因為Scrapy本身就有資料庫的寫入機制,所以在原本的程式上不用改。
需要改的地方只有 pipeline.py 、items.py以及settings.py
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient
from scrapy import Item
class TaazePipeline(object):
# 打開數據庫
def open_spider(self, spider):
db_uri = spider.settings.get('MONGODB_URI', 'mongodb://192.168.168.237:27017')
db_name = spider.settings.get('MONOGDB_DB_NAME', 'scrapy_db')
self.db_client = MongoClient(db_uri,
username='mongoadmin',
password='mongoadmin',
)
self.db = self.db_client[db_name]
def process_item(self, item, spider):
self.insert_db(item)
return item
# 插入數據
def insert_db(self, item):
if isinstance(item, Item):
item = dict(item)
if str(item["price"]).find("上架公布") < 0:
self.db.books.insert(item)
# 關閉數據庫
def close_spider(self, spider):
self.db_client.close()
open_spider 這部分管理mango連線,請記得要下載 pymongo ,才能使用。
insert_db 這邊是一次一筆塞進去,如果用mysql的話,建議是最後面在commit。
items.py
import scrapy
from scrapy import Item,Field< class TaazeItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() bookid = scrapy.Field() title = scrapy.Field() volume = scrapy.Field() price = scrapy.Field() pic = scrapy.Field()
TaazeSpider 主程式 需 加入下面這行,才能取得item的結構
import taaze.items as items
taaze 是此專案的名稱 ,name的部分。
class booksSpider(scrapy.Spider):
name = "taaze"
allowed_domains = ["taaze.tw"]
settings.py
# mongodb 配置 start
MONGODB_URI = 'mongodb://192.168.168.237:27017'
MONGODB_DB_NAME = 'scrapy_db'
ITEM_PIPELINES = {
'taaze.pipelines.TaazePipeline': 403,
}
這邊需要改的地方是DB_NAME ,需跟上面pipline一致
item_pipelines底下,將taaze改成自己的專案名稱。
如果要再主程式讀取Mongo的資料
可以先複寫_init_
class booksSpider(scrapy.Spider):
name = "taaze"
allowed_domains = ["taaze.tw"]
#start_urls=[]
def __init__(self, *a, **kw):
super(booksSpider, self).__init__(*a, **kw)
self.db_client = MongoClient('mongodb://192.168.168.237:27017',
username='mongoadmin',password='mongoadmin',
)
self.db = self.db_client["scrapy_db"]
supre(booksSpider)是你的class名稱,
再來就能直接在start_request取值了
for keyword in self.db.wishList.find ({},{ "name" :1 , "_id" : 0 } ):
self.log(keyword['name'])
因keyword是一個list,所以用list['col']的方式取值
ref.
PyMongo 3.8.0 documentation
Scrapy連接到各類數據庫(SQLite,Mysql,Mongodb,Redis)
Scrapy Item Pipeline 存入資料庫
Tool 003-Python Scrapy 爬取校花照片
Python Scrapy 爬取煎蛋網妹子圖實例(一)
Scrapy: Can't override __init__function
0 意見:
張貼留言