度估記事本: [Python]Scrapy存MangoDB

MangoDB的建置方法，請看前面 MongoDB 與 Mongo-express 連動
因為Scrapy本身就有資料庫的寫入機制，所以在原本的程式上不用改。
需要改的地方只有 pipeline.py 、items.py以及settings.py

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from pymongo import MongoClient

from scrapy import Item

class TaazePipeline(object):

      # 打開數據庫

     def open_spider(self, spider):

         db_uri = spider.settings.get('MONGODB_URI', 'mongodb://192.168.168.237:27017')

         db_name = spider.settings.get('MONOGDB_DB_NAME', 'scrapy_db')

        self.db_client = MongoClient(db_uri,

             username='mongoadmin',

             password='mongoadmin',

             )

         self.db = self.db_client[db_name]

    def process_item(self, item, spider):

         self.insert_db(item)

         return item

     # 插入數據

     def insert_db(self, item):

         if isinstance(item, Item):

             item = dict(item)

         if str(item["price"]).find("上架公布") < 0:            

             self.db.books.insert(item)



     # 關閉數據庫

     def close_spider(self, spider):

         self.db_client.close()

open_spider 這部分管理mango連線，請記得要下載 pymongo ，才能使用。
insert_db 這邊是一次一筆塞進去，如果用mysql的話，建議是最後面在commit。

items.py



import scrapy

from scrapy import Item,Field< class TaazeItem(scrapy.Item):      # define the fields for your item here like:      # name = scrapy.Field()      bookid = scrapy.Field()      title = scrapy.Field()      volume = scrapy.Field()      price = scrapy.Field()      pic = scrapy.Field()

TaazeSpider 主程式需加入下面這行，才能取得item的結構

import taaze.items as items

taaze 是此專案的名稱，name的部分。
class booksSpider(scrapy.Spider):
name = "taaze"
allowed_domains = ["taaze.tw"]

settings.py



# mongodb 配置 start

MONGODB_URI = 'mongodb://192.168.168.237:27017'

MONGODB_DB_NAME = 'scrapy_db'

ITEM_PIPELINES = {

    'taaze.pipelines.TaazePipeline': 403,

}

這邊需要改的地方是DB_NAME ，需跟上面pipline一致
item_pipelines底下，將taaze改成自己的專案名稱。

如果要再主程式讀取Mongo的資料
可以先複寫_init_



class booksSpider(scrapy.Spider):

    name = "taaze"

    allowed_domains = ["taaze.tw"]

    #start_urls=[]



    def __init__(self, *a, **kw):

        super(booksSpider, self).__init__(*a, **kw)

        self.db_client = MongoClient('mongodb://192.168.168.237:27017',

                username='mongoadmin',password='mongoadmin',

            )

        self.db = self.db_client["scrapy_db"]

supre(booksSpider)是你的class名稱，
再來就能直接在start_request取值了



        for keyword in self.db.wishList.find ({},{ "name" :1 , "_id" : 0 } ):

            self.log(keyword['name'])

因keyword是一個list，所以用list['col']的方式取值

ref.
PyMongo 3.8.0 documentation
Scrapy連接到各類數據庫(SQLite,Mysql,Mongodb,Redis)
Scrapy Item Pipeline 存入資料庫
Tool 003-Python Scrapy 爬取校花照片
Python Scrapy 爬取煎蛋網妹子圖實例（一）
Scrapy: Can't override __init__function

度估記事本

Pages - Menu

2019年7月17日星期三

[Python]Scrapy存MangoDB

沒有留言:

張貼留言

Pages - Menu

2019年7月17日 星期三

[Python]Scrapy存MangoDB

沒有留言:

張貼留言

2019年7月17日星期三