度估記事本: [Scrapy]實戰番外篇 Part II

1.單元測試
單元測試方式 contract

2.一次抓取多個網站 from txt



start_urls = [i.strip() for i in

open('todo.urls.txt').readlines()]

3.命令參數



抓90個後結束

scrapy crawl taaze -s CLOSESPIDER_ITEMCOUNT=90

4.Scrapy處理請求的方式為後入先出(LIFO)，
最深的最早爬。

5.登入畫面有隱藏的欄位，可使用from_response，可直接幫助填充現有資料



# Start on the welcome page



def start_requests(self):

     return [

         Request("http://web:9312/dynamic/nonce",

         callback=self.parse_welcome)]



def parse_welcome(self, response):

     return FormRequest.from_response(

     response,

     formdata={"user": "user", "pass": "pass"})

6.組成網頁的檔案名稱



url = “aa%06d.html” %  id

會使用id帶入變成網址，06的意思表示不滿6位會在前面補0

7.擷取字串中的某一段



x = “abcde”

x[2:]  ==>     cde   

x[:2]  ==>   ab       (left)

x[-2:] ==>   de      (right)

x[1:-1] ==>  bcd

8.ItemLoader 使用selector作為來源
從目前選取的部分，取得資料。可減少request



def parse_item(self, selector, response):

     # Create the loader using the selector

     l = ItemLoader(item=PropertiesItem(), selector=selector)

     

     # Load fields using XPath expressions

     l.add_xpath('title', './/*[@itemprop="name"][1]/text()',

         MapCompose(unicode.strip, unicode.title))

     l.add_xpath('price', './/*[@itemprop="price"][1]/text()',

         MapCompose(lambda i: i.replace(',', ''), float),

         re='[,.0-9]+')



return l.load_item()

9.MapCompose()函數
該函數將python函數或者lambda表達式作為參數（參數個數無限制），然後按順序執行這些函數來產生最終的結果
MapCompose(unicode.strip, float)首先將xpath提取的信息去掉空格，再將其轉換為float格式

ref.Learning Scrapy筆記（三）- Scrapy基礎

度估記事本

Pages - Menu

2019年7月15日星期一

[Scrapy]實戰番外篇 Part II

沒有留言:

張貼留言

Pages - Menu

2019年7月15日 星期一

[Scrapy]實戰番外篇 Part II

沒有留言:

張貼留言

2019年7月15日星期一