python – Scrapy – 如何根据已删除项目中的链接抓取新页面

前端之家收集整理的这篇文章主要介绍了python – Scrapy – 如何根据已删除项目中的链接抓取新页面前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

我是Scrapy的新手,我正在尝试从已删除项目中的链接删除页面.具体来说,我想从谷歌搜索结果删除DropBox上的一些文件共享链接,并将这些链接存储在JSON文件中.获取这些链接后,我想为每个链接打开一个新页面,以验证链接是否有效.如果它有效,我也想将文件名存储到JSON文件中.

我使用带有’链接’,’文件名’,’状态’,’err_msg’属性的DropBoxItem来存储每个被删除的项目,我尝试在解析函数中为每个被删除链接发起异步请求.但似乎永远不会调用parse_file_page函数.有谁知道如何实现这样的两步爬行?

    class DropBoxSpider(Spider):
        name = "dropBox"
        allowed_domains = ["google.com"]
        start_urls = [
            "https://www.google.com/#filter=0&q=site:www.dropBox.com/s/&start=0"
    ]

        def parse(self,response):
            sel = Selector(response)
            sites = sel.xpath("//h3[@class='r']")
            items = []
            for site in sites:
                item = DropBoxItem()
                link = site.xpath('a/@href').extract()
                item['link'] = link
                link = ''.join(link)
                #I want to parse a new page with url=link here
                new_request = Request(link,callback=self.parse_file_page)
                new_request.Meta['item'] = item
                items.append(item)
            return items

        def parse_file_page(self,response):
            #item passed from request
            item = response.Meta['item']
            #selector
            sel = Selector(response)
            content_area = sel.xpath("//div[@id='shmodel-content-area']")
            filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
            if filename_area:
                filename = filename_area.xpath("span[@id]/text()").extract()
                if filename:
                    item['filename'] = filename             
                    item['status'] = "normal"
            else:
                err_area = content_area.xpath("div[@class='err']")
                if err_area:
                    err_msg = err_area.xpath("h3/text()").extract()
                    item['err_msg'] = err_msg
                    item['status'] = "error"
            return item

感谢@ScrapyNovice的回答.我修改代码.现在看起来像

def parse(self,response):
    sel = Selector(response)
    sites = sel.xpath("//h3[@class='r']")
    #items = []
    for site in sites:
        item = DropBoxItem()
        link = site.xpath('a/@href').extract()
        item['link'] = link
        link = ''.join(link)
        print 'link!!!!!!=',link
        new_request = Request(link,callback=self.parse_file_page)
        new_request.Meta['item'] = item
        yield new_request
        #items.append(item)
    yield item
    return
    #return item   #Note,when I simply return item here,got an error msg "SyntaxError: 'return' with argument inside generator"

def parse_file_page(self,response):
    #item passed from request
    print 'parse_file_page!!!'
    item = response.Meta['item']
    #selector
    sel = Selector(response)
    content_area = sel.xpath("//div[@id='shmodel-content-area']")
    filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
    if filename_area:
        filename = filename_area.xpath("span[@id]/text()").extract()
        if filename:
            item['filename'] = filename
            item['status'] = "normal"
            item['err_msg'] = "none"
            print 'filename=',filename
    else:
        err_area = content_area.xpath("div[@class='err']")
        if err_area:
            err_msg = err_area.xpath("h3/text()").extract()
            item['filename'] = "null"
            item['err_msg'] = err_msg
            item['status'] = "error"
            print 'err_msg',err_msg
        else:
            item['filename'] = "null"
            item['err_msg'] = "unknown_err"
            item['status'] = "error"
            print 'unknown err'
    return item

控制流程实际上变得非常奇怪.当我使用“scrapy crawl dropBox -o items_dropBox.json -t json”来抓取本地文件(谷歌搜索结果的下载页面)时,我可以看到输出

2014-05-31 08:40:35-0400 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-05-31 08:40:35-0400 [scrapy] INFO: Optional features available: ssl,http11
2014-05-31 08:40:35-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders','Feed_FORMAT': 'json','SPIDER_MODULES': ['tutorial.spiders'],'Feed_URI': 'items_dropBox.json','BOT_NAME': 'tutorial'}
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled extensions: FeedExporter,LogStats,TelnetConsole,CloseSpider,WebService,CoreStats,SpiderState
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,MetaRefreshMiddleware,HttpCompressionMiddleware,RedirectMiddleware,CookiesMiddleware,ChunkedTransferMiddleware,DownloaderStats
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware,OffsiteMiddleware,RefererMiddleware,UrlLengthMiddleware,DepthMiddleware
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled item pipelines: 
2014-05-31 08:40:35-0400 [dropBox] INFO: Spider opened
2014-05-31 08:40:35-0400 [dropBox] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-31 08:40:35-0400 [dropBox] DEBUG: Crawled (200) Box_s/dropBox_s_1-Google.html> (referer: None)
link!!!!!!= http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0
link!!!!!!= https://www.dropBox.com/s/
2014-05-31 08:40:35-0400 [dropBox] DEBUG: Filtered offsite request to 'www.dropBox.com': Box.com/s/>
link!!!!!!= https://www.dropBox.com/s/awg9oeyychug66w
link!!!!!!= http://www.dropBox.com/s/kfmoyq9y4vrz8fm
link!!!!!!= https://www.dropBox.com/s/pvsp4uz6gejjhel
....  many links here
link!!!!!!= https://www.dropBox.com/s/gavgg48733m3918/MailCheck.xlsx
link!!!!!!= http://www.dropBox.com/s/9x8924gtb52ksn6/Phonesky.apk
2014-05-31 08:40:35-0400 [dropBox] DEBUG: Scraped from <200 file:///home/xin/Downloads/dropBox_s/dropBox_s_1-Google.html>
    {'link': [u'http://www.dropBox.com/s/9x8924gtb52ksn6/Phonesky.apk']}
2014-05-31 08:40:35-0400 [dropBox] DEBUG: Crawled (200) Box_s/dropBox_s_1-Google.html)
parse_file_page!!!
unknown err
2014-05-31 08:40:35-0400 [dropBox] DEBUG: Scraped from <200 http://www.google.com/intl/en/webmasters/>
    {'err_msg': 'unknown_err','filename': 'null','link': [u'http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0'],'status': 'error'}
2014-05-31 08:40:35-0400 [dropBox] INFO: Closing spider (finished)
2014-05-31 08:40:35-0400 [dropBox] INFO: Stored json Feed (2 items) in: items_dropBox.json
2014-05-31 08:40:35-0400 [dropBox] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 558,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 449979,'downloader/response_count': 2,'downloader/response_status_count/200': 2,'finish_reason': 'finished','finish_time': datetime.datetime(2014,5,31,12,40,35,348058),'item_scraped_count': 2,'log_count/DEBUG': 7,'log_count/INFO': 8,'request_depth_max': 1,'response_received_count': 2,'scheduler/dequeued': 2,'scheduler/dequeued/memory': 2,'scheduler/enqueued': 2,'scheduler/enqueued/memory': 2,'start_time': datetime.datetime(2014,249309)}
2014-05-31 08:40:35-0400 [dropBox] INFO: Spider closed (finished)

现在json文件只有:

[{"link": ["http://www.dropBox.com/s/9x8924gtb52ksn6/Phonesky.apk"]},{"status": "error","err_msg": "unknown_err","link": ["http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0"],"filename": "null"}]
最佳答案
您正在创建一个请求并很好地设置回调,但您从不对它做任何事情.

        for site in sites:
            item = DropBoxItem()
            link = site.xpath('a/@href').extract()
            item['link'] = link
            link = ''.join(link)
            #I want to parse a new page with url=link here
            new_request = Request(link,callback=self.parse_file_page)
            new_request.Meta['item'] = item
            yield new_request
            # Don't do this here because you're adding your Item twice.
            #items.append(item)

在更多的设计级别,您将所有已删除的项目存储在parse()末尾的项目中,但是管道通常希望接收单个项目,而不是它们的数组.摆脱items数组,您将能够使用内置于Scrapy的JSON Feed Export以JSON格式存储结果.

更新:

尝试返回项目时收到错误消息的原因是因为在函数中使用yield会将其转换为生成器.这允许您重复调用函数.每次达到收益时,它都会返回您正在收益的值,但会记住它的状态以及它正在执行的行.下次调用生成器时,它会从上次停止的位置继续执行.如果它没有产生的东西,它会引发一个StopIteration异常.在Python 2中,不允许在同一函数中混合yield和return.

你不想从parse()中产生任何项目,因为它们在那时仍然缺少文件名,状态等.

你在解析()中的请求是在dropBox.com上,对吗?请求没有通过,因为dropBox不在spider的allowed_domains中. (因此日志消息:DEBUG:过滤现场请求’www.dropBox.com’:< GET https://www.dropBox.com/s/\u0026gt;) 实际有效且未过滤的一个请求会导致http://www.google.com/intl/zh-CN/webmasters/#utm_source=en-wmxmsg\u0026amp;utm_medium=wmxmsg\u0026amp;utm_campaign=bm\u0026amp;authuser=0,这是谷歌的一个页面,而不是DropBox的.您可能希望在parse()方法中使用urlparse检查链接的域之前检查它的请求. 至于你的结果:第一个JSON对象

{"link": ["http://www.dropBox.com/s/9x8924gtb52ksn6/Phonesky.apk"]}

是你在parse()方法调用yield项的地方.只有一个因为你的产量不在任何类型的循环中,所以当生成器恢复执行时,它会运行下一行:return,它退出生成器.您会注意到此项缺少您在parse_file_page()方法中填写的所有字段.这就是为什么你不想在你的parse()方法中产生任何项目.

你的第二个JSON对象

{
 "status": "error","filename": "null"
}

是试图解析谷歌的一个页面的结果,就好像它是你期望的DropBox页面一样.您在parse()方法中产生了多个请求,除了其中一个之外的所有请求都指向dropBox.com.所有DropBox链接都被删除,因为它们不在您的allowed_domains中,因此您获得的唯一响应是页面上的另一个链接,该链接与您的xpath选择器匹配并且来自您的allowed_sites中的一个站点. (这是谷歌网站管理员链接)这就是为什么你只看到parse_file_page!一次在你的输出中.

我建议学习更多关于生成器的知识,因为它们是使用Scrapy的基本部分. second Google result for “python generator tutorial” looks like a very good place to start.

猜你在找的Python相关文章