我是Scrapy的新手,我想做的是创建一个只跟踪给定start_urls上
HTML元素内部链接的爬虫
就像一个例子,我只想让一个爬行器通过将start_urls设置为https://www.airbnb.com/s?location=New+York%2C+NY&checkin=&checkout=&guests=1的AirBnB列表
而不是抓取URL中的所有链接,我只想抓取xpath内的链接// * [@ id =“results”]
目前我正在使用以下代码来抓取所有链接,我如何才能使其仅适用于抓取// * [@ id =“results”]
from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector class BSpider(CrawlSpider): name = "bt" #follow = True allowed_domains = ["mydomain.com"] start_urls = ["http://myurl.com/path"] rules =(Rule(SgmlLinkExtractor(allow = ()),callback = 'parse_item',follow=True),) def parse_item(self,response): {parse code}
正确方向的任何提示将非常感激,
谢谢!