web-scraping – 如何在使用Scrapy时防止twisted.internet.error.ConnectionLost错误?

前端之家收集整理的这篇文章主要介绍了web-scraping – 如何在使用Scrapy时防止twisted.internet.error.ConnectionLost错误?前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我正在用 scrapy抓取一些页面并得到以下错误

twisted.internet.error.ConnectionLost

我的命令行输出

2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened
2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished)
2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 36,'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36,'downloader/request_bytes': 8121,'downloader/request_count': 36,'downloader/request_method_count/GET': 36,'finish_reason': 'finished','finish_time': datetime.datetime(2015,5,4,10,40,35,608377),'log_count/DEBUG': 38,'log_count/ERROR': 12,'log_count/INFO': 7,'scheduler/dequeued': 36,'scheduler/dequeued/memory': 36,'scheduler/enqueued': 36,'scheduler/enqueued/memory': 36,'start_time': datetime.datetime(2015,32,624695)}
2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished)

我的settings.py:

SPIDER_MODULES = ['proxy.spiders']
    NEWSPIDER_MODULES = 'proxy.spiders'

    DOWNLOAD_DELAY = 0
    DOWNLOAD_TIMEOUT = 30

    ITEM_PIPELINES = {
              'proxy.pipelines.ProxyPipeline':100,}

    CONCURRENT_ITEMS = 100
    CONCURRENT_REQUESTS_PER_DOMAIN = 64
    #CONCURRENT_SPIDERS = 128

    LOG_ENABLED = True
    LOG_ENCODING = 'utf-8'
    LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log'
    LOG_LEVEL = 'DEBUG'
    LOG_STDOUT = False

我的蜘蛛proxy_spider.py:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from proxy.items import ProxyItem
import re

class ProxycrawlerSpider(CrawlSpider):
    name = 'cnproxy'
    allowed_domains = ['www.cnproxy.com']
    indexes = [1,2,3,6,7,8,9,10]
    start_urls = []
    for i in indexes:
        url = 'http://www.cnproxy.com/proxy%s.html' % i
        start_urls.append(url)
    start_urls.append('http://www.cnproxy.com/proxyedu1.html')
    start_urls.append('http://www.cnproxy.com/proxyedu2.html')

    def parse_ip(self,response):
        sel = HtmlXPathSelector(response)
        addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}')
        protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>')
        locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>')
        ports_re = re.compile('write\(":"(.*)\)')
        raw_ports = ports_re.findall(response.body);
        port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''}
        ports = []
        for port in raw_ports:
            tmp = port 
            for key in port_map:
                tmp = tmp.replace(key,port_map[key]);
            ports.append(tmp)
        items = []
        for i in range(len(addresses)):
            item = ProxyItem()
            item['address'] = addresses[i]
            item['protocol'] = protocols[i]
            item['location'] = locations[i]
            item['port'] = ports[i]
            items.append(item)
        return items

我的管道或设置有什么问题吗?
如果不是,我怎么能防止twisted.internet.error.ConnectionLost错误.

我试过scrapy外壳

$scrapy shell http://www.cnproxy.com/proxy1.html

并获得与标题相同的错误.
但我可以使用我的chrome访问该页面.我尝试过其他类似的网页

$scrapy shell http://stackoverflow.com

他们都运作良好.

解决方法

您需要设置用户代理字符串.似乎有些网站不喜欢它,并在您的用户代理不是浏览器时阻止.
你可以找到 examples of user agent strings.

article确定了阻止蜘蛛被阻止的最佳做法.

打开settings.py:添加以下用户代理

USER_AGENT =’Mozilla / 5.0(Windows NT 6.3; Win64; x64)AppleWebKit / 537.36(KHTML,与Gecko一样)Chrome / 37.0.2049.0 Safari / 537.36′

你也可以试试user-agent randomiser

猜你在找的HTML相关文章