scrapy学习1

Scrapy爬虫学习

windows 平台的安装

使用Scrapy请安装好下列程序:

Python 2.7
pip pip install scrapy 会自动安装相关的库
lxml： windows安装lxml会出现很多问题，可以使用第三方的二进制工具安装
pip install lxml-3.4.4-cp27-none-win32.whl

pywin32 要在virtualenv中使用，可以用easy_install pywin32.exe的命令安装

@H_

403

开始爬虫

创建项目

scrapy startproject weather

该命令将会创建包含下列内容的weather目录

weather/
    scrapy.cfg
    weather/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

scrapy ：项目的配置文件
weather/：项目的python模块，代码都放在里面
weather/items.py ：项目的item文件
weather/pipelines.py ：项目中的pipelines文件
weather/settings.py：项目的设置文件
weather/spiders/：放置spider代码的目录

403

分析数据源代码
通过分析数据源，确定自己想的数据，在item中定义相应的字段

import scrapy
class WeatherItem(scrapy.Item):
define the fields for your item here like:
# name = scrapy.Field()
# demo 1
city = scrapy.Field()
date = scrapy.Field()
dayDesc = scrapy.Field()
dayTemp = scrapy.Field()
pass</code></pre>
编写爬虫（spider）
import scrapy
from weather.items import WeatherItem
class WeatherSpider(scrapy.Spider):

name = "myweather"

allowed_domains = ["sina.com.cn"]

start_urls = ['http://weather.sina.com.cn']
def parse(self,response):
    item = WeatherItem()
    item['city'] = response.xpath('//*[@id="slider_ct_name"]/text()').extract()
    tenDay = response.xpath('//*[@id="blk_fc_c0_scroll"]');
    item['date'] = tenDay.css('p.wt_fc_c0_i_date::text').extract()
    item['dayDesc'] = tenDay.css('img.icons0_wt::attr(title)').extract()
    item['dayTemp'] = tenDay.css('p.wt_fc_c0_i_temp::text').extract()
    return item

运行爬虫
scrapy crawl myweather -o wea.json

保存数据
要保存在文件或数据库中，这里就要用到 Item Pipeline 了，那么 Item Pipeline 是什么呢？
当Item在Spider中被收集之后，它将会被传递到Item Pipeline中，一些组件会按照一定的顺序执行对Item的处理。
每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。
item pipeline的典型应用有：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)

将爬取结果保存到文件或数据库中
   # -*- coding: utf-8 -*-
class WeatherPipeline(object):

def init(self):

pass
def process_item(self,item,spider):

with open('wea.txt','w+') as file:

city = item['city'][0].encode('utf-8')

file.write('city:' + str(city) + '\n\n')
       date = item['date']

       desc = item['dayDesc']
       dayDesc = desc[1::2]
       nightDesc = desc[0::2]

       dayTemp = item['dayTemp']

       weaitem = zip(date,dayDesc,nightDesc,dayTemp)

       for i in range(len(weaitem)):
           item = weaitem[i]
           d = item[0]
           dd = item[1]
           nd = item[2]
           ta = item[3].split('/')
           dt = ta[0]
           nt = ta[1]
           txt = 'date:{0}\t\tday:{1}({2})\t\tnight:{3}({4})\n\n'.format(
               d,dd.encode('utf-8'),dt.encode('utf-8'),nd.encode('utf-8'),nt.encode('utf-8')
           )
           file.write(txt)
   return item</code></pre>
把 ITEM_PIPELINES 添加到设置中

@H_403_29@
    ITEM_PIPELINES = {
    'weather.pipelines.WeatherPipeline': 1
    }

后记
爬虫的保存的时候是windows 的系统编码，输出的一些ascii编码不支持，所以报错。
解决的办法是，在输出的时候，对文件制定特定的UTF-8编码 用codecs库，
json.dumps(indent=2,ensure_ascii=False)
输出中文的json。通过使用 ensure_ascii=False，输出原有的语言文字。indent参数是缩进数量。

scrapy学习1

Scrapy爬虫学习

windows 平台的安装

开始爬虫

define the fields for your item here like:

猜你在找的程序笔记相关文章