/×××××××××××××××××××××××××××××××××××××××××/
Author:xxx0624
HomePage:http://www.cnblogs.com/xxx0624/
===============File===============
配置1:
<property> name>file.content.limit</value>65536description>The length limit for downloaded content using the file protocol,in bytes. If this value is nonnegative (>=0),content longer than it will be truncated; otherwise,no truncation at all. Do not confuse this setting with the http.content.limit setting. > >
当使用file协议下载的时候,用来限制下载文件的大小,默认为65535个字节。如果超过大小限制,内容会被截断。
配置2:
===============HTTP===============
配置1:(重要!)
定义HTTP header中的User-Agent相关属性,一定需要配置!
配置2:(重要!)
设置这个后,nutch会在相应的爬取网站中的robots.txt内寻找是否存在这个agent,否则无法爬取到该网页。
配置3:
但是如果这个值被设置成false,我们就无法爬取这个网站。
配置4:
>http.timeout>10000>The default network timeout,in milliseconds.>
默认的网络超时时间是10000ms。
配置5(这一点可以考虑用于优化Nutch的爬取速度):
配置6:
超过则会被截断。
配置7:(代理服务部分)
这几个配置都和protocol-httpclient插件有关。
===============FTP=================
暂无
===============web db===============
(1)Fetch过程中配置(只列出部分配置)
>
db.fetch.interval.default>2592000>The default number of seconds between re-fetches of a page (30 days). 这个设置为了定期重新爬取网页的时间间隔,默认是30天。单位是秒。
>
db.fetch.interval.max>7776000>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried,no matter what is its status. 这个设置表示在db.fetch.interval.max这段时间过后,数据库中的每个网页都肯定会被重新抓取,不管它目前是什么状态。>
db.fetch.schedule.class>org.apache.nutch.crawl.DefaultFetchSchedule>The implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time,regardless of page changes. 这个指定的类是实现了网页下载时间安排。DefaultFetchSchedule 只是简单的将原来的下载时间间隔加到上次下载时间上,不管当前每个网页的改变。
配置4:
配置5:
>
db.fetch.schedule.adaptive.min_interval>60.0>Minimum fetchInterval,in seconds. 最小的网页更新时间间隔。(2)Generate配置
配置7:
配置8:
配置9:
如果设置为true,即使没有中间updatedb,可以以运行一个额外的job来更新crawldb达到目的。
如果设置为false,在没有中间updatedb的情况下,则有可能产生两个相同的下载队列。
(3)partitioner(分发策略的配置)
配置10:
(4)fetcher(下载的配置)
配置11:
配置12:(重要!)
配置13:(重要!)这个可以考虑用于加速nutch爬虫
配置14:
配置15:
-1表示这个设置无效
===============index===============
索引部分不是很懂,只列举出部分配置...
<property> <name>indexer.max.title.length</name> <value>100</value> <description>The maximum number of characters of a title that are indexed. A value of -1 disables this check. Used by index-basic. </description> </property>
设置能索引的标题的最大长度
================plugin===============
<property> <name>plugin.folders</name> <value>plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute,it is used as is. If relative,it is searched for on the classpath.</description> </property>
设置nutch插件的地址
<property> <name>plugin.auto-activation</name> <value>true</value> <description>Defines if some plugins that are not activated regarding the plugin.includes and plugin.excludes properties must be automaticaly activated if they are needed by some actived plugins. </description> </property>
默认true为自动启动。
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP,and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient,but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
插件4:
<property>
<name>plugin.excludes</name>
<value></value>
<description>Regular expression naming plugin directory names to exclude.
</description>
</property>
===============parse===============
<property>
<name>parse.plugin.file</name>
<value>parse-plugins.xml</value>
<description>The name of the file that defines the associations between
content-types and parsers.</description>
</property>
这个用于设置文件类型和相应的解析器
<property> <name>htmlparsefilter.order</name> <value></value> <description>The order by which HTMLParse filters are applied. If empty,all available HTMLParse filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty,only named filters are loaded and applied in given order. HTMLParse filter ordering MAY have an impact on end result,as some filters could rely on the Metadata generated by a prevIoUs filter. </description> </property>
设置HTML解析器的顺序。默认是按照plugin-includes and plugin-excludes来进行加载的
配置3:(重要!)
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
设置url过滤,支持正则表达式
===============solr & elasticSearch================
配置1:(重要!)
<property> <name>solr.mapping.file</name> <value>solrindex-mapping.xml</value> <description> Defines the name of the file that will be used in the mapping of internal nutch field names to solr index fields as specified in the target Solr schema. </description> </property>
设置solr索引的映射关系
>
solr.commit.index> When closing the indexer,trigger a commit to the Solr server. 当关闭索引器时,提交结果到solr服务器>
elastic.index>index> The name of the elasticsearch index. Will normally be autocreated if it doesn't exist. 设置es索引的默认名字>
elastic.max.bulk.docs>500> The number of docs in the batch that will trigger a flush to elasticsearch. 设置bulk方式提交索引文件的数目==================store==================
>
storage.data.store.class>org.apache.gora.memory.store.MemStore>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: org.apache.gora.sql.store.sqlStore Default store. A DataStore implementation for RDBMS with a sql interface. sqlStore uses JDBC drivers to communicate with the DB. As explained in ivy.xml,currently >= gora-core 0.3 is not backwards compatable with sqlStore. org.apache.gora.cassandra.store.CassandraStore Gora class for storing data in Apache Cassandra. org.apache.gora.hbase.store.HBaseStore Gora class for storing data in Apache HBase. org.apache.gora.accumulo.store.AccumuloStore Gora class for storing data in Apache Accumulo. org.apache.gora.avro.store.AvroStore Gora class for storing data in Apache Avro. org.apache.gora.avro.store.DataFileAvroStore Gora class for storing data in Apache Avro. DataFileAvroStore is a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend. This datastore supports mapreduce. org.apache.gora.memory.store.MemStore Gora class for storing data in a Memory based implementation for tests. 指定存储的方式,如hbase,avro等方式方式2:还可以更改gora文件