>file.content.ignored>true>If true,no file content will be saved during fetch. And it is probably what we want to set most of time,since file:// URLs are meant to be local and we can always use them directly at parsing and indexing stages. Otherwise file contents will be saved. !! NO IMPLEMENTED YET !! 如果这个设置为true,当nutch在爬取文件的时候不会下载文件内容



>http.agent.name></>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. 这个用于配置HTTP代理。

定义HTTP header中的User-Agent相关属性一定需要配置


>http.robots.agents>*>The agent strings we'll look for in robots.txt files,comma-separated,in decreasing order of precedence. You should put the value of http.agent.name as the first agent name,and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* 有些网页会有robots设置,robots.txt设置为了规范爬虫。



>http.robots.403.allow>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false,then such sites will be treated as forbidden. 有些服务器在没有robots文件的时候会返回403错误,这时我们就能随意爬取内容



>http.timeout>10000>The default network timeout,in milliseconds.>



>http.max.delays>100>The number of times a thread will delay when trying to fetch a page. Each time it finds that a host is busy,it will wait fetcher.server.delay. After http.max.delays attepts,it will give up on the page for Now. 在爬取网页的时候,线程的最多等待次数。每次线程发现主机繁忙的时候,线程就会等待fetch.server.delay这么长的时间,如果总的等待次数超过了http.max.delays,nutch则不再爬取该网页。


>http.content.limit>The length limit for downloaded content using the http protocol,content longer than it will be truncated; otherwise,no truncation at all. Do not confuse this setting with the file.content.limit setting. 在使用HTTP协议下载网页的时候,用来限制下载网页的内容大小,最多是65536个字节。



>http.proxy.host>The proxy hostname. If empty,no proxy is used.> >http.proxy.port>The proxy port.>http.proxy.username>Username for proxy. This will be used by 'protocol-httpclient',if the proxy server requests basic,digest and/or NTLM authentication. To use this,'protocol-httpclient' must be present in the value of 'plugin.includes' property. NOTE: For NTLM authentication,do not prefix the username with the domain,i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. >http.proxy.password>Password for proxy. This will be used by 'protocol-httpclient','protocol-httpclient' must be present in the value of 'plugin.includes' property. 分别是代理的主机名,端口号,代理用户名,代理密码。




===============web db===============


>db.fetch.interval.default>2592000>The default number of seconds between re-fetches of a page (30 days). 这个设置为了定期重新爬取网页的时间间隔,认是30天。


>db.fetch.interval.max>7776000>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried,no matter what is its status. 这个设置表示在db.fetch.interval.max这段时间过后,数据库中的每个网页都肯定会被重新抓取,不管它目前是什么状态。

>db.fetch.schedule.class>org.apache.nutch.crawl.DefaultFetchSchedule>The implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time,regardless of page changes. 这个指定的类是实现了网页下载时间安排。

DefaultFetchSchedule 只是简单的将原来的下载时间间隔加到上次下载时间上,不管当前每个网页的改变。


>db.fetch.schedule.adaptive.inc_rate>0.4>If a page is unmodified,its fetchInterval will be increased by this rate. This value should not exceed 0.5,otherwise the algorithm becomes unstable. 如果重新下载网页并更新数据库的时候,发现这个网页没有发生变化,那么这个网页的更新时间间隔会变成:原来的时间间隔+设置的这个值(这个值不能超过0.5)


>db.fetch.schedule.adaptive.dec_rate>0.2>If a page is modified,its fetchInterval will be decreased by this rate. This value should not exceed 0.5,sans-serif; line-height:21px"> 如果重新下载网页并更新数据库的时候,发现这个网页发生了变化,那么这个网页的更新时间间隔会变成:原来的时间间隔-设置的这个值(这个值不能超过0.5)

>db.fetch.schedule.adaptive.min_interval>60.0>Minimum fetchInterval,in seconds. 最小的网页更新时间间隔。

>db.fetch.schedule.adaptive.max_interval>31536000.0>Maximum fetchInterval,in seconds (365 days). NOTE: this is limited by db.fetch.interval.max. Pages with fetchInterval larger than db.fetch.interval.max will be fetched anyway. 最大的网页更新时间间隔。



>generate.max.count>-1>The maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. 设置下载队列的url数量,-1表示无限。


>generate.count.mode>host>Determines how the URLs are counted for generator.max.count. Default value is 'host' but can be 'domain'. Note that we do not count per IP in the new version of the Generator. 设置用来根据host不同来判断该URL是否抓取其内容


>generate.update.crawldb>false>For highly-concurrent environments,where several generate/fetch/update cycles may overlap,setting this to true ensures that generate will create different fetchlists even without intervening updatedb-s,at the cost of running an additional job to update CrawlDB. If false,running generate twice without intervening updatedb will generate identical fetchlists. 对于高并发的环境来说,可能发生generate/fetch/update循环重叠的情况。





>partition.url.mode>byHost>Determines how to partition URLs. Default value is 'byHost',also takes 'byDomain' or 'byIP'. 设置根据Host不同来分发url。



>fetcher.server.delay>5.0>The number of seconds the fetcher will delay between successive requests to the same server. 设置对同一server成功请求的时间间隔。


>fetcher.threads.fetch>10>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. 认10个下载线程


>fetcher.threads.per.queue>1>This number is the maximum number of threads that should be allowed to access a queue at one time. 设置同一时间内,同一队列能有几个线程访问


>fetcher.store.content>If true,fetcher will store content. 设置true表示下载线程会下载内容


>fetcher.throughput.threshold.pages>The threshold of minimum pages per second. If the fetcher downloads less pages per second than the configured threshold,the fetcher stops,preventing slow queue's from stalling the throughput. This threshold must be an integer. This can be useful when fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check. 这个是设置fetcher的下载能力。如果每秒下载少于这个设置值,则下载线程会停止。




  <description>The maximum number of characters of a title that are indexed. A value of -1 disables this check.
  Used by index-basic.



  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute,it is used
  as is.  If relative,it is searched for on the classpath.</description>


  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.



 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient,but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.



  <description>Regular expression naming plugin directory names to exclude.  



  <description>The name of the file that defines the associations between
  content-types and parsers.</description>


  <description>The order by which HTMLParse filters are applied.
  If empty,all available HTMLParse filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty,only named filters are loaded and applied
  in given order.
  HTMLParse filter ordering MAY have an impact
  on end result,as some filters Could rely on the Metadata generated by a prevIoUs filter.

设置HTML解析器的顺序。认是按照plugin-includes and plugin-excludes来进行加载的


  <description>Name of file on CLAsspATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>


===============solr & elasticSearch================


  Defines the name of the file that will be used in the mapping of internal
  nutch field names to solr index fields as specified in the target Solr schema.


>solr.commit.index> When closing the indexer,trigger a commit to the Solr server. 关闭索引器时,提交结果到solr服务器

>elastic.index>index> The name of the elasticsearch index. Will normally be autocreated if it doesn't exist. 设置es索引的认名字

>elastic.max.bulk.docs>500> The number of docs in the batch that will trigger a flush to elasticsearch. 设置bulk方式提交索引文件的数目


>storage.data.store.class>org.apache.gora.memory.store.MemStore>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: org.apache.gora.sql.store.sqlStore Default store. A DataStore implementation for RDBMS with a sql interface. sqlStore uses JDBC drivers to communicate with the DB. As explained in ivy.xml,currently >= gora-core 0.3 is not backwards compatable with sqlStore. org.apache.gora.cassandra.store.CassandraStore Gora class for storing data in Apache Cassandra. org.apache.gora.hbase.store.HBaseStore Gora class for storing data in Apache HBase. org.apache.gora.accumulo.store.AccumuloStore Gora class for storing data in Apache Accumulo. org.apache.gora.avro.store.AvroStore Gora class for storing data in Apache Avro. org.apache.gora.avro.store.DataFileAvroStore Gora class for storing data in Apache Avro. DataFileAvroStore is a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend. This datastore supports mapreduce. org.apache.gora.memory.store.MemStore Gora class for storing data in a Memory based implementation for tests. 指定存储的方式,如hbase,avro等方式



