如何解决nutch readdb统计信息,域统计信息和solr记录不匹配
我已经抓取了2个深度为3的网址。 solr中的记录数是142。 域统计是 173烤 第363章 165 fronteracashandloan.com 8 gangnamsushihouse.com
readdb统计信息是 状态1(db_unfetched):246 状态2(db_fetched):153 状态3(db_gone):4 状态5(db_redir_perm):32 状态6(db_notmodified):20 状态7(db_duplicate):81
请问是什么问题。
我的坚果conf
<property>
<name>http.agent.name</name>
<value>baiduspider</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>db.ignore.external.links.mode</name>
<value>byDomain</value>
</property>
<property>
<name>fetcher.server.delay</name>
<value>2</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>0.5</value>
<description>The minimum number of seconds the fetcher will delay between
successive requests to the same server. This value is applicable ONLY
if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
is turned off).</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>400</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>10</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page,generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay,however long that
might be.
</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>25</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
<property>
<name>http.redirect.max</name>
<value>25</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0,fetcher won't immediately
follow redirected URLs,instead it will record them for later fetching.
</description>
</property>
</configuration>
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。