nutch readdb统计信息，域统计信息和solr记录不匹配

如何解决nutch readdb统计信息，域统计信息和solr记录不匹配

我已经抓取了2个深度为3的网址。 solr中的记录数是142。域统计是 173烤第363章 165 fronteracashandloan.com 8 gangnamsushihouse.com

readdb统计信息是状态1（db_unfetched）：246 状态2（db_fetched）：153 状态3（db_gone）：4 状态5（db_redir_perm）：32 状态6（db_notmodified）：20 状态7（db_duplicate）：81

请问是什么问题。

我的坚果conf

<property>
     <name>http.agent.name</name>
     <value>baiduspider</value>
</property>
<property>
    <name>plugin.includes</name>
    <value>protocol-httpclient|urlfilter-regex|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
 <name>db.ignore.external.links</name>
 <value>true</value>
</property>
<property>
 <name>db.ignore.external.links.mode</name>
 <value>byDomain</value>
</property>
<property>
  <name>fetcher.server.delay</name>
  <value>2</value>
 <description>The number of seconds the fetcher will delay between
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if
   fetcher.threads.per.queue is set to 1.
 </description>
</property>

<property>
  <name>fetcher.server.min.delay</name>
  <value>0.5</value>
  <description>The minimum number of seconds the fetcher will delay between
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>


<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page,generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay,however long that
 might be.
 </description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>
<property>
  <name>http.redirect.max</name>
  <value>25</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0,fetcher won't immediately
  follow redirected URLs,instead it will record them for later fetching.
  </description>
</property>
</configuration>

nutch readdb统计信息，域统计信息和solr记录不匹配

如何解决nutch readdb统计信息，域统计信息和solr记录不匹配

相关推荐