我在nutch爬行时记录了文件，但没有收到399054 SCHEDULE_REJECTED，5892 URLS_SKIPPED_PER_HOST_OVERFLOW

如何解决我在nutch爬行时记录了文件，但没有收到399054 SCHEDULE_REJECTED，5892 URLS_SKIPPED_PER_HOST_OVERFLOW

爬行时我看到它显示了

Generator: number of items rejected during selection:
Generator:     67  HOSTS_AFFECTED_PER_HOST_OVERFLOW
Generator:      3  MALFORMED_URL
Generator: 399054  SCHEDULE_REJECTED
Generator:   5892  URLS_SKIPPED_PER_HOST_OVERFLOW

我了解67 HOSTS_AFFECTED_PER_HOST_OVERFLOW，3 MALFORMED_URL

我不明白399054 SCHEDULE_REJECTED，5892 URLS_SKIPPED_PER_HOST_OVERFLOW是什么意思。

任何人都可以解释它的意思。

解决方法

Generator阶段具有不同的counters来知道Genertor MapReduce阶段中过滤或跳过的url。

SCHEDULE_REJECTED

if(!schedule.shouldFetch(url,crawlDatum,curTime)){
                context.getCounter("Generator","SCHEDULE_REJECTED").increment(1);
                return;}

根据nutch-site.xml中定义的属性，默认 schedule 值为DefaultFetchSchedule

db.fetch.schedule.clas = org.apache.nutch.crawl.DefaultFetchSchedule

AbstractFetchSchedule中的shouldFetch方法将决定现在允许在何处允许url进入Fetcher阶段。

public boolean shouldFetch(Text url,CrawlDatum datum,long curTime) {
    // pages are never truly GONE - we have to check them from time to time.
    // pages with too long a fetchInterval are adjusted so that they fit within
    // a maximum fetchInterval (segment retention period).
    if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
      if (datum.getFetchInterval() > maxInterval) {
        datum.setFetchInterval(maxInterval * 0.9f);
      }
      datum.setFetchTime(curTime);
    }
    if (datum.getFetchTime() > curTime) {
      return false; // not time yet
    }
    return true;
  }

根据上述逻辑，当fetchTime到期且fetchTime的窗口由db.fetch.interval.default确定且默认值为30天时，可以在以后的迭代中再次提取在上一次迭代中一次获取的URL。 >

shouldFetch确保成功成功获取的网址将仅在30天后再次尝试获取，否则将在生成器中被拒绝。

WAIT_FOR_UPDATE（等待的默认值为7天）仅当您启用generate.update.crawldb=true时此计数器才有意义，否则它没有任何意义。

此计数器将用于跟踪高并发多个生成/提取/更新周期可能重叠的环境，将其设置为true可确保generate将创建不同的提取列表，并使用 crawl.gen.delay 来确保获得不同的提取列表。 crawl.gen.delay定义已生成的项目被阻止的时间（默认为7天）

LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData()
              .get(Nutch.WRITABLE_GENERATE_TIME_KEY);
      if (oldGenTime != null) { // awaiting fetch & update
        if (oldGenTime.get() + genDelay > curTime) // still wait for
          // update
          context.getCounter("Generator","WAIT_FOR_UPDATE").increment(1);
        return;
      }

MALFORMED_URL：此计数器将跟踪没有正确的网址语法或网址编码问题的网址

HOSTS_AFFECTED_PER_HOST_OVERFLOW / URLS_SKIPPED_PER_HOST_OVERFLOW：

  if (maxCount > 0) {int[] hostCount = hostCounts.get(hostordomain);
  if (hostCount == null) {
    hostCount = new int[]{1,0};
    hostCounts.put(hostordomain,hostCount);
  }
  // increment hostCount
  hostCount[1]++;

  // check if topN reached,select next segment if it is
  while (segCounts[hostCount[0] - 1] >= limit
          && hostCount[0] < maxNumSegments) {
    hostCount[0]++;
    hostCount[1] = 0;
  }

  // reached the limit of allowed URLs per host / domain
  // see if we can put it in the next segment?
  if (hostCount[1] > maxCount) {
    if (hostCount[0] < maxNumSegments) {
      hostCount[0]++;
      hostCount[1] = 1;
    } else {
      if (hostCount[1] == (maxCount + 1)) {
        context
                .getCounter("Generator","HOSTS_AFFECTED_PER_HOST_OVERFLOW")
                .increment(1);
        LOG.info(
                "Host or domain {} has more than {} URLs for all {} segments. Additional URLs won't be included in the fetchlist.",hostordomain,maxCount,maxNumSegments);
      }
      // skip this entry
      context.getCounter("Generator","URLS_SKIPPED_PER_HOST_OVERFLOW")
              .increment(1);
      continue;
    }
  }
  entry.segnum = new IntWritable(hostCount[0]);
  segCounts[hostCount[0] - 1]++;
} else {
  entry.segnum = new IntWritable(currentsegmentnum);
  segCounts[currentsegmentnum - 1]++;
}

按照上述代码 hostCounts 对象用于跟踪hostCount[1] == ([maxCount][4] + 1) && hostCount[0] > maxNumSegments只有在我们达到了所有网段的每个域的全部线程持有量并且计数将以 HOSTS_AFFECTED_PER_HOST_OVERFLOW 进行计数时，才为真。

HOSTS_AFFECTED_PER_HOST_OVERFLOW 它基本上跟踪所有丢失的主机/域，这些主机/域在最后一个段中未按 1个空格的边距进行分配。

URLS_SKIPPED_PER_HOST_OVERFLOW 用于对没有足够空间填充所有段的所有域/主机进行计数。

以及其他计数器，例如 INTERVAL_REJECTED，SCORE_TOO_LOW，STATUS_REJECTED 都是不言自明的，为清楚起见，您可以检查生成器代码。