如何解决Nutch谁能解释一下readdb stats中指示的状态名称
Nutch谁能解释一下readdb统计信息中的状态名称。
1.db_redir_perm 2.db_unfetched 3.db_fetched 4.db_Gone 5.db_redir_temp 6.db_duplicate 7.db_notmodified。
解决方法
Nutch将URL的所有元数据信息存储在CrawlDatum对象中。并将其存储在/crawldb/*/part-*/data
位置
根据CrawlDatum的源代码
/** Page was not fetched yet. */
db_unfetched --> public static final byte STATUS_DB_UNFETCHED = 0x01;
/** Page was successfully fetched. */
db_fetched --> public static final byte STATUS_DB_FETCHED = 0x02;
/** Page no longer exists. */
db_Gone --> public static final byte STATUS_DB_GONE = 0x03;
/** Page temporarily redirects to other page. */
db_redir_temp --> public static final byte STATUS_DB_REDIR_TEMP = 0x04;
/** Page permanently redirects to other page. */
db_redir_perm --> public static final byte STATUS_DB_REDIR_PERM = 0x05;
/** Page was successfully fetched and found not modified. */
db_notmodified --> public static final byte STATUS_DB_NOTMODIFIED = 0x06;
/** Page was marked as being a duplicate of another page */
db_duplicate --> public static final byte STATUS_DB_DUPLICATE = 0x07;
CrawlDatum private byte status;
将根据URL的状态采用上述值之一。 (还有很多其他我现在不讨论的标志)
CrawlDatum(对象)的状态值何时更改?
上面提到的几种状态之一可能有很多流程。我将解释一些我很清楚的流程。
- 当我们将URL注入小节时。使用状态为(db_unfetched)的每个URL CrawlDatum对象创建爬网文件夹。请参见下面的Injector类代码
InjectReducer。reduce方法。
for (CrawlDatum val : values) {
if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
injected.set(val);
injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
injectedSet = true;
} else {
old.set(val);
oldSet = true;
}
}
通过设置此标志,对于生成器阶段仅选择未提取的网址会很有帮助。
- 在Fetcher阶段,如果您打开FetcherThread源代码。基于URL http统计代码更改了crawlDatum状态。您可以参考http代码here。 (以便更好地理解)
case ProtocolStatus.MOVED: // redirect case ProtocolStatus.TEMP_MOVED: int code; boolean temp; if (status.getCode() == ProtocolStatus.MOVED) { code = CrawlDatum.STATUS_FETCH_REDIR_PERM; temp = false; } else { code = CrawlDatum.STATUS_FETCH_REDIR_TEMP; temp = true; } output(fit.url,fit.datum,content,status,code); String newUrl = status.getMessage(); Text redirUrl = handleRedirect(fit,newUrl,temp,Fetcher.PROTOCOL_REDIR); if (redirUrl != null) { fit = queueRedirect(redirUrl,fit); } else { // stop redirecting redirecting = false; } break; case ProtocolStatus.EXCEPTION: logError(fit.url,status.getMessage()); int killedURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit .getQueueID()); if (killedURLs != 0) context.getCounter("FetcherStatus","AboveExceptionThresholdInQueue").increment(killedURLs); /* FALLTHROUGH */ case ProtocolStatus.RETRY: // retry case ProtocolStatus.BLOCKED: output(fit.url,null,CrawlDatum.STATUS_FETCH_RETRY); break; case ProtocolStatus.GONE: // gone case ProtocolStatus.NOTFOUND: case ProtocolStatus.ACCESS_DENIED: case ProtocolStatus.ROBOTS_DENIED: output(fit.url,CrawlDatum.STATUS_FETCH_GONE); break; case ProtocolStatus.NOTMODIFIED: output(fit.url,CrawlDatum.STATUS_FETCH_NOTMODIFIED); break; default: if (LOG.isWarnEnabled()) { LOG.warn("{} {} Unknown ProtocolStatus: {}",getName(),Thread.currentThread().getId(),status.getCode()); } output(fit.url,CrawlDatum.STATUS_FETCH_RETRY);
if (redirecting && redirectCount > maxRedirect) {
((FetchItemQueues) fetchQueues).finishFetchItem(fit);
if (LOG.isInfoEnabled()) {
LOG.info("{} {} - redirect count exceeded {}",fit.url);
}
output(fit.url,ProtocolStatus.STATUS_REDIR_EXCEEDED,CrawlDatum.STATUS_FETCH_GONE);
}
- 在重复数据删除阶段,如果发现URL基于md5哈希重复,则它将在重复数据删除阶段将状态标记为 STATUS_DB_DUPLICATE ,并且在下一次迭代中,生成器。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。