Nutch谁能解释一下readdb stats中指示的状态名称

如何解决Nutch谁能解释一下readdb stats中指示的状态名称

Nutch谁能解释一下readdb统计信息中的状态名称。

1.db_redir_perm 2.db_unfetched 3.db_fetched 4.db_Gone 5.db_redir_temp 6.db_duplicate 7.db_notmodified。

解决方法

Nutch将URL的所有元数据信息存储在CrawlDatum对象中。并将其存储在/crawldb/*/part-*/data位置

根据CrawlDatum的源代码

 /** Page was not fetched yet. */
   db_unfetched -->   public static final byte STATUS_DB_UNFETCHED = 0x01; 
      /** Page was successfully fetched. */
   db_fetched -->   public static final byte STATUS_DB_FETCHED = 0x02;
      /** Page no longer exists. */
   db_Gone -->   public static final byte STATUS_DB_GONE = 0x03;
      /** Page temporarily redirects to other page. */
   db_redir_temp -->   public static final byte STATUS_DB_REDIR_TEMP = 0x04;
      /** Page permanently redirects to other page. */
   db_redir_perm -->   public static final byte STATUS_DB_REDIR_PERM = 0x05;
      /** Page was successfully fetched and found not modified. */
   db_notmodified -->   public static final byte STATUS_DB_NOTMODIFIED = 0x06;
      /** Page was marked as being a duplicate of another page */
   db_duplicate -->   public static final byte STATUS_DB_DUPLICATE = 0x07;

CrawlDatum private byte status;将根据URL的状态采用上述值之一。 (还有很多其他我现在不讨论的标志)

CrawlDatum(对象)的状态值何时更改?

上面提到的几种状态之一可能有很多流程。我将解释一些我很清楚的流程。

  1. 当我们将URL注入小节时。使用状态为(db_unfetched)的每个URL CrawlDatum对象创建爬网文件夹。请参见下面的Injector类代码

InjectReducer。reduce方法。

for (CrawlDatum val : values) {
    if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
      injected.set(val);
      injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
      injectedSet = true;
    } else {
      old.set(val);
      oldSet = true;
    }
  }

通过设置此标志,对于生成器阶段仅选择未提取的网址会很有帮助。

  1. 在Fetcher阶段,如果您打开FetcherThread源代码。基于URL http统计代码更改了crawlDatum状态。您可以参考http代码here。 (以便更好地理解)
case ProtocolStatus.MOVED: // redirect
    case ProtocolStatus.TEMP_MOVED:
      int code;
      boolean temp;
      if (status.getCode() == ProtocolStatus.MOVED) {
        code = CrawlDatum.STATUS_FETCH_REDIR_PERM;
        temp = false;
      } else {
        code = CrawlDatum.STATUS_FETCH_REDIR_TEMP;
        temp = true;
      }
      output(fit.url,fit.datum,content,status,code);
      String newUrl = status.getMessage();
      Text redirUrl = handleRedirect(fit,newUrl,temp,Fetcher.PROTOCOL_REDIR);
      if (redirUrl != null) {
        fit = queueRedirect(redirUrl,fit);
      } else {
        // stop redirecting
        redirecting = false;
      }
      break;
    case ProtocolStatus.EXCEPTION:
      logError(fit.url,status.getMessage());
      int killedURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit
          .getQueueID());
      if (killedURLs != 0)
        context.getCounter("FetcherStatus","AboveExceptionThresholdInQueue").increment(killedURLs);
      /* FALLTHROUGH */
    case ProtocolStatus.RETRY: // retry
    case ProtocolStatus.BLOCKED:
      output(fit.url,null,CrawlDatum.STATUS_FETCH_RETRY);
      break;
    case ProtocolStatus.GONE: // gone
    case ProtocolStatus.NOTFOUND:
    case ProtocolStatus.ACCESS_DENIED:
    case ProtocolStatus.ROBOTS_DENIED:
      output(fit.url,CrawlDatum.STATUS_FETCH_GONE);
      break;
    case ProtocolStatus.NOTMODIFIED:
      output(fit.url,CrawlDatum.STATUS_FETCH_NOTMODIFIED);
      break;
    default:
      if (LOG.isWarnEnabled()) {
        LOG.warn("{} {} Unknown ProtocolStatus: {}",getName(),Thread.currentThread().getId(),status.getCode());
      }
      output(fit.url,CrawlDatum.STATUS_FETCH_RETRY);
    if (redirecting && redirectCount > maxRedirect) {
      ((FetchItemQueues) fetchQueues).finishFetchItem(fit);
      if (LOG.isInfoEnabled()) {
        LOG.info("{} {} - redirect count exceeded {}",fit.url);
      }
      output(fit.url,ProtocolStatus.STATUS_REDIR_EXCEEDED,CrawlDatum.STATUS_FETCH_GONE);
    }
  1. 在重复数据删除阶段,如果发现URL基于md5哈希重复,则它将在重复数据删除阶段将状态标记为 STATUS_DB_DUPLICATE ,并且在下一次迭代中,生成器。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)> insert overwrite table dwd_trade_cart_add_inc > select data.id, > data.user_id, > data.course_id, > date_format(
错误1 hive (edu)> insert into huanhuan values(1,'haoge'); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive> show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 <configuration> <property> <name>yarn.nodemanager.res