微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

处理网页中的已爬网文本时,请勿删除多余的行

如何解决处理网页中的已爬网文本时,请勿删除多余的行

搜索了parse-html的插件,但没有找到更改代码的位置,因此它不会从html页中删除多余的行。 在使用nutch进行爬网时,它会从爬网的文本中删除所有多余的行。我想保留文本以及网站上所有的新行。例如:在抓取此页面https://www.modernfamilydental.net/时, 预期的输出是:\n\n\n\nSan Francisco,CA Dentist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWould you like to switch to the accessible version of this site?\nGo to accessible site\n\nClose modal window\n\n\n\n\n\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\n\n\n\n\n\n\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModern Family Dental Hao Tran,DMD\nDentist located in Laurel Heights,San Francisco,CA\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n(415) 752-5244\n\n\n \n\n\n\n\n\n\n\n\n\nMenu\n\n\n\n\nHome\n\n\nServices\n \nLatest Equipment\n\n\nInsurance\n\n\nTeeth Whitening\n\n\nCrowns & Bridges\n\n\nSmile Makeovers\n\n\nResin Composite Bonding\n\n\nVeneers\n\n\nImplant Retained Dentures\n\n\nNight Guards\n\n\nMetal-Free Restoration\n\n\nInvisalign\n\n\nDental Examination

但是胡说八道的输出是:

San Francisco,CA Dentist\nWould you like to switch to the accessible version of this site?\nGo to accessible site\nClose modal window\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\nModern Family Dental Hao Tran,CA\n(415) 752-5244\nMenu\nHome\nServices\nLatest Equipment\nInsurance\nTeeth Whitening\nCrowns & Bridges\nSmile Makeovers\n\n\nResin Composite Bonding\nVeneers\nImplant Retained Dentures\nNight Guards\nMetal-Free Restoration\nInvisalign\nDental Examination

我可以知道应该更改哪个插件代码,还是应该更改parse_text的代码

解决方法

我已经回答了here in the comment section

如果您不想从 / content 文件夹中进行读取。

您可以执行以下操作。我假设您必须使用 parse-html | parse-tika 插件来解析HTML内容。

如果您正在使用其中任何一个。然后Nutch插件使用DOMContentUtils API从HTML中提取已解析的文本。

 **// this method extract text from  Node object and append to
 StringBuffer sb**
        public boolean getText(StringBuffer sb,Node node,boolean abortOnNestedAnchors) {
             if (getTextHelper(sb,node,abortOnNestedAnchors,0)) {
               return true;
             }
             return false;
           }

在getTextHelper方法中,您可以注释行text = text.replaceAll("\\s+"," ");,以便它不会一次出现替换多个[\ t \ r \ n \ f]。

     private boolean getTextHelper(StringBuffer sb,boolean abortOnNestedAnchors,int anchorDepth) {
    boolean abort = false;
    NodeWalker walker = new NodeWalker(node);

    while (walker.hasNext()) {

      Node currentNode = walker.nextNode();
      String nodeName = currentNode.getNodeName();
      short nodeType = currentNode.getNodeType();
      Node previousSibling = currentNode.getPreviousSibling();
      if (previousSibling != null
          && blockNodes.contains(previousSibling.getNodeName().toLowerCase())) {
        appendParagraphSeparator(sb);
      } else if (blockNodes.contains(nodeName.toLowerCase())) {
        appendParagraphSeparator(sb);
      }

      if ("script".equalsIgnoreCase(nodeName)) {
        walker.skipChildren();
      }
      if ("style".equalsIgnoreCase(nodeName)) {
        walker.skipChildren();
      }
      if (abortOnNestedAnchors && "a".equalsIgnoreCase(nodeName)) {
        anchorDepth++;
        if (anchorDepth > 1) {
          abort = true;
          break;
        }
      }
      if (nodeType == Node.COMMENT_NODE) {
        walker.skipChildren();
      }
      if (nodeType == Node.TEXT_NODE) {
        // cleanup and trim the value
        String text = currentNode.getNodeValue();
        **text = text.replaceAll("\\s+"," ");**
        text = text.trim();
        if (text.length() > 0) {
          appendSpace(sb);
          sb.append(text);
        } else {
          appendParagraphSeparator(sb);
        }
      }
    }

    return abort;
  }

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。