使用 JSoup 从 HTML 文档中提取换行符分隔的内容

如何解决使用 JSoup 从 HTML 文档中提取换行符分隔的内容

我们希望在以下限制下使用 Jsoup 从 HTML 文档中删除标记：

文本内容的顺序应该正确
每个文本块都应打印在单独的行上。在这里，我们将文本块视为一个文本节点，除了由 b、i、strong、u 等格式节点分隔的文本没有自己的新行。例如，This is a paragraph. 将产生以下单行文本：

This is a paragraph.

我们从以下代码开始：

File input = new File(args[0]);
Document doc = Jsoup.parse(input,"UTF-8");
String content = Jsoup.clean(doc.body().toString(),Whitelist.none());
System.out.println(content);

文本内容的顺序正确，但只生成了一行输出（即违反约束 2）。

接下来我们尝试了以下操作：

File input = new File(args[0]);
Document doc = Jsoup.parse(input,"UTF-8");

// Allow the full range of text and structural body HTML
String content = Jsoup.clean(doc.body().toString(),Whitelist.relaxed());

// Create a new HTML document 
String newHtml = "<html><body>" + content + "</body></html>";
Document newDoc = Jsoup.parse(newHtml);

// Get the ownText for the nodes 
Elements elements = newDoc.getAllElements();
for(Element element : elements) {
    String ownText = element.ownText();
    if(!ownText.isEmpty()) System.out.println(ownText);
}

这里我们得到了多行，但提取的内容顺序不正确（即违反约束 1）并且所有文本块都没有打印在单独的行上（即违反约束 2）。

我们尝试了另一种方法，但这也违反了这两个约束。

File input = new File(args[0]);
Document doc = Jsoup.parse(input,"UTF-8");

Elements elements = doc.getElementsMatchingOwnText(".*");
for(Element element : elements) {
    String ownText = element.ownText();
    if(!ownText.isEmpty()) System.out.println(ownText);
}

样本输入：

<!-- http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm -->
<HTML>
<HEAD>
<TITLE>Your Title Here</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"> </CENTER>
<HR>
<a href="http://somegreatsite.com">Link Name</a>
is a link to another nifty site
<H1>This is a Header</H1>
<H2>This is a Medium Header</H2>
Send me mail at <a href="mailto:support@yourcompany.com">
support@yourcompany.com</a>
<P>This is a new paragraph!
<P><B>This is a new paragraph!</B>
<BR><B><I>This is a new sentence without a paragraph break,in bold italics.</I></B>
<HR>
</BODY>
</HTML>

预期输出：

Link Name
is a link to another nifty site
This is a Header
This is a Medium Header
Send me mail at 
support@yourcompany.com
This is a new paragraph! 
This is a new paragraph!
This is a new sentence without a paragraph break,in bold italics.

使用 JSoup 从 HTML 文档中提取换行符分隔的内容

如何解决使用 JSoup 从 HTML 文档中提取换行符分隔的内容

相关推荐