如何解决使用 JSoup 解析带有 Clojure 的字符串
用JSoup用Clojure解析一个html字符串,源码如下
依赖
:dependencies [[org.clojure/clojure "1.10.1"]
[org.jsoup/jsoup "1.13.1"]]
源代码
(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph paragraphs}))
(fetch_html HTML)
预期结果
{:title "Website title",:paragraph ["Sample paragraph number 1"
"Sample paragraph number 2"]}
很遗憾,结果不如预期
user ==> (fetch_html HTML)
{:title "Website title",:paragraph []}
解决方法
我有一个可能有用的 Clojure wrapper for TagSoup。尝试在此 template project 中运行它。要在您的项目中使用,请添加以下行:
[tupelo "21.01.05"]
到您在 :dependencies
中的 project.clj
。
代码示例:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[tupelo.parse.tagsoup :as tagsoup]
))
(dotest
(let [html "<html>
<head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"]
(is= (tagsoup/parse html)
{:tag :html,:attrs {},:content [{:tag :head,:content [{:tag :title,:attrs {},:content ["Website title"]}]}
{:tag :body,:content [{:tag :p,:content ["Sample paragraph number 1 "]}
{:tag :p,:content ["Sample paragraph number 2"]}]}]})))
详情
如果您查看源代码,就很容易明白为什么要使用包装函数!
(ns tupelo.parse.tagsoup
(:use tupelo.core)
(:require
[schema.core :as s]
[tupelo.parse.xml :as xml]
[tupelo.string :as ts]
[tupelo.schema :as tsk]))
(s/defn ^:private tagsoup-parse-fn
[input-source :- org.xml.sax.InputSource
content-handler]
(doto (org.ccil.cowan.tagsoup.Parser.)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/default-attributes" false)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/cdata-elements" true)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace" true)
(.setContentHandler content-handler)
(.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-detector"
(proxy [org.ccil.cowan.tagsoup.AutoDetector] []
(autoDetectingReader [^java.io.InputStream is]
(java.io.InputStreamReader. is "UTF-8"))))
(.setProperty "http://xml.org/sax/properties/lexical-handler" content-handler)
(.parse input-source)))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse-raw :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/parse-raw-streaming
(org.xml.sax.InputSource.
(ts/string->stream html-str))
tagsoup-parse-fn))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/enlive-remove-whitespace
(xml/enlive-normalize
(parse-raw
html-str))))
,
(.getElementsByTag ...) 返回一个元素的序列,你需要在每个元素上调用 .text() 方法来获取文本值。我使用的是 1.13.1 版的 Jsoup。
(ns core
(:import (org.jsoup Jsoup))
(:require [clojure.string :as str]))
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph (mapv #(.text %) paragraphs)}))
(fetch_html HTML)
还可以考虑使用 Reaver,它是一个包装 JSoup 的 Clojure 库,或其他人建议的任何其他包装器。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。