如何解决比较游标与管道xml解析
这是我正在解析的xml,这是我第一次使用xml,soap或导管。
<?xml version = "1.0"
encoding = "utf-8"?>
<soap:Envelope xmlns:soap = "http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd = "http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns = "http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s = 'uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
xmlns:dt = 'uuid:C2F41010-65B3-11d1-A29F-00AA00C14882'
xmlns:rs = 'urn:schemas-microsoft-com:rowset'
xmlns:z = '#RowsetSchema'>
<rs:data ItemCount = "290">
<z:row ows_Date = '2020-10-20 00:00:00'
ows_Document = 'https://www.oregon.gov/oha/PH/disEASESCONDITIONS/disEASESAZ/Emerging%20Respitory%20Infections/Oregon-COVID-19-Update-10-20-2020-FINAL.pdf,Oregon COVID-19 Daily Update 10.20.2020'
ows_Category = 'Daily Update'
ows_MetaInfo = '294;#'
ows__ModerationStatus = '0'
ows__Level = '1'
ows_ID = '294'
ows_UniqueId = '294;#{C51D9DDB-9A9C-4C56-B030-236D6A0980D2}'
ows_owshiddenversion = '1'
ows_FSObjType = '294;#0'
ows_Created = '2020-10-20 12:16:49'
ows_PermMask = '0x1000030041'
ows_Modified = '2020-10-20 12:16:49'
ows_FileRef = '294;#oha/ERD/Lists/COVID19 Updates/294_.000' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>
我只想保留ows_Category
为Weekly Report
且ows_Document
不包含Spanish
的记录。我的cursor
版本可以轻松使用。 conduit
版本要复杂得多,但我最终通过对此question的答案来弄清楚了。
即使这两种方法现在都可以使用,但我还是有几个问题。
-
conduit
方法是否等效于lax
忽略名称空间? - 什么使光标功能中的
concat
成为必需?查看类型,我们从根节点开始,生成并维护要考虑的相关节点的列表,filter
对其进行嵌套,map
对它们进行嵌套,等等。是什么构成了另一层嵌套,为什么? -
conduit
版本需要助手f
(随时随地调用force
)和ns
(为所有内容命名空间)-他们似乎非常有必要,以至于我认为库将它们作为实用程序提供,因为每个人都一直需要它们。还是我在做傻事? - 我最糟糕的症结是,我原来在
glirspNS
和GetListItemsResult
上需要listitems
命名空间,即使在xml中看起来它只应应用于{{1 }}。只是幸运的猜测使我超越了这一点。命名空间应该继承下来直到像这样被覆盖吗? - 关于
GetListItemsResponse
:- 如果我们负责验证
requireAttrRaw
,我们是否不需要知道名称空间? - 为什么
Name
向我们发送requireAttrRaw
而不是两个[Content]
,而每个给Maybe Content
和ContentText
发送给我们? - 我们应该如何处理
ContentEntity
“用于传递解析”?
- 如果我们负责验证
ContentEntity
最后,我通常从{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Control.applicative
import Control.Arrow
import Control.Exception
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as L8
import Data.Foldable
import qualified Data.Map.Strict as M
import Data.String
import qualified Data.Text as T
import Data.Time.Calendar
import Data.Time.Format
import Data.XML.Types
import qualified Text.XML as X
import Text.XML.Cursor hiding (force)
import Text.XML.Stream.Parse
data Doc = Doc
{ url :: String,name :: String,date :: Day
} deriving (Show)
main :: IO ()
main = do
r <- L8.readFile "oha.xml"
let go :: Cursor -> [Doc]
go c = concat $ -- what is making the layer of nesting that makes this necessary? why?
c $// laxElement "row"
>=> attributeIs "ows_Category" "Weekly Report"
>=> checkElement (maybe False (not . T.isInfixOf "Spanish") . M.lookup "ows_Document" . X.elementAttributes)
&| \x -> doc <$> attribute "ows_Document" x <*> attribute "ows_Date" x
doc x = Doc u v . parseTimeOrError True defaultTimeLocale "%Y-%-m-%-d" . takeWhile (/= ' ') . T.unpack
where (u,v) = second (drop 2) . break (== ',') $ T.unpack x
parseAttributes,parseAttributes' :: AttrParser (T.Text,T.Text)
parseAttributes' = do
doc <- requireAttr "ows_Document"
cat <- requireAttr "ows_Category"
date <- requireAttr "ows_Date"
ignoreAttrs
guard $ not (T.isInfixOf "Spanish" doc) && cat == "Weekly Report"
return (doc,date)
-- since the attribute values don't interact,we can parse in applicative rather than Monad
parseAttributes = (,) <$> requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish")
<*> requireAttr "ows_Date"
<* requireAttrRaw' "ows_Category" ("Weekly Report" ==)
<* ignoreAttrs
where requireAttrRaw' n f = requireAttrRaw ("required attr value Failed condition: " <> n) $ \(n',as) ->
asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as
-- shouldn't we have had to pass in namespace?
-- why [Content] instead of two Maybe Content,one for ContentText and other for ContentEntity?
-- what to do with ContentEntity Text "For pass-through parsing"?}
ns n = fromString . (("{" <> n <> "}") <>)
f g n s = force (s <> " required") . g (ns n s)
parseDocs :: (MonadThrow m,Monadio m) => ConduitT Event o m [Doc]
parseDocs = f tagNoAttr soapNS "Envelope"
. f tagNoAttr soapNS "Body"
. f tagNoAttr glirspNS "GetListItemsResponse"
. f tagNoAttr glirspNS "GetListItemsResult" -- didn't expect to need ns glirspNS here
. f tagNoAttr glirspNS "listitems" -- didn't expect to need ns glirspNS here
. f tagIgnoreAttrs rsNS "data"
. many' . tag' (ns zNS "row")
parseAttributes $ return . uncurry doc
soapNS = "http://schemas.xmlsoap.org/soap/envelope/"
glirspNS = "http://schemas.microsoft.com/sharepoint/soap/"
rsNS = "urn:schemas-microsoft-com:rowset"
zNS = "#RowsetSchema"
disp = (print . length) <=< traverse print
(throwIO ||| disp . go . fromDocument) $ X.parseLBS X.def r
( disp =<<) . runconduit $ parseLBS def r .| parseDocs
获取xml,而不是从文件中读取。我是对的,有一种方法可以将Network.HTTP.Simple.httpLBS
解析器连接到conduit
,以便它直接在流上运行?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。