比较游标与管道xml解析

如何解决比较游标与管道xml解析

这是我正在解析的xml，这是我第一次使用xml，soap或导管。

<?xml version  = "1.0" 
      encoding = "utf-8"?>
<soap:Envelope xmlns:soap = "http://schemas.xmlsoap.org/soap/envelope/" 
               xmlns:xsi  = "http://www.w3.org/2001/XMLSchema-instance" 
               xmlns:xsd  = "http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns = "http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s  = 'uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
           xmlns:dt = 'uuid:C2F41010-65B3-11d1-A29F-00AA00C14882'
           xmlns:rs = 'urn:schemas-microsoft-com:rowset'
           xmlns:z  = '#RowsetSchema'>
<rs:data ItemCount = "290">
<z:row ows_Date              = '2020-10-20 00:00:00' 
       ows_Document          = 'https://www.oregon.gov/oha/PH/disEASESCONDITIONS/disEASESAZ/Emerging%20Respitory%20Infections/Oregon-COVID-19-Update-10-20-2020-FINAL.pdf,Oregon COVID-19 Daily Update 10.20.2020' 
       ows_Category          = 'Daily Update' 
       ows_MetaInfo          = '294;#' 
       ows__ModerationStatus = '0' 
       ows__Level            = '1' 
       ows_ID                = '294' 
       ows_UniqueId          = '294;#{C51D9DDB-9A9C-4C56-B030-236D6A0980D2}' 
       ows_owshiddenversion  = '1' 
       ows_FSObjType         = '294;#0' 
       ows_Created           = '2020-10-20 12:16:49' 
       ows_PermMask          = '0x1000030041' 
       ows_Modified          = '2020-10-20 12:16:49' 
       ows_FileRef           = '294;#oha/ERD/Lists/COVID19 Updates/294_.000' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>

我只想保留ows_Category为Weekly Report且ows_Document不包含Spanish的记录。我的cursor版本可以轻松使用。 conduit版本要复杂得多，但我最终通过对此question的答案来弄清楚了。

即使这两种方法现在都可以使用，但我还是有几个问题。

conduit方法是否等效于lax忽略名称空间？
什么使光标功能中的concat成为必需？查看类型，我们从根节点开始，生成并维护要考虑的相关节点的列表，filter对其进行嵌套，map对它们进行嵌套，等等。是什么构成了另一层嵌套，为什么？
conduit版本需要助手f（随时随地调用force）和ns（为所有内容命名空间）-他们似乎非常有必要，以至于我认为库将它们作为实用程序提供，因为每个人都一直需要它们。还是我在做傻事？
我最糟糕的症结是，我原来在glirspNS和GetListItemsResult上需要listitems命名空间，即使在xml中看起来它只应应用于{{1 }}。只是幸运的猜测使我超越了这一点。命名空间应该继承下来直到像这样被覆盖吗？
关于GetListItemsResponse：
- 如果我们负责验证requireAttrRaw，我们是否不需要知道名称空间？
- 为什么Name向我们发送requireAttrRaw而不是两个[Content]，而每个给Maybe Content和ContentText发送给我们？
- 我们应该如何处理ContentEntity“用于传递解析”？

ContentEntity

最后，我通常从{-# LANGUAGE OverloadedStrings #-} import Conduit import Control.applicative import Control.Arrow import Control.Exception import Control.Monad import qualified Data.ByteString.Lazy.Char8 as L8 import Data.Foldable import qualified Data.Map.Strict as M import Data.String import qualified Data.Text as T import Data.Time.Calendar import Data.Time.Format import Data.XML.Types import qualified Text.XML as X import Text.XML.Cursor hiding (force) import Text.XML.Stream.Parse data Doc = Doc { url :: String,name :: String,date :: Day } deriving (Show) main :: IO () main = do r <- L8.readFile "oha.xml" let go :: Cursor -> [Doc] go c = concat $ -- what is making the layer of nesting that makes this necessary? why? c $// laxElement "row" >=> attributeIs "ows_Category" "Weekly Report" >=> checkElement (maybe False (not . T.isInfixOf "Spanish") . M.lookup "ows_Document" . X.elementAttributes) &| \x -> doc <$> attribute "ows_Document" x <*> attribute "ows_Date" x doc x = Doc u v . parseTimeOrError True defaultTimeLocale "%Y-%-m-%-d" . takeWhile (/= ' ') . T.unpack where (u,v) = second (drop 2) . break (== ',') $ T.unpack x parseAttributes,parseAttributes' :: AttrParser (T.Text,T.Text) parseAttributes' = do doc <- requireAttr "ows_Document" cat <- requireAttr "ows_Category" date <- requireAttr "ows_Date" ignoreAttrs guard $ not (T.isInfixOf "Spanish" doc) && cat == "Weekly Report" return (doc,date) -- since the attribute values don't interact,we can parse in applicative rather than Monad parseAttributes = (,) <$> requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish") <*> requireAttr "ows_Date" <* requireAttrRaw' "ows_Category" ("Weekly Report" ==) <* ignoreAttrs where requireAttrRaw' n f = requireAttrRaw ("required attr value Failed condition: " <> n) $ \(n',as) -> asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as -- shouldn't we have had to pass in namespace? -- why [Content] instead of two Maybe Content,one for ContentText and other for ContentEntity? -- what to do with ContentEntity Text "For pass-through parsing"?} ns n = fromString . (("{" <> n <> "}") <>) f g n s = force (s <> " required") . g (ns n s) parseDocs :: (MonadThrow m,Monadio m) => ConduitT Event o m [Doc] parseDocs = f tagNoAttr soapNS "Envelope" . f tagNoAttr soapNS "Body" . f tagNoAttr glirspNS "GetListItemsResponse" . f tagNoAttr glirspNS "GetListItemsResult" -- didn't expect to need ns glirspNS here . f tagNoAttr glirspNS "listitems" -- didn't expect to need ns glirspNS here . f tagIgnoreAttrs rsNS "data" . many' . tag' (ns zNS "row") parseAttributes $ return . uncurry doc soapNS = "http://schemas.xmlsoap.org/soap/envelope/" glirspNS = "http://schemas.microsoft.com/sharepoint/soap/" rsNS = "urn:schemas-microsoft-com:rowset" zNS = "#RowsetSchema" disp = (print . length) <=< traverse print (throwIO ||| disp . go . fromDocument) $ X.parseLBS X.def r ( disp =<<) . runconduit $ parseLBS def r .| parseDocs获取xml，而不是从文件中读取。我是对的，有一种方法可以将Network.HTTP.Simple.httpLBS解析器连接到conduit，以便它直接在流上运行？