微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

比较游标与管道xml解析

如何解决比较游标与管道xml解析

这是我正在解析的xml,这是我第一次使用xml,soap或导管。

<?xml version  = "1.0" 
      encoding = "utf-8"?>
<soap:Envelope xmlns:soap = "http://schemas.xmlsoap.org/soap/envelope/" 
               xmlns:xsi  = "http://www.w3.org/2001/XMLSchema-instance" 
               xmlns:xsd  = "http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns = "http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s  = 'uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
           xmlns:dt = 'uuid:C2F41010-65B3-11d1-A29F-00AA00C14882'
           xmlns:rs = 'urn:schemas-microsoft-com:rowset'
           xmlns:z  = '#RowsetSchema'>
<rs:data ItemCount = "290">
<z:row ows_Date              = '2020-10-20 00:00:00' 
       ows_Document          = 'https://www.oregon.gov/oha/PH/disEASESCONDITIONS/disEASESAZ/Emerging%20Respitory%20Infections/Oregon-COVID-19-Update-10-20-2020-FINAL.pdf,Oregon COVID-19 Daily Update 10.20.2020' 
       ows_Category          = 'Daily Update' 
       ows_MetaInfo          = '294;#' 
       ows__ModerationStatus = '0' 
       ows__Level            = '1' 
       ows_ID                = '294' 
       ows_UniqueId          = '294;#{C51D9DDB-9A9C-4C56-B030-236D6A0980D2}' 
       ows_owshiddenversion  = '1' 
       ows_FSObjType         = '294;#0' 
       ows_Created           = '2020-10-20 12:16:49' 
       ows_PermMask          = '0x1000030041' 
       ows_Modified          = '2020-10-20 12:16:49' 
       ows_FileRef           = '294;#oha/ERD/Lists/COVID19 Updates/294_.000' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>

我只想保留ows_CategoryWeekly Reportows_Document不包含Spanish的记录。我的cursor版本可以轻松使用。 conduit版本要复杂得多,但我最终通过对此question的答案来弄清楚了。

即使这两种方法现在都可以使用,但我还是有几个问题。

  • conduit方法是否等效于lax忽略名称空间?
  • 什么使光标功能中的concat成为必需?查看类型,我们从根节点开始,生成并维护要考虑的相关节点的列表,filter对其进行嵌套,map对它们进行嵌套,等等。是什么构成了另一层嵌套,为什么?
  • conduit版本需要助手f(随时随地调用force)和ns(为所有内容命名空间)-他们似乎非常有必要,以至于我认为库将它们作为实用程序提供,因为每个人都一直需要它们。还是我在做傻事?
  • 我最糟糕的症结是,我原来在glirspNSGetListItemsResult上需要listitems命名空间,即使在xml中看起来它只应应用于{{1 }}。只是幸运的猜测使我超越了这一点。命名空间应该继承下来直到像这样被覆盖吗?
  • 关于GetListItemsResponse
    • 如果我们负责验证requireAttrRaw,我们是否不需要知道名称空间?
    • 为什么Name向我们发送requireAttrRaw而不是两个[Content],而每个给Maybe ContentContentText发送给我们?
    • 我们应该如何处理ContentEntity“用于传递解析”?
ContentEntity

最后,我通常从{-# LANGUAGE OverloadedStrings #-} import Conduit import Control.applicative import Control.Arrow import Control.Exception import Control.Monad import qualified Data.ByteString.Lazy.Char8 as L8 import Data.Foldable import qualified Data.Map.Strict as M import Data.String import qualified Data.Text as T import Data.Time.Calendar import Data.Time.Format import Data.XML.Types import qualified Text.XML as X import Text.XML.Cursor hiding (force) import Text.XML.Stream.Parse data Doc = Doc { url :: String,name :: String,date :: Day } deriving (Show) main :: IO () main = do r <- L8.readFile "oha.xml" let go :: Cursor -> [Doc] go c = concat $ -- what is making the layer of nesting that makes this necessary? why? c $// laxElement "row" >=> attributeIs "ows_Category" "Weekly Report" >=> checkElement (maybe False (not . T.isInfixOf "Spanish") . M.lookup "ows_Document" . X.elementAttributes) &| \x -> doc <$> attribute "ows_Document" x <*> attribute "ows_Date" x doc x = Doc u v . parseTimeOrError True defaultTimeLocale "%Y-%-m-%-d" . takeWhile (/= ' ') . T.unpack where (u,v) = second (drop 2) . break (== ',') $ T.unpack x parseAttributes,parseAttributes' :: AttrParser (T.Text,T.Text) parseAttributes' = do doc <- requireAttr "ows_Document" cat <- requireAttr "ows_Category" date <- requireAttr "ows_Date" ignoreAttrs guard $ not (T.isInfixOf "Spanish" doc) && cat == "Weekly Report" return (doc,date) -- since the attribute values don't interact,we can parse in applicative rather than Monad parseAttributes = (,) <$> requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish") <*> requireAttr "ows_Date" <* requireAttrRaw' "ows_Category" ("Weekly Report" ==) <* ignoreAttrs where requireAttrRaw' n f = requireAttrRaw ("required attr value Failed condition: " <> n) $ \(n',as) -> asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as -- shouldn't we have had to pass in namespace? -- why [Content] instead of two Maybe Content,one for ContentText and other for ContentEntity? -- what to do with ContentEntity Text "For pass-through parsing"?} ns n = fromString . (("{" <> n <> "}") <>) f g n s = force (s <> " required") . g (ns n s) parseDocs :: (MonadThrow m,Monadio m) => ConduitT Event o m [Doc] parseDocs = f tagNoAttr soapNS "Envelope" . f tagNoAttr soapNS "Body" . f tagNoAttr glirspNS "GetListItemsResponse" . f tagNoAttr glirspNS "GetListItemsResult" -- didn't expect to need ns glirspNS here . f tagNoAttr glirspNS "listitems" -- didn't expect to need ns glirspNS here . f tagIgnoreAttrs rsNS "data" . many' . tag' (ns zNS "row") parseAttributes $ return . uncurry doc soapNS = "http://schemas.xmlsoap.org/soap/envelope/" glirspNS = "http://schemas.microsoft.com/sharepoint/soap/" rsNS = "urn:schemas-microsoft-com:rowset" zNS = "#RowsetSchema" disp = (print . length) <=< traverse print (throwIO ||| disp . go . fromDocument) $ X.parseLBS X.def r ( disp =<<) . runconduit $ parseLBS def r .| parseDocs 获取xml,而不是从文件中读取。我是对的,有一种方法可以将Network.HTTP.Simple.httpLBS解析器连接到conduit,以便它直接在流上运行?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。