微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

R xml2 去除标签前缀

如何解决R xml2 去除标签前缀

假设我想解析 Microsoft 10-Q SEC XBRL 文件

library('xml2')
url <- "https://www.sec.gov/Archives/edgar/data/789019/000156459021002316/msft-10q_20201231_htm.xml"
xml <- read_xml(url)
xml_find_all(xml,"./us-gaap:EarningsPerShareBasic")

# {xml_nodeset (10)}
#  [1] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000099" unitRef="U_iso4217USD_x ...
#  [2] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20191001_20191231" decimals="2" id="F_000100" unitRef="U_iso4217USD_x ...
#  [3] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20200701_20201231" decimals="2" id="F_000101" unitRef="U_iso4217USD_x ...
#  [4] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20190701_20191231" decimals="2" id="F_000102" unitRef="U_iso4217USD_x ...
#  [5] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_us-gaapChangeInAccountingEstimateByTypeAxis_us-gaapServiceLifeMember_ ...
#  [6] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_us-gaapChangeInAccountingEstimateByTypeAxis_us-gaapServiceLifeMember_ ...
#  [7] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000517" unitRef="U_iso4217USD_x ...
#  [8] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20191001_20191231" decimals="2" id="F_000518" unitRef="U_iso4217USD_x ...
#  [9] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20200701_20201231" decimals="2" id="F_000519" unitRef="U_iso4217USD_x ...
# [10] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20190701_20191231" decimals="2" id="F_000520" unitRef="U_iso4217USD_x ...

如上所述,大多数美国 XBRL 标签都有命名空间前缀;这里 us-gaap: 表示会计准则。但是,某些 xml2 函数,例如:

 xml_name(xml_find_all(xml,"./us-gaap:EarningsPerShareBasic"))
 # [1] "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic"
 # [6] "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic"

 xml_find_first(xml,"./us-gaap:EarningsPerShareBasic")
 # {xml_node}
 # <EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000099" unitRef="U_iso4217USD_xbrlishares">

去掉前缀。
想象一下我想收集所有标签搜索它们名称的情况:

nodes <- xml_find_all(xml,"./*")
tags <- xml_name(nodes)
grep("earnings",tags,ignore.case = TRUE,value=TRUE)

因为 xml_name(nodes) 去掉了前缀,所以我没有从 grep 中得到实际的标签

有什么办法可以得到一个节点的确切标签名吗?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。