如何解决R xml2 去除标签前缀
假设我想解析 Microsoft 10-Q SEC XBRL 文件:
library('xml2')
url <- "https://www.sec.gov/Archives/edgar/data/789019/000156459021002316/msft-10q_20201231_htm.xml"
xml <- read_xml(url)
xml_find_all(xml,"./us-gaap:EarningsPerShareBasic")
# {xml_nodeset (10)}
# [1] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000099" unitRef="U_iso4217USD_x ...
# [2] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20191001_20191231" decimals="2" id="F_000100" unitRef="U_iso4217USD_x ...
# [3] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20200701_20201231" decimals="2" id="F_000101" unitRef="U_iso4217USD_x ...
# [4] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20190701_20191231" decimals="2" id="F_000102" unitRef="U_iso4217USD_x ...
# [5] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_us-gaapChangeInAccountingEstimateByTypeAxis_us-gaapServiceLifeMember_ ...
# [6] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_us-gaapChangeInAccountingEstimateByTypeAxis_us-gaapServiceLifeMember_ ...
# [7] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000517" unitRef="U_iso4217USD_x ...
# [8] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20191001_20191231" decimals="2" id="F_000518" unitRef="U_iso4217USD_x ...
# [9] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20200701_20201231" decimals="2" id="F_000519" unitRef="U_iso4217USD_x ...
# [10] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20190701_20191231" decimals="2" id="F_000520" unitRef="U_iso4217USD_x ...
如上所述,大多数美国 XBRL 标签都有命名空间前缀;这里 us-gaap:
表示会计准则。但是,某些 xml2
函数,例如:
xml_name(xml_find_all(xml,"./us-gaap:EarningsPerShareBasic"))
# [1] "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic"
# [6] "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic"
和
xml_find_first(xml,"./us-gaap:EarningsPerShareBasic")
# {xml_node}
# <EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000099" unitRef="U_iso4217USD_xbrlishares">
nodes <- xml_find_all(xml,"./*")
tags <- xml_name(nodes)
grep("earnings",tags,ignore.case = TRUE,value=TRUE)
因为 xml_name(nodes)
去掉了前缀,所以我没有从 grep 中得到实际的标签。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。