微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用python elementree xml解析器和循环

如何解决使用python elementree xml解析器和循环

正在开发一个应用程序 https://share.streamlit.io/carrlucy/hsl_oa/main,该应用程序会递归 europmc 数据库以查找开放数据,并且提供的 restful api 包含一个“nextcursormark”字段,以便查询可以进行分页...

我在如何处理这些信息方面遇到了困难,希望得到一些想法?

我知道我正在寻找的变量存储在 root[2] 的解析变量中

以下工作可用于获得第一组结果(root[4] 是为其他 for 循环提供数据的 xml 元素树,我需要将其包装在另一个循环中,我认为要整理出来,以便每个当它看到另一个 nextcursormark 值时,它会重新创建一个新元素树,然后由以下 for 循环解析?还担心我的代码没有完成,所以这会很简单?所以如果那里有什么没有意义我也会欣赏那里的想法吗?

'''

import math
import pandas as pd
import streamlit as st
import numpy as np
import json
import xml.etree.ElementTree as ET
import urllib.request
import rdflib
import altair as alt
from urllib.request import urlopen
from xml.etree.ElementTree import parse

"""
# Europe PMC Open Data Dashboard
"""

builtQuery=('https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=virginia&resultType=core&cursorMark=*&pageSize=50&format=xml')

#https://www.foxinfotech.in/2019/04/python-how-to-read-xml-from-url.html
restQuery=urlopen(builtQuery)

xmlTree=ET.parse(restQuery)
root = xmlTree.getroot()


   
#https://towardsdatascience.com/converting-multi-layered-xml-files-to-dataframes-in-python-using-xmltree-a13f9b043b48


openAccess=[]
authors=[]
date=[]
title=[]
iso=[]
doi=[]

nextPage=root[2].text

if int(root[1].text)<1000:
    for a in root[4]:
        root1=ET.Element('result')
        root1=a
        for b in root1.iter('isOpenAccess'):
            root2=ET.Element('root')
            
        for c in root1.iter('authorString'):
            root3=ET.Element('root2')
            
        for d in root1.iter('firstPublicationDate'):
            root4=ET.Element('root3')
            
        for e in root1.iter('title'):
            root5=ET.Element('root4')
            
        for f in root1.iter('ISOAbbreviation'):
            root6=ET.Element('root5')
              
        for g in root1.iter('doi'):
            root7=ET.Element('root6')
            
        openAccess.append(b.text)
        authors.append(c.text)
        date.append(d.text)
        title.append(e.text)
        iso.append(f.text)
        doi.append(g.text)
       


df = pd.DataFrame({'Authors':authors,'ArticleTitle':title,'JournalTitle':iso,'date':date,'DOI':doi,'openAccess': openAccess})
df['date'] = pd.to_datetime(df['date'])


openFilter = sorted(df['openAccess'].drop_duplicates()) # select the open access values 
open_Filter = st.sidebar.selectBox('Open Access?',openFilter) # render the streamlit widget on the sidebar of the page using the list we created above for the menu
df2=df[df['openAccess'].str.contains(open_Filter)] # create a dataframe filtered below
st.write(df2.sort_values(by='date'))


df['year']=df['date'].dt.to_period('Y')
df['yearDate'] = df['year'].astype(str)
df3 = df[['yearDate','openAccess']].copy()


valLayer = alt.Chart(df3).mark_bar().encode(x='yearDate',y='count(openAccess)',color='openAccess')

st.altair_chart(valLayer,use_container_width=True)

'''

顺便说一句-我已经修复了 URL,其输出

'''

<responseWrapper xmlns:slx="http://www.scholix.org" xmlns:epmc="https://www.europepmc.org/data" nighteye="disabled">
<script id="tinyhippos-injected"/>
<version>6.5</version>
<hitCount>277624</hitCount>
<nextCursorMark>AoIIQJRo5Sg0MzQwNzg5MQ==</nextCursorMark>
<request>
<queryString>virginia</queryString>
<resultType>core</resultType>
<cursorMark>*</cursorMark>
<pageSize>50</pageSize>
<sort/>
<synonym>false</synonym>
</request>
<resultList>
<result>

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。