如何解决DataFrame仅使用BeautifulSoup打印XML的最后一行
您好,我正在尝试从发布的XML数据集中提取一些信息。这是我的代码的第一部分:
from bs4 import BeautifulSoup as bs
import pandas as pd
content = []
with open("phosphiltestfilepmc.xml","r") as file:
content = file.readlines()
content = "".join(content)
bs_content = bs(content,"lxml")
available_contacts = 139
start_list = 0
input_tag = bs_content.find_all(attrs={'ref-type': 'corresp'})
我正在使用find_all函数返回所有带有'ref-type'='corresp'的属性,这会输出一个'resultset'
从那里我遍历它们并获取父元素,如下所示:
l = []
a = []
for i in range(start_list,available_contacts):
d = {}
b = {}
try:
d['firstname'] = input_tag[i].parent('given-names')
except:
None
try:
d['lastname'] = input_tag[i].parent('surname'))
except:
None
try:
d['email'] = input_tag[i].parent.parent.parent.parent('corresp')[0]('email')
except:
d['email'] = 'j@g.com'
l.append(d)
print(l)
print(l)的结果是字典列表(这是一个片段):
[{'firstname': [<given-names>Inn-Ho</given-names>],'lastname': [<surname>Tsai</surname>],'email': [<email>bc201@gate.sinica.edu.tw</email>]}]
我正在尝试从这些词典中获取文字。我认为get_text()不能用于resultSet。
我的解决方案是再次遍历它们,这次使用text.strip(),请参见以下内容:
for tag,tag2,tag3,in zip(d['firstname'],d['lastname'],d['email']):
try:
b['First Name'] = tag.text.strip()
except:
None
try:
b['Last Name'] = tag2.text.strip()
except:
None
try:
b['Email Address'] = tag3.text.strip()
except:
None
a.append(b)
print(a)
“ a”的输出是词典列表(这只是一个片段):[{'First Name': 'José María','Last Name': 'Gutiérrez','Email Address': 'jgutierr@icp.ucr.ac.cr'}]
import pandas
df = pandas.DataFrame(a)
df
输出仅是a列表中的姓。请帮忙。
这是xml代码的片段。
<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article">
<?properties open_access?>
<front>
<journal-Meta>
<journal-id journal-id-type="nlm-ta">Braz J Med Biol Res</journal-id>
<journal-id journal-id-type="iso-abbrev">Braz. J. Med. Biol. Res</journal-id>
<journal-id journal-id-type="publisher-id">bjmbr</journal-id>
<journal-title-group>
<journal-title>Brazilian Journal of Medical and Biological Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0100-879X</issn>
<issn pub-type="epub">1414-431X</issn>
<publisher>
<publisher-name>Associação Brasileira de Divulgação Científica</publisher-name>
</publisher>
</journal-Meta>
<article-Meta>
<article-id pub-id-type="pmid">31721904</article-id>
<article-id pub-id-type="pmc">6853074</article-id>
<article-id pub-id-type="other">00606</article-id>
<article-id pub-id-type="doi">10.1590/1414-431X20198441</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Behavioral effects of <italic>Bj</italic>-PRO-7a,a proline-rich oligopeptide from <italic>Bothrops jararaca</italic> venom</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-4646-5682</contrib-id>
<name>
<surname>Turones</surname>
<given-names>L.C.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-2318-9809</contrib-id>
<name>
<surname>da Cruz</surname>
<given-names>K.R.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-4061-8804</contrib-id>
<name>
<surname>Camargo-Silva</surname>
<given-names>G.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-1799-1106</contrib-id>
<name>
<surname>Reis-Silva</surname>
<given-names>L.L.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-4997-2658</contrib-id>
<name>
<surname>Graziani</surname>
<given-names>D.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ferreira</surname>
<given-names>P.M.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-2836-5565</contrib-id>
<name>
<surname>galdino</surname>
<given-names>P.M.</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-0488-5400</contrib-id>
<name>
<surname>Pedrino</surname>
<given-names>G.R.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0001-8738-5852</contrib-id>
<name>
<surname>Santos</surname>
<given-names>R.</given-names>
</name>
<xref ref-type="aff" rid="aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-1996-0901</contrib-id>
<name>
<surname>Costa</surname>
<given-names>E.A.</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0001-5709-9329</contrib-id>
<name>
<surname>Ianzer</surname>
<given-names>D.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-4006-8213</contrib-id>
<name>
<surname>Xavier</surname>
<given-names>C.H.</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<aff id="aff1">
<label>1</label>Laboratório de Neurobiologia de Sistemas,Departamento de Ciências Fisiológicas,Instituto de Ciências Biológicas,Universidade Federal de Goiás,Goiânia,GO,Brasil</aff>
<aff id="aff2">
<label>2</label>Laboratório de Farmacologia de Produtos Naturais e Sintéticos,Departamento de Farmacologia,Brasil</aff>
<aff id="aff3">
<label>3</label>Departamento de Fisiologia e Biofísica,Universidade Federal de Minas Gerais,Belo Horizonte,MG,Brasil</aff>
</contrib-group>
<author-notes>
<corresp id="cor1">Correspondence: C.H. Xavier: <<email>carlosxavier@ufg.br</email>></corresp>
<fn fn-type="equal" id="fn1">
<p>*These authors contributed equally to his work.</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>11</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>52</volume>
<issue>11</issue>
<elocation-id>e8441</elocation-id>
<history>
<date date-type="received">
<day>12</day>
<month>2</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>8</month>
<year>2019</year>
</date>
</history>
<permissions>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License,which permits unrestricted use,distribution,and reproduction in any medium,provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>The heptapeptide <italic>Bj</italic>-PRO-7a,isolated and identified from <italic>Bothrops jararaca</italic> (<italic>Bj</italic>) venom,produces antihypertensive and other cardiovascular effects that are independent on angiotensin converting enzyme inhibition,possibly relying on cholinergic muscarinic receptors subtype 1 (M<sub>1</sub>R). However,whether <italic>Bj</italic>-PRO-7a acts upon the central nervous system and modifies behavior is yet to be determined. Therefore,the aims of this study were: i) to assess the effects of acute administration of <italic>Bj</italic>-PRO-7a upon behavior; ii) to reveal mechanisms involved in the effects of <italic>Bj</italic>-PRO-7a upon locomotion/exploration,anxiety,and depression-like behaviors. For this purpose,adult male Wistar (WT,wild type) and spontaneous hypertensive rats (SHR) received intraperitoneal injections of vehicle (0.9% NaCl),diazepam (2 mg/kg),imipramine (15 mg/kg),<italic>Bj</italic>-PRO-7a (71,213 or 426 nmol/kg),pirenzepine (852 nmol/kg),α-methyl-DL-tyrosine (200 mg/kg),or chlorpromazine (2 mg/kg),and underwent elevated plus maze,open field,and forced swimming tests. The heptapeptide promoted anxiolytic and antidepressant-like effects and increased locomotion/exploration. These effects of <italic>Bj</italic>-PRO-7a seem to be dependent on M<sub>1</sub>R activation and dopaminergic receptors and rely on catecholaminergic pathways.</p>
</abstract>
<kwd-group>
<kwd><italic>Bj</italic>-PRO-7a</kwd>
<kwd>Snake venom</kwd>
<kwd>Neuroactive compounds</kwd>
<kwd>Anxiety</kwd>
<kwd>Depression</kwd>
<kwd>Behavior</kwd>
</kwd-group>
<counts>
<fig-count count="9"/>
<table-count count="0"/>
<equation-count count="0"/>
<ref-count count="35"/>
</counts>
</article-Meta>
</front>
这是整个脚本:
from bs4 import BeautifulSoup as bs
import pandas as pd
content = []
with open("phosphiltestfilepmc.xml","lxml")
available_contacts = 139
start_list = 0
#article_Meta = bs_content.find_all('article-Meta')
input_tag = bs_content.find_all(attrs={'ref-type': 'corresp'})
# something = []
# for link in input_tag:
# something.append(link.parent.get('given-names'))
# print(something)
l = []
a = []
for i in range(start_list,available_contacts):
d = {}
b = {}
try:
d['firstname'] = input_tag[i].parent('given-names')
except:
None
try:
d['lastname'] = input_tag[i].parent('surname')
except:
None
try:
d['email'] = input_tag[i].parent.parent.parent.parent('corresp')[0]('email')
except:
d['email'] = 'j@g.com'
l.append(d)
#print(l)
for tag,d['email']):
try:
b['First Name'] = tag.text.strip()
except:
None
try:
b['Last Name'] = tag2.text.strip()
except:
None
try:
b['Email Address'] = tag3.text.strip()
except:
None
a.append(b)
print(a)
import pandas
df = pandas.DataFrame(a)
df
解决方法
我希望我对您的问题理解正确:您想从<contrib>
标签(其中有<xref ref-type="corresp">
(txt
包含问题的XML代码段)中提取名称:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt,'html.parser')
all_data = []
for contrib in soup.select('contrib:has(> xref[ref-type="corresp"])'):
cor_id = contrib.select_one('xref[ref-type="corresp"]')['rid']
email = soup.select_one('corresp#{} email'.format(cor_id))
email = email.text if email else '-'
all_data.append({
'First Name': contrib.select_one('given-names').text,'Last Name': contrib.select_one('surname').text,'Email Address': email
})
df = pd.DataFrame(all_data)
print(df)
打印:
First Name Last Name Email Address
0 D. Ianzer carlosxavier@ufg.br
1 C.H. Xavier carlosxavier@ufg.br
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。