如何解决如何使用bsoup在xml中查找字符串偏移量?
xml内容可以包含多个文本块,也可以包含多个表块(因为OCR的文档看起来与包含文本段落的研究论文相似,在这些段落之间有一些表格)
我正在尝试从表中提取文本,在xml中看起来像这样:
<ComposedBlock ID="Page1_Block4" HEIGHT="240" WIDTH="1170" VPOS="226" HPOS="143" TYPE="table">
<TextBlock ID="Page1_Block5" HEIGHT="55" WIDTH="393" VPOS="226" HPOS="143" LANG="en-US" STYLEREFS="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1">
<TextLine HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154">
<String CONTENT="some text" HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154"/>
</TextLine>
</TextBlock>
非表格文本作为文本块的一部分出现,如下所示:
<TextBlock ID="Page1_Block1" HEIGHT="52" WIDTH="1918" VPOS="3101" HPOS="148" STYLEREFS="StyleId-4CD32088-9994-4ED5-BD2B-8082FC83356D- font1" LANG="en-US">
<TextLine HEIGHT="42" WIDTH="1362" VPOS="3101" HPOS="154">
<String CONTENT="Mafi" HEIGHT="32" WIDTH="74" VPOS="3101" HPOS="154"/>
<String STYLE="bold" CONTENT="," HEIGHT="10" WIDTH="4" VPOS="3129" HPOS="235"/>
<SP HEIGHT="36" WIDTH="18" VPOS="3103" HPOS="240"/>
</TextLine>
</TextBlock>
soup = bs.BeautifulSoup(content,'lxml')
page_deFinition = "Page" + str(page) + "_"
tables = soup.find_all('composedblock',{"type": "table"},id=lambda value: value and value.startswith(page_deFinition))
for table in tables:
table_content = []
vpos = []
# Converting table xml to str for bs4 to be able to consume
table = str(table)
xml_soup = bs.BeautifulSoup(table,'lxml')
# Finding all <textblock> tags inside the table ComposedBlock
text_blocks = xml_soup.find_all('textblock')
# Looping over all text blocks inside the table
for text_block in text_blocks:
print(table.text.index(text_block.text))
# Getting all vpos numbers,as different vpos signify different rows ..
# .. due to the virtue of vertical positions
vpos.append(text_block.get("vpos"))
# Collecting all vpos of a table
all_vpos = set(vpos)
all_vpos = sorted(list(all_vpos))
for vpos in all_vpos:
txt_block_row = xml_soup.find_all("textblock",{"vpos": str(vpos)})
# Finding all text in a single
row = []
for txt_block in txt_block_row:
texts = txt_block.findAll("string")
content_text =[]
for text in texts:
content = text.get('content')
解决方法
希望我能正确理解你的问题。您要打印表中字符串的所有索引:
from bs4 import BeautifulSoup
txt = '''
<TextBlock ID="Page1_Block1" HEIGHT="52" WIDTH="1918" VPOS="3101" HPOS="148" STYLEREFS="StyleId-4CD32088-9994-4ED5-BD2B-8082FC83356D- font1" LANG="en-US">
<TextLine HEIGHT="42" WIDTH="1362" VPOS="3101" HPOS="154">
<String CONTENT="Mafi" HEIGHT="32" WIDTH="74" VPOS="3101" HPOS="154"/>
<String STYLE="bold" CONTENT="," HEIGHT="10" WIDTH="4" VPOS="3129" HPOS="235"/>
<SP HEIGHT="36" WIDTH="18" VPOS="3103" HPOS="240"/>
</TextLine>
</TextBlock>
<ComposedBlock ID="Page1_Block4" HEIGHT="240" WIDTH="1170" VPOS="226" HPOS="143" TYPE="table">
<TextBlock ID="Page1_Block5" HEIGHT="55" WIDTH="393" VPOS="226" HPOS="143" LANG="en-US" STYLEREFS="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1">
<TextLine HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154">
<String CONTENT="some text" HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154"/>
</TextLine>
</TextBlock>
</ComposedBlock>
'''
soup = BeautifulSoup(txt,'html.parser')
all_strings = []
are_we_inside_table = []
current_index,indexes = 0,[]
for s in soup.select('string[content]'):
all_strings.append(s['content'])
are_we_inside_table.append(bool(s.find_previous('composedblock',type='table')))
indexes.append(current_index)
current_index += len(s['content'])
for s,t,i in zip(all_strings,are_we_inside_table,indexes):
if t:
print(i,s)
打印:
5 some text
5
,因为Mafi
和,
是字符串,但不在表内。
编辑:仅将搜索限制在第1页的字符串上,您可以进行细化更改:
...
for s in soup.select('[id^="Page{}"] string[content]'.format(1)):
...
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。