微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何使用bsoup在xml中查找字符串偏移量?

如何解决如何使用bsoup在xml中查找字符串偏移量?

我有一个用于Abby的OCR软件的xml文件

xml内容可以包含多个文本块,也可以包含多个表块(因为OCR的文档看起来与包含文本段落的研究论文相似,在这些段落之间有一些表格)

我正在尝试从表中提取文本,在xml中看起来像这样:

<ComposedBlock ID="Page1_Block4" HEIGHT="240" WIDTH="1170" VPOS="226" HPOS="143" TYPE="table">
                        <TextBlock ID="Page1_Block5" HEIGHT="55" WIDTH="393" VPOS="226" HPOS="143" LANG="en-US" STYLEREFS="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1">
                            <TextLine HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154">
                                <String CONTENT="some text" HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154"/>
                            </TextLine>
                        </TextBlock>

非表格文本作为文本块的一部分出现,如下所示:

<TextBlock ID="Page1_Block1" HEIGHT="52" WIDTH="1918" VPOS="3101" HPOS="148" STYLEREFS="StyleId-4CD32088-9994-4ED5-BD2B-8082FC83356D- font1" LANG="en-US">
                    <TextLine HEIGHT="42" WIDTH="1362" VPOS="3101" HPOS="154">
                        <String CONTENT="Mafi" HEIGHT="32" WIDTH="74" VPOS="3101" HPOS="154"/>
                        <String STYLE="bold" CONTENT="," HEIGHT="10" WIDTH="4" VPOS="3129" HPOS="235"/>
                        <SP HEIGHT="36" WIDTH="18" VPOS="3103" HPOS="240"/>
</TextLine>
</TextBlock>

现在,我的问题陈述是:如何仅获取中文本的全局偏移量?

这是我的代码从表中提取文本的样子:

soup = bs.BeautifulSoup(content,'lxml')
page_deFinition = "Page" + str(page) + "_"
tables = soup.find_all('composedblock',{"type": "table"},id=lambda value: value and value.startswith(page_deFinition))
for table in tables:
    table_content = []
    vpos = []
    # Converting table xml to str for bs4 to be able to consume
    table = str(table)
    xml_soup = bs.BeautifulSoup(table,'lxml')
    # Finding all <textblock> tags inside the table ComposedBlock
    text_blocks = xml_soup.find_all('textblock')
    
    # Looping over all text blocks inside the table
    for text_block in text_blocks:
        print(table.text.index(text_block.text))
        # Getting all vpos numbers,as different vpos signify different rows ..
        # .. due to the virtue of vertical positions
        vpos.append(text_block.get("vpos"))

    # Collecting all vpos of a table
    all_vpos = set(vpos)
    all_vpos = sorted(list(all_vpos))
    for vpos in all_vpos:
        txt_block_row = xml_soup.find_all("textblock",{"vpos": str(vpos)})

        # Finding all text in a single
        row = []
        for txt_block in txt_block_row:
            texts = txt_block.findAll("string")
            content_text =[]
            for text in texts:
                content = text.get('content')

解决方法

希望我能正确理解你的问题。您要打印表中字符串的所有索引:

from bs4 import BeautifulSoup


txt = '''
<TextBlock ID="Page1_Block1" HEIGHT="52" WIDTH="1918" VPOS="3101" HPOS="148" STYLEREFS="StyleId-4CD32088-9994-4ED5-BD2B-8082FC83356D- font1" LANG="en-US">
    <TextLine HEIGHT="42" WIDTH="1362" VPOS="3101" HPOS="154">
        <String CONTENT="Mafi" HEIGHT="32" WIDTH="74" VPOS="3101" HPOS="154"/>
        <String STYLE="bold" CONTENT="," HEIGHT="10" WIDTH="4" VPOS="3129" HPOS="235"/>
        <SP HEIGHT="36" WIDTH="18" VPOS="3103" HPOS="240"/>
    </TextLine>
</TextBlock>

<ComposedBlock ID="Page1_Block4" HEIGHT="240" WIDTH="1170" VPOS="226" HPOS="143" TYPE="table">
    <TextBlock ID="Page1_Block5" HEIGHT="55" WIDTH="393" VPOS="226" HPOS="143" LANG="en-US" STYLEREFS="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1">
        <TextLine HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154">
            <String CONTENT="some text" HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154"/>
        </TextLine>
    </TextBlock>
</ComposedBlock>
'''

soup = BeautifulSoup(txt,'html.parser')

all_strings = []
are_we_inside_table = []
current_index,indexes = 0,[]
for s in soup.select('string[content]'):
    all_strings.append(s['content'])
    are_we_inside_table.append(bool(s.find_previous('composedblock',type='table')))
    indexes.append(current_index)
    current_index += len(s['content'])

for s,t,i in zip(all_strings,are_we_inside_table,indexes):
    if t:
        print(i,s)

打印:

5 some text

5,因为Mafi,是字符串,但不在表内。


编辑:仅将搜索限制在第1页的字符串上,您可以进行细化更改:

...
for s in soup.select('[id^="Page{}"] string[content]'.format(1)):
    ...

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。