如何使用bsoup在xml中查找字符串偏移量？

如何解决如何使用bsoup在xml中查找字符串偏移量？

我有一个用于Abby的OCR软件的xml文件。

xml内容可以包含多个文本块，也可以包含多个表块（因为OCR的文档看起来与包含文本段落的研究论文相似，在这些段落之间有一些表格）

我正在尝试从表中提取文本，在xml中看起来像这样：

<ComposedBlock ID="Page1_Block4" HEIGHT="240" WIDTH="1170" VPOS="226" HPOS="143" TYPE="table">
                        <TextBlock ID="Page1_Block5" HEIGHT="55" WIDTH="393" VPOS="226" HPOS="143" LANG="en-US" STYLEREFS="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1">
                            <TextLine HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154">
                                <String CONTENT="some text" HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154"/>
                            </TextLine>
                        </TextBlock>

非表格文本作为文本块的一部分出现，如下所示：

<TextBlock ID="Page1_Block1" HEIGHT="52" WIDTH="1918" VPOS="3101" HPOS="148" STYLEREFS="StyleId-4CD32088-9994-4ED5-BD2B-8082FC83356D- font1" LANG="en-US">
                    <TextLine HEIGHT="42" WIDTH="1362" VPOS="3101" HPOS="154">
                        <String CONTENT="Mafi" HEIGHT="32" WIDTH="74" VPOS="3101" HPOS="154"/>
                        <String STYLE="bold" CONTENT="," HEIGHT="10" WIDTH="4" VPOS="3129" HPOS="235"/>
                        <SP HEIGHT="36" WIDTH="18" VPOS="3103" HPOS="240"/>
</TextLine>
</TextBlock>

现在，我的问题陈述是：如何仅获取表中文本的全局偏移量？

这是我的代码从表中提取文本的样子：

soup = bs.BeautifulSoup(content,'lxml')
page_deFinition = "Page" + str(page) + "_"
tables = soup.find_all('composedblock',{"type": "table"},id=lambda value: value and value.startswith(page_deFinition))
for table in tables:
    table_content = []
    vpos = []
    # Converting table xml to str for bs4 to be able to consume
    table = str(table)
    xml_soup = bs.BeautifulSoup(table,'lxml')
    # Finding all <textblock> tags inside the table ComposedBlock
    text_blocks = xml_soup.find_all('textblock')
    
    # Looping over all text blocks inside the table
    for text_block in text_blocks:
        print(table.text.index(text_block.text))
        # Getting all vpos numbers,as different vpos signify different rows ..
        # .. due to the virtue of vertical positions
        vpos.append(text_block.get("vpos"))

    # Collecting all vpos of a table
    all_vpos = set(vpos)
    all_vpos = sorted(list(all_vpos))
    for vpos in all_vpos:
        txt_block_row = xml_soup.find_all("textblock",{"vpos": str(vpos)})

        # Finding all text in a single
        row = []
        for txt_block in txt_block_row:
            texts = txt_block.findAll("string")
            content_text =[]
            for text in texts:
                content = text.get('content')

解决方法

希望我能正确理解你的问题。您要打印表中字符串的所有索引：

from bs4 import BeautifulSoup


txt = '''
<TextBlock ID="Page1_Block1" HEIGHT="52" WIDTH="1918" VPOS="3101" HPOS="148" STYLEREFS="StyleId-4CD32088-9994-4ED5-BD2B-8082FC83356D- font1" LANG="en-US">
    <TextLine HEIGHT="42" WIDTH="1362" VPOS="3101" HPOS="154">
        <String CONTENT="Mafi" HEIGHT="32" WIDTH="74" VPOS="3101" HPOS="154"/>
        <String STYLE="bold" CONTENT="," HEIGHT="10" WIDTH="4" VPOS="3129" HPOS="235"/>
        <SP HEIGHT="36" WIDTH="18" VPOS="3103" HPOS="240"/>
    </TextLine>
</TextBlock>

<ComposedBlock ID="Page1_Block4" HEIGHT="240" WIDTH="1170" VPOS="226" HPOS="143" TYPE="table">
    <TextBlock ID="Page1_Block5" HEIGHT="55" WIDTH="393" VPOS="226" HPOS="143" LANG="en-US" STYLEREFS="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1">
        <TextLine HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154">
            <String CONTENT="some text" HEIGHT="33" WIDTH="178" VPOS="234" HPOS="154"/>
        </TextLine>
    </TextBlock>
</ComposedBlock>
'''

soup = BeautifulSoup(txt,'html.parser')

all_strings = []
are_we_inside_table = []
current_index,indexes = 0,[]
for s in soup.select('string[content]'):
    all_strings.append(s['content'])
    are_we_inside_table.append(bool(s.find_previous('composedblock',type='table')))
    indexes.append(current_index)
    current_index += len(s['content'])

for s,t,i in zip(all_strings,are_we_inside_table,indexes):
    if t:
        print(i,s)

打印：

5 some text

5，因为Mafi和,是字符串，但不在表内。

编辑：仅将搜索限制在第1页的字符串上，您可以进行细化更改：

...
for s in soup.select('[id^="Page{}"] string[content]'.format(1)):
    ...