微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用python-pptx时,从演示文稿文件提取的文本顺序不正确

如何解决使用python-pptx时,从演示文稿文件提取的文本顺序不正确

我正在尝试使用以下代码从PowerPoint文本框中提取文本:

from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

def iter_textable_shapes(shapes):
    for shape in shapes:
        if shape.has_text_frame:
            yield shape

def iter_textframed_shapes(shapes):
    """Generate shape objects in *shapes* that can contain text.

    Shape objects are generated in document order (z-order),bottom to top.
    """
    for shape in shapes:
        # ---recurse on group shapes---
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            group_shape = shape
            for shape in iter_textable_shapes(group_shape.shapes):
                yield shape
            continue

        # ---otherwise,treat shape as a "leaf" shape---
        if shape.has_text_frame:
            yield shape

prs = Presentation(path_to_my_prs)
 
for slide in prs.slides:
    textable_shapes = list(iter_textframed_shapes(slide.shapes))
    ordered_textable_shapes = sorted(
        textable_shapes,key=lambda shape: (shape.top,shape.left)
    )

    for shape in ordered_textable_shapes:
        print(shape.text)

但有时会首先提取ppt末尾的文本框,有时会提取间的文本框,依此类推。如何修复我的代码以正确的顺序获取文本(从左到右,从上到下)?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。