微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何从PDF文件提取文本和文本坐标?

如何解决如何从PDF文件提取文本和文本坐标?

换行符在最终输出中转换为下划线。这是我发现的最小工作解决方案。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import pdfpage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import pdfpageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import pdfpageAggregator
import pdfminer

# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf', 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = pdfpageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = pdfpageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    # loop over the object list
    for obj in lt_objs:

        # if it's a textBox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print "%6d, %6d, %s" % (obj.bBox[0], obj.bBox[1], obj.get_text().replace('\n', '_'))

        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTfigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in pdfpage.create_pages(document):

    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()

    # extract text from this object
    parse_obj(layout._objs)

解决方法

我想使用PDFMiner从PDF文件中提取所有文本框和文本框坐标。

其他许多Stack Overflow帖子都介绍了如何以有序方式提取所有文本,但是我该如何做获取文本和文本位置的中间步骤呢?

给定一个PDF文件,输出应类似于:

489,41,"Signature"
500,52,"b"
630,202,"a_g_i_r"

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。