如何将 PDF 中的数据抓取到 Excel 中

如何解决如何将 PDF 中的数据抓取到 Excel 中

我正在尝试从 PDF 中抓取数据并将其保存到 Excel 文件中。这是我需要的 pdf：https://www.medicaljournals.se/acta/content_files/files/pdf/98/219/Suppl219.pdf

但是，我不需要抓取所有数据，而是抓取以下数据（如下），然后将其保存到不同单元格中的excel：从第 5 页开始，从 P001 到并包括简介 - 有一个 P 编号、职位、人名和简介。

目前，我只能将 PDF 文件转换为文本（我的代码如下）并将其全部保存在一个单元格中，但我需要将其分成不同的单元格

import PyPDF2 as p2

PDFfile = open('Abstract Book from the 5th World Psoriasis and Psoriatic Arthritis 
Conference 2018.pdf','rb')
pdfread = p2.PdfFileReader(PDFfile)

pdflist = []

i = 6
while i<pdfread.getNumPages():
  pageinfo = pdfread.getPage(i)
  #print(pageinfo.extractText())
  i = i + 1

  pdflist.append(pageinfo.extractText().replace('\n',''))

print(pdflist)

解决方法

您需要的主要是“标题”正则表达式为 15 个大写字母和“文章”正则表达式字母“P”和 3 位数字。另一个正则表达式可帮助您按任意关键字划分文本

article_re = re.compile(r'[P]\d{3}')  #P001: letter 'P' and 3 digits
header_re = re.compile(r'[A-Z\s\-]{15,}|$')  #min 15 UPPERCASE letters,including '\n' '-' and
key_word_delimeters = ['Peoples','Introduction','Objectives','Methods','Results','Conclusions','References']

file = open('data.pdf','rb')
pdf = pdf.PdfFileReader(file)

text = ''

for i in range(6,63):
    text += pdf.getPage(i).extractText()  # all text in one variable

articles = []

for article in re.split(article_re,text):
    header = re.match(header_re,article)  # recieving a match
    other_text = re.split(header_re,article)[1]  # recieving other text
    if header:
        header = header.group()            # get text from match
        item = {'header': header}
        first_name_letter = header[-1]     # save the first letter of name to put it in right position. Some kind of HOT BUGFIX
        header = header[:-1]               # cut last character: the first letter of name
        header = header.replace('\n','')  #delete linebreakers
        header = header.replace('-','')   #delete line break symbol
        other_text = first_name_letter + other_text
        data_array = re.split(
            'Introduction:|Objectives:|Methods:|Results:|Conclusions:|References:',other_text)

        for key,data in zip(key_word_delimeters,data_array):
            item[key] = data.replace('\n','')
        articles.append(item)