从pdf提取引用-Python

如何解决从pdf提取引用-Python

PDF非常复杂，我不是专家，但是我使用了extractText（）的源代码来查看其工作方式，并使用它print('>>>', operator, operands)可以查看它在PDF中找到的值。

在本文档中，它用于"Tm"将位置移动到新行，因此更改了原始代码extractText()，我曾经"Tm"添加\n并在行中得到了文本

Arto Anttila. 1995. How to recognise subjects in 
English. In Karlsson et al., chapt. 9, pp. 315-358. 
Dekang Lin. 1996. Evaluation of Principar with the 
Susanne corpus. In John Carroll, editor, Work- 
shop on Robust Parsing, pages 54-69, Prague. 
Jason M. Eisner. 1996. Three new probabilistic 
models for dependency parsing: An exploration. 
In The 16th International Conference on Compu- 
tational Linguistics, pages 340-345. copenhagen. 
David G. Hays. 1964. Dependency theory: A 
formalism and some observations. Language, 
40(4):511-525.

或与---线之间

---
Arto Anttila. 1995. How to recognise subjects in 
---
English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the 
---
Susanne corpus. In John Carroll, editor, Work- 
---
shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic 
---
models for dependency parsing: An exploration. 
---
In The 16th International Conference on Compu- 
---
tational Linguistics, pages 340-345. copenhagen. 
---
David G. Hays. 1964. Dependency theory: A 
---
formalism and some observations. Language, 
---
40(4):511-525.

但是它仍然不是那么有用，但是现在我用来获得这个结果的代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self):  
    # code from original `extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getobject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        # new code to add `\n` when text moves to new line
        elif operator == b_("Tm"):
            text += '\n'

    return text

# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]

# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

挖掘之后，似乎Tm也有值，并且有一个新位置x, y可以用来计算文本行之间的距离，并且\n当距离大于某个值时可以添加。我测试了不同的价值，从价值中17我得到了预期的结果

---
Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. copenhagen. 
---
David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. 
---

这里的代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText2(self):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getobject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    prev_x = 0
    prev_y = 0

    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        elif operator == b_("Tm"):
            x = operands[-2]
            y = operands[-1]

            diff_x = prev_x - x
            diff_y = prev_y - y

            #print('>>>', diff_x, diff_y - y)
            #text += f'| {diff_x}, {diff_y - y} |'

            if diff_y > 17 or diff_y < 0:  # (bigger margin) or (move to top in next column)
                text += '\n'
                #text += '\n' # to add empty line between elements

            prev_x = x
            prev_y = y

    return text

# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]

# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

它适用于此PDF，但其他文件可能具有不同的结构或彼此之间的距离，references并且可能需要其他更改。

更通用的版本-它有第二个论点

如果没有第二个参数运行

 text += myExtractText(page)

那么它的工作原理就像原始的一样，extractText()并且您将所有内容集中在一个字符串中。

如果第二个参数是 True

 text += myExtractText(page, True)

然后每次添加新行Tm-就像我的第一个版本一样。

如果第二个参数是整数-即。 17

 text += myExtractText(page, 17)

然后它会在距离更大时添加新行17-就像我的第二个版本一样。

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self, distance=None):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getobject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    prev_x = 0
    prev_y = 0

    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        if operator == b_("Tm"):

            if distance is True: 
                text += '\n'

            elif isinstance(distance, int):
                x = operands[-2]
                y = operands[-1]

                diff_x = prev_x - x
                diff_y = prev_y - y

                #print('>>>', diff_x, diff_y - y)
                #text += f'| {diff_x}, {diff_y - y} |'

                if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)
                    text += '\n'
                    #text += '\n' # to add empty line between elements

                prev_x = x
                prev_y = y

    return text

# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function

    #text += myExtractText(page)        # modified function (works like original version)
    #text += myExtractText(page, True)  # modified function (add `\n` after every `Tm`)
    text += myExtractText(page, 17)  # modified function (add `\n` only if distance is bigger then `17`)

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]

# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

它不仅对References文本而且对其余文本都是有用的-似乎将段落分割了。

PDF开始的结果

---
A non-projective dependency parser 
---
Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. Box 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i 
---
Abstract 
---
We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated. 
---
1 Introduction 
---
We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner. 
---
see bi i ~ d'~b~ bird 
---
figure 1: Dependencies for sentence I see a bird. 
---
First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. 
---
64 
---
The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t . 
---
2 Background 
---
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule 
---

解决方法

在我的python项目中，我需要REFERENCES从pdf研究论文中摘录。我PyPDF2用来阅读pdf并像这样从中提取文本。

import PyPDF2

pdfFileObj = open('fileName.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageCount = pdfReader.numPages
count = 0
text = ''

while count < pageCount:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

现在text可以采用任何格式，我无法从中识别任何标题。我无法使用，find('References')因为纸张在其他任何地方也可以包含该词。有些论文标题前有Number，例如
6 REFERENCES ，所以我可以为此添加正则表达式

但是我在前进之前就被那些没有任何数值的论文所困扰。

这是我目前正在使用的PDF非投影依赖解析器

这就是我得到的参考

References Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al.,chapt. 9,pp. 315-358. Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll,editor,Work- shop on Robust Parsing,pages 54-69,Prague. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics,pages 340-345. Copenhagen. David G. Hays. 1964. Dependency theory: A formalism and some observations. Language,40(4):511-525. Hans Jiirgen Heringer. 1993. Dependency syntax - basic ideas and the classical model. In Joachim Jacobs,Arnim von Stechow,Wolfgang Sternefeld,and Thee Venneman,editors,Syntax - An In- ternational Handbook of Contemporary Research,volume 1,chapter 12,pages 298-316. Walter de Gruyter,Berlin - New York. Richard Hudson. 1991. English Word Grammar. Basil Blackwell,Cambridge,MA. Arvi Hurskainen. 1996. Disambiguation of morpho- logical analysis in Bantu languages. In The 16th International Conference on Computational Lin- guistics,pages 568-573. Copenhagen. Time J~rvinen. 1994. Annotating 200 million words: the Bank of English project. In The 15th International Conference on Computational Lin- guistics Proceedings,pages 565-568. Kyoto. Fred Karlsson,Atro Voutilainen,Juha Heikkil~,and Arto Anttila,editors. 1995. Constraint Gram- mar: a language-independent system for parsing unrestricted text,volume 4 of Natural Language Processing. Mouton de Gruyter,Berlin and N.Y. Fred Karlsson. 1990. Constraint grammar as a framework for parsing running text. In Hans Karl- gren,Papers presented to the 13th Interna- tional Conference on Computational Linguistics,volume 3,pages 168-173,Helsinki,Finland. Michael McCord. 1990. Slot grammar: A system for simpler construction of practical natural language grammars. In lq,Studer,Natural Language and Logic: International Scientific Symposium,Lecture Notes in Computer Science,pages 118- 145. Springer,Berlin. Igor A. Mel'~uk. 1987. Dependency Syntax: Theory and Practice. State University of New York Press,Albany. Christer Samuelsson,Pasi Tapanainen,and Atro Voutilainen. 1996. Inducing constraint gram- mars. In Laurent Miclet and Colin de la Higuera,Grammatical Inference: Learning Syntax from Sentences,volume 1147 of Lecture Notes in Artificial Intelligence,pages 146-155,Springer. Daniel Sleator and Davy Temperley. 1991. Parsing English with a link grammar. Technical Report CMU-CS-91-196,Carnegie Mellon University. Pasi Tapanainen and Time J/irvinen. 1994. Syn- tactic analysis of natural language using linguis- tic rules and corpus-based patterns. In The 15th International Conference on Computational Lin- guistics Proceedings,pages 629-634. Kyoto. Pasi Tapanainen. 1996. The Constraint Grammar Parser CG-2. Number 27 in Publications of the Department of General Linguistics,University of Helsinki. Lucien TesniSre. 1959. l~ldments de syntaxe stvuc- turale,l~ditions Klincksieck,Paris. Atro Voutilainen. 1995. Morphological disambigua- tion. In Karlsson et al.,chapter 6,pages 165-284. 71

如何将这些Reference字符串解析为pdf中提到的多个引用？任何帮助将不胜感激。