如何解决从pdf提取引用-Python
PDF
非常复杂,我不是专家,但是我使用了extractText()的源代码来查看其工作方式,并使用它print('>>>',
operator, operands)
可以查看它在PDF中找到的值。
在本文档中,它用于"Tm"
将位置移动到新行,因此更改了原始代码extractText()
,我曾经"Tm"
添加\n
并在行中得到了文本
Arto Anttila. 1995. How to recognise subjects in
English. In Karlsson et al., chapt. 9, pp. 315-358.
Dekang Lin. 1996. Evaluation of Principar with the
Susanne corpus. In John Carroll, editor, Work-
shop on Robust Parsing, pages 54-69, Prague.
Jason M. Eisner. 1996. Three new probabilistic
models for dependency parsing: An exploration.
In The 16th International Conference on Compu-
tational Linguistics, pages 340-345. copenhagen.
David G. Hays. 1964. Dependency theory: A
formalism and some observations. Language,
40(4):511-525.
或与---
线之间
---
Arto Anttila. 1995. How to recognise subjects in
---
English. In Karlsson et al., chapt. 9, pp. 315-358.
---
Dekang Lin. 1996. Evaluation of Principar with the
---
Susanne corpus. In John Carroll, editor, Work-
---
shop on Robust Parsing, pages 54-69, Prague.
---
Jason M. Eisner. 1996. Three new probabilistic
---
models for dependency parsing: An exploration.
---
In The 16th International Conference on Compu-
---
tational Linguistics, pages 340-345. copenhagen.
---
David G. Hays. 1964. Dependency theory: A
---
formalism and some observations. Language,
---
40(4):511-525.
但是它仍然不是那么有用,但是现在我用来获得这个结果的代码
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText(self):
# code from original `extractText()`
# https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
text = u_("")
content = self["/Contents"].getobject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
for operands, operator in content.operations:
# used only for test to see values in variables
#print('>>>', operator, operands)
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
# new code to add `\n` when text moves to new line
elif operator == b_("Tm"):
text += '\n'
return text
# --- main ---
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
#text += page.extractText() # original function
text += myExtractText(page) # modified function
# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
# print all at once
print(text)
# print line by line
for line in text.split('\n'):
print(line)
print('---')
挖掘之后,似乎Tm
也有值,并且有一个新位置x,
y
可以用来计算文本行之间的距离,并且\n
当距离大于某个值时可以添加。我测试了不同的价值,从价值中17
我得到了预期的结果
---
Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358.
---
Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague.
---
Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. copenhagen.
---
David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525.
---
这里的代码
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText2(self):
# original code from `page.extractText()`
# https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
text = u_("")
content = self["/Contents"].getobject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
prev_x = 0
prev_y = 0
for operands, operator in content.operations:
# used only for test to see values in variables
#print('>>>', operator, operands)
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
elif operator == b_("Tm"):
x = operands[-2]
y = operands[-1]
diff_x = prev_x - x
diff_y = prev_y - y
#print('>>>', diff_x, diff_y - y)
#text += f'| {diff_x}, {diff_y - y} |'
if diff_y > 17 or diff_y < 0: # (bigger margin) or (move to top in next column)
text += '\n'
#text += '\n' # to add empty line between elements
prev_x = x
prev_y = y
return text
# --- main ---
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
#text += page.extractText() # original function
text += myExtractText(page) # modified function
# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
# print all at once
print(text)
# print line by line
for line in text.split('\n'):
print(line)
print('---')
它适用于此PDF,但其他文件可能具有不同的结构或彼此之间的距离,references
并且可能需要其他更改。
更通用的版本-它有第二个论点
如果没有第二个参数运行
text += myExtractText(page)
那么它的工作原理就像原始的一样,extractText()
并且您将所有内容集中在一个字符串中。
如果第二个参数是 True
text += myExtractText(page, True)
如果第二个参数是整数-即。 17
text += myExtractText(page, 17)
然后它会在距离更大时添加新行17
-就像我的第二个版本一样。
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText(self, distance=None):
# original code from `page.extractText()`
# https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
text = u_("")
content = self["/Contents"].getobject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
prev_x = 0
prev_y = 0
for operands, operator in content.operations:
# used only for test to see values in variables
#print('>>>', operator, operands)
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
if operator == b_("Tm"):
if distance is True:
text += '\n'
elif isinstance(distance, int):
x = operands[-2]
y = operands[-1]
diff_x = prev_x - x
diff_y = prev_y - y
#print('>>>', diff_x, diff_y - y)
#text += f'| {diff_x}, {diff_y - y} |'
if diff_y > distance or diff_y < 0: # (bigger margin) or (move to top in next column)
text += '\n'
#text += '\n' # to add empty line between elements
prev_x = x
prev_y = y
return text
# --- main ---
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
#text += page.extractText() # original function
#text += myExtractText(page) # modified function (works like original version)
#text += myExtractText(page, True) # modified function (add `\n` after every `Tm`)
text += myExtractText(page, 17) # modified function (add `\n` only if distance is bigger then `17`)
# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
# print all at once
print(text)
# print line by line
for line in text.split('\n'):
print(line)
print('---')
它不仅对References
文本而且对其余文本都是有用的-似乎将段落分割了。
PDF开始的结果
---
A non-projective dependency parser
---
Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. Box 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i
---
Abstract
---
We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated.
---
1 Introduction
---
We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner.
---
see bi i ~ d'~b~ bird
---
figure 1: Dependencies for sentence I see a bird.
---
First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated.
---
64
---
The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t .
---
2 Background
---
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule
---
解决方法
在我的python项目中,我需要REFERENCES
从pdf研究论文中摘录。我PyPDF2
用来阅读pdf并像这样从中提取文本。
import PyPDF2
pdfFileObj = open('fileName.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageCount = pdfReader.numPages
count = 0
text = ''
while count < pageCount:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
现在text
可以采用任何格式,我无法从中识别任何标题。我无法使用,find('References')
因为纸张在其他任何地方也可以包含该词。有些论文标题前有Number,例如
6 REFERENCES ,所以我可以为此添加正则表达式
但是我在前进之前就被那些没有任何数值的论文所困扰。
这是我目前正在使用的PDF非投影依赖解析器
这就是我得到的参考
References Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al.,chapt. 9,pp. 315-358. Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll,editor,Work- shop on Robust Parsing,pages 54-69,Prague. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics,pages 340-345. Copenhagen. David G. Hays. 1964. Dependency theory: A formalism and some observations. Language,40(4):511-525. Hans Jiirgen Heringer. 1993. Dependency syntax - basic ideas and the classical model. In Joachim Jacobs,Arnim von Stechow,Wolfgang Sternefeld,and Thee Venneman,editors,Syntax - An In- ternational Handbook of Contemporary Research,volume 1,chapter 12,pages 298-316. Walter de Gruyter,Berlin - New York. Richard Hudson. 1991. English Word Grammar. Basil Blackwell,Cambridge,MA. Arvi Hurskainen. 1996. Disambiguation of morpho- logical analysis in Bantu languages. In The 16th International Conference on Computational Lin- guistics,pages 568-573. Copenhagen. Time J~rvinen. 1994. Annotating 200 million words: the Bank of English project. In The 15th International Conference on Computational Lin- guistics Proceedings,pages 565-568. Kyoto. Fred Karlsson,Atro Voutilainen,Juha Heikkil~,and Arto Anttila,editors. 1995. Constraint Gram- mar: a language-independent system for parsing unrestricted text,volume 4 of Natural Language Processing. Mouton de Gruyter,Berlin and N.Y. Fred Karlsson. 1990. Constraint grammar as a framework for parsing running text. In Hans Karl- gren,Papers presented to the 13th Interna- tional Conference on Computational Linguistics,volume 3,pages 168-173,Helsinki,Finland. Michael McCord. 1990. Slot grammar: A system for simpler construction of practical natural language grammars. In lq,Studer,Natural Language and Logic: International Scientific Symposium,Lecture Notes in Computer Science,pages 118- 145. Springer,Berlin. Igor A. Mel'~uk. 1987. Dependency Syntax: Theory and Practice. State University of New York Press,Albany. Christer Samuelsson,Pasi Tapanainen,and Atro Voutilainen. 1996. Inducing constraint gram- mars. In Laurent Miclet and Colin de la Higuera,Grammatical Inference: Learning Syntax from Sentences,volume 1147 of Lecture Notes in Artificial Intelligence,pages 146-155,Springer. Daniel Sleator and Davy Temperley. 1991. Parsing English with a link grammar. Technical Report CMU-CS-91-196,Carnegie Mellon University. Pasi Tapanainen and Time J/irvinen. 1994. Syn- tactic analysis of natural language using linguis- tic rules and corpus-based patterns. In The 15th International Conference on Computational Lin- guistics Proceedings,pages 629-634. Kyoto. Pasi Tapanainen. 1996. The Constraint Grammar Parser CG-2. Number 27 in Publications of the Department of General Linguistics,University of Helsinki. Lucien TesniSre. 1959. l~ldments de syntaxe stvuc- turale,l~ditions Klincksieck,Paris. Atro Voutilainen. 1995. Morphological disambigua- tion. In Karlsson et al.,chapter 6,pages 165-284. 71
如何将这些Reference字符串解析为pdf中提到的多个引用?任何帮助将不胜感激。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。