如何解决VScode 使用 Python、Anaconda 进行文本提取
我在 VSCode 和 Windows 10 64 位上使用 Python 3.7.10。
我的等级: 初学者
我想做什么? 我有一个PDF双语,我想提取英文部分,一句又一句,所以基本上是整个文档。最后我想要另一个只有英文部分的 PDF。
我做了什么? 我尝试了不同的方法,但似乎都不起作用。
到目前为止,我能够导入在此线程下找到的包:how to extract text from PDF file using python,i never did this and not getting the DOM of PDF file 在我的 Jupyter 笔记本中。
包在目录中,VScode中settings.json中也有提到:
{
"python.pythonPath": "C:\\Users\\rajaw\\Anaconda3\\envs\\condaenv\\python.exe","jupyter.jupyterServerType": "local"
}
我被困在哪里了?我真的不明白这个问题很常见,因为很多人试图找到类似问题的解决方案。最著名的解决方案是它可能在另一个目录中,但它不是。此外,我找不到指南、如何找到、我的笔记本从哪个目录检索包的位置。或者我如何在 VScode 中告诉我的笔记本,它应该使用哪个目录。我认为这可能在 settings.json 中完成。 所以我的问题是:上面链接中的脚本找不到我的 pdf,还有一个 Shell 错误
我的代码:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'q.pdf'
#open allows you to read the file
pdfFileObj = open('q.pdf','rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False,we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process('C:/Users/rajaw/Documents/DA_Project/q.pdf',method='tesseract',language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces,possibly junk such as '\n' etc.
# Now,we will clean our text variable,and return it as a list of keywords.
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The","I","and",etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
错误信息: FileNotFoundError 追溯(最近一次调用最后一次) ~\AppData\Roaming\Python\python37\site-packages\textract\parsers\utils.py in run(self,args) 83 个参数, ---> 84 stdout=subprocess.PIPE,stderr=subprocess.PIPE,85)
~\Anaconda3\envs\myenv\lib\subprocess.py in __init__(self,args,bufsize,executable,stdin,stdout,stderr,preexec_fn,close_fds,shell,cwd,env,universal_newlines,startupinfo,creationflags,restore_signals,start_new_session,pass_fds,encoding,errors,text)
799 errread,errwrite,--> 800 restore_signals,start_new_session)
801 except:
~\Anaconda3\envs\myenv\lib\subprocess.py in _execute_child(self,p2cread,p2cwrite,c2pread,c2pwrite,errread,unused_restore_signals,unused_start_new_session)
1206 os.fspath(cwd) if cwd is not None else None,-> 1207 startupinfo)
1208 finally:
**FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden**
During handling of the above exception,another exception occurred:
ShellError Traceback (most recent call last)
<ipython-input-17-745599b9b2eb> in <module>
23 #If the above returns as False,we run the OCR library textract to #convert scanned/image based PDF files into text
24 else:
---> 25 text = textract.process('C:/Users/rajaw/Documents/DA_Project/q.pdf',language='eng')
26 # Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces,possibly junk such as '\n' etc.
27 # Now,and return it as a list of keywords.
~\AppData\Roaming\Python\python37\site-packages\textract\parsers\__init__.py in process(filename,extension,**kwargs)
75
76 parser = filetype_module.Parser()
---> 77 return parser.process(filename,**kwargs)
78
79
~\AppData\Roaming\Python\python37\site-packages\textract\parsers\utils.py in process(self,filename,**kwargs)
44 # output encoding
45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename,**kwargs)
47 unicode_string = self.decode(byte_string)
48 return self.encode(unicode_string,encoding)
~\AppData\Roaming\Python\python37\site-packages\textract\parsers\pdf_parser.py in extract(self,method,**kwargs)
31 return self.extract_pdfminer(filename,**kwargs)
32 elif method == 'tesseract':
---> 33 return self.extract_tesseract(filename,**kwargs)
34 else:
35 raise UnkNownMethod(method)
~\AppData\Roaming\Python\python37\site-packages\textract\parsers\pdf_parser.py in extract_tesseract(self,**kwargs)
55 contents = []
56 try:
---> 57 stdout,_ = self.run(['pdftoppm',base])
58
59 for page in sorted(os.listdir(temp_dir)):
~\AppData\Roaming\Python\python37\site-packages\textract\parsers\utils.py in run(self,args)
89 # This is equivalent to getting exitcode 127 from sh
90 raise exceptions.ShellError(
---> 91 ' '.join(args),127,'',92 )
93
**ShellError: The command `pdftoppm C:/Users/rajaw/Documents/DA_Project/q.pdf C:\Users\rajaw\AppData\Local\Temp\tmpd_xu6afp\conv` Failed with exit code 127
------------- stdout -------------
------------- stderr -------------**
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。