VScode 使用 Python、Anaconda 进行文本提取

如何解决VScode 使用 Python、Anaconda 进行文本提取

我在 VSCode 和 Windows 10 64 位上使用 Python 3.7.10。

我的等级：初学者

我想做什么？我有一个PDF双语，我想提取英文部分，一句又一句，所以基本上是整个文档。最后我想要另一个只有英文部分的 PDF。

我做了什么？我尝试了不同的方法，但似乎都不起作用。

到目前为止，我能够导入在此线程下找到的包：how to extract text from PDF file using python,i never did this and not getting the DOM of PDF file 在我的 Jupyter 笔记本中。

包在目录中，VScode中settings.json中也有提到：

{
    "python.pythonPath": "C:\\Users\\rajaw\\Anaconda3\\envs\\condaenv\\python.exe","jupyter.jupyterServerType": "local"
}

我被困在哪里了？我真的不明白这个问题很常见，因为很多人试图找到类似问题的解决方案。最著名的解决方案是它可能在另一个目录中，但它不是。此外，我找不到指南、如何找到、我的笔记本从哪个目录检索包的位置。或者我如何在 VScode 中告诉我的笔记本，它应该使用哪个目录。我认为这可能在 settings.json 中完成。所以我的问题是：上面链接中的脚本找不到我的 pdf，还有一个 Shell 错误

我的代码：

import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'q.pdf'
#open allows you to read the file
pdfFileObj = open('q.pdf','rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text
#If the above returns as False,we run the OCR library textract to #convert scanned/image based PDF files into text
else:
   text = textract.process('C:/Users/rajaw/Documents/DA_Project/q.pdf',method='tesseract',language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces,possibly junk such as '\n' etc.
# Now,we will clean our text variable,and return it as a list of keywords.
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The","I","and",etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

错误信息： FileNotFoundError 追溯（最近一次调用最后一次） ~\AppData\Roaming\Python\python37\site-packages\textract\parsers\utils.py in run(self,args) 83 个参数， ---> 84 stdout=subprocess.PIPE,stderr=subprocess.PIPE,85)

~\Anaconda3\envs\myenv\lib\subprocess.py in __init__(self,args,bufsize,executable,stdin,stdout,stderr,preexec_fn,close_fds,shell,cwd,env,universal_newlines,startupinfo,creationflags,restore_signals,start_new_session,pass_fds,encoding,errors,text)
    799                                 errread,errwrite,--> 800                                 restore_signals,start_new_session)
    801         except:

~\Anaconda3\envs\myenv\lib\subprocess.py in _execute_child(self,p2cread,p2cwrite,c2pread,c2pwrite,errread,unused_restore_signals,unused_start_new_session)
   1206                                          os.fspath(cwd) if cwd is not None else None,-> 1207                                          startupinfo)
   1208             finally:

**FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden**

During handling of the above exception,another exception occurred:

ShellError                                Traceback (most recent call last)
<ipython-input-17-745599b9b2eb> in <module>
     23 #If the above returns as False,we run the OCR library textract to #convert scanned/image based PDF files into text
     24 else:
---> 25    text = textract.process('C:/Users/rajaw/Documents/DA_Project/q.pdf',language='eng')
     26 # Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces,possibly junk such as '\n' etc.
     27 # Now,and return it as a list of keywords.

~\AppData\Roaming\Python\python37\site-packages\textract\parsers\__init__.py in process(filename,extension,**kwargs)
     75 
     76     parser = filetype_module.Parser()
---> 77     return parser.process(filename,**kwargs)
     78 
     79 

~\AppData\Roaming\Python\python37\site-packages\textract\parsers\utils.py in process(self,filename,**kwargs)
     44         # output encoding
     45         # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46         byte_string = self.extract(filename,**kwargs)
     47         unicode_string = self.decode(byte_string)
     48         return self.encode(unicode_string,encoding)

~\AppData\Roaming\Python\python37\site-packages\textract\parsers\pdf_parser.py in extract(self,method,**kwargs)
     31             return self.extract_pdfminer(filename,**kwargs)
     32         elif method == 'tesseract':
---> 33             return self.extract_tesseract(filename,**kwargs)
     34         else:
     35             raise UnkNownMethod(method)

~\AppData\Roaming\Python\python37\site-packages\textract\parsers\pdf_parser.py in extract_tesseract(self,**kwargs)
     55         contents = []
     56         try:
---> 57             stdout,_ = self.run(['pdftoppm',base])
     58 
     59             for page in sorted(os.listdir(temp_dir)):

~\AppData\Roaming\Python\python37\site-packages\textract\parsers\utils.py in run(self,args)
     89                 # This is equivalent to getting exitcode 127 from sh
     90                 raise exceptions.ShellError(
---> 91                     ' '.join(args),127,'',92                 )
     93 

**ShellError: The command `pdftoppm C:/Users/rajaw/Documents/DA_Project/q.pdf C:\Users\rajaw\AppData\Local\Temp\tmpd_xu6afp\conv` Failed with exit code 127
------------- stdout -------------
------------- stderr -------------**