如何解决无法使用 pymupdf 搜索某些 pdf
我编写了一个观察文件夹的小程序,一旦将 .pdf 文件放入该文件夹中,它将在 .pdf 中搜索关键字并输出一个新的 .txt(列出页码)和一个新的 pdf 文件,该文件仅包含包含关键字的页面。
它适用于大多数 .pdf,但有些表现出奇怪的行为。似乎有时它只搜索第一页而不搜索其他页面。如果需要,我可以提供其中一个 pdf 的链接。
这是我的代码:
import fitz,glob,os,time
from watchdog.observers.polling import PollingObserver
from watchdog.events import PatternMatchingEventHandler
os.chdir("C:/test/")
s1 = ["Siphone"]
if __name__ == "__main__":
patterns = ["*.pdf"]
ignore_patterns = ["*done.pdf"]
ignore_directories = True
case_sensitive = True
my_event_handler = PatternMatchingEventHandler(patterns,ignore_patterns,ignore_directories,case_sensitive)
def on_created(event):
print("on_created",event.src_path)
time.sleep(2)
txt = "%s.txt" %event.src_path
open("%s" %event.src_path,'r')
pdf_document = fitz.open(event.src_path)
out_file = "%s_done.pdf" %event.src_path
f = open("%s" %txt,"w")
bla = ""
for words in s1:
f = open("%s" % txt,"a")
f.write("%s:" % words)
for current_page in range(len(pdf_document)):
page = pdf_document.loadPage(current_page)
textsuche = page.searchFor(words)
if page.searchFor(words):
bla += (("%s,") % current_page)
seite = int(current_page)
seite += 1
f.write("%i," % seite)
f.write("\n")
liste = bla.split(",")
str_list = list(filter(None,liste))
str_list = list(dict.fromkeys(str_list))
test_list = [int(i) for i in str_list]
test_list.sort()
print(test_list)
doc = fitz.open()
for p in test_list:
doc.insertPDF(pdf_document,from_page=p,to_page=p)
output= ("%s_done.pdf" % event.src_path)
pdf_document.close()
for page in doc:
for i in s1:
text_instances = page.searchFor(i)
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
doc.save(output)
doc.close()
my_event_handler.on_created = on_created
path = "C:/test/"
go_recursively = True
my_observer = PollingObserver()
my_observer.schedule(my_event_handler,path,recursive=go_recursively)
my_observer.start()
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
my_observer.stop()
my_observer.join()
以下错误出现在某些 pdf 上(我假设 pymupdf 无法正确读取文件,只能搜索第 0 页):
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\mo\AppData\Local\Programs\Python\python39\lib\threading.py",line 954,in _bootstrap_inner
self.run()
File "C:\Users\mo\AppData\Local\Programs\Python\python39\lib\site-packages\watchdog\observers\api.py",line 199,in run
self.dispatch_events(self.event_queue,self.timeout)
File "C:\Users\mo\AppData\Local\Programs\Python\python39\lib\site-packages\watchdog\observers\api.py",line 372,in dispatch_events
handler.dispatch(event)
File "C:\Users\mo\AppData\Local\Programs\Python\python39\lib\site-packages\watchdog\events.py",line 382,in dispatch
super().dispatch(event)
File "C:\Users\mo\AppData\Local\Programs\Python\python39\lib\site-packages\watchdog\events.py",line 261,in dispatch
{
File "C:\all\pdf\final_pdf_suche.py",line 51,in on_created
doc.save(output)
File "C:\Users\mo\AppData\Local\Programs\Python\python39\lib\site-packages\fitz\fitz.py",line 4206,in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages
该词在 pdf 中多次出现但找不到。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。