如何解决如何使用python请求库从Web下载PDF文件
尝试使用请求模块从网站下载一些pdf文件,但是我仍然在下面列出此错误。我看到一些帖子,他们提到它们将response.content
用于pdf文件而不是response.text
,但是它仍然会产生错误。不确定如何解决此问题。
示例链接:https://corporate.exxonmobil.com/-/media/Global/Files/worldwide-giving/2018-Worldwide-Giving-Report.pdf
def scrape_website(link):
try:
print("getting content")
cert = requests.certs.where()
page = requests.get(link,verify=cert,headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/81.0.4044.141 Safari/537.36"})
print(page)
if ".pdf" in link:
print("the content is a pdf file. downloading..")
return page.content
return page.text
except Exception as x:
print(x)
return ''
statement_page = scrape_website(link)
with open(filepath,'w+',encoding="utf-8") as f:
print("writing page")
f.write(statement_page)
f.close()
<ipython-input-42-1e4771d32073> in save_html_page(page,path,filename)
13 with open(filepath,encoding="utf-8") as f:
14 print("writing page")
---> 15 f.write(page)
16 f.close()
17
TypeError: write() argument must be str,not bytes
解决方法
有时我也需要以编程方式下载内容。我只是用这个:
import requests
response = requests.get("https://link_to_thing.pdf")
file = open("myfile.pdf","wb")
file.write(response.content)
file.close()
,
这里是我曾经使用的示例,当您尝试下载大型pdf文件时,它非常方便:
import requests
import sys
url = 'url'
filename = 'filename'
# creating a connection to the pdf
print("Creating the connection ...")
with requests.get(url,stream=True) as r:
if r.status_code != 200:
print("Could not download the file '{}'\nError Code : {}\nReason : {}\n\n".format(
url,r.status_code,r.reason),file=sys.stderr)
else:
# Storing the file as a pdf
print("Saving the pdf file :\n\"{}\" ...".format(filename))
with open(filename,'wb') as f:
try:
total_size = int(r.headers['Content-length'])
saved_size_pers = 0
moversBy = 8192*100/total_size
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
saved_size_pers += moversBy
print("\r=>> %.2f%%" % (
saved_size_pers if saved_size_pers <= 100 else 100.0),end='')
print(end='\n\n')
except Exception:
print("==> Couldn't save : {}\\".format(filename))
f.flush()
r.close()
r.close()
它使用:iter_content()
下载并通过chunck保存pdf块。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。