如何解决如何通过自动下载链接使用 Python 访问 PDF 文件?
我正在尝试创建一个自动 Python 脚本,该脚本会转到 this 之类的网页,在正文底部找到链接(锚文本“此处”),然后下载点击后加载的 PDF说下载链接。我能够从原始文件中检索 HTML 并找到下载链接,但我不知道如何从那里获取 link to the PDF。任何帮助将非常感激。这是我到目前为止所拥有的:
import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html,'html.parser')
links = []
for link in soup.findAll('a',href=True,text=['HERE','here','Here']):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html,'html.parser')
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
此时我得到的链接列表不包括我正在寻找的 PDF。有什么方法可以在不硬编码代码中PDF链接的情况下获取它(这与我在这里尝试做的事情违反直觉)?谢谢!
解决方法
查找带有文本 a
的 here
元素,然后跟踪。
import requests
from bs4 import BeautifulSoup
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
user_agent = {'User-agent': 'Mozilla/5.0'}
s = requests.Session()
r = s.get(url,headers=user_agent)
soup = BeautifulSoup(r.content,'html.parser')
for a in soup.select('a'):
if a.text == 'here':
href = a['href']
r = s.get(href,headers=user_agent)
print(r.status_code,r.reason)
print(r.headers)
_,dl_url = r.headers['refresh'].split('url=',1)
r = s.get(dl_url,r.reason)
print(r.headers)
file_bytes = r.content # here's your PDF; you can write it out to a file
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。