如何解决html-requests,如果渲染 HTML 时出现 TimeoutError 则跳过
我正在使用 HTML 请求编写网页抓取脚本。我抓取 URL 然后遍历它们并提交到数据库。我已经能够抓取链接并创建一个 for 循环来呈现页面,然后抓取特定的产品信息。对于大多数链接,这有效,但对于某些链接,页面不会呈现并且我得到一个 pyppeteer.errors.TimeoutError
。我可以不抓取一些链接,因为大多数网站信息都被抓取了。我曾尝试使用 try and 除了以下内容:
session = HTMLSession()
for link in productlinks2:
r = session.get(link)
try:
r.html.render(sleep=3,timeout=30)
except TimeoutError:
pass
但这仍然会产生:
pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
无论如何要跳过无法及时呈现的链接?任何帮助将不胜感激。
解决方法
您是否导入了错误?
那么你也需要为你的 session.get()
设置超时
这取决于您的错误,但是,如果您的 url 错误,则在呈现页面之前 session.get() 会出现错误。 因此,例如查看可以捕获的不同错误:
from requests_html import HTMLSession
from requests.exceptions import ConnectionError,InvalidSchema,ReadTimeout
from pyppeteer.errors import TimeoutError
session = HTMLSession()
links = [
'https://www.google.com/','h**ps://www.google.com/','https://deelay.me/4000/https://www.google.com/',# 4s of delay to get the page
'https://www.baaaadurl.com/','https://www.youtube.com/','https://www.google.com/',]
for url in links:
try:
r = session.get(url,timeout=3)
r.html.render(timeout=1) # timout short to render google but not youtube
print(r.html.find('title',first=True).text,'\n')
except InvalidSchema as e:
# error for 'h**ps://www.google.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except ReadTimeout as e:
# error due to too much delay for
# 'https://deelay.me/4000/https://www.google.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except ConnectionError as e:
# error for 'https://www.baaaadurl.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except TimeoutError as e:
# error if timout
# in rendering the page 'https://www.youtube.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
打印结果:
Google
For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/'
For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me',port=443): Read timed out. (read timeout=3)
For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com',port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known'))
For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded.
Google
这样你就可以捕捉错误并继续你的循环。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。