微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

html-requests,如果渲染 HTML 时出现 TimeoutError 则跳过

如何解决html-requests,如果渲染 HTML 时出现 TimeoutError 则跳过

我正在使用 HTML 请求编写网页抓取脚本。我抓取 URL 然后遍历它们并提交到数据库。我已经能够抓取链接并创建一个 for 循环来呈现页面,然后抓取特定的产品信息。对于大多数链接,这有效,但对于某些链接页面不会呈现并且我得到一个 pyppeteer.errors.TimeoutError。我可以不抓取一些链接,因为大多数网站信息都被抓取了。我曾尝试使用 try and 除了以下内容

    session = HTMLSession()
    for link in productlinks2:
        r = session.get(link)
        try:
            r.html.render(sleep=3,timeout=30)
        except TimeoutError:
            pass

但这仍然会产生:

pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.

无论如何要跳过无法及时呈现的链接?任何帮助将不胜感激。

解决方法

您是否导入了错误?

那么你也需要为你的 session.get() 设置超时

这取决于您的错误,但是,如果您的 url 错误,则在呈现页面之前 session.get() 会出现错误。 因此,例如查看可以捕获的不同错误:

from requests_html import HTMLSession
from requests.exceptions import ConnectionError,InvalidSchema,ReadTimeout
from pyppeteer.errors import TimeoutError

session = HTMLSession()

links = [
    'https://www.google.com/','h**ps://www.google.com/','https://deelay.me/4000/https://www.google.com/',# 4s of delay to get the page
    'https://www.baaaadurl.com/','https://www.youtube.com/','https://www.google.com/',]

for url in links:
    try:
        r = session.get(url,timeout=3)
        r.html.render(timeout=1) # timout short to render google but not youtube
        print(r.html.find('title',first=True).text,'\n')
    except InvalidSchema as e:
        # error for 'h**ps://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ReadTimeout as e:
        # error due to too much delay for 
        # 'https://deelay.me/4000/https://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ConnectionError as e:
        # error for 'https://www.baaaadurl.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except TimeoutError as e:
        # error if timout 
        # in rendering the page 'https://www.youtube.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    

打印结果:

Google 

For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/' 

For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me',port=443): Read timed out. (read timeout=3) 

For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com',port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known')) 

For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded. 

Google 

这样你就可以捕捉错误并继续你的循环。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。