html-requests，如果渲染 HTML 时出现 TimeoutError 则跳过

如何解决html-requests，如果渲染 HTML 时出现 TimeoutError 则跳过

我正在使用 HTML 请求编写网页抓取脚本。我抓取 URL 然后遍历它们并提交到数据库。我已经能够抓取链接并创建一个 for 循环来呈现页面，然后抓取特定的产品信息。对于大多数链接，这有效，但对于某些链接，页面不会呈现并且我得到一个 pyppeteer.errors.TimeoutError。我可以不抓取一些链接，因为大多数网站信息都被抓取了。我曾尝试使用 try and 除了以下内容：

    session = HTMLSession()
    for link in productlinks2:
        r = session.get(link)
        try:
            r.html.render(sleep=3,timeout=30)
        except TimeoutError:
            pass

但这仍然会产生：

pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.

无论如何要跳过无法及时呈现的链接？任何帮助将不胜感激。

解决方法

您是否导入了错误？

那么你也需要为你的 session.get() 设置超时

这取决于您的错误，但是，如果您的 url 错误，则在呈现页面之前 session.get() 会出现错误。因此，例如查看可以捕获的不同错误：

from requests_html import HTMLSession
from requests.exceptions import ConnectionError,InvalidSchema,ReadTimeout
from pyppeteer.errors import TimeoutError

session = HTMLSession()

links = [
    'https://www.google.com/','h**ps://www.google.com/','https://deelay.me/4000/https://www.google.com/',# 4s of delay to get the page
    'https://www.baaaadurl.com/','https://www.youtube.com/','https://www.google.com/',]

for url in links:
    try:
        r = session.get(url,timeout=3)
        r.html.render(timeout=1) # timout short to render google but not youtube
        print(r.html.find('title',first=True).text,'\n')
    except InvalidSchema as e:
        # error for 'h**ps://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ReadTimeout as e:
        # error due to too much delay for 
        # 'https://deelay.me/4000/https://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ConnectionError as e:
        # error for 'https://www.baaaadurl.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except TimeoutError as e:
        # error if timout 
        # in rendering the page 'https://www.youtube.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass

打印结果：

Google 

For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/' 

For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me',port=443): Read timed out. (read timeout=3) 

For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com',port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known')) 

For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded. 

Google

这样你就可以捕捉错误并继续你的循环。