微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

提取内容失败,协议状态为:异常16,lastModified = 0:Http代码= 403,url = https://www.nicobuyscars.com

如何解决提取内容失败,协议状态为:异常16,lastModified = 0:Http代码= 403,url = https://www.nicobuyscars.com

我正在对URL进行parsechecker:https://www.nicobuyscars.com o / p读取失败,协议状态为:异常(16),lastModified = 0:Http代码= 403,URL = https://www.nicobuyscars.com

我可以知道问题是什么以及如何解决。我尝试更改代理名称,但是没有用。请帮助我。

解决方法

看起来服务器正在根据用户代理请求标头阻止请求。使用另一个HTTP客户端(wget)可重现:

$> wget --header='User-Agent: mycrawler/Nutch-1.17' https://www.nicobuyscars.com
--2020-09-25 11:08:19--  https://www.nicobuyscars.com/
Resolving www.nicobuyscars.com (www.nicobuyscars.com)... 205.147.88.151
Connecting to www.nicobuyscars.com (www.nicobuyscars.com)|205.147.88.151|:443... connected.
HTTP request sent,awaiting response... 403 Forbidden
2020-09-25 11:08:19 ERROR 403: Forbidden.

$> wget https://www.nicobuyscars.com
--2020-09-25 11:08:27--  https://www.nicobuyscars.com/
Resolving www.nicobuyscars.com (www.nicobuyscars.com)... 205.147.88.151
Connecting to www.nicobuyscars.com (www.nicobuyscars.com)|205.147.88.151|:443... connected.
HTTP request sent,awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

无论如何,请对Nutch使用礼貌设置:大fetcher.server.delay,继续遵守robots.txt等。服务器很可能会实施其他启发式方法来检测和阻止bot。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。