如何解决三次尝试后刮擦崩溃
我想使用代理对网站进行爬网,但是第三次尝试后,爬网程序崩溃了。这是我正在使用的代码。我的代理数据库很大,我正在使用scrapy-rotating-proxies
lib。因此,我将代理设置为ROTATING_PROXY_LIST
。爬行者启动,并且在一段时间后崩溃,而没有检查下一个代理并且没有下载页面。
import scrapy,sqlite3
from scrapy.crawler import CrawlerProcess
from rotating_proxies.policy import BanDetectionPolicy
from rotating_proxies.middlewares import RotatingProxyMiddleware
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.project import get_project_settings
class TestSpider(CrawlSpider):
name = 'get_files'
rules = (Rule(LinkExtractor(),callback='parse_item',follow=True),)
def __init__(self,*args,**kwargs):
super(TestSpider,self).__init__(*args,**kwargs)
self.response_list = list()
self.crowled_hrefs = list()
self.allowed_domains = [kwargs.get("domain")]
self.what_to_look = kwargs.get("what_to_look")
self.start_urls = ["https://www."+kwargs.get("domain")]
def parse_item(self,response):
#settings = scapy.crawler.settings
#print(settings)
if response.status == 200: #or (response.status == 301)
all_responses = response.css("a::attr(href)").extract()
print(len(all_responses),"<<<<<<")
for res in all_responses:
#print(res)
if res not in self.response_list:
self.response_list.append(res)
if res.endswith(self.what_to_look):
print(res)
print("pdf")
else:
return b'banned' in response.body
def response_is_ban(self,request,response):
if response not in self.crowled_hrefs:
self.crowled_hrefs.append(response)
else:
print("HELLO")
pass
return b'banned' in response.body
def exception_is_ban(self,exception):
print(request,"THERE",exception)
return None
if __name__ == "__main__":
conn = sqlite3.connect("hi.db",check_same_thread=False)
c = conn.cursor()
list_of_proxes = list()
p = c.execute("SELECT proxy from proxies").fetchall()
for i in p:
list_of_proxes.append(i[0].rstrip())
c.close()
conn.close()
print(len(list_of_proxes))
custom_settings = { "LOG_ENABLED":True,"ROTATING_PROXY_LIST":list_of_proxes,#"ROTATING_PROXY_LIST":[],"DEPTH_LIMIT" : 1,#"ROTATING_PROXY_BACKOFF_BASE":3600,#"ROTATING_PROXY_BACKOFF_CAP":3600,"ROTATING_PROXY_PAGE_RETRY_TIMES":5,"DOWNLOAD_TIMEOUT":3,"DOWNLOADER_MIDDLEWARES":{"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,"rotating_proxies.middlewares.BanDetectionMiddleware": 620,},}
process = CrawlerProcess(custom_settings)
process.crawl(TestSpider,domain="palaplast.gr",what_to_look=(".pdf",".img",".exe")) #"https://palaplast.gr/katalogos/"
#process.crawl(TestSpider1)
process.start()
print("END")
我遇到了这个错误。我该如何克服该错误并继续检查代理是否存在并下载特定站点。
1761
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0,libxml2 2.9.5,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.5 (tags/v3.8.5:580fbb0,Jul 20 2020,15:43:08) [MSC v.1926 32 bit (Intel)],pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020),cryptography 3.1.1,Platform Windows-10-10.0.19041-SP0
2020-10-21 11:56:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-21 11:56:41 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1,'DOWNLOAD_TIMEOUT': 3}
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet Password: 70605a3422c30cb7
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','rotating_proxies.middlewares.RotatingProxyMiddleware','rotating_proxies.middlewares.BanDetectionMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-21 11:56:41 [scrapy.core.engine] INFO: Spider opened
2020-10-21 11:56:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-21 11:56:41 [rotating_proxies.middlewares] INFO: Proxies(good: 0,dead: 0,unchecked: 1695,reanimated: 0,mean backoff time: 0s)
<GET https://www.palaplast.gr> THERE User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
2020-10-21 11:56:44 [scrapy.downloadermiddlewares.retry] DEBUG: retrying <GET https://www.palaplast.gr> (Failed 1 times): User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
<GET https://www.palaplast.gr> THERE User timeout caused connection failure.
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] DEBUG: retrying <GET https://www.palaplast.gr> (Failed 2 times): User timeout caused connection failure.
<GET https://www.palaplast.gr> THERE Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409,'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.palaplast.gr> (Failed 3 times): Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409,'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.palaplast.gr>
Traceback (most recent call last):
File "C:\Users\john\AppData\Local\Programs\Python\python38-32\lib\site-packages\scrapy\core\downloader\middleware.py",line 44,in process_request
return (yield download_func(request=request,spider=spider))
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409,'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 11:56:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,'downloader/request_bytes': 648,'downloader/request_count': 3,'downloader/request_method_count/GET': 3,'elapsed_time_seconds': 6.250498,'finish_reason': 'finished','finish_time': datetime.datetime(2020,10,21,8,56,47,867509),'log_count/DEBUG': 2,'log_count/ERROR': 2,'log_count/INFO': 11,'proxies/mean_backoff': 0.0,'proxies/reanimated': 0,'proxies/unchecked': 1695,'retry/count': 2,'retry/max_reached': 1,'retry/reason_count/twisted.internet.error.TimeoutError': 2,'scheduler/dequeued': 3,'scheduler/dequeued/memory': 3,'scheduler/enqueued': 3,'scheduler/enqueued/memory': 3,'start_time': datetime.datetime(2020,41,617011)}
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Spider closed (finished)
END
解决方法
有时,您必须在砍树之前先削斧头。 只是改变
"ROTATING_PROXY_PAGE_RETRY_TIMES":5
至"ROTATING_PROXY_PAGE_RETRY_TIMES":3600
,
旋转3600个代理人
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。