微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

三次尝试后刮擦崩溃

如何解决三次尝试后刮擦崩溃

我想使用代理对网站进行爬网,但是第三次​​尝试后,爬网程序崩溃了。这是我正在使用的代码。我的代理数据库很大,我正在使用scrapy-rotating-proxies lib。因此,我将代理设置为ROTATING_PROXY_LIST。爬行者启动,并且在一段时间后崩溃,而没有检查下一个代理并且没有下载页面

import scrapy,sqlite3
from scrapy.crawler import CrawlerProcess
from rotating_proxies.policy import BanDetectionPolicy
from rotating_proxies.middlewares import RotatingProxyMiddleware
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.project import get_project_settings



class TestSpider(CrawlSpider):
    name = 'get_files'
    rules = (Rule(LinkExtractor(),callback='parse_item',follow=True),)
    def __init__(self,*args,**kwargs):
        super(TestSpider,self).__init__(*args,**kwargs)
        self.response_list = list()
        self.crowled_hrefs = list()
        self.allowed_domains = [kwargs.get("domain")]
        self.what_to_look = kwargs.get("what_to_look")
        self.start_urls = ["https://www."+kwargs.get("domain")]
        
    def parse_item(self,response):
        #settings = scapy.crawler.settings
        #print(settings)        
        if response.status == 200:  #or (response.status == 301)
            all_responses = response.css("a::attr(href)").extract()
            print(len(all_responses),"<<<<<<")
            for res in all_responses:
                #print(res)
                if res not in self.response_list:
                    self.response_list.append(res)
                    if res.endswith(self.what_to_look):
                        print(res)
                        print("pdf")
        else:
            return b'banned' in response.body
    
    def response_is_ban(self,request,response):
        if response not in self.crowled_hrefs:
            self.crowled_hrefs.append(response)
        else:  
            print("HELLO")
            pass
        return b'banned' in response.body
        
        
    def exception_is_ban(self,exception):
        print(request,"THERE",exception)
        return None
        
if __name__ == "__main__":
    conn = sqlite3.connect("hi.db",check_same_thread=False) 
    c = conn.cursor()   
    list_of_proxes = list()
    p = c.execute("SELECT proxy from proxies").fetchall()
    for i in p:
        list_of_proxes.append(i[0].rstrip())
    c.close()
    conn.close()
    print(len(list_of_proxes))
    custom_settings = { "LOG_ENABLED":True,"ROTATING_PROXY_LIST":list_of_proxes,#"ROTATING_PROXY_LIST":[],"DEPTH_LIMIT" : 1,#"ROTATING_PROXY_BACKOFF_BASE":3600,#"ROTATING_PROXY_BACKOFF_CAP":3600,"ROTATING_PROXY_PAGE_RETRY_TIMES":5,"DOWNLOAD_TIMEOUT":3,"DOWNLOADER_MIDDLEWARES":{"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,"rotating_proxies.middlewares.BanDetectionMiddleware": 620,},}
    process = CrawlerProcess(custom_settings)
    process.crawl(TestSpider,domain="palaplast.gr",what_to_look=(".pdf",".img",".exe")) #"https://palaplast.gr/katalogos/"
    #process.crawl(TestSpider1)
    process.start()
    print("END")

我遇到了这个错误。我该如何克服该错误并继续检查代理是否存在并下载特定站点

1761
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0,libxml2 2.9.5,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.5 (tags/v3.8.5:580fbb0,Jul 20 2020,15:43:08) [MSC v.1926 32 bit (Intel)],pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020),cryptography 3.1.1,Platform Windows-10-10.0.19041-SP0
2020-10-21 11:56:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-21 11:56:41 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1,'DOWNLOAD_TIMEOUT': 3}
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet Password: 70605a3422c30cb7
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','rotating_proxies.middlewares.RotatingProxyMiddleware','rotating_proxies.middlewares.BanDetectionMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-21 11:56:41 [scrapy.core.engine] INFO: Spider opened
2020-10-21 11:56:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-21 11:56:41 [rotating_proxies.middlewares] INFO: Proxies(good: 0,dead: 0,unchecked: 1695,reanimated: 0,mean backoff time: 0s)
<GET https://www.palaplast.gr> THERE User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
2020-10-21 11:56:44 [scrapy.downloadermiddlewares.retry] DEBUG: retrying <GET https://www.palaplast.gr> (Failed 1 times): User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
<GET https://www.palaplast.gr> THERE User timeout caused connection failure.
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] DEBUG: retrying <GET https://www.palaplast.gr> (Failed 2 times): User timeout caused connection failure.
<GET https://www.palaplast.gr> THERE Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409,'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.palaplast.gr> (Failed 3 times): Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409,'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.palaplast.gr>
Traceback (most recent call last):
  File "C:\Users\john\AppData\Local\Programs\Python\python38-32\lib\site-packages\scrapy\core\downloader\middleware.py",line 44,in process_request
    return (yield download_func(request=request,spider=spider))
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409,'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 11:56:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,'downloader/request_bytes': 648,'downloader/request_count': 3,'downloader/request_method_count/GET': 3,'elapsed_time_seconds': 6.250498,'finish_reason': 'finished','finish_time': datetime.datetime(2020,10,21,8,56,47,867509),'log_count/DEBUG': 2,'log_count/ERROR': 2,'log_count/INFO': 11,'proxies/mean_backoff': 0.0,'proxies/reanimated': 0,'proxies/unchecked': 1695,'retry/count': 2,'retry/max_reached': 1,'retry/reason_count/twisted.internet.error.TimeoutError': 2,'scheduler/dequeued': 3,'scheduler/dequeued/memory': 3,'scheduler/enqueued': 3,'scheduler/enqueued/memory': 3,'start_time': datetime.datetime(2020,41,617011)}
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Spider closed (finished)
END

解决方法

有时,您必须在砍树之前先削斧头。 只是改变

"ROTATING_PROXY_PAGE_RETRY_TIMES":5"ROTATING_PROXY_PAGE_RETRY_TIMES":3600

旋转3600个代理人

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?