微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Scrapy:无法绑定:24:打开的文件太多

如何解决Scrapy:无法绑定:24:打开的文件太多

我开始出现错误

2020-09-04 20:45:25 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url.com/> (Failed 2 times): Couldn't bind: 24: Too many open files.

我在Ubuntu上运行Scrapy,并将结果保存到Django数据库(Postgres)。

我不知道问题出在哪里。我有

class Profilesspider(BaseSpiderMixin,scrapy.Spider):
    name = 'db_profiles_spider'

    custom_settings = {
        'CONCURRENT_REQUESTS': 20,'LOG_FILE': 'profiles_spider.log','DOWNLOAD_TIMEOUT': 30,'DNS_TIMEOUT': 30,'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter','RETRY_TIMES': 1,'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/27.0.1453.93 Safari/537.36"

    }

    def start_requests(self):
        self._lock() # creates lock file 
        self.load_websites()
        self.buffer = []

        for website in self.websites:
            try:
                yield scrapy.Request(website.url,self.parse,Meta={'website': website})
            except ValueError:
                continue

    def parse(self,response: Response):
       
        Meta = response.Meta
        website = Meta['website']
        Meta_tags = utils.Meta_tags.extract_Meta_tags(response)
        
        ....

        website.profile_scraped_at = Now()
        website.save()
        profile.save()

    def error(self,failure):

        # log all failures
        Meta = failure.request.Meta
        website = Meta['website']


        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            website.set_response_code(response.status,save=False)

        elif failure.check(DNSLookupError):
            website.set_response_code(WebSite.RESPONSE_CODE__DNS_LOOKUP_ERROR,save=False)

        elif failure.check(TimeoutError,TCPTimedOutError):
            website.set_response_code(WebSite.RESPONSE_CODE__TIMEOUT,save=False)
        else:
            website.set_response_code(WebSite.RESPONSE_CODE__UNKNowN,save=False)

        website.scraped_at = Now()
        website.save()

这是我的设置:

CONCURRENT_REQUESTS = 60
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
DEPTH_PRIORITY = 1
SCHEDULER_disK_QUEUE = 'scrapy.squeues.PickleFifodiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# CONCURRENT_REQUESTS_PER_IP = 4

您知道问题出在哪里吗?

编辑:

我也做了:ulimit -n 1000000

EDIT2:

我正在使用Django管理操作和子流程执行蜘蛛程序:

def runspider__profiles(modeladmin,request,queryset):
    ids = '.'.join([str(x) for x in queryset.values_list('id',flat=True)])
    cmd = ' '.join(["nohup",settings.CRAWL_SH_ABS_PATH,"db_profiles_spider","ids",ids,'&'])
    subprocess.call(cmd,shell=True)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。