微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

带有 CrawlerProcess 的 Scrapy 无限循环

如何解决带有 CrawlerProcess 的 Scrapy 无限循环

我目前正在运行 Scrapy v2.5,我想运行无限循环。我的代码

class main():

    def bucle(self,array_spyder,process):
        mongo       = mongodb(setting)
        for spider_name in array_spider:
            process_init.crawl(spider_name,params={ "mongo": mongo,"spider_name": spider_name})
        process.start()
        process.stop()
        mongo.close_mongo()

if __name__ == "__main__":
    setting     = get_project_settings()
    while True:
        process = CrawlerProcess(setting)
        array_spider = process.spider_loader.list()
        class_main = main()
        class_main.bucle(array_spider,process)

但这导致了如下错误信息:

Traceback (most recent call last):
  File "run_scrapy.py",line 92,in <module>
    process.start()
  File "/usr/local/lib/python3.8/dist-packages/scrapy/crawler.py",line 327,in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/base.py",line 1422,in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/base.py",line 1404,in startRunning
    ReactorBase.startRunning(cast(ReactorBase,self))
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/base.py",line 843,in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

有人可以帮我吗??

解决方法

AFAIK 没有简单的方法来重新启动蜘蛛,但有一个替代方案 - 蜘蛛永远不会关闭。为此,您可以使用 spider_idle signal.

根据文档:

Sent when a spider has gone idle,which means the spider has no further:  
* requests waiting to be downloaded
* requests scheduled
* items being processed in the item pipeline

您还可以在官方 documentation 中找到使用 Signals 的示例。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。