如何解决使用APscheduler进行Scrapy只能工作一半时间
我有以下代码,它仅每2小时工作一次,而不是每小时工作一次。我已将数据通过管道存储到MongoDb,因此我看到id随时间变化的是两个,而不是一个。
代码打算做的是每小时在data.csv文件中保存的100个subreddit中刮擦在线人数,并将数据推送到mongoDb云服务器。一切正常,只是只能每两个小时刮一次,而不是每个小时刮一次。
class SubredditSpider(scrapy.Spider):
name = 'subreddit'
sub_list = [] # list of subreddits
count = 0
def start_requests(self):
SubredditSpider.count += 1
if SubredditSpider.count > 24:
SubredditSpider.count = 1
with open('data.csv','r') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
self.sub_list.append(row[0])
for sub in self.sub_list:
yield scrapy.Request(f'https://www.reddit.com{sub}/about.json',self.parse)
def parse(self,response):
data = json.loads(response.body)
subreddit = data['data']['display_name']
active_users = data['data']['active_user_count']
Now = datetime.Now()
current_time = Now.strftime("%H:%M")
current_date = Now.strftime("%d:%m:%Y")
yield {
'_id': SubredditSpider.count,'subreddit': subreddit,'active_users': active_users,'time': current_time,'date': current_date
}
def main():
process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl,'cron',args=[
SubredditSpider],hour='*')
scheduler.start()
process.start(False)
不运行的时间显示日志
2020-09-24 10:00:00 [apscheduler.scheduler] DEBUG: Looking for jobs to run
2020-09-24 10:00:00 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2020-09-24 11:00:00+00:00 (in 3599.898356 seconds)
2020-09-24 10:00:00 [apscheduler.executors.default] INFO: Running job "CrawlerRunner.crawl (trigger: cron[hour='*'],next run at: 2020-09-24 11:00:00 UTC)" (scheduled at 2020-09-24 10:00:00+00:00)
2020-09-24 10:00:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'reddit','NEWSPIDER_MODULE': 'reddit.spiders','SPIDER_MODULES': ['reddit.spiders']}
2020-09-24 10:00:00 [scrapy.extensions.telnet] INFO: Telnet Password: telnet_password
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled item pipelines:
['reddit.pipelines.RedditPipeline']
2020-09-24 10:00:00 [scrapy.core.engine] INFO: Spider opened
2020-09-24 10:00:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2020-09-24 10:00:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-24 10:00:00 [apscheduler.executors.default] INFO: Job "CrawlerRunner.crawl (trigger: cron[hour='*'],next run at: 2020-09-24 11:00:00 UTC)" executed successfully
解决方法
我修改了您的代码,使它每分钟运行一次Spider,并废弃100倍相同的URL,并且对我有用,我让它运行了10分钟。
我还尝试了请求库,每秒发出一个请求,一切正常
我搜索了reddit api速率限制,有的帖子说您可以提出100个请求,但是在他们的实际文档中,他们将请求限制为60个。
https://github.com/reddit-archive/reddit/wiki/API
它们允许您在我共享的链接上的边界内删除所有内容,但您必须进行身份验证。
我唯一的理论是,由于超出了他们的小时费率,它们阻止了您的第二次报废,也许您可以尝试使用代理或进行身份验证。
如果您愿意,也可以向我发送您的URL列表,然后我可以重新运行测试。也许我没有超越他们的极限,因为我一遍又一遍地要求同样的事情。
import scrapy
import json
from datetime import datetime
import requests
from apscheduler.schedulers.twisted import TwistedScheduler
from apscheduler.schedulers.blocking import BlockingScheduler
from scrapy.crawler import CrawlerProcess
class SubredditSpider(scrapy.Spider):
name = 'subreddit'
sub_list = [] # list of subreddits
count = 0
custom_settings = {}
def start_requests(self):
SubredditSpider.count += 1
if SubredditSpider.count > 24:
SubredditSpider.count = 1
for _ in range(100):
yield scrapy.Request('https://www.reddit.com/r/Music/about.json',self.parse,dont_filter=True)
def parse(self,response,*args):
data = json.loads(response.body)
subreddit = data['data']['display_name']
active_users = data['data']['active_user_count']
now = datetime.now()
current_time = now.strftime("%H:%M")
current_date = now.strftime("%d:%m:%Y")
yield {
'_id': SubredditSpider.count,'subreddit': subreddit,'active_users': active_users,'time': current_time,'date': current_date
}
def main():
process = CrawlerProcess({'BOT_NAME': 'reddit'})
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl,'interval',args=[
SubredditSpider],minutes=1)
scheduler.start()
process.start(False)
def get_active_users():
url = "https://www.reddit.com/r/Music/about.json"
payload = {}
headers = {
'User-Agent': 'PostmanRuntime/7.26.3','Accept': '*/*','Accept-Encoding': 'gzip,deflate,br','Connection': 'keep-alive'
}
response = requests.request("GET",url,headers=headers,data=payload)
if response.status_code == 200:
data = response.json()
subreddit = data['data']['display_name']
active_users = data['data']['active_user_count']
now = datetime.now()
current_time = now.strftime("%H:%M")
current_date = now.strftime("%d:%m:%Y")
print({
'_id': SubredditSpider.count,'date': current_date
})
SubredditSpider.count += 1
else:
print(response)
if __name__ == '__main__':
main()
# scheduler = BlockingScheduler()
# scheduler.add_job(get_active_users,seconds=1)
# scheduler.start()
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。