微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用Scrapy Splash将响应存储为文件

如何解决使用Scrapy Splash将响应存储为文件

我正在用Splash创建我的第一个scrapy项目,并使用http://quotes.toscrape.com/js/中的测试数据 我想将每个页面的引号作为一个单独的文件存储在磁盘上(在下面的代码中,我首先尝试存储整个页面)。我有下面的代码,当我不使用SplashRequest时可以使用,但是使用下面的新代码,当我在Visual Studio Code中“运行和调试”此代码时,磁盘上现在没有存储任何内容。 此外,self.log不会写入我的可视代码终端窗口。我是Splash的新手,所以我确定我缺少什么,但是什么?

已选中herehere

import scrapy
from scrapy_splash import SplashRequest

class QuoteItem(scrapy.Item):
    author = scrapy.Field()
    quote = scrapy.Field()   

class MySpider(scrapy.Spider):
    name = "jsscraper"

    
    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,callback=self.parse,endpoint='render.html')

    def parse(self,response):
        for q in response.css("div.quote"):            
            quote = QuoteItem()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            yield quote

        #cycle through all available pages
        for a in response.css('ul.pager a'):
            yield SplashRequest(url=a,endpoint='render.html',args={ 'wait': 0.5 })

       
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename,'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

更新1

我如何调试它:

输出标签为空

“终端”标签包含:

PS C:\scrapy\tutorial>  cd 'c:\scrapy\tutorial'; & 'C:\Users\Mark\AppData\Local\Programs\Python\python38-32\python.exe' 'c:\Users\Mark\.vscode\extensions\ms-python.python-2020.9.114305\pythonFiles\lib\python\debugpy\launcher' '58582' '--' 'c:\scrapy\tutorial\spiders\quotes_spider_js.py'
PS C:\scrapy\tutorial> 

此外,我的Docker容器实例中没有任何记录,我认为这是Splash首先需要的。

更新2

我运行了scrapy crawl jsscraper文件'quotes-js.html'被存储在磁盘上。但是,它包含页面源HTML,但未执行任何JavaScript代码。我希望在“ http://quotes.toscrape.com/js/”上执行JS代码,并仅存储报价内容。我该怎么办?

解决方法

问题

您要抓取的网站上的JavaScript未执行。

解决方案

增加ScrappyRequest的等待时间以允许JavaScript执行。

示例

yield SplashRequest(
    url=url,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 }
)
,

将输出写入JSON文件:

我已尝试解决您的问题。这是代码的工作版本。我希望这是您要实现的目标。

import json

import scrapy
from scrapy_splash import SplashRequest


class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url=url,args={'wait': 0.5}
            )

    def parse(self,response):
        quotes = {"quotes": []}
        for q in response.css("div.quote"):
            quote = dict()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            quotes["quotes"].append(quote)

        page = response.url[response.url.index("page/")+5:]
        print("page=",page)
        filename = 'quotes-%s.json' % page
        with open(filename,'w') as outfile:
            outfile.write(json.dumps(quotes,indent=4,separators=(',',":")))

更新: 上面的代码已更新为从所有页面抓取,并将结果保存在从第1页到第10页的单独的json文件中。

这会将每个页面的引号列表写入一个单独的json文件,如下所示:

{
    "quotes":[
        {
            "author":"Albert Einstein","quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
        },{
            "author":"J.K. Rowling","quote":"\u201cIt is our choices,Harry,that show what we truly are,far more than our abilities.\u201d"
        },{
            "author":"Albert Einstein","quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
        },{
            "author":"Jane Austen","quote":"\u201cThe person,be it gentleman or lady,who has not pleasure in a good novel,must be intolerably stupid.\u201d"
        },{
            "author":"Marilyn Monroe","quote":"\u201cImperfection is beauty,madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
        },"quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
        },{
            "author":"Andr\u00e9 Gide","quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
        },{
            "author":"Thomas A. Edison","quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
        },{
            "author":"Eleanor Roosevelt","quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
        },{
            "author":"Steve Martin","quote":"\u201cA day without sunshine is like,you know,night.\u201d"
        }
    ]
}

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。