Xpath 选择只返回第一个响应结果

如何解决Xpath 选择只返回第一个响应结果

我对scrapy还是个新手。当尝试从 quotes.toscrape 读取数据时，使用 xpath 选择器时我没有得到任何内容。一旦我使用 css 选择器，一切都会按预期工作。即使示例非常简单，我也找不到错误。

quotes.py

import scrapy
from quotes_loader.items import QuotesLoaderItem as QL

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com//']

    def parse(self,response):
        item = QL()
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            # CSS-Selector
            # item['author_name'] = quote.css('small.author::text').get()
            # item['quote_text'] = quote.css('span.text::text').get()
            # item['author_link'] = quote.css('small.author + a::attr(href)').get()
            # item['tags'] = quote.css('div.tags > a.tag::text').get()

            # XPATH-Selektor
            item['author_name'] = quote.xpath('//small[@class="author"]/text()').get()
            item['quote_text'] = quote.xpath('//span[@class="text"]/text()').get()
            item['author_link'] = quote.xpath('//small[@class="author"]/following-sibling::a/@href').get()
            item['tags'] = quote.xpath('//*[@class="tags"]/*[@class="tag"]/text()').get()

            yield item

        # next_page_url = response.css('li.next > a::attr(href)').get()
        next_page_url = response.xpath('//*[class="next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(absolute_next_page_url)

items.py

import scrapy
from scrapy.loader import ItemLoader


class QuotesLoaderItem(scrapy.Item):
    # define the fields for your item here like:
    author_name = scrapy.Field()
    quote_text = scrapy.Field()
    author_link = scrapy.Field()
    tags = scrapy.Field()

结果

author_name,quote_text,author_link,tags
Albert Einstein,“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,/author/Albert-Einstein,change
Albert Einstein,...
...
(20 times)

感谢您的付出

解决方法

我使用选择器对象而不是响应对象，因此语法必须如下所示。

import scrapy
from quotes_loader.items import QuotesLoaderItem as QL

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com//']

    def parse(self,response):
        item = QL()
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            # CSS-Selector
            # item['author_name'] = quote.css('small.author::text').get()
            # item['quote_text'] = quote.css('span.text::text').get()
            # item['author_link'] = quote.css('small.author + a::attr(href)').get()
            # item['tags'] = quote.css('div.tags > a.tag::text').get()
            
            # XPATH-Selector
            item['author_name'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['quote_text'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author_link'] = quote.xpath('.//small[@class="author"]/following-sibling::a/@href').get()
            item['tags'] = quote.xpath('.//*[@class="tags"]/*[@class="tag"]/text()').get()

            yield item

        # next_page_url = response.css('li.next > a::attr(href)').get()
        next_page_url = response.xpath('.//*[class="next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(absolute_next_page_url)

Xpath 选择只返回第一个响应结果

如何解决Xpath 选择只返回第一个响应结果

解决方法

相关推荐