微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用 Scrapy Python 提取数据时出错

如何解决使用 Scrapy Python 提取数据时出错

import scrapy
import logging

class Countriesspider(scrapy.Spider):
    name = 'countries'
    allowed_domains = ['www.worldometers.info']
    start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
    def parse(self,response):
        countries = response.xpath("//td/a")
        for country in countries:
        name = country.xpath(".//text()").get()
        link = country.xpath(".//@href").get()
    
        # absolute_url = f"https://www.worldometers.info{link}"
        # absolute_url = response.urljoin(link)

        yield response.follow(url=link,callback=self.parse_country,Meta={'country_name':name})

def parse_country(self,response):
    name = response.request.Meta['country_name']
    rows = response.xpath("(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1])[1]/tbody/tr")
    for row in rows:
        year = row.xpath(".//td[1]/text()").get()
        population = row.xpath(".//td[2]/strong/text()").get()
        yield {
            'year': year,'population':population
        }

但我收到错误

(new_Virtual_workspace) SubhrajyotisAir:worldometer subhrajyotisaha$ scrapy crawl countries

2021-05-29 23:33:14 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: worldometer)

2021-05-29 23:33:14 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.5.2,w3lib 1.21.0,Twisted 21.2.0,Python 3.8.10 (default,May 19 2021,11:01:55) - [Clang 10.0.0 ],pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021),cryptography 3.4.7,Platform macOS-10.14.1-x86_64-i386-64bit

2021-05-29 23:33:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

2021-05-29 23:33:14 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'worldometer','NEWSPIDER_MODULE': 'worldometer.spiders','ROBOTSTXT_OBEY': True,'SPIDER_MODULES': ['worldometer.spiders']}

2021-05-29 23:33:14 [scrapy.extensions.telnet] INFO: Telnet Password: 87f0a20eef9428d7

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2021-05-29 23:33:14 [scrapy.core.engine] INFO: Spider opened

2021-05-29 23:33:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)

2021-05-29 23:33:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2021-05-29 23:33:18 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.worldometers.info/robots.txt> (referer: None)

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.

2021-05-29 23:33:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)

2021-05-29 23:33:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/ethiopia-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)

2021-05-29 23:33:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.worldometers.info/world-population/ethiopia-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)

Traceback (most recent call last):

  File "/Users/subhrajyotisaha/opt/anaconda3/envs/new_Virtual_workspace/lib/python3.8/site-packages/parsel/selector.py",line 236,in xpath

    result = xpathev(query,namespaces=nsp,File "src/lxml/etree.pyx",line 1582,in lxml.etree._Element.xpath

  File "src/lxml/xpath.pxi",line 305,in lxml.etree.XpathelementEvaluator.__call__

  File "src/lxml/xpath.pxi",line 225,in lxml.etree._XPathEvaluatorBase._handle_result

lxml.etree.XPathEvalError: Invalid expression

enter image description here

我正在使用 conda 虚拟工作区环境和 vs 代码 - macos。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?