如何解决如何从 URL 域调用正确的类 改进
我目前一直致力于创建一个网络爬虫,我想在其中调用从给定 URL 抓取网络元素的正确类。
目前我已经创建:
import sys
import tldextract
import requests
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethod
def for_url(cls,url):
k = tldextract.extract(url)
# return Scraper.scrapers[k.domain]()
# or
return cls.scrapers[k.domain]()
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(s):
print(s)
# FIXME Scrape the correct values for BBC
return "Scraped BBC News"
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
scraper.scrape(requests.get(url))
我现在想要做的是,如果 BBC 是域名,那么它应该进入 class BBCScraper(Scraper):
并且因为我们调用 scraper.scrape(requests.get(url))
它应该然后刮掉里面的网络元素BBCScraper -> scrape -> Return web elements
但是,我在尝试运行此脚本时确实遇到了问题:
Outprint >>> return cls.scrapers[k.domain]() KeyError: 'bbc'
我想知道如何根据指定给 for_url
类方法的域调用正确的类
解决方法
问题是 k.domain
返回 bbc
而你写了 url = 'bbc.co.uk'
所以这些解决方案之一
- 将
url = 'bbc.co.uk'
与k.registered_domain
一起使用 - 将
url = 'bbc'
与k.domain
一起使用
并在scrape
方法中添加参数以获取响应
from abc import abstractmethod
import requests
import tldextract
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethod
def for_url(cls,url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain]()
@abstractmethod
def scrape(self,content: requests.Response):
pass
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(self,content: requests.Response):
return "Scraped BBC News"
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape(requests.get(url))
print(r) # Scraped BBC News
改进
我建议将 url
存储在一个属性中,将 requests.get
放在 scrape
中,这样主代码中的代码更少
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.domain] = scraper_class
@classmethod
def for_url(cls,url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain](url)
@abstractmethod
def scrape(self):
pass
class BBCScraper(Scraper):
domain = 'bbc.co.uk'
def __init__(self,url):
self.url = url
def scrape(self):
rep = requests.Response = requests.get(self.url)
content = rep.text # ALL HTML CONTENT
return "Scraped BBC News" + content[:20]
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape()
print(r) # Scraped BBC News<!DOCTYPE html><html
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。