如何解决使用 Python 和 Beautiful Soup 抓取 HTML来自 IMDb
例如,我想从 this page 获取电影评分并逐行打印评分, 我已经用 BS4 提取了名称和发行年份,但不知道如何处理收视率...
import requests
from bs4 import BeautifulSoup
import urllib.request
url = urllib.request.urlopen('http://imdb.com/list/ls097228983/')
content = url.read()
soup = BeautifulSoup(content,'lxml')
for div in soup.findAll('h3',attrs={'class':'lister-item-header'}):
#print(div.find('a')['href'])
#print("**")
#print(div)
year = div.find('span',attrs={'class':'lister-item-year text-muted unbold'})
year = str(year)
year = year.replace('<span class="lister-item-year text-muted unbold">','')
year = year.replace('</span>','')
name = div.find('a').contents[0]
print(name + ' ' + year)
>> I want: Solaris (1972) 8.1
解决方法
您需要将 'class':'lister-item-header'
更改为 'class':'lister-item-content'
父类以获取评级。
import requests
from bs4 import BeautifulSoup
import urllib.request
url = urllib.request.urlopen('http://imdb.com/list/ls097228983/')
content = url.read()
soup = BeautifulSoup(content,'lxml')
for div in soup.findAll('div',{'class':'lister-item-content'}):
#print(div.find('a')['href'])
#print("**")
#print(div)
year = div.find('span',attrs={'class':'lister-item-year text-muted unbold'})
year = str(year)
year = year.replace('<span class="lister-item-year text-muted unbold">','')
year = year.replace('</span>','')
name = div.find('a').contents[0]
rating = div.find('span',class_='ipl-rating-star__rating').text
# print(rating)
# you could also format string.
print(f'{name} {year} {rating}'.format(name,year,rating))
print(name + ' ' + year + " " +rating)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。