从某些电视剧中收集来自 IMDb 的所有电影评论

如何解决从某些电视剧中收集来自 IMDb 的所有电影评论

我正在尝试使用 python 从 IMDb 收集数据,但我无法获得所有评论我有以下有效的代码,但不提供所有可用的评论

from imdb import IMDb

ia = IMDb()

ia.get_movie_reviews('13433812') 

输出

`{'data': {'reviews': [{'content': 'Just finished watching the episode 4. Wow,it was so good. Well made mixture of thriller and comedy.I saw a few negative reviews here written after eps 1 or 2. I recommend watching at least up to eps 3 and 4. The real story starts from eps 3. Eps 4 is like a complete well made movie. You will surely enjoy it.','helpful': 0,'title': '','author': 'ur129930427','date': '28 February 2021','rating': None,'not_helpful': 0},`{'content': 'You can see the cast had a lot of fun making this Italian/Korean would-be mafia thriller,the sort of fun NOT experienced in Hollywood since the days of Burt Reynolds. Vincenzo contains a very absorbing plot,a cast star-struck by designer clothes,interspersed with Italian (and other) Classical music excerpts to set in relief some well written suspense and intrigue. The plot centers on,if we really are to believe it,the endemically CORRUPT upper echelons of S. Korean society. Is it a coincidence that many of the systemic abuses of power and institutional vice that constitute Vincenzo\'s Main Plot are Now also going on,this very moment in the USA? It is certainly food for thought. A clear advantage this Korean drama has over mediocre US shows,however is a much softer-handed use of violence,resorting more often to satire to keep the plot moving as opposed to gratuitous savagery Now so common in so-called "hit" US shows. So far,so good,Binjenzo!'``

我也尝试过 Scrapy 代码,但没有得到任何评论

from scrapy.http import TextResponse
import urllib.parse
from urllib.parse import urljoin
base_url = "https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv"
r=requests.get(base_url)
response = TextResponse(r.url,body=r.text,encoding='utf-8')
reviews = response.xpath('//*[contains(@id,"1")]/p/text()').extract()
len(reviews)
output : 0

解决方法

这应该会为您提供该页面上的所有审阅者姓名,从而耗尽所有加载更多按钮。随意定义其他字段以根据您的要求获取它们。

import requests
from bs4 import BeautifulSoup

start_url = 'https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv'
link = 'https://www.imdb.com/title/tt13433812/reviews/_ajax'

params = {
    'ref_': 'undefined','paginationKey': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(start_url)

    while True:
        soup = BeautifulSoup(res.text,"lxml")
        for item in soup.select(".review-container"):
            reviewer_name = item.select_one("span.display-name-link > a").get_text(strip=True)
            print(reviewer_name)


        try:
            pagination_key = soup.select_one(".load-more-data[data-key]").get("data-key")
        except AttributeError:
            break
        params['paginationKey'] = pagination_key
        res = s.get(link,params=params)
,

您是否看到页面末尾的 Load More 按钮?

您无法获得所有评论的原因是,点击 Load More 时,AJAX 请求正在加载评论。

您需要使用 Selenium 点击该按钮,然后提取评论。

,

您也可以使用 selenium 连续点击“加载更多”按钮,直到加载所有评论:

from selenium import webdriver
import time,urllib.parse
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get((l:='https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv'))
while int(d.execute_script("return Array.from(document.querySelectorAll('#main .review-container')).length")) < int(d.execute_script("return document.querySelector('.header span').textContent").split()[0]):
   d.execute_script('document.querySelector(".ipl-load-more__button").click()')
   time.sleep(3)

r = [{'score':i.select_one('span.rating-other-user-rating span:nth-of-type(1)').get_text(strip=True),'title':i.select_one('a.title').get_text(strip=True),'reviewer_name':(j:=i.select_one('.display-name-link > a')).get_text(strip=True),'reviewer_link':urllib.parse.urljoin(l,j['href']),'date':i.select_one('.display-name-link > .review-date').get_text(strip=True),'review':i.select_one('.content > .text').get_text(strip=True)
    } 
    for i in soup(d.page_source,'html.parser').select('#main .review-container')]

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?