微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

无法通过 Selenium Python 在可折叠窗口中找到段落文本元素

如何解决无法通过 Selenium Python 在可折叠窗口中找到段落文本元素

我正在尝试通过 Python 中的 Selenium 获取网页上可折叠元素的段落文本。到目前为止,可折叠窗口在 Selenium 中通过 .click 打开,但是之后 Selenium 无法找到所需的类“object-viewer__ocr-articletext”的段落。

Selenium 似乎无法关注包含新可见元素(例如所需段落)的折叠窗口。

页面链接https://www.delpher.nl/nl/kranten/view?query=kernenergie&facets%5Bpapertitle%5D%5B%5D=Algemeen+Dagblad&facets%5Bpapertitle%5D%5B%5D=De+Volkskrant&facets%5Bpapertitle%5D%5B%5D=De+Telegraaf&facets%5Bpapertitle%5D%5B%5D=Trouw&page=1&sortfield=date&cql%5B%5D=%28date+_gte_+%2201-01-1970%22%29&cql%5B%5D=%28date+_lte_+%2201-01-2018%22%29&coll=ddd&redirect=true&identifier=ABCDDD:010818460:mpeg21:a0207&resultsidentifier=ABCDDD:010818460:mpeg21:a0207&rowid=1

完整代码

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options


chrome_options = Options()  
chrome_options.add_argument("--no-proxy-server")
chrome_options.add_argument("--proxy-server='direct://'");
chrome_options.add_argument("--proxy-bypass-list=*");

driver = webdriver.Chrome(options=chrome_options) 
driver.set_window_size(1400,1080)

#Set up the path to the chrome driver
html = driver.find_element_by_tag_name('html')

all_details = []
for c in range(1,2):
    try:
        driver.get("https://www.delpher.nl/nl/kranten/results?query=kernenergie&facets%5Bpapertitle%5D%5B%5D=Algemeen+Dagblad&facets%5Bpapertitle%5D%5B%5D=De+Volkskrant&facets%5Bpapertitle%5D%5B%5D=De+Telegraaf&facets%5Bpapertitle%5D%5B%5D=Trouw&page={}&sortfield=date&cql%5B%5D=(date+_gte_+%2201-01-1970%22)&cql%5B%5D=(date+_lte_+%2201-01-2018%22)&coll=ddd".format(c))
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
        incategory = driver.find_elements_by_class_name("search-result")
        print(driver.current_url)
        
        links = [ i.find_element_by_class_name("search-result__link").get_attribute("href") for i in incategory]
            
        # Loop through each link to acces the page of each article
        for link in links:
            # get one book url
            driver.get(link)
                      
            # newspaper 
            newspaper = driver.find_element_by_xpath("//*[@id='content']/div[2]/div/div[2]/header/h1/span[2]")
            
            # date of the article
            date = driver.find_element_by_xpath("//*[@id='content']/div[2]/div/div[2]/header/div/ul/li[1]")
            
            #click button and find title
            div_element = webdriverwait(driver,60).until(expected_conditions.presence_of_element_located((By.XPATH,'//*[@id="object"]/div/div/div')))
            hover = ActionChains(driver).move_to_element(div_element)
            hover.perform()
            div_element.click()
            
            button = webdriverwait(driver,10).until(expected_conditions.presence_of_element_located((By.XPATH,'//*[@id="object-viewer__ocr-button"]')))
            hover = ActionChains(driver).move_to_element(button)
            hover.perform()
            
            button.click()
            
                         
            element = driver.find_element_by_css_selector(".object-viewer__ocr-panel-results")
            driver.execute_script("$(arguments[0]).click();",element)
            driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
                
                               
            # content of article 
                        
            try:
                content = driver.find_element_by_class_name("object-viewer__ocr-articletext")
                
            except Exception as e: 
                print(str(e))
                pass
                                
            # Define a dictionary with details we need
            r = {
                "1Newspaper":newspaper.text,"2Date":date.text,"3Content":content,}
            # append r to all details
            all_details.append(r)
            
    except Exception as e:
        print(str(e))
        pass
            
# save the information into a CSV file
df = pd.DataFrame(all_details)
df = df.to_string()

time.sleep(3)
driver.close()

特别是这部分代码

element = driver.find_element_by_css_selector(".object-viewer__ocr-panel-results")
        driver.execute_script("$(arguments[0]).click();",element)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
            
                           
        # content of article 
                    
        try:
            content = driver.find_element_by_class_name("object-viewer__ocr-articletext")
            
        except Exception as e: 
            print(str(e))
            pass

有人对在可折叠窗口中找到段落文本有什么建议吗?

提前致谢。

解决方法

如果没有指向所需网页的链接,就很难确定问题所在。

我的猜测是,当您单击可折叠对象时,DOM 会发生变化,这意味着可折叠对象本身不再属于同一类、ID、名称。

第二个猜测是我们正在处理 iframe,这将要求我们捕获它的 id 并专注于它。

你的错误异常是什么?

,

发现展开的元素在 HTML 中整体可见。 使用 Urllib 和 BeautifulSoup 创建了一个新代码。

如果有人对新代码感兴趣,请告诉我!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。