好的,这就是我想要实现的目标:
>使用动态过滤的搜索结果列表调用URL
>点击第一个搜索结果(5 /页)
>抓取标题,段落和图像,并将它们作为json对象存储在单独的文件中,例如
{
“标题”:“个人条目的标题要素”,
“内容”:“个人条目中DOM顺序中的图表和图像”
}
>导航回搜索结果概述页面并重复步骤2 – 3
> 5/5结果后,抓住了下一页(点击分页链接)
>重复步骤2 – 5直到没有输入
再次想象一下这些内容:
到目前为止我所拥有的是:
#import libraries from selenium import webdriver from bs4 import BeautfifulSoup #URL url = "https://URL.com" #Create a browser session driver = webdriver.Chrome("PATH TO chromedriver.exe") driver.implicitly_wait(30) driver.get(url) #click consent btn on destination URL ( overlays rest of the content ) python_consentButton = driver.find_element_by_id('acceptAllCookies') python_consentButton.click() #click cookie consent btn #Seleium hands the page source to Beautiful Soup soup_results_overview = BeautifulSoup(driver.page_source,'lxml') for link in soup_results_overview.findAll("a",class_="searchResults__detail"): #Selenium visits each Search Result Page searchResult = driver.find_element_by_class_name('searchResults__detail') searchResult.click() #click Search Result #Ask Selenium to go back to the search results overview page driver.back() #Tell Selenium to click paginate "next" link #probably needs to be in a sourounding for loop? paginate = driver.find_element_by_class_name('pagination-link-next') paginate.click() #click paginate next driver.quit()
问题
每次Selenium导航回te搜索结果概述页面时,列表计数都会重置
所以它点击第一个条目5次,导航到接下来的5个项目并停止
任何关于如何解决这个问题的建议都表示赞赏.
解决方法
没有Selenium,您只能使用请求和BeautifulSoup刮擦.它将更快,并将消耗更少的资源:
import json import requests from bs4 import BeautifulSoup # Get 1000 results params = {"$filter": "TemplateName eq 'Application Article'","$orderby": "ArticleDate desc","$top": "1000","$inlinecount": "allpages",} response = requests.get("https://www.cst.com/odata/Articles",params=params).json() # iterate 1000 results articles = response["value"] for article in articles: article_json = {} article_content = [] # title of article article_title = article["Title"] # article url article_url = str(article["Url"]).split("|")[1] print(article_title) # request article page and parse it article_page = requests.get(article_url).text page = BeautifulSoup(article_page,"html.parser") # get header header = page.select_one("h1.head--bordered").text article_json["Title"] = str(header).strip() # get body content with images links and descriptions content = page.select("section.content p,section.content img,section.content span.imageDescription," "section.content em") # collect content to json format for x in content: if x.name == "img": article_content.append("https://cst.com/solutions/article/" + x.attrs["src"]) else: article_content.append(x.text) article_json["Content"] = article_content # write to json file with open(f"{article_json['Title']}.json",'w') as to_json_file: to_json_file.write(json.dumps(article_json)) print("the end")
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。