如何解决抓取动态网站以拉取最近的新闻文章 URL
我正在尝试使用 Python 从动态网站中提取投资新闻文章。我尝试了几个适用于静态网站的教程,但是我在将 URL 拉到特定文章时遇到了问题。我正在使用的代码如下:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.institutionalinvestor.com/search?'
'term=&' # eventually,the term would include the words I am actively searching for
'filters=%7B"dates":%5B"last%20week"%5D%7D') # filter to the last week,this would eventually be for the last 24 hours only
r.html.absolute_links
它以数组格式获取页面中的链接列表:
{'https://www.institutionalinvestor.com/Login','https://www.institutionalinvestor.com/display-advertising','http://www.ttivanguard.com/','https://www.riaintel.com/','http://interactive.institutionalinvestor.com/executive-IR-research-em/about-586KX-2742AB.html','https://twitter.com/iimag','https://myaccount.institutionalinvestor.com/Orders/SelectPackage.html','https://www.institutionalinvestor.com/','https://www.institutionalinvestor.com/Corner-Office','https://www.institutionalinvestor.com/Management','http://iimemberships.com/','http://www.iiconferences.com/','https://www.institutionalinvestor.com/Register','https://www.institutionalinvestor.com/cookies','https://www.institutionalinvestor.com/Careers','https://www.institutionalinvestor.com/Custom-Research','https://www.institutionalinvestor.com/Portfolio','https://www.euromoneyplc.com/modern-slavery-act-transparency-statement','https://www.institutionalinvestor.com/research','https://www.institutionalinvestor.com/Masthead','https://www.institutionalinvestor.com/about-thought-leadership','https://www.institutionalinvestor.com/Investors','https://www.institutionalinvestor.com/Premium','https://www.institutionalinvestor.com/about-us','https://www.institutionalinvestor.com/thought-leadership','https://www.institutionalinvestor.com/PrivacyPolicy','https://www.institutionalinvestor.com/sponsored','https://www.institutionalinvestor.com/Video','https://www.institutionalinvestor.com/How-to-Pitch-Institutional-Investor','https://www.institutionalinvestor.com/FAQs','https://www.institutionalinvestor.com/Research-FAQs','https://www.institutionalinvestor.com/Reprints','https://www.institutionalinvestor.com/TermsConditions','https://www.linkedin.com/company/164389','https://www.facebook.com/iimag','https://www.institutionalinvestor.com/Customer-Service','https://www.institutionalinvestor.com/Culture','https://www.institutionalinvestor.com/awards','https://www.institutionalinvestor.com/Research-Insight','http://www.sovereignwealthcenter.com/'}
但是我找不到文章本身的链接。当我检查源代码时,我看到的是:
<div class="search-results" role="listbox">
<article class="search-result" ng-repeat="article in serverData.hits.results">
<div class="search-result-text-ghost"></div>
<h2 ng-class="article|publicationClass"><a ng-href="{{article|articleHref}}">{{article|snippet:'title'|removeHtmlTags}}</a>
</h2>
作为一个对 HTML 比较陌生的人,最后的 h2 部分让我相信该站点是动态的,这就是我被卡住的地方。任何帮助,将不胜感激。我对这个问题的理想输出是获取文章的标题、来源(在本例中为“机构投资者”)、文章的预览(前几行左右,以及将文章的 URL 放入数据框可以每天早上发送给我以节省时间,否则我将花费手动拉取新闻。我已经将项目的其余部分放在一起,除了我正在使用的 API 中未包含的机构投资者等网站的新闻拉取之外.
如有必要或推荐,我愿意接受任何和所有新方法。提前致谢!
解决方法
尝试使用硒
简单的工作示例 您可能需要优化一些东西,例如 baseUrl、数据帧而不是打印、...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = "https://www.institutionalinvestor.com/search"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")
for article in soup.select('div.search-results > article'):
title = article.find('h2').get_text()
link = article.find('a')['href']
print(title +': https://www.institutionalinvestor.com'+link )
driver.close()
输出
Who’s on Third?
: https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
First the Cyberattack Hits. Then the Insider Trading.
: https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
Hedge Funds Featured Prominently in 2020 SPAC Boom
: https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
The Stocks That Drove Glenview’s Major Comeback
: https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
Bill Ackman’s Billion-Dollar Year
: https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
Ex-Verger Interns Make NFL,‘Bachelor’ Debuts
: https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter
: https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
Gold's 2020 Ride Explained
: https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
The ARK Invest Takeover Battle Is Over
: https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
Investors Quickly Saw Big Gains From These SPACs
: https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs
,
是的,它是动态的。您可以使用 selenium 来允许页面首先呈现,然后像通常使用静态站点那样拉出 html。或者,他们的 api 就在那里(我认为即使是完整的文章也在那里,但我只是拿出了你要的东西):
import requests
import json
import pandas as pd
api = 'https://search.euromoneyapi.com/api/Search'
headers= {'content-type': 'application/json','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
payload = {"site":"amg_ii","suggester":'true',"from":0,"size":10,"sort":"dates","sort_order":"desc"}
data = {"site":"amg_ii","suggester":True,"sort_order":"desc"}
jsonData = requests.post(api,headers=headers,data=json.dumps(data)).json()
rows = []
articles = jsonData['hits']['results']
for article in articles:
title = article['snippet']['title'][0]
source = 'https://www.institutionalinvestor.com/'
try:
preview = article['snippet']['description'][0]
except:
preview = ''
url = 'https://www.institutionalinvestor.com/article/' + article['id'].split('/')[-1] + '/' + article['fields']['url_title'][0]
row = {'title':title,'source':source,'preview':preview,'url':url}
rows.append(row)
df = pd.DataFrame(rows)
输出:
print (df.to_string())
title source preview url
0 Who’s on Third? https://www.institutionalinvestor.com/ Third-party claims filing service providers require due diligence for shareholder litigation outside the U.S https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
1 First the Cyberattack Hits. Then the Insider Trading. https://www.institutionalinvestor.com/ Researchers share their striking evidence of pre-disclosure spikes in options trading. https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
2 Hedge Funds Featured Prominently in 2020 SPAC Boom https://www.institutionalinvestor.com/ Nearly 13 percent of the blank check companies that filed plans to go public in 2020 were sponsored by hedge fund firms or individuals formerly associated with the industry. https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
3 The Stocks That Drove Glenview’s Major Comeback https://www.institutionalinvestor.com/ Larry Robbins’ hedge fund finished 2020 solidly positive thanks to huge gains in the final two months of the year. https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
4 Bill Ackman’s Billion-Dollar Year https://www.institutionalinvestor.com/ A big short and a big SPAC fueled hefty gains for Pershing Square in 2020. https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
5 Ex-Verger Interns Make NFL,‘Bachelor’ Debuts https://www.institutionalinvestor.com/ Verger Capital Management CIO Jim Dunn shared the inside story on former interns John Wolford and Matt James. https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
6 David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter https://www.institutionalinvestor.com/ The manager turned in a strong fourth quarter by sticking with his biggest positions. https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
7 Gold's 2020 Ride Explained https://www.institutionalinvestor.com/ https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
8 The ARK Invest Takeover Battle Is Over https://www.institutionalinvestor.com/ A new deal has “extinguished” Resolute’s option to acquire an additional stake in the ETF firm. https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
9 Investors Quickly Saw Big Gains From These SPACs https://www.institutionalinvestor.com/ At least two blank-check companies surged on recent merger announcements. https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。