抓取动态网站以拉取最近的新闻文章 URL

如何解决抓取动态网站以拉取最近的新闻文章 URL

我正在尝试使用 Python 从动态网站中提取投资新闻文章。我尝试了几个适用于静态网站的教程,但是我在将 URL 拉到特定文章时遇到了问题。我正在使用的代码如下:

    from requests_html import HTMLSession
    session = HTMLSession()
    
    r = session.get('https://www.institutionalinvestor.com/search?'
    'term=&' # eventually,the term would include the words I am actively searching for
    'filters=%7B"dates":%5B"last%20week"%5D%7D') # filter to the last week,this would eventually be for the last 24 hours only

    r.html.absolute_links

它以数组格式获取页面中的链接列表:

{'https://www.institutionalinvestor.com/Login','https://www.institutionalinvestor.com/display-advertising','http://www.ttivanguard.com/','https://www.riaintel.com/','http://interactive.institutionalinvestor.com/executive-IR-research-em/about-586KX-2742AB.html','https://twitter.com/iimag','https://myaccount.institutionalinvestor.com/Orders/SelectPackage.html','https://www.institutionalinvestor.com/','https://www.institutionalinvestor.com/Corner-Office','https://www.institutionalinvestor.com/Management','http://iimemberships.com/','http://www.iiconferences.com/','https://www.institutionalinvestor.com/Register','https://www.institutionalinvestor.com/cookies','https://www.institutionalinvestor.com/Careers','https://www.institutionalinvestor.com/Custom-Research','https://www.institutionalinvestor.com/Portfolio','https://www.euromoneyplc.com/modern-slavery-act-transparency-statement','https://www.institutionalinvestor.com/research','https://www.institutionalinvestor.com/Masthead','https://www.institutionalinvestor.com/about-thought-leadership','https://www.institutionalinvestor.com/Investors','https://www.institutionalinvestor.com/Premium','https://www.institutionalinvestor.com/about-us','https://www.institutionalinvestor.com/thought-leadership','https://www.institutionalinvestor.com/PrivacyPolicy','https://www.institutionalinvestor.com/sponsored','https://www.institutionalinvestor.com/Video','https://www.institutionalinvestor.com/How-to-Pitch-Institutional-Investor','https://www.institutionalinvestor.com/FAQs','https://www.institutionalinvestor.com/Research-FAQs','https://www.institutionalinvestor.com/Reprints','https://www.institutionalinvestor.com/TermsConditions','https://www.linkedin.com/company/164389','https://www.facebook.com/iimag','https://www.institutionalinvestor.com/Customer-Service','https://www.institutionalinvestor.com/Culture','https://www.institutionalinvestor.com/awards','https://www.institutionalinvestor.com/Research-Insight','http://www.sovereignwealthcenter.com/'}

但是我找不到文章本身的链接。当我检查源代码时,我看到的是:

<div class="search-results" role="listbox">
                        <article class="search-result" ng-repeat="article in serverData.hits.results">
                            <div class="search-result-text-ghost"></div>
                            <h2 ng-class="article|publicationClass"><a ng-href="{{article|articleHref}}">{{article|snippet:'title'|removeHtmlTags}}</a>
                            </h2>

作为一个对 HTML 比较陌生的人,最后的 h2 部分让我相信该站点是动态的,这就是我被卡住的地方。任何帮助,将不胜感激。我对这个问题的理想输出是获取文章的标题、来源(在本例中为“机构投资者”)、文章的预览(前几行左右,以及将文章的 URL 放入数据框可以每天早上发送给我以节省时间,否则我将花费手动拉取新闻。我已经将项目的其余部分放在一起,除了我正在使用的 API 中未包含的机构投资者等网站的新闻拉取之外.

如有必要或推荐,我愿意接受任何和所有新方法。提前致谢!

解决方法

尝试使用硒

简单的工作示例 您可能需要优化一些东西,例如 baseUrl、数据帧而不是打印、...

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

url = "https://www.institutionalinvestor.com/search"

driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')

driver.get(url)
sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")

for article in soup.select('div.search-results > article'):
    title = article.find('h2').get_text()
    link = article.find('a')['href']
    
    print(title +': https://www.institutionalinvestor.com'+link  )

driver.close()

输出

Who’s on Third?
: https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
First the Cyberattack Hits. Then the Insider Trading.
: https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
Hedge Funds Featured Prominently in 2020 SPAC Boom
: https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
The Stocks That Drove Glenview’s Major Comeback
: https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
Bill Ackman’s Billion-Dollar Year
: https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
Ex-Verger Interns Make NFL,‘Bachelor’ Debuts
: https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter
: https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
Gold's 2020 Ride Explained
: https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
The ARK Invest Takeover Battle Is Over
: https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
Investors Quickly Saw Big Gains From These SPACs
: https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs
,

是的,它是动态的。您可以使用 selenium 来允许页面首先呈现,然后像通常使用静态站点那样拉出 html。或者,他们的 api 就在那里(我认为即使是完整的文章也在那里,但我只是拿出了你要的东西):

import requests
import json
import pandas as pd

api = 'https://search.euromoneyapi.com/api/Search'

headers= {'content-type': 'application/json','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

payload = {"site":"amg_ii","suggester":'true',"from":0,"size":10,"sort":"dates","sort_order":"desc"}

data = {"site":"amg_ii","suggester":True,"sort_order":"desc"}

jsonData = requests.post(api,headers=headers,data=json.dumps(data)).json()

rows = []
articles = jsonData['hits']['results']
for article in articles:
    title = article['snippet']['title'][0]
    source = 'https://www.institutionalinvestor.com/'
    try:
        preview = article['snippet']['description'][0]
    except:
        preview = ''
    url = 'https://www.institutionalinvestor.com/article/' + article['id'].split('/')[-1] + '/' + article['fields']['url_title'][0]
   
    row = {'title':title,'source':source,'preview':preview,'url':url}
    rows.append(row)
    
df = pd.DataFrame(rows)

输出:

print (df.to_string())
                                                                       title                                  source                                                                                                                                                                        preview                                                                                                                                     url
0                                                            Who’s on Third?  https://www.institutionalinvestor.com/                                                                  Third-party claims filing service providers require due diligence for shareholder litigation outside the U.S                                                              https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
1                      First the Cyberattack Hits. Then the Insider Trading.  https://www.institutionalinvestor.com/                                                                                         Researchers share their striking evidence of pre-disclosure spikes in options trading.                        https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
2                         Hedge Funds Featured Prominently in 2020 SPAC Boom  https://www.institutionalinvestor.com/  Nearly 13 percent of the blank check companies that filed plans to go public in 2020 were sponsored by hedge fund firms or individuals formerly associated with the industry.                         https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
3                            The Stocks That Drove Glenview’s Major Comeback  https://www.institutionalinvestor.com/                                                             Larry Robbins’ hedge fund finished 2020 solidly positive thanks to huge gains in the final two months of the year.                            https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
4                                         Bill Ackman’s Billion-Dollar Year  https://www.institutionalinvestor.com/                                                                                                     A big short and a big SPAC fueled hefty gains for Pershing Square in 2020.                                          https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
5                              Ex-Verger Interns Make NFL,‘Bachelor’ Debuts  https://www.institutionalinvestor.com/                                                                  Verger Capital Management CIO Jim Dunn shared the inside story on former interns John Wolford and Matt James.                                 https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
6  David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter  https://www.institutionalinvestor.com/                                                                                          The manager turned in a strong fourth quarter by sticking with his biggest positions.  https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
7                                             Gold&#39;s 2020 Ride Explained  https://www.institutionalinvestor.com/                                                                                                                                                                                                                               https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
8                                     The ARK Invest Takeover Battle Is Over  https://www.institutionalinvestor.com/                                                                                A new deal has “extinguished” Resolute’s option to acquire an additional stake in the ETF firm.                                     https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
9                           Investors Quickly Saw Big Gains From These SPACs  https://www.institutionalinvestor.com/                                                                                                      At least two blank-check companies surged on recent merger announcements.                           https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res