如何解决Google搜索结果页上的网络抓取值 Python,BeautifulSoup,请求
我是Python的新手,正在尝试制作一系列程序来向我的手机发送有关股市指数表现的短信。我有一些功能有限的程序,如果可以从Google抓取数据,我相信它们会得到改善,到目前为止我还无法做到。我试图提取的值每次都在结果页的顶部,几乎在某种表中。在底部,有一个片段与要刮擦的值相连。
这是我当前用于网页抓取部分的代码部分。我在用漂亮的汤和要求。
import bs4
import requests
res = requests.get('https://www.google.com/search?safe=active&sxsrf=ALeKk00d7WrRTMvmhypG20E5MOEWpRwKlw%3A1601591747498&ei=w1l2X52DHoPatAXElILQDg&q=nasdaq+composite&oq=nasd&gs_lcp=CgZwc3ktYWIQAxgAMgwIIxAnEJ0CEEYQ-gEyBAgjECcyBAgjECcyCggAELEDEIMBEEMyCggAELEDEIMBEEMyCAgAELEDEIMBMgcIABCxAxBDMgcIABCxAxBDMgcIABCxAxBDMgoIABCxAxCDARBDOgQIABBHOgUIABCxAzoHCCMQ6gIQJzoHCCMQJxCdAjoECAAQQ1DszQ5Y9tkOYPjkDmgBcAJ4BYABqQOIAZ4NkgEJMS43LjEuMC4xmAEAoAEBqgEHZ3dzLXdperABCsgBCMABAQ&sclient=psy-ab')
type(res)
soup = bs4.BeautifulSoup(res.text,'lxml')
type(soup)
Current_Level = soup.find(class_='IsqQVc_NprOob_XcVN5d')
print (Current_Level)
如果您搜索“纳斯达克综合指数”,则该链接会转到页面。右键单击时,我用于soup.find()
的类与值对齐-检查网页上的值。
Image of the value I'm trying to scrape
解决方法
这是获取价值的方式。
from bs4 import BeautifulSoup as soup
import requests
url_to_scrape = "https://www.google.com/search?q=nasdaq+composite&oq=nasdaq+composite&aqs=chrome.0.0l8.5126j1j4&sourceid=chrome&ie=UTF-8"
try:
client_page = requests.get(url_to_scrape)
except:
print("Request aborted due to unknown reason!")
page_html = client_page.text
client_page.close()
page_soup = soup(page_html,"html.parser")
nasdaqValue = page_soup.findAll("div",{"class":"BNeawe iBp4i AP7Wnd"})
print(nasdaqValue[0].text)
,
尝试使用类名BNeawe iBp4i AP7Wnd
:
import requests
from bs4 import BeautifulSoup
res = requests.get(
"https://www.google.com/search?safe=active&sxsrf=ALeKk00d7WrRTMvmhypG20E5MOEWpRwKlw%3A1601591747498&ei=w1l2X52DHoPatAXElILQDg&q=nasdaq+composite&oq=nasd&gs_lcp=CgZwc3ktYWIQAxgAMgwIIxAnEJ0CEEYQ-gEyBAgjECcyBAgjECcyCggAELEDEIMBEEMyCggAELEDEIMBEEMyCAgAELEDEIMBMgcIABCxAxBDMgcIABCxAxBDMgcIABCxAxBDMgoIABCxAxCDARBDOgQIABBHOgUIABCxAzoHCCMQ6gIQJzoHCCMQJxCdAjoECAAQQ1DszQ5Y9tkOYPjkDmgBcAJ4BYABqQOIAZ4NkgEJMS43LjEuMC4xmAEAoAEBqgEHZ3dzLXdperABCsgBCMABAQ&sclient=psy-ab"
)
soup = BeautifulSoup(res.text,"lxml")
# Using `.split()` to remove `+159.00 (1.42%)` from the output
Current_Level = soup.find(class_="BNeawe iBp4i AP7Wnd").text.split('+')[0]
print(Current_Level)
输出:
+257.47
编辑:
如果您致电soup.prettify()
,则会看到数据在BNeawe iBp4i AP7Wnd
类下:
soup = BeautifulSoup(res.text,"lxml")
print(soup.prettify())
...
...
<div>
<div>
<div>
<div class="kCrYT">
<div>
<div>
<div>
<div class="BNeawe iBp4i AP7Wnd">
<div>
<div class="BNeawe iBp4i AP7Wnd">
11,332.49
<span class="rQMQod AWuZUe">
+257.47 (2.32%)
</span>
</div>
</div>
</div>
</div>
</div>
...
...
,
确保您使用的是 implementation 'com.google.android.exoplayer:exoplayer-core:2.12.1'
implementation 'com.google.android.exoplayer:exoplayer-dash:2.12.1'
implementation 'com.google.android.exoplayer:exoplayer-ui:2.12.1'
implementation 'com.google.android.exoplayer:extension-okhttp:2.9.0'
implementation 'com.google.android.exoplayer:extension-mediasession:2.9.3'
implementation 'saschpe.android:exoplayer2-ext-icy:2.1.0'
。这是我对您遇到的相同问题的 answer。
为了更快,我将复制代码:
user-agent
输出:
import requests
import lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=Nasdaq composite',headers=headers)
html = response.text
soup = BeautifulSoup(html,'lxml')
print(soup.select_one('.wT3VGc').text)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。