微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

登录后网页抓取

如何解决登录后网页抓取

我执行以下代码登录分配给 loginUrl 的 URL。身份验证后,我想转到另一个网址存储在 portfolIoUrl 中的网页。但是,当我 print(portfolioPage.content) 时,它会在登录后直接打印网页,但不会打印我想要的 portfolioPage。我的代码有什么问题?

from bs4 import BeautifulSoup
import requests
# create session
session = requests.Session()

loginUrl='https://www.investopedia.com/auth/realms/investopedia/protocol/openid-connect/auth?client_id=inv-simulator&redirect_uri=https%3A%2F%2Fwww.investopedia.com%2Fauth%2Frealms%2Finvestopedia%2Fshopify-auth%2Finv-simulator%2Flogin%3F%26redirectUrl%3Dhttps%253A%252F%252Fwww.investopedia.com%252Fauth%252Frealms%252Finvestopedia%252Fprotocol%252Fopenid-connect%252Fauth%253Fresponse_type%253Dcode%2526approval_prompt%253Dauto%2526redirect_uri%253Dhttps%25253A%25252F%25252Fwww.investopedia.com%25252Fsimulator%25252Fhome.aspx%2526client_id%253Dinv-simulator-conf&state=7edda3b2-eb6a-441f-8589-b42b8b78accf&response_mode=fragment&response_type=code&scope=openid&nonce=cd558670-7ae3-4c14-8281-bc149d4987b3'
portfolIoUrl = 'https://www.investopedia.com/simulator/Trade/Tradestock.aspx'

payload = {
    'username': 'my email','password': 'my password'
}
authPage = session.get(loginUrl)
soup = BeautifulSoup(authPage.content,'html.parser')
form = soup.find('form')
postUrl = form['action']
auth = session.post(postUrl,data=payload)

portfolioPage = session.get(portfolIoUrl)
soup = BeautifulSoup(portfolioPage.content,'html.parser')
print(portfolioPage.content)

解决方法

我认为您没有正确发布您的数据并且在您登录后没有保持会话打开。试试这个...

#using requests.Session() to close session automatically once done
with requests.Session() as login_request: 
    payload = {
        'username': 'my email','password': 'my password'
        }
    login_request.post(loginUrl,data=payload)

#while logged in get the content of the portfolioUrl variable
source_code = login_request.get(portfolioUrl).content 

#after this you can use soup to parse the source_code
soup = BeautifulSoup(source_code,'html.parser')

print(soup) #to check if it's printing the logged in data
,

你可以试试

import requests
from bs4 import BeautifulSoup

# create session
session = requests.Session()

url = 'https://investopedia.com/simulator/portfolio/'

payload = {
    'username': 'your_email','password': 'your_password'
}

# get log in page
auth_page = session.get(url)
soup = BeautifulSoup(auth_page.content,'html.parser')

# get form
form = soup.find('form')

# get post url
post_url = form['action']

# auth
session.post(post_url,data=payload)

# parse content
content_url = 'https://investopedia.com/simulator/trade/tradestock.aspx'
page = session.get(content_url)
page_soup = BeautifulSoup(page.content,'html.parser')

# simulate page
sim_page = page_soup.find('div',{'class': 'sim-page'})
table = sim_page.find_all('table',{'class': 'table2'})[1]
rows = table.find_all('tr')

for row in rows:
    print(row.find('th').text)
    print(row.find('td').text)
    print('----')
Value (USD)
$10,000.00
----
Buying Power
$10,000.00
----
Cash
$10,000.00
----

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?