微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何从网站获取/识别特定数据?

如何解决如何从网站获取/识别特定数据?

我正在尝试使用以下代码抓取网站

import requests
import pandas as pd

with requests.Session() as connection:
    connection.headers.update(
        {
            "referer": "https://gmatclub.com/forum/decision-tracker.html","user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/89.0.4389.86 Yabrowser/21.3.0.740 Yowser/2.5 Safari/537.36",}
    )
    _ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
    endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
    for item in endpoint["statistics"]:
        print(item)

我不确定如何在决策跟踪器下获取录取状态 - 实时更新。

enter image description here

解决方法

勾号、叉号和圆圈代表申请是被录取、被拒绝还是因任何原因待决。此信息位于 status_id 下。在源代码中可以找到数字的映射库。当我们将其转换为 python 字典时,我们可以获得状态,还可以重建刻度等:

import requests

status_mapping = {1: { 'id':1,'class':'mainApplicationSubmitted','name':'Application Submitted' },3: { 'id':3,'class':'mainInterviewed','name':'interviewed' },4: { 'id':4,'class':'mainAdmitted','name':'admited' },5: { 'id':5,'class':'mainDenied','name':'denied' },6: { 'id':6,7: { 'id':7,'class':'mainWaitListed','name':'waitlisted' },8: { 'id':8,9: { 'id':9,'class':'mainMatriculating','name':'matriculating' },10:{ 'id':10,'class':'mainWlAdmited','name':'admitted From WL' },11:{ 'id':11,'class':'mainResearching','name':'researching Or Writing Essays' },12:{ 'id':12,'class':'mainInvitedToInterview','name':'invited To Interview' },13:{ 'id':13,'class':'mainWithdrawn','name':'withdrawn Application '}}

with requests.Session() as connection:
    connection.headers.update(
        {
            "referer": "https://gmatclub.com/forum/decision-tracker.html","user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",}
    )
    _ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
    endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
    for item in endpoint["statistics"]:
        try:
            status = status_mapping[int(item['status_id'])]['name']
            if int(item['status_id']) in [4]:
                status_short = 'green'
            elif int(item['status_id']) in [5,6]:
                status_short = 'red'
            else:
                status_short = 'grey'
            print(status,status_short)
        except:
            print(f"Key {item['status_id']} is missing from status_mapping. Check the entry at {item['date']} to see what this key represents and add it to status_mapping.")

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。