使用 Python 抓取 Twitter

如何解决使用 Python 抓取 Twitter

我一直致力于一个项目，该项目使用非官方 API 和 Python 对 twitter 的应用进行逆向工程，以从 Twitter 中抓取公共帖子。（我想创建一个“替代”应用程序，它只是一个可以搜索用户并获取其帖子的本地主机）

我一直在搜索和阅读与 REST、AJAX 以及 python 模块 requests、requests-html、BeautifulSoup 等相关的所有内容。

在 devtools 上查看 twitter 时（例如在 Marvel 的个人资料页面上），我可以看到发送的唯一相关请求（通过 POST 和 GET）如下：client_event.json 和 UserTweets?variables=...。我知道这些是通过清理网络选项卡并仅在我向下滚动并加载新推文时记录而收到的相关消息 - 这些是出现的唯一不是随机视频的消息（我使用 -video 清理了搜索 - init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew）。

我是这个领域的新手，所以我可能做错了什么。我可以在 UserTweets 上看到我正在寻找的响应 - 一个包含我需要的所有数据的漂亮 JSON - 但无论我尝试了多少，我都无法访问它。

我尝试了不同的模块和不同的标题，但一无所获。我不想使用 Selenium，因为它很烦人，而且我知道我需要的数据存储在哪里。

我一直在尝试将 GET 请求发送到： https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D

这样做：

from requests_html import HTMLSession
from bs4 import BeautifulSoup

response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html,'lxml')

但我得到一个 HTML 脚本，它要么说 Chromium 不受支持，要么只是一个没有 javascript 更新 DOM 的静态页面。

感谢所有帮助。

谢谢

附言我已经在 reverseengineering.stackexchange 上发布了同样的问题，只是为了安全（溢出有更合适的标签:-)）

解决方法

在您深入研究实际代码之前，我将首先开始构建对 twitter 的正确请求。我会使用专注于 REST 和 API 的 3rd 方工具（例如 Postman）来构建和测试所需的请求 - 然后才会编写实际代码。

从您的问题来看，您似乎将使用 Twitter 的开放 API，因此这意味着您只需要在请求标头中发送 x-guest-token 和基本承载授权。

Bearer 是静态的 - 您只需浏览 Twitter 并复制/粘贴它来自开发工具网络监视器。
要获得 x-guest-token，您需要动态的东西，因为它已过期，我建议向 twitter 发送 curl 请求，从那里解析令牌并将其放入您的标头之前发送请求。您可以在以下内容中看到非常相似的内容：Python Downloading twitter video using python (without using twitter api) 。

在您拥有上述两项后，在 Postman 中构建所需的 GET 请求并测试您是否得到正确的响应。只有当你在 Postman 中完成所有工作之后 - 用 Python 或任何其他语言编写相同的内容**

**您可以使用 Postman 代码段自动生成许多编程语言所需的代码。

我刚刚尝试了相同的方法，但使用的是 requests，而不是 requests_html 模块。我可以获取所有网站内容，但我不会称其为“漂亮”。

此外，现在我被阻止访问该站点而无需登录。这是我的小例子。请改用官方 Twitter API。

我也认为在尝试使用此脚本后，我可能会被阻止。我只试过 2 次。

import requests
import bs4

def example():
    result = requests.get("https://twitter.com/childrightscnct")
    soup = bs4.BeautifulSoup(result.text,"lxml")
    print(soup)

if __name__ == '__main__':
    example()

要使用 bs4 选择任何元素，请使用

some_text = soup.select('locator').getText()

我找到了一个抓取 Twitter 的工具，它在 Github 上有很多星星 https://github.com/twintproject/twint 我自己没有尝试过，希望它是合法的。

@TripleS，如何从 __INITIAL_STATE__ 中提取 json 数据并将其写入文本文件的示例。

import requests
import re
import json
from contextlib import suppress

# get page
result = requests.get('https://twitter.com/ThePSF')


# Extract json from "window.__INITIAL_STATE__={....};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});",result.text).group(1)

# convert text string to structured json data
twitter_json = json.loads(json_string)

# Save structured json data to a text file that may help
# you to orient yourself and possible pick some parts you
# are interested in (if there are any)
with open('twitter_json_data.txt','w') as outfile:
    outfile.write(json.dumps(twitter_json,indent=4,sort_keys=True))