微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用 beautifulsoup 和 urllib 从 Json 中抓取

如何解决使用 beautifulsoup 和 urllib 从 Json 中抓取

我正在一个使用 json 的示例网站上学习抓取。例如,采用以下示例网站:http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini。源代码在这里[x for i,x in enumerate(l) if ...]。我想在第 388-396 行获取信息:

view-source:https://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini

并将每个变量用引号(即“id”、“item_number”、“type”等)保存在同名变量中。

到目前为止,我设法运行以下

<script>
    var js_data = {"first_time_bid":true,"yourbid":0,"product":{"id":55,"item_number":"P55","type":"PRODUCT","fixed":0,"price":1000,"tot_price":1000,"min_bid_value":1010,"currency":"EUR","raise_bid":10,"stamp_end":"2013-06-14 12:00:00","bids_number":12,"estimated_value":200,"extended_time":0,"url":"https:\/\/www.charitystars.com\/product\/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini","conversion_value":1,"eid":0,"user_has_bidded":false},"bid":{"id":323,"uid":126,"first_name":"fabio","last_name":"Gastaldi","company_name":"","is_company":0,"title":"fab1","nationality":"IT","amount":1000,"max_amount":0,"table":"","stamp":1371166006,"real_stamp":"2013-06-14 01:26:46"}};
    var p_currency = '€';
    var conversion_value = '1';
    var merch_items = [];
    var gallery_items = [];

    var inside_gala = false;
</script>

出于某种原因,script_soup 有很多我不需要的信息。我相信我需要的部分在 import requests from bs4 import BeautifulSoup from urllib import urlopen import re import json import time import csv from bs4 import BeautifulSoup as soup from pandas import DataFrame import urllib2 hdr = {"User-Agent": "My Agent"} req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini) response = urllib2.urlopen(req) htmlSource = response.read() soup = BeautifulSoup(htmlSource) title = soup.find_all("span",{"itemprop": "name"}) # get the title script_soup = soup.find_all("script") 中,但我不知道如何访问它(以有效的方式)。我真的很感激一些帮助。

解决方法

数据确实在script_soup[9]中。问题在于这是一个硬编码在脚本标记中的 json 字符串。您可以使用 script_soup[9].string 以纯文本形式获取字符串,然后使用 json(如我的示例)或使用 split() 提取 regex 字符串。然后使用 json.loads() 将字符串加载为 python 字典。

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import json

hdr = {"User-Agent": "My Agent"}
response = requests.get("http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini",headers=hdr)

soup = BeautifulSoup(response.content)
script_soup = soup.find_all("script")
data = json.loads(script_soup[9].string.split('= ')[1].split(';')[0])

数据现在存储在变量 data 中。您可以根据需要对其进行解析或使用 pandas 将其加载到 pd.DataFrame(data) 中。

,

如果你可以使用 requestslxml 模块,你可以使用这个

根据 OP 更新

import requests
from lxml import html
import json

header = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko)'
                   ' Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51'),'X-Requested-With': 'XMLHttpRequest'
}


url='http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini'
a = requests.get(url,headers=header)

a = html.fromstring(a.text).xpath('//*[@class="page-content"]/script/text()')[0]
a = a.replace('\n','').replace(' ','')
b = a.split(';')
b = [i.split('=') for i in b]
c = json.loads(b[0][1])
c['product']

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。