如何解决Zillow 抓取工具:为什么我无法从 Zillow 搜索中抓取完整列表
我正在尝试探索 zillow 住房数据以进行分析。但我发现我从 Zillow 抓取的数据会比列表少得多。
举个例子:
我们可以看到有 76 条记录。如果我使用 google chrome 扩展程序:Zillow-to-excel,所有 76 间房屋都可以被刮掉。 https://chrome.google.com/webstore/detail/zillow-to-excel/aecdekdgjlncaadbdiciepplaobhcjgi/related
但是当我使用 Python 请求抓取 zillow 数据时,只能抓取 18-20 条记录。 这是我的代码:
import requests
import json
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
cnt=0
stop_check=0
ele=[]
url='https://www.zillow.com/birmingham-al-35216/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','accept-encoding': 'gzip,deflate,br','accept-language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6','upgrade-insecure-requests': '1','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i in range(1,2):
params = {
'searchQueryState':'{"pagination":{"currentPage":'+str(i)+'},"useRSSearchTerm":"35216","mapBounds":{"west":-86.83314614582643,"east":-86.73781685417354,"south":33.32843303639682,"north":33.511017584543204},"regionSelection":[{"regionId":73386,"regionType":7}],"isMapVisible":true,"filterState":{"sort":{"value":"globalrelevanceex"},"ah":{"value":true}},"isListVisible":true,"mapZoom":13}'
}
page=requests.get(url,headers=headers,params=params,timeout=2)
sp=soup(page.content,'lxml')
lst=sp.find_all('address',{'class':'list-card-addr'})
ele.extend(lst)
print(i,len(lst))
if len(lst)==0:
stop_check+=1
if stop_check>=3:
print('stop on three empty')
Headers 和 params 来自使用 chrome 开发工具的 web。我还尝试了其他搜索,发现我只能在每个页面上抓取前 9-11 条记录。
我知道有一个 zillow API,但它可以用于一般搜索,如邮政编码中的所有房屋。所以我想尝试网页抓取。
我可以对如何修复我的代码提出一些建议吗?
非常感谢!
解决方法
你可以试试
import requests
import json
url = 'https://www.zillow.com/search/GetSearchPageState.htm'
headers = {
'Accept': '*/*','Accept-Encoding': 'gzip,deflate,br','Accept-Language': 'en-US,en;q=0.9','upgrade-insecure-requests': '1','User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
houses = []
for page in range(1,3):
params = {
"searchQueryState": json.dumps({
"pagination": {"currentPage": page},"usersSearchTerm": "35216","mapBounds": {
"west": -86.97413567189196,"east": -86.57244804982165,"south": 33.346263857015515,"north": 33.48754107532057
},"mapZoom": 12,"regionSelection": [
{
"regionId": 73386,"regionType": 7
}
],"isMapVisible": True,"filterState": {
"isAllHomes": {
"value": True
},"sortSelection": {
"value": "globalrelevanceex"
}
},"isListVisible": True
}),"wants": json.dumps(
{
"cat1": ["listResults","mapResults"],"cat2": ["total"]
}
),"requestId": 3
}
# send request
page = requests.get(url,headers=headers,params=params)
# get json data
json_data = page.json()
# loop via data
for house in json_data['cat1']['searchResults']['listResults']:
houses.append(house)
# show data
print('Total houses - {}'.format(len(houses)))
# show info in houses
for house in houses:
if 'brokerName' in house.keys():
print('{}: {}'.format(house['brokerName'],house['price']))
else:
print('No broker: {}'.format(house['price']))
Total houses - 76
RealtySouth-MB-Crestline: $424,900
eXp Realty,LLC Central: $259,900
ARC Realty Mountain Brook: $849,000
Ray & Poynor Properties: $499,900
Hinge Realty: $1,550,000
...
附言如果我对您有帮助,请不要忘记将答案标记为正确:)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。