微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Zillow 抓取工具:为什么我无法从 Zillow 搜索中抓取完整列表

如何解决Zillow 抓取工具:为什么我无法从 Zillow 搜索中抓取完整列表

我正在尝试探索 zillow 住房数据以进行分析。但我发现我从 Zillow 抓取的数据会比列表少得多。

举个例子:

我尝试拉取 35216 上的待售房源: https://www.zillow.com/birmingham-al-35216/?searchQueryState=%7B%22usersSearchTerm%22%3A%2235216%22%2C%22mapBounds%22%3A%7B%22west%22%3A-86.93997505787829%2C%22east%22%3A-86.62926796559313%2C%22south%22%3A33.33562772711966%2C%22north%22%3A33.51819716059094%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A73386%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A13%2C%22pagination%22%3A%7B%7D%7D

我们可以看到有 76 条记录。如果我使用 google chrome 扩展程序:Zillow-to-excel,所有 76 间房屋都可以被刮掉。 https://chrome.google.com/webstore/detail/zillow-to-excel/aecdekdgjlncaadbdiciepplaobhcjgi/related

但是当我使用 Python 请求抓取 zillow 数据时,只能抓取 18-20 条记录。 这是我的代码

import requests
import json
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np

cnt=0
stop_check=0
ele=[]
url='https://www.zillow.com/birmingham-al-35216/'
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','accept-encoding': 'gzip,deflate,br','accept-language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6','upgrade-insecure-requests': '1','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i in range(1,2):
    params = {
    'searchQueryState':'{"pagination":{"currentPage":'+str(i)+'},"useRSSearchTerm":"35216","mapBounds":{"west":-86.83314614582643,"east":-86.73781685417354,"south":33.32843303639682,"north":33.511017584543204},"regionSelection":[{"regionId":73386,"regionType":7}],"isMapVisible":true,"filterState":{"sort":{"value":"globalrelevanceex"},"ah":{"value":true}},"isListVisible":true,"mapZoom":13}'
    }
    page=requests.get(url,headers=headers,params=params,timeout=2)
    sp=soup(page.content,'lxml')
    lst=sp.find_all('address',{'class':'list-card-addr'})
    ele.extend(lst)
    print(i,len(lst))
    if len(lst)==0:
        stop_check+=1
    if stop_check>=3:
        print('stop on three empty')

Headers 和 params 来自使用 chrome 开发工具的 web。我还尝试了其他搜索,发现我只能在每个页面上抓取前 9-11 条记录。

我知道有一个 zillow API,但它可以用于一般搜索,如邮政编码中的所有房屋。所以我想尝试网页抓取。

我可以对如何修复我的代码提出一些建议吗?

非常感谢!

解决方法

你可以试试

import requests
import json

url = 'https://www.zillow.com/search/GetSearchPageState.htm'

headers = {
    'Accept': '*/*','Accept-Encoding': 'gzip,deflate,br','Accept-Language': 'en-US,en;q=0.9','upgrade-insecure-requests': '1','User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

houses = []
for page in range(1,3):
    params = {
        "searchQueryState": json.dumps({
            "pagination": {"currentPage": page},"usersSearchTerm": "35216","mapBounds": {
                "west": -86.97413567189196,"east": -86.57244804982165,"south": 33.346263857015515,"north": 33.48754107532057
            },"mapZoom": 12,"regionSelection": [
                {
                    "regionId": 73386,"regionType": 7
                }
            ],"isMapVisible": True,"filterState": {
                "isAllHomes": {
                    "value": True
                },"sortSelection": {
                    "value": "globalrelevanceex"
                }
            },"isListVisible": True
        }),"wants": json.dumps(
            {
                "cat1": ["listResults","mapResults"],"cat2": ["total"]
            }
        ),"requestId": 3
    }

    # send request
    page = requests.get(url,headers=headers,params=params)

    # get json data
    json_data = page.json()

    # loop via data
    for house in json_data['cat1']['searchResults']['listResults']:
        houses.append(house)


# show data
print('Total houses - {}'.format(len(houses)))

# show info in houses
for house in houses:
    if 'brokerName' in house.keys():
        print('{}: {}'.format(house['brokerName'],house['price']))
    else:
        print('No broker: {}'.format(house['price']))
Total houses - 76
RealtySouth-MB-Crestline: $424,900
eXp Realty,LLC Central: $259,900
ARC Realty Mountain Brook: $849,000
Ray & Poynor Properties: $499,900
Hinge Realty: $1,550,000
...

附言如果我对您有帮助,请不要忘记将答案标记为正确:)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。