房天下新房信息爬取
引言
本次爬虫使用了selenium,结合Chrome浏览器进行信息爬取,在数据存储方面,用了MongoDB数据库。
特别声明
代码仅供交流学习,不要用来做违法的事情。
思路分析
起始URL:'https://www.fang.com/SoufunFamily.htm
通过此URL,我们可以获取到全国各大城市的URL,然后,通过获取到的城市URL进行URL拼接,就可以得到每个城市新房页面的URL。由于每个页面还有下一页,可以用XPath定位到下一页的按钮,然后模拟点击就可以来到下一个页面了,再继续提取这个页面的信息即可。重复上述步骤,就可以完成新房信息的爬取。具体代码如下:
完整代码
# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time: 2020/2/6 21:00
# @Author: Martin
# @File: 房天下.py
# @Software:PyCharm
import requests
import re
import time
import pymongo
from lxml import etree
from selenium import webdriver
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
class FangSpider(object):
def __init__(self):
self.driver = webdriver.Chrome(executable_path='./chromedriver.exe')
self.start_url = 'https://www.fang.com/SoufunFamily.htm'
self.client = pymongo.MongoClient(host="localhost", port=27017)
self.db = self.client['fangtianxia']
def run(self):
self.driver.get(self.start_url)
webdriverwait(driver=self.driver, timeout=10).until(
EC.presence_of_element_located((By.XPATH, '//table[@class="table01"]/tbody/tr'))
)
source = self.driver.page_source
html = etree.HTML(source)
trs = html.xpath('//table[@class="table01"]/tbody/tr')[:-1]
for tr in trs:
a_list = tr.xpath('./td[3]/a')
for a in a_list:
city = a.xpath('./text()')[0]
city_url = a.xpath('./@href')[0]
city_name = city_url.split("//")[1].split(".")[0]
newhouse_url = "http://" + city_name + ".newhouse.fang.com/house/s/"
if city == "北京":
newhouse_url = 'https://newhouse.fang.com/house/s/'
self.parse_newhouse(newhouse_url, city)
time.sleep(1)
def parse_newhouse(self, newhouse_url, city):
city = city
self.driver.get(newhouse_url)
webdriverwait(driver=self.driver, timeout=10).until(
EC.presence_of_element_located((By.XPATH, '//div[@id="newhouse_loupai_list"]/ul/li'))
)
while True:
time.sleep(1)
source = self.driver.page_source
self.get_newhouse_info(source, city)
try:
btn = self.driver.find_element_by_xpath('//div[@class="page"]/ul/li[2]/a[@class="next"]')
btn.click()
except :
break
def get_newhouse_info(self, source, city):
city = city
html = etree.HTML(source)
li_list = html.xpath('//div[@id="newhouse_loupai_list"]/ul/li')
for li in li_list:
name = "".join(li.xpath('.//div[@class="nlc_details"]/div[1]/div/a/text()')).strip()
origin_url_list = li.xpath('.//div[@class="nlc_details"]/div[1]/div/a/@href')
if origin_url_list:
origin_url = "http:" + origin_url_list[0]
else:
origin_url = ""
a_list = li.xpath('.//div[@class="nlc_details"]/div[2]/a')
room_type = ""
for a in a_list:
text = a.xpath('./text()')
if text:
if text[0].endswith("居"):
room_str = text[0]
else:
room_str = ""
else:
room_str = ""
room_type += room_str
area = "".join(li.xpath('.//div[@class="nlc_details"]/div[2]/text()'))
area = re.sub(r'\s', "", area).replace("/", "").replace("-", "")
address = li.xpath('.//div[@class="nlc_details"]/div[3]/div/a/@title')
if address:
address = address[0]
else:
address = ""
price = li.xpath('.//div[@class="nhouse_price"]/span/text()') + li.xpath('.//div[@class="nhouse_price"]/em/text()')
if len(price) == 2:
price = price[0] + price[1]
else:
price = ""
sale = li.xpath('.//div[contains(@class,"fangyuan")]/span/text()')
if sale:
sale = sale[0]
else:
sale = ""
label_list = li.xpath('.//div[contains(@class,"fangyuan")]/a')
label = ""
for a in label_list:
text = a.xpath('./text()')
if text:
text = text[0]
else:
text = ""
label += text
house = {
'city': city,
'name': name,
'room_type': room_type,
'area': area,
'price': price,
'sale': sale,
'label': label,
'address': address,
'origin_url': origin_url
}
print(house)
self.save(house)
def save(self, info):
self.db.fangtianxia.insert_one(info)
def close(self):
self.client.close()
if __name__ == '__main__':
spider = FangSpider()
spider.run()
spider.close()
结果展示
爬虫没有运行完,我就停止了,大约运行了5~6分钟吧!我看了一下数据库已经有上千条信息了。
总结
采用selenium进行页面爬取时,由于需要浏览器,整体来说,效率是比较低的,但是在找不到网站的数据接口或者网站页面采用了js混淆时,用selenium也算是一个不错的方法吧!
虐猫人薛定谔i 发布了142 篇原创文章 · 获赞 223 · 访问量 3万+ 私信 关注版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。