上周末朋友给我提了个需求:想批量的获取某网站的数据,具体操作是先在a页面搜索框搜索xxx,然后在当前页面返回的列表中依次点击链接进入详情页面b,复制需要的数据。如果手动操作的话,工作量还是比较大的。于是我花了2小时时间学习了Python爬虫+爬数据,由于页面中的table不是按照行列组织的表格,又花了2小时处理数据的格式。用到的包有:requests请求页面、re正则匹配a标签、pandas读取页面中的table,之后又用xlrd和xlwt去处理excel,把数据转成想要的格式。具体代码如下:
第一步:爬数据
import requests
import re
import time
import pandas as pd
num = 0
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
data_all = []
for i in range(1,53):
#print i
u0 = 'http://xxx.html?pageNo='
u2 = '&aaee0102_03=%E5%8C%97%E4%BA%AC%E5%B8%82%E6%B0%91%E6%94%BF%E5%B1%80&field=aaee0105&sort=desc&flag=0&_=1616827094244'
url012 = u0 + str(i) +u2
#print(url012)
response_1 = requests.get(url012)
html = response_1.text
# print(response_1.text)
urls = re.findall('<a href="(.*?)" target="_self">', html)
for url in urls:
r0 = "http://xxx"
r01 = r0 + url
#print(r01)
num = num + 1
print(num)
time.sleep(0.01)
response_1_1 = requests.get(r01, headers=header)
data = pd.read_html(response_1_1.text)[0]
data.to_csv('/Users/xxx.csv', mode='a', encoding='utf-8')
#<a href="/xxx.html?aaee0101=ff808081734c76d401735197e262025c" target="_self">
#print(urls[0])
# r0 = "http://xxx"
# r1 = urls[0]
# r01 = r0 + r1
# print(r01)
#response_1_1 = requests.get('http://xxx.html?aaee0101=ff80808177d645820177d7305ac70054')
#response_1_1 = requests.get(r01,headers=header)
#print(response_1_1.text)
#data =pd.read_html(response_1_1.text)[0]
#print(data)
#data.to_csv('/Users/xxx.csv', encoding='utf-8')
第二步:格式处理
#coding=utf-8
import xlrd
import xlwt
data = xlrd.open_workbook('/xxx.xlsx') # 打开xls文件
table = data.sheets()[0] # 打开第一张表
nrows = table.nrows # 获取表的行数
# for i in range(nrows): # 循环逐行打印
# if i == 0: # 跳过第一行
# continue
# print table.row_values(i)[:13] # 取前十三列
# print(nrows)
workbook = xlwt.Workbook()
worksheet = workbook.add_sheet('test')
for i in range(0,775):
h = 2+11*i
print(h)
zuzhi = table.cell(2+11*i, 2).value
chengli = table.cell(3+11*i, 2).value
zige = table.cell(5+11*i, 2).value
zhusuo = table.cell(6+11*i, 2).value
youxiang = table.cell(8+11*i, 2).value
menhu = table.cell(9+11*i, 2).value
lianxiren = table.cell(10+11*i, 2).value
dengji = table.cell(3+11*i, 4).value
dianhua = table.cell(10+11*i, 4).value
worksheet.write(0+i, 0, zuzhi)
worksheet.write(0+i, 1, chengli)
worksheet.write(0+i, 2, zige)
worksheet.write(0+i, 3, zhusuo)
worksheet.write(0+i, 4, youxiang)
worksheet.write(0+i, 5, menhu)
worksheet.write(0+i, 6, lianxiren)
worksheet.write(0+i, 7, dengji)
worksheet.write(0+i, 8, dianhua)
# print(zuzhi)
# print(chengli)
# print(zige)
# print(zhusuo)
# print(youxiang)
# print(menhu)
# print(lianxiren)
# print(dengji)
# print(dianhua)
# print(str(a).decode('raw_unicode-escape').encode('utf-8'))
# print(str(b).decode('raw_unicode-escape').encode('utf-8'))
workbook.save('/Users/xxx.xls')
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。