代码截图
1.3 分析页面信息得原页面编码为:`gb2312`,修改获取内容编码
分析页码源码乱码
解决源码乱码情况
2.1 主页面源码已经获取到了,那我们到网页里看看源码的效果图吧
进群:548377875 即可获取数十套PDF哦!
有没有让你看的怦然心动,我是觉得清纯的妹纸挺好的。
2.2 爬取方式:简单 or 困难
爬取方式分析
3.1 从简单开始:首先我们要获取这个页面上的所有妹纸图的链接(一个妹纸有多张艺术照),然后向网站发送相应链接的请求,浏览器内按F12,进入开发者模式,小箭头选择想要的信息所在处。
源码分析采用什么正则匹配
1'''
2author : 极简XksA
3data : 2018.8.8
4goal : 分类爬取beautiful picture,保存到本地
5'''
6import re
7import requests
8# 爬取主页面:http://www.27270.com/ent/meinvtupian/
9
11r_url = 'http://www.27270.com/ent/meinvtupian/'
12html_code = requests.get(r_url)
13# 2. 设置页面编码为 gb2312
14html_code.encoding = 'gb2312'
15html_text = html_code.text
18beautiful_url = re.findall(pattern01,html_text)
19print(beautiful_url)
20print(len(beautiful_url))
21# 3.2 获取简介
23beautiful_words = re.findall(pattern02,html_text)
24print(beautiful_words)
25print(len(beautiful_words))
运行结果:
230
3.4.1 页面分析
单组照片页面分析
3.4.2 代码实现
1for i in range(len(beautiful_url)):
2 # 4.1 请求单个页面
3 picture_codes = requests.get(beautiful_url[i])
4 picture_codes.encoding = 'gb2312'
5 picture_words = picture_codes.text
7 # print(picture_words)
8 pattern03 = r''%beautiful_words[i]
9 picture_url = re.findall(pattern03,picture_words)
10 print(picture_url)
3.4.3 运行结果
1# 这里获取的并不是全部,而是每个妹纸的第一张图
2['http://t2.hddhhn.com/uploads/tu/201803/9999/f45065ed61.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/9126579004.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/320ab4622e.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/1fde4d7a1f.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/ef21eaa896.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/e1697062d3.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/419c69bec1.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/4302dc643c.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/df7ff261b0.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/b7b870636f.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/11ec3cf8b2.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/10a0a11a02.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/53e4e2717c.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/7431e6e040.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/228cf34f62.jpg'],
3['http://t2.hddhhn.com/uploads/tu/201807/9999/a9b7d62201.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/ba91f1e60e.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/76da610fa9.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/3ed260e5ae.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/3d93b5fd09.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/280277b310.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/b69662e2d9.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/fbf7a9178b.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/3f9a20a7da.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/691c12fa18.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/249d3362c4.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/29ea1b5fb7.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/db087ab231.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/1a9b5f8522.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/9b597acb26.jpg']
3.5.1 页面分析
3.5.2 代码实现
1# 4.3 翻页爬取
3pattern04 = r"
4pictures_url = re.findall(pattern04,picture_words)
5print(pictures_url)
6print(pattern03)
8for j in range(len(pictures_url)):
9 other_picture_url = r'http://www.27270.com/ent/meinvtupian/2018/{0}'.format(pictures_url[j])
10 pictures_codes = requests.get(other_picture_url)
11 pictures_codes.encoding = 'gb2312'
12 pictures_words = pictures_codes.text
13 picture_02 = re.findall(pattern03,pictures_words)
14 picture_address.append(picture_02)
15print(picture_address)
3.5.3 运行结果
1['261848_2.html','261848_3.html','261848_4.html','261848_5.html','261848_6.html','261848_7.html','261848_8.html']
2
3[['http://t2.hddhhn.com/uploads/tu/201803/9999/f45065ed61.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/88e0742045.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/c8d4eba79b.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/78e50b4522.jpg'],
4['http://t2.hddhhn.com/uploads/tu/201803/9999/c435bee80c.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/c8411d490e.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/0e7442531e.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/7aff8c935f.jpg']]
3.6 下载图片
2'''
5'''
6def download_pictures(folder_name,picture_address):
7 file_path = r'G:Beautiful{0}'.format(folder_name)
8 if not os.path.exits(file_path):
10 os.mkdir(os.path.join(r'G:Beautiful',folder_name))
12 for i in range(len(picture_address)):
13 # 下载文件(wb,以二进制格式写入)
14 with open(r'G:Beautiful{0}{1}.jpg'.format(folder_name,i+1),'wb') as f:
16 response = requests.get(picture_address[i][0])
17 f.write(response.content)
3.6.2 运行结果
运行结果
爬取结果(文字有点露骨)
单组皂片示例,看图片大小,挺清晰的
4. 上面只是爬取了主页面的所有妹纸图片,如何实现在主页面翻页呢?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。