本次爬取的网站https://image.so.com/打开此页面切换到美女的页面,打开浏览器的开发者工具,切换到XHR选项,然后往下拉页面,我么会看到出现许多的ajax请求,如图:
Spiders.py代码
import scrapy
from Pro360.items import Pro360Item
import json
class ImaSpider(scrapy.Spider):
name = 'Ima'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://image.so.com/zjl?ch=beauty&sn=0']
MAx_page = 50 # 爬取的页数
for i in range(1,MAx_page+1):
url = 'https://image.so.com/zjl?ch=beauty&sn={}'.format(i*30)# 拼接url
start_urls.append(url)# 加入到start_urls中
# print(start_urls)
def parse(self, response):
# pass
result = json.loads(response.text)
for image in result['list']:
#获取图片的各项信息并提交到管道
item = Pro360Item()
item['id'] = image.get('id')
item['url'] = image.get('qhimg_url')
item['title'] = image.get('title')
item['thumb'] = image.get('qhimg_thumb')
yield item
# print(item)
items.py中代码实现
import scrapy
class Pro360Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
id = scrapy.Field()
url = scrapy.Field()
title = scrapy.Field()
thumb = scrapy.Field()
import pymongo
import pyMysqL
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
# 存储mongodb数据库的代码实现
class MongoPipeline:
def open_spider(self, spider):
print('开始爬虫momgod')
self.client = pymongo.MongoClient(host='127.0.0.1', port=27017)
self.db = self.client['Image360']
self.collection = self.db['images']
def process_item(self, item, spider):
self.collection.insert(dict(item))
return item
def close_spider(self,spider):
print('结束爬虫momgod')
self.client.close()
# 存储MysqL数据库的代码实现
class MysqLPipeline:
def open_spider(self, spider):
print('开始爬虫MysqL')
self.db = pyMysqL.connect(host='127.0.0.1',
user='root',
password='123456',
database='image360',
charset='utf8',
port=3306)
self.cursor = self.db.cursor()
def process_item(self, item, spider):
data = dict(item)
keys = ','.join(data.keys())
values = ','.join(['%s'] * len(data))
sql = 'insert into images(%s) values(%s)' % (keys, values)
try:
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
except:
self.db.rollback()
return item
def close_spider(self, spider):
print('结束爬虫MysqL')
self.db.close()
# 把文件存储到本地
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_path = [x['path'] for ok,x in results if ok]
if not image_path:
raise DropItem('Image download Failed')
return item
def get_media_requests(self, item, info):
yield Request(item['url'])
settings.py中文件的配置信息
1,把机器人协议改成False 并添加日志的等级为ERROR 添加User-Agent
3、指定本地存储照片的路径
结果:
mongodb数据存储的数据
MysqL数据库的数据
存储到本地的照片
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。