如何解决Web抓取Google图片不会产生第一个图片结果
我已经编写了这个网络抓取脚本,该脚本可以抓取Google图像(并提供一些在线帮助)。在这里:
import os
import requests
from bs4 import BeautifulSoup
import csv
# Base URL for Google Search
google_image = 'https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&'
# CSV Directory
csv_dir = '../main/data/activities.csv'
# Neccesary strings for Python to access browser network
usr_agent = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML,like Gecko) Chrome/23.0.1271.64 '
'Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'none','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive',}
def description_to_query(activity_description):
line = activity_description.lower()
line = line.replace(' ','+')
return line
def download_images(activity_description):
"""Takes as input the activity description and writes corresponding Google Images result to activitiy_img folder,adding that path to the activities csv as well """
# Image directory
img_dir = '../main/data/activitiy_img'
if not os.path.exists(img_dir):
os.mkdir(img_dir)
# Create URL for web
searchurl = google_image + 'q=' + description_to_query(activity_description)
print(f'{activity_description}: {searchurl}')
# Get content from URL
response = requests.get(searchurl,headers=usr_agent)
# Find all divs containing images
soup = BeautifulSoup(response.text,'html.parser')
results = soup.findAll('img',{'class': 'rg_i Q4LuWd'})
# Unpacking div and retrieving data-src content. If key not found,continue.
link = ''
for res in results:
try:
link = res['data-src']
except KeyError:
continue
# Getting image data from just retrieved data-src and declaring img_name based on dir and description
response = requests.get(link)
img_path = img_dir + '/' + activity_description + '.jpg'
# Writing file
with open(img_path,'wb') as img:
img.write(response.content)
print(f'Downloading image {img_path}...')
# Updating activities CSV file
write_img_to_csv(activity_description,img_path)
def write_img_to_csv(activity_description,img_path):
"""Writes image path to CSV line corresponding with activity description"""
# Reading CSV and copying to directory
csv_read = csv.reader(open(csv_dir,newline=''))
lines = list(csv_read)
# Changing CSV values
for row in lines:
if row[1] == activity_description.lower():
# Converting to int to add,then back to string to store updated value in CSV
row[3] = img_path
print(f'{img_path} added to CSV')
# Changing CSV file to locally changed CSV
csv_write = csv.writer(open(csv_dir,'w',newline=''))
csv_write.writerows(lines)
现在,问题是,它可以正常工作(万岁!),但似乎并不是抓到第一个结果(通常是最好的结果),而是一个非常“迟到”的结果,可能与搜索描述相去甚远{ {1}},并且解析度往往很低。
我想知道为什么要这么做。我已经检查了Google Images HTML源代码,并且用于识别图像类的字典activity_description
似乎也存在于第一张图像中。我假设{'class': 'rg_i Q4LuWd'}
最初会找到第一个结果,但在那儿我可能错了,我想知道自己是否在哪里,如果不是,那么错误在哪里。
谢谢!
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。