微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Web抓取Google图片不会产生第一个图片结果

如何解决Web抓取Google图片不会产生第一个图片结果

我已经编写了这个网络抓取脚本,该脚本可以抓取Google图像(并提供一些在线帮助)。在这里

import os
import requests
from bs4 import BeautifulSoup
import csv

# Base URL for Google Search
google_image = 'https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&'

# CSV Directory
csv_dir = '../main/data/activities.csv'

# Neccesary strings for Python to access browser network
usr_agent = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML,like Gecko) Chrome/23.0.1271.64 '
                  'Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'none','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive',}


def description_to_query(activity_description):
    line = activity_description.lower()
    line = line.replace(' ','+')
    return line


def download_images(activity_description):
    """Takes as input the activity description and writes corresponding Google Images result to activitiy_img folder,adding that path to the activities csv as well """

    # Image directory
    img_dir = '../main/data/activitiy_img'
    if not os.path.exists(img_dir):
        os.mkdir(img_dir)

    # Create URL for web
    searchurl = google_image + 'q=' + description_to_query(activity_description)
    print(f'{activity_description}: {searchurl}')

    # Get content from URL
    response = requests.get(searchurl,headers=usr_agent)

    # Find all divs containing images
    soup = BeautifulSoup(response.text,'html.parser')
    results = soup.findAll('img',{'class': 'rg_i Q4LuWd'})

    # Unpacking div and retrieving data-src content. If key not found,continue.
    link = ''
    for res in results:
        try:
            link = res['data-src']
        except KeyError:
            continue

    # Getting image data from just retrieved data-src and declaring img_name based on dir and description
    response = requests.get(link)
    img_path = img_dir + '/' + activity_description + '.jpg'

    # Writing file
    with open(img_path,'wb') as img:
        img.write(response.content)
        print(f'Downloading image {img_path}...')

    # Updating activities CSV file
    write_img_to_csv(activity_description,img_path)


def write_img_to_csv(activity_description,img_path):
    """Writes image path to CSV line corresponding with activity description"""
    
    # Reading CSV and copying to directory
    csv_read = csv.reader(open(csv_dir,newline=''))
    lines = list(csv_read)

    # Changing CSV values
    for row in lines:
        if row[1] == activity_description.lower():
            # Converting to int to add,then back to string to store updated value in CSV
            row[3] = img_path
            print(f'{img_path} added to CSV')

    # Changing CSV file to locally changed CSV
    csv_write = csv.writer(open(csv_dir,'w',newline=''))
    csv_write.writerows(lines)

现在,问题是,它可以正常工作(万岁!),但似乎并不是抓到第一个结果(通常是最好的结果),而是一个非常“迟到”的结果,可能与搜索描述相去甚远{ {1}},并且解析度往往很低。

我想知道为什么要这么做。我已经检查了Google Images HTML源代码,并且用于识别图像类的字典activity_description似乎也存在于第一张图像中。我假设{'class': 'rg_i Q4LuWd'}最初会找到第一个结果,但在那儿我可能错了,我想知道自己是否在哪里,如果不是,那么错误在哪里。

谢谢!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?