微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用Requests_HTML刮取JS呈现的页面无法正常工作

如何解决使用Requests_HTML刮取JS呈现的页面无法正常工作

我正在抓取JS呈现的页面https://www.flipkart.com/search?q=Acer+Laptops)。在此页面中,产品图像正在动态加载。这些图像的预渲染SRC值为

// img1a.flixcart.com/www/linchpin/fk-cp-zion/img/placeholder_9951d0.svg

渲染后,SRC应该是这样的

https://rukminim1.flixcart.com/image/312/312/kcp4osw0/computer/f/w/d/acer-na-thin-and-light-laptop-original-imaftrdmuyxq5nrf.jpeg?q=70

使用requests_html,我可以获取SRC值,但它仅适用于顶部的前几个图像。请在这里帮我吗?我的代码:-

res = session.get("https://www.flipkart.com/search?q=Acer+Laptops")
res.html.render()
all_results = res.html.find('#container > div > div.t-0M7P._2doH3V > div._3e7xtJ > div._1HmYoV.hCUpcT > div:nth-child(2)',first=True) #Container for all the results
items = all_results.find('._1UoZlX') # Container for each product being displayed
for item in items:
   item_image = item.find('div._3BTv9X img',first=True).attrs.get('src')
   print(item_image)

输出:-

https://rukminim1.flixcart.com/image/312/312/kamtsi80/computer/m/8/y/acer-na-gaming-laptop-original-imafs5prytwgrcyf.jpeg?q=70
https://rukminim1.flixcart.com/image/312/312/kcp4osw0/computer/f/w/d/acer-na-thin-and-light-laptop-original-imaftrdmuyxq5nrf.jpeg?q=70
//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/placeholder_9951d0.svg
//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/placeholder_9951d0.svg

如您所见,前两个图像已加载,其余未加载。 预先谢谢大家!

解决方法

import requests
import re


def main(url):
    r = requests.get(url)
    match = [x.group(1) for x in re.finditer(
        'dynamicImageUrl":"(.*?)"',r.text)]
    print(match)


main("https://www.flipkart.com/search?q=Acer+Laptops")

输出:

['http://rukmini1.flixcart.com/flap/{@width}/{@height}/image/c9ef9eae08a3b038.jpg?q={@quality}','https://rukminim1.flixcart.com/www/{@width}/{@height}/promos/21/07/2017/e8625e14-3277-4f16-a4d4-df8ed525905b.png?q={@quality}','https://rukminim1.flixcart.com/www/{@width}/{@height}/promos/21/07/2017/e8625e14-3277-4f16-a4d4-df8ed525905b.png?q={@quality}']

现在,您可以根据需要替换宽度高度质量

默认值为312 x 312 x 70

,

我找到了解决方案,因为图像被延迟加载,因此我必须在“ render()”函数中使用“ scrolldown”和“ sleep”参数。在下面找到代码:

res = session.get("https://www.flipkart.com/search?q=Acer+Laptops")
res.html.render(scrolldown=20,sleep=.1)
all_results = res.html.find('#container > div > div.t-0M7P._2doH3V > div._3e7xtJ > div._1HmYoV.hCUpcT > div:nth-child(2)',first=True) #Container for all the results
items = all_results.find('._1UoZlX') # Container for each product being displayed
for item in items:
   item_image = item.find('div._3BTv9X img',first=True).attrs.get('src')
   print(item_image)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。