使用 Goutte 抓取网站挂起,直到特定网站超时

如何解决使用 Goutte 抓取网站挂起,直到特定网站超时

我在玩 Goutte,但无法让它连接到某个网站。所有其他 URL 似乎都运行良好,我正在努力了解阻止它连接的原因。它只是挂起,直到 30 秒后超时。如果我取消超时,150 秒后也会发生同样的情况。

注意事项:

  • 此超时/挂起仅发生在我目前发现的 tesco.com 上。 asda.com、google.com 等工作正常并返回结果。
  • 该网站会立即在网络浏览器 (Chrome) 中加载(与 IP 或 ISP 无关)。
  • 如果我在 Postman 中向同一个 URL 发出 GET 请求,我得到的结果返回正常。
  • 似乎与用户代理无关。
<?php

namespace App\Http\Controllers;

use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;

class ScraperController extends Controller
{
    public function scrape()
    {
        $goutteClient = new Client();

        $goutteClient->setHeader('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.96 Safari/537.36');

        $guzzleClient = new GuzzleClient(array(
            'timeout' => 30,'verify' => true,'debug' => true,));
        $goutteClient->setClient($guzzleClient);
        $crawler = $goutteClient->request('GET','https://www.tesco.com/');

        dump($crawler);

        /*$crawler->filter('.result__title .result__a')->each(function ($node) {
            dump($node->text());
        });*/

    }
}

这是“调试”输出,包括错误:

* Trying 104.123.91.150:443... * TCP_NODELAY set * Connected to www.tesco.com (104.123.91.150) port 443 (#0) * ALPN,offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN,server accepted to use http/1.1 * Server certificate: * subject: C=GB; L=Welwyn Garden City; jurisdictionC=GB; O=Tesco PLC; businessCategory=Private Organization; serialNumber=00445790; CN=www.tesco.com * start date: Feb 4 11:09:23 2020 GMT * expire date: Feb 3 11:39:21 2022 GMT * subjectAltName: host "www.tesco.com" matched cert's "www.tesco.com" * issuer: C=US; O=Entrust,Inc.; OU=See www.entrust.net/legal-terms; OU=(c) 2014 Entrust,Inc. - for authorized use only; CN=Entrust Certification Authority - L1M * SSL certificate verify ok. > GET / HTTP/1.1 Host: www.tesco.com user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.96 Safari/537.36 * old SSL session ID is stale,removing * Operation timed out after 30001 milliseconds with 0 bytes received * Closing connection 0
GuzzleHttp\Exception\ConnectException
cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
http://localhost/scrape

谁能明白我为什么完全没有收到回复?

解决方法

通过添加更多标题设法解决了这个问题:

<?php

namespace App\Http\Controllers;

use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;

class ScraperController extends Controller
{
    public function scrape()
    {
        $goutteClient = new Client();

        $goutteClient->setHeader('accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9');
        $goutteClient->setHeader('accept-encoding','gzip,deflate,br');
        $goutteClient->setHeader('accept-language','en-GB,en-US;q=0.9,en;q=0.8');
        $goutteClient->setHeader('upgrade-insecure-requests','1');
        $goutteClient->setHeader('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.96 Safari/537.36');
        $goutteClient->setHeader('connection','keep-alive');

        $guzzleClient = new GuzzleClient(array(
            'timeout' => 5,'verify' => true,'debug' => true,'cookies' => true,));
        $goutteClient->setClient($guzzleClient);
        $crawler = $goutteClient->request('GET','https://www.tesco.com/');

        dump($crawler);

        /*$crawler->filter('.result__title .result__a')->each(function ($node) {
            dump($node->text());
        });*/
    }
}

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res