我想开发一个从种子URL开始的Web爬虫,然后抓取它找到的100个html页面,它们与种子URL属于同一个域,并保留遍历的URL的记录,避免重复.我写了以下内容,但$url_count值似乎没有增加,检索到的URL甚至包含来自其他域的链接.我该如何解决这个问题?这里我插入了stackoverflow.com作为我的起始URL.
use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; ##open file to store links open my $file1,">>",("extracted_links.txt"); select($file1); ##starting URL my @urls = 'https://stackoverflow.com/'; my $browser = LWP::UserAgent->new('IE 6'); $browser->timeout(10); my %visited; my $url_count = 0; while (@urls) { my $url = shift @urls; if (exists $visited{$url}) ##check if URL already exists { next; } else { $url_count++; } my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); if ($response->is_error()) { printf "%s\n",$response->status_line; } else { my $contents = $response->content(); $visited{$url} = 1; @lines = split(/\n/,$contents); foreach $line(@lines) { $line =~ m@(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g; print "$1\n"; push @urls,$$line[2]; } sleep 60; if ($visited{$url} == 100) { last; } } } close $file1;
解决方法
有几点,你的URL解析很脆弱,你肯定不会得到相对链接.此外,你不测试100个链接,但测试当前网址的100个匹配,这几乎肯定不是你的意思.最后,我对LWP不太熟悉,所以我将展示一个使用
Mojolicious工具套件的例子.
这似乎有用,也许它会给你一些想法.
#!/usr/bin/env perl use strict; use warnings; use Mojo::UserAgent; use Mojo::URL; ##open file to store links open my $log,'>','extracted_links.txt' or die $!; ##starting URL my $base = Mojo::URL->new('https://stackoverflow.com/'); my @urls = $base; my $ua = Mojo::UserAgent->new; my %visited; my $url_count = 0; while (@urls) { my $url = shift @urls; next if exists $visited{$url}; print "$url\n"; print $log "$url\n"; $visited{$url} = 1; $url_count++; # find all <a> tags and act on each $ua->get($url)->res->dom('a')->each(sub{ my $url = Mojo::URL->new($_->{href}); if ( $url->is_abs ) { return unless $url->host eq $base->host; } push @urls,$url; }); last if $url_count == 100; sleep 1; }
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。