微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

php – 如何从页面源中“抓取”内容?

我有这个代码获取页面HTML代码
$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

我想从中搜集一些内容.例如,假设页面的源包含:

<strong>technorati.com</strong><br />
Connection Failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection Failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>Feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection Failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection Failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

有没有办法可以从源代码删除它并将其存储在变量中,所以它看起来像这样:

technorati.com Connection Failed
icerocket.com Connection Failed
eblogs.com Done
Ect.

因为页面是动态的,这就是我遇到问题的原因.我可以搜索源中的每个站点吗?但那我怎么得到它之后的结果呢? (连接失败/完成)
非常感谢您的帮助!

我尝试使用简单的HTML DOM PHP库来抓取多个站点,可以在这里获得: http://simplehtmldom.sourceforge.net/

然后使用这样的代码

<?PHP
include_once 'simple_html_dom.PHP';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat,$rep,$heading->find('span a',0)->plaintext) . "\n"; 
}
?>

这导致类似于:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity,Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
SerIoUs Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

原文地址:https://www.jb51.cc/php/130313.html

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐