微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

你如何在PHP中解析和处理HTML / XML?

如何解析HTML / XML并从中提取信息?

解决方法:

原生XML扩展

我更喜欢使用native XML extensions中的一个,因为它们与PHP捆绑在一起,通常比所有第三方库更快,并且在标记上给我所需的所有控制权.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C’s Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM能够解析和修改现实世界(破碎)的HTML,它可以执行XPath queries.它基于libxml.

使用DOM需要一些时间才能提高效率,但这个时间非常值得IMO.由于DOM是一个与语言无关的接口,因此您可以找到多种语言的实现,因此如果您需要更改编程语言,那么您很可能已经知道如何使用该语言的DOM API.

可以在Grabbing the href attribute of an A element找到基本用法示例,并且可以在DOMDocument in php找到一般概念概述

How to use the DOM extension has been covered extensively on StackOverflow,所以如果您选择使用它,您可以确定您遇到的大多数问题都可以通过搜索/浏览Stack Overflow来解决.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

与DOM一样,XMLReader基于libxml.我不知道如何触发HTML解析器模块,因此使用XMLReader解析损坏的HTML的可能性可能不如使用DOM,因为您可以明确告诉它使用libxml的HTML解析器模块.

基本用法示例可在getting all values from h1 tags using php找到

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

XML Parser库也基于libxml,并实现了SAX样式的XML推送解析器.它可能是比DOM或SimpleXML更好的内存管理选择,但是比XMLReader实现的pull解析器更难以使用.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

当您知道HTML是有效的XHTML时,SimpleXML是一个选项.如果你需要解析破碎的HTML,甚至不要考虑SimpleXml,因为它会窒息.

基本用法示例可以在A simple program to CRUD node and node values of xml file找到,并且有lots of additional examples in the PHP Manual.

第三方库(基于libxml)

如果您更喜欢使用第三方库,我建议使用实际上使用下面的DOM/libxml而不是字符串解析的库.

FluentDomRepo

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML
documents using It requires 070019 for traversing the
DOM tree and extends it by adding methods for manipulating the DOM
tree of HTML documents.

phpQuery(多年未更新)

PHPQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

另见:https://github.com/electrolinux/phpquery

Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including css-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP Warnings or notices. They also add varIoUs custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple “xml to object/array” mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API.
It Leverages XPath and the fluent programming pattern to be fun and effective.

第三方(不是基于libxml的)

构建DOM / libxml的好处是,您可以获得良好的开箱即用性能,因为您基于本机扩展.但是,并非所有第三方库都沿着这条路线行进.其中一些列在下面

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

我一般不推荐这个解析器.代码库很糟糕,解析器本身很慢而且内存很耗.并非所有jQuery选择器(例如child selectors)都是可能的.任何基于libxml的库都应该比这更容易.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it’s valid or not! This project was original supported by sunra/PHP-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his prevIoUs work.

同样,我不推荐这个解析器. cpu使用率很高,速度相当慢.还没有清除已创建DOM对象的内存的功能.这些问题尤其适用于嵌套循环.文档本身不准确且拼写错误,自4月14日以来没有回复修复.

Ganon

  • A universal tokenizer and HTML/XML/RSS DOM Parser
    • Ability to manipulate elements and their attributes
    • Supports invalid HTML and UTF8
  • Can perform advanced CSS3-like queries on elements (like jQuery — namespaces supported)
  • A HTML beautifier (like HTML Tidy)
    • Minify CSS and Javascript
    • Sort attributes, change character case, correct indentation, etc.
  • Extensible
    • Parsing documents using callbacks based on current character/token
    • Operations separated in smaller functions for easy overriding
  • Fast and Easy

从未使用过它.不知道它是否有用.

HTML 5

您可以使用上面的解析HTML5,但由于HTML5允许标记there can be quirks.因此,对于HTML5,您要考虑使用专用解析器,例如

html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

HTML5最终确定后,我们可能会看到更多专用解析器. W3的标题How-To for html 5 parsing博客文章值得一试.

网页服务

如果您不想编写PHP,您也可以使用Web服务.一般来说,我发现这些实用程序很少,但那只是我和我的用例.

ScraperWiki.

ScraperWiki’s external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

常用表达

最后也是最不推荐的,您可以使用regular expressions从HTML中提取数据.通常不鼓励在HTML上使用正则表达式.

您可以在网上找到与标记相匹配的大多数片段都很脆弱.在大多数情况下,它们只适用于非常特殊的HTML.微小的标记更改,例如在某处添加空格,或添加或更改标记中的属性,可以使RegEx在未正确编写时失败.在HTML上使用RegEx之前,您应该知道自己在做什么.

HTML解析器已经知道HTML的语法规则.必须为您编写的每个新RegEx讲授正则表达式. RegEx在某些情况下很好,但它实际上取决于您的用例.

can write more reliable parsers,但是用前面的库已经存在并且在这方面做得更好的时候,用正则表达式编写一个完整可靠的自定义解析器是浪费时间的.

另见Parsing Html The Cthulhu Way

图书

如果你想花一些钱,看看吧

> PHP Architect’s Guide to Webscraping with PHP

我不隶属于PHP Architect或作者.

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。