使用C中的Boost正则表达式缩小HTML

题

如何使用C缩小HTML？

资源

外部库可能是答案,但我更希望改进我当前的代码.虽然我对其他可能性都很感兴趣.

目前的代码

我不得不从原帖中改变的唯一部分就是这一部分：“(？ix)”
……还有一些逃脱的迹象

#include <boost/regex.hpp>
void minifyhtml(string* s) {
  boost::regex Nowhitespace(
    "(?ix)"
    "(?>"           // Match all whitespans other than single space.
    "[^\\S ]\\s*"   // Either one [\t\r\n\f\v] and zero or more ws,"| \\s{2,}"     // or two or more consecutive-any-whitespace.
    ")"             // Note: The remaining regex consumes no text at all...
    "(?="           // Ensure we are not in a blacklist tag.
    "[^<]*+"        // Either zero or more non-"<" {normal*}
    "(?:"           // Begin {(special normal*)*} construct
    "<"             // or a < starting a non-blacklist tag.
    "(?!/?(?:textarea|pre|script)\\b)"
    "[^<]*+"        // more non-"<" {normal*}
    ")*+"           // Finish "unrolling-the-loop"
    "(?:"           // Begin alternation group.
    "<"             // Either a blacklist start tag.
    "(?>textarea|pre|script)\\b"
    "| \\z"         // or end of file.
    ")"             // End alternation group.
    ")"             // If we made it here,we are not in a blacklist tag.
  );

  // @todo Don't remove conditional html comments
  boost::regex nocomments("<!--(.*)-->");

  *s = boost::regex_replace(*s,Nowhitespace," ");
  *s = boost::regex_replace(*s,nocomments,"");
}

只有第一个正则表达式来自原始帖子,另一个是我正在研究的东西,应该被认为远非完整.它应该有希望很好地了解我尝试完成的任务.

解决方法

Regexps是一个强大的工具,但我认为在这种情况下使用它们将是一个坏主意.例如,您提供的正则表达式是维护噩梦.通过查看此正则表达式,您无法快速了解它应该匹配的内容.

您需要一个可以标记输入文件的html解析器,或者允许您以流或对象树的形式访问令牌.基本上读取令牌,丢弃那些您不需要的令牌和属性,然后将剩余的内容写入输出.使用这样的东西可以让你比使用regexp解决它更快地开发解决方案.

我想你可以使用xml解析器,或者你可以搜索带有html支持的xml解析器.

在C中,libxml(可能有HTML支持模块),Qt 4,tinyxml,以及libstrophe使用某种可行的xml解析器.

请注意,C(尤其是C 03)可能不是此类程序的最佳语言.虽然我非常不喜欢python,但python有“Beautiful Soup”模块,可以很好地解决这类问题.

Qt 4可能有效,因为它提供了不错的unicode字符串类型(如果你要解析html,你将需要它).

使用C中的Boost正则表达式缩小HTML

解决方法

相关推荐