如何解决使用 PHP (Regex) 从抓取的 HTML 页面中提取 Javascript 变量
我正在努力从抓取网页的 HTML 转储中提取 Javascript 变量。
目前正在使用这个正则表达式
$re = '/window\.universal_variable\s*=\s*\{(.*?)\}/ms';
但它只显示第一组值。我基本上是想获取产品下的所有变量和值(即 id、product_id、sku 等)
<script type="text/javascript">
window.universal_variable = {
page: {
category: "product",searchTerm: "sony",environment: "production",variation: "production",revision: "1.1"
},user: {
otb: "",ATG_FO_IND: "A",\t
ooops_preference: "false",registered_today: false,registration_date: "",registered_in_current_session: false,\tidv_verified: true,last_order_date: "",start_date: "",first_order: false,\treturning: false,last_transaction_payment_type: "",unicaSegment: "",targetedPromos :"",cva:"0",cvb:"1",cvc:""
}// end of user\t,\t
product:{
id: "KEN6C",product_id: "prod1086433641",sku: "KEN6C",manufacturer: "",category: "Televisions",category_facet: "4740",department: "Electricals",subcategory: "electricals_televisions",currency: "GBP",unit_price: "",unit_sale_price: "319.0",rating: "4.3",ratingCount: "2048"
}// end of product
}// end of window.universal_variable\t
window.sdgGA = {
environment: "production",device: "desktop",userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/78.0.3904.70 Safari/537.36",page: {
PID: "test : PRODUCT",loggedInState: "not logged in",category:"product",customerStatus: "new"
},</script>
有什么建议吗?
解决方法
与其尝试使用非常脆弱的正则表达式,我建议使用诸如 this one 之类的转译器。我在您的示例代码上对其进行了测试,效果很好。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。