如何解决Node-fetch 不提供来自正文页面的所有 HTML
我正在使用cheerio 和node-fetch 来获取特定网址上的所有产品链接。
我返回了一组链接,但列表不完整,因为正文缺少带有产品链接的 HTML。
fetch('https://shop.gossmanknives.com/shop?olsPage=products')
.then(res => res.text())
.then(body => {
$ = cheerio.load(body);
let snapshot = $("a,[data-ux='Link']")
.map((i,x) => $(x).attr('href'))
.toArray();
console.log(snapshot);
});
这是返回的数组:
['#','/','/shop','#','https://www.godaddy.com/websites/website-builder?isc=pwugc&utm_source=wsb&utm_medium=applications&utm_campaign=en-us_corp_applications_base']
这看起来很奇怪,因为有一个类似于下面的元素应该被拾取,但看起来 fetch() 返回的“body”缺少一堆我在视图中看到的 HTML来源。不知道为什么。也许数据是动态的,并且在 fetch() 运行时不在页面上?
<a rel="" typography="LinkAlpha" data-ux="Link" data-aid="PRODUCT_NAME_RENDERED_Orion" data-page="https://shop.gossmanknives.com/shop" data-page-query="olsPage=products/orion-aebl-black" href="https://shop.gossmanknives.com/shop?olsPage=products/orion-aebl-black" class="x-el x-el-a c2-9 c2-a c2-b c2-c c2-d c2-61 c2-f c2-3 c2-43 c2-4 c2-o c2-62 c2-63 c2-5 c2-6 c2-7 c2-8 x-d-ux x-d-aid x-d-page x-d-page-query" data-tccl="ux2.SHOP.shop1.Section.Default.Link.Default.43.click,click"><div data-ux="ProductCard" class="x-el x-el-div x-el c2-1 c2-2 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux c2-1 c2-2 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux"><div data-ux="ProductAsset" name="Orion" class="x-el x-el-div c2-1 c2-2 c2-1e c2-64 c2-65 c2-33 c2-4d c2-66 c2-2y c2-2z c2-30 c2-31 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux"><div id="guacBg20" role="img" data-ux="Background" data-aid="PRODUCT_IMAGE_RENDERED_Orion" treatmentdata="[object Object]" class="x-el x-el-div c2-1 c2-2 c2-67 c2-68 c2-69 c2-6a c2-1g c2-6b c2-6c c2-1t c2-1i c2-6d c2-71 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux x-d-aid" data-guac-image="loaded"><script>new guacImage('https://img1.wsimg.com/isteam/ip/94c95d7f-6505-4bfd-9837-ff1bcff87400/ols/IMG_0005-0002.JPG/:/rs=w:{width},h:{height},cg:false,m',document.getElementById('guacBg20'),{"useTreatmentData":true,"backgroundLayers":["linear-gradient(to bottom,rgba(22,22,0) 0%,0) 100%)"]})</script></div></div><div data-ux="ProductName" class="x-el x-el-div c2-1 c2-2 c2-6f c2-e c2-4j c2-g c2-3z c2-3 c2-4 c2-6g c2-5 c2-6 c2-7 c2-8 x-d-ux"><p typography="BodyAlpha" data-ux="Text" class="x-el x-el-p c2-1 c2-2 c2-c c2-d c2-4u c2-x c2-y c2-3y c2-6h c2-3 c2-6i c2-12 x-d-ux">Orion</p></div><div data-ux="ProductPrices" class="x-el x-el-div c2-1 c2-2 c2-6j c2-3y c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux"><div typography="BodyAlpha" data-ux="Price" price="[object Object]" data-aid="PRODUCT_PRICE_RENDERED_Orion" class="x-el x-el-div c2-1 c2-2 c2-c c2-d c2-4u c2-x c2-y c2-t c2-3y c2-6k c2-3 c2-6i c2-12 x-d-ux x-d-aid">$365.00</div></div><p typography="DetailsAlpha" data-ux="ProductLabel" data-aid="PRODUCT_SHIP_FREE_RENDERED_Orion" class="x-el x-el-p c2-1 c2-1p c2-c c2-d c2-4u c2-6f c2-y c2-3y c2-28 c2-4r c2-3 c2-12 c2-29 c2-6q c2-2a c2-2b c2-2c x-d-ux x-d-aid">Free Shipping</p></div></a>
注意我使用的是 https://www.npmjs.com/package/node-fetch
解决方法
您的选择器似乎有误,您正在搜索 <a>
或任何具有 [data-ux='Link']
属性的元素。所以你捡到了很多没有属性的链接。要仅获取具有该属性的链接,只需传递 "a [data-ux='Link']"
然后导航到产品页面是通过URL查询。似乎cheerio出于某种原因从查询部分剥离了URL。
请注意,数组中有很多 "/shop"
值,这些值可能是 "/shop?something=123..."
。尝试记录整个 <a>
元素,看看你能从那里做些什么。
到正文的数据不存在,因为它是动态 HTML。
用过的傀儡师 (original source here) 一切正常。
getContents(url,name) {
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url,{ waitUntil: 'networkidle0' });
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
},
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。