node.js - Webscraping with cheerio nodeJS

Question

Welcome To Ask or Share your Answers For Others

node.js - Webscraping with cheerio nodeJS

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

node.js - Webscraping with cheerio nodeJS

I'm trying to scrap one webpage to try some skills with cheerio and I'm not able to do it. I'm using axios to make the http request.

scrap.js

const cheerio = require('cheerio');
const axios = require('axios');

async function iniciar() {
    axios.get('https://www.idealo.es/precios/4102124/the-north-face-men-s-mcmurdo-parka-tnf-black.html').then( res => {
        var price = [];
        const $ = cheerio.load(res.data);

        $('span.oopStage-variantThumbnailsFromPrice').each( (index, element) => {
            const name = $(element).first().text()
            price.push(name)
        })
        console.log(price);
    })
}

module.exports = {
    iniciar
};

main.js

const scrap = require('./assets/scrap');
scrap.iniciar()

It is allways returning a empty value.

<strong>
 <span class="oopStage-variantThumbnailsFromText">desde</span>
 <span class="oopStage-variantThumbnailsFromPrice">294,99&nbsp;€</span>
</strong>

Any idea.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-01-26T20:53:34+0000

The reason this is not working for you is that the html generated by your desired page is dynamic, it's generated on the client side by JavaScript code.

We can still scrape the data, but we must use something like Puppeteer (Zombie.js or another headless browser might work too.) I'll use Puppeteer for this example though.

We load the page you wish, then parse the html in much the same way you were before.

I'm also using user-agents to generate a random user-agent to avoid a Captcha request.

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const userAgent = require('user-agents');

async function getDynamicPageHtml(url) {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.setUserAgent(userAgent.toString());

        await page.goto(url, { waitUntil: 'networkidle0' });
        const html = await page.evaluate(() => document.querySelector('*').outerHTML);

        await browser.close();
        return html;
    } catch (err) {
        console.error(err);
        return null;
    }
}

async function iniciar() {
    const html = await getDynamicPageHtml('https://www.idealo.es/precios/4102124/the-north-face-men-s-mcmurdo-parka-tnf-black.html');
    const $ = cheerio.load(html);
    const price = $('span.oopStage-variantThumbnailsFromPrice').map( (index, element) => {
        return $(element).first().text().trim();
    }).toArray();
    console.log("iniciar: price:", price);
    return price;  
}

module.exports = {
    iniciar
};

I'm getting the below output when I call iniciar:

iniciar: price: [ '294,99?€' ]

Categories

node.js - Webscraping with cheerio nodeJS

node.js - Webscraping with cheerio nodeJS

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags