Welcome to MLink Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
410 views
in Technique[技术] by (71.8m points)

node.js - Webscraping with cheerio nodeJS

I'm trying to scrap one webpage to try some skills with cheerio and I'm not able to do it. I'm using axios to make the http request.

scrap.js

const cheerio = require('cheerio');
const axios = require('axios');

async function iniciar() {
    axios.get('https://www.idealo.es/precios/4102124/the-north-face-men-s-mcmurdo-parka-tnf-black.html').then( res => {
        var price = [];
        const $ = cheerio.load(res.data);

        $('span.oopStage-variantThumbnailsFromPrice').each( (index, element) => {
            const name = $(element).first().text()
            price.push(name)
        })
        console.log(price);
    })
}

module.exports = {
    iniciar
};

main.js

const scrap = require('./assets/scrap');
scrap.iniciar()

It is allways returning a empty value.

<strong>
 <span class="oopStage-variantThumbnailsFromText">desde</span>
 <span class="oopStage-variantThumbnailsFromPrice">294,99&nbsp;€</span>
</strong>

Any idea.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The reason this is not working for you is that the html generated by your desired page is dynamic, it's generated on the client side by JavaScript code.

We can still scrape the data, but we must use something like Puppeteer (Zombie.js or another headless browser might work too.) I'll use Puppeteer for this example though.

We load the page you wish, then parse the html in much the same way you were before.

I'm also using user-agents to generate a random user-agent to avoid a Captcha request.

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const userAgent = require('user-agents');

async function getDynamicPageHtml(url) {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.setUserAgent(userAgent.toString());

        await page.goto(url, { waitUntil: 'networkidle0' });
        const html = await page.evaluate(() => document.querySelector('*').outerHTML);

        await browser.close();
        return html;
    } catch (err) {
        console.error(err);
        return null;
    }
}

async function iniciar() {
    const html = await getDynamicPageHtml('https://www.idealo.es/precios/4102124/the-north-face-men-s-mcmurdo-parka-tnf-black.html');
    const $ = cheerio.load(html);
    const price = $('span.oopStage-variantThumbnailsFromPrice').map( (index, element) => {
        return $(element).first().text().trim();
    }).toArray();
    console.log("iniciar: price:", price);
    return price;  
}

module.exports = {
    iniciar
};

I'm getting the below output when I call iniciar:

iniciar: price: [ '294,99?€' ]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to MLink Developer Q&A Community for programmer and developer-Open, Learning and Share
...