Build a Web Scraper

Overview

Web scraping is the process of programmatically extracting data from websites. It is a critical skill for data engineering, market research, content aggregation, and competitive analysis. When an API does not exist, scraping is often the only way to get the data you need.

In this tutorial, you will build a robust web scraper in Node.js using Cheerio for HTML parsing. Your scraper will handle pagination, respect rate limits, retry on failure, and output structured data to both JSON and CSV formats. You will also learn responsible scraping practices — checking robots.txt, setting appropriate delays, and identifying your bot with a user agent string.

The tools are simple: Node.js for the runtime, node-fetch or the built-in fetch for HTTP requests, cheerio for parsing HTML into a jQuery-like API, and csv-stringify for CSV output. No browser automation needed for server-rendered pages.

Step 1: Set Up the Project

Initialize a TypeScript Node.js project with the dependencies you need.

Create the entry point and types:

The ScrapeConfig interface centralizes all scraper behavior. The delayMs field controls the pause between requests — this is how you avoid overwhelming target servers.

Step 2: Build the HTTP Fetcher

Create a fetch wrapper that handles retries, timeouts, and rate limiting.

The fetcher implements exponential backoff: each retry waits longer than the last. This is respectful to the target server and dramatically improves success rates when dealing with transient failures or rate limits.

Step 3: Parse HTML with Cheerio

Cheerio loads HTML into a DOM-like structure you can query with CSS selectors — the same selectors you use in browser DevTools.

The key insight with Cheerio is that you need to inspect the target site's HTML structure first. Open browser DevTools, find the elements you want, note their CSS classes or attributes, then translate those into selectors.

Step 4: Handle Pagination

Most data worth scraping spans multiple pages. Build a pagination loop that follows "next" links until there are no more pages or you hit your configured limit.

The delay between pages is not optional. Hammering a server with rapid-fire requests gets your IP banned and puts unnecessary load on someone else's infrastructure. A delay of 1-2 seconds between requests is a reasonable starting point.

Step 5: Implement Rate Limiting

For more sophisticated rate limiting, build a token bucket that controls requests per second across your entire application.

The token bucket algorithm is the industry standard for rate limiting. Tokens accumulate over time up to a maximum. Each request consumes one token. When tokens are exhausted, requests wait until a token becomes available.

Step 6: Check robots.txt

Responsible scraping starts with checking the site's robots.txt file to see which paths are allowed.

Step 7: Export to JSON and CSV

Build output formatters that write your scraped data to files.

JSON is ideal for programmatic consumption — pipe it into another script, load it into a database, or use it as an API fixture. CSV is ideal for analysis in spreadsheets or data tools like pandas.

Step 8: Wire It All Together

Create the main entry point that orchestrates the full scraping pipeline.

Run the scraper with npx tsx src/index.ts. To adapt this to any website, you only need to change the CSS selectors in parser.ts and the configuration in index.ts. The fetching, rate limiting, pagination, and output infrastructure stays the same.

Remember: always check a site's terms of service before scraping. Use delays between requests. Identify your bot with a descriptive user agent. And never scrape personal data without consent. Responsible scraping is sustainable scraping.