Scrape Website

Extract all visible text from a website. Enter a website URL and optionally a CSS selector to target a specific section; the tool scrapes the main page or, if chosen, all linked pages, returning the collected text content as output.

1474

Report this template

Select the reason for reporting

Describe the issue in detail

Share template

Link to template

https://templates.buildship.com/template/I_iCsYBkcaOP/

By BuildShip

Overview

This workflow provides a powerful web scraping solution that can extract content from single pages or crawl multiple pages automatically. Using BuildShip's integrated Puppeteer service, it runs headless Chrome automation to scrape websites reliably, handling JavaScript-rendered content and complex page structures. Perfect for content aggregation, competitive research, data collection, and automated monitoring tasks.

How it works

Input Processing - Accepts a website URL and validates/formats it (automatically adds https:// if missing), along with optional CSS selector and crawling mode parameters

Web Scraping Execution - Uses BuildShip's Puppeteer service at https://puppeteer.buildship.run/v1/text for single pages or https://puppeteer.buildship.run/v1/crawl for multi-page crawling with headless Chrome automation

Content Extraction - Applies CSS selectors to target specific page elements and extract clean text content from the rendered HTML

Data Processing - Returns structured JSON containing the scraped content, original URLs, metadata (title, description), and crawl information

Prerequisites and Integrations Used

BuildShip Puppeteer Service - Defaults to keyless integration using BuildShip Credits, but supports BYOK (Bring Your Own Key) for custom API keys

No external accounts required - Works out of the box with BuildShip's integrated headless browser service

Modern web browser support - Handles JavaScript-rendered content and complex page structures through Chrome automation

How to customize

Enter the website URL you want to scrape in the websiteUrl workflow input parameter

Customize the CSS selector to target specific content by modifying the selector parameter (defaults to body for full page content):

Use #main to target an element with ID "main"
Use .article to target elements with class "article"
Use h1, h2, h3 to target all heading elements

Choose scraping mode by setting the allPages parameter:

false (default): Single page scraping
true: Multi-page crawling with automatic link following

Configure rate limiting for multi-page crawling:

Set maxRequestsPerCrawl (default: 50) to limit total requests
Set maxConcurrency (default: 20) to control parallel requests

Add your own API key (optional) - Click the key icon on the scraper node to use your own Puppeteer service instead of BuildShip Credits

Who should use this template

Web Developers building applications that need automated content extraction without complex Puppeteer setup

Data Analysts collecting structured data for competitive research, market monitoring, or content analysis

Content Managers automating content aggregation from multiple sources or monitoring competitor websites

SEO Specialists extracting meta tags, content structure, and competitive intelligence data

Marketing Teams monitoring competitor pricing, product descriptions, or promotional content automatically

Research Teams gathering web data for academic research, market analysis, or trend monitoring

Business Intelligence Teams integrating web data into larger data pipelines and reporting systems

Next possible steps

If you want to extend this workflow's functionality, consider adding:

Data storage integration - Save scraped content to Airtable, Google Sheets, or databases for persistent storage

Content processing - Add AI-powered content analysis using OpenAI's GPT models or text summarization

Scheduling automation - Set up recurring scraping jobs using BuildShip's scheduled triggers for regular monitoring

Email notifications - Send alerts when specific content changes are detected using email integrations

Data transformation - Process scraped content into different formats (CSV, JSON, XML) for various downstream systems

Multi-site orchestration - Chain multiple scraping operations together for comprehensive data collection workflows

Content filtering - Add logic to filter and validate scraped content based on specific criteria or keywords

Scrape Website

Inputs

Output

Read me

Web Scraping Workflow

Overview

How it works

Prerequisites and Integrations Used

How to customize

Who should use this template

Next possible steps

Scrape Website

Inputs

Output

Read me

Web Scraping Workflow

Overview

How it works

Prerequisites and Integrations Used

How to customize

Who should use this template

Next possible steps

Related Templates