BuildShip Logo
BuildShip Community

Scrape Website

Extract all visible text from a website. Enter a website URL and optionally a CSS selector to target a specific section; the tool scrapes the main page or, if chosen, all linked pages, returning the collected text content as output.

1380

Report this template

Select the reason for reporting

Describe the issue in detail

Share template

Link to template

https://templates.buildship.com/template/I_iCsYBkcaOP/

Select an example

Inputs

Website URL

https://news.ycombinator.com/
This is a static example using sample inputs. Remix the template to run it with your own values.

Output

Read me

Web Scraping Workflow

Extract text content from any website using headless Chrome automation with flexible CSS selector targeting.

Overview

This workflow provides a powerful web scraping solution that can extract content from single pages or crawl multiple pages automatically. Using BuildShip's integrated Puppeteer service, it runs headless Chrome automation to scrape websites reliably, handling JavaScript-rendered content and complex page structures. Perfect for content aggregation, competitive research, data collection, and automated monitoring tasks.

How it works

  1. Input Processing - Accepts a website URL and validates/formats it (automatically adds https:// if missing), along with optional CSS selector and crawling mode parameters
  2. Web Scraping Execution - Uses BuildShip's Puppeteer service at https://puppeteer.buildship.run/v1/text for single pages or https://puppeteer.buildship.run/v1/crawl for multi-page crawling with headless Chrome automation
  3. Content Extraction - Applies CSS selectors to target specific page elements and extract clean text content from the rendered HTML
  4. Data Processing - Returns structured JSON containing the scraped content, original URLs, metadata (title, description), and crawl information

Prerequisites and Integrations Used

  • BuildShip Puppeteer Service - Defaults to keyless integration using BuildShip Credits, but supports BYOK (Bring Your Own Key) for custom API keys
  • No external accounts required - Works out of the box with BuildShip's integrated headless browser service
  • Modern web browser support - Handles JavaScript-rendered content and complex page structures through Chrome automation

How to customize

  1. Enter the website URL you want to scrape in the websiteUrl workflow input parameter
  2. Customize the CSS selector to target specific content by modifying the selector parameter (defaults to body for full page content):
    • Use #main to target an element with ID "main"
    • Use .article to target elements with class "article"
    • Use h1, h2, h3 to target all heading elements
  3. Choose scraping mode by setting the allPages parameter:
    • false (default): Single page scraping
    • true: Multi-page crawling with automatic link following
  4. Configure rate limiting for multi-page crawling:
    • Set maxRequestsPerCrawl (default: 50) to limit total requests
    • Set maxConcurrency (default: 20) to control parallel requests
  5. Add your own API key (optional) - Click the key icon on the scraper node to use your own Puppeteer service instead of BuildShip Credits

Who should use this template

  • Web Developers building applications that need automated content extraction without complex Puppeteer setup
  • Data Analysts collecting structured data for competitive research, market monitoring, or content analysis
  • Content Managers automating content aggregation from multiple sources or monitoring competitor websites
  • SEO Specialists extracting meta tags, content structure, and competitive intelligence data
  • Marketing Teams monitoring competitor pricing, product descriptions, or promotional content automatically
  • Research Teams gathering web data for academic research, market analysis, or trend monitoring
  • Business Intelligence Teams integrating web data into larger data pipelines and reporting systems

Next possible steps

If you want to extend this workflow's functionality, consider adding:

  • Data storage integration - Save scraped content to Airtable, Google Sheets, or databases for persistent storage
  • Content processing - Add AI-powered content analysis using OpenAI's GPT models or text summarization
  • Scheduling automation - Set up recurring scraping jobs using BuildShip's scheduled triggers for regular monitoring
  • Email notifications - Send alerts when specific content changes are detected using email integrations
  • Data transformation - Process scraped content into different formats (CSV, JSON, XML) for various downstream systems
  • Multi-site orchestration - Chain multiple scraping operations together for comprehensive data collection workflows
  • Content filtering - Add logic to filter and validate scraped content based on specific criteria or keywords