Guides

Source-data guides

Practical guides for crawler decisions, website-to-RAG preparation, document processing, and source-data workflows.

Start here
Guide library
RAG preparation guide

How to prepare website content for RAG

A practical guide to preparing website content for RAG workflows with scoped crawling, clean exports, source review, chunking, and retrieval-ready structure.

  • Use explicit URL targets and same-site limits before crawling.
  • Export source material in reviewable formats before embedding.
  • Keep source evidence so bad answers can be traced back to bad inputs.
Read guide
Crawler vs ETL guide

Web crawler vs. ETL pipeline: what is the difference?

Understand how web crawlers and ETL pipelines differ, when to use each one, and why SourceOfTruth.io keeps crawler collection separate from broader AI data preparation.

  • Use a crawler when the source is a website or documentation set.
  • Use ETL when data needs repeatable extraction, transformation, and loading across systems.
  • Crawler jobs need limits, estimates, and export evidence.
Read guide
Document processing guide

AI document processing: from files to usable source text

A source-data guide to AI document processing, including extraction, metadata, cleaning, chunking, evidence review, and metered downstream processing.

  • Extract text before applying AI analysis.
  • Keep metadata and source evidence attached to processed output.
  • Review extraction quality before chunking and embedding.
Read guide
RAG data pipeline guide

What belongs in a RAG data pipeline?

A practical overview of RAG data pipeline stages: source collection, cleaning, chunking, embeddings, indexing, retrieval testing, and evidence review.

  • The source layer determines the quality ceiling for RAG.
  • Chunking and embeddings should happen after source cleanup.
  • Retrieval tests need source evidence, not just answer demos.
Read guide
Crawler comparison guide

Looking for a Firecrawl alternative for source-data workflows?

A practical guide for teams evaluating crawler tools for AI source-data workflows, clean exports, metered usage, and RAG preparation.

  • Evaluate crawler tools by workflow fit, not just raw scrape ability.
  • Look for clean exports, limits, estimates, and job history.
  • Keep crawler pricing separate from advanced AI processing costs.
Read guide
Definitions

Crawler-first launch

The active product surface is Search + Web Crawler, with pricing and credits focused on bounded web collection.

RAG preparation

Source material should be reviewed, cleaned, and structured before it becomes chunks, embeddings, and retrieval context.

Document processing

File extraction and normalization are a future pipeline surface; page crawling and web exports stay separate.

ETL/ELT roadmap

ETL/ELT remains coming soon until connector, retry, observability, governance, and pricing expectations are production-ready.

Crawler is the live revenue surface.Guides can explain future pipeline direction, but public checkout and active pricing should remain crawler-first.