How to prepare website content for RAG
RAG systems perform better when the source layer is clean, scoped, and reviewable. Start with the pages that should be trusted, collect them with clear crawler limits, export clean source material, and only then move into chunking, embedding, and retrieval testing.
Step 1
Define the source set first
Before running a crawler, decide which pages should become trusted source material. Product docs, help-center pages, developer docs, policy pages, and public knowledge-base content are usually stronger candidates than broad marketing archives.
Step 2
Run a bounded crawl
Use limits for page count, same-site behavior, depth, and export format. Bounded jobs are easier to review, less expensive to operate, and safer to repeat when content changes.
Step 3
Clean and review before indexing
Crawler output should be readable before it becomes retrieval input. Review Markdown, JSON, or CSV exports for navigation clutter, stale copy, duplicate sections, and missing source context.
Step 4
Prepare for chunks and retrieval
Once the source text is clean, chunking and embedding can be evaluated with better signal. Keep crawler job IDs, source URLs, timestamps, and content boundaries attached to each downstream unit.
Related reading
Next links
FAQ
Quick answers
Should website content be crawled before chunking?
Yes. Crawling and export review should happen before chunking so navigation clutter, duplicates, and irrelevant pages are removed before they reach retrieval.
What format should I export for RAG preparation?
Markdown is often easiest to review, while JSON and CSV are useful when teams need structure, automation, or downstream processing.
Why keep crawl evidence?
Evidence makes it easier to trace retrieval issues back to the source URL, crawler run, timestamp, and exported text that entered the pipeline.