GuidesRAG preparation guide6 min read

How to prepare website content for RAG

RAG systems perform better when the source layer is clean, scoped, and reviewable. Start with the pages that should be trusted, collect them with clear crawler limits, export clean source material, and only then move into chunking, embedding, and retrieval testing.

Step 1

Define the source set first

Before running a crawler, decide which pages should become trusted source material. Product docs, help-center pages, developer docs, policy pages, and public knowledge-base content are usually stronger candidates than broad marketing archives.

Choose trusted sectionsAvoid duplicate archivesExclude irrelevant pagesDocument the target scope

Step 2

Run a bounded crawl

Use limits for page count, same-site behavior, depth, and export format. Bounded jobs are easier to review, less expensive to operate, and safer to repeat when content changes.

Estimate usageSet page limitsTrack job statusSave crawl evidence

Step 3

Clean and review before indexing

Crawler output should be readable before it becomes retrieval input. Review Markdown, JSON, or CSV exports for navigation clutter, stale copy, duplicate sections, and missing source context.

Review MarkdownCheck metadataRemove clutterPreserve source URLs

Step 4

Prepare for chunks and retrieval

Once the source text is clean, chunking and embedding can be evaluated with better signal. Keep crawler job IDs, source URLs, timestamps, and content boundaries attached to each downstream unit.

Chunk with contextAttach source URLsTrack timestampsTest retrieval quality

FAQ

Quick answers

Should website content be crawled before chunking?

Yes. Crawling and export review should happen before chunking so navigation clutter, duplicates, and irrelevant pages are removed before they reach retrieval.

What format should I export for RAG preparation?

Markdown is often easiest to review, while JSON and CSV are useful when teams need structure, automation, or downstream processing.

Why keep crawl evidence?

Evidence makes it easier to trace retrieval issues back to the source URL, crawler run, timestamp, and exported text that entered the pipeline.