Document Management

Documents are the foundation of your knowledge bases. This guide covers how to add, organize, and maintain your document collections.

Uploading Files

Drag and Drop

The simplest way to add files:

Open a knowledge base
Drag files from your computer onto the page
Drop them in the upload zone
Wait for processing

File Picker

Alternatively:

Click Upload Files
Select files from your computer
Click Open

Bulk Upload

For many files:

Drag a folder (supported in Chrome and Edge)
Select multiple files in the picker
Files are processed in parallel

Supported File Types

Category	Formats	Notes
Documents	PDF, DOCX, DOC, TXT, RTF	Text is extracted automatically
Presentations	PPTX, PPT	Slide content and notes
Spreadsheets	XLSX, XLS, CSV	Cell values and headers
Web	HTML, Markdown	Rendered content
Code	Most languages	Syntax-aware chunking

Maximum file size depends on your organization’s configuration. Typical limits are 50-100MB per file.

Images and Scans

For documents that are scanned images or contain images with text:

OCR is applied automatically to extract text
Quality depends on image clarity
Consider re-scanning poor quality documents

Adding Web Content

Single URLs

To add an individual web page:

Click Add URL
Enter the full URL (including https://)
Click Add

The page is fetched immediately and its content indexed.

Web Crawling

For multiple pages from a website:

Click Add Web Source
Enter the starting URL
Configure basic settings such as path filters, blacklisted patterns, sitemap mode, and XPath filtering
Click Start Crawling

The crawler discovers pages by following links and indexes their content. For the detailed workflow and current crawl settings, see Crawl a Website.

Advanced Crawler Settings

For more control, expand Hostname settings to configure:

Path filters - Only crawl specific sections (e.g., /docs/)
Blacklisted patterns - Skip low-value or duplicate URL patterns
Robots.txt - Keep the site’s crawling rules enabled by default
Sitemap mode - Only follow links from the sitemap
XPath filter - Extract only the useful page content
HTTP headers - Send custom headers to controlled internal sites

These settings can be configured per hostname if your source spans multiple domains.

Crawl Status

While crawling, a status banner shows:

Pages discovered, indexed, and skipped
Any errors encountered
Estimated completion

You can pause a crawl in progress and resume later.

Automatic Recrawling

Keep content fresh with scheduled recrawls:

Open the web source settings
Set Recrawl Schedule:
- Manual only
- Every 12 hours
- Daily
- Weekly
- Monthly
Save

The crawler checks for new and updated pages on schedule. Unchanged pages are skipped to save processing time.

Document Processing

When you add a document, several things happen:

1. Text Extraction

Content is extracted from the file format. This includes:

Body text
Headers and titles
Table content
Image captions (if available)
Metadata (author, date, etc.)

2. Chunking

Text is split into smaller pieces called chunks. This is necessary because:

Search works better with focused passages
AI models have context limits
Relevant information can be isolated

Default settings:

Chunk size: 512 tokens
Overlap: 50 tokens (consecutive chunks share context)

3. Embedding

Each chunk is converted to a vector (a list of numbers) using the embedding model. This enables semantic search - finding content by meaning, not just keywords.

4. Indexing

Chunks and their embeddings are stored in a vector database, ready for search.

Filtering Documents

Use the source filter to narrow the document list:

Filter	Shows
All	Everything in the knowledge base
Files	Uploaded documents only
Web	Crawled web pages only

Combine with search to quickly find specific documents.

Searching Documents

The search box above the document list searches across the entire knowledge base, not just the current page. It matches:

File names - e.g. quarterly-report
Web page URLs - e.g. docs.example.com/pricing
Tags - both tag names and values, e.g. department or legal

Search is case-insensitive and matches partial words. Results are paginated like the regular document list. The knowledge base list page offers the same server-side search on knowledge base names and descriptions.

Document Status

Each document shows a status that updates in real-time:

Status	Meaning
Queued	Waiting to be processed
Processing	Currently being extracted and indexed
Ready	Successfully processed and searchable
Error	Something went wrong during processing

Status changes appear automatically - no need to refresh the page. After uploading, watch as documents move from queued to processing to ready. Click an error status to see details. Common issues:

Unsupported format - File type not recognized
Password protected - Document is encrypted
Extraction failed - Content couldn’t be read
Too large - File exceeds size limit

Viewing Document Details

Click any document to see:

File information - Name, type, size, dates
Processing details - Chunk count, tokens, parser used (the effective parser from the last indexing pass). If a per-document parser override is active, an override badge appears alongside it.
Chunks viewer - See exactly how the document was split

The Chunks Viewer

Understanding how documents are chunked helps debug retrieval issues:

Click View Chunks on any document
Browse through chunks (paginated for large documents)
Expand any chunk to see its full text
Search within the chunks to find specific content
Copy chunk text for testing or debugging

Each chunk shows:

Text content (expandable)
Page number (for PDFs)
Token count
Position in document

If important information spans multiple chunks poorly, consider adjusting chunk size or using a different chunking strategy in RAG Settings.

Document Tags

Organize documents with tags:

Select a document
Click Edit Tags
Add or remove tags
Save

Tags help with:

Filtering the document list
Finding specific content types
Organizing large collections

Updating Documents

To replace a document with a new version:

Delete the old document
Upload the new version

Or:

Click Re-index… on the document to re-process it — confirming without changing anything reuses the document’s current settings
In the same dialog, optionally pick a different parser or chunking strategy for this document only — useful when a document failed with the default parser

For frequently updated content, consider using connectors that sync automatically rather than manual uploads.

Deleting Documents

To remove a document:

Find it in the documents list
Click the delete icon (trash)
Confirm deletion

The document and all its chunks are removed. This affects search results immediately.

Bulk Deletion

To delete multiple documents:

Use filters to narrow the list
Select documents using checkboxes
Click Delete Selected
Confirm

Reindexing

When you change chunking or parsing settings, existing documents keep their old chunks. To apply the new settings:

Single Document

Select Re-index… from a document row’s action menu. The dialog opens preselected on the document’s current settings — its persisted parser override if one exists, Use store default otherwise (the store’s current parser is shown next to it). Confirming without changing anything reprocesses the document with those settings. To try a different parser or chunking strategy for this one document without changing the knowledge base defaults:

Select Re-index… on the document row

Choose a parser in the dialog:

Parser	Best For
Store default	Use the knowledge base parser (clears any previous per-document override)
Tika (`tika`)	Plain text, simple office documents
Tika + OCR (`tika-ocr`)	Scanned documents, images with text
Unstructured (`unstructured`)	Documents with headings, lists, tables
Unstructured + OCR (`unstructured-ocr`)	Complex scanned layouts, multi-column
LLM (`llm`)	Images, audio, video — always forced for media files

Optionally expand Advanced: chunking to override the chunking strategy for this document alone (the controls start from the store’s current chunking config)
Click Re-index

An explicit parser or chunking choice is persisted on the document and reused on subsequent re-indexes. To revert to the knowledge-base default, open the dialog again and select Use store default — the document then follows the store configuration, including future changes to it.

For image, audio, and video files the parser is locked to LLM — the backend enforces this regardless of the parser selected. The modal reflects this constraint.

For failed documents, a Re-index with another parser shortcut appears directly in the error details panel — it opens the same dialog, making it the primary recovery action without going through the action menu.

All Documents

To reindex the entire knowledge base:

Go to Settings
Scroll to Danger Zone
Click Reindex All Documents
Confirm

When re-indexing all documents, per-document parser overrides are preserved — documents that have a specific parser set will continue to use that parser in the bulk pass.

Reindex only reapplies chunking and parsing. It does not change the embedding model or vector dimensions — those are frozen at knowledge-base creation and the physical vector index cannot be resized. To migrate to a different embedding model, see Changing the embedding model. Reindexing large knowledge bases takes time and consumes processing resources; documents remain searchable during reindexing, but results may be inconsistent until complete.

Best Practices

Use clean source documents

Well-formatted documents with clear headings produce better chunks and retrieval. Clean up messy documents before uploading.

Test with representative queries

After adding documents, test search in the Playground. Verify that relevant content is retrieved for typical questions.

Remove duplicates

Duplicate content hurts retrieval quality. If the same information appears in multiple documents, keep the most authoritative version.

Keep documents focused

Many focused documents are better than few giant documents. Split large documents by topic if they cover multiple subjects.

Use meaningful filenames

Filenames become part of the metadata and can help with retrieval. Use descriptive names, not “Document1.pdf”.

Document Management

Uploading Files

Drag and Drop

File Picker

Bulk Upload

Supported File Types

Images and Scans

Adding Web Content

Single URLs

Web Crawling

Advanced Crawler Settings

Crawl Status

Automatic Recrawling

Document Processing

1. Text Extraction

2. Chunking

3. Embedding

4. Indexing

Filtering Documents

Searching Documents

Document Status

Viewing Document Details

The Chunks Viewer

Document Tags

Updating Documents

Deleting Documents

Bulk Deletion

Reindexing

Single Document

All Documents

Best Practices

Next Steps

Connect external sources

Configure RAG settings

​Uploading Files

​Drag and Drop

​File Picker

​Bulk Upload

​Supported File Types

​Images and Scans

​Adding Web Content

​Single URLs

​Web Crawling

​Advanced Crawler Settings

​Crawl Status

​Automatic Recrawling

​Document Processing

​1. Text Extraction

​2. Chunking

​3. Embedding

​4. Indexing

​Filtering Documents

​Searching Documents

​Document Status

​Viewing Document Details

​The Chunks Viewer

​Document Tags

​Updating Documents

​Deleting Documents

​Bulk Deletion

​Reindexing

​Single Document

​All Documents

​Best Practices

​Next Steps

Connect external sources

Configure RAG settings

Uploading Files

Drag and Drop

File Picker

Bulk Upload

Supported File Types

Images and Scans

Adding Web Content

Single URLs

Web Crawling

Advanced Crawler Settings

Crawl Status

Automatic Recrawling

Document Processing

1. Text Extraction

2. Chunking

3. Embedding

4. Indexing

Filtering Documents

Searching Documents

Document Status

Viewing Document Details

The Chunks Viewer

Document Tags

Updating Documents

Deleting Documents

Bulk Deletion

Reindexing

Single Document

All Documents

Best Practices

Next Steps