Uploading Files
Drag and Drop
The simplest way to add files:- Open a knowledge base
- Drag files from your computer onto the page
- Drop them in the upload zone
- Wait for processing
File Picker
Alternatively:- Click Upload Files
- Select files from your computer
- Click Open
Bulk Upload
For many files:- Drag a folder (supported in Chrome and Edge)
- Select multiple files in the picker
- Files are processed in parallel
Supported File Types
| Category | Formats | Notes |
|---|---|---|
| Documents | PDF, DOCX, DOC, TXT, RTF | Text is extracted automatically |
| Presentations | PPTX, PPT | Slide content and notes |
| Spreadsheets | XLSX, XLS, CSV | Cell values and headers |
| Web | HTML, Markdown | Rendered content |
| Code | Most languages | Syntax-aware chunking |
Maximum file size depends on your organization’s configuration. Typical limits are 50-100MB per file.
Images and Scans
For documents that are scanned images or contain images with text:- OCR is applied automatically to extract text
- Quality depends on image clarity
- Consider re-scanning poor quality documents
Adding Web Content
Single URLs
To add an individual web page:- Click Add URL
- Enter the full URL (including https://)
- Click Add
Web Crawling
For multiple pages from a website:- Click Add Web Source
- Enter the starting URL
- Configure basic settings such as path filters, blacklisted patterns, sitemap mode, and XPath filtering
- Click Start Crawling
Advanced Crawler Settings
For more control, expand Hostname settings to configure:- Path filters - Only crawl specific sections (e.g.,
/docs/) - Blacklisted patterns - Skip low-value or duplicate URL patterns
- Robots.txt - Keep the site’s crawling rules enabled by default
- Sitemap mode - Only follow links from the sitemap
- XPath filter - Extract only the useful page content
- HTTP headers - Send custom headers to controlled internal sites
Crawl Status
While crawling, a status banner shows:- Pages discovered, indexed, and skipped
- Any errors encountered
- Estimated completion
Automatic Recrawling
Keep content fresh with scheduled recrawls:- Open the web source settings
- Set Recrawl Schedule:
- Manual only
- Every 12 hours
- Daily
- Weekly
- Monthly
- Save
Document Processing
When you add a document, several things happen:1. Text Extraction
Content is extracted from the file format. This includes:- Body text
- Headers and titles
- Table content
- Image captions (if available)
- Metadata (author, date, etc.)
2. Chunking
Text is split into smaller pieces called chunks. This is necessary because:- Search works better with focused passages
- AI models have context limits
- Relevant information can be isolated
- Chunk size: 512 tokens
- Overlap: 50 tokens (consecutive chunks share context)
3. Embedding
Each chunk is converted to a vector (a list of numbers) using the embedding model. This enables semantic search - finding content by meaning, not just keywords.4. Indexing
Chunks and their embeddings are stored in a vector database, ready for search.Filtering Documents
Use the source filter to narrow the document list:| Filter | Shows |
|---|---|
| All | Everything in the knowledge base |
| Files | Uploaded documents only |
| Web | Crawled web pages only |
Searching Documents
The search box above the document list searches across the entire knowledge base, not just the current page. It matches:- File names - e.g.
quarterly-report - Web page URLs - e.g.
docs.example.com/pricing - Tags - both tag names and values, e.g.
departmentorlegal
Document Status
Each document shows a status that updates in real-time:| Status | Meaning |
|---|---|
| Queued | Waiting to be processed |
| Processing | Currently being extracted and indexed |
| Ready | Successfully processed and searchable |
| Error | Something went wrong during processing |
- Unsupported format - File type not recognized
- Password protected - Document is encrypted
- Extraction failed - Content couldn’t be read
- Too large - File exceeds size limit
Viewing Document Details
Click any document to see:- File information - Name, type, size, dates
- Processing details - Chunk count, tokens, parser used (the effective parser from the last indexing pass). If a per-document parser override is active, an override badge appears alongside it.
- Chunks viewer - See exactly how the document was split
The Chunks Viewer
Understanding how documents are chunked helps debug retrieval issues:- Click View Chunks on any document
- Browse through chunks (paginated for large documents)
- Expand any chunk to see its full text
- Search within the chunks to find specific content
- Copy chunk text for testing or debugging
- Text content (expandable)
- Page number (for PDFs)
- Token count
- Position in document
Document Tags
Organize documents with tags:- Select a document
- Click Edit Tags
- Add or remove tags
- Save
- Filtering the document list
- Finding specific content types
- Organizing large collections
Updating Documents
To replace a document with a new version:- Delete the old document
- Upload the new version
- Click Re-index… on the document to re-process it — confirming without changing anything reuses the document’s current settings
- In the same dialog, optionally pick a different parser or chunking strategy for this document only — useful when a document failed with the default parser
Deleting Documents
To remove a document:- Find it in the documents list
- Click the delete icon (trash)
- Confirm deletion
Bulk Deletion
To delete multiple documents:- Use filters to narrow the list
- Select documents using checkboxes
- Click Delete Selected
- Confirm
Reindexing
When you change chunking or parsing settings, existing documents keep their old chunks. To apply the new settings:Single Document
Select Re-index… from a document row’s action menu. The dialog opens preselected on the document’s current settings — its persisted parser override if one exists, Use store default otherwise (the store’s current parser is shown next to it). Confirming without changing anything reprocesses the document with those settings. To try a different parser or chunking strategy for this one document without changing the knowledge base defaults:- Select Re-index… on the document row
-
Choose a parser in the dialog:
Parser Best For Store default Use the knowledge base parser (clears any previous per-document override) Tika ( tika)Plain text, simple office documents Tika + OCR ( tika-ocr)Scanned documents, images with text Unstructured ( unstructured)Documents with headings, lists, tables Unstructured + OCR ( unstructured-ocr)Complex scanned layouts, multi-column LLM ( llm)Images, audio, video — always forced for media files - Optionally expand Advanced: chunking to override the chunking strategy for this document alone (the controls start from the store’s current chunking config)
- Click Re-index
An explicit parser or chunking choice is persisted on the document and reused on subsequent re-indexes. To revert to the knowledge-base default, open the dialog again and select Use store default — the document then follows the store configuration, including future changes to it.
For image, audio, and video files the parser is locked to LLM — the backend enforces this regardless of the parser selected. The modal reflects this constraint.
All Documents
To reindex the entire knowledge base:- Go to Settings
- Scroll to Danger Zone
- Click Reindex All Documents
- Confirm
Best Practices
Use clean source documents
Use clean source documents
Well-formatted documents with clear headings produce better chunks and retrieval. Clean up messy documents before uploading.
Test with representative queries
Test with representative queries
After adding documents, test search in the Playground. Verify that relevant content is retrieved for typical questions.
Remove duplicates
Remove duplicates
Duplicate content hurts retrieval quality. If the same information appears in multiple documents, keep the most authoritative version.
Keep documents focused
Keep documents focused
Many focused documents are better than few giant documents. Split large documents by topic if they cover multiple subjects.
Use meaningful filenames
Use meaningful filenames
Filenames become part of the metadata and can help with retrieval. Use descriptive names, not “Document1.pdf”.
Next Steps
Connect external sources
Set up automatic syncing with SharePoint, Google Drive, and more
Configure RAG settings
Fine-tune chunking and retrieval for better results