> ## Documentation Index
> Fetch the complete documentation index at: https://docs.prisme.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Management

> Upload, organize, and manage documents in your knowledge bases

Documents are the foundation of your knowledge bases. This guide covers how to add, organize, and maintain your document collections.

## Uploading Files

### Drag and Drop

The simplest way to add files:

1. Open a knowledge base
2. Drag files from your computer onto the page
3. Drop them in the upload zone
4. Wait for processing

### File Picker

Alternatively:

1. Click **Upload Files**
2. Select files from your computer
3. Click **Open**

### Bulk Upload

For many files:

* Drag a folder (supported in Chrome and Edge)
* Select multiple files in the picker
* Files are processed in parallel

## Supported File Types

| Category          | Formats                  | Notes                           |
| ----------------- | ------------------------ | ------------------------------- |
| **Documents**     | PDF, DOCX, DOC, TXT, RTF | Text is extracted automatically |
| **Presentations** | PPTX, PPT                | Slide content and notes         |
| **Spreadsheets**  | XLSX, XLS, CSV           | Cell values and headers         |
| **Web**           | HTML, Markdown           | Rendered content                |
| **Code**          | Most languages           | Syntax-aware chunking           |

<Note>
  Maximum file size depends on your organization's configuration. Typical limits are 50-100MB per file.
</Note>

### Images and Scans

For documents that are scanned images or contain images with text:

* **OCR** is applied automatically to extract text
* Quality depends on image clarity
* Consider re-scanning poor quality documents

## Adding Web Content

### Single URLs

To add an individual web page:

1. Click **Add URL**
2. Enter the full URL (including https\://)
3. Click **Add**

The page is fetched immediately and its content indexed.

### Web Crawling

For multiple pages from a website:

1. Click **Add Web Source**
2. Enter the starting URL
3. Configure basic settings such as path filters, blacklisted patterns, sitemap mode, and XPath filtering
4. Click **Start Crawling**

The crawler discovers pages by following links and indexes their content.

For the detailed workflow and current crawl settings, see [Crawl a Website](./crawl-website).

#### Advanced Crawler Settings

For more control, expand **Hostname settings** to configure:

* **Path filters** - Only crawl specific sections (e.g., `/docs/`)
* **Blacklisted patterns** - Skip low-value or duplicate URL patterns
* **Robots.txt** - Keep the site's crawling rules enabled by default
* **Sitemap mode** - Only follow links from the sitemap
* **XPath filter** - Extract only the useful page content
* **HTTP headers** - Send custom headers to controlled internal sites

These settings can be configured per hostname if your source spans multiple domains.

#### Crawl Status

While crawling, a status banner shows:

* Pages discovered, indexed, and skipped
* Any errors encountered
* Estimated completion

You can **pause** a crawl in progress and resume later.

### Automatic Recrawling

Keep content fresh with scheduled recrawls:

1. Open the web source settings
2. Set **Recrawl Schedule**:
   * Manual only
   * Every 12 hours
   * Daily
   * Weekly
   * Monthly
3. Save

The crawler checks for new and updated pages on schedule. Unchanged pages are skipped to save processing time.

## Document Processing

When you add a document, several things happen:

### 1. Text Extraction

Content is extracted from the file format. This includes:

* Body text
* Headers and titles
* Table content
* Image captions (if available)
* Metadata (author, date, etc.)

### 2. Chunking

Text is split into smaller pieces called chunks. This is necessary because:

* Search works better with focused passages
* AI models have context limits
* Relevant information can be isolated

Default settings:

* **Chunk size**: 512 tokens
* **Overlap**: 50 tokens (consecutive chunks share context)

### 3. Embedding

Each chunk is converted to a vector (a list of numbers) using the embedding model. This enables semantic search - finding content by meaning, not just keywords.

### 4. Indexing

Chunks and their embeddings are stored in a vector database, ready for search.

## Filtering Documents

Use the source filter to narrow the document list:

| Filter    | Shows                            |
| --------- | -------------------------------- |
| **All**   | Everything in the knowledge base |
| **Files** | Uploaded documents only          |
| **Web**   | Crawled web pages only           |

Combine with search to quickly find specific documents.

## Searching Documents

The search box above the document list searches across the **entire knowledge base**, not just the current page. It matches:

* **File names** - e.g. `quarterly-report`
* **Web page URLs** - e.g. `docs.example.com/pricing`
* **Tags** - both tag names and values, e.g. `department` or `legal`

Search is case-insensitive and matches partial words. Results are paginated like the regular document list. The knowledge base list page offers the same server-side search on knowledge base names and descriptions.

## Document Status

Each document shows a status that updates in real-time:

| Status         | Meaning                                |
| -------------- | -------------------------------------- |
| **Queued**     | Waiting to be processed                |
| **Processing** | Currently being extracted and indexed  |
| **Ready**      | Successfully processed and searchable  |
| **Error**      | Something went wrong during processing |

Status changes appear automatically - no need to refresh the page. After uploading, watch as documents move from queued to processing to ready.

Click an error status to see details. Common issues:

* **Unsupported format** - File type not recognized
* **Password protected** - Document is encrypted
* **Extraction failed** - Content couldn't be read
* **Too large** - File exceeds size limit

## Viewing Document Details

Click any document to see:

* **File information** - Name, type, size, dates
* **Processing details** - Chunk count, tokens, parser used (the effective parser from the last indexing pass). If a per-document parser override is active, an **override** badge appears alongside it.
* **Chunks viewer** - See exactly how the document was split

### The Chunks Viewer

Understanding how documents are chunked helps debug retrieval issues:

1. Click **View Chunks** on any document
2. Browse through chunks (paginated for large documents)
3. **Expand** any chunk to see its full text
4. **Search** within the chunks to find specific content
5. **Copy** chunk text for testing or debugging

Each chunk shows:

* Text content (expandable)
* Page number (for PDFs)
* Token count
* Position in document

If important information spans multiple chunks poorly, consider adjusting chunk size or using a different chunking strategy in [RAG Settings](./rag-settings).

## Document Tags

Organize documents with tags:

1. Select a document
2. Click **Edit Tags**
3. Add or remove tags
4. Save

Tags help with:

* Filtering the document list
* Finding specific content types
* Organizing large collections

## Updating Documents

To replace a document with a new version:

1. Delete the old document
2. Upload the new version

Or:

1. Click **Re-index…** on the document to re-process it — confirming without changing anything reuses the document's current settings
2. In the same dialog, optionally pick a different parser or chunking strategy for this document only — useful when a document failed with the default parser

<Tip>
  For frequently updated content, consider using connectors that sync automatically rather than manual uploads.
</Tip>

## Deleting Documents

To remove a document:

1. Find it in the documents list
2. Click the delete icon (trash)
3. Confirm deletion

The document and all its chunks are removed. This affects search results immediately.

### Bulk Deletion

To delete multiple documents:

1. Use filters to narrow the list
2. Select documents using checkboxes
3. Click **Delete Selected**
4. Confirm

## Reindexing

When you change chunking or parsing settings, existing documents keep their old chunks. To apply the new settings:

### Single Document

Select **Re-index…** from a document row's action menu. The dialog opens preselected on the document's current settings — its persisted parser override if one exists, **Use store default** otherwise (the store's current parser is shown next to it). Confirming without changing anything reprocesses the document with those settings.

To try a different parser or chunking strategy for this one document without changing the knowledge base defaults:

1. Select **Re-index…** on the document row

2. Choose a parser in the dialog:

   | Parser                                      | Best For                                                                  |
   | ------------------------------------------- | ------------------------------------------------------------------------- |
   | **Store default**                           | Use the knowledge base parser (clears any previous per-document override) |
   | **Tika** (`tika`)                           | Plain text, simple office documents                                       |
   | **Tika + OCR** (`tika-ocr`)                 | Scanned documents, images with text                                       |
   | **Unstructured** (`unstructured`)           | Documents with headings, lists, tables                                    |
   | **Unstructured + OCR** (`unstructured-ocr`) | Complex scanned layouts, multi-column                                     |
   | **LLM** (`llm`)                             | Images, audio, video — always forced for media files                      |

3. Optionally expand **Advanced: chunking** to override the chunking strategy for this document alone (the controls start from the store's current chunking config)

4. Click **Re-index**

<Note>
  An explicit parser or chunking choice is **persisted on the document** and reused on subsequent re-indexes. To revert to the knowledge-base default, open the dialog again and select **Use store default** — the document then follows the store configuration, including future changes to it.
</Note>

<Note>
  For image, audio, and video files the parser is locked to **LLM** — the backend enforces this regardless of the parser selected. The modal reflects this constraint.
</Note>

For failed documents, a **Re-index with another parser** shortcut appears directly in the error details panel — it opens the same dialog, making it the primary recovery action without going through the action menu.

### All Documents

To reindex the entire knowledge base:

1. Go to **Settings**
2. Scroll to **Danger Zone**
3. Click **Reindex All Documents**
4. Confirm

When re-indexing all documents, per-document parser overrides are preserved — documents that have a specific parser set will continue to use that parser in the bulk pass.

<Warning>
  Reindex only reapplies **chunking and parsing**. It does **not** change the embedding model or vector dimensions — those are frozen at knowledge-base creation and the physical vector index cannot be resized. To migrate to a different embedding model, see [Changing the embedding model](/products/agent-factory/knowledge-architecture#changing-the-embedding-model-the-a-b-pattern). Reindexing large knowledge bases takes time and consumes processing resources; documents remain searchable during reindexing, but results may be inconsistent until complete.
</Warning>

## Best Practices

<AccordionGroup>
  <Accordion title="Use clean source documents">
    Well-formatted documents with clear headings produce better chunks and retrieval. Clean up messy documents before uploading.
  </Accordion>

  <Accordion title="Test with representative queries">
    After adding documents, test search in the Playground. Verify that relevant content is retrieved for typical questions.
  </Accordion>

  <Accordion title="Remove duplicates">
    Duplicate content hurts retrieval quality. If the same information appears in multiple documents, keep the most authoritative version.
  </Accordion>

  <Accordion title="Keep documents focused">
    Many focused documents are better than few giant documents. Split large documents by topic if they cover multiple subjects.
  </Accordion>

  <Accordion title="Use meaningful filenames">
    Filenames become part of the metadata and can help with retrieval. Use descriptive names, not "Document1.pdf".
  </Accordion>
</AccordionGroup>

## Next Steps

<CardGroup cols="2">
  <Card title="Connect external sources" icon="cloud" href="./connectors">
    Set up automatic syncing with SharePoint, Google Drive, and more
  </Card>

  <Card title="Configure RAG settings" icon="sliders" href="./rag-settings">
    Fine-tune chunking and retrieval for better results
  </Card>
</CardGroup>
