Scaling - Prisme.ai

Scale your self-hosted Prisme.ai platform to meet growing demands As your organization’s usage of Prisme.ai grows, you’ll need to scale your self-hosted platform to maintain performance and reliability. This guide provides strategies and best practices for scaling different components of your Prisme.ai deployment.

Scaling Approaches

Horizontal Scaling

Horizontal scaling involves adding more instances (pods, nodes) to distribute load: Benefits:

Better fault tolerance and availability
Linear capacity scaling
No downtime during scaling operations

Considerations:

Requires stateless application design
More complex networking
Service discovery requirements

Vertical Scaling

Vertical scaling involves increasing resources (CPU, memory) of existing instances: Benefits:

Simpler to implement
Better for stateful components
Can address specific bottlenecks

Considerations:

Limited by maximum resource sizes
May require downtime during scaling
Cost efficiency diminishes at larger scales

When to Scale

Performance Indicators

Monitor these key metrics to identify scaling needs:

API response times exceeding thresholds
CPU utilization consistently above 70%
Memory utilization consistently above 80%
Request queue depth increasing
Database query times growing

Growth Indicators

Business metrics that suggest scaling requirements:

Increasing number of users
Growing document count
More concurrent sessions
Higher query volume
Additional knowledge bases

Preventative Scaling

Proactive scaling for anticipated demands:

Before major rollouts
Ahead of seasonal peaks
Prior to marketing campaigns
In advance of organizational growth

Recovery Objectives

Scaling to meet resilience targets:

Redundancy requirements
High availability goals
Load distribution needs
Geographic distribution objectives

Scaling Core Components

API & Worker Services

When scaling API and worker services, proper resource management is crucial for optimal performance. First, assess current usage by gathering metrics on performance and resource utilization. Configure Horizontal Pod Autoscaling (HPA) to enable automatic scaling based on CPU and memory metrics, setting appropriate minimum and maximum replica counts. Update your Helm values to configure scaling parameters, including replica counts and autoscaling settings. Set proper resource requests and limits based on observed usage patterns, starting conservatively and adjusting based on monitoring data. Configure Pod Disruption Budgets to ensure high availability during scaling operations.

Product Modules

Each Prisme.ai product module can be scaled independently based on specific usage patterns. Knowledges requires scaling for document processing load and large knowledge bases, with tuning based on retrieval volume. Chat needs scaling based on concurrent user sessions and message throughput, considering message storage requirements. Agent Creator scaling focuses on catalog browsing traffic and agent deployment operations, with attention to metadata storage needs. Specific workspaxces on Builder requires scaling for concurrent development sessions and complex builds, considering testing environment requirements. Different products may require different scaling approaches based on their specific workloads and usage patterns.

Ingress & Networking

Ensure your ingress controller can handle increased traffic by scaling it appropriately. Configure connection pooling to optimize connection handling for scaled deployments, setting appropriate database pool sizes and Redis client limits. Implement Redis caching for frequently accessed data to reduce load on backend services.

Resource Optimization

Requests and Limits Configuration

Proper resource configuration is essential for effective scaling. Adjust CPU and memory limits for all core services and applications to accommodate the highest expected usage peaks. Set resource limits above the largest anticipated spikes to ensure services can handle peak loads without being throttled. Configure resource requests equal to their limits to guarantee that pods are assigned to nodes with sufficient available resources for peak loads. This approach ensures consistent performance during high-traffic periods and prevents resource contention between pods on the same node.

Service Crawler Optimization

The crawler service requires specific tuning for optimal performance. The DOWNLOAD_DELAY variable controls the delay between requests and should be adjusted based on target crawl throughput. The REQUEST_QUEUES_POLLING_SIZE determines how many requests are processed simultaneously, while REQUEST_QUEUES_POLLING_INTERVAL sets the frequency of queue checks. For typical document processing, such as a 100KB DOCX file containing 50,000 characters, recommended settings include a polling size of 8 requests, a download delay of 0.5 seconds, and a polling interval of 10 seconds. These values should be adjusted based on document types, processing time requirements, and target throughput.

Internal Cluster Communication

Optimize internal API calls by forcing all internal cluster communication to use HTTP instead of routing through Load Balancer HTTPS endpoints. Configure the INTERNAL_API_URL environment variable on all services to use internal service URLs, such as http://core-prismeai-api-gateway.core/v2. This optimization provides faster network communication and reduces CPU overhead from HTTPS processing, particularly beneficial for high-frequency internal API calls during runtime operations.

Runtime Configuration

Readiness Probe Tuning

Configure readiness probes with appropriate timeouts to prevent pod termination during load spikes. Set probe timeouts to at least 3 seconds with 2-3 failure attempts allowed before considering a pod unhealthy. This flexibility prevents unnecessary pod restarts during temporary high-load conditions.

Throttle Management

Consider disabling runtime throttling globally or specifically for Knowledges and Agent Creator workspaces to improve performance under load. Alternatively, increase throttle limits according to your performance requirements and capacity planning. https://docs.prisme.ai/api-reference/rate-limits#configuration-options.

API Gateway Timeout Adjustment

The API gateway default timeout of 60 seconds may be insufficient for LLM calls that can exceed one minute. Adjust the timeout configuration in the core-prismeai-api-gateway-config ConfigMap to accommodate longer-running requests, typically setting it to 120 seconds or based on your specific LLM response time requirements.

Event Volume Management

Reduce the size of execution events that are primarily used for monitoring rather than functional purposes. The BROKER_EMIT_MAXLEN and BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN environment variable controls maximum event size, with a default of 10,000 characters for runtime.automations.executed (BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN) and 100,000 for all other events (BROKER_EMIT_MAXLEN). These defaults should be suitable for most monitoring needs while reducing storage and processing overhead.

Database operations

Per-engine scaling, least-privileges and tuning details live on the dedicated database pages:

MongoDB

Replica sets, sharding, custom roles for least privileges.

Elasticsearch or OpenSearch

ILM, automated cleanup, index template tuning, reindex recipe.

Redis

Cluster mode, eviction policies, memory tuning.

PostgreSQL

Read replicas, PgBouncer, Entra ID passwordless auth.

Storage Scaling

Object Storage

S3 or compatible object storage typically scales automatically, but ensure proper configuration for performance and cost optimization. Enable transfer acceleration for faster uploads, use multipart uploads for large files, and implement appropriate file organization strategies. Consider regional deployments for global access and implement lifecycle policies for cost optimization, using appropriate storage classes based on access patterns.

Persistent Volumes

Adjust storage for stateful components by expanding persistent volume claims where supported by the storage class. Monitor storage usage patterns and plan for growth, ensuring adequate space for database operations, backups, and temporary files.

Infrastructure Scaling with Terraform

Scale Kubernetes nodes by adjusting node groups in your Terraform configuration, setting appropriate minimum, maximum, and desired node counts based on workload requirements. Configure cluster autoscaling for automatic node provisioning based on pod resource requirements and scheduling constraints. For global deployments, implement multi-region architecture with appropriate load balancing, database replication, and storage synchronization strategies.

Monitoring for Scaling Decisions

Key Metrics

Monitor core metrics that indicate scaling needs including API response times above 200ms, sustained CPU utilization above 70%, memory usage above 80%, increasing queue depths, and connection timeouts.

Monitoring Tools

Implement comprehensive monitoring using Prometheus and Grafana, Kubernetes metrics server, custom dashboards for Prisme.ai services, and database-specific monitoring tools.

Alert Thresholds

Set up alert thresholds to trigger scaling actions, with warnings at 60% resource utilization, critical alerts at 80% utilization, performance degradation above 50%, and error rate increases above 10%.

Scaling Dashboards

Create focused scaling dashboards showing resource usage trends, traffic patterns, database performance metrics, and storage growth rates to support scaling decisions.

Scaling Best Practices

Gradual Implementation

Implement scaling changes gradually rather than making large adjustments at once. Increase replicas by 50-100% increments, monitor effects before further scaling, allow systems to stabilize between changes, and document performance impacts for future reference.

Testing and Validation

Test scaling changes in non-production environments using load testing tools like JMeter, k6, or Locust. Simulate real-world usage patterns, test both scaling up and down scenarios, and verify application behavior during scaling events.

Automation

Use automation for routine scaling operations including Horizontal Pod Autoscalers, cluster autoscaling, scheduled scaling for predictable patterns, and anomaly detection for unexpected load increases.

Documentation

Maintain clear documentation for scaling operations including standard operating procedures, emergency scaling runbooks, performance baselines, and historical scaling decisions with their outcomes.

Next Steps

Continue with platform operations by implementing regular updates to keep your platform current, and establish comprehensive backup and restore strategies to protect your data and ensure business continuity.

​Scaling Approaches

​Horizontal Scaling

​Vertical Scaling

​When to Scale

​Performance Indicators

​Growth Indicators

​Preventative Scaling

​Recovery Objectives

​Scaling Core Components

​API & Worker Services

​Product Modules

​Ingress & Networking

​Resource Optimization

​Requests and Limits Configuration

​Service Crawler Optimization

​Internal Cluster Communication

​Runtime Configuration

​Readiness Probe Tuning

​Throttle Management

​API Gateway Timeout Adjustment

​Event Volume Management

​Database operations