Skip to main content

Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

  • CPU: 100m (0.1 cores)
  • Memory: 128Mi

Production deployment (recommended):

  • CPU: 500m (0.5 cores)
  • Memory: 512Mi

Scaling factors

Resource needs increase based on:

  • Number of backends: Each backend adds minimal overhead (~10-20Mi memory)
  • Request volume: Higher traffic requires more CPU for request processing
  • Data volume: Large inputs and tool responses increase memory usage and network bandwidth
  • Composite tool complexity: Workflows with many parallel steps consume more memory
  • Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend CountUse CaseNotes
1-5Small teams, focused toolsetsMinimal resource overhead
5-15Medium teams, diverse toolsRecommended range for most use cases
15-30Large teams, comprehensiveIncrease health check interval
30+Enterprise-scale deploymentsConsider multiple vMCP instances

Scaling strategies

Horizontal scaling

Horizontal scaling is possible for stateless use cases where MCP sessions can be resumed on any vMCP instance. However, stateful backends (e.g., Playwright browser sessions, database connections) complicate horizontal scaling because requests must be routed to the same vMCP instance that established the session.

Session considerations:

  • vMCP uses MCP session IDs to cache routing tables and maintain consistency
  • Some backends maintain persistent state that requires session affinity
  • Clients must be able to disconnect and resume sessions for horizontal scaling to work reliably

When horizontal scaling works well:

  • Stateless backends (fetch, search, read-only operations)
  • Short-lived sessions with no persistent state
  • Use cases where session affinity can be reliably maintained

When horizontal scaling is challenging:

  • Stateful backends (Playwright, database connections, file system operations)
  • Long-lived sessions requiring persistent state
  • Complex session interdependencies

Configuration

To scale horizontally, increase replicas in the Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: vmcp-my-vmcp
spec:
replicas: 3 # Horizontal scaling
# ... rest of deployment spec
Backend scaling

When scaling vMCP horizontally, the backend MCP servers will also see increased load. Ensure your backend deployments (MCPServer resources) are also scaled appropriately to handle the additional traffic.

Session affinity is required when using multiple replicas. Clients must be routed to the same vMCP instance for the duration of their session. Configure based on your deployment:

  • Kubernetes Service: Use sessionAffinity: ClientIP for basic client-to-pod stickiness
    • Note: This is IP-based and may not work well behind proxies or with changing client IPs
  • Ingress Controller: Configure cookie-based sticky sessions (recommended)
    • nginx: Use nginx.ingress.kubernetes.io/affinity: cookie
    • Other controllers: Consult your Ingress controller documentation
  • Gateway API: Use appropriate session affinity configuration based on your Gateway implementation
Session affinity recommendations
  • For stateless backends: Cookie-based sticky sessions work well and provide reliable routing through proxies
  • For stateful backends (Playwright, databases): Consider vertical scaling or dedicated vMCP instances instead of horizontal scaling with session affinity, as session resumption may not work reliably

Vertical scaling

Vertical scaling (increasing CPU/memory per instance) provides the simplest scaling story and works for all use cases, including stateful backends. However, it has limits and may not provide high availability since a single instance failure affects all sessions.

Recommended approach:

  • Start with vertical scaling for simplicity
  • Add horizontal scaling with session affinity when vertical limits are reached
  • For stateful backends, consider dedicated vMCP instances per team/use case

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

  • High CPU usage (>70% sustained) during normal operations
  • Memory pressure or OOM (out-of-memory) kills
  • Slow response times (>1 second) for simple tool calls
  • Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

  • CPU usage remains high despite increasing resources
  • You need higher availability and fault tolerance
  • Request volume exceeds capacity of a single instance
  • You want to distribute load across multiple availability zones

Scale configuration

Adjust operational settings when scaling:

Large backend counts (15+)

When managing 15 or more backends, reduce health check frequency to minimize overhead. Increase the healthCheckInterval to 60 seconds and raise the unhealthyThreshold to 5 for better stability with more backends.

spec:
config:
operational:
failureHandling:
healthCheckInterval: 60s
unhealthyThreshold: 5

High request volumes

For deployments handling high request volumes, allocate more CPU and memory resources. Start with 1 CPU and 1Gi memory for requests, with limits of 2 CPUs and 2Gi memory.

spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi

Performance optimization

Backend discovery

Backend discovery performance improves when you use inline mode for static configurations, eliminating Kubernetes API queries during startup. Group related tools in fewer servers to reduce the total backend count, and minimize the number of tools and resources each backend exposes to speed up initialization.

Authentication

Token caching is enabled by default and significantly reduces authentication overhead. For internal or trusted backends, consider using unauthenticated mode to eliminate authentication latency entirely. Configure appropriate token expiration times in your OIDC provider to balance security with cache efficiency.

Composite workflows

Design workflow steps to minimize dependencies between them, allowing vMCP to execute more steps in parallel. Use onError.action: continue for non-critical steps so that individual failures don't block the entire workflow. Set explicit timeouts on steps that call slow backends to prevent workflow delays.

Monitoring

Use vMCP's Telemetry and metrics integration to track backend request latency and error rates, workflow execution times and failure patterns, and health check success rates. These metrics help identify bottlenecks and guide optimization efforts.