Alpha · read-only diagnostics for Kubernetes inference

Diagnose inference latency before you scale GPUs.

P95 Labs is a read-only diagnostics layer for Kubernetes inference workloads. It correlates cluster state, Prometheus telemetry, queue depth, and GPU signals to explain why p95 is breaking — and which knob to inspect first.

Read-only: RBAC by default
vLLM · Triton: TGI · KServe
Helm + CLI: 5-min install

inference-gateway / latencyp95p50

p95 regression

p95

78ms

p50

19ms

GPU util

41%

Recommendation: queue depth rising while GPU utilisation remains low — inspect batching and concurrency limits before scaling.

bash — p95 diagnose

p95$p95 diagnose --ns inference --since 1h

› scanning 4 deployments · 12 replicas

! p95 regression on inference-gateway (78ms, +210%)

→ queue rising while GPU util 41% — inspect batching

The problem

Inference problems hide between layers.

Latency, queueing, autoscaling, and GPU behaviour are tightly coupled — but they're measured in separate tools that don't talk to each other.

GPU utilisation looks low, but latency is high

Idle-looking GPUs and slow responses at the same time point to a bottleneck somewhere other than raw compute.

Autoscaling reacts after the queue is already burning

By the time replicas spin up, requests have queued, timed out, and p95 has already broken for real users.

Long prompts quietly destroy p95

A small share of long-context requests can dominate tail latency without ever showing up in averages.

Batching and concurrency are tuned by guesswork

Without correlated signals, batch size and concurrency limits get set by intuition and rarely revisited.

Existing dashboards show symptoms, not recommended actions

Graphs tell you that p95 spiked. They rarely tell you which knob to inspect first, or why.

The product

Read-only intelligence for production inference.

P95 Labs sits beside your cluster, not inside the request path. It reads what you already emit and turns it into ranked, explained recommendations.

system architecture

read-only boundary

Your Kubernetes cluster · data plane

Inference workloads

vLLM · Triton · TGI · KServe

Prometheus

latency · queue · GPU exporters

scrapeGET only

P95 Labs · control plane

Collector + TimescaleDB

high-cardinality time series

Correlation engine

transparent rule set

Ranked recommendations

explained + scoped

Recommendations are surfaced to your team. P95 Labs never writes to the cluster — a human reviews and applies every change.

Kubernetes workload discovery

Automatically maps inference deployments, replicas, and runtimes across scoped namespaces.

Prometheus metrics ingestion

Pulls the latency, queue depth, and GPU signals you already export — no new agents in the request path.

TimescaleDB-backed telemetry

Time-series storage built for high-cardinality inference metrics and historical correlation.

Rule-based recommendations

Transparent, inspectable rules — every recommendation explains the signals behind it.

CLI and Helm deployment

Install with a Helm chart, drive everything from a CLI that fits existing platform workflows.

No autonomous production mutation

P95 Labs never changes your cluster on its own. It recommends; your team approves and acts.

Recommendations

Signals correlated into actions.

Each recommendation includes severity, the related metric, a plain-language explanation, and a concrete next step to inspect. Examples shown are illustrative.

P95-QUEUE-001severity: high

Queue depth rising while GPU utilisation remains low

query

queue_depth > 0 and gpu_util < 0.5

Requests are backing up before they reach the GPU. Compute is available, so the limit is in batching, concurrency, or admission — not capacity.

inspect max concurrent requests and batch settings before adding replicas or GPUs.

P95-TAIL-004severity: high

p95 latency regression correlated with request burst

query

histogram_quantile(0.95, …) ↑ with rate(requests[1m])

Tail latency rose in lockstep with a traffic burst while p50 stayed flat — a sign that a small fraction of requests are absorbing the spike.

inspect burst handling and per-request token limits; check for long-context outliers.

P95-HPA-007severity: medium

Autoscaler lag detected

query

hpa_desired_replicas - hpa_ready_replicas > 0

Desired replicas climbed several intervals before ready replicas caught up. The queue was already draining slowly during the gap.

inspect scale-up thresholds and warm-pool / preStop timing to shorten reaction time.

P95-KV-012severity: medium

GPU memory pressure affecting long-context requests

query

kv_cache_utilization > 0.9 and seq_len_p99 ↑

KV cache utilisation is approaching limits during long-context windows, which can trigger eviction and recompute under load.

inspect max sequence length and KV cache allocation relative to model footprint.

Security

Read-only is enforced by RBAC, not by promise.

Production inference is sensitive. P95 Labs requests only the verbs it needs to read telemetry — there are no write, update, patch, or delete permissions in the bundled role.

clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: p95-readonly
rules:
  - apiGroups: ["", "apps"]
    resources: [pods, deployments, nodes]
    verbs: [get, list, watch]
# no create / update / patch / delete

Read-only by default

P95 Labs observes. It does not mutate workloads or configuration on its own.

No sensitive payloads collected

No prompts, request bodies, or model outputs are collected by default — only operational telemetry.

Least-privilege Kubernetes RBAC

Ships with tightly scoped RBAC requesting only the read permissions it needs.

Scoped namespace deployment

Deploy into the namespaces you choose. Nothing outside that scope is touched.

Human approval before any change

Recommendations are surfaced for your team. A human always approves before action.

Clear uninstall path

Remove the Helm release and RBAC cleanly, with no residual controllers left behind.

Design partner program

Working with a handful of teams running real inference.

The alpha is founder-led and read-only. We're not taking payment yet — we're looking for teams who feel the p95 pain and want sharper diagnostics before adding GPU capacity.

vLLMTritonTGIKServeCustom runtimes

What you get

Hands-on diagnostics on your real workloads
Direct line to the founder, no sales layer
Influence over the rule set and roadmap

What we look for

Inference running on Kubernetes
Prometheus already exporting metrics
A read-only namespace to deploy into