How it works

How P95 Labs works.

A read-only diagnostics and recommendation layer that reads what your cluster already emits and turns it into ranked, explained actions.

Current status

P95 Labs is in founder-led alpha. The current build includes Kubernetes workload discovery, Prometheus ingestion, TimescaleDB-backed telemetry, CLI access, Helm deployment, and rule-based recommendations. The next validation target is real vLLM workloads.

data flow

Workload discovery

Metrics ingestion

Telemetry storage

Correlation engine

Recommendations

Human review

1. Workload discovery

Maps inference deployments, replicas, and runtimes (vLLM, Triton, TGI, KServe, or custom) across scoped namespaces. Read-only, least-privilege RBAC.

2. Metrics ingestion

Ingests the latency, queue depth, GPU, and cluster signals you already export to Prometheus. Nothing is inserted into the request path; no prompts or outputs collected by default.

3. Telemetry storage

Signals are stored in TimescaleDB — purpose-built for high-cardinality, time-series inference metrics and historical correlation across bursts and deploys.

4. Correlation & recommendations

A transparent, rule-based engine correlates signals — e.g. rising queue depth against low GPU utilisation — and emits ranked recommendations that explain the underlying signals.

5. Deployment

Install via a Helm chart and operate through a CLI that fits existing platform workflows. The control plane runs outside the request path.

6. Action model

P95 Labs never mutates production on its own. Recommendations are surfaced for your team, and a human approves before any change is made.

Want to evaluate it on your stack?

The alpha is founder-led and read-only. We're onboarding a small number of teams running real inference workloads.

Join design partner program