How it works
How P95 Labs works.
A read-only diagnostics and recommendation layer that reads what your cluster already emits and turns it into ranked, explained actions.
Current status
P95 Labs is in founder-led alpha. The current build includes Kubernetes workload discovery, Prometheus ingestion, TimescaleDB-backed telemetry, CLI access, Helm deployment, and rule-based recommendations. The next validation target is real vLLM workloads.
data flow
1. Workload discovery
Maps inference deployments, replicas, and runtimes (vLLM, Triton, TGI, KServe, or custom) across scoped namespaces. Read-only, least-privilege RBAC.
2. Metrics ingestion
Ingests the latency, queue depth, GPU, and cluster signals you already export to Prometheus. Nothing is inserted into the request path; no prompts or outputs collected by default.
3. Telemetry storage
Signals are stored in TimescaleDB — purpose-built for high-cardinality, time-series inference metrics and historical correlation across bursts and deploys.
4. Correlation & recommendations
A transparent, rule-based engine correlates signals — e.g. rising queue depth against low GPU utilisation — and emits ranked recommendations that explain the underlying signals.
5. Deployment
Install via a Helm chart and operate through a CLI that fits existing platform workflows. The control plane runs outside the request path.
6. Action model
P95 Labs never mutates production on its own. Recommendations are surfaced for your team, and a human approves before any change is made.
Want to evaluate it on your stack?
The alpha is founder-led and read-only. We're onboarding a small number of teams running real inference workloads.
Join design partner program