1.2 – Core metrics and formulas
This document presents a concise reference of the main formulas used in application + system performance engineering.
These formulas formalize the concepts introduced in:
They should be read as a complement to the conceptual model, not in isolation.
They provide the quantitative basis used to reason about system behavior, validate hypotheses, and interpret performance test results.
Table of Contents
- 1.2.1 Little’s Law (system-level concurrency)
- 1.2.2 Utilization Law (resource-level busy time)
- 1.2.3 Service time vs response time (queueing)
- 1.2.4 Service Demand (visits × service time)
- 1.2.5 Throughput
- 1.2.6 Error rate
- 1.2.7 Percentiles (p50, p95, p99)
- 1.2.8 Empirical CDF (threshold → percentage)
- 1.2.9 Long-tail latency (what it is)
- 1.2.10 Quick checklist (what to measure in tests)
Notation (typical)
| Symbol | Definition |
|---|---|
X or λ |
throughput / arrival rate (requests per second) |
R or W |
response time / time in system (seconds) |
S |
service time at a resource (seconds per request) |
U |
utilization of a resource (0–1) |
L |
average concurrency / in-flight requests (count) |
V |
average number of visits to a resource per request |
D |
service demand on a resource (seconds per request) |
This notation is used consistently throughout the guide and allows formulas to be applied uniformly across different contexts.
1.2.1 Little’s Law (system-level concurrency)
Definition
This law relates average concurrency to throughput and time in system.
Formula
Where
L= average number of requests in the system (in-flight / concurrency)λ= arrival rate / throughput (requests/s)W= average time in system (s) (often the average end-to-end response time)
Practical meaning
If throughput and average response time are known, the number of requests that are simultaneously “in flight” in the system can be estimated.
This makes Little’s Law one of the most useful tools for reasoning about system load and concurrency.
Example
If λ = 200 req/s and W = 0.15 s:
On average, there are about 30 requests in flight.
Practical interpretation
Little’s Law links three observable quantities:
- throughput
- latency
- concurrency
This allows:
- estimating concurrency from measurements
- validating system behavior
- detecting inconsistencies in metrics
This law is extensively used in performance engineering, capacity planning, and system diagnostics.
1.2.2 Utilization Law (resource-level busy time)
Definition
Utilization is the fraction of time during which a single resource is busy over a fixed time interval (typically 1 second).
It measures the “busy time percentage”.
Formula
Where
U= utilization (0–1)X= throughput observed by that resource (req/s)S= mean service time at that resource (s/req)
Resource
A single service unit, e.g. CPU core, thread/worker, DB connection, etc.
Example
A DB worker handles 50 req/s, each query takes 10 ms = 0.01 s:
Interpretation: the resource is busy 0.5 seconds per second.
Practical interpretation
Utilization is a key indicator of resource saturation.
As utilization approaches 1:
- queueing increases
- latency grows non-linearly
- system stability decreases
This makes it one of the most important signals when diagnosing bottlenecks.
1.2.3 Service time vs response time (queueing)
Definition
Response time at a resource includes:
- service time (actual work)
- queue time (waiting)
Formula
Where
R= response time at the resourceS= service timeW_q= waiting time in queue
Practical meaning
As utilization approaches saturation, queueing grows non-linearly and dominates response time, causing long-tail latency.
Practical interpretation
This formula explains why systems slow down under load even when computation cost does not change.
In many real systems:
- service time remains relatively stable
- waiting time increases rapidly
As a consequence:
- response time is dominated by queueing
- latency becomes unpredictable
This is a key point in diagnosing performance issues.
1.2.4 Service Demand (visits × service time)
Definition
Total service required from a resource per request, taking multiple visits into account.
Formula
Where
D= service demand on the resource (s)V= average visits to the resource per requestS= service time per visit (s)
Example
A request performs V = 3 DB queries, each taking S = 5 ms = 0.005 s:
Practical interpretation
Service demand represents the total work required from a resource for each request.
It is particularly useful for:
- identifying the most heavily used resources
- estimating capacity limits
- understanding scaling behavior
Reducing service demand is often more effective than increasing raw capacity.
1.2.5 Throughput
Definition
Requests completed per unit of time.
Formula
Formula: X = N / T
Where
N= number of completed requestsT= observation window (seconds)
Practical interpretation
Throughput is one of the primary indicators of system performance.
It reflects the system’s ability to process work.
However, throughput must always be interpreted together with:
- latency
- error rate
- resource utilization
High throughput alone does not guarantee acceptable system behavior.
1.2.6 Error rate
Definition
Fraction of requests that fail (timeouts, 5xx, etc.).
Formula
Formula: ErrorRate = (N_err / N_total) × 100%
Practical interpretation
Error rate reflects system reliability under load.
An increase in error rate often indicates:
- overload conditions
- resource exhaustion
- instability
Error rate should always be monitored together with latency and throughput.
1.2.7 Percentiles (p50, p95, p99)
Definition
The p-th percentile is the value below which p% of observations fall.
p50≈ median (“typical request”)p95= threshold for the slowest 5%p99= threshold for the slowest 1%
Percentiles capture distribution and tail behavior better than averages.
Practical interpretation
Percentiles are essential for understanding the user’s real experience.
In many systems:
- average latency appears acceptable
- tail latency (p95/p99) is significantly worse
This difference is critical for system evaluation and SLO definition.
1.2.7.1 How to compute a percentile (ordered sample)
Given N values sorted in ascending order:
Compute the theoretical position:
Formula: P = (p / 100) × (N + 1)
- If
Pis an integer → percentile =v_P - Otherwise, let
k = floor(P)andδ = P - k(fractional part), then interpolate:
Note: percentile definitions vary slightly across tools. This method is a commonly used approach.
1.2.7.2 Interpretation vs average (why tails matter)
- If
p50is much lower than the mean, the distribution is right-skewed (a few slow requests inflate the mean). - If
p95orp99is far above the mean, you have long-tail latency.
A typical pattern:
- the mean looks “acceptable”
p95/p99are bad
→ user experience is degraded for a non-negligible fraction of users and SLOs are at risk.
Practical interpretation
Percentiles highlight behaviors that averages hide.
They are essential for:
- defining service level objectives (SLOs)
- detecting tail latency issues
- understanding worst-case behavior
Ignoring percentiles often leads to incorrect conclusions about system performance.
1.2.8 Empirical CDF (threshold → percentage)
Definition
Given a threshold t, the empirical cumulative distribution function (CDF) indicates the fraction of samples less than or equal to t.
Formula
Formula: F(t) = count(x_i ≤ t) / N
Practical meaning
The CDF answers the question: “If my SLO is 200 ms, what % of requests meet it?”
Percentiles answer the inverse question: “What threshold corresponds to 95% of requests?”
Practical interpretation
CDF and percentiles are complementary views of the same data.
- CDF: given a threshold → what fraction meets it
- Percentile: given a fraction → what threshold corresponds to it
Both are useful for performance analysis and SLO validation.
1.2.9 Long-tail latency (what it is)
Definition
A small fraction of requests (e.g. 5% or 1%) is much slower than the majority.
Why the tail “dominates”
- SLOs are typically defined on
p95/p99, so tails drive pass/fail. - In distributed systems, the slowest dependency often determines end-to-end latency.
- Tail events are frequently driven by contention / queueing.
Common causes (high-level)
- thread pool / connection pool saturation (queueing)
- lock contention / synchronization hot spots
- slow DB queries, missing indexes, lock waits
- retries + timeouts amplifying tail latency
- hot keys in caches / uneven shard load
- GC pauses / memory pressure (stop-the-world)
- network jitter / packet loss / retransmissions
- disk I/O spikes, compactions, fsync/wal flush
Practical interpretation
Long-tail latency is one of the most critical aspects of system performance.
It explains why:
- average metrics may appear acceptable
- user experience is still degraded
Managing tail latency is often more important than improving average performance.
1.2.10 Quick checklist (what to measure in tests)
- Latency:
p50/p90/p95/p99 - Throughput:
RPS/TPS - Error rate:
timeouts/5xx - Utilization: CPU, memory, DB, pools
- Queue lengths: thread pools, connection pools, message backlogs
- Dependency timings: DB/Redis/external APIs
Practical interpretation
These metrics constitute the minimum set required to understand system behavior during performance tests.
They allow:
- identifying bottlenecks
- detecting instability
- correlating workload with system behavior
Measuring only a subset of these metrics often leads to incomplete or misleading analysis.
Key idea
Formulas are not isolated abstractions.
They are tools used to explain observed behavior and validate system models.
Evaluating them is an essential element of performance engineering.