1.1 – Foundations

This section introduces the fundamental concepts required to reason about application and system performance.

It provides a conceptual model used throughout the guide.

It defines the core principles used in performance engineering for analyzing system behavior under load.

1.1.1 Throughput, latency, concurrency

Definition

These are the three primary dimensions used to describe system performance.

Throughput: Quantity of work performed per unit of time; number of requests processed per unit of time (e.g. requests per second)
Latency: time required to complete a request (response time)
Concurrency: number of requests being processed at the same time

These concepts are fundamental in performance engineering and are used throughout the guide to describe system behavior.

Relationship

These quantities are not independent.

For a stable system:

increasing throughput typically increases concurrency
increasing concurrency tends to increase latency
latency directly affects how many requests remain “in flight”

This relationship is central to understanding how systems behave under load.

Practical intuition

A system can be viewed as a processing pipeline:

Input: requests enter
Execution: they are processed
Output: they exit

At any moment:

some requests are being processed (concurrency)
new requests arrive (throughput)
each request takes time to complete (latency)

This mental model helps reason about flow, accumulation, and delays in real systems.

Example

If a system processes:

100 requests per second (100 Req./sec.)
each request takes 200 ms (0.2 s)

then, on average:

about 20 requests are in flight at any given time

This relationship is formalized by Little’s Law:

→ 1.2.1 Little’s Law

Practical interpretation

Throughput, latency, and concurrency form a closed system.

Changing one of them necessarily impacts the others.

For example:

reducing latency reduces concurrency for the same throughput
increasing throughput increases concurrency if latency remains constant
high concurrency increases the probability of queueing and contention

This is a key element in diagnosing performance issues.

1.1.2 Service time vs response time

Definition

At a resource level, response time is composed of two parts:

service time (S): time spent performing actual work
waiting time (Wq): time spent waiting before being processed

This distinction is fundamental in performance analysis.

Relationship

Response time:

includes both execution and waiting
it increases when queues form

Even if service time remains constant:

response time can increase significantly due to waiting

This is one of the main reasons systems degrade under load.

Practical meaning

A slow system is often not slow because the work itself is expensive, but because the work is waiting for available resources.

As load increases:

queues grow
waiting dominates
response time degrades

This decomposition is formalized as:

→ 1.2.3 Service time vs response time

Practical interpretation

Separating service time from response time allows:

identifying whether the system is CPU-bound or queue-bound
distinguishing processing cost from resource contention
understanding whether optimization should target execution or waiting

In many real systems, latency issues are primarily caused by queueing rather than computation.

1.1.3 Systems under load

Definition

A system under load processes a continuous stream of incoming requests.

Load is typically expressed as:

requests per second
concurrent users
transactions per second

Load defines the operating conditions under which performance must be evaluated.

Behavior

As load increases:

resource utilization increases
queues begin to form
latency increases
throughput eventually stabilizes or degrades

These effects are not linear and depend on system design and resource constraints.

Key observation

Systems do not degrade linearly.

At low load:

performance is stable

Near saturation:

small increases in load can cause significant increases in latency

This non-linear behavior is a key characteristic of real-world systems.

Practical interpretation

Understanding system behavior under load is essential for:

capacity planning
performance testing
diagnosing latency issues

It helps explain why systems may appear stable in testing but fail under slightly higher production load.

1.1.4 Saturation and bottlenecks

Saturation

A resource is saturated when it is busy most or all of the time.

Typical examples:

CPU at 100% (or close to it...)
thread pool fully utilized
connection pool exhausted

Saturation indicates that a resource cannot handle additional demand without degradation.

Bottleneck

A bottleneck is the resource that limits system throughput.

Characteristics:

highest utilization
longest queues
dominant contribution to response time

The bottleneck determines the overall system capacity.

Practical meaning

Improving resources that are not actual bottlenecks has little or no effect.

Performance improvements require:

identifying the bottleneck
reducing its demand or increasing its capacity

This is a key principle in performance engineering.

Practical interpretation

In complex systems:

multiple resources may appear limiting
but typically only one limits throughput at a given time

Correctly identifying the bottleneck is essential to avoid ineffective optimizations.

1.1.5 Why systems slow down

Common mechanisms

Performance degradation is usually driven by a limited number of factors:

queueing due to saturation
contention on shared resources
inefficient use of resources
external dependencies becoming slow

These mechanisms often interact and amplify each other.

Queueing effect

As resource utilization approaches its limits:

waiting time increases rapidly
response time becomes dominated by queueing

This behavior is closely related to utilization and queueing effects:

→ 1.2.2 Utilization Law

Amplification effects

Certain patterns amplify performance problems:

retries increase load on already saturated systems
timeouts lead to duplicated work
cascading dependencies propagate delays

These effects can transform moderate load into severe degradation.

Practical interpretation

Performance degradation is rarely caused by a single factor.

Instead, it emerges from:

interactions between components
accumulation of waiting time
feedback loops under load

From this emerges the possibility of an effective diagnosis.

Practical conclusion

Most performance problems are not caused by a single slow or problematic operation, but by:

interactions between components
accumulation of waiting time
overload conditions

Understanding these mechanisms is required before applying formulas or running tests.

Key idea

System performance is determined by interactions between workload, resources, and concurrency.

Understanding these interactions constitutes the foundation of performance engineering.

1.1 – Foundations

Table of Contents

1.1.1 Throughput, latency, concurrency

Definition

Relationship

Practical intuition

Example

Practical interpretation

1.1.2 Service time vs response time

Definition

Relationship

Practical meaning

Practical interpretation

1.1.3 Systems under load

Definition

Behavior

Key observation

Practical interpretation

1.1.4 Saturation and bottlenecks

Saturation

Bottleneck

Practical meaning

Practical interpretation

1.1.5 Why systems slow down

Common mechanisms

Queueing effect

Amplification effects

Practical interpretation

Practical conclusion

Key idea