1.9 Common performance problems

1.9 – Common performance problems

This chapter describes common performance problems that appear in real systems under load.

These problems are not isolated categories. They often interact, reinforce each other, and become visible as latency growth, throughput loss, instability, or tail degradation.

The purpose of this chapter is to connect recurring symptoms to the underlying mechanisms already introduced in the previous chapters.

1.9.1 CPU-bound inefficiency

Definition

A CPU-bound inefficiency occurs when the system spends excessive CPU time performing work that could be reduced, optimized, or avoided.

This does not necessarily mean that the system is CPU-saturated at all times.

It means that available CPU time is being consumed inefficiently, reducing the amount of useful work the system can perform before reaching saturation.

Typical causes

inefficient algorithms (e.g. unnecessary complexity)
repeated computations
lack of caching for expensive operations
excessive data transformations

These causes are common because CPU inefficiency often emerges from code that is functionally correct but structurally wasteful.

In performance engineering, inefficiency matters most when it occurs in hot paths or highly repeated operations.

Example

public int countMatches(List<String> items, String target) {
    int count = 0;
    for (String s : items) {
        if (s.toLowerCase().equals(target.toLowerCase())) {
            count++;
        }
    }
    return count;
}

Interpretation:

repeated toLowerCase() calls create unnecessary work
CPU time increases with input size
avoidable computation in hot paths

The problem is not only the cost of the loop itself, but the repeated transformation of values that could be normalized once instead of at every comparison.

Mechanism

CPU-bound inefficiency wastes execution capacity.

More CPU time is consumed than necessary to produce the same result.

As the workload grows:

CPU utilization rises earlier
runnable work accumulates sooner
useful throughput reaches its limit earlier

This transforms inefficient code into a system-level bottleneck when request volume increases.

Impact under load

increased CPU utilization
reduced throughput
earlier CPU saturation

This leads to scheduling delays (→ 1.8.1 CPU behavior) and non-linear latency growth (→ 1.5.3 Non-linear degradation).

In practical terms, the system reaches its CPU limit sooner than expected, leaving less headroom for bursts or concurrent traffic growth.

Observable symptoms

Typical symptoms include:

high CPU usage under moderate load
rising latency with increasing request volume
throughput flattening earlier than expected
significant CPU time spent in repeated or avoidable operations

These symptoms often appear before total CPU saturation and may initially look like a generic scaling problem.

Practical implications

optimize hot paths
avoid repeated work
reduce algorithmic complexity

It is also important to identify which inefficiencies actually matter at system level.

An inefficient operation executed once may be irrelevant.

The same inefficiency executed millions of times becomes a bottleneck.

Practical interpretation

CPU inefficiency is one of the most common reasons a system fails to scale despite apparently sufficient hardware.

The issue is not lack of CPU in absolute terms, but poor use of the CPU that is available.

Optimization is therefore most valuable when it increases the amount of useful work performed per unit of CPU time.

Key idea

CPU inefficiency reduces the amount of useful work the system can perform before reaching saturation.

1.9.2 Excessive allocation and memory churn

Definition

Excessive allocation occurs when the system creates a large number of short-lived objects, increasing memory churn and pressure on the runtime.

This is a common problem in managed runtimes, where allocation is easy and often inexpensive per operation, but expensive in aggregate when performed continuously under load.

Example

for (Order o : orders) {
    result.add(new ReportRow(o.getId(), o.getAmount(), o.getStatus()));
}

Interpretation:

many objects are created per iteration
objects are short-lived
allocation rate increases

If this pattern appears in frequently executed code, total allocation volume can become significant even when each individual object is small.

Mechanism

high allocation rate increases memory churn
garbage collection runs more frequently

(→ 1.7.2 Allocation and object lifecycle)
(→ 1.7.3 Garbage collection)

The system therefore pays not only for creating objects, but for reclaiming them, tracking them, and managing the runtime effects of frequent memory turnover.

Impact under load

increased GC activity
CPU overhead for memory management
latency variability

This contributes to memory pressure (→ 1.7.4 Memory pressure and performance).

As load increases, allocation-related overhead often becomes more visible through pauses, jitter, and widening latency percentiles.

Observable symptoms

Typical symptoms include:

increased garbage collection frequency
periodic latency spikes
growing gap between average and tail latency
moderate CPU usage with unstable response times
memory behavior that degrades as throughput increases

These symptoms are especially common in systems that allocate heavily in request-processing paths.

Practical implications

reduce unnecessary object creation
reuse objects when appropriate
analyze allocation patterns

It is also important to distinguish between:

necessary allocation
avoidable allocation
retained allocation that should have remained temporary

This distinction helps determine whether the issue is churn, retention, or both.

Practical interpretation

Excessive allocation is often invisible in code review because the code remains simple and correct.

Its effect becomes visible only at runtime, when repeated object creation changes GC behavior and memory pressure.

A system may therefore appear logically efficient while still behaving poorly because it creates too much transient memory traffic.

Key idea

Memory churn increases runtime overhead and introduces latency variability.

1.9.3 Contention and synchronization hot spots

Definition

Contention occurs when multiple threads compete for the same resource, forcing serialized access.

A synchronization hot spot is a part of the system where this competition becomes concentrated and repeatedly delays execution.

These hot spots are especially damaging because they reduce effective parallelism exactly where concurrency is expected to help.

Example

public class Counter {
    private int value = 0;

    public synchronized void increment() {
        value++;
    }
}

Interpretation:

access is serialized through synchronization
only one thread progresses at a time
throughput is limited by the critical section

The issue is not that synchronization exists, but that a frequently accessed shared path can become the limiting point for the whole system.

Mechanism

threads block while waiting for the lock
contention increases with concurrency

(→ 1.6 Concurrency and parallelism)

As more threads compete for the same synchronized section:

waiting time grows
effective parallelism decreases
more time is spent coordinating than progressing

This causes the system to behave as if its concurrency were lower than its thread count suggests.

Impact under load

increased waiting time
reduced throughput
latency increases

This leads to queueing effects (→ 1.5 System behavior under load).

Under higher load, synchronization hot spots often become visible as latency growth without proportional CPU growth, because threads are waiting rather than computing.

Observable symptoms

Typical symptoms include:

rising latency with moderate CPU usage
many threads blocked or waiting
reduced scalability as concurrency increases
throughput limited by a small critical section
lock-heavy code paths appearing on hot execution paths

These symptoms are often misleading because the system may appear only partially utilized while already constrained.

Practical implications

minimize shared mutable state
reduce critical section size
use more scalable concurrency patterns

It is also important to identify whether the bottleneck is caused by:

lock scope
frequency of access
long critical sections
unnecessary synchronization

Different causes require different fixes.

Practical interpretation

Contention problems are often misunderstood as generic slowness.

In reality, the core issue is serialization: many threads are present, but only a few are making useful progress.

Performance engineering therefore focuses not only on adding concurrency, but on making sure concurrency does not collapse into waiting.

Key idea

Contention converts parallel work into serialized execution.

1.9.4 Blocking and waiting bottlenecks

Definition

Blocking occurs when a thread waits for an external operation to complete, preventing it from doing useful work.

This includes waiting for:

I/O
network responses
locks
external services
other coordinated events

Blocking is often necessary, but it becomes a bottleneck when too many execution resources are occupied by waiting rather than progressing.

Example

public String fetchData() throws Exception {
    Thread.sleep(50); // simulate blocking call
    return "data";
}

Interpretation:

thread is idle during wait
resources remain allocated
concurrency does not translate to throughput

The thread exists, but it is not advancing useful work during the blocked period.

Mechanism

threads spend time waiting instead of executing
thread pools may become saturated

(→ 1.6 Concurrency and parallelism)

As more threads become blocked:

fewer threads remain available for new work
queueing appears at the execution model level
latency grows even if the CPU is not fully used

This is why blocking bottlenecks often coexist with moderate CPU usage.

Impact under load

increased latency
reduced throughput
thread exhaustion

This amplifies queueing and saturation (→ 1.5 System behavior under load).

Under sustained load, blocking behavior often creates a feedback loop where queued requests wait for threads that are themselves waiting on slow operations.

Observable symptoms

Typical symptoms include:

many threads in waiting or blocked states
growing request queues
moderate CPU with poor throughput
rising latency during I/O-heavy or dependency-heavy operations
thread pools that appear full without corresponding productive work

These symptoms are especially common in services that mix request concurrency with synchronous downstream calls.

Practical implications

reduce blocking operations
use asynchronous or non-blocking patterns when appropriate
size thread pools carefully

It is also useful to distinguish between:

unavoidable blocking
avoidable blocking
blocking placed in high-frequency execution paths

That distinction helps identify where redesign is necessary.

Practical interpretation

Blocking reduces effective concurrency.

A system may have many threads, but if a large share of them is waiting, the system behaves as if it had much less execution capacity than expected.

This is why blocking issues are often execution-model problems before they become raw resource problems.

Key idea

Blocking reduces effective concurrency and limits system throughput.

1.9.5 Queue buildup and saturation effects

Definition

Queue buildup occurs when incoming work exceeds processing capacity, causing requests to wait before being processed.

This is one of the most common and most important performance problems because queueing transforms moderate overload into rapidly increasing latency.

Mechanism

arrival rate exceeds service capacity
queues grow over time

This can be described using Little’s Law (→ 1.2.1 Little’s Law (system-level concurrency)).

As incoming demand continues while processing remains limited, waiting accumulates and response time begins to include increasingly large queue delay.

Impact under load

waiting time increases
response time increases
latency becomes unstable

This leads to non-linear degradation (→ 1.5.3 Non-linear degradation) and throughput limits.

Once queueing becomes dominant, the system can deteriorate very quickly even if the original increase in load was relatively small.

Observable symptoms

growing queue lengths
increasing response times
stable or decreasing throughput

Other symptoms may include:

bursts of timeout errors
widening p95/p99 latency
delayed recovery after temporary overload

These effects often indicate that the system is operating near or beyond effective capacity.

Practical implications

control concurrency
increase capacity of the bottleneck resource
reduce arrival rate if necessary

It is also important to determine where the queue is forming:

thread pool
connection pool
device
network buffer
downstream service

The location of the queue often reveals the actual bottleneck.

Practical interpretation

Queue buildup is not just an operational detail.

It is often the direct mechanism through which overload becomes visible to users.

A system may still be functioning, but once work begins to wait systematically, latency growth becomes inevitable.

Key idea

Queues grow when demand exceeds capacity, driving latency.

1.9.6 Dependency amplification and cascading latency

Definition

Dependency amplification occurs when latency in one component propagates and increases latency across the system.

This problem is especially important in distributed systems, where a request often depends on multiple downstream calls before it can complete.

Mechanism

requests depend on multiple downstream services
delays accumulate across calls
slow components affect the entire system

Even when each individual delay is small, the total effect can become significant once multiple dependencies, retries, or serial call chains are involved.

Example

public Response process() {
    Data a = serviceA.call();
    Data b = serviceB.call();
    return combine(a, b);
}

Interpretation:

total latency depends on multiple dependencies
slowest dependency dominates response time

In real systems, this effect becomes stronger when requests depend on many services, remote databases, or chained synchronous operations.

Impact under load

latency amplification across services
increased variability
tail latency degradation

(→ 1.5.5 Tail latency amplification)

Under load, dependency amplification often becomes more severe because slow downstream systems retain upstream threads, requests, and queues for longer periods.

Observable symptoms

Typical symptoms include:

sudden latency increases without local CPU saturation
degraded p95/p99 behavior caused by downstream variability
request chains that become slower as one dependency slows down
instability spreading from one service to another
retries and timeouts increasing pressure across the system

These symptoms are often difficult to interpret without correlating behavior across multiple components.

Practical implications

minimize number of synchronous dependencies
use timeouts and fallback strategies
isolate slow components

It is also useful to identify:

which dependency contributes most to end-to-end delay
whether calls are serial or parallel
whether retries worsen the problem
whether slow components trigger upstream queueing

This turns a vague “distributed slowness” problem into a diagnosable system behavior.

Practical interpretation

A system’s latency is not determined only by its own code.

It is often determined by the slowest dependency in the request path.

The more dependencies a system has, the more likely it is that variability in one place will become visible everywhere else.

Key idea

System latency is often determined by the slowest dependency.

1.9 Common performance problems