1.9 Common performance problems
1.9 – Common performance problems
This chapter describes common performance problems that appear in real systems under load.
These problems are not isolated categories. They often interact, reinforce each other, and become visible as latency growth, throughput loss, instability, or tail degradation.
The purpose of this chapter is to connect recurring symptoms to the underlying mechanisms already introduced in the previous chapters.
Table of Contents
- 1.9.1 CPU-bound inefficiency
- 1.9.2 Excessive allocation and memory churn
- 1.9.3 Contention and synchronization hot spots
- 1.9.4 Blocking and waiting bottlenecks
- 1.9.5 Queue buildup and saturation effects
- 1.9.6 Dependency amplification and cascading latency
1.9.1 CPU-bound inefficiency
Definition
A CPU-bound inefficiency occurs when the system spends excessive CPU time performing work that could be reduced, optimized, or avoided.
This does not necessarily mean that the system is CPU-saturated at all times.
It means that available CPU time is being consumed inefficiently, reducing the amount of useful work the system can perform before reaching saturation.
Typical causes
- inefficient algorithms (e.g. unnecessary complexity)
- repeated computations
- lack of caching for expensive operations
- excessive data transformations
These causes are common because CPU inefficiency often emerges from code that is functionally correct but structurally wasteful.
In performance engineering, inefficiency matters most when it occurs in hot paths or highly repeated operations.
Example
public int countMatches(List<String> items, String target) {
int count = 0;
for (String s : items) {
if (s.toLowerCase().equals(target.toLowerCase())) {
count++;
}
}
return count;
}
Interpretation:
- repeated
toLowerCase()calls create unnecessary work - CPU time increases with input size
- avoidable computation in hot paths
The problem is not only the cost of the loop itself, but the repeated transformation of values that could be normalized once instead of at every comparison.
Mechanism
CPU-bound inefficiency wastes execution capacity.
More CPU time is consumed than necessary to produce the same result.
As the workload grows:
- CPU utilization rises earlier
- runnable work accumulates sooner
- useful throughput reaches its limit earlier
This transforms inefficient code into a system-level bottleneck when request volume increases.
Impact under load
- increased CPU utilization
- reduced throughput
- earlier CPU saturation
This leads to scheduling delays (→ 1.8.1 CPU behavior) and non-linear latency growth (→ 1.5.3 Non-linear degradation).
In practical terms, the system reaches its CPU limit sooner than expected, leaving less headroom for bursts or concurrent traffic growth.
Observable symptoms
Typical symptoms include:
- high CPU usage under moderate load
- rising latency with increasing request volume
- throughput flattening earlier than expected
- significant CPU time spent in repeated or avoidable operations
These symptoms often appear before total CPU saturation and may initially look like a generic scaling problem.
Practical implications
- optimize hot paths
- avoid repeated work
- reduce algorithmic complexity
It is also important to identify which inefficiencies actually matter at system level.
An inefficient operation executed once may be irrelevant.
The same inefficiency executed millions of times becomes a bottleneck.
Practical interpretation
CPU inefficiency is one of the most common reasons a system fails to scale despite apparently sufficient hardware.
The issue is not lack of CPU in absolute terms, but poor use of the CPU that is available.
Optimization is therefore most valuable when it increases the amount of useful work performed per unit of CPU time.
Key idea
CPU inefficiency reduces the amount of useful work the system can perform before reaching saturation.
1.9.2 Excessive allocation and memory churn
Definition
Excessive allocation occurs when the system creates a large number of short-lived objects, increasing memory churn and pressure on the runtime.
This is a common problem in managed runtimes, where allocation is easy and often inexpensive per operation, but expensive in aggregate when performed continuously under load.
Example
for (Order o : orders) {
result.add(new ReportRow(o.getId(), o.getAmount(), o.getStatus()));
}
Interpretation:
- many objects are created per iteration
- objects are short-lived
- allocation rate increases
If this pattern appears in frequently executed code, total allocation volume can become significant even when each individual object is small.
Mechanism
- high allocation rate increases memory churn
- garbage collection runs more frequently
(→ 1.7.2 Allocation and object lifecycle)
(→ 1.7.3 Garbage collection)
The system therefore pays not only for creating objects, but for reclaiming them, tracking them, and managing the runtime effects of frequent memory turnover.
Impact under load
- increased GC activity
- CPU overhead for memory management
- latency variability
This contributes to memory pressure (→ 1.7.4 Memory pressure and performance).
As load increases, allocation-related overhead often becomes more visible through pauses, jitter, and widening latency percentiles.
Observable symptoms
Typical symptoms include:
- increased garbage collection frequency
- periodic latency spikes
- growing gap between average and tail latency
- moderate CPU usage with unstable response times
- memory behavior that degrades as throughput increases
These symptoms are especially common in systems that allocate heavily in request-processing paths.
Practical implications
- reduce unnecessary object creation
- reuse objects when appropriate
- analyze allocation patterns
It is also important to distinguish between:
- necessary allocation
- avoidable allocation
- retained allocation that should have remained temporary
This distinction helps determine whether the issue is churn, retention, or both.
Practical interpretation
Excessive allocation is often invisible in code review because the code remains simple and correct.
Its effect becomes visible only at runtime, when repeated object creation changes GC behavior and memory pressure.
A system may therefore appear logically efficient while still behaving poorly because it creates too much transient memory traffic.
Key idea
Memory churn increases runtime overhead and introduces latency variability.
1.9.3 Contention and synchronization hot spots
Definition
Contention occurs when multiple threads compete for the same resource, forcing serialized access.
A synchronization hot spot is a part of the system where this competition becomes concentrated and repeatedly delays execution.
These hot spots are especially damaging because they reduce effective parallelism exactly where concurrency is expected to help.
Example
public class Counter {
private int value = 0;
public synchronized void increment() {
value++;
}
}
Interpretation:
- access is serialized through synchronization
- only one thread progresses at a time
- throughput is limited by the critical section
The issue is not that synchronization exists, but that a frequently accessed shared path can become the limiting point for the whole system.
Mechanism
- threads block while waiting for the lock
- contention increases with concurrency
(→ 1.6 Concurrency and parallelism)
As more threads compete for the same synchronized section:
- waiting time grows
- effective parallelism decreases
- more time is spent coordinating than progressing
This causes the system to behave as if its concurrency were lower than its thread count suggests.
Impact under load
- increased waiting time
- reduced throughput
- latency increases
This leads to queueing effects (→ 1.5 System behavior under load).
Under higher load, synchronization hot spots often become visible as latency growth without proportional CPU growth, because threads are waiting rather than computing.
Observable symptoms
Typical symptoms include:
- rising latency with moderate CPU usage
- many threads blocked or waiting
- reduced scalability as concurrency increases
- throughput limited by a small critical section
- lock-heavy code paths appearing on hot execution paths
These symptoms are often misleading because the system may appear only partially utilized while already constrained.
Practical implications
- minimize shared mutable state
- reduce critical section size
- use more scalable concurrency patterns
It is also important to identify whether the bottleneck is caused by:
- lock scope
- frequency of access
- long critical sections
- unnecessary synchronization
Different causes require different fixes.
Practical interpretation
Contention problems are often misunderstood as generic slowness.
In reality, the core issue is serialization: many threads are present, but only a few are making useful progress.
Performance engineering therefore focuses not only on adding concurrency, but on making sure concurrency does not collapse into waiting.
Key idea
Contention converts parallel work into serialized execution.
1.9.4 Blocking and waiting bottlenecks
Definition
Blocking occurs when a thread waits for an external operation to complete, preventing it from doing useful work.
This includes waiting for:
- I/O
- network responses
- locks
- external services
- other coordinated events
Blocking is often necessary, but it becomes a bottleneck when too many execution resources are occupied by waiting rather than progressing.
Example
public String fetchData() throws Exception {
Thread.sleep(50); // simulate blocking call
return "data";
}
Interpretation:
- thread is idle during wait
- resources remain allocated
- concurrency does not translate to throughput
The thread exists, but it is not advancing useful work during the blocked period.
Mechanism
- threads spend time waiting instead of executing
- thread pools may become saturated
(→ 1.6 Concurrency and parallelism)
As more threads become blocked:
- fewer threads remain available for new work
- queueing appears at the execution model level
- latency grows even if the CPU is not fully used
This is why blocking bottlenecks often coexist with moderate CPU usage.
Impact under load
- increased latency
- reduced throughput
- thread exhaustion
This amplifies queueing and saturation (→ 1.5 System behavior under load).
Under sustained load, blocking behavior often creates a feedback loop where queued requests wait for threads that are themselves waiting on slow operations.
Observable symptoms
Typical symptoms include:
- many threads in waiting or blocked states
- growing request queues
- moderate CPU with poor throughput
- rising latency during I/O-heavy or dependency-heavy operations
- thread pools that appear full without corresponding productive work
These symptoms are especially common in services that mix request concurrency with synchronous downstream calls.
Practical implications
- reduce blocking operations
- use asynchronous or non-blocking patterns when appropriate
- size thread pools carefully
It is also useful to distinguish between:
- unavoidable blocking
- avoidable blocking
- blocking placed in high-frequency execution paths
That distinction helps identify where redesign is necessary.
Practical interpretation
Blocking reduces effective concurrency.
A system may have many threads, but if a large share of them is waiting, the system behaves as if it had much less execution capacity than expected.
This is why blocking issues are often execution-model problems before they become raw resource problems.
Key idea
Blocking reduces effective concurrency and limits system throughput.
1.9.5 Queue buildup and saturation effects
Definition
Queue buildup occurs when incoming work exceeds processing capacity, causing requests to wait before being processed.
This is one of the most common and most important performance problems because queueing transforms moderate overload into rapidly increasing latency.
Mechanism
- arrival rate exceeds service capacity
- queues grow over time
This can be described using Little’s Law (→ 1.2.1 Little’s Law (system-level concurrency)).
As incoming demand continues while processing remains limited, waiting accumulates and response time begins to include increasingly large queue delay.
Impact under load
- waiting time increases
- response time increases
- latency becomes unstable
This leads to non-linear degradation (→ 1.5.3 Non-linear degradation) and throughput limits.
Once queueing becomes dominant, the system can deteriorate very quickly even if the original increase in load was relatively small.
Observable symptoms
- growing queue lengths
- increasing response times
- stable or decreasing throughput
Other symptoms may include:
- bursts of timeout errors
- widening p95/p99 latency
- delayed recovery after temporary overload
These effects often indicate that the system is operating near or beyond effective capacity.
Practical implications
- control concurrency
- increase capacity of the bottleneck resource
- reduce arrival rate if necessary
It is also important to determine where the queue is forming:
- thread pool
- connection pool
- device
- network buffer
- downstream service
The location of the queue often reveals the actual bottleneck.
Practical interpretation
Queue buildup is not just an operational detail.
It is often the direct mechanism through which overload becomes visible to users.
A system may still be functioning, but once work begins to wait systematically, latency growth becomes inevitable.
Key idea
Queues grow when demand exceeds capacity, driving latency.
1.9.6 Dependency amplification and cascading latency
Definition
Dependency amplification occurs when latency in one component propagates and increases latency across the system.
This problem is especially important in distributed systems, where a request often depends on multiple downstream calls before it can complete.
Mechanism
- requests depend on multiple downstream services
- delays accumulate across calls
- slow components affect the entire system
Even when each individual delay is small, the total effect can become significant once multiple dependencies, retries, or serial call chains are involved.
Example
public Response process() {
Data a = serviceA.call();
Data b = serviceB.call();
return combine(a, b);
}
Interpretation:
- total latency depends on multiple dependencies
- slowest dependency dominates response time
In real systems, this effect becomes stronger when requests depend on many services, remote databases, or chained synchronous operations.
Impact under load
- latency amplification across services
- increased variability
- tail latency degradation
(→ 1.5.5 Tail latency amplification)
Under load, dependency amplification often becomes more severe because slow downstream systems retain upstream threads, requests, and queues for longer periods.
Observable symptoms
Typical symptoms include:
- sudden latency increases without local CPU saturation
- degraded p95/p99 behavior caused by downstream variability
- request chains that become slower as one dependency slows down
- instability spreading from one service to another
- retries and timeouts increasing pressure across the system
These symptoms are often difficult to interpret without correlating behavior across multiple components.
Practical implications
- minimize number of synchronous dependencies
- use timeouts and fallback strategies
- isolate slow components
It is also useful to identify:
- which dependency contributes most to end-to-end delay
- whether calls are serial or parallel
- whether retries worsen the problem
- whether slow components trigger upstream queueing
This turns a vague “distributed slowness” problem into a diagnosable system behavior.
Practical interpretation
A system’s latency is not determined only by its own code.
It is often determined by the slowest dependency in the request path.
The more dependencies a system has, the more likely it is that variability in one place will become visible everywhere else.
Key idea
System latency is often determined by the slowest dependency.