Performance Testing: Types, Metrics, and How It Works

Software that works is not the same as software that works under load. A feature can pass every functional test, ship clean, and then buckle the first time real traffic arrives. Performance testing is the discipline that closes that gap: it measures how a system behaves when it is busy, not just whether it is correct when it is idle.

This guide explains what performance testing is, the main types, the metrics that define a result, and how it fits into a modern testing process.

What performance testing is

Performance testing evaluates the speed, stability, and scalability of a system under a defined workload. It does not ask “does this feature work?” It asks “how fast is it, how much can it take, and what happens when it has had enough?”

That distinction matters. Functional testing and performance testing answer different questions and catch different bugs. A login endpoint can return the correct token every time and still take four seconds to do it under load. Functional tests pass; users leave. Performance testing is what surfaces that second problem.

The output of performance testing is not a simple pass or fail. It is a profile: at this load, the system responds this fast, sustains this throughput, and fails in this way once pushed past this point. That profile is what lets a team plan capacity, set realistic service levels, and catch regressions before release.

The main types of performance testing

Performance testing is a family of test types, each applying load in a different shape to answer a different question.

Baseline testing runs the system under normal, expected load and records the result. This baseline is the reference every later test compares against. Without it, you cannot tell whether a number is good, bad, or simply different.

Load testing applies expected peak traffic and confirms the system holds up: response times stay within budget, errors stay near zero. It validates that the system survives a normal busy day.

Stress testing deliberately exceeds capacity, raising load until the system degrades or fails. The goal is to find the breaking point and observe the failure mode. Graceful degradation, slower but still serving, is acceptable; data loss or cascading errors is not.

Spike testing applies a sudden, steep jump in load and drops it again. It models flash sales, news events, and other bursts. A system tuned for steady traffic can still fail a spike because it cannot scale fast enough.

Capacity testing finds the maximum load the system can handle while still meeting its targets. The answer is a concrete number, used directly for capacity planning and autoscaling thresholds.

Soak testing, also called stability or endurance testing, holds a moderate load for an extended period to expose slow failures: memory leaks, resource exhaustion, gradual slowdown. These are invisible in short runs and only appear over hours.

Most teams run baseline, load, and soak testing as standard, and add stress and spike testing for systems with high or unpredictable traffic.

The metrics that define a result

A performance test is only as useful as the metrics you read from it.

Response time is the duration from request to response. Always read it as a distribution. The average is misleading; a healthy average can hide a 99th percentile that is ten times worse. The slow tail is what real users notice and complain about.

Throughput is the volume of work completed per unit of time, often requests per second. It is calculated as total requests divided by test duration and represents the true capacity of the system.

Concurrency is the number of simultaneous users or connections. System capacity is frequently stated as the concurrency level at which response time crosses the acceptable threshold.

Error rate is the percentage of requests that fail under load. A system that stays fast but starts failing requests at high concurrency has not passed; speed without reliability is not performance.

CPU and memory utilization explain why the other numbers move. If latency rises while CPU is pinned at 100%, the system is compute-bound. If latency rises while CPU is idle, the bottleneck is downstream, usually a database, a lock, or an external dependency.

A complete result reads like a sentence: at this concurrency, throughput peaked here, the 95th-percentile response time was this, the error rate was that, and the server was bound on this resource.

Where performance testing fits in the process

Performance testing used to be a single gate near the end of a project, run once before launch by a specialist team. That model fails for systems that ship continuously, because performance regresses with almost every change. A new query, an added integration, an unindexed column, each quietly adds latency that no functional test detects.

The better model treats performance like correctness: continuously checked, with a budget. Define a response-time and error-rate budget for critical paths. Run a light load test in CI/CD so a regression fails the build at the pull request. Reserve heavy stress and soak runs for scheduled pre-release testing, where their longer runtime is acceptable.

For most systems, the highest-value place to performance test is the API layer. APIs carry the core logic, they are fast and deterministic to call, and they have no flaky UI to fight. Testing APIs under load gives reliable numbers cheaply; API performance testing covers that focused approach in detail. Keeping performance tests beside functional API tests means every change is checked for both correctness and speed at once.

Common performance testing mistakes

Performance testing is easy to do in a way that produces confident, wrong answers. A few mistakes show up again and again.

Testing against unrealistic infrastructure. A load test on a developer laptop, or against a staging environment with a fraction of production’s resources, produces numbers that mean nothing. Test on infrastructure that matches production as closely as you can afford to.

Ignoring warm-up effects. Many systems are slow for the first few seconds of a run while caches fill and connection pools open. Measuring the cold start and the steady state together produces a misleading average. Discard the warm-up window or report it separately.

Reading averages instead of percentiles. This mistake is worth repeating because it is so common. An average response time of 200 ms can hide a 99th percentile of three seconds. The average describes a request nobody actually makes; the percentiles describe real users.

Using unrealistic data. Testing every request with the same user id or the same product means the database serves everything from cache. Real traffic spreads across the data set, hitting cold rows and cache misses. Vary the test data to match.

Testing one endpoint in isolation. Real users move through workflows: log in, browse, search, check out. Hammering a single endpoint misses the contention that appears when several endpoints compete for the same database and connection pool. Test realistic multi-step scenarios, not just individual calls.

Treating the test as one-and-done. A single pre-launch performance test goes stale the moment the next feature ships. Performance regresses continuously, so the test has to run continuously too.

Avoiding these six mistakes is most of what separates a performance test that informs a decision from one that produces a comforting but meaningless green check.

Running performance tests with Apidog

Apidog builds load testing into the same workspace used for API design and functional testing, so performance checks do not require a separate tool or a separate copy of the API definition.

You take an endpoint or a multi-step test scenario, confirm it passes functionally, then run it under a configured number of virtual users for a set duration. Apidog ramps the load up gradually and reports response-time percentiles, throughput, and error rate live, so you can see the exact concurrency level where performance turns. For load beyond a single machine, the scenario exports to JMeter while keeping the same definition.

Because the same test scenario serves both functional and performance runs, you maintain one artifact instead of two. Download Apidog to profile an endpoint you already have.

Frequently asked questions

What is the difference between performance testing and functional testing? Functional testing checks whether the software produces correct results. Performance testing checks how fast and how reliably it does so under load. Both are needed; neither replaces the other.

Which performance test type should I run first? Baseline, then load. The baseline gives you a reference under normal conditions; the load test confirms the system survives expected peak traffic. Add stress, spike, and soak testing from there.

Why use percentiles instead of average response time? The average is dragged toward the middle and hides the slow tail. The 95th and 99th percentile reveal what the least-lucky requests experience, and that tail is what users feel.

Can performance testing be automated? Yes. A light load test runs well in CI on every change, with a defined budget that fails the build on regression. Heavier stress and soak tests are usually scheduled rather than run on every commit.

When in the development cycle should performance testing start? Earlier than most teams think. You cannot get final latency numbers without real infrastructure, but you can establish budgets and write the test scenarios during design. Running a basic load test as soon as an endpoint is functional catches problems while they are cheap to fix.

Who is responsible for performance testing? On modern teams it is shared. Developers run lightweight load checks on their own changes; QA owns the broader test scenarios and budgets; operations or SRE supplies the production-like infrastructure and the server-side metrics. Treating it as one specialist’s job is how performance problems reach production.

How long should a performance test run? Long enough to pass the warm-up window and reach a steady state, usually several minutes for a load test. Soak tests run for hours or days by design, since their whole purpose is to expose slow degradation that short runs miss.