At WyldTrace, we had a traceability API that was slow — not catastrophically slow, but frustratingly, unpredictably slow. The kind of slow that makes stakeholders raise eyebrows during demos and causes that uncomfortable silence when a product manager asks why the page took four seconds to load.
After months of profiling, tuning, and one critical revelation about connection pooling under load, we brought average request latency down by 38%. This is what I learned — and what I'd do differently from day one.
THE Problem We Had
The platform handled product provenance lookups — scanning a QR code on a physical product should return its full supply chain history in under two seconds. In development, it did. In staging, it did. In production, under real load, it sometimes took five or six. Sporadically. Infuriatingly.
The initial instinct — and I'll admit this was mine — was to throw solutions at it. Add caching. Tune the JVM heap. Rewrite the worst-looking query. These things helped marginally. But we were optimising blind.
Lesson one: Optimising without measuring is guessing. You might guess right occasionally, but you'll never know why it worked, and you won't be able to reproduce the result deliberately.
BUILDING Visibility First
Before touching a single line of application code, we instrumented everything. Spring Boot Actuator gave us the foundation. We wired in Micrometer with a Prometheus backend, added custom timers around our most-called service methods, and deployed a Grafana dashboard that gave us per-endpoint p50, p95, and p99 latency in real time.
The first thing the dashboard told us was humbling: the problem wasn't our code at all.
// Before: fire and forget, no visibility public ProvenanceRecord lookupRecord(String qrCode) { return repository.findByQrCode(qrCode); } // After: instrumented with Micrometer timer private final MeterRegistry registry; public ProvenanceRecord lookupRecord(String qrCode) { return Timer.builder("provenance.lookup") .tag("endpoint", "qr-scan") .register(registry) .record(() -> repository.findByQrCode(qrCode)); }
Once we could see the data, patterns emerged immediately. Latency spikes happened at predictable intervals — roughly every 10 minutes — and they correlated almost perfectly with connection pool exhaustion events in our HikariCP logs.
THE REAL Culprit
Our microservice was configured with default HikariCP settings. The default maximum pool size is 10 connections. Under normal load, fine. Under the burst of 50–80 concurrent lookups that came with a real product launch event, threads were queuing for a database connection for up to 3.4 seconds before the actual query even ran.
The query itself was fast — under 40ms with proper indexing. But threads were waiting 3,400ms just to get a connection. We'd been profiling queries while the real problem was a config value we'd never touched.
What We Changed
- HikariCP pool size — tuned from default 10 to 30, with a connection timeout of 2s and a max lifetime of 10 minutes. Monitored the pool utilisation to find the right ceiling without over-provisioning.
- Database indexing — added a composite index on
(qr_code, product_id, created_at)which reduced our most common query from a full table scan to a sub-5ms index seek. - Read replicas — directed all lookup traffic to a read replica on AWS RDS, freeing the primary for writes and reducing contention.
- Request pipeline batching — grouped concurrent lookups for the same product into a single downstream call using a short-lived in-flight cache keyed on QR code hash.
- N+1 query elimination — Hibernate was triggering one query per supply chain step. Replacing with a single JOIN-based fetch via
@EntityGraphcut per-request query count from 12 down to 1.
WHAT I'D DO Differently
Looking back, nearly all of this pain was avoidable. The fixes were not complex — the real cost was the weeks we spent optimising the wrong things before we could see clearly what was wrong.
- Instrument from day one. Add Micrometer, wire up Prometheus, build a Grafana dashboard before the first PR is merged. It costs an afternoon and pays back tenfold.
- Load test early and often. A single-user response time tells you almost nothing. What matters is behaviour under your p95 concurrent load. Use k6 or Gatling and test from the first week.
- Never leave connection pool config at default. Profile your actual concurrent usage, set
maximumPoolSizedeliberately, and alert on pool saturation. - Audit every ORM query. Hibernate is powerful and treacherous in equal measure. Log SQL in staging, count queries per request, and treat any N+1 as a bug.
- Treat latency as a feature. It's not a performance concern to defer to a later sprint. Your users feel it on the first day.
THE Outcome
After implementing the changes above over three focused sprints, our average end-to-end latency for a QR provenance lookup dropped from 4.2 seconds to 1.2 seconds at scale — well within our target. The p99 dropped from a painful 8.1 seconds to 2.4 seconds. No more stakeholder eyebrows.
More importantly, we now had the instrumentation to know what was happening at all times. When a new deployment causes a regression, we see it on the dashboard within minutes — not in a Slack message from an unhappy user.
That shift — from reacting to symptoms to observing causes — is the real lesson here. The code changes were almost secondary.
TL;DR: If your API is slow and you don't have per-endpoint p95 latency visible on a dashboard right now, that's the first thing to fix. Measure before you optimise. Everything else follows from seeing clearly.