Ludicrous speed, GO!
One of the things I truly enjoy as a software engineer is the benchmark/think/tweak cycle. Figuring out where a bottleneck is and trying to eliminate it. Hitting a performance wall that makes you revisit assumptions, re-architect a component, or research alternate algorithms.
But more important to HikariCP than performance, is reliability and simplicity.
HikariCP has best-of-breed resilience in the face of network disruption, and each release has brought with it improved stability and consistency under load. However, some of these reliability gains have come at the expense of performance. Consequently, HikariCP 2.3.8 is roughly 20% slower than HikariCP 2.0.1.
In HikariCP 2.4.0, after many releases focused on reliability, I wanted to regain our performance.
HikariCP utilises a specialized collection called a
ConcurrentBagcode to hold connections in the pool. And like all pools, we need to be able to signal waiting threads when connections are returned to the pool. In versions prior to 2.4.0,
ConcurrentBag utilised an implementation of
AbstractQueuedLongSynchronizerdoc for wait/notify semantics.
AbstractQueuedLongSynchronizer provides useful features like efficient FIFO thread queueing and parking/unparking. Subclasses generally rely on the provided
compareAndSetState() methods, which are merely wrappers around an
AtomicLong, to implement their synchronization semantic.
The performance of HikariCP’s
AbstractQueuedLongSynchronizer implementation was fine, very good even, but the fact that
AtomicLong performs poorly under contention periodically surfaced in my brain.
I kept thinking “there must be some way to take advantage of Java 8’s
LongAdderdoc”. It’s well known that
LongAdder has much higher performance under contention, that is its raison d’être. I won’t bore you with all of the particulars.
This class is usually preferable to AtomicLong when multiple threads update a common sum that is used for purposes such as collecting statistics, not for fine-grained synchronization control.
This is because
LongAdder is not Sequentially Consistent.
It turns out that
LongAdder is Sequentially Consistent if you stick to only the
sum() methods. That is to say, the value must monotonically increase.
It now seemed possible to create a new
LongAdder-based wait/notify mechanism that substantially outperforms the previous, as long as we adhere to the SC constraints above.
QueuedSequenceSynchronizercode is a mash-up of
AbstractQueuedLongSynchronizer, taking advantage of the performance of the former and the infrastructure of the later. On Java 7 it falls back to
AtomicLong1, but on Java 8 … it’s ludicrously fast.
1 Unless DropWizard Metrics is present, in which case we use their
LongAdder Java 7 backport.
Do you have a nanosecond?
Without much fanfare, I’ll cut to the chase and stick with a simple before/after on my Core i7 (3770) 3.4GHz “Ivy Bridge” iMac.
Put another way, roundtrip times (
close()) are now between 150-250 nanoseconds on commodity hardware.
As usual in our benchmarks, “Unconstrained” means that there are more available connections than threads. And “Constrained” means that the number of threads outnumber connections 2:1.
Of course, the benchmark basically creates maximum contention (~20-50k calls per millisecond), so in production environments we would expect L2 cache-line invalidation to be less frequent (to put it mildly).
In the case of unconstrained access, the
QueuedSequenceSynchronizer doesn’t really come into play much. The big win comes from the fact that released connections are merely incrementing a
LongAdder in v2.4.0, compared to incrementing an
AtomicLong in v2.3.x.
In the case of constrained access, the
QueuedSequenceSynchronizer sees quite a bit more action. My concern was that the necessity of calling
LongAdder.sum(), which is generally much slower than
AtomicLong.get(), would result in worse performance than v2.3.x instead of better.
However, this fear proved unfounded. While the constrained performance is roughly half of the unconstrained in v2.4.0, it still amazingly beats the unconstrained performance of v2.3.x.
Stacking HikariCP 2.4.0 up against the usual pools in the benchmark suite…
- One Connection Cycle is defined as single
- In Unconstrained benchmark, connections > threads.
- In Constrained benchmark, threads > connections (2:1).
- One Statement Cycle is defined as single
* Versions: HikariCP 2.4.0, commons-dbcp2 2.1, Tomcat 8.0.23, Vibur 3.0, c3p0 0.9.5.1, Java 8u45
* Java options: -server -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xmx1096m
Scratching an Itch
Having scratched my performance itch, at least for a while, I’ll be turning my attention back to equally important tasks such as improving metrics reporting and performance bottleneck troubleshooting ability.
Thanks for reading.