https://db-benchmarks.com aims to make database and search engines benchmarks:
⚖️ Fair and transparent - it should be clear under what conditions this or that database / search engine gives this or that performance
🚀 High quality - control over coefficient of variation allows producing results that remain the same if you run a query today, tomorrow or next week
👪 Easily reproducible - anyone can reproduce any test on their own hardware
📊 Easy to understand - the charts are very simple
➕ Extendable - pluggable architecture allows adding more databases to test
And keep it all 100% open source!
This repository provides a test framework which does the job.
Why is this important?
Many database benchmarks are not objective, others jsut don’t care about results accuracy and stability which in some cases breaks the whole idea of benchmarks. Few examples:
Druid vs Clickhouse vs Rockset
We actually wanted to do the benchmark on the same hardware, an m5.8xlarge, but the only pre-baked configuration we have for m5.8xlarge is actually the m5d.8xlarge … Instead, we run on a c5.9xlarge instance
Bad news, guys: when you run benchmarks on different hardware, at the very least you can’t then say that something is “106.76%” and “103.13%” of something else. Even when you test on the same bare-metal server it’s quite difficult to get coefficient of variation lower than 5%. 3% difference on different servers can be highly likley ignored. Provided all that, how can one make sure the final conclusion is true?
Lots of databases and engines
Mark did a great job making the taxi rides test on so many different databases and search engines. But since the tests are made on different hardware the numbers in the resulting table don’t make much sense to compare one with another. You always need to remember about it when you evaluate the results in the table.
Clickhouse vs others
When you run each query just 3 times most likely you’ll get very high coefficient of variation for each of them. Which means that if you make the same a minute later you may get some 20% different results. And how does one reproduce it if he wants to test on his own hardware? Unfortunately, I can’t find how one can do it.