Napkin Math

The goal of this project is to collect software, numbers, and techniques to
quickly estimate the expected performance of systems from first-principles. For
example, how quickly can you read 1 GB of memory? By composing these resources
you should be able to answer interesting questions like: how much storage cost
should you expect to pay for logging for an application with 100,000 RPS?

The best introduction to this skill is through my talk at
SRECON.

The best way to practise napkin math in the grand domain of computers is to work
on your own problems. The second-best is to subscribe to this
newsletter where you'll get a problem every few
weeks to practise on. It should only take you a few minutes to solve each one as your
facility with these techniques improve.

The archive of problems to practise with are
here. The solution will be in the following
newsletter.

Numbers

Below are numbers that are rounded from runs on a metal Intel Xeon E-2236 3.4GHz
with 12 (virtual) cores.

Note 1: Some throughput and latency numbers don't line up, this is
intentional for ease of calculations.

Note 2: Take the numbers with a grain of salt. E.g. for I/O, fio is
the state-of-the-art. I am continuously updating these numbers as I learn more
to improve accuracy and as hardware improves.

Operation	Latency	Throughput	1 MiB	1 GiB
Sequential Memory R/W (64 bytes)	0.5 ns
├ Single Thread, No SIMD		10 GiB/s	100 μs	100 ms
├ Single Thread, SIMD		20 GiB/s	50 μs	50 ms
├ Threaded, No SIMD		30 GiB/s	35 μs	35 ms
├ Threaded, SIMD		35 GiB/s	30 μs	30 ms
Network Same-Zone		10 GiB/s	100 μs	100 ms
├ Inside VPC		10 GiB/s	100 μs	100 ms
├ Outside VPC		3 GiB/s	300 μs	300 ms
Hashing, not crypto-safe (64 bytes)	25 ns	2 GiB/s	500 μs	500 ms
Random Memory R/W (64 bytes)	50 ns	1 GiB/s	1 ms	1s
Fast Serialization `[8]` `[9]` †	N/A	1 GiB/s	1 ms	1s
Fast Deserialization `[8]` `[9]` †	N/A	1 GiB/s	1 ms	1s
System Call	500 ns	N/A	N/A	N/A
Hashing, crypto-safe (64 bytes)	500 ns	200 MiB/s	10 ms	10s
Sequential SSD read (8 KiB)	1 μs	4 GiB/s	200 μs	200 ms
Context Switch `[1] [2]`	10 μs	N/A	N/A	N/A
Sequential SSD write, -fsync (8KiB)	10 μs	1 GiB/s	1 ms	1s
TCP Echo Server (32 KiB)	10 μs	4 GiB/s	200 μs	200 ms
Decompression `[11]`	N/A	1 GiB/s	1 ms	1s
Compression `[11]`	N/A	500 MiB/s	2 ms	2s
Sequential SSD write, +fsync (8KiB)	1 ms	10 MiB/s	100 ms	2 min
Sorting (64-bit integers)	N/A	200 MiB/s	5 ms	5s
Sequential HDD Read (8 KiB)	10 ms	250 MiB/s	2 ms	2s
Blob Storage same region, 1 file	50 ms	500 MiB/s	2 ms	2s
Blob Storage same region, n files	50 ms	NW limit
Random SSD Read (8 KiB)	100 μs	70 MiB/s	15 ms	15s
Serialization `[8]` `[9]` †	N/A	100 MiB/s	10 ms	10s
Deserialization `[8]` `[9]` †	N/A	100 MiB/s	10 ms	10s
Proxy: Envoy/ProxySQL/Nginx/HAProxy	50 μs	?	?	?
Network within same region	250 μs	2 GiB/s	500 μs	500 ms
Premium network within zone/VPC	250 μs	25 GiB/s	50 μs	40 ms
{MySQL, Memcached, Redis, ..} Query	500 μs	?	?	?
Random HDD Read (8 KiB)	10 ms	0.7 MiB/s	2 s	30m
Network between regions `[6]`	Varies	25 MiB/s	40 ms	40s
Network NA East <-> West	60 ms	25 MiB/s	40 ms	40s
Network EU West <-> NA East	80 ms	25 MiB/s	40 ms	40s
Network NA West <-> Singapore	180 ms	25 MiB/s	40 ms	40s
Network EU West <-> Singapore	160 ms	25 MiB/s	40 ms	40s

†: "Fast serialization/deserialization" is typically a simple wire-protocol
that just dumps bytes, or a very efficient environment. Typically standard
serialization such as e.g. JSON will be of the slower kind. We include both here
as serialization/deserialization is a very, very broad topic with extremely
different performance characteristics depending on data and implementation.

You can run this with ./run to run with the right optimization levels. You
won't get the right numbers when you're compiling in debug mode. You can help
this project by adding new suites and filling out the blanks.

Note: I'm currently porting the benchmarks over to Criterion.rs, so some are
in bench/ now. You can run those by uncommenting the relevant line in ./run.

I am aware of some inefficiencies in this suite. I intend to improve my skills
in this area, in order to ensure the numbers are the upper-bound of performance
you may be able to squeeze out in production. I find it highly unlikely any of
them will be more than 2-3x off, which shouldn't be a problem for most users.

Cost Numbers

Approximate numbers that should be consistent between Cloud providers.

What	Amount	$ / Month	1y commit $ /month	Spot $ /month	Hourly Spot $
CPU	1	$15	$10	$2	$0.005
GPU	1	$5000	$3000	$1500	$2
Memory	1 GB	$2	$1	$0.2	$0.0005
Storage
├ Warehouse Storage	1 GB	$0.02
├ Blob (S3, GCS)	1 GB	$0.02
├ Zonal HDD	1 GB	$0.05
├ Ephemeral SSD	1 GB	$0.1	$0.05	$0.05	$0.05
├ Regional HDD	1 GB	$0.1
├ Zonal SSD	1 GB	$0.2
├ Regional SSD	1 GB	$0.35
Networking
├ Same Zone †	1 GB	$0
├ Cross-Zone	1 GB	$0.01
├ Cross-Zone Blob	1 GB	$0
├ Region Ingress	1 GB	$0
├ Region Egress	1 GB	$0.1
├ Internet Ingress	1 GB	$0
├ Internet Egress	1 GB	$0.1
CDN Egress	1 GB	$0.05
CDN Fill ‡	1 GB	$0.01
Warehouse Query	1 GB	$0.005

† This refers to network leaving your cloud provider, e.g. sending data to S3
from GCP or egress network for sending HTML from AWS to a client.

‡ An additional per cache-fill fee is incurred that costs close to blob storage
write costs (see just below).

Furthermore, for blob storage (S3/GCS/R2/...), you're charged per read/write
operation (fewer, large files is cheaper):

	1M	1000
Reads	$0.5	$0.0004
Writes	$10	$0.01

Compression Ratios

This is sourced from a few sources. [3] [4] [5] Note that compression speeds (but
generally not ratios) vary by an order of magnitude depending on the algorithm
and the level of compression (which trades speed for compression).

I typically ballpark that another x in compression ratio decreases performance
by 10x. E.g. we can get a 2x ratio on English
Wikipedia at ~200
MiB/s, and 3x at ~20MiB/s, and 4x at 1MB/s.

What	Compression Ratio
HTML	2-3x
English	2-4x
Source Code	2-4x
Executables	2-3x
RPC	5-10x
SSL	-2% `[10]`

Techniques

Don't overcomplicate. If you are basing your calculation on more than 6
assumptions, you're likely making it harder than it should be.
Keep the units. They're good checksumming.
Wolframalpha has terrific support if you need a
hand in converting e.g. KiB to TiB.
Calculate with exponents. A lot of back-of-the-envelope calculations are
done with just coefficients and exponents, e.g. c * 10^e. Your goal is to
get within an order of magnitude right--that's just e. c matters a lot
less. Only worrying about single-digit coefficients and exponents makes it
much easier on a napkin (not to speak of all the zeros you avoid writing).
Perform Fermi decomposition. Write down things you can guess at until you
can start to hint at an answer. When you want to know the cost of storage
for logging, you're going to want to know how big a log line is, how many of
those you have per second, what that costs, and so on.

Resources

[1]: https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/
[2]: https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
[3]: https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf
[4]: https://github.com/google/snappy
[5]: https://quixdb.github.io/squash-benchmark/
[6]: https://dl.acm.org/doi/10.1145/1879141.1879143
[7]: https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics#Seek_times_&_characteristics
[8]: https://github.com/simdjson/simdjson#performance-results
[9]: https://github.com/protocolbuffers/protobuf/blob/master/docs/performance.md
[10]: https://www.imperialviolet.org/2010/06/25/overclocking-ssl.html
[11]: https://github.com/inikep/lzbench
"How to get consistent results when benchmarking on
Linux?".
Great compilation of various Kernel and CPU features to toggle for reliable
bench-marking, e.g. CPU affinity, disabling turbo boost, etc. It also has
resources on proper statistical methods for benchmarking.
LLVM benchmarking tips. Similar
to the above in terms of dedicating CPUs, disabling address space
randomization, etc.
Top-Down performance analysis
methodology.
Useful post about using toplev to find the bottlenecks. This is particularly
useful for the benchmarking suite we have here, to ensure the programs are
correctly written (I have not taken them through this yet, but plan to).
Godbolt's compiler explorer. Fantastic resource
for comparing assembly between Rust and e.g. C with Clang/GCC.
cargo-show-asm. Cargo extension to allow
disassembling functions. Unfortunately the support for closure is a bit
lacking, which requires some refactoring.
Agner's Assembly
Guide. An excellent
resource on writing optimum assembly, which will be useful to inspect the
various functions for inefficiencies in our suite.
Agner's Instruction
Tables. Thorough
resource on the expected throughput for various instructions which is helpful
to inspect the assembly.
halobates.de. Useful resource for low-level
performance by the author of toplev.
Systems Performance (book). Fantastic book about analyzing system performance, finding bottlenecks, and understanding operating systems.
io_uring. Best summary, it links to many
resources.
How Long Does It Takes To Make a Context Switch
Integer Compression Comparisons
Files are hard

Name With Owner	sirupsen/napkin-math
Primary Language	Rust
Program language	Rust (Language Count: 6)
Platform
License:	MIT License

Created At	2019-06-01 11:30:14
Pushed At	2024-09-15 13:12:14
Last Commit At	2024-09-15 09:12:04
Release Count	0

Stargazers Count	4.4k
Watchers Count	94
Fork Count	180
Commits Count	125
Has Issues Enabled
Issues Count	17
Issue Open Count	3
Pull Requests Count	12
Pull Requests Open Count	4
Pull Requests Close Count	2

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Napkin Math

Github stars Tracking Chart

Napkin Math

Numbers

Cost Numbers

Compression Ratios

Techniques

Resources

Main metrics