TLDR:
- Use C libraries (via NIF) for matrix computation when performance is the top priority. Otherwise, it is about $10^3$ times slower in terms of matrix computation.
- OTP 25 introduces JIT on ARM64 and it shows 3-4% performance improvement (matrix computation).
- Almost linear speedup can be achieved when a large computation task can be divided into independent smaller ones.
- Apple M1 Max performs much better than its x86_64 competitors (Intel Core i9 8950HK and AMD Ryzen 9 3900XT).
Benchmark code here: https://github.com/cocoa-xu/CIFAR-10-livebook
Numerical Elixir
I started to use Elixir/Erlang about 2 months ago, and I learned the existence of Numerical Elixir (Nx) from my supervisor, Lito.
Basically, Nx to Elixir is like NumPy to Python. They implemented a number of numerical operations, especially for multi-dimensional arrays. It's worth noting that Nx comes with a built-in auto-grader, which means that we don't have to write the corresponding differentiate functions for backwards-propagating when training a neural network.
I explored the Nx and tried to write some benchmarks to evaluate its performance with different hardware (Raspberry Pi 4, x86_64 laptops and desktops, ARM64 laptops) and conditions (Allow calls to external C libraries vs. Pure Elixir implementation). And here I finally got some numbers!
P.S. The goal of this benchmark is only to evaluate the matrix computation performance, instead of getting a decent (or even acceptable) CIFAR-10 prediction accuracy.
Benchmark Settings
Hardware
- Raspberry Pi 4, 8 GB of RAM. Ubuntu 20.04 aarch64.
- x86_64 laptop. Intel 8th Gen Core i9 8950HK, 6 Cores 12 Threads, MacBook Pro (15-inch, 2018), 32 GB RAM. macOS Big Sur 11.1 x86_64.
- x86_64 desktop. AMD Ryzen 9 3900XT, 12 Cores 24 Threads, Desktop PC, 64 GB RAM, NVIDIA RTX 3090. Ubuntu 20.04 x86_64.
- ARM64 laptop. M1 Max, 10 Cores (8 Performance + 2 Effiency) 10 Threads, MacBook Pro (14-inch, 2021), 64 GB RAM. macOS Montery 12.0.1 aarch64.
Software
- Erlang OTP 24.0.6 and 25@b58c66e12.
- Numerical Elixir, Nx@e90de80157.
- LibTorch CPU, v1.9.1.
- LibTorch GPU, v1.9.1. CUDA 11.1, cuDNN 8.2.1.
Dataset
Method
- 3-layer DenseNN.
- Input layer. Dense layer, size {nil, 1024, 64} + {nil, 64}, activation sigmoid.
- Hidden layer. Dense layer, size {nil, 64, 32} + {nil, 32}, activation sigmoid.
- Output layer. Dense layer, size {nil, 32, 10} + {nil, 10}, activation softmax.
- Number of epochs: 5.
- Batch size.
- 300 when using
Nx.BinaryBackend
, single-thread - 250 * n_jobs when using
Nx.BinaryBackend
, multi-thread.n_jobs
will be the number of available logical cores.
- 300 when using
Torchx.Backend
.
- 300 when using
- Binary.
Benchmark.run(
backend: Nx.BinaryBackend,
batch_size: 300,
n_jobs: 1
)
- Binary MT.
Benchmark.run(
backend: Nx.BinaryBackend,
batch_size:250 * System.schedulers_online(),
n_jobs: System.schedulers_online()
)
- Torch CPU/GPU.
Benchmark.run(backend: Torchx.Backend, batch_size: 300)
Benchmark Results
Numbers are in seconds.
I'll fill in the empty cells when the rest benchmarks are done.
Hardware | Backend | OTP | Load Dataset | To Batched Input | Mean Epoch Time |
---|---|---|---|---|---|
Pi 4 | Binary | 24 | |||
Pi 4 | Binary MT | 24 | |||
Pi 4 | Binary | 25 | 194.427 | 11.917 | 27336.010 |
Pi 4 | Binary MT | 25 | 207.923 | 11.855 | 18210.347 |
Pi 4 | Torch CPU | 24 | 15.334 | 4.880 | 17.170 |
Pi 4 | Torch CPU | 25 | 16.372 | 4.442 | 16.207 |
8950HK | Binary | 24 | 17.994 | 3.036 | 4460.758 |
8950HK | Binary MT | 24 | 17.826 | 2.934 | 1471.090 |
8950HK | Torch CPU | 24 | 2.141 | 0.778 | 0.841 |
3900XT | Binary | 24 | 6.058 | 2.391 | 3670.930 |
3900XT | Binary MT | 24 | 6.034 | 2.536 | 786.443 |
3900XT | Torch CPU | 24 | 1.653 | 0.617 | 0.770 |
3900XT | Torch GPU | 24 | 1.630 | 0.652 | 0.564 |
M1 Max | Binary | 24 | 11.090 | 2.135 | 3003.321 |
M1 Max | Binary MT | 24 | 10.925 | 2.154 | 453.536 |
M1 Max | Binary | 25 | 9.458 | 1.548 | 3257.853 |
M1 Max | Binary MT | 25 | 9.949 | 1.527 | 436.385 |
M1 Max | Torch CPU | 24 | 1.702 | 1.900 | 0.803 |
M1 Max | Torch CPU | 25 | 1.599 | 0.745 | 0.773 |