Numerical Elixir Benchmark: CIFAR10 with 3-Layer DenseNN

TLDR:

  1. Use C libraries (via NIF) for matrix computation when performance is the top priority. Otherwise, it is about $10^3$ times slower in terms of matrix computation.
  2. OTP 25 introduces JIT on ARM64 and it shows 3-4% performance improvement (matrix computation).
  3. Almost linear speedup can be achieved when a large computation task can be divided into independent smaller ones.
  4. Apple M1 Max performs much better than its x86_64 competitors (Intel Core i9 8950HK and AMD Ryzen 9 3900XT).

Benchmark code here: https://github.com/cocoa-xu/CIFAR-10-livebook

Numerical Elixir

I started to use Elixir/Erlang about 2 months ago, and I learned the existence of Numerical Elixir (Nx) from my supervisor, Lito.

Basically, Nx to Elixir is like NumPy to Python. They implemented a number of numerical operations, especially for multi-dimensional arrays. It's worth noting that Nx comes with a built-in auto-grader, which means that we don't have to write the corresponding differentiate functions for backwards-propagating when training a neural network.

I explored the Nx and tried to write some benchmarks to evaluate its performance with different hardware (Raspberry Pi 4, x86_64 laptops and desktops, ARM64 laptops) and conditions (Allow calls to external C libraries vs. Pure Elixir implementation). And here I finally got some numbers!

P.S. The goal of this benchmark is only to evaluate the matrix computation performance, instead of getting a decent (or even acceptable) CIFAR-10 prediction accuracy.

Benchmark Settings

Hardware

  • Raspberry Pi 4, 8 GB of RAM. Ubuntu 20.04 aarch64.
  • x86_64 laptop. Intel 8th Gen Core i9 8950HK, 6 Cores 12 Threads, MacBook Pro (15-inch, 2018), 32 GB RAM. macOS Big Sur 11.1 x86_64.
  • x86_64 desktop. AMD Ryzen 9 3900XT, 12 Cores 24 Threads, Desktop PC, 64 GB RAM, NVIDIA RTX 3090. Ubuntu 20.04 x86_64.
  • ARM64 laptop. M1 Max, 10 Cores (8 Performance + 2 Effiency) 10 Threads, MacBook Pro (14-inch, 2021), 64 GB RAM. macOS Montery 12.0.1 aarch64.

Software

Dataset

CIFAR-10 binary version.

Method

  • 3-layer DenseNN.
    1. Input layer. Dense layer, size {nil, 1024, 64} + {nil, 64}, activation sigmoid.
    2. Hidden layer. Dense layer, size {nil, 64, 32} + {nil, 32}, activation sigmoid.
    3. Output layer. Dense layer, size {nil, 32, 10} + {nil, 10}, activation softmax.
  • Number of epochs: 5.
  • Batch size.
    • 300 when using Nx.BinaryBackend, single-thread
    • 250 * n_jobs when using Nx.BinaryBackend, multi-thread. n_jobs will be the number of available logical cores.
    • 300 when using Torchx.Backend.
  • Binary.
Benchmark.run(
  backend: Nx.BinaryBackend,
  batch_size: 300,
  n_jobs: 1
)
  • Binary MT.
Benchmark.run(
  backend: Nx.BinaryBackend,
  batch_size:250 * System.schedulers_online(),
  n_jobs: System.schedulers_online()
)
  • Torch CPU/GPU.
Benchmark.run(backend: Torchx.Backend, batch_size: 300)

Benchmark Results

Numbers are in seconds.

I'll fill in the empty cells when the rest benchmarks are done.

HardwareBackendOTPLoad DatasetTo Batched InputMean Epoch Time
Pi 4Binary24
Pi 4Binary MT24
Pi 4Binary25194.42711.91727336.010
Pi 4Binary MT25207.92311.85518210.347
Pi 4Torch CPU2415.3344.88017.170
Pi 4Torch CPU2516.3724.44216.207
8950HKBinary2417.9943.0364460.758
8950HKBinary MT2417.8262.9341471.090
8950HKTorch CPU242.1410.7780.841
3900XTBinary246.0582.3913670.930
3900XTBinary MT246.0342.536786.443
3900XTTorch CPU241.6530.6170.770
3900XTTorch GPU241.6300.6520.564
M1 MaxBinary2411.0902.1353003.321
M1 MaxBinary MT2410.9252.154453.536
M1 MaxBinary259.4581.5483257.853
M1 MaxBinary MT259.9491.527436.385
M1 MaxTorch CPU241.7021.9000.803
M1 MaxTorch CPU251.5990.7450.773

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.