November 2021 – Cocoa's

ARM64 CPU Compilation Test – Season 1

Takeaways: (1) M1 Max is powerful, deserves its price and outperforms some more expensive ARM64 servers. (2) The Always Free tier available on Oracle cloud can provide ARM64 servers with decent performance.

Just a quick compilation test amongst Apple M1 Max, AWS c6g.2xlarge, c6g.metal and Oracle Ampere (VM.Standard.A1.Flex).

Hardware-wise, I'm using a MacBook Pro 14-inch with M1 Max (10 cores, 8 performance cores + 2 efficiency cores). The build is done in ARM64 docker with the arm64v8:ubuntu image. The docker engine can use 8 cores and 14 GB of RAM. It's worth noting that allocating 8 cores to the docker engine does not guarantee they are all performance cores. The core schedule is handled by macOS and there is no core pinning in recent macOS.

The hardware configuration on AWS c6g.2xlarge is just the stock one, which is 8 cores and 16 GB of RAM. The system image on the c6g.2xlarge machine is also ubuntu 20.04.

As for the Oracle Ampere (VM.Standard.A1.Flex), I tested three configurations:

4 CPUs, 24 GB of RAM
8 CPUs, 48 GB of RAM
16 CPUs, 96 GB of RAM

The first configuration is eligible for the Oracle Always Free Tier while the second configuration is meant to match the cores count with M1 Max and AWS c6g.2xlarge. The last one is the topped out spec (by default, but can increase the quota by upgrading to a paid account). The OS image used on these configurations is ubuntu 20.04 as well (image build is 2021.10.15-0).

M1 Max completed the compilation in ~28 minutes while it took ~45 minutes and xx minutes for AWS c6g.2xlarge and c6g.metal respectively. The Oracle Ampere machines finished in ~68 minutes (4c), ~42 minutes (8c) and (16c). The precise results are shown in the table below.

Machine	Cores	RAM	Cost	Compile Time (seconds)
MBP 14", M1 Max	8@~3GHz	14 GB	One Time, ≥$2,499.00	1697.344
AWS c6g.2xlarge	8@2.5GHz	16 GB	$0.272/hr (~$204/m)	2736.556
AWS c6g.metal	64@2.5GHz	128 GB	$2.176/hr (~$1632/m)	1448.384
Oracle Ampere	4@3GHz	24 GB	Free Tier	4109.323
Oracle Ampere	8@3GHz	48 GB	$0.08/hr (~$30/m)	2569.361
Oracle Ampere	16@3GHz	96 GB	$0.16/hr (~$89/m)	1906.699

GCC 11.2 compilation time on different machines.

As we can see that M1 Max is about 37.98% faster than the c6g.2xlarge machine.

Oracle Ampere (VM.Standard.A1.Flex) 16 cores, GCC 11.2 compiled in ~31 minutes.

Oracle Ampere (VM.Standard.A1.Flex) 8 cores, GCC 11.2 compiled in ~42 minutes.

Oracle Ampere (VM.Standard.A1.Flex) 4 cores, GCC 11.2 compiled in ~68 minutes.

The test script used is shown below

#!/bin/bash

export GCC_VER=11.2.0
export GCC_SUFFIX=11.2

export sudo="$(which sudo)"
$sudo apt-get update -y
$sudo apt-get install -y make build-essential wget zlib1g-dev
wget "https://ftpmirror.gnu.org/gcc/gcc-${GCC_VER}/gcc-${GCC_VER}.tar.xz" \
  -O "gcc-${GCC_VER}.tar.xz"
tar xf "gcc-${GCC_VER}.tar.xz"
cd "gcc-${GCC_VER}"
contrib/download_prerequisites
cd .. && mkdir build && cd build

../gcc-${GCC_VER}/configure -v \
  --build=aarch64-linux-gnu \
  --host=aarch64-linux-gnu \
  --target=aarch64-linux-gnu \
  --prefix=/usr/local \
  --enable-checking=release \
  --enable-languages=c,c++,go,d,fortran,objc,obj-c++ \
  --disable-multilib \
  --program-suffix=-${GCC_SUFFIX} \
  --enable-threads=posix \
  --enable-nls \
  --enable-clocale=gnu \
  --enable-libstdcxx-debug \
  --enable-libstdcxx-time=yes \
  --with-default-libstdcxx-abi=new \
  --enable-gnu-unique-object \
  --disable-libquadmath \
  --disable-libquadmath-support \
  --enable-plugin \
  --enable-default-pie \
  --with-system-zlib \
  --with-target-system-zlib=auto \
  --enable-multiarch \
  --enable-fix-cortex-a53-843419 \
  --disable-werror

time make -j`nproc`

Cocoa's Linux Package Repo

# at least you should upgrade the ca-certificates package
# to get the latest ssl root certificates
sudo apt update && sudo apt install -y ca-certificates gnupg2 curl

# add key
curl https://repo.uwucocoa.moe/pgp.key | gpg --dearmor | \
    sudo tee /usr/share/keyrings/uwucocoa-archive-keyring.gpg

# add source for arm64
echo "deb [arch=arm64] https://repo.uwucocoa.moe/ stable main" | \
  sudo tee /etc/apt/sources.list.d/uwucocoa.list

# update caches
sudo apt update
# all packages from this repo have a 'uwu' suffix
sudo apt-cache search uwu

# (also has some packages for amd64(x86_64), armhf(armv7), s390x, ppc64el, riscv64)
echo "deb [arch=amd64] https://repo.uwucocoa.moe/ stable main" | \
  sudo tee /etc/apt/sources.list.d/uwucocoa.list
echo "deb [arch=armhf] https://repo.uwucocoa.moe/ stable main" | \
  sudo tee /etc/apt/sources.list.d/uwucocoa.list
echo "deb [arch=s390x] https://repo.uwucocoa.moe/ stable main" | \
  sudo tee /etc/apt/sources.list.d/uwucocoa.list
echo "deb [arch=ppc64el] https://repo.uwucocoa.moe/ stable main" | \
  sudo tee /etc/apt/sources.list.d/uwucocoa.list
echo "deb [arch=riscv64] https://repo.uwucocoa.moe/ stable main" | \
  sudo tee /etc/apt/sources.list.d/uwucocoa.list

Available packages can be viewed at https://repo.uwucocoa.moe/pool/main/. Although there are a few armhf, s390x and ppc64el packages.

Numerical Elixir Benchmark: CIFAR10 with 3-Layer DenseNN

TLDR:

Use C libraries (via NIF) for matrix computation when performance is the top priority. Otherwise, it is about $10^3$ times slower in terms of matrix computation.
OTP 25 introduces JIT on ARM64 and it shows 3-4% performance improvement (matrix computation).
Almost linear speedup can be achieved when a large computation task can be divided into independent smaller ones.
Apple M1 Max performs much better than its x86_64 competitors (Intel Core i9 8950HK and AMD Ryzen 9 3900XT).

Benchmark code here: https://github.com/cocoa-xu/CIFAR-10-livebook

Numerical Elixir

I started to use Elixir/Erlang about 2 months ago, and I learned the existence of Numerical Elixir (Nx) from my supervisor, Lito.

Basically, Nx to Elixir is like NumPy to Python. They implemented a number of numerical operations, especially for multi-dimensional arrays. It's worth noting that Nx comes with a built-in auto-grader, which means that we don't have to write the corresponding differentiate functions for backwards-propagating when training a neural network.

I explored the Nx and tried to write some benchmarks to evaluate its performance with different hardware (Raspberry Pi 4, x86_64 laptops and desktops, ARM64 laptops) and conditions (Allow calls to external C libraries vs. Pure Elixir implementation). And here I finally got some numbers!

P.S. The goal of this benchmark is only to evaluate the matrix computation performance, instead of getting a decent (or even acceptable) CIFAR-10 prediction accuracy.

Benchmark Settings

Hardware

Raspberry Pi 4, 8 GB of RAM. Ubuntu 20.04 aarch64.
x86_64 laptop. Intel 8th Gen Core i9 8950HK, 6 Cores 12 Threads, MacBook Pro (15-inch, 2018), 32 GB RAM. macOS Big Sur 11.1 x86_64.
x86_64 desktop. AMD Ryzen 9 3900XT, 12 Cores 24 Threads, Desktop PC, 64 GB RAM, NVIDIA RTX 3090. Ubuntu 20.04 x86_64.
ARM64 laptop. M1 Max, 10 Cores (8 Performance + 2 Effiency) 10 Threads, MacBook Pro (14-inch, 2021), 64 GB RAM. macOS Montery 12.0.1 aarch64.

Software

Erlang OTP 24.0.6 and 25@b58c66e12.
Numerical Elixir, Nx@e90de80157.
LibTorch CPU, v1.9.1.
LibTorch GPU, v1.9.1. CUDA 11.1, cuDNN 8.2.1.

Dataset

CIFAR-10 binary version.

Method

3-layer DenseNN.
1. Input layer. Dense layer, size {nil, 1024, 64} + {nil, 64}, activation sigmoid.
2. Hidden layer. Dense layer, size {nil, 64, 32} + {nil, 32}, activation sigmoid.
3. Output layer. Dense layer, size {nil, 32, 10} + {nil, 10}, activation softmax.

Number of epochs: 5.
Batch size.
- 300 when using Nx.BinaryBackend, single-thread
- 250 * n_jobs when using Nx.BinaryBackend, multi-thread. n_jobs will be the number of available logical cores.
- 300 when using Torchx.Backend.

Binary.

Benchmark.run(
  backend: Nx.BinaryBackend,
  batch_size: 300,
  n_jobs: 1
)

Binary MT.

Benchmark.run(
  backend: Nx.BinaryBackend,
  batch_size:250 * System.schedulers_online(),
  n_jobs: System.schedulers_online()
)

Torch CPU/GPU.

Benchmark.run(backend: Torchx.Backend, batch_size: 300)

Benchmark Results

Numbers are in seconds.

I'll fill in the empty cells when the rest benchmarks are done.

Hardware	Backend	OTP	Load Dataset	To Batched Input	Mean Epoch Time
Pi 4	Binary	24
Pi 4	Binary MT	24
Pi 4	Binary	25	194.427	11.917	27336.010
Pi 4	Binary MT	25	207.923	11.855	18210.347
Pi 4	Torch CPU	24	15.334	4.880	17.170
Pi 4	Torch CPU	25	16.372	4.442	16.207
8950HK	Binary	24	17.994	3.036	4460.758
8950HK	Binary MT	24	17.826	2.934	1471.090
8950HK	Torch CPU	24	2.141	0.778	0.841
3900XT	Binary	24	6.058	2.391	3670.930
3900XT	Binary MT	24	6.034	2.536	786.443
3900XT	Torch CPU	24	1.653	0.617	0.770
3900XT	Torch GPU	24	1.630	0.652	0.564
M1 Max	Binary	24	11.090	2.135	3003.321
M1 Max	Binary MT	24	10.925	2.154	453.536
M1 Max	Binary	25	9.458	1.548	3257.853
M1 Max	Binary MT	25	9.949	1.527	436.385
M1 Max	Torch CPU	24	1.702	1.900	0.803
M1 Max	Torch CPU	25	1.599	0.745	0.773