Index of /public/arch_compare

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[DIR]img/2023-11-23 18:24 -  

An exploration of Graviton3 performance

An exploration of Graviton3 performance

Table of Contents

1. Introduction

This document presents an exploration of the performance of the Graviton3 processor using a set of carefully selected benchmarks.

The next sections cover the methodology as well as the benchmarking tool used to evaluate the performance of multiple CPU micro-architectures (Haswell, Skylake, Zen2 Rome, Zen3 Milan, and Graviton2) and compare their performance to the Graviton3.

2. Methodology

In order to evaluate the performance of the target systems, we deployed three categories of benchmarks that tackle different aspects:

1 - A CPU core frequency evaluation benchmark that uses a tight loop.

2 - A cache latency benchmark implementing a random pointer chasing loop aiming at measuring the latency of access to blocks of memory of different sizes.

3 - A set of code patterns representing common workloads found in production grade applications aiming at exposing the performance of the memory hierarchy of the target system. These benchmarks are also used to evaluate the quality of compiler generated code with regards to vectorization.

2.1. CPU core frequency benchmark

On AARCH64, the subs instruction is documented to require only 1 cycle to execute. This implies that we can implement a tight loop that can run in exactly N cycles (we assume the subs + bne is optimized and will cost only 1 cycle as well). Below is an example of such a loop for AARCH64:

__asm__ volatile (
                ".align 4\n"
                "loop1:\n"

                "subs %[_N], %[_N], #1\n"

                "bne loop1\n"

                : [_N] "+r" (N));

After timing this loop using clock_gettime for a given N, we can evaluate the frequency in GHz as follows: f = N / time in ns.

The same principle applies for x86_64, instead we used the following tight loop:

__asm__ volatile (
                "loop1:;\n"

                "dec %[_N];\n"

                "jnz loop1;\n"

                : [_N] "+r" (N));

2.2. Cache latency benchmark

This benchmark measures the latency of a memory access over a large array of pointers. The array is initialized using a random cyclic permutation that allows to access, for each iteration, a cacheline in a random order. This should break prefetching and expose the real access latency to a cache level.

Below are excerpts from the cyclic permutation code and the pointer chasing loop:

//Shuffle pointer addresses
for (int i = size - 1; i >= 0; i--)
  {
    if (i < cycle_len)
      continue;

    unsigned ii = yrand(i / cycle_len) * cycle_len + (i % cycle_len);

    void *tmp = memblock[i];

    memblock[i] = memblock[ii];
    memblock[ii] = tmp;
  }

p = &memblock[0];

for (int i = iterations; i; i--)
  {
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
    p = *(void **)p;
  }

2.3. Memory bandwidth benchmarks

This section covers the C benchmarks and code patterns used to evaluate the bandwidth of the targeted systems.

The code patterns/workloads were inspired by the STREAM benchmark and mainly aim at exposing the system to various workloads mimicking the behavior of codelets commonly found in production applications.

Most of the chosen patterns should be 'easily' vectorized and parallelized (using OpenMP) by the available compilers.

In order to measure these benchmarks properly, many samples are collected and, for small array sizes, the kernels are repeated enough times to be accurately measurable. We also evaluate the stability/repeatability of the measurements by calculating the mean standard deviation of the collected samples and also evaluating the average error of the computations.

2.3.1. Load only benchmarks

These benchmarks perform only memory loads coupled with double precision floating-point arithmetic operations.

We will use the following symbols to express the global pattern of the benchmark: L for a load, S for a store, A for an addition, M for a multiplication, and F for FMA/MAC (Fused-Multiply-Add or Multiply-And-Accumulate) operations.

  • Array reduction:

reduc.png

Figure 1: Array reduction formula

This code performs a load followed by an add. L,A pattern.

for (unsigned long long i = 0; i < n; i++)
  r += a[i];

  • Dot product of two arrays:

dotprod.png

Figure 2: Dotprod formula

This code performs two loads followed by a multiplication and addition operations. L,F pattern.

for (unsigned long long i = 0; i < n; i++)
  d += (a[i] * b[i]);

  • Correlation factor between the data of two arrays:

correl.png

Figure 3: Pearson correlation formula

The following code calculates the sums required to compute the correlation coefficient. L,L,A,F,A,F,F pattern.

for (unsigned long long i = 0; i < n; i++)
  {
    const double _a = a[i];
    const double _b = b[i];

    sum_a += _a;
    sum_a2 += (_a * _a);

    sum_b += _b;
    sum_b2 += (_b * _b);

    sum_ab += (_a * _b);
  }

  • Least squares:

leastsq.png

Figure 4: Least squares formula

This code calculates the sums required to compute the least square components m and b. L,L,A,F,A,F pattern.

for (unsigned long long i = 0; i < n; i++)
 {
   const double _a = a[i];
   const double _b = b[i];

   sum_a  += _a;
   sum_a2 += (_a * _a);

   sum_b  += _b;

   sum_ab += (_a * _b);
 }

2.3.2. Store and load/store benchmarks

These benchmarks perform load and store operations coupled with double precision floating-point arithmetic operations.

  • Array initialization:

init.png

Figure 5: Array initialization formula

This code performs stores exclusively in order to initialize an array. S pattern.

for (unsigned long long i = 0; i < n; i++)
  a[i] = c;

  • Array copy:

copy.png

Figure 6: Array copy formula

This code performs a load and a store in order to copy an array into another. L,S pattern.

for (unsigned long long i = 0; i < n; i++)
  a[i] = b[i];

  • Array scaling:

scale.png

Figure 7: Array scaling formula

This code performs a load followed by a multiplication and a store. L,M,S pattern.

for (unsigned long long i = 0; i < n; i++)
  a[i] *= s;

  • Array sum:

sum.png

Figure 8: Array sum formula

This code performs two loads, an addition then a store. L,L,A,S pattern.

for (unsigned long long i = 0; i < n; i++)
  c[i] = (a[i] + b[i]);

  • Triad:

triad.png

Figure 9: Triad formula

This code performs three loads, a multiplication, an addition, then a store. L,L,F,S pattern.

for (unsigned long long i = 0; i < n; i++)
  c[i] += (a[i] * b[i]);

3. Target architectures and compilers

3.1. Target architectures

This table summarizes the main features of the targeted x86 and AARCH64 systems:

Model name Provider Micro-architecture Threads per core Cores per socket Sockets Total cores Total threads Max freq. (GHz) Boost freq. (GHz) Min freq (GHz) L1d (KiB) L2 (KiB) L3 (MiB) L3/core (MiB) SIMD ISA Memory Process
Intel(R) Xeon(R) E5-2699 v3 LIPARAD Haswell 2 18 2 36 72 3.6 False 1.2 32 256 45 2.5 SSE, AVX2 DDR4 22 nm
Intel(R) Xeon(R) Platinum 8170 LIPARAD Skylake 1 26 2 52 52 2.1 False 1.0 32 1024 35.75 1.375 SSE, AVX2, AVX512 DDR4 14 nm
Intel(R) Xeon(R) Platinum 8375C AWS Icelake 2 24 1 24 48 3.5 False N/A 48 512 54 2 SSE, AVX2, AVX512 DDR4 10 nm
AMD EPYC 7R32 AWS Zen2 Rome 2 32 1   64 3.0 3.3 N/A 32 512 128 4 SSE, AVX2 DDR4 7 nm
AMD EPYC 7R13 AWS Zen3 Milan 2 32 1   64 3.0 3.7 N/A 32 512 128 4 SSE, AVX2 DDR4 7 nm
Amazon Graviton2 AWS Neoverse N1 1 64 1   64 2.5 False N/A 64 1024 32 0.5 Neon DDR4 7 nm
Amazon Graviton3 AWS Neoverse V1 1 64 1   64 2.6 False N/A 64 1024 32 0.5 Neon, SVE DDR5 5 nm

3.2. Compilers available on x86 systems

3.2.1. Intel Haswell

  • GCC version 11.2.0
  • CLANG version 13.0.1
  • Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)
  • ICC 2021.5.0 20211109

3.2.2. Intel Skylake

  • GCC version 11.2.0
  • CLANG version 13.0.1
  • Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)
  • ICC 2021.5.0 20211109

3.2.3. Intel Ice Lake

  • GCC version 11.x.x
  • CLANG version 14.x.x

3.2.4. AMD Zen2 Rome

  • GCC version 10.3.0
  • AOCC version 3.2.0

3.2.5. AMD Zen3 Milan

  • GCC version 10.3.0
  • AOCC version 3.2.0

3.3. Compilers available on AARCH64 systems

3.3.1. Graviton2 (ARMv8.2 architecture/Neoverse N1 cores)

  • GCC version 10.0.0
  • CLANG version 12.0.0
  • ARM 21.1

3.3.2. Graviton3 (ARMv8.4+SVE /Neoverse V1 cores)

  • GCC version 11.1.0
  • CLANG version 12.0.0
  • ARM 21.1

4. Experimental results

4.1. Frequency benchmark

4.1.1. Single core

The following table shows the frequencies measured on a single core by running the frequency benchmark on the target systems:

CPU model Max Freq. in GHz Measured freq. in GHz Measurement error in %
Haswell 3.6 3.5920 0.017 %
Skylake 2.1 2.0904 0.015 %
Ice Lake 3.5 3.5000 0.065 %
Zen2 Rome 2.9 (3.3 with boost) 3.3000 0.745 %
Zen3 Milan 3.0 (3.7 with boost) 3.7000 0.188 %
Graviton2 2.5 2.5000 0.037 %
Graviton3 2.6 2.6000 0.008 %

From the measurements above, we can safely conclude that the systems are quite stable and that performance measurements won't be affected by CPU frequency fluctuations due to DVFS, Frequency Boost, or some other power/frequency management technology.

4.1.2. Multi-core

The following plots show the measured frequency across multiple cores in parallel. The benchmark is run on 1, 2, 3, 4, … cores at a time and the measured frequency for each core is recorded along with the measurement stability.

plot_graviton2.png

Figure 10: Graviton2 frequency evolution accross multiple cores (2.5GHz)

plot_graviton3.png

Figure 11: Graviton3 frequency evolution accross multiple cores (2.6GHz)

plot_a64fx.png

Figure 12: A64FX frequency evolution accross multiple cores (2.6GHz)

plot_skylake.png

Figure 13: Skylake frequency evolution accross multiple cores (2.1GHz)

We can notice for the Graviton2, Graviton3, and Skylake processors that, regardless of the number of cores being used, the frequency remains stable. On the other hand, for the Zen2, Zen3, and Haswell processors, the frequency seems to drop when multiple cores are being used, as shown in the plots below.

plot_haswell.png

Figure 14: Haswell frequency evolution accross multiple cores

plot_icelake.png

Figure 15: Icelake frequency evolution accross multiple cores

plot_zen2.png

Figure 16: Zen2 frequency evolution accross multiple cores

plot_zen3.png

Figure 17: Zen3 frequency evolution accross multiple cores

4.2. Cache benchmark

The following figures show the results of the cache benchmark (random pointer chasing) on the architectures described above.

plot_cache_lat.png

Figure 18: Cache latency benchmark on multiple CPU micro-architectures

plot_cache_lat_split.png

Figure 19: Cache latency benchmark on multiple CPU micro-architectures (split)

Here, we compare the cache latency of the Graviton3 CPU to the other CPUs.

plot_cache_lat_vs.png

Figure 20: One-to-one comparison of the cache latencies of Graviton3 and other CPU micro-architectures

4.3. Bandwidth benchmarks

4.3.1. Single core benchmarks

The following tables summarize the bandwidth (in GiB/s) for each benchmark on the target systems using multiple compilers.

Important: Note that the results shown here do not necessartily reflect the best performance of the underlying architecture, rather, the performance of compiler generated code.

  1. GCC
    1. -O3 -funroll-loops -finline-functions
      Model name init copy scale sum triad reduc dotprod correl leatssq
      Haswell 7.841 20.752 8.126 11.036 11.274 7.082 10.960 7.855 8.974
      Skylake 8.788 9.331 10.911 11.400 11.384 3.817 7.365 6.382 6.658
      Ice Lake 11.260 30.527 13.400 14.075 13.856 6.102 11.418 9.990 10.294
      AMD EPYC Zen2 Rome 14.571 18.046 16.267 22.972 25.646 7.647 14.656 12.599 13.336
      AMD EPYC Zen3 Milan 18.644 36.567 23.008 24.410 28.370 8.702 17.343 12.408 16.254
      Amazon Graviton2 36.929 30.881 16.889 24.456 14.988 9.062 17.634 11.748 13.778
      Amazon Graviton3 49.597 49.897 25.261 45.530 33.536 9.637 19.270 11.015 13.931
      A64FX 23.587 15.173 23.350 42.082 37.020 1.473 2.912 4.360 12.567

      O3_seq.png

      Figure 21: Bandwidth benchmarks

    2. -Ofast -funroll-loops -finline-functions
      Model name init copy scale sum triad reduc dotprod correl leatssq
      Haswell 7.974 20.882 8.169 8.169 11.097 11.349 12.408 11.790 11.909
      Skylake 8.579 9.293 10.780 11.332 11.332 11.152 11.777 10.898 11.083
      Ice Lake 11.156 29.792 13.264 14.042 13.823 14.185 13.853 14.689 14.105
      AMD EPYC Zen2 Rome 14.321 18.103 16.066 23.109 25.740 17.181 23.563 22.084 22.829
      AMD EPYC Zen3 Milan 18.045 34.784 22.307 24.938 28.810 25.937 31.717 30.118 30.519
      Amazon Graviton2 36.814 31.698 18.646 24.044 17.396 17.427 18.386 18.620 18.618
      Amazon Graviton3 49.654 49.971 24.965 45.835 33.573 28.084 34.672 34.077 34.348

      Ofast_seq.png

      Figure 22: Bandwidth benchmarks

4.3.2. Parallel benchmarks (OpenMP)

The following table summerizes the performance (in GiB/s) for all parallel benchmarks on the target systems using multiple compilers:

  1. GCC
    1. -O3 -fopenmp -funroll-loops -finline-functions
      Model name init copy scale sum triad reduc dotprod correl leastsq
      Haswell 40.472 59.449 40.031 67.926 67.875 97.832 96.924 96.812 96.819
      Skylake 85.154 113.010 55.220 107.421 128.508 126.979 166.585 174.954 172.813
      Ice Lake 70.676 121.012 60.045 128.628 103.566 139.729 145.405 143.077 143.379
      AMD EPYC Zen2 Rome 39.121 56.896 47.520 72.865 75.696 73.042 92.864 92.334 92.486
      AMD EPYC Zen3 Milan 57.402 78.408 59.934 87.267 87.069 111.427 116.127 115.192 114.831
      Amazon Graviton2 148.458 146.693 74.474 149.744 111.612 146.820 153.409 152.044 151.884
      Amazon Graviton3 213.623 232.102 110.914 223.261 182.721 251.491 260.696 260.398 260.879
      A64FX 364.709 460.137 378.337 446.239 483.873 70.101 137.800 65.379 82.239

      O3_omp.png

      Figure 23: Bandwidth benchmarks

    2. -Ofast -fopenmp -funroll-loops -finline-functions
      Model name init copy scale sum triad reduc dotprod correl leastsq
      Haswell 40.145 60.237 40.187 67.652 67.642 98.684 94.644 96.378 94.922
      Skylake 83.874 110.347 86.372 126.957 126.106 193.222 168.237 176.408 175.184
      Ice Lake 70.808 121.963 60.427 128.920 103.579 145.062 147.960 146.683 146.861
      AMD EPYC Zen2 Rome 47.723 65.030 48.859 75.014 75.361 80.624 91.055 87.829 93.232
      AMD EPYC Zen3 Milan 57.140 78.207 59.888 87.296 87.416 112.586 115.968 116.036 115.920
      Amazon Graviton2 145.616 145.986 74.299 149.841 111.622 147.861 153.018 151.939 151.882
      Amazon Graviton3 216.010 232.508 110.993 223.247 183.110 258.779 259.508 260.373 260.067

      Ofast_omp.png

      Figure 24: Sequential bandwidth benchmarks comparison

  2. Sequential side-by-side comparison

    O3_Ofast_seq_comp.png

    Figure 25: OpenMP bandwidth benchmarks comparison

4.4. Graviton3 compilers comparison

4.4.1. GCC

  • assembly codes (soon …)

4.4.2. CLANG

  • assembly codes (soon …)

4.4.3. ARM CLANG

  • assembly codes (soon …)

4.4.4. Sequential benchmarks

Compiler init copy scale sum triad reduc dotprod correl leastsq
GCC 49.597 49.897 25.261 45.530 33.536 9.637 19.270 11.015 13.931
CLANG 57.860 49.673 25.126 45.665 33.334 9.630 19.106 13.08 16.034
ARM CLANG 57.171 49.698 25.084 45.649 33.357 9.631 19.090 13.101 15.959

O3_seq_grav3.png

Figure 26: Sequential bandwidth of the different compilers on Graviton3 -O3 -funroll-loops -finline-functions

4.4.5. OpenMP benchmarks

Compiler init copy scale sum triad reduc dotprod correl leastsq
GCC 213.623 232.102 110.914 223.261 182.721 251.491 260.696 260.398 260.879
CLANG 218.906 232.225 111.093 221.840 183.534 251.695 260.299 258.103 259.629
ARM CLANG 217.381 231.842 110.995 222.039 183.397 251.372 260.351 257.790 259.489

O3_omp_grav3.png

Figure 27: OpenMP bandwidth for different compilers on Graviton3 -O3 -funroll-loops -finline-functions

4.5. A64FX

a64fx_seq.png

Figure 28: Bandwidth of sequential benchmarks using different compilers on A64FX

a64fx_omp.png

Figure 29: Bandwidth of OpenMP benchmakrs using different compilers on A64FX

4.6. Graviton3 vs. A64FX

a64fx_graviton3_seq.png

Figure 30: Bandwidth comparison of sequential benchmarks using different compilers on A64FX and Graviton3

a64fx_graviton3_omp.png

Figure 31: Bandwidth comparison of OpenMP benchmarks using different compilers on A64FX and Graviton3

Author: yaspr

Created: 2022-06-10 Fri 15:06

Validate