Loop Id: 5 | Module: exec | Source: Step10_orig.c:19-35 | Coverage: 99.82% |
---|
Loop Id: 5 | Module: exec | Source: Step10_orig.c:19-35 | Coverage: 99.82% |
---|
0x401bf0 VMOVUPS (%RSI,%RBX,4),%YMM16 [2] |
0x401bf7 VSUBPS %YMM6,%YMM16,%YMM16 |
0x401bfd VMOVUPS (%RDX,%RBX,4),%YMM17 [4] |
0x401c04 VSUBPS %YMM5,%YMM17,%YMM17 |
0x401c0a VMOVUPS (%RCX,%RBX,4),%YMM18 [1] |
0x401c11 VSUBPS %YMM2,%YMM18,%YMM18 |
0x401c17 VMULPS %YMM16,%YMM16,%YMM19 |
0x401c1d VFMADD231PS %YMM17,%YMM17,%YMM19 |
0x401c23 VFMADD231PS %YMM18,%YMM18,%YMM19 |
0x401c29 VCMPPS $0x1,%YMM1,%YMM19,%K1 |
0x401c30 VMOVUPS (%R8,%RBX,4),%YMM20{%K1}{z} [3] |
0x401c37 VADDPS %YMM27,%YMM19,%YMM21 |
0x401c3d VEXTRACTF32X4 $0x1,%YMM21,%XMM3 |
0x401c44 VCVTPS2PD %XMM3,%YMM3 |
0x401c48 VCVTPS2PD %XMM21,%YMM21 |
0x401c4e VSQRTPD %YMM21,%YMM23 |
0x401c54 VSQRTPD %YMM3,%YMM24 |
0x401c5a VMULPD %YMM21,%YMM21,%YMM21 |
0x401c60 VDIVPD %YMM21,%YMM8,%YMM21 |
0x401c66 VMULPD %YMM3,%YMM3,%YMM3 |
0x401c6a VMOVAPS %YMM9,%YMM25 |
0x401c70 VFMADD213PS %YMM10,%YMM19,%YMM25 |
0x401c76 VFMADD213PS %YMM11,%YMM19,%YMM25 |
0x401c7c VFMADD213PS %YMM13,%YMM19,%YMM25 |
0x401c82 VFMADD213PS %YMM14,%YMM19,%YMM25 |
0x401c88 VFMADD213PS %YMM15,%YMM19,%YMM25 |
0x401c8e VCVTPS2PD %XMM25,%YMM26 |
0x401c94 VDIVPD %YMM3,%YMM8,%YMM3 |
0x401c98 VFMADD231PD %YMM21,%YMM23,%YMM26 |
0x401c9e VEXTRACTF32X4 $0x1,%YMM25,%XMM0 |
0x401ca5 VCVTPS2PD %XMM0,%YMM0 |
0x401ca9 VFMADD231PD %YMM3,%YMM24,%YMM0 |
0x401caf VCVTPD2PS %YMM26,%XMM3 |
0x401cb5 VCVTPD2PS %YMM0,%XMM0 |
0x401cb9 VINSERTF128 $0x1,%XMM0,%YMM3,%YMM0 |
0x401cbf VCMPPS $0x1,%YMM19,%YMM22,%K1 |
0x401cc6 VMULPS %YMM0,%YMM20,%YMM0{%K1}{z} |
0x401ccc VFMADD231PS %YMM16,%YMM0,%YMM12 |
0x401cd2 VFMADD231PS %YMM17,%YMM0,%YMM7 |
0x401cd8 VFMADD231PS %YMM18,%YMM0,%YMM4 |
0x401cde ADD $0x8,%RBX |
0x401ce2 CMP %RDI,%RBX |
0x401ce5 JB 401bf0 |
/home/kcamus/qaas_runs/169-401-3406/intel/HACCmk/build/HACCmk/src/Step10_orig.c: 19 - 35 |
-------------------------------------------------------------------------------- |
19: for ( j = 0; j < count1; j++ ) |
20: { |
21: dxc = xx1[j] - xxi; |
22: dyc = yy1[j] - yyi; |
23: dzc = zz1[j] - zzi; |
24: |
25: r2 = dxc * dxc + dyc * dyc + dzc * dzc; |
26: |
27: m = ( r2 < fsrrmax2 ) ? mass1[j] : 0.0f; |
28: |
29: f = pow( r2 + mp_rsm2, -1.5 ) - ( ma0 + r2*(ma1 + r2*(ma2 + r2*(ma3 + r2*(ma4 + r2*ma5))))); |
30: |
31: f = ( r2 > 0.0f ) ? m * f : 0.0f; |
32: |
33: xi = xi + f * dxc; |
34: yi = yi + f * dyc; |
35: zi = zi + f * dzc; |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
►100.00+ | main.extracted.8 | main.c:142 | exec |
○ | __kmp_invoke_microtask | libiomp5.so | |
○ | __kmp_fork_call | libiomp5.so | |
○ | __kmpc_fork_call | libiomp5.so | |
○ | main | main.c:139 | exec |
○ | __libc_init_first | libc.so.6 |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 1.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 2.27 - 2.67 |
Bottlenecks | P0, |
Function | Step10_orig |
Source | Step10_orig.c:19-35 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 34.00 - 40.00 |
CQA cycles if no scalar integer | 34.00 - 40.00 |
CQA cycles if FP arith vectorized | 34.00 - 40.00 |
CQA cycles if fully vectorized | 34.00 - 40.00 |
Front-end cycles | 12.00 |
DIV/SQRT cycles | 15.00 |
P0 cycles | 15.00 |
P1 cycles | 2.00 |
P2 cycles | 2.00 |
P3 cycles | 0.00 |
P4 cycles | 11.00 |
P5 cycles | 2.00 |
P6 cycles | 0.00 |
P7 cycles | 34.00 - 40.00 |
Inter-iter dependencies cycles | 4 |
FE+BE cycles (UFS) | 35.32 - 40.95 |
Stall cycles (UFS) | 22.86 - 28.49 |
Nb insns | 43.00 |
Nb uops | 48.00 |
Nb loads | 4.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 7.29 - 6.20 |
Nb FLOP add-sub | 32.00 |
Nb FLOP mul | 24.00 |
Nb FLOP fma | 88.00 |
Nb FLOP div | 8.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 8.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 3.20 - 3.76 |
Bytes prefetched | 0.00 |
Bytes loaded | 128.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 4.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 100.00 |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | 100.00 |
Vectorization ratio other | 100.00 |
Vector-efficiency ratio all | 45.63 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | 50.00 |
Vector-efficiency ratio add_sub | 50.00 |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | 50.00 |
Vector-efficiency ratio other | 35.42 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 1.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 2.27 - 2.67 |
Bottlenecks | P0, |
Function | Step10_orig |
Source | Step10_orig.c:19-35 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 34.00 - 40.00 |
CQA cycles if no scalar integer | 34.00 - 40.00 |
CQA cycles if FP arith vectorized | 34.00 - 40.00 |
CQA cycles if fully vectorized | 34.00 - 40.00 |
Front-end cycles | 12.00 |
DIV/SQRT cycles | 15.00 |
P0 cycles | 15.00 |
P1 cycles | 2.00 |
P2 cycles | 2.00 |
P3 cycles | 0.00 |
P4 cycles | 11.00 |
P5 cycles | 2.00 |
P6 cycles | 0.00 |
P7 cycles | 34.00 - 40.00 |
Inter-iter dependencies cycles | 4 |
FE+BE cycles (UFS) | 35.32 - 40.95 |
Stall cycles (UFS) | 22.86 - 28.49 |
Nb insns | 43.00 |
Nb uops | 48.00 |
Nb loads | 4.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 7.29 - 6.20 |
Nb FLOP add-sub | 32.00 |
Nb FLOP mul | 24.00 |
Nb FLOP fma | 88.00 |
Nb FLOP div | 8.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 8.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 3.20 - 3.76 |
Bytes prefetched | 0.00 |
Bytes loaded | 128.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 4.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 100.00 |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | 100.00 |
Vectorization ratio other | 100.00 |
Vector-efficiency ratio all | 45.63 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | 50.00 |
Vector-efficiency ratio add_sub | 50.00 |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | 50.00 |
Vector-efficiency ratio other | 35.42 |
Path / |
Function | Step10_orig |
Source file and lines | Step10_orig.c:19-35 |
Module | exec |
nb instructions | 43 |
nb uops | 48 |
loop length | 251 |
used x86 registers | 6 |
used mmx registers | 0 |
used xmm registers | 4 |
used ymm registers | 28 |
used zmm registers | 0 |
nb stack references | 0 |
ADD-SUB / MUL ratio | 1.00 |
micro-operation queue | 12.00 cycles |
front end | 12.00 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 15.00 | 15.00 | 2.00 | 2.00 | 0.00 | 11.00 | 2.00 | 0.00 |
cycles | 15.00 | 15.00 | 2.00 | 2.00 | 0.00 | 11.00 | 2.00 | 0.00 |
Cycles executing div or sqrt instructions | 34.00-40.00 |
Longest recurrence chain latency (RecMII) | 4.00 |
FE+BE cycles | 35.32-40.95 |
Stall cycles | 22.86-28.49 |
RS full (events) | 0.33-0.32 |
PRF_FLOAT full (events) | 25.71-32.20 |
Front-end | 12.00 |
Dispatch | 15.00 |
DIV/SQRT | 34.00-40.00 |
Data deps. | 4.00 |
Overall L1 | 34.00-40.00 |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 100% |
fma | 100% |
div/sqrt | 100% |
other | 100% |
all | 45% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | 50% |
fma | 50% |
div/sqrt | 50% |
other | 35% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMOVUPS (%RSI,%RBX,4),%YMM16 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VSUBPS %YMM6,%YMM16,%YMM16 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPS (%RDX,%RBX,4),%YMM17 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VSUBPS %YMM5,%YMM17,%YMM17 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPS (%RCX,%RBX,4),%YMM18 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VSUBPS %YMM2,%YMM18,%YMM18 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPS %YMM16,%YMM16,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM17,%YMM17,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM18,%YMM18,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCMPPS $0x1,%YMM1,%YMM19,%K1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VMOVUPS (%R8,%RBX,4),%YMM20{%K1}{z} | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VADDPS %YMM27,%YMM19,%YMM21 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VEXTRACTF32X4 $0x1,%YMM21,%XMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VCVTPS2PD %XMM3,%YMM3 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VCVTPS2PD %XMM21,%YMM21 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VSQRTPD %YMM21,%YMM23 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-19 | 9-12 |
VSQRTPD %YMM3,%YMM24 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-19 | 9-12 |
VMULPD %YMM21,%YMM21,%YMM21 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VDIVPD %YMM21,%YMM8,%YMM21 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-14 | 8 |
VMULPD %YMM3,%YMM3,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPS %YMM9,%YMM25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PS %YMM10,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM11,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM13,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM14,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM15,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCVTPS2PD %XMM25,%YMM26 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VDIVPD %YMM3,%YMM8,%YMM3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-14 | 8 |
VFMADD231PD %YMM21,%YMM23,%YMM26 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VEXTRACTF32X4 $0x1,%YMM25,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VCVTPS2PD %XMM0,%YMM0 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VFMADD231PD %YMM3,%YMM24,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCVTPD2PS %YMM26,%XMM3 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VCVTPD2PS %YMM0,%XMM0 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VINSERTF128 $0x1,%XMM0,%YMM3,%YMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VCMPPS $0x1,%YMM19,%YMM22,%K1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VMULPS %YMM0,%YMM20,%YMM0{%K1}{z} | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM16,%YMM0,%YMM12 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM17,%YMM0,%YMM7 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM18,%YMM0,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD $0x8,%RBX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %RDI,%RBX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JB 401bf0 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
Function | Step10_orig |
Source file and lines | Step10_orig.c:19-35 |
Module | exec |
nb instructions | 43 |
nb uops | 48 |
loop length | 251 |
used x86 registers | 6 |
used mmx registers | 0 |
used xmm registers | 4 |
used ymm registers | 28 |
used zmm registers | 0 |
nb stack references | 0 |
ADD-SUB / MUL ratio | 1.00 |
micro-operation queue | 12.00 cycles |
front end | 12.00 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 15.00 | 15.00 | 2.00 | 2.00 | 0.00 | 11.00 | 2.00 | 0.00 |
cycles | 15.00 | 15.00 | 2.00 | 2.00 | 0.00 | 11.00 | 2.00 | 0.00 |
Cycles executing div or sqrt instructions | 34.00-40.00 |
Longest recurrence chain latency (RecMII) | 4.00 |
FE+BE cycles | 35.32-40.95 |
Stall cycles | 22.86-28.49 |
RS full (events) | 0.33-0.32 |
PRF_FLOAT full (events) | 25.71-32.20 |
Front-end | 12.00 |
Dispatch | 15.00 |
DIV/SQRT | 34.00-40.00 |
Data deps. | 4.00 |
Overall L1 | 34.00-40.00 |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 100% |
fma | 100% |
div/sqrt | 100% |
other | 100% |
all | 45% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | 50% |
fma | 50% |
div/sqrt | 50% |
other | 35% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMOVUPS (%RSI,%RBX,4),%YMM16 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VSUBPS %YMM6,%YMM16,%YMM16 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPS (%RDX,%RBX,4),%YMM17 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VSUBPS %YMM5,%YMM17,%YMM17 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPS (%RCX,%RBX,4),%YMM18 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VSUBPS %YMM2,%YMM18,%YMM18 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPS %YMM16,%YMM16,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM17,%YMM17,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM18,%YMM18,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCMPPS $0x1,%YMM1,%YMM19,%K1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VMOVUPS (%R8,%RBX,4),%YMM20{%K1}{z} | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VADDPS %YMM27,%YMM19,%YMM21 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VEXTRACTF32X4 $0x1,%YMM21,%XMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VCVTPS2PD %XMM3,%YMM3 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VCVTPS2PD %XMM21,%YMM21 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VSQRTPD %YMM21,%YMM23 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-19 | 9-12 |
VSQRTPD %YMM3,%YMM24 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-19 | 9-12 |
VMULPD %YMM21,%YMM21,%YMM21 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VDIVPD %YMM21,%YMM8,%YMM21 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-14 | 8 |
VMULPD %YMM3,%YMM3,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPS %YMM9,%YMM25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PS %YMM10,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM11,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM13,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM14,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PS %YMM15,%YMM19,%YMM25 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCVTPS2PD %XMM25,%YMM26 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VDIVPD %YMM3,%YMM8,%YMM3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-14 | 8 |
VFMADD231PD %YMM21,%YMM23,%YMM26 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VEXTRACTF32X4 $0x1,%YMM25,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VCVTPS2PD %XMM0,%YMM0 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VFMADD231PD %YMM3,%YMM24,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCVTPD2PS %YMM26,%XMM3 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VCVTPD2PS %YMM0,%XMM0 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
VINSERTF128 $0x1,%XMM0,%YMM3,%YMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VCMPPS $0x1,%YMM19,%YMM22,%K1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 |
VMULPS %YMM0,%YMM20,%YMM0{%K1}{z} | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM16,%YMM0,%YMM12 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM17,%YMM0,%YMM7 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PS %YMM18,%YMM0,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD $0x8,%RBX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %RDI,%RBX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JB 401bf0 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
Metric | run_0 |
---|---|
Coverage (% app. time) | 99.82 |
Time (s) | 41.69 |
Instance Count | 2190000 |
Iteration Count - min | 50 |
Iteration Count - avg | 961 |
Iteration Count - max | 1872 |
Cycles per Iteration - min | 39.79 |
Cycles per Iteration - avg | 41.11 |
Cycles per Iteration - max | 3646.22 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 99.93 |
Instance Count | 2190000 |
ORIG CPI:min | 63.60 |
ORIG CPI:med | 63.96 |
ORIG CPI:max | 66.96 |
DL1 CPI:min | 63.04 |
DL1 CPI:med | 65.36 |
DL1 CPI:max | 66.72 |
ORIG (min) / DL1 (min) | 1.01 |
ORIG (med) / DL1 (med) | 0.98 |
ORIG (max) / DL1 (max) | 1.00 |
Nb Iteration:min | 50 |
Nb Iteration:med | 50.00 |
Nb Iteration:max | 50 |
ORIG: min (cycles) | 3180 |
ORIG: med (cycles) | 3198.00 |
ORIG: max (cycles) | 3348 |
DL1:min (cycles) | 3152 |
DL1:med (cycles) | 3268.00 |
DL1:max (cycles) | 3336 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 3198.00 | 3198.00 | 3198.00 | 3198.00 | 3180.00 | 3198.00 | 3348.00 | 3268.00 | 3268.00 | 3268.00 | 3268.00 | 3152.00 | 3268.00 | 3336.00 |
CPI MIN | 63.60 | 63.04 | ||||||||||||
CPI MED | 63.96 | 63.96 | 63.96 | 63.96 | 63.60 | 63.96 | 66.96 | 65.36 | 65.36 | 65.36 | 65.36 | 63.04 | 65.36 | 66.72 |
CPI AVG | 64.05 | 64.89 | ||||||||||||
CPI MAX | 66.96 | 66.72 | ||||||||||||
Iteration Count | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 |
ORIG | DL1 | Original Code |
---|---|---|
0x6bbcb9 ADDQ $0x1,-0x2201(%RIP) 0x6bbcc1 VMOVUPS (%RSI,%RBX,4),%YMM16 | 0x6bc0f9 VMOVUPS -0x2d03(%RIP),%YMM16 | 0x401bf0 VMOVUPS (%RSI,%RBX,4),%YMM16 |
0x6bbcc8 VSUBPS %YMM6,%YMM16,%YMM16 | 0x6bc103 VSUBPS %YMM6,%YMM16,%YMM16 | 0x401bf7 VSUBPS %YMM6,%YMM16,%YMM16 |
0x6bbcce VMOVUPS (%RDX,%RBX,4),%YMM17 | 0x6bc109 VMOVUPS -0x2d13(%RIP),%YMM17 | 0x401bfd VMOVUPS (%RDX,%RBX,4),%YMM17 |
0x6bbcd5 VSUBPS %YMM5,%YMM17,%YMM17 | 0x6bc113 VSUBPS %YMM5,%YMM17,%YMM17 | 0x401c04 VSUBPS %YMM5,%YMM17,%YMM17 |
0x6bbcdb VMOVUPS (%RCX,%RBX,4),%YMM18 | 0x6bc119 VMOVUPS -0x2d23(%RIP),%YMM18 | 0x401c0a VMOVUPS (%RCX,%RBX,4),%YMM18 |
0x6bbce2 VSUBPS %YMM2,%YMM18,%YMM18 | 0x6bc123 VSUBPS %YMM2,%YMM18,%YMM18 | 0x401c11 VSUBPS %YMM2,%YMM18,%YMM18 |
0x6bbce8 VMULPS %YMM16,%YMM16,%YMM19 | 0x6bc129 VMULPS %YMM16,%YMM16,%YMM19 | 0x401c17 VMULPS %YMM16,%YMM16,%YMM19 |
0x6bbcee VFMADD231PS %YMM17,%YMM17,%YMM19 | 0x6bc12f VFMADD231PS %YMM17,%YMM17,%YMM19 | 0x401c1d VFMADD231PS %YMM17,%YMM17,%YMM19 |
0x6bbcf4 VFMADD231PS %YMM18,%YMM18,%YMM19 | 0x6bc135 VFMADD231PS %YMM18,%YMM18,%YMM19 | 0x401c23 VFMADD231PS %YMM18,%YMM18,%YMM19 |
0x6bbcfa VCMPPS $0x1,%YMM1,%YMM19,%K1 | 0x6bc13b VCMPPS $0x1,%YMM1,%YMM19,%K1 | 0x401c29 VCMPPS $0x1,%YMM1,%YMM19,%K1 |
0x6bbd01 VMOVUPS (%R8,%RBX,4),%YMM20{%K1}{z} | 0x6bc142 VMOVUPS -0x2d4c(%RIP),%YMM20{%K1}{z} | 0x401c30 VMOVUPS (%R8,%RBX,4),%YMM20{%K1}{z} |
0x6bbd08 VADDPS %YMM27,%YMM19,%YMM21 | 0x6bc14c VADDPS %YMM27,%YMM19,%YMM21 | 0x401c37 VADDPS %YMM27,%YMM19,%YMM21 |
0x6bbd0e VEXTRACTF32X4 $0x1,%YMM21,%XMM3 | 0x6bc152 VEXTRACTF32X4 $0x1,%YMM21,%XMM3 | 0x401c3d VEXTRACTF32X4 $0x1,%YMM21,%XMM3 |
0x6bbd15 VCVTPS2PD %XMM3,%YMM3 | 0x6bc159 VCVTPS2PD %XMM3,%YMM3 | 0x401c44 VCVTPS2PD %XMM3,%YMM3 |
0x6bbd19 VCVTPS2PD %XMM21,%YMM21 | 0x6bc15d VCVTPS2PD %XMM21,%YMM21 | 0x401c48 VCVTPS2PD %XMM21,%YMM21 |
0x6bbd1f VSQRTPD %YMM21,%YMM23 | 0x6bc163 VSQRTPD -0x2f2d(%RIP),%YMM23 | 0x401c4e VSQRTPD %YMM21,%YMM23 |
0x6bbd25 VSQRTPD %YMM3,%YMM24 | 0x6bc16d VSQRTPD -0x2eb7(%RIP),%YMM24 | 0x401c54 VSQRTPD %YMM3,%YMM24 |
0x6bbd2b VMULPD %YMM21,%YMM21,%YMM21 | 0x6bc177 VMULPD %YMM21,%YMM21,%YMM21 | 0x401c5a VMULPD %YMM21,%YMM21,%YMM21 |
0x6bbd31 VDIVPD %YMM21,%YMM8,%YMM21 | 0x6bc17d VMOVUPD -0x2e45(%RIP),%YMM8 0x6bc185 VDIVPD -0x2e8f(%RIP),%YMM8,%YMM21 | 0x401c60 VDIVPD %YMM21,%YMM8,%YMM21 |
0x6bbd37 VMULPD %YMM3,%YMM3,%YMM3 | 0x6bc18f VMULPD %YMM3,%YMM3,%YMM3 | 0x401c66 VMULPD %YMM3,%YMM3,%YMM3 |
0x6bbd3b VMOVAPS %YMM9,%YMM25 | 0x6bc193 VMOVAPS %YMM9,%YMM25 | 0x401c6a VMOVAPS %YMM9,%YMM25 |
0x6bbd41 VFMADD213PS %YMM10,%YMM19,%YMM25 | 0x6bc199 VFMADD213PS %YMM10,%YMM19,%YMM25 | 0x401c70 VFMADD213PS %YMM10,%YMM19,%YMM25 |
0x6bbd47 VFMADD213PS %YMM11,%YMM19,%YMM25 | 0x6bc19f VFMADD213PS %YMM11,%YMM19,%YMM25 | 0x401c76 VFMADD213PS %YMM11,%YMM19,%YMM25 |
0x6bbd4d VFMADD213PS %YMM13,%YMM19,%YMM25 | 0x6bc1a5 VFMADD213PS %YMM13,%YMM19,%YMM25 | 0x401c7c VFMADD213PS %YMM13,%YMM19,%YMM25 |
0x6bbd53 VFMADD213PS %YMM14,%YMM19,%YMM25 | 0x6bc1ab VFMADD213PS %YMM14,%YMM19,%YMM25 | 0x401c82 VFMADD213PS %YMM14,%YMM19,%YMM25 |
0x6bbd59 VFMADD213PS %YMM15,%YMM19,%YMM25 | 0x6bc1b1 VFMADD213PS %YMM15,%YMM19,%YMM25 | 0x401c88 VFMADD213PS %YMM15,%YMM19,%YMM25 |
0x6bbd5f VCVTPS2PD %XMM25,%YMM26 | 0x6bc1b7 VCVTPS2PD %XMM25,%YMM26 | 0x401c8e VCVTPS2PD %XMM25,%YMM26 |
0x6bbd65 VDIVPD %YMM3,%YMM8,%YMM3 | 0x6bc1bd VMOVUPD -0x2e05(%RIP),%YMM8 0x6bc1c5 VDIVPD -0x2e4d(%RIP),%YMM8,%YMM3 | 0x401c94 VDIVPD %YMM3,%YMM8,%YMM3 |
0x6bbd69 VFMADD231PD %YMM21,%YMM23,%YMM26 | 0x6bc1cd VFMADD231PD %YMM21,%YMM23,%YMM26 | 0x401c98 VFMADD231PD %YMM21,%YMM23,%YMM26 |
0x6bbd6f VEXTRACTF32X4 $0x1,%YMM25,%XMM0 | 0x6bc1d3 VEXTRACTF32X4 $0x1,%YMM25,%XMM0 | 0x401c9e VEXTRACTF32X4 $0x1,%YMM25,%XMM0 |
0x6bbd76 VCVTPS2PD %XMM0,%YMM0 | 0x6bc1da VCVTPS2PD %XMM0,%YMM0 | 0x401ca5 VCVTPS2PD %XMM0,%YMM0 |
0x6bbd7a VFMADD231PD %YMM3,%YMM24,%YMM0 | 0x6bc1de VFMADD231PD %YMM3,%YMM24,%YMM0 | 0x401ca9 VFMADD231PD %YMM3,%YMM24,%YMM0 |
0x6bbd80 VCVTPD2PS %YMM26,%XMM3 | 0x6bc1e4 VCVTPD2PS %YMM26,%XMM3 | 0x401caf VCVTPD2PS %YMM26,%XMM3 |
0x6bbd86 VCVTPD2PS %YMM0,%XMM0 | 0x6bc1ea VCVTPD2PS %YMM0,%XMM0 | 0x401cb5 VCVTPD2PS %YMM0,%XMM0 |
0x6bbd8a VINSERTF128 $0x1,%XMM0,%YMM3,%YMM0 | 0x6bc1ee VINSERTF128 $0x1,%XMM0,%YMM3,%YMM0 | 0x401cb9 VINSERTF128 $0x1,%XMM0,%YMM3,%YMM0 |
0x6bbd90 VCMPPS $0x1,%YMM19,%YMM22,%K1 | 0x6bc1f4 VCMPPS $0x1,%YMM19,%YMM22,%K1 | 0x401cbf VCMPPS $0x1,%YMM19,%YMM22,%K1 |
0x6bbd97 VMULPS %YMM0,%YMM20,%YMM0{%K1}{z} | 0x6bc1fb VMULPS %YMM0,%YMM20,%YMM0{%K1}{z} | 0x401cc6 VMULPS %YMM0,%YMM20,%YMM0{%K1}{z} |
0x6bbd9d VFMADD231PS %YMM16,%YMM0,%YMM12 | 0x6bc201 VFMADD231PS %YMM16,%YMM0,%YMM12 | 0x401ccc VFMADD231PS %YMM16,%YMM0,%YMM12 |
0x6bbda3 VFMADD231PS %YMM17,%YMM0,%YMM7 | 0x6bc207 VFMADD231PS %YMM17,%YMM0,%YMM7 | 0x401cd2 VFMADD231PS %YMM17,%YMM0,%YMM7 |
0x6bbda9 VFMADD231PS %YMM18,%YMM0,%YMM4 | 0x6bc20d VFMADD231PS %YMM18,%YMM0,%YMM4 | 0x401cd8 VFMADD231PS %YMM18,%YMM0,%YMM4 |
0x6bbdaf ADD $0x8,%RBX | 0x6bc213 ADD $0x8,%RBX | 0x401cde ADD $0x8,%RBX |
0x6bbdb3 CMP %RDI,%RBX | 0x6bc217 CMP %RDI,%RBX | 0x401ce2 CMP %RDI,%RBX |
0x6bbdb6 JB 6bbcb9 | 0x6bc21a JB 6bc0f9 | 0x401ce5 JB 401bf0 |
Path / |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 6.20, 7.29, | 6.20, 7.29, | 6.20, 7.29, |
cycles L1 CQA | 40.00 | 40.00 | 40.00 |
cycles UFS | 41.27 | 41.46 | 40.95 |
bytes loaded | 136.00 | 320.00 | 128.00 |
bytes stored | 8.00 | 0.00 | 0.00 |
nb loads | 5.00 | 10.00 | 4.00 |
nb stores | 1.00 | 0.00 | 0.00 |
cycles dispatch | 15.00 | 15.00 | 15.00 |
cycles front end | 12.50 | 12.50 | 12.00 |
cycles P0 | 15.00 | 15.00 | 15.00 |
cycles P1 | 15.00 | 15.00 | 15.00 |
cycles P2 | 2.50 | 5.00 | 2.00 |
cycles P3 | 2.50 | 5.00 | 2.00 |
cycles P4 | 1.00 | 0.00 | 0.00 |
cycles P5 | 11.00 | 11.00 | 11.00 |
cycles P6 | 3.00 | 2.00 | 2.00 |
cycles P7 | 1.00 | 0.00 | 0.00 |
stall cycles | 28.29 | 28.50 | 28.49 |
LB full | 0.00 | 0.00 | 0.00 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 32.41 | 31.84 | 32.20 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.29 | 0.30 | 0.32 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 50.00 | 50.00 | 48.00 |
uops P0 | 15.00 | 15.00 | 15.00 |
uops P1 | 15.00 | 15.00 | 15.00 |
uops P2 | 2.50 | 5.00 | 2.00 |
uops P3 | 2.50 | 5.00 | 2.00 |
uops P4 | 1.00 | 0.00 | 0.00 |
uops P5 | 11.00 | 11.00 | 11.00 |
uops P6 | 3.00 | 2.00 | 2.00 |
uops P7 | 1.00 | 0.00 | 0.00 |
ID | 11 | 13 | 5 |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 6.20, 7.29, | 6.20, 7.29, | 6.20, 7.29, |
cycles L1 CQA | 40.00 | 40.00 | 40.00 |
cycles UFS | 41.27 | 41.46 | 40.95 |
bytes loaded | 136.00 | 320.00 | 128.00 |
bytes stored | 8.00 | 0.00 | 0.00 |
nb loads | 5.00 | 10.00 | 4.00 |
nb stores | 1.00 | 0.00 | 0.00 |
cycles dispatch | 15.00 | 15.00 | 15.00 |
cycles front end | 12.50 | 12.50 | 12.00 |
cycles P0 | 15.00 | 15.00 | 15.00 |
cycles P1 | 15.00 | 15.00 | 15.00 |
cycles P2 | 2.50 | 5.00 | 2.00 |
cycles P3 | 2.50 | 5.00 | 2.00 |
cycles P4 | 1.00 | 0.00 | 0.00 |
cycles P5 | 11.00 | 11.00 | 11.00 |
cycles P6 | 3.00 | 2.00 | 2.00 |
cycles P7 | 1.00 | 0.00 | 0.00 |
stall cycles | 28.29 | 28.50 | 28.49 |
LB full | 0.00 | 0.00 | 0.00 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 32.41 | 31.84 | 32.20 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.29 | 0.30 | 0.32 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 50.00 | 50.00 | 48.00 |
uops P0 | 15.00 | 15.00 | 15.00 |
uops P1 | 15.00 | 15.00 | 15.00 |
uops P2 | 2.50 | 5.00 | 2.00 |
uops P3 | 2.50 | 5.00 | 2.00 |
uops P4 | 1.00 | 0.00 | 0.00 |
uops P5 | 11.00 | 11.00 | 11.00 |
uops P6 | 3.00 | 2.00 | 2.00 |
uops P7 | 1.00 | 0.00 | 0.00 |
ID | 11 | 13 | 5 |