Loop Id: 605 | Module: exec | Source: update_halo_kernel.f90:99-158 [...] | Coverage: 0.01% |
---|
Loop Id: 605 | Module: exec | Source: update_halo_kernel.f90:99-158 [...] | Coverage: 0.01% |
---|
0x451c40 SUB %R8,%R14 |
0x451c43 ADD $0x2,%R14 |
0x451c47 VPBROADCASTQ %R14,%ZMM10 |
0x451c4d VPBROADCASTQ %RBX,%ZMM9 |
0x451c53 VPADDQ %ZMM2,%ZMM9,%ZMM9 |
0x451c59 XOR %R10D,%R10D |
0x451c5c MOV 0x98(%RBP),%R15 |
0x451c63 VPBROADCASTQ %R10,%ZMM11 |
0x451c69 VPSUBQ %ZMM11,%ZMM0,%ZMM12 |
0x451c6f VPCMPNLEUQ %ZMM2,%ZMM12,%K1 |
0x451c76 VPBROADCASTD %R10D,%YMM12 |
0x451c7c VPSUBD %YMM12,%YMM8,%YMM8 |
0x451c81 VPADDD %YMM3,%YMM8,%YMM8 |
0x451c85 VPMOVSXDQ %YMM8,%ZMM8 |
0x451c8b VPSUBQ %ZMM1,%ZMM8,%ZMM8 |
0x451c91 VPMULLQ %ZMM8,%ZMM7,%ZMM8 |
0x451c97 VPSLLQ $0x3,%ZMM10,%ZMM10 |
0x451c9e VPADDQ %ZMM10,%ZMM4,%ZMM10 |
0x451ca4 VPADDQ %ZMM8,%ZMM10,%ZMM8 |
0x451caa VPXOR %XMM12,%XMM12,%XMM12 |
0x451caf KMOVQ %K1,%K2 |
0x451cb4 VGATHERQPD (,%ZMM8,1),%ZMM12{%K2} |
0x451cbf VMOVAPD %ZMM12,%ZMM6{%K1} |
0x451cc5 VPADDQ %ZMM11,%ZMM9,%ZMM8 |
0x451ccb VPSUBQ %ZMM1,%ZMM8,%ZMM8 |
0x451cd1 VPSUBQ %ZMM5,%ZMM8,%ZMM8 |
0x451cd7 VPMULLQ %ZMM8,%ZMM7,%ZMM7 |
0x451cdd VPADDQ %ZMM7,%ZMM10,%ZMM7 |
0x451ce3 VSCATTERQPD %ZMM6,(,%ZMM7,1){%K1} |
0x451cee JMP 451cf0 |
0x451cf0 LEA 0x1(%RSI),%R10D |
0x451cf4 CMP %ECX,%ESI |
0x451cf6 MOV %R10D,%ESI |
0x451cf9 JE 44e35f |
0x451cff TEST %EDX,%EDX |
0x451d01 JLE 451cf0 |
0x451d03 LEA (%RAX,%RSI,1),%R14D |
0x451d07 MOV 0xd8(%RBP),%R10 |
0x451d0e MOV (%R10),%R15 |
0x451d11 MOV 0xa8(%RBP),%R10 |
0x451d18 MOV (%R10),%R11D |
0x451d1b MOVSXD %R11D,%RBX |
0x451d1e MOV %RDX,%R10 |
0x451d21 VPBROADCASTD %EBX,%YMM8 |
0x451d27 VPBROADCASTQ %R15,%ZMM7 |
0x451d2d MOVSXD %R14D,%R14 |
0x451d30 AND %RDI,%R10 |
0x451d33 JE 451c40 |
0x451d39 SUB %R9,%R14 |
0x451d3c VPBROADCASTQ %R14,%ZMM10 |
0x451d42 VPSLLQ $0x3,%ZMM10,%ZMM9 |
0x451d49 VPADDQ %ZMM9,%ZMM4,%ZMM11 |
0x451d4f VPBROADCASTQ %RBX,%ZMM9 |
0x451d55 VPADDQ %ZMM2,%ZMM9,%ZMM9 |
0x451d5b XOR %EBX,%EBX |
0x451d5d NOPL (%RAX) |
(606) 0x451d60 VPBROADCASTD %R11D,%YMM12 |
(606) 0x451d66 VPADDD %YMM3,%YMM12,%YMM12 |
(606) 0x451d6a VPMOVSXDQ %YMM12,%ZMM12 |
(606) 0x451d70 VPSUBQ %ZMM1,%ZMM12,%ZMM12 |
(606) 0x451d76 VPMULLQ %ZMM12,%ZMM7,%ZMM12 |
(606) 0x451d7c VPADDQ %ZMM12,%ZMM11,%ZMM12 |
(606) 0x451d82 KXNORW %K0,%K0,%K1 |
(606) 0x451d86 VXORPD %XMM13,%XMM13,%XMM13 |
(606) 0x451d8b VGATHERQPD (,%ZMM12,1),%ZMM13{%K1} |
(606) 0x451d96 VPBROADCASTQ %RBX,%ZMM12 |
(606) 0x451d9c VPADDQ %ZMM12,%ZMM9,%ZMM12 |
(606) 0x451da2 VPSUBQ %ZMM1,%ZMM12,%ZMM12 |
(606) 0x451da8 VPSUBQ %ZMM5,%ZMM12,%ZMM12 |
(606) 0x451dae VPMULLQ %ZMM12,%ZMM7,%ZMM12 |
(606) 0x451db4 VPADDQ %ZMM12,%ZMM11,%ZMM12 |
(606) 0x451dba KXNORW %K0,%K0,%K1 |
(606) 0x451dbe VSCATTERQPD %ZMM13,(,%ZMM12,1){%K1} |
(606) 0x451dc9 ADD $0x8,%RBX |
(606) 0x451dcd ADD $-0x8,%R11D |
(606) 0x451dd1 CMP %R10,%RBX |
(606) 0x451dd4 JB 451d60 |
0x451dd6 CMP %RDX,%R10 |
0x451dd9 MOV 0x98(%RBP),%R15 |
0x451de0 JNE 451c63 |
0x451de6 JMP 451cf0 |
/scratch_na/users/xoserete/qaas_runs/171-322-0339/intel/CloverLeafFC/build/CloverLeafFC/CloverLeaf_ref/kernels/update_halo_kernel.f90: 99 - 158 |
-------------------------------------------------------------------------------- |
99: IF(fields(FIELD_DENSITY0).EQ.1) THEN |
[...] |
155: DO j=x_min-depth,x_max+depth |
156: !$OMP SIMD |
157: DO k=1,depth |
158: density1(j,y_max+k)=density1(j,y_max+1-k) |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.06 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 1.26 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.15 |
Bottlenecks | |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source | update_halo_kernel.f90:99-99,update_halo_kernel.f90:155-158 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 9.00 |
CQA cycles if no scalar integer | 8.50 |
CQA cycles if FP arith vectorized | 9.00 |
CQA cycles if fully vectorized | 7.14 |
Front-end cycles | 8.08 |
DIV/SQRT cycles | 8.75 |
P0 cycles | 3.65 |
P1 cycles | 2.58 |
P2 cycles | 2.58 |
P3 cycles | 2.00 |
P4 cycles | 8.65 |
P5 cycles | 3.70 |
P6 cycles | 2.00 |
P7 cycles | 2.00 |
P8 cycles | 2.00 |
P9 cycles | 3.50 |
P10 cycles | 2.58 |
P11 cycles | 0.00 |
Inter-iter dependencies cycles | 0 |
FE+BE cycles (UFS) | 9.66 - 13.72 |
Stall cycles (UFS) | 2.27 - 6.23 |
Nb insns | 34.00 |
Nb uops | 48.50 |
Nb loads | 4.25 |
Nb stores | 0.50 |
Nb stack references | 2.25 |
FLOP/cycle | 0.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 7.27 |
Bytes prefetched | 0.00 |
Bytes loaded | 59.00 |
Bytes stored | 32.00 |
Stride 0 | 0.75 |
Stride 1 | 0.00 |
Stride n | 0.00 |
Stride unknown | 1.25 |
Stride indirect | 1.00 |
Vectorization ratio all | 34.26 |
Vectorization ratio load | 16.67 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 81.20 |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 22.22 |
Vector-efficiency ratio all | 36.81 |
Vector-efficiency ratio load | 25.35 |
Vector-efficiency ratio store | 100.00 |
Vector-efficiency ratio mul | 100.00 |
Vector-efficiency ratio add_sub | 78.42 |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 25.19 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 16.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.20 |
Bottlenecks | P0, P6, |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source | update_halo_kernel.f90:99-99,update_halo_kernel.f90:155-158 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 1.00 |
CQA cycles if no scalar integer | 1.00 |
CQA cycles if FP arith vectorized | 1.00 |
CQA cycles if fully vectorized | 0.06 |
Front-end cycles | 0.83 |
DIV/SQRT cycles | 1.00 |
P0 cycles | 0.80 |
P1 cycles | 0.00 |
P2 cycles | 0.00 |
P3 cycles | 0.00 |
P4 cycles | 0.60 |
P5 cycles | 1.00 |
P6 cycles | 0.00 |
P7 cycles | 0.00 |
P8 cycles | 0.00 |
P9 cycles | 0.60 |
P10 cycles | 0.00 |
P11 cycles | 0.00 |
Inter-iter dependencies cycles | 0 |
FE+BE cycles (UFS) | 1.06 - 2.04 |
Stall cycles (UFS) | 0.00 - 0.78 |
Nb insns | 6.00 |
Nb uops | 5.00 |
Nb loads | 0.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 0.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 0.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 0.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 0.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 0.00 |
Vectorization ratio load | NA |
Vectorization ratio store | NA |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 0.00 |
Vector-efficiency ratio all | 6.25 |
Vector-efficiency ratio load | NA |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 6.25 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.04 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 1.21 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.13 |
Bottlenecks | P0, P5, |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source | update_halo_kernel.f90:99-99,update_halo_kernel.f90:155-158 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 14.50 |
CQA cycles if no scalar integer | 14.00 |
CQA cycles if FP arith vectorized | 14.50 |
CQA cycles if fully vectorized | 12.00 |
Front-end cycles | 12.83 |
DIV/SQRT cycles | 14.50 |
P0 cycles | 4.40 |
P1 cycles | 4.33 |
P2 cycles | 4.33 |
P3 cycles | 4.00 |
P4 cycles | 14.50 |
P5 cycles | 4.40 |
P6 cycles | 4.00 |
P7 cycles | 4.00 |
P8 cycles | 4.00 |
P9 cycles | 4.20 |
P10 cycles | 4.33 |
P11 cycles | 0.00 |
Inter-iter dependencies cycles | 0 |
FE+BE cycles (UFS) | 15.82 - 23.47 |
Stall cycles (UFS) | 4.57 - 12.13 |
Nb insns | 48.00 |
Nb uops | 77.00 |
Nb loads | 6.00 |
Nb stores | 1.00 |
Nb stack references | 3.00 |
FLOP/cycle | 0.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 11.31 |
Bytes prefetched | 0.00 |
Bytes loaded | 100.00 |
Bytes stored | 64.00 |
Stride 0 | 1.00 |
Stride 1 | 0.00 |
Stride n | 0.00 |
Stride unknown | 1.00 |
Stride indirect | 2.00 |
Vectorization ratio all | 58.82 |
Vectorization ratio load | 25.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 84.62 |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 40.00 |
Vector-efficiency ratio all | 56.62 |
Vector-efficiency ratio load | 32.81 |
Vector-efficiency ratio store | 100.00 |
Vector-efficiency ratio mul | 100.00 |
Vector-efficiency ratio add_sub | 78.85 |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 37.92 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.25 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 1.43 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.25 |
Bottlenecks | micro-operation queue, |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source | update_halo_kernel.f90:99-99,update_halo_kernel.f90:155-158 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 5.00 |
CQA cycles if no scalar integer | 4.00 |
CQA cycles if FP arith vectorized | 5.00 |
CQA cycles if fully vectorized | 3.50 |
Front-end cycles | 5.00 |
DIV/SQRT cycles | 4.00 |
P0 cycles | 4.00 |
P1 cycles | 1.67 |
P2 cycles | 1.67 |
P3 cycles | 0.00 |
P4 cycles | 4.00 |
P5 cycles | 4.00 |
P6 cycles | 0.00 |
P7 cycles | 0.00 |
P8 cycles | 0.00 |
P9 cycles | 4.00 |
P10 cycles | 1.67 |
P11 cycles | 0.00 |
Inter-iter dependencies cycles | 0 |
FE+BE cycles (UFS) | 5.17 - 5.19 |
Stall cycles (UFS) | 0.00 |
Nb insns | 30.00 |
Nb uops | 30.00 |
Nb loads | 5.00 |
Nb stores | 0.00 |
Nb stack references | 3.00 |
FLOP/cycle | 0.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 7.20 |
Bytes prefetched | 0.00 |
Bytes loaded | 36.00 |
Bytes stored | 0.00 |
Stride 0 | 1.00 |
Stride 1 | 0.00 |
Stride n | 0.00 |
Stride unknown | 2.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 18.75 |
Vectorization ratio load | 0.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | 66.67 |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 10.00 |
Vector-efficiency ratio all | 26.95 |
Vector-efficiency ratio load | 10.42 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | 70.83 |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 18.75 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.03 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 1.19 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.13 |
Bottlenecks | P0, P5, |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source | update_halo_kernel.f90:99-99,update_halo_kernel.f90:155-158 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 15.50 |
CQA cycles if no scalar integer | 15.00 |
CQA cycles if FP arith vectorized | 15.50 |
CQA cycles if fully vectorized | 13.00 |
Front-end cycles | 13.67 |
DIV/SQRT cycles | 15.50 |
P0 cycles | 5.40 |
P1 cycles | 4.33 |
P2 cycles | 4.33 |
P3 cycles | 4.00 |
P4 cycles | 15.50 |
P5 cycles | 5.40 |
P6 cycles | 4.00 |
P7 cycles | 4.00 |
P8 cycles | 4.00 |
P9 cycles | 5.20 |
P10 cycles | 4.33 |
P11 cycles | 0.00 |
Inter-iter dependencies cycles | 0 |
FE+BE cycles (UFS) | 16.58 - 24.19 |
Stall cycles (UFS) | 4.51 - 12.03 |
Nb insns | 52.00 |
Nb uops | 82.00 |
Nb loads | 6.00 |
Nb stores | 1.00 |
Nb stack references | 3.00 |
FLOP/cycle | 0.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 10.58 |
Bytes prefetched | 0.00 |
Bytes loaded | 100.00 |
Bytes stored | 64.00 |
Stride 0 | 1.00 |
Stride 1 | 0.00 |
Stride n | 0.00 |
Stride unknown | 2.00 |
Stride indirect | 2.00 |
Vectorization ratio all | 59.46 |
Vectorization ratio load | 25.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 92.31 |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 38.89 |
Vector-efficiency ratio all | 57.43 |
Vector-efficiency ratio load | 32.81 |
Vector-efficiency ratio store | 100.00 |
Vector-efficiency ratio mul | 100.00 |
Vector-efficiency ratio add_sub | 85.58 |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 37.85 |
Path / |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | update_halo_kernel.f90:99-158 |
Module | exec |
nb instructions | 34 |
nb uops | 48.50 |
loop length | 167.50 |
used x86 registers | 10 |
used mmx registers | 0 |
used xmm registers | 0.50 |
used ymm registers | 1.75 |
used zmm registers | 7.50 |
nb stack references | 2.25 |
micro-operation queue | 8.08 cycles |
front end | 8.08 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 8.75 | 3.65 | 2.58 | 2.58 | 2.00 | 8.65 | 3.70 | 2.00 | 2.00 | 2.00 | 3.50 | 2.58 |
cycles | 8.75 | 3.65 | 2.58 | 2.58 | 2.00 | 8.65 | 3.70 | 2.00 | 2.00 | 2.00 | 3.50 | 2.58 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 0.00 |
FE+BE cycles | 9.66-13.72 |
Stall cycles | 2.27-6.23 |
RS full (events) | 5.14-0.36 |
Front-end | 8.08 |
Dispatch | 8.75 |
Data deps. | 0.00 |
Overall L1 | 9.00 |
all | 32% |
load | 0% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 81% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 18% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 34% |
load | 16% |
store | 100% |
mul | 100% |
add-sub | 81% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 22% |
all | 34% |
load | 10% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 78% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 20% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 36% |
load | 25% |
store | 100% |
mul | 100% |
add-sub | 78% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 25% |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | update_halo_kernel.f90:99-158 |
Module | exec |
nb instructions | 6 |
nb uops | 5 |
loop length | 19 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 0 |
used zmm registers | 0 |
nb stack references | 0 |
micro-operation queue | 0.83 cycles |
front end | 0.83 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | 0.60 | 1.00 | 0.00 | 0.00 | 0.00 | 0.60 | 0.00 |
cycles | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | 0.60 | 1.00 | 0.00 | 0.00 | 0.00 | 0.60 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 0.00 |
FE+BE cycles | 1.06-2.04 |
Stall cycles | 0.00-0.78 |
Front-end | 0.83 |
Dispatch | 1.00 |
Data deps. | 0.00 |
Overall L1 | 1.00 |
all | 0% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 0% |
all | 6% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 6% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LEA 0x1(%RSI),%R10D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
CMP %ECX,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
MOV %R10D,%ESI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
JE 44e35f <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x51f> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
TEST %EDX,%EDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
JLE 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | update_halo_kernel.f90:99-158 |
Module | exec |
nb instructions | 48 |
nb uops | 77 |
loop length | 249 |
used x86 registers | 12 |
used mmx registers | 0 |
used xmm registers | 1 |
used ymm registers | 3 |
used zmm registers | 12 |
nb stack references | 3 |
micro-operation queue | 12.83 cycles |
front end | 12.83 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 14.50 | 4.40 | 4.33 | 4.33 | 4.00 | 14.50 | 4.40 | 4.00 | 4.00 | 4.00 | 4.20 | 4.33 |
cycles | 14.50 | 4.40 | 4.33 | 4.33 | 4.00 | 14.50 | 4.40 | 4.00 | 4.00 | 4.00 | 4.20 | 4.33 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 0.00 |
FE+BE cycles | 15.82-23.47 |
Stall cycles | 4.57-12.13 |
RS full (events) | 11.01-0.09 |
Front-end | 12.83 |
Dispatch | 14.50 |
Data deps. | 0.00 |
Overall L1 | 14.50 |
all | 54% |
load | 0% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 84% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 30% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 58% |
load | 25% |
store | 100% |
mul | 100% |
add-sub | 84% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 40% |
all | 52% |
load | 10% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 78% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 28% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 56% |
load | 32% |
store | 100% |
mul | 100% |
add-sub | 78% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 37% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SUB %R8,%R14 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
ADD $0x2,%R14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
VPBROADCASTQ %R14,%ZMM10 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPBROADCASTQ %RBX,%ZMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPADDQ %ZMM2,%ZMM9,%ZMM9 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
XOR %R10D,%R10D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
MOV 0x98(%RBP),%R15 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
VPBROADCASTQ %R10,%ZMM11 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSUBQ %ZMM11,%ZMM0,%ZMM12 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPCMPNLEUQ %ZMM2,%ZMM12,%K1 | |||||||||||||||
VPBROADCASTD %R10D,%YMM12 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSUBD %YMM12,%YMM8,%YMM8 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.33 |
VPADDD %YMM3,%YMM8,%YMM8 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 |
VPMOVSXDQ %YMM8,%ZMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSUBQ %ZMM1,%ZMM8,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPMULLQ %ZMM8,%ZMM7,%ZMM8 | 5 | 1.50 | 0 | 0 | 0 | 0 | 1.50 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 1.50 |
VPSLLQ $0x3,%ZMM10,%ZMM10 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2-4 | 1 |
VPADDQ %ZMM10,%ZMM4,%ZMM10 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPADDQ %ZMM8,%ZMM10,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPXOR %XMM12,%XMM12,%XMM12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
KMOVQ %K1,%K2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
VGATHERQPD (,%ZMM8,1),%ZMM12{%K2} | 5 | 1 | 0 | 2.67 | 2.67 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2.67 | 0-29 | 2.67 |
VMOVAPD %ZMM12,%ZMM6{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
VPADDQ %ZMM11,%ZMM9,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPSUBQ %ZMM1,%ZMM8,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPSUBQ %ZMM5,%ZMM8,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPMULLQ %ZMM8,%ZMM7,%ZMM7 | 5 | 1.50 | 0 | 0 | 0 | 0 | 1.50 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 1.50 |
VPADDQ %ZMM7,%ZMM10,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VSCATTERQPD %ZMM6,(,%ZMM7,1){%K1} | 20 | 2.20 | 0.20 | 0 | 0 | 4 | 0.20 | 0.20 | 4 | 4 | 4 | 0.20 | 0 | 2-12 | 7 |
JMP 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.84 |
LEA 0x1(%RSI),%R10D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
CMP %ECX,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
MOV %R10D,%ESI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
JE 44e35f <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x51f> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
TEST %EDX,%EDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
JLE 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
LEA (%RAX,%RSI,1),%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
MOV 0xd8(%RBP),%R10 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV (%R10),%R15 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV 0xa8(%RBP),%R10 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV (%R10),%R11D | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOVSXD %R11D,%RBX | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
MOV %RDX,%R10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
VPBROADCASTD %EBX,%YMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPBROADCASTQ %R15,%ZMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
MOVSXD %R14D,%R14 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
AND %RDI,%R10 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
JE 451c40 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3e00> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | update_halo_kernel.f90:99-158 |
Module | exec |
nb instructions | 30 |
nb uops | 30 |
loop length | 133 |
used x86 registers | 12 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 1 |
used zmm registers | 6 |
nb stack references | 3 |
micro-operation queue | 5.00 cycles |
front end | 5.00 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 4.00 | 4.00 | 1.67 | 1.67 | 0.00 | 4.00 | 4.00 | 0.00 | 0.00 | 0.00 | 4.00 | 1.67 |
cycles | 4.00 | 4.00 | 1.67 | 1.67 | 0.00 | 4.00 | 4.00 | 0.00 | 0.00 | 0.00 | 4.00 | 1.67 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 0.00 |
FE+BE cycles | 5.17-5.19 |
Stall cycles | 0.00 |
Front-end | 5.00 |
Dispatch | 4.00 |
Data deps. | 0.00 |
Overall L1 | 5.00 |
all | 18% |
load | 0% |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 66% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 10% |
all | 26% |
load | 10% |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 70% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 18% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LEA 0x1(%RSI),%R10D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
CMP %ECX,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
MOV %R10D,%ESI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
JE 44e35f <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x51f> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
TEST %EDX,%EDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
JLE 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
LEA (%RAX,%RSI,1),%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
MOV 0xd8(%RBP),%R10 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV (%R10),%R15 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV 0xa8(%RBP),%R10 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV (%R10),%R11D | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOVSXD %R11D,%RBX | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
MOV %RDX,%R10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
VPBROADCASTD %EBX,%YMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPBROADCASTQ %R15,%ZMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
MOVSXD %R14D,%R14 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
AND %RDI,%R10 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
JE 451c40 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3e00> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
SUB %R9,%R14 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
VPBROADCASTQ %R14,%ZMM10 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSLLQ $0x3,%ZMM10,%ZMM9 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2-4 | 1 |
VPADDQ %ZMM9,%ZMM4,%ZMM11 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPBROADCASTQ %RBX,%ZMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPADDQ %ZMM2,%ZMM9,%ZMM9 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
XOR %EBX,%EBX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
NOPL (%RAX) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
CMP %RDX,%R10 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
MOV 0x98(%RBP),%R15 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
JNE 451c63 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3e23> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
JMP 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.08 |
Function | update_halo_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | update_halo_kernel.f90:99-158 |
Module | exec |
nb instructions | 52 |
nb uops | 82 |
loop length | 269 |
used x86 registers | 12 |
used mmx registers | 0 |
used xmm registers | 1 |
used ymm registers | 3 |
used zmm registers | 12 |
nb stack references | 3 |
micro-operation queue | 13.67 cycles |
front end | 13.67 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 15.50 | 5.40 | 4.33 | 4.33 | 4.00 | 15.50 | 5.40 | 4.00 | 4.00 | 4.00 | 5.20 | 4.33 |
cycles | 15.50 | 5.40 | 4.33 | 4.33 | 4.00 | 15.50 | 5.40 | 4.00 | 4.00 | 4.00 | 5.20 | 4.33 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 0.00 |
FE+BE cycles | 16.58-24.19 |
Stall cycles | 4.51-12.03 |
RS full (events) | 9.56-0.01 |
Front-end | 13.67 |
Dispatch | 15.50 |
Data deps. | 0.00 |
Overall L1 | 15.50 |
all | 55% |
load | 0% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 92% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 31% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 59% |
load | 25% |
store | 100% |
mul | 100% |
add-sub | 92% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 38% |
all | 53% |
load | 10% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 85% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 30% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 57% |
load | 32% |
store | 100% |
mul | 100% |
add-sub | 85% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 37% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VPBROADCASTQ %R10,%ZMM11 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSUBQ %ZMM11,%ZMM0,%ZMM12 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPCMPNLEUQ %ZMM2,%ZMM12,%K1 | |||||||||||||||
VPBROADCASTD %R10D,%YMM12 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSUBD %YMM12,%YMM8,%YMM8 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.33 |
VPADDD %YMM3,%YMM8,%YMM8 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 |
VPMOVSXDQ %YMM8,%ZMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSUBQ %ZMM1,%ZMM8,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPMULLQ %ZMM8,%ZMM7,%ZMM8 | 5 | 1.50 | 0 | 0 | 0 | 0 | 1.50 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 1.50 |
VPSLLQ $0x3,%ZMM10,%ZMM10 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2-4 | 1 |
VPADDQ %ZMM10,%ZMM4,%ZMM10 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPADDQ %ZMM8,%ZMM10,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPXOR %XMM12,%XMM12,%XMM12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
KMOVQ %K1,%K2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
VGATHERQPD (,%ZMM8,1),%ZMM12{%K2} | 5 | 1 | 0 | 2.67 | 2.67 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2.67 | 0-29 | 2.67 |
VMOVAPD %ZMM12,%ZMM6{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
VPADDQ %ZMM11,%ZMM9,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPSUBQ %ZMM1,%ZMM8,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPSUBQ %ZMM5,%ZMM8,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
VPMULLQ %ZMM8,%ZMM7,%ZMM7 | 5 | 1.50 | 0 | 0 | 0 | 0 | 1.50 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 1.50 |
VPADDQ %ZMM7,%ZMM10,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VSCATTERQPD %ZMM6,(,%ZMM7,1){%K1} | 20 | 2.20 | 0.20 | 0 | 0 | 4 | 0.20 | 0.20 | 4 | 4 | 4 | 0.20 | 0 | 2-12 | 7 |
JMP 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.84 |
LEA 0x1(%RSI),%R10D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
CMP %ECX,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
MOV %R10D,%ESI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
JE 44e35f <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x51f> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
TEST %EDX,%EDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
JLE 451cf0 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3eb0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
LEA (%RAX,%RSI,1),%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
MOV 0xd8(%RBP),%R10 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV (%R10),%R15 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV 0xa8(%RBP),%R10 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOV (%R10),%R11D | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
MOVSXD %R11D,%RBX | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
MOV %RDX,%R10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
VPBROADCASTD %EBX,%YMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPBROADCASTQ %R15,%ZMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
MOVSXD %R14D,%R14 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
AND %RDI,%R10 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
JE 451c40 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3e00> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
SUB %R9,%R14 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
VPBROADCASTQ %R14,%ZMM10 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPSLLQ $0x3,%ZMM10,%ZMM9 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2-4 | 1 |
VPADDQ %ZMM9,%ZMM4,%ZMM11 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
VPBROADCASTQ %RBX,%ZMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VPADDQ %ZMM2,%ZMM9,%ZMM9 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
XOR %EBX,%EBX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
NOPL (%RAX) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
CMP %RDX,%R10 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
MOV 0x98(%RBP),%R15 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
JNE 451c63 <update_halo_kernel_module_mp_update_halo_kernel_.DIR.OMP.PARALLEL.2+0x3e23> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |