Loop Id: 1024 | Module: exec | Source: csr_matvec.c:307-314 | Coverage: 0.99% |
---|
Loop Id: 1024 | Module: exec | Source: csr_matvec.c:307-314 | Coverage: 0.99% |
---|
(1023) 0x4aa760 STR D0, [X13, X20,LSL #3] |
(1023) 0x4aa764 ORR X20, XZR, X0 |
(1023) 0x4aa768 CMP X0, X19 |
(1023) 0x4aa76c B.EQ 4ab304 |
(1023) 0x4aa770 ADD X0, X20, #1 |
(1023) 0x4aa774 ORR X4, XZR, X16 |
(1023) 0x4aa778 LDR D0, [X8, X20,LSL #3] |
(1023) 0x4aa77c LDR X16, [X9, X0,LSL #3] |
(1023) 0x4aa780 SUBS X2, X16, X4 |
(1023) 0x4aa784 B.LE 4aa760 |
(1023) 0x4aa788 CMP X2, #2 |
(1023) 0x4aa78c B.CS 4aa7a0 |
(1023) 0x4aa790 ORR X1, XZR, X4 |
(1023) 0x4aa794 B 4aa7f4 |
0x4aa7a0 AND X3, X2, #8127 |
0x4aa7a4 UBFM X5, X4, #61, #60 |
0x4aa7a8 MOVI D1, #0 |
0x4aa7ac ADD X1, X4, X3 |
0x4aa7b0 ADD X4, X14, X5 |
0x4aa7b4 ADD X5, X15, X5 |
0x4aa7b8 ORR X6, XZR, X3 |
0x4aa7bc HINT #0 |
(1022) 0x4aa7c0 LDP X7, X21, [X4, #1016] |
(1022) 0x4aa7c4 LDP D2, D3, [X5, #1016] |
(1022) 0x4aa7c8 SUBS X6, X6, #2 |
(1022) 0x4aa7cc ADD X4, X4, #16 |
(1022) 0x4aa7d0 ADD X5, X5, #16 |
(1022) 0x4aa7d4 LDR D4, [X11, X7,LSL #3] |
(1022) 0x4aa7d8 LDR D5, [X11, X21,LSL #3] |
(1022) 0x4aa7dc FMSUB D0, D4, D2, D0 |
(1022) 0x4aa7e0 FMSUB D1, D5, D3, D1 |
(1022) 0x4aa7e4 B.NE 4aa7c0 |
0x4aa7e8 FADD D0, D1, D0 |
0x4aa7ec CMP X2, X3 |
0x4aa7f0 B.EQ 4aa760 |
(1023) 0x4aa7f4 SUB W2, W16, W1 |
(1023) 0x4aa7f8 ANDS X3, X2, #4160 |
(1023) 0x4aa7fc B.EQ 4aa834 |
(1023) 0x4aa800 ORR X2, XZR, X1 |
(1026) 0x4aa804 LDR X4, [X12, X2,LSL #3] |
(1026) 0x4aa808 LDR D1, [X10, X2,LSL #3] |
(1026) 0x4aa80c ADD X2, X2, #1 |
(1026) 0x4aa810 SUBS X3, X3, #1 |
(1026) 0x4aa814 LDR D2, [X11, X4,LSL #3] |
(1026) 0x4aa818 FMSUB D0, D2, D1, D0 |
(1026) 0x4aa81c B.NE 4aa804 |
(1023) 0x4aa820 ORN X1, XZR, X1 |
(1023) 0x4aa824 ADD X1, X16, X1 |
(1023) 0x4aa828 CMP X1, #3 |
(1023) 0x4aa82c B.CC 4aa760 |
(1023) 0x4aa830 B 4aa848 |
(1023) 0x4aa834 ORR X2, XZR, X1 |
(1023) 0x4aa838 ORN X1, XZR, X1 |
(1023) 0x4aa83c ADD X1, X16, X1 |
(1023) 0x4aa840 CMP X1, #3 |
(1023) 0x4aa844 B.CC 4aa760 |
(1023) 0x4aa848 UBFM X3, X2, #61, #60 |
(1023) 0x4aa84c SUB X1, X16, X2 |
(1023) 0x4aa850 ADD X2, X17, X3 |
(1023) 0x4aa854 ADD X3, X18, X3 |
(1023) 0x4aa858 HINT #0 |
(1023) 0x4aa85c HINT #0 |
(1025) 0x4aa860 LDP X4, X5, [X3, #1008] |
(1025) 0x4aa864 LDP D2, D3, [X2, #1008] |
(1025) 0x4aa868 SUBS X1, X1, #4 |
(1025) 0x4aa86c LDR D1, [X11, X4,LSL #3] |
(1025) 0x4aa870 FMUL D1, D1, D2 |
(1025) 0x4aa874 LDR D2, [X11, X5,LSL #3] |
(1025) 0x4aa878 FMADD D1, D2, D3, D1 |
(1025) 0x4aa87c LDP X4, X5, [X3], #32 |
(1025) 0x4aa880 LDR D4, [X11, X4,LSL #3] |
(1025) 0x4aa884 LDP D2, D3, [X2], #32 |
(1025) 0x4aa888 FMADD D1, D4, D2, D1 |
(1025) 0x4aa88c LDR D2, [X11, X5,LSL #3] |
(1025) 0x4aa890 FMADD D1, D2, D3, D1 |
(1025) 0x4aa894 FSUB D0, D0, S1 |
(1025) 0x4aa898 B.NE 4aa860 |
(1023) 0x4aa89c B 4aa760 |
/home/hbollore/qaas/qaas-runs/169-817-3176/intel/AMG/build/AMG/AMG/seq_mv/csr_matvec.c: 307 - 314 |
-------------------------------------------------------------------------------- |
307: for (i = iBegin; i < iEnd; i++) |
308: { |
309: tempx = b_data[i]; |
310: for (jj = A_i[i]; jj < A_i[i+1]; jj++) |
311: { |
312: tempx -= A_data[jj] * x_data[A_j[jj]]; |
313: } |
314: y_data[i] = tempx; |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
○100.00 | __kmp_invoke_microtask | libomp.so |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 4.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.40 |
Bottlenecks | P2, P3, P4, P5, |
Function | .omp_outlined..18#0x4a9fb0 |
Source | csr_matvec.c:310-310 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 1.75 |
CQA cycles if no scalar integer | 1.75 |
CQA cycles if FP arith vectorized | 1.75 |
CQA cycles if fully vectorized | 0.44 |
Front-end cycles | 1.25 |
DIV/SQRT cycles | 0.50 |
P0 cycles | 0.50 |
P1 cycles | 1.75 |
P2 cycles | 1.75 |
P3 cycles | 1.75 |
P4 cycles | 1.75 |
P5 cycles | 0.50 |
P6 cycles | 0.50 |
P7 cycles | 0.50 |
P8 cycles | 0.50 |
P9 cycles | 0.00 |
P10 cycles | 0.00 |
P11 cycles | 0.00 |
P12 cycles | 0.00 |
P13 cycles | 0.00 |
P14 cycles | 0.00 |
Inter-iter dependencies cycles | NA |
FE+BE cycles (UFS) | NA |
Stall cycles (UFS) | NA |
Nb insns | 11.00 |
Nb uops | 10.00 |
Nb loads | NA |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 0.57 |
Nb FLOP add-sub | 1.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 0.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 0.00 |
Bytes stored | 0.00 |
Stride 0 | NA |
Stride 1 | NA |
Stride n | NA |
Stride unknown | NA |
Stride indirect | NA |
Vectorization ratio all | 0.00 |
Vectorization ratio load | NA |
Vectorization ratio store | NA |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | 0.00 |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 0.00 |
Vector-efficiency ratio all | 25.00 |
Vector-efficiency ratio load | NA |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | 25.00 |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 25.00 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.00 |
CQA speedup if fully vectorized | 4.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.40 |
Bottlenecks | P2, P3, P4, P5, |
Function | .omp_outlined..18#0x4a9fb0 |
Source | csr_matvec.c:310-310 |
Source loop unroll info | NA |
Source loop unroll confidence level | NA |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 1.75 |
CQA cycles if no scalar integer | 1.75 |
CQA cycles if FP arith vectorized | 1.75 |
CQA cycles if fully vectorized | 0.44 |
Front-end cycles | 1.25 |
DIV/SQRT cycles | 0.50 |
P0 cycles | 0.50 |
P1 cycles | 1.75 |
P2 cycles | 1.75 |
P3 cycles | 1.75 |
P4 cycles | 1.75 |
P5 cycles | 0.50 |
P6 cycles | 0.50 |
P7 cycles | 0.50 |
P8 cycles | 0.50 |
P9 cycles | 0.00 |
P10 cycles | 0.00 |
P11 cycles | 0.00 |
P12 cycles | 0.00 |
P13 cycles | 0.00 |
P14 cycles | 0.00 |
Inter-iter dependencies cycles | NA |
FE+BE cycles (UFS) | NA |
Stall cycles (UFS) | NA |
Nb insns | 11.00 |
Nb uops | 10.00 |
Nb loads | NA |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 0.57 |
Nb FLOP add-sub | 1.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 0.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 0.00 |
Bytes stored | 0.00 |
Stride 0 | NA |
Stride 1 | NA |
Stride n | NA |
Stride unknown | NA |
Stride indirect | NA |
Vectorization ratio all | 0.00 |
Vectorization ratio load | NA |
Vectorization ratio store | NA |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | 0.00 |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 0.00 |
Vector-efficiency ratio all | 25.00 |
Vector-efficiency ratio load | NA |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | 25.00 |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 25.00 |
Path / |
Function | .omp_outlined..18#0x4a9fb0 |
Source file and lines | csr_matvec.c:307-314 |
Module | exec |
nb instructions | 11 |
loop length | 44 |
nb stack references | 0 |
front end | 1.25 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 0.50 | 0.50 | 1.75 | 1.75 | 1.75 | 1.75 | 0.50 | 0.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
cycles | 0.50 | 0.50 | 1.75 | 1.75 | 1.75 | 1.75 | 0.50 | 0.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Front-end | 1.25 |
Overall L1 | 1.75 |
all | 0% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 0% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 0% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AND X3, X2, #8127 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
UBFM X5, X4, #61, #60 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
MOVI D1, #0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 2 | 0.25 |
ADD X1, X4, X3 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
ADD X4, X14, X5 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
ADD X5, X15, X5 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
ORR X6, XZR, X3 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
HINT #0 | ||||||||||||||||||
FADD D0, D1, D0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 2 | 0.25 |
CMP X2, X3 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 |
B.EQ 4aa760 <.omp_outlined..18+0x7b0> | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |
Function | .omp_outlined..18#0x4a9fb0 |
Source file and lines | csr_matvec.c:307-314 |
Module | exec |
nb instructions | 11 |
loop length | 44 |
nb stack references | 0 |
front end | 1.25 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 0.50 | 0.50 | 1.75 | 1.75 | 1.75 | 1.75 | 0.50 | 0.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
cycles | 0.50 | 0.50 | 1.75 | 1.75 | 1.75 | 1.75 | 0.50 | 0.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Front-end | 1.25 |
Overall L1 | 1.75 |
all | 0% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 0% |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 0% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AND X3, X2, #8127 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
UBFM X5, X4, #61, #60 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
MOVI D1, #0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 2 | 0.25 |
ADD X1, X4, X3 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
ADD X4, X14, X5 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
ADD X5, X15, X5 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
ORR X6, XZR, X3 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
HINT #0 | ||||||||||||||||||
FADD D0, D1, D0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 2 | 0.25 |
CMP X2, X3 | 1 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 |
B.EQ 4aa760 <.omp_outlined..18+0x7b0> | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 |