OV - exec - Loop 3138

0x500470 LDR	X4, [X21, X1,LSL #3]    [1]

0x500474 UBFM	X28, X2, #61, #60

0x500478 LDR	D6, [X22, X1,LSL #3]    [2]

0x50047c UBFM	X23, X4, #61, #60

0x500480 LDR	X3, [X0, X23]    [5]

0x500484 FMUL	D7, D28, D6

0x500488 UBFM	X30, X3, #61, #60

0x50048c CMP	X20, X3

0x500490 B.GT	500644

0x500494 LDR	D16, [X19, X30]    [6]

0x500498 ADD	X1, X1, #1

0x50049c FADD	D17, D16, D7

0x5004a0 STR	D17, [X19, X30]    [6]

0x5004a4 CMP	X1, X14

0x5004a8 B.LT	500470

0x500644 STR	X2, [X0, X23]    [5]

0x500648 ADD	X1, X1, #1

0x50064c ADD	X2, X2, #1

0x500650 STR	D7, [X19, X28]    [6]

0x500654 STR	X4, [X24, X28]    [3]

0x500658 LDR	X14, [X15]    [4]

0x50065c CMP	X14, X1

0x500660 B.GT	500470

/home/hbollore/qaas/qaas-runs/174-161-6712/intel/AMG/build/AMG/AMG/parcsr_mv/par_csr_matop.c: 946 - 965

--------------------------------------------------------------------------------

946:          for (jj3 = B_diag_i[i2]; jj3 < B_diag_i[i2+1]; jj3++)

947:          {

948:             i3 = B_diag_j[jj3];

[...]

956:             if (B_marker[i3] < jj_row_begin_diag)

957:             {

958:                B_marker[i3] = jj_count_diag;

959:                C_diag_data[jj_count_diag] = a_entry*B_diag_data[jj3];

960:                C_diag_j[jj_count_diag] = i3;

961:                jj_count_diag++;

962:             }

963:             else

964:             {

965:                C_diag_data[B_marker[i3]] += a_entry*B_diag_data[jj3];

Coverage (%)	Name	Source Location	Module
►99.08+	gomp_thread_start	team.c:130	libgomp.so.1.0.0
○	start_thread		libc.so.6
○	thread_start		libc.so.6

min	med	avg	max

Percentile Index	10	20	30	40	50	60	70	80	90	100
Value

min	med	avg	max

Percentile Index	10	20	30	40	50	60	70	80	90	100
Value

Path /

Metric	Value
CQA speedup if no scalar integer	1.87
CQA speedup if FP arith vectorized	1.09
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.05
Bottlenecks
Function	hypre_ParMatmul._omp_fn.3
Source	par_csr_matop.c:946-948,par_csr_matop.c:956-965
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	2.19
CQA cycles if no scalar integer	1.17
CQA cycles if FP arith vectorized	2.00
CQA cycles if fully vectorized	1.09
Front-end cycles	2.00
P0 cycles	1.00
P1 cycles	1.00
P2 cycles	1.08
P3 cycles	1.08
P4 cycles	1.08
P5 cycles	1.08
P6 cycles	1.08
P7 cycles	1.08
P8 cycles	0.63
P9 cycles	0.63
P10 cycles	0.63
P11 cycles	0.63
P12 cycles	2.17
P13 cycles	1.83
P14 cycles	2.00
P15 cycles	0.50
P16 cycles	0.50
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	16.00
Nb uops	16.00
Nb loads	NA
Nb stores	2.00
Nb stack references	0.00
FLOP/cycle	0.69
Nb FLOP add-sub	0.50
Nb FLOP mul	1.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	21.87
Bytes prefetched	0.00
Bytes loaded	32.00
Bytes stored	16.00
Stride 0	0.50
Stride 1	0.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	2.50
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	0.00
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	0.00
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	50.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	50.00
Vector-efficiency ratio add_sub	50.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	50.00

Metric	Value
CQA speedup if no scalar integer	1.61
CQA speedup if FP arith vectorized	1.02
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.02
Bottlenecks	micro-operation queue,
Function	hypre_ParMatmul._omp_fn.3
Source	par_csr_matop.c:946-948,par_csr_matop.c:956-965
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	1.88
CQA cycles if no scalar integer	1.17
CQA cycles if FP arith vectorized	1.83
CQA cycles if fully vectorized	0.94
Front-end cycles	1.88
P0 cycles	1.00
P1 cycles	1.00
P2 cycles	1.00
P3 cycles	1.00
P4 cycles	1.00
P5 cycles	1.00
P6 cycles	1.00
P7 cycles	1.00
P8 cycles	0.75
P9 cycles	0.75
P10 cycles	0.75
P11 cycles	0.75
P12 cycles	1.83
P13 cycles	1.50
P14 cycles	1.67
P15 cycles	0.00
P16 cycles	0.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	15.00
Nb uops	15.00
Nb loads	NA
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	1.07
Nb FLOP add-sub	1.00
Nb FLOP mul	1.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	21.33
Bytes prefetched	0.00
Bytes loaded	32.00
Bytes stored	8.00
Stride 0	0.00
Stride 1	0.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	2.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	0.00
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	0.00
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	50.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	50.00
Vector-efficiency ratio add_sub	50.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	50.00

Metric	Value
CQA speedup if no scalar integer	2.14
CQA speedup if FP arith vectorized	1.15
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.07
Bottlenecks	P12,
Function	hypre_ParMatmul._omp_fn.3
Source	par_csr_matop.c:946-948,par_csr_matop.c:956-965
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	2.50
CQA cycles if no scalar integer	1.17
CQA cycles if FP arith vectorized	2.17
CQA cycles if fully vectorized	1.25
Front-end cycles	2.13
P0 cycles	1.00
P1 cycles	1.00
P2 cycles	1.17
P3 cycles	1.17
P4 cycles	1.17
P5 cycles	1.17
P6 cycles	1.17
P7 cycles	1.17
P8 cycles	0.50
P9 cycles	0.50
P10 cycles	0.50
P11 cycles	0.50
P12 cycles	2.50
P13 cycles	2.17
P14 cycles	2.33
P15 cycles	1.00
P16 cycles	1.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	17.00
Nb uops	17.00
Nb loads	NA
Nb stores	3.00
Nb stack references	0.00
FLOP/cycle	0.40
Nb FLOP add-sub	0.00
Nb FLOP mul	1.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	22.40
Bytes prefetched	0.00
Bytes loaded	32.00
Bytes stored	24.00
Stride 0	1.00
Stride 1	0.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	3.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	0.00
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	0.00
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	50.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	50.00
Vector-efficiency ratio add_sub	50.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	50.00

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	hypre_ParMatmul._omp_fn.3
Source file and lines	par_csr_matop.c:946-965
Module	exec

The loop is defined in /home/hbollore/qaas/qaas-runs/174-161-6712/intel/AMG/build/AMG/AMG/parcsr_mv/par_csr_matop.c:946-948,956-965.

The related source loop is not unrolled or unrolled with no peel/tail loop.
The structure of this loop is probably <if then [else] end>.

The presence of multiple execution paths is typically the main/first bottleneck.
Try to simplify control inside loop: ideally, try to remove all conditional expressions, for example by (if applicable):

hoisting them (moving them outside the loop)
turning them into conditional moves, MIN or MAX

Ex: if (x<0) x=0 => x = (x<0 ? 0 : x) (or MAX(0,x) after defining the corresponding macro)

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. 2 data elements could be processed at once in vector registers. By vectorizing your loop, you can lower the cost of an iteration from 2.19 to 1.09 cycles (2.00x speedup).

Details

All VPU instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: C storage order is row-major: for(i) for(j) a[j][i] = b[j][i]; (slow, non stride 1) => for(i) for(j) a[i][j] = b[i][j]; (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): for(i) a[i].x = b[i].x; (slow, non stride 1) => for(i) a.x[i] = b.x[i]; (fast, stride 1)

FMA

Presence of both ADD/SUB and MUL operations.

Workaround

Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 1.5 FP arithmetical operations:

0.50: addition or subtraction
1: multiply

The binary loop is loading 32 bytes. The binary loop is storing 16 bytes.

Arithmetic intensity

Arithmetic intensity is 0.03 FP operations per loaded or stored byte.

General properties

nb instructions	16
loop length	64
nb stack references	0

Front-end

front end

2.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16
uops	1.00	1.00	1.08	1.08	1.08	1.08	1.08	1.08	0.63	0.63	0.63	0.63	2.17	1.83	2.00	0.50	0.50
cycles	1.00	1.00	1.08	1.08	1.08	1.08	1.08	1.08	0.63	0.63	0.63	0.63	2.17	1.83	2.00	0.50	0.50

Execution ports to units layout:

P0: BRU
P1: BRU
P2: ALU
P3: ALU
P4: ALU
P5: ALU
P6: ALU
P7: ALU
P8 (128 bits): VPU, FP store data, ALU, DIV/SQRT
P9 (128 bits): VPU, ALU, FP store data
P10 (128 bits): VPU, ALU, DIV/SQRT
P11 (128 bits): ALU, VPU
P12 (256 bits): store address, load
P13 (256 bits): store address, load
P14 (256 bits): load
P15 (64 bits): store data
P16 (64 bits): store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Cycles summary

Front-end	2.00
Data deps.	1.00
Overall L1	2.19

Vectorization ratios

INT

all	0%
load	0%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
other	0%

FP

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	NA (no store vectorizable/vectorized instructions)
mul	0%
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	0%
load	0%
store	0%
mul	0%
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	0%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 2.19 cycles. At this rate:

15% of peak load performance is reached (14.93 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))
14% of peak store performance is reached (6.93 out of 48.00 bytes stored per cycle (GB/s @ 1GHz))

Function	hypre_ParMatmul._omp_fn.3
Source file and lines	par_csr_matop.c:946-965
Module	exec

The loop is defined in /home/hbollore/qaas/qaas-runs/174-161-6712/intel/AMG/build/AMG/AMG/parcsr_mv/par_csr_matop.c:946-948,956-965.

The related source loop is not unrolled or unrolled with no peel/tail loop.
The structure of this loop is probably <if then [else] end>.

The presence of multiple execution paths is typically the main/first bottleneck.
Try to simplify control inside loop: ideally, try to remove all conditional expressions, for example by (if applicable):

hoisting them (moving them outside the loop)
turning them into conditional moves, MIN or MAX

Ex: if (x<0) x=0 => x = (x<0 ? 0 : x) (or MAX(0,x) after defining the corresponding macro)

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.88 to 1.17 cycles (1.61x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)

Vectorization

Your loop is not vectorized. 2 data elements could be processed at once in vector registers. By vectorizing your loop, you can lower the cost of an iteration from 1.88 to 0.94 cycles (2.00x speedup).

Details

All VPU instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: C storage order is row-major: for(i) for(j) a[j][i] = b[j][i]; (slow, non stride 1) => for(i) for(j) a[i][j] = b[i][j]; (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): for(i) a[i].x = b[i].x; (slow, non stride 1) => for(i) a.x[i] = b.x[i]; (fast, stride 1)

FMA

Presence of both ADD/SUB and MUL operations.

Workaround

Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 2 FP arithmetical operations:

1: addition or subtraction
1: multiply

The binary loop is loading 32 bytes. The binary loop is storing 8 bytes.

Arithmetic intensity

Arithmetic intensity is 0.05 FP operations per loaded or stored byte.

General properties

nb instructions	15
loop length	60
nb stack references	0

Front-end

FIT IN UOP CACHE

front end

1.88 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16
uops	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.75	0.75	0.75	0.75	1.83	1.50	1.67	0.00	0.00
cycles	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.75	0.75	0.75	0.75	1.83	1.50	1.67	0.00	0.00

Execution ports to units layout:

P0: BRU
P1: BRU
P2: ALU
P3: ALU
P4: ALU
P5: ALU
P6: ALU
P7: ALU
P8 (128 bits): VPU, FP store data, ALU, DIV/SQRT
P9 (128 bits): VPU, ALU, FP store data
P10 (128 bits): VPU, ALU, DIV/SQRT
P11 (128 bits): ALU, VPU
P12 (256 bits): store address, load
P13 (256 bits): store address, load
P14 (256 bits): load
P15 (64 bits): store data
P16 (64 bits): store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Cycles summary

Front-end	1.88
Data deps.	1.00
Overall L1	1.88

Vectorization ratios

INT

all	0%
load	0%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
other	0%

FP

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	NA (no store vectorizable/vectorized instructions)
mul	0%
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	0%
load	0%
store	0%
mul	0%
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	0%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.88 cycles. At this rate:

17% of peak load performance is reached (17.07 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))
8% of peak store performance is reached (4.27 out of 48.00 bytes stored per cycle (GB/s @ 1GHz))

ASM code

In the binary file, the address of the loop is: 500470

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	Latency	Recip. throughput	Vectorization
LDR X4, [X21, X1,LSL #3]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	4	0.33	scal (50.0%)
UBFM X28, X2, #61, #60	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	1	0.17	scal (50.0%)
LDR D6, [X22, X1,LSL #3]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	6	0.33	scal (50.0%)
UBFM X23, X4, #61, #60	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	1	0.17	N/A
LDR X3, [X0, X23]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	4	0.33	scal (50.0%)
FMUL D7, D28, D6	1	0	0	0	0	0	0	0	0	0.25	0.25	0.25	0.25	0	0	0	3	0.25	scal (50.0%)
UBFM X30, X3, #61, #60	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	1	0.17	N/A
CMP X20, X3	1	0	0	0.25	0.25	0	0	0.25	0.25	0	0	0	0	0	0	0	1	0.33	scal (50.0%)
B.GT 500644 <hypre_ParMatmul._omp_fn.3+0x384>	1	0.50	0.50	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0.50	N/A
LDR D16, [X19, X30]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	6	0.33	scal (50.0%)
ADD X1, X1, #1	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	1	0.17	N/A
FADD D17, D16, D7	1	0	0	0	0	0	0	0	0	0.25	0.25	0.25	0.25	0	0	0	2	0.25	scal (50.0%)
STR D17, [X19, X30]	1	0	0	0	0	0	0	0	0	0.50	0.50	0	0	0.50	0.50	0	2	0.50	scal (50.0%)
CMP X1, X14	1	0	0	0.25	0.25	0	0	0.25	0.25	0	0	0	0	0	0	0	1	0.33	N/A
B.LT 500470 <hypre_ParMatmul._omp_fn.3+0x1b0>	1	0.50	0.50	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0.50	N/A

Function	hypre_ParMatmul._omp_fn.3
Source file and lines	par_csr_matop.c:946-965
Module	exec

The loop is defined in /home/hbollore/qaas/qaas-runs/174-161-6712/intel/AMG/build/AMG/AMG/parcsr_mv/par_csr_matop.c:946-948,956-965.

The related source loop is not unrolled or unrolled with no peel/tail loop.
The structure of this loop is probably <if then [else] end>.

The presence of multiple execution paths is typically the main/first bottleneck.
Try to simplify control inside loop: ideally, try to remove all conditional expressions, for example by (if applicable):

hoisting them (moving them outside the loop)
turning them into conditional moves, MIN or MAX

Ex: if (x<0) x=0 => x = (x<0 ? 0 : x) (or MAX(0,x) after defining the corresponding macro)

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 2.50 to 1.17 cycles (2.14x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)

Vectorization

Your loop is not vectorized. 2 data elements could be processed at once in vector registers. By vectorizing your loop, you can lower the cost of an iteration from 2.50 to 1.25 cycles (2.00x speedup).

Details

All VPU instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: C storage order is row-major: for(i) for(j) a[j][i] = b[j][i]; (slow, non stride 1) => for(i) for(j) a[i][j] = b[i][j]; (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): for(i) a[i].x = b[i].x; (slow, non stride 1) => for(i) a.x[i] = b.x[i]; (fast, stride 1)

No data for this section

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 1 FP arithmetical operations:

1: multiply

The binary loop is loading 32 bytes. The binary loop is storing 24 bytes.

Arithmetic intensity

Arithmetic intensity is 0.02 FP operations per loaded or stored byte.

General properties

nb instructions	17
loop length	68
nb stack references	0

Front-end

FIT IN UOP CACHE

front end

2.13 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16
uops	1.00	1.00	1.17	1.17	1.17	1.17	1.17	1.17	0.50	0.50	0.50	0.50	2.50	2.17	2.33	1.00	1.00
cycles	1.00	1.00	1.17	1.17	1.17	1.17	1.17	1.17	0.50	0.50	0.50	0.50	2.50	2.17	2.33	1.00	1.00

Execution ports to units layout:

P0: BRU
P1: BRU
P2: ALU
P3: ALU
P4: ALU
P5: ALU
P6: ALU
P7: ALU
P8 (128 bits): VPU, FP store data, ALU, DIV/SQRT
P9 (128 bits): VPU, ALU, FP store data
P10 (128 bits): VPU, ALU, DIV/SQRT
P11 (128 bits): ALU, VPU
P12 (256 bits): store address, load
P13 (256 bits): store address, load
P14 (256 bits): load
P15 (64 bits): store data
P16 (64 bits): store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Cycles summary

Front-end	2.13
Data deps.	1.00
Overall L1	2.50

Vectorization ratios

INT

all	0%
load	0%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
other	0%

FP

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	NA (no store vectorizable/vectorized instructions)
mul	0%
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	0%
load	0%
store	0%
mul	0%
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	0%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 2.50 cycles. At this rate:

13% of peak load performance is reached (12.80 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))
20% of peak store performance is reached (9.60 out of 48.00 bytes stored per cycle (GB/s @ 1GHz))

ASM code

In the binary file, the address of the loop is: 500470

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16	Latency	Recip. throughput	Vectorization
LDR X4, [X21, X1,LSL #3]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	0	0	4	0.33	scal (50.0%)
UBFM X28, X2, #61, #60	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	0	0	1	0.17	N/A
LDR D6, [X22, X1,LSL #3]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	0	0	6	0.33	scal (50.0%)
UBFM X23, X4, #61, #60	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	0	0	1	0.17	N/A
LDR X3, [X0, X23]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	0	0	4	0.33	scal (50.0%)
FMUL D7, D28, D6	1	0	0	0	0	0	0	0	0	0.25	0.25	0.25	0.25	0	0	0	0	0	3	0.25	scal (50.0%)
UBFM X30, X3, #61, #60	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	0	0	1	0.17	scal (50.0%)
CMP X20, X3	1	0	0	0.25	0.25	0	0	0.25	0.25	0	0	0	0	0	0	0	0	0	1	0.33	scal (50.0%)
B.GT 500644 <hypre_ParMatmul._omp_fn.3+0x384>	1	0.50	0.50	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0.50	N/A
STR X2, [X0, X23]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.50	0.50	0	0.50	0.50	1	0.50	scal (50.0%)
ADD X1, X1, #1	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	0	0	1	0.17	N/A
ADD X2, X2, #1	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0	0	0	0	0	0	0	0	0	1	0.17	scal (50.0%)
STR D7, [X19, X28]	1	0	0	0	0	0	0	0	0	0.50	0.50	0	0	0.50	0.50	0	0	0	2	0.50	scal (50.0%)
STR X4, [X24, X28]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.50	0.50	0	0.50	0.50	1	0.50	scal (50.0%)
LDR X14, [X15]	1	0	0	0	0	0	0	0	0	0	0	0	0	0.33	0.33	0.33	0	0	4	0.33	N/A
CMP X14, X1	1	0	0	0.25	0.25	0	0	0.25	0.25	0	0	0	0	0	0	0	0	0	1	0.33	N/A
B.GT 500470 <hypre_ParMatmul._omp_fn.3+0x1b0>	1	0.50	0.50	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0.50	N/A

Report Configuration

Vectorization

Details

Workaround

FMA

Workaround

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

Code clean check

Workaround

Vectorization

Details

Workaround

FMA

Workaround

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

ASM code

Code clean check

Workaround

Vectorization

Details

Workaround

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

ASM code