OV - - Loop 129480 - engine_linux64_intel

	Loop Id: 129480	Module: engine_linux64_intel_impi	Source: mulawc.F:2399-2399	Coverage: 0.04%

0x3aa252a MOV	-0xf188(%RBP),%R8    [2]

0x3aa2531 MOV	-0x70(%RBP),%RAX    [2]

0x3aa2535 MOV	%R8,%RDI

0x3aa2538 MOV	0x118(%RBX),%RDX    [4]

0x3aa253f SUB	-0x388(%RBP),%RDI    [2]

0x3aa2546 MOV	-0xf178(%RBP),%RSI    [2]

0x3aa254d LEA	0x1(%R8,%RAX,1),%RCX

0x3aa2552 VMOVUPS	-0x8(%RDX,%RCX,8),%ZMM16    [3]

0x3aa255d VMOVUPS	0x38(%RDX,%RCX,8),%ZMM19    [3]

0x3aa2568 VMULPD	-0x3f70(%RBP,%R8,8),%ZMM16,%ZMM17    [2]

0x3aa2573 VMULPD	-0x3f30(%RBP,%R8,8),%ZMM19,%ZMM20    [2]

0x3aa257e VADDPD	(%RSI,%RDI,8),%ZMM17,%ZMM18    [1]

0x3aa2585 VMOVUPD	%ZMM18,(%RSI,%RDI,8)    [1]

0x3aa258c ADD	$0x10,%R8

0x3aa2590 VADDPD	0x40(%RSI,%RDI,8),%ZMM20,%ZMM21    [1]

0x3aa2598 VMOVUPD	%ZMM21,0x40(%RSI,%RDI,8)    [1]

0x3aa25a0 MOV	%R8,-0xf188(%RBP)    [2]

0x3aa25a7 CMP	-0xf190(%RBP),%R8    [2]

0x3aa25ae JB	3aa252a

/home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F: 2399 - 2399

--------------------------------------------------------------------------------

2399:                 FOR(1:NEL,5) = FOR(1:NEL,5) + THKLY(JPOS:JPOS+NEL-1)*SIGNZX(1:NEL)

Coverage (%)	Name	Source Location	Module
►99.97+	cmain3_.h	cmain3.F:295	engine_linux64_intel_impi
○	cforc3_.h	cforc3.F:504	engine_linux64_intel_impi
○	forintc_.h	forintc.F:398	engine_linux64_intel_impi
○	resol_		engine_linux64_intel_impi
○	__kmp_invoke_microtask		engine_linux64_intel_impi
○	__kmp_invoke_task_func		engine_linux64_intel_impi

Path /

Metric	Value
CQA speedup if no scalar integer	1.50
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	1.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.09
Bottlenecks	P2, P3,
Function	mulawc_.h
Source	mulawc.F:2399-2399
Source loop unroll info	unrolled by 2
Source loop unroll confidence level	high
Unroll/vectorization loop type	main
Unroll factor	2
CQA cycles	6.00
CQA cycles if no scalar integer	4.00
CQA cycles if FP arith vectorized	6.00
CQA cycles if fully vectorized	6.00
Front-end cycles	5.50
P0 cycles	2.00
P1 cycles	2.00
P2 cycles	6.00
P3 cycles	6.00
P4 cycles	3.00
P5 cycles	2.00
P6 cycles	2.00
P7 cycles	3.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	0
FE+BE cycles (UFS)	6.25
Stall cycles (UFS)	0.48
Nb insns	19.00
Nb uops	18.00
Nb loads	12.00
Nb stores	3.00
Nb stack references	5.00
FLOP/cycle	5.33
Nb FLOP add-sub	16.00
Nb FLOP mul	16.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	94.67
Bytes prefetched	0.00
Bytes loaded	432.00
Bytes stored	136.00
Stride 0	2.00
Stride 1	0.00
Stride n	0.00
Stride unknown	1.00
Stride indirect	1.00
Vectorization ratio all	88.89
Vectorization ratio load	100.00
Vectorization ratio store	66.67
Vectorization ratio mul	100.00
Vectorization ratio add_sub	100.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	90.28
Vector-efficiency ratio load	100.00
Vector-efficiency ratio store	70.83
Vector-efficiency ratio mul	100.00
Vector-efficiency ratio add_sub	100.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Metric	Value
CQA speedup if no scalar integer	1.50
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	1.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.09
Bottlenecks	P2, P3,
Function	mulawc_.h
Source	mulawc.F:2399-2399
Source loop unroll info	unrolled by 2
Source loop unroll confidence level	high
Unroll/vectorization loop type	main
Unroll factor	2
CQA cycles	6.00
CQA cycles if no scalar integer	4.00
CQA cycles if FP arith vectorized	6.00
CQA cycles if fully vectorized	6.00
Front-end cycles	5.50
P0 cycles	2.00
P1 cycles	2.00
P2 cycles	6.00
P3 cycles	6.00
P4 cycles	3.00
P5 cycles	2.00
P6 cycles	2.00
P7 cycles	3.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	0
FE+BE cycles (UFS)	6.25
Stall cycles (UFS)	0.48
Nb insns	19.00
Nb uops	18.00
Nb loads	12.00
Nb stores	3.00
Nb stack references	5.00
FLOP/cycle	5.33
Nb FLOP add-sub	16.00
Nb FLOP mul	16.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	94.67
Bytes prefetched	0.00
Bytes loaded	432.00
Bytes stored	136.00
Stride 0	2.00
Stride 1	0.00
Stride n	0.00
Stride unknown	1.00
Stride indirect	1.00
Vectorization ratio all	88.89
Vectorization ratio load	100.00
Vectorization ratio store	66.67
Vectorization ratio mul	100.00
Vectorization ratio add_sub	100.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	90.28
Vector-efficiency ratio load	100.00
Vector-efficiency ratio store	70.83
Vector-efficiency ratio mul	100.00
Vector-efficiency ratio add_sub	100.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	mulawc_.h
Source file and lines	mulawc.F:2399-2399
Module	engine_linux64_intel_impi

The loop is defined in /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399.

It is main loop of related source loop which is unrolled by 2 (including vectorization).

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 6.00 to 4.00 cycles (1.50x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
To reference allocatable arrays, use "allocatable" instead of "pointer" pointers or qualify them with the "contiguous" attribute (Fortran 2008)
For structures, limit to one indirection. For example, use a_b%c instead of a%b%c with a_b set to a%b before this loop

Vectorization

Your loop is partially vectorized. 90% of vector register length is used (average across all SSE/AVX instructions).

Details

88% of SSE/AVX instructions are used in vector version (process two or more data elements in vector registers):

66% of SSE/AVX stores are used in vector version.

Execution units bottlenecks

Performance is limited by:

reading data from caches/RAM (load units are a bottleneck)
writing data to caches/RAM (the store unit is a bottleneck)

By removing all these bottlenecks, you can lower the cost of an iteration from 6.00 to 5.50 cycles (1.09x speedup).

Workaround

Read less array elements
Write less array elements
Provide more information to your compiler:
- hardcode the bounds of the corresponding 'for' loop

FMA

Presence of both ADD/SUB and MUL operations.

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL).

Slow data structures access

Detected data structures (typically arrays) that cannot be efficiently read/written

Details

Constant unknown stride: 1 occurrence(s)
Irregular (variable stride) or indirect: 1 occurrence(s)

Non-unit stride (uncontiguous) accesses are not efficiently using data caches

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
Try to remove indirect accesses. If applicable, precompute elements out of the innermost loop.

Vector unaligned load/store instructions

Detected 4 optimal vector unaligned load/store instructions.

Details

VMOVUPD: 2 occurrences►
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399
VMOVUPS: 2 occurrences►
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399

Workaround

Use vector aligned instructions:

align your arrays on 64 bytes boundaries: compile with -align array64byte (remark: not affects arrays in COMMON blocks).
inform your compiler that your arrays are vector aligned: Append !DIR$ VECTOR ALIGNED to the loop if all accessed arrays are aligned, or !DIR$ ASSUME_ALIGNED FOO: 64 if only FOO is aligned.

Type of elements and instruction set

4 AVX-512 instructions are processing arithmetic or math operations on double precision FP elements in vector mode (eight at a time).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 32 FP arithmetical operations:

16: addition or subtraction
16: multiply

The binary loop is loading 432 bytes (54 double precision FP elements). The binary loop is storing 136 bytes (17 double precision FP elements).

Arithmetic intensity

Arithmetic intensity is 0.06 FP operations per loaded or stored byte.

General properties

nb instructions	19
nb uops	18
loop length	138
used x86 registers	8
used mmx registers	0
used xmm registers	0
used ymm registers	0
used zmm registers	6
nb stack references	5
ADD-SUB / MUL ratio	1.00

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	5.50 cycles
front end	5.50 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	2.00	2.00	6.00	6.00	3.00	2.00	2.00	3.00
cycles	2.00	2.00	6.00	6.00	3.00	2.00	2.00	3.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	0.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	6.25
Stall cycles	0.48
LM full (events)	0.48

Cycles summary

Front-end	5.50
Dispatch	6.00
Data deps.	0.00
Overall L1	6.00

Vectorization ratios

INT

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

all	100%
load	100%
store	100%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	88%
load	100%
store	66%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

INT

all	12%
load	NA (no load vectorizable/vectorized instructions)
store	12%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

all	100%
load	100%
store	100%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	90%
load	100%
store	70%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 6.00 cycles. At this rate:

56% of peak load performance is reached (72.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
35% of peak store performance is reached (22.67 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Found no such bottlenecks.

ASM code

In the binary file, the address of the loop is: 3aa252a

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput	Vectorization
MOV -0xf188(%RBP),%R8	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
MOV -0x70(%RBP),%RAX	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
MOV %R8,%RDI	1	0	0	0	0	0	0	0	0	0	0.25	N/A
MOV 0x118(%RBX),%RDX	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
SUB -0x388(%RBP),%RDI	1	0.25	0.25	0.50	0.50	0	0.25	0.25	0	1	0.50	N/A
MOV -0xf178(%RBP),%RSI	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
LEA 0x1(%R8,%RAX,1),%RCX	1	0	1	0	0	0	0	0	0	3	1	N/A
VMOVUPS -0x8(%RDX,%RCX,8),%ZMM16	1	0	0	0.50	0.50	0	0	0	0	5-6	0.50	vect (100.0%)
VMOVUPS 0x38(%RDX,%RCX,8),%ZMM19	1	0	0	0.50	0.50	0	0	0	0	5-6	0.50	vect (100.0%)
VMULPD -0x3f70(%RBP,%R8,8),%ZMM16,%ZMM17	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VMULPD -0x3f30(%RBP,%R8,8),%ZMM19,%ZMM20	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VADDPD (%RSI,%RDI,8),%ZMM17,%ZMM18	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VMOVUPD %ZMM18,(%RSI,%RDI,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	vect (100.0%)
ADD $0x10,%R8	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
VADDPD 0x40(%RSI,%RDI,8),%ZMM20,%ZMM21	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VMOVUPD %ZMM21,0x40(%RSI,%RDI,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	vect (100.0%)
MOV %R8,-0xf188(%RBP)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	scal (12.5%)
CMP -0xf190(%RBP),%R8	1	0.25	0.25	0.50	0.50	0	0.25	0.25	0	1	0.50	N/A
JB 3aa252a <mulawc_mod_mp_mulawc_.h+0x3c33a>	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1	N/A

Function	mulawc_.h
Source file and lines	mulawc.F:2399-2399
Module	engine_linux64_intel_impi

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 6.00 to 4.00 cycles (1.50x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
To reference allocatable arrays, use "allocatable" instead of "pointer" pointers or qualify them with the "contiguous" attribute (Fortran 2008)
For structures, limit to one indirection. For example, use a_b%c instead of a%b%c with a_b set to a%b before this loop

Vectorization

Your loop is partially vectorized. 90% of vector register length is used (average across all SSE/AVX instructions).

Details

88% of SSE/AVX instructions are used in vector version (process two or more data elements in vector registers):

66% of SSE/AVX stores are used in vector version.

Execution units bottlenecks

Performance is limited by:

reading data from caches/RAM (load units are a bottleneck)
writing data to caches/RAM (the store unit is a bottleneck)

By removing all these bottlenecks, you can lower the cost of an iteration from 6.00 to 5.50 cycles (1.09x speedup).

Workaround

Read less array elements
Write less array elements
Provide more information to your compiler:
- hardcode the bounds of the corresponding 'for' loop

FMA

Presence of both ADD/SUB and MUL operations.

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL).

Slow data structures access

Detected data structures (typically arrays) that cannot be efficiently read/written

Details

Constant unknown stride: 1 occurrence(s)
Irregular (variable stride) or indirect: 1 occurrence(s)

Non-unit stride (uncontiguous) accesses are not efficiently using data caches

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
Try to remove indirect accesses. If applicable, precompute elements out of the innermost loop.

Vector unaligned load/store instructions

Detected 4 optimal vector unaligned load/store instructions.

Details

VMOVUPD: 2 occurrences►
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399
VMOVUPS: 2 occurrences►
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399
- /home/kcamus/POP/POP3/OpenRadioss/OpenRadioss/engine/source/materials/mat_share/mulawc.F:2399

Workaround

Use vector aligned instructions:

align your arrays on 64 bytes boundaries: compile with -align array64byte (remark: not affects arrays in COMMON blocks).
inform your compiler that your arrays are vector aligned: Append !DIR$ VECTOR ALIGNED to the loop if all accessed arrays are aligned, or !DIR$ ASSUME_ALIGNED FOO: 64 if only FOO is aligned.

Type of elements and instruction set

4 AVX-512 instructions are processing arithmetic or math operations on double precision FP elements in vector mode (eight at a time).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 32 FP arithmetical operations:

16: addition or subtraction
16: multiply

The binary loop is loading 432 bytes (54 double precision FP elements). The binary loop is storing 136 bytes (17 double precision FP elements).

Arithmetic intensity

Arithmetic intensity is 0.06 FP operations per loaded or stored byte.

General properties

nb instructions	19
nb uops	18
loop length	138
used x86 registers	8
used mmx registers	0
used xmm registers	0
used ymm registers	0
used zmm registers	6
nb stack references	5
ADD-SUB / MUL ratio	1.00

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	5.50 cycles
front end	5.50 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	2.00	2.00	6.00	6.00	3.00	2.00	2.00	3.00
cycles	2.00	2.00	6.00	6.00	3.00	2.00	2.00	3.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	0.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	6.25
Stall cycles	0.48
LM full (events)	0.48

Cycles summary

Front-end	5.50
Dispatch	6.00
Data deps.	0.00
Overall L1	6.00

Vectorization ratios

INT

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

all	100%
load	100%
store	100%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	88%
load	100%
store	66%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

INT

all	12%
load	NA (no load vectorizable/vectorized instructions)
store	12%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

all	100%
load	100%
store	100%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

INT+FP

all	90%
load	100%
store	70%
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 6.00 cycles. At this rate:

56% of peak load performance is reached (72.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
35% of peak store performance is reached (22.67 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Found no such bottlenecks.

ASM code

In the binary file, the address of the loop is: 3aa252a

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput	Vectorization
MOV -0xf188(%RBP),%R8	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
MOV -0x70(%RBP),%RAX	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
MOV %R8,%RDI	1	0	0	0	0	0	0	0	0	0	0.25	N/A
MOV 0x118(%RBX),%RDX	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
SUB -0x388(%RBP),%RDI	1	0.25	0.25	0.50	0.50	0	0.25	0.25	0	1	0.50	N/A
MOV -0xf178(%RBP),%RSI	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	N/A
LEA 0x1(%R8,%RAX,1),%RCX	1	0	1	0	0	0	0	0	0	3	1	N/A
VMOVUPS -0x8(%RDX,%RCX,8),%ZMM16	1	0	0	0.50	0.50	0	0	0	0	5-6	0.50	vect (100.0%)
VMOVUPS 0x38(%RDX,%RCX,8),%ZMM19	1	0	0	0.50	0.50	0	0	0	0	5-6	0.50	vect (100.0%)
VMULPD -0x3f70(%RBP,%R8,8),%ZMM16,%ZMM17	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VMULPD -0x3f30(%RBP,%R8,8),%ZMM19,%ZMM20	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VADDPD (%RSI,%RDI,8),%ZMM17,%ZMM18	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VMOVUPD %ZMM18,(%RSI,%RDI,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	vect (100.0%)
ADD $0x10,%R8	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
VADDPD 0x40(%RSI,%RDI,8),%ZMM20,%ZMM21	1	0.50	0	0.50	0.50	0	0.50	0	0	4	0.50	vect (100.0%)
VMOVUPD %ZMM21,0x40(%RSI,%RDI,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	vect (100.0%)
MOV %R8,-0xf188(%RBP)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	scal (12.5%)
CMP -0xf190(%RBP),%R8	1	0.25	0.25	0.50	0.50	0	0.25	0.25	0	1	0.50	N/A
JB 3aa252a <mulawc_mod_mp_mulawc_.h+0x3c33a>	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1	N/A

Report Configuration

Code clean check

Workaround

Vectorization

Details

Execution units bottlenecks

Workaround

FMA

Workaround

Slow data structures access

Details

Workaround

Vector unaligned load/store instructions

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code

Code clean check

Workaround

Vectorization

Details

Execution units bottlenecks

Workaround

FMA

Workaround

Slow data structures access

Details

Workaround

Vector unaligned load/store instructions

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code