OV - - Loop 56170 - engine_linuxa64

0x16ff9a8 LDP	D0, D1, [X12, #1016]    [2]

0x16ff9ac LDPSW	W16, W17, [X14], #8    [3]

0x16ff9b0 SUBS	X15, X15, #2

0x16ff9b4 ADD	X12, X12, #16

0x16ff9b8 STR	D0, [X19, X16,LSL #3]    [1]

0x16ff9bc STR	D1, [X19, X17,LSL #3]    [1]

0x16ff9c0 B.NE	16ff9a8

/work/m23012/camus/OpenRadioss/OpenRadioss/engine/source/output/anim/generate/dfuncc.F: 4896 - 4896

--------------------------------------------------------------------------------

4896:                   FUNC(EL2FA(NN4+N)) = EVAR(I)

Coverage (%)	Name	Source Location	Module

Path /

Metric	Value
CQA speedup if no scalar integer	1.50
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	1.27
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.50
Bottlenecks	instruction fetch, predecoding,
Function	dfuncc
Source	dfuncc.F:4896-4896
Source loop unroll info	unrolled by 2
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	2
CQA cycles	3.00
CQA cycles if no scalar integer	2.00
CQA cycles if FP arith vectorized	3.00
CQA cycles if fully vectorized	2.36
Front-end cycles	3.00
P0 cycles	1.00
P1 cycles	1.33
P2 cycles	1.33
P3 cycles	1.33
P4 cycles	1.00
P5 cycles	1.00
P6 cycles	2.00
P7 cycles	2.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	7.00
Nb uops	7.00
Nb loads	NA
Nb stores	2.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	13.33
Bytes prefetched	0.00
Bytes loaded	24.00
Bytes stored	16.00
Stride 0	1.00
Stride 1	1.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	1.00
Vectorization ratio all	25.00
Vectorization ratio load	100.00
Vectorization ratio store	0.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	62.50
Vector-efficiency ratio load	100.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	50.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Metric	Value
CQA speedup if no scalar integer	1.50
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	1.27
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.50
Bottlenecks	instruction fetch, predecoding,
Function	dfuncc
Source	dfuncc.F:4896-4896
Source loop unroll info	unrolled by 2
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	2
CQA cycles	3.00
CQA cycles if no scalar integer	2.00
CQA cycles if FP arith vectorized	3.00
CQA cycles if fully vectorized	2.36
Front-end cycles	3.00
P0 cycles	1.00
P1 cycles	1.33
P2 cycles	1.33
P3 cycles	1.33
P4 cycles	1.00
P5 cycles	1.00
P6 cycles	2.00
P7 cycles	2.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	7.00
Nb uops	7.00
Nb loads	NA
Nb stores	2.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	13.33
Bytes prefetched	0.00
Bytes loaded	24.00
Bytes stored	16.00
Stride 0	1.00
Stride 1	1.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	1.00
Vectorization ratio all	25.00
Vectorization ratio load	100.00
Vectorization ratio store	0.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	62.50
Vector-efficiency ratio load	100.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	50.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	dfuncc
Source file and lines	dfuncc.F:4896-4896
Module	engine_linuxa64_ompi

The loop is defined in /work/m23012/camus/OpenRadioss/OpenRadioss/engine/source/output/anim/generate/dfuncc.F:4896.

The related source loop is unrolled by 2 (including vectorization).

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 3.00 to 2.00 cycles (1.50x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
To reference allocatable arrays, use "allocatable" instead of "pointer" pointers or qualify them with the "contiguous" attribute (Fortran 2008)
For structures, limit to one indirection. For example, use a_b%c instead of a%b%c with a_b set to a%b before this loop

Vectorization

Your loop is poorly vectorized. Only 62% of vector register length is used (average across all VPU instructions). By fully vectorizing your loop, you can lower the cost of an iteration from 3.00 to 2.36 cycles (1.27x speedup).

Details

25% of VPU instructions are used in vector version (process two or more data elements in vector registers):

0% of VPU stores are used in vector version.
0% of VPU addition or subtraction instructions are used in vector version.

Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

No data for this section

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 24 bytes. The binary loop is storing 16 bytes.

General properties

nb instructions	7
loop length	28
nb stack references	0

Front-end

DOES NOT FIT IN UOP CACHE

front end

3.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	1.00	1.33	1.33	1.33	1.00	1.00	2.00	2.00
cycles	1.00	1.33	1.33	1.33	1.00	1.00	2.00	2.00

Execution ports to units layout:

P0: BRU
P1: ALU
P2: ALU
P3: ALU
P4 (128 bits): VPU, ALU, DIV/SQRT
P5 (128 bits): ALU, VPU
P6 (128 bits): load, store data, store address
P7 (128 bits): load, store data, store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Cycles summary

Front-end	3.00
Data deps.	1.00
Overall L1	3.00

Vectorization ratios

all	25%
load	100%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 3.00 cycles. At this rate:

25% of peak load performance is reached (8.00 out of 32.00 bytes loaded per cycle (GB/s @ 1GHz))
16% of peak store performance is reached (5.33 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))

ASM code

In the binary file, the address of the loop is: 16ff9a8

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput	Vectorization
LDP D0, D1, [X12, #1016]	1	0	0.33	0.33	0.33	0	0	0.50	0.50	5	1	vect (100.0%)
LDPSW W16, W17, [X14], #8	1	0	0.33	0.33	0.33	0	0	0.50	0.50	5	1	N/A
SUBS X15, X15, #2	1	0	0.33	0.33	0.33	0	0	0	0	1	0.33	scal (50.0%)
ADD X12, X12, #16	1	0	0.33	0.33	0.33	0	0	0	0	1	0.33	N/A
STR D0, [X19, X16,LSL #3]	1	0	0	0	0	0.50	0.50	0.50	0.50	2	0.50	scal (50.0%)
STR D1, [X19, X17,LSL #3]	1	0	0	0	0	0.50	0.50	0.50	0.50	2	0.50	scal (50.0%)
B.NE 16ff9a8 <dfuncc_+0x1d94>	1	1	0	0	0	0	0	0	0	1	1	N/A

Function	dfuncc
Source file and lines	dfuncc.F:4896-4896
Module	engine_linuxa64_ompi

The loop is defined in /work/m23012/camus/OpenRadioss/OpenRadioss/engine/source/output/anim/generate/dfuncc.F:4896.

The related source loop is unrolled by 2 (including vectorization).

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 3.00 to 2.00 cycles (1.50x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
To reference allocatable arrays, use "allocatable" instead of "pointer" pointers or qualify them with the "contiguous" attribute (Fortran 2008)
For structures, limit to one indirection. For example, use a_b%c instead of a%b%c with a_b set to a%b before this loop

Vectorization

Your loop is poorly vectorized. Only 62% of vector register length is used (average across all VPU instructions). By fully vectorizing your loop, you can lower the cost of an iteration from 3.00 to 2.36 cycles (1.27x speedup).

Details

25% of VPU instructions are used in vector version (process two or more data elements in vector registers):

0% of VPU stores are used in vector version.
0% of VPU addition or subtraction instructions are used in vector version.

Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

No data for this section

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 24 bytes. The binary loop is storing 16 bytes.

General properties

nb instructions	7
loop length	28
nb stack references	0

Front-end

DOES NOT FIT IN UOP CACHE

front end

3.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	1.00	1.33	1.33	1.33	1.00	1.00	2.00	2.00
cycles	1.00	1.33	1.33	1.33	1.00	1.00	2.00	2.00

Execution ports to units layout:

P0: BRU
P1: ALU
P2: ALU
P3: ALU
P4 (128 bits): VPU, ALU, DIV/SQRT
P5 (128 bits): ALU, VPU
P6 (128 bits): load, store data, store address
P7 (128 bits): load, store data, store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Cycles summary

Front-end	3.00
Data deps.	1.00
Overall L1	3.00

Vectorization ratios

all	25%
load	100%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 3.00 cycles. At this rate:

25% of peak load performance is reached (8.00 out of 32.00 bytes loaded per cycle (GB/s @ 1GHz))
16% of peak store performance is reached (5.33 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))

ASM code

In the binary file, the address of the loop is: 16ff9a8

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput	Vectorization
LDP D0, D1, [X12, #1016]	1	0	0.33	0.33	0.33	0	0	0.50	0.50	5	1	vect (100.0%)
LDPSW W16, W17, [X14], #8	1	0	0.33	0.33	0.33	0	0	0.50	0.50	5	1	N/A
SUBS X15, X15, #2	1	0	0.33	0.33	0.33	0	0	0	0	1	0.33	scal (50.0%)
ADD X12, X12, #16	1	0	0.33	0.33	0.33	0	0	0	0	1	0.33	N/A
STR D0, [X19, X16,LSL #3]	1	0	0	0	0	0.50	0.50	0.50	0.50	2	0.50	scal (50.0%)
STR D1, [X19, X17,LSL #3]	1	0	0	0	0	0.50	0.50	0.50	0.50	2	0.50	scal (50.0%)
B.NE 16ff9a8 <dfuncc_+0x1d94>	1	1	0	0	0	0	0	0	0	1	1	N/A

Report Configuration

Code clean check

Workaround

Vectorization

Details

Workaround

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

ASM code

Code clean check

Workaround

Vectorization

Details

Workaround

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

ASM code