OV - - Loop 9960 - engine_linuxa64_gf

0x77b184 STR	D23, [X0], #8    [1]

0x77b188 CMP	X1, X0

0x77b18c B.NE	77b184

/home/eoseret/OpenRadioss/engine/source/elements/sh3n/coquedk/cncoef3.F: 261 - 261

--------------------------------------------------------------------------------

261:       AMU(JFT:JLT) = DN

Coverage (%)	Name	Source Location	Module
►50.69+	czforc3	czforc3.F:440	engine_linuxa64_gf_ompi
○	forintc	forintc.F:370	engine_linuxa64_gf_ompi
○	resol_._omp_fn.14	lockon.inc:28	engine_linuxa64_gf_ompi
○	gomp_thread_start	gomp_thread_start	libgomp.so.1.0.0
○	start_thread	start_thread	libc.so.6
○	thread_start	thread_start	libc.so.6
►49.31+	czforc3	czforc3.F:440	engine_linuxa64_gf_ompi
○	forintc	forintc.F:370	engine_linuxa64_gf_ompi
○	resol_._omp_fn.14	lockon.inc:28	engine_linuxa64_gf_ompi
○	GOMP_parallel	libgomp.h:980	libgomp.so.1.0.0
○	resol	resol.F:4458	engine_linuxa64_gf_ompi
○	resol_head	resol_head.F:284	engine_linuxa64_gf_ompi
○	radioss2	radioss2.F:2178	engine_linuxa64_gf_ompi
○	radioss0	radioss0.F:95	engine_linuxa64_gf_ompi
○	main	radioss.F:38	engine_linuxa64_gf_ompi
○	__libc_start_call_main		libc.so.6
○	__libc_start_main		libc.so.6
○	_start		engine_linuxa64_gf_ompi

Path /

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.20
Bottlenecks	P0, P1, P8, P9, P12, P13,
Function	cncoef3b
Source	cncoef3.F:261-261
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	0.50
CQA cycles if no scalar integer	0.50
CQA cycles if FP arith vectorized	0.50
CQA cycles if fully vectorized	0.25
Front-end cycles	0.38
P0 cycles	0.50
P1 cycles	0.50
P2 cycles	0.42
P3 cycles	0.42
P4 cycles	0.33
P5 cycles	0.33
P6 cycles	0.25
P7 cycles	0.25
P8 cycles	0.50
P9 cycles	0.50
P10 cycles	0.00
P11 cycles	0.00
P12 cycles	0.50
P13 cycles	0.50
P14 cycles	0.00
P15 cycles	0.00
P16 cycles	0.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	0
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	3.00
Nb uops	3.00
Nb loads	NA
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	16.00
Bytes prefetched	0.00
Bytes loaded	0.00
Bytes stored	8.00
Stride 0	1.00
Stride 1	0.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	NA
Vectorization ratio store	0.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	NA
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	NA
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.20
Bottlenecks	P0, P1, P8, P9, P12, P13,
Function	cncoef3b
Source	cncoef3.F:261-261
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	0.50
CQA cycles if no scalar integer	0.50
CQA cycles if FP arith vectorized	0.50
CQA cycles if fully vectorized	0.25
Front-end cycles	0.38
P0 cycles	0.50
P1 cycles	0.50
P2 cycles	0.42
P3 cycles	0.42
P4 cycles	0.33
P5 cycles	0.33
P6 cycles	0.25
P7 cycles	0.25
P8 cycles	0.50
P9 cycles	0.50
P10 cycles	0.00
P11 cycles	0.00
P12 cycles	0.50
P13 cycles	0.50
P14 cycles	0.00
P15 cycles	0.00
P16 cycles	0.00
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	0
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	3.00
Nb uops	3.00
Nb loads	NA
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	16.00
Bytes prefetched	0.00
Bytes loaded	0.00
Bytes stored	8.00
Stride 0	1.00
Stride 1	0.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	NA
Vectorization ratio store	0.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	NA
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	NA
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	cncoef3b
Source file and lines	cncoef3.F:261-261
Module	engine_linuxa64_gf_ompi

The loop is defined in /home/eoseret/OpenRadioss/engine/source/elements/sh3n/coquedk/cncoef3.F:261.

The related source loop is not unrolled or unrolled with no peel/tail loop.

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. 2 data elements could be processed at once in vector registers. By vectorizing your loop, you can lower the cost of an iteration from 0.50 to 0.25 cycles (2.00x speedup).

Details

All VPU instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- recompile with ftree-vectorize (included in O3) to enable loop vectorization and with fassociative-math (included in Ofast or ffast-math) to extend vectorization to FP reductions.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

No data for this section

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is storing 8 bytes.

Unroll opportunity

Loop body is too small to efficiently use resources.

Workaround

Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam. Or with the unroll (resp. unroll_and_jam) directive on top of the inner (resp. surrounding) loop. You can enforce an unroll factor: !GCC$ unroll N

General properties

nb instructions	3
loop length	12
nb stack references	0

Front-end

FIT IN UOP CACHE

front end

0.38 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16
uops	0.50	0.50	0.42	0.42	0.33	0.33	0.25	0.25	0.50	0.50	0.00	0.00	0.50	0.50	0.00	0.00	0.00
cycles	0.50	0.50	0.42	0.42	0.33	0.33	0.25	0.25	0.50	0.50	0.00	0.00	0.50	0.50	0.00	0.00	0.00

Execution ports to units layout:

P0: BRU
P1: BRU
P2: ALU
P3: ALU
P4: ALU
P5: ALU
P6: ALU
P7: ALU
P8 (128 bits): VPU, FP store data, ALU, DIV/SQRT
P9 (128 bits): VPU, ALU, FP store data
P10 (128 bits): VPU, ALU, DIV/SQRT
P11 (128 bits): ALU, VPU
P12 (256 bits): store address, load
P13 (256 bits): store address, load
P14 (256 bits): load
P15 (64 bits): store data
P16 (64 bits): store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	0.00

Cycles summary

Front-end	0.38
Data deps.	0.00
Overall L1	0.50

Vectorization ratios

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 0.50 cycles. At this rate:

33% of peak store performance is reached (16.00 out of 48.00 bytes stored per cycle (GB/s @ 1GHz))

ASM code

In the binary file, the address of the loop is: 77b184

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P12	P13	Latency	Recip. throughput	Vectorization
STR D23, [X0], #8	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0.50	0.50	0.50	0.50	2	0.50	scal (50.0%)
CMP X1, X0	1	0	0	0.25	0.25	0	0	0.25	0.25	0	0	0	0	1	0.33	N/A
B.NE 77b184 <cncoef3b_+0x484>	1	0.50	0.50	0	0	0	0	0	0	0	0	0	0	1	0.50	N/A

Function	cncoef3b
Source file and lines	cncoef3.F:261-261
Module	engine_linuxa64_gf_ompi

The loop is defined in /home/eoseret/OpenRadioss/engine/source/elements/sh3n/coquedk/cncoef3.F:261.

The related source loop is not unrolled or unrolled with no peel/tail loop.

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. 2 data elements could be processed at once in vector registers. By vectorizing your loop, you can lower the cost of an iteration from 0.50 to 0.25 cycles (2.00x speedup).

Details

All VPU instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- recompile with ftree-vectorize (included in O3) to enable loop vectorization and with fassociative-math (included in Ofast or ffast-math) to extend vectorization to FP reductions.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

No data for this section

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is storing 8 bytes.

Unroll opportunity

Loop body is too small to efficiently use resources.

Workaround

Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam. Or with the unroll (resp. unroll_and_jam) directive on top of the inner (resp. surrounding) loop. You can enforce an unroll factor: !GCC$ unroll N

General properties

nb instructions	3
loop length	12
nb stack references	0

Front-end

FIT IN UOP CACHE

front end

0.38 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16
uops	0.50	0.50	0.42	0.42	0.33	0.33	0.25	0.25	0.50	0.50	0.00	0.00	0.50	0.50	0.00	0.00	0.00
cycles	0.50	0.50	0.42	0.42	0.33	0.33	0.25	0.25	0.50	0.50	0.00	0.00	0.50	0.50	0.00	0.00	0.00

Execution ports to units layout:

P0: BRU
P1: BRU
P2: ALU
P3: ALU
P4: ALU
P5: ALU
P6: ALU
P7: ALU
P8 (128 bits): VPU, FP store data, ALU, DIV/SQRT
P9 (128 bits): VPU, ALU, FP store data
P10 (128 bits): VPU, ALU, DIV/SQRT
P11 (128 bits): ALU, VPU
P12 (256 bits): store address, load
P13 (256 bits): store address, load
P14 (256 bits): load
P15 (64 bits): store data
P16 (64 bits): store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	0.00

Cycles summary

Front-end	0.38
Data deps.	0.00
Overall L1	0.50

Vectorization ratios

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 0.50 cycles. At this rate:

33% of peak store performance is reached (16.00 out of 48.00 bytes stored per cycle (GB/s @ 1GHz))

ASM code

In the binary file, the address of the loop is: 77b184

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	P8	P9	P12	P13	Latency	Recip. throughput	Vectorization
STR D23, [X0], #8	1	0	0	0.17	0.17	0.17	0.17	0.17	0.17	0.50	0.50	0.50	0.50	2	0.50	scal (50.0%)
CMP X1, X0	1	0	0	0.25	0.25	0	0	0.25	0.25	0	0	0	0	1	0.33	N/A
B.NE 77b184 <cncoef3b_+0x484>	1	0.50	0.50	0	0	0	0	0	0	0	0	0	0	1	0.50	N/A

Report Configuration

Vectorization

Details

Workaround

Matching between your loop (in the source code) and the binary loop

Unroll opportunity

Workaround

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

ASM code

Vectorization

Details

Workaround

Matching between your loop (in the source code) and the binary loop

Unroll opportunity

Workaround

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Cycles and memory resources usage

ASM code