OV - lbc - Loop 327

0x42aab0 VMOVQ	(%R14,%RSI,8),%XMM4    [1]

0x42aab6 VMOVQ	%XMM4,(%R11)    [2]

0x42aabb INC	%RSI

0x42aabe ADD	%RBX,%R11

0x42aac1 CMP	%RSI,%RAX

0x42aac4 JNE	42aab0

/home/kcamus/POP3/lbm/lbc/lbc.F90: 117 - 117

--------------------------------------------------------------------------------

117:    lb_dom%fOut = lb_dom%fIn

Coverage (%)	Name	Module
►100.00+	main	lbc
○	__libc_init_first	libc.so.6
○	__libc_start_main	libc.so.6
○	_start	lbc

Path /

Metric	Value
CQA speedup if no scalar integer	1.25
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	8.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.25
Bottlenecks	micro-operation queue,
Function	MAIN_
Source	lbc.F90:117-117
Source loop unroll info	unrolled by 4
Source loop unroll confidence level	max
Unroll/vectorization loop type	peel/tail
Unroll factor	1
CQA cycles	1.25
CQA cycles if no scalar integer	1.00
CQA cycles if FP arith vectorized	1.25
CQA cycles if fully vectorized	0.16
Front-end cycles	1.25
P0 cycles	0.75
P1 cycles	0.75
P2 cycles	0.83
P3 cycles	0.50
P4 cycles	1.00
P5 cycles	0.75
P6 cycles	0.75
P7 cycles	0.67
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	1.36
Stall cycles (UFS)	0.00
Nb insns	6.00
Nb uops	5.00
Nb loads	1.00
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	12.80
Bytes prefetched	0.00
Bytes loaded	8.00
Bytes stored	8.00
Stride 0	0.00
Stride 1	1.00
Stride n	0.00
Stride unknown	1.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	NA
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	12.50
Vector-efficiency ratio load	12.50
Vector-efficiency ratio store	12.50
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Metric	Value
CQA speedup if no scalar integer	1.25
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	8.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.25
Bottlenecks	micro-operation queue,
Function	MAIN_
Source	lbc.F90:117-117
Source loop unroll info	unrolled by 4
Source loop unroll confidence level	max
Unroll/vectorization loop type	peel/tail
Unroll factor	1
CQA cycles	1.25
CQA cycles if no scalar integer	1.00
CQA cycles if FP arith vectorized	1.25
CQA cycles if fully vectorized	0.16
Front-end cycles	1.25
P0 cycles	0.75
P1 cycles	0.75
P2 cycles	0.83
P3 cycles	0.50
P4 cycles	1.00
P5 cycles	0.75
P6 cycles	0.75
P7 cycles	0.67
DIV/SQRT cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	1.36
Stall cycles (UFS)	0.00
Nb insns	6.00
Nb uops	5.00
Nb loads	1.00
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	12.80
Bytes prefetched	0.00
Bytes loaded	8.00
Bytes stored	8.00
Stride 0	0.00
Stride 1	1.00
Stride n	0.00
Stride unknown	1.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	NA
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	12.50
Vector-efficiency ratio load	12.50
Vector-efficiency ratio store	12.50
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	MAIN_
Source file and lines	lbc.F90:117-117
Module	lbc

The loop is defined in /home/kcamus/POP3/lbm/lbc/lbc.F90:117.

It is peel/tail loop of related source loop which is unrolled by 4 (including vectorization).

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.25 to 1.00 cycles (1.25x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
To reference allocatable arrays, use "allocatable" instead of "pointer" pointers or qualify them with the "contiguous" attribute (Fortran 2008)
For structures, limit to one indirection. For example, use a_b%c instead of a%b%c with a_b set to a%b before this loop

Unrolling/vectorization cost

This loop is peel/tail of a unrolled/vectorized loop. If its cost is not negligible compared to the main (unrolled/vectorized) loop, unrolling/vectorization is counterproductive due to low trip count.

Details

The more iterations the main loop is processing, the higher the trip count must be to amortize peel/tail overhead.

Workaround

recompile with -prof-gen, execute and recompile with -prof-use (profile-guided optimization)
insert !DIR$ LOOP COUNT MAX(n),MIN(n),AVG(n) at top of your loop
hardcode most frequent values of loop bounds by adding specialized paths.:
- For instance, replace do i=1,n foo(i) with: select case (n) case (4): do i=1,4 foo(i) case (6): do i=1,6 foo(i) default : do i=1,n foo(i) end select

Vectorization

Your loop is not vectorized. 8 data elements could be processed at once in vector registers.

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers).

Execution units bottlenecks

Found no such bottlenecks but see expert reports for more complex bottlenecks.

No data for this section

Slow data structures access

Detected data structures (typically arrays) that cannot be efficiently read/written

Details

Constant unknown stride: 1 occurrence(s)

Non-unit stride (uncontiguous) accesses are not efficiently using data caches

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 8 bytes. The binary loop is storing 8 bytes.

General properties

nb instructions	6
nb uops	5
loop length	22
used x86 registers	5
used mmx registers	0
used xmm registers	1
used ymm registers	0
used zmm registers	0
nb stack references	0

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	1.25 cycles
front end	1.25 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	0.75	0.75	0.83	0.50	1.00	0.75	0.75	0.67
cycles	0.75	0.75	0.83	0.50	1.00	0.75	0.75	0.67

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	1.36
Stall cycles	0.00

Cycles summary

Front-end	1.25
Dispatch	1.00
Data deps.	1.00
Overall L1	1.25

Vectorization ratios

all	0%
load	0%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

all	12%
load	12%
store	12%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.25 cycles. At this rate:

5% of peak load performance is reached (6.40 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
10% of peak store performance is reached (6.40 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 1.25 to 1.00 cycles (1.25x speedup).

ASM code

In the binary file, the address of the loop is: 42aab0

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput	Vectorization
VMOVQ (%R14,%RSI,8),%XMM4	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	scal (12.5%)
VMOVQ %XMM4,(%R11)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	scal (12.5%)
INC %RSI	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
ADD %RBX,%R11	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
CMP %RSI,%RAX	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
JNE 42aab0 <MAIN__+0xc00>	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1	N/A

Function	MAIN_
Source file and lines	lbc.F90:117-117
Module	lbc

The loop is defined in /home/kcamus/POP3/lbm/lbc/lbc.F90:117.

It is peel/tail loop of related source loop which is unrolled by 4 (including vectorization).

gain
potential
hint
expert

Code clean check

Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.25 to 1.00 cycles (1.25x speedup).

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)
To reference allocatable arrays, use "allocatable" instead of "pointer" pointers or qualify them with the "contiguous" attribute (Fortran 2008)
For structures, limit to one indirection. For example, use a_b%c instead of a%b%c with a_b set to a%b before this loop

Unrolling/vectorization cost

This loop is peel/tail of a unrolled/vectorized loop. If its cost is not negligible compared to the main (unrolled/vectorized) loop, unrolling/vectorization is counterproductive due to low trip count.

Details

The more iterations the main loop is processing, the higher the trip count must be to amortize peel/tail overhead.

Workaround

recompile with -prof-gen, execute and recompile with -prof-use (profile-guided optimization)
insert !DIR$ LOOP COUNT MAX(n),MIN(n),AVG(n) at top of your loop
hardcode most frequent values of loop bounds by adding specialized paths.:
- For instance, replace do i=1,n foo(i) with: select case (n) case (4): do i=1,4 foo(i) case (6): do i=1,6 foo(i) default : do i=1,n foo(i) end select

Vectorization

Your loop is not vectorized. 8 data elements could be processed at once in vector registers.

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers).

Execution units bottlenecks

Found no such bottlenecks but see expert reports for more complex bottlenecks.

No data for this section

Slow data structures access

Detected data structures (typically arrays) that cannot be efficiently read/written

Details

Constant unknown stride: 1 occurrence(s)

Non-unit stride (uncontiguous) accesses are not efficiently using data caches

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 8 bytes. The binary loop is storing 8 bytes.

General properties

nb instructions	6
nb uops	5
loop length	22
used x86 registers	5
used mmx registers	0
used xmm registers	1
used ymm registers	0
used zmm registers	0
nb stack references	0

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	1.25 cycles
front end	1.25 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	0.75	0.75	0.83	0.50	1.00	0.75	0.75	0.67
cycles	0.75	0.75	0.83	0.50	1.00	0.75	0.75	0.67

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	1.36
Stall cycles	0.00

Cycles summary

Front-end	1.25
Dispatch	1.00
Data deps.	1.00
Overall L1	1.25

Vectorization ratios

all	0%
load	0%
store	0%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

all	12%
load	12%
store	12%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.25 cycles. At this rate:

5% of peak load performance is reached (6.40 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
10% of peak store performance is reached (6.40 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 1.25 to 1.00 cycles (1.25x speedup).

ASM code

In the binary file, the address of the loop is: 42aab0

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput	Vectorization
VMOVQ (%R14,%RSI,8),%XMM4	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50	scal (12.5%)
VMOVQ %XMM4,(%R11)	1	0	0	0.33	0.33	1	0	0	0.33	3	1	scal (12.5%)
INC %RSI	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
ADD %RBX,%R11	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
CMP %RSI,%RAX	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25	N/A
JNE 42aab0 <MAIN__+0xc00>	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1	N/A

Report Configuration

Code clean check

Workaround

Unrolling/vectorization cost

Details

Workaround

Vectorization

Details

Execution units bottlenecks

Slow data structures access

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code

Code clean check

Workaround

Unrolling/vectorization cost

Details

Workaround

Vectorization

Details

Execution units bottlenecks

Slow data structures access

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code