[ovs-dev,RFC,2/5] configure: Include -mprefetchwt1 explicitly.

Message ID 1512418610-84032-2-git-send-email-bhanuprakash.bodireddy@intel.com
State New
Headers show
Series
  • [ovs-dev,RFC,1/5] compiler: Introduce OVS_PREFETCH variants.
Related show

Commit Message

Bodireddy, Bhanuprakash Dec. 4, 2017, 8:16 p.m.
Processors support prefetch instruction in anticipation of write but
compilers(gcc) won't use them unless explicitly asked to do so even
with '-march=native' specified.

[Problem]
  Case A:
    OVS_PREFETCH_CACHE(addr, OPCH_HTW)
       __builtin_prefetch(addr, 1, 3)
         leaq    -112(%rbp), %rax        [Assembly]
         prefetchw  (%rax)

  Case B:
    OVS_PREFETCH_CACHE(addr, OPCH_LTW)
       __builtin_prefetch(addr, 1, 1)
         leaq    -112(%rbp), %rax        [Assembly]
         prefetchw  (%rax)             <***problem***>

  Inspite of specifying -march=native and using Low Temporal Write(OPCH_LTW),
  the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
  instruction available on processor.

[Solution]
  Include -mprefetchwt1

  Case B:
    OVS_PREFETCH_CACHE(addr, OPCH_LTW)
       __builtin_prefetch(addr, 1, 1)
         leaq    -112(%rbp), %rax        [Assembly]
         prefetchwt1  (%rax)

[Testing]
  $ ./boot.sh
  $ ./configure
     checking target hint for cgcc... x86_64
     checking whether gcc accepts -mprefetchwt1... yes
  $ make -j

Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
---
 configure.ac | 1 +
 1 file changed, 1 insertion(+)

Comments

Ben Pfaff Dec. 4, 2017, 8:37 p.m. | #1
On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote:
> Processors support prefetch instruction in anticipation of write but
> compilers(gcc) won't use them unless explicitly asked to do so even
> with '-march=native' specified.
> 
> [Problem]
>   Case A:
>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>        __builtin_prefetch(addr, 1, 3)
>          leaq    -112(%rbp), %rax        [Assembly]
>          prefetchw  (%rax)
> 
>   Case B:
>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>        __builtin_prefetch(addr, 1, 1)
>          leaq    -112(%rbp), %rax        [Assembly]
>          prefetchw  (%rax)             <***problem***>
> 
>   Inspite of specifying -march=native and using Low Temporal Write(OPCH_LTW),
>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>   instruction available on processor.
> 
> [Solution]
>   Include -mprefetchwt1
> 
>   Case B:
>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>        __builtin_prefetch(addr, 1, 1)
>          leaq    -112(%rbp), %rax        [Assembly]
>          prefetchwt1  (%rax)
> 
> [Testing]
>   $ ./boot.sh
>   $ ./configure
>      checking target hint for cgcc... x86_64
>      checking whether gcc accepts -mprefetchwt1... yes
>   $ make -j
> 
> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>

Does this have any effect if the architecture or CPU configured for use
does not support prefetchwt1?  If it could lead to that situation, then
this does not seem like the right thing to do, and we might want to fall
back to recommending use of the option when the person building knows
that the software will run on a machine with prefetchwt1.
Bodireddy, Bhanuprakash Dec. 4, 2017, 8:59 p.m. | #2
>On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote:
>> Processors support prefetch instruction in anticipation of write but
>> compilers(gcc) won't use them unless explicitly asked to do so even
>> with '-march=native' specified.
>>
>> [Problem]
>>   Case A:
>>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>        __builtin_prefetch(addr, 1, 3)
>>          leaq    -112(%rbp), %rax        [Assembly]
>>          prefetchw  (%rax)
>>
>>   Case B:
>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>        __builtin_prefetch(addr, 1, 1)
>>          leaq    -112(%rbp), %rax        [Assembly]
>>          prefetchw  (%rax)             <***problem***>
>>
>>   Inspite of specifying -march=native and using Low Temporal
>Write(OPCH_LTW),
>>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>>   instruction available on processor.
>>
>> [Solution]
>>   Include -mprefetchwt1
>>
>>   Case B:
>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>        __builtin_prefetch(addr, 1, 1)
>>          leaq    -112(%rbp), %rax        [Assembly]
>>          prefetchwt1  (%rax)
>>
>> [Testing]
>>   $ ./boot.sh
>>   $ ./configure
>>      checking target hint for cgcc... x86_64
>>      checking whether gcc accepts -mprefetchwt1... yes
>>   $ make -j
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> <bhanuprakash.bodireddy@intel.com>
>
>Does this have any effect if the architecture or CPU configured for use does
>not support prefetchwt1?

That's a good question and I spent reasonable time today to figure this out.
I have Haswell, Broadwell and Skylake CPUs and they all support this instruction.  But I found that this instruction isn't enabled by default even with march=native and so need to explicitly enable this.

Coming to your question, there won't be side effects on using OPCH_LTW.
On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler generates a 'prefetcht1' instruction.
On processors that support PREFETCHW the compiler generates 'prefetchw' instruction.
On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.

>If it could lead to that situation, then this does not
>seem like the right thing to do, and we might want to fall back to
>recommending use of the option when the person building knows that the
>software will run on a machine with prefetchwt1.

According to above on processors that doesn't have this instruction support, 'prefetchnt1' instruction would be generated and doesn't have side effects.
I verified this using https://gcc.godbolt.org/  and carefully checking the instructions generated for different compiler versions and march flags.

- Bhanuprakash.
Ben Pfaff Dec. 4, 2017, 9:01 p.m. | #3
On Mon, Dec 04, 2017 at 08:59:47PM +0000, Bodireddy, Bhanuprakash wrote:
> >On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote:
> >> Processors support prefetch instruction in anticipation of write but
> >> compilers(gcc) won't use them unless explicitly asked to do so even
> >> with '-march=native' specified.
> >>
> >> [Problem]
> >>   Case A:
> >>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
> >>        __builtin_prefetch(addr, 1, 3)
> >>          leaq    -112(%rbp), %rax        [Assembly]
> >>          prefetchw  (%rax)
> >>
> >>   Case B:
> >>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
> >>        __builtin_prefetch(addr, 1, 1)
> >>          leaq    -112(%rbp), %rax        [Assembly]
> >>          prefetchw  (%rax)             <***problem***>
> >>
> >>   Inspite of specifying -march=native and using Low Temporal
> >Write(OPCH_LTW),
> >>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
> >>   instruction available on processor.
> >>
> >> [Solution]
> >>   Include -mprefetchwt1
> >>
> >>   Case B:
> >>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
> >>        __builtin_prefetch(addr, 1, 1)
> >>          leaq    -112(%rbp), %rax        [Assembly]
> >>          prefetchwt1  (%rax)
> >>
> >> [Testing]
> >>   $ ./boot.sh
> >>   $ ./configure
> >>      checking target hint for cgcc... x86_64
> >>      checking whether gcc accepts -mprefetchwt1... yes
> >>   $ make -j
> >>
> >> Signed-off-by: Bhanuprakash Bodireddy
> >> <bhanuprakash.bodireddy@intel.com>
> >
> >Does this have any effect if the architecture or CPU configured for use does
> >not support prefetchwt1?
> 
> That's a good question and I spent reasonable time today to figure this out.
> I have Haswell, Broadwell and Skylake CPUs and they all support this instruction.  But I found that this instruction isn't enabled by default even with march=native and so need to explicitly enable this.
> 
> Coming to your question, there won't be side effects on using OPCH_LTW.
> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler generates a 'prefetcht1' instruction.
> On processors that support PREFETCHW the compiler generates 'prefetchw' instruction.
> On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.
> 
> >If it could lead to that situation, then this does not
> >seem like the right thing to do, and we might want to fall back to
> >recommending use of the option when the person building knows that the
> >software will run on a machine with prefetchwt1.
> 
> According to above on processors that doesn't have this instruction support, 'prefetchnt1' instruction would be generated and doesn't have side effects.
> I verified this using https://gcc.godbolt.org/  and carefully checking the instructions generated for different compiler versions and march flags.

OK.  That is good reassurance, then, so:
        Acked-by: Ben Pfaff <blp@ovn.org>

Beyond the comments I've made already, I don't expect to review this
series myself.  Thanks for all the work you've put into it.
Ilya Maximets Dec. 5, 2017, 9:24 a.m. | #4
>>On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote:
>>> Processors support prefetch instruction in anticipation of write but
>>> compilers(gcc) won't use them unless explicitly asked to do so even
>>> with '-march=native' specified.
>>>
>>> [Problem]
>>>   Case A:
>>>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>>        __builtin_prefetch(addr, 1, 3)
>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>          prefetchw  (%rax)
>>>
>>>   Case B:
>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>        __builtin_prefetch(addr, 1, 1)
>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>          prefetchw  (%rax)             <***problem***>
>>>
>>>   Inspite of specifying -march=native and using Low Temporal
>>Write(OPCH_LTW),
>>>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>>>   instruction available on processor.
>>>
>>> [Solution]
>>>   Include -mprefetchwt1
>>>
>>>   Case B:
>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>        __builtin_prefetch(addr, 1, 1)
>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>          prefetchwt1  (%rax)
>>>
>>> [Testing]
>>>   $ ./boot.sh
>>>   $ ./configure
>>>      checking target hint for cgcc... x86_64
>>>      checking whether gcc accepts -mprefetchwt1... yes
>>>   $ make -j
>>>
>>> Signed-off-by: Bhanuprakash Bodireddy
>>> <bhanuprakash.bodireddy at intel.com>
>>
>>Does this have any effect if the architecture or CPU configured for use does
>>not support prefetchwt1?
> 
> That's a good question and I spent reasonable time today to figure this out.
> I have Haswell, Broadwell and Skylake CPUs and they all support this instruction.

Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and both of them
doesn't have prefetchwt1 instruction according to cpuid:
	
	PREFETCHWT1                              = false

This means that introducing of this change will break binary compatibility even between
CPUs of the same generation, i.e. I will not be able to run on my system binaries
compiled on yours.

If it's true I prefer to not have this change.

Anyway adding of this change will make compiling a generic binary for a different
platforms impossible if your build server supports prefetchwt1. There should be
way to disable this arch specific compiler flag even if it supported on my current
platform.

Best regards, Ilya Maximets.

> But I found that this instruction isn't enabled by default even with march=native and so need to explicitly enable this.
> 
> Coming to your question, there won't be side effects on using OPCH_LTW.
> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler generates a 'prefetcht1' instruction.
> On processors that support PREFETCHW the compiler generates 'prefetchw' instruction.
> On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.
> 
>>If it could lead to that situation, then this does not
>>seem like the right thing to do, and we might want to fall back to
>>recommending use of the option when the person building knows that the
>>software will run on a machine with prefetchwt1.
> 
> According to above on processors that doesn't have this instruction support, 'prefetchnt1' instruction would be generated and doesn't have side effects.
> I verified this using https://gcc.godbolt.org/  and carefully checking the instructions generated for different compiler versions and march flags.
> 
> - Bhanuprakash.
Bodireddy, Bhanuprakash Dec. 5, 2017, 1:54 p.m. | #5
>>>On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy
>wrote:
>>>> Processors support prefetch instruction in anticipation of write but
>>>> compilers(gcc) won't use them unless explicitly asked to do so even
>>>> with '-march=native' specified.
>>>>
>>>> [Problem]
>>>>   Case A:
>>>>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>>>        __builtin_prefetch(addr, 1, 3)
>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>          prefetchw  (%rax)
>>>>
>>>>   Case B:
>>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>>        __builtin_prefetch(addr, 1, 1)
>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>          prefetchw  (%rax)             <***problem***>
>>>>
>>>>   Inspite of specifying -march=native and using Low Temporal
>>>Write(OPCH_LTW),
>>>>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>>>>   instruction available on processor.
>>>>
>>>> [Solution]
>>>>   Include -mprefetchwt1
>>>>
>>>>   Case B:
>>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>>        __builtin_prefetch(addr, 1, 1)
>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>          prefetchwt1  (%rax)
>>>>
>>>> [Testing]
>>>>   $ ./boot.sh
>>>>   $ ./configure
>>>>      checking target hint for cgcc... x86_64
>>>>      checking whether gcc accepts -mprefetchwt1... yes
>>>>   $ make -j
>>>>
>>>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at
>>>> intel.com>
>>>
>>>Does this have any effect if the architecture or CPU configured for
>>>use does not support prefetchwt1?
>>
>> That's a good question and I spent reasonable time today to figure this out.
>> I have Haswell, Broadwell and Skylake CPUs and they all support this
>instruction.
>
>Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and
>both of them doesn't have prefetchwt1 instruction according to cpuid:
>
>	PREFETCHWT1                              = false

Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop variant where as E3-12XX v5 is equivalent skylake workstation/server variant.
AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid displays it otherwise.

pmd_thread_main()
-------------------------------------------------------------------------------------------
WITH OPCH_HTW, we see prefetchw instruction. 

OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
    cycles_count_start(pmd);
    for (;;) {
        for (i = 0; i < poll_cnt; i++) {
            process_packets =
                dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
                                           poll_list[i].port_no);
            cycles_count_intermediate(pmd, poll_list[i].rxq,


Address	Source Line	Assembly	
0x6e29ef	4,086	movl  0x823ecb(%rip), %edi							
0x6e29f5	4,085	movq  0x50(%rsp), %rax							
0x6e29fa	4,086	test %edi, %edi							
0x6e29fc	4,085	prefetchwz  (%rax)							
----------------------------------------------------------------------------------------
With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to show this).

OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_LTW);
    cycles_count_start(pmd);
    for (;;) {
        for (i = 0; i < poll_cnt; i++) {
            ..........

Address	Source Line	Assembly	
0x6e29ef	4,086	movl  0x823ecb(%rip), %edi							
0x6e29f5	4,085	movq  0x50(%rsp), %rax							
0x6e29fa	4,086	test %edi, %edi							
0x6e29fc	4,085	prefetchwt1b  (%rax)							
-----------------------------------------------------------------------------------------

>
>This means that introducing of this change will break binary compatibility even
>between CPUs of the same generation, i.e. I will not be able to run on my
>system binaries compiled on yours.
>
>If it's true I prefer to not have this change.
>
>Anyway adding of this change will make compiling a generic binary for a
>different platforms impossible if your build server supports prefetchwt1.
>There should be way to disable this arch specific compiler flag even if it
>supported on my current platform.

I see your point where a build server can be advanced and supports the prefetchwt1 instruction
and when I copy and run the precompiled binaries on a server not supporting it, how does this behave?

Not sure on this. May be Redhat/canonical developers can comment on how they handle this kind of cases.

I will try to check this on my side.

- Bhanuprakash.

>
>Best regards, Ilya Maximets.
>
>> But I found that this instruction isn't enabled by default even with
>march=native and so need to explicitly enable this.
>>
>> Coming to your question, there won't be side effects on using OPCH_LTW.
>> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the
>compiler generates a 'prefetcht1' instruction.
>> On processors that support PREFETCHW the compiler generates 'prefetchw'
>instruction.
>> On processors that support PREFETCHW & PREFETCHWT1, the compiler
>generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.
>>
>>>If it could lead to that situation, then this does not seem like the
>>>right thing to do, and we might want to fall back to recommending use
>>>of the option when the person building knows that the software will
>>>run on a machine with prefetchwt1.
>>
>> According to above on processors that doesn't have this instruction support,
>'prefetchnt1' instruction would be generated and doesn't have side effects.
>> I verified this using https://gcc.godbolt.org/  and carefully checking the
>instructions generated for different compiler versions and march flags.
>>
>> - Bhanuprakash.
Ilya Maximets Dec. 5, 2017, 3 p.m. | #6
On 05.12.2017 16:54, Bodireddy, Bhanuprakash wrote:
>>>> On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy
>> wrote:
>>>>> Processors support prefetch instruction in anticipation of write but
>>>>> compilers(gcc) won't use them unless explicitly asked to do so even
>>>>> with '-march=native' specified.
>>>>>
>>>>> [Problem]
>>>>>   Case A:
>>>>>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>>>>        __builtin_prefetch(addr, 1, 3)
>>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>>          prefetchw  (%rax)
>>>>>
>>>>>   Case B:
>>>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>>>        __builtin_prefetch(addr, 1, 1)
>>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>>          prefetchw  (%rax)             <***problem***>
>>>>>
>>>>>   Inspite of specifying -march=native and using Low Temporal
>>>> Write(OPCH_LTW),
>>>>>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>>>>>   instruction available on processor.
>>>>>
>>>>> [Solution]
>>>>>   Include -mprefetchwt1
>>>>>
>>>>>   Case B:
>>>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>>>        __builtin_prefetch(addr, 1, 1)
>>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>>          prefetchwt1  (%rax)
>>>>>
>>>>> [Testing]
>>>>>   $ ./boot.sh
>>>>>   $ ./configure
>>>>>      checking target hint for cgcc... x86_64
>>>>>      checking whether gcc accepts -mprefetchwt1... yes
>>>>>   $ make -j
>>>>>
>>>>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at
>>>>> intel.com>
>>>>
>>>> Does this have any effect if the architecture or CPU configured for
>>>> use does not support prefetchwt1?
>>>
>>> That's a good question and I spent reasonable time today to figure this out.
>>> I have Haswell, Broadwell and Skylake CPUs and they all support this
>> instruction.
>>
>> Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and
>> both of them doesn't have prefetchwt1 instruction according to cpuid:
>>
>> 	PREFETCHWT1                              = false
> 
> Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop variant where as E3-12XX v5 is equivalent skylake workstation/server variant.
> AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid displays it otherwise.

That is totally weird. I tried to compile following simple program: 

int main()
{
        int c;

        __builtin_prefetch(&c, 1, 1);
        c = 8;

        return c;
}

on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw':

      PREFETCHWT1                              = false
      3DNow! PREFETCH/PREFETCHW instructions = false

Results:

$ gcc 1.c 
$ objdump -S ./a.out | grep prefetch -A2 -B2
  40055b:       31 c0                   xor    %eax,%eax
  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
  400561:       0f 18 18                prefetcht2 (%rax)
  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax

$ gcc 1.c -march=native
$ objdump -S ./a.out | grep prefetch -A2 -B2
  40055b:       31 c0                   xor    %eax,%eax
  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
  400561:       0f 18 18                prefetcht2 (%rax)
  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax

$ gcc 1.c -march=native -mprefetchwt1
$ objdump -S ./a.out | grep prefetch -A2 -B2
  40055b:       31 c0                   xor    %eax,%eax
  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
  400561:       0f 0d 10                prefetchwt1 (%rax)
  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax

So, it inserts this instruction even if I have on such instruction in CPU.
More interesting is that program still works without any issues.
I assume that CPU just skips that instruction or executes something else.

So, it's really strange and it's unclear what CPU really executes in
case where we have 'prefetchwt1' in code but not supported by CPU.

If CPU just skips this instruction we will lost all the prefetching optimizations
because all the calls will be replaced by non-existent 'prefetchwt1'.

How can we be sure that 'prefetchwt1' was really executed?

Best regards, Ilya Maximets.

> 
> pmd_thread_main()
> -------------------------------------------------------------------------------------------
> WITH OPCH_HTW, we see prefetchw instruction. 
> 
> OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
>     cycles_count_start(pmd);
>     for (;;) {
>         for (i = 0; i < poll_cnt; i++) {
>             process_packets =
>                 dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
>                                            poll_list[i].port_no);
>             cycles_count_intermediate(pmd, poll_list[i].rxq,
> 
> 
> Address	Source Line	Assembly	
> 0x6e29ef	4,086	movl  0x823ecb(%rip), %edi							
> 0x6e29f5	4,085	movq  0x50(%rsp), %rax							
> 0x6e29fa	4,086	test %edi, %edi							
> 0x6e29fc	4,085	prefetchwz  (%rax)							
> ----------------------------------------------------------------------------------------
> With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to show this).
> 
> OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_LTW);
>     cycles_count_start(pmd);
>     for (;;) {
>         for (i = 0; i < poll_cnt; i++) {
>             ..........
> 
> Address	Source Line	Assembly	
> 0x6e29ef	4,086	movl  0x823ecb(%rip), %edi							
> 0x6e29f5	4,085	movq  0x50(%rsp), %rax							
> 0x6e29fa	4,086	test %edi, %edi							
> 0x6e29fc	4,085	prefetchwt1b  (%rax)							
> -----------------------------------------------------------------------------------------
> 
>>
>> This means that introducing of this change will break binary compatibility even
>> between CPUs of the same generation, i.e. I will not be able to run on my
>> system binaries compiled on yours.
>>
>> If it's true I prefer to not have this change.
>>
>> Anyway adding of this change will make compiling a generic binary for a
>> different platforms impossible if your build server supports prefetchwt1.
>> There should be way to disable this arch specific compiler flag even if it
>> supported on my current platform.
> 
> I see your point where a build server can be advanced and supports the prefetchwt1 instruction
> and when I copy and run the precompiled binaries on a server not supporting it, how does this behave?
> 
> Not sure on this. May be Redhat/canonical developers can comment on how they handle this kind of cases.
> 
> I will try to check this on my side.
> 
> - Bhanuprakash.
> 
>>
>> Best regards, Ilya Maximets.
>>
>>> But I found that this instruction isn't enabled by default even with
>> march=native and so need to explicitly enable this.
>>>
>>> Coming to your question, there won't be side effects on using OPCH_LTW.
>>> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the
>> compiler generates a 'prefetcht1' instruction.
>>> On processors that support PREFETCHW the compiler generates 'prefetchw'
>> instruction.
>>> On processors that support PREFETCHW & PREFETCHWT1, the compiler
>> generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.
>>>
>>>> If it could lead to that situation, then this does not seem like the
>>>> right thing to do, and we might want to fall back to recommending use
>>>> of the option when the person building knows that the software will
>>>> run on a machine with prefetchwt1.
>>>
>>> According to above on processors that doesn't have this instruction support,
>> 'prefetchnt1' instruction would be generated and doesn't have side effects.
>>> I verified this using https://gcc.godbolt.org/  and carefully checking the
>> instructions generated for different compiler versions and march flags.
>>>
>>> - Bhanuprakash.
Bodireddy, Bhanuprakash Dec. 5, 2017, 4:19 p.m. | #7
[...]
>int main()

>{

>        int c;

>

>        __builtin_prefetch(&c, 1, 1);

>        c = 8;

>

>        return c;

>}

>

>on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw':

>

>      PREFETCHWT1                              = false

>      3DNow! PREFETCH/PREFETCHW instructions = false

>

>Results:


[Bhanu] I  found https://gcc.godbolt.org/ the other day and its handy to generate code for different targets and compilers.

>$ gcc 1.c

>$ objdump -S ./a.out | grep prefetch -A2 -B2

>  40055b:       31 c0                   xor    %eax,%eax

>  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax

>  400561:       0f 18 18                prefetcht2 (%rax)

>  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)

>  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax


[Bhanu] Expected and compiler generates prefetcht2.

>

>$ gcc 1.c -march=native

>$ objdump -S ./a.out | grep prefetch -A2 -B2

>  40055b:       31 c0                   xor    %eax,%eax

>  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax

>  400561:       0f 18 18                prefetcht2 (%rax)

>  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)

>  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax


[Bhanu] Though march=native is specified the processor doesn't  have it and still prefetchnt2 is generated by compiler.

>$ gcc 1.c -march=native -mprefetchwt1

>$ objdump -S ./a.out | grep prefetch -A2 -B2

>  40055b:       31 c0                   xor    %eax,%eax

>  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax

>  400561:       0f 0d 10                prefetchwt1 (%rax)

>  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)

>  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax


[Bhanu] The compiler inserts prefetchwt1 instruction as we asked it to do.

>

>So, it inserts this instruction even if I have no such instruction in CPU.


[Bhanu] 
Though the compiler generates this, as the instruction isn't available on the processor it just become a multi byte NO-Operation(NOP).
On processors(Intel) that doesn't have prefetchw or 3D Now feature(AMD)  it decodes in to NOP.
http://ref.x86asm.net/coder64.html#x0F0D
	- Click on '0D' in two-byte opcode index - (16.  0F0D NOP)
               -  More information on this can be found in Intel SW developers manual (Combined Volumes)

>More interesting is that program still works without any issues.

>I assume that CPU just skips that instruction or executes something else.


[Bhanu] This is what is mostly expected. On processors that supports prefetchwt1 it executes and others it just becomes a NOP.

>

>So, it's really strange and it's unclear what CPU really executes in case where

>we have 'prefetchwt1' in code but not supported by CPU.


[Bhanu] It’s decoded in to NOP may be by pipeline decoding units.

>

>If CPU just skips this instruction we will lost all the prefetching optimizations

>because all the calls will be replaced by non-existent 'prefetchwt1'.


[Bhanu] I would be worried if core generates an exception treating it as illegal instruction. Instead pipeline units treat this as NOP if it doesn't support it.
So the micro optimizations doesn't really do any thing on the processors that doesn't support it.

>

>How can we be sure that 'prefetchwt1' was really executed?


[Bhanu] I don’t know how we can see this unless we can peek in to Instruction queues & Decoders of the pipeline :(.

- Bhanuprakash.
Ilya Maximets Dec. 7, 2017, 2:21 p.m. | #8
On 05.12.2017 19:19, Bodireddy, Bhanuprakash wrote:
> [...]
>> int main()
>> {
>>        int c;
>>
>>        __builtin_prefetch(&c, 1, 1);
>>        c = 8;
>>
>>        return c;
>> }
>>
>> on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw':
>>
>>      PREFETCHWT1                              = false
>>      3DNow! PREFETCH/PREFETCHW instructions = false
>>
>> Results:
> 
> [Bhanu] I  found https://gcc.godbolt.org/ the other day and its handy to generate code for different targets and compilers.
> 
>> $ gcc 1.c
>> $ objdump -S ./a.out | grep prefetch -A2 -B2
>>  40055b:       31 c0                   xor    %eax,%eax
>>  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
>>  400561:       0f 18 18                prefetcht2 (%rax)
>>  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
>>  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax
> 
> [Bhanu] Expected and compiler generates prefetcht2.
> 
>>
>> $ gcc 1.c -march=native
>> $ objdump -S ./a.out | grep prefetch -A2 -B2
>>  40055b:       31 c0                   xor    %eax,%eax
>>  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
>>  400561:       0f 18 18                prefetcht2 (%rax)
>>  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
>>  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax
> 
> [Bhanu] Though march=native is specified the processor doesn't  have it and still prefetchnt2 is generated by compiler.
> 
>> $ gcc 1.c -march=native -mprefetchwt1
>> $ objdump -S ./a.out | grep prefetch -A2 -B2
>>  40055b:       31 c0                   xor    %eax,%eax
>>  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
>>  400561:       0f 0d 10                prefetchwt1 (%rax)
>>  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
>>  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax
> 
> [Bhanu] The compiler inserts prefetchwt1 instruction as we asked it to do.
> 
>>
>> So, it inserts this instruction even if I have no such instruction in CPU.
> 
> [Bhanu] 
> Though the compiler generates this, as the instruction isn't available on the processor it just become a multi byte NO-Operation(NOP).
> On processors(Intel) that doesn't have prefetchw or 3D Now feature(AMD)  it decodes in to NOP.
> http://ref.x86asm.net/coder64.html#x0F0D
> 	- Click on '0D' in two-byte opcode index - (16.  0F0D NOP)
>                -  More information on this can be found in Intel SW developers manual (Combined Volumes)
> 
>> More interesting is that program still works without any issues.
>> I assume that CPU just skips that instruction or executes something else.
> 
> [Bhanu] This is what is mostly expected. On processors that supports prefetchwt1 it executes and others it just becomes a NOP.
> 
>>
>> So, it's really strange and it's unclear what CPU really executes in case where
>> we have 'prefetchwt1' in code but not supported by CPU.
> 
> [Bhanu] It’s decoded in to NOP may be by pipeline decoding units.
> 
>>
>> If CPU just skips this instruction we will lost all the prefetching optimizations
>> because all the calls will be replaced by non-existent 'prefetchwt1'.
> 
> [Bhanu] I would be worried if core generates an exception treating it as illegal instruction. Instead pipeline units treat this as NOP if it doesn't support it.
> So the micro optimizations doesn't really do any thing on the processors that doesn't support it.

This could be an issue. If someday we'll have real performance optimization
based on OPCH_HTW prefetch, we will have prefetchwt1 on system that supports
it and NOP on others even if they have usual prefetchw which could provide
performance improvement too.

As I understand, checking of '-mprefetchwt1' is equal to checking compiler
version. It doesn't check anything about supporting of this instruction in CPU.
This could end up with non-working performance optimizations and even
degradation on systems that supports usual prefetches but not prefetchwt1
(useless NOPs degrades performance if they are on a hot path).

IMHO, This compiler option should be passed only if CPU really supports it.
I guess, the maximum that we can do is add a note into performance optimization
guide that '-mprefetchwt1' could be passed via CFLAGS if user sure that it
supported by target CPU.

> 
>>
>> How can we be sure that 'prefetchwt1' was really executed?
> 
> [Bhanu] I don’t know how we can see this unless we can peek in to Instruction queues & Decoders of the pipeline :(.
> 
> - Bhanuprakash.
>
Jan Scheurich Dec. 7, 2017, 4:35 p.m. | #9
> >> If CPU just skips this instruction we will lost all the prefetching optimizations
> >> because all the calls will be replaced by non-existent 'prefetchwt1'.
> >
> > [Bhanu] I would be worried if core generates an exception treating it as illegal instruction. Instead pipeline units treat this as NOP if it
> doesn't support it.
> > So the micro optimizations doesn't really do any thing on the processors that doesn't support it.
> 
> This could be an issue. If someday we'll have real performance optimization
> based on OPCH_HTW prefetch, we will have prefetchwt1 on system that supports
> it and NOP on others even if they have usual prefetchw which could provide
> performance improvement too.
> 
> As I understand, checking of '-mprefetchwt1' is equal to checking compiler
> version. It doesn't check anything about supporting of this instruction in CPU.
> This could end up with non-working performance optimizations and even
> degradation on systems that supports usual prefetches but not prefetchwt1
> (useless NOPs degrades performance if they are on a hot path).
> 
> IMHO, This compiler option should be passed only if CPU really supports it.
> I guess, the maximum that we can do is add a note into performance optimization
> guide that '-mprefetchwt1' could be passed via CFLAGS if user sure that it
> supported by target CPU.

That is my thinking as well. The people/organizations building OVS packages for deployment have the responsibility to specify the minimum requirements on the target architecture and feed that into the compiler using CFLAGS. That may well be leaning towards the lower end of capabilities to maximize compatibility and sacrifice some performance on high-end CPUs.

The specialized prefetch macros should be mapped to the best available target instructions by the compiler and/or conditional compile directives based on the CFLAGS architecture settings.

We would gather all these target-specific compiler optimization guidelines in the advanced DPDK documentation of OVS. 

Of course developers or benchmark testers are free to use -march=native or similar at their discretion in their local test beds for best possible performance.

BR, Jan
Bodireddy, Bhanuprakash Dec. 7, 2017, 7:46 p.m. | #10
>> >> If CPU just skips this instruction we will lost all the prefetching
>> >> optimizations because all the calls will be replaced by non-existent
>'prefetchwt1'.
>> >
>> > [Bhanu] I would be worried if core generates an exception treating
>> > it as illegal instruction. Instead pipeline units treat this as NOP
>> > if it
>> doesn't support it.
>> > So the micro optimizations doesn't really do any thing on the processors
>that doesn't support it.
>>
>> This could be an issue. If someday we'll have real performance
>> optimization based on OPCH_HTW prefetch, we will have prefetchwt1 on
>> system that supports it and NOP on others even if they have usual
>> prefetchw which could provide performance improvement too.

[Bhanu]  Adding the below information only for future reference, (going to point to this thread in the commit log)

On systems that has *only* prefetchw and no prefetchwt1 instruction.
     OPCH_LTW    -   prefetchw 
     OPCH_MTW  -   prefetchw
     OPCH_HTW   -    prefetchw
     OPCH_NTW   -    prefetchw

On systems that supports both prefetchw and prefetchwt1,
     OPCH_LTW    -   prefetchwt1
     OPCH_MTW  -   prefetchwt1
     OPCH_HTW   -    prefetchw
     OPCH_NTW   -    prefetchwt1

So OPCH_HTW would always be prefetchw and LTW/MTW/HTW  might turn in to NOPs on processors that support prefetchw alone.
(when compiled with CFLAGS = -march=native -mprefetchwt1)

>>
>> As I understand, checking of '-mprefetchwt1' is equal to checking
>> compiler version. It doesn't check anything about supporting of this
>instruction in CPU.
>> This could end up with non-working performance optimizations and even
>> degradation on systems that supports usual prefetches but not
>> prefetchwt1 (useless NOPs degrades performance if they are on a hot
>path).
>>
>> IMHO, This compiler option should be passed only if CPU really supports it.
>> I guess, the maximum that we can do is add a note into performance
>> optimization guide that '-mprefetchwt1' could be passed via CFLAGS if
>> user sure that it supported by target CPU.
>
>That is my thinking as well. The people/organizations building OVS packages
>for deployment have the responsibility to specify the minimum requirements
>on the target architecture and feed that into the compiler using CFLAGS. That
>may well be leaning towards the lower end of capabilities to maximize
>compatibility and sacrifice some performance on high-end CPUs.
>
>The specialized prefetch macros should be mapped to the best available
>target instructions by the compiler and/or conditional compile directives
>based on the CFLAGS architecture settings.
>
>We would gather all these target-specific compiler optimization guidelines in
>the advanced DPDK documentation of OVS.
>
>Of course developers or benchmark testers are free to use -march=native or
>similar at their discretion in their local test beds for best possible performance.

If the general view is get rid of this flag at compilation and only to document this, I am happy with this and can update the documentation.
But I still think we are being too defensive here and with few NOPs performance impact isn't even noticeable. 

- Bhanuprakash.

Patch

diff --git a/configure.ac b/configure.ac
index 6a8113a..8f4fbe2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -171,6 +171,7 @@  OVS_CONDITIONAL_CC_OPTION([-Wno-unused], [HAVE_WNO_UNUSED])
 OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter], [HAVE_WNO_UNUSED_PARAMETER])
 OVS_ENABLE_WERROR
 OVS_ENABLE_SPARSE
+OVS_ENABLE_OPTION([-mprefetchwt1])
 OVS_CTAGS_IDENTIFIERS
 
 AC_ARG_VAR(KARCH, [Kernel Architecture String])