diff mbox

[RFC,i386] : Use "lock orl $0, -4(%esp)" in mfence_nosse

Message ID CAFULd4ZB8jehEJZBDmn10HGqQvOho9MJ9wDZVorRmbZMduJxDA@mail.gmail.com
State New
Headers show

Commit Message

Uros Bizjak May 29, 2016, 9:10 p.m. UTC
Hello!

As explained in PR71245, comment #3 [1], it is better to use offset -4
to a %esp to implement a non-SSE memory fence instruction:

-q-

I guess it costs a code byte for a disp8 in the addressing mode, but
it avoids adding a lot of latency to a critical path involving a
spill/reload to (%esp), in functions where there is something at
(%esp).

If it's an object larger than 4B, the lock orl could even cause a
store-forwarding stall when the object is reloaded.  (e.g. a double or
a vector).

Ideally we could do the  lock orl  on some padding between two locals,
or on something in memory that wasn't going to be loaded soon, to
avoid touching more stack memory (which might be in the next page
down).  But we still want to do it on a cache line that's hot, so
going way up above our own stack frame isn't good either.

-/q-

Attached RFC patch implements this proposal.

2016-05-29  Uros Bizjak  <ubizjak@gmail.com>

    * config/i386/sync.md (mfence_nosse): Use "lock orl $0, -4(%esp)".

Patch was bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Any other opinion on this issue? The linux kernel also implements
memory fence like the above proposal.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245#c3

Uros.

Comments

Jakub Jelinek Feb. 17, 2017, 4:30 p.m. UTC | #1
On Sun, May 29, 2016 at 11:10:15PM +0200, Uros Bizjak wrote:
> As explained in PR71245, comment #3 [1], it is better to use offset -4
> to a %esp to implement a non-SSE memory fence instruction:
> 
> -q-
> 
> I guess it costs a code byte for a disp8 in the addressing mode, but
> it avoids adding a lot of latency to a critical path involving a
> spill/reload to (%esp), in functions where there is something at
> (%esp).
> 
> If it's an object larger than 4B, the lock orl could even cause a
> store-forwarding stall when the object is reloaded.  (e.g. a double or
> a vector).
> 
> Ideally we could do the  lock orl  on some padding between two locals,
> or on something in memory that wasn't going to be loaded soon, to
> avoid touching more stack memory (which might be in the next page
> down).  But we still want to do it on a cache line that's hot, so
> going way up above our own stack frame isn't good either.

Unfortunately this makes valgrind unhappy about that:
https://bugzilla.redhat.com/show_bug.cgi?id=1423434
I assume it will complain now on anything pre-SSE2 that contains the memory
barrier in 32-bit code.
Perhaps we should decrement and increment %esp around it or something
similar (or push/pop)?  Of course, that would mean we need to take care
of async unwind info.

> Attached RFC patch implements this proposal.
> 
> 2016-05-29  Uros Bizjak  <ubizjak@gmail.com>
> 
>     * config/i386/sync.md (mfence_nosse): Use "lock orl $0, -4(%esp)".
> 
> Patch was bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> 
> Any other opinion on this issue? The linux kernel also implements
> memory fence like the above proposal.
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245#c3
> 
> Uros.

> Index: config/i386/sync.md
> ===================================================================
> --- config/i386/sync.md	(revision 236863)
> +++ config/i386/sync.md	(working copy)
> @@ -98,7 +98,7 @@
>  	(unspec:BLK [(match_dup 0)] UNSPEC_MFENCE))
>     (clobber (reg:CC FLAGS_REG))]
>    "!(TARGET_64BIT || TARGET_SSE2)"
> -  "lock{%;} or{l}\t{$0, (%%esp)|DWORD PTR [esp], 0}"
> +  "lock{%;} or{l}\t{$0, -4(%%esp)|DWORD PTR [esp-4], 0}"
>    [(set_attr "memory" "unknown")])
>  
>  (define_expand "mem_thread_fence"


	Jakub
Uros Bizjak Feb. 17, 2017, 4:59 p.m. UTC | #2
On Fri, Feb 17, 2017 at 5:30 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Sun, May 29, 2016 at 11:10:15PM +0200, Uros Bizjak wrote:
>> As explained in PR71245, comment #3 [1], it is better to use offset -4
>> to a %esp to implement a non-SSE memory fence instruction:
>>
>> -q-
>>
>> I guess it costs a code byte for a disp8 in the addressing mode, but
>> it avoids adding a lot of latency to a critical path involving a
>> spill/reload to (%esp), in functions where there is something at
>> (%esp).
>>
>> If it's an object larger than 4B, the lock orl could even cause a
>> store-forwarding stall when the object is reloaded.  (e.g. a double or
>> a vector).
>>
>> Ideally we could do the  lock orl  on some padding between two locals,
>> or on something in memory that wasn't going to be loaded soon, to
>> avoid touching more stack memory (which might be in the next page
>> down).  But we still want to do it on a cache line that's hot, so
>> going way up above our own stack frame isn't good either.
>
> Unfortunately this makes valgrind unhappy about that:
> https://bugzilla.redhat.com/show_bug.cgi?id=1423434
> I assume it will complain now on anything pre-SSE2 that contains the memory
> barrier in 32-bit code.
> Perhaps we should decrement and increment %esp around it or something
> similar (or push/pop)?  Of course, that would mean we need to take care
> of async unwind info.

Or, we can simply revert the patch? Not that the barrier performance
of non-SSE 32bit targets matter...

Uros.
Jakub Jelinek Feb. 17, 2017, 5 p.m. UTC | #3
On Fri, Feb 17, 2017 at 05:59:30PM +0100, Uros Bizjak wrote:
> > Unfortunately this makes valgrind unhappy about that:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1423434
> > I assume it will complain now on anything pre-SSE2 that contains the memory
> > barrier in 32-bit code.
> > Perhaps we should decrement and increment %esp around it or something
> > similar (or push/pop)?  Of course, that would mean we need to take care
> > of async unwind info.
> 
> Or, we can simply revert the patch? Not that the barrier performance
> of non-SSE 32bit targets matter...

Yeah.  People who care about performance should use -m64 anyway.

	Jakub
diff mbox

Patch

Index: config/i386/sync.md
===================================================================
--- config/i386/sync.md	(revision 236863)
+++ config/i386/sync.md	(working copy)
@@ -98,7 +98,7 @@ 
 	(unspec:BLK [(match_dup 0)] UNSPEC_MFENCE))
    (clobber (reg:CC FLAGS_REG))]
   "!(TARGET_64BIT || TARGET_SSE2)"
-  "lock{%;} or{l}\t{$0, (%%esp)|DWORD PTR [esp], 0}"
+  "lock{%;} or{l}\t{$0, -4(%%esp)|DWORD PTR [esp-4], 0}"
   [(set_attr "memory" "unknown")])
 
 (define_expand "mem_thread_fence"