[RFC,v3,14/14] target-i386: Generate fences for x86

Message ID	20160618040343.19517-15-bobby.prani@gmail.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Pranith Kumar <bobby.prani@gmail.com> To: Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <rth@twiddle.net>, Eduardo Habkost <ehabkost@redhat.com>, qemu-devel@nongnu.org (open list:All patches CC here) Date: Sat, 18 Jun 2016 00:03:43 -0400 Message-Id: <20160618040343.19517-15-bobby.prani@gmail.com> In-Reply-To: <20160618040343.19517-1-bobby.prani@gmail.com> References: <20160618040343.19517-1-bobby.prani@gmail.com> Subject: [Qemu-devel] [RFC v3 PATCH 14/14] target-i386: Generate fences for x86 Precedence: list Cc: serge.fdrv@gmail.com, alex.bennee@linaro.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Pranith Kumar June 18, 2016, 4:03 a.m. UTC

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 target-i386/translate.c | 4 ++++
 1 file changed, 4 insertions(+)

Richard Henderson June 18, 2016, 5:48 a.m. UTC | #1

On 06/17/2016 09:03 PM, Pranith Kumar wrote:
>          case 0xe8 ... 0xef: /* lfence */
> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
> +            break;
>          case 0xf0 ... 0xf7: /* mfence */
>              if (!(s->cpuid_features & CPUID_SSE2)
>                  || (prefixes & PREFIX_LOCK)) {
>                  goto illegal_op;
>              }
> +            tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC);

You need to duplicate the sse2 check if you split lfence from mfence.


r~

Pranith Kumar June 20, 2016, 3:05 p.m. UTC | #2

On Sat, Jun 18, 2016 at 1:48 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 06/17/2016 09:03 PM, Pranith Kumar wrote:
>>
>>          case 0xe8 ... 0xef: /* lfence */
>> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
>> +            break;
>>          case 0xf0 ... 0xf7: /* mfence */
>>              if (!(s->cpuid_features & CPUID_SSE2)
>>                  || (prefixes & PREFIX_LOCK)) {
>>                  goto illegal_op;
>>              }
>> +            tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC);
>
>
> You need to duplicate the sse2 check if you split lfence from mfence.
>

Yes, I missed this one. I will fix it.

Thanks for the review.

Paolo Bonzini June 21, 2016, 7:28 a.m. UTC | #3

On 18/06/2016 06:03, Pranith Kumar wrote:
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  target-i386/translate.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/target-i386/translate.c b/target-i386/translate.c
> index bf33e6b..32b0f5c 100644
> --- a/target-i386/translate.c
> +++ b/target-i386/translate.c
> @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  || (prefixes & PREFIX_LOCK)) {
>                  goto illegal_op;
>              }
> +            tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC);
>              break;
>          case 0xe8 ... 0xef: /* lfence */
> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
> +            break;

These are unnecessary.  On the other hand, _each and every load_ must be
followed by a LD_LD | LD_ST barrier, and each and every store must be
preceded by a LD_ST | ST_ST barrier.

Paolo

>          case 0xf0 ... 0xf7: /* mfence */
>              if (!(s->cpuid_features & CPUID_SSE2)
>                  || (prefixes & PREFIX_LOCK)) {
>                  goto illegal_op;
>              }
> +            tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC);
>              break;
>  
>          default:
>

Richard Henderson June 21, 2016, 3:57 p.m. UTC | #4

On 06/21/2016 12:28 AM, Paolo Bonzini wrote:
>
>
> On 18/06/2016 06:03, Pranith Kumar wrote:
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>> ---
>>  target-i386/translate.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/target-i386/translate.c b/target-i386/translate.c
>> index bf33e6b..32b0f5c 100644
>> --- a/target-i386/translate.c
>> +++ b/target-i386/translate.c
>> @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>>                  || (prefixes & PREFIX_LOCK)) {
>>                  goto illegal_op;
>>              }
>> +            tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC);
>>              break;
>>          case 0xe8 ... 0xef: /* lfence */
>> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
>> +            break;
>
> These are unnecessary.  On the other hand, _each and every load_ must be
> followed by a LD_LD | LD_ST barrier, and each and every store must be
> preceded by a LD_ST | ST_ST barrier.

They're not unnecessary if we (1) add those barriers for normal loads and 
stores and (2) omit them from the non-temporal loads and stores.

In the meantime they're good documentation.


r~

Paolo Bonzini June 21, 2016, 4:12 p.m. UTC | #5

On 21/06/2016 17:57, Richard Henderson wrote:
>>>
>>>                  || (prefixes & PREFIX_LOCK)) {
>>>                  goto illegal_op;
>>>              }
>>> +            tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC);
>>>              break;
>>>          case 0xe8 ... 0xef: /* lfence */
>>> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
>>> +            break;
>>
>> These are unnecessary.  On the other hand, _each and every load_ must be
>> followed by a LD_LD | LD_ST barrier, and each and every store must be
>> preceded by a LD_ST | ST_ST barrier.
> 
> They're not unnecessary if we (1) add those barriers for normal loads
> and stores and (2) omit them from the non-temporal loads and stores.

When does TCG generate non-temporal loads and stores?

Paolo

Richard Henderson June 21, 2016, 4:23 p.m. UTC | #6

On 06/21/2016 09:12 AM, Paolo Bonzini wrote:
>
>
> On 21/06/2016 17:57, Richard Henderson wrote:
>>>>
>>>>                  || (prefixes & PREFIX_LOCK)) {
>>>>                  goto illegal_op;
>>>>              }
>>>> +            tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC);
>>>>              break;
>>>>          case 0xe8 ... 0xef: /* lfence */
>>>> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
>>>> +            break;
>>>
>>> These are unnecessary.  On the other hand, _each and every load_ must be
>>> followed by a LD_LD | LD_ST barrier, and each and every store must be
>>> preceded by a LD_ST | ST_ST barrier.
>>
>> They're not unnecessary if we (1) add those barriers for normal loads
>> and stores and (2) omit them from the non-temporal loads and stores.
>
> When does TCG generate non-temporal loads and stores?

I was talking about the guest non-temporal loads and stores.


r~

Paolo Bonzini June 21, 2016, 4:33 p.m. UTC | #7

On 21/06/2016 18:23, Richard Henderson wrote:
> On 06/21/2016 09:12 AM, Paolo Bonzini wrote:
>>
>>
>> On 21/06/2016 17:57, Richard Henderson wrote:
>>>>>
>>>>>                  || (prefixes & PREFIX_LOCK)) {
>>>>>                  goto illegal_op;
>>>>>              }
>>>>> +            tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC);
>>>>>              break;
>>>>>          case 0xe8 ... 0xef: /* lfence */
>>>>> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
>>>>> +            break;
>>>>
>>>> These are unnecessary.  On the other hand, _each and every load_
>>>> must be
>>>> followed by a LD_LD | LD_ST barrier, and each and every store must be
>>>> preceded by a LD_ST | ST_ST barrier.
>>>
>>> They're not unnecessary if we (1) add those barriers for normal loads
>>> and stores and (2) omit them from the non-temporal loads and stores.
>>
>> When does TCG generate non-temporal loads and stores?
> 
> I was talking about the guest non-temporal loads and stores.

Oh, you're right---sorry, I confused target-i386 and tcg/i386.

Thanks,

Paolo

Pranith Kumar June 21, 2016, 5:28 p.m. UTC | #8

On Tue, Jun 21, 2016 at 3:28 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 18/06/2016 06:03, Pranith Kumar wrote:
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>> ---
>>  target-i386/translate.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/target-i386/translate.c b/target-i386/translate.c
>> index bf33e6b..32b0f5c 100644
>> --- a/target-i386/translate.c
>> +++ b/target-i386/translate.c
>> @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>>                  || (prefixes & PREFIX_LOCK)) {
>>                  goto illegal_op;
>>              }
>> +            tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC);
>>              break;
>>          case 0xe8 ... 0xef: /* lfence */
>> +            tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC);
>> +            break;
>
> These are unnecessary.  On the other hand, _each and every load_ must be
> followed by a LD_LD | LD_ST barrier, and each and every store must be
> preceded by a LD_ST | ST_ST barrier.
>

Reg. the second point, I did consider this situation of running x86 on
ARM where such barriers are necessary for correctness. But, I am
really apprehensive of the cost it will impose. I am not sure if there
are any alternative solutions to avoid generating barriers for each
memory operation, but it would be great if we could reduce them.

Thanks,

Peter Maydell June 21, 2016, 5:54 p.m. UTC | #9

On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote:
> Reg. the second point, I did consider this situation of running x86 on
> ARM where such barriers are necessary for correctness. But, I am
> really apprehensive of the cost it will impose. I am not sure if there
> are any alternative solutions to avoid generating barriers for each
> memory operation, but it would be great if we could reduce them.

I vaguely recall an idea that you could avoid needing
explicit barriers by turning all the guest load/stores into
host load-acquire/store-release, but I have no idea whether
that's (a) actually true (b) any better than piles of
explicit barriers.

thanks
-- PMM

Pranith Kumar June 21, 2016, 6:03 p.m. UTC | #10

On Tue, Jun 21, 2016 at 1:54 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote:
>> Reg. the second point, I did consider this situation of running x86 on
>> ARM where such barriers are necessary for correctness. But, I am
>> really apprehensive of the cost it will impose. I am not sure if there
>> are any alternative solutions to avoid generating barriers for each
>> memory operation, but it would be great if we could reduce them.
>
> I vaguely recall an idea that you could avoid needing
> explicit barriers by turning all the guest load/stores into
> host load-acquire/store-release, but I have no idea whether
> that's (a) actually true (b) any better than piles of
> explicit barriers.

Yes, this is true for ARMv8(not sure about ia64). The
load-acquire/store-release operations are sequentially consistent to
each other. But this does not work for ARMv7 and as you said... I
think the cost here too is really prohibitive.

Alex Bennée June 21, 2016, 6:25 p.m. UTC | #11

Pranith Kumar <bobby.prani@gmail.com> writes:

> On Tue, Jun 21, 2016 at 1:54 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote:
>>> Reg. the second point, I did consider this situation of running x86 on
>>> ARM where such barriers are necessary for correctness. But, I am
>>> really apprehensive of the cost it will impose. I am not sure if there
>>> are any alternative solutions to avoid generating barriers for each
>>> memory operation, but it would be great if we could reduce them.
>>
>> I vaguely recall an idea that you could avoid needing
>> explicit barriers by turning all the guest load/stores into
>> host load-acquire/store-release, but I have no idea whether
>> that's (a) actually true (b) any better than piles of
>> explicit barriers.
>
> Yes, this is true for ARMv8(not sure about ia64). The
> load-acquire/store-release operations are sequentially consistent to
> each other. But this does not work for ARMv7 and as you said... I
> think the cost here too is really prohibitive.

If the cost ends up being too prohibitive (as in -smp N runs slower and
slower) then we may just keep -accel tcg,thread=single the default for
that combination.

We need some hard numbers either way.

--
Alex Bennée

Sergey Fedorov June 22, 2016, 11:18 a.m. UTC | #12

On 21/06/16 21:03, Pranith Kumar wrote:
> On Tue, Jun 21, 2016 at 1:54 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote:
>>> Reg. the second point, I did consider this situation of running x86 on
>>> ARM where such barriers are necessary for correctness. But, I am
>>> really apprehensive of the cost it will impose. I am not sure if there
>>> are any alternative solutions to avoid generating barriers for each
>>> memory operation, but it would be great if we could reduce them.
>> I vaguely recall an idea that you could avoid needing
>> explicit barriers by turning all the guest load/stores into
>> host load-acquire/store-release, but I have no idea whether
>> that's (a) actually true (b) any better than piles of
>> explicit barriers.
> Yes, this is true for ARMv8(not sure about ia64). The
> load-acquire/store-release operations are sequentially consistent to
> each other. But this does not work for ARMv7 and as you said... I
> think the cost here too is really prohibitive.

As I understand, there's no requirement for sequential consistency even
on a systems with pretty strong memory model such as x86. Due to the
presence of store queue, earlier regular stores are allowed to be
completed after the following regular loads. This relaxation breaks
sequential consistency requirement, if I understand correctly, since it
allows a CPU to see its own stores with respect to other CPU stores in
different order. However, such a model can perfectly match
acquire/release semantics, even as it is defined by Itanium memory
model. Lets break it down:
(1) if a load-acquire must not be reordered with any subsequent loads
and stores,
(2) and if a store-release must not be reordered with any preceding
loads and stores,
(3) thus if all loads are load-acquires and all stores are
store-releases, then the only possible reordering can be a store-release
reordered after the subsequent load-acquire.

Considering this, I think that strongly-ordered memory model semantics
such (as in x86 memory model) can be translated directly into relaxed
acquire/release memory model semantics (as in Itanium memory model or a
bit more strong ARMv8). And I believe this can perform better than
inserting separate memory barriers on those architectures which provide
acquire/release semantics since it is more relaxed and permit certain
hardware optimizations like store-after-load reordering.

Kind regards,
Sergey

[RFC,v3,14/14] target-i386: Generate fences for x86

Commit Message

Comments

Patch