diff mbox

[RFC,v2,02/13] tcg/i386: Add support for fence

Message ID 20160531183928.29406-3-bobby.prani@gmail.com
State New
Headers show

Commit Message

Pranith Kumar May 31, 2016, 6:39 p.m. UTC
Generate mfence instruction on SSE2 enabled processors. For older
processors, generate a 'lock orl $0,0(%esp)' instruction which has
similar ordering semantics.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[rth: Check for sse2, fallback to locked memory op otherwise.]
Signed-off-by: Richard Henderson <rth@twiddle.net>

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 tcg/i386/tcg-target.inc.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

Comments

Richard Henderson May 31, 2016, 8:27 p.m. UTC | #1
On 05/31/2016 11:39 AM, Pranith Kumar wrote:
> +    case INDEX_op_mb:
> +        tcg_out_mb(s);

You need to look at the barrier type and DTRT.  In particular, the Linux 
smp_rmb and smp_wmb types need not emit any code.

> +    { INDEX_op_mb, { "r" } },

You certainly do *not* need the constant argument loaded into a register.  This 
should remain { }.


r~
Pranith Kumar June 1, 2016, 6:49 p.m. UTC | #2
On Tue, May 31, 2016 at 4:27 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 05/31/2016 11:39 AM, Pranith Kumar wrote:
>>
>> +    case INDEX_op_mb:
>> +        tcg_out_mb(s);
>
>
> You need to look at the barrier type and DTRT.  In particular, the Linux
> smp_rmb and smp_wmb types need not emit any code.

These are converted to 'lfence' and 'sfence' instructions. Based on
the target backend, I think we still need to emit barrier
instructions. For example, if target backend is ARMv7 we need to emit
'dmb' instruction for both x86 fence instructions. I am not sure why
they do not emit any code?

>
>> +    { INDEX_op_mb, { "r" } },
>
>
> You certainly do *not* need the constant argument loaded into a register.
> This should remain { }.
>

OK, I will fix this.

Thanks,
Richard Henderson June 1, 2016, 9:17 p.m. UTC | #3
On 06/01/2016 11:49 AM, Pranith Kumar wrote:
> On Tue, May 31, 2016 at 4:27 PM, Richard Henderson <rth@twiddle.net> wrote:
>> On 05/31/2016 11:39 AM, Pranith Kumar wrote:
>>>
>>> +    case INDEX_op_mb:
>>> +        tcg_out_mb(s);
>>
>>
>> You need to look at the barrier type and DTRT.  In particular, the Linux
>> smp_rmb and smp_wmb types need not emit any code.
>
> These are converted to 'lfence' and 'sfence' instructions. Based on
> the target backend, I think we still need to emit barrier
> instructions. For example, if target backend is ARMv7 we need to emit
> 'dmb' instruction for both x86 fence instructions. I am not sure why
> they do not emit any code?

Because x86 has a strong memory model.

It does not require barriers to keep normal loads and stores in order.  The 
primary reason for the *fence instructions is to order the "non-temporal" 
memory operations that are part of the SSE instruction set, which we're not 
using at all.

This is why you'll find

/*
  * Because of the strongly ordered storage model, wmb() and rmb() are nops
  * here (a compiler barrier only).  QEMU doesn't do accesses to write-combining
  * qemu memory or non-temporal load/stores from C code.
  */
#define smp_wmb()   barrier()
#define smp_rmb()   barrier()

for x86 and s390.


r~
Pranith Kumar June 1, 2016, 9:44 p.m. UTC | #4
On Wed, Jun 1, 2016 at 5:17 PM, Richard Henderson <rth@twiddle.net> wrote:
>
> Because x86 has a strong memory model.
>
> It does not require barriers to keep normal loads and stores in order.  The
> primary reason for the *fence instructions is to order the "non-temporal"
> memory operations that are part of the SSE instruction set, which we're not
> using at all.
>
> This is why you'll find
>
> /*
>  * Because of the strongly ordered storage model, wmb() and rmb() are nops
>  * here (a compiler barrier only).  QEMU doesn't do accesses to
> write-combining
>  * qemu memory or non-temporal load/stores from C code.
>  */
> #define smp_wmb()   barrier()
> #define smp_rmb()   barrier()
>
> for x86 and s390.


OK. For x86 target, that is true. I think I got the context confused.
On x86 target, we can elide the read and write barriers. But we still
need to generate 'mfence' to prevent store-after-load reordering. I
will refine this in the next version.

Thanks,
diff mbox

Patch

diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 8fd37f4..1fd5a99 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -121,6 +121,16 @@  static bool have_cmov;
 # define have_cmov 0
 #endif
 
+/* For 32-bit, we are going to attempt to determine at runtime whether
+   sse2 support is available.  */
+#if TCG_TARGET_REG_BITS == 64 || defined(__SSE2__)
+# define have_sse2 1
+#elif defined(CONFIG_CPUID_H) && defined(bit_SSE2)
+static bool have_sse2;
+#else
+# define have_sse2 0
+#endif
+
 /* If bit_MOVBE is defined in cpuid.h (added in GCC version 4.6), we are
    going to attempt to determine at runtime whether movbe is available.  */
 #if defined(CONFIG_CPUID_H) && defined(bit_MOVBE)
@@ -686,6 +696,21 @@  static inline void tcg_out_pushi(TCGContext *s, tcg_target_long val)
     }
 }
 
+static inline void tcg_out_mb(TCGContext *s)
+{
+    if (have_sse2) {
+        /* mfence */
+        tcg_out8(s, 0x0f);
+        tcg_out8(s, 0xae);
+        tcg_out8(s, 0xf0);
+    } else {
+        /* lock orl $0,0(%esp) */
+        tcg_out8(s, 0xf0);
+        tcg_out_modrm_offset(s, OPC_ARITH_EvIb, ARITH_OR, TCG_REG_ESP, 0);
+        tcg_out8(s, 0);
+    }
+}
+
 static inline void tcg_out_push(TCGContext *s, int reg)
 {
     tcg_out_opc(s, OPC_PUSH_r32 + LOWREGMASK(reg), 0, reg, 0);
@@ -2114,6 +2139,9 @@  static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    case INDEX_op_mb:
+        tcg_out_mb(s);
+        break;
     case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
     case INDEX_op_mov_i64:
     case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
@@ -2179,6 +2207,8 @@  static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_add2_i32, { "r", "r", "0", "1", "ri", "ri" } },
     { INDEX_op_sub2_i32, { "r", "r", "0", "1", "ri", "ri" } },
 
+    { INDEX_op_mb, { "r" } },
+
 #if TCG_TARGET_REG_BITS == 32
     { INDEX_op_brcond2_i32, { "r", "r", "ri", "ri" } },
     { INDEX_op_setcond2_i32, { "r", "r", "r", "ri", "ri" } },
@@ -2356,6 +2386,11 @@  static void tcg_target_init(TCGContext *s)
            available, we'll use a small forward branch.  */
         have_cmov = (d & bit_CMOV) != 0;
 #endif
+#ifndef have_sse2
+        /* Likewise, almost all hardware supports SSE2, but we do
+           have a locked memory operation to use as a substitute.  */
+        have_sse2 = (d & bit_SSE2) != 0;
+#endif
 #ifndef have_movbe
         /* MOVBE is only available on Intel Atom and Haswell CPUs, so we
            need to probe for it.  */