diff mbox

ARM: Weaker memory barriers

Message ID 201403110254.s2B2sJ1g023579@outgoing.mit.edu
State New
Headers show

Commit Message

John Carr March 11, 2014, 2:54 a.m. UTC
A comment in arm/sync.md notes "We should consider issuing a inner
shareability zone barrier here instead."  Here is my first attempt
at a patch to emit weaker memory barriers.  Three instructions seem
to be relevant for user mode code on my Cortex A9 Linux box:

dmb ishst, dmb ish, dmb sy

I believe these correspond to a release barrier, a full barrier
with respect to other CPUs, and a full barrier that also orders
relative to I/O.

Consider this a request for comments on whether the approach is correct.
I haven't done any testing yet (beyond eyeballing the assembly output).

Comments

Will Deacon March 11, 2014, 10:25 a.m. UTC | #1
Hi John,

On Tue, Mar 11, 2014 at 02:54:18AM +0000, John Carr wrote:
> A comment in arm/sync.md notes "We should consider issuing a inner
> shareability zone barrier here instead."  Here is my first attempt
> at a patch to emit weaker memory barriers.  Three instructions seem
> to be relevant for user mode code on my Cortex A9 Linux box:
> 
> dmb ishst, dmb ish, dmb sy
> 
> I believe these correspond to a release barrier, a full barrier
> with respect to other CPUs, and a full barrier that also orders
> relative to I/O.

Not quite; DMB ISHST only orders writes with other writes, so loads can move
across it in both directions. That means it's not sufficient for releasing a
lock, for example.

<shameless plug>

I gave a presentation at ELCE about the various ARMv7 barrier options (from
a kernel perspective):

  https://www.youtube.com/watch?v=6ORn6_35kKo

</shameless plug>

> Consider this a request for comments on whether the approach is correct.
> I haven't done any testing yet (beyond eyeballing the assembly output).

You'll probably find that a lot of ARM micro-architectures treat them
equivalently, which makes this a good source of potentially latent bugs if
you get the wrong options. When I added these options to the kernel, I
tested with a big/little A7/A15 setup using a CCI400 (the cache-coherent
interconnect) to try and tickle things a bit better, but it's by no means
definitive.

>  (define_insn "*memory_barrier"
>    [(set (match_operand:BLK 0 "" "")
> -	(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
> -  "TARGET_HAVE_MEMORY_BARRIER"
> +	(unspec:BLK
> +	 [(match_dup 0) (match_operand:SI 1 "const_int_operand")]
> +	 UNSPEC_MEMORY_BARRIER))]
> +  "TARGET_HAVE_DMB || TARGET_HAVE_MEMORY_BARRIER"
>    {
>      if (TARGET_HAVE_DMB)
>        {
> -	/* Note we issue a system level barrier. We should consider issuing
> -	   a inner shareabilty zone barrier here instead, ie. "DMB ISH".  */
> -	/* ??? Differentiate based on SEQ_CST vs less strict?  */
> -	return "dmb\tsy";
> +        switch (INTVAL(operands[1]))
> +	{
> +	case MEMMODEL_RELEASE:
> +          return "dmb\tishst";

As stated above, this isn't correct.

> +	case MEMMODEL_CONSUME:

You might be able to build this one out of <control dependency> + ISB (in
lieu of a true address dependency). However, given the recent monolithic C11
thread that took over this list and others, let's play it safe for now and
stick with DMB ISH.

> +	case MEMMODEL_ACQUIRE:
> +	case MEMMODEL_ACQ_REL:
> +	  return "dmb\tish";

I think you probably *always* want ISH for GCC. You assume normal, cacheable
, coherent memory right? That also mirrors the first comment you delete
above.

> +	case MEMMODEL_SEQ_CST:
> +	  return "dmb\tsy";

Again, ISH here should be fine. SY is for things like non-coherent devices,
which I don't think GCC has a concept of (the second comment you delete
doesn't make any sense and has probably tricked you).

Hope that helps,

Will
John Carr March 11, 2014, 9:12 p.m. UTC | #2
Will Deacon <will.deacon@arm.com> wrote:

> 
> Hi John,
> 
> On Tue, Mar 11, 2014 at 02:54:18AM +0000, John Carr wrote:
> > A comment in arm/sync.md notes "We should consider issuing a inner
> > shareability zone barrier here instead."  Here is my first attempt
> > at a patch to emit weaker memory barriers.  Three instructions seem
> > to be relevant for user mode code on my Cortex A9 Linux box:
> > 
> > dmb ishst, dmb ish, dmb sy
> > 
> > I believe these correspond to a release barrier, a full barrier
> > with respect to other CPUs, and a full barrier that also orders
> > relative to I/O.
> 
> Not quite; DMB ISHST only orders writes with other writes, so loads can move
> across it in both directions. That means it's not sufficient for releasing a
> lock, for example.

Release in this context doesn't mean "lock release".  I understand
it to mean release in the specific context of the C++11 memory model.
(Similarly, if you're arguing standards compliance "inline" really
means "relax the one definition rule for this function.")

I don't see a prohibition on moving non-atomic loads across a release
store.  Can you point to an analysis that shows a full barrier is needed?

If we assume that gcc is used to generate code for processes running
within a single inner shareable domain, then we can start by demoting
"dmb sy" to "dmb ish" for the memory barrier with no other change.

If a store-store barrier has no place in the gcc atomic memory model,
that supports my hypothesis that a twisty maze of ifdefs is superior to
a "portable" attractive nuisance.

> 
> <shameless plug>
> 
> I gave a presentation at ELCE about the various ARMv7 barrier options (from
> a kernel perspective):
> 
>   https://www.youtube.com/watch?v=6ORn6_35kKo
> 
> </shameless plug>

Conveniently just about the same length as a dryer cycle.

I learned that inner shareable domain is the right one to use
for normal user mode code, and the Linux kernel doesn't care
because it has its own memory model and barrier functions.
Will Deacon March 12, 2014, 1:43 p.m. UTC | #3
On Tue, Mar 11, 2014 at 09:12:53PM +0000, John Carr wrote:
> Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Mar 11, 2014 at 02:54:18AM +0000, John Carr wrote:
> > > A comment in arm/sync.md notes "We should consider issuing a inner
> > > shareability zone barrier here instead."  Here is my first attempt
> > > at a patch to emit weaker memory barriers.  Three instructions seem
> > > to be relevant for user mode code on my Cortex A9 Linux box:
> > > 
> > > dmb ishst, dmb ish, dmb sy
> > > 
> > > I believe these correspond to a release barrier, a full barrier
> > > with respect to other CPUs, and a full barrier that also orders
> > > relative to I/O.
> > 
> > Not quite; DMB ISHST only orders writes with other writes, so loads can move
> > across it in both directions. That means it's not sufficient for releasing a
> > lock, for example.
> 
> Release in this context doesn't mean "lock release".  I understand
> it to mean release in the specific context of the C++11 memory model.
> (Similarly, if you're arguing standards compliance "inline" really
> means "relax the one definition rule for this function.")
> 
> I don't see a prohibition on moving non-atomic loads across a release
> store.  Can you point to an analysis that shows a full barrier is needed?

Well, you can use acquire/release to implement a lock easily enough. For
example, try feeding the following to cppmem:

  int main() {
    int x = 0, y = 0;
    atomic_int z = 0;

    {{{ { r1 = x; y = 1;
          z.store(1, memory_order_release); }
    ||| { r0 = z.load(memory_order_acquire).readsvalue(1);
          r1 = y; x = 1;}
    }}}

    return 0;
  }

There is one consistent execution, which requires the first thread to have
r1 == 0 (i.e. read x as zero) and the second thread to have r1 == 1 (i.e.
read y as 1).

If we implement store-release using DMB ISHST, the assembly code would look
something like the following (I've treated the atomic accesses like normal
load/store instructions for clarity, since they don't affect the ordering
here):

  T0:

  LDR r1, [x]
  STR #1, [y]
  DMB ISHST
  STR #1, [z]

  T1:

  LDR r0, [z] // Reads 1
  DMB ISH
  LDR r1, [y]
  STR #1, [x]

The problem with this is that the LDR in T0 can be re-ordered *past* the
rest of the sequence, potentially resulting in r1 == 1, which is forbidden.
It's just like reading from a shared, lock-protected data structure without
the lock held.

> If we assume that gcc is used to generate code for processes running
> within a single inner shareable domain, then we can start by demoting
> "dmb sy" to "dmb ish" for the memory barrier with no other change.

I'm all for such a change.

> If a store-store barrier has no place in the gcc atomic memory model,
> that supports my hypothesis that a twisty maze of ifdefs is superior to
> a "portable" attractive nuisance.

I don't understand your point here.

Will
diff mbox

Patch

2014-03-10  John F. Carr  <jfc@mit.edu>

	* config/arm/sync.md (mem_thread_fence): New expander for weaker
	memory barriers.
	* config/arm/arm.c (arm_pre_atomic_barrier, arm_post_atomic_barrier): 
	Emit only as strong a fence as needed.

Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c	(revision 208470)
+++ config/arm/arm.c	(working copy)
@@ -29813,7 +29813,12 @@ 
 arm_pre_atomic_barrier (enum memmodel model)
 {
   if (need_atomic_barrier_p (model, true))
-    emit_insn (gen_memory_barrier ());
+    {
+      if (HAVE_mem_thread_fence)
+	emit_insn (gen_mem_thread_fence (GEN_INT ((int) model)));
+      else
+	emit_insn (gen_memory_barrier ());
+    }
 }
 
 static void
@@ -29820,7 +29825,12 @@ 
 arm_post_atomic_barrier (enum memmodel model)
 {
   if (need_atomic_barrier_p (model, false))
-    emit_insn (gen_memory_barrier ());
+    {
+      if (HAVE_mem_thread_fence)
+	emit_insn (gen_mem_thread_fence (GEN_INT ((int) model)));
+      else
+	emit_insn (gen_memory_barrier ());
+    }
 }
 
 /* Emit the load-exclusive and store-exclusive instructions.
Index: config/arm/sync.md
===================================================================
--- config/arm/sync.md	(revision 208470)
+++ config/arm/sync.md	(working copy)
@@ -34,26 +34,54 @@ 
 (define_mode_attr sync_sfx
   [(QI "b") (HI "h") (SI "") (DI "d")])
 
+(define_expand "mem_thread_fence"
+  [(set (match_dup 1)
+	(unspec:BLK
+	 [(match_dup 1)
+	  (match_operand:SI 0 "const_int_operand")]
+	 UNSPEC_MEMORY_BARRIER))]
+  "TARGET_HAVE_DMB"
+{
+  if (INTVAL(operands[0]) == MEMMODEL_RELAXED)
+    DONE;
+  operands[1] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
+  MEM_VOLATILE_P (operands[1]) = 1;
+})
+
 (define_expand "memory_barrier"
   [(set (match_dup 0)
-	(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
+	(unspec:BLK [(match_dup 0) (match_dup 1)]
+		    UNSPEC_MEMORY_BARRIER))]
   "TARGET_HAVE_MEMORY_BARRIER"
 {
   operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
   MEM_VOLATILE_P (operands[0]) = 1;
+  operands[1] = GEN_INT((int) MEMMODEL_SEQ_CST);
 })
 
 (define_insn "*memory_barrier"
   [(set (match_operand:BLK 0 "" "")
-	(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
-  "TARGET_HAVE_MEMORY_BARRIER"
+	(unspec:BLK
+	 [(match_dup 0) (match_operand:SI 1 "const_int_operand")]
+	 UNSPEC_MEMORY_BARRIER))]
+  "TARGET_HAVE_DMB || TARGET_HAVE_MEMORY_BARRIER"
   {
     if (TARGET_HAVE_DMB)
       {
-	/* Note we issue a system level barrier. We should consider issuing
-	   a inner shareabilty zone barrier here instead, ie. "DMB ISH".  */
-	/* ??? Differentiate based on SEQ_CST vs less strict?  */
-	return "dmb\tsy";
+        switch (INTVAL(operands[1]))
+	{
+	case MEMMODEL_RELEASE:
+          return "dmb\tishst";
+	case MEMMODEL_CONSUME:
+	case MEMMODEL_ACQUIRE:
+	case MEMMODEL_ACQ_REL:
+	  return "dmb\tish";
+	case MEMMODEL_SEQ_CST:
+	  return "dmb\tsy";
+	case MEMMODEL_RELAXED:
+	default:
+	  gcc_unreachable ();
+        }
       }
 
     if (TARGET_HAVE_DMB_MCR)