diff mbox

Torn read/write possible on aarch64/x86-64 MTTCG?

Message ID 20170724212327.GA24963@flamenco
State New
Headers show

Commit Message

Emilio Cota July 24, 2017, 9:23 p.m. UTC
(Adding some Cc's)

On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel wrote:
> Hi all,
>
> I'm trying to track down what appears to be a translation bug in either
> the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms
> are entirely consistent with a torn read/write -- that is, a 64-bit
> load or store that was translated to two 32-bit loads and stores --
> but that's obviously not what happens in the common path through the
> translation for this code, so I'm wondering: are there any cases in
> which qemu will split a 64-bit memory access into two 32-bit accesses?

That would be a bug in MTTCG.

> The code: Guest CPU A writes a 64-bit value to an aligned memory
> location that was previously 0, using a regular store; e.g.:
>	f9000034 str	     x20,[x1]
>
> Guest CPU B (who is busy-waiting) reads a value from the same location:
>	f9400280 ldr	     x0,[x20]
>
> The symptom: CPU B loads a value that is neither NULL nor the value
> written. Instead, x0 gets only the low 32-bits of the value written
> (high bits are all zero). By the time this value is dereferenced (a
> few instructions later) and the exception handlers run, the memory
> location from which it was loaded has the correct 64-bit value with
> a non-zero upper half.
>
> Obviously on a real ARM memory barriers are critical, and indeed
> the code has such barriers in it, but I'm assuming that any possible
> mistranslation of the barriers is irrelevant because for a 64-bit load
> and a 64-bit store you should get all or nothing. Other clues that may
> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
> is used to resolve a race when updating another variable), and the
> busy-wait loop has a yield instruction in it (although those appear
> to be no-ops with MTTCG).

This might have to do with how ldrex/strex is emulated; are you relying
on the exclusive pair detecting ABA? If so, your code won't work in
QEMU since it uses cmpxchg to emulate ldrex/strex.

> The bug repros more easily with more guest VCPUs, and more load on the
> host (i.e. more context switching to expose the race). It doesn't repro
> for the single-threaded TCG. Unfortunately it's hard to get detailed
> trace information, because the bug only repros roughly every one in 40
> attempts, and it's a long way into the guest OS boot before it arises.
>
> I'm not yet 100% convinced this is a qemu bug -- the obvious path
> through the translator for those instructions does 64-bit memory
> accesses on the host -- but at the same time, it has never been seen
> outside qemu, and after staring long and hard at the guest code, we're
> pretty sure it's correct. It's also extremely unlikely to be a wild
> write, given that it occurs on a wide variety of guest call-stacks,
> and the memory is later inconsistent with what was loaded.
>
> Any clues or debugging suggestions appreciated!

- Pin the QEMU-MTTCG process to a single CPU. Can you repro then?

- Force the emulation of cmpxchg via EXCP_ATOMIC with:


This will halt all other vCPUs before the calling vCPU performs
the cmpxchg. Can you reproduce then?

		Emilio

Comments

Richard Henderson July 24, 2017, 10:02 p.m. UTC | #1
On 07/24/2017 02:23 PM, Emilio G. Cota wrote:
> (Adding some Cc's)
> 
> On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel wrote:
>> Hi all,
>>
>> I'm trying to track down what appears to be a translation bug in either
>> the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms

I assume this is really x86_64 and not i686 as host.

>> are entirely consistent with a torn read/write -- that is, a 64-bit
>> load or store that was translated to two 32-bit loads and stores --
>> but that's obviously not what happens in the common path through the
>> translation for this code, so I'm wondering: are there any cases in
>> which qemu will split a 64-bit memory access into two 32-bit accesses?
> 
> That would be a bug in MTTCG.
> 
>> The code: Guest CPU A writes a 64-bit value to an aligned memory
>> location that was previously 0, using a regular store; e.g.:
>> 	f9000034 str	     x20,[x1]
>>
>> Guest CPU B (who is busy-waiting) reads a value from the same location:
>> 	f9400280 ldr	     x0,[x20]
>>
>> The symptom: CPU B loads a value that is neither NULL nor the value
>> written. Instead, x0 gets only the low 32-bits of the value written
>> (high bits are all zero). By the time this value is dereferenced (a
>> few instructions later) and the exception handlers run, the memory
>> location from which it was loaded has the correct 64-bit value with
>> a non-zero upper half.
>>
>> Obviously on a real ARM memory barriers are critical, and indeed
>> the code has such barriers in it, but I'm assuming that any possible
>> mistranslation of the barriers is irrelevant because for a 64-bit load
>> and a 64-bit store you should get all or nothing. Other clues that may
>> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
>> is used to resolve a race when updating another variable), and the
>> busy-wait loop has a yield instruction in it (although those appear
>> to be no-ops with MTTCG).
> 
> This might have to do with how ldrex/strex is emulated; are you relying
> on the exclusive pair detecting ABA? If so, your code won't work in
> QEMU since it uses cmpxchg to emulate ldrex/strex.

ABA problem is nothing to do with tearing.  And cmpxchg will definitely not 
create tearing problems.

I don't know how we would manage 64-bit tearing on a 64-bit host, at least for 
the aarch64 guest, for which I believe we have a good emulation.

> - Pin the QEMU-MTTCG process to a single CPU. Can you repro then?

A good suggestion.

> - Force the emulation of cmpxchg via EXCP_ATOMIC with:
> 
> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> index 87f673e..771effe5 100644
> --- a/tcg/tcg-op.c
> +++ b/tcg/tcg-op.c
> @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv,
>           }
>           tcg_temp_free_i64(t1);
>       } else if ((memop & MO_SIZE) == MO_64) {
> -#ifdef CONFIG_ATOMIC64
> +#if 0

I suspect this will simply alter the timing.  However, give it a go by all means.

If there's a test case that you can share, that would be awesome.

Especially if you can prod it to happen with a standalone minimal binary.  With 
luck you can reproduce via aarch64-linux-user too, and simply signal an error 
via branch to __builtin_trap.


r~
Cameron Esfahani via July 25, 2017, 9:53 p.m. UTC | #2
> From: Richard Henderson [mailto:rth7680@gmail.com] On Behalf Of Richard

> Henderson

> Sent: Monday, 24 July 2017 15:03

> 

> On 07/24/2017 02:23 PM, Emilio G. Cota wrote:

> > (Adding some Cc's)

> >

> > On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel

> wrote:

> >> Hi all,

> >>

> >> I'm trying to track down what appears to be a translation bug in either

> >> the aarch64 target or x86_64 TCG (in multithreaded mode). The

> symptoms

> 

> I assume this is really x86_64 and not i686 as host.


Yes.

> >> are entirely consistent with a torn read/write -- that is, a 64-bit

> >> load or store that was translated to two 32-bit loads and stores --

> >> but that's obviously not what happens in the common path through the

> >> translation for this code, so I'm wondering: are there any cases in

> >> which qemu will split a 64-bit memory access into two 32-bit accesses?

> >

> > That would be a bug in MTTCG.

> >

> >> The code: Guest CPU A writes a 64-bit value to an aligned memory

> >> location that was previously 0, using a regular store; e.g.:

> >> 	f9000034 str	     x20,[x1]

> >>

> >> Guest CPU B (who is busy-waiting) reads a value from the same location:

> >> 	f9400280 ldr	     x0,[x20]

> >>

> >> The symptom: CPU B loads a value that is neither NULL nor the value

> >> written. Instead, x0 gets only the low 32-bits of the value written

> >> (high bits are all zero). By the time this value is dereferenced (a

> >> few instructions later) and the exception handlers run, the memory

> >> location from which it was loaded has the correct 64-bit value with

> >> a non-zero upper half.

> >>

> >> Obviously on a real ARM memory barriers are critical, and indeed

> >> the code has such barriers in it, but I'm assuming that any possible

> >> mistranslation of the barriers is irrelevant because for a 64-bit load

> >> and a 64-bit store you should get all or nothing. Other clues that may

> >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting

> >> is used to resolve a race when updating another variable), and the

> >> busy-wait loop has a yield instruction in it (although those appear

> >> to be no-ops with MTTCG).

> >

> > This might have to do with how ldrex/strex is emulated; are you relying

> > on the exclusive pair detecting ABA? If so, your code won't work in

> > QEMU since it uses cmpxchg to emulate ldrex/strex.

> 

> ABA problem is nothing to do with tearing.  And cmpxchg will definitely not

> create tearing problems.


In fact, the ldrex/strex are there implementing a cmpxchg. :)

> I don't know how we would manage 64-bit tearing on a 64-bit host, at least

> for

> the aarch64 guest, for which I believe we have a good emulation.

> 

> > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then?

> 

> A good suggestion.


Thanks for the suggestions. I've been running this configuration all day, and haven't seen a repro yet in hundreds of attempts. I'll report if that changes.

The problem is that this test isn't very conclusive... I don't know how often the VCPU threads will context switch, but I suspect it's pretty rare relative to the ability to expose races when they run directly on different host cores. And if the race really doesn't exist when running on a single core, that to me suggests a hardware problem, which is even less likely.

> > - Force the emulation of cmpxchg via EXCP_ATOMIC with:

> >

> > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c

> > index 87f673e..771effe5 100644

> > --- a/tcg/tcg-op.c

> > +++ b/tcg/tcg-op.c

> > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64

> retv, TCGv addr, TCGv_i64 cmpv,

> >           }

> >           tcg_temp_free_i64(t1);

> >       } else if ((memop & MO_SIZE) == MO_64) {

> > -#ifdef CONFIG_ATOMIC64

> > +#if 0

> 

> I suspect this will simply alter the timing.  However, give it a go by all means.


I haven't tried this yet. If it behaves as you describe, it seems likely to make the race harder to hit.

> If there's a test case that you can share, that would be awesome.


I wish we could :(

> Especially if you can prod it to happen with a standalone minimal binary.  With

> luck you can reproduce via aarch64-linux-user too, and simply signal an error

> via branch to __builtin_trap.


Initial attempts to repro this with a tight loop failed, but I'll take another stab at producing a standalone test program that we can share.

Andrew
Alex Bennée July 26, 2017, 7:59 a.m. UTC | #3
Andrew Baumann <Andrew.Baumann@microsoft.com> writes:

>> From: Richard Henderson [mailto:rth7680@gmail.com] On Behalf Of Richard
>> Henderson
>> Sent: Monday, 24 July 2017 15:03
>>
>> On 07/24/2017 02:23 PM, Emilio G. Cota wrote:
>> > (Adding some Cc's)
>> >
>> > On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel
>> wrote:
>> >> Hi all,
>> >>
>> >> I'm trying to track down what appears to be a translation bug in either
>> >> the aarch64 target or x86_64 TCG (in multithreaded mode). The
>> symptoms
>>
>> I assume this is really x86_64 and not i686 as host.
>
> Yes.
>
>> >> are entirely consistent with a torn read/write -- that is, a 64-bit
>> >> load or store that was translated to two 32-bit loads and stores --
>> >> but that's obviously not what happens in the common path through the
>> >> translation for this code, so I'm wondering: are there any cases in
>> >> which qemu will split a 64-bit memory access into two 32-bit accesses?
>> >
>> > That would be a bug in MTTCG.
>> >
>> >> The code: Guest CPU A writes a 64-bit value to an aligned memory
>> >> location that was previously 0, using a regular store; e.g.:
>> >> 	f9000034 str	     x20,[x1]
>> >>
>> >> Guest CPU B (who is busy-waiting) reads a value from the same location:
>> >> 	f9400280 ldr	     x0,[x20]
>> >>
>> >> The symptom: CPU B loads a value that is neither NULL nor the value
>> >> written. Instead, x0 gets only the low 32-bits of the value written
>> >> (high bits are all zero). By the time this value is dereferenced (a
>> >> few instructions later) and the exception handlers run, the memory
>> >> location from which it was loaded has the correct 64-bit value with
>> >> a non-zero upper half.
>> >>
>> >> Obviously on a real ARM memory barriers are critical, and indeed
>> >> the code has such barriers in it, but I'm assuming that any possible
>> >> mistranslation of the barriers is irrelevant because for a 64-bit load
>> >> and a 64-bit store you should get all or nothing. Other clues that may
>> >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
>> >> is used to resolve a race when updating another variable), and the
>> >> busy-wait loop has a yield instruction in it (although those appear
>> >> to be no-ops with MTTCG).
>> >
>> > This might have to do with how ldrex/strex is emulated; are you relying
>> > on the exclusive pair detecting ABA? If so, your code won't work in
>> > QEMU since it uses cmpxchg to emulate ldrex/strex.
>>
>> ABA problem is nothing to do with tearing.  And cmpxchg will definitely not
>> create tearing problems.
>
> In fact, the ldrex/strex are there implementing a cmpxchg. :)
>
>> I don't know how we would manage 64-bit tearing on a 64-bit host, at least
>> for
>> the aarch64 guest, for which I believe we have a good emulation.
>>
>> > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then?
>>
>> A good suggestion.
>
> Thanks for the suggestions. I've been running this configuration all day, and haven't seen a repro yet in hundreds of attempts. I'll report if that changes.
>
> The problem is that this test isn't very conclusive... I don't know
> how often the VCPU threads will context switch, but I suspect it's
> pretty rare relative to the ability to expose races when they run
> directly on different host cores. And if the race really doesn't exist
> when running on a single core, that to me suggests a hardware problem,
> which is even less likely.

They will likely context switch as they drop/clear locks coming out of
the translated code

>
>> > - Force the emulation of cmpxchg via EXCP_ATOMIC with:
>> >
>> > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
>> > index 87f673e..771effe5 100644
>> > --- a/tcg/tcg-op.c
>> > +++ b/tcg/tcg-op.c
>> > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64
>> retv, TCGv addr, TCGv_i64 cmpv,
>> >           }
>> >           tcg_temp_free_i64(t1);
>> >       } else if ((memop & MO_SIZE) == MO_64) {
>> > -#ifdef CONFIG_ATOMIC64
>> > +#if 0
>>
>> I suspect this will simply alter the timing.  However, give it a go by all means.
>
> I haven't tried this yet. If it behaves as you describe, it seems likely to make the race harder to hit.
>
>> If there's a test case that you can share, that would be awesome.
>
> I wish we could :(
>
>> Especially if you can prod it to happen with a standalone minimal binary.  With
>> luck you can reproduce via aarch64-linux-user too, and simply signal an error
>> via branch to __builtin_trap.
>
> Initial attempts to repro this with a tight loop failed, but I'll take
> another stab at producing a standalone test program that we can share.

You could try using the counter value as a way to synchronise threads if
there is a certain race you are trying to engineer. I did something
similar in the barrier tests I wrote while doing the MTTCG work:

  https://github.com/stsquad/kvm-unit-tests/blob/mttcg/current-tests-v8/arm/barrier-litmus-test.c#L61

>
> Andrew


--
Alex Bennée
diff mbox

Patch

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673e..771effe5 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2856,7 +2856,7 @@  void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv,
         }
         tcg_temp_free_i64(t1);
     } else if ((memop & MO_SIZE) == MO_64) {
-#ifdef CONFIG_ATOMIC64
+#if 0
         gen_atomic_cx_i64 gen;

         gen = table_cmpxchg[memop & (MO_SIZE | MO_BSWAP)];