diff mbox

linux-user crashing in multi-threaded programs

Message ID FEA18DAD-B1F9-4456-8B07-116E6938537C@suse.de
State New
Headers show

Commit Message

Alexander Graf Nov. 21, 2011, 2:05 a.m. UTC
Howdy,

In an adventure to find out why yast2-core fails in its testsuite during build on ARM for us, I dove into the code for a bit and debugged it down to our old friend: multi-threaded TB invalidation.

I have a nice test case at

  http://csgraf.de/tmp/yast2-core.tbz2

which you can easily use to reproduce the race. I run it using the following command line:

  qemu-arm -L yast2-core yast2-core/bin/test_thread_log.prg

It breaks pretty much every single time.

Apparently two threads try to modify jmp_next at the same time, rendering the whole invalidation racy. The following patch triggers while in normal single-threaded worlds it should never appear:



I would advice against using it as a real fix though, as it effectively disables TB chaining, rendering all of QEMU pretty slow.


Does anyone have pending patches, approaches or ideas on how to fix this? It seems to be the last big road block towards getting linux-user actually workable for us.


Alex

Comments

Peter Maydell Nov. 21, 2011, 8:54 a.m. UTC | #1
On 21 November 2011 02:05, Alexander Graf <agraf@suse.de> wrote:
> In an adventure to find out why yast2-core fails in its testsuite during build on ARM for us, I dove into the code for a bit and debugged it down to our old friend: multi-threaded TB invalidation.

https://bugs.launchpad.net/qemu/+bug/668799

> I have a nice test case at
>
>  http://csgraf.de/tmp/yast2-core.tbz2
>
> which you can easily use to reproduce the race.

There's also a test case attached to 668799.

> Does anyone have pending patches, approaches or ideas on how to
> fix this? It seems to be the last big road block towards getting
> linux-user actually workable for us.

I think I sent you my work-in-progress patch, didn't I?
The basic idea is that we throw away this code which tries
to unlink TB chains, and instead we just have a simple flag
which we check at the start of each TB (cost: a few insns
and a load from something that will be in L1), and a
thread wanting to cause a cpu exit simply sets the flag.

There are a couple of things in the patch that needed cleaning
up but the main reason I didn't get round to doing anything with
it was that I wanted to benchmark it for speed and I don't
have a nice benchmarking setup... (my very rough tests
suggested ~3% slowdown which personally I think is an
acceptable tradeoff for reliability.)

-- PMM
Alexander Graf Nov. 21, 2011, 9:55 a.m. UTC | #2
On 21.11.2011, at 09:54, Peter Maydell wrote:

> On 21 November 2011 02:05, Alexander Graf <agraf@suse.de> wrote:
>> In an adventure to find out why yast2-core fails in its testsuite during build on ARM for us, I dove into the code for a bit and debugged it down to our old friend: multi-threaded TB invalidation.
> 
> https://bugs.launchpad.net/qemu/+bug/668799

Yeah, I keep running into the s/interrupt_lock/tb_lock/ patch which fails to solve the issue for me.

> 
>> I have a nice test case at
>> 
>>  http://csgraf.de/tmp/yast2-core.tbz2
>> 
>> which you can easily use to reproduce the race.
> 
> There's also a test case attached to 668799.

Oops :). Oh well - the more the merrier :).

> 
>> Does anyone have pending patches, approaches or ideas on how to
>> fix this? It seems to be the last big road block towards getting
>> linux-user actually workable for us.
> 
> I think I sent you my work-in-progress patch, didn't I?

I know that we talked about it. But I can't find it. That doesn't mean that you didn't send it, but either way I'd really like to see your patch somewhere where it can be found by more people than just me. How about the bug report? :)

> The basic idea is that we throw away this code which tries
> to unlink TB chains, and instead we just have a simple flag
> which we check at the start of each TB (cost: a few insns
> and a load from something that will be in L1), and a
> thread wanting to cause a cpu exit simply sets the flag.
> 
> There are a couple of things in the patch that needed cleaning
> up but the main reason I didn't get round to doing anything with
> it was that I wanted to benchmark it for speed and I don't
> have a nice benchmarking setup... (my very rough tests
> suggested ~3% slowdown which personally I think is an
> acceptable tradeoff for reliability.)

I agree that it's an acceptable tradeoff. There's also always the chance of doing tricks like I did in my patch tonight, so you could do the unlinking code in single-threaded applications and switch to the more expensive flag check in multi-threaded ones.


Alex
Peter Maydell Nov. 21, 2011, 9:59 a.m. UTC | #3
On 21 November 2011 09:55, Alexander Graf <agraf@suse.de> wrote:
> On 21.11.2011, at 09:54, Peter Maydell wrote:
>> I think I sent you my work-in-progress patch, didn't I?
>
> I know that we talked about it. But I can't find it. That doesn't
> mean that you didn't send it, but either way I'd really like to see
> your patch somewhere where it can be found by more people than just
> me. How about the bug report? :)

Good idea -- I'll attach it there today.

> There's also always the chance of doing tricks like I did in my patch
> tonight, so you could do the unlinking code in single-threaded
> applications and switch to the more expensive flag check in
> multi-threaded ones.

You can hit this bug even single-threaded or in system mode:
it's just that aggressively multi-threaded user apps are the
easiest way to trigger it. So I don't think we can ever use
the tb-unlinking code safely.

-- PMM
Alexander Graf Nov. 21, 2011, 10:07 a.m. UTC | #4
On 21.11.2011, at 10:59, Peter Maydell wrote:

> On 21 November 2011 09:55, Alexander Graf <agraf@suse.de> wrote:
>> On 21.11.2011, at 09:54, Peter Maydell wrote:
>>> I think I sent you my work-in-progress patch, didn't I?
>> 
>> I know that we talked about it. But I can't find it. That doesn't
>> mean that you didn't send it, but either way I'd really like to see
>> your patch somewhere where it can be found by more people than just
>> me. How about the bug report? :)
> 
> Good idea -- I'll attach it there today.
> 
>> There's also always the chance of doing tricks like I did in my patch
>> tonight, so you could do the unlinking code in single-threaded
>> applications and switch to the more expensive flag check in
>> multi-threaded ones.
> 
> You can hit this bug even single-threaded or in system mode:
> it's just that aggressively multi-threaded user apps are the
> easiest way to trigger it. So I don't think we can ever use
> the tb-unlinking code safely.

Not sure I understand. I thought it's a race between multiple threads trying to chain/unchain TBs at the same time? How can that happen in system or single-threaded mode?

Alex
Peter Maydell Nov. 21, 2011, 10:11 a.m. UTC | #5
On 21 November 2011 10:07, Alexander Graf <agraf@suse.de> wrote:
> On 21.11.2011, at 10:59, Peter Maydell wrote:
>> You can hit this bug even single-threaded or in system mode:
>> it's just that aggressively multi-threaded user apps are the
>> easiest way to trigger it. So I don't think we can ever use
>> the tb-unlinking code safely.
>
> Not sure I understand. I thought it's a race between multiple
>threads trying to chain/unchain TBs at the same time? How can
> that happen in system or single-threaded mode?

It's a race between (a) a thread executing code and (b) any
other thread or signal handler that calls cpu_exit(). So (b)
can be the IO thread, or the linux-user host_signal_handler().

-- PMM
陳韋任 Nov. 23, 2011, 8:52 a.m. UTC | #6
On Mon, Nov 21, 2011 at 10:11:37AM +0000, Peter Maydell wrote:
> On 21 November 2011 10:07, Alexander Graf <agraf@suse.de> wrote:
> > On 21.11.2011, at 10:59, Peter Maydell wrote:
> >> You can hit this bug even single-threaded or in system mode:
> >> it's just that aggressively multi-threaded user apps are the
> >> easiest way to trigger it. So I don't think we can ever use
> >> the tb-unlinking code safely.
> >
> > Not sure I understand. I thought it's a race between multiple
> >threads trying to chain/unchain TBs at the same time? How can
> > that happen in system or single-threaded mode?
> 
> It's a race between (a) a thread executing code and (b) any
> other thread or signal handler that calls cpu_exit(). So (b)
> can be the IO thread, or the linux-user host_signal_handler().

  Would it be a good idea to write this as a comment about
cpu_unlink_tb? 

Regards,
chenwj
diff mbox

Patch

diff --git a/exec.c b/exec.c
index 6b92198..0c48222 100644
--- a/exec.c
+++ b/exec.c
@@ -1356,6 +1356,7 @@  static inline void tb_reset_jump_recursive2(TranslationBlock *tb, int n)
             tb1 = (TranslationBlock *)((long)tb1 & ~3);
             if (n1 == 2)
                 break;
+            assert(tb1);
             tb1 = tb1->jmp_next[n1];
         }
         /* we are now sure now that tb jumps to tb1 */


So how do I know it's a race in the chaining code? Well, it only happens with threaded code. And it only happens when actually chaining. The following patch makes the test case work:

diff --git a/exec-all.h b/exec-all.h
index c211242..c8124e9 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -259,6 +259,7 @@  static inline void tb_set_jmp_target(TranslationBlock *tb,
 static inline void tb_add_jump(TranslationBlock *tb, int n,
                                TranslationBlock *tb_next)
 {
+    if(1) return;
     /* NOTE: this test is only needed for thread safety */
     if (!tb->jmp_next[n]) {
         /* patch the native jump address */