Message ID | FEA18DAD-B1F9-4456-8B07-116E6938537C@suse.de |
---|---|
State | New |
Headers | show |
On 21 November 2011 02:05, Alexander Graf <agraf@suse.de> wrote: > In an adventure to find out why yast2-core fails in its testsuite during build on ARM for us, I dove into the code for a bit and debugged it down to our old friend: multi-threaded TB invalidation. https://bugs.launchpad.net/qemu/+bug/668799 > I have a nice test case at > > http://csgraf.de/tmp/yast2-core.tbz2 > > which you can easily use to reproduce the race. There's also a test case attached to 668799. > Does anyone have pending patches, approaches or ideas on how to > fix this? It seems to be the last big road block towards getting > linux-user actually workable for us. I think I sent you my work-in-progress patch, didn't I? The basic idea is that we throw away this code which tries to unlink TB chains, and instead we just have a simple flag which we check at the start of each TB (cost: a few insns and a load from something that will be in L1), and a thread wanting to cause a cpu exit simply sets the flag. There are a couple of things in the patch that needed cleaning up but the main reason I didn't get round to doing anything with it was that I wanted to benchmark it for speed and I don't have a nice benchmarking setup... (my very rough tests suggested ~3% slowdown which personally I think is an acceptable tradeoff for reliability.) -- PMM
On 21.11.2011, at 09:54, Peter Maydell wrote: > On 21 November 2011 02:05, Alexander Graf <agraf@suse.de> wrote: >> In an adventure to find out why yast2-core fails in its testsuite during build on ARM for us, I dove into the code for a bit and debugged it down to our old friend: multi-threaded TB invalidation. > > https://bugs.launchpad.net/qemu/+bug/668799 Yeah, I keep running into the s/interrupt_lock/tb_lock/ patch which fails to solve the issue for me. > >> I have a nice test case at >> >> http://csgraf.de/tmp/yast2-core.tbz2 >> >> which you can easily use to reproduce the race. > > There's also a test case attached to 668799. Oops :). Oh well - the more the merrier :). > >> Does anyone have pending patches, approaches or ideas on how to >> fix this? It seems to be the last big road block towards getting >> linux-user actually workable for us. > > I think I sent you my work-in-progress patch, didn't I? I know that we talked about it. But I can't find it. That doesn't mean that you didn't send it, but either way I'd really like to see your patch somewhere where it can be found by more people than just me. How about the bug report? :) > The basic idea is that we throw away this code which tries > to unlink TB chains, and instead we just have a simple flag > which we check at the start of each TB (cost: a few insns > and a load from something that will be in L1), and a > thread wanting to cause a cpu exit simply sets the flag. > > There are a couple of things in the patch that needed cleaning > up but the main reason I didn't get round to doing anything with > it was that I wanted to benchmark it for speed and I don't > have a nice benchmarking setup... (my very rough tests > suggested ~3% slowdown which personally I think is an > acceptable tradeoff for reliability.) I agree that it's an acceptable tradeoff. There's also always the chance of doing tricks like I did in my patch tonight, so you could do the unlinking code in single-threaded applications and switch to the more expensive flag check in multi-threaded ones. Alex
On 21 November 2011 09:55, Alexander Graf <agraf@suse.de> wrote: > On 21.11.2011, at 09:54, Peter Maydell wrote: >> I think I sent you my work-in-progress patch, didn't I? > > I know that we talked about it. But I can't find it. That doesn't > mean that you didn't send it, but either way I'd really like to see > your patch somewhere where it can be found by more people than just > me. How about the bug report? :) Good idea -- I'll attach it there today. > There's also always the chance of doing tricks like I did in my patch > tonight, so you could do the unlinking code in single-threaded > applications and switch to the more expensive flag check in > multi-threaded ones. You can hit this bug even single-threaded or in system mode: it's just that aggressively multi-threaded user apps are the easiest way to trigger it. So I don't think we can ever use the tb-unlinking code safely. -- PMM
On 21.11.2011, at 10:59, Peter Maydell wrote: > On 21 November 2011 09:55, Alexander Graf <agraf@suse.de> wrote: >> On 21.11.2011, at 09:54, Peter Maydell wrote: >>> I think I sent you my work-in-progress patch, didn't I? >> >> I know that we talked about it. But I can't find it. That doesn't >> mean that you didn't send it, but either way I'd really like to see >> your patch somewhere where it can be found by more people than just >> me. How about the bug report? :) > > Good idea -- I'll attach it there today. > >> There's also always the chance of doing tricks like I did in my patch >> tonight, so you could do the unlinking code in single-threaded >> applications and switch to the more expensive flag check in >> multi-threaded ones. > > You can hit this bug even single-threaded or in system mode: > it's just that aggressively multi-threaded user apps are the > easiest way to trigger it. So I don't think we can ever use > the tb-unlinking code safely. Not sure I understand. I thought it's a race between multiple threads trying to chain/unchain TBs at the same time? How can that happen in system or single-threaded mode? Alex
On 21 November 2011 10:07, Alexander Graf <agraf@suse.de> wrote: > On 21.11.2011, at 10:59, Peter Maydell wrote: >> You can hit this bug even single-threaded or in system mode: >> it's just that aggressively multi-threaded user apps are the >> easiest way to trigger it. So I don't think we can ever use >> the tb-unlinking code safely. > > Not sure I understand. I thought it's a race between multiple >threads trying to chain/unchain TBs at the same time? How can > that happen in system or single-threaded mode? It's a race between (a) a thread executing code and (b) any other thread or signal handler that calls cpu_exit(). So (b) can be the IO thread, or the linux-user host_signal_handler(). -- PMM
On Mon, Nov 21, 2011 at 10:11:37AM +0000, Peter Maydell wrote: > On 21 November 2011 10:07, Alexander Graf <agraf@suse.de> wrote: > > On 21.11.2011, at 10:59, Peter Maydell wrote: > >> You can hit this bug even single-threaded or in system mode: > >> it's just that aggressively multi-threaded user apps are the > >> easiest way to trigger it. So I don't think we can ever use > >> the tb-unlinking code safely. > > > > Not sure I understand. I thought it's a race between multiple > >threads trying to chain/unchain TBs at the same time? How can > > that happen in system or single-threaded mode? > > It's a race between (a) a thread executing code and (b) any > other thread or signal handler that calls cpu_exit(). So (b) > can be the IO thread, or the linux-user host_signal_handler(). Would it be a good idea to write this as a comment about cpu_unlink_tb? Regards, chenwj
diff --git a/exec.c b/exec.c index 6b92198..0c48222 100644 --- a/exec.c +++ b/exec.c @@ -1356,6 +1356,7 @@ static inline void tb_reset_jump_recursive2(TranslationBlock *tb, int n) tb1 = (TranslationBlock *)((long)tb1 & ~3); if (n1 == 2) break; + assert(tb1); tb1 = tb1->jmp_next[n1]; } /* we are now sure now that tb jumps to tb1 */ So how do I know it's a race in the chaining code? Well, it only happens with threaded code. And it only happens when actually chaining. The following patch makes the test case work: diff --git a/exec-all.h b/exec-all.h index c211242..c8124e9 100644 --- a/exec-all.h +++ b/exec-all.h @@ -259,6 +259,7 @@ static inline void tb_set_jmp_target(TranslationBlock *tb, static inline void tb_add_jump(TranslationBlock *tb, int n, TranslationBlock *tb_next) { + if(1) return; /* NOTE: this test is only needed for thread safety */ if (!tb->jmp_next[n]) { /* patch the native jump address */