diff mbox

Fix sporadic failure in g++.dg/tsan/aligned_vs_unaligned_race.C

Message ID CACT4Y+Yyf9ARSsuwAjNM1j=rLsb6DAANWd8Z70vOuUgh7_5yKw@mail.gmail.com
State New
Headers show

Commit Message

Dmitry Vyukov Jan. 21, 2015, 8:23 a.m. UTC
Hi Mike,

Yes, I can quantify the cost. Is it very high.

Here is the patch that I used:

after:
[       OK ] DISABLED_BENCH.Mop8Write (5085 ms)

So that's 338% slowdown.





On Mon, Jan 19, 2015 at 6:12 PM, Mike Stump <mikestump@comcast.net> wrote:
> On Jan 19, 2015, at 12:47 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> Long story short. Tsan has a logical data race the core of data race
>> detection algorithm. The race is not a bug, but a deliberate design
>> decision that makes tsan considerably faster.
>
> Could you please quantify that for us?  Also, what lockless update method did you use?  Did you try atomic increment?  On my machine, they are as cheap as stores; can’t imagine they could be slow at all.  If the latency and bandwidth of atomic increment is identical to store, would the costs be any higher than using a store to update the tsan data?  A proper port of tsan to my machine would make use of atomic increment.  I consider it a simple matter to sequence the thread termination and the output routine to ensure that all the updates in the threads happen before the output routine runs.  The output routine strikes me as slow, and thread termination strikes me as slow, so taking a little extra time there seems reasonable.  Was the excessive cost you saw due to the termination costs?
>
>> So ironically, if the race memory accesses happen almost simultaneously, tsan can miss the
>> race.
>> Thus we have sleeps.
>
> I’ve not seen a reason why the test suite should randomly fail.  The gcc test suite does not.  Could you explain why the llvm test suite does?  Surely you know that sleep is not a synchronization primitive?
>
>> Sleeps vs barrier is being discussed in the "Fix parameters of
>> __tsan_vptr_update" thread.
>
> When finished, let us know the outcome.  To date, I’ve not seen any compelling reason to have the core of tsan be other than deterministic and the test suite other than deterministic.  I’d love to see the backing for such a decision.
>
>> I would really like to keep llvm and gcc tests in sync as much as possible.
>
> Excellent, from my perspective, that would mean that you make the llvm test suite deterministic.

Comments

Jakub Jelinek Jan. 21, 2015, 8:34 a.m. UTC | #1
On Wed, Jan 21, 2015 at 12:23:34PM +0400, Dmitry Vyukov wrote:
> Hi Mike,
> 
> Yes, I can quantify the cost. Is it very high.
> 
> Here is the patch that I used:
> 
> --- rtl/tsan_rtl.cc (revision 226644)
> +++ rtl/tsan_rtl.cc (working copy)
> @@ -709,7 +709,11 @@
>  ALWAYS_INLINE USED
>  void MemoryAccess(ThreadState *thr, uptr pc, uptr addr,
>      int kAccessSizeLog, bool kAccessIsWrite, bool kIsAtomic) {
>    u64 *shadow_mem = (u64*)MemToShadow(addr);
> +
> +  atomic_fetch_add((atomic_uint64_t*)shadow_mem, 0, memory_order_acq_rel);

And the cost of adding that atomic_fetch_add guarded by
if (__builtin_expect (someCondition, 0)) ?
If that doesn't slow down the non-deterministic default case too much,
that would allow users to choose what they prefer - much faster unreliable
and slower deterministic.  Then for the gcc testsuite we could opt for the
latter.

> +
> 
> On the standard tsan benchmark that does 8-byte writes:
> before:
> [       OK ] DISABLED_BENCH.Mop8Write (1161 ms)
> after:
> [       OK ] DISABLED_BENCH.Mop8Write (5085 ms)
> 
> So that's 338% slowdown.

	Jakub
Dmitry Vyukov Jan. 21, 2015, 8:51 a.m. UTC | #2
On Wed, Jan 21, 2015 at 11:34 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Jan 21, 2015 at 12:23:34PM +0400, Dmitry Vyukov wrote:
>> Hi Mike,
>>
>> Yes, I can quantify the cost. Is it very high.
>>
>> Here is the patch that I used:
>>
>> --- rtl/tsan_rtl.cc (revision 226644)
>> +++ rtl/tsan_rtl.cc (working copy)
>> @@ -709,7 +709,11 @@
>>  ALWAYS_INLINE USED
>>  void MemoryAccess(ThreadState *thr, uptr pc, uptr addr,
>>      int kAccessSizeLog, bool kAccessIsWrite, bool kIsAtomic) {
>>    u64 *shadow_mem = (u64*)MemToShadow(addr);
>> +
>> +  atomic_fetch_add((atomic_uint64_t*)shadow_mem, 0, memory_order_acq_rel);
>
> And the cost of adding that atomic_fetch_add guarded by
> if (__builtin_expect (someCondition, 0)) ?
> If that doesn't slow down the non-deterministic default case too much,
> that would allow users to choose what they prefer - much faster unreliable
> and slower deterministic.  Then for the gcc testsuite we could opt for the
> latter.

It's more complex than this.

1. In reality the cost can be higher. We will need to construct a
spin-mutex in shadow. Thus there will be contention between threads
(both waiting for the mutex and cache-line contention).
Currently threads don't always write to shadow (just read), so the
cache line can actually be shared between several threads accessing
the same shadow.

2. We don't have a spare bit in shadow for the mutex.

3. Asynchronous shadow memory flush can turn arbitrary regions of
shadow memory into zeroes. The shadow is especially designed to
tolerate this. This does not play well with mutexes.

4. That won't make multi-threaded programs deterministic. First, there
is at least one another issue in tsan -- it has only 4 slots to
remember previous memory accesses, they are evicted in effectively
random manner. Then, multi-threaded programs are non-deterministic by
themselves.

So there are lots of technical problems, significant slowdown and no
value for end users. I don't want to solve it for tests. The invisible
barrier is the right solution to make tests deterministic.



>> On the standard tsan benchmark that does 8-byte writes:
>> before:
>> [       OK ] DISABLED_BENCH.Mop8Write (1161 ms)
>> after:
>> [       OK ] DISABLED_BENCH.Mop8Write (5085 ms)
>>
>> So that's 338% slowdown.
>
>         Jakub
diff mbox

Patch

--- rtl/tsan_rtl.cc (revision 226644)
+++ rtl/tsan_rtl.cc (working copy)
@@ -709,7 +709,11 @@ 
 ALWAYS_INLINE USED
 void MemoryAccess(ThreadState *thr, uptr pc, uptr addr,
     int kAccessSizeLog, bool kAccessIsWrite, bool kIsAtomic) {
   u64 *shadow_mem = (u64*)MemToShadow(addr);
+
+  atomic_fetch_add((atomic_uint64_t*)shadow_mem, 0, memory_order_acq_rel);
+

On the standard tsan benchmark that does 8-byte writes:
before:
[       OK ] DISABLED_BENCH.Mop8Write (1161 ms)