From patchwork Wed Apr 22 15:07:47 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Adhemerval Zanella Netto
 <adhemerval.zanella@linaro.org>
X-Patchwork-Id: 463691
Return-Path: 
 <libc-alpha-return-58523-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 580D914012C
	for <incoming@patchwork.ozlabs.org>;
	Thu, 23 Apr 2015 01:08:01 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass
	reason="1024-bit key; unprotected key"
	header.d=sourceware.org header.i=@sourceware.org header.b=rg2CPSw+;
	dkim-adsp=none (unprotected policy); dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:subject
	:references:in-reply-to:content-type:content-transfer-encoding;
	q=dns; s=default; b=KK1WFwxosauG0Jkra8YhYt7UAo0EFCV8ZgVqThVA1jh
	x8AYUBvqcfRfOtA8Tgk5W2fUUz5aQRBeh9lPuYcXEq6YedkFzBXKdK7obU9mSks7
	zNhZ9dAgeMwk82lebBPhx17BZXxJHJuwxWHz8vjOKZfyFOLgYdgpGJP0gIMG2biU
	=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:subject
	:references:in-reply-to:content-type:content-transfer-encoding;
	s=default; bh=HTt/uhQyl95YbQQFd8WDPCMcVbg=; b=rg2CPSw+71++I7gLe
	Zs5RA5P/bII+o1CjeaplOKBcsflM210CsdKxGueFbJ5hEosnfxhJQFnstMVD8T4j
	NQW1PfjQQR/4CGUrxZFhiPPHi/SU5yVTtnSstjkkMAjVTIfnan7xTxW+lTC1vMr5
	aoTO9dssdjTfnWj8nm2zHmzBcc=
Received: (qmail 125055 invoked by alias); 22 Apr 2015 15:07:55 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 125043 invoked by uid 89); 22 Apr 2015 15:07:55 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00,
	RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-qg0-f44.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
	:subject:references:in-reply-to:content-type
	:content-transfer-encoding;
	bh=ANrr7XVp3lIkPqvbOxbVzUtOSczlHPqLpyF8HWckdJM=;
	b=DFXn9Dxmtwl6N1MziFKBI3bG79D23MrdEbwsU8YHhJ+tdTYBeIdFRkG8j35ThfiNtq
	0JkZy1AaVFmLtgsp7hW/OH4OjADCRSNbSqaEirjrkIcqoxAX/sSljd+VbgUMQGdiix1g
	DfWgrCHIj9UirI9Petxl6BqP37KYGoPKqVuQ2Y738JMCCDxV8omIGoR3qgS29Hv8/oMJ
	aD1U27cPcpAUl7ksJeBmw0b7nfPvpNc0c33MgDqVWd5qtB5YLAPxfdTphboe4Pt3Cby+
	eTM6oBQ5U0IDX7xFro+Jcc+AYy+KsLUDKVjAWiKWVK0ipEZaEsKNdZM+XvCvRkamtbQe
	Iv/w==
X-Gm-Message-State: 
 ALoCoQl1G4mhenYF2PtOAZZro7ZjTNLc1cDWCfA/TXBFDrENVkRk8KRVeJwjYvTJYhV/0Rsn2PNW
X-Received: by 10.55.53.72 with SMTP id c69mr49807331qka.67.1429715271230;
	Wed, 22 Apr 2015 08:07:51 -0700 (PDT)
Message-ID: <5537B943.2060001@linaro.org>
Date: Wed, 22 Apr 2015 12:07:47 -0300
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: libc-alpha@sourceware.org
Subject: Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race
	fix
References: <553793A3.7030206@arm.com>
In-Reply-To: <553793A3.7030206@arm.com>

Hi

On 22-04-2015 09:27, Szabolcs Nagy wrote:
> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
> accesses.
> 
> The original ARM TLSDESC design did not discuss this race that
> arises on systems with weak memory order guarantees
> 
> http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
> 
> "Storing the final value of the TLS descriptor also needs care: the
> resolver field must be set to its final value after the argument gets
> its final value, such that any if thread attempts to use the
> descriptor before it gets its final value, it still goes to the hold
> function."
> 
> The current glibc code (i386, x86_64, arm, aarch64) is
> 
>   td->arg = ...;
>   td->entry = ...;
> 
> the arm memory model allows store-store reordering, so a barrier is
> needed between these two stores (the code is broken on x86 as well in
> principle: x86 memory model does not help on the c code level, the
> compiler can reorder non-aliasing stores).
> 
> What is missing from the TLSDESC design spec is a description of the
> ordering requirements on the read side: the TLS access sequence
> (generated by the compiler) loads the descriptor function pointer
> (td->entry) and then the argument is loaded (td->arg) in that function.
> These two loads must observe the changes made on the write side in a
> sequentially consistent way. The code in asm:
> 
>   ldr x1, [x0]    // load td->entry
>   blr [x0]        // call it
> 
> entryfunc:
>   ldr x0, [x0,#8] // load td->arg
> 
> The branch (blr) does not provide a load-load ordering barrier (so the
> second load may observe an old arg even though the first load observed
> the new entry, this happens with existing aarch64 hardware causing
> invalid memory access due to the incorrect TLS offset).
> 
> Various alternatives were considered to force the load ordering in the
> descriptor function:
> 
> (1) barrier
> 
> entryfunc:
>   dmb ishld
>   ldr x0, [x0,#8]
> 
> (2) address dependency (if the address of the second load depends on the
> result of the first one the ordering is guaranteed):
> 
> entryfunc:
>   ldr x1,[x0]
>   and x1,x1,#8
>   orr x1,x1,#8
>   ldr x0,[x0,x1]
> 
> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> loads and stores)
> 
> entryfunc:
>   ldar xzr,[x0]
>   ldr x0,[x0,#8]
> 
> Option (1) is the simplest but slowest (note: this runs at every TLS
> access), options (2) and (3) do one extra load from [x0] (same address
> loads are ordered so it happens-after the load on the call site),
> option (2) clobbers x1, so I think (3) is the best solution (a load
> into the zero register has the same effect as with a normal register,
> but the result is discarded so nothing is clobbered. Note that the
> TLSDESC abi allows the descriptor function to clobber x1, but current
> gcc does not implement this correctly, gcc will be fixed independently,
> the dynamic and undefweak descriptors currently save/restore x1 so only
> static TLS would be affected by this clobbering issue).

I see options (3) as the preferred one as well, since it fits better on
the C11 memory semantics.

> 
> On the write side I used a full barrier (__sync_synchronize) although
> 
>   dmb ishst
> 
> would be enough, but write side is not performance critical.

My understanding is you do not need to push a seq-consistent, but rather
a store release on the 'td->arg', since the code only requires that
'td->entry' should not be reordered with 'td->arg'.  So, to adjust to
C11 semantics I would change:


> 
> I introduced a new _dl_tlsdesc_return_lazy descriptor function for
> lazily relocated static TLS, so non-lazy use-case is not affected.
> The dynamic and undefweak descriptors do enough work so the additional
> ldar should not have a performance impact.
> 
> Other thoughts:
> 
> - Lazy binding for static TLS may be unnecessary complexity: it seems
> that gcc generates at most two static TLSDESC relocation entries for a
> translation unit (zero and non-zero initialized TLS), so there has to be
> a lot of object files with static TLS linked into a shared object to
> make the load time relocation overhead significant. Unless there is some
> evidence that lazy static TLS relocation makes sense I would suggest
> removing that logic (from all archs). (The dynamic case is probably a
> similar micro-optimization, but there the lazy vs non-lazy semantics are
> different in case of undefined symbols and some applications may depend
> on that).

Recently I am seeing that all lazy relocation yields less gains than
before and add a lot of complexity.  I would like to see an usercase
where lazy TLS or even normal relocation shows a real gain.

> 
> - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
> with option (1) or (2) with an additional twist: some existing ARMv7 cpus
> don't implement the same address load ordering reliably, see:
> http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
> 
> - I don't understand the role of the hold function in the general
> TLSDESC design, why is it not enough to let other threads wait on the
> global lock in the initial resolver function? Is the global dl lock
> implemented as a spin lock? Is it for some liveness/fairness property?
> 
> - There are some incorrect uses of "cfi_adjust_cfa_offset" in
> dl-tlsdesc.S which is a separate patch.
> 
> Changelog:
> 
> 2015-04-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>
> 
> 	[BZ #18034]
> 	* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
> 	* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
> 	(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
> 	(_dl_tlsdesc_dynamic): Likewise.
> 	* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
> 	barrier between stores.
>

diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..f738cc6 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
   if (!sym)
     {
       td->arg = (void*) reloc->r_addend;
+      /* Barrier so readers see the write above before the one below. */
+      __sync_synchronize ();

To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'