From patchwork Wed Apr 22 15:07:47 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Adhemerval Zanella Netto X-Patchwork-Id: 463691 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 580D914012C for ; Thu, 23 Apr 2015 01:08:01 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass reason="1024-bit key; unprotected key" header.d=sourceware.org header.i=@sourceware.org header.b=rg2CPSw+; dkim-adsp=none (unprotected policy); dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; q=dns; s=default; b=KK1WFwxosauG0Jkra8YhYt7UAo0EFCV8ZgVqThVA1jh x8AYUBvqcfRfOtA8Tgk5W2fUUz5aQRBeh9lPuYcXEq6YedkFzBXKdK7obU9mSks7 zNhZ9dAgeMwk82lebBPhx17BZXxJHJuwxWHz8vjOKZfyFOLgYdgpGJP0gIMG2biU = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; s=default; bh=HTt/uhQyl95YbQQFd8WDPCMcVbg=; b=rg2CPSw+71++I7gLe Zs5RA5P/bII+o1CjeaplOKBcsflM210CsdKxGueFbJ5hEosnfxhJQFnstMVD8T4j NQW1PfjQQR/4CGUrxZFhiPPHi/SU5yVTtnSstjkkMAjVTIfnan7xTxW+lTC1vMr5 aoTO9dssdjTfnWj8nm2zHmzBcc= Received: (qmail 125055 invoked by alias); 22 Apr 2015 15:07:55 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 125043 invoked by uid 89); 22 Apr 2015 15:07:55 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-qg0-f44.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=ANrr7XVp3lIkPqvbOxbVzUtOSczlHPqLpyF8HWckdJM=; b=DFXn9Dxmtwl6N1MziFKBI3bG79D23MrdEbwsU8YHhJ+tdTYBeIdFRkG8j35ThfiNtq 0JkZy1AaVFmLtgsp7hW/OH4OjADCRSNbSqaEirjrkIcqoxAX/sSljd+VbgUMQGdiix1g DfWgrCHIj9UirI9Petxl6BqP37KYGoPKqVuQ2Y738JMCCDxV8omIGoR3qgS29Hv8/oMJ aD1U27cPcpAUl7ksJeBmw0b7nfPvpNc0c33MgDqVWd5qtB5YLAPxfdTphboe4Pt3Cby+ eTM6oBQ5U0IDX7xFro+Jcc+AYy+KsLUDKVjAWiKWVK0ipEZaEsKNdZM+XvCvRkamtbQe Iv/w== X-Gm-Message-State: ALoCoQl1G4mhenYF2PtOAZZro7ZjTNLc1cDWCfA/TXBFDrENVkRk8KRVeJwjYvTJYhV/0Rsn2PNW X-Received: by 10.55.53.72 with SMTP id c69mr49807331qka.67.1429715271230; Wed, 22 Apr 2015 08:07:51 -0700 (PDT) Message-ID: <5537B943.2060001@linaro.org> Date: Wed, 22 Apr 2015 12:07:47 -0300 From: Adhemerval Zanella User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: libc-alpha@sourceware.org Subject: Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix References: <553793A3.7030206@arm.com> In-Reply-To: <553793A3.7030206@arm.com> Hi On 22-04-2015 09:27, Szabolcs Nagy wrote: > Lazy TLSDESC initialization needs to be synchronized with concurrent TLS > accesses. > > The original ARM TLSDESC design did not discuss this race that > arises on systems with weak memory order guarantees > > http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt > > "Storing the final value of the TLS descriptor also needs care: the > resolver field must be set to its final value after the argument gets > its final value, such that any if thread attempts to use the > descriptor before it gets its final value, it still goes to the hold > function." > > The current glibc code (i386, x86_64, arm, aarch64) is > > td->arg = ...; > td->entry = ...; > > the arm memory model allows store-store reordering, so a barrier is > needed between these two stores (the code is broken on x86 as well in > principle: x86 memory model does not help on the c code level, the > compiler can reorder non-aliasing stores). > > What is missing from the TLSDESC design spec is a description of the > ordering requirements on the read side: the TLS access sequence > (generated by the compiler) loads the descriptor function pointer > (td->entry) and then the argument is loaded (td->arg) in that function. > These two loads must observe the changes made on the write side in a > sequentially consistent way. The code in asm: > > ldr x1, [x0] // load td->entry > blr [x0] // call it > > entryfunc: > ldr x0, [x0,#8] // load td->arg > > The branch (blr) does not provide a load-load ordering barrier (so the > second load may observe an old arg even though the first load observed > the new entry, this happens with existing aarch64 hardware causing > invalid memory access due to the incorrect TLS offset). > > Various alternatives were considered to force the load ordering in the > descriptor function: > > (1) barrier > > entryfunc: > dmb ishld > ldr x0, [x0,#8] > > (2) address dependency (if the address of the second load depends on the > result of the first one the ordering is guaranteed): > > entryfunc: > ldr x1,[x0] > and x1,x1,#8 > orr x1,x1,#8 > ldr x0,[x0,x1] > > (3) load-acquire (ARMv8 instruction that is ordered before subsequent > loads and stores) > > entryfunc: > ldar xzr,[x0] > ldr x0,[x0,#8] > > Option (1) is the simplest but slowest (note: this runs at every TLS > access), options (2) and (3) do one extra load from [x0] (same address > loads are ordered so it happens-after the load on the call site), > option (2) clobbers x1, so I think (3) is the best solution (a load > into the zero register has the same effect as with a normal register, > but the result is discarded so nothing is clobbered. Note that the > TLSDESC abi allows the descriptor function to clobber x1, but current > gcc does not implement this correctly, gcc will be fixed independently, > the dynamic and undefweak descriptors currently save/restore x1 so only > static TLS would be affected by this clobbering issue). I see options (3) as the preferred one as well, since it fits better on the C11 memory semantics. > > On the write side I used a full barrier (__sync_synchronize) although > > dmb ishst > > would be enough, but write side is not performance critical. My understanding is you do not need to push a seq-consistent, but rather a store release on the 'td->arg', since the code only requires that 'td->entry' should not be reordered with 'td->arg'. So, to adjust to C11 semantics I would change: > > I introduced a new _dl_tlsdesc_return_lazy descriptor function for > lazily relocated static TLS, so non-lazy use-case is not affected. > The dynamic and undefweak descriptors do enough work so the additional > ldar should not have a performance impact. > > Other thoughts: > > - Lazy binding for static TLS may be unnecessary complexity: it seems > that gcc generates at most two static TLSDESC relocation entries for a > translation unit (zero and non-zero initialized TLS), so there has to be > a lot of object files with static TLS linked into a shared object to > make the load time relocation overhead significant. Unless there is some > evidence that lazy static TLS relocation makes sense I would suggest > removing that logic (from all archs). (The dynamic case is probably a > similar micro-optimization, but there the lazy vs non-lazy semantics are > different in case of undefined symbols and some applications may depend > on that). Recently I am seeing that all lazy relocation yields less gains than before and add a lot of complexity. I would like to see an usercase where lazy TLS or even normal relocation shows a real gain. > > - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go > with option (1) or (2) with an additional twist: some existing ARMv7 cpus > don't implement the same address load ordering reliably, see: > http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf > > - I don't understand the role of the hold function in the general > TLSDESC design, why is it not enough to let other threads wait on the > global lock in the initial resolver function? Is the global dl lock > implemented as a spin lock? Is it for some liveness/fairness property? > > - There are some incorrect uses of "cfi_adjust_cfa_offset" in > dl-tlsdesc.S which is a separate patch. > > Changelog: > > 2015-04-22 Szabolcs Nagy > > [BZ #18034] > * sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare. > * sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define. > (_dl_tlsdesc_undefweak): Guarantee load-load ordering. > (_dl_tlsdesc_dynamic): Likewise. > * sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add > barrier between stores. > diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c index 4821f8c..f738cc6 100644 --- a/sysdeps/aarch64/tlsdesc.c +++ b/sysdeps/aarch64/tlsdesc.c @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, if (!sym) { td->arg = (void*) reloc->r_addend; + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'