Message ID | orzimyknzx.fsf@livre.home |
---|---|
State | New |
Headers | show |
On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote: > (if it decides to update it without > resizing, we also have a race, but both threads end up writing the same > final value, which AFAIK is not so much of a problem) Can this still happen after your patches? If so, then this needs to be fixed. Data races are a problem. You must used atomics to actually make writes of the same value, concurrently by more than one thread, well-defined. Also, for code comments at least, we need to be less ambiguous regarding our use of the word "race". I think we should avoid it unless we specifically mean a data race (as defined by C11), or something that's bad and needs to be fixed. If we want to say that something is not ordered by happens-before, we should say exactly that. This allows us to see the term "race" as a red flag. If we need to use atomics to avoid data races, they will not be ordered by happens-before. But that's okay, because that's in the nature of cases where we have to synchronize. The final state that we should head towards is that we have code that is either data-race-free or proper synchronization (e.g., using atomics) that is well-documented. > The end result is that the initializer may write to stale/freed memory, > that may even have already been reassigned to some other purpose. > That's very bad indeed. > > > So, although I post the entire backported patch, because that's what I > tested, and the comment I refer to above is in a part of the patch that, > if applied, would restore the described race, I immediately withdraw it, > and propose instead to revert the changes to elf/dl-reloc.c and > nptl/allocatestack.c that I'd installed earlier this week to the master > branch. They're not necessary for use with the regular __tls_get_addr, > and they won't affect the alternate __tls_get_addr used in static > programs, because the Static TLS block for the main program (the only > module it cares about) and its DTV entry are always set up by > __libc_setup_tls (main thread) or _dl_allocate_tls_init (other threads): > it will never be the case that we will dlopen the main program after the > program starts and have to initialize its static TLS blocks (and DTV > entry) for all running threads, because we never dlopen the main > program. ISTM that all this should become a code comment, and not just be in an email on the list. > For the master branch, I'll post a separate patch momentarily, reverting > the incorrect portion of the recent change, and expanding the comments > about the race condition, so that neither myself nor anyone else ever > mistakes another (harmless?) race for that one: that one thread > initializes other threads' TLS blocks while holding the rtld lock, but > later on the other threads access the initialized TLS blocks without > taking any locks or any form of memory synchronization that would > clearly establish, memory-wise, the implied happens-before. IIUC, that's were the assumption about dlopen vs. usage comes in that you included in the other patch. If so, this is not a data race (nor a race condition). Or are you referring to something else? BTW, there are no harmless data races ;) > I'll retest and repost the patches for master and branches early next > week, if not over the weekend. Please also remember that with the tools we have, it's pretty unlikely that we can test the absence of concurrency bugs. Therefore, it is very important that our reasoning about concurrency is thorough. And we need to do that collectively, which means: * Shared terminology * Proper, detailed documentation Otherwise, we can't communicate properly about it, which means we can't review it, and that we're not decreasing bus factors.
On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote: > On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote: >> (if it decides to update it without >> resizing, we also have a race, but both threads end up writing the same >> final value, which AFAIK is not so much of a problem) > Can this still happen after your patches? After removing (again) one of the possibly-concurrent writes, I don't see how it could still happen, since there's only one writer to a thread's DTV at a time: at thread creation there's one, and while the thread is active there's another. But that's not the result of a thorough global analysis from scratch, it's just my intuition based on my limited understanding of the whole TLS architecture. If you want a thorough analysis, I'm afraid you'll have to resort to someone else, probably someone who speaks the same language as you, and that reasons at the same abstraction layer as you. I'm not that person, as our many attempts at conversations have shown. > ISTM that all this should become a code comment, and not just be in an > email on the list. It's there in the patch. > IIUC, that's were the assumption about dlopen vs. usage comes in that > you included in the other patch. Yeah, that too. > If so, this is not a data race (nor a race condition). It is a potential data race, yes, but I'll sustain it is probably harmless, although I know you'll take that as heresy ;-) Please look for the details in the other thread.
On Sun, 2016-09-25 at 19:09 -0300, Alexandre Oliva wrote: > On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote: > > > On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote: > >> (if it decides to update it without > >> resizing, we also have a race, but both threads end up writing the same > >> final value, which AFAIK is not so much of a problem) > > > Can this still happen after your patches? > > After removing (again) one of the possibly-concurrent writes, I don't > see how it could still happen, since there's only one writer to a > thread's DTV at a time: at thread creation there's one, and while the > thread is active there's another. But that's not the result of a > thorough global analysis from scratch, it's just my intuition based on > my limited understanding of the whole TLS architecture. If you want a > thorough analysis, I'm afraid you'll have to resort to someone else, > probably someone who speaks the same language as you, and that reasons > at the same abstraction layer as you. I'm not that person, as our many > attempts at conversations have shown. > > > ISTM that all this should become a code comment, and not just be in an > > email on the list. > > It's there in the patch. Apparently, I thought it wasn't. At the very least please include this sentence of yours (see above) in the code comments: "But that's not the result of a thorough global analysis from scratch, it's just my intuition based on my limited understanding of the whole TLS architecture." What I want to avoid is that somebody else tries to work on the code and believes in your comment as-is, when you already know that you haven't checked this completely. > > IIUC, that's were the assumption about dlopen vs. usage comes in that > > you included in the other patch. > > Yeah, that too. > > > If so, this is not a data race (nor a race condition). > > It is a potential data race, yes, I'm saying that when making the assumption, which is essentially a precondition we (should) specify, this isn't a data race. If violating the precondition, then that's already undefined behavior, but a fault of the caller.
On Sep 29, 2016, Torvald Riegel <triegel@redhat.com> wrote: > On Sun, 2016-09-25 at 19:09 -0300, Alexandre Oliva wrote: >> On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote: >> >> > On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote: >> >> (if it decides to update it without >> >> resizing, we also have a race, but both threads end up writing the same >> >> final value, which AFAIK is not so much of a problem) >> >> > Can this still happen after your patches? >> >> After removing (again) one of the possibly-concurrent writes, I don't >> see how it could still happen, since there's only one writer to a >> thread's DTV at a time: at thread creation there's one, and while the >> thread is active there's another. But that's not the result of a >> thorough global analysis from scratch, it's just my intuition based on >> my limited understanding of the whole TLS architecture. If you want a >> thorough analysis, I'm afraid you'll have to resort to someone else, >> probably someone who speaks the same language as you, and that reasons >> at the same abstraction layer as you. I'm not that person, as our many >> attempts at conversations have shown. >> >> > ISTM that all this should become a code comment, and not just be in an >> > email on the list. >> >> It's there in the patch. > Apparently, I thought it wasn't. > At the very least please include this sentence of yours (see above) in > the code comments: > "But that's not the result of a thorough global analysis from scratch, > it's just my intuition based on my limited understanding of the whole > TLS architecture." > What I want to avoid is that somebody else tries to work on the code and > believes in your comment as-is, when you already know that you haven't > checked this completely. I checked thoroughly what I wrote in the comment, namely, that there'd be a DTV race if we did what the removed code used to do before. What I meant is that I cannot affirm that there aren't any other DTV- or TLS-related races, before or after the fix, because I haven't ever done a thorough review of the TLS implementation, and I surely haven't done one for purposes of posting the patch. I just know enough about the code in general, and about concurrency programming in general, to know that, if I remove that code, that was unnecessary and that caused the race I had detected a year or so ago, that specific race will be gone, and the remaining code will work the way it did before the patch I installed last week. What I know about these topics would allow me to make numerous other claims, of course. But what would the point be of my spelling any of them out? You wouldn't believe them anyway! Even if I said I'd made a thorough analysis, would you trust it? >> > IIUC, that's were the assumption about dlopen vs. usage comes in that >> > you included in the other patch. >> >> Yeah, that too. >> >> > If so, this is not a data race (nor a race condition). >> >> It is a potential data race, yes, > I'm saying that when making the assumption, which is essentially a > precondition we (should) specify, this isn't a data race. If violating > the precondition, then that's already undefined behavior, but a fault of > the caller. Are you saying it ceases to be a data race just because some earlier precondition failed to be observed? I don't think I agree with that. But I don't think I'm interested in being part of such a discussion with you either.
On Thu, 2016-09-29 at 19:07 -0300, Alexandre Oliva wrote: > What I know about these topics would allow me to make numerous other > claims, of course. But what would the point be of my spelling any of > them out? You wouldn't believe them anyway! Even if I said I'd made a > thorough analysis, would you trust it? What I asked for is that you make it clear if you are not confident in a comment you add. Of course, I'd trust such a comment less than one that you are confident in. > > >> > IIUC, that's were the assumption about dlopen vs. usage comes in that > >> > you included in the other patch. > >> > >> Yeah, that too. > >> > >> > If so, this is not a data race (nor a race condition). > >> > >> It is a potential data race, yes, > > > I'm saying that when making the assumption, which is essentially a > > precondition we (should) specify, this isn't a data race. If violating > > the precondition, then that's already undefined behavior, but a fault of > > the caller. > > Are you saying it ceases to be a data race just because some earlier > precondition failed to be observed? I don't think I agree with that. I would not classify it as a data race in the code because we make statements about the code based on what it's supposed to do. If the caller violates the preconditions, everything can happen. Nonetheless, when speculating about what would happen if the caller would be allowed to call in a way that violates the precondition, then we would say that there's a data race there. However, speculating about possible behavior that violates preconditions in code comments is just confusing IMO, unless it's explicitly stated (ie, that this is essentially a thought experiment).
diff --git a/elf/dl-reloc.c b/elf/dl-reloc.c index 42bddc1..64e3aea 100644 --- a/elf/dl-reloc.c +++ b/elf/dl-reloc.c @@ -137,6 +137,12 @@ _dl_nothread_init_static_tls (struct link_map *map) # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined" #endif + /* Fill in the DTV slot so that a later LD/GD access will find it. */ + dtv_t *dtv = THREAD_DTV (); + assert (map->l_tls_modid <= dtv[-1].counter); + dtv[map->l_tls_modid].pointer.is_static = true; + dtv[map->l_tls_modid].pointer.val = dest; + /* Initialize the memory. */ memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size), '\0', map->l_tls_blocksize - map->l_tls_initimage_size); diff --git a/elf/dl-tls.c b/elf/dl-tls.c index ed13fd9..0add960 100644 --- a/elf/dl-tls.c +++ b/elf/dl-tls.c @@ -511,6 +511,11 @@ _dl_allocate_tls_init (void *result) # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined" #endif + /* Set up the DTV entry. The simplified __tls_get_addr that + some platforms use in static programs requires it. */ + dtv[map->l_tls_modid].pointer.is_static = true; + dtv[map->l_tls_modid].pointer.val = dest; + /* Copy the initialization image and clear the BSS part. */ memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size), '\0', diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c index 6b42b11..99002b2 100644 --- a/nptl/allocatestack.c +++ b/nptl/allocatestack.c @@ -1209,9 +1209,12 @@ init_one_static_tls (struct pthread *curp, struct link_map *map) # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined" # endif - /* We cannot delay the initialization of the Static TLS area, since - it can be accessed with LE or IE, but since the DTV is only used - by GD and LD, we can delay its update to avoid a race. */ + /* Fill in the DTV slot so that a later LD/GD access will find it. */ + dtv_t *dtv = GET_DTV (TLS_TPADJ (curp)); + dtv[map->l_tls_modid].pointer.is_static = true; + dtv[map->l_tls_modid].pointer.val = dest; + + /* Initialize the memory. */ memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size), '\0', map->l_tls_blocksize - map->l_tls_initimage_size); }