Message ID | 20131104105059.GL3947@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote: > OK, something like this for the definitions (though PowerPC might want > to locally abstract the lwsync expansion): > > #define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \ > do { \ > barrier(); \ > ACCESS_ONCE(p) = (v); \ > } while (0) > > #define smp_store_with_release_semantics(p, v) /* PowerPC. */ \ > do { \ > __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \ > ACCESS_ONCE(p) = (v); \ > } while (0) > > #define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \ > ({ \ > typeof(*p) *_________p1 = ACCESS_ONCE(p); \ > barrier(); \ > _________p1; \ > }) > > #define smp_load_with_acquire_semantics(p) /* PowerPC. */ \ > ({ \ > typeof(*p) *_________p1 = ACCESS_ONCE(p); \ > __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \ > _________p1; \ > }) > > For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM > "ldar" instruction and smp_store_with_release_semantics() is a wrapper > around the ARM "stlr" instruction. This still leaves me confused as to what to do with my case :/ Slightly modified since last time -- as the simplified version was maybe simplified too far. To recap, I'd like to get rid of barrier A where possible, since that's now a full barrier for every event written. However, there's no immediate store I can attach it to; the obvious one would be the kbuf->head store, but that's complicated by the local_cmpxchg() thing. And we need that cmpxchg loop because a hardware NMI event can interleave with a software event. And to be honest, I'm still totally confused about memory barriers vs control flow vs C/C++. The only way we're ever getting to that memcpy is if we've already observed ubuf->tail, so that LOAD has to be fully processes and completed. I'm really not seeing how a STORE from the memcpy() could possibly go wrong; and if C/C++ can hoist the memcpy() over a compiler barrier() then I suppose we should all just go home. /me who wants A to be a barrier() but is terminally confused. --- /* * One important detail is that the kbuf part and the kbuf_writer() are * strictly per cpu and we can thus rely on program order for those. * * Only the userspace consumer can possibly run on another cpu, and thus we * need to ensure data consistency for those. */ struct buffer { u64 size; u64 tail; u64 head; void *data; }; struct buffer *kbuf, *ubuf; /* * If there's space in the buffer; store the data @buf; otherwise * discard it. */ void kbuf_write(int sz, void *buf) { u64 tail, head, offset; do { tail = ACCESS_ONCE(ubuf->tail); offset = head = kbuf->head; if (CIRC_SPACE(head, tail, kbuf->size) < sz) { /* discard @buf */ return; } head += sz; } while (local_cmpxchg(&kbuf->head, offset, head) != offset) /* * Ensure that if we see the userspace tail (ubuf->tail) such * that there is space to write @buf without overwriting data * userspace hasn't seen yet, we won't in fact store data before * that read completes. */ smp_mb(); /* A, matches with D */ memcpy(kbuf->data + offset, buf, sz); /* * Ensure that we write all the @buf data before we update the * userspace visible ubuf->head pointer. */ smp_wmb(); /* B, matches with C */ ubuf->head = kbuf->head; } /* * Consume the buffer data and update the tail pointer to indicate to * kernel space there's 'free' space. */ void ubuf_read(void) { u64 head, tail; tail = ACCESS_ONCE(ubuf->tail); head = ACCESS_ONCE(ubuf->head); /* * Ensure we read the buffer boundaries before the actual buffer * data... */ smp_rmb(); /* C, matches with B */ while (tail != head) { obj = ubuf->data + tail; /* process obj */ tail += obj->size; tail %= ubuf->size; } /* * Ensure all data reads are complete before we issue the * ubuf->tail update; once that update hits, kbuf_write() can * observe and overwrite data. */ smp_mb(); /* D, matches with A */ ubuf->tail = tail; }
On Mon, Nov 04, 2013 at 12:22:54PM +0100, Peter Zijlstra wrote: > On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote: > > OK, something like this for the definitions (though PowerPC might want > > to locally abstract the lwsync expansion): > > > > #define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \ > > do { \ > > barrier(); \ > > ACCESS_ONCE(p) = (v); \ > > } while (0) > > > > #define smp_store_with_release_semantics(p, v) /* PowerPC. */ \ > > do { \ > > __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \ > > ACCESS_ONCE(p) = (v); \ > > } while (0) > > > > #define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \ > > ({ \ > > typeof(*p) *_________p1 = ACCESS_ONCE(p); \ > > barrier(); \ > > _________p1; \ > > }) > > > > #define smp_load_with_acquire_semantics(p) /* PowerPC. */ \ > > ({ \ > > typeof(*p) *_________p1 = ACCESS_ONCE(p); \ > > __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \ > > _________p1; \ > > }) > > > > For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM > > "ldar" instruction and smp_store_with_release_semantics() is a wrapper > > around the ARM "stlr" instruction. > > This still leaves me confused as to what to do with my case :/ > > Slightly modified since last time -- as the simplified version was maybe > simplified too far. > > To recap, I'd like to get rid of barrier A where possible, since that's > now a full barrier for every event written. > > However, there's no immediate store I can attach it to; the obvious one > would be the kbuf->head store, but that's complicated by the > local_cmpxchg() thing. > > And we need that cmpxchg loop because a hardware NMI event can > interleave with a software event. > > And to be honest, I'm still totally confused about memory barriers vs > control flow vs C/C++. The only way we're ever getting to that memcpy is > if we've already observed ubuf->tail, so that LOAD has to be fully > processes and completed. > > I'm really not seeing how a STORE from the memcpy() could possibly go > wrong; and if C/C++ can hoist the memcpy() over a compiler barrier() > then I suppose we should all just go home. > > /me who wants A to be a barrier() but is terminally confused. Well, let's see... > --- > > > /* > * One important detail is that the kbuf part and the kbuf_writer() are > * strictly per cpu and we can thus rely on program order for those. > * > * Only the userspace consumer can possibly run on another cpu, and thus we > * need to ensure data consistency for those. > */ > > struct buffer { > u64 size; > u64 tail; > u64 head; > void *data; > }; > > struct buffer *kbuf, *ubuf; > > /* > * If there's space in the buffer; store the data @buf; otherwise > * discard it. > */ > void kbuf_write(int sz, void *buf) > { > u64 tail, head, offset; > > do { > tail = ACCESS_ONCE(ubuf->tail); So the above load is the key load. It determines whether or not we have space in the buffer. This of course assumes that only this CPU writes to ->head. If so, then: tail = smp_load_with_acquire_semantics(ubuf->tail); /* A -> D */ > offset = head = kbuf->head; > if (CIRC_SPACE(head, tail, kbuf->size) < sz) { > /* discard @buf */ > return; > } > head += sz; > } while (local_cmpxchg(&kbuf->head, offset, head) != offset) If there is an issue with kbuf->head, presumably local_cmpxchg() fails and we retry. But sheesh, do you think we could have buried the definitions of local_cmpxchg() under a few more layers of macro expansion just to keep things more obscure? Anyway, griping aside... o __cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h doesn't seem to exclude NMIs, so is not safe for this usage. o __cmpxchg_local() in ARM handles NMI as long as the argument is 32 bits, otherwise, it uses the aforementionted __cmpxchg_local_generic(), which does not handle NMI. Given your u64, this does not look good... And some ARM subarches (e.g., metag) seem to fail to handle NMI even in the 32-bit case. o FRV and M32r seem to act similar to ARM. Or maybe these architectures don't do NMIs? If they do, local_cmpxchg() does not seem to be safe against NMIs in general. :-/ That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it. Of course, x86's local_cmpxchg() has full memory barriers implicitly. > > /* > * Ensure that if we see the userspace tail (ubuf->tail) such > * that there is space to write @buf without overwriting data > * userspace hasn't seen yet, we won't in fact store data before > * that read completes. > */ > > smp_mb(); /* A, matches with D */ Given a change to smp_load_with_acquire_semantics() above, you should not need this smp_mb(). > memcpy(kbuf->data + offset, buf, sz); > > /* > * Ensure that we write all the @buf data before we update the > * userspace visible ubuf->head pointer. > */ > smp_wmb(); /* B, matches with C */ > > ubuf->head = kbuf->head; Replace the smp_wmb() and the assignment with: smp_store_with_release_semantics(ubuf->head, kbuf->head); /* B -> C */ > } > > /* > * Consume the buffer data and update the tail pointer to indicate to > * kernel space there's 'free' space. > */ > void ubuf_read(void) > { > u64 head, tail; > > tail = ACCESS_ONCE(ubuf->tail); Does anyone else write tail? Or is this defense against NMIs? If no one else writes to tail and if NMIs cannot muck things up, then the above ACCESS_ONCE() is not needed, though I would not object to its staying. > head = ACCESS_ONCE(ubuf->head); Make the above be: head = smp_load_with_acquire_semantics(ubuf->head); /* C -> B */ > /* > * Ensure we read the buffer boundaries before the actual buffer > * data... > */ > smp_rmb(); /* C, matches with B */ And drop the above memory barrier. > while (tail != head) { > obj = ubuf->data + tail; > /* process obj */ > tail += obj->size; > tail %= ubuf->size; > } > > /* > * Ensure all data reads are complete before we issue the > * ubuf->tail update; once that update hits, kbuf_write() can > * observe and overwrite data. > */ > smp_mb(); /* D, matches with A */ > > ubuf->tail = tail; Replace the above barrier and the assignment with: smp_store_with_release_semantics(ubuf->tail, tail); /* D -> B. */ > } All this is leading me to suggest the following shortenings of names: smp_load_with_acquire_semantics() -> smp_load_acquire() smp_store_with_release_semantics() -> smp_store_release() But names aside, the above gets rid of explicit barriers on TSO architectures, allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of the heavier-weight sync. Thanx, Paul
On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote: > > > > > > /* > > * One important detail is that the kbuf part and the kbuf_writer() are > > * strictly per cpu and we can thus rely on program order for those. > > * > > * Only the userspace consumer can possibly run on another cpu, and thus we > > * need to ensure data consistency for those. > > */ > > > > struct buffer { > > u64 size; > > u64 tail; > > u64 head; > > void *data; > > }; > > > > struct buffer *kbuf, *ubuf; > > > > /* > > * If there's space in the buffer; store the data @buf; otherwise > > * discard it. > > */ > > void kbuf_write(int sz, void *buf) > > { > > u64 tail, head, offset; > > > > do { > > tail = ACCESS_ONCE(ubuf->tail); > > So the above load is the key load. It determines whether or not we > have space in the buffer. This of course assumes that only this CPU > writes to ->head. This assumption is true. > If so, then: > > tail = smp_load_with_acquire_semantics(ubuf->tail); /* A -> D */ > OK, the way I understand ACQUIRE semantics are the semi-permeable LOCK semantics from Documetnation/memory-barriers.txt. In which case the relevant STORES below could be hoisted up here, but not across the READ, which I suppose is sufficient. > > offset = head = kbuf->head; > > if (CIRC_SPACE(head, tail, kbuf->size) < sz) { > > /* discard @buf */ > > return; > > } > > head += sz; > > } while (local_cmpxchg(&kbuf->head, offset, head) != offset) > > If there is an issue with kbuf->head, presumably local_cmpxchg() fails > and we retry. > > But sheesh, do you think we could have buried the definitions of > local_cmpxchg() under a few more layers of macro expansion just to > keep things more obscure? Anyway, griping aside... > > o __cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h > doesn't seem to exclude NMIs, so is not safe for this usage. > > o __cmpxchg_local() in ARM handles NMI as long as the > argument is 32 bits, otherwise, it uses the aforementionted > __cmpxchg_local_generic(), which does not handle NMI. Given your > u64, this does not look good... > > And some ARM subarches (e.g., metag) seem to fail to handle NMI > even in the 32-bit case. > > o FRV and M32r seem to act similar to ARM. > > Or maybe these architectures don't do NMIs? If they do, local_cmpxchg() > does not seem to be safe against NMIs in general. :-/ > > That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it. Ah my bad, so the in-kernel kbuf variant uses unsigned long, which on all archs should be the native words size and cover the address space. Only the public variant (ubuf) is u64 wide to not change data structure layout on compat etc. I suppose this was a victim on the simplification :/ And in case of 32bit the upper word will always be zero and the partial reads should all work out just fine. > Of course, x86's local_cmpxchg() has full memory barriers implicitly. Not quite, the 'lock' in __raw_cmpxchg() expands to "" due to __cmpxchg_local(), etc.. > > > > /* > > * Ensure that if we see the userspace tail (ubuf->tail) such > > * that there is space to write @buf without overwriting data > > * userspace hasn't seen yet, we won't in fact store data before > > * that read completes. > > */ > > > > smp_mb(); /* A, matches with D */ > > Given a change to smp_load_with_acquire_semantics() above, you should not > need this smp_mb(). Because the STORES can not be hoisted across the ACQUIRE, indeed. > > > memcpy(kbuf->data + offset, buf, sz); > > > > /* > > * Ensure that we write all the @buf data before we update the > > * userspace visible ubuf->head pointer. > > */ > > smp_wmb(); /* B, matches with C */ > > > > ubuf->head = kbuf->head; > > Replace the smp_wmb() and the assignment with: > > smp_store_with_release_semantics(ubuf->head, kbuf->head); /* B -> C */ And here the RELEASE semantics I assume are the same as the semi-permeable UNLOCK from _The_ document? In which case the above STORES cannot be lowered across this store and all should, again, be well. > > } > > > > /* > > * Consume the buffer data and update the tail pointer to indicate to > > * kernel space there's 'free' space. > > */ > > void ubuf_read(void) > > { > > u64 head, tail; > > > > tail = ACCESS_ONCE(ubuf->tail); > > Does anyone else write tail? Or is this defense against NMIs? No, we're the sole writer; just general paranoia. Not sure the actual userspace does this; /me checks. Nope, distinct lack of ACCESS_ONCE() there, just the rmb(), which including a barrier() should hopefully accomplish similar things most of the time ;-) I'll need to introduce ACCESS_ONCE to the userspace code. > If no one else writes to tail and if NMIs cannot muck things up, then > the above ACCESS_ONCE() is not needed, though I would not object to its > staying. > > > head = ACCESS_ONCE(ubuf->head); > > Make the above be: > > head = smp_load_with_acquire_semantics(ubuf->head); /* C -> B */ > > > /* > > * Ensure we read the buffer boundaries before the actual buffer > > * data... > > */ > > smp_rmb(); /* C, matches with B */ > > And drop the above memory barrier. > > > while (tail != head) { > > obj = ubuf->data + tail; > > /* process obj */ > > tail += obj->size; > > tail %= ubuf->size; > > } > > > > /* > > * Ensure all data reads are complete before we issue the > > * ubuf->tail update; once that update hits, kbuf_write() can > > * observe and overwrite data. > > */ > > smp_mb(); /* D, matches with A */ > > > > ubuf->tail = tail; > > Replace the above barrier and the assignment with: > > smp_store_with_release_semantics(ubuf->tail, tail); /* D -> B. */ > > > } Right, so this consumer side isn't called that often and the two barriers are only per consume, not per event, so I don't care too much about these. It would also mean hoisting the implementation of the proposed primitives into userspace -- which reminds me: should we make include/asm/barrier.h a uapi header? > All this is leading me to suggest the following shortenings of names: > > smp_load_with_acquire_semantics() -> smp_load_acquire() > > smp_store_with_release_semantics() -> smp_store_release() > > But names aside, the above gets rid of explicit barriers on TSO architectures, > allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of > the heavier-weight sync. Totally awesome! ;-) And full ack on the shorter names.
* Peter Zijlstra (peterz@infradead.org) wrote: [...] Hi Peter, Looking at this simplified version of perf's ring buffer synchronization, I get concerned about the following issue: > /* > * One important detail is that the kbuf part and the kbuf_writer() are > * strictly per cpu and we can thus rely on program order for those. > * > * Only the userspace consumer can possibly run on another cpu, and thus we > * need to ensure data consistency for those. > */ > > struct buffer { > u64 size; > u64 tail; > u64 head; > void *data; > }; > > struct buffer *kbuf, *ubuf; > > /* > * If there's space in the buffer; store the data @buf; otherwise > * discard it. > */ > void kbuf_write(int sz, void *buf) > { > u64 tail, head, offset; > > do { > tail = ACCESS_ONCE(ubuf->tail); > offset = head = kbuf->head; > if (CIRC_SPACE(head, tail, kbuf->size) < sz) { > /* discard @buf */ > return; > } > head += sz; > } while (local_cmpxchg(&kbuf->head, offset, head) != offset) > Let's suppose we have a thread executing kbuf_write(), interrupted by an IRQ or NMI right after a successful local_cmpxchg() (space reservation in the buffer). If the nested execution context also calls kbuf_write(), it will therefore update ubuf->head (below) with the second reserved space, and only after that will it return to the original thread context and continue executing kbuf_write(), thus overwriting ubuf->head with the prior-to-last reserved offset. All this probably works OK most of the times, when we have an event flow guaranteeing that a following event will fix things up, but there appears to be a risk of losing events near the end of the trace when those are in nested execution contexts. Thoughts ? Thanks, Mathieu > /* > * Ensure that if we see the userspace tail (ubuf->tail) such > * that there is space to write @buf without overwriting data > * userspace hasn't seen yet, we won't in fact store data before > * that read completes. > */ > > smp_mb(); /* A, matches with D */ > > memcpy(kbuf->data + offset, buf, sz); > > /* > * Ensure that we write all the @buf data before we update the > * userspace visible ubuf->head pointer. > */ > smp_wmb(); /* B, matches with C */ > > ubuf->head = kbuf->head; > } > > /* > * Consume the buffer data and update the tail pointer to indicate to > * kernel space there's 'free' space. > */ > void ubuf_read(void) > { > u64 head, tail; > > tail = ACCESS_ONCE(ubuf->tail); > head = ACCESS_ONCE(ubuf->head); > > /* > * Ensure we read the buffer boundaries before the actual buffer > * data... > */ > smp_rmb(); /* C, matches with B */ > > while (tail != head) { > obj = ubuf->data + tail; > /* process obj */ > tail += obj->size; > tail %= ubuf->size; > } > > /* > * Ensure all data reads are complete before we issue the > * ubuf->tail update; once that update hits, kbuf_write() can > * observe and overwrite data. > */ > smp_mb(); /* D, matches with A */ > > ubuf->tail = tail; > }
diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt index 8117e5bf6065..1846044bf6cc 100644 --- a/Documentation/circular-buffers.txt +++ b/Documentation/circular-buffers.txt @@ -160,6 +160,7 @@ The producer will look something like this: spin_lock(&producer_lock); unsigned long head = buffer->head; + /* The spin_unlock() and next spin_lock() provide needed ordering. */ unsigned long tail = ACCESS_ONCE(buffer->tail); if (CIRC_SPACE(head, tail, buffer->size) >= 1) { @@ -168,9 +169,8 @@ The producer will look something like this: produce_item(item); - smp_wmb(); /* commit the item before incrementing the head */ - - buffer->head = (head + 1) & (buffer->size - 1); + smp_store_with_release_semantics(buffer->head, + (head + 1) & (buffer->size - 1)); /* wake_up() will make sure that the head is committed before * waking anyone up */ @@ -183,9 +183,14 @@ This will instruct the CPU that the contents of the new item must be written before the head index makes it available to the consumer and then instructs the CPU that the revised head index must be written before the consumer is woken. -Note that wake_up() doesn't have to be the exact mechanism used, but whatever -is used must guarantee a (write) memory barrier between the update of the head -index and the change of state of the consumer, if a change of state occurs. +Note that wake_up() does not guarantee any sort of barrier unless something +is actually awakened. We therefore cannot rely on it for ordering. However, +there is always one element of the array left empty. Therefore, the +producer must produce two elements before it could possibly corrupt the +element currently being read by the consumer. Therefore, the unlock-lock +pair between consecutive invocations of the consumer provides the necessary +ordering between the read of the index indicating that the consumer has +vacated a given element and the write by the producer to that same element. THE CONSUMER @@ -195,21 +200,18 @@ The consumer will look something like this: spin_lock(&consumer_lock); - unsigned long head = ACCESS_ONCE(buffer->head); + unsigned long head = smp_load_with_acquire_semantics(buffer->head); unsigned long tail = buffer->tail; if (CIRC_CNT(head, tail, buffer->size) >= 1) { - /* read index before reading contents at that index */ - smp_read_barrier_depends(); /* extract one item from the buffer */ struct item *item = buffer[tail]; consume_item(item); - smp_mb(); /* finish reading descriptor before incrementing tail */ - - buffer->tail = (tail + 1) & (buffer->size - 1); + smp_store_with_release_semantics(buffer->tail, + (tail + 1) & (buffer->size - 1)); } spin_unlock(&consumer_lock); @@ -218,12 +220,17 @@ This will instruct the CPU to make sure the index is up to date before reading the new item, and then it shall make sure the CPU has finished reading the item before it writes the new tail pointer, which will erase the item. - -Note the use of ACCESS_ONCE() in both algorithms to read the opposition index. -This prevents the compiler from discarding and reloading its cached value - -which some compilers will do across smp_read_barrier_depends(). This isn't -strictly needed if you can be sure that the opposition index will _only_ be -used the once. +Note the use of ACCESS_ONCE() and smp_load_with_acquire_semantics() +to read the opposition index. This prevents the compiler from +discarding and reloading its cached value - which some compilers will +do across smp_read_barrier_depends(). This isn't strictly needed +if you can be sure that the opposition index will _only_ be used +the once. The smp_load_with_acquire_semantics() additionally forces +the CPU to order against subsequent memory references. Similarly, +smp_store_with_release_semantics() is used in both algorithms to write +the thread's index. This documents the fact that we are writing to +something that can be read concurrently, prevents the compiler from +tearing the store, and enforces ordering against previous accesses. ===============