Message ID | 20230303171939.237819-6-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Series | Fix missing memory barriers on ARM | expand |
On 3/3/23 09:19, Paolo Bonzini wrote: > There is no implicit memory barrier in qatomic_fetch_or() and > atomic_fetch_and() on ARM systems. Add an explicit > smp_mb__after_rmw() to match the intended semantics. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > util/async.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/util/async.c b/util/async.c > index 0657b7539777..6129f2c991cb 100644 > --- a/util/async.c > +++ b/util/async.c > @@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags) > unsigned old_flags; > > /* > - * The memory barrier implicit in qatomic_fetch_or makes sure that: > - * 1. idle & any writes needed by the callback are done before the > - * locations are read in the aio_bh_poll. > + * The memory barrier makes sure that: > + * 1. any writes needed by the callback are visible from the callback > + * after aio_bh_dequeue() returns bh. > * 2. ctx is loaded before the callback has a chance to execute and bh > * could be freed. > */ > old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags); > + smp_mb__after_rmw(); > + > if (!(old_flags & BH_PENDING)) { > QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next); > } > @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags) > QSLIST_REMOVE_HEAD(head, next); > > /* > - * The qatomic_and is paired with aio_bh_enqueue(). The implicit memory > - * barrier ensures that the callback sees all writes done by the scheduling > + * The memory barrier is paired with aio_bh_enqueue() and it > + * ensures that the callback sees all writes done by the scheduling > * thread. It also ensures that the scheduling thread sees the cleared > * flag before bh->cb has run, and thus will call aio_notify again if > * necessary. > */ Is it really paired with aio_bh_enqueue? Seems like the enqueue barrier really is for aio_bh_poll, and the dequeue barrier is for the callback. While the comments seem off, the code seems correct. If you agree: Reviewed-by: Richard Henderson <richard.henderson@linaro.org> r~
On 3/5/23 20:32, Richard Henderson wrote: > On 3/3/23 09:19, Paolo Bonzini wrote: >> unsigned old_flags; >> /* >> * The memory barrier makes sure that: >> * 1. any writes needed by the callback are visible from the callback >> * after aio_bh_dequeue() returns bh. >> * 2. ctx is loaded before the callback has a chance to execute and bh >> * could be freed. >> */ >> old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags); >> + smp_mb__after_rmw(); >> + >> if (!(old_flags & BH_PENDING)) { >> QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next); >> } >> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, >> unsigned *flags) >> QSLIST_REMOVE_HEAD(head, next); >> /* >> * The memory barrier is paired with aio_bh_enqueue() and it >> * ensures that the callback sees all writes done by the scheduling >> * thread. It also ensures that the scheduling thread sees the cleared >> * flag before bh->cb has run, and thus will call aio_notify again if >> * necessary. >> */ > > Is it really paired with aio_bh_enqueue? > > Seems like the enqueue barrier really is for aio_bh_poll, and the > dequeue barrier is for the callback. The documentation has been quite obsolete since the introduction of bh_list. The logic is as follows: aio_bh_enqueue() load bh->ctx set PENDING flag // (0) if flag was not set // bh becomes visible to dequeuing thread here: insert into bh_list // (1) aio_notify // Write bh->flags and bh_list before ctx->notified smp_wmb() // (2) set notified to true // Write notified before reading notify_me smp_mb() // (3) if notify_me then event_notifier_set() and on the dequeue side it's tricky to see why all barriers are needed; it boils down to the busy-wait polling done at the beginning of aio_poll(): aio_poll() compute approximate timeout (*unordered* read of bh_list) enable notify_me // Write notify_me before reading notified smp_mb() // synchronizes with (3) if notified then timeout = 0 ctx->fdmon_ops->wait(timeout) disable notify_me aio_notify_accept() set notified to false // Write ctx->notified before reading e.g. bh_list smp_mb() // synchronizes with (2) aio_bh_poll() QSLIST_MOVE_ATOMIC // synchronizes with (1) aio_bh_dequeue remove from head clear PENDING/SCHEDULED/IDLE // synchronizes with (0) if flags were set aio_bh_call That is: for synchronization point (0) - the main function here is to ensure that aio_bh_dequeue() removes from the list before the PENDING flag is cleared, otherwise enqueuers can clobber the list, and vice versa for the enqueuers. For this, it is enough that qatomic_fetch_and() is "release" and qatomic_fetch_or() is "acquire" (and they are). for synchronization point (1) - the bottom half machinery between aio_bh_enqueue and aio_bh_poll only needs release-to-acquire synchronization, and that is provided by (1). This is also where ordering ensures that bh->ctx is loaded before the callback executes (which could free bh). - between enqueuers, setting the PENDING flag only needs to be done before inserting into bh_list to avoid double insertion (and of course needs to be done atomically); for this purpose, QSLIST_INSERT_HEAD_ATOMIC needs to be "release" (which it is). Also the call to aio_notify() in aio_bh_enqueue() is unconditional, so aio_bh_dequeue() need not protect against missed calls to aio_notify(). aio_notify() only synchronizes with aio_poll() and aio_notify_accept(). for synchronization point (2) - This cannot be just a smp_rmb() because the timeout that was computed earlier was not ordered against either notified or notify_me. - This is a smp_mb() for generality; as far as bh_list is concerned it could be smp_mb__before_rmw(). Synchronization point (3) is pretty mundane in comparison. So I'm dropping this patch; I have prepared a documentation patch instead. Paolo
diff --git a/util/async.c b/util/async.c index 0657b7539777..6129f2c991cb 100644 --- a/util/async.c +++ b/util/async.c @@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags) unsigned old_flags; /* - * The memory barrier implicit in qatomic_fetch_or makes sure that: - * 1. idle & any writes needed by the callback are done before the - * locations are read in the aio_bh_poll. + * The memory barrier makes sure that: + * 1. any writes needed by the callback are visible from the callback + * after aio_bh_dequeue() returns bh. * 2. ctx is loaded before the callback has a chance to execute and bh * could be freed. */ old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags); + smp_mb__after_rmw(); + if (!(old_flags & BH_PENDING)) { QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next); } @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags) QSLIST_REMOVE_HEAD(head, next); /* - * The qatomic_and is paired with aio_bh_enqueue(). The implicit memory - * barrier ensures that the callback sees all writes done by the scheduling + * The memory barrier is paired with aio_bh_enqueue() and it + * ensures that the callback sees all writes done by the scheduling * thread. It also ensures that the scheduling thread sees the cleared * flag before bh->cb has run, and thus will call aio_notify again if * necessary. */ *flags = qatomic_fetch_and(&bh->flags, ~(BH_PENDING | BH_SCHEDULED | BH_IDLE)); + smp_mb__after_rmw(); return bh; }
There is no implicit memory barrier in qatomic_fetch_or() and atomic_fetch_and() on ARM systems. Add an explicit smp_mb__after_rmw() to match the intended semantics. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- util/async.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)