diff mbox

epoll: prevent missed events on EPOLL_CTL_MOD

Message ID 1357148750.21409.17169.camel@edumazet-glaptop
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 2, 2013, 5:45 p.m. UTC
On Tue, 2013-01-01 at 23:56 +0000, Eric Wong wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Please document the barrier that this mb() pairs with, and then give
> > an explanation for the fix in the commit message, and I'll happily
> > take it. Even if it's just duplicating the comments above the
> > wq_has_sleeper() function, except modified for the ep_modify() case.
> 
> Hopefully my explanation is correct and makes sense below,
> I think both effects of the barrier are needed
> 
> > Of course, it would be good to get verification from Jason and Andreas
> > that the alternate patch also works for them.
> 
> Jason just confirmed it.
> 
> ------------------------------- 8< ----------------------------
> From 02f43757d04bb6f2786e79eecf1cfa82e6574379 Mon Sep 17 00:00:00 2001
> From: Eric Wong <normalperson@yhbt.net>
> Date: Tue, 1 Jan 2013 21:20:27 +0000
> Subject: [PATCH] epoll: prevent missed events on EPOLL_CTL_MOD
> 
> EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
> ensure events are not missed.  Since the modifications to the interest
> mask are not protected by the same lock as ep_poll_callback, we need to
> ensure the change is visible to other CPUs calling ep_poll_callback.
> 
> We also need to ensure f_op->poll() has an up-to-date view of past
> events which occured before we modified the interest mask.  So this
> barrier also pairs with the barrier in wq_has_sleeper().
> 
> This should guarantee either ep_poll_callback or f_op->poll() (or both)
> will notice the readiness of a recently-ready/modified item.
> 
> This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
> http://thread.gmane.org/gmane.linux.kernel/1408782/
> 
> Signed-off-by: Eric Wong <normalperson@yhbt.net>
> Cc: Hans Verkuil <hans.verkuil@cisco.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
> Tested-by: "Junchang(Jason) Wang" <junchang.wang@yale.edu>
> Cc: netdev@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/eventpoll.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index cd96649..39573ee 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1285,7 +1285,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
>  	 * otherwise we might miss an event that happens between the
>  	 * f_op->poll() call and the new event set registering.
>  	 */
> -	epi->event.events = event->events;
> +	epi->event.events = event->events; /* need barrier below */
>  	pt._key = event->events;
>  	epi->event.data = event->data; /* protected by mtx */
>  	if (epi->event.events & EPOLLWAKEUP) {
> @@ -1296,6 +1296,26 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
>  	}
>  
>  	/*
> +	 * The following barrier has two effects:
> +	 *
> +	 * 1) Flush epi changes above to other CPUs.  This ensures
> +	 *    we do not miss events from ep_poll_callback if an
> +	 *    event occurs immediately after we call f_op->poll().
> +	 *    We need this because we did not take ep->lock while
> +	 *    changing epi above (but ep_poll_callback does take
> +	 *    ep->lock).
> +	 *
> +	 * 2) We also need to ensure we do not miss _past_ events
> +	 *    when calling f_op->poll().  This barrier also
> +	 *    pairs with the barrier in wq_has_sleeper (see
> +	 *    comments for wq_has_sleeper).
> +	 *
> +	 * This barrier will now guarantee ep_poll_callback or f_op->poll
> +	 * (or both) will notice the readiness of an item.
> +	 */
> +	smp_mb();
> +
> +	/*
>  	 * Get current event bits. We can safely use the file* here because
>  	 * its usage count has been increased by the caller of this function.
>  	 */
> -- 
> Eric Wong

First, thanks for working on this issue.

It seems the real problem is the epi->event.events = event->events;
which is done without taking ep->lock

While a smp_mb() could reduce the race window, I believe there is still
a race, and the following patch would close it.






--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Wong Jan. 2, 2013, 6:40 p.m. UTC | #1
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> First, thanks for working on this issue.

No problem!

> It seems the real problem is the epi->event.events = event->events;
> which is done without taking ep->lock

Yes.  I am hoping it is possible to do it without a lock there,
but your change is more obviously correct.

> While a smp_mb() could reduce the race window, I believe there is still
> a race, and the following patch would close it.

I'm not an experienced kernel hacker, can you describe where the race
would be?

> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index be56b21..25e5c53 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1313,7 +1313,10 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
>  	 * otherwise we might miss an event that happens between the
>  	 * f_op->poll() call and the new event set registering.
>  	 */
> +	spin_lock_irq(&ep->lock);
>  	epi->event.events = event->events;
> +	spin_unlock_irq(&ep->lock);
> +
>  	pt._key = event->events;
>  	epi->event.data = event->data; /* protected by mtx */
>  	if (epi->event.events & EPOLLWAKEUP) {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 2, 2013, 7:03 p.m. UTC | #2
On Wed, 2013-01-02 at 18:40 +0000, Eric Wong wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > First, thanks for working on this issue.
> 
> No problem!
> 
> > It seems the real problem is the epi->event.events = event->events;
> > which is done without taking ep->lock
> 
> Yes.  I am hoping it is possible to do it without a lock there,
> but your change is more obviously correct.
> 
> > While a smp_mb() could reduce the race window, I believe there is still
> > a race, and the following patch would close it.
> 
> I'm not an experienced kernel hacker, can you describe where the race
> would be?

It would be for example in ep_send_events_proc() doing :

if (epi->event.events & EPOLLONESHOT)
    epi->event.events &= EP_PRIVATE_BITS;

And this could happen at the same time.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Wong Jan. 2, 2013, 7:32 p.m. UTC | #3
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2013-01-02 at 18:40 +0000, Eric Wong wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > It seems the real problem is the epi->event.events = event->events;
> > > which is done without taking ep->lock
> > 
> > Yes.  I am hoping it is possible to do it without a lock there,
> > but your change is more obviously correct.
> > 
> > > While a smp_mb() could reduce the race window, I believe there is still
> > > a race, and the following patch would close it.
> > 
> > I'm not an experienced kernel hacker, can you describe where the race
> > would be?
> 
> It would be for example in ep_send_events_proc() doing :
> 
> if (epi->event.events & EPOLLONESHOT)
>     epi->event.events &= EP_PRIVATE_BITS;
> 
> And this could happen at the same time.

That modification in ep_send_events_proc() is protected by ep->mtx
(as is ep_modify()), though.  Maybe there are other places, but I
don't see it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 2, 2013, 10:08 p.m. UTC | #4
On Wed, 2013-01-02 at 19:32 +0000, Eric Wong wrote:

> That modification in ep_send_events_proc() is protected by ep->mtx
> (as is ep_modify()), though.  Maybe there are other places, but I
> don't see it.

Yes, and using a mutex for protecting this field while its read from
interrupt context (so without mutex synch help) is why there were races.

Some users rely on barriers included in spin_lock/spin_unlock, others in
explicit barriers, or before your patch pure luck.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index be56b21..25e5c53 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1313,7 +1313,10 @@  static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	 * otherwise we might miss an event that happens between the
 	 * f_op->poll() call and the new event set registering.
 	 */
+	spin_lock_irq(&ep->lock);
 	epi->event.events = event->events;
+	spin_unlock_irq(&ep->lock);
+
 	pt._key = event->events;
 	epi->event.data = event->data; /* protected by mtx */
 	if (epi->event.events & EPOLLWAKEUP) {