diff mbox

[RFC] tcp: race in receive part

Message ID 20090624162112.GB5409@jolsa.lab.eng.brq.redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Olsa June 24, 2009, 4:21 p.m. UTC
On Wed, Jun 24, 2009 at 01:04:13PM +0200, Eric Dumazet wrote:
> Jiri Olsa a écrit :
> > On Tue, Jun 23, 2009 at 12:32:10PM +0200, Eric Dumazet wrote:
> >> Jiri Olsa a écrit :
> >>> Hi,
> >>>
> >>> thanks for an answer, and sorry for my late reply,
> >>> we needed the cust permission to disclose the debug data.
> >>>
> >> I see ! Now this is me with litle time as I am traveling right now.
> >>
> >>> On Thu, Jun 18, 2009 at 04:06:42PM +0200, Eric Dumazet wrote:
> >>>> Jiri Olsa a écrit :
> >>>>> Hi,
> >>>>>
> >>>>> in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
> >>>>> this on the upstream kernel, but since the issue occurs very rarelly
> >>>>> (once per 8 days), we just might not be lucky.
> >>>>>
> >>>>> I'm affraid this might be a long email, I'll try to structure it nicely.. :)
> >>>>>
> >>>> Thanks for your mail and detailed analysis
> >>>>
> >>>>> RACE DESCRIPTION
> >>>>> ================
> >>>>>
> >>>>> There's a nice pdf describing the issue (and sollution using locks) on
> >>>>> https://bugzilla.redhat.com/attachment.cgi?id=345014
> >>>> I could not reach this url unfortunatly 
> >>>>
> >>>> --> "You are not authorized to access bug #494404. "
> >>> please try it now, the bug should be accessible now
> >>>
> >> Thanks, this doc is indeed nice :)
> >>
> >> But adding an write_lock()/write_unlock() in tcp_poll() was overkill
> >> It had an sm_mb() implied because of the nesting of locks.
> >>
> >>>>> The race fires, when following code paths meet, and the tp->rcv_nxt and
> >>>>> __add_wait_queue updates stay in CPU caches.
> >>>>>
> >>>>> CPU1                         CPU2
> >>>>>
> >>>>>
> >>>>> sys_select                   receive packet
> >>>>>   ...                        ...
> >>>>>   __add_wait_queue           update tp->rcv_nxt
> >>>>>   ...                        ...
> >>>>>   tp->rcv_nxt check          sock_def_readable
> >>>>>   ...                        {
> >>>>>   schedule                      ...
> >>>>>                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >>>>>                                         wake_up_interruptible(sk->sk_sleep)
> >>>>>                                 ...
> >>>>>                              }
> >>>>>
> >>>>> If there were no cache the code would work ok, since the wait_queue and
> >>>>> rcv_nxt are opposit to each other.
> >>>>>
> >>>>> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> >>>>> passed the tp->rcv_nxt check and sleeps, or will get the new value for
> >>>>> tp->rcv_nxt and will return with new data mask.  
> >>>>> In both cases the process (CPU1) is being added to the wait queue, so the
> >>>>> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> >>>>>
> >>>>> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> >>>>> cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
> >>>>> endup calling schedule and sleep forever if there are no more data on the
> >>>>> socket.
> >>>>>
> >>>>> Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
> >>>>> should prevent the above bad scenario.
> >>>>>
> >>>>> The upstream patch is attached. It seems to prevent the issue.
> >>>>>
> >>>>>
> >>>>>
> >>>>> CPU BUGS
> >>>>> ========
> >>>>>
> >>>>> The customer has been able to reproduce this problem only on one CPU model:
> >>>>> Xeon E5345*2. They didn't reproduce on XEON MV, for example.
> >>>> Is there an easy way to reproduce the problem ?
> >>> there's a reproducer attached to the bug
> >>>
> >>> https://enterprise.redhat.com/issue-tracker/?module=download&fid=201560&key=f6f87caf6ac2dc1eb1173257c8a5ef78
> >>>
> >>> it is basically the client/server program. 
> >>> They're passing messages to each other. When a message is sent,
> >>> both of them expect message on the input before sending another message.
> >>>
> >>> Very rarely the code hits the place when the process which called select
> >>> is not woken up by incomming data. Probably because of the memory cache
> >>> incoherency. See backtrace in the 
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1
> >>>
> >>>
> >>>>> That CPU model happens to have 2 possible issues, that might cause the issue:
> >>>>> (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
> >>>>>
> >>>>> AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
> >>>>> the other one has following notes:
> >>>> AJ18 only matters on unaligned accesses, tcp code doesnt do this.
> >>>>
> >>>>>       Software should ensure at least one of the following is true when
> >>>>>       modifying shared data by multiple agents:
> >>>>>              • The shared data is aligned
> >>>>>              • Proper semaphores or barriers are used in order to
> >>>>>                 prevent concurrent data accesses.
> >>>>>
> >>>>>
> >>>>>
> >>>>> RFC
> >>>>> ===
> >>>>>
> >>>>> I'm aware that not having this issue reproduced on upstream lowers the odds
> >>>>> having this checked in. However AFAICS the issue is present. I'd appreciate
> >>>>> any comment/ideas.
> >>>>>
> >>>>>
> >>>>> thanks,
> >>>>> jirka
> >>>>>
> >>>>>
> >>>>> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
> >>>>>
> >>>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >>>>> index 17b89c5..f5d9dbf 100644
> >>>>> --- a/net/ipv4/tcp.c
> >>>>> +++ b/net/ipv4/tcp.c
> >>>>> @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> >>>>>  	struct tcp_sock *tp = tcp_sk(sk);
> >>>>>  
> >>>>>  	poll_wait(file, sk->sk_sleep, wait);
> >>>> poll_wait() calls add_wait_queue() which contains a 
> >>>> spin_lock_irqsave()/spin_unlock_irqrestore() pair
> >>>>
> >>>> Documentation/memory-barriers.txt states in line 1123 :
> >>>>
> >>>> 	Memory operations issued after the LOCK will be completed after the LOCK
> >>>> 	operation has completed.
> >>>>
> >>>> and line 1131 states :
> >>>>
> >>>> 	Memory operations issued before the UNLOCK will be completed before the
> >>>> 	UNLOCK operation has completed.
> >>>>
> >>>> So yes, there is no full smp_mb() in poll_wait()
> >>>>
> >>>>> +
> >>>>> +	/* Get in sync with tcp_data_queue, tcp_urg
> >>>>> +	   and tcp_rcv_established function. */
> >>>>> +	smp_mb();
> >>>> If this barrier is really necessary, I guess it should be done in poll_wait() itself.
> >>>>
> >>>> Documentation/memory-barriers.txt misses some information about poll_wait()
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> +
> >>>>>  	if (sk->sk_state == TCP_LISTEN)
> >>>>>  		return inet_csk_listen_poll(sk);
> >>>>>  
> >>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>>> index 2bdb0da..0606e5e 100644
> >>>>> --- a/net/ipv4/tcp_input.c
> >>>>> +++ b/net/ipv4/tcp_input.c
> >>>>> @@ -4362,8 +4362,11 @@ queue_and_out:
> >>>>>  
> >>>>>  		if (eaten > 0)
> >>>>>  			__kfree_skb(skb);
> >>>>> -		else if (!sock_flag(sk, SOCK_DEAD))
> >>>>> +		else if (!sock_flag(sk, SOCK_DEAD)) {
> >>>>> +			/* Get in sync with tcp_poll function. */
> >>>>> +			smp_mb();
> >>>>>  			sk->sk_data_ready(sk, 0);
> >>>>> +		}
> >>>>>  		return;
> >>>>>
> >>>> Oh well... if smp_mb() is needed, I believe it should be done
> >>>> right before "if (waitqueue_active(sk->sk_sleep) ... "
> >>>>
> >>>>  	read_lock(&sk->sk_callback_lock);
> >>>> +	smp_mb();
> >>>>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >>>>  		wake_up_interruptible(sk->sk_sleep)
> >>>>
> >>>> It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)
> >>>>
> >>>> Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
> >>>> "lock subl $0x1,(%eax)"
> >>>>
> >>>> Maybe we could define a smp_mb_after_read_lock()  (a compiler barrier() on x86)
> >>>>
> >>> First version of the patch was actually in this layer, see
> >>> https://bugzilla.redhat.com/attachment.cgi?id=345886
> >>>
> >>> I was adviced this could be to invasive (it was in waitqueue_active actually), 
> >>> so I moved the change to the TCP layer itself... 
> >>>
> >>> As far as I understand the problem there's need for 2 barriers to be
> >>> sure, the memory will have correct data. One in the codepath calling the
> >>> select (tcp_poll), and in the other one updating the available data status 
> >>> (sock_def_readable), am I missing smth?
> >>>
> >> Hmm, I am not saying your patch doesnt fix the problem, I am saying it
> >> is a partial fix of a general problem. We might have same problem(s) in other
> >> parts of network stack. This is a very serious issue.
> >>
> >> Point 1 :
> >>
> >> You added an smp_mb() call in tcp_poll(). This one looks fine to solve
> >> the problem for tcp sockets. What about other protocols ? Do we have
> >> same problem ?
> > 
> > Looks like most of the protocols using the poll_wait. Also I assume
> > that most of them will probably have the same scenario as the one
> > described (CPU1 and CPU2 codepaths in the RACE DESCRIPTION).
> > 
> > So I moved the poll smp_mb() call to the __pollwait function, plz
> > check the attached diff. This might be too invasive, so another
> > place could be probably polling callbacks themselfs like 
> > datagram_poll (used very often by protocols), tcp_poll, udp_poll...
> > 
> > I'm still looking which way would be more suitable, comments are very
> > welcome :)
> > 
> >> Point 2 :
> >>
> >> You added several smp_mb() calls in tcp input path. In my first
> >> reply, I said it was probably better to add smp_mb() in a single 
> >> place, right before "if (waitqueue_active(sk->sk_sleep) ... ",
> >> but in all paths (input path & output path).
> >>
> >> Point 3 :
> >>
> >> The optimization we could do, defining
> >> a smp_mb_after_read_lock() that could be a nop on x86
> >>
> >> 	read_lock(&sk->sk_callback_lock); // "lock subl $0x1,(%eax)" on x86
> >> 	smp_mb_after_read_lock(); /* compiler barrier() on x86 */
> >> 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >>  		wake_up_interruptible(sk->sk_sleep);
> >>
> >> Am I missing something ?
> >>
> >> ;)
> >>
> > 
> > not at all :) I'm the one behind..
> > 
> > Anyway I made modifications based on Point 2) and 3) and the diff is
> > attached, please check.
> > 
> > thanks a lot,
> > jirka
> > 
> > 
> > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> > index b7e5db8..570c0ff 100644
> > --- a/arch/x86/include/asm/spinlock.h
> > +++ b/arch/x86/include/asm/spinlock.h
> > @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >  #define _raw_read_relax(lock)	cpu_relax()
> >  #define _raw_write_relax(lock)	cpu_relax()
> >  
> > +/* The read_lock() on x86 is a full memory barrier. */
> > +#define smp_mb__after_read_lock() barrier()
> > +
> >  #endif /* _ASM_X86_SPINLOCK_H */
> > diff --git a/fs/select.c b/fs/select.c
> > index d870237..f9ebd45 100644
> > --- a/fs/select.c
> > +++ b/fs/select.c
> > @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
> >  	init_waitqueue_func_entry(&entry->wait, pollwake);
> >  	entry->wait.private = pwq;
> >  	add_wait_queue(wait_address, &entry->wait);
> > +
> > +	/* This memory barrier is paired with the smp_mb__after_read_lock
> > +	 * in the sock_def_readable. */
> > +	smp_mb();
> >  }
> >  
> >  int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
> > diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> > index 252b245..dd28726 100644
> > --- a/include/linux/spinlock.h
> > +++ b/include/linux/spinlock.h
> > @@ -132,6 +132,11 @@ do {								\
> >  #endif /*__raw_spin_is_contended*/
> >  #endif
> >  
> > +/* The read_lock does not imply full memory barrier. */
> > +#ifndef smp_mb__after_read_lock
> > +#define smp_mb__after_read_lock() smp_mb()
> > +#endif
> > +
> >  /**
> >   * spin_unlock_wait - wait until the spinlock gets unlocked
> >   * @lock: the spinlock in question.
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index b0ba569..11e414f 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1732,6 +1732,7 @@ static void sock_def_error_report(struct sock *sk)
> >  static void sock_def_readable(struct sock *sk, int len)
> >  {
> >  	read_lock(&sk->sk_callback_lock);
> > +	smp_mb__after_read_lock();
> >  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >  		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
> >  						POLLRDNORM | POLLRDBAND);
> 
> I suspect we need to change all places where we do :
> 
> 
> read_lock(&sk->sk_callback_lock);
> ...
> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> 
> to :
> 
> read_lock(&sk->sk_callback_lock);
> ...
> smp_mb__after_read_lock();
> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> 
> I suggest you add a helper function like
> 
> static inline int sk_has_sleeper(struct sock *sk)
> {
> 	/*
> 	 * some nice comment here why this barrier is needed
> 	 * (and where opposite one is located)
> 	 */
> 	smp_mb__after_read_lock();
> 	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> }
> 
> and use it in net/atm/common.c : vcc_def_wakeup() & vcc_write_space()
> net/dccp/output.c : dccp_write_space()
> net/core/stream.c : sk_stream_write_space()
> net/core/sock.c : sock_def_wakeup(), sock_def_error_report(),
> 		sock_def_readable(), sock_def_write_space()
> net/iucv/af_iucv.c : iucv_sock_wake_msglim()
> 
> and several others as well in net tree... "find|grep" are your friends :)
> 
> 
> Thanks

I made the modification, plz check the attached diff.

I found some places where the read_lock is not ahead of the check:
  "if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))"

I'm not sure we dont want to address those as well; located in following
files:
        drivers/net/tun.c
        net/core/stream.c
        net/sctp/socket.c
        net/sunrpc/svcsock.c


thanks,
jirka



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Oleg Nesterov June 24, 2009, 4:30 p.m. UTC | #1
On 06/24, Jiri Olsa wrote:
>
> +/* The read_lock() on x86 is a full memory barrier. */
> +#define smp_mb__after_read_lock() barrier()

Just curious, why do we need barrier() ?

I must admit, personally I dislike _read_lock part. Because I think we
need a "more generic" smp_mb__{before,after}_lock() or whatever which
work for spin_lock/read_lock/write_lock.

In that case it can have more users. Btw, in fs/select.c too, see
__pollwake().

And surprise,

> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>  	entry->wait.private = pwq;
>  	add_wait_queue(wait_address, &entry->wait);
> +
> +	/* This memory barrier is paired with the smp_mb__after_read_lock
> +	 * in the sk_has_sleeper. */
> +	smp_mb();

This could be smp_mb__after_lock() too.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov June 24, 2009, 4:41 p.m. UTC | #2
On 06/24, Oleg Nesterov wrote:
>
> On 06/24, Jiri Olsa wrote:
> >
> > +/* The read_lock() on x86 is a full memory barrier. */
> > +#define smp_mb__after_read_lock() barrier()
>
> Just curious, why do we need barrier() ?
>
> I must admit, personally I dislike _read_lock part. Because I think we
> need a "more generic" smp_mb__{before,after}_lock() or whatever which
> work for spin_lock/read_lock/write_lock.
>
> In that case it can have more users. Btw, in fs/select.c too, see
> __pollwake().
>
> And surprise,
>
> > --- a/fs/select.c
> > +++ b/fs/select.c
> > @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
> >  	init_waitqueue_func_entry(&entry->wait, pollwake);
> >  	entry->wait.private = pwq;
> >  	add_wait_queue(wait_address, &entry->wait);
> > +
> > +	/* This memory barrier is paired with the smp_mb__after_read_lock
> > +	 * in the sk_has_sleeper. */
> > +	smp_mb();
>
> This could be smp_mb__after_lock() too.

Cough. this needs mb__after_UNlock(), sorry.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 25, 2009, 10:28 a.m. UTC | #3
Jiri Olsa a écrit :
> 
> I made the modification, plz check the attached diff.
> 
> I found some places where the read_lock is not ahead of the check:
>   "if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))"
> 
> I'm not sure we dont want to address those as well; located in following
> files:
>         drivers/net/tun.c
>         net/core/stream.c
>         net/sctp/socket.c
>         net/sunrpc/svcsock.c

We'll take care of them later :)

> 
> 
> thanks,
> jirka
> 

This patch is OK with me, please submit a new formal patch with
fresh ChangeLog so that we can all agree and Signed-off-by/Acked-by

Oleg, I think your comment can be addressed in a followup patch ?

Thanks to all

> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index b7e5db8..570c0ff 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>  #define _raw_read_relax(lock)	cpu_relax()
>  #define _raw_write_relax(lock)	cpu_relax()
>  
> +/* The read_lock() on x86 is a full memory barrier. */
> +#define smp_mb__after_read_lock() barrier()
> +
>  #endif /* _ASM_X86_SPINLOCK_H */
> diff --git a/fs/select.c b/fs/select.c
> index d870237..cf5d80b 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>  	entry->wait.private = pwq;
>  	add_wait_queue(wait_address, &entry->wait);
> +
> +	/* This memory barrier is paired with the smp_mb__after_read_lock
> +	 * in the sk_has_sleeper. */
> +	smp_mb();
>  }
>  
>  int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 252b245..dd28726 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -132,6 +132,11 @@ do {								\
>  #endif /*__raw_spin_is_contended*/
>  #endif
>  
> +/* The read_lock does not imply full memory barrier. */
> +#ifndef smp_mb__after_read_lock
> +#define smp_mb__after_read_lock() smp_mb()
> +#endif
> +
>  /**
>   * spin_unlock_wait - wait until the spinlock gets unlocked
>   * @lock: the spinlock in question.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 07133c5..a02a956 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1241,6 +1241,24 @@ static inline int sk_has_allocations(const struct sock *sk)
>  	return sk_wmem_alloc_get(sk) || sk_rmem_alloc_get(sk);
>  }
>  
> +/**
> + * sk_has_sleeper - check if there are any waiting processes
> + * @sk: socket
> + *
> + * Returns true if socket has waiting processes
> + */
> +static inline int sk_has_sleeper(struct sock *sk)
> +{
> +	/*
> +	 * We need to be sure we are in sync with the
> +	 * add_wait_queue modifications to the wait queue.
> +	 *
> +	 * This memory barrier is paired in the __pollwait.
> +	 */
> +	smp_mb__after_read_lock();
> +	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> +}
> +
>  /*
>   * 	Queue a received datagram if it will fit. Stream and sequenced
>   *	protocols can't normally use this as they need to fit buffers in
> diff --git a/net/atm/common.c b/net/atm/common.c
> index c1c9793..67a8642 100644
> --- a/net/atm/common.c
> +++ b/net/atm/common.c
> @@ -92,7 +92,7 @@ static void vcc_sock_destruct(struct sock *sk)
>  static void vcc_def_wakeup(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up(sk->sk_sleep);
>  	read_unlock(&sk->sk_callback_lock);
>  }
> @@ -110,7 +110,7 @@ static void vcc_write_space(struct sock *sk)
>  	read_lock(&sk->sk_callback_lock);
>  
>  	if (vcc_writable(sk)) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible(sk->sk_sleep);
>  
>  		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b0ba569..6354863 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1715,7 +1715,7 @@ EXPORT_SYMBOL(sock_no_sendpage);
>  static void sock_def_wakeup(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_all(sk->sk_sleep);
>  	read_unlock(&sk->sk_callback_lock);
>  }
> @@ -1723,7 +1723,7 @@ static void sock_def_wakeup(struct sock *sk)
>  static void sock_def_error_report(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_poll(sk->sk_sleep, POLLERR);
>  	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
>  	read_unlock(&sk->sk_callback_lock);
> @@ -1732,7 +1732,7 @@ static void sock_def_error_report(struct sock *sk)
>  static void sock_def_readable(struct sock *sk, int len)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
>  						POLLRDNORM | POLLRDBAND);
>  	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
> @@ -1747,7 +1747,7 @@ static void sock_def_write_space(struct sock *sk)
>  	 * progress.  --DaveM
>  	 */
>  	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
>  						POLLWRNORM | POLLWRBAND);
>  
> diff --git a/net/dccp/output.c b/net/dccp/output.c
> index c0e88c1..c96119f 100644
> --- a/net/dccp/output.c
> +++ b/net/dccp/output.c
> @@ -196,7 +196,7 @@ void dccp_write_space(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
>  
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible(sk->sk_sleep);
>  	/* Should agree with poll, otherwise some programs break */
>  	if (sock_writeable(sk))
> diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
> index 6be5f92..ba0149d 100644
> --- a/net/iucv/af_iucv.c
> +++ b/net/iucv/af_iucv.c
> @@ -306,7 +306,7 @@ static inline int iucv_below_msglim(struct sock *sk)
>  static void iucv_sock_wake_msglim(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_all(sk->sk_sleep);
>  	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
>  	read_unlock(&sk->sk_callback_lock);
> diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
> index eac5e7b..60e0e38 100644
> --- a/net/rxrpc/af_rxrpc.c
> +++ b/net/rxrpc/af_rxrpc.c
> @@ -63,7 +63,7 @@ static void rxrpc_write_space(struct sock *sk)
>  	_enter("%p", sk);
>  	read_lock(&sk->sk_callback_lock);
>  	if (rxrpc_writable(sk)) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible(sk->sk_sleep);
>  		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
>  	}
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 36d4e44..143143a 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -315,7 +315,7 @@ static void unix_write_space(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
>  	if (unix_writable(sk)) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible_sync(sk->sk_sleep);
>  		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
>  	}
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 25, 2009, 10:46 a.m. UTC | #4
Eric Dumazet a écrit :
> Jiri Olsa a écrit :
>> I made the modification, plz check the attached diff.
>>
>> I found some places where the read_lock is not ahead of the check:
>>   "if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))"
>>
>> I'm not sure we dont want to address those as well; located in following
>> files:
>>         drivers/net/tun.c
>>         net/core/stream.c
>>         net/sctp/socket.c
>>         net/sunrpc/svcsock.c
> 
> We'll take care of them later :)
> 
>>
>> thanks,
>> jirka
>>
> 
> This patch is OK with me, please submit a new formal patch with
> fresh ChangeLog so that we can all agree and Signed-off-by/Acked-by
> 
> Oleg, I think your comment can be addressed in a followup patch ?
> 
> Thanks to all

To clarify, I meant the second comment from Oleg.

Jiri, please define a "smp_mb__after_lock()" instead of smp_mb__after_read_lock()

+/* The {read|write|spin}_lock() on x86 are full memory barriers. */
+#define smp_mb__after_lock() do { } while (0)
+
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 25, 2009, 10:51 a.m. UTC | #5
Oleg Nesterov a écrit :
> On 06/24, Oleg Nesterov wrote:
>> On 06/24, Jiri Olsa wrote:
>>> +/* The read_lock() on x86 is a full memory barrier. */
>>> +#define smp_mb__after_read_lock() barrier()
>> Just curious, why do we need barrier() ?
>>
>> I must admit, personally I dislike _read_lock part. Because I think we
>> need a "more generic" smp_mb__{before,after}_lock() or whatever which
>> work for spin_lock/read_lock/write_lock.
>>
>> In that case it can have more users. Btw, in fs/select.c too, see
>> __pollwake().
>>
>> And surprise,
>>
>>> --- a/fs/select.c
>>> +++ b/fs/select.c
>>> @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>>>  	entry->wait.private = pwq;
>>>  	add_wait_queue(wait_address, &entry->wait);
>>> +
>>> +	/* This memory barrier is paired with the smp_mb__after_read_lock
>>> +	 * in the sk_has_sleeper. */
>>> +	smp_mb();
>> This could be smp_mb__after_lock() too.
> 
> Cough. this needs mb__after_UNlock(), sorry.
> 

Yes, and this time you need separate smp_mb__after_spin_unlock(),
as rwlocks and spinlocks dont have same unlock implementation.

(spin_unlock dont have memory barrier on x86, while read_write_unlock do have a barrier)

As it wont give us a benefit on x86 but code obfuscation, I suspect we can leave this for now :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index b7e5db8..570c0ff 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -302,4 +302,7 @@  static inline void __raw_write_unlock(raw_rwlock_t *rw)
 #define _raw_read_relax(lock)	cpu_relax()
 #define _raw_write_relax(lock)	cpu_relax()
 
+/* The read_lock() on x86 is a full memory barrier. */
+#define smp_mb__after_read_lock() barrier()
+
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/fs/select.c b/fs/select.c
index d870237..cf5d80b 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -219,6 +219,10 @@  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
+
+	/* This memory barrier is paired with the smp_mb__after_read_lock
+	 * in the sk_has_sleeper. */
+	smp_mb();
 }
 
 int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 252b245..dd28726 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -132,6 +132,11 @@  do {								\
 #endif /*__raw_spin_is_contended*/
 #endif
 
+/* The read_lock does not imply full memory barrier. */
+#ifndef smp_mb__after_read_lock
+#define smp_mb__after_read_lock() smp_mb()
+#endif
+
 /**
  * spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
diff --git a/include/net/sock.h b/include/net/sock.h
index 07133c5..a02a956 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1241,6 +1241,24 @@  static inline int sk_has_allocations(const struct sock *sk)
 	return sk_wmem_alloc_get(sk) || sk_rmem_alloc_get(sk);
 }
 
+/**
+ * sk_has_sleeper - check if there are any waiting processes
+ * @sk: socket
+ *
+ * Returns true if socket has waiting processes
+ */
+static inline int sk_has_sleeper(struct sock *sk)
+{
+	/*
+	 * We need to be sure we are in sync with the
+	 * add_wait_queue modifications to the wait queue.
+	 *
+	 * This memory barrier is paired in the __pollwait.
+	 */
+	smp_mb__after_read_lock();
+	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
+}
+
 /*
  * 	Queue a received datagram if it will fit. Stream and sequenced
  *	protocols can't normally use this as they need to fit buffers in
diff --git a/net/atm/common.c b/net/atm/common.c
index c1c9793..67a8642 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -92,7 +92,7 @@  static void vcc_sock_destruct(struct sock *sk)
 static void vcc_def_wakeup(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
 }
@@ -110,7 +110,7 @@  static void vcc_write_space(struct sock *sk)
 	read_lock(&sk->sk_callback_lock);
 
 	if (vcc_writable(sk)) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible(sk->sk_sleep);
 
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..6354863 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1715,7 +1715,7 @@  EXPORT_SYMBOL(sock_no_sendpage);
 static void sock_def_wakeup(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
 }
@@ -1723,7 +1723,7 @@  static void sock_def_wakeup(struct sock *sk)
 static void sock_def_error_report(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_poll(sk->sk_sleep, POLLERR);
 	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
 	read_unlock(&sk->sk_callback_lock);
@@ -1732,7 +1732,7 @@  static void sock_def_error_report(struct sock *sk)
 static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
 						POLLRDNORM | POLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
@@ -1747,7 +1747,7 @@  static void sock_def_write_space(struct sock *sk)
 	 * progress.  --DaveM
 	 */
 	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
diff --git a/net/dccp/output.c b/net/dccp/output.c
index c0e88c1..c96119f 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -196,7 +196,7 @@  void dccp_write_space(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible(sk->sk_sleep);
 	/* Should agree with poll, otherwise some programs break */
 	if (sock_writeable(sk))
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 6be5f92..ba0149d 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -306,7 +306,7 @@  static inline int iucv_below_msglim(struct sock *sk)
 static void iucv_sock_wake_msglim(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_all(sk->sk_sleep);
 	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	read_unlock(&sk->sk_callback_lock);
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index eac5e7b..60e0e38 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -63,7 +63,7 @@  static void rxrpc_write_space(struct sock *sk)
 	_enter("%p", sk);
 	read_lock(&sk->sk_callback_lock);
 	if (rxrpc_writable(sk)) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible(sk->sk_sleep);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 36d4e44..143143a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -315,7 +315,7 @@  static void unix_write_space(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (unix_writable(sk)) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible_sync(sk->sk_sleep);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}