diff mbox

[RFC] tcp: race in receive part

Message ID 20090618102727.GC3782@jolsa.lab.eng.brq.redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Olsa June 18, 2009, 10:27 a.m. UTC
Hi,

in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
this on the upstream kernel, but since the issue occurs very rarelly
(once per 8 days), we just might not be lucky.

I'm affraid this might be a long email, I'll try to structure it nicely.. :)



RACE DESCRIPTION
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet June 18, 2009, 2:06 p.m. UTC | #1
Jiri Olsa a écrit :
> Hi,
> 
> in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
> this on the upstream kernel, but since the issue occurs very rarelly
> (once per 8 days), we just might not be lucky.
> 
> I'm affraid this might be a long email, I'll try to structure it nicely.. :)
> 

Thanks for your mail and detailed analysis

> 
> 
> RACE DESCRIPTION
> ================
> 
> There's a nice pdf describing the issue (and sollution using locks) on
> https://bugzilla.redhat.com/attachment.cgi?id=345014

I could not reach this url unfortunatly 

--> "You are not authorized to access bug #494404. "

> 
> 
> The race fires, when following code paths meet, and the tp->rcv_nxt and
> __add_wait_queue updates stay in CPU caches.
> 
> CPU1                         CPU2
> 
> 
> sys_select                   receive packet
>   ...                        ...
>   __add_wait_queue           update tp->rcv_nxt
>   ...                        ...
>   tp->rcv_nxt check          sock_def_readable
>   ...                        {
>   schedule                      ...
>                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>                                         wake_up_interruptible(sk->sk_sleep)
>                                 ...
>                              }
> 
> If there were no cache the code would work ok, since the wait_queue and
> rcv_nxt are opposit to each other.
> 
> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> passed the tp->rcv_nxt check and sleeps, or will get the new value for
> tp->rcv_nxt and will return with new data mask.  
> In both cases the process (CPU1) is being added to the wait queue, so the
> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> 
> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
> endup calling schedule and sleep forever if there are no more data on the
> socket.
> 
> Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
> should prevent the above bad scenario.
> 
> The upstream patch is attached. It seems to prevent the issue.
> 
> 
> 
> CPU BUGS
> ========
> 
> The customer has been able to reproduce this problem only on one CPU model:
> Xeon E5345*2. They didn't reproduce on XEON MV, for example.

Is there an easy way to reproduce the problem ?

> 
> That CPU model happens to have 2 possible issues, that might cause the issue:
> (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
> 
> AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
> the other one has following notes:

AJ18 only matters on unaligned accesses, tcp code doesnt do this.

> 
>       Software should ensure at least one of the following is true when
>       modifying shared data by multiple agents:
>              • The shared data is aligned
>              • Proper semaphores or barriers are used in order to
>                 prevent concurrent data accesses.
> 
> 
> 
> RFC
> ===
> 
> I'm aware that not having this issue reproduced on upstream lowers the odds
> having this checked in. However AFAICS the issue is present. I'd appreciate
> any comment/ideas.
> 
> 
> thanks,
> jirka
> 
> 
> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 17b89c5..f5d9dbf 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  	struct tcp_sock *tp = tcp_sk(sk);
>  
>  	poll_wait(file, sk->sk_sleep, wait);

poll_wait() calls add_wait_queue() which contains a 
spin_lock_irqsave()/spin_unlock_irqrestore() pair

Documentation/memory-barriers.txt states in line 1123 :

	Memory operations issued after the LOCK will be completed after the LOCK
	operation has completed.

and line 1131 states :

	Memory operations issued before the UNLOCK will be completed before the
	UNLOCK operation has completed.

So yes, there is no full smp_mb() in poll_wait()

> +
> +	/* Get in sync with tcp_data_queue, tcp_urg
> +	   and tcp_rcv_established function. */
> +	smp_mb();

If this barrier is really necessary, I guess it should be done in poll_wait() itself.

Documentation/memory-barriers.txt misses some information about poll_wait()




> +
>  	if (sk->sk_state == TCP_LISTEN)
>  		return inet_csk_listen_poll(sk);
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2bdb0da..0606e5e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4362,8 +4362,11 @@ queue_and_out:
>  
>  		if (eaten > 0)
>  			__kfree_skb(skb);
> -		else if (!sock_flag(sk, SOCK_DEAD))
> +		else if (!sock_flag(sk, SOCK_DEAD)) {
> +			/* Get in sync with tcp_poll function. */
> +			smp_mb();
>  			sk->sk_data_ready(sk, 0);
> +		}
>  		return;
>

Oh well... if smp_mb() is needed, I believe it should be done
right before "if (waitqueue_active(sk->sk_sleep) ... "

 	read_lock(&sk->sk_callback_lock);
+	smp_mb();
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible(sk->sk_sleep)

It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)

Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
"lock subl $0x1,(%eax)"

Maybe we could define a smp_mb_after_read_lock()  (a compiler barrier() on x86)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Olsa June 23, 2009, 9:12 a.m. UTC | #2
Hi,

thanks for an answer, and sorry for my late reply,
we needed the cust permission to disclose the debug data.


On Thu, Jun 18, 2009 at 04:06:42PM +0200, Eric Dumazet wrote:
> Jiri Olsa a écrit :
> > Hi,
> > 
> > in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
> > this on the upstream kernel, but since the issue occurs very rarelly
> > (once per 8 days), we just might not be lucky.
> > 
> > I'm affraid this might be a long email, I'll try to structure it nicely.. :)
> > 
> 
> Thanks for your mail and detailed analysis
> 
> > 
> > 
> > RACE DESCRIPTION
> > ================
> > 
> > There's a nice pdf describing the issue (and sollution using locks) on
> > https://bugzilla.redhat.com/attachment.cgi?id=345014
> 
> I could not reach this url unfortunatly 
> 
> --> "You are not authorized to access bug #494404. "

please try it now, the bug should be accessible now

> 
> > 
> > 
> > The race fires, when following code paths meet, and the tp->rcv_nxt and
> > __add_wait_queue updates stay in CPU caches.
> > 
> > CPU1                         CPU2
> > 
> > 
> > sys_select                   receive packet
> >   ...                        ...
> >   __add_wait_queue           update tp->rcv_nxt
> >   ...                        ...
> >   tp->rcv_nxt check          sock_def_readable
> >   ...                        {
> >   schedule                      ...
> >                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >                                         wake_up_interruptible(sk->sk_sleep)
> >                                 ...
> >                              }
> > 
> > If there were no cache the code would work ok, since the wait_queue and
> > rcv_nxt are opposit to each other.
> > 
> > Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> > passed the tp->rcv_nxt check and sleeps, or will get the new value for
> > tp->rcv_nxt and will return with new data mask.  
> > In both cases the process (CPU1) is being added to the wait queue, so the
> > waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> > 
> > The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> > cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
> > endup calling schedule and sleep forever if there are no more data on the
> > socket.
> > 
> > Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
> > should prevent the above bad scenario.
> > 
> > The upstream patch is attached. It seems to prevent the issue.
> > 
> > 
> > 
> > CPU BUGS
> > ========
> > 
> > The customer has been able to reproduce this problem only on one CPU model:
> > Xeon E5345*2. They didn't reproduce on XEON MV, for example.
> 
> Is there an easy way to reproduce the problem ?

there's a reproducer attached to the bug

https://enterprise.redhat.com/issue-tracker/?module=download&fid=201560&key=f6f87caf6ac2dc1eb1173257c8a5ef78

it is basically the client/server program. 
They're passing messages to each other. When a message is sent,
both of them expect message on the input before sending another message.

Very rarely the code hits the place when the process which called select
is not woken up by incomming data. Probably because of the memory cache
incoherency. See backtrace in the 

https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1


> 
> > 
> > That CPU model happens to have 2 possible issues, that might cause the issue:
> > (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
> > 
> > AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
> > the other one has following notes:
> 
> AJ18 only matters on unaligned accesses, tcp code doesnt do this.
> 
> > 
> >       Software should ensure at least one of the following is true when
> >       modifying shared data by multiple agents:
> >              • The shared data is aligned
> >              • Proper semaphores or barriers are used in order to
> >                 prevent concurrent data accesses.
> > 
> > 
> > 
> > RFC
> > ===
> > 
> > I'm aware that not having this issue reproduced on upstream lowers the odds
> > having this checked in. However AFAICS the issue is present. I'd appreciate
> > any comment/ideas.
> > 
> > 
> > thanks,
> > jirka
> > 
> > 
> > Signed-off-by: Jiri Olsa <jolsa@redhat.com>
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 17b89c5..f5d9dbf 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> >  	struct tcp_sock *tp = tcp_sk(sk);
> >  
> >  	poll_wait(file, sk->sk_sleep, wait);
> 
> poll_wait() calls add_wait_queue() which contains a 
> spin_lock_irqsave()/spin_unlock_irqrestore() pair
> 
> Documentation/memory-barriers.txt states in line 1123 :
> 
> 	Memory operations issued after the LOCK will be completed after the LOCK
> 	operation has completed.
> 
> and line 1131 states :
> 
> 	Memory operations issued before the UNLOCK will be completed before the
> 	UNLOCK operation has completed.
> 
> So yes, there is no full smp_mb() in poll_wait()
> 
> > +
> > +	/* Get in sync with tcp_data_queue, tcp_urg
> > +	   and tcp_rcv_established function. */
> > +	smp_mb();
> 
> If this barrier is really necessary, I guess it should be done in poll_wait() itself.
> 
> Documentation/memory-barriers.txt misses some information about poll_wait()
> 
> 
> 
> 
> > +
> >  	if (sk->sk_state == TCP_LISTEN)
> >  		return inet_csk_listen_poll(sk);
> >  
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 2bdb0da..0606e5e 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4362,8 +4362,11 @@ queue_and_out:
> >  
> >  		if (eaten > 0)
> >  			__kfree_skb(skb);
> > -		else if (!sock_flag(sk, SOCK_DEAD))
> > +		else if (!sock_flag(sk, SOCK_DEAD)) {
> > +			/* Get in sync with tcp_poll function. */
> > +			smp_mb();
> >  			sk->sk_data_ready(sk, 0);
> > +		}
> >  		return;
> >
> 
> Oh well... if smp_mb() is needed, I believe it should be done
> right before "if (waitqueue_active(sk->sk_sleep) ... "
> 
>  	read_lock(&sk->sk_callback_lock);
> +	smp_mb();
>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>  		wake_up_interruptible(sk->sk_sleep)
> 
> It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)
> 
> Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
> "lock subl $0x1,(%eax)"
> 
> Maybe we could define a smp_mb_after_read_lock()  (a compiler barrier() on x86)
> 

First version of the patch was actually in this layer, see
https://bugzilla.redhat.com/attachment.cgi?id=345886

I was adviced this could be to invasive (it was in waitqueue_active actually), 
so I moved the change to the TCP layer itself... 

As far as I understand the problem there's need for 2 barriers to be
sure, the memory will have correct data. One in the codepath calling the
select (tcp_poll), and in the other one updating the available data status 
(sock_def_readable), am I missing smth?


thanks,
jirka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 23, 2009, 10:32 a.m. UTC | #3
Jiri Olsa a écrit :
> Hi,
> 
> thanks for an answer, and sorry for my late reply,
> we needed the cust permission to disclose the debug data.
> 

I see ! Now this is me with litle time as I am traveling right now.

> 
> On Thu, Jun 18, 2009 at 04:06:42PM +0200, Eric Dumazet wrote:
>> Jiri Olsa a écrit :
>>> Hi,
>>>
>>> in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
>>> this on the upstream kernel, but since the issue occurs very rarelly
>>> (once per 8 days), we just might not be lucky.
>>>
>>> I'm affraid this might be a long email, I'll try to structure it nicely.. :)
>>>
>> Thanks for your mail and detailed analysis
>>
>>>
>>> RACE DESCRIPTION
>>> ================
>>>
>>> There's a nice pdf describing the issue (and sollution using locks) on
>>> https://bugzilla.redhat.com/attachment.cgi?id=345014
>> I could not reach this url unfortunatly 
>>
>> --> "You are not authorized to access bug #494404. "
> 
> please try it now, the bug should be accessible now
> 

Thanks, this doc is indeed nice :)

But adding an write_lock()/write_unlock() in tcp_poll() was overkill
It had an sm_mb() implied because of the nesting of locks.

>>>
>>> The race fires, when following code paths meet, and the tp->rcv_nxt and
>>> __add_wait_queue updates stay in CPU caches.
>>>
>>> CPU1                         CPU2
>>>
>>>
>>> sys_select                   receive packet
>>>   ...                        ...
>>>   __add_wait_queue           update tp->rcv_nxt
>>>   ...                        ...
>>>   tp->rcv_nxt check          sock_def_readable
>>>   ...                        {
>>>   schedule                      ...
>>>                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>>>                                         wake_up_interruptible(sk->sk_sleep)
>>>                                 ...
>>>                              }
>>>
>>> If there were no cache the code would work ok, since the wait_queue and
>>> rcv_nxt are opposit to each other.
>>>
>>> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
>>> passed the tp->rcv_nxt check and sleeps, or will get the new value for
>>> tp->rcv_nxt and will return with new data mask.  
>>> In both cases the process (CPU1) is being added to the wait queue, so the
>>> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
>>>
>>> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
>>> cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
>>> endup calling schedule and sleep forever if there are no more data on the
>>> socket.
>>>
>>> Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
>>> should prevent the above bad scenario.
>>>
>>> The upstream patch is attached. It seems to prevent the issue.
>>>
>>>
>>>
>>> CPU BUGS
>>> ========
>>>
>>> The customer has been able to reproduce this problem only on one CPU model:
>>> Xeon E5345*2. They didn't reproduce on XEON MV, for example.
>> Is there an easy way to reproduce the problem ?
> 
> there's a reproducer attached to the bug
> 
> https://enterprise.redhat.com/issue-tracker/?module=download&fid=201560&key=f6f87caf6ac2dc1eb1173257c8a5ef78
> 
> it is basically the client/server program. 
> They're passing messages to each other. When a message is sent,
> both of them expect message on the input before sending another message.
> 
> Very rarely the code hits the place when the process which called select
> is not woken up by incomming data. Probably because of the memory cache
> incoherency. See backtrace in the 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1
> 
> 
>>> That CPU model happens to have 2 possible issues, that might cause the issue:
>>> (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
>>>
>>> AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
>>> the other one has following notes:
>> AJ18 only matters on unaligned accesses, tcp code doesnt do this.
>>
>>>       Software should ensure at least one of the following is true when
>>>       modifying shared data by multiple agents:
>>>              • The shared data is aligned
>>>              • Proper semaphores or barriers are used in order to
>>>                 prevent concurrent data accesses.
>>>
>>>
>>>
>>> RFC
>>> ===
>>>
>>> I'm aware that not having this issue reproduced on upstream lowers the odds
>>> having this checked in. However AFAICS the issue is present. I'd appreciate
>>> any comment/ideas.
>>>
>>>
>>> thanks,
>>> jirka
>>>
>>>
>>> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 17b89c5..f5d9dbf 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>>>  	struct tcp_sock *tp = tcp_sk(sk);
>>>  
>>>  	poll_wait(file, sk->sk_sleep, wait);
>> poll_wait() calls add_wait_queue() which contains a 
>> spin_lock_irqsave()/spin_unlock_irqrestore() pair
>>
>> Documentation/memory-barriers.txt states in line 1123 :
>>
>> 	Memory operations issued after the LOCK will be completed after the LOCK
>> 	operation has completed.
>>
>> and line 1131 states :
>>
>> 	Memory operations issued before the UNLOCK will be completed before the
>> 	UNLOCK operation has completed.
>>
>> So yes, there is no full smp_mb() in poll_wait()
>>
>>> +
>>> +	/* Get in sync with tcp_data_queue, tcp_urg
>>> +	   and tcp_rcv_established function. */
>>> +	smp_mb();
>> If this barrier is really necessary, I guess it should be done in poll_wait() itself.
>>
>> Documentation/memory-barriers.txt misses some information about poll_wait()
>>
>>
>>
>>
>>> +
>>>  	if (sk->sk_state == TCP_LISTEN)
>>>  		return inet_csk_listen_poll(sk);
>>>  
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index 2bdb0da..0606e5e 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -4362,8 +4362,11 @@ queue_and_out:
>>>  
>>>  		if (eaten > 0)
>>>  			__kfree_skb(skb);
>>> -		else if (!sock_flag(sk, SOCK_DEAD))
>>> +		else if (!sock_flag(sk, SOCK_DEAD)) {
>>> +			/* Get in sync with tcp_poll function. */
>>> +			smp_mb();
>>>  			sk->sk_data_ready(sk, 0);
>>> +		}
>>>  		return;
>>>
>> Oh well... if smp_mb() is needed, I believe it should be done
>> right before "if (waitqueue_active(sk->sk_sleep) ... "
>>
>>  	read_lock(&sk->sk_callback_lock);
>> +	smp_mb();
>>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>>  		wake_up_interruptible(sk->sk_sleep)
>>
>> It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)
>>
>> Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
>> "lock subl $0x1,(%eax)"
>>
>> Maybe we could define a smp_mb_after_read_lock()  (a compiler barrier() on x86)
>>
> 
> First version of the patch was actually in this layer, see
> https://bugzilla.redhat.com/attachment.cgi?id=345886
> 
> I was adviced this could be to invasive (it was in waitqueue_active actually), 
> so I moved the change to the TCP layer itself... 
> 
> As far as I understand the problem there's need for 2 barriers to be
> sure, the memory will have correct data. One in the codepath calling the
> select (tcp_poll), and in the other one updating the available data status 
> (sock_def_readable), am I missing smth?
> 

Hmm, I am not saying your patch doesnt fix the problem, I am saying it
is a partial fix of a general problem. We might have same problem(s) in other
parts of network stack. This is a very serious issue.

Point 1 :

You added an smp_mb() call in tcp_poll(). This one looks fine to solve
the problem for tcp sockets. What about other protocols ? Do we have
same problem ?

Point 2 :

You added several smp_mb() calls in tcp input path. In my first
reply, I said it was probably better to add smp_mb() in a single 
place, right before "if (waitqueue_active(sk->sk_sleep) ... ",
but in all paths (input path & output path).

Point 3 :

The optimization we could do, defining
a smp_mb_after_read_lock() that could be a nop on x86

	read_lock(&sk->sk_callback_lock); // "lock subl $0x1,(%eax)" on x86
	smp_mb_after_read_lock(); /* compiler barrier() on x86 */
	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible(sk->sk_sleep);

Am I missing something ?

;)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov June 23, 2009, 7:44 p.m. UTC | #4
(Off-topic)

On 06/23, Eric Dumazet wrote:
>
> The optimization we could do, defining
> a smp_mb_after_read_lock() that could be a nop on x86

Imho it would be nice to have smp_mb__before/after_lock.

insert_work(), __pte_alloc(), try_to_wake_up() could use it. Probably
something else.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

================

There's a nice pdf describing the issue (and sollution using locks) on
https://bugzilla.redhat.com/attachment.cgi?id=345014


The race fires, when following code paths meet, and the tp->rcv_nxt and
__add_wait_queue updates stay in CPU caches.

CPU1                         CPU2


sys_select                   receive packet
  ...                        ...
  __add_wait_queue           update tp->rcv_nxt
  ...                        ...
  tp->rcv_nxt check          sock_def_readable
  ...                        {
  schedule                      ...
                                if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                        wake_up_interruptible(sk->sk_sleep)
                                ...
                             }

If there were no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.

Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.  
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.

Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
should prevent the above bad scenario.

The upstream patch is attached. It seems to prevent the issue.



CPU BUGS
========

The customer has been able to reproduce this problem only on one CPU model:
Xeon E5345*2. They didn't reproduce on XEON MV, for example.

That CPU model happens to have 2 possible issues, that might cause the issue:
(see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)

AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
the other one has following notes:

      Software should ensure at least one of the following is true when
      modifying shared data by multiple agents:
             • The shared data is aligned
             • Proper semaphores or barriers are used in order to
                prevent concurrent data accesses.



RFC
===

I'm aware that not having this issue reproduced on upstream lowers the odds
having this checked in. However AFAICS the issue is present. I'd appreciate
any comment/ideas.


thanks,
jirka


Signed-off-by: Jiri Olsa <jolsa@redhat.com>

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 17b89c5..f5d9dbf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -340,6 +340,11 @@  unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	poll_wait(file, sk->sk_sleep, wait);
+
+	/* Get in sync with tcp_data_queue, tcp_urg
+	   and tcp_rcv_established function. */
+	smp_mb();
+
 	if (sk->sk_state == TCP_LISTEN)
 		return inet_csk_listen_poll(sk);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2bdb0da..0606e5e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4362,8 +4362,11 @@  queue_and_out:
 
 		if (eaten > 0)
 			__kfree_skb(skb);
-		else if (!sock_flag(sk, SOCK_DEAD))
+		else if (!sock_flag(sk, SOCK_DEAD)) {
+			/* Get in sync with tcp_poll function. */
+			smp_mb();
 			sk->sk_data_ready(sk, 0);
+		}
 		return;
 	}
 
@@ -4967,8 +4970,11 @@  static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
 			if (skb_copy_bits(skb, ptr, &tmp, 1))
 				BUG();
 			tp->urg_data = TCP_URG_VALID | tmp;
-			if (!sock_flag(sk, SOCK_DEAD))
+			if (!sock_flag(sk, SOCK_DEAD)) {
+				/* Get in sync with tcp_poll function. */
+				smp_mb();
 				sk->sk_data_ready(sk, 0);
+			}
 		}
 	}
 }
@@ -5317,8 +5323,11 @@  no_ack:
 #endif
 			if (eaten)
 				__kfree_skb(skb);
-			else
+			else {
+				/* Get in sync with tcp_poll function. */
+				smp_mb();
 				sk->sk_data_ready(sk, 0);
+			}
 			return 0;
 		}
 	}