diff mbox

wrong smp_mb__after_atomic() in tcp_check_space() ?

Message ID 1485194671.16328.195.camel@edumazet-glaptop3.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 23, 2017, 6:04 p.m. UTC
On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote:
> On 01/23/2017 09:30 AM, Oleg Nesterov wrote:
> > Hello,
> >
> > smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the
> > non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE)
> > (non-atomic too) won't be reordered.
> >
> 
> Indeed. Here's a bit of discussion on it:
> http://marc.info/?l=linux-netdev&m=146662325920596&w=2
> 
> > It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths"
> > and the patch looks correct in that we need the barriers in tcp_check_space()
> > and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ?
> >
> 
> Yes, I think it should be upgraded to an smp_mb() there. If you agree 
> with this analysis, I will send a patch to upgrade it. Note, I did not 
> actually run into this race in practice.

SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll().

(Otherwise it would be using atomic set/clear operations)

I do not see obvious reason why we have this smp_mb__after_atomic() in
tcp_check_space().


But looking at this code, it seems we lack one barrier if sk_sndbuf is
ever increased. Fortunately this almost never happen during TCP session
lifetime...

Comments

Jason Baron Jan. 23, 2017, 6:45 p.m. UTC | #1
On 01/23/2017 01:04 PM, Eric Dumazet wrote:
> On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote:
>> On 01/23/2017 09:30 AM, Oleg Nesterov wrote:
>>> Hello,
>>>
>>> smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the
>>> non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE)
>>> (non-atomic too) won't be reordered.
>>>
>>
>> Indeed. Here's a bit of discussion on it:
>> http://marc.info/?l=linux-netdev&m=146662325920596&w=2
>>
>>> It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths"
>>> and the patch looks correct in that we need the barriers in tcp_check_space()
>>> and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ?
>>>
>>
>> Yes, I think it should be upgraded to an smp_mb() there. If you agree
>> with this analysis, I will send a patch to upgrade it. Note, I did not
>> actually run into this race in practice.
>
> SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll().
>
> (Otherwise it would be using atomic set/clear operations)
>
> I do not see obvious reason why we have this smp_mb__after_atomic() in
> tcp_check_space().
>
>

The idea of the  smp_mb__after_atomic() in tcp_check_space() was to 
ensure that the 'read' of SOCK_NOSPACE there didn't happen before any of 
the 'write' to make sk_stream_is_writeable() true. Otherwise, we could 
miss doing the wakeup from tcp_check_space(). There is probably an 
argument here that there will likely be a subsequent call to 
tcp_check_space() that will see the SOCK_NOSPACE bit set, but in theory 
we could have a small send buffer, or a lot of data could be ack'd in 
one go.

What I missed in the original patch was that sock_reset_flag() isn't an 
atomic operation and thus the smp_mb__after_atomic() is wrong.


> But looking at this code, it seems we lack one barrier if sk_sndbuf is
> ever increased. Fortunately this almost never happen during TCP session
> lifetime...
>

But the wakeup from sk->sk_write_space(sk) will imply a smp_wmb() as per 
the comment in __wake_up() ?

Thanks,

-Jason


> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk)
>  	sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
>  	sndmem *= nr_segs * per_mss;
>
> -	if (sk->sk_sndbuf < sndmem)
> +	if (sk->sk_sndbuf < sndmem) {
>  		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
> +		/* Paired with second sk_stream_is_writeable(sk)
> +		 * test from tcp_poll()
> +		 */
> +		smp_wmb();
> +	}
>  }
>
>  /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
>
>
Oleg Nesterov Jan. 24, 2017, 9:18 a.m. UTC | #2
On 01/23, Eric Dumazet wrote:
>
> On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote:
> > On 01/23/2017 09:30 AM, Oleg Nesterov wrote:
> > > Hello,
> > >
> > > smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the
> > > non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE)
> > > (non-atomic too) won't be reordered.
> > >
> >
> > Indeed. Here's a bit of discussion on it:
> > http://marc.info/?l=linux-netdev&m=146662325920596&w=2
> >
> > > It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths"
> > > and the patch looks correct in that we need the barriers in tcp_check_space()
> > > and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ?
> > >
> >
> > Yes, I think it should be upgraded to an smp_mb() there. If you agree 
> > with this analysis, I will send a patch to upgrade it. Note, I did not 
> > actually run into this race in practice.
>
> SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll().
>
> (Otherwise it would be using atomic set/clear operations)
>
> I do not see obvious reason why we have this smp_mb__after_atomic() in
> tcp_check_space().

It is not that we need to serialize __clear_bit(SOCK_QUEUE_SHRUNK) and
test_bit(SOCK_NOSPACE), we do not care if they are reordered.

But we need to ensure that either tcp_poll() sees sk_stream_is_writeable()
or tcp_check_space() sees SOCK_NOSPACE and calls tcp_new_space().


> But looking at this code, it seems we lack one barrier if sk_sndbuf is
> ever increased. Fortunately this almost never happen during TCP session
> lifetime...
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk)
>  	sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
>  	sndmem *= nr_segs * per_mss;
>
> -	if (sk->sk_sndbuf < sndmem)
> +	if (sk->sk_sndbuf < sndmem) {
>  		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
> +		/* Paired with second sk_stream_is_writeable(sk)
> +		 * test from tcp_poll()
> +		 */
> +		smp_wmb();
> +	}
>  }

I do not think we need the additional barrier here. If we are going to call
sk->sk_write_space() we rely on wq_has_sleeper() which has a barrier which
also pairs with the 2nd check in tcp_poll().

Oleg.
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -331,8 +331,13 @@  static void tcp_sndbuf_expand(struct sock *sk)
 	sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
 	sndmem *= nr_segs * per_mss;
 
-	if (sk->sk_sndbuf < sndmem)
+	if (sk->sk_sndbuf < sndmem) {
 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
+		/* Paired with second sk_stream_is_writeable(sk)
+		 * test from tcp_poll()
+		 */
+		smp_wmb();
+	}
 }
 
 /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)