diff mbox

tcp: set SOCK_NOSPACE under memory presure

Message ID 20150420200513.11D8E202D@prod-mail-relay06.akamai.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jason Baron April 20, 2015, 8:05 p.m. UTC
Under tcp memory pressure, calling epoll_wait() in edge triggered
mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
even when there is suffcient memory available to continue making
progress. The problem is that __sk_mem_schedule() can return 0,
under memory pressure without having set the SOCK_NOSPACE flag. Thus,
even though all the outstanding packets have been acked, we never
get the EPOLLOUT that we are expecting from epoll_wait().

This issue is currently limited to epoll when used in edge trigger
mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
This is sufficient for poll()/select() and epoll() in level trigger
mode. However, in edge trigger mode, epoll() is relying on the write
path to set SOCK_NOSPACE. So I view this patch as bringing us into
sync with poll()/select() and epoll() level trigger behavior.

I can reproduce this issue, using SO_SNDBUF, since
__sk_mem_schedule() will return 0, or failure more readily with
SO_SNDBUF:

1) create socket and set SO_SNDBUF to N
2) add socket as edge trigger
3) write to socket and block in epoll on -EAGAIN
4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem

The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
when the socket is non-blocking.

Note that we could still hang if sk->sk_wmem_queue is 0, when we get
the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
are waiting for and event that will never happen. I believe
that this case is hard to hit (and did not hit in my testing),
in that over the 'soft' limit, we continue to guarantee a minimum
write buffer size. Perhaps, we could return -ENOSPC in this
case...note that this case is not specific to epoll ET, but
rather would affect all blocking and non-blocking sockets as well,
and thus I think its ok to treat it as a separate case.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 net/core/stream.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

David Miller April 21, 2015, 9:33 p.m. UTC | #1
From: Jason Baron <jbaron@akamai.com>
Date: Mon, 20 Apr 2015 20:05:13 +0000 (GMT)

> Under tcp memory pressure, calling epoll_wait() in edge triggered
> mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
> even when there is suffcient memory available to continue making
> progress. The problem is that __sk_mem_schedule() can return 0,
> under memory pressure without having set the SOCK_NOSPACE flag. Thus,
> even though all the outstanding packets have been acked, we never
> get the EPOLLOUT that we are expecting from epoll_wait().
> 
> This issue is currently limited to epoll when used in edge trigger
> mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
> This is sufficient for poll()/select() and epoll() in level trigger
> mode. However, in edge trigger mode, epoll() is relying on the write
> path to set SOCK_NOSPACE. So I view this patch as bringing us into
> sync with poll()/select() and epoll() level trigger behavior.

Can you explain exactly how epoll in edge trigger mode is
depending upon SOCK_NOSPACE being set in this way?  I tried
to read the epoll code and it just seems to call ->poll()
in the normal way when returning event state.

Also, there are exactly two call sites of sk_stream_wait_space()
for TCP, and they both look like this:

====================
wait_for_sndbuf:
		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
		tcp_push(sk, flags & ~MSG_MORE, mss_now,
			 TCP_NAGLE_PUSH, size_goal);

		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
			goto do_error;
====================

Definitely, the person who wrote this code intended SOCK_NOSPACE to be
set only when we are waiting for sndbuf space rather than just memory.

At a minimum, I need a more detailed commit log message for this,
showing the exact code paths in epoll() that have this requirement and
thus create the looping condition.  Because with a casual scan of the
epoll code I could not figure it out.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron April 22, 2015, 2:25 p.m. UTC | #2
On 04/21/2015 05:33 PM, David Miller wrote:
> From: Jason Baron <jbaron@akamai.com>
> Date: Mon, 20 Apr 2015 20:05:13 +0000 (GMT)
>
>> Under tcp memory pressure, calling epoll_wait() in edge triggered
>> mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
>> even when there is suffcient memory available to continue making
>> progress. The problem is that __sk_mem_schedule() can return 0,
>> under memory pressure without having set the SOCK_NOSPACE flag. Thus,
>> even though all the outstanding packets have been acked, we never
>> get the EPOLLOUT that we are expecting from epoll_wait().
>>
>> This issue is currently limited to epoll when used in edge trigger
>> mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
>> This is sufficient for poll()/select() and epoll() in level trigger
>> mode. However, in edge trigger mode, epoll() is relying on the write
>> path to set SOCK_NOSPACE. So I view this patch as bringing us into
>> sync with poll()/select() and epoll() level trigger behavior.
> Can you explain exactly how epoll in edge trigger mode is
> depending upon SOCK_NOSPACE being set in this way?  I tried
> to read the epoll code and it just seems to call ->poll()
> in the normal way when returning event state.

In edge trigger mode, when we receive a wakeup event we
call ->poll() in the normal way, *but* we do not leave the event
as still pending. (Specifically, in the epoll() code we are not
re-adding it (fs/eventpoll.c:ep_send_events_proc())). This is
because we are only interested in the 'edge' or the event
going high. In level trigger mode, we do leave the event
pending if its 'high', such that it will re-trigger again for us on
the next epoll_wait().

EPOLL(7) is clear that in edge-trigger mode we can only do
epoll_wait() after read/write return -EAGAIN. Thus, in the
case of the socket write, we are relying on the fact that
tcp_sendmsg()/network layer is going to issue a wakeup
for us at some point in the future when we get -EAGAIN.

This all works fine in the case you pointed out where we
have exceeded the sk->sndbuf and set SOCK_NOSPACE.
However, when we return -EAGAIN from the write path
b/c we are over the tcp memory limits and not b/c we are
over the sndbuf, we are never going to get another wakeup
(since SOCK_NOSPACE is not set in this case). Level trigger
avoids this since the subsequent epoll_wait() is going to
re-try the ->poll() (and set SOCK_NOSPACE if it fails).

Now, in the memory failure case, we are not really
waiting for the buffer to empty, but rather for there to be
memory more generally available. So it could be argued
that we need to implement a wakeup here based on memory
being available as opposed to the write queue emptying.
That is one potential option here.

I think the other one is the route I was proposing, which was
to treat the out of memory case, in the same way as the
sk->sndbuf queue full case, as select(), poll() and epoll() level
trigger are currently doing. And potentially add maybe an
-ENOSPC return if the write queue really is empty...I thought
that approach made sense b/c even under memory pressure
(over sk_prot_mem_limits(sk, 1), but not over
sk_prot_mem_limits(sk, 2)) we continue to guarantee a
minimum sndbuf size (implying we can keep making progress).
That said, there is a case, over sk_prot_mem_limits(sk, 2),
where we do not guarantee the minimum buffer size, but I
think in practice that is very hard to hit (since we are reducing
usage over sk_prot_mem_limits(sk, 1) aggressively).

There is also the case where we actually are out of memory
on the system, ie kmalloc() etc. are failing, in which case
we could maybe return -ENOSPC, or else we would potentially
need a larger change to wait on memory being available
as opposed to the buffer emptying.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron April 30, 2015, 2:34 p.m. UTC | #3
On 04/22/2015 10:25 AM, Jason Baron wrote:
> On 04/21/2015 05:33 PM, David Miller wrote:
>> From: Jason Baron <jbaron@akamai.com>
>> Date: Mon, 20 Apr 2015 20:05:13 +0000 (GMT)
>>
>>> Under tcp memory pressure, calling epoll_wait() in edge triggered
>>> mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
>>> even when there is suffcient memory available to continue making
>>> progress. The problem is that __sk_mem_schedule() can return 0,
>>> under memory pressure without having set the SOCK_NOSPACE flag. Thus,
>>> even though all the outstanding packets have been acked, we never
>>> get the EPOLLOUT that we are expecting from epoll_wait().
>>>
>>> This issue is currently limited to epoll when used in edge trigger
>>> mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
>>> This is sufficient for poll()/select() and epoll() in level trigger
>>> mode. However, in edge trigger mode, epoll() is relying on the write
>>> path to set SOCK_NOSPACE. So I view this patch as bringing us into
>>> sync with poll()/select() and epoll() level trigger behavior.
>> Can you explain exactly how epoll in edge trigger mode is
>> depending upon SOCK_NOSPACE being set in this way?  I tried
>> to read the epoll code and it just seems to call ->poll()
>> in the normal way when returning event state.
> In edge trigger mode, when we receive a wakeup event we
> call ->poll() in the normal way, *but* we do not leave the event
> as still pending. (Specifically, in the epoll() code we are not
> re-adding it (fs/eventpoll.c:ep_send_events_proc())). This is
> because we are only interested in the 'edge' or the event
> going high. In level trigger mode, we do leave the event
> pending if its 'high', such that it will re-trigger again for us on
> the next epoll_wait().
>
> EPOLL(7) is clear that in edge-trigger mode we can only do
> epoll_wait() after read/write return -EAGAIN. Thus, in the
> case of the socket write, we are relying on the fact that
> tcp_sendmsg()/network layer is going to issue a wakeup
> for us at some point in the future when we get -EAGAIN.
>
> This all works fine in the case you pointed out where we
> have exceeded the sk->sndbuf and set SOCK_NOSPACE.
> However, when we return -EAGAIN from the write path
> b/c we are over the tcp memory limits and not b/c we are
> over the sndbuf, we are never going to get another wakeup
> (since SOCK_NOSPACE is not set in this case). Level trigger
> avoids this since the subsequent epoll_wait() is going to
> re-try the ->poll() (and set SOCK_NOSPACE if it fails).
>
> Now, in the memory failure case, we are not really
> waiting for the buffer to empty, but rather for there to be
> memory more generally available. So it could be argued
> that we need to implement a wakeup here based on memory
> being available as opposed to the write queue emptying.
> That is one potential option here.
>
> I think the other one is the route I was proposing, which was
> to treat the out of memory case, in the same way as the
> sk->sndbuf queue full case, as select(), poll() and epoll() level
> trigger are currently doing. And potentially add maybe an
> -ENOSPC return if the write queue really is empty...I thought
> that approach made sense b/c even under memory pressure
> (over sk_prot_mem_limits(sk, 1), but not over
> sk_prot_mem_limits(sk, 2)) we continue to guarantee a
> minimum sndbuf size (implying we can keep making progress).
> That said, there is a case, over sk_prot_mem_limits(sk, 2),
> where we do not guarantee the minimum buffer size, but I
> think in practice that is very hard to hit (since we are reducing
> usage over sk_prot_mem_limits(sk, 1) aggressively).
>
> There is also the case where we actually are out of memory
> on the system, ie kmalloc() etc. are failing, in which case
> we could maybe return -ENOSPC, or else we would potentially
> need a larger change to wait on memory being available
> as opposed to the buffer emptying.
>
> Thanks,
>
> -Jason

Hi,

Just curious if anybody had any further reaction on this issue.
I think making the epoll edge trigger case, as least match what
we are seeing for poll()/select()/epoll() level trigger seems
reasonable here.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 30, 2015, 3:44 p.m. UTC | #4
From: Jason Baron <jbaron@akamai.com>
Date: Thu, 30 Apr 2015 10:34:17 -0400

> Just curious if anybody had any further reaction on this issue.
> I think making the epoll edge trigger case, as least match what
> we are seeing for poll()/select()/epoll() level trigger seems
> reasonable here.

It's in my inbox and looking into this is on my TODO list.

I simply lack the time right now and you will need to be
patient.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 4, 2015, 11:12 p.m. UTC | #5
From: Jason Baron <jbaron@akamai.com>
Date: Thu, 30 Apr 2015 10:34:17 -0400

> Just curious if anybody had any further reaction on this issue.
> I think making the epoll edge trigger case, as least match what
> we are seeing for poll()/select()/epoll() level trigger seems
> reasonable here.

Ok Jason I think I understand the situation now, thanks.

Can you add a condensed version of this nice detailed
description you gave of all the relevant code paths and
mechanisms to the commit message and resubmit?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/stream.c b/net/core/stream.c
index 301c05f..d70f77a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -119,6 +119,7 @@  int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 	int err = 0;
 	long vm_wait = 0;
 	long current_timeo = *timeo_p;
+	bool noblock = (*timeo_p ? false : true);
 	DEFINE_WAIT(wait);
 
 	if (sk_stream_memory_free(sk))
@@ -131,8 +132,11 @@  int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 
 		if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 			goto do_error;
-		if (!*timeo_p)
+		if (!*timeo_p) {
+			if (noblock)
+				set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 			goto do_nonblock;
+		}
 		if (signal_pending(current))
 			goto do_interrupted;
 		clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);