diff mbox

[net-next,1/8] tipc: decrease connection flow control window

Message ID 1399641209-26112-2-git-send-email-jon.maloy@ericsson.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jon Maloy May 9, 2014, 1:13 p.m. UTC
Memory overhead when allocating big buffers for data transfer may
be quite significant. E.g., truesize of a 64 KB buffer turns out
to be 132 KB, 2 x the requested size.

This invalidates the "worst case" calculation we have been
using to determine the default socket receive buffer limit,
which is based on the assumption that 1024x64KB = 67MB buffers
may be queued up on a socket.

Since TIPC connections cannot survive hitting the buffer limit,
we have to compensate for this overhead.

We do that in this commit by dividing the fix connection flow
control window from 1024 (2*512) messages to 512 (2*256). Since
older version nodes send out acks at 512 message intervals,
compatibility with such nodes is guaranteed, although performance
may be non-optimal in such cases.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/core.c   |    7 ++++---
 net/tipc/port.h   |    9 +++++----
 net/tipc/socket.c |    4 ++--
 3 files changed, 11 insertions(+), 9 deletions(-)

Comments

David Laight May 9, 2014, 1:30 p.m. UTC | #1
From: Jon Maloy
> Memory overhead when allocating big buffers for data transfer may
> be quite significant. E.g., truesize of a 64 KB buffer turns out
> to be 132 KB, 2 x the requested size.

If the data is in the skb allocated by the ethernet driver then
the cumulative truesize is probably very dependent on the driver.
In some cases the value could be much higher - especially if the
drivers are fixed to report a correct truesize.

> This invalidates the "worst case" calculation we have been
> using to determine the default socket receive buffer limit,
> which is based on the assumption that 1024x64KB = 67MB buffers
> may be queued up on a socket.
> 
> Since TIPC connections cannot survive hitting the buffer limit,
> we have to compensate for this overhead.

If the connection can't survive this, then you probably have to
accept the received data anyway.
However I'd have thought you should be able to treat it as equivalent
to a lost ethernet packet.

Sounds a bit like a badly designed protocol to me...

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Erik Hugne May 9, 2014, 2:27 p.m. UTC | #2
On Fri, May 09, 2014 at 01:30:43PM +0000, David Laight wrote:
> Sounds a bit like a badly designed protocol to me...

Well, i don't like this either but the current message-oriented flowcontrol
will eventually be replaced with a byte-based one. Right now we're trying
to find a way to adapt a message-oriented flow control to per-socket buffer
constraints, without breaking protocol compatibility.

//E
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight May 9, 2014, 2:59 p.m. UTC | #3
From: erik.hugne@ericsson. 
> On Fri, May 09, 2014 at 01:30:43PM +0000, David Laight wrote:
> > Sounds a bit like a badly designed protocol to me...
> 
> Well, i don't like this either but the current message-oriented flowcontrol
> will eventually be replaced with a byte-based one. Right now we're trying
> to find a way to adapt a message-oriented flow control to per-socket buffer
> constraints, without breaking protocol compatibility.

I wasn't thinking of the byte/message flow control problems.
More of only requesting acks every 512 messages
(if my quick reading of the proposed change is right).
Especially without the receiver being able to stop the sender when
there is a buffer crisis (of any kind) at the receiver.

I guess the aim was to limit the 'back traffic' by making the
assumption that the receiving system would always be able to
process the data as fast as it arrived.
If the receiver is an embedded system this is unlikely to be true.

I suspect you want the receiver to offer a large window (bytes or messages),
but on a buffer crisis (per connection or global) be able to tell the
sender to stop sending.
A few messages received while the sender is stopped could still be saved
(effectively assuming they were sent before the 'stop' was received).

If running over an unreliable transport you'll want to retx the stop in
case it was lost.
If the 'stop' message is an ack, then don't ack the last received message.
Then you'll get timeout recovery if the 'start' message is lost.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Maloy May 9, 2014, 4 p.m. UTC | #4
On 05/09/2014 10:59 AM, David Laight wrote:
> From: erik.hugne@ericsson. 
>> On Fri, May 09, 2014 at 01:30:43PM +0000, David Laight wrote:
>>> Sounds a bit like a badly designed protocol to me...
>> Well, i don't like this either but the current message-oriented flowcontrol
>> will eventually be replaced with a byte-based one. Right now we're trying
>> to find a way to adapt a message-oriented flow control to per-socket buffer
>> constraints, without breaking protocol compatibility.
> I wasn't thinking of the byte/message flow control problems.
> More of only requesting acks every 512 messages

Acks are not requested, they are sent out unsolicited at fix
intervals, controlled by the reading capacity of the receiving
process.

The fundamental problem we have is that we acknowledge
messages, without considering their size, so a sender won't
stop (there is no active stop message from the receiver) until
he has sent 512 messages and not received an ack, which in
the worst case amounts to 512 x truesize(64KB) = 67 MB
outstanding data. This is what the receiving socket must
be able to absorb, in order to avoid broken connections.

If we acknowledge bytes instead of messages and possibly
introduce a flexible acknowledge interval, this whole problem will
go away, but until then (a few months from now, I think) we have do
make the best out of what we have.


> (if my quick reading of the proposed change is right).
> Especially without the receiver being able to stop the sender when
> there is a buffer crisis (of any kind) at the receiver.

To use any form of active feedback to the sender sounds scary
to me. Passive feedback, in the form of withheld acks, is the way
to go. That is the way it works now, and the way it should work,
it is just that the buffer limits we need to sustain this currently
are insanely high.

>
> I guess the aim was to limit the 'back traffic' by making the
> assumption that the receiving system would always be able to
> process the data as fast as it arrived.
> If the receiver is an embedded system this is unlikely to be true.

True.  Legacy TIPC comes from a world where memory constraints
wasn't considered a big issue. That is different now, of course, and
has to be remedied.

>
> I suspect you want the receiver to offer a large window (bytes or messages),
> but on a buffer crisis (per connection or global) be able to tell the
> sender to stop sending.

I believe we rather should put our effort into a byte-oriented, TCP-like
flow control, to permanently solve this issue. As said, it is on our agenda.

> A few messages received while the sender is stopped could still be saved
> (effectively assuming they were sent before the 'stop' was received).
>
> If running over an unreliable transport you'll want to retx the stop in
> case it was lost.

TIPC connections always run over a reliable media, - its own link
layer. This means that the connection layer does not have any
form of sequence numbering or retransmission, -it just assumes
that everything arrives in order and without losses, as long
as the link layer doesn't tell it otherwise.

This isn't really a problem, as long as the connection-level flow
control is able to work within reasonable constraints.  That is
what it is unable to do now, and what we are going remedy once
our drive to redo the locking policy is finished.

Regards
///jon


> If the 'stop' message is an ack, then don't ack the last received message.
> Then you'll get timeout recovery if the 'start' message is lost.
>
> 	David
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight May 9, 2014, 4:35 p.m. UTC | #5
From: Jon Maloy
> On 05/09/2014 10:59 AM, David Laight wrote:
> > From: erik.hugne@ericsson.
> >> On Fri, May 09, 2014 at 01:30:43PM +0000, David Laight wrote:
> >>> Sounds a bit like a badly designed protocol to me...
> >> Well, i don't like this either but the current message-oriented flowcontrol
> >> will eventually be replaced with a byte-based one. Right now we're trying
> >> to find a way to adapt a message-oriented flow control to per-socket buffer
> >> constraints, without breaking protocol compatibility.
> > I wasn't thinking of the byte/message flow control problems.
> > More of only requesting acks every 512 messages
> 
> Acks are not requested, they are sent out unsolicited at fix
> intervals, controlled by the reading capacity of the receiving
> process.

Requested by the received side...

> The fundamental problem we have is that we acknowledge
> messages, without considering their size, so a sender won't
> stop (there is no active stop message from the receiver) until
> he has sent 512 messages and not received an ack, which in
> the worst case amounts to 512 x truesize(64KB) = 67 MB
> outstanding data. This is what the receiving socket must
> be able to absorb, in order to avoid broken connections.

Even 512 x 64kB is a lot.
I thought that a 64k skb would only have a truesize of 72k (with 4k pages).
I'm also guessing that an ethernet controller that can do txp rx offloading
might have to allocate 64k skb - so a single 1500 byte ethernet frame
is likely to be in an skb with a truesize of 72k.
I'm not sure if this affects your 128k truesize though.

(Pasted from below)
> TIPC connections always run over a reliable media, - its own link
> layer. This means that the connection layer does not have any
> form of sequence numbering or retransmission, -it just assumes
> that everything arrives in order and without losses, as long
> as the link layer doesn't tell it otherwise.

Hmmm... If multiple connections are multiplexed over a single
link you are completely stuffed!

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Maloy May 9, 2014, 6:07 p.m. UTC | #6
> -----Original Message-----
> From: David Laight [mailto:David.Laight@ACULAB.COM]
> Sent: May-09-14 12:35 PM
> To: 'Jon Maloy'; Erik Hugne
> Cc: Jon Maloy; davem@davemloft.net; netdev@vger.kernel.org; Paul
> Gortmaker; ying.xue@windriver.com; tipc-discussion@lists.sourceforge.net
> Subject: RE: [PATCH net-next 1/8] tipc: decrease connection flow control
> window
> 
> From: Jon Maloy
> > On 05/09/2014 10:59 AM, David Laight wrote:
> > > From: erik.hugne@ericsson.
> > >> On Fri, May 09, 2014 at 01:30:43PM +0000, David Laight wrote:
> > >>> Sounds a bit like a badly designed protocol to me...
> > >> Well, i don't like this either but the current message-oriented
> > >> flowcontrol will eventually be replaced with a byte-based one.
> > >> Right now we're trying to find a way to adapt a message-oriented
> > >> flow control to per-socket buffer constraints, without breaking protocol
> compatibility.
> > > I wasn't thinking of the byte/message flow control problems.
> > > More of only requesting acks every 512 messages
> >
> > Acks are not requested, they are sent out unsolicited at fix
> > intervals, controlled by the reading capacity of the receiving
> > process.
> 
> Requested by the received side...
??? Sent by the receiving side.
> 
> > The fundamental problem we have is that we acknowledge messages,
> > without considering their size, so a sender won't stop (there is no
> > active stop message from the receiver) until he has sent 512 messages
> > and not received an ack, which in the worst case amounts to 512 x
> > truesize(64KB) = 67 MB outstanding data. This is what the receiving
> > socket must be able to absorb, in order to avoid broken connections.
> 
> Even 512 x 64kB is a lot.
> I thought that a 64k skb would only have a truesize of 72k (with 4k pages).
> I'm also guessing that an ethernet controller that can do txp rx offloading
> might have to allocate 64k skb - so a single 1500 byte ethernet frame is likely
> to be in an skb with a truesize of 72k.
> I'm not sure if this affects your 128k truesize though.

The 128KiB comes from locally allocated buffers. TIPC, being an IPC protocol,
does not use the loopback interface, and hence does not need fragmentation
for node local communication. This gives a latency improvement of ~3 times
as compared to TCP.

> 
> (Pasted from below)
> > TIPC connections always run over a reliable media, - its own link
> > layer. This means that the connection layer does not have any form of
> > sequence numbering or retransmission, -it just assumes that everything
> > arrives in order and without losses, as long as the link layer doesn't
> > tell it otherwise.
> 
> Hmmm... If multiple connections are multiplexed over a single link you are
> completely stuffed!

Not as much as you would think. First, we are not limited to one link, we
can have parallel links across different LANS/VLANs. As a matter of fact, 
TIPC was running circles around TCP until a few years ago, but now it is
admittedly behind when it comes to total throughput/interface. Recent 
tests using GRO and variable link transmission windows still shows that we
 an easily fill  2x10G links, which is no less than TCP can do given the physical 
constraint, and it is anyway the maximum most systems have these days.
It remains to be seen what happens when we go to 40G or 100G, but
we have some ideas even there.

///jon

> 
> 	David
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 13, 2014, 4:05 a.m. UTC | #7
From: Jon Maloy <jon.maloy@ericsson.com>
Date: Fri,  9 May 2014 09:13:22 -0400

> Memory overhead when allocating big buffers for data transfer may
> be quite significant. E.g., truesize of a 64 KB buffer turns out
> to be 132 KB, 2 x the requested size.
> 
> This invalidates the "worst case" calculation we have been
> using to determine the default socket receive buffer limit,
> which is based on the assumption that 1024x64KB = 67MB buffers
> may be queued up on a socket.
> 
> Since TIPC connections cannot survive hitting the buffer limit,
> we have to compensate for this overhead.
> 
> We do that in this commit by dividing the fix connection flow
> control window from 1024 (2*512) messages to 512 (2*256). Since
> older version nodes send out acks at 512 message intervals,
> compatibility with such nodes is guaranteed, although performance
> may be non-optimal in such cases.
> 
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> Reviewed-by: Ying Xue <ying.xue@windriver.com>

So all I have to do is open 64 sockets to make TIPC commit to 4GB
of ram at once?

I really think you need to rethink this, the socket limits are
there for a reason.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Maloy May 13, 2014, 5:27 p.m. UTC | #8
On 05/13/2014 12:05 AM, David Miller wrote:
> From: Jon Maloy <jon.maloy@ericsson.com>
> Date: Fri,  9 May 2014 09:13:22 -0400
>
>> Memory overhead when allocating big buffers for data transfer may
>> be quite significant. E.g., truesize of a 64 KB buffer turns out
>> to be 132 KB, 2 x the requested size.
>>
>> This invalidates the "worst case" calculation we have been
>> using to determine the default socket receive buffer limit,
>> which is based on the assumption that 1024x64KB = 67MB buffers
>> may be queued up on a socket.
>>
>> Since TIPC connections cannot survive hitting the buffer limit,
>> we have to compensate for this overhead.
>>
>> We do that in this commit by dividing the fix connection flow
>> control window from 1024 (2*512) messages to 512 (2*256). Since
>> older version nodes send out acks at 512 message intervals,
>> compatibility with such nodes is guaranteed, although performance
>> may be non-optimal in such cases.
>>
>> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
>> Reviewed-by: Ying Xue <ying.xue@windriver.com>
> So all I have to do is open 64 sockets to make TIPC commit to 4GB
> of ram at once?

Yes. We are fully aware of this. But this is the way it has been the last
two years, and this series changes nothing regarding that. It was
even more before.

>
> I really think you need to rethink this, the socket limits are
> there for a reason.

We have already done that. A couple of months from now, when
we have finished our current redesign of the locking policy
and transmission path code,  you can expect a series of commits
where the connection-level flow control is completely re-worked.
It will be byte-based, and be pretty similar to what we have in TCP.

But, that solution cannot be made backwards compatible with
the current, message based flow control, so we will have to keep
supporting that one too for a while. We will probably use capability
flags to distinguish between the two, and require active enabling for
any new node to use the old algorithm. I think fixing weaknesses
in the current flow control can be seen as such support, as long
as we don't extend the limits for claimable memory further than
it is now.

Commit #1 is such a fix, while #2 will be valid even when we
introduce the new flow control. The other ones are about completely
different matters.

So, please advise me, should I resubmit the series as a whole,
without patch #1, without ##1 and 2, or do you expect us
to drop everything else until we have a new flow control?
The latter alternative will no doubt cause us some double effort.

Regards
///jon



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 13, 2014, 10:32 p.m. UTC | #9
From: Jon Maloy <jon.maloy@ericsson.com>
Date: Tue, 13 May 2014 13:27:45 -0400

> So, please advise me, should I resubmit the series as a whole,
> without patch #1, without ##1 and 2, or do you expect us
> to drop everything else until we have a new flow control?
> The latter alternative will no doubt cause us some double effort.

Ok, just repost the series without any changes I'll review it
again.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 57f8ae9..676d180 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -154,10 +154,11 @@  static int __init tipc_init(void)
 	tipc_max_ports = CONFIG_TIPC_PORTS;
 	tipc_net_id = 4711;
 
-	sysctl_tipc_rmem[0] = CONN_OVERLOAD_LIMIT >> 4 << TIPC_LOW_IMPORTANCE;
-	sysctl_tipc_rmem[1] = CONN_OVERLOAD_LIMIT >> 4 <<
+	sysctl_tipc_rmem[0] = TIPC_CONN_OVERLOAD_LIMIT >> 4 <<
+			      TIPC_LOW_IMPORTANCE;
+	sysctl_tipc_rmem[1] = TIPC_CONN_OVERLOAD_LIMIT >> 4 <<
 			      TIPC_CRITICAL_IMPORTANCE;
-	sysctl_tipc_rmem[2] = CONN_OVERLOAD_LIMIT;
+	sysctl_tipc_rmem[2] = TIPC_CONN_OVERLOAD_LIMIT;
 
 	res = tipc_core_start();
 	if (res)
diff --git a/net/tipc/port.h b/net/tipc/port.h
index a003973..5dfd165 100644
--- a/net/tipc/port.h
+++ b/net/tipc/port.h
@@ -42,9 +42,10 @@ 
 #include "msg.h"
 #include "node_subscr.h"
 
-#define TIPC_FLOW_CONTROL_WIN 512
-#define CONN_OVERLOAD_LIMIT	((TIPC_FLOW_CONTROL_WIN * 2 + 1) * \
-				SKB_TRUESIZE(TIPC_MAX_USER_MSG_SIZE))
+#define TIPC_CONNACK_INTV         256
+#define TIPC_FLOWCTRL_WIN        (TIPC_CONNACK_INTV * 2)
+#define TIPC_CONN_OVERLOAD_LIMIT ((TIPC_FLOWCTRL_WIN * 2 + 1) * \
+				  SKB_TRUESIZE(TIPC_MAX_USER_MSG_SIZE))
 
 /**
  * struct tipc_port - TIPC port structure
@@ -187,7 +188,7 @@  static inline void tipc_port_unlock(struct tipc_port *p_ptr)
 
 static inline int tipc_port_congested(struct tipc_port *p_ptr)
 {
-	return (p_ptr->sent - p_ptr->acked) >= (TIPC_FLOW_CONTROL_WIN * 2);
+	return ((p_ptr->sent - p_ptr->acked) >= TIPC_FLOWCTRL_WIN);
 }
 
 
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 3f9912f..8685daf 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1101,7 +1101,7 @@  restart:
 	/* Consume received message (optional) */
 	if (likely(!(flags & MSG_PEEK))) {
 		if ((sock->state != SS_READY) &&
-		    (++port->conn_unacked >= TIPC_FLOW_CONTROL_WIN))
+		    (++port->conn_unacked >= TIPC_CONNACK_INTV))
 			tipc_acknowledge(port->ref, port->conn_unacked);
 		advance_rx_queue(sk);
 	}
@@ -1210,7 +1210,7 @@  restart:
 
 	/* Consume received message (optional) */
 	if (likely(!(flags & MSG_PEEK))) {
-		if (unlikely(++port->conn_unacked >= TIPC_FLOW_CONTROL_WIN))
+		if (unlikely(++port->conn_unacked >= TIPC_CONNACK_INTV))
 			tipc_acknowledge(port->ref, port->conn_unacked);
 		advance_rx_queue(sk);
 	}