diff mbox

bonding: flow control regression [was Re: bridging: flow control regression]

Message ID 1288683057.2660.154.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 2, 2010, 7:30 a.m. UTC
Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit :
> On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> > 
> > > Thanks for the explanation.
> > > I'm not entirely sure how much of a problem this is in practice.
> > 
> > Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> > to delay the orphaning up to the real device.
> 
> That was my initial thought. Could you give me some guidance
> on how that might be done so I can try and make a patch to test?
> 
> > But if the socket send buffer is very large, it would defeat the flow
> > control any way...
> 
> I'm primarily concerned about a situation where
> UDP packets are sent as fast as possible, indefinitely.
> And in that scenario, I think it would need to be a rather large buffer.
> 

Please try following patch, thanks.

 drivers/net/bonding/bond_main.c |    1 +
 include/linux/if.h              |    3 +++
 net/core/dev.c                  |    5 +++--
 3 files changed, 7 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Simon Horman Nov. 2, 2010, 8:46 a.m. UTC | #1
On Tue, Nov 02, 2010 at 08:30:57AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit :
> > On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> > > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> > > 
> > > > Thanks for the explanation.
> > > > I'm not entirely sure how much of a problem this is in practice.
> > > 
> > > Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> > > to delay the orphaning up to the real device.
> > 
> > That was my initial thought. Could you give me some guidance
> > on how that might be done so I can try and make a patch to test?
> > 
> > > But if the socket send buffer is very large, it would defeat the flow
> > > control any way...
> > 
> > I'm primarily concerned about a situation where
> > UDP packets are sent as fast as possible, indefinitely.
> > And in that scenario, I think it would need to be a rather large buffer.
> > 
> 
> Please try following patch, thanks.

Thanks Eric, that seems to resolve the problem that I was seeing.

With your patch I see:

No bonding

# netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB

116736    1472   30.00     2438413      0      957.2     8.52     1.458 
129024           30.00     2438413             957.2     -1.00    -1.000

With bonding (one slave, the interface used in the test above)

netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB

116736    1472   30.00     2438390      0      957.1     8.97     1.535 
129024           30.00     2438390             957.1     -1.00    -1.000

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 2, 2010, 9:29 a.m. UTC | #2
Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :

> Thanks Eric, that seems to resolve the problem that I was seeing.
> 
> With your patch I see:
> 
> No bonding
> 
> # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Socket  Message  Elapsed      Messages                   CPU      Service
> Size    Size     Time         Okay Errors   Throughput   Util     Demand
> bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> 
> 116736    1472   30.00     2438413      0      957.2     8.52     1.458 
> 129024           30.00     2438413             957.2     -1.00    -1.000
> 
> With bonding (one slave, the interface used in the test above)
> 
> netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Socket  Message  Elapsed      Messages                   CPU      Service
> Size    Size     Time         Okay Errors   Throughput   Util     Demand
> bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> 
> 116736    1472   30.00     2438390      0      957.1     8.97     1.535 
> 129024           30.00     2438390             957.1     -1.00    -1.000
> 


Sure the patch helps when not too many flows are involved, but this is a
hack.

Say the device queue is 1000 packets, and you run a workload with 2000
sockets, it wont work...

Or device queue is 1000 packets, one flow, and socket send queue size
allows for more than 1000 packets to be 'in flight' (echo 2000000
>/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
with devices with a qdisc sitting in the first device met after the
socket.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Nov. 6, 2010, 9:25 a.m. UTC | #3
On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
> 
> > Thanks Eric, that seems to resolve the problem that I was seeing.
> > 
> > With your patch I see:
> > 
> > No bonding
> > 
> > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > Socket  Message  Elapsed      Messages                   CPU      Service
> > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > 
> > 116736    1472   30.00     2438413      0      957.2     8.52     1.458 
> > 129024           30.00     2438413             957.2     -1.00    -1.000
> > 
> > With bonding (one slave, the interface used in the test above)
> > 
> > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > Socket  Message  Elapsed      Messages                   CPU      Service
> > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > 
> > 116736    1472   30.00     2438390      0      957.1     8.97     1.535 
> > 129024           30.00     2438390             957.1     -1.00    -1.000
> > 
> 
> 
> Sure the patch helps when not too many flows are involved, but this is a
> hack.
> 
> Say the device queue is 1000 packets, and you run a workload with 2000
> sockets, it wont work...
> 
> Or device queue is 1000 packets, one flow, and socket send queue size
> allows for more than 1000 packets to be 'in flight' (echo 2000000
> >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
> with devices with a qdisc sitting in the first device met after the
> socket.

True, thanks for pointing that out.

The scenario that I am actually interested in is virtualisation.
And I believe that your patch helps the vhostnet case (I don't see
flow control problems with bonding + virtio without vhostnet). However,
I am unsure if there are also some easy work-arounds to degrade
flow control in the vhostnet case too.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Dec. 8, 2010, 1:22 p.m. UTC | #4
On Sat, Nov 06, 2010 at 06:25:37PM +0900, Simon Horman wrote:
> On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
> > 
> > > Thanks Eric, that seems to resolve the problem that I was seeing.
> > > 
> > > With your patch I see:
> > > 
> > > No bonding
> > > 
> > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > > Socket  Message  Elapsed      Messages                   CPU      Service
> > > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > > 
> > > 116736    1472   30.00     2438413      0      957.2     8.52     1.458 
> > > 129024           30.00     2438413             957.2     -1.00    -1.000
> > > 
> > > With bonding (one slave, the interface used in the test above)
> > > 
> > > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > > Socket  Message  Elapsed      Messages                   CPU      Service
> > > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > > 
> > > 116736    1472   30.00     2438390      0      957.1     8.97     1.535 
> > > 129024           30.00     2438390             957.1     -1.00    -1.000
> > > 
> > 
> > 
> > Sure the patch helps when not too many flows are involved, but this is a
> > hack.
> > 
> > Say the device queue is 1000 packets, and you run a workload with 2000
> > sockets, it wont work...
> > 
> > Or device queue is 1000 packets, one flow, and socket send queue size
> > allows for more than 1000 packets to be 'in flight' (echo 2000000
> > >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
> > with devices with a qdisc sitting in the first device met after the
> > socket.
> 
> True, thanks for pointing that out.
> 
> The scenario that I am actually interested in is virtualisation.
> And I believe that your patch helps the vhostnet case (I don't see
> flow control problems with bonding + virtio without vhostnet). However,
> I am unsure if there are also some easy work-arounds to degrade
> flow control in the vhostnet case too.

Hi Eric,

do you have any thoughts on this?

I measured the performance impact of your patch on 2.6.37-rc1
and I can see why early orphaning is a win.

The tests are run over a bond with 3 slaves.
The bond is in rr-balance mode. Other parameters of interest are:
	MTU=1500
	client,server:tcp_reordering=3(default)
	client:GSO=off,
	client:TSO=off
	server:GRO=off
	server:rx-usecs=3(default)

Without your no early-orphan patch
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
	172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1621.03   16.31    6.48     1.648   2.621

With your no early-orphan patch
# netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
	172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1433.48   9.60     5.45     1.098   2.490


However in the case of virtualisation I think it is a win to be able to do
flow control on UDP traffic from guests (using vitio). Am I missing
something and flow control can be bypassed anyway? If not perhaps making
the change that your patch makes configurable through proc or ethtool is an
option?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 8, 2010, 1:50 p.m. UTC | #5
Le mercredi 08 décembre 2010 à 22:22 +0900, Simon Horman a écrit :

> Hi Eric,
> 
> do you have any thoughts on this?
> 
> I measured the performance impact of your patch on 2.6.37-rc1
> and I can see why early orphaning is a win.
> 
> The tests are run over a bond with 3 slaves.
> The bond is in rr-balance mode. Other parameters of interest are:
> 	MTU=1500
> 	client,server:tcp_reordering=3(default)
> 	client:GSO=off,
> 	client:TSO=off
> 	server:GRO=off
> 	server:rx-usecs=3(default)
> 
> Without your no early-orphan patch
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 	172.17.60.216 (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384  16384    10.00      1621.03   16.31    6.48     1.648   2.621
> 
> With your no early-orphan patch
> # netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 	172.17.60.216 (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384  16384    10.00      1433.48   9.60     5.45     1.098   2.490
> 

It seems strange this makes such big difference with one flow


> 
> However in the case of virtualisation I think it is a win to be able to do
> flow control on UDP traffic from guests (using vitio). Am I missing
> something and flow control can be bypassed anyway? If not perhaps making
> the change that your patch makes configurable through proc or ethtool is an
> option?
> 

virtio_net start_xmit() does one skb_orphan() anyway, so not doing it
some nano seconds before wont change anything.

Real perf problem is when skb are queued (for example on eth driver TX
ring or qdisc queue), then freed some micro (or milli) seconds later.

Maybe your ethtool suggestion is the way to go, so that we can remove
special "skb_orphans()" that can be done in some drivers : Let core
network stack decide to skb_orphan() itself, not the driver.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bdb68a6..325931e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4714,6 +4714,7 @@  static void bond_setup(struct net_device *bond_dev)
 	bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
 	bond_dev->priv_flags |= IFF_BONDING;
 	bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	bond_dev->priv_flags &= ~IFF_EARLY_ORPHAN;
 
 	if (bond->params.arp_interval)
 		bond_dev->priv_flags |= IFF_MASTER_ARPMON;
diff --git a/include/linux/if.h b/include/linux/if.h
index 1239599..7499a99 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -77,6 +77,9 @@ 
 #define IFF_BRIDGE_PORT	0x8000		/* device used as bridge port */
 #define IFF_OVS_DATAPATH	0x10000	/* device used as Open vSwitch
 					 * datapath port */
+#define IFF_EARLY_ORPHAN	0x20000 /* early orphan skbs in
+					 * dev_hard_start_xmit()
+					 */
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
 #define IF_GET_PROTO	0x0002
diff --git a/net/core/dev.c b/net/core/dev.c
index 35dfb83..eabf94d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2005,7 +2005,8 @@  int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(skb);
 
-		skb_orphan_try(skb);
+		if (dev->priv_flags & IFF_EARLY_ORPHAN)
+			skb_orphan_try(skb);
 
 		if (vlan_tx_tag_present(skb) &&
 		    !(dev->features & NETIF_F_HW_VLAN_TX)) {
@@ -5590,7 +5591,7 @@  struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->napi_list);
 	INIT_LIST_HEAD(&dev->unreg_list);
 	INIT_LIST_HEAD(&dev->link_watch_list);
-	dev->priv_flags = IFF_XMIT_DST_RELEASE;
+	dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_EARLY_ORPHAN ;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;