Message ID | 1288683057.2660.154.camel@edumazet-laptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Nov 02, 2010 at 08:30:57AM +0100, Eric Dumazet wrote: > Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit : > > On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote: > > > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit : > > > > > > > Thanks for the explanation. > > > > I'm not entirely sure how much of a problem this is in practice. > > > > > > Maybe for virtual devices (tunnels, bonding, ...), it would make sense > > > to delay the orphaning up to the real device. > > > > That was my initial thought. Could you give me some guidance > > on how that might be done so I can try and make a patch to test? > > > > > But if the socket send buffer is very large, it would defeat the flow > > > control any way... > > > > I'm primarily concerned about a situation where > > UDP packets are sent as fast as possible, indefinitely. > > And in that scenario, I think it would need to be a rather large buffer. > > > > Please try following patch, thanks. Thanks Eric, that seems to resolve the problem that I was seeing. With your patch I see: No bonding # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # 10^6bits/sec % SU us/KB 116736 1472 30.00 2438413 0 957.2 8.52 1.458 129024 30.00 2438413 957.2 -1.00 -1.000 With bonding (one slave, the interface used in the test above) netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # 10^6bits/sec % SU us/KB 116736 1472 30.00 2438390 0 957.1 8.97 1.535 129024 30.00 2438390 957.1 -1.00 -1.000 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit : > Thanks Eric, that seems to resolve the problem that I was seeing. > > With your patch I see: > > No bonding > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET > Socket Message Elapsed Messages CPU Service > Size Size Time Okay Errors Throughput Util Demand > bytes bytes secs # # 10^6bits/sec % SU us/KB > > 116736 1472 30.00 2438413 0 957.2 8.52 1.458 > 129024 30.00 2438413 957.2 -1.00 -1.000 > > With bonding (one slave, the interface used in the test above) > > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET > Socket Message Elapsed Messages CPU Service > Size Size Time Okay Errors Throughput Util Demand > bytes bytes secs # # 10^6bits/sec % SU us/KB > > 116736 1472 30.00 2438390 0 957.1 8.97 1.535 > 129024 30.00 2438390 957.1 -1.00 -1.000 > Sure the patch helps when not too many flows are involved, but this is a hack. Say the device queue is 1000 packets, and you run a workload with 2000 sockets, it wont work... Or device queue is 1000 packets, one flow, and socket send queue size allows for more than 1000 packets to be 'in flight' (echo 2000000 >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only with devices with a qdisc sitting in the first device met after the socket. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote: > Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit : > > > Thanks Eric, that seems to resolve the problem that I was seeing. > > > > With your patch I see: > > > > No bonding > > > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET > > Socket Message Elapsed Messages CPU Service > > Size Size Time Okay Errors Throughput Util Demand > > bytes bytes secs # # 10^6bits/sec % SU us/KB > > > > 116736 1472 30.00 2438413 0 957.2 8.52 1.458 > > 129024 30.00 2438413 957.2 -1.00 -1.000 > > > > With bonding (one slave, the interface used in the test above) > > > > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET > > Socket Message Elapsed Messages CPU Service > > Size Size Time Okay Errors Throughput Util Demand > > bytes bytes secs # # 10^6bits/sec % SU us/KB > > > > 116736 1472 30.00 2438390 0 957.1 8.97 1.535 > > 129024 30.00 2438390 957.1 -1.00 -1.000 > > > > > Sure the patch helps when not too many flows are involved, but this is a > hack. > > Say the device queue is 1000 packets, and you run a workload with 2000 > sockets, it wont work... > > Or device queue is 1000 packets, one flow, and socket send queue size > allows for more than 1000 packets to be 'in flight' (echo 2000000 > >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only > with devices with a qdisc sitting in the first device met after the > socket. True, thanks for pointing that out. The scenario that I am actually interested in is virtualisation. And I believe that your patch helps the vhostnet case (I don't see flow control problems with bonding + virtio without vhostnet). However, I am unsure if there are also some easy work-arounds to degrade flow control in the vhostnet case too. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Nov 06, 2010 at 06:25:37PM +0900, Simon Horman wrote: > On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote: > > Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit : > > > > > Thanks Eric, that seems to resolve the problem that I was seeing. > > > > > > With your patch I see: > > > > > > No bonding > > > > > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 > > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET > > > Socket Message Elapsed Messages CPU Service > > > Size Size Time Okay Errors Throughput Util Demand > > > bytes bytes secs # # 10^6bits/sec % SU us/KB > > > > > > 116736 1472 30.00 2438413 0 957.2 8.52 1.458 > > > 129024 30.00 2438413 957.2 -1.00 -1.000 > > > > > > With bonding (one slave, the interface used in the test above) > > > > > > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472 > > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET > > > Socket Message Elapsed Messages CPU Service > > > Size Size Time Okay Errors Throughput Util Demand > > > bytes bytes secs # # 10^6bits/sec % SU us/KB > > > > > > 116736 1472 30.00 2438390 0 957.1 8.97 1.535 > > > 129024 30.00 2438390 957.1 -1.00 -1.000 > > > > > > > > > Sure the patch helps when not too many flows are involved, but this is a > > hack. > > > > Say the device queue is 1000 packets, and you run a workload with 2000 > > sockets, it wont work... > > > > Or device queue is 1000 packets, one flow, and socket send queue size > > allows for more than 1000 packets to be 'in flight' (echo 2000000 > > >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only > > with devices with a qdisc sitting in the first device met after the > > socket. > > True, thanks for pointing that out. > > The scenario that I am actually interested in is virtualisation. > And I believe that your patch helps the vhostnet case (I don't see > flow control problems with bonding + virtio without vhostnet). However, > I am unsure if there are also some easy work-arounds to degrade > flow control in the vhostnet case too. Hi Eric, do you have any thoughts on this? I measured the performance impact of your patch on 2.6.37-rc1 and I can see why early orphaning is a win. The tests are run over a bond with 3 slaves. The bond is in rr-balance mode. Other parameters of interest are: MTU=1500 client,server:tcp_reordering=3(default) client:GSO=off, client:TSO=off server:GRO=off server:rx-usecs=3(default) Without your no early-orphan patch TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 1621.03 16.31 6.48 1.648 2.621 With your no early-orphan patch # netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 1433.48 9.60 5.45 1.098 2.490 However in the case of virtualisation I think it is a win to be able to do flow control on UDP traffic from guests (using vitio). Am I missing something and flow control can be bypassed anyway? If not perhaps making the change that your patch makes configurable through proc or ethtool is an option? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 08 décembre 2010 à 22:22 +0900, Simon Horman a écrit : > Hi Eric, > > do you have any thoughts on this? > > I measured the performance impact of your patch on 2.6.37-rc1 > and I can see why early orphaning is a win. > > The tests are run over a bond with 3 slaves. > The bond is in rr-balance mode. Other parameters of interest are: > MTU=1500 > client,server:tcp_reordering=3(default) > client:GSO=off, > client:TSO=off > server:GRO=off > server:rx-usecs=3(default) > > Without your no early-orphan patch > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 172.17.60.216 (172.17.60.216) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB > > 87380 16384 16384 10.00 1621.03 16.31 6.48 1.648 2.621 > > With your no early-orphan patch > # netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 172.17.60.216 (172.17.60.216) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB > > 87380 16384 16384 10.00 1433.48 9.60 5.45 1.098 2.490 > It seems strange this makes such big difference with one flow > > However in the case of virtualisation I think it is a win to be able to do > flow control on UDP traffic from guests (using vitio). Am I missing > something and flow control can be bypassed anyway? If not perhaps making > the change that your patch makes configurable through proc or ethtool is an > option? > virtio_net start_xmit() does one skb_orphan() anyway, so not doing it some nano seconds before wont change anything. Real perf problem is when skb are queued (for example on eth driver TX ring or qdisc queue), then freed some micro (or milli) seconds later. Maybe your ethtool suggestion is the way to go, so that we can remove special "skb_orphans()" that can be done in some drivers : Let core network stack decide to skb_orphan() itself, not the driver. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index bdb68a6..325931e 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4714,6 +4714,7 @@ static void bond_setup(struct net_device *bond_dev) bond_dev->flags |= IFF_MASTER|IFF_MULTICAST; bond_dev->priv_flags |= IFF_BONDING; bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; + bond_dev->priv_flags &= ~IFF_EARLY_ORPHAN; if (bond->params.arp_interval) bond_dev->priv_flags |= IFF_MASTER_ARPMON; diff --git a/include/linux/if.h b/include/linux/if.h index 1239599..7499a99 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -77,6 +77,9 @@ #define IFF_BRIDGE_PORT 0x8000 /* device used as bridge port */ #define IFF_OVS_DATAPATH 0x10000 /* device used as Open vSwitch * datapath port */ +#define IFF_EARLY_ORPHAN 0x20000 /* early orphan skbs in + * dev_hard_start_xmit() + */ #define IF_GET_IFACE 0x0001 /* for querying only */ #define IF_GET_PROTO 0x0002 diff --git a/net/core/dev.c b/net/core/dev.c index 35dfb83..eabf94d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2005,7 +2005,8 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, if (dev->priv_flags & IFF_XMIT_DST_RELEASE) skb_dst_drop(skb); - skb_orphan_try(skb); + if (dev->priv_flags & IFF_EARLY_ORPHAN) + skb_orphan_try(skb); if (vlan_tx_tag_present(skb) && !(dev->features & NETIF_F_HW_VLAN_TX)) { @@ -5590,7 +5591,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, INIT_LIST_HEAD(&dev->napi_list); INIT_LIST_HEAD(&dev->unreg_list); INIT_LIST_HEAD(&dev->link_watch_list); - dev->priv_flags = IFF_XMIT_DST_RELEASE; + dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_EARLY_ORPHAN ; setup(dev); strcpy(dev->name, name); return dev;