From patchwork Sat Feb 8 13:14:33 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 318420 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 0552C2C00A3 for ; Sun, 9 Feb 2014 00:14:57 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751684AbaBHNOi (ORCPT ); Sat, 8 Feb 2014 08:14:38 -0500 Received: from mail-pb0-f53.google.com ([209.85.160.53]:55685 "EHLO mail-pb0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750957AbaBHNOg (ORCPT ); Sat, 8 Feb 2014 08:14:36 -0500 Received: by mail-pb0-f53.google.com with SMTP id md12so4376251pbc.40 for ; Sat, 08 Feb 2014 05:14:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:subject:from:to:cc:date:in-reply-to:references :content-type:content-transfer-encoding:mime-version; bh=GiQQL9Q/cQiU2qXZJ7cdBJdu29jWGJRS/45rgbChnWY=; b=Eh6gQRImFXAEYX8I6IldJl1encnPBipMNFrX9+S+aJ20YNIs72Td4faLpAJijyje7j ixYaAOqOxY1V7VqY21iiJpHLwqUwd8PjOuAUC0o4yd0/4Esc3Oy3M3utyzI6N31SeN5w 1JC7NarFfmsgsyGjoKlh/GhhdMglHiDMFeQI5Rl/ObVmkRIJRgl7R+Peo+4cTOz29T8r ikD7sIMbNN0bVsUKnQ4TPwTaIicM4/FJvHd91AQvhg13jZHlnsPXo2FSgZ0Hxwhu2aMZ jikN6BwJpccTejt4EIZksG2UxTC1J44iHewM/fpcX+1ocGVNsaM1xXJngkuXyy8miBUS +qsA== X-Received: by 10.66.121.164 with SMTP id ll4mr5004756pab.129.1391865275156; Sat, 08 Feb 2014 05:14:35 -0800 (PST) Received: from [172.29.165.184] ([172.29.165.184]) by mx.google.com with ESMTPSA id nm5sm23547713pbc.29.2014.02.08.05.14.33 for (version=SSLv3 cipher=RC4-SHA bits=128/128); Sat, 08 Feb 2014 05:14:34 -0800 (PST) Message-ID: <1391865273.10160.76.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds] From: Eric Dumazet To: Thomas Glanzmann , John Ogness Cc: Eric Dumazet , "David S. Miller" , "Nicholas A. Bellinger" , target-devel , Linux Network Development , LKML Date: Sat, 08 Feb 2014 05:14:33 -0800 In-Reply-To: <20140208093808.GD16336@glanzmann.de> References: <20140206153640.GB4103@glanzmann.de> <1391727771.14985.41.camel@haakon3.risingtidesystems.com> <20140207051500.GB10916@glanzmann.de> <20140207075536.GB17815@glanzmann.de> <1391801597.1155.28.camel@haakon3.risingtidesystems.com> <20140207205142.GA8609@glanzmann.de> <20140208093808.GD16336@glanzmann.de> X-Mailer: Evolution 3.2.3-0ubuntu6 Mime-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Sat, 2014-02-08 at 10:38 +0100, Thomas Glanzmann wrote: > Hello Eric, > > [RESEND: the time it took the VMFS was created was switched between > on/off so with on it took over 2 minutes with off it took less than 4 > seconds] > > [RESEND 2: The throughput graphs were switched as well ;-(] > > > * Thomas Glanzmann [2014-02-07 08:55]: > > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 > > > and 15 minutes on 3.14.0-rc2+. > > * Nicholas A. Bellinger [2014-02-07 20:30]: > > Would it be possible to try a couple of different stable kernel > > versions to help track this down? > > I bisected[1] it and found the offending commit f54b311 tcp auto corking > [2] 'if we have a small send and a previous packet is already in the > qdisc or device queue, defer until TX completion or we get more data.' > - Description by David S. Miller > > I gathered a pcap with tcp_autocorking on and off. > > On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system > https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png > > Off: - took 4 seconds to create a 500 GB VMFS file system > sysctl net.ipv4.tcp_autocorking=0 > https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png > > First graph can be generated by opening bunziping the file, opening it > in wireshark and select Statistics > IO Grap and change the unit to > Bytes/Tick. The second graph can be generated by selecting Statistics > > TCP Stream Graph > Round Trip Time. > > You can also see that the round trip time increases by factor 25 at > least. > > I once saw a similar problem with dealyed ACK packets of the > paravirtulized network driver in xen it caused that the tcp window > filled up and slowed down the throughput from 30 MB/s to less than 100 > KB/s the symptom was that the login to a Windows desktop took more than > 10 minutes while it used to be below 30 seconds because the profile of > the user was loaded slowly from a CIFS server. At that time the culprit > were also delayed small packets: ACK packets in the CIFS case. However I > only proofed iSCSI regression so far for tcp auto corking but assume we > will see many others if we leave it enabled. > > I found the problem by doing the following: > - I compiled kernel by executing the following commands: > yes '' | make oldconfig > time make -j 24 > / make modules_install > / mkinitramfs -o /boot/initrd.img-bisect > > - I cleaned the iSCSI configuration after each test by issuing: > /etc/init.d/target stop > rm /iscsi?/* /etc/target/* > > - I configured iSCSI after each reboot > cat > lio-v101.conf < set global auto_cd_after_create=false > /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true > > /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1 > > saveconfig > yes > EOF > targetcli < lio-v101.conf > And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy > to the iSCSI target, configured the portal, rescanned and > created a 500 GB VMFS 5 filesystem and noticed the time if it > was longer than 2 minutes it was bad if it was below 10 seconds > it was good. > git bisect good/bad > > My network config is: > > auto bond0 > iface bond0 inet static > address 10.100.4.62 > netmask 255.255.0.0 > gateway 10.100.0.1 > slaves eth0 eth1 > bond-mode 802.3ad > bond-miimon 100 > > auto bond0.101 > iface bond0.101 inet static > address 10.101.99.4 > netmask 255.255.0.0 > > auto bond1 > iface bond1 inet static > address 10.100.5.62 > netmask 255.255.0.0 > slaves eth2 eth3 > bond-mode 802.3ad > bond-miimon 100 > > auto bond1.101 > iface bond1.101 inet static > address 10.101.99.5 > netmask 255.255.0.0 > > I propose to disable tcp_autocorking by default because it obviously degrades > iSCSI performance and probably many other protocols. Also the commit mentions > that applications can explicitly disable auto corking we probably should do > that for the iSCSI target, but I don't know how. Anyone? > > [1] http://pbot.rmdir.de/a65q6MjgV36tZnn5jS-DUQ > [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f54b311142a92ea2e42598e347b84e1655caf8e3 > > Cheers, > Thomas Hi Thomas, thanks a lot for this very detailed bug report. I think you are hit by other bug(s), lets try to fix it/them instead of disabling this feature. John Ogness started a thread yesterday about TCP_NODELAY being hit by the TCP Small Queue mechanism, which is the base of TCP auto corking. Two RFC patches were discussed. One dealing with the TCP_NODELAY flag that John posted, and I'll adapt it to the current kernel. One dealing with a possible race, that I suggested (I doubt this could trigger at every write, but lets fix it anyway) Here is the combined patch, could you test it ? Thanks ! --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 10435b3b9d0f..3be16727f058 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk) if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_CLOSE_WAIT | TCPF_LAST_ACK)) - tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC); + tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle, + 0, GFP_ATOMIC); } /* * One tasklet per cpu tries to send more skbs. @@ -1904,7 +1905,15 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); - break; + /* It is possible TX completion already happened + * before we set TSQ_THROTTLED, so we must + * test again the condition. + * We abuse smp_mb__after_clear_bit() because + * there is no smp_mb__after_set_bit() yet + */ + smp_mb__after_clear_bit(); + if (atomic_read(&sk->sk_wmem_alloc) > limit) + break; } limit = mss_now;