Patchwork GRO aggregation

login
register
mail settings
Submitter Eric Dumazet
Date Sept. 13, 2012, 12:05 p.m.
Message ID <1347537926.13103.1530.camel@edumazet-glaptop>
Download mbox | patch
Permalink /patch/183617/
State RFC
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Sept. 13, 2012, 12:05 p.m.
On Thu, 2012-09-13 at 12:59 +0300, Or Gerlitz wrote:
> On Thu, Sep 13, 2012 at 11:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > MAX_SKB_FRAGS is 16
> > skb_gro_receive() will return -E2BIG once this limit is hit.
> > If you use a MSS = 100 (instead of MSS = 1460), then GRO skb will
> > contain only at most 1700 bytes, but TSO packets can still be 64KB, if
> > the sender NIC can afford it (some NICS wont work quite well)
> 
> Hi Eric,
> 
> Addressing this assertion of yours, Shlomo showed that with ixgbe he managed
> to see GRO aggregating 32KB which means 20-21 packets that is > 16 fragments
> in this notation, can it be related to the way ixgbe is actually
> allocating skbs?
> 

Hard to say without knowing exact kernel version, as things change a lot
in this area.

You have several kind of GRO. One fast and one slow.

The slow one uses a linked list of skbs (pinfo->frag_list), while the
fast one uses fragments (pinfo->nr_frags)

For example, some drivers (mellanox one is in this lot) pull too many
bytes in skb->head and this defeats the fast GRO :
Part of payload is in skb->head, remaining part in pinfo->frags[0]

skb_gro_receive() then has to allocate a new head skb, to link skbs into
head->frag_list. The total skb->truesize is not reduced at all, its
increased.

So you might think GRO is working, but its only a hack, as one skb has a
list of skbs, and this makes TCP read() slower, and defeats TCP
coalescing as well. Whats the point of delivering fat skbs to TCP stack
if it slows down the consumer, because of increased cache line misses ?

I am not _very_ interested in the slow GRO behavior, I try to improve
the fast path.

ixgbe uses the fast GRO, at least on recent kernels.

In my tests on mellanox, it only aggregates 8 frames per skb, and still
we reach 10Gbps...

03:41:40.128074 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1563841:1575425(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128080 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1575425:1587009(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128085 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1587009:1598593(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128089 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1598593:1610177(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128093 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1610177:1621761(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128103 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1633345:1644929(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128116 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1668097:1679681(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128121 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1679681:1691265(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128134 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1714433:1726017(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128146 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1749185:1759321(10136) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128163 IP 7.7.7.83.52113 > 7.7.7.84.38079: . ack 1575425 win 4147 <nop,nop,timestamp 152427711 137349733>
03:41:40.128193 IP 7.7.7.83.52113 > 7.7.7.84.38079: . ack 1759321 win 3339 <nop,nop,timestamp 152427711 137349733>

And it aggregates 8 frames per skb because each individual frame uses 2 fragments :

One of 512 bytes and one of 1024 bytes : total of 1536 bytes,
instead of the typical 2048 bytes used by other NIC

To get better performance, mellanox could use only one frag
per MTU (if MTU <= 1500), using 1536 bytes frags.

I tried this and this gives now :

05:00:12.507398 IP 7.7.7.84.63422 > 7.7.7.83.37622: . 2064384:2089000(24616) ack 1 win 229 <nop,nop,timestamp 142062123 4294793380>
05:00:12.507419 IP 7.7.7.84.63422 > 7.7.7.83.37622: . 2138232:2161400(23168) ack 1 win 229 <nop,nop,timestamp 142062123 4294793380>
05:00:12.507489 IP 7.7.7.84.63422 > 7.7.7.83.37622: . 2244664:2269280(24616) ack 1 win 229 <nop,nop,timestamp 142062123 4294793380>
05:00:12.507509 IP 7.7.7.83.37622 > 7.7.7.84.63422: . ack 2244664 win 16384 <nop,nop,timestamp 4294793380 142062123>

But there is no real difference in throughput.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Sept. 13, 2012, 12:47 p.m.
On Thu, Sep 13, 2012 at 3:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-13 at 12:59 +0300, Or Gerlitz wrote:
>> On Thu, Sep 13, 2012 at 11:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > MAX_SKB_FRAGS is 16
>> > skb_gro_receive() will return -E2BIG once this limit is hit.
>> > If you use a MSS = 100 (instead of MSS = 1460), then GRO skb will
>> > contain only at most 1700 bytes, but TSO packets can still be 64KB, if
>> > the sender NIC can afford it (some NICS wont work quite well)

>> Addressing this assertion of yours, Shlomo showed that with ixgbe he managed
>> to see GRO aggregating 32KB which means 20-21 packets that is > 16 fragments
>> in this notation, can it be related to the way ixgbe is actually allocating skbs?

> Hard to say without knowing exact kernel version, as things change a lot in this area.

As Shlomo wrote earlier on this thread his testbed is 3.6-rc1


> You have several kind of GRO. One fast and one slow.
> The slow one uses a linked list of skbs (pinfo->frag_list), while the
> fast one uses fragments (pinfo->nr_frags)
>
> For example, some drivers (mellanox one is in this lot) pull too many
> bytes in skb->head and this defeats the fast GRO :
> Part of payload is in skb->head, remaining part in pinfo->frags[0]
>
> skb_gro_receive() then has to allocate a new head skb, to link skbs into
> head->frag_list. The total skb->truesize is not reduced at all, its
> increased.
>
> So you might think GRO is working, but its only a hack, as one skb has a
> list of skbs, and this makes TCP read() slower, and defeats TCP
> coalescing as well. Whats the point of delivering fat skbs to TCP stack
> if it slows down the consumer, because of increased cache line misses ?

Shlomo is dealing with making the IPoIB driver work well with GRO,
thanks for the
comments on the Mellanox Ethernet driver, we will look there too
(added Yevgeny)...

As for IPoIB it has two modes, connected which irrelevant for this
discussion, and datagram
- who is under the  scope here. Its MTU is typically 2044 but can be
4092 as well, the allocation
of skb's for this mode is done in ipoib_alloc_rx_skb() -- which you've
patched recently...

Following your comment we noted that if using the lower/typical mtu of
2044 which means
we are below the ipoib_ud_need_sg() threshold, skbs are allocated on
one "form" and if using
the 4092 mtu in another "form" - do you see each of the form to fall
into different GRO flow, e.g
2044 to the "slow" and 4092 to the "fast"?!

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Sept. 13, 2012, 1:22 p.m.
On Thu, 2012-09-13 at 15:47 +0300, Or Gerlitz wrote:
> Shlomo is dealing with making the IPoIB driver work well with GRO,
> thanks for the
> comments on the Mellanox Ethernet driver, we will look there too
> (added Yevgeny)...
> 
> As for IPoIB it has two modes, connected which irrelevant for this
> discussion, and datagram
> - who is under the  scope here. Its MTU is typically 2044 but can be
> 4092 as well, the allocation
> of skb's for this mode is done in ipoib_alloc_rx_skb() -- which you've
> patched recently...
> 
> Following your comment we noted that if using the lower/typical mtu of
> 2044 which means
> we are below the ipoib_ud_need_sg() threshold, skbs are allocated on
> one "form" and if using
> the 4092 mtu in another "form" - do you see each of the form to fall
> into different GRO flow, e.g
> 2044 to the "slow" and 4092 to the "fast"?!

Seems fine to me both ways, because you use dev_alloc_skb(), and you
dont pull tcp payload into tcp->head.

You might try adding prefetch() as well to bring into cpu cache
IP/TCP headers before they are needed in gro layers.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6c4f935..435c35e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -96,8 +96,8 @@ 
 /* Receive fragment sizes; we use at most 4 fragments (for 9600 byte MTU
  * and 4K allocations) */
 enum {
-	FRAG_SZ0 = 512 - NET_IP_ALIGN,
-	FRAG_SZ1 = 1024,
+	FRAG_SZ0 = 1536 - NET_IP_ALIGN,
+	FRAG_SZ1 = 2048,
        FRAG_SZ2 = 4096,
        FRAG_SZ3 = MLX4_EN_ALLOC_SIZE
 };