diff mbox

[net-next,3/3] ipv4: gre: add GRO capability

Message ID 1348750130.5093.1227.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 27, 2012, 12:48 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Add GRO capability to IPv4 GRE tunnels, using the gro_cells
infrastructure.

Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
checking GRO is building large packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ipip.h |    3 +++
 net/ipv4/ip_gre.c  |   13 +++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jesse Gross Sept. 27, 2012, 5:52 p.m. UTC | #1
On Thu, Sep 27, 2012 at 5:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Add GRO capability to IPv4 GRE tunnels, using the gro_cells
> infrastructure.
>
> Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
> checking GRO is building large packets.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

When I was thinking about doing this, my original plan was to handle
GRO/GSO by extending the current handlers to be able to look inside
GRE and then loop around to process the inner packet (similar to what
is done today with skb_flow_dissect() for RPS).  Is there a reason to
do it in the device?

Pushing it earlier/later in the stack obviously increases the benefit
and it will also be more compatible with the forthcoming OVS tunneling
hooks, which will be flow based and therefore won't have a device.

Also, the next generation of NICs will support this type of thing in
hardware so putting the software versions very close to the NIC will
give us a more similar abstraction.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 27, 2012, 6:08 p.m. UTC | #2
On Thu, 2012-09-27 at 10:52 -0700, Jesse Gross wrote:

> When I was thinking about doing this, my original plan was to handle
> GRO/GSO by extending the current handlers to be able to look inside
> GRE and then loop around to process the inner packet (similar to what
> is done today with skb_flow_dissect() for RPS).  Is there a reason to
> do it in the device?
> 
> Pushing it earlier/later in the stack obviously increases the benefit
> and it will also be more compatible with the forthcoming OVS tunneling
> hooks, which will be flow based and therefore won't have a device.
> 
> Also, the next generation of NICs will support this type of thing in
> hardware so putting the software versions very close to the NIC will
> give us a more similar abstraction.

This sounds not feasible with all kind of tunnels, for example IPIP
tunnels, or UDP encapsulation, at least with current stack (not OVS)

Also note that pushing earlier means forcing the checksumming earlier
and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
future.

Using a napi_struct permits to eventually have separate cpus, and things
like RPS/RSS to split the load.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 27, 2012, 6:19 p.m. UTC | #3
On Thu, 2012-09-27 at 20:08 +0200, Eric Dumazet wrote:

> 
> This sounds not feasible with all kind of tunnels, for example IPIP
> tunnels, or UDP encapsulation, at least with current stack (not OVS)
> 
> Also note that pushing earlier means forcing the checksumming earlier
> and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
> future.
> 
> Using a napi_struct permits to eventually have separate cpus, and things
> like RPS/RSS to split the load.

Also please note that my implementation doesnt bypass first IP stack
traversal (and firewalling if any), so its changing nothing in term
of existing setups.

So packets that should be forwarded will stay as they are (no tunnels
decapsulation/recapsulation)

Doing this in the generic GRO layer sounds a bit difficult.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Sept. 27, 2012, 10:03 p.m. UTC | #4
On Thu, Sep 27, 2012 at 11:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-27 at 10:52 -0700, Jesse Gross wrote:
>
>> When I was thinking about doing this, my original plan was to handle
>> GRO/GSO by extending the current handlers to be able to look inside
>> GRE and then loop around to process the inner packet (similar to what
>> is done today with skb_flow_dissect() for RPS).  Is there a reason to
>> do it in the device?
>>
>> Pushing it earlier/later in the stack obviously increases the benefit
>> and it will also be more compatible with the forthcoming OVS tunneling
>> hooks, which will be flow based and therefore won't have a device.
>>
>> Also, the next generation of NICs will support this type of thing in
>> hardware so putting the software versions very close to the NIC will
>> give us a more similar abstraction.
>
> This sounds not feasible with all kind of tunnels, for example IPIP
> tunnels, or UDP encapsulation, at least with current stack (not OVS)

Hmm, I think we might be talking about different things since I can't
think of why it wouldn't be feasible (and none of it should be
specific to OVS).  What I was planning would result in the creation of
large but still encapsulated packets.  The merging would be purely
based on the headers in each layer being the same (as GRO is today) so
the logic of the IP stack, UDP stack, etc. isn't processed until
later.

> Also note that pushing earlier means forcing the checksumming earlier
> and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
> future.

It is a good point that if the packet isn't actually destined to us
then probably none of this is worth it (although I suspect that the
relative number of tunnel packets that are passed through vs.
terminated is fairly low).  Many NICs are capable of supplying
CHECKSUM_COMPLETE packets here, even if it is not exposed by the
drivers.

> Using a napi_struct permits to eventually have separate cpus, and things
> like RPS/RSS to split the load.

We should be able to split the load today using RPS since we can look
into the GRE flow once the packet comes off the NIC (assuming that it
is using NAPI).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Sept. 27, 2012, 10:03 p.m. UTC | #5
On Thu, Sep 27, 2012 at 11:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-27 at 20:08 +0200, Eric Dumazet wrote:
>
>>
>> This sounds not feasible with all kind of tunnels, for example IPIP
>> tunnels, or UDP encapsulation, at least with current stack (not OVS)
>>
>> Also note that pushing earlier means forcing the checksumming earlier
>> and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
>> future.
>>
>> Using a napi_struct permits to eventually have separate cpus, and things
>> like RPS/RSS to split the load.
>
> Also please note that my implementation doesnt bypass first IP stack
> traversal (and firewalling if any), so its changing nothing in term
> of existing setups.
>
> So packets that should be forwarded will stay as they are (no tunnels
> decapsulation/recapsulation)
>
> Doing this in the generic GRO layer sounds a bit difficult.

We wouldn't actually do the decapsulation at the point of GRO.  This
is actually pretty similar to what we do with TCP - we merge TCP
payloads even though we haven't done any real IP processing yet.
However, we do check firewall rules later if we actually hit the IP
stack.  GRE would work the same way in this case.

What I'm describing is pretty much exactly what NICs will be doing, so
if that doesn't work we'll have a problem...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 28, 2012, 2:04 p.m. UTC | #6
On Thu, 2012-09-27 at 15:03 -0700, Jesse Gross wrote:

> We wouldn't actually do the decapsulation at the point of GRO.  This
> is actually pretty similar to what we do with TCP - we merge TCP
> payloads even though we haven't done any real IP processing yet.
> However, we do check firewall rules later if we actually hit the IP
> stack.  GRE would work the same way in this case.
> 
> What I'm describing is pretty much exactly what NICs will be doing, so
> if that doesn't work we'll have a problem...

GRO ability to truly aggregate data is kind of limited to some
workloads. How NICs will handle interleaved flows I dont really know.

What you describe needs a serious GRO preliminary work, because it
depends on napi_gro_flush() being called from time to time, while we
need something else, more fine grained.

(I am pretty sure GRO needs some love from us, it looks like some
packets can stay a long time in gro_list. It would be nice if it was
able to reorder packets (from same flow) as well)

Anyway, my changes are self-contained in a new file and non intrusive.

As soon as we can provide a better alternative we can revert them ?

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Oct. 1, 2012, 8:56 p.m. UTC | #7
On Fri, Sep 28, 2012 at 7:04 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-27 at 15:03 -0700, Jesse Gross wrote:
>
>> We wouldn't actually do the decapsulation at the point of GRO.  This
>> is actually pretty similar to what we do with TCP - we merge TCP
>> payloads even though we haven't done any real IP processing yet.
>> However, we do check firewall rules later if we actually hit the IP
>> stack.  GRE would work the same way in this case.
>>
>> What I'm describing is pretty much exactly what NICs will be doing, so
>> if that doesn't work we'll have a problem...
>
> GRO ability to truly aggregate data is kind of limited to some
> workloads. How NICs will handle interleaved flows I dont really know.
>
> What you describe needs a serious GRO preliminary work, because it
> depends on napi_gro_flush() being called from time to time, while we
> need something else, more fine grained.
>
> (I am pretty sure GRO needs some love from us, it looks like some
> packets can stay a long time in gro_list. It would be nice if it was
> able to reorder packets (from same flow) as well)

It's definitely possible to improve GRO in a couple of areas.  I'm not
quite sure why you say that these changes are related to tunnels
though, since they're not really different from say, a VLAN tag.

> Anyway, my changes are self-contained in a new file and non intrusive.
>
> As soon as we can provide a better alternative we can revert them ?

Sure, I don't have a problem with your patches for now.  I was just
trying to think about different approaches.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 1, 2012, 9:04 p.m. UTC | #8
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 Sep 2012 14:48:50 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Add GRO capability to IPv4 GRE tunnels, using the gro_cells
> infrastructure.
> 
> Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
> checking GRO is building large packets.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 5, 2012, 2:52 p.m. UTC | #9
Current GRO cell is somewhat limited :

- It uses a single list (napi->gro_list) of pending skbs

- This list has a limit of 8 skbs (MAX_GRO_SKBS)

- Workloads with lot of concurrent flows have small GRO hit rate but
  pay high overhead (in inet_gro_receive())

- Increasing MAX_GRO_SKBS is not an option, because GRO
  overhead becomes too high.

- Packets can stay a long time held in GRO cell (there is
  no flush if napi never completes on a stressed cpu)

  Some elephant flows can stall interactive ones (if we receive
  flood of non TCP frames, we dont flush tcp packets waiting in
gro_list)

What we could do :

1) Use a hash to avoid expensive gro_list management and allow
   much more concurrent flows.

Use skb_get_rxhash(skb) to compute rxhash

If l4_rxhash not set -> not a GRO candidate.

If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
candidates.

(tcp stack could eventually use rxhash instead of its custom hash
computation ...)

2) Use a LRU list to eventually be able to 'flush' too old packets,
   even if the napi never completes. Each time we process a new packet,
   being a GRO candidate or not, we increment a napi->sequence, and we
   flush the oldest packet in gro_lru_list if its own sequence is too
   old.

  That would give a latency guarantee.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 5, 2012, 6:16 p.m. UTC | #10
On 10/05/2012 07:52 AM, Eric Dumazet wrote:
> What we could do :
>
> 1) Use a hash to avoid expensive gro_list management and allow
>     much more concurrent flows.
>
> Use skb_get_rxhash(skb) to compute rxhash
>
> If l4_rxhash not set -> not a GRO candidate.
>
> If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
> candidates.
>
> (tcp stack could eventually use rxhash instead of its custom hash
> computation ...)
>
> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>     even if the napi never completes. Each time we process a new packet,
>     being a GRO candidate or not, we increment a napi->sequence, and we
>     flush the oldest packet in gro_lru_list if its own sequence is too
>     old.
>
>    That would give a latency guarantee.

Flushing things if N packets have come though sounds like goodness, and 
it reminds me a bit about what happens with IP fragment reassembly - 
another area where the stack is trying to guess just how long to 
hang-onto a packet before doing something else with it.  But the value 
of N to get a "decent" per-flow GRO aggregation rate will depend on the 
number of concurrent flows right?  If I want to have a good shot at 
getting 2 segments combined for 1000 active, concurrent flows entering 
my system via that interface, won't N have to approach 2000?

GRO (and HW LRO) has a fundamental limitation/disadvantage here.  GRO 
does provide a very nice "boost" on various situations (especially 
numbers of concurrent netperfs that don't blow-out the tracking limits) 
but since it won't really know anything about the flow(s) involved (*) 
or even their number (?), it will always be guessing.  That is why it is 
really only "poor man's JumboFrames" (or larger MTU - Sadly, the IEEE 
keeps us all beggars here).

A goodly portion of the benefit of GRO comes from the "incidental" ACK 
avoidance it causes yes?  That being the case, might that be a 
worthwhile avenue to explore?   It would then naturally scale as TCP et 
al do today.

When we go to 40 GbE will we have 4x as many flows, or the same number 
of 4x faster flows?

rick jones

* for example - does this TCP segment contain the last byte(s) of a 
pipelined http request/response and the first byte(s) of the next one 
and so should "flush" now?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 5, 2012, 7 p.m. UTC | #11
On Fri, 2012-10-05 at 11:16 -0700, Rick Jones wrote:
> O
> Flushing things if N packets have come though sounds like goodness, and 
> it reminds me a bit about what happens with IP fragment reassembly - 
> another area where the stack is trying to guess just how long to 
> hang-onto a packet before doing something else with it.  But the value 
> of N to get a "decent" per-flow GRO aggregation rate will depend on the 
> number of concurrent flows right?  If I want to have a good shot at 
> getting 2 segments combined for 1000 active, concurrent flows entering 
> my system via that interface, won't N have to approach 2000?
> 

It all depends on the max latency you can afford.

> GRO (and HW LRO) has a fundamental limitation/disadvantage here.  GRO 
> does provide a very nice "boost" on various situations (especially 
> numbers of concurrent netperfs that don't blow-out the tracking limits) 
> but since it won't really know anything about the flow(s) involved (*) 
> or even their number (?), it will always be guessing.  That is why it is 
> really only "poor man's JumboFrames" (or larger MTU - Sadly, the IEEE 
> keeps us all beggars here).
> 
> A goodly portion of the benefit of GRO comes from the "incidental" ACK 
> avoidance it causes yes?  That being the case, might that be a 
> worthwhile avenue to explore?   It would then naturally scale as TCP et 
> al do today.
> 
> When we go to 40 GbE will we have 4x as many flows, or the same number 
> of 4x faster flows?
> 
> rick jones
> 
> * for example - does this TCP segment contain the last byte(s) of a 
> pipelined http request/response and the first byte(s) of the next one 
> and so should "flush" now?

Some remarks :

1) I use some 40Gbe links, thats probably why I try to improve things ;)

2) benefit of GRO can be huge, and not only for the ACK avoidance
   (other tricks could be done for ACK avoidance in the stack)

3) High speeds probably need multiqueue device, and each queue has its
own GRO unit.

  For example on a 40Gbe, 8 queues -> 5Gbps per queue (about 400k
packets/sec)

Lets say we allow no more than 1ms of delay in GRO, this means we could
have about 400 packets in the GRO queue (assuming 1500 bytes packets)

Another idea to play with would be to extend GRO to allow packet
reorder.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 5, 2012, 7:35 p.m. UTC | #12
On 10/05/2012 12:00 PM, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 11:16 -0700, Rick Jones wrote:
>
> Some remarks :
>
> 1) I use some 40Gbe links, thats probably why I try to improve things ;)

Path length before workarounds :)

> 2) benefit of GRO can be huge, and not only for the ACK avoidance
>     (other tricks could be done for ACK avoidance in the stack)

Just how much code path is there between NAPI and the socket?? (And I 
guess just how much combining are you hoping for?)

> 3) High speeds probably need multiqueue device, and each queue has its
> own GRO unit.
>
>    For example on a 40Gbe, 8 queues -> 5Gbps per queue (about 400k
> packets/sec)
>
> Lets say we allow no more than 1ms of delay in GRO,

OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
that kind of added delay anyway.  I'm not sure if that also then 
eliminates the networked storage types.

> this means we could have about 400 packets in the GRO queue (assuming
> 1500 bytes packets)

How many flows are you going to have entering via that queue?  And just 
how well "shuffled" will the segments of those flows be?  That is what 
it all comes down to right?  How many (active) flows and how well 
shuffled they are.  If the flows aren't well shuffled, you can get away 
with a smallish coalescing context.  If they are perfectly shuffled and 
greater in number than your delay allowance you get right back to square 
with all the overhead of GRO attempts with none of the benefit.

If the flow count is < 400 to allow a decent shot at a non-zero 
combining rate on well shuffled flows with the 400 packet limit, then 
that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
aggregated.  And I think you then get two segments per flow aggregated 
at a time.  Is that consistent with what you expect to be the 
characteristics of the flows entering via that queue?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 5, 2012, 8:06 p.m. UTC | #13
On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:

> Just how much code path is there between NAPI and the socket?? (And I 
> guess just how much combining are you hoping for?)
> 

When GRO correctly works, you can save about 30% of cpu cycles, it
depends...

Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
gives an improvement as well...

> > Lets say we allow no more than 1ms of delay in GRO,
> 
> OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
> that kind of added delay anyway.  I'm not sure if that also then 
> eliminates the networked storage types.
> 

I took this 1ms delay, but I never said it was a fixed value ;)

Also remember one thing, this is the _max_ delay in case your napi
handler is flooded. This almost never happen (tm)


> > this means we could have about 400 packets in the GRO queue (assuming
> > 1500 bytes packets)
> 
> How many flows are you going to have entering via that queue?  And just 
> how well "shuffled" will the segments of those flows be?  That is what 
> it all comes down to right?  How many (active) flows and how well 
> shuffled they are.  If the flows aren't well shuffled, you can get away 
> with a smallish coalescing context.  If they are perfectly shuffled and 
> greater in number than your delay allowance you get right back to square 
> with all the overhead of GRO attempts with none of the benefit.

Not sure what you mean by shuffle. We use a hash table to locate a flow,
but we also have a LRU list to get the packets ordered by their entry in
the 'GRO unit'.

If napi completes, all the LRU list content is flushed to IP stack.
( napi_gro_flush()) 

If napi doesnt complete, we would only flush 'too old' packets found in
the LRU.

Note: this selective flush can be called once per napi run from
net_rx_action(). Extra cost to get a somewhat precise timestamp
would be acceptable (one call to ktime_get() or get_cycles() every 64
packets)

This timestamp could be stored in napi->timestamp and done once per
n->poll(n, weight) call.

> 
> If the flow count is < 400 to allow a decent shot at a non-zero 
> combining rate on well shuffled flows with the 400 packet limit, then 
> that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
> aggregated.  And I think you then get two segments per flow aggregated 
> at a time.  Is that consistent with what you expect to be the 
> characteristics of the flows entering via that queue?

If a packet cant stay more than 1ms, then a flow sending less than 1000
packets per second wont benefit from GRO.

So yes, 12.5 Mbit/s would be the threshold.

By the way, when TCP timestamps are used, and hosts are linux machines
with HZ=1000, current GRO can not coalesce packets anyway because their
TCP options are different.

(So it would be not useful trying bigger sojourn time than 1ms)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 6, 2012, 4:11 a.m. UTC | #14
On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> Current GRO cell is somewhat limited :
> 
> - It uses a single list (napi->gro_list) of pending skbs
> 
> - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> 
> - Workloads with lot of concurrent flows have small GRO hit rate but
>   pay high overhead (in inet_gro_receive())
> 
> - Increasing MAX_GRO_SKBS is not an option, because GRO
>   overhead becomes too high.

Yeah these were all meant to be addressed at some point.

> - Packets can stay a long time held in GRO cell (there is
>   no flush if napi never completes on a stressed cpu)

This should never happen though.  NAPI runs must always be
punctuated just to guarantee one card never hogs a CPU.  Which
driver causes these behaviour?

>   Some elephant flows can stall interactive ones (if we receive
>   flood of non TCP frames, we dont flush tcp packets waiting in
> gro_list)

Again this should never be a problem given the natural limit
on backlog processing.

> What we could do :
> 
> 1) Use a hash to avoid expensive gro_list management and allow
>    much more concurrent flows.
> 
> Use skb_get_rxhash(skb) to compute rxhash
> 
> If l4_rxhash not set -> not a GRO candidate.
> 
> If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
> candidates.
> 
> (tcp stack could eventually use rxhash instead of its custom hash
> computation ...)

Sounds good to me.

> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>    even if the napi never completes. Each time we process a new packet,
>    being a GRO candidate or not, we increment a napi->sequence, and we
>    flush the oldest packet in gro_lru_list if its own sequence is too
>    old.
> 
>   That would give a latency guarantee.

I don't think this should ever be necessary.  IOW, if we need this
for GRO, then it means that we also need it for NAPI for the exact
same reasons.

Cheers,
Eric Dumazet Oct. 6, 2012, 5:08 a.m. UTC | #15
Le samedi 06 octobre 2012 à 12:11 +0800, Herbert Xu a écrit :
> On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> > Current GRO cell is somewhat limited :
> > 
> > - It uses a single list (napi->gro_list) of pending skbs
> > 
> > - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> > 
> > - Workloads with lot of concurrent flows have small GRO hit rate but
> >   pay high overhead (in inet_gro_receive())
> > 
> > - Increasing MAX_GRO_SKBS is not an option, because GRO
> >   overhead becomes too high.
> 
> Yeah these were all meant to be addressed at some point.
> 
> > - Packets can stay a long time held in GRO cell (there is
> >   no flush if napi never completes on a stressed cpu)
> 
> This should never happen though.  NAPI runs must always be
> punctuated just to guarantee one card never hogs a CPU.  Which
> driver causes these behaviour?

I believe its a generic issue, not specific to a driver.

napi_gro_flush() is only called from napi_complete() 

Some drivers (marvell/skge.c & realtek/8139cp.c) calls it only because
they 'inline' napi_complete()



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 6, 2012, 5:14 a.m. UTC | #16
On Sat, Oct 06, 2012 at 07:08:46AM +0200, Eric Dumazet wrote:
> Le samedi 06 octobre 2012 à 12:11 +0800, Herbert Xu a écrit :
> > On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> > > Current GRO cell is somewhat limited :
> > > 
> > > - It uses a single list (napi->gro_list) of pending skbs
> > > 
> > > - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> > > 
> > > - Workloads with lot of concurrent flows have small GRO hit rate but
> > >   pay high overhead (in inet_gro_receive())
> > > 
> > > - Increasing MAX_GRO_SKBS is not an option, because GRO
> > >   overhead becomes too high.
> > 
> > Yeah these were all meant to be addressed at some point.
> > 
> > > - Packets can stay a long time held in GRO cell (there is
> > >   no flush if napi never completes on a stressed cpu)
> > 
> > This should never happen though.  NAPI runs must always be
> > punctuated just to guarantee one card never hogs a CPU.  Which
> > driver causes these behaviour?
> 
> I believe its a generic issue, not specific to a driver.
> 
> napi_gro_flush() is only called from napi_complete() 
> 
> Some drivers (marvell/skge.c & realtek/8139cp.c) calls it only because
> they 'inline' napi_complete()

So which driver has the potential of never doing napi_gro_flush?

Cheers,
Eric Dumazet Oct. 6, 2012, 6:22 a.m. UTC | #17
On Sat, 2012-10-06 at 13:14 +0800, Herbert Xu wrote:
> On Sat, Oct 06, 2012 at 07:08:46AM +0200, Eric Dumazet wrote:
> > Le samedi 06 octobre 2012 à 12:11 +0800, Herbert Xu a écrit :
> > > On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> > > > Current GRO cell is somewhat limited :
> > > > 
> > > > - It uses a single list (napi->gro_list) of pending skbs
> > > > 
> > > > - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> > > > 
> > > > - Workloads with lot of concurrent flows have small GRO hit rate but
> > > >   pay high overhead (in inet_gro_receive())
> > > > 
> > > > - Increasing MAX_GRO_SKBS is not an option, because GRO
> > > >   overhead becomes too high.
> > > 
> > > Yeah these were all meant to be addressed at some point.
> > > 
> > > > - Packets can stay a long time held in GRO cell (there is
> > > >   no flush if napi never completes on a stressed cpu)
> > > 
> > > This should never happen though.  NAPI runs must always be
> > > punctuated just to guarantee one card never hogs a CPU.  Which
> > > driver causes these behaviour?
> > 
> > I believe its a generic issue, not specific to a driver.
> > 
> > napi_gro_flush() is only called from napi_complete() 
> > 
> > Some drivers (marvell/skge.c & realtek/8139cp.c) calls it only because
> > they 'inline' napi_complete()
> 
> So which driver has the potential of never doing napi_gro_flush?

All drivers.

If the napi->poll() consumes all budget, we dont call napi_complete()



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 6, 2012, 7 a.m. UTC | #18
On Sat, 2012-10-06 at 08:22 +0200, Eric Dumazet wrote:

> All drivers.
> 
> If the napi->poll() consumes all budget, we dont call napi_complete()

Probably nobody noticed, because TCP stack retransmits packets
eventually.

But this adds unexpected latencies.

I'll send a patch.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 6, 2012, 10:56 a.m. UTC | #19
On Sat, Oct 06, 2012 at 09:00:25AM +0200, Eric Dumazet wrote:
> On Sat, 2012-10-06 at 08:22 +0200, Eric Dumazet wrote:
> 
> > All drivers.
> > 
> > If the napi->poll() consumes all budget, we dont call napi_complete()
> 
> Probably nobody noticed, because TCP stack retransmits packets
> eventually.
> 
> But this adds unexpected latencies.
> 
> I'll send a patch.

Yeah we should definitely do a flush in that case.  Thanks!
Rick Jones Oct. 8, 2012, 4:40 p.m. UTC | #20
On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
>
>> Just how much code path is there between NAPI and the socket?? (And I
>> guess just how much combining are you hoping for?)
>>
>
> When GRO correctly works, you can save about 30% of cpu cycles, it
> depends...
>
> Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> gives an improvement as well...

OK, but how much of that 30% come from where?  Each coalesced segment is 
saving the cycles between NAPI and the socket.  Each avoided ACK is 
saving the cycles from TCP to the bottom of the driver and a (share of) 
transmit completion.


> I took this 1ms delay, but I never said it was a fixed value ;)
>
> Also remember one thing, this is the _max_ delay in case your napi
> handler is flooded. This almost never happen (tm)

We can still ignore the FSI types and probably the HPC types because 
they will insist on never happens (tm) :)


>
> Not sure what you mean by shuffle. We use a hash table to locate a flow,
> but we also have a LRU list to get the packets ordered by their entry in
> the 'GRO unit'.

Whe I say shuffle I mean something along the lines of interleave.  So, 
if we have four flows, 1-4, a perfect shuffle of their segments would be 
something like:

1 2 3 4 1 2 3 4 1 2 3 4

but not well shuffled might look like

1 1 3 2 3 2 4 4 4 1 3 2

rick

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2012, 4:59 p.m. UTC | #21
On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
> On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> > On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
> >
> >> Just how much code path is there between NAPI and the socket?? (And I
> >> guess just how much combining are you hoping for?)
> >>
> >
> > When GRO correctly works, you can save about 30% of cpu cycles, it
> > depends...
> >
> > Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> > gives an improvement as well...
> 
> OK, but how much of that 30% come from where?  Each coalesced segment is 
> saving the cycles between NAPI and the socket.  Each avoided ACK is 
> saving the cycles from TCP to the bottom of the driver and a (share of) 
> transmit completion.

It comes from the fact that you have less competition between Bottom
Half handler and application on socket lock, not counting all layers
that we have to cross (IP, netfilter ...)

Each time a TCP packet is delivered and socket owned by the user, packet
is placed on a special 'backlog queue', and application has to process
this packet right before releasing socket lock. It sucks because it adds
latencies, and other frames are queued to backlokg since application
processes the backlog (very expensive because of cache line misses)

So GRO really makes this kind of event less probable.

> 
> Whe I say shuffle I mean something along the lines of interleave.  So, 
> if we have four flows, 1-4, a perfect shuffle of their segments would be 
> something like:
> 
> 1 2 3 4 1 2 3 4 1 2 3 4
> 
> but not well shuffled might look like
> 
> 1 1 3 2 3 2 4 4 4 1 3 2
> 

If all these packets are delivered in the same NAPI run, and correctly
aggregated, their order doesnt matter.

In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
with 3 MSS)

In second case we will deliver

B1 B3 B2 B4



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 8, 2012, 5:49 p.m. UTC | #22
On 10/08/2012 09:59 AM, Eric Dumazet wrote:
> On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
>> Whe I say shuffle I mean something along the lines of interleave.  So,
>> if we have four flows, 1-4, a perfect shuffle of their segments would be
>> something like:
>>
>> 1 2 3 4 1 2 3 4 1 2 3 4
>>
>> but not well shuffled might look like
>>
>> 1 1 3 2 3 2 4 4 4 1 3 2
>>
>
> If all these packets are delivered in the same NAPI run, and correctly
> aggregated, their order doesnt matter.
>
> In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
> with 3 MSS)
>
> In second case we will deliver
>
> B1 B3 B2 B4

So, with my term shuffle better defined, let's circle back to your 
proposal to try to GRO-service a very much larger group of flows, with a 
"flush queued packets older than N packets" heuristic as part of the 
latency minimization.  If N were 2 there - half the number of flows, the 
"perfect" shuffle" doesn't get aggregated at all right? N would have to 
be 4 or the number of concurrent flows.   What I'm trying to get at is 
just to how many concurrent flows you are trying to get GRO to scale, 
and whether at that level you have asymptotically approached having a 
hash/retained state that is, basically, a duplicate of what is happening 
in TCP.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2012, 5:55 p.m. UTC | #23
On Mon, 2012-10-08 at 10:49 -0700, Rick Jones wrote:

> So, with my term shuffle better defined, let's circle back to your 
> proposal to try to GRO-service a very much larger group of flows, with a 
> "flush queued packets older than N packets" heuristic as part of the 
> latency minimization.  If N were 2 there - half the number of flows, the 
> "perfect" shuffle" doesn't get aggregated at all right? N would have to 
> be 4 or the number of concurrent flows.   What I'm trying to get at is 
> just to how many concurrent flows you are trying to get GRO to scale, 
> and whether at that level you have asymptotically approached having a 
> hash/retained state that is, basically, a duplicate of what is happening 
> in TCP.
> 

I didnt said "flush queued packets older than N packets" but instead
suggested to use a time limit, eventually a sysctl.

"flush packet if the oldest part of it is aged by xxx us"

Say the default would be 500 us

If your hardware is capable of receiving 2 packets per us, this would
allow 1000 packets in GRO queue. So using a hash table with 128 slots,
and keeping the current limit of 8 entries per bucket would allow this
kind of workload.

If your hardware is capable of receiving one packet per us, this would
allow 50 packets in GRO queue.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2012, 5:56 p.m. UTC | #24
On Mon, 2012-10-08 at 19:55 +0200, Eric Dumazet wrote:

> If your hardware is capable of receiving one packet per us, this would
> allow 50 packets in GRO queue.

Sorry, I meant : 10 us per packet.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 8, 2012, 6:21 p.m. UTC | #25
On 10/08/2012 10:55 AM, Eric Dumazet wrote:
> On Mon, 2012-10-08 at 10:49 -0700, Rick Jones wrote:
>
>> So, with my term shuffle better defined, let's circle back to your
>> proposal to try to GRO-service a very much larger group of flows, with a
>> "flush queued packets older than N packets" heuristic as part of the
>> latency minimization.  If N were 2 there - half the number of flows, the
>> "perfect" shuffle" doesn't get aggregated at all right? N would have to
>> be 4 or the number of concurrent flows.   What I'm trying to get at is
>> just to how many concurrent flows you are trying to get GRO to scale,
>> and whether at that level you have asymptotically approached having a
>> hash/retained state that is, basically, a duplicate of what is happening
>> in TCP.
>>
>
> I didnt said "flush queued packets older than N packets" but instead
> suggested to use a time limit, eventually a sysctl.

Did I then mis-interpret:

> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>    even if the napi never completes. Each time we process a new packet,
>    being a GRO candidate or not, we increment a napi->sequence, and we
>    flush the oldest packet in gro_lru_list if its own sequence is too
>    old.

in your initial RFC email?  Because I took that as flushing a given 
packet if N packets have come through since that packet was queued to 
await coalescing.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2012, 6:28 p.m. UTC | #26
On Mon, 2012-10-08 at 11:21 -0700, Rick Jones wrote:

> Did I then mis-interpret:
> 
> > 2) Use a LRU list to eventually be able to 'flush' too old packets,
> >    even if the napi never completes. Each time we process a new packet,
> >    being a GRO candidate or not, we increment a napi->sequence, and we
> >    flush the oldest packet in gro_lru_list if its own sequence is too
> >    old.
> 
> in your initial RFC email?  Because I took that as flushing a given 
> packet if N packets have come through since that packet was queued to 
> await coalescing.

Yes, this was refined in the patch I sent.

Currently using jiffies, but my plan is trying to use get_cycles() or
ktime_get(), after some experiments, allowing cycles/ns resolution.

(One ktime_get() per napi batch seems OK)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/ipip.h b/include/net/ipip.h
index a93cf6d..ddc077c 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -2,6 +2,7 @@ 
 #define __NET_IPIP_H 1
 
 #include <linux/if_tunnel.h>
+#include <net/gro_cells.h>
 #include <net/ip.h>
 
 /* Keep error state on tunnel for 30 sec */
@@ -36,6 +37,8 @@  struct ip_tunnel {
 #endif
 	struct ip_tunnel_prl_entry __rcu *prl;		/* potential router list */
 	unsigned int			prl_count;	/* # of entries in PRL */
+
+	struct gro_cells		gro_cells;
 };
 
 struct ip_tunnel_prl_entry {
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index f233c1d..1f00b30 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -714,8 +714,7 @@  static int ipgre_rcv(struct sk_buff *skb)
 		skb_reset_network_header(skb);
 		ipgre_ecn_decapsulate(iph, skb);
 
-		netif_rx(skb);
-
+		gro_cells_receive(&tunnel->gro_cells, skb);
 		rcu_read_unlock();
 		return 0;
 	}
@@ -1296,6 +1295,9 @@  static const struct net_device_ops ipgre_netdev_ops = {
 
 static void ipgre_dev_free(struct net_device *dev)
 {
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	gro_cells_destroy(&tunnel->gro_cells);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -1327,6 +1329,7 @@  static int ipgre_tunnel_init(struct net_device *dev)
 {
 	struct ip_tunnel *tunnel;
 	struct iphdr *iph;
+	int err;
 
 	tunnel = netdev_priv(dev);
 	iph = &tunnel->parms.iph;
@@ -1353,6 +1356,12 @@  static int ipgre_tunnel_init(struct net_device *dev)
 	if (!dev->tstats)
 		return -ENOMEM;
 
+	err = gro_cells_init(&tunnel->gro_cells, dev);
+	if (err) {
+		free_percpu(dev->tstats);
+		return err;
+	}
+
 	return 0;
 }