diff mbox

- vxlan: gro not effective for intel 82599

Message ID 5981772fe36e64f8fec5997a4c7aa08f@imap.linux.ibm.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ramu Ramamurthy June 26, 2015, 12:03 a.m. UTC
Problem:
-------

GRO is enabled on the interfaces in the following test,
but GRO does not take effect for vxlan-encapsulated tcp streams. The 
root
cause of why GRO does not take effect is described below.

VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|

Because gro is not effective, the throughput for vxlan-encapsulated
tcp-stream is around 3 Gbps.

With the proposed patch, gro takes effect for vxlan-encapsulated tcp 
streams,
and performance in the same test is around 8.6 Gbps.


Root Cause:
----------


At entry to udp4_gro_receive(), the gro parameters are set as follows:

     skb->ip_summed  == 0 (CHECKSUM_NONE)
     NAPI_GRO_CB(skb)->csum_cnt == 0
     NAPI_GRO_CB(skb)->csum_valid == 0

     UDH header checksum is 0.

static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
					 struct sk_buff *skb)
{

          <snip>

	if (skb_gro_checksum_validate_zero_check(skb, IPPROTO_UDP, uh->check,
						 inet_gro_compute_pseudo))

>>>             This calls __skb_incr_checksum_unnecessary which sets
>>>                     skb->ip_summed to  CHECKSUM_UNNECESSARY
>>> 

		goto flush;
	else if (uh->check)
		skb_gro_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
					     inet_gro_compute_pseudo);
skip:
	NAPI_GRO_CB(skb)->is_ipv6 = 0;
	return udp_gro_receive(head, skb, uh);

}

struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff 
*skb,
				 struct udphdr *uh)
{
	struct udp_offload_priv *uo_priv;
	struct sk_buff *p, **pp = NULL;
	struct udphdr *uh2;
	unsigned int off = skb_gro_offset(skb);
	int flush = 1;

	if (NAPI_GRO_CB(skb)->udp_mark ||
	    (skb->ip_summed != CHECKSUM_PARTIAL &&
	     NAPI_GRO_CB(skb)->csum_cnt == 0 &&
	     !NAPI_GRO_CB(skb)->csum_valid))
		goto out;
>>> 
>>>      vxlan GRO gets skipped due to the above condition because here,:
>>>          skb->ip_summed == CHECKSUM_UNNECESSARY
>>>          NAPI_GRO_CB(skb)->csum_cnt == 0
>>>          NAPI_GRO_CB(skb)->csum_valid == 0

There is no reason for skipping vxlan gro in the above combination of 
conditions,
because, tcp4_gro_receive() validates the inner tcp checksum anyway !


Patch:
------

Signed-off-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com>
---
  net/ipv4/udp_offload.c |    1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

  		goto out;

Comments

Tom Herbert June 26, 2015, 12:20 a.m. UTC | #1
On Thu, Jun 25, 2015 at 5:03 PM, Ramu Ramamurthy
<sramamur@linux.vnet.ibm.com> wrote:
> Problem:
> -------
>
> GRO is enabled on the interfaces in the following test,
> but GRO does not take effect for vxlan-encapsulated tcp streams. The root
> cause of why GRO does not take effect is described below.
>
> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>
> Because gro is not effective, the throughput for vxlan-encapsulated
> tcp-stream is around 3 Gbps.
>
> With the proposed patch, gro takes effect for vxlan-encapsulated tcp
> streams,
> and performance in the same test is around 8.6 Gbps.
>
>
> Root Cause:
> ----------
>
>
> At entry to udp4_gro_receive(), the gro parameters are set as follows:
>
>     skb->ip_summed  == 0 (CHECKSUM_NONE)
>     NAPI_GRO_CB(skb)->csum_cnt == 0
>     NAPI_GRO_CB(skb)->csum_valid == 0
>
>     UDH header checksum is 0.
>
> static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
>                                          struct sk_buff *skb)
> {
>
>          <snip>
>
>         if (skb_gro_checksum_validate_zero_check(skb, IPPROTO_UDP,
> uh->check,
>                                                  inet_gro_compute_pseudo))
>
>>>>             This calls __skb_incr_checksum_unnecessary which sets
>>>>                     skb->ip_summed to  CHECKSUM_UNNECESSARY
>>>>
>
>                 goto flush;
>         else if (uh->check)
>                 skb_gro_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
>                                              inet_gro_compute_pseudo);
> skip:
>         NAPI_GRO_CB(skb)->is_ipv6 = 0;
>         return udp_gro_receive(head, skb, uh);
>
> }
>
> struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
>                                  struct udphdr *uh)
> {
>         struct udp_offload_priv *uo_priv;
>         struct sk_buff *p, **pp = NULL;
>         struct udphdr *uh2;
>         unsigned int off = skb_gro_offset(skb);
>         int flush = 1;
>
>         if (NAPI_GRO_CB(skb)->udp_mark ||
>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>              !NAPI_GRO_CB(skb)->csum_valid))
>                 goto out;
>>>>
>>>>
>>>>      vxlan GRO gets skipped due to the above condition because here,:
>>>>          skb->ip_summed == CHECKSUM_UNNECESSARY
>>>>          NAPI_GRO_CB(skb)->csum_cnt == 0
>>>>          NAPI_GRO_CB(skb)->csum_valid == 0
>
>
> There is no reason for skipping vxlan gro in the above combination of
> conditions,
> because, tcp4_gro_receive() validates the inner tcp checksum anyway !
>
>
> Patch:
> ------
>
> Signed-off-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com>
> ---
>  net/ipv4/udp_offload.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index f938616..17fc12b 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -301,6 +301,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head,
> struct sk_buff *skb,
>
>         if (NAPI_GRO_CB(skb)->udp_mark ||
>             (skb->ip_summed != CHECKSUM_PARTIAL &&
> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>              !NAPI_GRO_CB(skb)->csum_valid))
>                 goto out;
> --

This isn't right. The CHECKSUM_UNNECESSARY only refers to the outer
checksum which is zero in this case so it is trivially unnecessary.
The inner checksum still needs to be computed on the host. By
convention, we do not do GRO if it is required to compute the inner
checksum (csum_cnt == 0 checks that). If we want to allow checksum
calculation to occur in the GRO path, meaning we understand the
ramifications and can show this is better for performance, then all
the checks about checksum here should be removed.

> 1.7.1
>
>
>
>
>
> Notes:
> -------
>
> The above gro fix applies to all udp-encapsulation protocols (vxlan, geneve)
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ramu Ramamurthy June 26, 2015, 1:06 a.m. UTC | #2
On 2015-06-25 17:20, Tom Herbert wrote:
> On Thu, Jun 25, 2015 at 5:03 PM, Ramu Ramamurthy
> <sramamur@linux.vnet.ibm.com> wrote:
>> Problem:
>> -------
>> 
>> GRO is enabled on the interfaces in the following test,
>> but GRO does not take effect for vxlan-encapsulated tcp streams. The 
>> root
>> cause of why GRO does not take effect is described below.
>> 
>> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>> 
>> Because gro is not effective, the throughput for vxlan-encapsulated
>> tcp-stream is around 3 Gbps.
>> 
>> With the proposed patch, gro takes effect for vxlan-encapsulated tcp
>> streams,
>> and performance in the same test is around 8.6 Gbps.
>> 
>> 
>> Root Cause:
>> ----------
>> 
>> 
>> At entry to udp4_gro_receive(), the gro parameters are set as follows:
>> 
>>     skb->ip_summed  == 0 (CHECKSUM_NONE)
>>     NAPI_GRO_CB(skb)->csum_cnt == 0
>>     NAPI_GRO_CB(skb)->csum_valid == 0
>> 
>>     UDH header checksum is 0.
>> 
>> static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
>>                                          struct sk_buff *skb)
>> {
>> 
>>          <snip>
>> 
>>         if (skb_gro_checksum_validate_zero_check(skb, IPPROTO_UDP,
>> uh->check,
>>                                                  
>> inet_gro_compute_pseudo))
>> 
>>>>>             This calls __skb_incr_checksum_unnecessary which sets
>>>>>                     skb->ip_summed to  CHECKSUM_UNNECESSARY
>>>>> 
>> 
>>                 goto flush;
>>         else if (uh->check)
>>                 skb_gro_checksum_try_convert(skb, IPPROTO_UDP, 
>> uh->check,
>>                                              inet_gro_compute_pseudo);
>> skip:
>>         NAPI_GRO_CB(skb)->is_ipv6 = 0;
>>         return udp_gro_receive(head, skb, uh);
>> 
>> }
>> 
>> struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff 
>> *skb,
>>                                  struct udphdr *uh)
>> {
>>         struct udp_offload_priv *uo_priv;
>>         struct sk_buff *p, **pp = NULL;
>>         struct udphdr *uh2;
>>         unsigned int off = skb_gro_offset(skb);
>>         int flush = 1;
>> 
>>         if (NAPI_GRO_CB(skb)->udp_mark ||
>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>              !NAPI_GRO_CB(skb)->csum_valid))
>>                 goto out;
>>>>> 
>>>>> 
>>>>>      vxlan GRO gets skipped due to the above condition because 
>>>>> here,:
>>>>>          skb->ip_summed == CHECKSUM_UNNECESSARY
>>>>>          NAPI_GRO_CB(skb)->csum_cnt == 0
>>>>>          NAPI_GRO_CB(skb)->csum_valid == 0
>> 
>> 
>> There is no reason for skipping vxlan gro in the above combination of
>> conditions,
>> because, tcp4_gro_receive() validates the inner tcp checksum anyway !
>> 
>> 
>> Patch:
>> ------
>> 
>> Signed-off-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com>
>> ---
>>  net/ipv4/udp_offload.c |    1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>> 
>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>> index f938616..17fc12b 100644
>> --- a/net/ipv4/udp_offload.c
>> +++ b/net/ipv4/udp_offload.c
>> @@ -301,6 +301,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff 
>> **head,
>> struct sk_buff *skb,
>> 
>>         if (NAPI_GRO_CB(skb)->udp_mark ||
>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>              !NAPI_GRO_CB(skb)->csum_valid))
>>                 goto out;
>> --
> 
> This isn't right. The CHECKSUM_UNNECESSARY only refers to the outer
> checksum which is zero in this case so it is trivially unnecessary.
> The inner checksum still needs to be computed on the host. By
> convention, we do not do GRO if it is required to compute the inner
> checksum (csum_cnt == 0 checks that). If we want to allow checksum
> calculation to occur in the GRO path, meaning we understand the
> ramifications and can show this is better for performance, then all
> the checks about checksum here should be removed.
> 

Isnt the inner checksum computed on the gro-path from tcp4_gro_receive() 
as follows ?
This trace is from my testbed.

In my tests, I consistently get 8.5-9 Gbps with vxlan gro (inspite of
the added sw inner checksumming), whereas without vxlan GRO  the 
performance
drops down to 3Gbps or so. So, a significant performance benefit can be 
gained
on intel 10G nics which are widely deployed. Hence the interest in 
pursuing this or a modified patch.

      vxlan_gro_receive <-udp4_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420280: __pskb_pull_tail 
<-vxlan_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420280: skb_copy_bits 
<-__pskb_pull_tail
      ksoftirqd/1-94    [001] ..s. 11421.420280: __pskb_pull_tail 
<-vxlan_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_copy_bits 
<-__pskb_pull_tail
      ksoftirqd/1-94    [001] ..s. 11421.420281: gro_find_receive_by_type 
<-vxlan_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420281: inet_gro_receive 
<-vxlan_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420281: __pskb_pull_tail 
<-inet_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_copy_bits 
<-__pskb_pull_tail
      ksoftirqd/1-94    [001] ..s. 11421.420281: tcp4_gro_receive 
<-inet_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420281: 
__skb_gro_checksum_complete <-tcp4_gro_receive
      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_checksum 
<-__skb_gro_checksum_complete
      ksoftirqd/1-94    [001] ..s. 11421.420281: __skb_checksum 
<-skb_checksum
      ksoftirqd/1-94    [001] ..s1 11421.420281: csum_partial 
<-csum_partial_ext
      ksoftirqd/1-94    [001] ..s1 11421.420281: do_csum <-csum_partial



>> 1.7.1
>> 
>> 
>> 
>> 
>> 
>> Notes:
>> -------
>> 
>> The above gro fix applies to all udp-encapsulation protocols (vxlan, 
>> geneve)
>> 
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert June 26, 2015, 2:57 a.m. UTC | #3
On Thu, Jun 25, 2015 at 6:06 PM, Ramu Ramamurthy
<sramamur@linux.vnet.ibm.com> wrote:
> On 2015-06-25 17:20, Tom Herbert wrote:
>>
>> On Thu, Jun 25, 2015 at 5:03 PM, Ramu Ramamurthy
>> <sramamur@linux.vnet.ibm.com> wrote:
>>>
>>> Problem:
>>> -------
>>>
>>> GRO is enabled on the interfaces in the following test,
>>> but GRO does not take effect for vxlan-encapsulated tcp streams. The root
>>> cause of why GRO does not take effect is described below.
>>>
>>> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>>> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>>>
>>> Because gro is not effective, the throughput for vxlan-encapsulated
>>> tcp-stream is around 3 Gbps.
>>>
>>> With the proposed patch, gro takes effect for vxlan-encapsulated tcp
>>> streams,
>>> and performance in the same test is around 8.6 Gbps.
>>>
>>>
>>> Root Cause:
>>> ----------
>>>
>>>
>>> At entry to udp4_gro_receive(), the gro parameters are set as follows:
>>>
>>>     skb->ip_summed  == 0 (CHECKSUM_NONE)
>>>     NAPI_GRO_CB(skb)->csum_cnt == 0
>>>     NAPI_GRO_CB(skb)->csum_valid == 0
>>>
>>>     UDH header checksum is 0.
>>>
>>> static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
>>>                                          struct sk_buff *skb)
>>> {
>>>
>>>          <snip>
>>>
>>>         if (skb_gro_checksum_validate_zero_check(skb, IPPROTO_UDP,
>>> uh->check,
>>>
>>> inet_gro_compute_pseudo))
>>>
>>>>>>             This calls __skb_incr_checksum_unnecessary which sets
>>>>>>                     skb->ip_summed to  CHECKSUM_UNNECESSARY
>>>>>>
>>>
>>>                 goto flush;
>>>         else if (uh->check)
>>>                 skb_gro_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
>>>                                              inet_gro_compute_pseudo);
>>> skip:
>>>         NAPI_GRO_CB(skb)->is_ipv6 = 0;
>>>         return udp_gro_receive(head, skb, uh);
>>>
>>> }
>>>
>>> struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff
>>> *skb,
>>>                                  struct udphdr *uh)
>>> {
>>>         struct udp_offload_priv *uo_priv;
>>>         struct sk_buff *p, **pp = NULL;
>>>         struct udphdr *uh2;
>>>         unsigned int off = skb_gro_offset(skb);
>>>         int flush = 1;
>>>
>>>         if (NAPI_GRO_CB(skb)->udp_mark ||
>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>>              !NAPI_GRO_CB(skb)->csum_valid))
>>>                 goto out;
>>>>>>
>>>>>>
>>>>>>
>>>>>>      vxlan GRO gets skipped due to the above condition because here,:
>>>>>>          skb->ip_summed == CHECKSUM_UNNECESSARY
>>>>>>          NAPI_GRO_CB(skb)->csum_cnt == 0
>>>>>>          NAPI_GRO_CB(skb)->csum_valid == 0
>>>
>>>
>>>
>>> There is no reason for skipping vxlan gro in the above combination of
>>> conditions,
>>> because, tcp4_gro_receive() validates the inner tcp checksum anyway !
>>>
>>>
>>> Patch:
>>> ------
>>>
>>> Signed-off-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com>
>>> ---
>>>  net/ipv4/udp_offload.c |    1 +
>>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>>> index f938616..17fc12b 100644
>>> --- a/net/ipv4/udp_offload.c
>>> +++ b/net/ipv4/udp_offload.c
>>> @@ -301,6 +301,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff
>>> **head,
>>> struct sk_buff *skb,
>>>
>>>         if (NAPI_GRO_CB(skb)->udp_mark ||
>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>>              !NAPI_GRO_CB(skb)->csum_valid))
>>>                 goto out;
>>> --
>>
>>
>> This isn't right. The CHECKSUM_UNNECESSARY only refers to the outer
>> checksum which is zero in this case so it is trivially unnecessary.
>> The inner checksum still needs to be computed on the host. By
>> convention, we do not do GRO if it is required to compute the inner
>> checksum (csum_cnt == 0 checks that). If we want to allow checksum
>> calculation to occur in the GRO path, meaning we understand the
>> ramifications and can show this is better for performance, then all
>> the checks about checksum here should be removed.
>>
>
> Isnt the inner checksum computed on the gro-path from tcp4_gro_receive() as
> follows ?
> This trace is from my testbed.
>
> In my tests, I consistently get 8.5-9 Gbps with vxlan gro (inspite of
> the added sw inner checksumming), whereas without vxlan GRO  the performance
> drops down to 3Gbps or so. So, a significant performance benefit can be
> gained
> on intel 10G nics which are widely deployed. Hence the interest in pursuing
> this or a modified patch.
>
That may be, but this change would affect all uses of GRO with UDP
encapsulation not just for intel 10G NICs. For instance, pushing a lot
of checksum calculation into the napi for a single queue device could
overwhelm the corresponding CPU-- this is the motivation for the
restriction in the first place. We need to do a little more diligence
here.

Can you please provide more details about your tests and configuration
(# of flows, #queues, etc.). Also, please try enabling UDP checksum
this should eliminate need for checksum computation on the receiver
and allow GRO to be used. Enabling RCO should then eliminate checksum
computation on the host.

Thanks,
Tom

>      vxlan_gro_receive <-udp4_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420280: __pskb_pull_tail
> <-vxlan_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420280: skb_copy_bits
> <-__pskb_pull_tail
>      ksoftirqd/1-94    [001] ..s. 11421.420280: __pskb_pull_tail
> <-vxlan_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_copy_bits
> <-__pskb_pull_tail
>      ksoftirqd/1-94    [001] ..s. 11421.420281: gro_find_receive_by_type
> <-vxlan_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420281: inet_gro_receive
> <-vxlan_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420281: __pskb_pull_tail
> <-inet_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_copy_bits
> <-__pskb_pull_tail
>      ksoftirqd/1-94    [001] ..s. 11421.420281: tcp4_gro_receive
> <-inet_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420281: __skb_gro_checksum_complete
> <-tcp4_gro_receive
>      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_checksum
> <-__skb_gro_checksum_complete
>      ksoftirqd/1-94    [001] ..s. 11421.420281: __skb_checksum
> <-skb_checksum
>      ksoftirqd/1-94    [001] ..s1 11421.420281: csum_partial
> <-csum_partial_ext
>      ksoftirqd/1-94    [001] ..s1 11421.420281: do_csum <-csum_partial
>
>
>
>
>>> 1.7.1
>>>
>>>
>>>
>>>
>>>
>>> Notes:
>>> -------
>>>
>>> The above gro fix applies to all udp-encapsulation protocols (vxlan,
>>> geneve)
>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 26, 2015, 5:15 a.m. UTC | #4
On Thu, 2015-06-25 at 19:57 -0700, Tom Herbert wrote:

> That may be, but this change would affect all uses of GRO with UDP
> encapsulation not just for intel 10G NICs. For instance, pushing a lot
> of checksum calculation into the napi for a single queue device could
> overwhelm the corresponding CPU-- this is the motivation for the
> restriction in the first place. We need to do a little more diligence
> here.

We made the choice of computing checksums years ago when GRO was enabled
for GRE packets, as many NIC are not able to validate TCP checksums with
this configuration.

Sure, in some cases this can increase load on softirq handlers, but
there are some ways to spread load on multi queue NIC.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert June 26, 2015, 5:24 p.m. UTC | #5
On Thu, Jun 25, 2015 at 10:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-06-25 at 19:57 -0700, Tom Herbert wrote:
>
>> That may be, but this change would affect all uses of GRO with UDP
>> encapsulation not just for intel 10G NICs. For instance, pushing a lot
>> of checksum calculation into the napi for a single queue device could
>> overwhelm the corresponding CPU-- this is the motivation for the
>> restriction in the first place. We need to do a little more diligence
>> here.
>
> We made the choice of computing checksums years ago when GRO was enabled
> for GRE packets, as many NIC are not able to validate TCP checksums with
> this configuration.
>
But we only started computed checksums for GRE in the device napi with
Jerry's patch:

commit bf5a755f5e9186406bbf50f4087100af5bd68e40
Author: Jerry Chu <hkchu@google.com>
Date:   Tue Jan 7 10:23:19 2014 -0800

    net-gre-gro: Add GRE support to the GRO stack

Or's patch to support GRO for UDP encapsulation which included a check
for checksum complete was accepted a mere 13 days later :-)

We have the option of performing GRO at the tunnel interface and that
is good if we need to do checksum calculation since it is post packet
steering and could heat up cache for an application. If we remove the
checksum check in the udp_gro_receive then the only way to push the
GRO to be done in the tunnel interface would be to disable GRO for
everyone on the Ethernet device. I suspect Ramu's results are without
GRO enabled on the tunnel interface and getting no GRO at all can be
very expensive of course.

Anyway, I am looking at solutions and trying to repro the results.

Tom








> Sure, in some cases this can increase load on softirq handlers, but
> there are some ways to spread load on multi queue NIC.
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ramu Ramamurthy June 26, 2015, 5:36 p.m. UTC | #6
On 2015-06-25 19:57, Tom Herbert wrote:
> On Thu, Jun 25, 2015 at 6:06 PM, Ramu Ramamurthy
> <sramamur@linux.vnet.ibm.com> wrote:
>> On 2015-06-25 17:20, Tom Herbert wrote:
>>> 
>>> On Thu, Jun 25, 2015 at 5:03 PM, Ramu Ramamurthy
>>> <sramamur@linux.vnet.ibm.com> wrote:
>>>> 
>>>> Problem:
>>>> -------
>>>> 
>>>> GRO is enabled on the interfaces in the following test,
>>>> but GRO does not take effect for vxlan-encapsulated tcp streams. The 
>>>> root
>>>> cause of why GRO does not take effect is described below.
>>>> 
>>>> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>>>> VM nic (mtu 1450)---bridge---vxlan----10Gb nic (intel 82599ES)-----|
>>>> 
>>>> Because gro is not effective, the throughput for vxlan-encapsulated
>>>> tcp-stream is around 3 Gbps.
>>>> 
>>>> With the proposed patch, gro takes effect for vxlan-encapsulated tcp
>>>> streams,
>>>> and performance in the same test is around 8.6 Gbps.
>>>> 
>>>> 
>>>> Root Cause:
>>>> ----------
>>>> 
>>>> 
>>>> At entry to udp4_gro_receive(), the gro parameters are set as 
>>>> follows:
>>>> 
>>>>     skb->ip_summed  == 0 (CHECKSUM_NONE)
>>>>     NAPI_GRO_CB(skb)->csum_cnt == 0
>>>>     NAPI_GRO_CB(skb)->csum_valid == 0
>>>> 
>>>>     UDH header checksum is 0.
>>>> 
>>>> static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
>>>>                                          struct sk_buff *skb)
>>>> {
>>>> 
>>>>          <snip>
>>>> 
>>>>         if (skb_gro_checksum_validate_zero_check(skb, IPPROTO_UDP,
>>>> uh->check,
>>>> 
>>>> inet_gro_compute_pseudo))
>>>> 
>>>>>>>             This calls __skb_incr_checksum_unnecessary which sets
>>>>>>>                     skb->ip_summed to  CHECKSUM_UNNECESSARY
>>>>>>> 
>>>> 
>>>>                 goto flush;
>>>>         else if (uh->check)
>>>>                 skb_gro_checksum_try_convert(skb, IPPROTO_UDP, 
>>>> uh->check,
>>>>                                              
>>>> inet_gro_compute_pseudo);
>>>> skip:
>>>>         NAPI_GRO_CB(skb)->is_ipv6 = 0;
>>>>         return udp_gro_receive(head, skb, uh);
>>>> 
>>>> }
>>>> 
>>>> struct sk_buff **udp_gro_receive(struct sk_buff **head, struct 
>>>> sk_buff
>>>> *skb,
>>>>                                  struct udphdr *uh)
>>>> {
>>>>         struct udp_offload_priv *uo_priv;
>>>>         struct sk_buff *p, **pp = NULL;
>>>>         struct udphdr *uh2;
>>>>         unsigned int off = skb_gro_offset(skb);
>>>>         int flush = 1;
>>>> 
>>>>         if (NAPI_GRO_CB(skb)->udp_mark ||
>>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>>>              !NAPI_GRO_CB(skb)->csum_valid))
>>>>                 goto out;
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>      vxlan GRO gets skipped due to the above condition because 
>>>>>>> here,:
>>>>>>>          skb->ip_summed == CHECKSUM_UNNECESSARY
>>>>>>>          NAPI_GRO_CB(skb)->csum_cnt == 0
>>>>>>>          NAPI_GRO_CB(skb)->csum_valid == 0
>>>> 
>>>> 
>>>> 
>>>> There is no reason for skipping vxlan gro in the above combination 
>>>> of
>>>> conditions,
>>>> because, tcp4_gro_receive() validates the inner tcp checksum anyway 
>>>> !
>>>> 
>>>> 
>>>> Patch:
>>>> ------
>>>> 
>>>> Signed-off-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com>
>>>> ---
>>>>  net/ipv4/udp_offload.c |    1 +
>>>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>>> 
>>>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>>>> index f938616..17fc12b 100644
>>>> --- a/net/ipv4/udp_offload.c
>>>> +++ b/net/ipv4/udp_offload.c
>>>> @@ -301,6 +301,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff
>>>> **head,
>>>> struct sk_buff *skb,
>>>> 
>>>>         if (NAPI_GRO_CB(skb)->udp_mark ||
>>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>>> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
>>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>>>              !NAPI_GRO_CB(skb)->csum_valid))
>>>>                 goto out;
>>>> --
>>> 
>>> 
>>> This isn't right. The CHECKSUM_UNNECESSARY only refers to the outer
>>> checksum which is zero in this case so it is trivially unnecessary.
>>> The inner checksum still needs to be computed on the host. By
>>> convention, we do not do GRO if it is required to compute the inner
>>> checksum (csum_cnt == 0 checks that). If we want to allow checksum
>>> calculation to occur in the GRO path, meaning we understand the
>>> ramifications and can show this is better for performance, then all
>>> the checks about checksum here should be removed.
>>> 
>> 
>> Isnt the inner checksum computed on the gro-path from 
>> tcp4_gro_receive() as
>> follows ?
>> This trace is from my testbed.
>> 
>> In my tests, I consistently get 8.5-9 Gbps with vxlan gro (inspite of
>> the added sw inner checksumming), whereas without vxlan GRO  the 
>> performance
>> drops down to 3Gbps or so. So, a significant performance benefit can 
>> be
>> gained
>> on intel 10G nics which are widely deployed. Hence the interest in 
>> pursuing
>> this or a modified patch.
>> 
> That may be, but this change would affect all uses of GRO with UDP
> encapsulation not just for intel 10G NICs. For instance, pushing a lot
> of checksum calculation into the napi for a single queue device could
> overwhelm the corresponding CPU-- this is the motivation for the
> restriction in the first place. We need to do a little more diligence
> here.
> 
> Can you please provide more details about your tests and configuration
> (# of flows, #queues, etc.). Also, please try enabling UDP checksum
> this should eliminate need for checksum computation on the receiver
> and allow GRO to be used. Enabling RCO should then eliminate checksum
> computation on the host.
> 
> Thanks,
> Tom
> 

I am testing the simplest configuration which has 1 TCP flow generated 
by iperf from
a VM connected to a linux bridge with a vxlan tunnel interface. The 10G 
nic (82599 ES) has
multiple receive queues, but in this simple test, it is likely 
immaterial (because, the
tuple on which it hashes would be fixed). The real difference in 
performance appears to
be whether or not vxlan gro is performed by software.

The vxlan spec requires UDP checksums to be zero. So, we should expect 
by default, vxlan traffic coming
in with a zero checksum, either from other devices or operating systems.
          UDP Checksum: It SHOULD be transmitted as zero.  When a packet
          is received with a UDP checksum of zero, it MUST be accepted
          for decapsulation.  Optionally, if the encapsulating end point
          includes a non-zero UDP checksum, it MUST be correctly
          calculated across the entire packet including the IP header,
          UDP header, VXLAN header, and encapsulated MAC frame.
https://datatracker.ietf.org/doc/rfc7348/

The geneve spec also by default allows UDP checksums to be zero.
https://tools.ietf.org/html/draft-gross-geneve-00#section-3.3

In summary, if we can remove the checksum checks in udp_offload.c and 
allow by default to perform
vxlan/geneve GRO if configured.








>>      vxlan_gro_receive <-udp4_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420280: __pskb_pull_tail
>> <-vxlan_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420280: skb_copy_bits
>> <-__pskb_pull_tail
>>      ksoftirqd/1-94    [001] ..s. 11421.420280: __pskb_pull_tail
>> <-vxlan_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_copy_bits
>> <-__pskb_pull_tail
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: 
>> gro_find_receive_by_type
>> <-vxlan_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: inet_gro_receive
>> <-vxlan_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: __pskb_pull_tail
>> <-inet_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_copy_bits
>> <-__pskb_pull_tail
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: tcp4_gro_receive
>> <-inet_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: 
>> __skb_gro_checksum_complete
>> <-tcp4_gro_receive
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: skb_checksum
>> <-__skb_gro_checksum_complete
>>      ksoftirqd/1-94    [001] ..s. 11421.420281: __skb_checksum
>> <-skb_checksum
>>      ksoftirqd/1-94    [001] ..s1 11421.420281: csum_partial
>> <-csum_partial_ext
>>      ksoftirqd/1-94    [001] ..s1 11421.420281: do_csum <-csum_partial
>> 
>> 
>> 
>> 
>>>> 1.7.1
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Notes:
>>>> -------
>>>> 
>>>> The above gro fix applies to all udp-encapsulation protocols (vxlan,
>>>> geneve)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert June 26, 2015, 6:04 p.m. UTC | #7
> I am testing the simplest configuration which has 1 TCP flow generated by
> iperf from
> a VM connected to a linux bridge with a vxlan tunnel interface. The 10G nic
> (82599 ES) has
> multiple receive queues, but in this simple test, it is likely immaterial
> (because, the
> tuple on which it hashes would be fixed). The real difference in performance
> appears to
> be whether or not vxlan gro is performed by software.
>

Please do "ethtool -k vxlan0" of whatever interface is for vxlan.
Ensure GRO is "on", if not enable it on the interface by "ethtool _k
vxlan0 gro on". Run iperf and to tcpdump on the vxlan interface to
verify GRO is being done. If we are seeing performance degradation
when GRO is being done at tunnel versus device that would be a
different problem than no GRO being done at all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ramu Ramamurthy June 26, 2015, 7:31 p.m. UTC | #8
On 2015-06-26 11:04, Tom Herbert wrote:
>> I am testing the simplest configuration which has 1 TCP flow generated 
>> by
>> iperf from
>> a VM connected to a linux bridge with a vxlan tunnel interface. The 
>> 10G nic
>> (82599 ES) has
>> multiple receive queues, but in this simple test, it is likely 
>> immaterial
>> (because, the
>> tuple on which it hashes would be fixed). The real difference in 
>> performance
>> appears to
>> be whether or not vxlan gro is performed by software.
>> 
> 
> Please do "ethtool -k vxlan0" of whatever interface is for vxlan.
> Ensure GRO is "on", if not enable it on the interface by "ethtool _k
> vxlan0 gro on". Run iperf and to tcpdump on the vxlan interface to
> verify GRO is being done. If we are seeing performance degradation
> when GRO is being done at tunnel versus device that would be a
> different problem than no GRO being done at all.

Heres more details on the test.

gro is "on" on the device and the tunnel. tcpdump on the vxlan interface 
show un-aggregated packets

[root@ramu1 tracing]# tcpdump -i vxlan0
<snip>
ptions [nop,nop,TS val 1972850548 ecr 193703], length 1398
14:14:38.911955 IP 1.1.1.21.44134 > 1.1.1.11.commplex-link: Flags [.], 
seq 224921449:224922847, ack 1, win 221, options [nop,nop,TS val 
1972850548 ecr 193703], length 1398
14:14:38.911957 IP 1.1.1.21.44134 > 1.1.1.11.commplex-link: Flags [.], 
seq 224922847:224924245, ack 1, win 221, options [nop,nop,TS val 
1972850548 ecr 193703], length 1398
14:14:38.911958 IP 1.1.1.21.44134 > 1.1.1.11.commplex-link: Flags [.], 
seq 224924245:224925643, ack 1, win 221, options [nop,nop,TS val 
1972850548 ecr 193703], length 1398
14:14:38.911959 IP 1.1.1.21.44134 > 1.1.1.11.commplex-link: Flags [.], 
seq 224925643:224927041, ack 1, win 221, options [nop,nop,TS val 
1972850548 ecr 193703], length 1398

In the kernel trace I dont see "vxlan_gro_receive" being hit at all.

[root@localhost ~]# ./iperf -s -i 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 1.1.1.11 port 5001 connected with 1.1.1.21 port 44135
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 2.0 sec   503 MBytes  2.11 Gbits/sec


With the proposed patch (and everything else remaining the same) tcpdump 
shows aggregated frames like this:

[root@ramu1 perf]# tcpdump -i vxlan0
<snip>
14:29:50.961380 IP 1.1.1.21.44138 > 1.1.1.11.commplex-link: Flags [.], 
seq 24565681:24629989, ack 1, win 221, options [nop,nop,TS val 
1973762616 ecr 4294793113], length 64308
14:29:50.961506 IP 1.1.1.11.commplex-link > 1.1.1.21.44138: Flags [.], 
ack 24629989, win 21888, options [nop,nop,TS val 4294793113 ecr 
1973762616], length 0
14:29:50.961463 IP 1.1.1.21.44138 > 1.1.1.11.commplex-link: Flags [.], 
seq 24629989:24694297, ack 1, win 221, options [nop,nop,TS val 
1973762616 ecr 4294793113], length 64308
14:29:50.961518 IP 1.1.1.21.44138 > 1.1.1.11.commplex-link: Flags [.], 
seq 24694297:24758605, ack 1, win 221, options [nop,nop,TS val 
1973762616 ecr 4294793113], length 64308
14:29:50.961655 IP 1.1.1.11.commplex-link > 1.1.1.21.44138: Flags [.], 
ack 24694297, win 21932, options [nop,nop,TS val 4294793113 ecr 
1973762616], length 0
14:29:50.961626 IP 1.1.1.21.44138 > 1.1.1.11.commplex-link: Flags [P.], 
seq 24758605:24822913, ack 1, win 221, options [nop,nop,TS val 
1973762616 ecr 4294793113], length 64308


[root@localhost ~]# ./iperf -s -i 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 1.1.1.11 port 5001 connected with 1.1.1.21 port 44136
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 2.0 sec  1.64 GBytes  7.04 Gbits/sec
[  4]  2.0- 4.0 sec  1.98 GBytes  8.48 Gbits/sec
[  4]  4.0- 6.0 sec  1.98 GBytes  8.52 Gbits/sec
[  4]  6.0- 8.0 sec  1.99 GBytes  8.53 Gbits/sec

kernel trace shows vxlan_gro_receive being hit.


Topology:
---------

VM1 ---bridge 
(br_perf)---vxlan0----10Gnic(int4)-----10Gnic---vxlan0----bridge 
(br_perf)---VM2

MTUs:
   VM        (1450)
   br_perf   (9000)
   vxlan0    (9000)
   int4      (9000)

Hw/Sw Adapter, Drivers
-----------------------

02:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)

[root@ramu1 ~]# ethtool -i int4
driver: ixgbe
version: 4.0.1-k-rh7.1
firmware-version: 0x80000208
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Config HOST1:
-------------

[root@ramu1 perf]# cat docfg.sh
ip link del vxlan0
ip link set dev br_perf down
brctl delbr br_perf

brctl addbr br_perf
ip link set dev br_perf up
ip link add vxlan0 mtu 9000 type vxlan id 1 l2miss l3miss rsc proxy 
nolearning dstport 8472
ip link set dev vxlan0 up
brctl addif br_perf vxlan0

ip neigh add 1.1.1.21 lladdr 52:54:00:17:c8:4d dev vxlan0 nud permanent

bridge fdb replace 52:54:00:17:c8:4d  dev vxlan0 self permanent dst 
10.50.117.216


Config VM1:
-----------
eth0 IP,MAC:  1.1.1.11, 52:54:00:6c:53:61

CPU affinity for both VMs
--------------------------
[root@ramu1 perf]# virsh vcpuinfo centos-6.5
VCPU:           0
CPU:            N/A
State:          N/A
CPU time        N/A
CPU Affinity:   ---y------------------------------------

Iptables disabled on bridges
------------------------------
echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables

Offload Settings both hosts are at default
-------------------------------------------

[root@ramu1 perf]# ethtool -k int4
Features for int4:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: on
	tx-checksum-ip-generic: off [fixed]
	tx-checksum-ipv6: on
	tx-checksum-fcoe-crc: on [fixed]
	tx-checksum-sctp: on
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
busy-poll: on [fixed]



[root@ramu1 perf]# ethtool -k br_perf
Features for br_perf:
rx-checksumming: off [fixed]
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: on
	tx-tcp6-segmentation: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [requested on]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [requested on]
tx-fcoe-segmentation: off [requested on]
tx-gre-segmentation: on
tx-ipip-segmentation: on
tx-sit-segmentation: on
tx-udp_tnl-segmentation: on
tx-mpls-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
busy-poll: off [fixed]







[root@ramu1 perf]# ethtool -k vxlan0
Features for vxlan0:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: on
	tx-tcp6-segmentation: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
busy-poll: off [fixed]
[root@ramu1 perf]#













--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert June 26, 2015, 7:59 p.m. UTC | #9
On Fri, Jun 26, 2015 at 12:31 PM, Ramu Ramamurthy
<sramamur@linux.vnet.ibm.com> wrote:
> On 2015-06-26 11:04, Tom Herbert wrote:
>>>
>>> I am testing the simplest configuration which has 1 TCP flow generated by
>>> iperf from
>>> a VM connected to a linux bridge with a vxlan tunnel interface. The 10G
>>> nic
>>> (82599 ES) has
>>> multiple receive queues, but in this simple test, it is likely immaterial
>>> (because, the
>>> tuple on which it hashes would be fixed). The real difference in
>>> performance
>>> appears to
>>> be whether or not vxlan gro is performed by software.
>>>
>>
>> Please do "ethtool -k vxlan0" of whatever interface is for vxlan.
>> Ensure GRO is "on", if not enable it on the interface by "ethtool _k
>> vxlan0 gro on". Run iperf and to tcpdump on the vxlan interface to
>> verify GRO is being done. If we are seeing performance degradation
>> when GRO is being done at tunnel versus device that would be a
>> different problem than no GRO being done at all.
>
>
> Heres more details on the test.
>
> gro is "on" on the device and the tunnel. tcpdump on the vxlan interface
> show un-aggregated packets
>
> [root@ramu1 tracing]# tcpdump -i vxlan0
> <snip>
> ptions [nop,nop,TS val 1972850548 ecr 193703], length 1398
> 14:14:38.911955 IP 1.1.1.21.44134 > 1.1.1.11.commplex-link: Flags [.], seq
> 224921449:224922847, ack 1, win 221, options [nop,nop,TS val 1972850548 ecr

Looks like GRO was never implemented for vxlan tunnels. The driver is
simply calling netif_rx instead of using the GRO cells infrastructure.
geneve is doing the same thing. For other tunnels which are used in
foo-over-udp (GRE, IPIP, SIT) ip_tunnel_rcv is called which in turn
calls gro_cells_receive.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ramu Ramamurthy June 26, 2015, 9:44 p.m. UTC | #10
On 2015-06-26 12:59, Tom Herbert wrote:
> On Fri, Jun 26, 2015 at 12:31 PM, Ramu Ramamurthy
> <sramamur@linux.vnet.ibm.com> wrote:
>> On 2015-06-26 11:04, Tom Herbert wrote:
>>>> 
>>>> I am testing the simplest configuration which has 1 TCP flow 
>>>> generated by
>>>> iperf from
>>>> a VM connected to a linux bridge with a vxlan tunnel interface. The 
>>>> 10G
>>>> nic
>>>> (82599 ES) has
>>>> multiple receive queues, but in this simple test, it is likely 
>>>> immaterial
>>>> (because, the
>>>> tuple on which it hashes would be fixed). The real difference in
>>>> performance
>>>> appears to
>>>> be whether or not vxlan gro is performed by software.
>>>> 
>>> 
>>> Please do "ethtool -k vxlan0" of whatever interface is for vxlan.
>>> Ensure GRO is "on", if not enable it on the interface by "ethtool _k
>>> vxlan0 gro on". Run iperf and to tcpdump on the vxlan interface to
>>> verify GRO is being done. If we are seeing performance degradation
>>> when GRO is being done at tunnel versus device that would be a
>>> different problem than no GRO being done at all.
>> 
>> 
>> Heres more details on the test.
>> 
>> gro is "on" on the device and the tunnel. tcpdump on the vxlan 
>> interface
>> show un-aggregated packets
>> 
>> [root@ramu1 tracing]# tcpdump -i vxlan0
>> <snip>
>> ptions [nop,nop,TS val 1972850548 ecr 193703], length 1398
>> 14:14:38.911955 IP 1.1.1.21.44134 > 1.1.1.11.commplex-link: Flags [.], 
>> seq
>> 224921449:224922847, ack 1, win 221, options [nop,nop,TS val 
>> 1972850548 ecr
> 
> Looks like GRO was never implemented for vxlan tunnels. The driver is
> simply calling netif_rx instead of using the GRO cells infrastructure.
> geneve is doing the same thing. For other tunnels which are used in
> foo-over-udp (GRE, IPIP, SIT) ip_tunnel_rcv is called which in turn
> calls gro_cells_receive.

Can we remove or (relax) the checksum checks in udp_gro_receive() which 
are immediately
preventing the vxlan_gro callbacks from being called from 
udp_gro_receive() ?
vxlan driver is registering these offloads callbacks, and I can see them 
work when i
relax the following checksum checks.

	if (NAPI_GRO_CB(skb)->udp_mark ||
	    (skb->ip_summed != CHECKSUM_PARTIAL &&      <<<<  remove or relax 
these checks
	     NAPI_GRO_CB(skb)->csum_cnt == 0 &&         <<<<  which are 
directly
	     !NAPI_GRO_CB(skb)->csum_valid))            <<<<  dependent on nic 
capability
		goto out;

Alternatively, can we move these checks to the respective drivers' 
gro_receive() function.

The other changes you suggest (gro_cells) are beyond my understanding.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz June 28, 2015, 8:19 p.m. UTC | #11
On Fri, Jun 26, 2015 at 10:59 PM, Tom Herbert <tom@herbertland.com> wrote:
[...]
> Looks like GRO was never implemented for vxlan tunnels. The driver is
> simply calling netif_rx instead of using the GRO cells infrastructure.
> geneve is doing the same thing. For other tunnels which are used in
> foo-over-udp (GRE, IPIP, SIT) ip_tunnel_rcv is called which in turn
> calls gro_cells_receive.

Tom,

Since v3.14, when a tunneled (say VXLAN/GRE) packets are received on
the physical interface, they go through GRO aggregation before being
delivered up to the tunnel "device" (e.g either vxlan/gre netdevice or
OVS vxlan/gre vport) -- so in that respect, can you elaborate a little
further why we want to GRO them again?

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert June 28, 2015, 9:17 p.m. UTC | #12
On Sun, Jun 28, 2015 at 1:19 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Fri, Jun 26, 2015 at 10:59 PM, Tom Herbert <tom@herbertland.com> wrote:
> [...]
>> Looks like GRO was never implemented for vxlan tunnels. The driver is
>> simply calling netif_rx instead of using the GRO cells infrastructure.
>> geneve is doing the same thing. For other tunnels which are used in
>> foo-over-udp (GRE, IPIP, SIT) ip_tunnel_rcv is called which in turn
>> calls gro_cells_receive.
>
> Tom,
>
> Since v3.14, when a tunneled (say VXLAN/GRE) packets are received on
> the physical interface, they go through GRO aggregation before being
> delivered up to the tunnel "device" (e.g either vxlan/gre netdevice or
> OVS vxlan/gre vport) -- so in that respect, can you elaborate a little
> further why we want to GRO them again?
>

If we don't have a verifiable checksum from the device GRO is not
applied to UDP encapsulated packets at the physical interface, but can
be done at the tunnel. Ramu is seeing poor performance because there
is no GRO at all is happening, so doing it at the tunnel is an
improvement. As I described before, avoiding checksum calculation in
the device NAPI still seems to be a good thing (in my testing I do see
a slight regression if we were to do the checksum in device NAPI).

btw, the real "fix" for this is for NICs to provide CHECKSUM_COMPLETE! :-)

Tom

> Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ramu Ramamurthy June 29, 2015, 7:56 p.m. UTC | #13
On 2015-06-28 14:17, Tom Herbert wrote:
> On Sun, Jun 28, 2015 at 1:19 PM, Or Gerlitz <gerlitz.or@gmail.com> 
> wrote:
>> On Fri, Jun 26, 2015 at 10:59 PM, Tom Herbert <tom@herbertland.com> 
>> wrote:
>> [...]
>>> Looks like GRO was never implemented for vxlan tunnels. The driver is
>>> simply calling netif_rx instead of using the GRO cells 
>>> infrastructure.
>>> geneve is doing the same thing. For other tunnels which are used in
>>> foo-over-udp (GRE, IPIP, SIT) ip_tunnel_rcv is called which in turn
>>> calls gro_cells_receive.
>> 
>> Tom,
>> 
>> Since v3.14, when a tunneled (say VXLAN/GRE) packets are received on
>> the physical interface, they go through GRO aggregation before being
>> delivered up to the tunnel "device" (e.g either vxlan/gre netdevice or
>> OVS vxlan/gre vport) -- so in that respect, can you elaborate a little
>> further why we want to GRO them again?
>> 
> 
> If we don't have a verifiable checksum from the device GRO is not
> applied to UDP encapsulated packets at the physical interface, but can
> be done at the tunnel. Ramu is seeing poor performance because there
> is no GRO at all is happening, so doing it at the tunnel is an
> improvement. As I described before, avoiding checksum calculation in
> the device NAPI still seems to be a good thing (in my testing I do see
> a slight regression if we were to do the checksum in device NAPI).
> 
> btw, the real "fix" for this is for NICs to provide CHECKSUM_COMPLETE! 
> :-)
> 
> Tom
> 
>> Or.

When I force the sender to set a non-zero UDP checksum for vxlan 
encapsulated tcp-stream,
then, I can see the gro activated at the receiver (82599ES nic),
and the throughput is ~8.5Gbps !

So, to get gro to be effective for the 82599ES receiver, the sender 
needs to set the UDP
checksum.  If the sender does NOT set the UDP checksum (udp-checksum == 
0), then the gro-cells patch suggested by Tom
will perform gro at the tunnel device level.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f938616..17fc12b 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -301,6 +301,7 @@  struct sk_buff **udp_gro_receive(struct sk_buff 
**head, struct sk_buff *skb,

  	if (NAPI_GRO_CB(skb)->udp_mark ||
  	    (skb->ip_summed != CHECKSUM_PARTIAL &&
+	     skb->ip_summed != CHECKSUM_UNNECESSARY &&
  	     NAPI_GRO_CB(skb)->csum_cnt == 0 &&
  	     !NAPI_GRO_CB(skb)->csum_valid))