diff mbox

RFC: ixgbe+build_skb+extra performance experiments

Message ID 1412229642-10555-1-git-send-email-ast@plumgrid.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov Oct. 2, 2014, 6 a.m. UTC
Hi All,

I'm trying to speed up single core packet per second.

I took dual port ixgbe and added both ports to a linux bridge.
Only one port is connected to another system running pktgen at 10G rate.
I disabled gro to measure pure RX speed of ixgbe.

Out of the box I see 6.5 Mpps and the following stack:
  21.83%    ksoftirqd/0  [kernel.kallsyms]  [k] memcpy
  17.58%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
  10.07%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
   6.40%    ksoftirqd/0  [kernel.kallsyms]  [k] __netdev_alloc_frag
   5.18%    ksoftirqd/0  [kernel.kallsyms]  [k] put_compound_page
   4.93%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
   4.55%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core

Obviously driver spends huge amount of time copying data from
hw buffers into skb.

Then I applied buggy but working in this case patch:
http://patchwork.ozlabs.org/patch/236044/
that is trying to use build_skb() API in ixgbe.

RX speed jumped to 7.6 Mpps with the following stack:
  27.02%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
  16.68%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
  11.45%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
   5.20%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
   4.72%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
   3.96%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free

packets no longer copied and performance is higher.
It's doing the following:
- build_skb out of hw buffer and prefetch packet data
- eth_type_trans
- napi_gro_receive

but build_skb() is too fast and cpu doesn't have enough time
to prefetch packet data before eth_type_trans() is called,
so I added mini skb bursting of 2 skbs (patch below) that does:
- build_skb1 out of hw buffer and prefetch packet data
- build_skb2 out of hw buffer and prefetch packet data
- eth_type_trans(skb1)
- napi_gro_receive(skb1)
- eth_type_trans(skb2)
- napi_gro_receive(skb2)
and performance jumped to 9.0 Mpps with stack:
  20.54%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
  13.15%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
   8.35%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
   7.16%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
   4.73%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free
   4.50%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc

with further instruction tunning inside ixgbe_clean_rx_irq()
I could push it to 9.4 Mpps.

From 6.5 Mpps to 9.4 Mpps via build_skb() and tunning.

Is there a way to fix the issue Ben pointed a year ago?
Brute force fix could to be: avoid half-page buffers.
We'll be wasting 16Mbyte of memory. Sure, but in some cases
extra peformance might be worth it.
Other options?
I think we need to try harder to switch to build_skb()
It will open up a lot of possibilities for further performance
improvements.
Thoughts?

---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   34 +++++++++++++++++++++----
 1 file changed, 29 insertions(+), 5 deletions(-)

Comments

Jesper Dangaard Brouer Oct. 2, 2014, 7:36 a.m. UTC | #1
On Wed,  1 Oct 2014 23:00:42 -0700 Alexei Starovoitov <ast@plumgrid.com> wrote:

> I'm trying to speed up single core packet per second.

Great, welcome to the club ;-)
 
> I took dual port ixgbe and added both ports to a linux bridge.
> Only one port is connected to another system running pktgen at 10G rate.
> I disabled gro to measure pure RX speed of ixgbe.

It is great that you are attacking the RX side, I planned to look at it
after finishing the qdisc bulking.  It is really lacking behind,
especially after we have now almost "fixed" the TX side (driver layer
can now do 14.8Mpps, if ignoring rest of stack, alloc etc.).


> Out of the box I see 6.5 Mpps and the following stack:
>   21.83%    ksoftirqd/0  [kernel.kallsyms]  [k] memcpy
>   17.58%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>   10.07%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>    6.40%    ksoftirqd/0  [kernel.kallsyms]  [k] __netdev_alloc_frag
>    5.18%    ksoftirqd/0  [kernel.kallsyms]  [k] put_compound_page
>    4.93%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>    4.55%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
> 
> Obviously driver spends huge amount of time copying data from
> hw buffers into skb.
> 
> Then I applied buggy but working in this case patch:
> http://patchwork.ozlabs.org/patch/236044/
> that is trying to use build_skb() API in ixgbe.

I also expected it will be a huge win to use build_skb() API.
Good to see this confirmed! :-)

I've been playing with a faster memory pool/allocator (implemented via a
ring_queue), and my experiments show I could save 52ns when using it
for the skb->data.  And you basically avoid this skb->data alloc with
build_skb().


> RX speed jumped to 7.6 Mpps with the following stack:
>   27.02%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
>   16.68%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>   11.45%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>    5.20%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>    4.72%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>    3.96%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free

My faster memory pool/allocator could save 8ns for the kmem_cache/slub
calls, which is also high in your perf top.  8ns out of 40ns which is
the micro benchmarked cost of the kmem_cache_{alloc,free} calls.


> packets no longer copied and performance is higher.
> It's doing the following:
> - build_skb out of hw buffer and prefetch packet data
> - eth_type_trans
> - napi_gro_receive
> 
> but build_skb() is too fast and cpu doesn't have enough time
> to prefetch packet data before eth_type_trans() is called,
> so I added mini skb bursting of 2 skbs (patch below) that does:
> - build_skb1 out of hw buffer and prefetch packet data
> - build_skb2 out of hw buffer and prefetch packet data
> - eth_type_trans(skb1)
> - napi_gro_receive(skb1)
> - eth_type_trans(skb2)
> - napi_gro_receive(skb2)
> and performance jumped to 9.0 Mpps with stack:
>   20.54%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>   13.15%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>    8.35%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>    7.16%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
>    4.73%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free
>    4.50%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
> 
> with further instruction tunning inside ixgbe_clean_rx_irq()
> I could push it to 9.4 Mpps.
> 
> From 6.5 Mpps to 9.4 Mpps via build_skb() and tunning.

Cool, quite impressive performance boost! - good work! :-)


> Is there a way to fix the issue Ben pointed a year ago?
> Brute force fix could to be: avoid half-page buffers.
> We'll be wasting 16Mbyte of memory. Sure, but in some cases
> extra peformance might be worth it.
> Other options?
>
> I think we need to try harder to switch to build_skb()
> It will open up a lot of possibilities for further performance
> improvements.
> Thoughts?

Yes, we should really work on converting more drivers to use
build_skb().

> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   34 +++++++++++++++++++++----
>  1 file changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 21d1a65..1d1e37f 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1590,8 +1590,6 @@ static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
>  	}
> 
>  	skb_record_rx_queue(skb, rx_ring->queue_index);
> -
> -	skb->protocol = eth_type_trans(skb, dev);
>  }
> 
>  static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
> @@ -2063,6 +2061,24 @@ dma_sync:
>  	return skb;
>  }
> 
> +#define BURST_SIZE 2
> +static void ixgbe_rx_skb_burst(struct sk_buff *skbs[BURST_SIZE],
> +			       unsigned int skb_burst,
> +			       struct ixgbe_q_vector *q_vector,
> +			       struct net_device *dev)
> +{
> +	int i;
> +
> +	for (i = 0; i < skb_burst; i++) {
> +		struct sk_buff *skb = skbs[i];
> +
> +		skb->protocol = eth_type_trans(skb, dev);
> +
> +		skb_mark_napi_id(skb, &q_vector->napi);
> +		ixgbe_rx_skb(q_vector, skb);
> +	}
> +}
> +
>  /**
>   * ixgbe_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>   * @q_vector: structure containing interrupt and ring information
> @@ -2087,6 +2103,8 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  	unsigned int mss = 0;
>  #endif /* IXGBE_FCOE */
>  	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
> +	struct sk_buff *skbs[BURST_SIZE];
> +	unsigned int skb_burst = 0;
> 
>  	while (likely(total_rx_packets < budget)) {
>  		union ixgbe_adv_rx_desc *rx_desc;
> @@ -2161,13 +2179,19 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  		}
>  
>  #endif /* IXGBE_FCOE */
> -		skb_mark_napi_id(skb, &q_vector->napi);
> -		ixgbe_rx_skb(q_vector, skb);
> -
>  		/* update budget accounting */
>  		total_rx_packets++;
> +		skbs[skb_burst++] = skb;
> +
> +		if (skb_burst == BURST_SIZE) {
> +			ixgbe_rx_skb_burst(skbs, skb_burst, q_vector,
> +					   rx_ring->netdev);
> +			skb_burst = 0;
> +		}
>  	}
>  
> +	ixgbe_rx_skb_burst(skbs, skb_burst, q_vector, rx_ring->netdev);
> +
>  	u64_stats_update_begin(&rx_ring->syncp);
>  	rx_ring->stats.packets += total_rx_packets;
>  	rx_ring->stats.bytes += total_rx_bytes;
Alexander H Duyck Oct. 3, 2014, 2:40 p.m. UTC | #2
On 10/02/2014 12:36 AM, Jesper Dangaard Brouer wrote:
> On Wed,  1 Oct 2014 23:00:42 -0700 Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> I'm trying to speed up single core packet per second.
> Great, welcome to the club ;-)

Yes, but please keep in mind that multi-core is the more common use case
for many systems.

>> I took dual port ixgbe and added both ports to a linux bridge.
>> Only one port is connected to another system running pktgen at 10G rate.
>> I disabled gro to measure pure RX speed of ixgbe.
> It is great that you are attacking the RX side, I planned to look at it
> after finishing the qdisc bulking.  It is really lacking behind,
> especially after we have now almost "fixed" the TX side (driver layer
> can now do 14.8Mpps, if ignoring rest of stack, alloc etc.).

To that end we may want to look to something like GRO to do the
buffering on the Rx side so that we could make use of GRO/GSO to send
blocks of buffers instead of one at a time.

>> Out of the box I see 6.5 Mpps and the following stack:
>>   21.83%    ksoftirqd/0  [kernel.kallsyms]  [k] memcpy
>>   17.58%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>>   10.07%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>>    6.40%    ksoftirqd/0  [kernel.kallsyms]  [k] __netdev_alloc_frag
>>    5.18%    ksoftirqd/0  [kernel.kallsyms]  [k] put_compound_page
>>    4.93%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>>    4.55%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>>
>> Obviously driver spends huge amount of time copying data from
>> hw buffers into skb.
>>
>> Then I applied buggy but working in this case patch:
>> http://patchwork.ozlabs.org/patch/236044/
>> that is trying to use build_skb() API in ixgbe.
> I also expected it will be a huge win to use build_skb() API.
> Good to see this confirmed! :-)

From my past experience this is very platform dependant.  For example
with DDIO or DCA features enabled on a system the memcpy is very cheap
since it is already in the cache.  It is one of the reasons for choosing
that as a means of working around the fact that we cannot use build_skb
and page reuse in the same driver.

> I've been playing with a faster memory pool/allocator (implemented via a
> ring_queue), and my experiments show I could save 52ns when using it
> for the skb->data.  And you basically avoid this skb->data alloc with
> build_skb().

The other big item that I think many are overlooking is the cost of
mapping the buffer.  On systems with an IOMMU or even in some cases just
the swiotlb can add some significant cost to allocating a new buffer.

>> RX speed jumped to 7.6 Mpps with the following stack:
>>   27.02%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
>>   16.68%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>>   11.45%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>>    5.20%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>>    4.72%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>>    3.96%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free
> My faster memory pool/allocator could save 8ns for the kmem_cache/slub
> calls, which is also high in your perf top.  8ns out of 40ns which is
> the micro benchmarked cost of the kmem_cache_{alloc,free} calls.
>
>
>> packets no longer copied and performance is higher.
>> It's doing the following:
>> - build_skb out of hw buffer and prefetch packet data
>> - eth_type_trans
>> - napi_gro_receive
>>
>> but build_skb() is too fast and cpu doesn't have enough time
>> to prefetch packet data before eth_type_trans() is called,
>> so I added mini skb bursting of 2 skbs (patch below) that does:
>> - build_skb1 out of hw buffer and prefetch packet data
>> - build_skb2 out of hw buffer and prefetch packet data
>> - eth_type_trans(skb1)
>> - napi_gro_receive(skb1)
>> - eth_type_trans(skb2)
>> - napi_gro_receive(skb2)
>> and performance jumped to 9.0 Mpps with stack:
>>   20.54%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>>   13.15%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>>    8.35%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>>    7.16%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
>>    4.73%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free
>>    4.50%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>>
>> with further instruction tunning inside ixgbe_clean_rx_irq()
>> I could push it to 9.4 Mpps.
>>
>> From 6.5 Mpps to 9.4 Mpps via build_skb() and tunning.
> Cool, quite impressive performance boost! - good work! :-)

One thought I had at one point was to try and add a flag to the DMA api
to indicate if the DMA api is trivial resulting in just a call to
virt_to_phys.  It might be worthwhile to look into something like that,
then we could split the receive processing into one of two paths, one
for non-trivial DMA mapping APIs, and one for trivial DMA mapping APIs
such as swiotlb on a device that supports all the memory in the system.

>> Is there a way to fix the issue Ben pointed a year ago?
>> Brute force fix could to be: avoid half-page buffers.
>> We'll be wasting 16Mbyte of memory. Sure, but in some cases
>> extra peformance might be worth it.
>> Other options?
>>
>> I think we need to try harder to switch to build_skb()
>> It will open up a lot of possibilities for further performance
>> improvements.
>> Thoughts?
> Yes, we should really work on converting more drivers to use
> build_skb().

The problem is build_skb usage comes at a certain cost.  Specifically in
the case of small packets it can result in a larger memory footprint
since you cannot just reuse the same region in the buffer.  I suspect we
may need to look into some sort of compromise between build_skb and a
copybreak scheme for best cache performance on Xeon for example.

>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   34 +++++++++++++++++++++----
>>  1 file changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 21d1a65..1d1e37f 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1590,8 +1590,6 @@ static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
>>  	}
>>
>>  	skb_record_rx_queue(skb, rx_ring->queue_index);
>> -
>> -	skb->protocol = eth_type_trans(skb, dev);
>>  }
>>
>>  static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
>> @@ -2063,6 +2061,24 @@ dma_sync:
>>  	return skb;
>>  }
>>
>> +#define BURST_SIZE 2
>> +static void ixgbe_rx_skb_burst(struct sk_buff *skbs[BURST_SIZE],
>> +			       unsigned int skb_burst,
>> +			       struct ixgbe_q_vector *q_vector,
>> +			       struct net_device *dev)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < skb_burst; i++) {
>> +		struct sk_buff *skb = skbs[i];
>> +
>> +		skb->protocol = eth_type_trans(skb, dev);
>> +
>> +		skb_mark_napi_id(skb, &q_vector->napi);
>> +		ixgbe_rx_skb(q_vector, skb);
>> +	}
>> +}
>> +
>>  /**
>>   * ixgbe_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @q_vector: structure containing interrupt and ring information
>> @@ -2087,6 +2103,8 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>  	unsigned int mss = 0;
>>  #endif /* IXGBE_FCOE */
>>  	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
>> +	struct sk_buff *skbs[BURST_SIZE];
>> +	unsigned int skb_burst = 0;
>>
>>  	while (likely(total_rx_packets < budget)) {
>>  		union ixgbe_adv_rx_desc *rx_desc;
>> @@ -2161,13 +2179,19 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>  		}
>>  
>>  #endif /* IXGBE_FCOE */
>> -		skb_mark_napi_id(skb, &q_vector->napi);
>> -		ixgbe_rx_skb(q_vector, skb);
>> -
>>  		/* update budget accounting */
>>  		total_rx_packets++;
>> +		skbs[skb_burst++] = skb;
>> +
>> +		if (skb_burst == BURST_SIZE) {
>> +			ixgbe_rx_skb_burst(skbs, skb_burst, q_vector,
>> +					   rx_ring->netdev);
>> +			skb_burst = 0;
>> +		}
>>  	}
>>  
>> +	ixgbe_rx_skb_burst(skbs, skb_burst, q_vector, rx_ring->netdev);
>> +
>>  	u64_stats_update_begin(&rx_ring->syncp);
>>  	rx_ring->stats.packets += total_rx_packets;
>>  	rx_ring->stats.bytes += total_rx_bytes;
>

For the burst size logic you might want to explore handling the
descriptors in 4 descriptor aligned chunks that should give you the best
possible performance since that would mean processing the descriptor
ring one cache-line at a time.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 3, 2014, 4:54 p.m. UTC | #3
On Fri, Oct 3, 2014 at 7:40 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On 10/02/2014 12:36 AM, Jesper Dangaard Brouer wrote:
>> On Wed,  1 Oct 2014 23:00:42 -0700 Alexei Starovoitov <ast@plumgrid.com> wrote:
>>
>>> I'm trying to speed up single core packet per second.
>> Great, welcome to the club ;-)
>
> Yes, but please keep in mind that multi-core is the more common use case
> for many systems.

well, I care about 'single core performance' and not 'single core systems'.
My primary test machines are 4-core i7 haswell and
12-core xeon servers. It's much easier to benchmark, understand,
speedup performance on a single cpu before turning on packet
spraying and stressing the whole box.

> To that end we may want to look to something like GRO to do the
> buffering on the Rx side so that we could make use of GRO/GSO to send
> blocks of buffers instead of one at a time.

Optimizing gro would be next step. When I turn gro now, it only
slows things down and muddies perf profile.

> From my past experience this is very platform dependant.  For example
> with DDIO or DCA features enabled on a system the memcpy is very cheap
> since it is already in the cache.  It is one of the reasons for choosing
> that as a means of working around the fact that we cannot use build_skb
> and page reuse in the same driver.

my systems are already intel alphabet soup, including DCA
and I have CONFIG_IXGBE_DCA=y
yet, memcpy() is #1 as you can see in profile.

> One thought I had at one point was to try and add a flag to the DMA api
> to indicate if the DMA api is trivial resulting in just a call to
> virt_to_phys.  It might be worthwhile to look into something like that,
> then we could split the receive processing into one of two paths, one
> for non-trivial DMA mapping APIs, and one for trivial DMA mapping APIs
> such as swiotlb on a device that supports all the memory in the system.

I have similar hack to optimize swiotlb case, but it's not helpful
right now. The first step is to use build_skb()

> The problem is build_skb usage comes at a certain cost.  Specifically in
> the case of small packets it can result in a larger memory footprint
> since you cannot just reuse the same region in the buffer.  I suspect we
> may need to look into some sort of compromise between build_skb and a
> copybreak scheme for best cache performance on Xeon for example.

we're talking 10Gbps ixgbe use case here.
For e1000 on small system with precious memory the copybreak
approach might makes sense, but large server in datacenter
I would rather configure for build_skb() only.
In your patch you made a cutoff based on 1500 mtu.
I would prefer 1550 or 1600 cutoff, so that encapsulated
packets can get into hypervisor as quickly as possible and
forwarded to appropriate VMs or containers.

> For the burst size logic you might want to explore handling the
> descriptors in 4 descriptor aligned chunks that should give you the best
> possible performance since that would mean processing the descriptor
> ring one cache-line at a time.

makes sense. I was thinking to pipeline it more in the future.
Including splitting build_skb() into phases of allocation and initialization,
so that prefetch from previous stage will have time to populate caches.

I was hoping my performance measurements were convincing
enough for you to dust off ixgbe+build_skb patch, fix page reuse
somehow and submit it for everyone to cheer :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H Oct. 3, 2014, 5:49 p.m. UTC | #4
On 10/03/2014 09:54 AM, Alexei Starovoitov wrote:
> On Fri, Oct 3, 2014 at 7:40 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On 10/02/2014 12:36 AM, Jesper Dangaard Brouer wrote:
>>> On Wed,  1 Oct 2014 23:00:42 -0700 Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>
>>>> I'm trying to speed up single core packet per second.
>>> Great, welcome to the club ;-)
>>
>> Yes, but please keep in mind that multi-core is the more common use case
>> for many systems.
> 
> well, I care about 'single core performance' and not 'single core systems'.
> My primary test machines are 4-core i7 haswell and
> 12-core xeon servers. It's much easier to benchmark, understand,
> speedup performance on a single cpu before turning on packet
> spraying and stressing the whole box.

The point is that there are single core optimization that don't actually
don't scale well when you spread them out to multiple cores due to
design issues resulting in cache-thrash and such.  I was just suggesting
you keep that in mind when making any changes.

>> To that end we may want to look to something like GRO to do the
>> buffering on the Rx side so that we could make use of GRO/GSO to send
>> blocks of buffers instead of one at a time.
> 
> Optimizing gro would be next step. When I turn gro now, it only
> slows things down and muddies perf profile.

I am aware of that.  Optimizing GRO now though might make more sense
since it would allow for a clean optimization.  The problem with using
build_skb on ixgbe is that ixgbe does page-reuse.  If you were to do a
clean setup of build_skb you would likely lose more performance as the
design of page-reuse provided significant gains on ixgbe, especially on
platforms with IOMMU enabled.

>> From my past experience this is very platform dependant.  For example
>> with DDIO or DCA features enabled on a system the memcpy is very cheap
>> since it is already in the cache.  It is one of the reasons for choosing
>> that as a means of working around the fact that we cannot use build_skb
>> and page reuse in the same driver.
> 
> my systems are already intel alphabet soup, including DCA
> and I have CONFIG_IXGBE_DCA=y
> yet, memcpy() is #1 as you can see in profile.

Yes, but that is the kernel option, not the hardware feature.  On Xeon
systems both IOAT and DDIO are available and function.  On i7 those
features are not enabled as I recall.

>> One thought I had at one point was to try and add a flag to the DMA api
>> to indicate if the DMA api is trivial resulting in just a call to
>> virt_to_phys.  It might be worthwhile to look into something like that,
>> then we could split the receive processing into one of two paths, one
>> for non-trivial DMA mapping APIs, and one for trivial DMA mapping APIs
>> such as swiotlb on a device that supports all the memory in the system.
> 
> I have similar hack to optimize swiotlb case, but it's not helpful
> right now. The first step is to use build_skb()

If you really want to use build_skb() you should remove most of the page
reuse code and simply map a page and use sub-sections of it.  The ixgbe
receive path didn't work with build_skb() because of the page reuse, and
before you can push anything that would make use of build_skb() you
would first need to remove all of that code.

>> The problem is build_skb usage comes at a certain cost.  Specifically in
>> the case of small packets it can result in a larger memory footprint
>> since you cannot just reuse the same region in the buffer.  I suspect we
>> may need to look into some sort of compromise between build_skb and a
>> copybreak scheme for best cache performance on Xeon for example.
> 
> we're talking 10Gbps ixgbe use case here.
> For e1000 on small system with precious memory the copybreak
> approach might makes sense, but large server in datacenter
> I would rather configure for build_skb() only.
> In your patch you made a cutoff based on 1500 mtu.
> I would prefer 1550 or 1600 cutoff, so that encapsulated
> packets can get into hypervisor as quickly as possible and
> forwarded to appropriate VMs or containers.

I am talking about cache foot rather than memory.  At 1500 byte packets
the expense for doing a 1 or 2 cache line copy to get the header is
pretty low.  Also if you are talking about virtualization I assume you
must not be talking about co-existing with direct assignment or running
on PowerPC since most drivers that use build_skb will not perform as
well as the page reuse when an IOMMU is enabled on the host.

Also the upper limit would largely be controlled based on the size of
the skb_shared_info.  The actual value for maximum supported buffer
would probably be the shared info size plus the size for any padding
needed at the start of the frame subtracted from 2K.

>> For the burst size logic you might want to explore handling the
>> descriptors in 4 descriptor aligned chunks that should give you the best
>> possible performance since that would mean processing the descriptor
>> ring one cache-line at a time.
> 
> makes sense. I was thinking to pipeline it more in the future.
> Including splitting build_skb() into phases of allocation and initialization,
> so that prefetch from previous stage will have time to populate caches.
> 
> I was hoping my performance measurements were convincing
> enough for you to dust off ixgbe+build_skb patch, fix page reuse
> somehow and submit it for everyone to cheer :)

The problem is market segment focus.  Your are running ixgbe on
enthusiast grade hardware.  When you get it up onto the big Xeon or even
PowerPC systems the performance layout is quite different due to the
introduction of other features, and the focus is on these units as this
is what you find in most big data centers.

The i7 focused approach can be very narrow.  When I rewrote most of the
receive path for ixgbe I had to take into account multiple use cases
included multiple architectures and system setups including multiple
NUMA nodes, IOMMU, PowerPC, Atom, Xeon, and other factors.  The current
driver is a compromise in order to get the best performance across all
platforms without sacrificing too much on any one of them.

If anything I think you might need to start first with looking at the
DMA APIs out there and seeing if there is a way to sort out trivial and
non-trivial DMA APIs.  So for the cases that are using a simple
virt_to_phys such as x86 we might be able to get away with just using
build_skb and could probably set a flag indicating as such so that we
ran that path.  For the other DMA APIs out there however we would
probably need to just use standard page reuse in order to avoid
overwriting the page data on a shared page.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 21d1a65..1d1e37f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1590,8 +1590,6 @@  static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
 	}

 	skb_record_rx_queue(skb, rx_ring->queue_index);
-
-	skb->protocol = eth_type_trans(skb, dev);
 }

 static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
@@ -2063,6 +2061,24 @@  dma_sync:
 	return skb;
 }

+#define BURST_SIZE 2
+static void ixgbe_rx_skb_burst(struct sk_buff *skbs[BURST_SIZE],
+			       unsigned int skb_burst,
+			       struct ixgbe_q_vector *q_vector,
+			       struct net_device *dev)
+{
+	int i;
+
+	for (i = 0; i < skb_burst; i++) {
+		struct sk_buff *skb = skbs[i];
+
+		skb->protocol = eth_type_trans(skb, dev);
+
+		skb_mark_napi_id(skb, &q_vector->napi);
+		ixgbe_rx_skb(q_vector, skb);
+	}
+}
+
 /**
  * ixgbe_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @q_vector: structure containing interrupt and ring information
@@ -2087,6 +2103,8 @@  static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	unsigned int mss = 0;
 #endif /* IXGBE_FCOE */
 	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
+	struct sk_buff *skbs[BURST_SIZE];
+	unsigned int skb_burst = 0;

 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
@@ -2161,13 +2179,19 @@  static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		}
 
 #endif /* IXGBE_FCOE */
-		skb_mark_napi_id(skb, &q_vector->napi);
-		ixgbe_rx_skb(q_vector, skb);
-
 		/* update budget accounting */
 		total_rx_packets++;
+		skbs[skb_burst++] = skb;
+
+		if (skb_burst == BURST_SIZE) {
+			ixgbe_rx_skb_burst(skbs, skb_burst, q_vector,
+					   rx_ring->netdev);
+			skb_burst = 0;
+		}
 	}
 
+	ixgbe_rx_skb_burst(skbs, skb_burst, q_vector, rx_ring->netdev);
+
 	u64_stats_update_begin(&rx_ring->syncp);
 	rx_ring->stats.packets += total_rx_packets;
 	rx_ring->stats.bytes += total_rx_bytes;