diff mbox

net: Update netdev_alloc_frag to work more efficiently with TCP and GRO

Message ID 20120620004306.17814.58369.stgit@gitlad.jf.intel.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Duyck, Alexander H June 20, 2012, 12:43 a.m. UTC
This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1.  I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start.  The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.

My primary test case was using ixgbe in combination with TCP.  With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.

I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing.  In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/skbuff.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alexander H Duyck June 20, 2012, 1:49 a.m. UTC | #1
On 6/19/2012 5:43 PM, Alexander Duyck wrote:
> This patch is meant to help improve system performance when
> netdev_alloc_frag is used in scenarios in which buffers are short lived.
> This is accomplished by allowing the page offset to be reset in the event
> that the page count is 1.  I also reordered the direction in which we give
> out sections of the page so that we start at the end of the page and end at
> the start.  The main motivation being that I preferred to have offset
> represent the amount of page remaining to be used.
>
> My primary test case was using ixgbe in combination with TCP.  With this
> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
> thread of netperf receiving a TCP stream via ixgbe.
>
> I also tested several scenarios in which the page reuse would not be
> possible such as UDP flows and routing.  In both of these scenarios I saw
> no noticeable performance degradation compared to the kernel without this
> patch.
>
> Cc: Eric Dumazet<edumazet@google.com>
> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
> ---
>
>   net/core/skbuff.c |   15 +++++++++++----
>   1 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..eb3853c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>   	if (unlikely(!nc->page)) {
>   refill:
>   		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> -		nc->offset = 0;
>   	}
>   	if (likely(nc->page)) {
> -		if (nc->offset + fragsz>  PAGE_SIZE) {
> +		unsigned int offset = PAGE_SIZE;
> +
> +		if (page_count(nc->page) != 1)
> +			offset = nc->offset;
> +
> +		if (offset<  fragsz) {
>   			put_page(nc->page);
>   			goto refill;
>   		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> +
> +		offset -= fragsz;
> +		nc->offset = offset;
> +
> +		data = page_address(nc->page) + offset;
>   		get_page(nc->page);
>   	}
>   	local_irq_restore(flags);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
It looks like I forgot to add "--auto" to the command line when I sent 
this out via stg mail so I am just adding Eric to the CC list on this 
reply.  Sorry for the extra noise.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 20, 2012, 5:36 a.m. UTC | #2
On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
> This patch is meant to help improve system performance when
> netdev_alloc_frag is used in scenarios in which buffers are short lived.
> This is accomplished by allowing the page offset to be reset in the event
> that the page count is 1.  I also reordered the direction in which we give
> out sections of the page so that we start at the end of the page and end at
> the start.  The main motivation being that I preferred to have offset
> represent the amount of page remaining to be used.
> 
> My primary test case was using ixgbe in combination with TCP.  With this
> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
> thread of netperf receiving a TCP stream via ixgbe.
> 
> I also tested several scenarios in which the page reuse would not be
> possible such as UDP flows and routing.  In both of these scenarios I saw
> no noticeable performance degradation compared to the kernel without this
> patch.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  net/core/skbuff.c |   15 +++++++++++----
>  1 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..eb3853c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>  	if (unlikely(!nc->page)) {
>  refill:
>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> -		nc->offset = 0;
>  	}
>  	if (likely(nc->page)) {
> -		if (nc->offset + fragsz > PAGE_SIZE) {
> +		unsigned int offset = PAGE_SIZE;
> +
> +		if (page_count(nc->page) != 1)
> +			offset = nc->offset;
> +
> +		if (offset < fragsz) {
>  			put_page(nc->page);
>  			goto refill;
>  		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> +
> +		offset -= fragsz;
> +		nc->offset = offset;
> +
> +		data = page_address(nc->page) + offset;
>  		get_page(nc->page);
>  	}
>  	local_irq_restore(flags);
> 

I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.

The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.

For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.

Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.

No clear win either, but you might try this too.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 20, 2012, 8:17 a.m. UTC | #3
On Wed, 2012-06-20 at 07:36 +0200, Eric Dumazet wrote:

> I tested this idea one month ago and got not convincing results, because
> the branch was taken half of the time.
> 
> The cases where page can be reused is probably specific to ixgbe because
> it uses a different allocator for the frags themselves.
> netdev_alloc_frag() is only used to allocate the skb head.
> 
> For typical nics, we allocate frags to populate the RX ring _way_ before
> packet is received by the NIC.
> 
> Then, I played with using order-2 pages instead of order-0 ones if
> PAGE_SIZE < 8192.
> 
> No clear win either, but you might try this too.

By the way, big cost in netdev_alloc_frag() is the irq masking/restore
We probably could have a version for softirq users...

Another idea would also use a pool of pages, instead of a single one,
if we want to play with the "clear the offset if page count is one"
idea.

Strange, I did again benchs with order-2 allocations and got good
results this time, but with latest net-next, maybe things have changed
since last time I did this.

(netdev_alloc_frag(), get_page_from_freelist() and put_page() less
prevalent in perf results)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 20, 2012, 8:44 a.m. UTC | #4
On Wed, 2012-06-20 at 10:17 +0200, Eric Dumazet wrote:

> Strange, I did again benchs with order-2 allocations and got good
> results this time, but with latest net-next, maybe things have changed
> since last time I did this.
> 
> (netdev_alloc_frag(), get_page_from_freelist() and put_page() less
> prevalent in perf results)
> 

Oh well, I now remember why I abandoned this : machines dont have
infinite memory after all...

(put_page assumes order-0 page)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 20, 2012, 9:04 a.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Jun 2012 10:44:07 +0200

> (put_page assumes order-0 page)

But it certainly can handle compound pages, I thought?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 20, 2012, 9:14 a.m. UTC | #6
On Wed, 2012-06-20 at 02:04 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Jun 2012 10:44:07 +0200
> 
> > (put_page assumes order-0 page)
> 
> But it certainly can handle compound pages, I thought?

Yes, I forgot the GFP_COMP ;)

Oh well, time for a coffee


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H June 20, 2012, 4:30 p.m. UTC | #7
On 06/19/2012 10:36 PM, Eric Dumazet wrote:
> On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
>> This patch is meant to help improve system performance when
>> netdev_alloc_frag is used in scenarios in which buffers are short lived.
>> This is accomplished by allowing the page offset to be reset in the event
>> that the page count is 1.  I also reordered the direction in which we give
>> out sections of the page so that we start at the end of the page and end at
>> the start.  The main motivation being that I preferred to have offset
>> represent the amount of page remaining to be used.
>>
>> My primary test case was using ixgbe in combination with TCP.  With this
>> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
>> thread of netperf receiving a TCP stream via ixgbe.
>>
>> I also tested several scenarios in which the page reuse would not be
>> possible such as UDP flows and routing.  In both of these scenarios I saw
>> no noticeable performance degradation compared to the kernel without this
>> patch.
>>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>
>>  net/core/skbuff.c |   15 +++++++++++----
>>  1 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 5b21522..eb3853c 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>>  	if (unlikely(!nc->page)) {
>>  refill:
>>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
>> -		nc->offset = 0;
>>  	}
>>  	if (likely(nc->page)) {
>> -		if (nc->offset + fragsz > PAGE_SIZE) {
>> +		unsigned int offset = PAGE_SIZE;
>> +
>> +		if (page_count(nc->page) != 1)
>> +			offset = nc->offset;
>> +
>> +		if (offset < fragsz) {
>>  			put_page(nc->page);
>>  			goto refill;
>>  		}
>> -		data = page_address(nc->page) + nc->offset;
>> -		nc->offset += fragsz;
>> +
>> +		offset -= fragsz;
>> +		nc->offset = offset;
>> +
>> +		data = page_address(nc->page) + offset;
>>  		get_page(nc->page);
>>  	}
>>  	local_irq_restore(flags);
>>
> I tested this idea one month ago and got not convincing results, because
> the branch was taken half of the time.
>
> The cases where page can be reused is probably specific to ixgbe because
> it uses a different allocator for the frags themselves.
> netdev_alloc_frag() is only used to allocate the skb head.
Actually it is pretty much anywhere a copy-break type setup exists.  I
think ixgbe and a few other drivers have this type of setup where
netdev_alloc_skb is called and the data is just copied into the buffer. 
My thought was if that I can improve this one case without hurting the
other cases I should just go ahead and submit it since it is a net win
performance wise.

I think one of the biggest advantages of this for ixgbe is that it
allows the buffer to become cache warm so that writing the shared info
and copying the header contents becomes very cheap compared to accessing
a cache cold page.

> For typical nics, we allocate frags to populate the RX ring _way_ before
> packet is received by the NIC.
>
> Then, I played with using order-2 pages instead of order-0 ones if
> PAGE_SIZE < 8192.
>
> No clear win either, but you might try this too.
The biggest issue I see with an order-2 page is that it means the memory
is going to take much longer to cycle out of a shared page.  As a result
changes like the one I just came up with would likely have little to no
benefit because we would run out of room in the frags list before we
could start reusing a fresh page.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H June 20, 2012, 5:14 p.m. UTC | #8
On 06/20/2012 09:30 AM, Alexander Duyck wrote:
> On 06/19/2012 10:36 PM, Eric Dumazet wrote:
>> On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
>>> This patch is meant to help improve system performance when
>>> netdev_alloc_frag is used in scenarios in which buffers are short lived.
>>> This is accomplished by allowing the page offset to be reset in the event
>>> that the page count is 1.  I also reordered the direction in which we give
>>> out sections of the page so that we start at the end of the page and end at
>>> the start.  The main motivation being that I preferred to have offset
>>> represent the amount of page remaining to be used.
>>>
>>> My primary test case was using ixgbe in combination with TCP.  With this
>>> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
>>> thread of netperf receiving a TCP stream via ixgbe.
>>>
>>> I also tested several scenarios in which the page reuse would not be
>>> possible such as UDP flows and routing.  In both of these scenarios I saw
>>> no noticeable performance degradation compared to the kernel without this
>>> patch.
>>>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> ---
>>>
>>>  net/core/skbuff.c |   15 +++++++++++----
>>>  1 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 5b21522..eb3853c 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>>>  	if (unlikely(!nc->page)) {
>>>  refill:
>>>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
>>> -		nc->offset = 0;
>>>  	}
>>>  	if (likely(nc->page)) {
>>> -		if (nc->offset + fragsz > PAGE_SIZE) {
>>> +		unsigned int offset = PAGE_SIZE;
>>> +
>>> +		if (page_count(nc->page) != 1)
>>> +			offset = nc->offset;
>>> +
>>> +		if (offset < fragsz) {
>>>  			put_page(nc->page);
>>>  			goto refill;
>>>  		}
>>> -		data = page_address(nc->page) + nc->offset;
>>> -		nc->offset += fragsz;
>>> +
>>> +		offset -= fragsz;
>>> +		nc->offset = offset;
>>> +
>>> +		data = page_address(nc->page) + offset;
>>>  		get_page(nc->page);
>>>  	}
>>>  	local_irq_restore(flags);
>>>
>> I tested this idea one month ago and got not convincing results, because
>> the branch was taken half of the time.
>>
>> The cases where page can be reused is probably specific to ixgbe because
>> it uses a different allocator for the frags themselves.
>> netdev_alloc_frag() is only used to allocate the skb head.
> Actually it is pretty much anywhere a copy-break type setup exists.  I
> think ixgbe and a few other drivers have this type of setup where
> netdev_alloc_skb is called and the data is just copied into the buffer. 
> My thought was if that I can improve this one case without hurting the
> other cases I should just go ahead and submit it since it is a net win
> performance wise.
>
> I think one of the biggest advantages of this for ixgbe is that it
> allows the buffer to become cache warm so that writing the shared info
> and copying the header contents becomes very cheap compared to accessing
> a cache cold page.
>
>> For typical nics, we allocate frags to populate the RX ring _way_ before
>> packet is received by the NIC.
>>
>> Then, I played with using order-2 pages instead of order-0 ones if
>> PAGE_SIZE < 8192.
>>
>> No clear win either, but you might try this too.
> The biggest issue I see with an order-2 page is that it means the memory
> is going to take much longer to cycle out of a shared page.  As a result
> changes like the one I just came up with would likely have little to no
> benefit because we would run out of room in the frags list before we
> could start reusing a fresh page.
>
> Thanks,
>
> Alex
>
Actually I think I just realized what the difference is.  I was looking
at things with LRO disabled.  With LRO enabled our hardware RSC feature
kind of defeats the whole point of the GRO or TCP coalescing anyway
since it will stuff 16 fragments into a single packet before we even
hand the packet off to the stack.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 20, 2012, 6:41 p.m. UTC | #9
On Wed, 2012-06-20 at 10:14 -0700, Alexander Duyck wrote:

> Actually I think I just realized what the difference is.  I was looking
> at things with LRO disabled.  With LRO enabled our hardware RSC feature
> kind of defeats the whole point of the GRO or TCP coalescing anyway
> since it will stuff 16 fragments into a single packet before we even
> hand the packet off to the stack.

I noticed LRO was now 'off' by default on ixgbe (net-next tree), I am
pretty sure it was 'on' some months ago ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H June 20, 2012, 8:10 p.m. UTC | #10
On 06/20/2012 11:41 AM, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 10:14 -0700, Alexander Duyck wrote:
>
>> Actually I think I just realized what the difference is.  I was looking
>> at things with LRO disabled.  With LRO enabled our hardware RSC feature
>> kind of defeats the whole point of the GRO or TCP coalescing anyway
>> since it will stuff 16 fragments into a single packet before we even
>> hand the packet off to the stack.
> I noticed LRO was now 'off' by default on ixgbe (net-next tree), I am
> pretty sure it was 'on' some months ago ?
It should be on by default unless you are doing some routing.  In that
case as soon as the interface comes up the LRO is disabled by the stack.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 21, 2012, 5:56 a.m. UTC | #11
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Jun 2012 10:17:03 +0200

> By the way, big cost in netdev_alloc_frag() is the irq
> masking/restore We probably could have a version for softirq
> users...

That's an extremely disappointing way for us to be losing cycles.

Everyone pays this price purely because:

1) __netdev_alloc_skb() uses netdev_alloc_frag()

and:

2) #1 is invoked, either directly or indirectly, by tons
   of slow non-NAPI drivers.

This got me looking into the plathora of interfaces we let drivers use
to allocate RX buffers.  It's a big mess.

We have dev_alloc_skb() which essentially calls __netdev_alloc_skb()
with a NULL device argument.  This is terrible because it means that
if we want to do something interesting on a per-device level we can't
rely upon the device being non-NULL in __netdev_alloc_skb().

I looked at the remaining dev_alloc_skb() users and these are in places
which are allocating packets in a module which is one level removed from
the netdevice level.  For example, ATM and infiniband IPATH.

What these callers want is something more like:

static inline struct sk_buff *alloc_skb_and_reserve_pad(unsinged int length,
							gfp_t gfp_mask)
{
	struct sk_buff *skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
					  0, NUMA_NO_NODE);
	if (likely(skb))
		skb_reserve(skb, NET_SKB_PAD);
	return skb;
}

Then we won't have the NULL device case for __netdev_alloc_skb() any more.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@  void *netdev_alloc_frag(unsigned int fragsz)
 	if (unlikely(!nc->page)) {
 refill:
 		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
-		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
+		unsigned int offset = PAGE_SIZE;
+
+		if (page_count(nc->page) != 1)
+			offset = nc->offset;
+
+		if (offset < fragsz) {
 			put_page(nc->page);
 			goto refill;
 		}
-		data = page_address(nc->page) + nc->offset;
-		nc->offset += fragsz;
+
+		offset -= fragsz;
+		nc->offset = offset;
+
+		data = page_address(nc->page) + offset;
 		get_page(nc->page);
 	}
 	local_irq_restore(flags);