diff mbox

[net-next,02/11] ixgbe: Mask off check of frag_off as we only want fragment offset

Message ID 1365765866-15741-3-git-send-email-jeffrey.t.kirsher@intel.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Kirsher, Jeffrey T April 12, 2013, 11:24 a.m. UTC
From: Alexander Duyck <alexander.h.duyck@intel.com>

We were incorrectly checking the entire frag_off field when we only wanted the
fragment offset.  As a result we were not pulling in TCP headers when the DNF
flag was set.

To correct that we will now check for frag off using the IP_OFFSET mask.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet April 12, 2013, 1:28 p.m. UTC | #1
On Fri, 2013-04-12 at 04:24 -0700, Jeff Kirsher wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> We were incorrectly checking the entire frag_off field when we only wanted the
> fragment offset.  As a result we were not pulling in TCP headers when the DNF
> flag was set.
> 
> To correct that we will now check for frag off using the IP_OFFSET mask.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 0c51a9f..096a221 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1347,7 +1347,7 @@ static unsigned int ixgbe_get_headlen(unsigned char *data,
>  			return hdr.network - data;
>  
>  		/* record next protocol if header is present */
> -		if (!hdr.ipv4->frag_off)
> +		if (!(hdr.ipv4->frag_off & htons(IP_OFFSET)))
>  			nexthdr = hdr.ipv4->protocol;
>  	} else if (protocol == __constant_htons(ETH_P_IPV6)) {
>  		if ((hdr.network - data) > (max_len - sizeof(struct ipv6hdr)))

I wonder if you could use core functions instead of all this...

A simple wrapper would be :

static noinline unsigned int ixgbe_get_headlen(unsigned char *data,
						u32 maxlen)
{
    struct skb fake;
    unsigned int res;

    fake->data = data;
    fake->head = data;
    fake->data_len = 0;
    fake->len = maxlen;
    skb_reset_network_header(&fake);
    res = __skb_get_poff(&fake);
    return res ?: maxlen;
}

(completely untested)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 12, 2013, 1:45 p.m. UTC | #2
On Fri, 2013-04-12 at 06:28 -0700, Eric Dumazet wrote:

> I wonder if you could use core functions instead of all this...
> 
> A simple wrapper would be :

Or more something like :

static noinline unsigned int ixgbe_get_headlen(unsigned char *data,
                                                u32 maxlen)
{
    struct skb fake;
    unsigned int res;

    if (maxlen < ETH_HLEN)
	return maxlen;

    fake->data = data + ETH_HLEN;
    fake->head = data;
    fake->data_len = 0;
    fake->len = maxlen - ETH_HLEN;
    skb_reset_network_header(&fake);
    res = __skb_get_poff(&fake);
    return res ? res + ETH_HLEN : maxlen;
}



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H April 12, 2013, 4:38 p.m. UTC | #3
On 04/12/2013 06:45 AM, Eric Dumazet wrote:
> On Fri, 2013-04-12 at 06:28 -0700, Eric Dumazet wrote:
>
>> I wonder if you could use core functions instead of all this...
>>
>> A simple wrapper would be :
> Or more something like :
>
> static noinline unsigned int ixgbe_get_headlen(unsigned char *data,
>                                                 u32 maxlen)
> {
>     struct skb fake;
>     unsigned int res;
>
>     if (maxlen < ETH_HLEN)
> 	return maxlen;
>
>     fake->data = data + ETH_HLEN;
>     fake->head = data;
>     fake->data_len = 0;
>     fake->len = maxlen - ETH_HLEN;
>     skb_reset_network_header(&fake);
>     res = __skb_get_poff(&fake);
>     return res ? res + ETH_HLEN : maxlen;
> }

The problem is this is way more then I need, and I would prefer not to
allocate a 192+ byte structure on the stack in order to just parse a
header that is likely less than 128 bytes.

I could probably do something like create a copy of the
ixgbe_get_headlen function, maybe named something like
etherdev_get_headlen and stored in eth.c that could be used by both igb
and ixgbe.  That way it would be available for anyone else who might
want to do something similar.  If that would work for you I could
probably submit that patch sometime in the next few hours.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 12, 2013, 4:51 p.m. UTC | #4
On Fri, 2013-04-12 at 09:38 -0700, Alexander Duyck wrote:
> On 04/12/2013 06:45 AM, Eric Dumazet wrote:
> > On Fri, 2013-04-12 at 06:28 -0700, Eric Dumazet wrote:
> >
> >> I wonder if you could use core functions instead of all this...
> >>
> >> A simple wrapper would be :
> > Or more something like :
> >
> > static noinline unsigned int ixgbe_get_headlen(unsigned char *data,
> >                                                 u32 maxlen)
> > {
> >     struct skb fake;
> >     unsigned int res;
> >
> >     if (maxlen < ETH_HLEN)
> > 	return maxlen;
> >
> >     fake->data = data + ETH_HLEN;
> >     fake->head = data;
> >     fake->data_len = 0;
> >     fake->len = maxlen - ETH_HLEN;
> >     skb_reset_network_header(&fake);
> >     res = __skb_get_poff(&fake);
> >     return res ? res + ETH_HLEN : maxlen;
> > }
> 
> The problem is this is way more then I need, and I would prefer not to
> allocate a 192+ byte structure on the stack in order to just parse a
> header that is likely less than 128 bytes.
> 

Thats why I used 'noinline' keyword.

Your code adds significative icache pressure and latencies.


> I could probably do something like create a copy of the
> ixgbe_get_headlen function, maybe named something like
> etherdev_get_headlen and stored in eth.c that could be used by both igb
> and ixgbe.  That way it would be available for anyone else who might
> want to do something similar.  If that would work for you I could
> probably submit that patch sometime in the next few hours.

No please don't do that.

I suggested reusing stuff, not duplicating it.

The main problem is not the cpu cycles spent to parse the header, but
bringing two cache lines for the memcpy() to pull headers.  (TCP uses 66
bytes of headers)

If you use a prefetch(data + 64), chances are good the current generic
code will run before hitting the memory stall


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H April 12, 2013, 6:22 p.m. UTC | #5
On 04/12/2013 09:51 AM, Eric Dumazet wrote:
> On Fri, 2013-04-12 at 09:38 -0700, Alexander Duyck wrote:
>> On 04/12/2013 06:45 AM, Eric Dumazet wrote:
>>> On Fri, 2013-04-12 at 06:28 -0700, Eric Dumazet wrote:
>>>
>>>> I wonder if you could use core functions instead of all this...
>>>>
>>>> A simple wrapper would be :
>>> Or more something like :
>>>
>>> static noinline unsigned int ixgbe_get_headlen(unsigned char *data,
>>>                                                 u32 maxlen)
>>> {
>>>     struct skb fake;
>>>     unsigned int res;
>>>
>>>     if (maxlen < ETH_HLEN)
>>> 	return maxlen;
>>>
>>>     fake->data = data + ETH_HLEN;
>>>     fake->head = data;
>>>     fake->data_len = 0;
>>>     fake->len = maxlen - ETH_HLEN;
>>>     skb_reset_network_header(&fake);
>>>     res = __skb_get_poff(&fake);
>>>     return res ? res + ETH_HLEN : maxlen;
>>> }
>> The problem is this is way more then I need, and I would prefer not to
>> allocate a 192+ byte structure on the stack in order to just parse a
>> header that is likely less than 128 bytes.
>>
> Thats why I used 'noinline' keyword.
>
> Your code adds significative icache pressure and latencies.

The footprint for the code itself is not that large, and the fact is the
behavior is different enough from skb_flow_dissect which is what
__skb_get_poff relies on that I don't think I could get the same
behavior without adding at least one more protocol (FCOE), and probably
some sort of flag because in our case we want the header length
including L4 header for the first frame of a fragmented flow since the
goal is to leave only payload data in the pages.

>> I could probably do something like create a copy of the
>> ixgbe_get_headlen function, maybe named something like
>> etherdev_get_headlen and stored in eth.c that could be used by both igb
>> and ixgbe.  That way it would be available for anyone else who might
>> want to do something similar.  If that would work for you I could
>> probably submit that patch sometime in the next few hours.
> No please don't do that.
>
> I suggested reusing stuff, not duplicating it.
>
> The main problem is not the cpu cycles spent to parse the header, but
> bringing two cache lines for the memcpy() to pull headers.  (TCP uses 66
> bytes of headers)
>
> If you use a prefetch(data + 64), chances are good the current generic
> code will run before hitting the memory stall

The main problem I have with this is the fact that before we are done we
would have had to populate a number of fields within the fake skb before
the parsing could be completed.  This also assumes that at no point in
the future will somebody add anything that requires any other fields to
be set or unset within the skb since all of the values in fake are not
memset to 0 like a standard skb.  It would be a pain to debug this type
of issue.

For example the code snippet you sent likely wouldn't have worked
because it appears to have missed that it would also need to set
skb->protocol before being called.

I appreciate the desire to reuse, and what I meant was that since igb
and ixgbe both use essentially the same function I could move it to one
central location and both of them could use it as well as any other low
level drivers that need to just quickly parse the header out of a linear
block of data.  I just don't feel __skb_get_poff really does what I am
looking for since it assumes it is working with a skb, not just a linear
block of data.  If it could get broken up somehow so that it, or at
least pieces of it could just be used on linear blocks of data then I
might be more interested in reusing it.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 12, 2013, 6:44 p.m. UTC | #6
On Fri, 2013-04-12 at 11:22 -0700, Alexander Duyck wrote:

> I appreciate the desire to reuse, and what I meant was that since igb
> and ixgbe both use essentially the same function I could move it to one
> central location and both of them could use it as well as any other low
> level drivers that need to just quickly parse the header out of a linear
> block of data.  I just don't feel __skb_get_poff really does what I am
> looking for since it assumes it is working with a skb, not just a linear
> block of data.

Big deal.

Filling skb->data/head/len/data_len/network_offset and you're done in 3
cycles or so.

We are constrained by the ~250 cycles the cpu has for free to fetch 2
cache lines from the received frame.

>   If it could get broken up somehow so that it, or at
> least pieces of it could just be used on linear blocks of data then I
> might be more interested in reusing it.

But this is not going to happen. Our skb code is able to deal with
multiple areas (skb->head and frags). We are not going to duplicate it
for linear skbs 'only'

Thats your call obviously to maintain your own code, but consider you
fix this quite incredible bug only now. (All TCP data segments have the
DF flag set, so you were not pulling tcp headers)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H April 12, 2013, 8:12 p.m. UTC | #7
On 04/12/2013 11:44 AM, Eric Dumazet wrote:
> On Fri, 2013-04-12 at 11:22 -0700, Alexander Duyck wrote:
>>   If it could get broken up somehow so that it, or at
>> least pieces of it could just be used on linear blocks of data then I
>> might be more interested in reusing it.
> But this is not going to happen. Our skb code is able to deal with
> multiple areas (skb->head and frags). We are not going to duplicate it
> for linear skbs 'only'
>
> Thats your call obviously to maintain your own code, but consider you
> fix this quite incredible bug only now. (All TCP data segments have the
> DF flag set, so you were not pulling tcp headers)

I think part of the trouble is we are debating reworking the wrong
function.  We would probably be better off if we could come up with a
generic way to handle the ixgbe_pull_tail function.  I think the only
complaint I would have in using the __skb_get_poff in there is the fact
that it would be copying the header in multiple chunks.  If it worked
more like GRO where it just used an offset in the first frag instead of
having to copy the headers separately to read them then I would be 100%
on board.

Thanks,

Alex




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 12, 2013, 8:29 p.m. UTC | #8
On Fri, 2013-04-12 at 13:12 -0700, Alexander Duyck wrote:

> I think part of the trouble is we are debating reworking the wrong
> function.  We would probably be better off if we could come up with a
> generic way to handle the ixgbe_pull_tail function.  I think the only
> complaint I would have in using the __skb_get_poff in there is the fact
> that it would be copying the header in multiple chunks.  If it worked
> more like GRO where it just used an offset in the first frag instead of
> having to copy the headers separately to read them then I would be 100%
> on board.

__skb_get_poff() does no copy at all if you provide a linear 'skb'

Its really fast, I am really sorry you have wrong idea of what is going
on.

static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
                                       int len, void *buffer)
{
        int hlen = skb_headlen(skb);

        if (hlen - offset >= len)
                return skb->data + offset;


On the other hand GRO engine is way more expensive, now you mention it, we might
use the flow dissector to speed it.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H April 12, 2013, 9:05 p.m. UTC | #9
On 04/12/2013 01:29 PM, Eric Dumazet wrote:
> On Fri, 2013-04-12 at 13:12 -0700, Alexander Duyck wrote:
>
>> I think part of the trouble is we are debating reworking the wrong
>> function.  We would probably be better off if we could come up with a
>> generic way to handle the ixgbe_pull_tail function.  I think the only
>> complaint I would have in using the __skb_get_poff in there is the fact
>> that it would be copying the header in multiple chunks.  If it worked
>> more like GRO where it just used an offset in the first frag instead of
>> having to copy the headers separately to read them then I would be 100%
>> on board.
> __skb_get_poff() does no copy at all if you provide a linear 'skb'
>
> Its really fast, I am really sorry you have wrong idea of what is going
> on.
>
> static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
>                                        int len, void *buffer)
> {
>         int hlen = skb_headlen(skb);
>
>         if (hlen - offset >= len)
>                 return skb->data + offset;
>
>
> On the other hand GRO engine is way more expensive, now you mention it, we might
> use the flow dissector to speed it.

I am sure it is fast with a linear skb.  I understand what is going on
in that function, and perhaps my original complaint was not clear.  The
issue I had before is the fact that we were using a big uninitialized
blob that was the fake skb, and I really didn't want to deal with the
potential of somebody expecting something there was that possibly
uninitialized.  Instead I would prefer to deal with what we have,
instead of possibly introducing new issues by using uninitialized memory.

The problem is I have a non-linear skb, with nothing in the skb->data
region.  All my code does now is parse frag 0 to get the size of the
head, and then memcpy that out to skb->data and then update lengths and
offsets.  The problem right now is if I call __skb_get_poff it will go
through the portion of that path you didn't call out that does
skb_copy_bits.

What I would need in order to give up my current solution is a function
that I can pass a fully formed non-linear skb to that will efficiently
parse the header out of frag 0 and place it in skb->data.  What I do not
want to do is hand a partially initialized skb off to some kernel
function and hope that it doesn't do anything I didn't account for.

Perhaps it is just best for me to wait and see what you do with GRO
since it has many of the same requirements that I essentially do.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 12, 2013, 9:12 p.m. UTC | #10
On Fri, 2013-04-12 at 14:05 -0700, Alexander Duyck wrote:

>  The problem right now is if I call __skb_get_poff it will go
> through the portion of that path you didn't call out that does
> skb_copy_bits.

Absolutely not : skb->data_len is 0.

How possibly can you say such things ?

OK I give up with this topic.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H April 12, 2013, 9:34 p.m. UTC | #11
On 04/12/2013 02:12 PM, Eric Dumazet wrote:
> On Fri, 2013-04-12 at 14:05 -0700, Alexander Duyck wrote:
>
>>  The problem right now is if I call __skb_get_poff it will go
>> through the portion of that path you didn't call out that does
>> skb_copy_bits.
> Absolutely not : skb->data_len is 0.
>
> How possibly can you say such things ?
>
> OK I give up with this topic.

Yeah, we should agree to disagree since I think at some point we started
arguing two different things.

As for the patch that started this though we are good right?  No problem
with the patch itself, our discussion was over possible future work correct?

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 12, 2013, 9:40 p.m. UTC | #12
On Fri, 2013-04-12 at 14:34 -0700, Alexander Duyck wrote:

> As for the patch that started this though we are good right?  No problem
> with the patch itself, our discussion was over possible future work correct?

No problem.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0c51a9f..096a221 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1347,7 +1347,7 @@  static unsigned int ixgbe_get_headlen(unsigned char *data,
 			return hdr.network - data;
 
 		/* record next protocol if header is present */
-		if (!hdr.ipv4->frag_off)
+		if (!(hdr.ipv4->frag_off & htons(IP_OFFSET)))
 			nexthdr = hdr.ipv4->protocol;
 	} else if (protocol == __constant_htons(ETH_P_IPV6)) {
 		if ((hdr.network - data) > (max_len - sizeof(struct ipv6hdr)))