diff mbox

[RFC] Kernel unaligned access at __skb_flow_dissect

Message ID 20160129180651.GA17127@oracle.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan Jan. 29, 2016, 6:06 p.m. UTC
There is an unaligned access at __skb_flow_dissect when it calls
ip6_flowlabel() with the call stack

  __skb_flow_dissect+0x2a8/0x87c
  eth_get_headlen+0x5c/0xaxa4
  ixgbe_clean_rx_irq+0x5cc/0xb20 [ixgbe]
  ixgbe_poll+0x5a4/0x760 [ixgbe]
  net_rx_action+0x13c/0x354
    :

Essentially, ixgbe_pull_tail() is trying to figure out how much
to pull, in order to have an aligned buffer:

        pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);

        /* align pull length to size of long to optimize memcpy performance */
        skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));

and seems like the unaligned access is unavoidable here (see comments
in __skb_get_poff, for example).

This (below) is what I came up with, to get rid of the unaligned access
errors on sparc, Is there a better solution? (Not having access
to struct ip6_hdr in this file made put_unaligned usage non-obvious)

Comments

Eric Dumazet Jan. 29, 2016, 6:33 p.m. UTC | #1
On Fri, 2016-01-29 at 13:06 -0500, Sowmini Varadhan wrote:
> There is an unaligned access at __skb_flow_dissect when it calls
> ip6_flowlabel() with the call stack
> 
>   __skb_flow_dissect+0x2a8/0x87c
>   eth_get_headlen+0x5c/0xaxa4
>   ixgbe_clean_rx_irq+0x5cc/0xb20 [ixgbe]
>   ixgbe_poll+0x5a4/0x760 [ixgbe]
>   net_rx_action+0x13c/0x354
>     :
> 
> Essentially, ixgbe_pull_tail() is trying to figure out how much
> to pull, in order to have an aligned buffer:
> 
>         pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);
> 
>         /* align pull length to size of long to optimize memcpy performance */
>         skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> 
> and seems like the unaligned access is unavoidable here (see comments
> in __skb_get_poff, for example).
> 
> This (below) is what I came up with, to get rid of the unaligned access
> errors on sparc, Is there a better solution? (Not having access
> to struct ip6_hdr in this file made put_unaligned usage non-obvious)
> 
> 
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -102,6 +102,17 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int 
>  }
>  EXPORT_SYMBOL(__skb_flow_get_ports);
>  
> +static inline __be32 ip6_flowlabel_align(const u8 *hdr)
> +{
> +       union {
> +               __u8 w[4];
> +               __u32 flow;
> +       } ip6_flow;
> +
> +       memcpy(ip6_flow.w, hdr, 4);
> +       return (ip6_flow.flow & IPV6_FLOWLABEL_MASK);
> +}
> +
>  /**
>   * __skb_flow_dissect - extract the flow_keys struct and return it
>   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specifie
> @@ -230,7 +241,7 @@ ipv6:
>                         key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
>                 }
>  
> -               flow_label = ip6_flowlabel(iph);
> +               flow_label = ip6_flowlabel_align((const u8 *)iph);
>                 if (flow_label) {
>                         if (dissector_uses_key(flow_dissector,
>                                                FLOW_DISSECTOR_KEY_FLOW_LABEL)) {
> 
> 

Why ipv6 stack itself does not trigger the issue ?

Maybe the driver itself does not properly align IP headers on sparc ?

Make sure NET_IP_ALIGN is 2 on your build.

Note that x86 does not care, but a driver should always align Ethernet
header to NET_IP_ALIGN, so that IP headers are aligned to 4 bytes
boundaries.
Eric Dumazet Jan. 29, 2016, 6:54 p.m. UTC | #2
On Fri, 2016-01-29 at 10:33 -0800, Eric Dumazet wrote:
> On Fri, 2016-01-29 at 13:06 -0500, Sowmini Varadhan wrote:
> > There is an unaligned access at __skb_flow_dissect when it calls
> > ip6_flowlabel() with the call stack
> > 
> >   __skb_flow_dissect+0x2a8/0x87c
> >   eth_get_headlen+0x5c/0xaxa4
> >   ixgbe_clean_rx_irq+0x5cc/0xb20 [ixgbe]
> >   ixgbe_poll+0x5a4/0x760 [ixgbe]
> >   net_rx_action+0x13c/0x354
> >     :
> > 
> > Essentially, ixgbe_pull_tail() is trying to figure out how much
> > to pull, in order to have an aligned buffer:
> > 
> >         pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);
> > 
> >         /* align pull length to size of long to optimize memcpy performance */
> >         skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> > 
> > and seems like the unaligned access is unavoidable here (see comments
> > in __skb_get_poff, for example).
> > 
> > This (below) is what I came up with, to get rid of the unaligned access
> > errors on sparc, Is there a better solution? (Not having access
> > to struct ip6_hdr in this file made put_unaligned usage non-obvious)
> > 
> > 
> > --- a/net/core/flow_dissector.c
> > +++ b/net/core/flow_dissector.c
> > @@ -102,6 +102,17 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int 
> >  }
> >  EXPORT_SYMBOL(__skb_flow_get_ports);
> >  
> > +static inline __be32 ip6_flowlabel_align(const u8 *hdr)
> > +{
> > +       union {
> > +               __u8 w[4];
> > +               __u32 flow;
> > +       } ip6_flow;
> > +
> > +       memcpy(ip6_flow.w, hdr, 4);
> > +       return (ip6_flow.flow & IPV6_FLOWLABEL_MASK);
> > +}
> > +
> >  /**
> >   * __skb_flow_dissect - extract the flow_keys struct and return it
> >   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specifie
> > @@ -230,7 +241,7 @@ ipv6:
> >                         key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> >                 }
> >  
> > -               flow_label = ip6_flowlabel(iph);
> > +               flow_label = ip6_flowlabel_align((const u8 *)iph);
> >                 if (flow_label) {
> >                         if (dissector_uses_key(flow_dissector,
> >                                                FLOW_DISSECTOR_KEY_FLOW_LABEL)) {
> > 
> > 
> 
> Why ipv6 stack itself does not trigger the issue ?
> 
> Maybe the driver itself does not properly align IP headers on sparc ?
> 
> Make sure NET_IP_ALIGN is 2 on your build.
> 
> Note that x86 does not care, but a driver should always align Ethernet
> header to NET_IP_ALIGN, so that IP headers are aligned to 4 bytes
> boundaries.

Hmmm.... it seems that flow dissector can "support L2 GRE", leading to
unaligned accesses since a header is 14 bytes (not multiple of 4)

commit e1733de2243609073534cf56afb146a62af3c3d8
Author: Michael Dalton <mwdalton@google.com>
Date:   Mon Mar 11 06:52:28 2013 +0000

    flow_dissector: support L2 GRE
    
    Add support for L2 GRE tunnels, so that RPS can be more effective.


Michael, do we still need this ?

IP stacks in linux assume IP headers are always aligned to 4 bytes,
so it means having a GRE header like this would align trap on some
arches.
David Miller Jan. 29, 2016, 7:04 p.m. UTC | #3
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Fri, 29 Jan 2016 13:06:51 -0500

> @@ -102,6 +102,17 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int 
>  }
>  EXPORT_SYMBOL(__skb_flow_get_ports);
>  
> +static inline __be32 ip6_flowlabel_align(const u8 *hdr)
> +{
> +       union {
> +               __u8 w[4];
> +               __u32 flow;
> +       } ip6_flow;
> +
> +       memcpy(ip6_flow.w, hdr, 4);
 ...
> -               flow_label = ip6_flowlabel(iph);
> +               flow_label = ip6_flowlabel_align((const u8 *)iph);

Casting is not a foolproof solution.

The compiler is allowed to legally walk through the casts and discover that
the original type has certain alignment guarantees and use them in it's
inline expansion of memcpy().

Also, so much stuff in this file is going to trip over this issue if
the IP header isn't even 4 byte aligned.
David Miller Jan. 29, 2016, 7:06 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jan 2016 10:33:48 -0800

> Why ipv6 stack itself does not trigger the issue ?
> 
> Maybe the driver itself does not properly align IP headers on sparc ?
> 
> Make sure NET_IP_ALIGN is 2 on your build.
> 
> Note that x86 does not care, but a driver should always align Ethernet
> header to NET_IP_ALIGN, so that IP headers are aligned to 4 bytes
> boundaries.

Eric, please reread the original posting, the driver is trying to do
exactly this.

This is the part in the driver where it is trying to figure out how
much to pull in order to align the packet before it gets sent into the
stack.

That's why only the flow dissector sees this problem.
Sowmini Varadhan Jan. 29, 2016, 7:22 p.m. UTC | #5
On (01/29/16 10:54), Eric Dumazet wrote:
> > Why ipv6 stack itself does not trigger the issue ?
> > Maybe the driver itself does not properly align IP headers on sparc ?
> > Make sure NET_IP_ALIGN is 2 on your build.
> > Note that x86 does not care, but a driver should always align Ethernet
> > header to NET_IP_ALIGN, so that IP headers are aligned to 4 bytes
> > boundaries.

Consolidating a few responses:

NET_IP_ALIGN is 2, of course, or else I would have seen errors about unaligned
access in many more places. The issue is being triggered by ipv6
stack (I'm running iperf over ipv6) so I'm not sure I understand the
first question.

I tried out the suggested patch of setting page_offset to NET_IP_ALIGN
in ixgbe_main.c - and yes, it kills the unaligned errors, so that's a
Good Thing. (though I see davem just responded to the patch)

However, the patch itself is not arch specific, and the unaligned 
address issue should have impacted other archs (silently) as well - 
even for ipv4, 

It would also impact other logic in __skb_flow_dissect: 
e.g., the access to iph->saddr for ipv4 frags would be
unaligned in __skb_flow_dissect? (And Ias I mentioned in my original
mail, the comments in __skb_get_poff seem to indicate that others
have tripped up on this before - and yes, casting is not a good thing
but seems like that's what they did to paper over the problem)

> Hmmm.... it seems that flow dissector can "support L2 GRE", leading to
> unaligned accesses since a header is 14 bytes (not multiple of 4)
  :
> IP stacks in linux assume IP headers are always aligned to 4 bytes,
> so it means having a GRE header like this would align trap on some
> arches.

I'm not sure I see how the above commit could have been impacting me,
I dont have any l2 gre invovled (simple p2p Ethernet II + ipv6 + tcp, 
no vlans, gre or other exotic stuff).

--Sowmini
Eric Dumazet Jan. 29, 2016, 7:37 p.m. UTC | #6
On Fri, 2016-01-29 at 11:06 -0800, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 29 Jan 2016 10:33:48 -0800
> 
> > Why ipv6 stack itself does not trigger the issue ?
> > 
> > Maybe the driver itself does not properly align IP headers on sparc ?
> > 
> > Make sure NET_IP_ALIGN is 2 on your build.
> > 
> > Note that x86 does not care, but a driver should always align Ethernet
> > header to NET_IP_ALIGN, so that IP headers are aligned to 4 bytes
> > boundaries.
> 
> Eric, please reread the original posting, the driver is trying to do
> exactly this.

Sure, but this driver was tested on x86.

> 
> This is the part in the driver where it is trying to figure out how
> much to pull in order to align the packet before it gets sent into the
> stack.
> 
> That's why only the flow dissector sees this problem.

I have no idea why reading iph->saddr or iph->daddr would not hit the
problem, but accessing the 32bit ipv6 flow label would be an issue.

Something is fishy.

But really adding unaligned() accesses in flow dissector would slow it
quite a lot on MIPS and others.
David Miller Jan. 29, 2016, 7:41 p.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jan 2016 11:37:30 -0800

> But really adding unaligned() accesses in flow dissector would slow it
> quite a lot on MIPS and others.

Agreed, the fix definitely belongs in the callers of these interfaces,
which in this case is the driver.
Sowmini Varadhan Jan. 29, 2016, 7:44 p.m. UTC | #8
On (01/29/16 11:37), Eric Dumazet wrote:
> 
> I have no idea why reading iph->saddr or iph->daddr would not hit the
> problem, but accessing the 32bit ipv6 flow label would be an issue.
> 
> Something is fishy.

I was wondering about this myself. Even on sparc, I only first
ran into the errors for ipv6. I dont know if the fact that the
saddr is memcpy'ed masks the error (even though the problem
is still there). 

But doing the check of: if (!IS_ALIGNED(..)) on the iph->saddr in the
code does result in a positive.

> But really adding unaligned() accesses in flow dissector would slow it
> quite a lot on MIPS and others.
Eric Dumazet Jan. 29, 2016, 8:01 p.m. UTC | #9
On Fri, 2016-01-29 at 14:44 -0500, Sowmini Varadhan wrote:
> On (01/29/16 11:37), Eric Dumazet wrote:
> > 
> > I have no idea why reading iph->saddr or iph->daddr would not hit the
> > problem, but accessing the 32bit ipv6 flow label would be an issue.
> > 
> > Something is fishy.
> 
> I was wondering about this myself. Even on sparc, I only first
> ran into the errors for ipv6. I dont know if the fact that the
> saddr is memcpy'ed masks the error (even though the problem
> is still there). 

Oh right, recent work in flow dissector added all these memcpy()

I was still looking at linux-4.3 ;)
Eric Dumazet Jan. 29, 2016, 8:06 p.m. UTC | #10
On Fri, 2016-01-29 at 12:01 -0800, Eric Dumazet wrote:
> On Fri, 2016-01-29 at 14:44 -0500, Sowmini Varadhan wrote:
> > On (01/29/16 11:37), Eric Dumazet wrote:
> > > 
> > > I have no idea why reading iph->saddr or iph->daddr would not hit the
> > > problem, but accessing the 32bit ipv6 flow label would be an issue.
> > > 
> > > Something is fishy.
> > 
> > I was wondering about this myself. Even on sparc, I only first
> > ran into the errors for ipv6. I dont know if the fact that the
> > saddr is memcpy'ed masks the error (even though the problem
> > is still there). 
> 
> Oh right, recent work in flow dissector added all these memcpy()
> 
> I was still looking at linux-4.3 ;)

The code for GRE/GRE_KEY is still accessing 32bit vars.

const __be32 *keyid;
...
key_keyid->keyid = *keyid;

So instead of auditing flow dissect code, we should fix the few drivers
using it.
Tom Herbert Jan. 29, 2016, 9 p.m. UTC | #11
On Fri, Jan 29, 2016 at 10:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-01-29 at 10:33 -0800, Eric Dumazet wrote:
>> On Fri, 2016-01-29 at 13:06 -0500, Sowmini Varadhan wrote:
>> > There is an unaligned access at __skb_flow_dissect when it calls
>> > ip6_flowlabel() with the call stack
>> >
>> >   __skb_flow_dissect+0x2a8/0x87c
>> >   eth_get_headlen+0x5c/0xaxa4
>> >   ixgbe_clean_rx_irq+0x5cc/0xb20 [ixgbe]
>> >   ixgbe_poll+0x5a4/0x760 [ixgbe]
>> >   net_rx_action+0x13c/0x354
>> >     :
>> >
>> > Essentially, ixgbe_pull_tail() is trying to figure out how much
>> > to pull, in order to have an aligned buffer:
>> >
>> >         pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);
>> >
>> >         /* align pull length to size of long to optimize memcpy performance */
>> >         skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
>> >
>> > and seems like the unaligned access is unavoidable here (see comments
>> > in __skb_get_poff, for example).
>> >
>> > This (below) is what I came up with, to get rid of the unaligned access
>> > errors on sparc, Is there a better solution? (Not having access
>> > to struct ip6_hdr in this file made put_unaligned usage non-obvious)
>> >
>> >
>> > --- a/net/core/flow_dissector.c
>> > +++ b/net/core/flow_dissector.c
>> > @@ -102,6 +102,17 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int
>> >  }
>> >  EXPORT_SYMBOL(__skb_flow_get_ports);
>> >
>> > +static inline __be32 ip6_flowlabel_align(const u8 *hdr)
>> > +{
>> > +       union {
>> > +               __u8 w[4];
>> > +               __u32 flow;
>> > +       } ip6_flow;
>> > +
>> > +       memcpy(ip6_flow.w, hdr, 4);
>> > +       return (ip6_flow.flow & IPV6_FLOWLABEL_MASK);
>> > +}
>> > +
>> >  /**
>> >   * __skb_flow_dissect - extract the flow_keys struct and return it
>> >   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specifie
>> > @@ -230,7 +241,7 @@ ipv6:
>> >                         key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
>> >                 }
>> >
>> > -               flow_label = ip6_flowlabel(iph);
>> > +               flow_label = ip6_flowlabel_align((const u8 *)iph);
>> >                 if (flow_label) {
>> >                         if (dissector_uses_key(flow_dissector,
>> >                                                FLOW_DISSECTOR_KEY_FLOW_LABEL)) {
>> >
>> >
>>
>> Why ipv6 stack itself does not trigger the issue ?
>>
>> Maybe the driver itself does not properly align IP headers on sparc ?
>>
>> Make sure NET_IP_ALIGN is 2 on your build.
>>
>> Note that x86 does not care, but a driver should always align Ethernet
>> header to NET_IP_ALIGN, so that IP headers are aligned to 4 bytes
>> boundaries.
>
> Hmmm.... it seems that flow dissector can "support L2 GRE", leading to
> unaligned accesses since a header is 14 bytes (not multiple of 4)
>
> commit e1733de2243609073534cf56afb146a62af3c3d8
> Author: Michael Dalton <mwdalton@google.com>
> Date:   Mon Mar 11 06:52:28 2013 +0000
>
>     flow_dissector: support L2 GRE
>
>     Add support for L2 GRE tunnels, so that RPS can be more effective.
>
>
> Michael, do we still need this ?
>
> IP stacks in linux assume IP headers are always aligned to 4 bytes,
> so it means having a GRE header like this would align trap on some
> arches.
>
Doesn't every IP/VXLAN packet contains unaligned headers? Why don't
those create alignment issues (like when stack looks at addresses)?

>
>
>
>
>
>
>
>
Sowmini Varadhan Jan. 29, 2016, 9:09 p.m. UTC | #12
On (01/29/16 13:00), Tom Herbert wrote:
> Doesn't every IP/VXLAN packet contains unaligned headers? Why don't
> those create alignment issues (like when stack looks at addresses)?

They do.

http://comments.gmane.org/gmane.linux.network/370672

some of it was fixed in https://lkml.org/lkml/2015/7/20/63.
There's still some NET_IP_ALIGN missing. IIRC, I was seeing
this for mldv3, but the fix has to be done carefully, by
someone who knows how to fully regression test it.

I dont know if other tunneling methods manage to get the 
NET_IP_ALIGN correct in every case. 

Also, while sparc complains about unaligned access
in most cases, some sins may pass under the radar, and other
platforms dont even generate traps, so it's easy to not know
that there's a problem, a lot of the time.

--Sowmini
Alexander H Duyck Jan. 29, 2016, 9:16 p.m. UTC | #13
On Fri, Jan 29, 2016 at 12:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-01-29 at 14:44 -0500, Sowmini Varadhan wrote:
>> On (01/29/16 11:37), Eric Dumazet wrote:
>> >
>> > I have no idea why reading iph->saddr or iph->daddr would not hit the
>> > problem, but accessing the 32bit ipv6 flow label would be an issue.
>> >
>> > Something is fishy.
>>
>> I was wondering about this myself. Even on sparc, I only first
>> ran into the errors for ipv6. I dont know if the fact that the
>> saddr is memcpy'ed masks the error (even though the problem
>> is still there).
>
> Oh right, recent work in flow dissector added all these memcpy()
>
> I was still looking at linux-4.3 ;)

It has to be something recent.  I know back when I wrote the code I
tested it on a few different architectures and had to add a few bits
in __skb_get_poff so that it would read doff as a u8 instead of
bitfield in a u32.

Looking at the code it seems like this should be an easy fix.  Just
swap the two lines that acquire and test the flow label with the check
for dissector_uses_key(... _KEY_FLOW_LABEL).  Then ixgbe will stop
trying to grab flow labels.

- Alex
Tom Herbert Jan. 29, 2016, 9:33 p.m. UTC | #14
On Fri, Jan 29, 2016 at 1:09 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/29/16 13:00), Tom Herbert wrote:
>> Doesn't every IP/VXLAN packet contains unaligned headers? Why don't
>> those create alignment issues (like when stack looks at addresses)?
>
> They do.
>
> http://comments.gmane.org/gmane.linux.network/370672
>
> some of it was fixed in https://lkml.org/lkml/2015/7/20/63.
> There's still some NET_IP_ALIGN missing. IIRC, I was seeing
> this for mldv3, but the fix has to be done carefully, by
> someone who knows how to fully regression test it.
>
> I dont know if other tunneling methods manage to get the
> NET_IP_ALIGN correct in every case.
>
Fortunately, most encapsulations are based on 4-byte alignment.
ETHER_IP even specifies a two byte header to ensure that whatever
follows the encapsulated Ethernet header is 4-byte aligned. The
encapsulation protocols that carry Ethernet without any inserted bytes
for alignment are the potential problem (GRE, VXLAN, etc.).
NET_IP_ALIGN can't help because the outer and inner (IP) headers have
different alignment, so no one value can work.

> Also, while sparc complains about unaligned access
> in most cases, some sins may pass under the radar, and other
> platforms dont even generate traps, so it's easy to not know
> that there's a problem, a lot of the time.
>
That is worrisome. The only way I can think of to be completely
protected would be to copy encapsulated Ethernet frames to
NET_IP_ALIGN buffer in the driver :-( It doesn't look like VXLAN does
anything like that.

Tom

> --Sowmini
>
Eric Dumazet Jan. 29, 2016, 9:33 p.m. UTC | #15
On Fri, 2016-01-29 at 13:16 -0800, Alexander Duyck wrote:

> It has to be something recent.  I know back when I wrote the code I
> tested it on a few different architectures and had to add a few bits
> in __skb_get_poff so that it would read doff as a u8 instead of
> bitfield in a u32.
> 
> Looking at the code it seems like this should be an easy fix.  Just
> swap the two lines that acquire and test the flow label with the check
> for dissector_uses_key(... _KEY_FLOW_LABEL).  Then ixgbe will stop
> trying to grab flow labels.

Note that if we properly align headers, we also align TCP/UDP payload.

Meaning that copying data in recvmsg() is likely faster in some cpus.
(The ones having NET_IP_ALIGN set to 2 presumably)
Alexander H Duyck Jan. 29, 2016, 10:08 p.m. UTC | #16
On Fri, Jan 29, 2016 at 1:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-01-29 at 13:16 -0800, Alexander Duyck wrote:
>
>> It has to be something recent.  I know back when I wrote the code I
>> tested it on a few different architectures and had to add a few bits
>> in __skb_get_poff so that it would read doff as a u8 instead of
>> bitfield in a u32.
>>
>> Looking at the code it seems like this should be an easy fix.  Just
>> swap the two lines that acquire and test the flow label with the check
>> for dissector_uses_key(... _KEY_FLOW_LABEL).  Then ixgbe will stop
>> trying to grab flow labels.
>
> Note that if we properly align headers, we also align TCP/UDP payload.
>
> Meaning that copying data in recvmsg() is likely faster in some cpus.
> (The ones having NET_IP_ALIGN set to 2 presumably)

It also means DMA becomes dramatically slower as it introduces a
partial write access for the start of every frame.  It is why we had
set NET_IP_ALIGN to 0 on x86 since DMA was becoming more expensive
when unaligned then reading IP unaligned headers.

The gain on recvmsg would probably be minimal.  The only time I have
seen any significant speed-up for copying is if you can get both ends
aligned to something like 16B.

- Alex
Eric Dumazet Jan. 29, 2016, 10:28 p.m. UTC | #17
On Fri, 2016-01-29 at 14:08 -0800, Alexander Duyck wrote:

> It also means DMA becomes dramatically slower as it introduces a
> partial write access for the start of every frame.  It is why we had
> set NET_IP_ALIGN to 0 on x86 since DMA was becoming more expensive
> when unaligned then reading IP unaligned headers.

Well, I guess that if you have an arch where DMA accesses are slow and
NET_IP_ALIGN = 2, you are out of luck. This is why some platforms are
better than others.


> 
> The gain on recvmsg would probably be minimal.  The only time I have
> seen any significant speed-up for copying is if you can get both ends
> aligned to something like 16B.

On modern intel cpus, this does not matter at all, sure. It took a while
before "rep movsb" finally did the right thing.

memcpy() and friends implementations are much slower on some older
arches (when dealing with unaligned src/dst)

arch/mips/lib/memcpy.S is a gem ;)
Alexander H Duyck Jan. 29, 2016, 11 p.m. UTC | #18
On Fri, Jan 29, 2016 at 2:28 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-01-29 at 14:08 -0800, Alexander Duyck wrote:
>
>> It also means DMA becomes dramatically slower as it introduces a
>> partial write access for the start of every frame.  It is why we had
>> set NET_IP_ALIGN to 0 on x86 since DMA was becoming more expensive
>> when unaligned then reading IP unaligned headers.
>
> Well, I guess that if you have an arch where DMA accesses are slow and
> NET_IP_ALIGN = 2, you are out of luck. This is why some platforms are
> better than others.

The other bit you forgot to mention was an IOMMU.  That is another
per-architecture thing that can really slow us down.  Back when I
rewrote the receive path I was dealing with a number of performance
complaints on PowerPC.  The approach I took with the Intel drivers was
supposed to be the best compromise for IOMMU, DMA alignment, and IP
header alignment.

>>
>> The gain on recvmsg would probably be minimal.  The only time I have
>> seen any significant speed-up for copying is if you can get both ends
>> aligned to something like 16B.
>
> On modern intel cpus, this does not matter at all, sure. It took a while
> before "rep movsb" finally did the right thing.
>
> memcpy() and friends implementations are much slower on some older
> arches (when dealing with unaligned src/dst)
>
> arch/mips/lib/memcpy.S is a gem ;)

Yeah.  I can imagine.  The fact is you can't may everybody happy so I
am good with just trying to support the majority architectures as best
as possible if a few have to take a performance hit for an unaligned
memcpy then so be it.

- Alex
Tom Herbert Jan. 29, 2016, 11 p.m. UTC | #19
On Fri, Jan 29, 2016 at 1:09 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/29/16 13:00), Tom Herbert wrote:
>> Doesn't every IP/VXLAN packet contains unaligned headers? Why don't
>> those create alignment issues (like when stack looks at addresses)?
>
> They do.
>
> http://comments.gmane.org/gmane.linux.network/370672
>
> some of it was fixed in https://lkml.org/lkml/2015/7/20/63.
> There's still some NET_IP_ALIGN missing. IIRC, I was seeing
> this for mldv3, but the fix has to be done carefully, by
> someone who knows how to fully regression test it.
>
> I dont know if other tunneling methods manage to get the
> NET_IP_ALIGN correct in every case.
>
> Also, while sparc complains about unaligned access
> in most cases, some sins may pass under the radar, and other
> platforms dont even generate traps, so it's easy to not know
> that there's a problem, a lot of the time.
>
The sparc documentation is pretty clear
http://docs.oracle.com/cd/E19253-01/816-4854/hwovr-2/index.html, seems
like unaligned accesses are not allowed in the architecture.

Tom

> --Sowmini
>
Sowmini Varadhan Jan. 29, 2016, 11:04 p.m. UTC | #20
On (01/29/16 15:00), Tom Herbert wrote:

> The sparc documentation is pretty clear
> http://docs.oracle.com/cd/E19253-01/816-4854/hwovr-2/index.html, seems
> like unaligned accesses are not allowed in the architecture.

yes, but looks like you  can paper over some of this with
memcpy (as was happening with the saddr ref in skb_flow_dissect
that puzzled me and Eric because it did not generate any traps).

I suppose one could sprinkele a few WARN_ON's for !IS_ALIGNED
but that's not a fool-proof detection method either (in addition
to all the ugly shouting in the code).

--Sowmini
Tom Herbert Jan. 29, 2016, 11:31 p.m. UTC | #21
On Fri, Jan 29, 2016 at 3:04 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/29/16 15:00), Tom Herbert wrote:
>
>> The sparc documentation is pretty clear
>> http://docs.oracle.com/cd/E19253-01/816-4854/hwovr-2/index.html, seems
>> like unaligned accesses are not allowed in the architecture.
>
> yes, but looks like you  can paper over some of this with
> memcpy (as was happening with the saddr ref in skb_flow_dissect
> that puzzled me and Eric because it did not generate any traps).
>
Well, I am more worried about what happens when an unaligned
encapsulated TPC/IP packet gets into the stack. It seems unlikely we'd
want to replace a bunch of address, seq numbers, etc. with memcpy all
over the stack. ;-)

But even within flow dissector, to be completely correct, we need to
replace all 32-bit accesses with the mempy (flow_label, mpls label,
keyid) and be vigilant about new ones coming in. For that matter, if
we really want to be pedantic, nothing prevents a driver from giving
us a packet with 1 byte alignment in which case we need to consider
access to 16-bit values also! Considering that this is a very narrow
use case (requires encapsulated Ethernet and architecture that is
alignment sensitive), I wonder if we should just punt on flow
dissection when we see non 4-byte alignment and
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set.

Tom
Sowmini Varadhan Jan. 29, 2016, 11:58 p.m. UTC | #22
On (01/29/16 15:31), Tom Herbert wrote:
> But even within flow dissector, to be completely correct, we need to
> replace all 32-bit accesses with the mempy (flow_label, mpls label,
> keyid) and be vigilant about new ones coming in. For that matter, ..

well, one question that came to my mind when I was looking at this is:
why does eth_get_headlen try to compute the flow hash even before the
driver has had a chance to pull things into an aligned buffer? Isn't
that just risking even more unaligned accesses?
Tom Herbert Jan. 30, 2016, 12:47 a.m. UTC | #23
On Fri, Jan 29, 2016 at 3:58 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/29/16 15:31), Tom Herbert wrote:
>> But even within flow dissector, to be completely correct, we need to
>> replace all 32-bit accesses with the mempy (flow_label, mpls label,
>> keyid) and be vigilant about new ones coming in. For that matter, ..
>
> well, one question that came to my mind when I was looking at this is:
> why does eth_get_headlen try to compute the flow hash even before the
> driver has had a chance to pull things into an aligned buffer? Isn't
> that just risking even more unaligned accesses?
>
Hmm, thanks for pointing that out. It looks like this is called by a
couple drivers as part of pulling up the data to get alignment. I am
actually surprised they are doing this. Flow dissector is not the
cheapest function in the world and it seems like a shame to be calling
it this early and not even getting a hash for the skb. Also, this is
another instance of touching the data so if we do get to the point of
trying steering packets without cache miss (another thread); this
undermines that. Seems like it might be just as well to pull up a
fixed number of bytes (ixgbe want min of 60 bytes), or don't pull up
anything avoid the cache miss here and let the stack pull up as
needed.

Anyway, I think you do have a point that flow_dissector does not
assume alignment or pull up data like the rest of the stack. Seems
like we need a utility function like align_access (maybe there is one
already) to get 32-bit fields and probably do need the sanity check
for odd alignment.

Tom
David Miller Jan. 30, 2016, 1:18 a.m. UTC | #24
From: Tom Herbert <tom@herbertland.com>
Date: Fri, 29 Jan 2016 16:47:46 -0800

> Seems like it might be just as well to pull up a fixed number of
> bytes (ixgbe want min of 60 bytes),

This is precisely what we were trying to move away from, please
don't regress back to that.
Eric Dumazet Jan. 30, 2016, 2:15 a.m. UTC | #25
On Fri, 2016-01-29 at 16:47 -0800, Tom Herbert wrote:

> Hmm, thanks for pointing that out. It looks like this is called by a
> couple drivers as part of pulling up the data to get alignment. I am
> actually surprised they are doing this. Flow dissector is not the
> cheapest function in the world and it seems like a shame to be calling
> it this early and not even getting a hash for the skb. Also, this is
> another instance of touching the data so if we do get to the point of
> trying steering packets without cache miss (another thread); this
> undermines that. Seems like it might be just as well to pull up a
> fixed number of bytes (ixgbe want min of 60 bytes), or don't pull up
> anything avoid the cache miss here and let the stack pull up as
> needed.

Except that when for example mlx4 is able to pull only the headers
instead of a fixed amount of bytes, we increased the performance,
because GRO could cook packets twice as big for tunneled traffic.

( commit cfecec56ae7c7c40f23fbdac04acee027ca3bd66 )

1) Increasing performance by prefetching headers is a separate matter,
completely orthogonal to this.

2) Taking the opportunity to also give back the l4 hash as a bonus
has been considered, but tests showed no clear benefit.

  For typical cases where l4 hash is not used (RPS and RFS being not
enabled by default on linux), computing it brings nothing but added
complexity.
Tom Herbert Jan. 31, 2016, 10:13 p.m. UTC | #26
On Fri, Jan 29, 2016 at 6:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-01-29 at 16:47 -0800, Tom Herbert wrote:
>
>> Hmm, thanks for pointing that out. It looks like this is called by a
>> couple drivers as part of pulling up the data to get alignment. I am
>> actually surprised they are doing this. Flow dissector is not the
>> cheapest function in the world and it seems like a shame to be calling
>> it this early and not even getting a hash for the skb. Also, this is
>> another instance of touching the data so if we do get to the point of
>> trying steering packets without cache miss (another thread); this
>> undermines that. Seems like it might be just as well to pull up a
>> fixed number of bytes (ixgbe want min of 60 bytes), or don't pull up
>> anything avoid the cache miss here and let the stack pull up as
>> needed.
>
> Except that when for example mlx4 is able to pull only the headers
> instead of a fixed amount of bytes, we increased the performance,
> because GRO could cook packets twice as big for tunneled traffic.
>
> ( commit cfecec56ae7c7c40f23fbdac04acee027ca3bd66 )
>
> 1) Increasing performance by prefetching headers is a separate matter,
> completely orthogonal to this.
>
> 2) Taking the opportunity to also give back the l4 hash as a bonus
> has been considered, but tests showed no clear benefit.
>
>   For typical cases where l4 hash is not used (RPS and RFS being not
> enabled by default on linux), computing it brings nothing but added
> complexity.
>
Neither is GRE enabled by default in Linux and it is not a typical
case. So that patch is an optimization for a very narrow use case that
impacts the core data path for everyone. Please at least consider
making it configurable.

>
>
Eric Dumazet Feb. 1, 2016, 12:04 a.m. UTC | #27
On Sun, 2016-01-31 at 14:13 -0800, Tom Herbert wrote:

> Neither is GRE enabled by default in Linux and it is not a typical
> case. So that patch is an optimization for a very narrow use case that
> impacts the core data path for everyone. Please at least consider
> making it configurable.

No idea why it should be configurable, by adding yet another
conditional.

Pulling all headers at once is faster than pulling in GRO slow path, or
if GRO is disabled in pskb_may_pull() slow path.

It seems we need to fix flow dissector anyway.

Note: This path is only taken in non native traffic, in case you missed
this.

Regular IP+TCP or IP+UDP does not use it.
David Laight Feb. 1, 2016, 4:55 p.m. UTC | #28
From: Eric Dumazet

> Sent: 29 January 2016 22:29

...
> On modern intel cpus, this does not matter at all, sure. It took a while

> before "rep movsb" finally did the right thing.


Unfortunately memcpy_to_io() etc now map to 'rep movsb', and that can
only be optimisied for cached addresses.

So copies to/from pcie space get done as byte copies, slower than slow.

The same is true when usespace has used mmap() to directly access
pcie memory.

There isn't a standard wrapper than generates 'rep movsd'.

	David
Eric Dumazet Feb. 1, 2016, 5:25 p.m. UTC | #29
On Mon, 2016-02-01 at 16:55 +0000, David Laight wrote:
> From: Eric Dumazet
> > Sent: 29 January 2016 22:29
> ...
> > On modern intel cpus, this does not matter at all, sure. It took a while
> > before "rep movsb" finally did the right thing.
> 
> Unfortunately memcpy_to_io() etc now map to 'rep movsb', and that can
> only be optimisied for cached addresses.
> 
> So copies to/from pcie space get done as byte copies, slower than slow.
> 
> The same is true when usespace has used mmap() to directly access
> pcie memory.
> 
> There isn't a standard wrapper than generates 'rep movsd'.

If memcpy_to_io() has problem because it assumed memcpy() had some
properties, it is quite a different problem that should be referred to
x86 maintainers ?

arch/x86/lib/memcpy_64.S certainly contains a suitable variant.
diff mbox

Patch

--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -102,6 +102,17 @@  __be32 __skb_flow_get_ports(const struct sk_buff *skb, int 
 }
 EXPORT_SYMBOL(__skb_flow_get_ports);
 
+static inline __be32 ip6_flowlabel_align(const u8 *hdr)
+{
+       union {
+               __u8 w[4];
+               __u32 flow;
+       } ip6_flow;
+
+       memcpy(ip6_flow.w, hdr, 4);
+       return (ip6_flow.flow & IPV6_FLOWLABEL_MASK);
+}
+
 /**
  * __skb_flow_dissect - extract the flow_keys struct and return it
  * @skb: sk_buff to extract the flow from, can be NULL if the rest are specifie
@@ -230,7 +241,7 @@  ipv6:
                        key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
                }
 
-               flow_label = ip6_flowlabel(iph);
+               flow_label = ip6_flowlabel_align((const u8 *)iph);
                if (flow_label) {
                        if (dissector_uses_key(flow_dissector,
                                               FLOW_DISSECTOR_KEY_FLOW_LABEL)) {