Message ID | 1471470587-24813-3-git-send-email-dsa@cumulusnetworks.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Aug 17, 2016 at 2:49 PM, David Ahern <dsa@cumulusnetworks.com> wrote: > As reported by Lennert the MPLS GSO code is failing to properly segment > large packets. There are a couple of problems: > > 1. the inner protocol is not set so the gso segment functions for inner > protocol layers are not getting run, and > > 2 MPLS labels for packets that use the "native" (non-OVS) MPLS code > are not properly accounted for in mpls_gso_segment. > > The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment > to call the gso segment functions for the higher layer protocols. That > means skb_mac_gso_segment is called twice -- once with the network > protocol set to MPLS and again with the network protocol set to the > inner protocol. > > This patch sets the inner skb protocol addressing item 1 above and sets > the network_header and inner_network_header to mark where the MPLS labels > start and end. The MPLS code in OVS is also updated to set the two > network markers. > > From there the MPLS GSO code uses the difference between the network > header and the inner network header to know the size of the MPLS header > that was pushed. It then pulls the MPLS header, resets the mac_len and > protocol for the inner protocol and then calls skb_mac_gso_segment > to segment the skb. Afterwards the skb protocol is set to mpls for > each segment as suggested by Simon. > > Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> > Signed-off-by: David Ahern <dsa@cumulusnetworks.com> > --- > net/mpls/mpls_gso.c | 24 +++++++++++++----------- > net/mpls/mpls_iptunnel.c | 5 +++++ > net/openvswitch/actions.c | 6 ++++++ > 3 files changed, 24 insertions(+), 11 deletions(-) > <snip> > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 1ecbd7715f6d..6d78f162a88b 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, > skb->mac_len); > skb_reset_mac_header(skb); > > + /* for GSO: set MPLS as network header and encapsulated protocol > + * header as inner network header > + */ > + skb_set_network_header(skb, skb->mac_len); > + skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN); > + > new_mpls_lse = (__be32 *)skb_mpls_header(skb); > *new_mpls_lse = mpls->mpls_lse; > So the one question I would have about this is how attached are you to using the network_header to record the offset for the MPLS header? I ask because I think from a hardware offloading perspective it would make it much easier if instead you used the inner_mac_header to represent the offset for the MPLS header. This way device drivers could just skip over it like a VLAN and just use network and transport header values like they would otherwise. - Alex
On 8/17/16 5:16 PM, Alexander Duyck wrote: >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >> index 1ecbd7715f6d..6d78f162a88b 100644 >> --- a/net/openvswitch/actions.c >> +++ b/net/openvswitch/actions.c >> @@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, >> skb->mac_len); >> skb_reset_mac_header(skb); >> >> + /* for GSO: set MPLS as network header and encapsulated protocol >> + * header as inner network header >> + */ >> + skb_set_network_header(skb, skb->mac_len); >> + skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN); >> + >> new_mpls_lse = (__be32 *)skb_mpls_header(skb); >> *new_mpls_lse = mpls->mpls_lse; >> > > So the one question I would have about this is how attached are you to > using the network_header to record the offset for the MPLS header? I > ask because I think from a hardware offloading perspective it would > make it much easier if instead you used the inner_mac_header to > represent the offset for the MPLS header. This way device drivers > could just skip over it like a VLAN and just use network and transport > header values like they would otherwise. > Where does the network_header relate to if I change the marker to inner_mac_header? Would it be skipped? skb->protocol is set to MPLS. mac_header points to ethernet address network_header points to ??? inner protocol is set to what is encapsulated (e.g., ipv4 or ipv6) inner_mac_header points to start of mpls label. inner_network points to start of network header. Is that sufficient for h/w drivers?
On Wed, Aug 17, 2016 at 4:23 PM, David Ahern <dsa@cumulusnetworks.com> wrote: > On 8/17/16 5:16 PM, Alexander Duyck wrote: >>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >>> index 1ecbd7715f6d..6d78f162a88b 100644 >>> --- a/net/openvswitch/actions.c >>> +++ b/net/openvswitch/actions.c >>> @@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, >>> skb->mac_len); >>> skb_reset_mac_header(skb); >>> >>> + /* for GSO: set MPLS as network header and encapsulated protocol >>> + * header as inner network header >>> + */ >>> + skb_set_network_header(skb, skb->mac_len); >>> + skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN); >>> + >>> new_mpls_lse = (__be32 *)skb_mpls_header(skb); >>> *new_mpls_lse = mpls->mpls_lse; >>> >> >> So the one question I would have about this is how attached are you to >> using the network_header to record the offset for the MPLS header? I >> ask because I think from a hardware offloading perspective it would >> make it much easier if instead you used the inner_mac_header to >> represent the offset for the MPLS header. This way device drivers >> could just skip over it like a VLAN and just use network and transport >> header values like they would otherwise. >> > > Where does the network_header relate to if I change the marker to inner_mac_header? Would it be skipped? No, the network header would still be the network header. > skb->protocol is set to MPLS. > mac_header points to ethernet address > network_header points to ??? The network_header would point to the IP header like it would be for a non-MPLS frame. > inner protocol is set to what is encapsulated (e.g., ipv4 or ipv6) I am okay with this, but wonder if we actually need it. Do you know of any protocols other than IPv4 or IPv6 that can be carried over MPLS and would expect to be offloaded? If not we may be able to just get away with recording the network header offset and then using the first nibble of the network header to determine the IP version since the value should be 4 or 6 for the two types we are offloading. > inner_mac_header points to start of mpls label. So this is what I would expect. > inner_network points to start of network header. The problem is that using inner_network_header to point to the network header will require me to fork the path pretty significantly for most of the Intel devices that would want to do MPLS GSO. The assumption most drivers make is that if we are offloading things then network_header and inner_network_header will point to either IPv4 or IPv6 headers. Introducing MPLS as the network_header with IPv4 or IPv6 as the inner_network_header throws a kink in the works because we currently ignore inner_network_header for the devices that are doing UDP or GRE tunnel GSO via GSO_PARTIAL with TSO_MANGLEID. > Is that sufficient for h/w drivers? I think of this as working like how we handle it for IP over IP tunnels. In that case we are at L3 so the inner_network_header field is populated, but the transport header stays the same. In the case of MPLS it isn't really L3 it is more of an L2.5 so my preference would be to treat it like it is an L2 tunnel or VLAN and just overwrite the inner_mac_header with the MPLS header offset, and leave the network and transport headers untouched. One other bonus that also occurred to me is that you might be able to get away with doing MPLS offloads for MPLS over IP or GRE tunnels. I hadn't realized that MPLS inside of these tunnels was a thing, I had just noticed it while looking over how the IP-in-IP tunnels are all being handled. However if you move the header tracking to inner_mac_header, and can avoid using skb->inner_protocol by instead using the first nibble of the network_header value then you could probably support segmenting those types of tunnels in hardware. - Alex
On 8/17/16 7:06 PM, Alexander Duyck wrote: > On Wed, Aug 17, 2016 at 4:23 PM, David Ahern <dsa@cumulusnetworks.com> wrote: >> On 8/17/16 5:16 PM, Alexander Duyck wrote: >>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >>>> index 1ecbd7715f6d..6d78f162a88b 100644 >>>> --- a/net/openvswitch/actions.c >>>> +++ b/net/openvswitch/actions.c >>>> @@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, >>>> skb->mac_len); >>>> skb_reset_mac_header(skb); >>>> >>>> + /* for GSO: set MPLS as network header and encapsulated protocol >>>> + * header as inner network header >>>> + */ >>>> + skb_set_network_header(skb, skb->mac_len); >>>> + skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN); >>>> + >>>> new_mpls_lse = (__be32 *)skb_mpls_header(skb); >>>> *new_mpls_lse = mpls->mpls_lse; >>>> >>> >>> So the one question I would have about this is how attached are you to >>> using the network_header to record the offset for the MPLS header? I >>> ask because I think from a hardware offloading perspective it would >>> make it much easier if instead you used the inner_mac_header to >>> represent the offset for the MPLS header. This way device drivers >>> could just skip over it like a VLAN and just use network and transport >>> header values like they would otherwise. >>> >> >> Where does the network_header relate to if I change the marker to inner_mac_header? Would it be skipped? > > No, the network header would still be the network header. If core MPLS code (ie., non-OVS) does not do skb_reset_network_header(skb) after adding the MPLS label nothing works. Not even ping with small packets. tcpdump shows a completely mangled packet. Right now resetting the network_header to mpls is required.
Thought I would go through and do a second pass since it sounds like the inner_mac_header idea isn't going to fly. If we can't push this as an L2 encapsulation there are few tweaks we probably need in order to make this work as an L3. I have included comments inline below. Also I haven't worked with MPLS much before. Is there a simple way to setup an MPLS tunnel between two hosts connected back to back so that I could try testing a few things related to this patch? Thanks. - Alex On Wed, Aug 17, 2016 at 2:49 PM, David Ahern <dsa@cumulusnetworks.com> wrote: > As reported by Lennert the MPLS GSO code is failing to properly segment > large packets. There are a couple of problems: > > 1. the inner protocol is not set so the gso segment functions for inner > protocol layers are not getting run, and > > 2 MPLS labels for packets that use the "native" (non-OVS) MPLS code > are not properly accounted for in mpls_gso_segment. > > The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment > to call the gso segment functions for the higher layer protocols. That > means skb_mac_gso_segment is called twice -- once with the network > protocol set to MPLS and again with the network protocol set to the > inner protocol. > > This patch sets the inner skb protocol addressing item 1 above and sets > the network_header and inner_network_header to mark where the MPLS labels > start and end. The MPLS code in OVS is also updated to set the two > network markers. > > From there the MPLS GSO code uses the difference between the network > header and the inner network header to know the size of the MPLS header > that was pushed. It then pulls the MPLS header, resets the mac_len and > protocol for the inner protocol and then calls skb_mac_gso_segment > to segment the skb. Afterwards the skb protocol is set to mpls for > each segment as suggested by Simon. > > Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> > Signed-off-by: David Ahern <dsa@cumulusnetworks.com> > --- > net/mpls/mpls_gso.c | 24 +++++++++++++----------- > net/mpls/mpls_iptunnel.c | 5 +++++ > net/openvswitch/actions.c | 6 ++++++ > 3 files changed, 24 insertions(+), 11 deletions(-) > > diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c > index 2055e57ed1c3..fa6899f02cc8 100644 > --- a/net/mpls/mpls_gso.c > +++ b/net/mpls/mpls_gso.c > @@ -22,33 +22,35 @@ > static struct sk_buff *mpls_gso_segment(struct sk_buff *skb, > netdev_features_t features) > { > + int mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb); > struct sk_buff *segs = ERR_PTR(-EINVAL); > + u16 mac_offset = skb->mac_header; > netdev_features_t mpls_features; > __be16 mpls_protocol; > + u16 mac_len = skb->mac_len; So one thing you may want to do here is defer the skb_network_header() call until after being able to call skb_reset_network_header(). For reference you might look at how we handle inet_gso_segment. That way if at some point in the future we end up having to support MPLS encapsulated in an IP tunnel it should be able to play the same as IP-in-IP. > > /* Setup inner SKB. */ > mpls_protocol = skb->protocol; > skb->protocol = skb->inner_protocol; > > - /* Push back the mac header that skb_mac_gso_segment() has pulled. > - * It will be re-pulled by the call to skb_mac_gso_segment() below > - */ > - __skb_push(skb, skb->mac_len); > + __skb_pull(skb, mpls_hlen); > + skb->mac_len = skb_inner_network_offset(skb); So I am not sure sure setting the skb->mac_len here really does anything. If I am not mistaken I think the value should always come out 0 since you already pulled mpls_hlen, and skb->data should be equal to skb_network_header(). So you might save yourself a few cycles and just set skb->mac_len = 0. Also you may need to call skb_reset_mac_header() so that you don't have the skb_mac_gso_segment call pushing your MPLS header and the headers below it back on before you can capture those offsets back in your frame. > /* Segment inner packet. */ > mpls_features = skb->dev->mpls_features & features; > segs = skb_mac_gso_segment(skb, mpls_features); > - > + if (IS_ERR_OR_NULL(segs)) { > + skb_gso_error_unwind(skb, mpls_protocol, mpls_hlen, mac_offset, > + mac_len); > + goto out; > + } > > /* Restore outer protocol. */ > skb->protocol = mpls_protocol; > + for (skb = segs; skb; skb = skb->next) > + skb->protocol = mpls_protocol; At this point you should probably be pushing back on your MPLS header and resetting the inner network header, network header, and mac header. Otherwise either the inner IPv4 or IPv6 header will be set as the network_header after you have segmented the frame. This is one of the reasons why I thought my original ideal would work. You might refer to the approach taken in gre_gso_segment as an example of how to approach that. The key bit here is that you can't lose the offsets you setup when you were creating the frame and I don't see anything anywhere that is handling the inner_network_header value. > - /* Re-pull the mac header that the call to skb_mac_gso_segment() > - * above pulled. It will be re-pushed after returning > - * skb_mac_gso_segment(), an indirect caller of this function. > - */ > - __skb_pull(skb, skb->data - skb_mac_header(skb)); > - > +out: > return segs; > } > > diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c > index aed872cc05a6..55c5ab907563 100644 > --- a/net/mpls/mpls_iptunnel.c > +++ b/net/mpls/mpls_iptunnel.c > @@ -90,7 +90,12 @@ static int mpls_xmit(struct sk_buff *skb) > if (skb_cow(skb, hh_len + new_header_size)) > goto drop; > > + skb_set_inner_protocol(skb, skb->protocol); > + skb_reset_inner_network_header(skb); > + skb->encapsulation = 1; > + So you probably shouldn't be updating skb->encapsulation. Normally that is used or L4 encapsulation over UDP or GRE. The problem is it signals that the checksum needs to be computed at inner_transport_header instead of transport_header and can cause issues if we try to offload the checksum for this. > skb_push(skb, new_header_size); > + > skb_reset_network_header(skb); > > skb->dev = out_dev; > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 1ecbd7715f6d..6d78f162a88b 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, > skb->mac_len); > skb_reset_mac_header(skb); > > + /* for GSO: set MPLS as network header and encapsulated protocol > + * header as inner network header > + */ > + skb_set_network_header(skb, skb->mac_len); > + skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN); > + > new_mpls_lse = (__be32 *)skb_mpls_header(skb); > *new_mpls_lse = mpls->mpls_lse; > > -- > 2.1.4 >
On 8/18/16 8:37 AM, Alexander Duyck wrote: > Thought I would go through and do a second pass since it sounds like > the inner_mac_header idea isn't going to fly. If we can't push this > as an L2 encapsulation there are few tweaks we probably need in order > to make this work as an L3. I have included comments inline below. > > Also I haven't worked with MPLS much before. Is there a simple way to > setup an MPLS tunnel between two hosts connected back to back so that > I could try testing a few things related to this patch? Here commands that I use for VMs - copy and paste. It is an adaptation of Lennert's namespace script. VM id's are local to my host. Network addresses are 10.100.1.x/24 and 2100:1::x/120 on eth1 of the respective node. Includes MPLS encap, IP-IP encap and none to compare performances. VM2 === modprobe mpls_router modprobe mpls_gso modprobe mpls_iptunnel sysctl -w net.mpls.platform_labels=1000 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.3 ip -6 route add 3000:1::1/128 encap mpls 101 via inet6 2100:1::3 ip tunnel add tun0 mode ipip remote 10.100.1.3 ip link set dev tun0 up ip route add 10.10.10.11/32 dev tun0 ip route add 10.10.10.12/32 via inet 10.100.1.3 ip -6 route add 3000:1::3/128 via inet6 2100:1::3 VM3 === modprobe mpls_router modprobe mpls_gso modprobe mpls_iptunnel sysctl -w net.mpls.conf.eth1.input=1 sysctl -w net.mpls.platform_labels=1000 ip -f mpls route add 100 via inet 10.100.2.4 ip -f mpls route add 101 via inet6 2100:2::4 ip tunnel add tun0 mode ipip remote 10.100.1.2 ip link set dev tun0 up ip ro add 10.10.10.11/32 via 10.100.2.4 ip ro add 10.10.10.12/32 via 10.100.2.4 ip -6 route add 3000:1::3/128 via inet6 2100:2::4 VM4 === ip addr add 10.10.10.10/32 dev lo ip addr add 10.10.10.11/32 dev lo ip addr add 10.10.10.12/32 dev lo ip -6 addr add 3000:1::1/128 dev lo ip -6 addr add 3000:1::2/128 dev lo ip -6 addr add 3000:1::3/128 dev lo netserver Go back to VM2: ping -c 1 10.10.10.10 ping -c 1 10.10.10.11 ping -c 1 10.10.10.12 netperf -c -C -H 10.10.10.10 -l 10 -t TCP_STREAM netperf -c -C -H 10.10.10.11 -l 10 -t TCP_STREAM netperf -c -C -H 10.10.10.12 -l 10 -t TCP_STREAM I'll take a look at your other comments today.
diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c index 2055e57ed1c3..fa6899f02cc8 100644 --- a/net/mpls/mpls_gso.c +++ b/net/mpls/mpls_gso.c @@ -22,33 +22,35 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb, netdev_features_t features) { + int mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb); struct sk_buff *segs = ERR_PTR(-EINVAL); + u16 mac_offset = skb->mac_header; netdev_features_t mpls_features; __be16 mpls_protocol; + u16 mac_len = skb->mac_len; /* Setup inner SKB. */ mpls_protocol = skb->protocol; skb->protocol = skb->inner_protocol; - /* Push back the mac header that skb_mac_gso_segment() has pulled. - * It will be re-pulled by the call to skb_mac_gso_segment() below - */ - __skb_push(skb, skb->mac_len); + __skb_pull(skb, mpls_hlen); + skb->mac_len = skb_inner_network_offset(skb); /* Segment inner packet. */ mpls_features = skb->dev->mpls_features & features; segs = skb_mac_gso_segment(skb, mpls_features); - + if (IS_ERR_OR_NULL(segs)) { + skb_gso_error_unwind(skb, mpls_protocol, mpls_hlen, mac_offset, + mac_len); + goto out; + } /* Restore outer protocol. */ skb->protocol = mpls_protocol; + for (skb = segs; skb; skb = skb->next) + skb->protocol = mpls_protocol; - /* Re-pull the mac header that the call to skb_mac_gso_segment() - * above pulled. It will be re-pushed after returning - * skb_mac_gso_segment(), an indirect caller of this function. - */ - __skb_pull(skb, skb->data - skb_mac_header(skb)); - +out: return segs; } diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index aed872cc05a6..55c5ab907563 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -90,7 +90,12 @@ static int mpls_xmit(struct sk_buff *skb) if (skb_cow(skb, hh_len + new_header_size)) goto drop; + skb_set_inner_protocol(skb, skb->protocol); + skb_reset_inner_network_header(skb); + skb->encapsulation = 1; + skb_push(skb, new_header_size); + skb_reset_network_header(skb); skb->dev = out_dev; diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 1ecbd7715f6d..6d78f162a88b 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, skb->mac_len); skb_reset_mac_header(skb); + /* for GSO: set MPLS as network header and encapsulated protocol + * header as inner network header + */ + skb_set_network_header(skb, skb->mac_len); + skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN); + new_mpls_lse = (__be32 *)skb_mpls_header(skb); *new_mpls_lse = mpls->mpls_lse;
As reported by Lennert the MPLS GSO code is failing to properly segment large packets. There are a couple of problems: 1. the inner protocol is not set so the gso segment functions for inner protocol layers are not getting run, and 2 MPLS labels for packets that use the "native" (non-OVS) MPLS code are not properly accounted for in mpls_gso_segment. The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment to call the gso segment functions for the higher layer protocols. That means skb_mac_gso_segment is called twice -- once with the network protocol set to MPLS and again with the network protocol set to the inner protocol. This patch sets the inner skb protocol addressing item 1 above and sets the network_header and inner_network_header to mark where the MPLS labels start and end. The MPLS code in OVS is also updated to set the two network markers. From there the MPLS GSO code uses the difference between the network header and the inner network header to know the size of the MPLS header that was pushed. It then pulls the MPLS header, resets the mac_len and protocol for the inner protocol and then calls skb_mac_gso_segment to segment the skb. Afterwards the skb protocol is set to mpls for each segment as suggested by Simon. Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> --- net/mpls/mpls_gso.c | 24 +++++++++++++----------- net/mpls/mpls_iptunnel.c | 5 +++++ net/openvswitch/actions.c | 6 ++++++ 3 files changed, 24 insertions(+), 11 deletions(-)