diff mbox

[v9,net-next,4/7] openvswitch: add layer 3 flow/port support

Message ID 20160506055705.GA9276@penelope.isobedori.kobe.vergenet.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Simon Horman May 6, 2016, 5:57 a.m. UTC
[CC Jiri Benc]

On Thu, May 05, 2016 at 10:37:08AM -0700, pravin shelar wrote:
> On Wed, May 4, 2016 at 12:36 AM, Simon Horman
> <simon.horman@netronome.com> wrote:
> > From: Lorand Jakab <lojakab@cisco.com>
> >
> > Implementation of the pop_eth and push_eth actions in the kernel, and
> > layer 3 flow support.
> >
> > This doesn't actually do anything yet as no layer 2 tunnel ports are
> > supported yet. The original patch by Lorand was against the Open vSwtich
> > tree which has L2 LISP tunnels but that is not supported in mainline Linux.
> > I (Simon) plan to follow up with support for non-TEB GRE ports based on
> > work by Thomas Morin.
> >
> > Cc: Thomas Morin <thomas.morin@orange.com>
> > Signed-off-by: Lorand Jakab <lojakab@cisco.com>
> > Signed-off-by: Simon Horman <simon.horman@netronome.com>
> >
> > ---
> 
> ...
> 
> > diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> > index 0ea128eeeab2..6e174ea5f2bb 100644
> > --- a/net/openvswitch/flow.c
> > +++ b/net/openvswitch/flow.c
> > @@ -468,28 +468,31 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
> >
> >         skb_reset_mac_header(skb);
> >
> > -       /* Link layer.  We are guaranteed to have at least the 14 byte Ethernet
> > -        * header in the linear data area.
> > -        */
> > -       eth = eth_hdr(skb);
> > -       ether_addr_copy(key->eth.src, eth->h_source);
> > -       ether_addr_copy(key->eth.dst, eth->h_dest);
> > +       /* Link layer. */
> > +       if (key->phy.is_layer3) {
> > +               key->eth.tci = 0;
> > +               key->eth.type = skb->protocol;
> > +       } else {
> > +               eth = eth_hdr(skb);
> > +               ether_addr_copy(key->eth.src, eth->h_source);
> > +               ether_addr_copy(key->eth.dst, eth->h_dest);
> >
> > -       __skb_pull(skb, 2 * ETH_ALEN);
> > -       /* We are going to push all headers that we pull, so no need to
> > -        * update skb->csum here.
> > -        */
> > +               __skb_pull(skb, 2 * ETH_ALEN);
> > +               /* We are going to push all headers that we pull, so no need to
> > +                * update skb->csum here.
> > +                */
> >
> > -       key->eth.tci = 0;
> > -       if (skb_vlan_tag_present(skb))
> > -               key->eth.tci = htons(skb->vlan_tci);
> > -       else if (eth->h_proto == htons(ETH_P_8021Q))
> > -               if (unlikely(parse_vlan(skb, key)))
> > -                       return -ENOMEM;
> > +               key->eth.tci = 0;
> > +               if (skb_vlan_tag_present(skb))
> > +                       key->eth.tci = htons(skb->vlan_tci);
> > +               else if (eth->h_proto == htons(ETH_P_8021Q))
> > +                       if (unlikely(parse_vlan(skb, key)))
> > +                               return -ENOMEM;
> >
> > -       key->eth.type = parse_ethertype(skb);
> > -       if (unlikely(key->eth.type == htons(0)))
> > -               return -ENOMEM;
> > +               key->eth.type = parse_ethertype(skb);
> > +               if (unlikely(key->eth.type == htons(0)))
> > +                       return -ENOMEM;
> > +       }
> >
> >         skb_reset_network_header(skb);
> >         skb_reset_mac_len(skb);
> > @@ -696,11 +699,23 @@ int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key)
> >  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
> >                          struct sk_buff *skb, struct sw_flow_key *key)
> >  {
> > +       bool is_layer3 = false;
> > +       bool is_teb = false;
> is_layer3 and is_teb are mutually exclusive, so can't we use single
> boolean here?

Sure, I can do something like the following if you prefer.
To my mind it makes things a bit less readable. But I don't feel
strongly about this.


> > +       int err;
> > +
> >         /* Extract metadata from packet. */
> >         if (tun_info) {
> >                 key->tun_proto = ip_tunnel_info_af(tun_info);
> >                 memcpy(&key->tun_key, &tun_info->key, sizeof(key->tun_key));
> >
> > +               if (OVS_CB(skb)->input_vport->dev->type != ARPHRD_ETHER) {
> > +                       if (skb->protocol == htons(ETH_P_TEB))
> > +                               is_teb = true;
> > +                       else
> > +                               is_layer3 = true;
> > +               }
> > +
> On transmit side you are using mac_len to detect l3 packet, why not do
> same while extracting the key?

Unfortunately mac_len can't be relied on here, emprically it has the same
value (28 in my tests) for both the TEB and layer3 case above.

Perhaps that could be changed by futher enhancements in the tunneling code
but I think things are symetric as they stand:

* On recieve skb->protocol can be read to distinguish TEB and layer3 packets
* On transmit skb->protocol should be set to distinguish TEB and layer3 packets

Comments

Jiri Benc May 6, 2016, 9:25 a.m. UTC | #1
On Fri, 6 May 2016 14:57:07 +0900, Simon Horman wrote:
> On Thu, May 05, 2016 at 10:37:08AM -0700, pravin shelar wrote:
> > On transmit side you are using mac_len to detect l3 packet, why not do
> > same while extracting the key?

I agree. The skb should be self-contained, i.e. it should be obvious
whether it has the MAC header set or not just from the skb itself at
any point in the packet processing. Otherwise, I'd expect things like
recirculation to break after push/pop of eth header.

> Unfortunately mac_len can't be relied on here, emprically it has the same
> value (28 in my tests) for both the TEB and layer3 case above.

That's strange, it looks like there's something setting the mac header
unconditionally in ovs code. We should find that place and change it.

The ARPHRD_NONE interfaces don't even set mac_header and mac_len, this
will need to be set by ovs upon getting frame from such interface. 

> Perhaps that could be changed by futher enhancements in the tunneling code
> but I think things are symetric as they stand:
> 
> * On recieve skb->protocol can be read to distinguish TEB and layer3 packets
> * On transmit skb->protocol should be set to distinguish TEB and layer3 packets

Yes, but you need to act upon this directly after receiving the
frame/just before sending the frame and set up an internal flag that
will be used throughout the code. That way, the packet can be handed
over to different parts of the code, recirculated, etc. without
worries. skb->mac_len is probably a good candidate for such flag.

 Jiri
Simon Horman May 9, 2016, 8:04 a.m. UTC | #2
On Fri, May 06, 2016 at 11:25:14AM +0200, Jiri Benc wrote:
> On Fri, 6 May 2016 14:57:07 +0900, Simon Horman wrote:
> > On Thu, May 05, 2016 at 10:37:08AM -0700, pravin shelar wrote:
> > > On transmit side you are using mac_len to detect l3 packet, why not do
> > > same while extracting the key?
> 
> I agree. The skb should be self-contained, i.e. it should be obvious
> whether it has the MAC header set or not just from the skb itself at
> any point in the packet processing. Otherwise, I'd expect things like
> recirculation to break after push/pop of eth header.
> 
> > Unfortunately mac_len can't be relied on here, emprically it has the same
> > value (28 in my tests) for both the TEB and layer3 case above.
> 
> That's strange, it looks like there's something setting the mac header
> unconditionally in ovs code. We should find that place and change it.

It seems to be caused by the following:

1. __ipgre_rcv() calls skb_pop_mac_header() which
   sets skb->mac_header to the skb->network_header.

2. __ipgre_rcv() then calls ip_tunnel_rcv() which calls
   skb_reset_network_header(). This updates skb->network_header to
   just after the end of the GRE header.

   This is 28 bytes after the old skb->network_header
   as there is a 20 byte IPv4 header followed by an
   8 byte GRE header.

3. Later, dev_gro_receive() calls skb_reset_mac_len().
   This calculates skb->mac_len based on skb->network_header and
   skb->mac_header. I.e. 28 bytes.


I think this may be possible to address by calling
skb_reset_network_header() instead of skb_pop_mac_header()
in __ipgre_rcv().

I think that would leave skb->mac_header and skb->network_header both
set to just after the end of the GRE header and result in mac_len of 0.
It looks like this owuld be for for both TEB and non-TEB GRE packets
and that OVS would need to check against mac_len, protocol and most
likely dev->type early on.

> The ARPHRD_NONE interfaces don't even set mac_header and mac_len, this
> will need to be set by ovs upon getting frame from such interface. 

Noted.

> > Perhaps that could be changed by futher enhancements in the tunneling code
> > but I think things are symetric as they stand:
> > 
> > * On recieve skb->protocol can be read to distinguish TEB and layer3 packets
> > * On transmit skb->protocol should be set to distinguish TEB and layer3 packets
> 
> Yes, but you need to act upon this directly after receiving the
> frame/just before sending the frame and set up an internal flag that
> will be used throughout the code. That way, the packet can be handed
> over to different parts of the code, recirculated, etc. without
> worries. skb->mac_len is probably a good candidate for such flag.

Its possible that I've overlooked something but as things stand I think
things look like this:

* ovs_flow_key_extract() keys off dev->type and skb->protocol.
* ovs_flow_key_extract() calls key_extract() which amongst other things
  sets up the skb->mac_header and skb->mac_len of the skb.
* ovs_flow_key_extract() sets skb->protocol to that of the inner packet
  in the case of TEB
* Actions update the above mentioned skb fields as appropriate.

So it seems to me that it should be safe to rely on skb->protocol
in the receive path. Or more specifically, in netdev_port_receive().

If mac_len is also able to be used then I think fine. But it seems to me
that it needs to be set up by OvS at least for the ARPHRD_NONE case. This
could be done early on, say in netdev_port_receive(). But it seems that
would involve duplicating some of what is already occurring in
key_extract().
Jiri Benc May 10, 2016, 12:01 p.m. UTC | #3
On Mon, 9 May 2016 17:04:22 +0900, Simon Horman wrote:
> It seems to be caused by the following:
> 
> 1. __ipgre_rcv() calls skb_pop_mac_header() which
>    sets skb->mac_header to the skb->network_header.
> 
> 2. __ipgre_rcv() then calls ip_tunnel_rcv() which calls
>    skb_reset_network_header(). This updates skb->network_header to
>    just after the end of the GRE header.
> 
>    This is 28 bytes after the old skb->network_header
>    as there is a 20 byte IPv4 header followed by an
>    8 byte GRE header.
> 
> 3. Later, dev_gro_receive() calls skb_reset_mac_len().
>    This calculates skb->mac_len based on skb->network_header and
>    skb->mac_header. I.e. 28 bytes.

Right. Thanks for tracking this down!

> I think this may be possible to address by calling
> skb_reset_network_header() instead of skb_pop_mac_header()
> in __ipgre_rcv().

We can't do that. The interface type is ARPHRD_IPGRE and not
ARPHRD_NONE, so the current behavior makes pretty good sense. See
e.g. commit 0e3da5bb8da45.

We have two options here:

1. As for metadata tunnels all the info is in metadata_dst and we
   don't need the IP/GRE header for anything, we can make the ipgre
   interface ARPHRD_NONE in metadata based mode.

2. We can fix this up in ovs after receiving the packet from
   ARPHRD_IPGRE interface.

I think the first option is the correct one. We already don't assign
dev->header_ops in metadata mode. I'll prepare a patch.

> Its possible that I've overlooked something but as things stand I think
> things look like this:
> 
> * ovs_flow_key_extract() keys off dev->type and skb->protocol.
> * ovs_flow_key_extract() calls key_extract() which amongst other things
>   sets up the skb->mac_header and skb->mac_len of the skb.
> * ovs_flow_key_extract() sets skb->protocol to that of the inner packet
>   in the case of TEB
> * Actions update the above mentioned skb fields as appropriate.

Okay, that actually eases things somewhat.

> So it seems to me that it should be safe to rely on skb->protocol
> in the receive path. Or more specifically, in netdev_port_receive().
> 
> If mac_len is also able to be used then I think fine. But it seems to me
> that it needs to be set up by OvS at least for the ARPHRD_NONE case. This
> could be done early on, say in netdev_port_receive(). But it seems that
> would involve duplicating some of what is already occurring in
> key_extract().

I'd actually prefer doing this earlier, netdev_port_receive looks like
the right place. Just set mac_len there (or whatever) and let
key_extract do the rest of the work, not depending on dev->type in
there.

My point about recirculation was not actually valid, as I missed you're
doing this in ovs_flow_key_extract and not in key_extract. Still,
I think the special handling of particular interface types belongs to
the tx processing on those interfaces, not to the common code.

Thanks!

 Jiri
Simon Horman May 11, 2016, 1:50 a.m. UTC | #4
Hi Jiri,

On Tue, May 10, 2016 at 02:01:06PM +0200, Jiri Benc wrote:
> On Mon, 9 May 2016 17:04:22 +0900, Simon Horman wrote:
> > It seems to be caused by the following:
> > 
> > 1. __ipgre_rcv() calls skb_pop_mac_header() which
> >    sets skb->mac_header to the skb->network_header.
> > 
> > 2. __ipgre_rcv() then calls ip_tunnel_rcv() which calls
> >    skb_reset_network_header(). This updates skb->network_header to
> >    just after the end of the GRE header.
> > 
> >    This is 28 bytes after the old skb->network_header
> >    as there is a 20 byte IPv4 header followed by an
> >    8 byte GRE header.
> > 
> > 3. Later, dev_gro_receive() calls skb_reset_mac_len().
> >    This calculates skb->mac_len based on skb->network_header and
> >    skb->mac_header. I.e. 28 bytes.
> 
> Right. Thanks for tracking this down!
> 
> > I think this may be possible to address by calling
> > skb_reset_network_header() instead of skb_pop_mac_header()
> > in __ipgre_rcv().
> 
> We can't do that. The interface type is ARPHRD_IPGRE and not
> ARPHRD_NONE, so the current behavior makes pretty good sense. See
> e.g. commit 0e3da5bb8da45.
> 
> We have two options here:
> 
> 1. As for metadata tunnels all the info is in metadata_dst and we
>    don't need the IP/GRE header for anything, we can make the ipgre
>    interface ARPHRD_NONE in metadata based mode.
> 
> 2. We can fix this up in ovs after receiving the packet from
>    ARPHRD_IPGRE interface.
> 
> I think the first option is the correct one. We already don't assign
> dev->header_ops in metadata mode. I'll prepare a patch.

I agree that 1. seems to be the better approach.

> > Its possible that I've overlooked something but as things stand I think
> > things look like this:
> > 
> > * ovs_flow_key_extract() keys off dev->type and skb->protocol.
> > * ovs_flow_key_extract() calls key_extract() which amongst other things
> >   sets up the skb->mac_header and skb->mac_len of the skb.
> > * ovs_flow_key_extract() sets skb->protocol to that of the inner packet
> >   in the case of TEB
> > * Actions update the above mentioned skb fields as appropriate.
> 
> Okay, that actually eases things somewhat.
> 
> > So it seems to me that it should be safe to rely on skb->protocol
> > in the receive path. Or more specifically, in netdev_port_receive().
> > 
> > If mac_len is also able to be used then I think fine. But it seems to me
> > that it needs to be set up by OvS at least for the ARPHRD_NONE case. This
> > could be done early on, say in netdev_port_receive(). But it seems that
> > would involve duplicating some of what is already occurring in
> > key_extract().
> 
> I'd actually prefer doing this earlier, netdev_port_receive looks like
> the right place. Just set mac_len there (or whatever) and let
> key_extract do the rest of the work, not depending on dev->type in
> there.
> 
> My point about recirculation was not actually valid, as I missed you're
> doing this in ovs_flow_key_extract and not in key_extract. Still,
> I think the special handling of particular interface types belongs to
> the tx processing on those interfaces, not to the common code.

Sure, if that is your preference I think it should be simple enough to
implement. I agree that netdev_port_receive() looks like a good place for
this.
Jiri Benc May 11, 2016, 1:57 p.m. UTC | #5
On Wed, 11 May 2016 10:50:12 +0900, Simon Horman wrote:
> On Tue, May 10, 2016 at 02:01:06PM +0200, Jiri Benc wrote:
> > We have two options here:
> > 
> > 1. As for metadata tunnels all the info is in metadata_dst and we
> >    don't need the IP/GRE header for anything, we can make the ipgre
> >    interface ARPHRD_NONE in metadata based mode.
> > 
> > 2. We can fix this up in ovs after receiving the packet from
> >    ARPHRD_IPGRE interface.
> > 
> > I think the first option is the correct one. We already don't assign
> > dev->header_ops in metadata mode. I'll prepare a patch.
> 
> I agree that 1. seems to be the better approach.

I just sent a patch that fixes this. And we have the same bug in
VXLAN-GPE, I sent a patch, too.

> Sure, if that is your preference I think it should be simple enough to
> implement. I agree that netdev_port_receive() looks like a good place for
> this.

Great, thanks!

 Jiri
diff mbox

Patch

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index d320c2657627..fc92cf542101 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -701,7 +701,6 @@  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 			 struct sk_buff *skb, struct sw_flow_key *key)
 {
 	bool is_layer3 = false;
-	bool is_teb = false;
 	int err;
 
 	/* Extract metadata from packet. */
@@ -709,13 +708,9 @@  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 		key->tun_proto = ip_tunnel_info_af(tun_info);
 		memcpy(&key->tun_key, &tun_info->key, sizeof(key->tun_key));
 
-		if (OVS_CB(skb)->input_vport->dev->type != ARPHRD_ETHER) {
-			if (skb->protocol == htons(ETH_P_TEB))
-				is_teb = true;
-			else
-				is_layer3 = true;
-		}
-
+		if (OVS_CB(skb)->input_vport->dev->type != ARPHRD_ETHER &&
+		    skb->protocol != htons(ETH_P_TEB))
+			is_layer3 = true;
 
 		if (tun_info->options_len) {
 			BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) *
@@ -746,10 +741,12 @@  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 	if (err < 0)
 		return err;
 
-	if (is_teb)
-		skb->protocol = key->eth.type;
-	else if (is_layer3)
-		key->eth.type = skb->protocol;
+	if (tun_info && OVS_CB(skb)->input_vport->dev->type != ARPHRD_ETHER) {
+		if (is_layer3)
+			key->eth.type = skb->protocol;
+		else
+			skb->protocol = key->eth.type;
+	}
 
 	return err;
 }