diff mbox

[01/21] datapath: tunnelling: Replace tun_id with tun_key

Message ID 1337850554-10339-2-git-send-email-horms@verge.net.au
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Simon Horman May 24, 2012, 9:08 a.m. UTC
this is a first pass at providing a tun_key which can be used
as the basis for flow-based tunnelling. The tun_key includes and
replaces the tun_id in both struct ovs_skb_cb and struct sw_tun_key.

In ovs_skb_cb tun_key is a pointer as it is envisaged that it will grow
when support for IPv6 to an extent that inlining the structure will result
in ovs_skb_cb being larger than the 48 bytes available in skb->cb.

As OVS does not support IPv6 as the outer transport protocol for tunnels
the IPv6 portions of this change, which appeared in the previous revision,
have been dropped in order to limit the scope and size of this patch.

This patch does not make any effort to retain the existing tun_id behaviour
nor does it fully implement flow-based tunnels. As such it it is incomplete
and can't be used in its current form (other than to break OVS tunnelling).

** Please do not apply **

Cc: Kyle Mestery <kmestery@cisco.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

---

v4
* Add tun_flags to ovs_key_ipv4_tunnel
* Correct format in format_odp_key_attr()

v3
* Rework, actually works in limited scenarios

v2
* Use pointer to struct ovs_key_ipv4_tunnel in OVS_CB()
  rather than having a struct ovs_key_ipv4_tunnel in OVS_CB()

v1
* Initial post
---
 datapath/actions.c          |  6 +++---
 datapath/datapath.c         | 10 +++++++++-
 datapath/datapath.h         |  5 +++--
 datapath/flow.c             | 34 +++++++++++++++++++++++-----------
 datapath/flow.h             | 27 +++++++++++++++++++++++----
 datapath/tunnel.c           | 24 +++++++++++++-----------
 datapath/tunnel.h           |  5 +++--
 datapath/vport-capwap.c     | 12 ++++++------
 datapath/vport-gre.c        | 21 +++++++++++----------
 datapath/vport.c            |  2 +-
 include/linux/openvswitch.h | 13 ++++++++++++-
 lib/dpif-netdev.c           |  1 +
 lib/odp-util.c              | 13 +++++++++++++
 lib/odp-util.h              |  5 +++--
 14 files changed, 124 insertions(+), 54 deletions(-)

Comments

Jesse Gross June 3, 2012, 9:15 a.m. UTC | #1
On Thu, May 24, 2012 at 6:08 PM, Simon Horman <horms@verge.net.au> wrote:
> this is a first pass at providing a tun_key which can be used
> as the basis for flow-based tunnelling. The tun_key includes and
> replaces the tun_id in both struct ovs_skb_cb and struct sw_tun_key.
>
> In ovs_skb_cb tun_key is a pointer as it is envisaged that it will grow
> when support for IPv6 to an extent that inlining the structure will result
> in ovs_skb_cb being larger than the 48 bytes available in skb->cb.
>
> As OVS does not support IPv6 as the outer transport protocol for tunnels
> the IPv6 portions of this change, which appeared in the previous revision,
> have been dropped in order to limit the scope and size of this patch.
>
> This patch does not make any effort to retain the existing tun_id behaviour
> nor does it fully implement flow-based tunnels. As such it it is incomplete
> and can't be used in its current form (other than to break OVS tunnelling).
>
> ** Please do not apply **
>
> Cc: Kyle Mestery <kmestery@cisco.com>
> Signed-off-by: Simon Horman <horms@verge.net.au>

Thanks and sorry again about being so slow to look at this.

Overall, this looks pretty good to me.  The main difficulty that I had
was in figuring out what should go with the old behavior and what
should go with the new since it's at an intermediate point between the
two but I understand that it's difficult to break it up in a way that
both encapsulates a particular set of functionality and isn't too
large.  Otherwise, I noticed a few specific things that I noted below.

> diff --git a/datapath/flow.c b/datapath/flow.c
> index d07337c..49c0dd8 100644
> --- a/datapath/flow.c
> +++ b/datapath/flow.c
> @@ -1162,14 +1166,15 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
>  * get the metadata, that is, the parts of the flow key that cannot be
>  * extracted from the packet itself.
>  */
> -int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
> +int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
> +                                  struct ovs_key_ipv4_tunnel *tun_key,
>                                   const struct nlattr *attr)
>  {
>        const struct nlattr *nla;
>        int rem;
>
>        *in_port = DP_MAX_PORTS;
> -       *tun_id = 0;
> +       tun_key->tun_id = 0;

I think we probably want to memset the entire tun_key to zero to avoid
having potentially uninitialized data in the flow.

> @@ -1204,15 +1210,21 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
>  int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
>  {
>        struct ovs_key_ethernet *eth_key;
> +       struct ovs_key_ipv4_tunnel *tun_key;
>        struct nlattr *nla, *encap;
>
>        if (swkey->phy.priority &&
>            nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority))
>                goto nla_put_failure;
>
> -       if (swkey->phy.tun_id != cpu_to_be64(0) &&
> -           nla_put_be64(skb, OVS_KEY_ATTR_TUN_ID, swkey->phy.tun_id))
> -               goto nla_put_failure;
> +       if (swkey->phy.tun_key.ipv4_dst) {

It's probably OK to use DIP equal to zero as a not present marker but
we need to enforce that it's always true - for example we shouldn't
allow somebody to setup a flow that way or receive packets with a zero
address.  Alternately, we may be able to find a spare bit to indicate
this, like is done with vlans.

In any case, I think we need to do some additional validation when
setting up flows to check reserved space, for example, as otherwise
that will never match.

> diff --git a/datapath/flow.h b/datapath/flow.h
> index 5be481e..bab5363 100644
> --- a/datapath/flow.h
> +++ b/datapath/flow.h
> @@ -42,7 +42,7 @@ struct sw_flow_actions {
>
>  struct sw_flow_key {
>        struct {
> -               __be64  tun_id;         /* Encapsulating tunnel ID. */
> +               struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */

This is an optimization but as we get closer I'd like to put the
tun_key at the end of struct sw_flow_key so that packets that didn't
come from a tunnel don't have to pay the cost during the lookup (this
is especially true as we add support for IPv6 tunnels).

In a similar vein, struct ovs_key_ipv4_tunnel contains some fields
that I think can never apply for lookup such as the flags so it would
be nice if we could remove that for lookup.

> @@ -150,6 +150,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies);
>  *                         ------  ---  ------  -----
>  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
>  *  OVS_KEY_ATTR_TUN_ID        8    --     4     12
> + *  OVS_KEY_ATTR_IPV4_TUNNEL  18     2     4     24

If my math is correct, I think the size of the base struct
ova_key_ipv4_tunnel is 24 bytes.

> +static inline void tun_key_swap_addr(struct ovs_key_ipv4_tunnel *tun_key)
> +{
> +       __be32 ndst = tun_key->ipv4_src;
> +       tun_key->ipv4_src = tun_key->ipv4_dst;
> +       tun_key->ipv4_dst = ndst;
> +}

I'm not quite sure when we would need to swap the addresses in a
tunnel and I didn't see any uses of this function.

> +static inline void tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
> +                               const struct iphdr *iph, __be64 tun_id)
> +{
> +       tun_key->tun_id = tun_id;
> +       tun_key->ipv4_src = iph->saddr;
> +       tun_key->ipv4_dst = iph->daddr;
> +       tun_key->ipv4_tos = iph->tos;
> +       tun_key->ipv4_ttl = iph->ttl;
> +}

Aren't there some fields that we need to zero out to avoid problems in
the lookup?

> diff --git a/datapath/tunnel.c b/datapath/tunnel.c
> index d651c11..010e513 100644
> --- a/datapath/tunnel.c
> +++ b/datapath/tunnel.c
> @@ -367,9 +367,9 @@ struct vport *ovs_tnl_find_port(struct net *net, __be32 saddr, __be32 daddr,
>        return NULL;
>  }
>
> -static void ecn_decapsulate(struct sk_buff *skb, u8 tos)
> +static void ecn_decapsulate(struct sk_buff *skb)
>  {
> -       if (unlikely(INET_ECN_is_ce(tos))) {
> +       if (unlikely(INET_ECN_is_ce(OVS_CB(skb)->tun_key->ipv4_tos))) {

This might come in a later patch, although I didn't see it in a quick
scan, but it should be possible to implement all the ECN encapsulation
and decapsulation in userspace, just like we can do with the rest of
the ToS and TTL.

>  bool ovs_tnl_frag_needed(struct vport *vport,
>                         const struct tnl_mutable_config *mutable,
> -                        struct sk_buff *skb, unsigned int mtu, __be64 flow_key)
> +                        struct sk_buff *skb, unsigned int mtu,
> +                        struct ovs_key_ipv4_tunnel *tun_key)
>  {
>        unsigned int eth_hdr_len = ETH_HLEN;
>        unsigned int total_length = 0, header_length = 0, payload_length;
>        struct ethhdr *eh, *old_eh = eth_hdr(skb);
>        struct sk_buff *nskb;
> +       struct ovs_key_ipv4_tunnel ntun_key;
>
>        /* Sanity check */
>        if (skb->protocol == htons(ETH_P_IP)) {
> @@ -705,8 +707,10 @@ bool ovs_tnl_frag_needed(struct vport *vport,
>         * any way of synthesizing packets.
>         */
>        if ((mutable->flags & (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) ==
> -           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION))
> -               OVS_CB(nskb)->tun_id = flow_key;
> +           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) {
> +               ntun_key = *tun_key;
> +               OVS_CB(nskb)->tun_key = &ntun_key;
> +       }

I guess this is probably where you were going to use the function to
reverse IP addresses.  The logic doesn't really work but it's moot
since this is going away anyways.

> @@ -799,10 +803,8 @@ static void create_tunnel_header(const struct vport *vport,
>        iph->ihl        = sizeof(struct iphdr) >> 2;
>        iph->frag_off   = htons(IP_DF);
>        iph->protocol   = tnl_vport->tnl_ops->ipproto;
> -       iph->tos        = mutable->tos;
>        iph->daddr      = rt->rt_dst;
>        iph->saddr      = rt->rt_src;
> -       iph->ttl        = mutable->ttl;
>        if (!iph->ttl)
>                iph->ttl = ip4_dst_hoplimit(&rt_dst(rt));

I'm not sure that these changes quite belong in this patch (not that
it shouldn't be done but it seems like the supporting code isn't there
yet).

> diff --git a/datapath/vport-gre.c b/datapath/vport-gre.c
> index ab89c5b..fd2b038 100644
> --- a/datapath/vport-gre.c
> +++ b/datapath/vport-gre.c
> @@ -101,10 +101,6 @@ static struct sk_buff *gre_update_header(const struct vport *vport,
>        __be32 *options = (__be32 *)(skb_network_header(skb) + mutable->tunnel_hlen
>                                               - GRE_HEADER_SECTION);
>
> -       /* Work backwards over the options so the checksum is last. */
> -       if (mutable->flags & TNL_F_OUT_KEY_ACTION)
> -               *options = be64_get_low32(OVS_CB(skb)->tun_id);

Why does this go away?

> diff --git a/datapath/vport.c b/datapath/vport.c
> index 172261a..0c77a1b 100644
> --- a/datapath/vport.c
> +++ b/datapath/vport.c
> @@ -462,7 +462,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb)
>                OVS_CB(skb)->flow = NULL;
>
>        if (!(vport->ops->flags & VPORT_F_TUN_ID))
> -               OVS_CB(skb)->tun_id = 0;
> +               OVS_CB(skb)->tun_key = NULL;

We probably should rename this flag now.

> diff --git a/lib/odp-util.h b/lib/odp-util.h
> index d53f083..4e5a8a1 100644
> --- a/lib/odp-util.h
> +++ b/lib/odp-util.h
> @@ -72,6 +72,7 @@ int odp_actions_from_string(const char *, const struct simap *port_names,
>  *                         ------  ---  ------  -----
>  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
>  *  OVS_KEY_ATTR_TUN_ID        8    --     4     12
> + *  OVS_KEY_ATTR_IPV4_TUNNEL  18     2     4     24

Same thing about the size here as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman June 4, 2012, 10:34 p.m. UTC | #2
On Sun, Jun 03, 2012 at 06:15:04PM +0900, Jesse Gross wrote:
> On Thu, May 24, 2012 at 6:08 PM, Simon Horman <horms@verge.net.au> wrote:
> > this is a first pass at providing a tun_key which can be used
> > as the basis for flow-based tunnelling. The tun_key includes and
> > replaces the tun_id in both struct ovs_skb_cb and struct sw_tun_key.
> >
> > In ovs_skb_cb tun_key is a pointer as it is envisaged that it will grow
> > when support for IPv6 to an extent that inlining the structure will result
> > in ovs_skb_cb being larger than the 48 bytes available in skb->cb.
> >
> > As OVS does not support IPv6 as the outer transport protocol for tunnels
> > the IPv6 portions of this change, which appeared in the previous revision,
> > have been dropped in order to limit the scope and size of this patch.
> >
> > This patch does not make any effort to retain the existing tun_id behaviour
> > nor does it fully implement flow-based tunnels. As such it it is incomplete
> > and can't be used in its current form (other than to break OVS tunnelling).
> >
> > ** Please do not apply **
> >
> > Cc: Kyle Mestery <kmestery@cisco.com>
> > Signed-off-by: Simon Horman <horms@verge.net.au>
> 
> Thanks and sorry again about being so slow to look at this.
> 
> Overall, this looks pretty good to me.  The main difficulty that I had
> was in figuring out what should go with the old behavior and what
> should go with the new since it's at an intermediate point between the
> two but I understand that it's difficult to break it up in a way that
> both encapsulates a particular set of functionality and isn't too
> large.  Otherwise, I noticed a few specific things that I noted below.
> 
> > diff --git a/datapath/flow.c b/datapath/flow.c
> > index d07337c..49c0dd8 100644
> > --- a/datapath/flow.c
> > +++ b/datapath/flow.c
> > @@ -1162,14 +1166,15 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
> >  * get the metadata, that is, the parts of the flow key that cannot be
> >  * extracted from the packet itself.
> >  */
> > -int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
> > +int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
> > +                                  struct ovs_key_ipv4_tunnel *tun_key,
> >                                   const struct nlattr *attr)
> >  {
> >        const struct nlattr *nla;
> >        int rem;
> >
> >        *in_port = DP_MAX_PORTS;
> > -       *tun_id = 0;
> > +       tun_key->tun_id = 0;
> 
> I think we probably want to memset the entire tun_key to zero to avoid
> having potentially uninitialized data in the flow.

Sure, that is fine by me.

> > @@ -1204,15 +1210,21 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
> >  int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
> >  {
> >        struct ovs_key_ethernet *eth_key;
> > +       struct ovs_key_ipv4_tunnel *tun_key;
> >        struct nlattr *nla, *encap;
> >
> >        if (swkey->phy.priority &&
> >            nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority))
> >                goto nla_put_failure;
> >
> > -       if (swkey->phy.tun_id != cpu_to_be64(0) &&
> > -           nla_put_be64(skb, OVS_KEY_ATTR_TUN_ID, swkey->phy.tun_id))
> > -               goto nla_put_failure;
> > +       if (swkey->phy.tun_key.ipv4_dst) {
> 
> It's probably OK to use DIP equal to zero as a not present marker but
> we need to enforce that it's always true - for example we shouldn't
> allow somebody to setup a flow that way or receive packets with a zero
> address.  Alternately, we may be able to find a spare bit to indicate
> this, like is done with vlans.

When I originally wrote this there didn't seem to be any obvious
place in ovs_key_ipv4_tunnel to have an active/inactive bit, which
is in part why the code relies on checking DIP.

However, more recent versions of ovs_key_ipv4_tunnel have a flags field of
which only one bit is currently used. We could use one of the unused flag
bits.

> In any case, I think we need to do some additional validation when
> setting up flows to check reserved space, for example, as otherwise
> that will never match.

Sure. My testing seems to indicate that matching does occur,
though I am quite happy to tighten things up.

> 
> > diff --git a/datapath/flow.h b/datapath/flow.h
> > index 5be481e..bab5363 100644
> > --- a/datapath/flow.h
> > +++ b/datapath/flow.h
> > @@ -42,7 +42,7 @@ struct sw_flow_actions {
> >
> >  struct sw_flow_key {
> >        struct {
> > -               __be64  tun_id;         /* Encapsulating tunnel ID. */
> > +               struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
> 
> This is an optimization but as we get closer I'd like to put the
> tun_key at the end of struct sw_flow_key so that packets that didn't
> come from a tunnel don't have to pay the cost during the lookup (this
> is especially true as we add support for IPv6 tunnels).

Sure.

> In a similar vein, struct ovs_key_ipv4_tunnel contains some fields
> that I think can never apply for lookup such as the flags so it would
> be nice if we could remove that for lookup.

I think they need to be there to be passed around, so I'm not
sure if they can easily be removed from ovs_key_ipv4_tunnel if that
is what you are asking.

> > @@ -150,6 +150,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies);
> >  *                         ------  ---  ------  -----
> >  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
> >  *  OVS_KEY_ATTR_TUN_ID        8    --     4     12
> > + *  OVS_KEY_ATTR_IPV4_TUNNEL  18     2     4     24
> 
> If my math is correct, I think the size of the base struct
> ova_key_ipv4_tunnel is 24 bytes.

Sorry, the size changed a few times and I seem to have forgotten to
update the above table accordingly.

> > +static inline void tun_key_swap_addr(struct ovs_key_ipv4_tunnel *tun_key)
> > +{
> > +       __be32 ndst = tun_key->ipv4_src;
> > +       tun_key->ipv4_src = tun_key->ipv4_dst;
> > +       tun_key->ipv4_dst = ndst;
> > +}
> 
> I'm not quite sure when we would need to swap the addresses in a
> tunnel and I didn't see any uses of this function.

This should no longer be needed - and was always broken - I'll remove it.

> > +static inline void tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
> > +                               const struct iphdr *iph, __be64 tun_id)
> > +{
> > +       tun_key->tun_id = tun_id;
> > +       tun_key->ipv4_src = iph->saddr;
> > +       tun_key->ipv4_dst = iph->daddr;
> > +       tun_key->ipv4_tos = iph->tos;
> > +       tun_key->ipv4_ttl = iph->ttl;
> > +}
> 
> Aren't there some fields that we need to zero out to avoid problems in
> the lookup?

Thanks, I will check.

> > diff --git a/datapath/tunnel.c b/datapath/tunnel.c
> > index d651c11..010e513 100644
> > --- a/datapath/tunnel.c
> > +++ b/datapath/tunnel.c
> > @@ -367,9 +367,9 @@ struct vport *ovs_tnl_find_port(struct net *net, __be32 saddr, __be32 daddr,
> >        return NULL;
> >  }
> >
> > -static void ecn_decapsulate(struct sk_buff *skb, u8 tos)
> > +static void ecn_decapsulate(struct sk_buff *skb)
> >  {
> > -       if (unlikely(INET_ECN_is_ce(tos))) {
> > +       if (unlikely(INET_ECN_is_ce(OVS_CB(skb)->tun_key->ipv4_tos))) {
> 
> This might come in a later patch, although I didn't see it in a quick
> scan, but it should be possible to implement all the ECN encapsulation
> and decapsulation in userspace, just like we can do with the rest of
> the ToS and TTL.

I hadn't considered that, I will add it to my TODO list.

> >  bool ovs_tnl_frag_needed(struct vport *vport,
> >                         const struct tnl_mutable_config *mutable,
> > -                        struct sk_buff *skb, unsigned int mtu, __be64 flow_key)
> > +                        struct sk_buff *skb, unsigned int mtu,
> > +                        struct ovs_key_ipv4_tunnel *tun_key)
> >  {
> >        unsigned int eth_hdr_len = ETH_HLEN;
> >        unsigned int total_length = 0, header_length = 0, payload_length;
> >        struct ethhdr *eh, *old_eh = eth_hdr(skb);
> >        struct sk_buff *nskb;
> > +       struct ovs_key_ipv4_tunnel ntun_key;
> >
> >        /* Sanity check */
> >        if (skb->protocol == htons(ETH_P_IP)) {
> > @@ -705,8 +707,10 @@ bool ovs_tnl_frag_needed(struct vport *vport,
> >         * any way of synthesizing packets.
> >         */
> >        if ((mutable->flags & (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) ==
> > -           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION))
> > -               OVS_CB(nskb)->tun_id = flow_key;
> > +           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) {
> > +               ntun_key = *tun_key;
> > +               OVS_CB(nskb)->tun_key = &ntun_key;
> > +       }
> 
> I guess this is probably where you were going to use the function to
> reverse IP addresses.  The logic doesn't really work but it's moot
> since this is going away anyways.

My latest series includes a clean up to ovs_tnl_frag_needed() to allow
it to work in some circumstances - i.e. those found in my test environment.
That series removes knowledge of tun_key from ovs_tnl_frag_needed().

I am however happy to remove ovs_tnl_frag_needed() completely if you think
that is appropriate.

> > @@ -799,10 +803,8 @@ static void create_tunnel_header(const struct vport *vport,
> >        iph->ihl        = sizeof(struct iphdr) >> 2;
> >        iph->frag_off   = htons(IP_DF);
> >        iph->protocol   = tnl_vport->tnl_ops->ipproto;
> > -       iph->tos        = mutable->tos;
> >        iph->daddr      = rt->rt_dst;
> >        iph->saddr      = rt->rt_src;
> > -       iph->ttl        = mutable->ttl;
> >        if (!iph->ttl)
> >                iph->ttl = ip4_dst_hoplimit(&rt_dst(rt));
> 
> I'm not sure that these changes quite belong in this patch (not that
> it shouldn't be done but it seems like the supporting code isn't there
> yet).

Sorry, I did some merging of my patches and this seems to have been
a case where I merged incorrectly. I think that change belongs in:

[PATCH 18/21] dataptah: remove ttl and tos from tnl_mutable_config

(I need to fix the typo in the title of that patch!)

> > diff --git a/datapath/vport-gre.c b/datapath/vport-gre.c
> > index ab89c5b..fd2b038 100644
> > --- a/datapath/vport-gre.c
> > +++ b/datapath/vport-gre.c
> > @@ -101,10 +101,6 @@ static struct sk_buff *gre_update_header(const struct vport *vport,
> >        __be32 *options = (__be32 *)(skb_network_header(skb) + mutable->tunnel_hlen
> >                                               - GRE_HEADER_SECTION);
> >
> > -       /* Work backwards over the options so the checksum is last. */
> > -       if (mutable->flags & TNL_F_OUT_KEY_ACTION)
> > -               *options = be64_get_low32(OVS_CB(skb)->tun_id);
> 
> Why does this go away?

I agree that looks wrong and my only explanation is that it is remnants
of some hack.

I have done some, hopefully more sensible, re-working of gre_update_header in:

[PATCH 20/21] datapath: Use tun_key flags for id and csum settings on transmit

> > diff --git a/datapath/vport.c b/datapath/vport.c
> > index 172261a..0c77a1b 100644
> > --- a/datapath/vport.c
> > +++ b/datapath/vport.c
> > @@ -462,7 +462,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb)
> >                OVS_CB(skb)->flow = NULL;
> >
> >        if (!(vport->ops->flags & VPORT_F_TUN_ID))
> > -               OVS_CB(skb)->tun_id = 0;
> > +               OVS_CB(skb)->tun_key = NULL;
> 
> We probably should rename this flag now.

Yes, I think there are several flags that may now be removed.
I'll add that to my TODO list.

> > diff --git a/lib/odp-util.h b/lib/odp-util.h
> > index d53f083..4e5a8a1 100644
> > --- a/lib/odp-util.h
> > +++ b/lib/odp-util.h
> > @@ -72,6 +72,7 @@ int odp_actions_from_string(const char *, const struct simap *port_names,
> >  *                         ------  ---  ------  -----
> >  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
> >  *  OVS_KEY_ATTR_TUN_ID        8    --     4     12
> > + *  OVS_KEY_ATTR_IPV4_TUNNEL  18     2     4     24
> 
> Same thing about the size here as well.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross June 5, 2012, 3:33 a.m. UTC | #3
On Tue, Jun 5, 2012 at 7:34 AM, Simon Horman <horms@verge.net.au> wrote:
> On Sun, Jun 03, 2012 at 06:15:04PM +0900, Jesse Gross wrote:
>> On Thu, May 24, 2012 at 6:08 PM, Simon Horman <horms@verge.net.au> wrote:
>> > @@ -1204,15 +1210,21 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
>> >  int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
>> >  {
>> >        struct ovs_key_ethernet *eth_key;
>> > +       struct ovs_key_ipv4_tunnel *tun_key;
>> >        struct nlattr *nla, *encap;
>> >
>> >        if (swkey->phy.priority &&
>> >            nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority))
>> >                goto nla_put_failure;
>> >
>> > -       if (swkey->phy.tun_id != cpu_to_be64(0) &&
>> > -           nla_put_be64(skb, OVS_KEY_ATTR_TUN_ID, swkey->phy.tun_id))
>> > -               goto nla_put_failure;
>> > +       if (swkey->phy.tun_key.ipv4_dst) {
>>
>> It's probably OK to use DIP equal to zero as a not present marker but
>> we need to enforce that it's always true - for example we shouldn't
>> allow somebody to setup a flow that way or receive packets with a zero
>> address.  Alternately, we may be able to find a spare bit to indicate
>> this, like is done with vlans.
>
> When I originally wrote this there didn't seem to be any obvious
> place in ovs_key_ipv4_tunnel to have an active/inactive bit, which
> is in part why the code relies on checking DIP.
>
> However, more recent versions of ovs_key_ipv4_tunnel have a flags field of
> which only one bit is currently used. We could use one of the unused flag
> bits.

I guess it depends on what we end up doing with the lookup struct.  If
it stays as it is today, there's plenty of space if you include those
padding bytes.  If we shrink it down and there isn't a place then I do
think it is fine to use DIP (since this is traversing an IP stack and
DIP = 0 is an invalid value it's not like an L2 switch not allowing
invalid IP packet).  In that case, we just need to do more validation
in other places to make sure that this is the only situation that the
condition can arise.

>> In any case, I think we need to do some additional validation when
>> setting up flows to check reserved space, for example, as otherwise
>> that will never match.
>
> Sure. My testing seems to indicate that matching does occur,
> though I am quite happy to tighten things up.

I don't think it causes a problem as long as userspace is well
behaved, I was think it's best to detect problems early.

>> In a similar vein, struct ovs_key_ipv4_tunnel contains some fields
>> that I think can never apply for lookup such as the flags so it would
>> be nice if we could remove that for lookup.
>
> I think they need to be there to be passed around, so I'm not
> sure if they can easily be removed from ovs_key_ipv4_tunnel if that
> is what you are asking.

My guess is that we'll probably want to separate out the struct that
is used for lookup from the one that is used for communication with
userspace, which is what we do for most things so that we have more
freedom to optimize in the kernel.

>> >  bool ovs_tnl_frag_needed(struct vport *vport,
>> >                         const struct tnl_mutable_config *mutable,
>> > -                        struct sk_buff *skb, unsigned int mtu, __be64 flow_key)
>> > +                        struct sk_buff *skb, unsigned int mtu,
>> > +                        struct ovs_key_ipv4_tunnel *tun_key)
>> >  {
>> >        unsigned int eth_hdr_len = ETH_HLEN;
>> >        unsigned int total_length = 0, header_length = 0, payload_length;
>> >        struct ethhdr *eh, *old_eh = eth_hdr(skb);
>> >        struct sk_buff *nskb;
>> > +       struct ovs_key_ipv4_tunnel ntun_key;
>> >
>> >        /* Sanity check */
>> >        if (skb->protocol == htons(ETH_P_IP)) {
>> > @@ -705,8 +707,10 @@ bool ovs_tnl_frag_needed(struct vport *vport,
>> >         * any way of synthesizing packets.
>> >         */
>> >        if ((mutable->flags & (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) ==
>> > -           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION))
>> > -               OVS_CB(nskb)->tun_id = flow_key;
>> > +           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) {
>> > +               ntun_key = *tun_key;
>> > +               OVS_CB(nskb)->tun_key = &ntun_key;
>> > +       }
>>
>> I guess this is probably where you were going to use the function to
>> reverse IP addresses.  The logic doesn't really work but it's moot
>> since this is going away anyways.
>
> My latest series includes a clean up to ovs_tnl_frag_needed() to allow
> it to work in some circumstances - i.e. those found in my test environment.
> That series removes knowledge of tun_key from ovs_tnl_frag_needed().
>
> I am however happy to remove ovs_tnl_frag_needed() completely if you think
> that is appropriate.

I think, in retrospect, that the path MTU discovery that I implemented
here was probably not the right choice and MSS clamping is the correct
way to do things.  It was better when it wasn't possible to do any
kind of flow-based manipulation of tunnels but the model is breaking
down more and more over time.  Given that I would be hesitant to
submit it upstream and since that's the goal of this work, removing it
completely is probably the right thing to do.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman June 5, 2012, 8:12 a.m. UTC | #4
On Tue, Jun 05, 2012 at 12:33:11PM +0900, Jesse Gross wrote:
> On Tue, Jun 5, 2012 at 7:34 AM, Simon Horman <horms@verge.net.au> wrote:
> > On Sun, Jun 03, 2012 at 06:15:04PM +0900, Jesse Gross wrote:
> >> On Thu, May 24, 2012 at 6:08 PM, Simon Horman <horms@verge.net.au> wrote:
> >> > @@ -1204,15 +1210,21 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
> >> >  int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
> >> >  {
> >> >        struct ovs_key_ethernet *eth_key;
> >> > +       struct ovs_key_ipv4_tunnel *tun_key;
> >> >        struct nlattr *nla, *encap;
> >> >
> >> >        if (swkey->phy.priority &&
> >> >            nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority))
> >> >                goto nla_put_failure;
> >> >
> >> > -       if (swkey->phy.tun_id != cpu_to_be64(0) &&
> >> > -           nla_put_be64(skb, OVS_KEY_ATTR_TUN_ID, swkey->phy.tun_id))
> >> > -               goto nla_put_failure;
> >> > +       if (swkey->phy.tun_key.ipv4_dst) {
> >>
> >> It's probably OK to use DIP equal to zero as a not present marker but
> >> we need to enforce that it's always true - for example we shouldn't
> >> allow somebody to setup a flow that way or receive packets with a zero
> >> address.  Alternately, we may be able to find a spare bit to indicate
> >> this, like is done with vlans.
> >
> > When I originally wrote this there didn't seem to be any obvious
> > place in ovs_key_ipv4_tunnel to have an active/inactive bit, which
> > is in part why the code relies on checking DIP.
> >
> > However, more recent versions of ovs_key_ipv4_tunnel have a flags field of
> > which only one bit is currently used. We could use one of the unused flag
> > bits.
> 
> I guess it depends on what we end up doing with the lookup struct.  If
> it stays as it is today, there's plenty of space if you include those
> padding bytes.  If we shrink it down and there isn't a place then I do
> think it is fine to use DIP (since this is traversing an IP stack and
> DIP = 0 is an invalid value it's not like an L2 switch not allowing
> invalid IP packet).  In that case, we just need to do more validation
> in other places to make sure that this is the only situation that the
> condition can arise.

Ok, my suspicion is that there will be space.

> >> In any case, I think we need to do some additional validation when
> >> setting up flows to check reserved space, for example, as otherwise
> >> that will never match.
> >
> > Sure. My testing seems to indicate that matching does occur,
> > though I am quite happy to tighten things up.
> 
> I don't think it causes a problem as long as userspace is well
> behaved, I was think it's best to detect problems early.

Ok, good point. I'll make sure the data is clean.
> 
> >> In a similar vein, struct ovs_key_ipv4_tunnel contains some fields
> >> that I think can never apply for lookup such as the flags so it would
> >> be nice if we could remove that for lookup.
> >
> > I think they need to be there to be passed around, so I'm not
> > sure if they can easily be removed from ovs_key_ipv4_tunnel if that
> > is what you are asking.
> 
> My guess is that we'll probably want to separate out the struct that
> is used for lookup from the one that is used for communication with
> userspace, which is what we do for most things so that we have more
> freedom to optimize in the kernel.

Understood, I'll look into doing that.

> >> >  bool ovs_tnl_frag_needed(struct vport *vport,
> >> >                         const struct tnl_mutable_config *mutable,
> >> > -                        struct sk_buff *skb, unsigned int mtu, __be64 flow_key)
> >> > +                        struct sk_buff *skb, unsigned int mtu,
> >> > +                        struct ovs_key_ipv4_tunnel *tun_key)
> >> >  {
> >> >        unsigned int eth_hdr_len = ETH_HLEN;
> >> >        unsigned int total_length = 0, header_length = 0, payload_length;
> >> >        struct ethhdr *eh, *old_eh = eth_hdr(skb);
> >> >        struct sk_buff *nskb;
> >> > +       struct ovs_key_ipv4_tunnel ntun_key;
> >> >
> >> >        /* Sanity check */
> >> >        if (skb->protocol == htons(ETH_P_IP)) {
> >> > @@ -705,8 +707,10 @@ bool ovs_tnl_frag_needed(struct vport *vport,
> >> >         * any way of synthesizing packets.
> >> >         */
> >> >        if ((mutable->flags & (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) ==
> >> > -           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION))
> >> > -               OVS_CB(nskb)->tun_id = flow_key;
> >> > +           (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) {
> >> > +               ntun_key = *tun_key;
> >> > +               OVS_CB(nskb)->tun_key = &ntun_key;
> >> > +       }
> >>
> >> I guess this is probably where you were going to use the function to
> >> reverse IP addresses.  The logic doesn't really work but it's moot
> >> since this is going away anyways.
> >
> > My latest series includes a clean up to ovs_tnl_frag_needed() to allow
> > it to work in some circumstances - i.e. those found in my test environment.
> > That series removes knowledge of tun_key from ovs_tnl_frag_needed().
> >
> > I am however happy to remove ovs_tnl_frag_needed() completely if you think
> > that is appropriate.
> 
> I think, in retrospect, that the path MTU discovery that I implemented
> here was probably not the right choice and MSS clamping is the correct
> way to do things.  It was better when it wasn't possible to do any
> kind of flow-based manipulation of tunnels but the model is breaking
> down more and more over time.  Given that I would be hesitant to
> submit it upstream and since that's the goal of this work, removing it
> completely is probably the right thing to do.

Sure. My latest series also includes an implementation of MSS clamping,
so assuming it isn't doomed in some way we should be safe to remove
MTU discovery. I'll see about doing that for my next post.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/datapath/actions.c b/datapath/actions.c
index 208f260..7b2ea25 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -342,8 +342,8 @@  static int execute_set_action(struct sk_buff *skb,
 		skb->priority = nla_get_u32(nested_attr);
 		break;
 
-	case OVS_KEY_ATTR_TUN_ID:
-		OVS_CB(skb)->tun_id = nla_get_be64(nested_attr);
+	case OVS_KEY_ATTR_IPV4_TUNNEL:
+		OVS_CB(skb)->tun_key = nla_data(nested_attr);
 		break;
 
 	case OVS_KEY_ATTR_ETHERNET:
@@ -469,7 +469,7 @@  int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb)
 		goto out_loop;
 	}
 
-	OVS_CB(skb)->tun_id = 0;
+	OVS_CB(skb)->tun_key = NULL;
 	error = do_execute_actions(dp, skb, acts->actions,
 					 acts->actions_len, false);
 
diff --git a/datapath/datapath.c b/datapath/datapath.c
index a4376a0..65dfe79 100644
--- a/datapath/datapath.c
+++ b/datapath/datapath.c
@@ -587,12 +587,20 @@  static int validate_set(const struct nlattr *a,
 
 	switch (key_type) {
 	const struct ovs_key_ipv4 *ipv4_key;
+	const struct ovs_key_ipv4_tunnel *tun_key;
 
 	case OVS_KEY_ATTR_PRIORITY:
 	case OVS_KEY_ATTR_TUN_ID:
 	case OVS_KEY_ATTR_ETHERNET:
 		break;
 
+	case OVS_KEY_ATTR_IPV4_TUNNEL:
+		tun_key = nla_data(ovs_key);
+		if (!tun_key->ipv4_dst) {
+			return -EINVAL;
+		}
+		break;
+
 	case OVS_KEY_ATTR_IPV4:
 		if (flow_key->eth.type != htons(ETH_P_IP))
 			return -EINVAL;
@@ -785,7 +793,7 @@  static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 
 	err = ovs_flow_metadata_from_nlattrs(&flow->key.phy.priority,
 					     &flow->key.phy.in_port,
-					     &flow->key.phy.tun_id,
+					     &flow->key.phy.tun_key,
 					     a[OVS_PACKET_ATTR_KEY]);
 	if (err)
 		goto err_flow_put;
diff --git a/datapath/datapath.h b/datapath/datapath.h
index affbf0e..de0b28d 100644
--- a/datapath/datapath.h
+++ b/datapath/datapath.h
@@ -96,7 +96,7 @@  struct datapath {
 /**
  * struct ovs_skb_cb - OVS data in skb CB
  * @flow: The flow associated with this packet.  May be %NULL if no flow.
- * @tun_id: ID of the tunnel that encapsulated this packet.  It is 0 if the
+ * @tun_key: Key for the tunnel that encapsulated this packet.
  * @ip_summed: Consistently stores L4 checksumming status across different
  * kernel versions.
  * @csum_start: Stores the offset from which to start checksumming independent
@@ -107,7 +107,7 @@  struct datapath {
  */
 struct ovs_skb_cb {
 	struct sw_flow		*flow;
-	__be64			tun_id;
+	struct ovs_key_ipv4_tunnel  *tun_key;
 #ifdef NEED_CSUM_NORMALIZE
 	enum csum_type		ip_summed;
 	u16			csum_start;
@@ -192,4 +192,5 @@  struct sk_buff *ovs_vport_cmd_build_info(struct vport *, u32 pid, u32 seq,
 					 u8 cmd);
 
 int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb);
+
 #endif /* datapath.h */
diff --git a/datapath/flow.c b/datapath/flow.c
index d07337c..49c0dd8 100644
--- a/datapath/flow.c
+++ b/datapath/flow.c
@@ -629,7 +629,8 @@  int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 	memset(key, 0, sizeof(*key));
 
 	key->phy.priority = skb->priority;
-	key->phy.tun_id = OVS_CB(skb)->tun_id;
+	if (OVS_CB(skb)->tun_key)
+		key->phy.tun_key = *OVS_CB(skb)->tun_key;
 	key->phy.in_port = in_port;
 
 	skb_reset_mac_header(skb);
@@ -847,6 +848,7 @@  const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 
 	/* Not upstream. */
 	[OVS_KEY_ATTR_TUN_ID] = sizeof(__be64),
+	[OVS_KEY_ATTR_IPV4_TUNNEL] = sizeof(struct ovs_key_ipv4_tunnel),
 };
 
 static int ipv4_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len,
@@ -1022,9 +1024,11 @@  int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 		swkey->phy.in_port = DP_MAX_PORTS;
 	}
 
-	if (attrs & (1ULL << OVS_KEY_ATTR_TUN_ID)) {
-		swkey->phy.tun_id = nla_get_be64(a[OVS_KEY_ATTR_TUN_ID]);
-		attrs &= ~(1ULL << OVS_KEY_ATTR_TUN_ID);
+	if (attrs & (1ULL << OVS_KEY_ATTR_IPV4_TUNNEL)) {
+		struct ovs_key_ipv4_tunnel *tun_key;
+		tun_key = nla_data(a[OVS_KEY_ATTR_IPV4_TUNNEL]);
+		swkey->phy.tun_key = *tun_key;
+		attrs &= ~(1ULL << OVS_KEY_ATTR_IPV4_TUNNEL);
 	}
 
 	/* Data attributes. */
@@ -1162,14 +1166,15 @@  int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
  * get the metadata, that is, the parts of the flow key that cannot be
  * extracted from the packet itself.
  */
-int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
+int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
+				   struct ovs_key_ipv4_tunnel *tun_key,
 				   const struct nlattr *attr)
 {
 	const struct nlattr *nla;
 	int rem;
 
 	*in_port = DP_MAX_PORTS;
-	*tun_id = 0;
+	tun_key->tun_id = 0;
 	*priority = 0;
 
 	nla_for_each_nested(nla, attr, rem) {
@@ -1184,8 +1189,9 @@  int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
 				*priority = nla_get_u32(nla);
 				break;
 
-			case OVS_KEY_ATTR_TUN_ID:
-				*tun_id = nla_get_be64(nla);
+			case OVS_KEY_ATTR_IPV4_TUNNEL:
+				memcpy(tun_key, nla_data(nla),
+				       sizeof(*tun_key));
 				break;
 
 			case OVS_KEY_ATTR_IN_PORT:
@@ -1204,15 +1210,21 @@  int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
 int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
 {
 	struct ovs_key_ethernet *eth_key;
+	struct ovs_key_ipv4_tunnel *tun_key;
 	struct nlattr *nla, *encap;
 
 	if (swkey->phy.priority &&
 	    nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority))
 		goto nla_put_failure;
 
-	if (swkey->phy.tun_id != cpu_to_be64(0) &&
-	    nla_put_be64(skb, OVS_KEY_ATTR_TUN_ID, swkey->phy.tun_id))
-		goto nla_put_failure;
+	if (swkey->phy.tun_key.ipv4_dst) {
+		nla = nla_reserve(skb, OVS_KEY_ATTR_IPV4_TUNNEL,
+				  sizeof(*tun_key));
+		if (!nla)
+			goto nla_put_failure;
+		tun_key = nla_data(nla);
+		*tun_key = swkey->phy.tun_key;
+	}
 
 	if (swkey->phy.in_port != DP_MAX_PORTS &&
 	    nla_put_u32(skb, OVS_KEY_ATTR_IN_PORT, swkey->phy.in_port))
diff --git a/datapath/flow.h b/datapath/flow.h
index 5be481e..bab5363 100644
--- a/datapath/flow.h
+++ b/datapath/flow.h
@@ -42,7 +42,7 @@  struct sw_flow_actions {
 
 struct sw_flow_key {
 	struct {
-		__be64	tun_id;		/* Encapsulating tunnel ID. */
+		struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
 		u32	priority;	/* Packet QoS priority. */
 		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
 	} phy;
@@ -150,6 +150,7 @@  u64 ovs_flow_used_time(unsigned long flow_jiffies);
  *                         ------  ---  ------  -----
  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
  *  OVS_KEY_ATTR_TUN_ID        8    --     4     12
+ *  OVS_KEY_ATTR_IPV4_TUNNEL  18     2     4     24
  *  OVS_KEY_ATTR_IN_PORT       4    --     4      8
  *  OVS_KEY_ATTR_ETHERNET     12    --     4     16
  *  OVS_KEY_ATTR_8021Q         4    --     4      8
@@ -158,14 +159,15 @@  u64 ovs_flow_used_time(unsigned long flow_jiffies);
  *  OVS_KEY_ATTR_ICMPV6        2     2     4      8
  *  OVS_KEY_ATTR_ND           28    --     4     32
  *  -------------------------------------------------
- *  total                                       144
+ *  total                                       168
  */
-#define FLOW_BUFSIZE 144
+#define FLOW_BUFSIZE 168
 
 int ovs_flow_to_nlattrs(const struct sw_flow_key *, struct sk_buff *);
 int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 		      const struct nlattr *);
-int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
+int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
+				   struct ovs_key_ipv4_tunnel *tun_key,
 				   const struct nlattr *);
 
 #define MAX_ACTIONS_BUFSIZE	(16 * 1024)
@@ -204,4 +206,21 @@  u32 ovs_flow_hash(const struct sw_flow_key *key, int key_len);
 struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *idx);
 extern const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1];
 
+static inline void tun_key_swap_addr(struct ovs_key_ipv4_tunnel *tun_key)
+{
+	__be32 ndst = tun_key->ipv4_src;
+	tun_key->ipv4_src = tun_key->ipv4_dst;
+	tun_key->ipv4_dst = ndst;
+}
+
+static inline void tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
+				const struct iphdr *iph, __be64 tun_id)
+{
+	tun_key->tun_id = tun_id;
+	tun_key->ipv4_src = iph->saddr;
+	tun_key->ipv4_dst = iph->daddr;
+	tun_key->ipv4_tos = iph->tos;
+	tun_key->ipv4_ttl = iph->ttl;
+}
+
 #endif /* flow.h */
diff --git a/datapath/tunnel.c b/datapath/tunnel.c
index d651c11..010e513 100644
--- a/datapath/tunnel.c
+++ b/datapath/tunnel.c
@@ -367,9 +367,9 @@  struct vport *ovs_tnl_find_port(struct net *net, __be32 saddr, __be32 daddr,
 	return NULL;
 }
 
-static void ecn_decapsulate(struct sk_buff *skb, u8 tos)
+static void ecn_decapsulate(struct sk_buff *skb)
 {
-	if (unlikely(INET_ECN_is_ce(tos))) {
+	if (unlikely(INET_ECN_is_ce(OVS_CB(skb)->tun_key->ipv4_tos))) {
 		__be16 protocol = skb->protocol;
 
 		skb_set_network_header(skb, ETH_HLEN);
@@ -416,7 +416,7 @@  static void ecn_decapsulate(struct sk_buff *skb, u8 tos)
  * - skb->csum does not include the inner Ethernet header.
  * - The layer pointers are undefined.
  */
-void ovs_tnl_rcv(struct vport *vport, struct sk_buff *skb, u8 tos)
+void ovs_tnl_rcv(struct vport *vport, struct sk_buff *skb)
 {
 	struct ethhdr *eh;
 
@@ -433,7 +433,7 @@  void ovs_tnl_rcv(struct vport *vport, struct sk_buff *skb, u8 tos)
 	skb_clear_rxhash(skb);
 	secpath_reset(skb);
 
-	ecn_decapsulate(skb, tos);
+	ecn_decapsulate(skb);
 	vlan_set_tci(skb, 0);
 
 	if (unlikely(compute_ip_summed(skb, false))) {
@@ -613,12 +613,14 @@  static void ipv6_build_icmp(struct sk_buff *skb, struct sk_buff *nskb,
 
 bool ovs_tnl_frag_needed(struct vport *vport,
 			 const struct tnl_mutable_config *mutable,
-			 struct sk_buff *skb, unsigned int mtu, __be64 flow_key)
+			 struct sk_buff *skb, unsigned int mtu,
+			 struct ovs_key_ipv4_tunnel *tun_key)
 {
 	unsigned int eth_hdr_len = ETH_HLEN;
 	unsigned int total_length = 0, header_length = 0, payload_length;
 	struct ethhdr *eh, *old_eh = eth_hdr(skb);
 	struct sk_buff *nskb;
+	struct ovs_key_ipv4_tunnel ntun_key;
 
 	/* Sanity check */
 	if (skb->protocol == htons(ETH_P_IP)) {
@@ -705,8 +707,10 @@  bool ovs_tnl_frag_needed(struct vport *vport,
 	 * any way of synthesizing packets.
 	 */
 	if ((mutable->flags & (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) ==
-	    (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION))
-		OVS_CB(nskb)->tun_id = flow_key;
+	    (TNL_F_IN_KEY_MATCH | TNL_F_OUT_KEY_ACTION)) {
+		ntun_key = *tun_key;
+		OVS_CB(nskb)->tun_key = &ntun_key;
+	}
 
 	if (unlikely(compute_ip_summed(nskb, false))) {
 		kfree_skb(nskb);
@@ -761,7 +765,7 @@  static bool check_mtu(struct sk_buff *skb,
 
 			if (packet_length > mtu &&
 			    ovs_tnl_frag_needed(vport, mutable, skb, mtu,
-						OVS_CB(skb)->tun_id))
+						OVS_CB(skb)->tun_key))
 				return false;
 		}
 	}
@@ -778,7 +782,7 @@  static bool check_mtu(struct sk_buff *skb,
 
 			if (packet_length > mtu &&
 			    ovs_tnl_frag_needed(vport, mutable, skb, mtu,
-						OVS_CB(skb)->tun_id))
+						OVS_CB(skb)->tun_key))
 				return false;
 		}
 	}
@@ -799,10 +803,8 @@  static void create_tunnel_header(const struct vport *vport,
 	iph->ihl	= sizeof(struct iphdr) >> 2;
 	iph->frag_off	= htons(IP_DF);
 	iph->protocol	= tnl_vport->tnl_ops->ipproto;
-	iph->tos	= mutable->tos;
 	iph->daddr	= rt->rt_dst;
 	iph->saddr	= rt->rt_src;
-	iph->ttl	= mutable->ttl;
 	if (!iph->ttl)
 		iph->ttl = ip4_dst_hoplimit(&rt_dst(rt));
 
diff --git a/datapath/tunnel.h b/datapath/tunnel.h
index 1924017..7d78297 100644
--- a/datapath/tunnel.h
+++ b/datapath/tunnel.h
@@ -269,14 +269,15 @@  int ovs_tnl_set_addr(struct vport *vport, const unsigned char *addr);
 const char *ovs_tnl_get_name(const struct vport *vport);
 const unsigned char *ovs_tnl_get_addr(const struct vport *vport);
 int ovs_tnl_send(struct vport *vport, struct sk_buff *skb);
-void ovs_tnl_rcv(struct vport *vport, struct sk_buff *skb, u8 tos);
+void ovs_tnl_rcv(struct vport *vport, struct sk_buff *skb);
 
 struct vport *ovs_tnl_find_port(struct net *net, __be32 saddr, __be32 daddr,
 				__be64 key, int tunnel_type,
 				const struct tnl_mutable_config **mutable);
 bool ovs_tnl_frag_needed(struct vport *vport,
 			 const struct tnl_mutable_config *mutable,
-			 struct sk_buff *skb, unsigned int mtu, __be64 flow_key);
+			 struct sk_buff *skb, unsigned int mtu,
+			 struct ovs_key_ipv4_tunnel *tun_key);
 void ovs_tnl_free_linked_skbs(struct sk_buff *skb);
 
 int ovs_tnl_init(void);
diff --git a/datapath/vport-capwap.c b/datapath/vport-capwap.c
index 05a099d..1e08d5a 100644
--- a/datapath/vport-capwap.c
+++ b/datapath/vport-capwap.c
@@ -220,7 +220,7 @@  static struct sk_buff *capwap_update_header(const struct vport *vport,
 		struct capwaphdr_wsi *wsi = (struct capwaphdr_wsi *)(cwh + 1);
 		struct capwaphdr_wsi_key *opt = (struct capwaphdr_wsi_key *)(wsi + 1);
 
-		opt->key = OVS_CB(skb)->tun_id;
+		opt->key = OVS_CB(skb)->tun_key->tun_id;
 	}
 
 	udph->len = htons(skb->len - skb_transport_offset(skb));
@@ -316,6 +316,7 @@  static int capwap_rcv(struct sock *sk, struct sk_buff *skb)
 	struct vport *vport;
 	const struct tnl_mutable_config *mutable;
 	struct iphdr *iph;
+	struct ovs_key_ipv4_tunnel tun_key;
 	__be64 key = 0;
 
 	if (unlikely(!pskb_may_pull(skb, CAPWAP_MIN_HLEN + ETH_HLEN)))
@@ -333,12 +334,11 @@  static int capwap_rcv(struct sock *sk, struct sk_buff *skb)
 		goto error;
 	}
 
-	if (mutable->flags & TNL_F_IN_KEY_MATCH)
-		OVS_CB(skb)->tun_id = key;
-	else
-		OVS_CB(skb)->tun_id = 0;
+	tun_key_init(&tun_key, iph,
+		     mutable->flags & TNL_F_IN_KEY_MATCH ? key : 0);
+	OVS_CB(skb)->tun_key = &tun_key;
 
-	ovs_tnl_rcv(vport, skb, iph->tos);
+	ovs_tnl_rcv(vport, skb);
 	goto out;
 
 error:
diff --git a/datapath/vport-gre.c b/datapath/vport-gre.c
index ab89c5b..fd2b038 100644
--- a/datapath/vport-gre.c
+++ b/datapath/vport-gre.c
@@ -101,10 +101,6 @@  static struct sk_buff *gre_update_header(const struct vport *vport,
 	__be32 *options = (__be32 *)(skb_network_header(skb) + mutable->tunnel_hlen
 					       - GRE_HEADER_SECTION);
 
-	/* Work backwards over the options so the checksum is last. */
-	if (mutable->flags & TNL_F_OUT_KEY_ACTION)
-		*options = be64_get_low32(OVS_CB(skb)->tun_id);
-
 	if (mutable->out_key || mutable->flags & TNL_F_OUT_KEY_ACTION)
 		options--;
 
@@ -285,7 +281,11 @@  static void gre_err(struct sk_buff *skb, u32 info)
 #endif
 
 	__skb_pull(skb, tunnel_hdr_len);
-	ovs_tnl_frag_needed(vport, mutable, skb, mtu, key);
+	{
+		struct ovs_key_ipv4_tunnel tun_key;
+		tun_key_init(&tun_key, iph, key);
+		ovs_tnl_frag_needed(vport, mutable, skb, mtu, &tun_key);
+	}
 	__skb_push(skb, tunnel_hdr_len);
 
 out:
@@ -327,6 +327,7 @@  static int gre_rcv(struct sk_buff *skb)
 	const struct tnl_mutable_config *mutable;
 	int hdr_len;
 	struct iphdr *iph;
+	struct ovs_key_ipv4_tunnel tun_key;
 	__be16 flags;
 	__be64 key;
 
@@ -351,15 +352,15 @@  static int gre_rcv(struct sk_buff *skb)
 		goto error;
 	}
 
-	if (mutable->flags & TNL_F_IN_KEY_MATCH)
-		OVS_CB(skb)->tun_id = key;
-	else
-		OVS_CB(skb)->tun_id = 0;
+
+	tun_key_init(&tun_key, iph,
+		     mutable->flags & TNL_F_IN_KEY_MATCH ? key : 0);
+	OVS_CB(skb)->tun_key = &tun_key;
 
 	__skb_pull(skb, hdr_len);
 	skb_postpull_rcsum(skb, skb_transport_header(skb), hdr_len + ETH_HLEN);
 
-	ovs_tnl_rcv(vport, skb, iph->tos);
+	ovs_tnl_rcv(vport, skb);
 	return 0;
 
 error:
diff --git a/datapath/vport.c b/datapath/vport.c
index 172261a..0c77a1b 100644
--- a/datapath/vport.c
+++ b/datapath/vport.c
@@ -462,7 +462,7 @@  void ovs_vport_receive(struct vport *vport, struct sk_buff *skb)
 		OVS_CB(skb)->flow = NULL;
 
 	if (!(vport->ops->flags & VPORT_F_TUN_ID))
-		OVS_CB(skb)->tun_id = 0;
+		OVS_CB(skb)->tun_key = NULL;
 
 	ovs_dp_process_received_packet(vport, skb);
 }
diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
index f5c9cca..c32bb58 100644
--- a/include/linux/openvswitch.h
+++ b/include/linux/openvswitch.h
@@ -278,7 +278,8 @@  enum ovs_key_attr {
 	OVS_KEY_ATTR_ICMPV6,    /* struct ovs_key_icmpv6 */
 	OVS_KEY_ATTR_ARP,       /* struct ovs_key_arp */
 	OVS_KEY_ATTR_ND,        /* struct ovs_key_nd */
-	OVS_KEY_ATTR_TUN_ID = 63, /* be64 tunnel ID */
+	OVS_KEY_ATTR_TUN_ID,    /* be64 tunnel ID */
+	OVS_KEY_ATTR_IPV4_TUNNEL,  /* struct ovs_key_ipv4_tunnel */
 	__OVS_KEY_ATTR_MAX
 };
 
@@ -360,6 +361,16 @@  struct ovs_key_nd {
 	__u8  nd_tll[6];
 };
 
+struct ovs_key_ipv4_tunnel {
+	__be64 tun_id;
+	__u32  tun_flags;
+	__be32 ipv4_src;
+	__be32 ipv4_dst;
+	__u8   ipv4_tos;
+	__u8   ipv4_ttl;
+	__u8   pad[2];
+};
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index fb0a863..d065a3a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1165,6 +1165,7 @@  execute_set_action(struct ofpbuf *packet, const struct nlattr *a)
     case OVS_KEY_ATTR_TUN_ID:
     case OVS_KEY_ATTR_PRIORITY:
     case OVS_KEY_ATTR_IPV6:
+    case OVS_KEY_ATTR_IPV4_TUNNEL:
         /* not implemented */
         break;
 
diff --git a/lib/odp-util.c b/lib/odp-util.c
index 8693d3c..23d1efe 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -106,6 +106,7 @@  ovs_key_attr_to_string(enum ovs_key_attr attr)
     case OVS_KEY_ATTR_ARP: return "arp";
     case OVS_KEY_ATTR_ND: return "nd";
     case OVS_KEY_ATTR_TUN_ID: return "tun_id";
+    case OVS_KEY_ATTR_IPV4_TUNNEL: return "ipv4_tunnel";
 
     case __OVS_KEY_ATTR_MAX:
     default:
@@ -614,6 +615,7 @@  odp_flow_key_attr_len(uint16_t type)
     case OVS_KEY_ATTR_ICMPV6: return sizeof(struct ovs_key_icmpv6);
     case OVS_KEY_ATTR_ARP: return sizeof(struct ovs_key_arp);
     case OVS_KEY_ATTR_ND: return sizeof(struct ovs_key_nd);
+    case OVS_KEY_ATTR_IPV4_TUNNEL: return sizeof(struct ovs_key_ipv4_tunnel);
 
     case OVS_KEY_ATTR_UNSPEC:
     case __OVS_KEY_ATTR_MAX:
@@ -668,6 +670,7 @@  format_odp_key_attr(const struct nlattr *a, struct ds *ds)
     const struct ovs_key_icmpv6 *icmpv6_key;
     const struct ovs_key_arp *arp_key;
     const struct ovs_key_nd *nd_key;
+    const struct ovs_key_ipv4_tunnel *ipv4_tun_key;
     enum ovs_key_attr attr = nl_attr_type(a);
     int expected_len;
 
@@ -698,6 +701,16 @@  format_odp_key_attr(const struct nlattr *a, struct ds *ds)
         ds_put_format(ds, "(%#"PRIx64")", ntohll(nl_attr_get_be64(a)));
         break;
 
+    case OVS_KEY_ATTR_IPV4_TUNNEL:
+        ipv4_tun_key = nl_attr_get(a);
+        ds_put_format(ds, "(tun_id=%"PRIx64",flags=%"PRIx32
+                      ",src="IP_FMT",dst="IP_FMT",tos=%"PRIx8",ttl=%"PRIu8")",
+                      ntohll(ipv4_tun_key->tun_id), ipv4_tun_key->tun_flags,
+                      IP_ARGS(&ipv4_tun_key->ipv4_src),
+                      IP_ARGS(&ipv4_tun_key->ipv4_dst),
+                      ipv4_tun_key->ipv4_tos, ipv4_tun_key->ipv4_ttl);
+        break;
+
     case OVS_KEY_ATTR_IN_PORT:
         ds_put_format(ds, "(%"PRIu32")", nl_attr_get_u32(a));
         break;
diff --git a/lib/odp-util.h b/lib/odp-util.h
index d53f083..4e5a8a1 100644
--- a/lib/odp-util.h
+++ b/lib/odp-util.h
@@ -72,6 +72,7 @@  int odp_actions_from_string(const char *, const struct simap *port_names,
  *                         ------  ---  ------  -----
  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
  *  OVS_KEY_ATTR_TUN_ID        8    --     4     12
+ *  OVS_KEY_ATTR_IPV4_TUNNEL  18     2     4     24
  *  OVS_KEY_ATTR_IN_PORT       4    --     4      8
  *  OVS_KEY_ATTR_ETHERNET     12    --     4     16
  *  OVS_KEY_ATTR_8021Q         4    --     4      8
@@ -80,9 +81,9 @@  int odp_actions_from_string(const char *, const struct simap *port_names,
  *  OVS_KEY_ATTR_ICMPV6        2     2     4      8
  *  OVS_KEY_ATTR_ND           28    --     4     32
  *  -------------------------------------------------
- *  total                                       144
+ *  total                                       168
  */
-#define ODPUTIL_FLOW_KEY_BYTES 144
+#define ODPUTIL_FLOW_KEY_BYTES 168
 
 /* A buffer with sufficient size and alignment to hold an nlattr-formatted flow
  * key.  An array of "struct nlattr" might not, in theory, be sufficiently