[ovs-dev,v5] tunneling: Avoid recirculation on datapath by computing the recirculate actions at translate time.

Hi Joe,

Thank you for your comments! Please, find my answers below.

> I had trouble trying to review this, in part because I felt like
> changes to fix multiple issues are being placed into one patch. If
> there are separate issues being addressed, each issue should be fixed
> in a separate patch, each with a clear description of the intent of
> the change (with links to patches that introduced the breakage if
> possible!). If there is refactoring to do, consider separating the
> non-functional changes into a separate patch---when looking at the
> original patch that started this discussion, I found it difficult to
> review post-merge because the refactoring was included and I couldn't
> tell if there were actually functional changes.
> 
> Given that this is taking a bit longer than I thought and we need to
> take a holistic approach to how all of these interactions should work,
> I plan to go ahead and apply the revert patch. I look forward to
> seeing a fresh series proposal that will bring back the benefits of
> the original patch.
> 
> More feedback below.
> 
> > Here comes the patch:
> >
> >
> > Since tunneling and forwarding via patch port uses clone action, the tunneled
> 
> Where does forwarding via patch port use clone action?

Sorry, that was not correct. Patch port does not use clone action. Both patch 
port and native tunneling use the apply_netsed_clone_action(). That's what I 
wanted to express.

> apply_nested_clone_action() also takes copies of the flow and
> base_flow, do we need to do it twice in this path? I suspect that this
> work could benefit from some further refactoring, taking note to have
> separate patches to refactor code and different patches for functional
> changes.

Thank you for indicating this. There is no need to create a backup copy of 
flow. We need to store some fields of base_flow. I could use a function like 
this to create backup:

static void
copy_flow_L2L3_data(struct flow *dst, const struct flow *src)
{
    dst->packet_type = src->packet_type;
    dst->dl_dst = src->dl_dst;
    dst->dl_src = src->dl_src;
    dst->dl_type = src->dl_type;
    dst->nw_dst = src->nw_dst;
    dst->nw_src = src->nw_src;
    dst->ipv6_dst = src->ipv6_dst;
    dst->ipv6_src = src->ipv6_src;
    dst->nw_tos = src->nw_tos;
    dst->nw_ttl = src->nw_ttl;
    dst->nw_proto = src->nw_proto;
    dst->tp_dst = src->tp_dst;
    dst->tp_src = src->tp_src;
}

...

/* Backup. */
copy_flow_L2L3_data(&old_base_flow, &ctx->base_flow);

...

/* Restore. */
copy_flow_L2L3_data(&ctx->base_flow, &old_base_flow);

> >
> >      err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
> >      if (err) {
> > @@ -3216,16 +3248,57 @@ build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
> >      tnl_push_data.tnl_port = odp_to_u32(tunnel_odp_port);
> >      tnl_push_data.out_port = odp_to_u32(out_dev->odp_port);
> >
> > +    /* After tunnel header has been added, MAC and IP data of flow and
> > +     * base_flow need to be set properly, since there is not recirculation
> > +     * any more when sending packet to tunnel. */
> > +    if (!eth_addr_is_zero(ctx->xin->flow.dl_dst)) {
> > +        ctx->xin->flow.dl_dst = dmac;
> > +        ctx->base_flow.dl_dst = ctx->xin->flow.dl_dst;
> > +    }
> 
> What's the expectation if ctx->xin->flow.dl_dst is zero? Is that for
> L3 tunneled packets?

No, it's not for L3 tunneled packets. Without this push_pop_tunnel and 
push_pop_tunnel_ipv6 unit tests do fail. This needs to be fixed.

783: tunnel_push_pop - action                        FAILED (tunnel-push-pop.at:109)
785: tunnel_push_pop_ipv6 - action                   FAILED (tunnel-push-pop-ipv6.at:92)

./tunnel-push-pop.at:108: ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.
3.112,proto=47,tos=0,ttl=64,frag=no)'
stdout:
Flow: ip,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=1.1.3.88,nw_dst=1.1.
3.112,nw_proto=47,nw_tos=0,nw_ecn=0,nw_ttl=64

bridge("int-br")
----------------
 0. priority 32768
    output:2
     -> output to native tunnel
     -> tunneling to 1.1.2.92 via br0
     -> tunneling from aa:55:aa:55:00:00 1.1.2.88 to f8:bc:12:44:34:b6 1.1.2.92

bridge("br0")
-------------
 0. priority 32768
    NORMAL
     -> learned port is input port, dropping

Final flow: unchanged
Megaflow: recirc_id=0,ip,in_port=LOCAL,vlan_tci=0x0000/0x1fff,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_e
cn=0,nw_frag=no
Datapath actions: clone(drop),tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:5
5:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x
0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100))
./tunnel-push-pop.at:109: tail -1 stdout

> > +
> > +    ctx->xin->flow.dl_src = smac;
> > +    ctx->base_flow.dl_src = ctx->xin->flow.dl_src;
> > +
> > +    propagate_tunnel_data_to_flow(&ctx->xin->flow, &ctx->base_flow);
> > +
> > +    if (!tnl_params.is_ipv6) {
> > +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IP);
> > +    } else {
> > +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IPV6);
> > +    }
> 
> When writing if (!condition) { ... } else { ... }, please consider
> whether you can swap the conditions and remove the negation. It's less
> cognitive load if we don't have to negate a negative condition to
> consider which condition the "else" applies to. (This otherwise looks
> correct, so just a style request.)

Ok. I'm going to change this.

> > +
> > +    if (tnl_push_data.tnl_type == OVS_VPORT_TYPE_GRE) {
> > +        ctx->xin->flow.nw_proto = IPPROTO_GRE;
> > +    } else {
> > +        ctx->xin->flow.nw_proto = IPPROTO_UDP;
> > +    }
> > +    ctx->base_flow.nw_proto = ctx->xin->flow.nw_proto;
> 
> Should we use a switch () statement here to ensure that newer tunnel
> ports are properly handled in this case if/when they are added? What
> about things like LISP or STT?

I can use a switch case:

    switch (tnl_push_data.tnl_type) {
    case OVS_VPORT_TYPE_GRE:
        ctx->xin->flow.nw_proto = IPPROTO_GRE;
        break;
    case OVS_VPORT_TYPE_VXLAN:
    case OVS_VPORT_TYPE_GENEVE:
        ctx->xin->flow.nw_proto = IPPROTO_UDP;
        break;
    case OVS_VPORT_TYPE_LISP:
         /* XXX: UDP? */
    case OVS_VPORT_TYPE_STT:
         /* XXX: TCP? */
    default:
        break;
    }

But, how to handle LISP and STT? I have not seen any build_header() function 
for LISP and STT in netdev-vport.c: 

    static const struct vport_class vport_classes[] = {
        TUNNEL_CLASS("geneve", "genev_sys", netdev_geneve_build_header,
                                            netdev_tnl_push_udp_header,
                                            netdev_geneve_pop_header),
        TUNNEL_CLASS("gre", "gre_sys", netdev_gre_build_header,
                                       netdev_gre_push_header,
                                       netdev_gre_pop_header),
        TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header,
                                           netdev_tnl_push_udp_header,
                                           netdev_vxlan_pop_header),
        TUNNEL_CLASS("lisp", "lisp_sys", NULL, NULL, NULL),
        TUNNEL_CLASS("stt", "stt_sys", NULL, NULL, NULL),
    };

Are LISP and STT valid tunnel types for native tunneling? If I'm not wrong, 
build_tunnel_send() is invoked when native tunneling is used.

> > +
> >      size_t push_action_size = 0;
> >      size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
> >                                             OVS_ACTION_ATTR_CLONE);
> >      odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
> >      push_action_size = ctx->odp_actions->size;
> > +
> > +    if (ctx->rule) {
> > +        old_ths = ctx->rule->tunnel_header_size;
> > +        ctx->rule->tunnel_header_size = tnl_push_data.header_len;
> > +    }
> > +
> >      apply_nested_clone_actions(ctx, xport, out_dev);
> >      if (ctx->odp_actions->size > push_action_size) {
> >          /* Update the CLONE action only when combined */
> >          nl_msg_end_nested(ctx->odp_actions, clone_ofs);
> >      }
> 
> If there are no actions on the underlay bridge (ie drop), the else
> condition here should probably reset the nl_msg nesting so that we
> don't generate any actions to the datapath at all. You may also
> consider adding a xlate_report() call to explain why the drop is
> occuring - this should help debugging later on using ofproto/trace.

I'm not sure about this. 
Sugesh, how should this be resolved?

> >      if (credit_counts) {
> > +        uint64_t stats_n_bytes = 0;
> > +
> > +        if (rule->truncated_packet_size) {
> > +            stats_n_bytes = MIN(rule->truncated_packet_size, stats->n_bytes);
> > +        } else {
> > +            stats_n_bytes = stats->n_bytes;
> > +        }
> 
> Is this fixing a separate issue? I'm confused about whether truncated
> packet stats are being correctly attributed today on master. If so, I
> think this should split out into a separate patch. You might also
> consider whether it makes more sense to modify 'stats' earlier in the
> path so that each rule doesn't need to individually apply the stats
> adjustment. I could imagine a store/restore of the stats plus
> modifying for truncation during the xlate_output_trunc_action()
> processing rather than pushing this complexity all the way into the
> rule stats attribution.

Incorrect bytes statistics on the underlay bridge is an unwanted side-effect of 
the original "Avoid recirculate" commit. By exluding recirculation, there is no
upcall for the tunnelled packets, and the packet size is not set by 
upcall_xlate() again:

  static void
  upcall_xlate(struct udpif *udpif, struct upcall *upcall,
               struct ofpbuf *odp_actions, struct flow_wildcards *wc)
  {
      struct dpif_flow_stats stats;
      struct xlate_in xin;

      stats.n_packets = 1;
      stats.n_bytes = dp_packet_size(upcall->packet);
      stats.used = time_msec();
      stats.tcp_flags = ntohs(upcall->flow->tcp_flags);

      xlate_in_init(&xin, upcall->ofproto,
                    ofproto_dpif_get_tables_version(upcall->ofproto),
                    upcall->flow, upcall->in_port, NULL,
                    stats.tcp_flags, upcall->packet, wc, odp_actions);

      if (upcall->type == DPIF_UC_MISS) {
          xin.resubmit_stats = &stats;

Here upcall_xlate() sets xin.resubmint_stats which will be used for calculating
statistic. These are const values, that's why I have not updated them earlier.

    /* If nonnull, flow translation credits the specified statistics to each
     * rule reached through a resubmit or OFPP_TABLE action.
     *
     * This is normally null so the client has to set it manually after
     * calling xlate_in_init(). */
    const struct dpif_flow_stats *resubmit_stats;

If it does not harm, then the const modifier could be removed and stats updated 
at earlier stage.

Best regards,
Zoltan

[ovs-dev,v5] tunneling: Avoid recirculation on datapath by computing the recirculate actions at translate time.

Commit Message

Comments

Patch