diff mbox

[ovs-dev,v5] tunneling: Avoid recirculation on datapath by computing the recirculate actions at translate time.

Message ID AM2PR07MB104298D64EBD0B440536460F8AED0@AM2PR07MB1042.eurprd07.prod.outlook.com
State Changes Requested
Headers show

Commit Message

Zoltan Balogh May 11, 2017, 4:04 p.m. UTC
Hi Joe,

Thank you for your comments! Please, find my answers below.
 
> I had trouble trying to review this, in part because I felt like
> changes to fix multiple issues are being placed into one patch. If
> there are separate issues being addressed, each issue should be fixed
> in a separate patch, each with a clear description of the intent of
> the change (with links to patches that introduced the breakage if
> possible!). If there is refactoring to do, consider separating the
> non-functional changes into a separate patch---when looking at the
> original patch that started this discussion, I found it difficult to
> review post-merge because the refactoring was included and I couldn't
> tell if there were actually functional changes.
> 
> Given that this is taking a bit longer than I thought and we need to
> take a holistic approach to how all of these interactions should work,
> I plan to go ahead and apply the revert patch. I look forward to
> seeing a fresh series proposal that will bring back the benefits of
> the original patch.
> 
> More feedback below.
> 
> > Here comes the patch:
> >
> >
> > Since tunneling and forwarding via patch port uses clone action, the tunneled
> 
> Where does forwarding via patch port use clone action?

Sorry, that was not correct. Patch port does not use clone action. Both patch 
port and native tunneling use the apply_netsed_clone_action(). That's what I 
wanted to express.


> apply_nested_clone_action() also takes copies of the flow and
> base_flow, do we need to do it twice in this path? I suspect that this
> work could benefit from some further refactoring, taking note to have
> separate patches to refactor code and different patches for functional
> changes.

Thank you for indicating this. There is no need to create a backup copy of 
flow. We need to store some fields of base_flow. I could use a function like 
this to create backup:

static void
copy_flow_L2L3_data(struct flow *dst, const struct flow *src)
{
    dst->packet_type = src->packet_type;
    dst->dl_dst = src->dl_dst;
    dst->dl_src = src->dl_src;
    dst->dl_type = src->dl_type;
    dst->nw_dst = src->nw_dst;
    dst->nw_src = src->nw_src;
    dst->ipv6_dst = src->ipv6_dst;
    dst->ipv6_src = src->ipv6_src;
    dst->nw_tos = src->nw_tos;
    dst->nw_ttl = src->nw_ttl;
    dst->nw_proto = src->nw_proto;
    dst->tp_dst = src->tp_dst;
    dst->tp_src = src->tp_src;
}

...

/* Backup. */
copy_flow_L2L3_data(&old_base_flow, &ctx->base_flow);

...

/* Restore. */
copy_flow_L2L3_data(&ctx->base_flow, &old_base_flow);

> >
> >      err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
> >      if (err) {
> > @@ -3216,16 +3248,57 @@ build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
> >      tnl_push_data.tnl_port = odp_to_u32(tunnel_odp_port);
> >      tnl_push_data.out_port = odp_to_u32(out_dev->odp_port);
> >
> > +    /* After tunnel header has been added, MAC and IP data of flow and
> > +     * base_flow need to be set properly, since there is not recirculation
> > +     * any more when sending packet to tunnel. */
> > +    if (!eth_addr_is_zero(ctx->xin->flow.dl_dst)) {
> > +        ctx->xin->flow.dl_dst = dmac;
> > +        ctx->base_flow.dl_dst = ctx->xin->flow.dl_dst;
> > +    }
> 
> What's the expectation if ctx->xin->flow.dl_dst is zero? Is that for
> L3 tunneled packets?

No, it's not for L3 tunneled packets. Without this push_pop_tunnel and 
push_pop_tunnel_ipv6 unit tests do fail. This needs to be fixed.

783: tunnel_push_pop - action                        FAILED (tunnel-push-pop.at:109)
785: tunnel_push_pop_ipv6 - action                   FAILED (tunnel-push-pop-ipv6.at:92)


./tunnel-push-pop.at:108: ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.
3.112,proto=47,tos=0,ttl=64,frag=no)'
stdout:
Flow: ip,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=1.1.3.88,nw_dst=1.1.
3.112,nw_proto=47,nw_tos=0,nw_ecn=0,nw_ttl=64

bridge("int-br")
----------------
 0. priority 32768
    output:2
     -> output to native tunnel
     -> tunneling to 1.1.2.92 via br0
     -> tunneling from aa:55:aa:55:00:00 1.1.2.88 to f8:bc:12:44:34:b6 1.1.2.92

bridge("br0")
-------------
 0. priority 32768
    NORMAL
     -> learned port is input port, dropping

Final flow: unchanged
Megaflow: recirc_id=0,ip,in_port=LOCAL,vlan_tci=0x0000/0x1fff,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_e
cn=0,nw_frag=no
Datapath actions: clone(drop),tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:5
5:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x
0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100))
./tunnel-push-pop.at:109: tail -1 stdout

> > +
> > +    ctx->xin->flow.dl_src = smac;
> > +    ctx->base_flow.dl_src = ctx->xin->flow.dl_src;
> > +
> > +    propagate_tunnel_data_to_flow(&ctx->xin->flow, &ctx->base_flow);
> > +
> > +    if (!tnl_params.is_ipv6) {
> > +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IP);
> > +    } else {
> > +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IPV6);
> > +    }
> 
> When writing if (!condition) { ... } else { ... }, please consider
> whether you can swap the conditions and remove the negation. It's less
> cognitive load if we don't have to negate a negative condition to
> consider which condition the "else" applies to. (This otherwise looks
> correct, so just a style request.)

Ok. I'm going to change this.

> > +
> > +    if (tnl_push_data.tnl_type == OVS_VPORT_TYPE_GRE) {
> > +        ctx->xin->flow.nw_proto = IPPROTO_GRE;
> > +    } else {
> > +        ctx->xin->flow.nw_proto = IPPROTO_UDP;
> > +    }
> > +    ctx->base_flow.nw_proto = ctx->xin->flow.nw_proto;
> 
> Should we use a switch () statement here to ensure that newer tunnel
> ports are properly handled in this case if/when they are added? What
> about things like LISP or STT?

I can use a switch case:

    switch (tnl_push_data.tnl_type) {
    case OVS_VPORT_TYPE_GRE:
        ctx->xin->flow.nw_proto = IPPROTO_GRE;
        break;
    case OVS_VPORT_TYPE_VXLAN:
    case OVS_VPORT_TYPE_GENEVE:
        ctx->xin->flow.nw_proto = IPPROTO_UDP;
        break;
    case OVS_VPORT_TYPE_LISP:
         /* XXX: UDP? */
    case OVS_VPORT_TYPE_STT:
         /* XXX: TCP? */
    default:
        break;
    }

But, how to handle LISP and STT? I have not seen any build_header() function 
for LISP and STT in netdev-vport.c: 

    static const struct vport_class vport_classes[] = {
        TUNNEL_CLASS("geneve", "genev_sys", netdev_geneve_build_header,
                                            netdev_tnl_push_udp_header,
                                            netdev_geneve_pop_header),
        TUNNEL_CLASS("gre", "gre_sys", netdev_gre_build_header,
                                       netdev_gre_push_header,
                                       netdev_gre_pop_header),
        TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header,
                                           netdev_tnl_push_udp_header,
                                           netdev_vxlan_pop_header),
        TUNNEL_CLASS("lisp", "lisp_sys", NULL, NULL, NULL),
        TUNNEL_CLASS("stt", "stt_sys", NULL, NULL, NULL),
    };

Are LISP and STT valid tunnel types for native tunneling? If I'm not wrong, 
build_tunnel_send() is invoked when native tunneling is used.

> > +
> >      size_t push_action_size = 0;
> >      size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
> >                                             OVS_ACTION_ATTR_CLONE);
> >      odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
> >      push_action_size = ctx->odp_actions->size;
> > +
> > +    if (ctx->rule) {
> > +        old_ths = ctx->rule->tunnel_header_size;
> > +        ctx->rule->tunnel_header_size = tnl_push_data.header_len;
> > +    }
> > +
> >      apply_nested_clone_actions(ctx, xport, out_dev);
> >      if (ctx->odp_actions->size > push_action_size) {
> >          /* Update the CLONE action only when combined */
> >          nl_msg_end_nested(ctx->odp_actions, clone_ofs);
> >      }
> 
> If there are no actions on the underlay bridge (ie drop), the else
> condition here should probably reset the nl_msg nesting so that we
> don't generate any actions to the datapath at all. You may also
> consider adding a xlate_report() call to explain why the drop is
> occuring - this should help debugging later on using ofproto/trace.

I'm not sure about this. 
Sugesh, how should this be resolved?

> >      if (credit_counts) {
> > +        uint64_t stats_n_bytes = 0;
> > +
> > +        if (rule->truncated_packet_size) {
> > +            stats_n_bytes = MIN(rule->truncated_packet_size, stats->n_bytes);
> > +        } else {
> > +            stats_n_bytes = stats->n_bytes;
> > +        }
> 
> Is this fixing a separate issue? I'm confused about whether truncated
> packet stats are being correctly attributed today on master. If so, I
> think this should split out into a separate patch. You might also
> consider whether it makes more sense to modify 'stats' earlier in the
> path so that each rule doesn't need to individually apply the stats
> adjustment. I could imagine a store/restore of the stats plus
> modifying for truncation during the xlate_output_trunc_action()
> processing rather than pushing this complexity all the way into the
> rule stats attribution.

Incorrect bytes statistics on the underlay bridge is an unwanted side-effect of 
the original "Avoid recirculate" commit. By exluding recirculation, there is no
upcall for the tunnelled packets, and the packet size is not set by 
upcall_xlate() again:

  static void
  upcall_xlate(struct udpif *udpif, struct upcall *upcall,
               struct ofpbuf *odp_actions, struct flow_wildcards *wc)
  {
      struct dpif_flow_stats stats;
      struct xlate_in xin;

      stats.n_packets = 1;
      stats.n_bytes = dp_packet_size(upcall->packet);
      stats.used = time_msec();
      stats.tcp_flags = ntohs(upcall->flow->tcp_flags);

      xlate_in_init(&xin, upcall->ofproto,
                    ofproto_dpif_get_tables_version(upcall->ofproto),
                    upcall->flow, upcall->in_port, NULL,
                    stats.tcp_flags, upcall->packet, wc, odp_actions);

      if (upcall->type == DPIF_UC_MISS) {
          xin.resubmit_stats = &stats;

Here upcall_xlate() sets xin.resubmint_stats which will be used for calculating
statistic. These are const values, that's why I have not updated them earlier.

    /* If nonnull, flow translation credits the specified statistics to each
     * rule reached through a resubmit or OFPP_TABLE action.
     *
     * This is normally null so the client has to set it manually after
     * calling xlate_in_init(). */
    const struct dpif_flow_stats *resubmit_stats;

If it does not harm, then the const modifier could be removed and stats updated 
at earlier stage.
 
Best regards,
Zoltan

Comments

Joe Stringer May 11, 2017, 4:38 p.m. UTC | #1
On 11 May 2017 at 09:04, Zoltán Balogh <zoltan.balogh@ericsson.com> wrote:
> Hi Joe,
>
> Thank you for your comments! Please, find my answers below.
>
>> I had trouble trying to review this, in part because I felt like
>> changes to fix multiple issues are being placed into one patch. If
>> there are separate issues being addressed, each issue should be fixed
>> in a separate patch, each with a clear description of the intent of
>> the change (with links to patches that introduced the breakage if
>> possible!). If there is refactoring to do, consider separating the
>> non-functional changes into a separate patch---when looking at the
>> original patch that started this discussion, I found it difficult to
>> review post-merge because the refactoring was included and I couldn't
>> tell if there were actually functional changes.
>>
>> Given that this is taking a bit longer than I thought and we need to
>> take a holistic approach to how all of these interactions should work,
>> I plan to go ahead and apply the revert patch. I look forward to
>> seeing a fresh series proposal that will bring back the benefits of
>> the original patch.
>>
>> More feedback below.
>>
>> > Here comes the patch:
>> >
>> >
>> > Since tunneling and forwarding via patch port uses clone action, the tunneled
>>
>> Where does forwarding via patch port use clone action?
>
> Sorry, that was not correct. Patch port does not use clone action. Both patch
> port and native tunneling use the apply_netsed_clone_action(). That's what I
> wanted to express.

OK. I find that function name a bit confusing given it doesn't relate to clone.

>> apply_nested_clone_action() also takes copies of the flow and
>> base_flow, do we need to do it twice in this path? I suspect that this
>> work could benefit from some further refactoring, taking note to have
>> separate patches to refactor code and different patches for functional
>> changes.
>
> Thank you for indicating this. There is no need to create a backup copy of
> flow. We need to store some fields of base_flow. I could use a function like
> this to create backup:
>
> static void
> copy_flow_L2L3_data(struct flow *dst, const struct flow *src)
> {
>     dst->packet_type = src->packet_type;
>     dst->dl_dst = src->dl_dst;
>     dst->dl_src = src->dl_src;
>     dst->dl_type = src->dl_type;
>     dst->nw_dst = src->nw_dst;
>     dst->nw_src = src->nw_src;
>     dst->ipv6_dst = src->ipv6_dst;
>     dst->ipv6_src = src->ipv6_src;
>     dst->nw_tos = src->nw_tos;
>     dst->nw_ttl = src->nw_ttl;
>     dst->nw_proto = src->nw_proto;
>     dst->tp_dst = src->tp_dst;
>     dst->tp_src = src->tp_src;
> }

Ah, that's not quite what I meant. Copying the whole flow seems
generically correct as per before. The above may be more efficient,
but it also seems easier to screw up if 'struct flow' fields are
changed. I think we can start with copying the whole flow, and
consider an optimization like this separately.

My concern was that both build_tunnel_send() and
apply_nested_clone_action() was doing the store/restore of the flows.

> ...
>
> /* Backup. */
> copy_flow_L2L3_data(&old_base_flow, &ctx->base_flow);
>
> ...
>
> /* Restore. */
> copy_flow_L2L3_data(&ctx->base_flow, &old_base_flow);
>
>> >
>> >      err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
>> >      if (err) {
>> > @@ -3216,16 +3248,57 @@ build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
>> >      tnl_push_data.tnl_port = odp_to_u32(tunnel_odp_port);
>> >      tnl_push_data.out_port = odp_to_u32(out_dev->odp_port);
>> >
>> > +    /* After tunnel header has been added, MAC and IP data of flow and
>> > +     * base_flow need to be set properly, since there is not recirculation
>> > +     * any more when sending packet to tunnel. */
>> > +    if (!eth_addr_is_zero(ctx->xin->flow.dl_dst)) {
>> > +        ctx->xin->flow.dl_dst = dmac;
>> > +        ctx->base_flow.dl_dst = ctx->xin->flow.dl_dst;
>> > +    }
>>
>> What's the expectation if ctx->xin->flow.dl_dst is zero? Is that for
>> L3 tunneled packets?
>
> No, it's not for L3 tunneled packets. Without this push_pop_tunnel and
> push_pop_tunnel_ipv6 unit tests do fail. This needs to be fixed.
>
> 783: tunnel_push_pop - action                        FAILED (tunnel-push-pop.at:109)
> 785: tunnel_push_pop_ipv6 - action                   FAILED (tunnel-push-pop-ipv6.at:92)

OK, I still don't quite understand how we get into this state of
dl_dst is zero in the first place. If it's unexpected, let's fix the
bug that causes this state. If it's expected, maybe we can add a short
comment above the if statement which describes why the dl_dst may be
zero and why it makes sense to leave the flow's mac alone for
tunneling.

> ./tunnel-push-pop.at:108: ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.
> 3.112,proto=47,tos=0,ttl=64,frag=no)'
> stdout:
> Flow: ip,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=1.1.3.88,nw_dst=1.1.
> 3.112,nw_proto=47,nw_tos=0,nw_ecn=0,nw_ttl=64
>
> bridge("int-br")
> ----------------
>  0. priority 32768
>     output:2
>      -> output to native tunnel
>      -> tunneling to 1.1.2.92 via br0
>      -> tunneling from aa:55:aa:55:00:00 1.1.2.88 to f8:bc:12:44:34:b6 1.1.2.92
>
> bridge("br0")
> -------------
>  0. priority 32768
>     NORMAL
>      -> learned port is input port, dropping
>
> Final flow: unchanged
> Megaflow: recirc_id=0,ip,in_port=LOCAL,vlan_tci=0x0000/0x1fff,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_e
> cn=0,nw_frag=no
> Datapath actions: clone(drop),tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:5
> 5:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x
> 0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100))
> ./tunnel-push-pop.at:109: tail -1 stdout
> --- -   2017-05-11 14:40:42.205776583 +0200
> +++ /home/ezolbal/work/general_L3_tunneling/ovs/tests/testsuite.dir/at-groups/783/stdout        2017-05-11 14:40:42.200387095 +0200
> @@ -1,2 +1,2 @@
> -Datapath actions: clone(tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100)),1)
> +Datapath actions: clone(drop),tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100))
>
>
>> > +
>> > +    ctx->xin->flow.dl_src = smac;
>> > +    ctx->base_flow.dl_src = ctx->xin->flow.dl_src;
>> > +
>> > +    propagate_tunnel_data_to_flow(&ctx->xin->flow, &ctx->base_flow);
>> > +
>> > +    if (!tnl_params.is_ipv6) {
>> > +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IP);
>> > +    } else {
>> > +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IPV6);
>> > +    }
>>
>> When writing if (!condition) { ... } else { ... }, please consider
>> whether you can swap the conditions and remove the negation. It's less
>> cognitive load if we don't have to negate a negative condition to
>> consider which condition the "else" applies to. (This otherwise looks
>> correct, so just a style request.)
>
> Ok. I'm going to change this.

Thanks.

>> > +
>> > +    if (tnl_push_data.tnl_type == OVS_VPORT_TYPE_GRE) {
>> > +        ctx->xin->flow.nw_proto = IPPROTO_GRE;
>> > +    } else {
>> > +        ctx->xin->flow.nw_proto = IPPROTO_UDP;
>> > +    }
>> > +    ctx->base_flow.nw_proto = ctx->xin->flow.nw_proto;
>>
>> Should we use a switch () statement here to ensure that newer tunnel
>> ports are properly handled in this case if/when they are added? What
>> about things like LISP or STT?
>
> I can use a switch case:
>
>     switch (tnl_push_data.tnl_type) {
>     case OVS_VPORT_TYPE_GRE:
>         ctx->xin->flow.nw_proto = IPPROTO_GRE;
>         break;
>     case OVS_VPORT_TYPE_VXLAN:
>     case OVS_VPORT_TYPE_GENEVE:
>         ctx->xin->flow.nw_proto = IPPROTO_UDP;
>         break;
>     case OVS_VPORT_TYPE_LISP:
>          /* XXX: UDP? */
>     case OVS_VPORT_TYPE_STT:
>          /* XXX: TCP? */
>     default:
>         break;
>     }
>
> But, how to handle LISP and STT? I have not seen any build_header() function
> for LISP and STT in netdev-vport.c:
>
>     static const struct vport_class vport_classes[] = {
>         TUNNEL_CLASS("geneve", "genev_sys", netdev_geneve_build_header,
>                                             netdev_tnl_push_udp_header,
>                                             netdev_geneve_pop_header),
>         TUNNEL_CLASS("gre", "gre_sys", netdev_gre_build_header,
>                                        netdev_gre_push_header,
>                                        netdev_gre_pop_header),
>         TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header,
>                                            netdev_tnl_push_udp_header,
>                                            netdev_vxlan_pop_header),
>         TUNNEL_CLASS("lisp", "lisp_sys", NULL, NULL, NULL),
>         TUNNEL_CLASS("stt", "stt_sys", NULL, NULL, NULL),
>     };
>
> Are LISP and STT valid tunnel types for native tunneling? If I'm not wrong,
> build_tunnel_send() is invoked when native tunneling is used.

Ah! Sugesh pointed this out to me yesterday. I though there was STT
but you're right.

For LISP/STT it can just call OVS_NOT_REACHED(). Then it can fail in a
pretty obvious way if someone attempts to use those tunnel types in
future.

>> > +
>> >      size_t push_action_size = 0;
>> >      size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
>> >                                             OVS_ACTION_ATTR_CLONE);
>> >      odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
>> >      push_action_size = ctx->odp_actions->size;
>> > +
>> > +    if (ctx->rule) {
>> > +        old_ths = ctx->rule->tunnel_header_size;
>> > +        ctx->rule->tunnel_header_size = tnl_push_data.header_len;
>> > +    }
>> > +
>> >      apply_nested_clone_actions(ctx, xport, out_dev);
>> >      if (ctx->odp_actions->size > push_action_size) {
>> >          /* Update the CLONE action only when combined */
>> >          nl_msg_end_nested(ctx->odp_actions, clone_ofs);
>> >      }
>>
>> If there are no actions on the underlay bridge (ie drop), the else
>> condition here should probably reset the nl_msg nesting so that we
>> don't generate any actions to the datapath at all. You may also
>> consider adding a xlate_report() call to explain why the drop is
>> occuring - this should help debugging later on using ofproto/trace.
>
> I'm not sure about this.
> Sugesh, how should this be resolved?
>
>> >      if (credit_counts) {
>> > +        uint64_t stats_n_bytes = 0;
>> > +
>> > +        if (rule->truncated_packet_size) {
>> > +            stats_n_bytes = MIN(rule->truncated_packet_size, stats->n_bytes);
>> > +        } else {
>> > +            stats_n_bytes = stats->n_bytes;
>> > +        }
>>
>> Is this fixing a separate issue? I'm confused about whether truncated
>> packet stats are being correctly attributed today on master. If so, I
>> think this should split out into a separate patch. You might also
>> consider whether it makes more sense to modify 'stats' earlier in the
>> path so that each rule doesn't need to individually apply the stats
>> adjustment. I could imagine a store/restore of the stats plus
>> modifying for truncation during the xlate_output_trunc_action()
>> processing rather than pushing this complexity all the way into the
>> rule stats attribution.
>
> Incorrect bytes statistics on the underlay bridge is an unwanted side-effect of
> the original "Avoid recirculate" commit. By exluding recirculation, there is no
> upcall for the tunnelled packets, and the packet size is not set by
> upcall_xlate() again:
>
>   static void
>   upcall_xlate(struct udpif *udpif, struct upcall *upcall,
>                struct ofpbuf *odp_actions, struct flow_wildcards *wc)
>   {
>       struct dpif_flow_stats stats;
>       struct xlate_in xin;
>
>       stats.n_packets = 1;
>       stats.n_bytes = dp_packet_size(upcall->packet);
>       stats.used = time_msec();
>       stats.tcp_flags = ntohs(upcall->flow->tcp_flags);
>
>       xlate_in_init(&xin, upcall->ofproto,
>                     ofproto_dpif_get_tables_version(upcall->ofproto),
>                     upcall->flow, upcall->in_port, NULL,
>                     stats.tcp_flags, upcall->packet, wc, odp_actions);
>
>       if (upcall->type == DPIF_UC_MISS) {
>           xin.resubmit_stats = &stats;
>
> Here upcall_xlate() sets xin.resubmint_stats which will be used for calculating
> statistic. These are const values, that's why I have not updated them earlier.
>
>     /* If nonnull, flow translation credits the specified statistics to each
>      * rule reached through a resubmit or OFPP_TABLE action.
>      *
>      * This is normally null so the client has to set it manually after
>      * calling xlate_in_init(). */
>     const struct dpif_flow_stats *resubmit_stats;
>
> If it does not harm, then the const modifier could be removed and stats updated
> at earlier stage.

I see, thanks for the explanation. Mostly I'm just trying to push the
question of "how can we make this code easier to read and
understand?". Const is nice because it tells you these fields won't
change. But I think it's probably a little easier if the truncation of
the statistics is performed where the truncation occurs, so that the
core stats attribution logic can be oblivious of this particular
feature.

Cheers,
Joe
Zoltan Balogh May 17, 2017, 11:37 a.m. UTC | #2
Hi Joe

I started to rework my patch based on your comments and suggestions. I had some 
difficulties with the last one. Let us focus on this below.

> >> >      if (credit_counts) {
> >> > +        uint64_t stats_n_bytes = 0;
> >> > +
> >> > +        if (rule->truncated_packet_size) {
> >> > +            stats_n_bytes = MIN(rule->truncated_packet_size, stats->n_bytes);
> >> > +        } else {
> >> > +            stats_n_bytes = stats->n_bytes;
> >> > +        }
> >>
> >> Is this fixing a separate issue? I'm confused about whether truncated
> >> packet stats are being correctly attributed today on master. If so, I
> >> think this should split out into a separate patch. You might also
> >> consider whether it makes more sense to modify 'stats' earlier in the
> >> path so that each rule doesn't need to individually apply the stats
> >> adjustment. I could imagine a store/restore of the stats plus
> >> modifying for truncation during the xlate_output_trunc_action()
> >> processing rather than pushing this complexity all the way into the
> >> rule stats attribution.
>
> >
> > Incorrect bytes statistics on the underlay bridge is an unwanted side-effect of
> > the original "Avoid recirculate" commit. By exluding recirculation, there is no
> > upcall for the tunnelled packets, and the packet size is not set by
> > upcall_xlate() again:
> >
> >   static void
> >   upcall_xlate(struct udpif *udpif, struct upcall *upcall,
> >                struct ofpbuf *odp_actions, struct flow_wildcards *wc)
> >   {
> >       struct dpif_flow_stats stats;
> >       struct xlate_in xin;
> >
> >       stats.n_packets = 1;
> >       stats.n_bytes = dp_packet_size(upcall->packet);
> >       stats.used = time_msec();
> >       stats.tcp_flags = ntohs(upcall->flow->tcp_flags);
> >
> >       xlate_in_init(&xin, upcall->ofproto,
> >                     ofproto_dpif_get_tables_version(upcall->ofproto),
> >                     upcall->flow, upcall->in_port, NULL,
> >                     stats.tcp_flags, upcall->packet, wc, odp_actions);
> >
> >       if (upcall->type == DPIF_UC_MISS) {
> >           xin.resubmit_stats = &stats;
> >
> > Here upcall_xlate() sets xin.resubmint_stats which will be used for calculating
> > statistic. These are const values, that's why I have not updated them earlier.
> >
> >     /* If nonnull, flow translation credits the specified statistics to each
> >      * rule reached through a resubmit or OFPP_TABLE action.
> >      *
> >      * This is normally null so the client has to set it manually after
> >      * calling xlate_in_init(). */
> >     const struct dpif_flow_stats *resubmit_stats;
> >
> > If it does not harm, then the const modifier could be removed and stats updated
> > at earlier stage.
> 
> I see, thanks for the explanation. Mostly I'm just trying to push the
> question of "how can we make this code easier to read and
> understand?". Const is nice because it tells you these fields won't
> change. But I think it's probably a little easier if the truncation of
> the statistics is performed where the truncation occurs, so that the
> core stats attribution logic can be oblivious of this particular
> feature.

I removed the const qualifier from resubmit_stats, removed the truncation from 
rule_dpif_credit_stats__() and applied it to resubmit_stats during translation.
That works fine for a packet translated due to an upcall. At the end we get a 
sinlge datapath flow in the netdev datapath. It should look like this:

netdev@ovs-netdev: hit:4 missed:10
        br-int:
                br-int 65534/1: (tap)
                br-int-ns1 1/3: (system)
                gre0 2/5: (gre: remote_ip=10.0.0.20)
        br-p:
                br-p 65534/2: (tap)
                br-p-ns2 3/4: (system)

flow-dump from non-dpdk interfaces:
recirc_id(0),in_port(3),eth_type(0x0800),ipv4(tos=0/0x3,ttl=64,frag=no), packets:0, bytes:0, used:never, actions:trunc(200),clone(tnl_push(tnl_port(5)
,header(size=38,type=3,eth(dst=cc:cc:cc:dd:dd:de,src=02:3e:b1:dc:0e:42,dl_type=0x0800),ipv4(src=10.0.0.10,dst=10.0.0.20,proto=47,tos=0,ttl=64,frag=0x4
000),gre((flags=0x0,proto=0x6558))),out_port(2)),set(ipv4(tos=0/0x3,ttl=63)),4)


At this point I get a problem, when a new packet arrives but there is no need 
for translation since we have the datapath flow in the netdev datapath. 
In this case the byte and packet statistics are calculated by the 
rule_dpif_credit_stats__() function as well. 

static void
rule_dpif_credit_stats__(struct rule_dpif *rule,
                         const struct dpif_flow_stats *stats,
                         bool credit_counts)
    OVS_REQUIRES(rule->stats_mutex)
{
    if (credit_counts) {
        rule->stats.n_packets += stats->n_packets;
        rule->stats.n_bytes += stats->n_bytes;
    }
    rule->stats.used = MAX(rule->stats.used, stats->used);
}

However, in this case the 'stats' argument does not come from resubmit_stats. 
It is derived from dpif_flow_stats and udpif_flow stats resulting in the 
original packet size. 
I guess, this is due to fact, that we have a single datapath flow which can't
reflect the size of the original and the altered packet size. Someone could 
confirm this, to make sure I'm not wrong.

These functions are invoked:

static void
revalidate(struct revalidator *revalidator)
{
...
        const struct dpif_flow *f;
...
        n_dumped = dpif_flow_dump_next(dump_thread, flows, ARRAY_SIZE(flows));
...
         for (f = flows; f < &flows[n_dumped]; f++, index++) {
            long long int used = f->stats.used;
...
                result = revalidate_ukey(udpif, ukey, &f->stats, &odp_actions,
                                         reval_seq, &recircs);
...
}

static enum reval_result
revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
                const struct dpif_flow_stats *stats,
                struct ofpbuf *odp_actions, uint64_t reval_seq,
                struct recirc_refs *recircs)
    OVS_REQUIRES(ukey->mutex)
{
...
    push.n_bytes = (stats->n_bytes > ukey->stats.n_bytes
                    ? stats->n_bytes - ukey->stats.n_bytes
                    : 0);
...
        xlate_push_stats(ukey->xcache, &push);
...
}

void
xlate_push_stats(struct xlate_cache *xcache,
                 const struct dpif_flow_stats *stats)
{
...
        xlate_push_stats_entry(entry, stats);
...
}

void
xlate_push_stats_entry(struct xc_entry *entry,
                       const struct dpif_flow_stats *stats)
{
...
        rule_dpif_credit_stats(entry->rule, stats);
...
}

void
rule_dpif_credit_stats(struct rule_dpif *rule,
                       const struct dpif_flow_stats *stats)
{
...
        rule_dpif_credit_stats__(rule, stats, true);
...
}

So, I'm not sure if stats correction can be solved at an earlier stage but in 
rule_dpif_credit_stats__(). I would like to get some help or verification 
from experts. Comment are welcome.

Best regards,
Zoltan
Joe Stringer May 17, 2017, 6:24 p.m. UTC | #3
On 17 May 2017 at 04:37, Zoltán Balogh <zoltan.balogh@ericsson.com> wrote:
> Hi Joe
>
> I started to rework my patch based on your comments and suggestions. I had some
> difficulties with the last one. Let us focus on this below.
>
>> >> >      if (credit_counts) {
>> >> > +        uint64_t stats_n_bytes = 0;
>> >> > +
>> >> > +        if (rule->truncated_packet_size) {
>> >> > +            stats_n_bytes = MIN(rule->truncated_packet_size, stats->n_bytes);
>> >> > +        } else {
>> >> > +            stats_n_bytes = stats->n_bytes;
>> >> > +        }
>> >>
>> >> Is this fixing a separate issue? I'm confused about whether truncated
>> >> packet stats are being correctly attributed today on master. If so, I
>> >> think this should split out into a separate patch. You might also
>> >> consider whether it makes more sense to modify 'stats' earlier in the
>> >> path so that each rule doesn't need to individually apply the stats
>> >> adjustment. I could imagine a store/restore of the stats plus
>> >> modifying for truncation during the xlate_output_trunc_action()
>> >> processing rather than pushing this complexity all the way into the
>> >> rule stats attribution.
>>
>> >
>> > Incorrect bytes statistics on the underlay bridge is an unwanted side-effect of
>> > the original "Avoid recirculate" commit. By exluding recirculation, there is no
>> > upcall for the tunnelled packets, and the packet size is not set by
>> > upcall_xlate() again:
>> >
>> >   static void
>> >   upcall_xlate(struct udpif *udpif, struct upcall *upcall,
>> >                struct ofpbuf *odp_actions, struct flow_wildcards *wc)
>> >   {
>> >       struct dpif_flow_stats stats;
>> >       struct xlate_in xin;
>> >
>> >       stats.n_packets = 1;
>> >       stats.n_bytes = dp_packet_size(upcall->packet);
>> >       stats.used = time_msec();
>> >       stats.tcp_flags = ntohs(upcall->flow->tcp_flags);
>> >
>> >       xlate_in_init(&xin, upcall->ofproto,
>> >                     ofproto_dpif_get_tables_version(upcall->ofproto),
>> >                     upcall->flow, upcall->in_port, NULL,
>> >                     stats.tcp_flags, upcall->packet, wc, odp_actions);
>> >
>> >       if (upcall->type == DPIF_UC_MISS) {
>> >           xin.resubmit_stats = &stats;
>> >
>> > Here upcall_xlate() sets xin.resubmint_stats which will be used for calculating
>> > statistic. These are const values, that's why I have not updated them earlier.
>> >
>> >     /* If nonnull, flow translation credits the specified statistics to each
>> >      * rule reached through a resubmit or OFPP_TABLE action.
>> >      *
>> >      * This is normally null so the client has to set it manually after
>> >      * calling xlate_in_init(). */
>> >     const struct dpif_flow_stats *resubmit_stats;
>> >
>> > If it does not harm, then the const modifier could be removed and stats updated
>> > at earlier stage.
>>
>> I see, thanks for the explanation. Mostly I'm just trying to push the
>> question of "how can we make this code easier to read and
>> understand?". Const is nice because it tells you these fields won't
>> change. But I think it's probably a little easier if the truncation of
>> the statistics is performed where the truncation occurs, so that the
>> core stats attribution logic can be oblivious of this particular
>> feature.
>
> I removed the const qualifier from resubmit_stats, removed the truncation from
> rule_dpif_credit_stats__() and applied it to resubmit_stats during translation.
> That works fine for a packet translated due to an upcall. At the end we get a
> sinlge datapath flow in the netdev datapath. It should look like this:
>
> netdev@ovs-netdev: hit:4 missed:10
>         br-int:
>                 br-int 65534/1: (tap)
>                 br-int-ns1 1/3: (system)
>                 gre0 2/5: (gre: remote_ip=10.0.0.20)
>         br-p:
>                 br-p 65534/2: (tap)
>                 br-p-ns2 3/4: (system)
>
> flow-dump from non-dpdk interfaces:
> recirc_id(0),in_port(3),eth_type(0x0800),ipv4(tos=0/0x3,ttl=64,frag=no), packets:0, bytes:0, used:never, actions:trunc(200),clone(tnl_push(tnl_port(5)
> ,header(size=38,type=3,eth(dst=cc:cc:cc:dd:dd:de,src=02:3e:b1:dc:0e:42,dl_type=0x0800),ipv4(src=10.0.0.10,dst=10.0.0.20,proto=47,tos=0,ttl=64,frag=0x4
> 000),gre((flags=0x0,proto=0x6558))),out_port(2)),set(ipv4(tos=0/0x3,ttl=63)),4)
>
>
> At this point I get a problem, when a new packet arrives but there is no need
> for translation since we have the datapath flow in the netdev datapath.
> In this case the byte and packet statistics are calculated by the
> rule_dpif_credit_stats__() function as well.
>
> static void
> rule_dpif_credit_stats__(struct rule_dpif *rule,
>                          const struct dpif_flow_stats *stats,
>                          bool credit_counts)
>     OVS_REQUIRES(rule->stats_mutex)
> {
>     if (credit_counts) {
>         rule->stats.n_packets += stats->n_packets;
>         rule->stats.n_bytes += stats->n_bytes;
>     }
>     rule->stats.used = MAX(rule->stats.used, stats->used);
> }
>
> However, in this case the 'stats' argument does not come from resubmit_stats.
> It is derived from dpif_flow_stats and udpif_flow stats resulting in the
> original packet size.
> I guess, this is due to fact, that we have a single datapath flow which can't
> reflect the size of the original and the altered packet size. Someone could
> confirm this, to make sure I'm not wrong.

upcall_xlate() generates the 'dpif_flow_stats' from the packet in the
upcall path. It then populates xin.resubmit_stats and calls
xlate_actions().

For revalidator path, it is similar but populated from xlate_key().

What I imagined happening during translation is that when the flow
with the truncate action is handled, the 'resubmit_stats' would be
modified for subsequent translation. That is to say, in
xlate_output_trunc_action() where it calls out to the
xlate_output_action() common code, there would be store/restore of the
resubmit_stats.

Backing up a bit for context, the stats attribution goes roughly like this:
* First upcall, handler thread calls through the translate code with a
packet. The resubmit_stats are derived from that packet. This goes
through xlate_actions().
* First dump of flow from revalidator thread fetches the flow and runs
the same xlate_actions() with whatever stats it has (may be zero).
This time, whenever stats attribution or side effects occur, an
xlate_cache entry is generated.
* Second and subsequent dumps of flows fetches the flow and shortcuts
the xlate_actions() by using the xlate_cache instead - ie a call to
xlate_push_stats().

So, in the same place where the resubmit_stats is manipulated, you
would also need to generate a new XC entry which would manipulate the
stats - this would be a 'side-effect'. I'd imagine that prior to the
full output translation there would be a XC_TRUNCATE(truncated_size)
then afterwards there would be an XC_TRUNCATE_RESET(). Or it could be
just XC_SET_SIZE(...) where 0 is reset and non-zero is a truncate
size. In the implementation/execution in xlate_push_stats() when
performing XC_TRUNCATE you would need to store the original push_stats
size somewhere, then calculate a new 'n_bytes' based on the number of
packets and existing bytes*. For XC_TRUNCATE_RESET(), it would restore
the original push_stats size.

* Hmm, I'm not sure the calculation will be 100% here. Let's say there
were 3 packets hit the flow, 50B, 200B, 300B. If output(max_len=100)
was executed, then we don't know how many of the packets were
truncated. The maximum number of bytes that could be transmitted is
300, but the actual number was 250. We could divide the n_bytes by
n_packets, subtract the max_len and then multiply back up by the
number of packets, which works for this case assuming floating point
arithmetic but is slightly off if using integer math..

> These functions are invoked:
>
> static void
> revalidate(struct revalidator *revalidator)
> {
> ...
>         const struct dpif_flow *f;
> ...
>         n_dumped = dpif_flow_dump_next(dump_thread, flows, ARRAY_SIZE(flows));
> ...
>          for (f = flows; f < &flows[n_dumped]; f++, index++) {
>             long long int used = f->stats.used;
> ...
>                 result = revalidate_ukey(udpif, ukey, &f->stats, &odp_actions,
>                                          reval_seq, &recircs);
> ...
> }
>
> static enum reval_result
> revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>                 const struct dpif_flow_stats *stats,
>                 struct ofpbuf *odp_actions, uint64_t reval_seq,
>                 struct recirc_refs *recircs)
>     OVS_REQUIRES(ukey->mutex)
> {
> ...
>     push.n_bytes = (stats->n_bytes > ukey->stats.n_bytes
>                     ? stats->n_bytes - ukey->stats.n_bytes
>                     : 0);
> ...
>         xlate_push_stats(ukey->xcache, &push);
> ...
> }
>
> void
> xlate_push_stats(struct xlate_cache *xcache,
>                  const struct dpif_flow_stats *stats)
> {
> ...
>         xlate_push_stats_entry(entry, stats);
> ...
> }
>
> void
> xlate_push_stats_entry(struct xc_entry *entry,
>                        const struct dpif_flow_stats *stats)
> {
> ...
>         rule_dpif_credit_stats(entry->rule, stats);
> ...
> }
>
> void
> rule_dpif_credit_stats(struct rule_dpif *rule,
>                        const struct dpif_flow_stats *stats)
> {
> ...
>         rule_dpif_credit_stats__(rule, stats, true);
> ...
> }
>
> So, I'm not sure if stats correction can be solved at an earlier stage but in
> rule_dpif_credit_stats__(). I would like to get some help or verification
> from experts. Comment are welcome.
Zoltan Balogh May 26, 2017, 2 p.m. UTC | #4
Hi Joe,

> Backing up a bit for context, the stats attribution goes roughly like this:
> * First upcall, handler thread calls through the translate code with a
> packet. The resubmit_stats are derived from that packet. This goes
> through xlate_actions().
> * First dump of flow from revalidator thread fetches the flow and runs
> the same xlate_actions() with whatever stats it has (may be zero).
> This time, whenever stats attribution or side effects occur, an
> xlate_cache entry is generated.
> * Second and subsequent dumps of flows fetches the flow and shortcuts
> the xlate_actions() by using the xlate_cache instead - ie a call to
> xlate_push_stats().
> 
> So, in the same place where the resubmit_stats is manipulated, you
> would also need to generate a new XC entry which would manipulate the
> stats - this would be a 'side-effect'. I'd imagine that prior to the
> full output translation there would be a XC_TRUNCATE(truncated_size)
> then afterwards there would be an XC_TRUNCATE_RESET(). Or it could be
> just XC_SET_SIZE(...) where 0 is reset and non-zero is a truncate
> size. In the implementation/execution in xlate_push_stats() when
> performing XC_TRUNCATE you would need to store the original push_stats
> size somewhere, then calculate a new 'n_bytes' based on the number of
> packets and existing bytes*. For XC_TRUNCATE_RESET(), it would restore
> the original push_stats size.

 Thank you for the explanation.
 
> * Hmm, I'm not sure the calculation will be 100% here. Let's say there
> were 3 packets hit the flow, 50B, 200B, 300B. If output(max_len=100)
> was executed, then we don't know how many of the packets were
> truncated. The maximum number of bytes that could be transmitted is
> 300, but the actual number was 250. We could divide the n_bytes by
> n_packets, subtract the max_len and then multiply back up by the
> number of packets, which works for this case assuming floating point
> arithmetic but is slightly off if using integer math..

I don't think, that would be the proper way of calculating n_bytes. Let's 
say we have 3 packets with 50B, 200B, 200B and max_len=100. The output 
should be 50 + 100 + 100 = 250B.
Following the instructions above we will get 
[(50 + 200 + 200) / 3 - 100 ] * 3 = [450 / 3 - 100 ] * 3 = 50 * 3 = 150B

Any other idea how to calculate the truncated size with xlate cache? 
Or maybe I did not understand your calculation.

There is one more thing to be taken into consideration. By adding a tunnel 
header, the size of packets increases as well. But that's a constant value
for each packet, easier to calculate with it.

Best regards,
Zoltan
Chandran, Sugesh May 26, 2017, 5:42 p.m. UTC | #5
Regards
_Sugesh


> -----Original Message-----

> From: Zoltán Balogh [mailto:zoltan.balogh@ericsson.com]

> Sent: Friday, May 26, 2017 3:01 PM

> To: Joe Stringer <joe@ovn.org>

> Cc: Chandran, Sugesh <sugesh.chandran@intel.com>; Andy Zhou

> <azhou@ovn.org>; William Tu <u9012063@gmail.com>;

> gvrose8192@gmail.com; ovs dev <dev@openvswitch.org>; Ben Pfaff

> <blp@ovn.org>; Jan Scheurich <jan.scheurich@ericsson.com>; Gray, Mark D

> <mark.d.gray@intel.com>

> Subject: RE: [ovs-dev] [PATCH v5] tunneling: Avoid recirculation on datapath

> by computing the recirculate actions at translate time.

> 

> 

> Hi Joe,

> 

> > Backing up a bit for context, the stats attribution goes roughly like this:

> > * First upcall, handler thread calls through the translate code with a

> > packet. The resubmit_stats are derived from that packet. This goes

> > through xlate_actions().

> > * First dump of flow from revalidator thread fetches the flow and runs

> > the same xlate_actions() with whatever stats it has (may be zero).

> > This time, whenever stats attribution or side effects occur, an

> > xlate_cache entry is generated.

> > * Second and subsequent dumps of flows fetches the flow and shortcuts

> > the xlate_actions() by using the xlate_cache instead - ie a call to

> > xlate_push_stats().

> >

> > So, in the same place where the resubmit_stats is manipulated, you

> > would also need to generate a new XC entry which would manipulate the

> > stats - this would be a 'side-effect'. I'd imagine that prior to the

> > full output translation there would be a XC_TRUNCATE(truncated_size)

> > then afterwards there would be an XC_TRUNCATE_RESET(). Or it could be

> > just XC_SET_SIZE(...) where 0 is reset and non-zero is a truncate

> > size. In the implementation/execution in xlate_push_stats() when

> > performing XC_TRUNCATE you would need to store the original push_stats

> > size somewhere, then calculate a new 'n_bytes' based on the number of

> > packets and existing bytes*. For XC_TRUNCATE_RESET(), it would restore

> > the original push_stats size.

> 

>  Thank you for the explanation.

> 

> > * Hmm, I'm not sure the calculation will be 100% here. Let's say there

> > were 3 packets hit the flow, 50B, 200B, 300B. If output(max_len=100)

> > was executed, then we don't know how many of the packets were

> > truncated. The maximum number of bytes that could be transmitted is

> > 300, but the actual number was 250. We could divide the n_bytes by

> > n_packets, subtract the max_len and then multiply back up by the

> > number of packets, which works for this case assuming floating point

> > arithmetic but is slightly off if using integer math..

> 

> I don't think, that would be the proper way of calculating n_bytes. Let's say

> we have 3 packets with 50B, 200B, 200B and max_len=100. The output should

> be 50 + 100 + 100 = 250B.

> Following the instructions above we will get

> [(50 + 200 + 200) / 3 - 100 ] * 3 = [450 / 3 - 100 ] * 3 = 50 * 3 = 150B

> 

> Any other idea how to calculate the truncated size with xlate cache?

> Or maybe I did not understand your calculation.

[Sugesh] Since we have this issue with the trunc action,
How about limit the combine action only for those tunnels that don’t have any post trunc action.
If there is a trunc action, Create two separate rules normally as now.
Is there any other action that would be considered as exception like this?

> 

> There is one more thing to be taken into consideration. By adding a tunnel

> header, the size of packets increases as well. But that's a constant value for

> each packet, easier to calculate with it.

> 

> Best regards,

> Zoltan
Joe Stringer May 26, 2017, 5:51 p.m. UTC | #6
On 26 May 2017 at 07:00, Zoltán Balogh <zoltan.balogh@ericsson.com> wrote:
>
> Hi Joe,
>
>> Backing up a bit for context, the stats attribution goes roughly like this:
>> * First upcall, handler thread calls through the translate code with a
>> packet. The resubmit_stats are derived from that packet. This goes
>> through xlate_actions().
>> * First dump of flow from revalidator thread fetches the flow and runs
>> the same xlate_actions() with whatever stats it has (may be zero).
>> This time, whenever stats attribution or side effects occur, an
>> xlate_cache entry is generated.
>> * Second and subsequent dumps of flows fetches the flow and shortcuts
>> the xlate_actions() by using the xlate_cache instead - ie a call to
>> xlate_push_stats().
>>
>> So, in the same place where the resubmit_stats is manipulated, you
>> would also need to generate a new XC entry which would manipulate the
>> stats - this would be a 'side-effect'. I'd imagine that prior to the
>> full output translation there would be a XC_TRUNCATE(truncated_size)
>> then afterwards there would be an XC_TRUNCATE_RESET(). Or it could be
>> just XC_SET_SIZE(...) where 0 is reset and non-zero is a truncate
>> size. In the implementation/execution in xlate_push_stats() when
>> performing XC_TRUNCATE you would need to store the original push_stats
>> size somewhere, then calculate a new 'n_bytes' based on the number of
>> packets and existing bytes*. For XC_TRUNCATE_RESET(), it would restore
>> the original push_stats size.
>
>  Thank you for the explanation.
>
>> * Hmm, I'm not sure the calculation will be 100% here. Let's say there
>> were 3 packets hit the flow, 50B, 200B, 300B. If output(max_len=100)
>> was executed, then we don't know how many of the packets were
>> truncated. The maximum number of bytes that could be transmitted is
>> 300, but the actual number was 250. We could divide the n_bytes by
>> n_packets, subtract the max_len and then multiply back up by the
>> number of packets, which works for this case assuming floating point
>> arithmetic but is slightly off if using integer math..
>
> I don't think, that would be the proper way of calculating n_bytes. Let's
> say we have 3 packets with 50B, 200B, 200B and max_len=100. The output
> should be 50 + 100 + 100 = 250B.
> Following the instructions above we will get
> [(50 + 200 + 200) / 3 - 100 ] * 3 = [450 / 3 - 100 ] * 3 = 50 * 3 = 150B
>
> Any other idea how to calculate the truncated size with xlate cache?
> Or maybe I did not understand your calculation.

Nope, you're absolutely right. It occurred to me a little later that
for more complex statistic sets this doesn't work. I suspect you
actually need datapath support for tracking these stats separately.

> There is one more thing to be taken into consideration. By adding a tunnel
> header, the size of packets increases as well. But that's a constant value
> for each packet, easier to calculate with it.

Good point.
Joe Stringer May 26, 2017, 6:01 p.m. UTC | #7
On 26 May 2017 at 10:42, Chandran, Sugesh <sugesh.chandran@intel.com> wrote:
>
>
> Regards
> _Sugesh
>
>
>> -----Original Message-----
>> From: Zoltán Balogh [mailto:zoltan.balogh@ericsson.com]
>> Sent: Friday, May 26, 2017 3:01 PM
>> To: Joe Stringer <joe@ovn.org>
>> Cc: Chandran, Sugesh <sugesh.chandran@intel.com>; Andy Zhou
>> <azhou@ovn.org>; William Tu <u9012063@gmail.com>;
>> gvrose8192@gmail.com; ovs dev <dev@openvswitch.org>; Ben Pfaff
>> <blp@ovn.org>; Jan Scheurich <jan.scheurich@ericsson.com>; Gray, Mark D
>> <mark.d.gray@intel.com>
>> Subject: RE: [ovs-dev] [PATCH v5] tunneling: Avoid recirculation on datapath
>> by computing the recirculate actions at translate time.
>>
>>
>> Hi Joe,
>>
>> > Backing up a bit for context, the stats attribution goes roughly like this:
>> > * First upcall, handler thread calls through the translate code with a
>> > packet. The resubmit_stats are derived from that packet. This goes
>> > through xlate_actions().
>> > * First dump of flow from revalidator thread fetches the flow and runs
>> > the same xlate_actions() with whatever stats it has (may be zero).
>> > This time, whenever stats attribution or side effects occur, an
>> > xlate_cache entry is generated.
>> > * Second and subsequent dumps of flows fetches the flow and shortcuts
>> > the xlate_actions() by using the xlate_cache instead - ie a call to
>> > xlate_push_stats().
>> >
>> > So, in the same place where the resubmit_stats is manipulated, you
>> > would also need to generate a new XC entry which would manipulate the
>> > stats - this would be a 'side-effect'. I'd imagine that prior to the
>> > full output translation there would be a XC_TRUNCATE(truncated_size)
>> > then afterwards there would be an XC_TRUNCATE_RESET(). Or it could be
>> > just XC_SET_SIZE(...) where 0 is reset and non-zero is a truncate
>> > size. In the implementation/execution in xlate_push_stats() when
>> > performing XC_TRUNCATE you would need to store the original push_stats
>> > size somewhere, then calculate a new 'n_bytes' based on the number of
>> > packets and existing bytes*. For XC_TRUNCATE_RESET(), it would restore
>> > the original push_stats size.
>>
>>  Thank you for the explanation.
>>
>> > * Hmm, I'm not sure the calculation will be 100% here. Let's say there
>> > were 3 packets hit the flow, 50B, 200B, 300B. If output(max_len=100)
>> > was executed, then we don't know how many of the packets were
>> > truncated. The maximum number of bytes that could be transmitted is
>> > 300, but the actual number was 250. We could divide the n_bytes by
>> > n_packets, subtract the max_len and then multiply back up by the
>> > number of packets, which works for this case assuming floating point
>> > arithmetic but is slightly off if using integer math..
>>
>> I don't think, that would be the proper way of calculating n_bytes. Let's say
>> we have 3 packets with 50B, 200B, 200B and max_len=100. The output should
>> be 50 + 100 + 100 = 250B.
>> Following the instructions above we will get
>> [(50 + 200 + 200) / 3 - 100 ] * 3 = [450 / 3 - 100 ] * 3 = 50 * 3 = 150B
>>
>> Any other idea how to calculate the truncated size with xlate cache?
>> Or maybe I did not understand your calculation.
> [Sugesh] Since we have this issue with the trunc action,
> How about limit the combine action only for those tunnels that don’t have any post trunc action.
> If there is a trunc action, Create two separate rules normally as now.
> Is there any other action that would be considered as exception like this?

There's no need for different rules.

Translation of truncate+output is done from
xlate_output_trunc_action(), which calls the common
xlate_output_action() translation after setting up the packet max_len.
Perhaps this path just needs to keep the recirculation, while the
common xlate_output_action() path should be capable of doing this
'combined' (patch-port-style) translation.

I think your last question is effectively asking 'who calls
xlate_output_action'. I see output_reg, output_trunc enqueue, bundle,
and regular output. I think that truncate is the only 'special' case
from this perspective.
Chandran, Sugesh May 26, 2017, 6:06 p.m. UTC | #8
Thank you Joe for the response.

Regards
_Sugesh


> -----Original Message-----

> From: Joe Stringer [mailto:joe@ovn.org]

> Sent: Friday, May 26, 2017 7:02 PM

> To: Chandran, Sugesh <sugesh.chandran@intel.com>

> Cc: Zoltán Balogh <zoltan.balogh@ericsson.com>; Andy Zhou

> <azhou@ovn.org>; William Tu <u9012063@gmail.com>;

> gvrose8192@gmail.com; ovs dev <dev@openvswitch.org>; Ben Pfaff

> <blp@ovn.org>; Jan Scheurich <jan.scheurich@ericsson.com>; Gray, Mark D

> <mark.d.gray@intel.com>

> Subject: Re: [ovs-dev] [PATCH v5] tunneling: Avoid recirculation on datapath

> by computing the recirculate actions at translate time.

> 

> On 26 May 2017 at 10:42, Chandran, Sugesh <sugesh.chandran@intel.com>

> wrote:

> >

> >

> > Regards

> > _Sugesh

> >

> >

> >> -----Original Message-----

> >> From: Zoltán Balogh [mailto:zoltan.balogh@ericsson.com]

> >> Sent: Friday, May 26, 2017 3:01 PM

> >> To: Joe Stringer <joe@ovn.org>

> >> Cc: Chandran, Sugesh <sugesh.chandran@intel.com>; Andy Zhou

> >> <azhou@ovn.org>; William Tu <u9012063@gmail.com>;

> >> gvrose8192@gmail.com; ovs dev <dev@openvswitch.org>; Ben Pfaff

> >> <blp@ovn.org>; Jan Scheurich <jan.scheurich@ericsson.com>; Gray,

> Mark

> >> D <mark.d.gray@intel.com>

> >> Subject: RE: [ovs-dev] [PATCH v5] tunneling: Avoid recirculation on

> >> datapath by computing the recirculate actions at translate time.

> >>

> >>

> >> Hi Joe,

> >>

> >> > Backing up a bit for context, the stats attribution goes roughly like this:

> >> > * First upcall, handler thread calls through the translate code

> >> > with a packet. The resubmit_stats are derived from that packet.

> >> > This goes through xlate_actions().

> >> > * First dump of flow from revalidator thread fetches the flow and

> >> > runs the same xlate_actions() with whatever stats it has (may be zero).

> >> > This time, whenever stats attribution or side effects occur, an

> >> > xlate_cache entry is generated.

> >> > * Second and subsequent dumps of flows fetches the flow and

> >> > shortcuts the xlate_actions() by using the xlate_cache instead - ie

> >> > a call to xlate_push_stats().

> >> >

> >> > So, in the same place where the resubmit_stats is manipulated, you

> >> > would also need to generate a new XC entry which would manipulate

> >> > the stats - this would be a 'side-effect'. I'd imagine that prior

> >> > to the full output translation there would be a

> >> > XC_TRUNCATE(truncated_size) then afterwards there would be an

> >> > XC_TRUNCATE_RESET(). Or it could be just XC_SET_SIZE(...) where 0

> >> > is reset and non-zero is a truncate size. In the

> >> > implementation/execution in xlate_push_stats() when performing

> >> > XC_TRUNCATE you would need to store the original push_stats size

> >> > somewhere, then calculate a new 'n_bytes' based on the number of

> >> > packets and existing bytes*. For XC_TRUNCATE_RESET(), it would

> restore the original push_stats size.

> >>

> >>  Thank you for the explanation.

> >>

> >> > * Hmm, I'm not sure the calculation will be 100% here. Let's say

> >> > there were 3 packets hit the flow, 50B, 200B, 300B. If

> >> > output(max_len=100) was executed, then we don't know how many of

> >> > the packets were truncated. The maximum number of bytes that could

> >> > be transmitted is 300, but the actual number was 250. We could

> >> > divide the n_bytes by n_packets, subtract the max_len and then

> >> > multiply back up by the number of packets, which works for this

> >> > case assuming floating point arithmetic but is slightly off if using integer

> math..

> >>

> >> I don't think, that would be the proper way of calculating n_bytes.

> >> Let's say we have 3 packets with 50B, 200B, 200B and max_len=100. The

> >> output should be 50 + 100 + 100 = 250B.

> >> Following the instructions above we will get

> >> [(50 + 200 + 200) / 3 - 100 ] * 3 = [450 / 3 - 100 ] * 3 = 50 * 3 =

> >> 150B

> >>

> >> Any other idea how to calculate the truncated size with xlate cache?

> >> Or maybe I did not understand your calculation.

> > [Sugesh] Since we have this issue with the trunc action, How about

> > limit the combine action only for those tunnels that don’t have any post

> trunc action.

> > If there is a trunc action, Create two separate rules normally as now.

> > Is there any other action that would be considered as exception like this?

> 

> There's no need for different rules.

[Sugesh] Ok. What I meant here is do the combine only for the non trunc case.
> 

> Translation of truncate+output is done from xlate_output_trunc_action(),

> which calls the common

> xlate_output_action() translation after setting up the packet max_len.

> Perhaps this path just needs to keep the recirculation, while the common

> xlate_output_action() path should be capable of doing this 'combined'

> (patch-port-style) translation.

[Sugesh] yes.
> 

> I think your last question is effectively asking 'who calls xlate_output_action'.

> I see output_reg, output_trunc enqueue, bundle, and regular output. I think

> that truncate is the only 'special' case from this perspective.

[Sugesh] So we can treat trunc as a special case then,
diff mbox

Patch

--- -   2017-05-11 14:40:42.205776583 +0200
+++ /home/ezolbal/work/general_L3_tunneling/ovs/tests/testsuite.dir/at-groups/783/stdout        2017-05-11 14:40:42.200387095 +0200
@@ -1,2 +1,2 @@ 
-Datapath actions: clone(tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100)),1)
+Datapath actions: clone(drop),tnl_push(tnl_port(4789),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x4000),udp(src=0,dst=4789,csum=0x0),vxlan(flags=0x8000000,vni=0x7b)),out_port(100))