diff mbox series

[ovs-dev,v2,2/4] netdev-linux: Favour inner packet for multi-encapsulated TSO.

Message ID 20240212195030.190385-2-mkp@redhat.com
State Changes Requested
Delegated to: Simon Horman
Headers show
Series [ovs-dev,v2,1/4] Userspace: Software fallback for UDP encapsulated TCP segmentation. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Mike Pattrick Feb. 12, 2024, 7:50 p.m. UTC
Previously if an OVS configuration nested multiple layers of UDP tunnels
like VXLAN or GENEVE on top of each other through netdev-linux
interfaces, the vnet header would be incorrectly set to the outermost
UDP tunnel layer instead of the intermediary tunnel layer.

This resulted in the middle UDP tunnel not checksum offloading properly.

Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
---
 lib/netdev-linux.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Comments

Mike Pattrick Feb. 14, 2024, 2:12 p.m. UTC | #1
Note, this failed github CI with the following error message:

../../tests/testsuite: line 4034: ./atconfig: No such file or directory

This appears to be a false negative.


-M

On Mon, Feb 12, 2024 at 2:50 PM Mike Pattrick <mkp@redhat.com> wrote:
>
> Previously if an OVS configuration nested multiple layers of UDP tunnels
> like VXLAN or GENEVE on top of each other through netdev-linux
> interfaces, the vnet header would be incorrectly set to the outermost
> UDP tunnel layer instead of the intermediary tunnel layer.
>
> This resulted in the middle UDP tunnel not checksum offloading properly.
>
> Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Mike Pattrick <mkp@redhat.com>
> ---
>  lib/netdev-linux.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 1b2e5b6c2..7a156cc28 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -7239,14 +7239,23 @@ netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
>              vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
>                                      struct tcp_header, tcp_csum);
>          } else if (dp_packet_hwol_l4_is_udp(b)) {
> -            struct udp_header *udp_hdr = dp_packet_l4(b);
> +            /* Favour the inner packet when indicating checksum offsets. */
> +            void *l3_off = dp_packet_inner_l3(b);
> +            void *l4_off = dp_packet_inner_l4(b);
> +
> +            if (!l3_off || !l4_off) {
> +                l3_off = dp_packet_l3(b);
> +                l4_off = dp_packet_l4(b);
> +            }
> +            struct udp_header *udp_hdr = l4_off;
> +
>              ovs_be16 csum = 0;
>
>              if (dp_packet_hwol_is_ipv4(b)) {
> -                const struct ip_header *ip_hdr = dp_packet_l3(b);
> +                const struct ip_header *ip_hdr = l3_off;
>                  csum = ~csum_finish(packet_csum_pseudoheader(ip_hdr));
>              } else if (dp_packet_hwol_tx_ipv6(b)) {
> -                const struct ovs_16aligned_ip6_hdr *ip6_hdr = dp_packet_l3(b);
> +                const struct ovs_16aligned_ip6_hdr *ip6_hdr = l4_off;
>                  csum = ~csum_finish(packet_csum_pseudoheader6(ip6_hdr));
>              }
>
> --
> 2.39.3
>
David Marchand Feb. 14, 2024, 5:08 p.m. UTC | #2
Hello Mike,

On Mon, Feb 12, 2024 at 8:50 PM Mike Pattrick <mkp@redhat.com> wrote:
>
> Previously if an OVS configuration nested multiple layers of UDP tunnels
> like VXLAN or GENEVE on top of each other through netdev-linux
> interfaces, the vnet header would be incorrectly set to the outermost
> UDP tunnel layer instead of the intermediary tunnel layer.
>
> This resulted in the middle UDP tunnel not checksum offloading properly.
>
> Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Mike Pattrick <mkp@redhat.com>

I have some trouble relating this patch to the issue I faced :-).
Could you detail a test that shows the issue you fix here?



After applying (only this patch), I still reproduce an issue with
inner checksums.
As I reported this issue to you offlist, let me put the details in public here.

I wrote a system-traffic.at unit test that stacks 3 vxlan tunnels
(separate topic, but for the context, my goal was to stress DPDK
dp-packets wrt headroom).
If I try this unit test before commit 084c8087292c ("userspace:
Support VXLAN and GENEVE TSO."), I have no issue.

The topology is as follows:
##############################################################
#
# at_ns0                    . init_net
#                           .
# at_vxlan1 (10.1.1.1/24)   . br0 (10.1.1.100/24)
# (remote 172.31.1.100)     . |
#                           . at_vxlan0
#                           . (remote 172.31.1.1)
#                           .
# at_vxlan3 (172.31.1.1/24) . br-underlay0 (172.31.1.100/24)
# (remote 172.31.2.100)     . |
#                           . at_vxlan2
#                           . (remote 172.31.2.1)
#                           .
# at_vxlan5 (172.31.2.1/24) . br-underlay1 (172.31.2.100/24)
# (remote 172.31.3.100)     . |
#                           . at_vxlan4
#                           . (remote 172.31.3.1)
#                           .
# p0 (172.31.3.1/24)        . br-underlay2 (172.31.3.100/24)
# |                         . |
# \-------------------------.-ovs-p0
#
##############################################################

(gmail will probably bust this copy/paste, so putting a link to the
actual test: https://github.com/david-marchand/ovs/commit/manyvxlan~2#diff-45a77f85f9679bc66ac97300392c0d5d9f5c53264fa8a82d735a553246e71faeR400)

With this setup, I try to ping, from at_ns0 netns, the ip address of
the br tap iface plugged with the other side of each tunnel:

- Most outter level, no encapsulation, all good:
16:24:51.590966 a6:0a:bf:e2:f3:f2 > 82:cf:78:de:ed:46, ethertype IPv4
(0x0800), length 98: (tos 0x0, ttl 64, id 63550, offset 0, flags [DF],
proto ICMP (1), length 84)
    172.31.3.1 > 172.31.3.100: ICMP echo request, id 26707, seq 1, length 64

16:24:51.591084 82:cf:78:de:ed:46 > a6:0a:bf:e2:f3:f2, ethertype IPv4
(0x0800), length 98: (tos 0x0, ttl 64, id 28720, offset 0, flags
[none], proto ICMP (1), length 84)
    172.31.3.100 > 172.31.3.1: ICMP echo reply, id 26707, seq 1, length 64

- One tunnel encap all good:
16:24:54.140629 a6:0a:bf:e2:f3:f2 > 82:cf:78:de:ed:46, ethertype IPv4
(0x0800), length 148: (tos 0x0, ttl 64, id 61052, offset 0, flags
[none], proto UDP (17), length 134)
    172.31.3.1.36831 > 172.31.3.100.vxlan: [udp sum ok] VXLAN, flags
[I] (0x08), vni 0
1e:db:ec:e5:28:6d > 9a:39:be:e8:18:4b, ethertype IPv4 (0x0800), length
98: (tos 0x0, ttl 64, id 54399, offset 0, flags [DF], proto ICMP (1),
length 84)
    172.31.2.1 > 172.31.2.100: ICMP echo request, id 51488, seq 1, length 64

16:24:54.140772 82:cf:78:de:ed:46 > a6:0a:bf:e2:f3:f2, ethertype IPv4
(0x0800), length 148: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto UDP (17), length 134)
    172.31.3.100.39912 > 172.31.3.1.vxlan: [no cksum] VXLAN, flags [I]
(0x08), vni 0
9a:39:be:e8:18:4b > 1e:db:ec:e5:28:6d, ethertype IPv4 (0x0800), length
98: (tos 0x0, ttl 64, id 29701, offset 0, flags [none], proto ICMP
(1), length 84)
    172.31.2.100 > 172.31.2.1: ICMP echo reply, id 51488, seq 1, length 64

- Two tunnels encap:
16:24:58.578900 a6:0a:bf:e2:f3:f2 > 82:cf:78:de:ed:46, ethertype IPv4
(0x0800), length 142: (tos 0x0, ttl 64, id 61719, offset 0, flags
[none], proto UDP (17), length 128)
    172.31.3.1.50673 > 172.31.3.100.vxlan: [udp sum ok] VXLAN, flags
[I] (0x08), vni 0
1e:db:ec:e5:28:6d > 9a:39:be:e8:18:4b, ethertype IPv4 (0x0800), length
92: (tos 0x0, ttl 64, id 35175, offset 0, flags [none], proto UDP
(17), length 78)
    172.31.2.1.44060 > 172.31.2.100.vxlan: [udp sum ok] VXLAN, flags
[I] (0x08), vni 1
62:53:3f:82:da:56 > Broadcast, ethertype ARP (0x0806), length 42:
Ethernet (len 6), IPv4 (len 4), Request who-has 172.31.1.100 tell
172.31.1.1, length 28

16:24:58.579021 82:cf:78:de:ed:46 > a6:0a:bf:e2:f3:f2, ethertype IPv4
(0x0800), length 142: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto UDP (17), length 128)
    172.31.3.100.56325 > 172.31.3.1.vxlan: [no cksum] VXLAN, flags [I]
(0x08), vni 0
9a:39:be:e8:18:4b > 1e:db:ec:e5:28:6d, ethertype IPv4 (0x0800), length
92: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17),
length 78, bad cksum 0 (->ddfb)!)
    172.31.2.100.56325 > 172.31.2.1.vxlan: [no cksum] VXLAN, flags [I]
(0x08), vni 1
26:8d:e7:0e:41:45 > 62:53:3f:82:da:56, ethertype ARP (0x0806), length
42: Ethernet (len 6), IPv4 (len 4), Reply 172.31.1.100 is-at
26:8d:e7:0e:41:45, length 28

In this last packet capture, the inner vxlan tunnel header has wrong
(well no) IP checksum.

If I set the tunnel endpoint neighbors in each netns, the same issue
can be seen for icmp encapsulated traffic:

17:41:39.303170 32:a4:65:eb:df:49 > a6:0a:bf:e2:f3:f2, ethertype IPv4
(0x0800), length 198: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto UDP (17), length 184)
    172.31.3.100.59550 > 172.31.3.1.vxlan: [no cksum] VXLAN, flags [I]
(0x08), vni 0
fe:72:7b:39:24:49 > 7e:9d:b3:a5:a2:f8, ethertype IPv4 (0x0800), length
148: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17),
length 134, bad cksum 0 (->ddc3)!)
    172.31.2.100.59550 > 172.31.2.1.vxlan: [no cksum] VXLAN, flags [I]
(0x08), vni 1
56:21:48:f3:f3:42 > 72:64:65:f9:38:0a, ethertype IPv4 (0x0800), length
98: (tos 0x0, ttl 64, id 49609, offset 0, flags [none], proto ICMP
(1), length 84)
    172.31.1.100 > 172.31.1.1: ICMP echo reply, id 35387, seq 1, length 64


Running in GHA:
https://github.com/david-marchand/ovs/actions/runs/7903675971/job/21572151269

Afaics, this issue is not reproduced when using the kernel datapath
(check-kernel) but it is seen in GHA for check-system-userspace /
check-dpdk / check-afxdp.
I can reproduce those locally.

There is also a failure with check-offloads in GHA which I can't
reproduce locally (for the latter, it may be a kernel related issue
(and I think there were some known issues..)).


Now, if I apply below diff, my issue seems to go away for
check-system-userspace / check-dpdk / check-afxdp.
$ git diff -b HEAD^
diff --git a/lib/netdev-native-tnl.c b/lib/netdev-native-tnl.c
index 0d6d803fe4..7f3e2c0ae4 100644
--- a/lib/netdev-native-tnl.c
+++ b/lib/netdev-native-tnl.c
@@ -280,6 +280,7 @@ dp_packet_tnl_ol_process(struct dp_packet *packet,
                 dp_packet_hwol_set_tx_ipv6(packet);
             }
         }
+    }

     /* Attention please, tunnel inner l2 len is consist of udp header
      * len and tunnel header len and inner l2 len. */
@@ -299,7 +300,6 @@ dp_packet_tnl_ol_process(struct dp_packet *packet,
                                      (char *) dp_packet_eth(packet) +
                                      VXLAN_HLEN);
     }
-    }
 }

 void

But IPv6 tso tests have some failures, and some nsh test fails too
regardless of which datapath is used, so this is not a right fix.

Actual commit:
https://github.com/david-marchand/ovs/commit/manyvxlan
https://github.com/david-marchand/ovs/actions/runs/7904287294
Mike Pattrick Feb. 15, 2024, 6:03 a.m. UTC | #3
On Wed, Feb 14, 2024 at 12:09 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> Hello Mike,
>
> On Mon, Feb 12, 2024 at 8:50 PM Mike Pattrick <mkp@redhat.com> wrote:
> >
> > Previously if an OVS configuration nested multiple layers of UDP tunnels
> > like VXLAN or GENEVE on top of each other through netdev-linux
> > interfaces, the vnet header would be incorrectly set to the outermost
> > UDP tunnel layer instead of the intermediary tunnel layer.
> >
> > This resulted in the middle UDP tunnel not checksum offloading properly.
> >
> > Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Signed-off-by: Mike Pattrick <mkp@redhat.com>
>
> I have some trouble relating this patch to the issue I faced :-).
> Could you detail a test that shows the issue you fix here?

I made a slight modification to the test suite you provided adding a
UDP traffic test to your test; which unveiled this issue. But you are
correct, I got so distracted by one issue that I missed other issues.

>
> After applying (only this patch), I still reproduce an issue with
> inner checksums.
> As I reported this issue to you offlist, let me put the details in public here.
>
> I wrote a system-traffic.at unit test that stacks 3 vxlan tunnels
> (separate topic, but for the context, my goal was to stress DPDK
> dp-packets wrt headroom).
> If I try this unit test before commit 084c8087292c ("userspace:
> Support VXLAN and GENEVE TSO."), I have no issue.
>
> The topology is as follows:
> ##############################################################
> #
> # at_ns0                    . init_net
> #                           .
> # at_vxlan1 (10.1.1.1/24)   . br0 (10.1.1.100/24)
> # (remote 172.31.1.100)     . |
> #                           . at_vxlan0
> #                           . (remote 172.31.1.1)
> #                           .
> # at_vxlan3 (172.31.1.1/24) . br-underlay0 (172.31.1.100/24)
> # (remote 172.31.2.100)     . |
> #                           . at_vxlan2
> #                           . (remote 172.31.2.1)
> #                           .
> # at_vxlan5 (172.31.2.1/24) . br-underlay1 (172.31.2.100/24)
> # (remote 172.31.3.100)     . |
> #                           . at_vxlan4
> #                           . (remote 172.31.3.1)
> #                           .
> # p0 (172.31.3.1/24)        . br-underlay2 (172.31.3.100/24)
> # |                         . |
> # \-------------------------.-ovs-p0
> #
> ##############################################################
>
> (gmail will probably bust this copy/paste, so putting a link to the
> actual test: https://github.com/david-marchand/ovs/commit/manyvxlan~2#diff-45a77f85f9679bc66ac97300392c0d5d9f5c53264fa8a82d735a553246e71faeR400)
>
> With this setup, I try to ping, from at_ns0 netns, the ip address of
> the br tap iface plugged with the other side of each tunnel:
>
> - Most outter level, no encapsulation, all good:
> 16:24:51.590966 a6:0a:bf:e2:f3:f2 > 82:cf:78:de:ed:46, ethertype IPv4
> (0x0800), length 98: (tos 0x0, ttl 64, id 63550, offset 0, flags [DF],
> proto ICMP (1), length 84)
>     172.31.3.1 > 172.31.3.100: ICMP echo request, id 26707, seq 1, length 64
>
> 16:24:51.591084 82:cf:78:de:ed:46 > a6:0a:bf:e2:f3:f2, ethertype IPv4
> (0x0800), length 98: (tos 0x0, ttl 64, id 28720, offset 0, flags
> [none], proto ICMP (1), length 84)
>     172.31.3.100 > 172.31.3.1: ICMP echo reply, id 26707, seq 1, length 64
>
> - One tunnel encap all good:
> 16:24:54.140629 a6:0a:bf:e2:f3:f2 > 82:cf:78:de:ed:46, ethertype IPv4
> (0x0800), length 148: (tos 0x0, ttl 64, id 61052, offset 0, flags
> [none], proto UDP (17), length 134)
>     172.31.3.1.36831 > 172.31.3.100.vxlan: [udp sum ok] VXLAN, flags
> [I] (0x08), vni 0
> 1e:db:ec:e5:28:6d > 9a:39:be:e8:18:4b, ethertype IPv4 (0x0800), length
> 98: (tos 0x0, ttl 64, id 54399, offset 0, flags [DF], proto ICMP (1),
> length 84)
>     172.31.2.1 > 172.31.2.100: ICMP echo request, id 51488, seq 1, length 64
>
> 16:24:54.140772 82:cf:78:de:ed:46 > a6:0a:bf:e2:f3:f2, ethertype IPv4
> (0x0800), length 148: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto UDP (17), length 134)
>     172.31.3.100.39912 > 172.31.3.1.vxlan: [no cksum] VXLAN, flags [I]
> (0x08), vni 0
> 9a:39:be:e8:18:4b > 1e:db:ec:e5:28:6d, ethertype IPv4 (0x0800), length
> 98: (tos 0x0, ttl 64, id 29701, offset 0, flags [none], proto ICMP
> (1), length 84)
>     172.31.2.100 > 172.31.2.1: ICMP echo reply, id 51488, seq 1, length 64
>
> - Two tunnels encap:
> 16:24:58.578900 a6:0a:bf:e2:f3:f2 > 82:cf:78:de:ed:46, ethertype IPv4
> (0x0800), length 142: (tos 0x0, ttl 64, id 61719, offset 0, flags
> [none], proto UDP (17), length 128)
>     172.31.3.1.50673 > 172.31.3.100.vxlan: [udp sum ok] VXLAN, flags
> [I] (0x08), vni 0
> 1e:db:ec:e5:28:6d > 9a:39:be:e8:18:4b, ethertype IPv4 (0x0800), length
> 92: (tos 0x0, ttl 64, id 35175, offset 0, flags [none], proto UDP
> (17), length 78)
>     172.31.2.1.44060 > 172.31.2.100.vxlan: [udp sum ok] VXLAN, flags
> [I] (0x08), vni 1
> 62:53:3f:82:da:56 > Broadcast, ethertype ARP (0x0806), length 42:
> Ethernet (len 6), IPv4 (len 4), Request who-has 172.31.1.100 tell
> 172.31.1.1, length 28
>
> 16:24:58.579021 82:cf:78:de:ed:46 > a6:0a:bf:e2:f3:f2, ethertype IPv4
> (0x0800), length 142: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto UDP (17), length 128)
>     172.31.3.100.56325 > 172.31.3.1.vxlan: [no cksum] VXLAN, flags [I]
> (0x08), vni 0
> 9a:39:be:e8:18:4b > 1e:db:ec:e5:28:6d, ethertype IPv4 (0x0800), length
> 92: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17),
> length 78, bad cksum 0 (->ddfb)!)
>     172.31.2.100.56325 > 172.31.2.1.vxlan: [no cksum] VXLAN, flags [I]
> (0x08), vni 1
> 26:8d:e7:0e:41:45 > 62:53:3f:82:da:56, ethertype ARP (0x0806), length
> 42: Ethernet (len 6), IPv4 (len 4), Reply 172.31.1.100 is-at
> 26:8d:e7:0e:41:45, length 28
>
> In this last packet capture, the inner vxlan tunnel header has wrong
> (well no) IP checksum.
>
> If I set the tunnel endpoint neighbors in each netns, the same issue
> can be seen for icmp encapsulated traffic:
>
> 17:41:39.303170 32:a4:65:eb:df:49 > a6:0a:bf:e2:f3:f2, ethertype IPv4
> (0x0800), length 198: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto UDP (17), length 184)
>     172.31.3.100.59550 > 172.31.3.1.vxlan: [no cksum] VXLAN, flags [I]
> (0x08), vni 0
> fe:72:7b:39:24:49 > 7e:9d:b3:a5:a2:f8, ethertype IPv4 (0x0800), length
> 148: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17),
> length 134, bad cksum 0 (->ddc3)!)
>     172.31.2.100.59550 > 172.31.2.1.vxlan: [no cksum] VXLAN, flags [I]
> (0x08), vni 1
> 56:21:48:f3:f3:42 > 72:64:65:f9:38:0a, ethertype IPv4 (0x0800), length
> 98: (tos 0x0, ttl 64, id 49609, offset 0, flags [none], proto ICMP
> (1), length 84)
>     172.31.1.100 > 172.31.1.1: ICMP echo reply, id 35387, seq 1, length 64
>
>
> Running in GHA:
> https://github.com/david-marchand/ovs/actions/runs/7903675971/job/21572151269
>
> Afaics, this issue is not reproduced when using the kernel datapath
> (check-kernel) but it is seen in GHA for check-system-userspace /
> check-dpdk / check-afxdp.
> I can reproduce those locally.
>
> There is also a failure with check-offloads in GHA which I can't
> reproduce locally (for the latter, it may be a kernel related issue
> (and I think there were some known issues..)).
>
>
> Now, if I apply below diff, my issue seems to go away for
> check-system-userspace / check-dpdk / check-afxdp.
> $ git diff -b HEAD^
> diff --git a/lib/netdev-native-tnl.c b/lib/netdev-native-tnl.c
> index 0d6d803fe4..7f3e2c0ae4 100644
> --- a/lib/netdev-native-tnl.c
> +++ b/lib/netdev-native-tnl.c
> @@ -280,6 +280,7 @@ dp_packet_tnl_ol_process(struct dp_packet *packet,
>                  dp_packet_hwol_set_tx_ipv6(packet);
>              }
>          }
> +    }
>
>      /* Attention please, tunnel inner l2 len is consist of udp header
>       * len and tunnel header len and inner l2 len. */
> @@ -299,7 +300,6 @@ dp_packet_tnl_ol_process(struct dp_packet *packet,
>                                       (char *) dp_packet_eth(packet) +
>                                       VXLAN_HLEN);
>      }
> -    }
>  }
>
>  void
>
> But IPv6 tso tests have some failures, and some nsh test fails too
> regardless of which datapath is used, so this is not a right fix.

One of the problems here is when the tunnel offload flag is set, we
set dp_packet_hwol_set_tx_outer_ipv4_csum() instead of
dp_packet_hwol_set_tx_ipv4_csum(). But when modifying IP addresses, we
check for dp_packet_hwol_tx_ip_csum() not the outer ip csum flag.
Likewise, in netdev_tnl_ip_extract_tnl_md we also only check for ip
csum, not the outer flag.

Since dp_packet_l3*() functions always refer to the outermost packet
but the ipv4_csum() functions refer to the inner packet there is some
tension here, and some incorrect behaviour.

I've made a branch where we properly account for outer and inner
checksums, and it passes the tests mostly, except for afxdp.

For afxdp we crash in dp_packet_prealloc_headroom(). netdev-afxdp has
a hardcoded OVS_XDP_HEADROOM=128 bytes and the multiple layers of
tunneling exceeds that. I ran a test where I set this to 256 and the
test passes, but that seems like a non-ideal solution. We probably
shouldn't abort() in dp_packet_resize(), as it could be possible to
accidentally run into this.

Dropping the packet is probably preferable IMO, but that is also a
very large change, as none of the calling functions have return codes
themselves and some of the 2rd degree call backs don't either, so many
functions will need to change.

You can see the branch here: https://github.com/mkp-rh/ovs/tree/multitun
And the test run here: https://github.com/mkp-rh/ovs/actions/runs/7911539363

I'll clean up this a bit and address some of the other things
mentioned, like the incorrect Fixes tag.


Cheers,
M

>
> Actual commit:
> https://github.com/david-marchand/ovs/commit/manyvxlan
> https://github.com/david-marchand/ovs/actions/runs/7904287294
>
>
>
> --
> David Marchand
>
David Marchand Feb. 15, 2024, 8:48 a.m. UTC | #4
On Thu, Feb 15, 2024 at 7:03 AM Mike Pattrick <mkp@redhat.com> wrote:
> I've made a branch where we properly account for outer and inner
> checksums, and it passes the tests mostly, except for afxdp.
>
> For afxdp we crash in dp_packet_prealloc_headroom(). netdev-afxdp has
> a hardcoded OVS_XDP_HEADROOM=128 bytes and the multiple layers of
> tunneling exceeds that. I ran a test where I set this to 256 and the
> test passes, but that seems like a non-ideal solution. We probably
> shouldn't abort() in dp_packet_resize(), as it could be possible to
> accidentally run into this.

This is exactly the point I wanted to stress with DPDK dp-packets.

The reason behind was to check this old patch of mine:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220318153339.31083-1-david.marchand@redhat.com/

DPDK dp-packets data are supposed to be located at
RTE_PKTMBUF_HEADROOM == 128 bytes, on rx.

But I uncovered recently that we won't hit this headroom limit with
net/af_xdp backing netdev-dpdk ports...
The net/af_xdp driver tries to be smart and avoid copies by using the
unaligned chunk af_xdp feature.
https://git.dpdk.org/dpdk/commit/?id=d8a210774e1d4c295fd93b983538da0d15312edd
A consequence is that this driver places received data with a 384
bytes headroom (RTE_PKTMBUF_HEADROOM + XDP_PACKET_HEADROOM).
Which then defeats my unit test...

This placement of data looks incorrect to me, from the DPDK mbuf API "spirit".
Applications expect a RTE_PKTMBUF_HEADROOM headroom, and they size
their buffers accordingly.
This extra headroom would mean applications need to account for this
peculiarity when using this driver...

I will need to spend more time on this, but not now.


>
> Dropping the packet is probably preferable IMO, but that is also a
> very large change, as none of the calling functions have return codes
> themselves and some of the 2rd degree call backs don't either, so many
> functions will need to change.

Or extend dp_packet_resize() for af_xdp dp-packet.
The tricky part is that the dp-packet is part of a umem buffer.
If we make a af_xdp dp-packet points at a different malloc'd data
buffer, we need to distinguish for this case when freeing this
dp-packet.
I can put this on my todolist.


>
> You can see the branch here: https://github.com/mkp-rh/ovs/tree/multitun
> And the test run here: https://github.com/mkp-rh/ovs/actions/runs/7911539363
>
> I'll clean up this a bit and address some of the other things
> mentioned, like the incorrect Fixes tag.

We don't need to fix all issues, the main point is the inner checksum
issue, as it is something that got broken in 3.3.
If we strip the 3rd layer of tunnel from my unit test, it would be
enough to reproduce without hitting af_xdp headroom limit.

Or do you think we can extend an existing test?
At least, fixes should be isolated from the new features like one
introduced in patch 1 of this series.
diff mbox series

Patch

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 1b2e5b6c2..7a156cc28 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -7239,14 +7239,23 @@  netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
             vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
                                     struct tcp_header, tcp_csum);
         } else if (dp_packet_hwol_l4_is_udp(b)) {
-            struct udp_header *udp_hdr = dp_packet_l4(b);
+            /* Favour the inner packet when indicating checksum offsets. */
+            void *l3_off = dp_packet_inner_l3(b);
+            void *l4_off = dp_packet_inner_l4(b);
+
+            if (!l3_off || !l4_off) {
+                l3_off = dp_packet_l3(b);
+                l4_off = dp_packet_l4(b);
+            }
+            struct udp_header *udp_hdr = l4_off;
+
             ovs_be16 csum = 0;
 
             if (dp_packet_hwol_is_ipv4(b)) {
-                const struct ip_header *ip_hdr = dp_packet_l3(b);
+                const struct ip_header *ip_hdr = l3_off;
                 csum = ~csum_finish(packet_csum_pseudoheader(ip_hdr));
             } else if (dp_packet_hwol_tx_ipv6(b)) {
-                const struct ovs_16aligned_ip6_hdr *ip6_hdr = dp_packet_l3(b);
+                const struct ovs_16aligned_ip6_hdr *ip6_hdr = l4_off;
                 csum = ~csum_finish(packet_csum_pseudoheader6(ip6_hdr));
             }