diff mbox series

[ovs-dev,v2] netdev-dpdk: Fallback to non tunnel offloading API.

Message ID 20240328091537.1467676-1-david.marchand@redhat.com
State Changes Requested
Headers show
Series [ovs-dev,v2] netdev-dpdk: Fallback to non tunnel offloading API. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

David Marchand March 28, 2024, 9:15 a.m. UTC
The outer checksum offloading API in DPDK is ambiguous and was
added by Intel folks with the assumption that any outer offloading
always goes with an inner offloading request.

With net/i40e and net/ice drivers, requesting outer ip checksum with a
tunnel context but no inner offloading request triggers a Tx failure
associated with a port MDD event.
2024-03-27T16:02:07.084Z|00018|dpdk|WARN|ice_interrupt_handler(): OICR:
	MDD event

To avoid this situation, if no checksum or segmentation offloading is
requested on the inner part of a packet, fallback to "normal" (non outer)
offloading request.
And outer offloading can be re-enabled for net/i40e and netice.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
Changes since v1:
- reset inner marks before converting outer requests,
- fixed some coding style,

---
 lib/netdev-dpdk.c | 83 ++++++++++++++++++++++++-----------------------
 1 file changed, 43 insertions(+), 40 deletions(-)

Comments

David Marchand April 3, 2024, 5:40 p.m. UTC | #1
On Thu, Mar 28, 2024 at 10:16 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> The outer checksum offloading API in DPDK is ambiguous and was
> added by Intel folks with the assumption that any outer offloading
> always goes with an inner offloading request.
>
> With net/i40e and net/ice drivers, requesting outer ip checksum with a
> tunnel context but no inner offloading request triggers a Tx failure
> associated with a port MDD event.
> 2024-03-27T16:02:07.084Z|00018|dpdk|WARN|ice_interrupt_handler(): OICR:
>         MDD event
>
> To avoid this situation, if no checksum or segmentation offloading is
> requested on the inner part of a packet, fallback to "normal" (non outer)
> offloading request.
> And outer offloading can be re-enabled for net/i40e and netice.
>
> Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> ---
> Changes since v1:
> - reset inner marks before converting outer requests,
> - fixed some coding style,
>
> ---
>  lib/netdev-dpdk.c | 83 ++++++++++++++++++++++++-----------------------
>  1 file changed, 43 insertions(+), 40 deletions(-)
>
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 2111f77681..ae43594a3d 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1354,18 +1354,6 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>          info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_TCP_CKSUM;
>      }
>
> -    if (!strcmp(info.driver_name, "net_ice")
> -        || !strcmp(info.driver_name, "net_i40e")) {
> -        /* FIXME: Driver advertises the capability but doesn't seem
> -         * to actually support it correctly.  Can remove this once
> -         * the driver is fixed on DPDK side. */
> -        VLOG_INFO("%s: disabled Tx outer udp checksum offloads for a "
> -                  "net/ice or net/i40e port.", netdev_get_name(&dev->up));
> -        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
> -        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO;
> -        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO;
> -    }
> -

A few comments after spending some time on the topic.


- This patch fixes some misusage of the DPDK API.
Basically, resolving a neighbor with ARP inside tunnels is broken on
Intel nics (even without re-enabling outer udp checksum).
This can be observed with the following debug patch at the end of
netdev_dpdk_prep_hwol_packet():

+    char buf[256];
+    if (rte_get_tx_ol_flag_list(mbuf->ol_flags, buf, sizeof(buf)) != 0)
+        buf[0] = '\0';
+    VLOG_WARN("len=%u, ol_flags=%s, outer_l2_len=%u, outer_l3_len=%u,
l2_len=%u, l3_len=%u, l4_len=%u, tso_segsz=%u", mbuf->pkt_len, buf,
mbuf->outer_l2_len, mbuf->outer_l3_len, mbuf->l2_len, mbuf->l3_len,
mbuf->l4_len, mbuf->tso_segsz);

Then doing a "arping" inside the tunnel triggers:
2024-04-03T16:05:40.920Z|00014|netdev_dpdk(pmd-c03/id:8)|WARN|len=96,
ol_flags=RTE_MBUF_F_TX_L4_NO_CKSUM RTE_MBUF_F_TX_OUTER_IP_CKSUM
RTE_MBUF_F_TX_OUTER_IPV4 RTE_MBUF_F_TX_TUNNEL_VXLAN , outer_l2_len=18,
outer_l3_len=20, l2_len=0, l3_len=0, l4_len=0, tso_segsz=0
2024-04-03T16:05:40.920Z|00012|dpdk|WARN|ice_interrupt_handler():
OICR: MDD event

We need this fix in OVS regardless of the outer udp checksum issue.
I'll respin this fix in a new series, without touching UDP checksum capa.


- It does seem that X710 nics have no support for outer udp checksum
(according to its datasheet). Some X722 version may have support for
it, but net/i40e does not configure the Tx descriptor accordingly.
On the other hand, E810 ones seem fine (according to its datasheet).

After more debugging, I managed to get outer udp checksum working.
I understand the DPDK rte_net_intel_cksum_flags_prepare() helper does
not set the pseudo header checksum in the outer udp header.
I proposed a fix in the dpdk bz.

Waiting for the fix on DPDK side... it is still possible to add the
missing bits in OVS (see the branch I pointed at in the OVS issue).


- About the workaround (disabling outer udp checksum for net/ice and
net/i40e), the net/iavf is subject to the same bugs.
So we should disable outer udp checksum too for this driver.

However, I am not sure the iavf driver (can?) differentiates which PF
/ hw is used underneath.
So we may have no solution but to always disable this type of
offloading in OVS for net/iavf.
Ilya Maximets April 3, 2024, 6:14 p.m. UTC | #2
On 4/3/24 19:40, David Marchand wrote:
> On Thu, Mar 28, 2024 at 10:16 AM David Marchand
> <david.marchand@redhat.com> wrote:
>>
>> The outer checksum offloading API in DPDK is ambiguous and was
>> added by Intel folks with the assumption that any outer offloading
>> always goes with an inner offloading request.
>>
>> With net/i40e and net/ice drivers, requesting outer ip checksum with a
>> tunnel context but no inner offloading request triggers a Tx failure
>> associated with a port MDD event.
>> 2024-03-27T16:02:07.084Z|00018|dpdk|WARN|ice_interrupt_handler(): OICR:
>>         MDD event
>>
>> To avoid this situation, if no checksum or segmentation offloading is
>> requested on the inner part of a packet, fallback to "normal" (non outer)
>> offloading request.
>> And outer offloading can be re-enabled for net/i40e and netice.
>>
>> Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
>> Signed-off-by: David Marchand <david.marchand@redhat.com>
>> ---
>> Changes since v1:
>> - reset inner marks before converting outer requests,
>> - fixed some coding style,
>>
>> ---
>>  lib/netdev-dpdk.c | 83 ++++++++++++++++++++++++-----------------------
>>  1 file changed, 43 insertions(+), 40 deletions(-)
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 2111f77681..ae43594a3d 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -1354,18 +1354,6 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>          info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_TCP_CKSUM;
>>      }
>>
>> -    if (!strcmp(info.driver_name, "net_ice")
>> -        || !strcmp(info.driver_name, "net_i40e")) {
>> -        /* FIXME: Driver advertises the capability but doesn't seem
>> -         * to actually support it correctly.  Can remove this once
>> -         * the driver is fixed on DPDK side. */
>> -        VLOG_INFO("%s: disabled Tx outer udp checksum offloads for a "
>> -                  "net/ice or net/i40e port.", netdev_get_name(&dev->up));
>> -        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
>> -        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO;
>> -        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO;
>> -    }
>> -
> 
> A few comments after spending some time on the topic.

Hi, David.  Thanks for working on the issue.  I appreciate this.

> - This patch fixes some misusage of the DPDK API.

Hmm, I understand that the driver does something funny when it gets
outer flags set without any inner flags, but how is that a misuse
of the DPDK API?  Could you point me to the API docs that say that
inner flags must always be set in this case or that setting only
outer offloads is not allowed?

I agree that it seems safer to just downgrade all outer flags to
inner ones on OVS side, when no inner offloads are requested, I'm
just not sure I agree that it's an API misuse.  Especially since
non-Intel cards seem to work fine.

> Basically, resolving a neighbor with ARP inside tunnels is broken on
> Intel nics (even without re-enabling outer udp checksum).
> This can be observed with the following debug patch at the end of
> netdev_dpdk_prep_hwol_packet():
> 
> +    char buf[256];
> +    if (rte_get_tx_ol_flag_list(mbuf->ol_flags, buf, sizeof(buf)) != 0)
> +        buf[0] = '\0';
> +    VLOG_WARN("len=%u, ol_flags=%s, outer_l2_len=%u, outer_l3_len=%u,
> l2_len=%u, l3_len=%u, l4_len=%u, tso_segsz=%u", mbuf->pkt_len, buf,
> mbuf->outer_l2_len, mbuf->outer_l3_len, mbuf->l2_len, mbuf->l3_len,
> mbuf->l4_len, mbuf->tso_segsz);
> 
> Then doing a "arping" inside the tunnel triggers:
> 2024-04-03T16:05:40.920Z|00014|netdev_dpdk(pmd-c03/id:8)|WARN|len=96,
> ol_flags=RTE_MBUF_F_TX_L4_NO_CKSUM RTE_MBUF_F_TX_OUTER_IP_CKSUM
> RTE_MBUF_F_TX_OUTER_IPV4 RTE_MBUF_F_TX_TUNNEL_VXLAN , outer_l2_len=18,
> outer_l3_len=20, l2_len=0, l3_len=0, l4_len=0, tso_segsz=0
> 2024-04-03T16:05:40.920Z|00012|dpdk|WARN|ice_interrupt_handler():
> OICR: MDD event
> 
> We need this fix in OVS regardless of the outer udp checksum issue.
> I'll respin this fix in a new series, without touching UDP checksum capa.
> 
> 
> - It does seem that X710 nics have no support for outer udp checksum
> (according to its datasheet). Some X722 version may have support for
> it, but net/i40e does not configure the Tx descriptor accordingly.
> On the other hand, E810 ones seem fine (according to its datasheet).
> 
> After more debugging, I managed to get outer udp checksum working.
> I understand the DPDK rte_net_intel_cksum_flags_prepare() helper does
> not set the pseudo header checksum in the outer udp header.
> I proposed a fix in the dpdk bz.
> 
> Waiting for the fix on DPDK side... it is still possible to add the
> missing bits in OVS (see the branch I pointed at in the OVS issue).

Since this feature never worked with ice in OVS and it is experimental,
I tend to think that we should just disable it for ice as well until
DPDK is fixed.

A little too many fixes for that thing we have already and this one will
involve some extra driver-specific logic that we don't have any automated
tests for.

> 
> 
> - About the workaround (disabling outer udp checksum for net/ice and
> net/i40e), the net/iavf is subject to the same bugs.
> So we should disable outer udp checksum too for this driver.
> 
> However, I am not sure the iavf driver (can?) differentiates which PF
> / hw is used underneath.
> So we may have no solution but to always disable this type of
> offloading in OVS for net/iavf.

IAVF should just not advertise support if the underlying hardware
doesn't support it.  There should be a way for a driver to know the
exact card it runs on.  But we can add it to the list of broken
DPDK drivers in OVS until it's fixed, sure.

Best regards, Ilya Maximets.
David Marchand April 4, 2024, 12:46 p.m. UTC | #3
On Wed, Apr 3, 2024 at 8:13 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > - This patch fixes some misusage of the DPDK API.
>
> Hmm, I understand that the driver does something funny when it gets
> outer flags set without any inner flags, but how is that a misuse
> of the DPDK API?  Could you point me to the API docs that say that
> inner flags must always be set in this case or that setting only
> outer offloads is not allowed?

Setting the tunnel type (which is set along outer checksum in OVS) is
described as:

/**
 * Bits 45:48 used for the tunnel type.
 * The tunnel type must be specified for TSO or checksum on the inner part
 * of tunnel packets.
 * These flags can be used with RTE_MBUF_F_TX_TCP_SEG for TSO, or
 * RTE_MBUF_F_TX_xxx_CKSUM.
 * The mbuf fields for inner and outer header lengths are required:
 * outer_l2_len, outer_l3_len, l2_len, l3_len, l4_len and tso_segsz for TSO.
 */
#define RTE_MBUF_F_TX_TUNNEL_VXLAN   (0x1ULL << 45)
#define RTE_MBUF_F_TX_TUNNEL_GRE     (0x2ULL << 45)
#define RTE_MBUF_F_TX_TUNNEL_IPIP    (0x3ULL << 45)
#define RTE_MBUF_F_TX_TUNNEL_GENEVE  (0x4ULL << 45)
/** TX packet with MPLS-in-UDP RFC 7510 header. */
#define RTE_MBUF_F_TX_TUNNEL_MPLSINUDP (0x5ULL << 45)
#define RTE_MBUF_F_TX_TUNNEL_VXLAN_GPE (0x6ULL << 45)
#define RTE_MBUF_F_TX_TUNNEL_GTP       (0x7ULL << 45)
#define RTE_MBUF_F_TX_TUNNEL_ESP       (0x8ULL << 45)

It is not specified what to expect it neither TSO nor inner checksum
is requested.

In a same way, it is not described what to expect if outer API is
called with no inner offload.
Adding Ferruh and Thomas who may have one opinion.


>
> I agree that it seems safer to just downgrade all outer flags to
> inner ones on OVS side, when no inner offloads are requested, I'm
> just not sure I agree that it's an API misuse.  Especially since
> non-Intel cards seem to work fine.

I suppose you mean mlx5.
Has it been tested on other nics?


>
> > Basically, resolving a neighbor with ARP inside tunnels is broken on
> > Intel nics (even without re-enabling outer udp checksum).
> > This can be observed with the following debug patch at the end of
> > netdev_dpdk_prep_hwol_packet():
> >
> > +    char buf[256];
> > +    if (rte_get_tx_ol_flag_list(mbuf->ol_flags, buf, sizeof(buf)) != 0)
> > +        buf[0] = '\0';
> > +    VLOG_WARN("len=%u, ol_flags=%s, outer_l2_len=%u, outer_l3_len=%u,
> > l2_len=%u, l3_len=%u, l4_len=%u, tso_segsz=%u", mbuf->pkt_len, buf,
> > mbuf->outer_l2_len, mbuf->outer_l3_len, mbuf->l2_len, mbuf->l3_len,
> > mbuf->l4_len, mbuf->tso_segsz);
> >
> > Then doing a "arping" inside the tunnel triggers:
> > 2024-04-03T16:05:40.920Z|00014|netdev_dpdk(pmd-c03/id:8)|WARN|len=96,
> > ol_flags=RTE_MBUF_F_TX_L4_NO_CKSUM RTE_MBUF_F_TX_OUTER_IP_CKSUM
> > RTE_MBUF_F_TX_OUTER_IPV4 RTE_MBUF_F_TX_TUNNEL_VXLAN , outer_l2_len=18,
> > outer_l3_len=20, l2_len=0, l3_len=0, l4_len=0, tso_segsz=0
> > 2024-04-03T16:05:40.920Z|00012|dpdk|WARN|ice_interrupt_handler():
> > OICR: MDD event
> >
> > We need this fix in OVS regardless of the outer udp checksum issue.
> > I'll respin this fix in a new series, without touching UDP checksum capa.
> >
> >
> > - It does seem that X710 nics have no support for outer udp checksum
> > (according to its datasheet). Some X722 version may have support for
> > it, but net/i40e does not configure the Tx descriptor accordingly.
> > On the other hand, E810 ones seem fine (according to its datasheet).
> >
> > After more debugging, I managed to get outer udp checksum working.
> > I understand the DPDK rte_net_intel_cksum_flags_prepare() helper does
> > not set the pseudo header checksum in the outer udp header.
> > I proposed a fix in the dpdk bz.
> >
> > Waiting for the fix on DPDK side... it is still possible to add the
> > missing bits in OVS (see the branch I pointed at in the OVS issue).
>
> Since this feature never worked with ice in OVS and it is experimental,
> I tend to think that we should just disable it for ice as well until
> DPDK is fixed.
>
> A little too many fixes for that thing we have already and this one will
> involve some extra driver-specific logic that we don't have any automated
> tests for.

I don't mind waiting for the DPDK fix before re-enabling outer udp and
other offloads.
Ilya Maximets April 5, 2024, 1 p.m. UTC | #4
On 4/4/24 14:46, David Marchand wrote:
> On Wed, Apr 3, 2024 at 8:13 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>> - This patch fixes some misusage of the DPDK API.
>>
>> Hmm, I understand that the driver does something funny when it gets
>> outer flags set without any inner flags, but how is that a misuse
>> of the DPDK API?  Could you point me to the API docs that say that
>> inner flags must always be set in this case or that setting only
>> outer offloads is not allowed?
> 
> Setting the tunnel type (which is set along outer checksum in OVS) is
> described as:
> 
> /**
>  * Bits 45:48 used for the tunnel type.
>  * The tunnel type must be specified for TSO or checksum on the inner part
>  * of tunnel packets.
>  * These flags can be used with RTE_MBUF_F_TX_TCP_SEG for TSO, or
>  * RTE_MBUF_F_TX_xxx_CKSUM.
>  * The mbuf fields for inner and outer header lengths are required:
>  * outer_l2_len, outer_l3_len, l2_len, l3_len, l4_len and tso_segsz for TSO.
>  */
> #define RTE_MBUF_F_TX_TUNNEL_VXLAN   (0x1ULL << 45)
> #define RTE_MBUF_F_TX_TUNNEL_GRE     (0x2ULL << 45)
> #define RTE_MBUF_F_TX_TUNNEL_IPIP    (0x3ULL << 45)
> #define RTE_MBUF_F_TX_TUNNEL_GENEVE  (0x4ULL << 45)
> /** TX packet with MPLS-in-UDP RFC 7510 header. */
> #define RTE_MBUF_F_TX_TUNNEL_MPLSINUDP (0x5ULL << 45)
> #define RTE_MBUF_F_TX_TUNNEL_VXLAN_GPE (0x6ULL << 45)
> #define RTE_MBUF_F_TX_TUNNEL_GTP       (0x7ULL << 45)
> #define RTE_MBUF_F_TX_TUNNEL_ESP       (0x8ULL << 45)
> 
> It is not specified what to expect it neither TSO nor inner checksum
> is requested.

I can agree on that part that specifying the tunnel type while not
requesting any inner offloads is kind of undefined.

However, this is counter-intuitive that providing more of correct
information about a packet makes the driver freak out.

While we should work around this, I would still not call it a misuse.

DPDK needs to define the behavior for this case.  Two options are:

a. Document that tunnel type must not be set if inner offloads
   are not requested.
b. Fix the affected drivers to not break the packets and document
   that any valid informational flags can be set in ol_flags.

Option 'a' sounds like a poor API design though.

BTW, why the outer checksums are documented as inner:
 https://doc.dpdk.org/guides/nics/features.html#inner-l3-checksum
 https://doc.dpdk.org/guides/nics/features.html#inner-l4-checksum
?

> 
> In a same way, it is not described what to expect if outer API is
> called with no inner offload.

I disagree on that one.

/**
 * Outer UDP checksum offload flag. This flag is used for enabling
 * outer UDP checksum in PMD. To use outer UDP checksum, the user needs to
 * 1) Enable the following in mbuf,
 * a) Fill outer_l2_len and outer_l3_len in mbuf.
 * b) Set the RTE_MBUF_F_TX_OUTER_UDP_CKSUM flag.
 * c) Set the RTE_MBUF_F_TX_OUTER_IPV4 or RTE_MBUF_F_TX_OUTER_IPV6 flag.
 * 2) Configure RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM offload flag.
 */
#define RTE_MBUF_F_TX_OUTER_UDP_CKSUM     (1ULL << 41)

Reading this and other documentation in the file I would expect the
following two sets of flags to be equivalent on a card that supports
tunnel offloads:

 - RTE_MBUF_F_TX_OUTER_IPV4|RTE_MBUF_F_TX_OUTER_UDP_CKSUM with the
   outer_l2_len and outer_l3_len filled in.

 - RTE_MBUF_F_TX_IPV4|RTE_MBUF_F_TX_UDP_CKSUM with
   l2_len and l3_len filled in.

If these are not equivalent, that sounds like a bug in DPDK.

BTW, the l2_len field description is kind of vague:

            uint64_t l2_len:RTE_MBUF_L2_LEN_BITS;
            /**< L2 (MAC) Header Length for non-tunneling pkt.
             * Outer_L4_len + ... + Inner_L2_len for tunneling pkt.
             */

Reading other comments, it seems that setting any "outer" bits
in ol_flags constitutes a 'tunneling pkt', but it's not said explicitly
in this particular comment.

> Adding Ferruh and Thomas who may have one opinion.
> 
> 
>>
>> I agree that it seems safer to just downgrade all outer flags to
>> inner ones on OVS side, when no inner offloads are requested, I'm
>> just not sure I agree that it's an API misuse.  Especially since
>> non-Intel cards seem to work fine.
> 
> I suppose you mean mlx5.

Yes, I think it was tested with mlx5.

> Has it been tested on other nics?

I didn't test.  The original reporter doesn't have other cards, IIRC.

>>
>>> Basically, resolving a neighbor with ARP inside tunnels is broken on
>>> Intel nics (even without re-enabling outer udp checksum).
>>> This can be observed with the following debug patch at the end of
>>> netdev_dpdk_prep_hwol_packet():
>>>
>>> +    char buf[256];
>>> +    if (rte_get_tx_ol_flag_list(mbuf->ol_flags, buf, sizeof(buf)) != 0)
>>> +        buf[0] = '\0';
>>> +    VLOG_WARN("len=%u, ol_flags=%s, outer_l2_len=%u, outer_l3_len=%u,
>>> l2_len=%u, l3_len=%u, l4_len=%u, tso_segsz=%u", mbuf->pkt_len, buf,
>>> mbuf->outer_l2_len, mbuf->outer_l3_len, mbuf->l2_len, mbuf->l3_len,
>>> mbuf->l4_len, mbuf->tso_segsz);
>>>
>>> Then doing a "arping" inside the tunnel triggers:
>>> 2024-04-03T16:05:40.920Z|00014|netdev_dpdk(pmd-c03/id:8)|WARN|len=96,
>>> ol_flags=RTE_MBUF_F_TX_L4_NO_CKSUM RTE_MBUF_F_TX_OUTER_IP_CKSUM
>>> RTE_MBUF_F_TX_OUTER_IPV4 RTE_MBUF_F_TX_TUNNEL_VXLAN , outer_l2_len=18,
>>> outer_l3_len=20, l2_len=0, l3_len=0, l4_len=0, tso_segsz=0

The fact that l2_len and l3_len are not set here looks like an OVS
bug though, as AFAIU, these should always be set if any Tx offload
is requested.

>>> 2024-04-03T16:05:40.920Z|00012|dpdk|WARN|ice_interrupt_handler():
>>> OICR: MDD event
>>>
>>> We need this fix in OVS regardless of the outer udp checksum issue.
>>> I'll respin this fix in a new series, without touching UDP checksum capa.
>>>
>>>
>>> - It does seem that X710 nics have no support for outer udp checksum
>>> (according to its datasheet). Some X722 version may have support for
>>> it, but net/i40e does not configure the Tx descriptor accordingly.
>>> On the other hand, E810 ones seem fine (according to its datasheet).
>>>
>>> After more debugging, I managed to get outer udp checksum working.
>>> I understand the DPDK rte_net_intel_cksum_flags_prepare() helper does
>>> not set the pseudo header checksum in the outer udp header.
>>> I proposed a fix in the dpdk bz.
>>>
>>> Waiting for the fix on DPDK side... it is still possible to add the
>>> missing bits in OVS (see the branch I pointed at in the OVS issue).
>>
>> Since this feature never worked with ice in OVS and it is experimental,
>> I tend to think that we should just disable it for ice as well until
>> DPDK is fixed.
>>
>> A little too many fixes for that thing we have already and this one will
>> involve some extra driver-specific logic that we don't have any automated
>> tests for.
> 
> I don't mind waiting for the DPDK fix before re-enabling outer udp and
> other offloads.
Just to be clear, I think we need:
 - a fix to downgrade outer offloads to inner ones in cases where no
   inner offloads requested.
 - keep manually disabling tunnel offloads for Intel NICs, adding iavf
   to the list.

But we should not:
 - add pseudo-header checksum calculations to OVS to workaround the
   driver behavior.

Best regards, Ilya Maximets.
David Marchand April 8, 2024, 2:02 p.m. UTC | #5
On Fri, Apr 5, 2024 at 3:00 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >>> Basically, resolving a neighbor with ARP inside tunnels is broken on
> >>> Intel nics (even without re-enabling outer udp checksum).
> >>> This can be observed with the following debug patch at the end of
> >>> netdev_dpdk_prep_hwol_packet():
> >>>
> >>> +    char buf[256];
> >>> +    if (rte_get_tx_ol_flag_list(mbuf->ol_flags, buf, sizeof(buf)) != 0)
> >>> +        buf[0] = '\0';
> >>> +    VLOG_WARN("len=%u, ol_flags=%s, outer_l2_len=%u, outer_l3_len=%u,
> >>> l2_len=%u, l3_len=%u, l4_len=%u, tso_segsz=%u", mbuf->pkt_len, buf,
> >>> mbuf->outer_l2_len, mbuf->outer_l3_len, mbuf->l2_len, mbuf->l3_len,
> >>> mbuf->l4_len, mbuf->tso_segsz);
> >>>
> >>> Then doing a "arping" inside the tunnel triggers:
> >>> 2024-04-03T16:05:40.920Z|00014|netdev_dpdk(pmd-c03/id:8)|WARN|len=96,
> >>> ol_flags=RTE_MBUF_F_TX_L4_NO_CKSUM RTE_MBUF_F_TX_OUTER_IP_CKSUM
> >>> RTE_MBUF_F_TX_OUTER_IPV4 RTE_MBUF_F_TX_TUNNEL_VXLAN , outer_l2_len=18,
> >>> outer_l3_len=20, l2_len=0, l3_len=0, l4_len=0, tso_segsz=0
>
> The fact that l2_len and l3_len are not set here looks like an OVS
> bug though, as AFAIU, these should always be set if any Tx offload
> is requested.

The commit that introduces such Tx offloads requests is:
f81d782c19 - netdev-native-tnl: Mark all vxlan/geneve packets as
tunneled. (7 weeks ago) <Mike Pattrick>
diff mbox series

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2111f77681..ae43594a3d 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1354,18 +1354,6 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
         info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_TCP_CKSUM;
     }
 
-    if (!strcmp(info.driver_name, "net_ice")
-        || !strcmp(info.driver_name, "net_i40e")) {
-        /* FIXME: Driver advertises the capability but doesn't seem
-         * to actually support it correctly.  Can remove this once
-         * the driver is fixed on DPDK side. */
-        VLOG_INFO("%s: disabled Tx outer udp checksum offloads for a "
-                  "net/ice or net/i40e port.", netdev_get_name(&dev->up));
-        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
-        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO;
-        info.tx_offload_capa &= ~RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO;
-    }
-
     if (info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_IPV4_CKSUM) {
         dev->hw_ol_features |= NETDEV_TX_IPV4_CKSUM_OFFLOAD;
     } else {
@@ -2584,16 +2572,18 @@  netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
     struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
     struct tcp_header *th;
 
-    const uint64_t all_requests = (RTE_MBUF_F_TX_IP_CKSUM |
-                                   RTE_MBUF_F_TX_L4_MASK  |
-                                   RTE_MBUF_F_TX_OUTER_IP_CKSUM  |
-                                   RTE_MBUF_F_TX_OUTER_UDP_CKSUM |
-                                   RTE_MBUF_F_TX_TCP_SEG);
-    const uint64_t all_marks = (RTE_MBUF_F_TX_IPV4 |
-                                RTE_MBUF_F_TX_IPV6 |
-                                RTE_MBUF_F_TX_OUTER_IPV4 |
-                                RTE_MBUF_F_TX_OUTER_IPV6 |
-                                RTE_MBUF_F_TX_TUNNEL_MASK);
+    const uint64_t all_inner_requests = (RTE_MBUF_F_TX_IP_CKSUM |
+                                         RTE_MBUF_F_TX_L4_MASK |
+                                         RTE_MBUF_F_TX_TCP_SEG);
+    const uint64_t all_outer_requests = (RTE_MBUF_F_TX_OUTER_IP_CKSUM  |
+                                          RTE_MBUF_F_TX_OUTER_UDP_CKSUM);
+    const uint64_t all_requests = all_inner_requests | all_outer_requests;
+    const uint64_t all_inner_marks = (RTE_MBUF_F_TX_IPV4 |
+                                      RTE_MBUF_F_TX_IPV6);
+    const uint64_t all_outer_marks = (RTE_MBUF_F_TX_OUTER_IPV4 |
+                                      RTE_MBUF_F_TX_OUTER_IPV6 |
+                                      RTE_MBUF_F_TX_TUNNEL_MASK);
+    const uint64_t all_marks = all_inner_marks | all_outer_marks;
 
     if (!(mbuf->ol_flags & all_requests)) {
         /* No offloads requested, no marks should be set. */
@@ -2610,32 +2600,45 @@  netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
         return true;
     }
 
+    const uint64_t tunnel_type = mbuf->ol_flags & RTE_MBUF_F_TX_TUNNEL_MASK;
+    if (OVS_UNLIKELY(tunnel_type
+                     && tunnel_type != RTE_MBUF_F_TX_TUNNEL_GENEVE
+                     && tunnel_type != RTE_MBUF_F_TX_TUNNEL_VXLAN)) {
+        VLOG_WARN_RL(&rl, "%s: Unexpected tunnel type: %#"PRIx64,
+                     netdev_get_name(&dev->up), tunnel_type);
+        netdev_dpdk_mbuf_dump(netdev_get_name(&dev->up),
+                              "Packet with unexpected tunnel type", mbuf);
+        return false;
+    }
+
     /* If packet is vxlan or geneve tunnel packet, calculate outer
      * l2 len and outer l3 len. Inner l2/l3/l4 len are calculated
      * before. */
-    const uint64_t tunnel_type = mbuf->ol_flags & RTE_MBUF_F_TX_TUNNEL_MASK;
-    if (tunnel_type == RTE_MBUF_F_TX_TUNNEL_GENEVE ||
-        tunnel_type == RTE_MBUF_F_TX_TUNNEL_VXLAN) {
+    if ((tunnel_type == RTE_MBUF_F_TX_TUNNEL_GENEVE
+         || tunnel_type == RTE_MBUF_F_TX_TUNNEL_VXLAN)
+        && (mbuf->ol_flags & all_inner_requests)) {
+
         mbuf->outer_l2_len = (char *) dp_packet_l3(pkt) -
                  (char *) dp_packet_eth(pkt);
         mbuf->outer_l3_len = (char *) dp_packet_l4(pkt) -
                  (char *) dp_packet_l3(pkt);
-
-        /* If neither inner checksums nor TSO is requested, inner marks
-         * should not be set. */
-        if (!(mbuf->ol_flags & (RTE_MBUF_F_TX_IP_CKSUM |
-                                RTE_MBUF_F_TX_L4_MASK  |
-                                RTE_MBUF_F_TX_TCP_SEG))) {
-            mbuf->ol_flags &= ~(RTE_MBUF_F_TX_IPV4 |
-                                RTE_MBUF_F_TX_IPV6);
-        }
-    } else if (OVS_UNLIKELY(tunnel_type)) {
-        VLOG_WARN_RL(&rl, "%s: Unexpected tunnel type: %#"PRIx64,
-                     netdev_get_name(&dev->up), tunnel_type);
-        netdev_dpdk_mbuf_dump(netdev_get_name(&dev->up),
-                              "Packet with unexpected tunnel type", mbuf);
-        return false;
     } else {
+        if (OVS_UNLIKELY(!(mbuf->ol_flags & all_inner_requests))) {
+            /* If no inner offloading is requested, fallback to non tunneling
+             * checksum offloads. */
+
+            mbuf->ol_flags &= ~all_inner_marks;
+            if (mbuf->ol_flags & RTE_MBUF_F_TX_OUTER_IP_CKSUM) {
+                mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM;
+                mbuf->ol_flags |= RTE_MBUF_F_TX_IPV4;
+            }
+            if (mbuf->ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
+                mbuf->ol_flags |= RTE_MBUF_F_TX_UDP_CKSUM;
+                mbuf->ol_flags |= mbuf->ol_flags & RTE_MBUF_F_TX_OUTER_IPV4
+                                  ? RTE_MBUF_F_TX_IPV4 : RTE_MBUF_F_TX_IPV6;
+            }
+            mbuf->ol_flags &= ~(all_outer_requests | all_outer_marks);
+        }
         mbuf->l2_len = (char *) dp_packet_l3(pkt) -
                (char *) dp_packet_eth(pkt);
         mbuf->l3_len = (char *) dp_packet_l4(pkt) -