diff mbox series

[ISSUE,4.20.6] mlx5 and checksum failures

Message ID CAA85sZtE7Gv8mKL5tUh8AJ4yG9xd_HZh9svWkHXm=j7VohD1Cw@mail.gmail.com
State Not Applicable
Delegated to: David Miller
Headers show
Series [ISSUE,4.20.6] mlx5 and checksum failures | expand

Commit Message

Ian Kumlien Feb. 6, 2019, 4:16 p.m. UTC
Hi,

I'm hitting an issue that i think is fixed by the following patch,
i haven't verified it yet - but it looks like it should go on the
stable queue(?)

(And yes, I did look, and couldn't find it ;))

commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Dec 3 22:14:04 2018 -0800

    net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

    When an ethernet frame is padded to meet the minimum ethernet frame
    size, the padding octets are not covered by the hardware checksum.
    Fortunately the padding octets are usually zero's, which don't affect
    checksum. However, we have a switch which pads non-zero octets, this
    causes kernel hardware checksum fault repeatedly.

    Prior to:
    commit '88078d98d1bb ("net: pskb_trim_rcsum() and
CHECKSUM_COMPLETE ...")'
    skb checksum was forced to be CHECKSUM_NONE when padding is detected.
    After it, we need to keep skb->csum updated, like what we do for RXFCS.
    However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
    headers, it is not worthy the effort as the packets are so small that
    CHECKSUM_COMPLETE can't save anything.

    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
are friends"),
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Tariq Toukan <tariqt@mellanox.com>
    Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

                                     struct mlx5_cqe64 *cqe,
                                     struct mlx5e_rq *rq,
@@ -754,6 +756,17 @@ static inline void mlx5e_handle_csum(struct
net_device *netdev,
        if (unlikely(test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, &rq->state)))
                goto csum_unnecessary;

+       /* CQE csum doesn't cover padding octets in short ethernet
+        * frames. And the pad field is appended prior to calculating
+        * and appending the FCS field.
+        *
+        * Detecting these padded frames requires to verify and parse
+        * IP headers, so we simply force all those small frames to be
+        * CHECKSUM_UNNECESSARY even if they are not padded.
+        */
+       if (short_frame(skb->len))
+               goto csum_unnecessary;
+
        if (likely(is_last_ethertype_ip(skb, &network_depth, &proto))) {
                if (unlikely(get_ip_proto(skb, network_depth, proto)
== IPPROTO_SCTP))
                        goto csum_unnecessary;
---

Kernel log:
[ 3226.017424] bond0: hw csum failure
[ 3226.018387] CPU: 13 PID: 0 Comm: swapper/13 Tainted: G          I
    4.20.6-1.el7.elrepo.x86_64 #1
[ 3226.020928] Hardware name: HP ProLiant DL380 G6, BIOS P62 01/22/2015
[ 3226.022649] Call Trace:
[ 3226.023409]  <IRQ>
[ 3226.024039]  dump_stack+0x63/0x88
[ 3226.025066]  netdev_rx_csum_fault+0x3a/0x40
[ 3226.026208]  __skb_checksum_complete+0xd5/0xe0
[ 3226.027418]  nf_ip_checksum+0xc9/0xf0
[ 3226.028474]  nf_checksum+0x2d/0x40
[ 3226.029504]  tcp_packet+0x2ce/0xa20 [nf_conntrack]
[ 3226.030913]  ? tcp_v4_do_rcv+0x77/0x1f0
[ 3226.032094]  ? sock_put+0x19/0x20
[ 3226.033070]  ? nf_ct_deliver_cached_events+0xd0/0x110 [nf_conntrack]
[ 3226.034754]  nf_conntrack_in+0x140/0x510 [nf_conntrack]
[ 3226.036228]  ipv4_conntrack_in+0x14/0x20 [nf_conntrack]
[ 3226.037646]  nf_hook_slow+0x42/0xc0
[ 3226.038626]  ip_rcv+0xb5/0xd0
[ 3226.039480]  ? ip_local_deliver_finish+0x1e0/0x1e0
[ 3226.040767]  __netif_receive_skb_one_core+0x57/0x80
[ 3226.042155]  __netif_receive_skb+0x18/0x60
[ 3226.043275]  netif_receive_skb_internal+0x45/0xf0
[ 3226.044530]  napi_gro_receive+0xd0/0xf0
[ 3226.045665]  mlx5e_handle_rx_cqe+0x1e6/0x540 [mlx5_core]
[ 3226.047167]  mlx5e_poll_rx_cq+0xd6/0x9c0 [mlx5_core]
[ 3226.048516]  mlx5e_napi_poll+0xc2/0xcd0 [mlx5_core]
[ 3226.049836]  ? mlx5_eq_int+0x4b4/0x6c0 [mlx5_core]
[ 3226.051118]  net_rx_action+0x289/0x3d0
[ 3226.052257]  __do_softirq+0xd5/0x2a2
[ 3226.053277]  irq_exit+0xe8/0x100
[ 3226.054183]  do_IRQ+0x59/0xe0
[ 3226.055014]  common_interrupt+0xf/0xf
[ 3226.056038]  </IRQ>
[ 3226.056722] RIP: 0010:cpuidle_enter_state+0xba/0x2f0
[ 3226.058087] Code: d0 95 7e e8 38 07 a1 ff 41 8b 5c 24 04 49 89 c6
66 66 66 66 90 31 ff e8 34 19 a1 ff 80 7d cf 00 0f 85 8c 01 00 00 fb
66 66 90 <66> 66 90 45 85 ed 0f 88 94 01 00 00 4c 2b 75 c0 48 ba cf f7
53 e3
[ 3226.062925] RSP: 0018:ffffc9000c547e50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffd6
[ 3226.064974] RAX: ffff88a3df7a2dc0 RBX: 000000000000000d RCX: 000000000000001f
[ 3226.066866] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 3226.068747] RBP: ffffc9000c547e90 R08: 0000000000000002 R09: ffffffcdc506f2e7
[ 3226.070622] R10: 0000000000000018 R11: 071c71c71c71c71c R12: ffffe8ffffb96f00
[ 3226.072525] R13: 0000000000000004 R14: 000002ef1d9f1e10 R15: ffff88a3d8900000
[ 3226.074479]  cpuidle_enter+0x17/0x20
[ 3226.075463]  call_cpuidle+0x23/0x40
[ 3226.076412]  do_idle+0x1db/0x280
[ 3226.077323]  cpu_startup_entry+0x1d/0x30
[ 3226.078417]  start_secondary+0x1ae/0x200
[ 3226.079490]  secondary_startup_64+0xa4/0xb0

Comments

Saeed Mahameed Feb. 6, 2019, 5:18 p.m. UTC | #1
On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
> Hi,
> 
> I'm hitting an issue that i think is fixed by the following patch,
> i haven't verified it yet - but it looks like it should go on the
> stable queue(?)
> 
> (And yes, I did look, and couldn't find it ;))
> 

Yes, i couldn't find it neither,

It should have been queued up for 4.18 by now.
Dave said he will take care of it, maybe he just forgot or something.
since the patch needed some extra care..

https://patchwork.ozlabs.org/patch/1027837/

Dave, Is there anything i can do here ?

Thanks,
Saeed.

> commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
> Author: Cong Wang <xiyou.wangcong@gmail.com>
> Date:   Mon Dec 3 22:14:04 2018 -0800
> 
>     net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
David Miller Feb. 6, 2019, 10:03 p.m. UTC | #2
From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 6 Feb 2019 17:18:55 +0000

> On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
>> Hi,
>> 
>> I'm hitting an issue that i think is fixed by the following patch,
>> i haven't verified it yet - but it looks like it should go on the
>> stable queue(?)
>> 
>> (And yes, I did look, and couldn't find it ;))
>> 
> 
> Yes, i couldn't find it neither,
> 
> It should have been queued up for 4.18 by now.
> Dave said he will take care of it, maybe he just forgot or something.
> since the patch needed some extra care..
> 
> https://patchwork.ozlabs.org/patch/1027837/
> 
> Dave, Is there anything i can do here ?

I never handle anything past the most recent two -stable releases,
which right now is 4.20 and 4.19

For anything beyond that you have to contact the person who maintains
that -stable tree.
Ian Kumlien Feb. 6, 2019, 10:12 p.m. UTC | #3
On Wed, Feb 6, 2019 at 11:03 PM David Miller <davem@davemloft.net> wrote:
> From: Saeed Mahameed <saeedm@mellanox.com>
> Date: Wed, 6 Feb 2019 17:18:55 +0000

> > On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
> >> Hi,
> >>
> >> I'm hitting an issue that i think is fixed by the following patch,
> >> i haven't verified it yet - but it looks like it should go on the
> >> stable queue(?)
> >>
> >> (And yes, I did look, and couldn't find it ;))
> >>
> >
> > Yes, i couldn't find it neither,
> >
> > It should have been queued up for 4.18 by now.
> > Dave said he will take care of it, maybe he just forgot or something.
> > since the patch needed some extra care..
> >
> > https://patchwork.ozlabs.org/patch/1027837/
> >
> > Dave, Is there anything i can do here ?
>
> I never handle anything past the most recent two -stable releases,
> which right now is 4.20 and 4.19

Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things

> For anything beyond that you have to contact the person who maintains
> that -stable tree.

I think it needs to be applied to all -stable since 4.18 :/
David Miller Feb. 6, 2019, 10:22 p.m. UTC | #4
From: Ian Kumlien <ian.kumlien@gmail.com>
Date: Wed, 6 Feb 2019 23:12:53 +0100

> Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things

Its... there:

https://patchwork.ozlabs.org/bundle/davem/stable/?series=&submitter=&state=*&q=&archive=
Ian Kumlien Feb. 6, 2019, 10:33 p.m. UTC | #5
On Wed, Feb 6, 2019 at 11:22 PM David Miller <davem@davemloft.net> wrote:
>
> From: Ian Kumlien <ian.kumlien@gmail.com>
> Date: Wed, 6 Feb 2019 23:12:53 +0100
>
> > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
>
> Its... there:
>
> https://patchwork.ozlabs.org/bundle/davem/stable/?series=&submitter=&state=*&q=&archive=

F... Sorry, yet again...

I thought that I DID look at patch fork but apparently accepted wasn't
listed by default
Cong Wang Feb. 6, 2019, 10:38 p.m. UTC | #6
On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things

It doesn't break anything, packets are _not_ dropped, only that the
warning itself is noisy.
Ian Kumlien Feb. 6, 2019, 10:41 p.m. UTC | #7
On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
>
> It doesn't break anything, packets are _not_ dropped, only that the
> warning itself is noisy.

Not my experience, to me it slows the machine down and looses packets,
I don't however know
if this is the only culprit

You can actually see it on ping where it start out with 0.0xyx and
ends up at ~10ms

But as I said, I assume this is the culprit - further investigation
will be done =)
Cong Wang Feb. 6, 2019, 10:49 p.m. UTC | #8
On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> >
> > It doesn't break anything, packets are _not_ dropped, only that the
> > warning itself is noisy.
>
> Not my experience, to me it slows the machine down and looses packets,
> I don't however know
> if this is the only culprit

The packet process could be slow down because of printing
out this kernel warning. Packet should be still delivered to upper
stack, at least I didn't see any packet drops because of this.

>
> You can actually see it on ping where it start out with 0.0xyx and
> ends up at ~10ms
>

I don't understand how it could affect ICMP, it is purely TCP
from my point of view, even the stack trace from you says so. ;)

Thanks.
Ian Kumlien Feb. 6, 2019, 11 p.m. UTC | #9
On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> > >
> > > It doesn't break anything, packets are _not_ dropped, only that the
> > > warning itself is noisy.
> >
> > Not my experience, to me it slows the machine down and looses packets,
> > I don't however know
> > if this is the only culprit
>
> The packet process could be slow down because of printing
> out this kernel warning. Packet should be still delivered to upper
> stack, at least I didn't see any packet drops because of this.

I have several machines pushing the same errors currently, while on this
one I was logged in on the serial console and not over ssh like the others.

On the other machines, typing is slow, looses characters and drops the
connection

But, again, I don't know if this is the only culprit, it sure does
fill dmesg though =)
(which suddenly takes minutes to show over a 100gig connection)

> > You can actually see it on ping where it start out with 0.0xyx and
> > ends up at ~10ms
>
> I don't understand how it could affect ICMP, it is purely TCP
> from my point of view, even the stack trace from you says so. ;)

It changes directly after the first hw checksum failure, I don't know why =/
Saeed Mahameed Feb. 7, 2019, 1 a.m. UTC | #10
On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> > > >
> > > > It doesn't break anything, packets are _not_ dropped, only that the
> > > > warning itself is noisy.
> > >
> > > Not my experience, to me it slows the machine down and looses packets,
> > > I don't however know
> > > if this is the only culprit
> >
> > The packet process could be slow down because of printing
> > out this kernel warning. Packet should be still delivered to upper
> > stack, at least I didn't see any packet drops because of this.
>
> I have several machines pushing the same errors currently, while on this
> one I was logged in on the serial console and not over ssh like the others.
>
> On the other machines, typing is slow, looses characters and drops the
> connection
>
> But, again, I don't know if this is the only culprit, it sure does
> fill dmesg though =)
> (which suddenly takes minutes to show over a 100gig connection)
>
> > > You can actually see it on ping where it start out with 0.0xyx and
> > > ends up at ~10ms
> >
> > I don't understand how it could affect ICMP, it is purely TCP
> > from my point of view, even the stack trace from you says so. ;)
>
> It changes directly after the first hw checksum failure, I don't know why =/

weird, Maybe a real check-summing issue/corruption on the PCI ?!

can you try turning off checksum offloads
ethtool -K ethX  rx off
Ian Kumlien Feb. 7, 2019, 10:16 a.m. UTC | #11
On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > It changes directly after the first hw checksum failure, I don't know why =/
>
> weird, Maybe a real check-summing issue/corruption on the PCI ?!

Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine

Just FYI, my dmesg testcase:
time ssh <server> "dmesg && exit
real    3m5.845s
user    0m0.035s
sys     0m0.041s

> can you try turning off checksum offloads
> ethtool -K ethX  rx off

same test:
real    0m3.408s
user    0m0.022s
sys     0m0.032s

So yes, something in 4.20.6 goes wrong on the receiving part :/
Saeed Mahameed Feb. 7, 2019, 6:42 p.m. UTC | #12
On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > It changes directly after the first hw checksum failure, I don't know why =/
> >
> > weird, Maybe a real check-summing issue/corruption on the PCI ?!
>
> Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
>

Great, the difference is only 120 patches.
that is bisect-able, it will only take 5 iterations to find the
offending commit.


> Just FYI, my dmesg testcase:
> time ssh <server> "dmesg && exit
> real    3m5.845s
> user    0m0.035s
> sys     0m0.041s
>
> > can you try turning off checksum offloads
> > ethtool -K ethX  rx off
>
> same test:
> real    0m3.408s
> user    0m0.022s
> sys     0m0.032s
>
> So yes, something in 4.20.6 goes wrong on the receiving part :/
Marcelo Ricardo Leitner Feb. 7, 2019, 9:40 p.m. UTC | #13
On Wed, Feb 06, 2019 at 11:41:28PM +0100, Ian Kumlien wrote:
> On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> >
> > It doesn't break anything, packets are _not_ dropped, only that the
> > warning itself is noisy.
> 
> Not my experience, to me it slows the machine down and looses packets,
> I don't however know
> if this is the only culprit
> 
> You can actually see it on ping where it start out with 0.0xyx and
> ends up at ~10ms

Serial console is a/the killer in these situations.
Ian Kumlien Feb. 7, 2019, 10:01 p.m. UTC | #14
On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>
> On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > It changes directly after the first hw checksum failure, I don't know why =/
> > >
> > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> >
> > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
> >
>
> Great, the difference is only 120 patches.
> that is bisect-able, it will only take 5 iterations to find the
> offending commit.

I just wish it wasn't a server that takes, what feels like 5 minutes to boot...

All of these seas of sensors 2d and 3d... =P

But, yep, that's the plan

> > Just FYI, my dmesg testcase:
> > time ssh <server> "dmesg && exit
> > real    3m5.845s
> > user    0m0.035s
> > sys     0m0.041s
> >
> > > can you try turning off checksum offloads
> > > ethtool -K ethX  rx off
> >
> > same test:
> > real    0m3.408s
> > user    0m0.022s
> > sys     0m0.032s
> >
> > So yes, something in 4.20.6 goes wrong on the receiving part :/
Ian Kumlien Feb. 8, 2019, 4:29 p.m. UTC | #15
On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > >
> > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > >
> > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine

> > Great, the difference is only 120 patches.
> > that is bisect-able, it will only take 5 iterations to find the
> > offending commit.
>
> I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
>
> All of these seas of sensors 2d and 3d... =P
>
> But, yep, that's the plan

Huh, spent most of the day with two bisects and none of them yielded
any results....

Looks like I'll have to start investigating the elrepo kernel-ml build =(
Ian Kumlien Feb. 9, 2019, 3:54 p.m. UTC | #16
On Fri, Feb 8, 2019 at 5:29 PM Ian Kumlien <ian.kumlien@gmail.com> wrote
> On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > > >
> > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > > >
> > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
>
> > > Great, the difference is only 120 patches.
> > > that is bisect-able, it will only take 5 iterations to find the
> > > offending commit.
> >
> > I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
> >
> > All of these seas of sensors 2d and 3d... =P
> >
> > But, yep, that's the plan
>
> Huh, spent most of the day with two bisects and none of them yielded
> any results....
>
> Looks like I'll have to start investigating the elrepo kernel-ml build =(

Just realized that it's not an entirely fair comparison - since
retpolines wasn't enabled, damned old compilers...
Ian Kumlien Feb. 13, 2019, 12:04 p.m. UTC | #17
One last update on this, 4.20.8 compiled with the same compiler works
- I still suspect that it was fixed by:
net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

Anyway, we can forget about it now ;)

On Sat, Feb 9, 2019 at 4:54 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Feb 8, 2019 at 5:29 PM Ian Kumlien <ian.kumlien@gmail.com> wrote
> > On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > > > >
> > > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > > > >
> > > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
> >
> > > > Great, the difference is only 120 patches.
> > > > that is bisect-able, it will only take 5 iterations to find the
> > > > offending commit.
> > >
> > > I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
> > >
> > > All of these seas of sensors 2d and 3d... =P
> > >
> > > But, yep, that's the plan
> >
> > Huh, spent most of the day with two bisects and none of them yielded
> > any results....
> >
> > Looks like I'll have to start investigating the elrepo kernel-ml build =(
>
> Just realized that it's not an entirely fair comparison - since
> retpolines wasn't enabled, damned old compilers...
Saeed Mahameed Feb. 13, 2019, 11:36 p.m. UTC | #18
On Wed, 2019-02-13 at 13:04 +0100, Ian Kumlien wrote:
> One last update on this, 4.20.8 compiled with the same compiler works
> - I still suspect that it was fixed by:
> net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
> 
> Anyway, we can forget about it now ;)

cool, nice to know.

Thanks for the update.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1d0bb5ff8c26..f86e4804e83e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,8 @@  static u8 get_ip_proto(struct sk_buff *skb, int
network_depth, __be16 proto)
                                            ((struct ipv6hdr
*)ip_p)->nexthdr;
 }

+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,