Message ID | CAA85sZtE7Gv8mKL5tUh8AJ4yG9xd_HZh9svWkHXm=j7VohD1Cw@mail.gmail.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Series | [ISSUE,4.20.6] mlx5 and checksum failures | expand |
On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote: > Hi, > > I'm hitting an issue that i think is fixed by the following patch, > i haven't verified it yet - but it looks like it should go on the > stable queue(?) > > (And yes, I did look, and couldn't find it ;)) > Yes, i couldn't find it neither, It should have been queued up for 4.18 by now. Dave said he will take care of it, maybe he just forgot or something. since the patch needed some extra care.. https://patchwork.ozlabs.org/patch/1027837/ Dave, Is there anything i can do here ? Thanks, Saeed. > commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a > Author: Cong Wang <xiyou.wangcong@gmail.com> > Date: Mon Dec 3 22:14:04 2018 -0800 > > net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
From: Saeed Mahameed <saeedm@mellanox.com> Date: Wed, 6 Feb 2019 17:18:55 +0000 > On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote: >> Hi, >> >> I'm hitting an issue that i think is fixed by the following patch, >> i haven't verified it yet - but it looks like it should go on the >> stable queue(?) >> >> (And yes, I did look, and couldn't find it ;)) >> > > Yes, i couldn't find it neither, > > It should have been queued up for 4.18 by now. > Dave said he will take care of it, maybe he just forgot or something. > since the patch needed some extra care.. > > https://patchwork.ozlabs.org/patch/1027837/ > > Dave, Is there anything i can do here ? I never handle anything past the most recent two -stable releases, which right now is 4.20 and 4.19 For anything beyond that you have to contact the person who maintains that -stable tree.
On Wed, Feb 6, 2019 at 11:03 PM David Miller <davem@davemloft.net> wrote: > From: Saeed Mahameed <saeedm@mellanox.com> > Date: Wed, 6 Feb 2019 17:18:55 +0000 > > On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote: > >> Hi, > >> > >> I'm hitting an issue that i think is fixed by the following patch, > >> i haven't verified it yet - but it looks like it should go on the > >> stable queue(?) > >> > >> (And yes, I did look, and couldn't find it ;)) > >> > > > > Yes, i couldn't find it neither, > > > > It should have been queued up for 4.18 by now. > > Dave said he will take care of it, maybe he just forgot or something. > > since the patch needed some extra care.. > > > > https://patchwork.ozlabs.org/patch/1027837/ > > > > Dave, Is there anything i can do here ? > > I never handle anything past the most recent two -stable releases, > which right now is 4.20 and 4.19 Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > For anything beyond that you have to contact the person who maintains > that -stable tree. I think it needs to be applied to all -stable since 4.18 :/
From: Ian Kumlien <ian.kumlien@gmail.com> Date: Wed, 6 Feb 2019 23:12:53 +0100 > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things Its... there: https://patchwork.ozlabs.org/bundle/davem/stable/?series=&submitter=&state=*&q=&archive=
On Wed, Feb 6, 2019 at 11:22 PM David Miller <davem@davemloft.net> wrote: > > From: Ian Kumlien <ian.kumlien@gmail.com> > Date: Wed, 6 Feb 2019 23:12:53 +0100 > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > > Its... there: > > https://patchwork.ozlabs.org/bundle/davem/stable/?series=&submitter=&state=*&q=&archive= F... Sorry, yet again... I thought that I DID look at patch fork but apparently accepted wasn't listed by default
On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things It doesn't break anything, packets are _not_ dropped, only that the warning itself is noisy.
On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > > It doesn't break anything, packets are _not_ dropped, only that the > warning itself is noisy. Not my experience, to me it slows the machine down and looses packets, I don't however know if this is the only culprit You can actually see it on ping where it start out with 0.0xyx and ends up at ~10ms But as I said, I assume this is the culprit - further investigation will be done =)
On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > > > > It doesn't break anything, packets are _not_ dropped, only that the > > warning itself is noisy. > > Not my experience, to me it slows the machine down and looses packets, > I don't however know > if this is the only culprit The packet process could be slow down because of printing out this kernel warning. Packet should be still delivered to upper stack, at least I didn't see any packet drops because of this. > > You can actually see it on ping where it start out with 0.0xyx and > ends up at ~10ms > I don't understand how it could affect ICMP, it is purely TCP from my point of view, even the stack trace from you says so. ;) Thanks.
On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > > > > > > It doesn't break anything, packets are _not_ dropped, only that the > > > warning itself is noisy. > > > > Not my experience, to me it slows the machine down and looses packets, > > I don't however know > > if this is the only culprit > > The packet process could be slow down because of printing > out this kernel warning. Packet should be still delivered to upper > stack, at least I didn't see any packet drops because of this. I have several machines pushing the same errors currently, while on this one I was logged in on the serial console and not over ssh like the others. On the other machines, typing is slow, looses characters and drops the connection But, again, I don't know if this is the only culprit, it sure does fill dmesg though =) (which suddenly takes minutes to show over a 100gig connection) > > You can actually see it on ping where it start out with 0.0xyx and > > ends up at ~10ms > > I don't understand how it could affect ICMP, it is purely TCP > from my point of view, even the stack trace from you says so. ;) It changes directly after the first hw checksum failure, I don't know why =/
On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > > > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > > > > > > > > It doesn't break anything, packets are _not_ dropped, only that the > > > > warning itself is noisy. > > > > > > Not my experience, to me it slows the machine down and looses packets, > > > I don't however know > > > if this is the only culprit > > > > The packet process could be slow down because of printing > > out this kernel warning. Packet should be still delivered to upper > > stack, at least I didn't see any packet drops because of this. > > I have several machines pushing the same errors currently, while on this > one I was logged in on the serial console and not over ssh like the others. > > On the other machines, typing is slow, looses characters and drops the > connection > > But, again, I don't know if this is the only culprit, it sure does > fill dmesg though =) > (which suddenly takes minutes to show over a 100gig connection) > > > > You can actually see it on ping where it start out with 0.0xyx and > > > ends up at ~10ms > > > > I don't understand how it could affect ICMP, it is purely TCP > > from my point of view, even the stack trace from you says so. ;) > > It changes directly after the first hw checksum failure, I don't know why =/ weird, Maybe a real check-summing issue/corruption on the PCI ?! can you try turning off checksum offloads ethtool -K ethX rx off
On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > It changes directly after the first hw checksum failure, I don't know why =/ > > weird, Maybe a real check-summing issue/corruption on the PCI ?! Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine Just FYI, my dmesg testcase: time ssh <server> "dmesg && exit real 3m5.845s user 0m0.035s sys 0m0.041s > can you try turning off checksum offloads > ethtool -K ethX rx off same test: real 0m3.408s user 0m0.022s sys 0m0.032s So yes, something in 4.20.6 goes wrong on the receiving part :/
On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > It changes directly after the first hw checksum failure, I don't know why =/ > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?! > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine > Great, the difference is only 120 patches. that is bisect-able, it will only take 5 iterations to find the offending commit. > Just FYI, my dmesg testcase: > time ssh <server> "dmesg && exit > real 3m5.845s > user 0m0.035s > sys 0m0.041s > > > can you try turning off checksum offloads > > ethtool -K ethX rx off > > same test: > real 0m3.408s > user 0m0.022s > sys 0m0.032s > > So yes, something in 4.20.6 goes wrong on the receiving part :/
On Wed, Feb 06, 2019 at 11:41:28PM +0100, Ian Kumlien wrote: > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things > > > > It doesn't break anything, packets are _not_ dropped, only that the > > warning itself is noisy. > > Not my experience, to me it slows the machine down and looses packets, > I don't however know > if this is the only culprit > > You can actually see it on ping where it start out with 0.0xyx and > ends up at ~10ms Serial console is a/the killer in these situations.
On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > It changes directly after the first hw checksum failure, I don't know why =/ > > > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?! > > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine > > > > Great, the difference is only 120 patches. > that is bisect-able, it will only take 5 iterations to find the > offending commit. I just wish it wasn't a server that takes, what feels like 5 minutes to boot... All of these seas of sensors 2d and 3d... =P But, yep, that's the plan > > Just FYI, my dmesg testcase: > > time ssh <server> "dmesg && exit > > real 3m5.845s > > user 0m0.035s > > sys 0m0.041s > > > > > can you try turning off checksum offloads > > > ethtool -K ethX rx off > > > > same test: > > real 0m3.408s > > user 0m0.022s > > sys 0m0.032s > > > > So yes, something in 4.20.6 goes wrong on the receiving part :/
On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > > It changes directly after the first hw checksum failure, I don't know why =/ > > > > > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?! > > > > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine > > Great, the difference is only 120 patches. > > that is bisect-able, it will only take 5 iterations to find the > > offending commit. > > I just wish it wasn't a server that takes, what feels like 5 minutes to boot... > > All of these seas of sensors 2d and 3d... =P > > But, yep, that's the plan Huh, spent most of the day with two bisects and none of them yielded any results.... Looks like I'll have to start investigating the elrepo kernel-ml build =(
On Fri, Feb 8, 2019 at 5:29 PM Ian Kumlien <ian.kumlien@gmail.com> wrote > On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > > > It changes directly after the first hw checksum failure, I don't know why =/ > > > > > > > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?! > > > > > > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine > > > > Great, the difference is only 120 patches. > > > that is bisect-able, it will only take 5 iterations to find the > > > offending commit. > > > > I just wish it wasn't a server that takes, what feels like 5 minutes to boot... > > > > All of these seas of sensors 2d and 3d... =P > > > > But, yep, that's the plan > > Huh, spent most of the day with two bisects and none of them yielded > any results.... > > Looks like I'll have to start investigating the elrepo kernel-ml build =( Just realized that it's not an entirely fair comparison - since retpolines wasn't enabled, damned old compilers...
One last update on this, 4.20.8 compiled with the same compiler works - I still suspect that it was fixed by: net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames Anyway, we can forget about it now ;) On Sat, Feb 9, 2019 at 4:54 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > On Fri, Feb 8, 2019 at 5:29 PM Ian Kumlien <ian.kumlien@gmail.com> wrote > > On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote: > > > > > > > It changes directly after the first hw checksum failure, I don't know why =/ > > > > > > > > > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?! > > > > > > > > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine > > > > > > Great, the difference is only 120 patches. > > > > that is bisect-able, it will only take 5 iterations to find the > > > > offending commit. > > > > > > I just wish it wasn't a server that takes, what feels like 5 minutes to boot... > > > > > > All of these seas of sensors 2d and 3d... =P > > > > > > But, yep, that's the plan > > > > Huh, spent most of the day with two bisects and none of them yielded > > any results.... > > > > Looks like I'll have to start investigating the elrepo kernel-ml build =( > > Just realized that it's not an entirely fair comparison - since > retpolines wasn't enabled, damned old compilers...
On Wed, 2019-02-13 at 13:04 +0100, Ian Kumlien wrote: > One last update on this, 4.20.8 compiled with the same compiler works > - I still suspect that it was fixed by: > net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames > > Anyway, we can forget about it now ;) cool, nice to know. Thanks for the update.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 1d0bb5ff8c26..f86e4804e83e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -732,6 +732,8 @@ static u8 get_ip_proto(struct sk_buff *skb, int network_depth, __be16 proto) ((struct ipv6hdr *)ip_p)->nexthdr; } +#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN) + static inline void mlx5e_handle_csum(struct net_device *netdev,