diff mbox series

[net] net: invert the check of detecting hardware RX checksum fault

Message ID 20181115231602.18328-1-xiyou.wangcong@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net] net: invert the check of detecting hardware RX checksum fault | expand

Commit Message

Cong Wang Nov. 15, 2018, 11:16 p.m. UTC
The following evidences indicate this check is likely wrong:

1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid checksum.

2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
   only when it returns non-zero. So non-zero indicates a failure.

3. In __skb_checksum_validate_complete(), we have a nearly same check, where
   zero is considered as success.

4. csum_fold() already does the one’s complement, this indicates 0 should
   be considered as a successful validation.

5. We have triggered this fault for many times, but InCsumErrors field in
   /proc/net/snmp remains 0.

Base on the above, non-zero should be used as a checksum mismatch.

I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.

Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/core/datagram.c | 4 ++--
 net/core/dev.c      | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Herbert Xu Nov. 16, 2018, 1:52 a.m. UTC | #1
On Thu, Nov 15, 2018 at 03:16:02PM -0800, Cong Wang wrote:
> The following evidences indicate this check is likely wrong:
> 
> 1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid checksum.
> 
> 2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
>    only when it returns non-zero. So non-zero indicates a failure.
> 
> 3. In __skb_checksum_validate_complete(), we have a nearly same check, where
>    zero is considered as success.
> 
> 4. csum_fold() already does the one’s complement, this indicates 0 should
>    be considered as a successful validation.
> 
> 5. We have triggered this fault for many times, but InCsumErrors field in
>    /proc/net/snmp remains 0.
> 
> Base on the above, non-zero should be used as a checksum mismatch.
> 
> I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.
> 
> Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Tom Herbert <tom@herbertland.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/core/datagram.c | 4 ++--
>  net/core/dev.c      | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 57f3a6fcfc1e..e542a9a212a7 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
>  	__sum16 sum;
>  
>  	sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
> -	if (likely(!sum)) {
> +	if (unlikely(sum)) {
>  		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
>  		    !skb->csum_complete_sw)
>  			netdev_rx_csum_fault(skb->dev);

Normally if the hardware's partial checksum is valid then we just
trust it and send the packet along.  However, if the partial
checksum is invalid we don't trust it and we will compute the
whole checksum manually which is what ends up in sum.

netdev_rx_csum_fault is meant to warn about the situation where
a packet with a valid checksum (i.e., sum == 0) was given to us
by the hardware with a partial checksum that was invalid.

So changing it to sum here is wrong.

Can you give more information as to how you got the warnings with
mlx5? It sounds like there may be a real bug there because if you
are getting the warning then it means that a packet with an invalid
hardware-computed partial checksum passed the manual check and
was actually valid.  This implies that either the hardware or the
driver is broken.

Cheers,
Cong Wang Nov. 16, 2018, 2:23 a.m. UTC | #2
On Thu, Nov 15, 2018 at 5:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Nov 15, 2018 at 03:16:02PM -0800, Cong Wang wrote:
> > The following evidences indicate this check is likely wrong:
> >
> > 1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid checksum.
> >
> > 2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
> >    only when it returns non-zero. So non-zero indicates a failure.
> >
> > 3. In __skb_checksum_validate_complete(), we have a nearly same check, where
> >    zero is considered as success.
> >
> > 4. csum_fold() already does the one’s complement, this indicates 0 should
> >    be considered as a successful validation.
> >
> > 5. We have triggered this fault for many times, but InCsumErrors field in
> >    /proc/net/snmp remains 0.
> >
> > Base on the above, non-zero should be used as a checksum mismatch.
> >
> > I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.
> >
> > Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Tom Herbert <tom@herbertland.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >  net/core/datagram.c | 4 ++--
> >  net/core/dev.c      | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 57f3a6fcfc1e..e542a9a212a7 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
> >       __sum16 sum;
> >
> >       sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
> > -     if (likely(!sum)) {
> > +     if (unlikely(sum)) {
> >               if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >                   !skb->csum_complete_sw)
> >                       netdev_rx_csum_fault(skb->dev);
>
> Normally if the hardware's partial checksum is valid then we just
> trust it and send the packet along.  However, if the partial
> checksum is invalid we don't trust it and we will compute the
> whole checksum manually which is what ends up in sum.

Not sure if I understand partial checksum here, but it is the
CHECKSUM_COMPLETE case which I am trying to fix, not
CHECKSUM_PARTIAL.

Or you mean the checksum returned by skb_checksum(), that is,
checksum from skb->data to skb->data+skb->len.

If neither, I am confused.

>
> netdev_rx_csum_fault is meant to warn about the situation where
> a packet with a valid checksum (i.e., sum == 0) was given to us
> by the hardware with a partial checksum that was invalid.
>
> So changing it to sum here is wrong.
>

So, in other word, a checksum *match* is the intended to detect
this HW RX checksum fault?

What has been changed in between skb_checksum_init() and
tcp_checksum_complete() so that the logic is inverted?

Looks like I miss something too obvious to understand the logic. :-/



> Can you give more information as to how you got the warnings with
> mlx5? It sounds like there may be a real bug there because if you
> are getting the warning then it means that a packet with an invalid
> hardware-computed partial checksum passed the manual check and
> was actually valid.  This implies that either the hardware or the
> driver is broken.

Sure, my case is nearly same with Pawel's, except I have no vlan:
https://marc.info/?l=linux-netdev&m=154086647601721&w=2

None of us has RXFCS, if you are curious whether Eric's fix works
for us.

There are also a few other reports with conntrack involved:
https://marc.info/?l=linux-netdev&m=154134983130200&w=2
https://marc.info/?l=linux-netdev&m=154070099731902&w=2

Thanks.
Herbert Xu Nov. 16, 2018, 4:50 a.m. UTC | #3
On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
>
> > Normally if the hardware's partial checksum is valid then we just
> > trust it and send the packet along.  However, if the partial
> > checksum is invalid we don't trust it and we will compute the
> > whole checksum manually which is what ends up in sum.
> 
> Not sure if I understand partial checksum here, but it is the
> CHECKSUM_COMPLETE case which I am trying to fix, not
> CHECKSUM_PARTIAL.

What I meant by partial checksum is the checksum produced by the
hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
CHECKSUM_PARTIAL is the absence of the substantial part of the
checksum which is something we use in the kernel primarily for TX.

Yes the names are confusing :)

> So, in other word, a checksum *match* is the intended to detect
> this HW RX checksum fault?

Correct.  Or more likely it's probably a bug in either the driver
or if there are overlaying code such as VLAN then in that code.

Basically if the RX checksum is buggy, it's much more likely to
cause a valid packet to be rejected than to cause an invalid packet
to be accepted, because we still verify that checksum against the
pseudoheader.  So we only attempt to catch buggy hardware/drivers
by doing a second manual verification for the case where the packet
is flagged as invalid.

> Sure, my case is nearly same with Pawel's, except I have no vlan:
> https://marc.info/?l=linux-netdev&m=154086647601721&w=2

Can you please provide your backtrace?

Thanks,
Eric Dumazet Nov. 16, 2018, 4:52 a.m. UTC | #4
On 11/15/2018 06:23 PM, Cong Wang wrote:
> On Thu, Nov 15, 2018 at 5:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> On Thu, Nov 15, 2018 at 03:16:02PM -0800, Cong Wang wrote:
>>> The following evidences indicate this check is likely wrong:
>>>
>>> 1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid checksum.
>>>
>>> 2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
>>>    only when it returns non-zero. So non-zero indicates a failure.
>>>
>>> 3. In __skb_checksum_validate_complete(), we have a nearly same check, where
>>>    zero is considered as success.
>>>
>>> 4. csum_fold() already does the one’s complement, this indicates 0 should
>>>    be considered as a successful validation.
>>>
>>> 5. We have triggered this fault for many times, but InCsumErrors field in
>>>    /proc/net/snmp remains 0.
>>>
>>> Base on the above, non-zero should be used as a checksum mismatch.
>>>
>>> I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.
>>>
>>> Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
>>> Cc: Herbert Xu <herbert@gondor.apana.org.au>
>>> Cc: Tom Herbert <tom@herbertland.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>>> ---
>>>  net/core/datagram.c | 4 ++--
>>>  net/core/dev.c      | 2 +-
>>>  2 files changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>>> index 57f3a6fcfc1e..e542a9a212a7 100644
>>> --- a/net/core/datagram.c
>>> +++ b/net/core/datagram.c
>>> @@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
>>>       __sum16 sum;
>>>
>>>       sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
>>> -     if (likely(!sum)) {
>>> +     if (unlikely(sum)) {
>>>               if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
>>>                   !skb->csum_complete_sw)
>>>                       netdev_rx_csum_fault(skb->dev);
>>
>> Normally if the hardware's partial checksum is valid then we just
>> trust it and send the packet along.  However, if the partial
>> checksum is invalid we don't trust it and we will compute the
>> whole checksum manually which is what ends up in sum.
> 
> Not sure if I understand partial checksum here, but it is the
> CHECKSUM_COMPLETE case which I am trying to fix, not
> CHECKSUM_PARTIAL.
> 
> Or you mean the checksum returned by skb_checksum(), that is,
> checksum from skb->data to skb->data+skb->len.
> 
> If neither, I am confused.
> 
>>
>> netdev_rx_csum_fault is meant to warn about the situation where
>> a packet with a valid checksum (i.e., sum == 0) was given to us
>> by the hardware with a partial checksum that was invalid.
>>
>> So changing it to sum here is wrong.
>>
> 
> So, in other word, a checksum *match* is the intended to detect
> this HW RX checksum fault?
> 
> What has been changed in between skb_checksum_init() and
> tcp_checksum_complete() so that the logic is inverted?
> 
> Looks like I miss something too obvious to understand the logic. :-/
> 
> 
> 
>> Can you give more information as to how you got the warnings with
>> mlx5? It sounds like there may be a real bug there because if you
>> are getting the warning then it means that a packet with an invalid
>> hardware-computed partial checksum passed the manual check and
>> was actually valid.  This implies that either the hardware or the
>> driver is broken.
> 
> Sure, my case is nearly same with Pawel's, except I have no vlan:
> https://marc.info/?l=linux-netdev&m=154086647601721&w=2
> 
> None of us has RXFCS, if you are curious whether Eric's fix works
> for us.
> 
> There are also a few other reports with conntrack involved:
> https://marc.info/?l=linux-netdev&m=154134983130200&w=2
> https://marc.info/?l=linux-netdev&m=154070099731902&w=2


It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
case non zero trailer bytes were added by a buggy switch (or host)

Saeed can comment/confirm, but the theory is that the NIC does header analysis and
computes a checksum only on the bytes of the IP frame, not including the tail bytes
that were added by a switch.

You could use trafgen to cook such a frame and confirm the theory.

Something like :

{
  0x00, 0x1a, 0x11, 0xc3, 0x0d, 0x45,  # MAC Destination
  0x00, 0x12, 0xc0, 0x02, 0xac, 0x5a,  # MAC Source
  const16(0x0800),

  /* IPv4 Version, IHL, TOS */
  0b01000101, 0,
  /* IPv4 Total Len */
  const16(40),
  /* IPv4 Ident */
  //drnd(2),
  const16(2),

  /* IPv4 Flags, Frag Off */
  0b01000000, 0,
  /* IPv4 TTL */
  64,
  /* Proto TCP */
  0x06,
  /* IPv4 Checksum (IP header from, to) */
  csumip(14, 33),

  7, drnd(3), # Source IP
  10,246,7,152,       # Dest IP

  /* TCP Source Port */
  drnd(2),
  /* TCP Dest Port */
  const16(80),
  /* TCP Sequence Number */
  drnd(4),
  /* TCP Ackn. Number */
  c32(0),

  /* TCP Header length + Flags */
  const16((0x5 << 12) | 2)		/* TCP SYN Flag */

  /* Window Size */
  const16(16),
  /* TCP Checksum (offset IP, offset TCP) */
  csumtcp(14, 34),

  11,22,33,44,55,66, /* PAD */
}
Herbert Xu Nov. 16, 2018, 4:59 a.m. UTC | #5
On Thu, Nov 15, 2018 at 08:52:23PM -0800, Eric Dumazet wrote:
>
> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> case non zero trailer bytes were added by a buggy switch (or host)

We should probably change netdev_rx_csum_fault to print out at
least one complete packet plus the hardware-generated checksum.

That would make debugging these rare hardware faults much easier.

Cheers,
Cong Wang Nov. 16, 2018, 8:06 p.m. UTC | #6
On Thu, Nov 15, 2018 at 8:50 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
> >
> > > Normally if the hardware's partial checksum is valid then we just
> > > trust it and send the packet along.  However, if the partial
> > > checksum is invalid we don't trust it and we will compute the
> > > whole checksum manually which is what ends up in sum.
> >
> > Not sure if I understand partial checksum here, but it is the
> > CHECKSUM_COMPLETE case which I am trying to fix, not
> > CHECKSUM_PARTIAL.
>
> What I meant by partial checksum is the checksum produced by the
> hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
> CHECKSUM_PARTIAL is the absence of the substantial part of the
> checksum which is something we use in the kernel primarily for TX.
>
> Yes the names are confusing :)

Yeah, understood. The hardware provides skb->csum in this case, but
we keep adjusting it each time when we change skb->data.


>
> > So, in other word, a checksum *match* is the intended to detect
> > this HW RX checksum fault?
>
> Correct.  Or more likely it's probably a bug in either the driver
> or if there are overlaying code such as VLAN then in that code.
>
> Basically if the RX checksum is buggy, it's much more likely to
> cause a valid packet to be rejected than to cause an invalid packet
> to be accepted, because we still verify that checksum against the
> pseudoheader.  So we only attempt to catch buggy hardware/drivers
> by doing a second manual verification for the case where the packet
> is flagged as invalid.

Hmm, now I see how it works. Actually it uses the differences between
these two check's as the difference between hardware checksum with
skb_checksum().

I will send a patch to add a comment there to avoid confusion.


>
> > Sure, my case is nearly same with Pawel's, except I have no vlan:
> > https://marc.info/?l=linux-netdev&m=154086647601721&w=2
>
> Can you please provide your backtrace?

I already did:
https://marc.info/?l=linux-netdev&m=154092211305599&w=2

Note, the offending commit has been backported to 4.14, which
is why I saw this warning. I have no idea why it is backported
from the beginning, it is just an optimization, doesn't fix any bug,
IMHO.

Also, it is much harder for me to reproduce it than Pawel who
saw the warning every second. Sometimes I need 1 hour to trigger
it, sometimes other people here needs 10+ hours to trigger it.

Let me see if I can add vlan on my side to make it more reproducible,
it seems hard as our switch doesn't use vlan either.

We have warnings with conntrack involved too, I can provide it too
if you are interested.

I tend to revert it for -stable, at least that is what I plan to do
on my side unless there is a fix coming soon.

Thanks.
Cong Wang Nov. 16, 2018, 8:10 p.m. UTC | #7
On Thu, Nov 15, 2018 at 8:59 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Nov 15, 2018 at 08:52:23PM -0800, Eric Dumazet wrote:
> >
> > It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> > case non zero trailer bytes were added by a buggy switch (or host)
>
> We should probably change netdev_rx_csum_fault to print out at
> least one complete packet plus the hardware-generated checksum.
>
> That would make debugging these rare hardware faults much easier.

I have a patch as a starter:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7fe50ac83f4319c18ed7c634d85cad16bd0bf509

Let me know if you want to add more information there.

Dumping the hex of an skb data?

Thanks.
Cong Wang Nov. 16, 2018, 8:15 p.m. UTC | #8
On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> case non zero trailer bytes were added by a buggy switch (or host)
>
> Saeed can comment/confirm, but the theory is that the NIC does header analysis and
> computes a checksum only on the bytes of the IP frame, not including the tail bytes
> that were added by a switch.


This theory seems can't explain why Pawel saw this warning so often,
which is beyond the probability of a buggy switch. I don't know.


>
> You could use trafgen to cook such a frame and confirm the theory.
>
> Something like :

I will try it.

Thanks.
Cong Wang Nov. 16, 2018, 9:32 p.m. UTC | #9
On Fri, Nov 16, 2018 at 12:06 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hmm, now I see how it works. Actually it uses the differences between
> these two check's as the difference between hardware checksum with
> skb_checksum().
>

Well...

This is true only when there is a skb_checksum_init*() or
skb_checksum_validate*() prior to it, it seems not true for
nf_ip_checksum() where skb->csum is correctly set to pesudo header
checksum but there is no validation of the original skb->csum.
So this check should be still inverted there??

Or am I still missing anything here?
Eric Dumazet Nov. 16, 2018, 9:33 p.m. UTC | #10
On 11/16/2018 12:15 PM, Cong Wang wrote:
> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
>> case non zero trailer bytes were added by a buggy switch (or host)
>>
>> Saeed can comment/confirm, but the theory is that the NIC does header analysis and
>> computes a checksum only on the bytes of the IP frame, not including the tail bytes
>> that were added by a switch.
> 
> 
> This theory seems can't explain why Pawel saw this warning so often,
> which is beyond the probability of a buggy switch. I don't know.

Well the bug here would be the receiver NIC, not really respecting CHECKSUM_COMPLETE premise
(provide a checksum over all the bytes, regardless of how smart header parsing can be on the NIC)

'Buggy switch' would add random bytes after IP frames, but as I mentioned, any AF_PACKET user
can cook arbitrary padding after a valid IP (or IPv6) frame.

> 
> I will try it.
> 
> Thanks.
>
David Miller Nov. 18, 2018, 12:40 a.m. UTC | #11
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 16 Nov 2018 09:52:42 +0800

> netdev_rx_csum_fault is meant to warn about the situation where
> a packet with a valid checksum (i.e., sum == 0) was given to us
> by the hardware with a partial checksum that was invalid.

Thanks for reminding us how this code is supposed to work :-)

Definitely we need to better document this and it looks like Cong is
going to add a comment in order to do so.

As for dumping out a packet, we need to rate limit that more strongly
than the existing net_ratelimit() there.  But only dumping once per
bootup seems to extreme in the other direction.  If we get like 2
or 3 of these in an hour, dumping all of them would be reasonable
and useful.
Herbert Xu Nov. 19, 2018, 4:01 a.m. UTC | #12
On Fri, Nov 16, 2018 at 01:32:50PM -0800, Cong Wang wrote:
>
> This is true only when there is a skb_checksum_init*() or
> skb_checksum_validate*() prior to it, it seems not true for
> nf_ip_checksum() where skb->csum is correctly set to pesudo header
> checksum but there is no validation of the original skb->csum.
> So this check should be still inverted there??
> 
> Or am I still missing anything here?

What do you mean? My copy of nf_ip_checksum seems to be doing the
right thing as far as verifying CHECKSUM_COMPLETED goes.

Cheers,
Cong Wang Nov. 19, 2018, 7:25 p.m. UTC | #13
On Sun, Nov 18, 2018 at 8:02 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Nov 16, 2018 at 01:32:50PM -0800, Cong Wang wrote:
> >
> > This is true only when there is a skb_checksum_init*() or
> > skb_checksum_validate*() prior to it, it seems not true for
> > nf_ip_checksum() where skb->csum is correctly set to pesudo header
> > checksum but there is no validation of the original skb->csum.
> > So this check should be still inverted there??
> >
> > Or am I still missing anything here?
>
> What do you mean? My copy of nf_ip_checksum seems to be doing the
> right thing as far as verifying CHECKSUM_COMPLETED goes.

Hmm, it calls csum_tcpudp_magic() directly instead of
__skb_checksum_validate_complete(), but it also sets ip_summed
to CHECKSUM_UNNECESSARY:

 20                 if ((protocol == 0 && !csum_fold(skb->csum)) ||
 21                     !csum_tcpudp_magic(iph->saddr, iph->daddr,
 22                                        skb->len - dataoff, protocol,
 23                                        skb->csum)) {
 24                         skb->ip_summed = CHECKSUM_UNNECESSARY;
 25                         break;
 26                 }

which means the rx checksum fault won't be triggered.
Cong Wang Nov. 20, 2018, 1:42 a.m. UTC | #14
On Fri, Nov 16, 2018 at 12:15 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > You could use trafgen to cook such a frame and confirm the theory.
> >
> > Something like :
>
> I will try it.

I just tried it, it doesn't make much difference, the warning only
shows up once after I ran the trafgen script for many times,
it could be triggered by other daemons running on the host too.
Eric Dumazet Nov. 20, 2018, 1:47 a.m. UTC | #15
On 11/19/2018 05:42 PM, Cong Wang wrote:
> On Fri, Nov 16, 2018 at 12:15 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>> You could use trafgen to cook such a frame and confirm the theory.
>>>
>>> Something like :
>>
>> I will try it.
> 
> I just tried it, it doesn't make much difference, the warning only
> shows up once after I ran the trafgen script for many times,
> it could be triggered by other daemons running on the host too.


I guess we will need to dump the whole packet for debugging,
as suggest by Herbert :/
Herbert Xu Nov. 20, 2018, 3:09 a.m. UTC | #16
On Mon, Nov 19, 2018 at 11:25:47AM -0800, Cong Wang wrote:
> 
> Hmm, it calls csum_tcpudp_magic() directly instead of
> __skb_checksum_validate_complete(), but it also sets ip_summed
> to CHECKSUM_UNNECESSARY:
> 
>  20                 if ((protocol == 0 && !csum_fold(skb->csum)) ||
>  21                     !csum_tcpudp_magic(iph->saddr, iph->daddr,
>  22                                        skb->len - dataoff, protocol,
>  23                                        skb->csum)) {
>  24                         skb->ip_summed = CHECKSUM_UNNECESSARY;
>  25                         break;
>  26                 }
> 
> which means the rx checksum fault won't be triggered.

This is just the case where the hardware checksum is valid.

For the other case, it just ignores the hardware checksum.  Perhaps
it should call rx checksum fault if the checksum turns out to be
correct in that case but it's not really a big deal.

Cheers,
David Miller Nov. 20, 2018, 6:13 p.m. UTC | #17
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Nov 2018 17:47:33 -0800

> 
> 
> On 11/19/2018 05:42 PM, Cong Wang wrote:
>> On Fri, Nov 16, 2018 at 12:15 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>>
>>> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>
>>>> You could use trafgen to cook such a frame and confirm the theory.
>>>>
>>>> Something like :
>>>
>>> I will try it.
>> 
>> I just tried it, it doesn't make much difference, the warning only
>> shows up once after I ran the trafgen script for many times,
>> it could be triggered by other daemons running on the host too.
> 
> 
> I guess we will need to dump the whole packet for debugging,
> as suggest by Herbert :/

Something like this?  Is it safe to linearize here?

diff --git a/net/core/dev.c b/net/core/dev.c
index f2bfd2eda7b2..3955939cc0cf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3094,6 +3094,8 @@ EXPORT_SYMBOL(__skb_gso_segment);
 void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
 {
 	if (net_ratelimit()) {
+		static unsigned long last_full_pkt_dump = ~0UL;
+
 		pr_err("%s: hw csum failure\n", dev ? dev->name : "<unknown>");
 		if (dev)
 			pr_err("dev features: %pNF\n", &dev->features);
@@ -3102,6 +3104,13 @@ void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
 		       skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
 		       skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
 		       skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
+		if (jiffies - last_full_pkt_dump >= (HZ * 60 * 5) &&
+		    !skb_linearize(skb)) {
+			last_full_pkt_dump = jiffies;
+			print_hex_dump(KERN_ERR, "skb data: ",
+				       DUMP_PREFIX_OFFSET, 32, 1,
+				       skb->data, skb->len, false);
+		}
 		dump_stack();
 	}
 }
Eric Dumazet Nov. 20, 2018, 6:18 p.m. UTC | #18
On Tue, Nov 20, 2018 at 10:13 AM David Miller <davem@davemloft.net> wrote:
>
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 19 Nov 2018 17:47:33 -0800
>
> >
> >
> > On 11/19/2018 05:42 PM, Cong Wang wrote:
> >> On Fri, Nov 16, 2018 at 12:15 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >>>
> >>> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>
> >>>> You could use trafgen to cook such a frame and confirm the theory.
> >>>>
> >>>> Something like :
> >>>
> >>> I will try it.
> >>
> >> I just tried it, it doesn't make much difference, the warning only
> >> shows up once after I ran the trafgen script for many times,
> >> it could be triggered by other daemons running on the host too.
> >
> >
> > I guess we will need to dump the whole packet for debugging,
> > as suggest by Herbert :/
>
> Something like this?  Is it safe to linearize here?
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f2bfd2eda7b2..3955939cc0cf 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3094,6 +3094,8 @@ EXPORT_SYMBOL(__skb_gso_segment);
>  void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
>  {
>         if (net_ratelimit()) {
> +               static unsigned long last_full_pkt_dump = ~0UL;
> +
>                 pr_err("%s: hw csum failure\n", dev ? dev->name : "<unknown>");
>                 if (dev)
>                         pr_err("dev features: %pNF\n", &dev->features);
> @@ -3102,6 +3104,13 @@ void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
>                        skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
>                        skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
>                        skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
> +               if (jiffies - last_full_pkt_dump >= (HZ * 60 * 5) &&
> +                   !skb_linearize(skb)) {
> +                       last_full_pkt_dump = jiffies;
> +                       print_hex_dump(KERN_ERR, "skb data: ",
> +                                      DUMP_PREFIX_OFFSET, 32, 1,
> +                                      skb->data, skb->len, false);


I guess we should dump from skb->head instead, to show all headers.

Maybe dump the mac header offset as well, so that we can ignore the
padding bytes (NET_SKB_PAD) if needed.
Herbert Xu Nov. 21, 2018, 3:35 a.m. UTC | #19
On Tue, Nov 20, 2018 at 10:18:17AM -0800, Eric Dumazet wrote:
>
> > Something like this?  Is it safe to linearize here?

It looks safe to me.  It's only unsafe if your skb is shared which
from my grepping does not appear to be the case (and it cannot be
shared if you're modifying skb->csum which all the callers that
I found were doing for obvious reasons).

If it's cloned skb_linearize will unclone it.

> I guess we should dump from skb->head instead, to show all headers.
>
> Maybe dump the mac header offset as well, so that we can ignore the
> padding bytes (NET_SKB_PAD) if needed.

Yes I agree.  It doesn't hurt to dump more data.

Cheers,
Paweł Staszewski Nov. 22, 2018, 1:25 a.m. UTC | #20
W dniu 16.11.2018 o 21:06, Cong Wang pisze:
> On Thu, Nov 15, 2018 at 8:50 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
>>>> Normally if the hardware's partial checksum is valid then we just
>>>> trust it and send the packet along.  However, if the partial
>>>> checksum is invalid we don't trust it and we will compute the
>>>> whole checksum manually which is what ends up in sum.
>>> Not sure if I understand partial checksum here, but it is the
>>> CHECKSUM_COMPLETE case which I am trying to fix, not
>>> CHECKSUM_PARTIAL.
>> What I meant by partial checksum is the checksum produced by the
>> hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
>> CHECKSUM_PARTIAL is the absence of the substantial part of the
>> checksum which is something we use in the kernel primarily for TX.
>>
>> Yes the names are confusing :)
> Yeah, understood. The hardware provides skb->csum in this case, but
> we keep adjusting it each time when we change skb->data.
>
>
>>> So, in other word, a checksum *match* is the intended to detect
>>> this HW RX checksum fault?
>> Correct.  Or more likely it's probably a bug in either the driver
>> or if there are overlaying code such as VLAN then in that code.
>>
>> Basically if the RX checksum is buggy, it's much more likely to
>> cause a valid packet to be rejected than to cause an invalid packet
>> to be accepted, because we still verify that checksum against the
>> pseudoheader.  So we only attempt to catch buggy hardware/drivers
>> by doing a second manual verification for the case where the packet
>> is flagged as invalid.
> Hmm, now I see how it works. Actually it uses the differences between
> these two check's as the difference between hardware checksum with
> skb_checksum().
>
> I will send a patch to add a comment there to avoid confusion.
>
>
>>> Sure, my case is nearly same with Pawel's, except I have no vlan:
>>> https://marc.info/?l=linux-netdev&m=154086647601721&w=2
>> Can you please provide your backtrace?
> I already did:
> https://marc.info/?l=linux-netdev&m=154092211305599&w=2
>
> Note, the offending commit has been backported to 4.14, which
> is why I saw this warning. I have no idea why it is backported
> from the beginning, it is just an optimization, doesn't fix any bug,
> IMHO.
>
> Also, it is much harder for me to reproduce it than Pawel who
> saw the warning every second. Sometimes I need 1 hour to trigger
> it, sometimes other people here needs 10+ hours to trigger it.

By the way - changed network controller for vlans where i was receiving 
rx csum fail to 82599 with ixgbe driver and

with mellanox:

[91584.359273] vlan980: hw csum failure
[91584.359278] CPU: 54 PID: 0 Comm: swapper/54 Not tainted 4.20.0-rc1+ #2
[91584.359279] Call Trace:
[91584.359282]  <IRQ>
[91584.359290]  dump_stack+0x46/0x5b
[91584.359296]  __skb_checksum_complete+0x9b/0xb0
[91584.359301]  icmp_rcv+0x51/0x1f0
[91584.359305]  ip_local_deliver_finish+0x49/0xd0
[91584.359307]  ip_local_deliver+0xb7/0xe0
[91584.359309]  ? ip_sublist_rcv_finish+0x50/0x50
[91584.359310]  ip_rcv+0x96/0xc0
[91584.359313]  __netif_receive_skb_one_core+0x4b/0x70
[91584.359315]  netif_receive_skb_internal+0x2f/0xc0
[91584.359316]  napi_gro_receive+0xb0/0xd0
[91584.359320]  mlx5e_handle_rx_cqe+0x78/0xd0
[91584.359321]  mlx5e_poll_rx_cq+0xc4/0x970
[91584.359323]  mlx5e_napi_poll+0xab/0xcb0
[91584.359325]  net_rx_action+0xd9/0x300
[91584.359328]  __do_softirq+0xd3/0x2d9
[91584.359333]  irq_exit+0x7a/0x80
[91584.359334]  do_IRQ+0x72/0xc0
[91584.359336]  common_interrupt+0xf/0xf
[91584.359337]  </IRQ>
[91584.359340] RIP: 0010:mwait_idle+0x74/0x1b0
[91584.359342] Code: ae f0 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 
01 c8 48 8b 00 48 c1 e8 03 83 e0 01 0f 85 26 01 00 00 48 89 c1 fb 0f 01 
c9 <65> 8b 2d 95 8e 6b 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[91584.359343] RSP: 0018:ffffc900034f3ec0 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffffde
[91584.359344] RAX: 0000000000000000 RBX: 0000000000000036 RCX: 
0000000000000000
[91584.359345] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[91584.359346] RBP: 0000000000000036 R08: 0000000000000000 R09: 
0000000000000000
[91584.359346] R10: 00000001008b49bb R11: 0000000000000c00 R12: 
0000000000000000
[91584.359347] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[91584.359352]  do_idle+0x19f/0x1c0
[91584.359354]  ? do_idle+0x4/0x1c0
[91584.359355]  cpu_startup_entry+0x14/0x20
[91584.359360]  start_secondary+0x165/0x190
[91584.359364]  secondary_startup_64+0xa4/0xb0


With intel no errors.


>
> Let me see if I can add vlan on my side to make it more reproducible,
> it seems hard as our switch doesn't use vlan either.
>
> We have warnings with conntrack involved too, I can provide it too
> if you are interested.
>
> I tend to revert it for -stable, at least that is what I plan to do
> on my side unless there is a fix coming soon.
>
> Thanks.
>
diff mbox series

Patch

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..e542a9a212a7 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -733,7 +733,7 @@  __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
 	__sum16 sum;
 
 	sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
-	if (likely(!sum)) {
+	if (unlikely(sum)) {
 		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
 		    !skb->csum_complete_sw)
 			netdev_rx_csum_fault(skb->dev);
@@ -753,7 +753,7 @@  __sum16 __skb_checksum_complete(struct sk_buff *skb)
 
 	/* skb->csum holds pseudo checksum */
 	sum = csum_fold(csum_add(skb->csum, csum));
-	if (likely(!sum)) {
+	if (unlikely(sum)) {
 		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
 		    !skb->csum_complete_sw)
 			netdev_rx_csum_fault(skb->dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ffcbdd55fa9..c76dee329844 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5776,7 +5776,7 @@  __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
 
 	/* NAPI_GRO_CB(skb)->csum holds pseudo checksum */
 	sum = csum_fold(csum_add(NAPI_GRO_CB(skb)->csum, wsum));
-	if (likely(!sum)) {
+	if (unlikely(sum)) {
 		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
 		    !skb->csum_complete_sw)
 			netdev_rx_csum_fault(skb->dev);