diff mbox

veth: remove hardware checksum feature

Message ID 51F15E50.8080208@guap.ru
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Vitaly E. Lavrov July 25, 2013, 5:20 p.m. UTC
The network device VETH can't support the feature NETIF_F_HW_CSUM.
All locally generated packets have invalid checksum.
Wrong commit http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8093315a91340bca52549044975d8c7f673b28a1 ( 
kernel 3.9.0 )

Workaround "ethtool -K vethX tx off"

Possible patch:

the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet July 25, 2013, 5:59 p.m. UTC | #1
On Thu, 2013-07-25 at 21:20 +0400, Vitaly E. Lavrov wrote:
> The network device VETH can't support the feature NETIF_F_HW_CSUM.
> All locally generated packets have invalid checksum.
> Wrong commit http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8093315a91340bca52549044975d8c7f673b28a1 ( 
> kernel 3.9.0 )
> 
> Workaround "ethtool -K vethX tx off"
> 
> Possible patch:

No idea of what the problem is.

Could you elaborate ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 8, 2013, 12:07 a.m. UTC | #2
On 07/25/2013 10:20 AM, Vitaly E. Lavrov wrote:
> The network device VETH can't support the feature NETIF_F_HW_CSUM.
> All locally generated packets have invalid checksum.
> Wrong commit http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8093315a91340bca52549044975d8c7f673b28a1 ( kernel 3.9.0 )
>
> Workaround "ethtool -K vethX tx off"
>
> Possible patch:
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 177f911..3db97da 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -254,7 +254,7 @@ static const struct net_device_ops veth_netdev_ops = {
>   };
>
>   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
> -                      NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
> +                      NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
>                         NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
>                         NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )

I see the same problem.

My test case is a bit complicated, but the gist is that I have a VETH
pair, one with IP (veth1), one connected to a bridge-like-thing (veth2).

The UDP frames sent on veth1 appear on veth2, and when I sniff veth2,
the packets show broken checksum.  The work-around mentioned in Vitaly's
email above fixes the problem for me (I did not try the patch yet).

Eric:  You responded originally that you needed more info.  If
my explanation above is not sufficient, please let me know what
you need...

Thanks,
Ben
Eric Dumazet Aug. 8, 2013, 12:12 a.m. UTC | #3
On Wed, 2013-08-07 at 17:07 -0700, Ben Greear wrote:
> On 07/25/2013 10:20 AM, Vitaly E. Lavrov wrote:
> > The network device VETH can't support the feature NETIF_F_HW_CSUM.
> > All locally generated packets have invalid checksum.
> > Wrong commit http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8093315a91340bca52549044975d8c7f673b28a1 ( kernel 3.9.0 )
> >
> > Workaround "ethtool -K vethX tx off"
> >
> > Possible patch:
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 177f911..3db97da 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -254,7 +254,7 @@ static const struct net_device_ops veth_netdev_ops = {
> >   };
> >
> >   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
> > -                      NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
> > +                      NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
> >                         NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
> >                         NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
> 
> I see the same problem.
> 
> My test case is a bit complicated, but the gist is that I have a VETH
> pair, one with IP (veth1), one connected to a bridge-like-thing (veth2).
> 
> The UDP frames sent on veth1 appear on veth2, and when I sniff veth2,
> the packets show broken checksum.  The work-around mentioned in Vitaly's
> email above fixes the problem for me (I did not try the patch yet).
> 
> Eric:  You responded originally that you needed more info.  If
> my explanation above is not sufficient, please let me know what
> you need...
> 

tcpdump is known to display wrong checksums, its not a reason to disable
tx checksums on our interfaces and kill performance.

       -K     Don't  attempt to verify IP, TCP, or UDP checksums.  This
is useful for interfaces that perform some or all of those checksum
calculation in hardware; other‐
              wise, all outgoing TCP checksums will be flagged as bad.


So we need more information than 'my trcpdump says checksums are wrong'

We could also disable TSO because : It sends packets bigger than MTU,
this can not be good ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 8, 2013, 12:23 a.m. UTC | #4
On 08/07/2013 05:12 PM, Eric Dumazet wrote:
> On Wed, 2013-08-07 at 17:07 -0700, Ben Greear wrote:
>> On 07/25/2013 10:20 AM, Vitaly E. Lavrov wrote:
>>> The network device VETH can't support the feature NETIF_F_HW_CSUM.
>>> All locally generated packets have invalid checksum.
>>> Wrong commit http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8093315a91340bca52549044975d8c7f673b28a1 ( kernel 3.9.0 )
>>>
>>> Workaround "ethtool -K vethX tx off"
>>>
>>> Possible patch:
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 177f911..3db97da 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -254,7 +254,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>>    };
>>>
>>>    #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
>>> -                      NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
>>> +                      NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
>>>                          NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
>>>                          NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
>>
>> I see the same problem.
>>
>> My test case is a bit complicated, but the gist is that I have a VETH
>> pair, one with IP (veth1), one connected to a bridge-like-thing (veth2).
>>
>> The UDP frames sent on veth1 appear on veth2, and when I sniff veth2,
>> the packets show broken checksum.  The work-around mentioned in Vitaly's
>> email above fixes the problem for me (I did not try the patch yet).
>>
>> Eric:  You responded originally that you needed more info.  If
>> my explanation above is not sufficient, please let me know what
>> you need...
>>
>
> tcpdump is known to display wrong checksums, its not a reason to disable
> tx checksums on our interfaces and kill performance.

I am receiving the packet into user space by reading veth2
using a packet socket, and then writing that packet out to eth6
(e100e).  As far as I can tell, it reads from veth2 with bad checksum
and then goes onto the wire with bad checksum.

>         -K     Don't  attempt to verify IP, TCP, or UDP checksums.  This
> is useful for interfaces that perform some or all of those checksum
> calculation in hardware; other‐
>                wise, all outgoing TCP checksums will be flagged as bad.

Is it ever valid to *read* a packet with bad checksum though?  I thought
the bogus bad hw-checksum issue was only on the tx-side as far as sniffing goes?

Thanks,
Ben
Eric Dumazet Aug. 8, 2013, 1:08 a.m. UTC | #5
On Wed, 2013-08-07 at 17:23 -0700, Ben Greear wrote:

> I am receiving the packet into user space by reading veth2
> using a packet socket, and then writing that packet out to eth6
> (e100e).  As far as I can tell, it reads from veth2 with bad checksum
> and then goes onto the wire with bad checksum.
> 

Then, when you read the packet socket, you probably have an indication
that checksum is to be computed or ignored.

Your application breaks because of this.

If forwarding was done by the kernel, the checksum would be filled
either by hardware, or core network helpers.

> Is it ever valid to *read* a packet with bad checksum though?  I thought
> the bogus bad hw-checksum issue was only on the tx-side as far as sniffing goes?

We have same flags on loopback interface.

So using your application on loopback should break the same ?

I am not saying your application is buggy, maybe we need a helper in
net/packet/af_packet.c. Please check TP_STATUS_CSUMNOTREADY

This was added 6 years ago in commit 8dc4194474159660
("[PACKET]: Add optional checksum computation for recvmsg")



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 8, 2013, 1:54 a.m. UTC | #6
On 08/07/2013 06:08 PM, Eric Dumazet wrote:
> On Wed, 2013-08-07 at 17:23 -0700, Ben Greear wrote:
>
>> I am receiving the packet into user space by reading veth2
>> using a packet socket, and then writing that packet out to eth6
>> (e100e).  As far as I can tell, it reads from veth2 with bad checksum
>> and then goes onto the wire with bad checksum.
>>
>
> Then, when you read the packet socket, you probably have an indication
> that checksum is to be computed or ignored.

My user-space bridge code doesn't know or care about the checksum, it
is just reading a packet from one interface and writing onto another.

>
> Your application breaks because of this.

>> Is it ever valid to *read* a packet with bad checksum though?  I thought
>> the bogus bad hw-checksum issue was only on the tx-side as far as sniffing goes?
>
> We have same flags on loopback interface.
>
> So using your application on loopback should break the same ?

I suppose, I have never tried to bridge loopback, and probably won't
try anytime soon.

> I am not saying your application is buggy, maybe we need a helper in
> net/packet/af_packet.c. Please check TP_STATUS_CSUMNOTREADY
>
> This was added 6 years ago in commit 8dc4194474159660
> ("[PACKET]: Add optional checksum computation for recvmsg")

Hmm, I'm using recmmsg, in case that matters.  I'm not sure I really
understand that patch well.  It looks like it requires user-space changes?

I'd rather not have to deal with calculating checksums just to bridge
two interfaces...if it comes to that, I'd rather just force a disable
of the HW checksum feature using ethtool when my app is configured in
this manner.

Maybe we should just do the csum calc in the kernel if packet is
about to be sent up to user-space via af_packet?  I think that
would keep the expected behaviour, and hopefully not loose any of the performance
benefits for cases where the packet never leaves the kernel?

Thanks,
Ben
Eric Dumazet Aug. 8, 2013, 2:52 a.m. UTC | #7
On Wed, 2013-08-07 at 18:54 -0700, Ben Greear wrote:

> Maybe we should just do the csum calc in the kernel if packet is
> about to be sent up to user-space via af_packet?  I think that
> would keep the expected behaviour, and hopefully not loose any of the performance
> benefits for cases where the packet never leaves the kernel?

I think you are rephrasing what I suggested ;)

I'll send a patch asap, unless someone beats me.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 8, 2013, 7:43 p.m. UTC | #8
On 08/07/2013 07:52 PM, Eric Dumazet wrote:
> On Wed, 2013-08-07 at 18:54 -0700, Ben Greear wrote:
>
>> Maybe we should just do the csum calc in the kernel if packet is
>> about to be sent up to user-space via af_packet?  I think that
>> would keep the expected behaviour, and hopefully not loose any of the performance
>> benefits for cases where the packet never leaves the kernel?
>
> I think you are rephrasing what I suggested ;)
>
> I'll send a patch asap, unless someone beats me.


I'll be happy to test.

And a slightly related question if you have the time:

My kernel-mode bridge was breaking as well, but this appears to
be a bug in my code.  I had this code (since 2009, at least):

// Evidently this fixes issues with sending between NICs
// that support and do not support hw-csum.
skb->ip_summed = CHECKSUM_NONE;


It seems that at some point, this fixed some problem I saw, but it was long
ago.  If I remove this, then the 3.9 kernel bridges just fine between
the VETH and a physical interface.

I do not see any modification of the ip_summed in the bridge code.

So, the question is:  Is there any time that I *should* be mucking with
skb->ip_summed when bridging pkts from one device to another on modern
kernels?  I'm using the ptype_all hook to grab packets, and ndo_start_xmit
to send them, in case that matters.

Thanks,
Ben
Eric Dumazet Aug. 8, 2013, 8:16 p.m. UTC | #9
On Thu, 2013-08-08 at 12:43 -0700, Ben Greear wrote:

> So, the question is:  Is there any time that I *should* be mucking with
> skb->ip_summed when bridging pkts from one device to another on modern
> kernels?  I'm using the ptype_all hook to grab packets, and ndo_start_xmit
> to send them, in case that matters.


ndo_start_xmit() is likely bypassing the core network fallbacks.

You might try dev_hard_start_xmit() instead


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 8, 2013, 10:13 p.m. UTC | #10
On 08/08/2013 01:16 PM, Eric Dumazet wrote:
> On Thu, 2013-08-08 at 12:43 -0700, Ben Greear wrote:
>
>> So, the question is:  Is there any time that I *should* be mucking with
>> skb->ip_summed when bridging pkts from one device to another on modern
>> kernels?  I'm using the ptype_all hook to grab packets, and ndo_start_xmit
>> to send them, in case that matters.
>
>
> ndo_start_xmit() is likely bypassing the core network fallbacks.
>
> You might try dev_hard_start_xmit() instead

Looks like it should work nicely, except it would be really
nice if I could short-circuit the xmit_nit() part for my
protocol handler.

Think there would be any interest in allowing the ptype_all
handlers to optionally register a direction (ie tx-only, rx-only, both)
and have dev_queue_xmit_nit() pay attention to that?

Thanks,
Ben
Ben Greear Aug. 8, 2013, 10:20 p.m. UTC | #11
On 08/08/2013 03:13 PM, Ben Greear wrote:
> On 08/08/2013 01:16 PM, Eric Dumazet wrote:
>> On Thu, 2013-08-08 at 12:43 -0700, Ben Greear wrote:
>>
>>> So, the question is:  Is there any time that I *should* be mucking with
>>> skb->ip_summed when bridging pkts from one device to another on modern
>>> kernels?  I'm using the ptype_all hook to grab packets, and ndo_start_xmit
>>> to send them, in case that matters.
>>
>>
>> ndo_start_xmit() is likely bypassing the core network fallbacks.
>>
>> You might try dev_hard_start_xmit() instead
>
> Looks like it should work nicely, except it would be really
> nice if I could short-circuit the xmit_nit() part for my
> protocol handler.
>
> Think there would be any interest in allowing the ptype_all
> handlers to optionally register a direction (ie tx-only, rx-only, both)
> and have dev_queue_xmit_nit() pay attention to that?

Actually, seems I might be able to abuse the ptype->af_packet_priv
and skb->sk and make the xmit_nit() not pass it back up...

Still might be interesting to have direction support in the ptype
handlers though...

Ben
Eric Dumazet Aug. 8, 2013, 10:22 p.m. UTC | #12
On Thu, 2013-08-08 at 15:13 -0700, Ben Greear wrote:

> Looks like it should work nicely, except it would be really
> nice if I could short-circuit the xmit_nit() part for my
> protocol handler.
> 
> Think there would be any interest in allowing the ptype_all
> handlers to optionally register a direction (ie tx-only, rx-only, both)
> and have dev_queue_xmit_nit() pay attention to that?

Sure, it would also make sense to apply the BPF filter _before_ doing
the skb_clone()

Right now, we :

clone packets
-> deliver the clone to the sniffer.
 sniffer eventuall drops the packet after BPF filtering.

Its trivial to test the tx/rx thing in BPF, and it's JIT code.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 8, 2013, 10:35 p.m. UTC | #13
On 08/08/2013 03:22 PM, Eric Dumazet wrote:
> On Thu, 2013-08-08 at 15:13 -0700, Ben Greear wrote:
>
>> Looks like it should work nicely, except it would be really
>> nice if I could short-circuit the xmit_nit() part for my
>> protocol handler.
>>
>> Think there would be any interest in allowing the ptype_all
>> handlers to optionally register a direction (ie tx-only, rx-only, both)
>> and have dev_queue_xmit_nit() pay attention to that?
>
> Sure, it would also make sense to apply the BPF filter _before_ doing
> the skb_clone()
>
> Right now, we :
>
> clone packets
> -> deliver the clone to the sniffer.
>   sniffer eventuall drops the packet after BPF filtering.
>
> Its trivial to test the tx/rx thing in BPF, and it's JIT code.

I'm actually registering the hook from a kernel module and
doing the bridging in this module.  I'm not using sockets or
BPF like a sniffer would...

I think for my own use, just causing the skb_loop_sk() method to
return true would be optimal, but in general I like your idea.

Thanks,
Ben
diff mbox

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 177f911..3db97da 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -254,7 +254,7 @@  static const struct net_device_ops veth_netdev_ops = {
  };

  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
-                      NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+                      NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
                        NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
                        NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in