Message ID | 572155F4.10405@candelatech.com |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Hello, 2016-04-27, 17:14:44 -0700, Ben Greear wrote: > On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote: > > Hi Ben, > > > > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote: > > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote: > > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote: > > > > > > > > > > 3.2.80-rc1 review patch. If anyone has any objections, please let me know. > > > > I would be careful about this. It causes regressions when sending > > > > PACKET_SOCKET buffers from user-space to veth devices. > > > > > > > > There was a proposed upstream fix for the regression, but it has not gone > > > > into the tree as far as I know. > > > > > > > > http://www.spinics.net/lists/netdev/msg370436.html > > > [...] > > > > > > OK, I'll drop this for now. > > > > The fall out from not having this patch is in my opinion a bigger > > fallout than not having this patch. This patch fixes silent data > > corruption vs. the problem Ben Greear is talking about, which might not > > be that a common usage. > > > > What do others think? > > > > Bye, > > Hannes > > > > This patch from Cong Wang seems to fix the regression for me, I think it should be added and > tested in the main tree, and then apply them to stable as a pair. > > http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a Actually, no, this is not really a regression. If you capture packets on a device with checksum offloading enabled, the TCP/UDP checksum isn't filled. veth also behaves that way. What the "veth: don't modify ip_summed" patch does is enable proper checksum validation on veth. This really was a bug in veth. Cong's patch would also break cases where we choose to inject packets with invalid checksums, and they would now be accepted as correct. Your use case is invalid, it just happened to work because of a bug. If you want the stack to fill checksums so that you want capture and reinject packets, you have to disable checksum offloading (or compute the checksum yourself in userspace). Thanks.
On 04/28/2016 03:29 AM, Sabrina Dubroca wrote: > Hello, > > 2016-04-27, 17:14:44 -0700, Ben Greear wrote: >> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote: >>> Hi Ben, >>> >>> On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote: >>>> On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote: >>>>> On 04/26/2016 04:02 PM, Ben Hutchings wrote: >>>>>> >>>>>> 3.2.80-rc1 review patch. If anyone has any objections, please let me know. >>>>> I would be careful about this. It causes regressions when sending >>>>> PACKET_SOCKET buffers from user-space to veth devices. >>>>> >>>>> There was a proposed upstream fix for the regression, but it has not gone >>>>> into the tree as far as I know. >>>>> >>>>> http://www.spinics.net/lists/netdev/msg370436.html >>>> [...] >>>> >>>> OK, I'll drop this for now. >>> >>> The fall out from not having this patch is in my opinion a bigger >>> fallout than not having this patch. This patch fixes silent data >>> corruption vs. the problem Ben Greear is talking about, which might not >>> be that a common usage. >>> >>> What do others think? >>> >>> Bye, >>> Hannes >>> >> >> This patch from Cong Wang seems to fix the regression for me, I think it should be added and >> tested in the main tree, and then apply them to stable as a pair. >> >> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > > Actually, no, this is not really a regression. > > If you capture packets on a device with checksum offloading enabled, > the TCP/UDP checksum isn't filled. veth also behaves that way. What > the "veth: don't modify ip_summed" patch does is enable proper > checksum validation on veth. This really was a bug in veth. > > Cong's patch would also break cases where we choose to inject packets > with invalid checksums, and they would now be accepted as correct. > > Your use case is invalid, it just happened to work because of a > bug. If you want the stack to fill checksums so that you want capture > and reinject packets, you have to disable checksum offloading (or > compute the checksum yourself in userspace). Disabling checksum offloading or computing in user-space (and then recomputing in veth to verify the checksum?) is a huge performance loss. Maybe we could add a socket option to enable Cong's patch on a per-socket basis? That way my use-case can still work and you can have this new behaviour by default? Thanks, Ben
On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: > Hello, > > 2016-04-27, 17:14:44 -0700, Ben Greear wrote: > > > > On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote: > > > > > > Hi Ben, > > > > > > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote: > > > > > > > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote: > > > > > > > > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote: > > > > > > > > > > > > > > > > > > 3.2.80-rc1 review patch. If anyone has any objections, please let me know. > > > > > I would be careful about this. It causes regressions when sending > > > > > PACKET_SOCKET buffers from user-space to veth devices. > > > > > > > > > > There was a proposed upstream fix for the regression, but it has not gone > > > > > into the tree as far as I know. > > > > > > > > > > http://www.spinics.net/lists/netdev/msg370436.html > > > > [...] > > > > > > > > OK, I'll drop this for now. > > > The fall out from not having this patch is in my opinion a bigger > > > fallout than not having this patch. This patch fixes silent data > > > corruption vs. the problem Ben Greear is talking about, which might not > > > be that a common usage. > > > > > > What do others think? > > > > > > Bye, > > > Hannes > > > > > This patch from Cong Wang seems to fix the regression for me, I think it should be added and > > tested in the main tree, and then apply them to stable as a pair. > > > > http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. Ben.
On Thu, 2016-04-28 at 06:45 -0700, Ben Greear wrote: > > On 04/28/2016 03:29 AM, Sabrina Dubroca wrote: [...] > > Your use case is invalid, it just happened to work because of a > > bug. If you want the stack to fill checksums so that you want capture > > and reinject packets, you have to disable checksum offloading (or > > compute the checksum yourself in userspace). > Disabling checksum offloading or computing in user-space (and then > recomputing in veth to verify the checksum?) is a huge performance loss. > > Maybe we could add a socket option to enable Cong's patch on a per-socket > basis? That way my use-case can still work and you can have this new > behaviour by default? It does sound like a useful option to have. If there are other applications that depend on veth's checksum-fixing behaviour and are being distributed in binary form, then a per-device option might be necessary so users can keep those applications working before they're updated. Ben.
On 04/30/2016 11:33 AM, Ben Hutchings wrote: > On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: >> Hello, >>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a >> Actually, no, this is not really a regression. > [...] > > It really is. Even though the old behaviour was a bug (raw packets > should not be changed), if there are real applications that depend on > that then we have to keep those applications working somehow. To be honest, I fail to see why the old behaviour is a bug when sending raw packets from user-space. If raw packets should not be changed, then we need some way to specify what the checksum setting is to begin with, otherwise, user-space has not enough control. A socket option for new programs, and sysctl configurable defaults for raw sockets for old binary programs would be sufficient I think. Thanks, Ben
We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Tom On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear <greearb@candelatech.com> wrote: > > > On 04/30/2016 11:33 AM, Ben Hutchings wrote: >> >> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: >>> >>> Hello, > > >>>> >>>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a >>> >>> Actually, no, this is not really a regression. >> >> [...] >> >> It really is. Even though the old behaviour was a bug (raw packets >> should not be changed), if there are real applications that depend on >> that then we have to keep those applications working somehow. > > > To be honest, I fail to see why the old behaviour is a bug when sending > raw packets from user-space. If raw packets should not be changed, then > we need some way to specify what the checksum setting is to begin with, > otherwise, user-space has not enough control. > > A socket option for new programs, and sysctl configurable defaults for raw > sockets > for old binary programs would be sufficient I think. > > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com
[oops – resending this because I was using gmail in HTML mode before by accident] There was a discussion on a separate thread about this. I agree with Sabrina fully. I believe veth should provide an abstraction layer that correctly emulates a physical network in all ways. Consider an environment where we have multiple physical computers. Each computer runs one or more containers, each of which has a publicly routable ip address. When adding a new app to the cluster, a scheduler might decide to run this container on any physical machine of its choice, assuming that apps have a way of routing traffic to their backends (we did something similar Google >10 years ago). This is something we might imagine happening with docker and ipv6 for instance. If you have an app, A, which sends raw ethernet frames (the simplest case I could imagine) with TCP data that has invalid checksums to app B, which is receiving it, the behaviour of the system _will be different_ depending upon whether app B is scheduled to run on the same machine as app A or not. This seems like a clear bug and a broken abstraction (especially as the default case), and something we should endeavour to avoid. I do see Ben's point about enabling sw checksum verification as potentially incurring a huge performance penalty (I haven't had a chance to measure it) that is completely wasteful in the vast majority of cases. Unfortunately I just don't see how we can solve this problem in a way that preserves a correct abstraction layer while also avoiding excess work. I guess the first piece of data that would be helpful is to determine just how big of a performance penalty this is. If it's small, then maybe it doesn't matter. On Thu, Apr 28, 2016 at 6:29 AM, Sabrina Dubroca <sd@queasysnail.net> wrote: > Hello, > > 2016-04-27, 17:14:44 -0700, Ben Greear wrote: >> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote: >> > Hi Ben, >> > >> > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote: >> > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote: >> > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote: >> > > > > >> > > > > 3.2.80-rc1 review patch. If anyone has any objections, please let me know. >> > > > I would be careful about this. It causes regressions when sending >> > > > PACKET_SOCKET buffers from user-space to veth devices. >> > > > >> > > > There was a proposed upstream fix for the regression, but it has not gone >> > > > into the tree as far as I know. >> > > > >> > > > http://www.spinics.net/lists/netdev/msg370436.html >> > > [...] >> > > >> > > OK, I'll drop this for now. >> > >> > The fall out from not having this patch is in my opinion a bigger >> > fallout than not having this patch. This patch fixes silent data >> > corruption vs. the problem Ben Greear is talking about, which might not >> > be that a common usage. >> > >> > What do others think? >> > >> > Bye, >> > Hannes >> > >> >> This patch from Cong Wang seems to fix the regression for me, I think it should be added and >> tested in the main tree, and then apply them to stable as a pair. >> >> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > > Actually, no, this is not really a regression. > > If you capture packets on a device with checksum offloading enabled, > the TCP/UDP checksum isn't filled. veth also behaves that way. What > the "veth: don't modify ip_summed" patch does is enable proper > checksum validation on veth. This really was a bug in veth. > > Cong's patch would also break cases where we choose to inject packets > with invalid checksums, and they would now be accepted as correct. > > Your use case is invalid, it just happened to work because of a > bug. If you want the stack to fill checksums so that you want capture > and reinject packets, you have to disable checksum offloading (or > compute the checksum yourself in userspace). > > Thanks. > > -- > Sabrina
On 04/30/2016 12:54 PM, Tom Herbert wrote: > We've put considerable effort into cleaning up the checksum interface > to make it as unambiguous as possible, please be very careful to > follow it. Broken checksum processing is really hard to detect and > debug. > > CHECKSUM_UNNECESSARY means that some number of _specific_ checksums > (indicated by csum_level) have been verified to be correct in a > packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is > never right. If CHECKSUM_UNNECESSARY is set in such a manner but the > checksum it would refer to has not been verified and is incorrect this > is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? Also, I might want to send raw frames that do have broken checksums (lets assume a real NIC, not veth), and I want them to hit the wire with those bad checksums. How do I configure the checksumming in this case? Thanks, Ben > > Tom > > On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear <greearb@candelatech.com> wrote: >> >> >> On 04/30/2016 11:33 AM, Ben Hutchings wrote: >>> >>> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: >>>> >>>> Hello, >> >> >>>>> >>>>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a >>>> >>>> Actually, no, this is not really a regression. >>> >>> [...] >>> >>> It really is. Even though the old behaviour was a bug (raw packets >>> should not be changed), if there are real applications that depend on >>> that then we have to keep those applications working somehow. >> >> >> To be honest, I fail to see why the old behaviour is a bug when sending >> raw packets from user-space. If raw packets should not be changed, then >> we need some way to specify what the checksum setting is to begin with, >> otherwise, user-space has not enough control. >> >> A socket option for new programs, and sysctl configurable defaults for raw >> sockets >> for old binary programs would be sufficient I think. >> >> >> Thanks, >> Ben >> >> -- >> Ben Greear <greearb@candelatech.com> >> Candela Technologies Inc http://www.candelatech.com >
On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <greearb@candelatech.com> wrote: > > > On 04/30/2016 12:54 PM, Tom Herbert wrote: >> >> We've put considerable effort into cleaning up the checksum interface >> to make it as unambiguous as possible, please be very careful to >> follow it. Broken checksum processing is really hard to detect and >> debug. >> >> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >> (indicated by csum_level) have been verified to be correct in a >> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >> checksum it would refer to has not been verified and is incorrect this >> is a major bug. > > > Suppose I know that the packet received on a packet-socket has > already been verified by a NIC that supports hardware checksumming. > > Then, I want to transmit it on a veth interface using a second > packet socket. I do not want veth to recalculate the checksum on > transmit, nor to validate it on the peer veth on receive, because I do > not want to waste the CPU cycles. I am assuming that my app is not > accidentally corrupting frames, so the checksum can never be bad. > > How should the checksumming be configured for the packets going into > the packet-socket from user-space? It seems like that only the receiver should decide whether or not to checksum packets on the veth, not the sender. How about: We could add a receiving socket option for "don't checksum packets received from a veth when the other side has marked them as elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending socket option for "mark all data sent via this socket to a veth as elide-checksum-suggested". So the process would be: Writer: 1. open read socket 2. open write socket, with option elide-checksum-for-veth-suggested 3. write data Reader: 1. open read socket with "follow-elide-checksum-suggestions-on-veth" 2. read data The kernel / module would then need to persist the flag on all packets that traverse a veth, and drop these data when they leave the veth module. > > > Also, I might want to send raw frames that do have > broken checksums (lets assume a real NIC, not veth), and I want them > to hit the wire with those bad checksums. > > > How do I configure the checksumming in this case? Correct me if I'm wrong but I think this is already possible now. You can have packets with incorrect checksum hitting the wire as is. What you cannot do is instruct the receiving end to ignore the checksum from the sending end when using a physical device (and something I think we should mimic on the sending device). > > > > Thanks, > Ben > > > >> >> Tom >> >> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear <greearb@candelatech.com> wrote: >>> >>> >>> >>> On 04/30/2016 11:33 AM, Ben Hutchings wrote: >>>> >>>> >>>> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: >>>>> >>>>> >>>>> Hello, >>> >>> >>> >>>>>> >>>>>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a >>>>> >>>>> >>>>> Actually, no, this is not really a regression. >>>> >>>> >>>> [...] >>>> >>>> It really is. Even though the old behaviour was a bug (raw packets >>>> should not be changed), if there are real applications that depend on >>>> that then we have to keep those applications working somehow. >>> >>> >>> >>> To be honest, I fail to see why the old behaviour is a bug when sending >>> raw packets from user-space. If raw packets should not be changed, then >>> we need some way to specify what the checksum setting is to begin with, >>> otherwise, user-space has not enough control. >>> >>> A socket option for new programs, and sysctl configurable defaults for raw >>> sockets >>> for old binary programs would be sufficient I think. >>> >>> >>> Thanks, >>> Ben >>> >>> -- >>> Ben Greear <greearb@candelatech.com> >>> Candela Technologies Inc http://www.candelatech.com >> >> > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com
On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: > On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <greearb@candelatech.com> wrote: >> >> >> On 04/30/2016 12:54 PM, Tom Herbert wrote: >>> >>> We've put considerable effort into cleaning up the checksum interface >>> to make it as unambiguous as possible, please be very careful to >>> follow it. Broken checksum processing is really hard to detect and >>> debug. >>> >>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >>> (indicated by csum_level) have been verified to be correct in a >>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >>> checksum it would refer to has not been verified and is incorrect this >>> is a major bug. >> >> >> Suppose I know that the packet received on a packet-socket has >> already been verified by a NIC that supports hardware checksumming. >> >> Then, I want to transmit it on a veth interface using a second >> packet socket. I do not want veth to recalculate the checksum on >> transmit, nor to validate it on the peer veth on receive, because I do >> not want to waste the CPU cycles. I am assuming that my app is not >> accidentally corrupting frames, so the checksum can never be bad. >> >> How should the checksumming be configured for the packets going into >> the packet-socket from user-space? > > > It seems like that only the receiver should decide whether or not to > checksum packets on the veth, not the sender. > > How about: > > We could add a receiving socket option for "don't checksum packets > received from a veth when the other side has marked them as > elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending > socket option for "mark all data sent via this socket to a veth as > elide-checksum-suggested". > > So the process would be: > > Writer: > 1. open read socket > 2. open write socket, with option elide-checksum-for-veth-suggested > 3. write data > > Reader: > 1. open read socket with "follow-elide-checksum-suggestions-on-veth" > 2. read data > > The kernel / module would then need to persist the flag on all packets > that traverse a veth, and drop these data when they leave the veth > module. I'm not sure this works completely. In my app, the packet flow might be: eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> [kernel router/bridge logic ...] <-> eth1 There may be no sockets on the vethB port. And reader/writer is not a good way to look at it since I am implementing a bi-directional bridge in user-space and each packet-socket is for both rx and tx. >> Also, I might want to send raw frames that do have >> broken checksums (lets assume a real NIC, not veth), and I want them >> to hit the wire with those bad checksums. >> >> >> How do I configure the checksumming in this case? > > > Correct me if I'm wrong but I think this is already possible now. You > can have packets with incorrect checksum hitting the wire as is. What > you cannot do is instruct the receiving end to ignore the checksum > from the sending end when using a physical device (and something I > think we should mimic on the sending device). Yes, it does work currently (or, last I checked)...I just want to make sure it keeps working. Thanks, Ben
On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear <greearb@candelatech.com> wrote: > > > On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: >> >> On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <greearb@candelatech.com> >> wrote: >>> >>> >>> >>> On 04/30/2016 12:54 PM, Tom Herbert wrote: >>>> >>>> >>>> We've put considerable effort into cleaning up the checksum interface >>>> to make it as unambiguous as possible, please be very careful to >>>> follow it. Broken checksum processing is really hard to detect and >>>> debug. >>>> >>>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >>>> (indicated by csum_level) have been verified to be correct in a >>>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >>>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >>>> checksum it would refer to has not been verified and is incorrect this >>>> is a major bug. >>> >>> >>> >>> Suppose I know that the packet received on a packet-socket has >>> already been verified by a NIC that supports hardware checksumming. >>> >>> Then, I want to transmit it on a veth interface using a second >>> packet socket. I do not want veth to recalculate the checksum on >>> transmit, nor to validate it on the peer veth on receive, because I do >>> not want to waste the CPU cycles. I am assuming that my app is not >>> accidentally corrupting frames, so the checksum can never be bad. >>> >>> How should the checksumming be configured for the packets going into >>> the packet-socket from user-space? >> >> >> >> It seems like that only the receiver should decide whether or not to >> checksum packets on the veth, not the sender. >> >> How about: >> >> We could add a receiving socket option for "don't checksum packets >> received from a veth when the other side has marked them as >> elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending >> socket option for "mark all data sent via this socket to a veth as >> elide-checksum-suggested". >> >> So the process would be: >> >> Writer: >> 1. open read socket >> 2. open write socket, with option elide-checksum-for-veth-suggested >> 3. write data >> >> Reader: >> 1. open read socket with "follow-elide-checksum-suggestions-on-veth" >> 2. read data >> >> The kernel / module would then need to persist the flag on all packets >> that traverse a veth, and drop these data when they leave the veth >> module. > > > I'm not sure this works completely. In my app, the packet flow might be: > > eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB > <-> [kernel router/bridge logic ...] <-> eth1 Good point, so if you had: eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> userspace-stub <->eth1 and user-space hub enabled this elide flag, things would work, right? Then, it seems like what we need is a way to tell the kernel router/bridge logic to follow elide signals in packets coming from veth. I'm not sure what the best way to do this is because I'm less familiar with conventions in that part of the kernel, but assuming there's a way to do this, would it be acceptable? > > There may be no sockets on the vethB port. And reader/writer is not > a good way to look at it since I am implementing a bi-directional bridge in > user-space and each packet-socket is for both rx and tx. Sure, but we could model a bidrectional connection as two unidirectional sockets for our discussions here, right? > >>> Also, I might want to send raw frames that do have >>> broken checksums (lets assume a real NIC, not veth), and I want them >>> to hit the wire with those bad checksums. >>> >>> >>> How do I configure the checksumming in this case? >> >> >> >> Correct me if I'm wrong but I think this is already possible now. You >> can have packets with incorrect checksum hitting the wire as is. What >> you cannot do is instruct the receiving end to ignore the checksum >> from the sending end when using a physical device (and something I >> think we should mimic on the sending device). > > > Yes, it does work currently (or, last I checked)...I just want to make sure > it keeps working. Yeah, good point. It would be great if we could write a test, or at the very least, a list of invariants about what kinds of things should work somewhere. > > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com
On 04/30/2016 02:36 PM, Vijay Pandurangan wrote: > On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear <greearb@candelatech.com> wrote: >> >> >> On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: >>> >>> On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <greearb@candelatech.com> >>> wrote: >>>> >>>> >>>> >>>> On 04/30/2016 12:54 PM, Tom Herbert wrote: >>>>> >>>>> >>>>> We've put considerable effort into cleaning up the checksum interface >>>>> to make it as unambiguous as possible, please be very careful to >>>>> follow it. Broken checksum processing is really hard to detect and >>>>> debug. >>>>> >>>>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >>>>> (indicated by csum_level) have been verified to be correct in a >>>>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >>>>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >>>>> checksum it would refer to has not been verified and is incorrect this >>>>> is a major bug. >>>> >>>> >>>> >>>> Suppose I know that the packet received on a packet-socket has >>>> already been verified by a NIC that supports hardware checksumming. >>>> >>>> Then, I want to transmit it on a veth interface using a second >>>> packet socket. I do not want veth to recalculate the checksum on >>>> transmit, nor to validate it on the peer veth on receive, because I do >>>> not want to waste the CPU cycles. I am assuming that my app is not >>>> accidentally corrupting frames, so the checksum can never be bad. >>>> >>>> How should the checksumming be configured for the packets going into >>>> the packet-socket from user-space? >>> >>> >>> >>> It seems like that only the receiver should decide whether or not to >>> checksum packets on the veth, not the sender. >>> >>> How about: >>> >>> We could add a receiving socket option for "don't checksum packets >>> received from a veth when the other side has marked them as >>> elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending >>> socket option for "mark all data sent via this socket to a veth as >>> elide-checksum-suggested". >>> >>> So the process would be: >>> >>> Writer: >>> 1. open read socket >>> 2. open write socket, with option elide-checksum-for-veth-suggested >>> 3. write data >>> >>> Reader: >>> 1. open read socket with "follow-elide-checksum-suggestions-on-veth" >>> 2. read data >>> >>> The kernel / module would then need to persist the flag on all packets >>> that traverse a veth, and drop these data when they leave the veth >>> module. >> >> >> I'm not sure this works completely. In my app, the packet flow might be: >> >> eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB >> <-> [kernel router/bridge logic ...] <-> eth1 > > Good point, so if you had: > > eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> > userspace-stub <->eth1 > > and user-space hub enabled this elide flag, things would work, right? > Then, it seems like what we need is a way to tell the kernel > router/bridge logic to follow elide signals in packets coming from > veth. I'm not sure what the best way to do this is because I'm less > familiar with conventions in that part of the kernel, but assuming > there's a way to do this, would it be acceptable? You cannot receive on one veth without transmitting on the other, so I think the elide csum logic can go on the raw-socket, and apply to packets in the transmit-from-user-space direction. Just allowing the socket to make the veth behave like it used to before this patch in question should be good enough, since that worked for us for years. So, just an option to modify the ip_summed for pkts sent on a socket is probably sufficient. >> There may be no sockets on the vethB port. And reader/writer is not >> a good way to look at it since I am implementing a bi-directional bridge in >> user-space and each packet-socket is for both rx and tx. > > Sure, but we could model a bidrectional connection as two > unidirectional sockets for our discussions here, right? Best not to I think, you want to make sure that one socket can correctly handle tx and rx. As long as that works, then using uni-directional sockets should work too. Thanks, Ben
On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear <greearb@candelatech.com> wrote: >> >> Good point, so if you had: >> >> eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> >> userspace-stub <->eth1 >> >> and user-space hub enabled this elide flag, things would work, right? >> Then, it seems like what we need is a way to tell the kernel >> router/bridge logic to follow elide signals in packets coming from >> veth. I'm not sure what the best way to do this is because I'm less >> familiar with conventions in that part of the kernel, but assuming >> there's a way to do this, would it be acceptable? > > > You cannot receive on one veth without transmitting on the other, so > I think the elide csum logic can go on the raw-socket, and apply to packets > in the transmit-from-user-space direction. Just allowing the socket to make > the veth behave like it used to before this patch in question should be good > enough, since that worked for us for years. So, just an option to modify > the > ip_summed for pkts sent on a socket is probably sufficient. I don't think this is right. Consider: - App A sends out corrupt packets 50% of the time and discards inbound data. - App B doesn't care about corrupt packets and is happy to receive them and has some way of dealing with them (special case) - App C is a regular app, say nc or something. In your world, where A decides what happens to data it transmits, then A<--veth-->B and A<---wire-->B will have the same behaviour but A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C will behave incorrectly if it's connected over veth but correctly if connected with a wire. That is a bug. Since A cannot know what the app it's talking to will desire, I argue that both sides of a message must be opted in to this optimization. > >>> There may be no sockets on the vethB port. And reader/writer is not >>> a good way to look at it since I am implementing a bi-directional bridge >>> in >>> user-space and each packet-socket is for both rx and tx. >> >> >> Sure, but we could model a bidrectional connection as two >> unidirectional sockets for our discussions here, right? > > > Best not to I think, you want to make sure that one socket can > correctly handle tx and rx. As long as that works, then using > uni-directional sockets should work too. > > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com
On Sat, Apr 30, 2016 at 1:59 PM, Ben Greear <greearb@candelatech.com> wrote: > > On 04/30/2016 12:54 PM, Tom Herbert wrote: >> >> We've put considerable effort into cleaning up the checksum interface >> to make it as unambiguous as possible, please be very careful to >> follow it. Broken checksum processing is really hard to detect and >> debug. >> >> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >> (indicated by csum_level) have been verified to be correct in a >> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >> checksum it would refer to has not been verified and is incorrect this >> is a major bug. > > > Suppose I know that the packet received on a packet-socket has > already been verified by a NIC that supports hardware checksumming. > > Then, I want to transmit it on a veth interface using a second > packet socket. I do not want veth to recalculate the checksum on > transmit, nor to validate it on the peer veth on receive, because I do > not want to waste the CPU cycles. I am assuming that my app is not > accidentally corrupting frames, so the checksum can never be bad. > > How should the checksumming be configured for the packets going into > the packet-socket from user-space? > Checksum unnecessary conversion to checksum complete should be done and then the computed value can be sent to user space. If you want to take advantage of the value on transmit then we would to change the interface. But I'm not sure why recalculating the the checksum in the host is even a problem. With all the advances in checksum offload, LCO, RCO it seems like that shouldn't be a common case anyway. > Also, I might want to send raw frames that do have > broken checksums (lets assume a real NIC, not veth), and I want them > to hit the wire with those bad checksums. > > How do I configure the checksumming in this case? Just send raw packets with bad checksums (no checksum offload). Tom > > > Thanks, > Ben > > > >> >> Tom >> >> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear <greearb@candelatech.com> >> wrote: >>> >>> >>> >>> On 04/30/2016 11:33 AM, Ben Hutchings wrote: >>>> >>>> >>>> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: >>>>> >>>>> >>>>> Hello, >>> >>> >>> >>>>>> >>>>>> >>>>>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a >>>>> >>>>> >>>>> Actually, no, this is not really a regression. >>>> >>>> >>>> [...] >>>> >>>> It really is. Even though the old behaviour was a bug (raw packets >>>> should not be changed), if there are real applications that depend on >>>> that then we have to keep those applications working somehow. >>> >>> >>> >>> To be honest, I fail to see why the old behaviour is a bug when sending >>> raw packets from user-space. If raw packets should not be changed, then >>> we need some way to specify what the checksum setting is to begin with, >>> otherwise, user-space has not enough control. >>> >>> A socket option for new programs, and sysctl configurable defaults for >>> raw >>> sockets >>> for old binary programs would be sufficient I think. >>> >>> >>> Thanks, >>> Ben >>> >>> -- >>> Ben Greear <greearb@candelatech.com> >>> Candela Technologies Inc http://www.candelatech.com >> >> > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com
On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: > On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear <greearb@candelatech.com> wrote: >>> >>> Good point, so if you had: >>> >>> eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> >>> userspace-stub <->eth1 >>> >>> and user-space hub enabled this elide flag, things would work, right? >>> Then, it seems like what we need is a way to tell the kernel >>> router/bridge logic to follow elide signals in packets coming from >>> veth. I'm not sure what the best way to do this is because I'm less >>> familiar with conventions in that part of the kernel, but assuming >>> there's a way to do this, would it be acceptable? >> >> >> You cannot receive on one veth without transmitting on the other, so >> I think the elide csum logic can go on the raw-socket, and apply to packets >> in the transmit-from-user-space direction. Just allowing the socket to make >> the veth behave like it used to before this patch in question should be good >> enough, since that worked for us for years. So, just an option to modify >> the >> ip_summed for pkts sent on a socket is probably sufficient. > > I don't think this is right. Consider: > > - App A sends out corrupt packets 50% of the time and discards inbound data. > - App B doesn't care about corrupt packets and is happy to receive > them and has some way of dealing with them (special case) > - App C is a regular app, say nc or something. > > In your world, where A decides what happens to data it transmits, > then > A<--veth-->B and A<---wire-->B will have the same behaviour > > but > > A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C > will behave incorrectly if it's connected over veth but correctly if > connected with a wire. That is a bug. > > Since A cannot know what the app it's talking to will desire, I argue > that both sides of a message must be opted in to this optimization. How can you make a generic app C know how to do this? The path could be, for instance: eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> vethD <-> appC There are no sockets on vethB, but it does need to have special behaviour to elide csums. Even if appC is hacked to know how to twiddle some thing on it's veth port, mucking with vethD will have no effect on vethB. With regard to your example above, why would A corrupt packets? My guess: 1) It has bugs (so, fix the bugs, it could equally create incorrect data with proper checksums, so just enabling checksumming adds no useful protection.) 2) It means to corrupt frames. In that case, someone must expect that C should receive incorrect frames, otherwise why bother making App-A corrupt them in the first place? 3) You are explicitly trying to test the kernel checksum logic, so you want the kernel to detect the bad checksum and throw away the packet. In this case, just don't set the socket option in appA to elide checksums and the packet will be thrown away. Any other cases you can think of? Thanks, Ben
On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote: > On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: > > Consider: > > > > - App A sends out corrupt packets 50% of the time and discards inbound data. (...) > How can you make a generic app C know how to do this? The path could be, > for instance: > > eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> vethD <-> appC > > There are no sockets on vethB, but it does need to have special behaviour to elide > csums. Even if appC is hacked to know how to twiddle some thing on it's veth port, > mucking with vethD will have no effect on vethB. > > With regard to your example above, why would A corrupt packets? My guess: > > 1) It has bugs (so, fix the bugs, it could equally create incorrect data with proper checksums, > so just enabling checksumming adds no useful protection.) I agree with Ben here, what he needs is the ability for userspace to be trusted when *forwarding* a packet. Ideally you'd only want to receive the csum status per packet on the packet socket and pass the same value on the vethA interface, with this status being kept when the packet reaches vethB. If A purposely corrupts packet, it's A's problem. It's similar to designing a NIC which intentionally corrupts packets and reports "checksum good". The real issue is that in order to do things right, the userspace bridge (here, "A") would really need to pass this status. In Ben's case as he says, bad checksum packets are dropped before reaching A, so that simplifies the process quite a bit and that might be what causes some confusion, but ideally we'd rather have recvmsg() and sendmsg() with these flags. I faced the exact same issue 3 years ago when playing with netmap, it was slow as hell because it would lose all checksum information when packets were passing through userland, resulting in GRO/GSO etc being disabled, and had to modify it to let userland preserve it. That's especially important when you have to deal with possibly corrupted packets not yet detected in the chain because the NIC did not validate their checksums. Willy
Mr Miller: How do you feel about a new socket-option to allow a socket to request the old veth behaviour? Thanks, Ben On 04/30/2016 10:30 PM, Willy Tarreau wrote: > On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote: >> On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: >>> Consider: >>> >>> - App A sends out corrupt packets 50% of the time and discards inbound data. > (...) >> How can you make a generic app C know how to do this? The path could be, >> for instance: >> >> eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> vethD <-> appC >> >> There are no sockets on vethB, but it does need to have special behaviour to elide >> csums. Even if appC is hacked to know how to twiddle some thing on it's veth port, >> mucking with vethD will have no effect on vethB. >> >> With regard to your example above, why would A corrupt packets? My guess: >> >> 1) It has bugs (so, fix the bugs, it could equally create incorrect data with proper checksums, >> so just enabling checksumming adds no useful protection.) > > I agree with Ben here, what he needs is the ability for userspace to be > trusted when *forwarding* a packet. Ideally you'd only want to receive > the csum status per packet on the packet socket and pass the same value > on the vethA interface, with this status being kept when the packet > reaches vethB. > > If A purposely corrupts packet, it's A's problem. It's similar to designing > a NIC which intentionally corrupts packets and reports "checksum good". > > The real issue is that in order to do things right, the userspace bridge > (here, "A") would really need to pass this status. In Ben's case as he > says, bad checksum packets are dropped before reaching A, so that > simplifies the process quite a bit and that might be what causes some > confusion, but ideally we'd rather have recvmsg() and sendmsg() with > these flags. > > I faced the exact same issue 3 years ago when playing with netmap, it was > slow as hell because it would lose all checksum information when packets > were passing through userland, resulting in GRO/GSO etc being disabled, > and had to modify it to let userland preserve it. That's especially > important when you have to deal with possibly corrupted packets not yet > detected in the chain because the NIC did not validate their checksums. > > Willy >
From: Ben Greear <greearb@candelatech.com> Date: Fri, 13 May 2016 09:57:19 -0700 > How do you feel about a new socket-option to allow a socket to > request the old veth behaviour? I depend upon the opinions of the experts who work upstream on and maintain these components, since it is an area I am not so familiar with. Generally speaking asking me directly for opinions on matters like this isn't the way to go, in fact I kind of find it irritating. It can't all be on me.
On 05/13/2016 11:21 AM, David Miller wrote: > From: Ben Greear <greearb@candelatech.com> > Date: Fri, 13 May 2016 09:57:19 -0700 > >> How do you feel about a new socket-option to allow a socket to >> request the old veth behaviour? > > I depend upon the opinions of the experts who work upstream on and > maintain these components, since it is an area I am not so familiar > with. > > Generally speaking asking me directly for opinions on matters like > this isn't the way to go, in fact I kind of find it irritating. It > can't all be on me. > Fair enough, thanks for your time. Ben
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index da1ae0e..f8cc758 100644 (file) --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1926,6 +1926,7 @@ retry: goto out_unlock; } + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = sk->sk_priority; @@ -2352,6 +2353,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, ph.raw = frame; + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = po->sk.sk_priority; @@ -2776,6 +2778,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) goto out_free; } + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = sk->sk_priority;