| Message ID | 20160318232522.14955.13475.stgit@localhost.localdomain |
|---|---|
| State | RFC, archived |
| Delegated to: | David Miller |
| Headers | show |
On 18/03/16 23:25, Alexander Duyck wrote: > This patch adds support for something I am referring to as GSO partial. > The basic idea is that we can support a broader range of devices for > segmentation if we use fixed outer headers and have the hardware only > really deal with segmenting the inner header. The idea behind the naming > is due to the fact that everything before csum_start will be fixed headers, > and everything after will be the region that is handled by hardware. > > With the current implementation it allows us to add support for the > following GSO types with an inner TSO or TSO6 offload: > NETIF_F_GSO_GRE > NETIF_F_GSO_GRE_CSUM > NETIF_F_UDP_TUNNEL > NETIF_F_UDP_TUNNEL_CSUM > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com> > --- If I'm correctly understanding what you're doing, you're building a large TCP segment, feeding it through the encapsulation drivers as normal, then at GSO time you're fixing up length fields, checksums etc. in the headers. I think we can do this more simply, by making it so that at the time when we _generate_ the TCP segment, we give it headers saying it's one MSS big, but have several MSS of data. Similarly when adding the encap headers, they all need to get their lengths from what the layer below tells them, rather than the current length of data in the SKB. Then at GSO time all the headers already have the right things in, and you don't need to call any per-protocol GSO callbacks for them. Any protocol that noticed it was putting something non-copyable in its headers (e.g. GRE with the Counter field, or an outer IP layer without DF set needing real IPIDs) would set a flag in the SKB to indicate that we really do need to call through the per-protocol GSO stuff. (Ideally, if we had a separate skb->gso_start field rather than piggybacking on csum_start, we could reset it to point just before us, so that any further headers outside us still can be copied rather than taking callbacks. But I'm not sure whether that's worth using up sk_buff real estate for.) (It might still be necessary to put the true length in the TCP header, if hardware is using that as an input to segmentation. I think sfc hardware just uses 'total length of all payload DMA descriptors', but others might behave differently.) However, I haven't yet had the time to attempt to implement this, so there might be some obvious reason I'm missing why this is impossible. Also, it's possible that I've completely misunderstood your patch and it's orthogonal to and can coexist with what I'm suggesting. -Ed
On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote: > On 18/03/16 23:25, Alexander Duyck wrote: >> This patch adds support for something I am referring to as GSO partial. >> The basic idea is that we can support a broader range of devices for >> segmentation if we use fixed outer headers and have the hardware only >> really deal with segmenting the inner header. The idea behind the naming >> is due to the fact that everything before csum_start will be fixed headers, >> and everything after will be the region that is handled by hardware. >> >> With the current implementation it allows us to add support for the >> following GSO types with an inner TSO or TSO6 offload: >> NETIF_F_GSO_GRE >> NETIF_F_GSO_GRE_CSUM >> NETIF_F_UDP_TUNNEL >> NETIF_F_UDP_TUNNEL_CSUM >> >> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> >> --- > If I'm correctly understanding what you're doing, you're building a large > TCP segment, feeding it through the encapsulation drivers as normal, then > at GSO time you're fixing up length fields, checksums etc. in the headers. > I think we can do this more simply, by making it so that at the time when > we _generate_ the TCP segment, we give it headers saying it's one MSS big, > but have several MSS of data. Similarly when adding the encap headers, > they all need to get their lengths from what the layer below tells them, > rather than the current length of data in the SKB. Then at GSO time all > the headers already have the right things in, and you don't need to call > any per-protocol GSO callbacks for them. One issue I have to deal with here is that we have no way of knowing what the underlying hardware can support at the time of segment being created. You have to keep in mind that what we have access to is the tunnel dev in many cases, not the underlying dev so we don't know if things can be offloaded to hardware or not. By pushing this logic into the GSO code we can actually implement it without much overhead since we either segment it into an MSS multiple, or into single MSS sized chunks. This way we defer the decision until the very last moment when we actually know if we can offload some portion of this in hardware or not. > Any protocol that noticed it was putting something non-copyable in its > headers (e.g. GRE with the Counter field, or an outer IP layer without DF > set needing real IPIDs) would set a flag in the SKB to indicate that we > really do need to call through the per-protocol GSO stuff. (Ideally, if > we had a separate skb->gso_start field rather than piggybacking on > csum_start, we could reset it to point just before us, so that any further > headers outside us still can be copied rather than taking callbacks. But > I'm not sure whether that's worth using up sk_buff real estate for.) The idea behind piggybacking on csum_start was due to the fact that we cannot perform GSO/TSO unless CHECKSUM_PARTIAL is set. As far as I know in the case of TCP offloads this always ends up being the inner-most L4 header so it works out in that it actually reduces code path as we were having to deal with all the skb->encapsulation checks. It was a relationship that already existed, I just decided to make use of it since it simplifies things pretty significantly. As far as retreating I don't really see how that would work. In most cases it is an all-or-nothing proposition to setup these outer headers. Either we can segment the frame with the outer headers replicated or we cannot. I suspect it would end up being a common case where the hardware will update the outer IP and inner TCP headers, but I think the outer L4 and inner IP headers will be the ones that most likely always end up being static. Also we already have code paths in place in the GRE driver for instance that prevent us from using GSO in the case of TUNNEL_SEQ being enabled. > (It might still be necessary to put the true length in the TCP header, if > hardware is using that as an input to segmentation. I think sfc hardware > just uses 'total length of all payload DMA descriptors', but others might > behave differently.) That is what most drivers do. The way I kind of retained that is that the TCP header doesn't include an actual length field, but I left the pseudo-header using the full length of all data. My thought was to end up using something like the ixgbe approach for most devices. What I did there was replicate the tunnel headers and inner IPv4 or IPv6 header. In the case of ixgbe and i40e I can throw away the checksum and length values for the outer IP header, one thing I was curious about is if I really needed to retain the full packet size for those. > However, I haven't yet had the time to attempt to implement this, so there > might be some obvious reason I'm missing why this is impossible. > Also, it's possible that I've completely misunderstood your patch and it's > orthogonal to and can coexist with what I'm suggesting. The one piece I could really use would be an understanding of what inputs your hardware is expecting in order for us to extend TSO to support this kind of approach. Then I could start tailoring the output generated so that we had something that would work with more devices. I was thinking the approach I have taken is fairly generic since essentially it allows us to get away with TSO as long as we are allowed to provide the offsets for the IP header and the TCP header. From what I can tell it looks like the Solarflare drivers do something similar so you might even try making changes similar to what I did for ixgbe to see if you can get a proof of concept working for sfc. - Alex
On 22/03/16 17:47, Alexander Duyck wrote: > On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote: >> On 18/03/16 23:25, Alexander Duyck wrote: >>> This patch adds support for something I am referring to as GSO partial. >>> The basic idea is that we can support a broader range of devices for >>> segmentation if we use fixed outer headers and have the hardware only >>> really deal with segmenting the inner header. The idea behind the naming >>> is due to the fact that everything before csum_start will be fixed headers, >>> and everything after will be the region that is handled by hardware. >>> >>> With the current implementation it allows us to add support for the >>> following GSO types with an inner TSO or TSO6 offload: >>> NETIF_F_GSO_GRE >>> NETIF_F_GSO_GRE_CSUM >>> NETIF_F_UDP_TUNNEL >>> NETIF_F_UDP_TUNNEL_CSUM >>> >>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> >>> --- >> If I'm correctly understanding what you're doing, you're building a large >> TCP segment, feeding it through the encapsulation drivers as normal, then >> at GSO time you're fixing up length fields, checksums etc. in the headers. >> I think we can do this more simply, by making it so that at the time when >> we _generate_ the TCP segment, we give it headers saying it's one MSS big, >> but have several MSS of data. Similarly when adding the encap headers, >> they all need to get their lengths from what the layer below tells them, >> rather than the current length of data in the SKB. Then at GSO time all >> the headers already have the right things in, and you don't need to call >> any per-protocol GSO callbacks for them. > One issue I have to deal with here is that we have no way of knowing > what the underlying hardware can support at the time of segment being > created. You have to keep in mind that what we have access to is the > tunnel dev in many cases, not the underlying dev so we don't know if > things can be offloaded to hardware or not. By pushing this logic > into the GSO code we can actually implement it without much overhead > since we either segment it into an MSS multiple, or into single MSS > sized chunks. This way we defer the decision until the very last > moment when we actually know if we can offload some portion of this in > hardware or not. But won't the tunnel dev have the feature flag for GSO_PARTIAL depending on what the underlying dev advertises? (Or, at least, could we make it bethatway?) Alternatively, have per-protocol GSO callbacks to do the fixup in the opposite direction to what you have now - in the long term we hope that hardware supporting GSO partial will become the common case, so that should be the fast path without bouncing backwards through GSO callbacks. Then, if you find out at GSO time that the hardware wants to do old-style TSO, you call those callbacks so as to give it a superframe with the long lengths filled in everywhere. (Even that might not be necessary; it's a question of whether hardware makes assumptions about what those fields contain when folding its packet edits into checksums. Since this is going to be driver-specific and drivers doing these things will have a fixed list of what encaps they can parse and therefore do this for, maybe all these fixups could be done by the driver - using common helper functions, of course - in its TSO path.) >> Any protocol that noticed it was putting something non-copyable in its >> headers (e.g. GRE with the Counter field, or an outer IP layer without DF >> set needing real IPIDs) would set a flag in the SKB to indicate that we >> really do need to call through the per-protocol GSO stuff. (Ideally, if >> we had a separate skb->gso_start field rather than piggybacking on >> csum_start, we could reset it to point just before us, so that any further >> headers outside us still can be copied rather than taking callbacks. But >> I'm not sure whether that's worth using up sk_buff real estate for.) > The idea behind piggybacking on csum_start was due to the fact that we > cannot perform GSO/TSO unless CHECKSUM_PARTIAL is set. As far as I > know in the case of TCP offloads this always ends up being the > inner-most L4 header so it works out in that it actually reduces code > path as we were having to deal with all the skb->encapsulation checks. > It was a relationship that already existed, I just decided to make use > of it since it simplifies things pretty significantly. Yes; it's a clever idea. Only trouble is that we really want theinner IP header rather than the inner TCP header, so that we can (if we want to) increment the inner IP IDs. Of course, if we Officially Don't Care about inner IP IDs that's not a problem. Iwonder if we could just always fill in inner_network_headereven if we're not doing encapsulation. Or does it end up pointing to a 'middle' header in the case of nested encap? > As far as retreating I don't really see how that would work. In most > cases it is an all-or-nothing proposition to setup these outer > headers. Either we can segment the frame with the outer headers > replicated or we cannot. I suspect it would end up being a common > case where the hardware will update the outer IP and inner TCP > headers, but I think the outer L4 and inner IP headers will be the > ones that most likely always end up being static. Having thought a bit more about this, I think supporting anything other than "hardware updates inner [IP and] TCPheaders" is needlessly complex (well, we still have to handle "hardware updates everything 'cos it thinks it knows best", because that already exists in the wild in hardware that might not support the new way). I don't think there's likely to be a case where hardware can do half of the segmentation at the same time as copying headers for the other half. I also still don't see why hardware would want to update the outer IP header - can you explain? > Also we already > have code paths in place in the GRE driver for instance that prevent > us from using GSO in the case of TUNNEL_SEQ being enabled. Oh good, one less thing to worry about. >> (It might still be necessary to put the true length in the TCP header, if >> hardware is using that as an input to segmentation. I think sfc hardware >> just uses 'total length of all payload DMA descriptors', but others might >> behave differently.) > That is what most drivers do. The way I kind of retained that is that > the TCP header doesn't include an actual length field, but I left the > pseudo-header using the full length of all data. But then you're guaranteed to have to update the outer L4 checksum when yousegment (because outer LCO reads the inner pseudo-header checksum). Why not use the single-segment length in the pseudo-header, then the outer L4 checksum is already the right thing? (And if yourhardware can't be told to leave the outer L4 checksum alone, then it's not worth the trouble of trying to support GSO partial for it, since it clearly wants to do old-style "NIC knows best" TSO.) Then if the hardware is assuming the (inner) pseudo is using the full length, and is going to include that edit in its checksum calculation, you can just do the opposite edit in the driver, just before handing the packet off to the hardware. Again, the idea is that we optimise for GSO partial by making it a plain header copy everywhere, and put all the 'fix things up' on the _other_ path. And yes, I forgot (and keep forgetting) that the TCP header doesn't have an explicit length field. > My thought was to > end up using something like the ixgbe approach for most devices. What > I did there was replicate the tunnel headers and inner IPv4 or IPv6 > header. In the case of ixgbe and i40e I can throw away the checksum > and length values for the outer IP header, one thing I was curious > about is if I really needed to retain the full packet size for those. Again, the outer IP header should be computed for a single segment rather than for the superframe, so that it doesn't need to be edited later. It should be possible to send a "GSO partial" frame to TSO withouta single GSO callback needing to be called; similarly, software GSO should be able to just copy the outer headers, and only need to know how to update the TCP header. (See below for my "what a NIC should do" TSO design, which software can easily emulate.) >> However, I haven't yet had the time to attempt to implement this, so there >> might be some obvious reason I'm missing why this is impossible. >> Also, it's possible that I've completely misunderstood your patch and it's >> orthogonal to and can coexist with what I'm suggesting. > The one piece I could really use would be an understanding of what > inputs your hardware is expecting in order for us to extend TSO to > support this kind of approach. Then I could start tailoring the > output generated so that we had something that would work with more > devices. I was thinking the approach I have taken is fairly generic > since essentially it allows us to get away with TSO as long as we are > allowed to provide the offsets for the IP header and the TCP header. > From what I can tell it looks like the Solarflare drivers do something > similar so you might even try making changes similar to what I did for > ixgbe to see if you can get a proof of concept working for sfc. So, this is all still slightly speculative because while I've talked to some of our firmware developers, we haven't got as far as actually writing the new firmware. I'd also like to make clear that this isn't "what Solarflare has officially decided to do"; rather it's "what I'm currently trying to convince people at Solarflare to do". But what I think we're going to end up with is this: The kernel will give us a packet that looks like a single MSS-sized segment except that the payload is too long; the length fields in all the headers are for an MSS-sized segment, and the checksums are correct for that (except that the inner TCP checksum is, of course, the pseudo-header sum rather than a sum over the whole payload). The kernel will also tell us where in the packet the inner IP header begins. The driver will then give the following descriptors to the hardware: * A TSO descriptor, containing the offset of the inner IP header, and the MSS to use for segmentation. * A DMA descriptor containing all the headers (i.e. up to the end of the inner TCP header). * A series of DMA descriptors containing the payload, with a total length divisible by the MSS we thought of earlier. The NIC can now read IHL from the inner IP header, and thereby compute the offset of the inner TCP header, and the csum_start/offset values. Then for each MSS-sized block of payload, the NIC will do the following: * transmit header + payload block * increment inner IP ID, and decrement inner IP checksum (ones-complement) * add MSS to TCP sequence number I believe this is something thatany NIC with TSO support should be able to learn to do, with appropriate firmware changes. It might be a while before there are NICs in the wild that can do this,though. -Ed
On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote: > On 22/03/16 17:47, Alexander Duyck wrote: >> On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote: >>> On 18/03/16 23:25, Alexander Duyck wrote: >>>> This patch adds support for something I am referring to as GSO partial. >>>> The basic idea is that we can support a broader range of devices for >>>> segmentation if we use fixed outer headers and have the hardware only >>>> really deal with segmenting the inner header. The idea behind the naming >>>> is due to the fact that everything before csum_start will be fixed headers, >>>> and everything after will be the region that is handled by hardware. >>>> >>>> With the current implementation it allows us to add support for the >>>> following GSO types with an inner TSO or TSO6 offload: >>>> NETIF_F_GSO_GRE >>>> NETIF_F_GSO_GRE_CSUM >>>> NETIF_F_UDP_TUNNEL >>>> NETIF_F_UDP_TUNNEL_CSUM >>>> >>>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> >>>> --- >>> If I'm correctly understanding what you're doing, you're building a large >>> TCP segment, feeding it through the encapsulation drivers as normal, then >>> at GSO time you're fixing up length fields, checksums etc. in the headers. >>> I think we can do this more simply, by making it so that at the time when >>> we _generate_ the TCP segment, we give it headers saying it's one MSS big, >>> but have several MSS of data. Similarly when adding the encap headers, >>> they all need to get their lengths from what the layer below tells them, >>> rather than the current length of data in the SKB. Then at GSO time all >>> the headers already have the right things in, and you don't need to call >>> any per-protocol GSO callbacks for them. >> One issue I have to deal with here is that we have no way of knowing >> what the underlying hardware can support at the time of segment being >> created. You have to keep in mind that what we have access to is the >> tunnel dev in many cases, not the underlying dev so we don't know if >> things can be offloaded to hardware or not. By pushing this logic >> into the GSO code we can actually implement it without much overhead >> since we either segment it into an MSS multiple, or into single MSS >> sized chunks. This way we defer the decision until the very last >> moment when we actually know if we can offload some portion of this in >> hardware or not. > But won't the tunnel dev have the feature flag for GSO_PARTIAL depending > on what the underlying dev advertises? (Or, at least, could we make it > bethatway?) Features that have been designed this way in the past are usually pretty fragile. Not only would you have to track changes in the routing table but you could have bridges, tc, vlan devices, etc. all of which might change the path of the packet and would have to somehow propagate this information. It's much more robust to resolve the device capabilities just before you hand the packet to that device. Plus, anything along the path of the packet (iptables, for example) that looks at the headers might potentially need to be aware of this optimization. You're also assuming that the generating TCP stack is resident on the same machine as the device that does the offloads. That's not necessarily true in the case of VMs or remote senders whose packets have been GRO'ed. Keeping the core stack consistent and just handling this at the GRO/driver layer as Alex has here seems preferable to me.
From: Jesse Gross <jesse@kernel.org> Date: Tue, 22 Mar 2016 13:11:21 -0700 > Features that have been designed this way in the past are usually > pretty fragile. Not only would you have to track changes in the > routing table but you could have bridges, tc, vlan devices, etc. all > of which might change the path of the packet and would have to somehow > propagate this information. It's much more robust to resolve the > device capabilities just before you hand the packet to that device. > Plus, anything along the path of the packet (iptables, for example) > that looks at the headers might potentially need to be aware of this > optimization. Indeed, this is a major fundamental issue in our stack right now. I keep being reminded of that ugly change we had to make to accomodate scatter-gather limitations for Infiniband devices, exactly because properties don't propagate properly through all of the layers right now. But we have to solve this somehow, the packetizer has to know certain basic properties of the ultimate device in order to function properly. This requirement is unavoidable.
On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote: > On 22/03/16 17:47, Alexander Duyck wrote: >> On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote: >>> On 18/03/16 23:25, Alexander Duyck wrote: >>>> This patch adds support for something I am referring to as GSO partial. >>>> The basic idea is that we can support a broader range of devices for >>>> segmentation if we use fixed outer headers and have the hardware only >>>> really deal with segmenting the inner header. The idea behind the naming >>>> is due to the fact that everything before csum_start will be fixed headers, >>>> and everything after will be the region that is handled by hardware. >>>> >>>> With the current implementation it allows us to add support for the >>>> following GSO types with an inner TSO or TSO6 offload: >>>> NETIF_F_GSO_GRE >>>> NETIF_F_GSO_GRE_CSUM >>>> NETIF_F_UDP_TUNNEL >>>> NETIF_F_UDP_TUNNEL_CSUM >>>> >>>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> >>>> --- >>> If I'm correctly understanding what you're doing, you're building a large >>> TCP segment, feeding it through the encapsulation drivers as normal, then >>> at GSO time you're fixing up length fields, checksums etc. in the headers. >>> I think we can do this more simply, by making it so that at the time when >>> we _generate_ the TCP segment, we give it headers saying it's one MSS big, >>> but have several MSS of data. Similarly when adding the encap headers, >>> they all need to get their lengths from what the layer below tells them, >>> rather than the current length of data in the SKB. Then at GSO time all >>> the headers already have the right things in, and you don't need to call >>> any per-protocol GSO callbacks for them. >> One issue I have to deal with here is that we have no way of knowing >> what the underlying hardware can support at the time of segment being >> created. You have to keep in mind that what we have access to is the >> tunnel dev in many cases, not the underlying dev so we don't know if >> things can be offloaded to hardware or not. By pushing this logic >> into the GSO code we can actually implement it without much overhead >> since we either segment it into an MSS multiple, or into single MSS >> sized chunks. This way we defer the decision until the very last >> moment when we actually know if we can offload some portion of this in >> hardware or not. > But won't the tunnel dev have the feature flag for GSO_PARTIAL depending > on what the underlying dev advertises? (Or, at least, could we make it > bethatway?) This stuff doesn't work. That is why tunnels now advertise all available features that can be offloaded via software. Basically if we can advertise a feature we do, and then we sort things out in software if we cannot actually do it in hardware. > Alternatively, have per-protocol GSO callbacks to do the fixup in the > opposite direction to what you have now - in the long term we hope that > hardware supporting GSO partial will become the common case, so that > should be the fast path without bouncing backwards through GSO callbacks. > Then, if you find out at GSO time that the hardware wants to do old-style > TSO, you call those callbacks so as to give it a superframe with the long > lengths filled in everywhere. (Even that might not be necessary; it's a > question of whether hardware makes assumptions about what those fields > contain when folding its packet edits into checksums. Since this is > going to be driver-specific and drivers doing these things will have a > fixed list of what encaps they can parse and therefore do this for, maybe > all these fixups could be done by the driver - using common helper > functions, of course - in its TSO path.) I thought about doing that but decided it was much simpler to simply update all headers. For now I want to keep this as simple as possible while we get the first few drivers on board. If we later want to optimize and add complexity we can go that route, but for now this change is more than fast enough as it allows me to push an i40e at 20Gb/s while sending frames with outer checksums enabled. >>> Any protocol that noticed it was putting something non-copyable in its >>> headers (e.g. GRE with the Counter field, or an outer IP layer without DF >>> set needing real IPIDs) would set a flag in the SKB to indicate that we >>> really do need to call through the per-protocol GSO stuff. (Ideally, if >>> we had a separate skb->gso_start field rather than piggybacking on >>> csum_start, we could reset it to point just before us, so that any further >>> headers outside us still can be copied rather than taking callbacks. But >>> I'm not sure whether that's worth using up sk_buff real estate for.) >> The idea behind piggybacking on csum_start was due to the fact that we >> cannot perform GSO/TSO unless CHECKSUM_PARTIAL is set. As far as I >> know in the case of TCP offloads this always ends up being the >> inner-most L4 header so it works out in that it actually reduces code >> path as we were having to deal with all the skb->encapsulation checks. >> It was a relationship that already existed, I just decided to make use >> of it since it simplifies things pretty significantly. > Yes; it's a clever idea. Only trouble is that we really want theinner IP > header rather than the inner TCP header, so that we can (if we want to) > increment the inner IP IDs. Of course, if we Officially Don't Care about > inner IP IDs that's not a problem. The inner IP IDs are the ones that are guaranteed to be the ones we can leave fixed since TCP will require that the DF bit be set. The current VXLAN implementation does not set DF for the outer headers. So really we don't have too many options right now if we are wanting to support tunnels. > Iwonder if we could just always fill in inner_network_headereven if we're > not doing encapsulation. Or does it end up pointing to a 'middle' header > in the case of nested encap? Right now neseted encap is not supported because tunnels don't advertise hw_encap_features. >> As far as retreating I don't really see how that would work. In most >> cases it is an all-or-nothing proposition to setup these outer >> headers. Either we can segment the frame with the outer headers >> replicated or we cannot. I suspect it would end up being a common >> case where the hardware will update the outer IP and inner TCP >> headers, but I think the outer L4 and inner IP headers will be the >> ones that most likely always end up being static. > Having thought a bit more about this, I think supporting anything other > than "hardware updates inner [IP and] TCPheaders" is needlessly complex > (well, we still have to handle "hardware updates everything 'cos it > thinks it knows best", because that already exists in the wild in > hardware that might not support the new way). I don't think there's > likely to be a case where hardware can do half of the segmentation at > the same time as copying headers for the other half. The ixgbe approach skips the inner IP header. The inner IPv4 ID field can be fixed since TCP requires the DF bit be set. > I also still don't see why hardware would want to update the outer IP > header - can you explain? Inner IP header has DF bit set. As such we can ignore IPv4 ID field. The outer header will not. As such RFC6864 says we have to increment it since it could be fragmented. >> Also we already >> have code paths in place in the GRE driver for instance that prevent >> us from using GSO in the case of TUNNEL_SEQ being enabled. > Oh good, one less thing to worry about. > >>> (It might still be necessary to put the true length in the TCP header, if >>> hardware is using that as an input to segmentation. I think sfc hardware >>> just uses 'total length of all payload DMA descriptors', but others might >>> behave differently.) >> That is what most drivers do. The way I kind of retained that is that >> the TCP header doesn't include an actual length field, but I left the >> pseudo-header using the full length of all data. > But then you're guaranteed to have to update the outer L4 checksum when > yousegment (because outer LCO reads the inner pseudo-header checksum). Yes, but the LCO checksum is computed when we are constructing the headers. I have already taken that into account in my patches. > Why not use the single-segment length in the pseudo-header, then the > outer L4 checksum is already the right thing? (And if yourhardware > can't be told to leave the outer L4 checksum alone, then it's not worth > the trouble of trying to support GSO partial for it, since it clearly > wants to do old-style "NIC knows best" TSO.) The problem then becomes that I needs special case code. One for standard TSO and one for this special case. If I leave it as is I can use the same code to cancel out the length in the TCP pseudo-header checksum for either case. > Then if the hardware is assuming the (inner) pseudo is using the full > length, and is going to include that edit in its checksum calculation, > you can just do the opposite edit in the driver, just before handing > the packet off to the hardware. I want to keep the scope of this change limited. By keeping the inner TCP checksum calculation working the same way I don't have to write up some new function. The general idea to my approach is to treat the the UDP or GRE header to the inner IP header as an IP header extension type value. That will allow us to trick most hardware into being able to accept it. > Again, the idea is that we optimise for GSO partial by making it a plain > header copy everywhere, and put all the 'fix things up' on the _other_ > path. That is what I went for. The only part that isn't a plain header copy is the TCP pseudo-header checksum. > And yes, I forgot (and keep forgetting) that the TCP header doesn't have > an explicit length field. Right. That is one of the other reasons for not wanting to pull that length out since the actual length is the only real length value we can track with the TCP header since it doesn't contain one of its own. >> My thought was to >> end up using something like the ixgbe approach for most devices. What >> I did there was replicate the tunnel headers and inner IPv4 or IPv6 >> header. In the case of ixgbe and i40e I can throw away the checksum >> and length values for the outer IP header, one thing I was curious >> about is if I really needed to retain the full packet size for those. > Again, the outer IP header should be computed for a single segment > rather than for the superframe, so that it doesn't need to be edited > later. It should be possible to send a "GSO partial" frame to TSO > withouta single GSO callback needing to be called; similarly, > software GSO should be able to just copy the outer headers, and only > need to know how to update the TCP header. (See below for my "what > a NIC should do" TSO design, which software can easily emulate.) >>> However, I haven't yet had the time to attempt to implement this, so there >>> might be some obvious reason I'm missing why this is impossible. >>> Also, it's possible that I've completely misunderstood your patch and it's >>> orthogonal to and can coexist with what I'm suggesting. >> The one piece I could really use would be an understanding of what >> inputs your hardware is expecting in order for us to extend TSO to >> support this kind of approach. Then I could start tailoring the >> output generated so that we had something that would work with more >> devices. I was thinking the approach I have taken is fairly generic >> since essentially it allows us to get away with TSO as long as we are >> allowed to provide the offsets for the IP header and the TCP header. >> From what I can tell it looks like the Solarflare drivers do something >> similar so you might even try making changes similar to what I did for >> ixgbe to see if you can get a proof of concept working for sfc. > So, this is all still slightly speculative because while I've talked to > some of our firmware developers, we haven't got as far as actually writing > the new firmware. I'd also like to make clear that this isn't "what > Solarflare has officially decided to do"; rather it's "what I'm currently > trying to convince people at Solarflare to do". > But what I think we're going to end up with is this: > > The kernel will give us a packet that looks like a single MSS-sized segment > except that the payload is too long; the length fields in all the headers > are for an MSS-sized segment, and the checksums are correct for that > (except that the inner TCP checksum is, of course, the pseudo-header sum > rather than a sum over the whole payload). The kernel will also tell us > where in the packet the inner IP header begins. The driver will then give > the following descriptors to the hardware: > * A TSO descriptor, containing the offset of the inner IP header, and the > MSS to use for segmentation. > * A DMA descriptor containing all the headers (i.e. up to the end of the > inner TCP header). > * A series of DMA descriptors containing the payload, with a total length > divisible by the MSS we thought of earlier. > The NIC can now read IHL from the inner IP header, and thereby compute the > offset of the inner TCP header, and the csum_start/offset values. > Then for each MSS-sized block of payload, the NIC will do the following: > * transmit header + payload block > * increment inner IP ID, and decrement inner IP checksum (ones-complement) > * add MSS to TCP sequence number > > I believe this is something thatany NIC with TSO support should be able to > learn to do, with appropriate firmware changes. It might be a while before > there are NICs in the wild that can do this,though. The only issue I see is the expectation that the outer headers go untouched is problematic. The outer headers are where things like fragmentation will occur in transit. In addition I suspect a number of devices other than the Intel NICs probably use the network header to determine where to insert L2 tags such as VLAN. Like I have said with the current solution I could probably update igb, igbvf, fm10k, ixgbe, ixgbevf, i40e, and i40evf to support this. That covers pretty much the entire Intel series of drivers in terms of enterprise. The question is what do I need to enable to support other drivers. It doesn't seem like there would be too much but the bit I would need to know is what the expectations are when computing outer IPv4 checksums. - Alex
On 22/03/16 21:38, Alexander Duyck wrote: > On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote: >> But won't the tunnel dev have the feature flag for GSO_PARTIAL depending >> on what the underlying dev advertises? (Or, at least, could we make it >> bethatway?) > This stuff doesn't work. That is why tunnels now advertise all > available features that can be offloaded via software. Basically if > we can advertise a feature we do, and then we sort things out in > software if we cannot actually do it in hardware. Fair enough; then go withthe other approach: >> Alternatively, have per-protocol GSO callbacks to do the fixup in the >> opposite direction to what you have now - in the long term we hope that >> hardware supporting GSO partial will become the common case, so that >> should be the fast path without bouncing backwards through GSO callbacks. >> Then, if you find out at GSO time that the hardware wants to do old-style >> TSO, you call those callbacks so as to give it a superframe with the long >> lengths filled in everywhere. > I thought about doing that but decided it was much simpler to simply > update all headers. For now I want to keep this as simple as possible > while we get the first few drivers on board. If we later want to > optimize and add complexity we can go that route, but for now this > change is more than fast enough as it allows me to push an i40e at > 20Gb/s while sending frames with outer checksums enabled. My belief is that my way is (in the long run) simpler: ultimately it gets rid of per-protocol GSO callbacks entirely. Every header gets written correctly* when the packet initially traverses the stack, and then at transmit time you either hand that off to TSO, or do the equivalent thing in software: segment at the TCP layer while treating everything above it as an opaque pseudo-L2 header. *That is, correctly for a single segment, rather than correctly for the superframe. >> Yes; it's a clever idea. Only trouble is that we really want theinner IP >> header rather than the inner TCP header, so that we can (if we want to) >> increment the inner IP IDs. Of course, if we Officially Don't Care about >> inner IP IDs that's not a problem. > The inner IP IDs are the ones that are guaranteed to be the ones we > can leave fixed since TCP will require that the DF bit be set. I was worrying about the SLHC stuff, I thought the inner ones were precisely the ones where that was a problem. If we really don't care about it, then we do just need the inner TCP header, and we can stick with using your csum_start == gso_start trick :) > The > current VXLAN implementation does not set DF for the outer headers. > So really we don't have too many options right now if we are wanting > to support tunnels. I was predicating all this on the assumption that tunnels would be changed to set DF in their outer frames; I thought I saw a patch recently to do that, but maybe I was mistaken. In any case I think that's the right thing to do, and it's a necessary prerequisite for truly tunnel-agnostic TSO. >> Iwonder if we could just always fill in inner_network_headereven if we're >> not doing encapsulation. Or does it end up pointing to a 'middle' header >> in the case of nested encap? > Right now neseted encap is not supported because tunnels don't > advertise hw_encap_features. You mean not supported by offloads, right? We can still _create_ a nested tunnel device, it just won't use hardware offloads. And in the long run, if we can make tunnel-agnostic TSO as I'm proposing, we should be able to support it for nested tunnels too (the giant L2 header just gets more giant). >> Why not use the single-segment length in the pseudo-header, then the >> outer L4 checksum is already the right thing? (And if yourhardware >> can't be told to leave the outer L4 checksum alone, then it's not worth >> the trouble of trying to support GSO partial for it, since it clearly >> wants to do old-style "NIC knows best" TSO.) > The problem then becomes that I needs special case code. One for > standard TSO and one for this special case. If I leave it as is I can > use the same code to cancel out the length in the TCP pseudo-header > checksum for either case. I don't think that's true. Look at it this way: there are two choices. 1) The normal path builds things for standard TSO, and we have code to fix it up for this new-style TSO. 2) The normal path builds things for new-style TSO, and we have code to fix it up for standard TSO. Now the fix-up code should be of equal complexity in either case, because the two cases are inverses of each other. So we should choose whichever approach makes the normal path simpler. I think having TCP 'lie' to all the outer layers about the length of the packet is simpler, because then they don't even have to know that GSO is happening. They just append a header appropriate for an MSS-sized packet, blissfully unaware that the payload data carries on for longer than that. Moreover, the fixup in (2) can sometimes be avoided... I don't know how your hardware does encapsulated TSO, but what ours does (in the old "I'm-a-smart-NIC-I-know-best" world) is that it computes both checksums itself from scratch. So it doesn't care what it was given as the pre-seeded value, because it's reconstructing the pseudo-header and zeroing the checksum field to begin with. I would have expected this to be the common behaviour for "smart" NICs doing encap TSO, in which case (2) is a clear winner. And, of course, once "less is more" hardware becomes common, we will want (2) anyway. > The general idea to my approach is to treat the the UDP or GRE header > to the inner IP header as an IP header extension type value. That > will allow us to trick most hardware into being able to accept it. I think an L2 header extension is a better semantic match for what's happening (the tunnel is an L2 device and we're sending L3 packets over it). But why does it matter? Are you giving the hardware the L2 and L3 headers in separate DMA descriptors or something? The way I see it, all hardware needs to be told is "where to start TSO from", and how it thinks of the stuff before that doesn't matter, because it's not supposed to touch it anyway. >> Again, the idea is that we optimise for GSO partial by making it a plain >> header copy everywhere, and put all the 'fix things up' on the _other_ >> path. > That is what I went for. The only part that isn't a plain header copy > is the TCP pseudo-header checksum. Right, but that means someone, somewhere, has to fix up the outer UDP checksum to account for that (because the TCP phdr sum leaks out of LCO). I realise that's a per-superframe job, rather than a per-segment job, but it still means that code at GSO time (when we decide whether to do GSO or GSO-partial) has to know enough about the tunnel headers to find the outer checksum and fix it up. And it's a question of who pays that cost, the old TSO or the new one; see above for why I think the old TSO should pay it. > The only issue I see is the expectation that the outer headers go > untouched is problematic. The outer headers are where things like > fragmentation will occur in transit. Like I say, I'm assuming we'll start setting DF on outer frames. Besides, it doesn't matter what happens "in transit" - as long as the outer headers aren't touched by the transmitting NIC, the network can do what it likes to them afterwards, without it breaking our TSO code. > In addition I suspect a number > of devices other than the Intel NICs probably use the network header > to determine where to insert L2 tags such as VLAN. Ah, VLAN is interesting, because there's two things you might want: VLAN inside the tunnel, or outside of it. Presumably if you're having the NIC insert the VLAN, you want it outside (e.g. you're doing SR-IOV and you're putting the VF on a VLAN). But then it doesn't make sense to work backwards from the network header, because that'll get confused by traffic that's already VLAN tagged - because again, you want to insert the new VLAN as the outer VLAN. You need to find the Ethertype, as well; it just seems like working backwards from L3 is crazy. Particularly when working forwards from L2 is so easy - you know your L2 header begins at the start of the packet, you (hopefully) know it's Ethernet, so you know where the VLAN tags and Ethertype go. (Are you /sure/ Intel works backwards from the L3 header? I'm pretty confident we don't.) Then of course this works fine with the 'giant L2 header', because the NIC just inserts the VLAN as normal in the outer Ethernet header and shunts the remaining headers along to make room for it. In fact, in our NICs I think that's done by an entirely separate bit of hardware that doesn't even have to know that TSO was considering much more of the packet to be "L2 header". _However_, if we don't need to update the IP IDs, then we can just take the offset of the inner L4 header, and it doesn't make any difference whether you choose to think of the stuff before that as L2 + giant L3 or as giant L2 + normal L3, because it's not part of the OS->NIC interface (which is just "L4 starts here"). If your NIC needs to be told where the outer L3 starts as well, then, I guess that's just a wart you need in your drivers. You have skb->network_header, so that shouldn't be difficult - that will always point to the outer one. > Like I have said with the current solution I could probably update > igb, igbvf, fm10k, ixgbe, ixgbevf, i40e, and i40evf to support this. > That covers pretty much the entire Intel series of drivers in terms of > enterprise. The question is what do I need to enable to support other > drivers. It doesn't seem like there would be too much but the bit I > would need to know is what the expectations are when computing outer > IPv4 checksums. Like I say, the plan we had for Solarflare NICs before "less is more" happened was that the device would reconstruct the outer pseudo-header (same as it does with the inner) and ignore any pre-seeded value in the checksum field. I'd expected other vendors to have gone down the same route, but if Intel didn't then maybe others didn't either. It'd be nice if some of them would chime in and let us know what they want... -Ed
On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote: > This patch adds support for something I am referring to as GSO partial. > The basic idea is that we can support a broader range of devices for > segmentation if we use fixed outer headers and have the hardware only > really deal with segmenting the inner header. The idea behind the naming > is due to the fact that everything before csum_start will be fixed headers, > and everything after will be the region that is handled by hardware. > Personally, I don't like the name ;-) This technique should be "generic" to handle almost all GSO except those case where headers are not fixed which we should be able to avoid as a design point in any new encapsulations. Besides, what if someday we perform GSO on something where csum_start is not set? Can you add some description about strategy for dealing with ip_id? Thanks, Tom > With the current implementation it allows us to add support for the > following GSO types with an inner TSO or TSO6 offload: > NETIF_F_GSO_GRE > NETIF_F_GSO_GRE_CSUM > NETIF_F_UDP_TUNNEL > NETIF_F_UDP_TUNNEL_CSUM > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com> > --- > include/linux/netdev_features.h | 7 ++++++- > include/linux/netdevice.h | 2 ++ > include/linux/skbuff.h | 7 ++++++- > net/core/dev.c | 31 ++++++++++++++++++++++++++++++- > net/core/ethtool.c | 1 + > net/core/skbuff.c | 21 ++++++++++++++++++++- > net/ipv4/af_inet.c | 12 ++++++++++-- > net/ipv4/gre_offload.c | 23 +++++++++++++++++++---- > net/ipv4/tcp_offload.c | 10 ++++++++-- > net/ipv4/udp_offload.c | 20 ++++++++++++++++---- > net/ipv6/ip6_offload.c | 9 ++++++++- > 11 files changed, 126 insertions(+), 17 deletions(-) > > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h > index a734bf43d190..8df3c5553af0 100644 > --- a/include/linux/netdev_features.h > +++ b/include/linux/netdev_features.h > @@ -48,8 +48,12 @@ enum { > NETIF_F_GSO_UDP_TUNNEL_BIT, /* ... UDP TUNNEL with TSO */ > NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */ > NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */ > + NETIF_F_GSO_PARTIAL_BIT, /* ... Only segment inner-most L4 > + * in hardware and all other > + * headers in software. > + */ > /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */ > - NETIF_F_GSO_TUNNEL_REMCSUM_BIT, > + NETIF_F_GSO_PARTIAL_BIT, > > NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */ > NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */ > @@ -121,6 +125,7 @@ enum { > #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL) > #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM) > #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM) > +#define NETIF_F_GSO_PARTIAL __NETIF_F(GSO_PARTIAL) > #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER) > #define NETIF_F_HW_VLAN_STAG_RX __NETIF_F(HW_VLAN_STAG_RX) > #define NETIF_F_HW_VLAN_STAG_TX __NETIF_F(HW_VLAN_STAG_TX) > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 31474d9d8a96..427d748ad8f9 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1647,6 +1647,7 @@ struct net_device { > netdev_features_t vlan_features; > netdev_features_t hw_enc_features; > netdev_features_t mpls_features; > + netdev_features_t gso_partial_features; > > int ifindex; > int group; > @@ -4014,6 +4015,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type) > BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT)); > BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT)); > BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT)); > + BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> NETIF_F_GSO_SHIFT)); > > return (features & feature) == feature; > } > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 15d0df943466..c291a282f8b6 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -482,6 +482,8 @@ enum { > SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11, > > SKB_GSO_TUNNEL_REMCSUM = 1 << 12, > + > + SKB_GSO_PARTIAL = 1 << 13, > }; > > #if BITS_PER_LONG > 32 > @@ -3584,7 +3586,10 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb) > * Keeps track of level of encapsulation of network headers. > */ > struct skb_gso_cb { > - int mac_offset; > + union { > + int mac_offset; > + int data_offset; > + }; > int encap_level; > __wsum csum; > __u16 csum_start; > diff --git a/net/core/dev.c b/net/core/dev.c > index edb7179bc051..666cf427898b 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb, > return ERR_PTR(err); > } > > + /* Only report GSO partial support if it will enable us to > + * support segmentation on this frame without needing additional > + * work. > + */ > + if (features & NETIF_F_GSO_PARTIAL) { > + netdev_features_t partial_features; > + struct net_device *dev = skb->dev; > + > + partial_features = dev->features & dev->gso_partial_features; > + if (!skb_gso_ok(skb, features | partial_features)) > + features &= ~NETIF_F_GSO_PARTIAL; > + } > + > BUILD_BUG_ON(SKB_SGO_CB_OFFSET + > sizeof(*SKB_GSO_CB(skb)) > sizeof(skb->cb)); > > @@ -2841,6 +2854,14 @@ netdev_features_t netif_skb_features(struct sk_buff *skb) > if (skb->encapsulation) > features &= dev->hw_enc_features; > > + /* Support for GSO partial features requires software intervention > + * before we can actually process the packets so we need to strip > + * support for any partial features now and we can pull them back > + * in after we have partially segmented the frame. > + */ > + if (skb_is_gso(skb) && !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL)) > + features &= ~dev->gso_partial_features; > + > if (skb_vlan_tagged(skb)) > features = netdev_intersect_features(features, > dev->vlan_features | > @@ -6702,6 +6723,14 @@ static netdev_features_t netdev_fix_features(struct net_device *dev, > } > } > > + /* GSO partial features require GSO partial be set */ > + if ((features & dev->gso_partial_features) && > + !(features & NETIF_F_GSO_PARTIAL)) { > + netdev_dbg(dev, > + "Dropping partially supported GSO features since no GSO partial.\n"); > + features &= ~dev->gso_partial_features; > + } > + > #ifdef CONFIG_NET_RX_BUSY_POLL > if (dev->netdev_ops->ndo_busy_poll) > features |= NETIF_F_BUSY_POLL; > @@ -6982,7 +7011,7 @@ int register_netdevice(struct net_device *dev) > > /* Make NETIF_F_SG inheritable to tunnel devices. > */ > - dev->hw_enc_features |= NETIF_F_SG; > + dev->hw_enc_features |= NETIF_F_SG | NETIF_F_GSO_PARTIAL; > > /* Make NETIF_F_SG inheritable to MPLS. > */ > diff --git a/net/core/ethtool.c b/net/core/ethtool.c > index b3c39d531469..d1b278c6c29f 100644 > --- a/net/core/ethtool.c > +++ b/net/core/ethtool.c > @@ -88,6 +88,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] > [NETIF_F_GSO_UDP_TUNNEL_BIT] = "tx-udp_tnl-segmentation", > [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation", > [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation", > + [NETIF_F_GSO_PARTIAL_BIT] = "tx-gso-partial", > > [NETIF_F_FCOE_CRC_BIT] = "tx-checksum-fcoe-crc", > [NETIF_F_SCTP_CRC_BIT] = "tx-checksum-sctp", > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index f044f970f1a6..bdcba77e164c 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3076,8 +3076,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > struct sk_buff *frag_skb = head_skb; > unsigned int offset = doffset; > unsigned int tnl_hlen = skb_tnl_header_len(head_skb); > + unsigned int partial_segs = 0; > unsigned int headroom; > - unsigned int len; > + unsigned int len = head_skb->len; > __be16 proto; > bool csum; > int sg = !!(features & NETIF_F_SG); > @@ -3094,6 +3095,15 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > > csum = !!can_checksum_protocol(features, proto); > > + /* GSO partial only requires that we trim off any excess that > + * doesn't fit into an MSS sized block, so take care of that > + * now. > + */ > + if (features & NETIF_F_GSO_PARTIAL) { > + partial_segs = len / mss; > + mss *= partial_segs; > + } > + > headroom = skb_headroom(head_skb); > pos = skb_headlen(head_skb); > > @@ -3281,6 +3291,15 @@ perform_csum_check: > */ > segs->prev = tail; > > + /* Update GSO info on first skb in partial sequence. */ > + if (partial_segs) { > + skb_shinfo(segs)->gso_size = mss / partial_segs; > + skb_shinfo(segs)->gso_segs = partial_segs; > + skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type | > + SKB_GSO_PARTIAL; > + SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset; > + } > + > /* Following permits correct backpressure, for protocols > * using skb_set_owner_w(). > * Idea is to tranfert ownership from head_skb to last segment. > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 5e3885672907..d091f91fa25d 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1199,7 +1199,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, > unsigned int offset = 0; > bool udpfrag, encap; > struct iphdr *iph; > - int proto; > + int proto, tot_len; > int nhoff; > int ihl; > int id; > @@ -1269,10 +1269,18 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, > if (skb->next) > iph->frag_off |= htons(IP_MF); > offset += skb->len - nhoff - ihl; > + tot_len = skb->len - nhoff; > + } else if (skb_is_gso(skb)) { > + iph->id = htons(id); > + id += skb_shinfo(skb)->gso_segs; > + tot_len = skb_shinfo(skb)->gso_size + > + SKB_GSO_CB(skb)->data_offset + > + skb->head - (unsigned char *)iph; > } else { > iph->id = htons(id++); > + tot_len = skb->len - nhoff; > } > - iph->tot_len = htons(skb->len - nhoff); > + iph->tot_len = htons(tot_len); > ip_send_check(iph); > if (encap) > skb_reset_inner_headers(skb); > diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c > index 7ea14ced9222..dea0390d65bb 100644 > --- a/net/ipv4/gre_offload.c > +++ b/net/ipv4/gre_offload.c > @@ -85,7 +85,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb, > skb = segs; > do { > struct gre_base_hdr *greh; > - __be32 *pcsum; > + __sum16 *pcsum; > > /* Set up inner headers if we are offloading inner checksum */ > if (skb->ip_summed == CHECKSUM_PARTIAL) { > @@ -105,10 +105,25 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb, > continue; > > greh = (struct gre_base_hdr *)skb_transport_header(skb); > - pcsum = (__be32 *)(greh + 1); > + pcsum = (__sum16 *)(greh + 1); > + > + if (skb_is_gso(skb)) { > + unsigned int partial_adj; > + > + /* Adjust checksum to account for the fact that > + * the partial checksum is based on actual size > + * whereas headers should be based on MSS size. > + */ > + partial_adj = skb->len + skb_headroom(skb) - > + SKB_GSO_CB(skb)->data_offset - > + skb_shinfo(skb)->gso_size; > + *pcsum = ~csum_fold((__force __wsum)htonl(partial_adj)); > + } else { > + *pcsum = 0; > + } > > - *pcsum = 0; > - *(__sum16 *)pcsum = gso_make_checksum(skb, 0); > + *(pcsum + 1) = 0; > + *pcsum = gso_make_checksum(skb, 0); > } while ((skb = skb->next)); > out: > return segs; > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c > index 1a2e9957c177..4e9b8f011473 100644 > --- a/net/ipv4/tcp_offload.c > +++ b/net/ipv4/tcp_offload.c > @@ -107,6 +107,12 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, > goto out; > } > > + /* GSO partial only requires splitting the frame into an MSS > + * multiple and possibly a remainder. So update the mss now. > + */ > + if (features & NETIF_F_GSO_PARTIAL) > + mss = skb->len - (skb->len % mss); > + > copy_destructor = gso_skb->destructor == tcp_wfree; > ooo_okay = gso_skb->ooo_okay; > /* All segments but the first should have ooo_okay cleared */ > @@ -131,7 +137,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, > newcheck = ~csum_fold((__force __wsum)((__force u32)th->check + > (__force u32)delta)); > > - do { > + while (skb->next) { > th->fin = th->psh = 0; > th->check = newcheck; > > @@ -151,7 +157,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, > > th->seq = htonl(seq); > th->cwr = 0; > - } while (skb->next); > + } > > /* Following permits TCP Small Queues to work well with GSO : > * The callback to TCP stack will be called at the time last frag > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c > index 8a3405a80260..5fcb93269afb 100644 > --- a/net/ipv4/udp_offload.c > +++ b/net/ipv4/udp_offload.c > @@ -100,7 +100,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb, > udp_offset = outer_hlen - tnl_hlen; > skb = segs; > do { > - __be16 len; > + unsigned int len; > > if (remcsum) > skb->ip_summed = CHECKSUM_NONE; > @@ -118,14 +118,26 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb, > skb_reset_mac_header(skb); > skb_set_network_header(skb, mac_len); > skb_set_transport_header(skb, udp_offset); > - len = htons(skb->len - udp_offset); > + len = skb->len - udp_offset; > uh = udp_hdr(skb); > - uh->len = len; > + > + /* If we are only performing partial GSO the inner header > + * will be using a length value equal to only one MSS sized > + * segment instead of the entire frame. > + */ > + if (skb_is_gso(skb)) { > + uh->len = htons(skb_shinfo(skb)->gso_size + > + SKB_GSO_CB(skb)->data_offset + > + skb->head - (unsigned char *)uh); > + } else { > + uh->len = htons(len); > + } > > if (!need_csum) > continue; > > - uh->check = ~csum_fold(csum_add(partial, (__force __wsum)len)); > + uh->check = ~csum_fold(csum_add(partial, > + (__force __wsum)htonl(len))); > > if (skb->encapsulation || !offload_csum) { > uh->check = gso_make_checksum(skb, ~uh->check); > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > index eeca943f12dc..d467053c226a 100644 > --- a/net/ipv6/ip6_offload.c > +++ b/net/ipv6/ip6_offload.c > @@ -63,6 +63,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, > int proto; > struct frag_hdr *fptr; > unsigned int unfrag_ip6hlen; > + unsigned int payload_len; > u8 *prevhdr; > int offset = 0; > bool encap, udpfrag; > @@ -117,7 +118,13 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, > > for (skb = segs; skb; skb = skb->next) { > ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff); > - ipv6h->payload_len = htons(skb->len - nhoff - sizeof(*ipv6h)); > + if (skb_is_gso(skb)) > + payload_len = skb_shinfo(skb)->gso_size + > + SKB_GSO_CB(skb)->data_offset + > + skb->head - (unsigned char *)(ipv6h + 1); > + else > + payload_len = skb->len - nhoff - sizeof(*ipv6h); > + ipv6h->payload_len = htons(payload_len); > skb->network_header = (u8 *)ipv6h - skb->head; > > if (udpfrag) { >
On Wed, Mar 23, 2016 at 9:27 AM, Edward Cree <ecree@solarflare.com> wrote: > On 22/03/16 21:38, Alexander Duyck wrote: >> On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote: >>> But won't the tunnel dev have the feature flag for GSO_PARTIAL depending >>> on what the underlying dev advertises? (Or, at least, could we make it >>> bethatway?) >> This stuff doesn't work. That is why tunnels now advertise all >> available features that can be offloaded via software. Basically if >> we can advertise a feature we do, and then we sort things out in >> software if we cannot actually do it in hardware. > Fair enough; then go withthe other approach: >>> Alternatively, have per-protocol GSO callbacks to do the fixup in the >>> opposite direction to what you have now - in the long term we hope that >>> hardware supporting GSO partial will become the common case, so that >>> should be the fast path without bouncing backwards through GSO callbacks. >>> Then, if you find out at GSO time that the hardware wants to do old-style >>> TSO, you call those callbacks so as to give it a superframe with the long >>> lengths filled in everywhere. >> I thought about doing that but decided it was much simpler to simply >> update all headers. For now I want to keep this as simple as possible >> while we get the first few drivers on board. If we later want to >> optimize and add complexity we can go that route, but for now this >> change is more than fast enough as it allows me to push an i40e at >> 20Gb/s while sending frames with outer checksums enabled. > My belief is that my way is (in the long run) simpler: ultimately it gets > rid of per-protocol GSO callbacks entirely. Every header gets written > correctly* when the packet initially traverses the stack, and then at > transmit time you either hand that off to TSO, or do the equivalent thing > in software: segment at the TCP layer while treating everything above it > as an opaque pseudo-L2 header. I'm pretty sure that isn't a safe approach to take. With GSO we are holding off until we are about to hand the packets off to the device before we segment it. If we do it earlier it will lead to more issues as you could have the packet route off to somewhere you were not expecting and having it already "soft segmented" could lead to issues where the packet would no longer be coherent. > *That is, correctly for a single segment, rather than correctly for the > superframe. Yes, but what about the cases where a packets gets switched from Tx back to Rx? How would you expect the stack to handle that then? >>> Yes; it's a clever idea. Only trouble is that we really want theinner IP >>> header rather than the inner TCP header, so that we can (if we want to) >>> increment the inner IP IDs. Of course, if we Officially Don't Care about >>> inner IP IDs that's not a problem. >> The inner IP IDs are the ones that are guaranteed to be the ones we >> can leave fixed since TCP will require that the DF bit be set. > I was worrying about the SLHC stuff, I thought the inner ones were precisely > the ones where that was a problem. If we really don't care about it, then > we do just need the inner TCP header, and we can stick with using your > csum_start == gso_start trick :) >> The >> current VXLAN implementation does not set DF for the outer headers. >> So really we don't have too many options right now if we are wanting >> to support tunnels. > I was predicating all this on the assumption that tunnels would be changed > to set DF in their outer frames; I thought I saw a patch recently to do > that, but maybe I was mistaken. In any case I think that's the right thing > to do, and it's a necessary prerequisite for truly tunnel-agnostic TSO. >>> Iwonder if we could just always fill in inner_network_headereven if we're >>> not doing encapsulation. Or does it end up pointing to a 'middle' header >>> in the case of nested encap? >> Right now neseted encap is not supported because tunnels don't >> advertise hw_encap_features. > You mean not supported by offloads, right? We can still _create_ a nested > tunnel device, it just won't use hardware offloads. And in the long run, > if we can make tunnel-agnostic TSO as I'm proposing, we should be able to > support it for nested tunnels too (the giant L2 header just gets more giant). Right. Basically what will currently happen is that if you were to encapsulate an ipip tunnel inside of a VXLAN for instance the GSO would kick in before the VXLAN tunnel has even added the outer headers because the VXLAN netdev does not advertise NETIF_F_GSO_IPIP. That is the kind of thing we want to have happen though anyway since a physical device wouldn't know how to deal with such a scenario anyway. >>> Why not use the single-segment length in the pseudo-header, then the >>> outer L4 checksum is already the right thing? (And if yourhardware >>> can't be told to leave the outer L4 checksum alone, then it's not worth >>> the trouble of trying to support GSO partial for it, since it clearly >>> wants to do old-style "NIC knows best" TSO.) >> The problem then becomes that I needs special case code. One for >> standard TSO and one for this special case. If I leave it as is I can >> use the same code to cancel out the length in the TCP pseudo-header >> checksum for either case. > I don't think that's true. Look at it this way: there are two choices. > 1) The normal path builds things for standard TSO, and we have code to > fix it up for this new-style TSO. > 2) The normal path builds things for new-style TSO, and we have code to > fix it up for standard TSO. > Now the fix-up code should be of equal complexity in either case, because > the two cases are inverses of each other. So we should choose whichever > approach makes the normal path simpler. I think having TCP 'lie' to all > the outer layers about the length of the packet is simpler, because then > they don't even have to know that GSO is happening. They just append a > header appropriate for an MSS-sized packet, blissfully unaware that the > payload data carries on for longer than that. > Moreover, the fixup in (2) can sometimes be avoided... > I don't know how your hardware does encapsulated TSO, but what ours does > (in the old "I'm-a-smart-NIC-I-know-best" world) is that it computes both > checksums itself from scratch. So it doesn't care what it was given as > the pre-seeded value, because it's reconstructing the pseudo-header and > zeroing the checksum field to begin with. > I would have expected this to be the common behaviour for "smart" NICs > doing encap TSO, in which case (2) is a clear winner. And, of course, > once "less is more" hardware becomes common, we will want (2) anyway. The argument for here is a matter of complexity. I really don't think we gain much by recomputing the pseudo-header with the segmented length versus the entire packet length. In the case of gso it is more complicated to figure out the length if we are having to take gso_size and add the size for a TCP header rather than just subtracting the inner transport offset from skb->len. >> The general idea to my approach is to treat the the UDP or GRE header >> to the inner IP header as an IP header extension type value. That >> will allow us to trick most hardware into being able to accept it. > I think an L2 header extension is a better semantic match for what's > happening (the tunnel is an L2 device and we're sending L3 packets over > it). But why does it matter? Are you giving the hardware the L2 and > L3 headers in separate DMA descriptors or something? The way I see it, > all hardware needs to be told is "where to start TSO from", and how it > thinks of the stuff before that doesn't matter, because it's not > supposed to touch it anyway. The problem is setups like VFs where they want to know where the network header starts so they know where to insert a VLAN tag. Most hardware supports IP options or IPv6 extension headers. What I am doing is exploiting that in the case of the Intel hardware by telling it the IP header is the outer IP header and the transport header is the inner transport header. Then all I have to deal with is the fact that hardware will try to compute the entire IPv4 header checksum over the range so I cancel that out by seeding the IP checksum with the lco_csum added to the inner pseudo-header checksum. >>> Again, the idea is that we optimise for GSO partial by making it a plain >>> header copy everywhere, and put all the 'fix things up' on the _other_ >>> path. >> That is what I went for. The only part that isn't a plain header copy >> is the TCP pseudo-header checksum. > Right, but that means someone, somewhere, has to fix up the outer UDP > checksum to account for that (because the TCP phdr sum leaks out of LCO). > I realise that's a per-superframe job, rather than a per-segment job, but > it still means that code at GSO time (when we decide whether to do GSO or > GSO-partial) has to know enough about the tunnel headers to find the outer > checksum and fix it up. The fix-up already happens in the GSO code path I have coded up. The outer checksum already takes it into account. If you look at my patches you should see a block in both the UDP tunnel and GRE tunnel code that handles the "skb_is_gso()" case. That is where we deal with the checksum adjustment. In the case of UDP tunnels it actually requires very little work as the only field that actually has to be updated is uh->len since the outer pseduo-header checksum effectively cancels out the length for the inner anyway. In the case of GRE it took a bit more as I had to seed the checksum with the delta of the length in order to correctly account for the change. > And it's a question of who pays that cost, the old TSO or the new one; see > above for why I think the old TSO should pay it. Generally I don't agree with slowing down existing paths in order to improve the performance for new ones. One of my goals with this was to keep the code minimally intrusive. I would rather not have to rewrite all driver TSO paths in order to support a new segmentation approach. >> The only issue I see is the expectation that the outer headers go >> untouched is problematic. The outer headers are where things like >> fragmentation will occur in transit. > Like I say, I'm assuming we'll start setting DF on outer frames. > Besides, it doesn't matter what happens "in transit" - as long as the > outer headers aren't touched by the transmitting NIC, the network can > do what it likes to them afterwards, without it breaking our TSO code. The IP ID bit is going to be something where I don't want to break things. One of the things I have seen is there ends up being a number of occasions where VXLAN gets fragmented due to incorrectly configured MTU. I would rather not have it get completely screwed up when this occurs. A performance hit for fragmenting is one thing. Having people start complaining because previously working tunnels suddenly stop functioning is another. The fact is the whole point of VXLAN is to support sending the tunneled frames between data centers. With the DF bit set on a tunnel header we end up making it so that frames get dropped instead of fragmented which would be a change of behavior. >> In addition I suspect a number >> of devices other than the Intel NICs probably use the network header >> to determine where to insert L2 tags such as VLAN. > Ah, VLAN is interesting, because there's two things you might want: > VLAN inside the tunnel, or outside of it. Presumably if you're having > the NIC insert the VLAN, you want it outside (e.g. you're doing SR-IOV > and you're putting the VF on a VLAN). > But then it doesn't make sense to work backwards from the network > header, because that'll get confused by traffic that's already VLAN > tagged - because again, you want to insert the new VLAN as the outer > VLAN. You need to find the Ethertype, as well; it just seems like > working backwards from L3 is crazy. Particularly when working forwards > from L2 is so easy - you know your L2 header begins at the start of the > packet, you (hopefully) know it's Ethernet, so you know where the VLAN > tags and Ethertype go. (Are you /sure/ Intel works backwards from the > L3 header? I'm pretty confident we don't.) Yes, I am very confident of that. For Intel hardware the outer VLAN tag would be the one inserted via software, the inner VLAN tag is the one inserted via hardware. > Then of course this works fine with the 'giant L2 header', because the > NIC just inserts the VLAN as normal in the outer Ethernet header and > shunts the remaining headers along to make room for it. In fact, in > our NICs I think that's done by an entirely separate bit of hardware > that doesn't even have to know that TSO was considering much more of > the packet to be "L2 header". Well the Intel hardware doesn't work that way. It would be nice if it did because then you could actually support Q-in-Q on the VFs without much issue, but due to the limiting nature of a 64b transmit descriptor they opted to insert the VLAN tags at the start of the network header. > _However_, if we don't need to update the IP IDs, then we can just take > the offset of the inner L4 header, and it doesn't make any difference > whether you choose to think of the stuff before that as L2 + giant L3 > or as giant L2 + normal L3, because it's not part of the OS->NIC > interface (which is just "L4 starts here"). If your NIC needs to be > told where the outer L3 starts as well, then, I guess that's just a > wart you need in your drivers. You have skb->network_header, so that > shouldn't be difficult - that will always point to the outer one. Right. That is kind of what I was going for with this setup. The only issue is that the VXLAN tunnels not setting the DF bit kind of get in the way of the giant L3 approach. Dealing with the outer header needing the DF bit is something I have left unaddressed at this point. The question would be what is the correct approach to take for all this. I know RFC2003 for IPv4 in IPv4 says you must set the DF bit if the inner header has the DF bit set. I'm just wondering if we can apply the same type of logic to GRE and UDP tunnels. >> Like I have said with the current solution I could probably update >> igb, igbvf, fm10k, ixgbe, ixgbevf, i40e, and i40evf to support this. >> That covers pretty much the entire Intel series of drivers in terms of >> enterprise. The question is what do I need to enable to support other >> drivers. It doesn't seem like there would be too much but the bit I >> would need to know is what the expectations are when computing outer >> IPv4 checksums. > Like I say, the plan we had for Solarflare NICs before "less is more" > happened was that the device would reconstruct the outer pseudo-header > (same as it does with the inner) and ignore any pre-seeded value in the > checksum field. I'd expected other vendors to have gone down the same > route, but if Intel didn't then maybe others didn't either. It'd be > nice if some of them would chime in and let us know what they want... Right. I added Or to the Cc. Maybe we can get some input from Mellanox on how they do TSO and what changes they would need in order to support TSO for GRE or UDP tunnels with checksum. My concern is that I am not sure how much value there is to add with this type of code if the hardware is parsing headers. In the case of most of the Intel parts you specify offsets for the network and transport headers so it gives you some degree of flexibility. If however the hardware parses headers it becomes problematic as we can only support protocols that can be parsed by the hardware. So for example on the Intel parts using the drivers igb and ixgbe I will be able to support VXLAN-GPE using a partial GSO approach. However for fm10k which does some parsing for the tunnel headers I probably won't as it will not know what to do if there are any extra headers included and won't be able to determine the offset of the inner transport header. - Alex
On Wed, Mar 23, 2016 at 10:09 AM, Tom Herbert <tom@herbertland.com> wrote: > On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote: >> This patch adds support for something I am referring to as GSO partial. >> The basic idea is that we can support a broader range of devices for >> segmentation if we use fixed outer headers and have the hardware only >> really deal with segmenting the inner header. The idea behind the naming >> is due to the fact that everything before csum_start will be fixed headers, >> and everything after will be the region that is handled by hardware. >> > Personally, I don't like the name ;-) This technique should be > "generic" to handle almost all GSO except those case where headers are > not fixed which we should be able to avoid as a design point in any > new encapsulations. Besides, what if someday we perform GSO on > something where csum_start is not set? Well the "partial" component of it refers to the fact that we have to do a certain amount of this in software before we offload the work to hardware. So it isn't a full segmentation offload, it is a partial one. The csum_start is a bit of a red herring in terms of the naming. I was using csum_start as that works for TCP offloads since it will always point to the inner-most transport header. It was basically just a convenient optimization for the drivers and makes it so that we can essentially use just one code path in the case of ixgbe since network header and checksum_start will always represent the two headers we need to update for TSO regardless of it being partial or complete offload. > Can you add some description about strategy for dealing with ip_id? Yeah. I still need to add more documentation. I just didn't want to get into details on it until we have finalized things a bit more. I'm still wondering if we should follow the pattern of ipip tunnels in terms of setting the DF bit if the inner header has the DF bit set. If we end up needing to add code to do that then it makes it so that the ip_id value can be fixed for both inner and outer which makes the segmentation much simpler since the only header that would ever really need to be updated would be the transport header in order to get the checksum correct. - Alex
On 23/03/16 18:06, Alexander Duyck wrote: > On Wed, Mar 23, 2016 at 9:27 AM, Edward Cree <ecree@solarflare.com> wrote: >> My belief is that my way is (in the long run) simpler: ultimately it gets >> rid of per-protocol GSO callbacks entirely. Every header gets written >> correctly* when the packet initially traverses the stack, and then at >> transmit time you either hand that off to TSO, or do the equivalent thing >> in software: segment at the TCP layer while treating everything above it >> as an opaque pseudo-L2 header. > I'm pretty sure that isn't a safe approach to take. With GSO we are > holding off until we are about to hand the packets off to the device > before we segment it. If we do it earlier it will lead to more issues > as you could have the packet route off to somewhere you were not > expecting and having it already "soft segmented" could lead to issues > where the packet would no longer be coherent. Ah, yes, that is a potential problem. And now that I think about it, we might not know what the MSS is until segmentation time, either, even if we did know for sure we would want to segment. So my approach doesn't work after all, the superframe has to be coherent when it traverses the stack. >> You mean not supported by offloads, right? We can still _create_ a nested >> tunnel device, it just won't use hardware offloads. And in the long run, >> if we can make tunnel-agnostic TSO as I'm proposing, we should be able to >> support it for nested tunnels too (the giant L2 header just gets more giant). > Right. Basically what will currently happen is that if you were to > encapsulate an ipip tunnel inside of a VXLAN for instance the GSO > would kick in before the VXLAN tunnel has even added the outer headers > because the VXLAN netdev does not advertise NETIF_F_GSO_IPIP. That is > the kind of thing we want to have happen though anyway since a > physical device wouldn't know how to deal with such a scenario anyway. I disagree. Surely we should be able to "soft segment" the packet just before we give it to the physical device, and then tell it to do dumb copying of both the VXLAN and IPIP headers? At this point, we don't have the problem you identified above, because we've arrived at the device now. So we can chase through some per-protocol callbacks to shorten all the outer lengths and adjust all the outer checksums, then hand it to the device for TSO. The device is treating the extra headers as an opaque blob, so it doesn't know or care whether it's one layer of encapsulation or forty-two. >> I think an L2 header extension is a better semantic match for what's >> happening (the tunnel is an L2 device and we're sending L3 packets over >> it). But why does it matter? Are you giving the hardware the L2 and >> L3 headers in separate DMA descriptors or something? The way I see it, >> all hardware needs to be told is "where to start TSO from", and how it >> thinks of the stuff before that doesn't matter, because it's not >> supposed to touch it anyway. > The problem is setups like VFs where they want to know where the > network header starts so they know where to insert a VLAN tag. Most > hardware supports IP options or IPv6 extension headers. What I am > doing is exploiting that in the case of the Intel hardware by telling > it the IP header is the outer IP header and the transport header is > the inner transport header. Then all I have to deal with is the fact > that hardware will try to compute the entire IPv4 header checksum over > the range so I cancel that out by seeding the IP checksum with the > lco_csum added to the inner pseudo-header checksum. Ok, it sounds like the interface to Intel hardware is just Very Different to Solarflare hardware on this point: we don't tell our hardware anything about where the various headers start, it just parses them to figure it out. (And for new-style TSO we'd tell it where the TCP header starts, as I described before.) But this sounds like a driver-level thing: you have to undo some of what your hardware will do because you're having to lie to it about what you're giving it. So that all happens in the driver, the stack's GSO code isn't affected. In which case I'm much less bothered by this; I thought it was an assumption you were baking into the stack. As far as I'm concerned you can do whatever ugly hacks you like in Intel drivers, as long as I don't have to do them in sfc ;) Although, why is your device computing the IPv4 header checksum? Those aren't supposed to be offloaded, the stack always already filled them in in software. (I think sfc makes the same mistake actually.) Is there not a way for you to tell your device to skip IP header checksum offload? Then you wouldn't have this problem in the first place, and you could tell it the IP header was as big as you like without having to seed it with this correction value. >> Like I say, I'm assuming we'll start setting DF on outer frames. >> Besides, it doesn't matter what happens "in transit" - as long as the >> outer headers aren't touched by the transmitting NIC, the network can >> do what it likes to them afterwards, without it breaking our TSO code. > The IP ID bit is going to be something where I don't want to break > things. One of the things I have seen is there ends up being a number > of occasions where VXLAN gets fragmented due to incorrectly configured > MTU. I would rather not have it get completely screwed up when this > occurs. A performance hit for fragmenting is one thing. Having > people start complaining because previously working tunnels suddenly > stop functioning is another. The fact is the whole point of VXLAN is > to support sending the tunneled frames between data centers. With the > DF bit set on a tunnel header we end up making it so that frames get > dropped instead of fragmented which would be a change of behavior. I agree this isn't something we can do silently. But we _can_ make it a condition for enabling gso-partial. And I think it's a necessary condition for truly generic TSO. Sure, your 'L3 extension header' works fine for a single tunnel. But if you nest tunnels, you now need to update the outer _and_ middle IP IDs, and you can't do that because you only have one L3 header pointer. OTOH, for a single tunnel I think we could implement your 'L3 extension header' trick in firmware, by making it always parse the outer packet up to the outer L3 header and increment the IP ID in that. So I could live with that approach if necessary. > Yes, I am very confident of that. For Intel hardware the outer VLAN > tag would be the one inserted via software, the inner VLAN tag is the > one inserted via hardware. That's really weird; why does it do that? For sfc, the only reason we do hardware VLANs at all is to transparently tag a VF (or non-primary PF) that's being passed through into an (untrusted) VM. For that use case you'd always want to insert the outer VLAN tag, because otherwise the VM can escape the isolation by inserting a different VLAN tag in software. >> _However_, if we don't need to update the IP IDs, then we can just take >> the offset of the inner L4 header, and it doesn't make any difference >> whether you choose to think of the stuff before that as L2 + giant L3 >> or as giant L2 + normal L3, because it's not part of the OS->NIC >> interface (which is just "L4 starts here"). If your NIC needs to be >> told where the outer L3 starts as well, then, I guess that's just a >> wart you need in your drivers. You have skb->network_header, so that >> shouldn't be difficult - that will always point to the outer one. > Right. That is kind of what I was going for with this setup. The > only issue is that the VXLAN tunnels not setting the DF bit kind of > get in the way of the giant L3 approach. On the contrary; your giant L3 approach is exactly what solves this case (for non-nested tunnels) - the hardware has the outer IP and inner TCP header offsets, which are exactly the two headers it needs to alter. And if the hardware is sensible, it won't try to re-checksum the whole giant L3 header, it'll just decrement the (outer) IP checksum to account for incrementing the (outer) IP ID. If the hardware isn't sensible, then you have to play games like you are doing in the Intel drivers ;) > Dealing with the outer header needing the DF bit is something I have > left unaddressed at this point. The question would be what is the > correct approach to take for all this. I know RFC2003 for IPv4 in > IPv4 says you must set the DF bit if the inner header has the DF bit > set. I'm just wondering if we can apply the same type of logic to GRE > and UDP tunnels. I wonder if _not_ setting the DF bit in the outer header can harm TCP congestion control at all? If so, then we'd pretty much have to set it in that case. > My concern is that I am not sure how much value there is to add with > this type of code if the hardware is parsing headers. In the case of > most of the Intel parts you specify offsets for the network and > transport headers so it gives you some degree of flexibility. If > however the hardware parses headers it becomes problematic as we can > only support protocols that can be parsed by the hardware. Solarflare parts will _normally_ parse headers to get that information. But, when doing TSO, we do get a chance to specify some extra information, in the TSO option descriptor. Enough of the datapath is under firmware control that that should be enough; as long as the outer frame is IP over Ethernet, the hardware will parse that fine, and we *should* be able to make it just accept that it doesn't know what's going on between that and the start of the TCP header. And, it shouldn't matter that the hardware can parse some types of tunnel headers, because we'll just tell it to ignore that. Of course, that means changing the firmware; luckily we haven't got any parts in the wild doing tunnel offloads yet, so we still have a chance to do that without needing driver code to work around our past mistakes... But this stuff does definitely add value for us, it means we could TSO any tunnel type whatsoever; even nested tunnels as long as only the outermost IP ID needs to change. -Ed
On Wed, Mar 23, 2016 at 2:05 PM, Edward Cree <ecree@solarflare.com> wrote: > On 23/03/16 18:06, Alexander Duyck wrote: >> On Wed, Mar 23, 2016 at 9:27 AM, Edward Cree <ecree@solarflare.com> wrote: >>> My belief is that my way is (in the long run) simpler: ultimately it gets >>> rid of per-protocol GSO callbacks entirely. Every header gets written >>> correctly* when the packet initially traverses the stack, and then at >>> transmit time you either hand that off to TSO, or do the equivalent thing >>> in software: segment at the TCP layer while treating everything above it >>> as an opaque pseudo-L2 header. >> I'm pretty sure that isn't a safe approach to take. With GSO we are >> holding off until we are about to hand the packets off to the device >> before we segment it. If we do it earlier it will lead to more issues >> as you could have the packet route off to somewhere you were not >> expecting and having it already "soft segmented" could lead to issues >> where the packet would no longer be coherent. > Ah, yes, that is a potential problem. And now that I think about it, we > might not know what the MSS is until segmentation time, either, even if > we did know for sure we would want to segment. > So my approach doesn't work after all, the superframe has to be coherent > when it traverses the stack. Right. >>> You mean not supported by offloads, right? We can still _create_ a nested >>> tunnel device, it just won't use hardware offloads. And in the long run, >>> if we can make tunnel-agnostic TSO as I'm proposing, we should be able to >>> support it for nested tunnels too (the giant L2 header just gets more giant). >> Right. Basically what will currently happen is that if you were to >> encapsulate an ipip tunnel inside of a VXLAN for instance the GSO >> would kick in before the VXLAN tunnel has even added the outer headers >> because the VXLAN netdev does not advertise NETIF_F_GSO_IPIP. That is >> the kind of thing we want to have happen though anyway since a >> physical device wouldn't know how to deal with such a scenario anyway. > I disagree. Surely we should be able to "soft segment" the packet just > before we give it to the physical device, and then tell it to do dumb copying > of both the VXLAN and IPIP headers? At this point, we don't have the problem > you identified above, because we've arrived at the device now. One issue here is that all levels of IP headers would have to have the DF bit set. I don't think that happens right now. > So we can chase through some per-protocol callbacks to shorten all the outer > lengths and adjust all the outer checksums, then hand it to the device for > TSO. The device is treating the extra headers as an opaque blob, so it > doesn't know or care whether it's one layer of encapsulation or forty-two. So if we do pure software offloads this is doable. However the GSO flags are meant to have hardware feature equivalents. The problem is if you combine an IPIP and VXLAN header how do you know what header is what and which order things are in, and what is the likelihood of having a device that would get things right when dealing with 3 levels of IP headers. This is one of the reasons why we don't support multiple levels of tunnels in the GSO code. GSO is just meant to be a fall-back for hardware offloads. >>> I think an L2 header extension is a better semantic match for what's >>> happening (the tunnel is an L2 device and we're sending L3 packets over >>> it). But why does it matter? Are you giving the hardware the L2 and >>> L3 headers in separate DMA descriptors or something? The way I see it, >>> all hardware needs to be told is "where to start TSO from", and how it >>> thinks of the stuff before that doesn't matter, because it's not >>> supposed to touch it anyway. >> The problem is setups like VFs where they want to know where the >> network header starts so they know where to insert a VLAN tag. Most >> hardware supports IP options or IPv6 extension headers. What I am >> doing is exploiting that in the case of the Intel hardware by telling >> it the IP header is the outer IP header and the transport header is >> the inner transport header. Then all I have to deal with is the fact >> that hardware will try to compute the entire IPv4 header checksum over >> the range so I cancel that out by seeding the IP checksum with the >> lco_csum added to the inner pseudo-header checksum. > Ok, it sounds like the interface to Intel hardware is just Very Different > to Solarflare hardware on this point: we don't tell our hardware anything > about where the various headers start, it just parses them to figure it > out. (And for new-style TSO we'd tell it where the TCP header starts, as > I described before.) That is kind of what I figured. So does that mean for IPv6 you guys are parsing through extension headers? I believe that is one of the reasons why Intel did things the way they did is to avoid having to parse through any IPv4 options or IPv6 extension headers. > But this sounds like a driver-level thing: you have to undo some of what > your hardware will do because you're having to lie to it about what you're > giving it. So that all happens in the driver, the stack's GSO code isn't > affected. In which case I'm much less bothered by this; I thought it was > an assumption you were baking into the stack. As far as I'm concerned you > can do whatever ugly hacks you like in Intel drivers, as long as I don't > have to do them in sfc ;) The thing is I have to have the hardware recompute the outer IPv4 checksum because it is updating the ID field. If I were able to leave the IP ID static then I wouldn't need to update the checksum. The only bit that makes it ugly is the fact that the hardware is being misled so it thinks the IPv4 header has 50 bytes of options when those actually represent the outer L4 through to the inner L3 headers. > Although, why is your device computing the IPv4 header checksum? Those > aren't supposed to be offloaded, the stack always already filled them in > in software. (I think sfc makes the same mistake actually.) Is there not > a way for you to tell your device to skip IP header checksum offload? > Then you wouldn't have this problem in the first place, and you could tell > it the IP header was as big as you like without having to seed it with > this correction value. It is because we are rewriting the IP ID. We cannot have a static checksum if the IP ID is getting updated. >>> Like I say, I'm assuming we'll start setting DF on outer frames. >>> Besides, it doesn't matter what happens "in transit" - as long as the >>> outer headers aren't touched by the transmitting NIC, the network can >>> do what it likes to them afterwards, without it breaking our TSO code. >> The IP ID bit is going to be something where I don't want to break >> things. One of the things I have seen is there ends up being a number >> of occasions where VXLAN gets fragmented due to incorrectly configured >> MTU. I would rather not have it get completely screwed up when this >> occurs. A performance hit for fragmenting is one thing. Having >> people start complaining because previously working tunnels suddenly >> stop functioning is another. The fact is the whole point of VXLAN is >> to support sending the tunneled frames between data centers. With the >> DF bit set on a tunnel header we end up making it so that frames get >> dropped instead of fragmented which would be a change of behavior. > I agree this isn't something we can do silently. But we _can_ make it a > condition for enabling gso-partial. And I think it's a necessary > condition for truly generic TSO. Sure, your 'L3 extension header' works > fine for a single tunnel. But if you nest tunnels, you now need to > update the outer _and_ middle IP IDs, and you can't do that because you > only have one L3 header pointer. This is getting away from the 'less is more' concept. If we are doing multiple levels of tunnels we have already made things far too complicated and it is unlikely hardware will ever support anything like that. > OTOH, for a single tunnel I think we could implement your 'L3 extension > header' trick in firmware, by making it always parse the outer packet up > to the outer L3 header and increment the IP ID in that. So I could live > with that approach if necessary. Good to hear. >> Yes, I am very confident of that. For Intel hardware the outer VLAN >> tag would be the one inserted via software, the inner VLAN tag is the >> one inserted via hardware. > That's really weird; why does it do that? > For sfc, the only reason we do hardware VLANs at all is to transparently > tag a VF (or non-primary PF) that's being passed through into an > (untrusted) VM. For that use case you'd always want to insert the outer > VLAN tag, because otherwise the VM can escape the isolation by inserting > a different VLAN tag in software. The problem is I think the feature was implemented long before any of the SR-IOV stuff was added and it has been carried that way through the igb and ixgbe drivers. I haven't looked at i40e but it doesn't report doing STAG so I don't know how it handles QinQ. >>> _However_, if we don't need to update the IP IDs, then we can just take >>> the offset of the inner L4 header, and it doesn't make any difference >>> whether you choose to think of the stuff before that as L2 + giant L3 >>> or as giant L2 + normal L3, because it's not part of the OS->NIC >>> interface (which is just "L4 starts here"). If your NIC needs to be >>> told where the outer L3 starts as well, then, I guess that's just a >>> wart you need in your drivers. You have skb->network_header, so that >>> shouldn't be difficult - that will always point to the outer one. >> Right. That is kind of what I was going for with this setup. The >> only issue is that the VXLAN tunnels not setting the DF bit kind of >> get in the way of the giant L3 approach. Sorry meant giant L2 approach here, not L3. > On the contrary; your giant L3 approach is exactly what solves this case > (for non-nested tunnels) - the hardware has the outer IP and inner TCP > header offsets, which are exactly the two headers it needs to alter. > And if the hardware is sensible, it won't try to re-checksum the whole > giant L3 header, it'll just decrement the (outer) IP checksum to account > for incrementing the (outer) IP ID. If the hardware isn't sensible, > then you have to play games like you are doing in the Intel drivers ;) Agreed. >> Dealing with the outer header needing the DF bit is something I have >> left unaddressed at this point. The question would be what is the >> correct approach to take for all this. I know RFC2003 for IPv4 in >> IPv4 says you must set the DF bit if the inner header has the DF bit >> set. I'm just wondering if we can apply the same type of logic to GRE >> and UDP tunnels. > I wonder if _not_ setting the DF bit in the outer header can harm TCP > congestion control at all? If so, then we'd pretty much have to set it > in that case. Really the only concern with DF based on RFC 2003 is for path MTU discovery. What happens right now if you set the VXLAN MTU incorrectly is you end up taking a performance hit as the tunneled frames are fragmented and have to be reassembled on the other end. It ends up being a performance hit due to the fragmentation. I might need to look into this. It is possible that we are already doing the wrong stuff anyway with this and might need to look at setting the DF bit anyway. >> My concern is that I am not sure how much value there is to add with >> this type of code if the hardware is parsing headers. In the case of >> most of the Intel parts you specify offsets for the network and >> transport headers so it gives you some degree of flexibility. If >> however the hardware parses headers it becomes problematic as we can >> only support protocols that can be parsed by the hardware. > Solarflare parts will _normally_ parse headers to get that information. > But, when doing TSO, we do get a chance to specify some extra > information, in the TSO option descriptor. Enough of the datapath is > under firmware control that that should be enough; as long as the outer > frame is IP over Ethernet, the hardware will parse that fine, and we > *should* be able to make it just accept that it doesn't know what's > going on between that and the start of the TCP header. And, it > shouldn't matter that the hardware can parse some types of tunnel > headers, because we'll just tell it to ignore that. Right. Just skipping the tunnel headers makes it quite a bit easier. > Of course, that means changing the firmware; luckily we haven't got any > parts in the wild doing tunnel offloads yet, so we still have a chance > to do that without needing driver code to work around our past > mistakes... > > But this stuff does definitely add value for us, it means we could TSO > any tunnel type whatsoever; even nested tunnels as long as only the > outermost IP ID needs to change. Right. In your case it sounds like you would have the advantage of just having to run essentially two counters, one increments the IPv4 ID and the other decrements the IPv4 checksum. Beyond that the outer headers wouldn't need to change at all. The only other issue would be determining how the inner pseudo-header checksum is updated. If you were parsing out header fields from the IP header previously to generate it you would instead need to update things so that you could use the partial checksum that is already stored in the TCP header checksum field. - Alex
On 23/03/16 22:36, Alexander Duyck wrote: > On Wed, Mar 23, 2016 at 2:05 PM, Edward Cree <ecree@solarflare.com> wrote: >> I disagree. Surely we should be able to "soft segment" the packet just >> before we give it to the physical device, and then tell it to do dumb copying >> of both the VXLAN and IPIP headers? At this point, we don't have the problem >> you identified above, because we've arrived at the device now. > One issue here is that all levels of IP headers would have to have the > DF bit set. I don't think that happens right now. Yes, that's still a requirement. (Well, except for the outermost IP header.) >> So we can chase through some per-protocol callbacks to shorten all the outer >> lengths and adjust all the outer checksums, then hand it to the device for >> TSO. The device is treating the extra headers as an opaque blob, so it >> doesn't know or care whether it's one layer of encapsulation or forty-two. > So if we do pure software offloads this is doable. However the GSO > flags are meant to have hardware feature equivalents. The problem is > if you combine an IPIP and VXLAN header how do you know what header is > what and which order things are in, and what is the likelihood of > having a device that would get things right when dealing with 3 levels > of IP headers. This is one of the reasons why we don't support > multiple levels of tunnels in the GSO code. GSO is just meant to be a > fall-back for hardware offloads. Right, but if the hardware does things "the new way" it should work fine: Packet still starts with Eth + IP. Packet still has TCP headers at some specified offset. So it all works, as long as you don't have to update any IP IDs except possibly the outermost one. >> Ok, it sounds like the interface to Intel hardware is just Very Different >> to Solarflare hardware on this point: we don't tell our hardware anything >> about where the various headers start, it just parses them to figure it >> out. (And for new-style TSO we'd tell it where the TCP header starts, as >> I described before.) > That is kind of what I figured. So does that mean for IPv6 you guys > are parsing through extension headers? I believe that is one of the > reasons why Intel did things the way they did is to avoid having to > parse through any IPv4 options or IPv6 extension headers. I believe so, but I'd have to check with our firmware team to be sure. The hardware needs to have that capability for RX processing, where it wants to figure out things like the l4proto for IPv6: you have to walk the extension headers until you get a layer 4 nexthdr. I wonder how Intel manage without that? >> I agree this isn't something we can do silently. But we _can_ make it a >> condition for enabling gso-partial. And I think it's a necessary >> condition for truly generic TSO. Sure, your 'L3 extension header' works >> fine for a single tunnel. But if you nest tunnels, you now need to >> update the outer _and_ middle IP IDs, and you can't do that because you >> only have one L3 header pointer. > This is getting away from the 'less is more' concept. If we are doing > multiple levels of tunnels we have already made things far too > complicated and it is unlikely hardware will ever support anything > like that. That's not how I understood the concept. I parsed it as "if hardware knows less, we can get more out of it", i.e. by having the hardware blithely paste together whatever headers you give it, you can support things like nested tunnels. As long as your 'middle' IP header has DF set, this can be done without the hardware needing to know a thing about it. And while we don't need to implement that straight away, we should care to design our interfaces to ensure we can do that in the future without too much trouble. >> Of course, that means changing the firmware; luckily we haven't got any >> parts in the wild doing tunnel offloads yet, so we still have a chance >> to do that without needing driver code to work around our past >> mistakes... >> >> But this stuff does definitely add value for us, it means we could TSO >> any tunnel type whatsoever; even nested tunnels as long as only the >> outermost IP ID needs to change. > Right. In your case it sounds like you would have the advantage of > just having to run essentially two counters, one increments the IPv4 > ID and the other decrements the IPv4 checksum. Beyond that the outer > headers wouldn't need to change at all. Exactly. > The only other issue would be determining how the inner pseudo-header > checksum is updated. If you were parsing out header fields from the > IP header previously to generate it you would instead need to update > things so that you could use the partial checksum that is already > stored in the TCP header checksum field. Right, but again that's sufficiently under firmware control (AFAIK) that that should just be a SMOP for the firmware. Though I will ask about that tomorrow, just in case. -Ed
On Wed, Mar 23, 2016 at 4:00 PM, Edward Cree <ecree@solarflare.com> wrote: > On 23/03/16 22:36, Alexander Duyck wrote: >> On Wed, Mar 23, 2016 at 2:05 PM, Edward Cree <ecree@solarflare.com> wrote: >>> I disagree. Surely we should be able to "soft segment" the packet just >>> before we give it to the physical device, and then tell it to do dumb copying >>> of both the VXLAN and IPIP headers? At this point, we don't have the problem >>> you identified above, because we've arrived at the device now. >> One issue here is that all levels of IP headers would have to have the >> DF bit set. I don't think that happens right now. > Yes, that's still a requirement. (Well, except for the outermost IP header.) >>> So we can chase through some per-protocol callbacks to shorten all the outer >>> lengths and adjust all the outer checksums, then hand it to the device for >>> TSO. The device is treating the extra headers as an opaque blob, so it >>> doesn't know or care whether it's one layer of encapsulation or forty-two. >> So if we do pure software offloads this is doable. However the GSO >> flags are meant to have hardware feature equivalents. The problem is >> if you combine an IPIP and VXLAN header how do you know what header is >> what and which order things are in, and what is the likelihood of >> having a device that would get things right when dealing with 3 levels >> of IP headers. This is one of the reasons why we don't support >> multiple levels of tunnels in the GSO code. GSO is just meant to be a >> fall-back for hardware offloads. > Right, but if the hardware does things "the new way" it should work fine: > Packet still starts with Eth + IP. Packet still has TCP headers at some > specified offset. So it all works, as long as you don't have to update > any IP IDs except possibly the outermost one. Right, but the problem becomes how do you identify what tunnel wants what. So for example we could theoretically have a UDP tunnel in a UDP with checksum. How would we tell which one want to have the checksum set and which one doesn't? The fact is we cannot. You are looking too far ahead. We haven't gotten to tunnel in tunnel yet. The approach as it stands doesn't have any issues that necessarily prevent that as long as the outer is the only IP ID that has to increment, but we don't support anything like that now so we don't need to worry about it too much. >>> Ok, it sounds like the interface to Intel hardware is just Very Different >>> to Solarflare hardware on this point: we don't tell our hardware anything >>> about where the various headers start, it just parses them to figure it >>> out. (And for new-style TSO we'd tell it where the TCP header starts, as >>> I described before.) >> That is kind of what I figured. So does that mean for IPv6 you guys >> are parsing through extension headers? I believe that is one of the >> reasons why Intel did things the way they did is to avoid having to >> parse through any IPv4 options or IPv6 extension headers. > I believe so, but I'd have to check with our firmware team to be sure. > The hardware needs to have that capability for RX processing, where it > wants to figure out things like the l4proto for IPv6: you have to walk > the extension headers until you get a layer 4 nexthdr. I wonder how > Intel manage without that? They have some parsing in the Rx. That is one of the reasons why there was all the arguing about adding GENEVE port numbers a few months ago. They just don't make use of it in the Tx path with the exception of the fm10k parts. >>> I agree this isn't something we can do silently. But we _can_ make it a >>> condition for enabling gso-partial. And I think it's a necessary >>> condition for truly generic TSO. Sure, your 'L3 extension header' works >>> fine for a single tunnel. But if you nest tunnels, you now need to >>> update the outer _and_ middle IP IDs, and you can't do that because you >>> only have one L3 header pointer. >> This is getting away from the 'less is more' concept. If we are doing >> multiple levels of tunnels we have already made things far too >> complicated and it is unlikely hardware will ever support anything >> like that. > That's not how I understood the concept. I parsed it as "if hardware knows > less, we can get more out of it", i.e. by having the hardware blithely paste > together whatever headers you give it, you can support things like nested > tunnels. As long as your 'middle' IP header has DF set, this can be done > without the hardware needing to know a thing about it. And while we don't > need to implement that straight away, we should care to design our > interfaces to ensure we can do that in the future without too much trouble. The design as is does nothing to prevent that. One of the reasons why I prefer to keep the outer IP ID incrementing is in order to support that kind of concept. Also it shields us a bit as we usually cannot control the network between the tunnel endpoints since it is usually traversing a WAN. What we need to do though is go through and see if we can get away with something like "if inner IP DF is set the outer IP DF bit must be set" kind of logic for GRE and UDP tunnels. If we can push that then it will allow us to essentially fix all the tunnel logic in one shot since TCP requires DF bit be set so all levels of headers would have the DF bit set. >>> Of course, that means changing the firmware; luckily we haven't got any >>> parts in the wild doing tunnel offloads yet, so we still have a chance >>> to do that without needing driver code to work around our past >>> mistakes... >>> >>> But this stuff does definitely add value for us, it means we could TSO >>> any tunnel type whatsoever; even nested tunnels as long as only the >>> outermost IP ID needs to change. >> Right. In your case it sounds like you would have the advantage of >> just having to run essentially two counters, one increments the IPv4 >> ID and the other decrements the IPv4 checksum. Beyond that the outer >> headers wouldn't need to change at all. > Exactly. >> The only other issue would be determining how the inner pseudo-header >> checksum is updated. If you were parsing out header fields from the >> IP header previously to generate it you would instead need to update >> things so that you could use the partial checksum that is already >> stored in the TCP header checksum field. > Right, but again that's sufficiently under firmware control (AFAIK) that > that should just be a SMOP for the firmware. Though I will ask about > that tomorrow, just in case. There shouldn't be much to it. In the case of the Intel parts they want the length cancelled out of the checksum by the driver and they they fold it back in via hardware. I would imagine that your hardware could probably do something similar or may already be doing it since the length has to be handled differently for IPv4 vs IPv6. - Alex
On Wed, Mar 23, 2016 at 11:19 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Mar 23, 2016 at 10:09 AM, Tom Herbert <tom@herbertland.com> wrote: >> Can you add some description about strategy for dealing with ip_id? > > Yeah. I still need to add more documentation. I just didn't want to > get into details on it until we have finalized things a bit more. I'm > still wondering if we should follow the pattern of ipip tunnels in > terms of setting the DF bit if the inner header has the DF bit set. > If we end up needing to add code to do that then it makes it so that > the ip_id value can be fixed for both inner and outer which makes the > segmentation much simpler since the only header that would ever really > need to be updated would be the transport header in order to get the > checksum correct. I tried to do this years ago but in practice it broke things. There's enough middleboxes/firewalls/etc. out there that filter ICMP messages that path MTU discovery isn't necessarily reliable. And while you might argue that if the box is breaking things then the same would be true for the original, unencapsulated TCP stream but a lot of times there are some other hacks built in (like MSS clamping) that make assumptions that the traffic is TCP. So at the minimum it is generally good to have an option to force the DF bit off. That being said, I actually think that it is good to have the DF bit on by default for encapsulation headers being added. Unintentional (and perhaps multiple layers of) fragmentation usually results in unuseably bad performance and so it best to try to correct it, hopefully automatically in most cases. And, of course, this is the direction that IPv6 has already gone. If we can assume that this is the most common case then in practice we can keep the outer headers constant for the high performance path. To me, incrementing the inner IP really seems the best choice. The inner header is most likely someone else's traffic so it best to not mess with that whereas the outer headers are likely ours and we know the parameters for them (and can set the DF bit as we believe is correct). Also, if you are looking forward to the future as far as stacking multiple layers of tunnels, I think the only consistent thing to do is have the inner ID increment and all of the tunnel headers stay fixed - it is hard to justify why the first tunnel header should increment but not the second one. And finally, as a nice bonus, this is what the GRO code has been expecting already so you won't cause any performance regressions with existing systems.
On Wed, Mar 23, 2016 at 6:37 PM, Jesse Gross <jesse@kernel.org> wrote: > On Wed, Mar 23, 2016 at 11:19 AM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Wed, Mar 23, 2016 at 10:09 AM, Tom Herbert <tom@herbertland.com> wrote: >>> Can you add some description about strategy for dealing with ip_id? >> >> Yeah. I still need to add more documentation. I just didn't want to >> get into details on it until we have finalized things a bit more. I'm >> still wondering if we should follow the pattern of ipip tunnels in >> terms of setting the DF bit if the inner header has the DF bit set. >> If we end up needing to add code to do that then it makes it so that >> the ip_id value can be fixed for both inner and outer which makes the >> segmentation much simpler since the only header that would ever really >> need to be updated would be the transport header in order to get the >> checksum correct. > > I tried to do this years ago but in practice it broke things. > > There's enough middleboxes/firewalls/etc. out there that filter ICMP > messages that path MTU discovery isn't necessarily reliable. And while > you might argue that if the box is breaking things then the same would > be true for the original, unencapsulated TCP stream but a lot of times > there are some other hacks built in (like MSS clamping) that make > assumptions that the traffic is TCP. So at the minimum it is generally > good to have an option to force the DF bit off. Agreed, I think if we tried anything we would have to have some sort of bit remaining to shut it off. > That being said, I actually think that it is good to have the DF bit > on by default for encapsulation headers being added. Unintentional > (and perhaps multiple layers of) fragmentation usually results in > unuseably bad performance and so it best to try to correct it, > hopefully automatically in most cases. And, of course, this is the > direction that IPv6 has already gone. If we can assume that this is > the most common case then in practice we can keep the outer headers > constant for the high performance path. I'm still weighting my options on how to address the DF issue. I'm not really sure having DF always on for outer headers is the best way to go either. I kind of like the idea that if DF is set for the inner headers then we set it for the outer headers. I just have to see how hard something like that would be to implement. > To me, incrementing the inner IP really seems the best choice. The > inner header is most likely someone else's traffic so it best to not > mess with that whereas the outer headers are likely ours and we know > the parameters for them (and can set the DF bit as we believe is > correct). Also, if you are looking forward to the future as far as > stacking multiple layers of tunnels, I think the only consistent thing > to do is have the inner ID increment and all of the tunnel headers > stay fixed - it is hard to justify why the first tunnel header should > increment but not the second one. And finally, as a nice bonus, this > is what the GRO code has been expecting already so you won't cause any > performance regressions with existing systems. Agreed. However in the case of the igb and ixgbe incrementing the inner isn't really going to work well as it introduces a number of issues. I've considered not enabling GSO partial by default on those parts, but leaving it as an available feature. In the case of i40e we will not have any of those issues as both the inner and outer IP IDs will increment. The regressions issue shouldn't be all that big. In most cases tunnels are point to point links. Odds are if someone is running a network they should have similar devices on both ends. So in the case of the Intel drivers I have patched here updating i40e isn't that big of a deal since it retains the original GSO behavior. In the case of ixgbe it didn't support any tunnel offloads so there is no GRO support without checksums being enabled in the tunnel, and Tx checksum support vi HW_CSUM has yet to be pushed upstream from Jeff's tree so things are currently throttled by the Tx side. GSO partial is going to end up being very driver specific in terms of what increments and what doesn't as we are having to really hack our way around to make it do things the spec sheets say it cannot do. Some implementations are going to end up working out better than others in terms of interoperability and what can make use of other advanced features such as GRO on legacy systems. - Alex
On 23/03/16 23:15, Alexander Duyck wrote: > Right, but the problem becomes how do you identify what tunnel wants > what. So for example we could theoretically have a UDP tunnel in a > UDP with checksum. How would we tell which one want to have the > checksum set and which one doesn't? The fact is we cannot. I think we can still handle that, assuming the device is only touching the innermost checksum (i.e. it's obeying csum_start/offset). We don't need flags to tell us what to fill in in GSO, we can work it all out: Make the series of per-protocol callbacks for GSO partial run inner- outwards, by using recursion at the head. Make each return a csum_edit value. Then for example: For IPv4 header, our checksum covers only our header, so we fold any edits into our own checksum, and pass csum_edit through unchanged. For UDP header, we look to see if the current checksum field is zero. If so, we leave it as zero, fold our edits into csum_edit and return the result. Otherwise, we fold our edits and csum_edit into our checksum field, and return zero. For GRE, we look at the checksumming bit in the GRE header, and behave similarly to UDP. Etcetera... This should even be a worthwhile simplification of the non-nested case, because (if I understand correctly) it means GSO partial doesn't need the various gso_type flags we already have to specify tunnel type and checksum status; it just figures it out as it goes. If your device is touching other checksums as well, then of course you need to figure that out in your driver so you can cancel it out. But the device will only fiddle with the headers you tell it about (in your case I think that's outermost L3), not any others in the middle. So it should still all work, without the driver having to know about the nesting. > You are > looking too far ahead. We haven't gotten to tunnel in tunnel yet. IMHO, if our offloads are truly generic, tunnel in tunnel should be low- hanging fruit. (In principle, "VxLAN + Ethernet + IP + GRE" is just another encapsulation header, albeit a rather long one). Therefore, if it _isn't_ low-hanging fruit for us, we should suspect that we aren't generic. So even if it's not currently useful in itself, it's still a convenient canary. -Ed
On Thu, Mar 24, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote: > On 23/03/16 23:15, Alexander Duyck wrote: >> Right, but the problem becomes how do you identify what tunnel wants >> what. So for example we could theoretically have a UDP tunnel in a >> UDP with checksum. How would we tell which one want to have the >> checksum set and which one doesn't? The fact is we cannot. > I think we can still handle that, assuming the device is only touching the > innermost checksum (i.e. it's obeying csum_start/offset). We don't need > flags to tell us what to fill in in GSO, we can work it all out: > Make the series of per-protocol callbacks for GSO partial run inner- > outwards, by using recursion at the head. Make each return a csum_edit > value. Then for example: > For IPv4 header, our checksum covers only our header, so we fold any edits > into our own checksum, and pass csum_edit through unchanged. Right. IPv4 is easy because it is a localized checksum that is always present. > For UDP header, we look to see if the current checksum field is zero. If > so, we leave it as zero, fold our edits into csum_edit and return the > result. Otherwise, we fold our edits and csum_edit into our checksum > field, and return zero. This would require changing how we handle partial checksums so that in the case of UDP we don't allow 0 as a valid value. Currently we do. It isn't till we get to the final checksum that we take care of the bit flip in the case of 0. > For GRE, we look at the checksumming bit in the GRE header, and behave > similarly to UDP. > Etcetera... Right. In the case of GRE we at least have a flag we could check. > This should even be a worthwhile simplification of the non-nested case, > because (if I understand correctly) it means GSO partial doesn't need the > various gso_type flags we already have to specify tunnel type and checksum > status; it just figures it out as it goes. Yes, but doing packet inspection can get to be expensive as we crawl through the headers. In addition it gets into the whole software versus hardware offloads thing. > If your device is touching other checksums as well, then of course you need > to figure that out in your driver so you can cancel it out. But the device > will only fiddle with the headers you tell it about (in your case I think > that's outermost L3), not any others in the middle. So it should still all > work, without the driver having to know about the nesting. > >> You are >> looking too far ahead. We haven't gotten to tunnel in tunnel yet. > IMHO, if our offloads are truly generic, tunnel in tunnel should be low- > hanging fruit. (In principle, "VxLAN + Ethernet + IP + GRE" is just > another encapsulation header, albeit a rather long one). Therefore, if > it _isn't_ low-hanging fruit for us, we should suspect that we aren't > generic. So even if it's not currently useful in itself, it's still a > convenient canary. Honestly I think tunnel-in-tunnel is not going to be doable due to the fact that we would have to increment multiple layers of IP ID in order to do it correctly. The more I look into the whole DF on outer headers thing the more I see RFCs such as RFC 2784 that say not to do it unless you want to implement PMTU discovery, and PMTU discovery is inherently flawed since it would require ICMP messages to be passed which may be filtered out by firewalls. On top of that it occurred to me that GRE cannot be screened in GRO for the outer-IP-ID check. Basically what can happen is on devices that don't parse inner headers for GRE we can end up with two TCP flows from the same tunnel essentially stomping on each other and causing one another to get evicted for having an outer IP-ID that doesn't match up with the expected value. - Alex
On 24/03/16 18:43, Alexander Duyck wrote: > On Thu, Mar 24, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote: >> For UDP header, we look to see if the current checksum field is zero. If >> so, we leave it as zero, fold our edits into csum_edit and return the >> result. Otherwise, we fold our edits and csum_edit into our checksum >> field, and return zero. > This would require changing how we handle partial checksums so that in > the case of UDP we don't allow 0 as a valid value. Currently we do. > It isn't till we get to the final checksum that we take care of the > bit flip in the case of 0. No, the UDP checksum will have been filled in by LCO and thus have been bit- flipped already if it was zero. Only the innermost L4 header will have a partial checksum, and that's TCP so the checksum is required. (Alternatively: whichever header has the partial checksum - and there is at most one - is identified by skb->csum_start, and by definition the checksum must be enabled for that header, so we can skip the 'check for zero' heuristic there.) (Besides, I thought it was impossible for the partial checksum to be zero anyway because at least one of the inputs must be nonzero, and the end- around carry can never produce a zero. But maybe I'm getting confused here.) >> This should even be a worthwhile simplification of the non-nested case, >> because (if I understand correctly) it means GSO partial doesn't need the >> various gso_type flags we already have to specify tunnel type and checksum >> status; it just figures it out as it goes. > Yes, but doing packet inspection can get to be expensive as we crawl > through the headers. In addition it gets into the whole software > versus hardware offloads thing. The headers should already be in cache, I thought, and this is only happening once per superframe. We're already going to have to crawl through the headers anyway to edit the lengths, I don't think it should cost much more to also inspect things like GRE csum bit or the UDP checksum field. And by identifying the 'next header' from this method as well, we don't need to add a new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to the kernel. (And remember that this is separate from - and doesn't impact - the existing GSO code; this is a bunch of new foo_gso_partial() callbacks, distinct from the foo_gso_segment() ones; and it's about preparing a superframe for hardware offload rather than doing the segmentation in software. I think it's preferable to have some preparation happen in software if that allows hardware to be totally dumb. (Apologies if that was already all obvious.)) > Honestly I think tunnel-in-tunnel is not going to be doable due to the > fact that we would have to increment multiple layers of IP ID in order > to do it correctly. The more I look into the whole DF on outer > headers thing the more I see RFCs such as RFC 2784 that say not to do > it unless you want to implement PMTU discovery, and PMTU discovery is > inherently flawed since it would require ICMP messages to be passed > which may be filtered out by firewalls. If PMTU discovery really is "inherently flawed", then TCP is broken and so is IPv6. I think the Right Thing here is for us to translate and forward ICMP errors to the originator of the traffic, as RFC 2784 vaguely suggests. It should also be possible to configure the tunnel endpoint's MTU to a sufficiently low value that in practice it will fit within the path MTU; then the sender will discover the tunnel's MTU restriction and stay within that. (I am assuming here that ICMP won't be filtered on the overlay network - which should be under the control of the tunnel administrator - even if it might be on the underlay network.) > On top of that it occurred to me that GRE cannot be screened in GRO > for the outer-IP-ID check. Basically what can happen is on devices > that don't parse inner headers for GRE we can end up with two TCP > flows from the same tunnel essentially stomping on each other and > causing one another to get evicted for having an outer IP-ID that > doesn't match up with the expected value. Yes, that does seem problematic - you'd need to save away the IPID check result and only evict once you'd ascertained they were the same flow. All the more reason to try to make your tunnels use DF *grin*. On the subject of GRO, I was wondering - it seems like the main reason why drivers have LRO is so they can be more permissive than GRO's "must be reversible" rules. (At least, that seems to be the case for sfc's LRO, which is only in our out-of-tree driver.) Maybe instead of each driver having its own LRO code (with hacks to avoid breaking Slow Start and suchlike by breaking ACK stats), there should be per-device options to control how permissive GRO is (e.g. don't check IP IDs), so that a user who wants LRO-like behaviour can get it from GRO. Any obvious reason why I couldn't do such a thing? (I realise LRO might not go away entirely, if other drivers are doing various hardware-assisted things. But in our case it really is all in software.) -Ed
On Thu, Mar 24, 2016 at 1:17 PM, Edward Cree <ecree@solarflare.com> wrote: > On 24/03/16 18:43, Alexander Duyck wrote: >> On Thu, Mar 24, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote: >>> For UDP header, we look to see if the current checksum field is zero. If >>> so, we leave it as zero, fold our edits into csum_edit and return the >>> result. Otherwise, we fold our edits and csum_edit into our checksum >>> field, and return zero. >> This would require changing how we handle partial checksums so that in >> the case of UDP we don't allow 0 as a valid value. Currently we do. >> It isn't till we get to the final checksum that we take care of the >> bit flip in the case of 0. > No, the UDP checksum will have been filled in by LCO and thus have been bit- > flipped already if it was zero. Only the innermost L4 header will have a > partial checksum, and that's TCP so the checksum is required. (Alternatively: > whichever header has the partial checksum - and there is at most one - is > identified by skb->csum_start, and by definition the checksum must be enabled > for that header, so we can skip the 'check for zero' heuristic there.) No, recheck the code. If the skb is GSO when we call udp_set_csum() we only have populated the partial checksum in the UDP header. It isn't until we actually perform the segmentation that we call lco_csum(). > (Besides, I thought it was impossible for the partial checksum to be zero > anyway because at least one of the inputs must be nonzero, and the end- > around carry can never produce a zero. But maybe I'm getting confused here.) I forgot about that bit. I think you are right. We end up inverting the output from csum fold so we are back to 0x1 - 0xFFFF as possible values. >>> This should even be a worthwhile simplification of the non-nested case, >>> because (if I understand correctly) it means GSO partial doesn't need the >>> various gso_type flags we already have to specify tunnel type and checksum >>> status; it just figures it out as it goes. >> Yes, but doing packet inspection can get to be expensive as we crawl >> through the headers. In addition it gets into the whole software >> versus hardware offloads thing. > The headers should already be in cache, I thought, and this is only happening > once per superframe. We're already going to have to crawl through the headers > anyway to edit the lengths, I don't think it should cost much more to also > inspect things like GRE csum bit or the UDP checksum field. And by > identifying the 'next header' from this method as well, we don't need to add a > new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to > the kernel. Right, but this gets back into the hardware flags issue. The whole reason for SKB_GSO_FOO and SKB_GSO_FOO_CSUM is because hardware typically supports one but not the other. The other thing to keep in mind is it is much easier to figure out what offloads we can make use of when we are only dealing with at most 1 layer of tunnels. When we start getting into stacked tunnels it really becomes impossible to figure out what features in the hardware we can still make use of. You throw everything and the kitchen sink on it and we have to give up and would be doing everything in software. At least if we restrict the features being requested the hardware has a chance to perhaps be useful. > (And remember that this is separate from - and doesn't impact - the existing > GSO code; this is a bunch of new foo_gso_partial() callbacks, distinct from > the foo_gso_segment() ones; and it's about preparing a superframe for hardware > offload rather than doing the segmentation in software. I think it's > preferable to have some preparation happen in software if that allows hardware > to be totally dumb. (Apologies if that was already all obvious.)) This is where you and I differ. I think there are things you are overlooking. For example one of the reasons why I know the outer UDP checksum works the way it does with tunnel GSO is because the i40e and i40evf drivers already have hardware that supports UDP_TUNNEL_CSUM. As such in order to do what you are proposing we would likely have to rip that code out and completely rewrite it since it would change out the checksum value given to the device. >> Honestly I think tunnel-in-tunnel is not going to be doable due to the >> fact that we would have to increment multiple layers of IP ID in order >> to do it correctly. The more I look into the whole DF on outer >> headers thing the more I see RFCs such as RFC 2784 that say not to do >> it unless you want to implement PMTU discovery, and PMTU discovery is >> inherently flawed since it would require ICMP messages to be passed >> which may be filtered out by firewalls. > If PMTU discovery really is "inherently flawed", then TCP is broken and so > is IPv6. I think the Right Thing here is for us to translate and forward > ICMP errors to the originator of the traffic, as RFC 2784 vaguely suggests. > It should also be possible to configure the tunnel endpoint's MTU to a > sufficiently low value that in practice it will fit within the path MTU; > then the sender will discover the tunnel's MTU restriction and stay within > that. (I am assuming here that ICMP won't be filtered on the overlay > network - which should be under the control of the tunnel administrator - > even if it might be on the underlay network.) I'm not going to get into an argument over the merits of PMTU. The fact is I don't believe it is worth the effort in order to enable the tunnel-in-tunnel case you envision. Also another thing to keep in mind is you would have to find a way to identify that a tunnel requesting to be segmented is setting the DF bit in the outer headers. >> On top of that it occurred to me that GRE cannot be screened in GRO >> for the outer-IP-ID check. Basically what can happen is on devices >> that don't parse inner headers for GRE we can end up with two TCP >> flows from the same tunnel essentially stomping on each other and >> causing one another to get evicted for having an outer IP-ID that >> doesn't match up with the expected value. > Yes, that does seem problematic - you'd need to save away the IPID check > result and only evict once you'd ascertained they were the same flow. All > the more reason to try to make your tunnels use DF *grin*. Feel free to solve the problem if you believe it is that easy. There is nothing in my patches to prevent that. > On the subject of GRO, I was wondering - it seems like the main reason why > drivers have LRO is so they can be more permissive than GRO's "must be > reversible" rules. (At least, that seems to be the case for sfc's LRO, > which is only in our out-of-tree driver.) Maybe instead of each driver > having its own LRO code (with hacks to avoid breaking Slow Start and > suchlike by breaking ACK stats), there should be per-device options to > control how permissive GRO is (e.g. don't check IP IDs), so that a user who > wants LRO-like behaviour can get it from GRO. I don't really see the point. Odds are the LRO in the driver isn't too much more efficient than the in-kernel GRO. If anything I would think the main motivation for maintaining LRO in your out-of-tree driver is to support kernels prior to GRO. > Any obvious reason why I couldn't do such a thing? > > (I realise LRO might not go away entirely, if other drivers are doing > various hardware-assisted things. But in our case it really is all in > software.) I don't know. I'm thinking it all depends on what kind of stuff you are talking about trying to change. Most of the flush conditions in GRO have very good reasons for being there. I know in the case of the Intel drivers I actually found that by copying most of the conditions into the out-of-tree driver LRO code we had I saw improvements. More often then not the more aggressive LRO can end up causing more harm than good because what ends up happening is that it spends too much time aggregating and starves the other side by witholding acks. - Alex
On 24/03/16 21:50, Alexander Duyck wrote: > On Thu, Mar 24, 2016 at 1:17 PM, Edward Cree <ecree@solarflare.com> wrote: >> (Besides, I thought it was impossible for the partial checksum to be zero >> anyway because at least one of the inputs must be nonzero, and the end- >> around carry can never produce a zero. But maybe I'm getting confused here.) > > I forgot about that bit. I think you are right. We end up inverting > the output from csum fold so we are back to 0x1 - 0xFFFF as possible > values. Yes, it seems csum_fold inverts it and then the relevant callers invert it again. (I hadn't spotted either inversion, so I panicked for a moment when I thought it was getting inverted just once.) >> The headers should already be in cache, I thought, and this is only happening >> once per superframe. We're already going to have to crawl through the headers >> anyway to edit the lengths, I don't think it should cost much more to also >> inspect things like GRE csum bit or the UDP checksum field. And by >> identifying the 'next header' from this method as well, we don't need to add a >> new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to >> the kernel. > > Right, but this gets back into the hardware flags issue. The whole > reason for SKB_GSO_FOO and SKB_GSO_FOO_CSUM is because hardware > typically supports one but not the other. Right, and for hardware supporting a specific GSO_FOO[_CSUM], we'll still need the flags - but in that case we'll be doing "old style" TSO anyway. GSO partial only really makes sense for a device that isn't (or at least can be prevented from) parsing the tunnel headers, and such hardware will support GSO_PARTIAL only, not a profusion of GSO_FOO (with or without _CSUM). So: in the initial transmit path we build a coherent superframe; when we get to the device, we say either "oh, device doesn't support offloads at all, call GSO", or "oh, device supports this particular GSO type for TSO, so give it the superframe SKB to do what it wants with", or "oh, device supports GSO_PARTIAL, let's call skb_mac_gso_partial [i.e. chain of new callbacks I was describing in previous email] and then give the non-coherent SKB to the device". And in that last case it doesn't matter what the GSO type is, because the hardware can handle anything as long as we transform it into GSO partial. > The other thing to keep in mind is it is much easier to figure out > what offloads we can make use of when we are only dealing with at most > 1 layer of tunnels. When we start getting into stacked tunnels it > really becomes impossible to figure out what features in the hardware > we can still make use of. You throw everything and the kitchen sink > on it and we have to give up and would be doing everything in > software. At least if we restrict the features being requested the > hardware has a chance to perhaps be useful. I think all the protocol-specific offloads can just say "don't even try if you've got stacked tunnels"; it's not something I expect current hardware to support, and if GSO partial works out then a hw vendor would have to be crazy to add such support in the future. But GSO partial doesn't care about stacked tunnels, it still works. Unless of course someone in the middle wants to change their IP IDs or GRE Counter or whatever, in which case they can mark the skb as "not gso_partial-safe" when they first build their header, or their gso_partial callback can return "nope", and then we fall back on a protocol-specific offload (if one's available, which it won't be for stacked tunnels) or software segmentation. >> (And remember that this is separate from - and doesn't impact - the existing >> GSO code; this is a bunch of new foo_gso_partial() callbacks, distinct from >> the foo_gso_segment() ones; and it's about preparing a superframe for hardware >> offload rather than doing the segmentation in software. I think it's >> preferable to have some preparation happen in software if that allows hardware >> to be totally dumb. (Apologies if that was already all obvious.)) > > This is where you and I differ. I think there are things you are overlooking. > > For example one of the reasons why I know the outer UDP checksum works > the way it does with tunnel GSO is because the i40e and i40evf drivers > already have hardware that supports UDP_TUNNEL_CSUM. As such in order > to do what you are proposing we would likely have to rip that code out > and completely rewrite it since it would change out the checksum value > given to the device. I'm not sure I understand, so bear with me. AIUI, UDP_TUNNEL_CSUM means that the device will fill in both inner and outer L4 checksums. But it can only do that if it knows where the inner and outer L4 headers *are*, and I thought you were going to tell it not to treat it as a tunnel at all, but instead as a giant L3 header. So it won't touch the outer L4 checksum at all (and, in the case of stacked tunnels, it wouldn't touch the 'middle' one either - it only sees the innermost L4 header). So while the driver would need to change to add GSO_partial support, the hardware wouldn't. Where am I going wrong? > I'm not going to get into an argument over the merits of PMTU. The > fact is I don't believe it is worth the effort in order to enable the > tunnel-in-tunnel case you envision. I agree it's not worth the effort to implement it right now. But some day it will be, and at that point, if we've designed GSO partial the way I'm arguing for, it will just magically start working when we disable fragmentation on inner tunnels. And in the meantime we'll be using a more generic interface that hopefully won't encourage driver (and hardware) designers to see the gso_type enum as a list of protocols to support. Again, tunnel-in-tunnel isn't important in itself, but as a canary for "can we support whatever encapsulation protocol comes down the pike"; arguing that you won't be able to set DF on your inner tunnel because operational consideration XYZ is missing the point IMHO. > Also another thing to keep in mind is you would have to find a way to > identify that a tunnel requesting to be segmented is setting the DF > bit in the outer headers. But that's easy, as above: either foo_xmit sets a "don't GSO-partial me" flag in the SKB, or foo_gso_partial returns -EMIGHTBEFRAGMENTED. The former is probably preferable for performance reasons. >>> On top of that it occurred to me that GRE cannot be screened in GRO >>> for the outer-IP-ID check. Basically what can happen is on devices >>> that don't parse inner headers for GRE we can end up with two TCP >>> flows from the same tunnel essentially stomping on each other and >>> causing one another to get evicted for having an outer IP-ID that >>> doesn't match up with the expected value. > >> Yes, that does seem problematic - you'd need to save away the IPID check >> result and only evict once you'd ascertained they were the same flow. All >> the more reason to try to make your tunnels use DF *grin*. > > Feel free to solve the problem if you believe it is that easy. There > is nothing in my patches to prevent that. I'm not saying it's easy. The GRO side of your patches look great to me and I think you're more likely to solve it than I am. But I will have a go at it too, if I have time. Though I must admit, I don't quite understand why we have to restrict to (incrementing or constant) in GRO, rather than just saying "it's DF, we can ignore the IP IDs and just aggregate it all together". We won't be producing the original IP IDs anyway if we resegment, because we don't know whether they were originally incrementing or constant. (In other words, "RFC6864 compliant GRO" as you've defined it can reverse either GSO or GSO partial, but GSO (in either form) can't always reverse RFC6864 compliant GRO.) -Ed
On Thu, Mar 24, 2016 at 4:00 PM, Edward Cree <ecree@solarflare.com> wrote: > On 24/03/16 21:50, Alexander Duyck wrote: >> On Thu, Mar 24, 2016 at 1:17 PM, Edward Cree <ecree@solarflare.com> wrote: >>> (Besides, I thought it was impossible for the partial checksum to be zero >>> anyway because at least one of the inputs must be nonzero, and the end- >>> around carry can never produce a zero. But maybe I'm getting confused here.) >> >> I forgot about that bit. I think you are right. We end up inverting >> the output from csum fold so we are back to 0x1 - 0xFFFF as possible >> values. > Yes, it seems csum_fold inverts it and then the relevant callers invert it > again. (I hadn't spotted either inversion, so I panicked for a moment when > I thought it was getting inverted just once.) > >>> The headers should already be in cache, I thought, and this is only happening >>> once per superframe. We're already going to have to crawl through the headers >>> anyway to edit the lengths, I don't think it should cost much more to also >>> inspect things like GRE csum bit or the UDP checksum field. And by >>> identifying the 'next header' from this method as well, we don't need to add a >>> new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to >>> the kernel. >> >> Right, but this gets back into the hardware flags issue. The whole >> reason for SKB_GSO_FOO and SKB_GSO_FOO_CSUM is because hardware >> typically supports one but not the other. > Right, and for hardware supporting a specific GSO_FOO[_CSUM], we'll still > need the flags - but in that case we'll be doing "old style" TSO anyway. > GSO partial only really makes sense for a device that isn't (or at least can > be prevented from) parsing the tunnel headers, and such hardware will support > GSO_PARTIAL only, not a profusion of GSO_FOO (with or without _CSUM). > > So: in the initial transmit path we build a coherent superframe; when we get > to the device, we say either "oh, device doesn't support offloads at all, > call GSO", or "oh, device supports this particular GSO type for TSO, so give > it the superframe SKB to do what it wants with", or "oh, device supports > GSO_PARTIAL, let's call skb_mac_gso_partial [i.e. chain of new callbacks I > was describing in previous email] and then give the non-coherent SKB to the > device". What you are proposing is essentially forking GSO. I really don't like that. I prefer the approach we have now where essentially GSO partial is something checked for at the very end and we have a few checks when putting together the headers so we do the right thing. > And in that last case it doesn't matter what the GSO type is, because the > hardware can handle anything as long as we transform it into GSO partial. Yes and no. The problem is trying to figure out what the device supports. The hw_enc_features flags are only meant to be used with one layer of tunnels. Are you suggesting we add some code to make sure the GSO tunnel types are kept mutually exclusive and add an extra offload that forces GSO/GSO_PARTIAL if they aren't? This is sounding very complicated. I think we would be better off if you took the time to try and implement some of this yourself so you could see how feasible it is. >> The other thing to keep in mind is it is much easier to figure out >> what offloads we can make use of when we are only dealing with at most >> 1 layer of tunnels. When we start getting into stacked tunnels it >> really becomes impossible to figure out what features in the hardware >> we can still make use of. You throw everything and the kitchen sink >> on it and we have to give up and would be doing everything in >> software. At least if we restrict the features being requested the >> hardware has a chance to perhaps be useful. > I think all the protocol-specific offloads can just say "don't even try > if you've got stacked tunnels"; it's not something I expect current > hardware to support, and if GSO partial works out then a hw vendor would > have to be crazy to add such support in the future. But GSO partial > doesn't care about stacked tunnels, it still works. Unless of course > someone in the middle wants to change their IP IDs or GRE Counter or > whatever, in which case they can mark the skb as "not gso_partial-safe" > when they first build their header, or their gso_partial callback can > return "nope", and then we fall back on a protocol-specific offload (if > one's available, which it won't be for stacked tunnels) or software > segmentation. That is basically what happens now. When the tunnel gets configured if there are any options that would prevent GSO from working then the tunnel disables TSO support. That way the frames are segmented further up the stack and then come through and are built around the segmented frames. >> For example one of the reasons why I know the outer UDP checksum works >> the way it does with tunnel GSO is because the i40e and i40evf drivers >> already have hardware that supports UDP_TUNNEL_CSUM. As such in order >> to do what you are proposing we would likely have to rip that code out >> and completely rewrite it since it would change out the checksum value >> given to the device. > I'm not sure I understand, so bear with me. AIUI, UDP_TUNNEL_CSUM means > that the device will fill in both inner and outer L4 checksums. But it can > only do that if it knows where the inner and outer L4 headers *are*, and I > thought you were going to tell it not to treat it as a tunnel at all, but > instead as a giant L3 header. So it won't touch the outer L4 checksum at > all (and, in the case of stacked tunnels, it wouldn't touch the 'middle' > one either - it only sees the innermost L4 header). There is hardware that supports this already without GSO partial. Specifically the i40e driver supports a XL722 device that supports doing the outer UDP tunnel checksum in addition to the inner TCP checksum. I'm suspecting there will be hardware from other vendors in the near future that will have support as well. The ixgbe and igb drivers will be taking the giant L3 header approach. > So while the driver would need to change to add GSO_partial support, the > hardware wouldn't. > Where am I going wrong? > >> I'm not going to get into an argument over the merits of PMTU. The >> fact is I don't believe it is worth the effort in order to enable the >> tunnel-in-tunnel case you envision. > I agree it's not worth the effort to implement it right now. But some day > it will be, and at that point, if we've designed GSO partial the way I'm > arguing for, it will just magically start working when we disable > fragmentation on inner tunnels. And in the meantime we'll be using a more > generic interface that hopefully won't encourage driver (and hardware) > designers to see the gso_type enum as a list of protocols to support. > Again, tunnel-in-tunnel isn't important in itself, but as a canary for > "can we support whatever encapsulation protocol comes down the pike"; > arguing that you won't be able to set DF on your inner tunnel because > operational consideration XYZ is missing the point IMHO. It turns out that UDP tunnels seem to be the only real problem child in terms of the DF bit. I was looking over the GRE code and it already sets the DF bit and supports PMTU. Odds are we would probably need to figure out how to borrow code from there and apply it to the UDP tunnels. >> Also another thing to keep in mind is you would have to find a way to >> identify that a tunnel requesting to be segmented is setting the DF >> bit in the outer headers. > But that's easy, as above: either foo_xmit sets a "don't GSO-partial me" > flag in the SKB, or foo_gso_partial returns -EMIGHTBEFRAGMENTED. The > former is probably preferable for performance reasons. Still ugly. Honestly I don't know if it is worth all this trouble because we are just as likely to run out of headroom which would probably be a more expensive operation then giving up on the segmentation offloads and only supporting HW_CSUM anyway. >>>> On top of that it occurred to me that GRE cannot be screened in GRO >>>> for the outer-IP-ID check. Basically what can happen is on devices >>>> that don't parse inner headers for GRE we can end up with two TCP >>>> flows from the same tunnel essentially stomping on each other and >>>> causing one another to get evicted for having an outer IP-ID that >>>> doesn't match up with the expected value. >> >>> Yes, that does seem problematic - you'd need to save away the IPID check >>> result and only evict once you'd ascertained they were the same flow. All >>> the more reason to try to make your tunnels use DF *grin*. >> >> Feel free to solve the problem if you believe it is that easy. There >> is nothing in my patches to prevent that. > I'm not saying it's easy. The GRO side of your patches look great to me > and I think you're more likely to solve it than I am. But I will have a > go at it too, if I have time. My concern is that what you are talking about ends up being a pretty significant change to the transmit path itself. Really the stack doesn't want things segmented like that until you hit the GSO layer and I think that might be where you run into issues. > Though I must admit, I don't quite understand why we have to restrict to > (incrementing or constant) in GRO, rather than just saying "it's DF, we > can ignore the IP IDs and just aggregate it all together". We won't be > producing the original IP IDs anyway if we resegment, because we don't > know whether they were originally incrementing or constant. (In other > words, "RFC6864 compliant GRO" as you've defined it can reverse either > GSO or GSO partial, but GSO (in either form) can't always reverse > RFC6864 compliant GRO.) Actually for v2 of this GRO code I am taking the "it's DF just ignore the IP ID entirely" approach. Basically IPv4 with DF set will behave the same as IPv6 by setting flush_id to 0 instead of computing the offset. I realized I was still deriving information from the IP ID and RFC 6864 says we cannot do that for atomic datagrams. - Alex
On Thu, 24 Mar 2016, Alexander Duyck wrote: > On Thu, Mar 24, 2016 at 4:00 PM, Edward Cree <ecree@solarflare.com> wrote: >> So: in the initial transmit path we build a coherent superframe; when we get >> to the device, we say either "oh, device doesn't support offloads at all, >> call GSO", or "oh, device supports this particular GSO type for TSO, so give >> it the superframe SKB to do what it wants with", or "oh, device supports >> GSO_PARTIAL, let's call skb_mac_gso_partial [i.e. chain of new callbacks I >> was describing in previous email] and then give the non-coherent SKB to the >> device". > > What you are proposing is essentially forking GSO. In a way, I suppose it is. > I really don't > like that. I prefer the approach we have now where essentially GSO > partial is something checked for at the very end and we have a few > checks when putting together the headers so we do the right thing. I guess both can exist, and I'll name mine something other than "GSO partial"... > This is sounding very complicated. I think we would be better off if > you took the time to try and implement some of this yourself so you > could see how feasible it is. Sure, I was already intending to do that before I saw your patch series and thought it might be able to do what I had in mind. As it now seems like we're envisaging different things, I'll go back to implementing mine and wish you luck with yours. -Ed
On Wed, Mar 23, 2016 at 7:53 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Mar 23, 2016 at 6:37 PM, Jesse Gross <jesse@kernel.org> wrote: >> That being said, I actually think that it is good to have the DF bit >> on by default for encapsulation headers being added. Unintentional >> (and perhaps multiple layers of) fragmentation usually results in >> unuseably bad performance and so it best to try to correct it, >> hopefully automatically in most cases. And, of course, this is the >> direction that IPv6 has already gone. If we can assume that this is >> the most common case then in practice we can keep the outer headers >> constant for the high performance path. > > I'm still weighting my options on how to address the DF issue. I'm > not really sure having DF always on for outer headers is the best way > to go either. I kind of like the idea that if DF is set for the inner > headers then we set it for the outer headers. I just have to see how > hard something like that would be to implement. I don't think it would be too hard to implement. We already have code that copies things like ECN between the inner and outer headers, so it shouldn't be an issue to add this if that's what we decide is right. My first concern was that we have a way to turn off the DF bit in the event that it breaks things. Otherwise, I think it is a performance win both from the perspective of avoiding fragmentation and allowing blind copying of headers through TSO by avoiding the need to increment the ID. In practice, there's probably not too much difference between turning it on by default and inheriting from the inner frame other than it might affect what is considered to be the 'fast path'. >> To me, incrementing the inner IP really seems the best choice. The >> inner header is most likely someone else's traffic so it best to not >> mess with that whereas the outer headers are likely ours and we know >> the parameters for them (and can set the DF bit as we believe is >> correct). Also, if you are looking forward to the future as far as >> stacking multiple layers of tunnels, I think the only consistent thing >> to do is have the inner ID increment and all of the tunnel headers >> stay fixed - it is hard to justify why the first tunnel header should >> increment but not the second one. And finally, as a nice bonus, this >> is what the GRO code has been expecting already so you won't cause any >> performance regressions with existing systems. > > Agreed. However in the case of the igb and ixgbe incrementing the > inner isn't really going to work well as it introduces a number of > issues. I've considered not enabling GSO partial by default on those > parts, but leaving it as an available feature. In the case of i40e we > will not have any of those issues as both the inner and outer IP IDs > will increment. > > The regressions issue shouldn't be all that big. In most cases > tunnels are point to point links. Odds are if someone is running a > network they should have similar devices on both ends. So in the case > of the Intel drivers I have patched here updating i40e isn't that big > of a deal since it retains the original GSO behavior. In the case of > ixgbe it didn't support any tunnel offloads so there is no GRO support > without checksums being enabled in the tunnel, and Tx checksum support > vi HW_CSUM has yet to be pushed upstream from Jeff's tree so things > are currently throttled by the Tx side. > > GSO partial is going to end up being very driver specific in terms of > what increments and what doesn't as we are having to really hack our > way around to make it do things the spec sheets say it cannot do. > Some implementations are going to end up working out better than > others in terms of interoperability and what can make use of other > advanced features such as GRO on legacy systems. I have to admit that I'm a little concerned about the fact that different drivers will end up doing different things as a result of this series. Generally speaking, it doesn't seem all that good for drivers to be setting policy for what packets look like. That being said, if we are OK with IP IDs jumping around when DF is set as a result of the first patch, then this shouldn't really be a problem as the effect is similar. We just obviously need to make sure that we don't let a packet without DF and and non-incrementing IP IDs escape out in the wild. Anyways, I think you've already covered a lot of this ground in the other thread, so I don't think there is any real reason to rehash it here.
On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote: > diff --git a/net/core/dev.c b/net/core/dev.c > index edb7179bc051..666cf427898b 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb, [...] > + /* Only report GSO partial support if it will enable us to > + * support segmentation on this frame without needing additional > + * work. > + */ > + if (features & NETIF_F_GSO_PARTIAL) { > + netdev_features_t partial_features; > + struct net_device *dev = skb->dev; > + > + partial_features = dev->features & dev->gso_partial_features; > + if (!skb_gso_ok(skb, features | partial_features)) > + features &= ~NETIF_F_GSO_PARTIAL; I think we need to add NETIF_F_GSO_ROBUST into the skb_gso_ok() check - otherwise packets coming from VMs fail this test and we lose GSO partial. It's totally safe to expose this feature, since we'll compute gso_segs anyways. > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index f044f970f1a6..bdcba77e164c 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3281,6 +3291,15 @@ perform_csum_check: > */ > segs->prev = tail; > > + /* Update GSO info on first skb in partial sequence. */ > + if (partial_segs) { > + skb_shinfo(segs)->gso_size = mss / partial_segs; One small thing: this gso_size is the same as the original MSS, right? It seems like we could trivially stick it in a local variable and avoid the extra division. > + skb_shinfo(segs)->gso_segs = partial_segs; > + skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type | > + SKB_GSO_PARTIAL; Since we're computing the gso_segs ourselves, it might be nice to strip out SKB_GSO_DODGY when we set the type. I just wanted to say that this is really nice work - I was expecting it to turn out to be really messy and unmaintainable but this is very clean. Thanks!
On Sun, Mar 27, 2016 at 10:36 PM, Jesse Gross <jesse@kernel.org> wrote: > On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote: >> diff --git a/net/core/dev.c b/net/core/dev.c >> index edb7179bc051..666cf427898b 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb, > [...] >> + /* Only report GSO partial support if it will enable us to >> + * support segmentation on this frame without needing additional >> + * work. >> + */ >> + if (features & NETIF_F_GSO_PARTIAL) { >> + netdev_features_t partial_features; >> + struct net_device *dev = skb->dev; >> + >> + partial_features = dev->features & dev->gso_partial_features; >> + if (!skb_gso_ok(skb, features | partial_features)) >> + features &= ~NETIF_F_GSO_PARTIAL; > > I think we need to add NETIF_F_GSO_ROBUST into the skb_gso_ok() check > - otherwise packets coming from VMs fail this test and we lose GSO > partial. It's totally safe to expose this feature, since we'll compute > gso_segs anyways. Good point. I will update that before submitting for net-next. >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index f044f970f1a6..bdcba77e164c 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -3281,6 +3291,15 @@ perform_csum_check: >> */ >> segs->prev = tail; >> >> + /* Update GSO info on first skb in partial sequence. */ >> + if (partial_segs) { >> + skb_shinfo(segs)->gso_size = mss / partial_segs; > > One small thing: this gso_size is the same as the original MSS, right? > It seems like we could trivially stick it in a local variable and > avoid the extra division. Nice catch. I was a bit too in the mindset of having to use the same variable throughout. >> + skb_shinfo(segs)->gso_segs = partial_segs; >> + skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type | >> + SKB_GSO_PARTIAL; > > Since we're computing the gso_segs ourselves, it might be nice to > strip out SKB_GSO_DODGY when we set the type. I will have that fixed for the version I submit for net-next. > I just wanted to say that this is really nice work - I was expecting > it to turn out to be really messy and unmaintainable but this is very > clean. Thanks! Thanks. - Alex
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index a734bf43d190..8df3c5553af0 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -48,8 +48,12 @@ enum { NETIF_F_GSO_UDP_TUNNEL_BIT, /* ... UDP TUNNEL with TSO */ NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */ NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */ + NETIF_F_GSO_PARTIAL_BIT, /* ... Only segment inner-most L4 + * in hardware and all other + * headers in software. + */ /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */ - NETIF_F_GSO_TUNNEL_REMCSUM_BIT, + NETIF_F_GSO_PARTIAL_BIT, NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */ NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */ @@ -121,6 +125,7 @@ enum { #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL) #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM) #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM) +#define NETIF_F_GSO_PARTIAL __NETIF_F(GSO_PARTIAL) #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER) #define NETIF_F_HW_VLAN_STAG_RX __NETIF_F(HW_VLAN_STAG_RX) #define NETIF_F_HW_VLAN_STAG_TX __NETIF_F(HW_VLAN_STAG_TX) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 31474d9d8a96..427d748ad8f9 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1647,6 +1647,7 @@ struct net_device { netdev_features_t vlan_features; netdev_features_t hw_enc_features; netdev_features_t mpls_features; + netdev_features_t gso_partial_features; int ifindex; int group; @@ -4014,6 +4015,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type) BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT)); BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT)); BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT)); + BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> NETIF_F_GSO_SHIFT)); return (features & feature) == feature; } diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 15d0df943466..c291a282f8b6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -482,6 +482,8 @@ enum { SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11, SKB_GSO_TUNNEL_REMCSUM = 1 << 12, + + SKB_GSO_PARTIAL = 1 << 13, }; #if BITS_PER_LONG > 32 @@ -3584,7 +3586,10 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb) * Keeps track of level of encapsulation of network headers. */ struct skb_gso_cb { - int mac_offset; + union { + int mac_offset; + int data_offset; + }; int encap_level; __wsum csum; __u16 csum_start; diff --git a/net/core/dev.c b/net/core/dev.c index edb7179bc051..666cf427898b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb, return ERR_PTR(err); } + /* Only report GSO partial support if it will enable us to + * support segmentation on this frame without needing additional + * work. + */ + if (features & NETIF_F_GSO_PARTIAL) { + netdev_features_t partial_features; + struct net_device *dev = skb->dev; + + partial_features = dev->features & dev->gso_partial_features; + if (!skb_gso_ok(skb, features | partial_features)) + features &= ~NETIF_F_GSO_PARTIAL; + } + BUILD_BUG_ON(SKB_SGO_CB_OFFSET + sizeof(*SKB_GSO_CB(skb)) > sizeof(skb->cb)); @@ -2841,6 +2854,14 @@ netdev_features_t netif_skb_features(struct sk_buff *skb) if (skb->encapsulation) features &= dev->hw_enc_features; + /* Support for GSO partial features requires software intervention + * before we can actually process the packets so we need to strip + * support for any partial features now and we can pull them back + * in after we have partially segmented the frame. + */ + if (skb_is_gso(skb) && !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL)) + features &= ~dev->gso_partial_features; + if (skb_vlan_tagged(skb)) features = netdev_intersect_features(features, dev->vlan_features | @@ -6702,6 +6723,14 @@ static netdev_features_t netdev_fix_features(struct net_device *dev, } } + /* GSO partial features require GSO partial be set */ + if ((features & dev->gso_partial_features) && + !(features & NETIF_F_GSO_PARTIAL)) { + netdev_dbg(dev, + "Dropping partially supported GSO features since no GSO partial.\n"); + features &= ~dev->gso_partial_features; + } + #ifdef CONFIG_NET_RX_BUSY_POLL if (dev->netdev_ops->ndo_busy_poll) features |= NETIF_F_BUSY_POLL; @@ -6982,7 +7011,7 @@ int register_netdevice(struct net_device *dev) /* Make NETIF_F_SG inheritable to tunnel devices. */ - dev->hw_enc_features |= NETIF_F_SG; + dev->hw_enc_features |= NETIF_F_SG | NETIF_F_GSO_PARTIAL; /* Make NETIF_F_SG inheritable to MPLS. */ diff --git a/net/core/ethtool.c b/net/core/ethtool.c index b3c39d531469..d1b278c6c29f 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -88,6 +88,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] [NETIF_F_GSO_UDP_TUNNEL_BIT] = "tx-udp_tnl-segmentation", [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation", [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation", + [NETIF_F_GSO_PARTIAL_BIT] = "tx-gso-partial", [NETIF_F_FCOE_CRC_BIT] = "tx-checksum-fcoe-crc", [NETIF_F_SCTP_CRC_BIT] = "tx-checksum-sctp", diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f044f970f1a6..bdcba77e164c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3076,8 +3076,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, struct sk_buff *frag_skb = head_skb; unsigned int offset = doffset; unsigned int tnl_hlen = skb_tnl_header_len(head_skb); + unsigned int partial_segs = 0; unsigned int headroom; - unsigned int len; + unsigned int len = head_skb->len; __be16 proto; bool csum; int sg = !!(features & NETIF_F_SG); @@ -3094,6 +3095,15 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, csum = !!can_checksum_protocol(features, proto); + /* GSO partial only requires that we trim off any excess that + * doesn't fit into an MSS sized block, so take care of that + * now. + */ + if (features & NETIF_F_GSO_PARTIAL) { + partial_segs = len / mss; + mss *= partial_segs; + } + headroom = skb_headroom(head_skb); pos = skb_headlen(head_skb); @@ -3281,6 +3291,15 @@ perform_csum_check: */ segs->prev = tail; + /* Update GSO info on first skb in partial sequence. */ + if (partial_segs) { + skb_shinfo(segs)->gso_size = mss / partial_segs; + skb_shinfo(segs)->gso_segs = partial_segs; + skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type | + SKB_GSO_PARTIAL; + SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset; + } + /* Following permits correct backpressure, for protocols * using skb_set_owner_w(). * Idea is to tranfert ownership from head_skb to last segment. diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 5e3885672907..d091f91fa25d 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1199,7 +1199,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, unsigned int offset = 0; bool udpfrag, encap; struct iphdr *iph; - int proto; + int proto, tot_len; int nhoff; int ihl; int id; @@ -1269,10 +1269,18 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, if (skb->next) iph->frag_off |= htons(IP_MF); offset += skb->len - nhoff - ihl; + tot_len = skb->len - nhoff; + } else if (skb_is_gso(skb)) { + iph->id = htons(id); + id += skb_shinfo(skb)->gso_segs; + tot_len = skb_shinfo(skb)->gso_size + + SKB_GSO_CB(skb)->data_offset + + skb->head - (unsigned char *)iph; } else { iph->id = htons(id++); + tot_len = skb->len - nhoff; } - iph->tot_len = htons(skb->len - nhoff); + iph->tot_len = htons(tot_len); ip_send_check(iph); if (encap) skb_reset_inner_headers(skb); diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index 7ea14ced9222..dea0390d65bb 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -85,7 +85,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb, skb = segs; do { struct gre_base_hdr *greh; - __be32 *pcsum; + __sum16 *pcsum; /* Set up inner headers if we are offloading inner checksum */ if (skb->ip_summed == CHECKSUM_PARTIAL) { @@ -105,10 +105,25 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb, continue; greh = (struct gre_base_hdr *)skb_transport_header(skb); - pcsum = (__be32 *)(greh + 1); + pcsum = (__sum16 *)(greh + 1); + + if (skb_is_gso(skb)) { + unsigned int partial_adj; + + /* Adjust checksum to account for the fact that + * the partial checksum is based on actual size + * whereas headers should be based on MSS size. + */ + partial_adj = skb->len + skb_headroom(skb) - + SKB_GSO_CB(skb)->data_offset - + skb_shinfo(skb)->gso_size; + *pcsum = ~csum_fold((__force __wsum)htonl(partial_adj)); + } else { + *pcsum = 0; + } - *pcsum = 0; - *(__sum16 *)pcsum = gso_make_checksum(skb, 0); + *(pcsum + 1) = 0; + *pcsum = gso_make_checksum(skb, 0); } while ((skb = skb->next)); out: return segs; diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 1a2e9957c177..4e9b8f011473 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -107,6 +107,12 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, goto out; } + /* GSO partial only requires splitting the frame into an MSS + * multiple and possibly a remainder. So update the mss now. + */ + if (features & NETIF_F_GSO_PARTIAL) + mss = skb->len - (skb->len % mss); + copy_destructor = gso_skb->destructor == tcp_wfree; ooo_okay = gso_skb->ooo_okay; /* All segments but the first should have ooo_okay cleared */ @@ -131,7 +137,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, newcheck = ~csum_fold((__force __wsum)((__force u32)th->check + (__force u32)delta)); - do { + while (skb->next) { th->fin = th->psh = 0; th->check = newcheck; @@ -151,7 +157,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, th->seq = htonl(seq); th->cwr = 0; - } while (skb->next); + } /* Following permits TCP Small Queues to work well with GSO : * The callback to TCP stack will be called at the time last frag diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 8a3405a80260..5fcb93269afb 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -100,7 +100,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb, udp_offset = outer_hlen - tnl_hlen; skb = segs; do { - __be16 len; + unsigned int len; if (remcsum) skb->ip_summed = CHECKSUM_NONE; @@ -118,14 +118,26 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb, skb_reset_mac_header(skb); skb_set_network_header(skb, mac_len); skb_set_transport_header(skb, udp_offset); - len = htons(skb->len - udp_offset); + len = skb->len - udp_offset; uh = udp_hdr(skb); - uh->len = len; + + /* If we are only performing partial GSO the inner header + * will be using a length value equal to only one MSS sized + * segment instead of the entire frame. + */ + if (skb_is_gso(skb)) { + uh->len = htons(skb_shinfo(skb)->gso_size + + SKB_GSO_CB(skb)->data_offset + + skb->head - (unsigned char *)uh); + } else { + uh->len = htons(len); + } if (!need_csum) continue; - uh->check = ~csum_fold(csum_add(partial, (__force __wsum)len)); + uh->check = ~csum_fold(csum_add(partial, + (__force __wsum)htonl(len))); if (skb->encapsulation || !offload_csum) { uh->check = gso_make_checksum(skb, ~uh->check); diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index eeca943f12dc..d467053c226a 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -63,6 +63,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, int proto; struct frag_hdr *fptr; unsigned int unfrag_ip6hlen; + unsigned int payload_len; u8 *prevhdr; int offset = 0; bool encap, udpfrag; @@ -117,7 +118,13 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, for (skb = segs; skb; skb = skb->next) { ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff); - ipv6h->payload_len = htons(skb->len - nhoff - sizeof(*ipv6h)); + if (skb_is_gso(skb)) + payload_len = skb_shinfo(skb)->gso_size + + SKB_GSO_CB(skb)->data_offset + + skb->head - (unsigned char *)(ipv6h + 1); + else + payload_len = skb->len - nhoff - sizeof(*ipv6h); + ipv6h->payload_len = htons(payload_len); skb->network_header = (u8 *)ipv6h - skb->head; if (udpfrag) {
This patch adds support for something I am referring to as GSO partial. The basic idea is that we can support a broader range of devices for segmentation if we use fixed outer headers and have the hardware only really deal with segmenting the inner header. The idea behind the naming is due to the fact that everything before csum_start will be fixed headers, and everything after will be the region that is handled by hardware. With the current implementation it allows us to add support for the following GSO types with an inner TSO or TSO6 offload: NETIF_F_GSO_GRE NETIF_F_GSO_GRE_CSUM NETIF_F_UDP_TUNNEL NETIF_F_UDP_TUNNEL_CSUM Signed-off-by: Alexander Duyck <aduyck@mirantis.com> --- include/linux/netdev_features.h | 7 ++++++- include/linux/netdevice.h | 2 ++ include/linux/skbuff.h | 7 ++++++- net/core/dev.c | 31 ++++++++++++++++++++++++++++++- net/core/ethtool.c | 1 + net/core/skbuff.c | 21 ++++++++++++++++++++- net/ipv4/af_inet.c | 12 ++++++++++-- net/ipv4/gre_offload.c | 23 +++++++++++++++++++---- net/ipv4/tcp_offload.c | 10 ++++++++-- net/ipv4/udp_offload.c | 20 ++++++++++++++++---- net/ipv6/ip6_offload.c | 9 ++++++++- 11 files changed, 126 insertions(+), 17 deletions(-)