Message ID | alpine.DEB.2.02.1401092051020.25554@tomh.mtv.corp.google.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: Tom Herbert <therbert@google.com> Date: Thu, 9 Jan 2014 20:54:09 -0800 (PST) > When initializing a gro_list for a packet, first check the rxhash of > the incoming skb against that of the skb's in the list. This should be > a very strong inidicator of whether the flow is going to be matched, > and potentially allows a lot of other checks to be short circuited. > > Signed-off-by: Tom Herbert <therbert@google.com> This is not appropriate for 'net', so I assume you mean to target this for 'net-next', right? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote: > When initializing a gro_list for a packet, first check the rxhash of > the incoming skb against that of the skb's in the list. This should be > a very strong inidicator of whether the flow is going to be matched, > and potentially allows a lot of other checks to be short circuited. > Hmm... this idea was discussed in the past. I used it when attempting to use a hash table instead of a gro_list last year. Unfortunately this added lot of cycles when rxhash is not provided by hardware, and my tests found it was not a win. Remember : in most cases, gro_list contains one flow, so this test does nothing special but adds overhead. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-01-09 at 21:38 -0800, Eric Dumazet wrote: > Hmm... this idea was discussed in the past. I used it when attempting to > use a hash table instead of a gro_list last year. > > Unfortunately this added lot of cycles when rxhash is not provided by > hardware, and my tests found it was not a win. > > Remember : in most cases, gro_list contains one flow, so this test does > nothing special but adds overhead. Additional point : For napi_gro_frag() users, skb->head do not contains headers, so flow dissector enters slow path to fetch the headers from the fragment, one by one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 9, 2014 at 9:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote: >> When initializing a gro_list for a packet, first check the rxhash of >> the incoming skb against that of the skb's in the list. This should be >> a very strong inidicator of whether the flow is going to be matched, >> and potentially allows a lot of other checks to be short circuited. >> > > Hmm... this idea was discussed in the past. I used it when attempting to > use a hash table instead of a gro_list last year. > > Unfortunately this added lot of cycles when rxhash is not provided by > hardware, and my tests found it was not a win. > > Remember : in most cases, gro_list contains one flow, so this test does > nothing special but adds overhead. I don't understand what your basis is that gro_list in most cases contains one flow, but assuming that were true, maybe we should make the it only contain one flow eliminating the complexity of multiple flows (same_flow logic is convoluted and layers of encapsulation is not going to simplify things). If we are doing RPS or RFS, rxhash will be computed anyway, so the case your optimizing is pretty narrow: no RPS, no RFS, no hardware hash, and a single flow in gro_list. Nevertheless, if this is really an important concern, we can make the check directly against skb->rxhash so to be opportunistic and avoid the possibility of computation. > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 9, 2014 at 9:43 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2014-01-09 at 21:38 -0800, Eric Dumazet wrote: > >> Hmm... this idea was discussed in the past. I used it when attempting to >> use a hash table instead of a gro_list last year. >> >> Unfortunately this added lot of cycles when rxhash is not provided by >> hardware, and my tests found it was not a win. >> >> Remember : in most cases, gro_list contains one flow, so this test does >> nothing special but adds overhead. > > Additional point : For napi_gro_frag() users, skb->head do not contains > headers, so flow dissector enters slow path to fetch the headers from > the fragment, one by one. > skb_get_hash is only called for the new skb, we compare against skb->rxhash for packets on the gro_list. Also, I still intend to change skb_get_hash to not repeatedly compute the hash when there's no L4 hash to be found. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Fri, 10 Jan 2014 08:27:20 -0800 > On Thu, Jan 9, 2014 at 9:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote: >>> When initializing a gro_list for a packet, first check the rxhash of >>> the incoming skb against that of the skb's in the list. This should be >>> a very strong inidicator of whether the flow is going to be matched, >>> and potentially allows a lot of other checks to be short circuited. >>> >> >> Hmm... this idea was discussed in the past. I used it when attempting to >> use a hash table instead of a gro_list last year. >> >> Unfortunately this added lot of cycles when rxhash is not provided by >> hardware, and my tests found it was not a win. >> >> Remember : in most cases, gro_list contains one flow, so this test does >> nothing special but adds overhead. > > I don't understand what your basis is that gro_list in most cases > contains one flow It also doesn't jive well with Eric's recent patch to adjust the GRO overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net: gro: change GRO overflow strategy")) :-) I sort of like Tom's idea to optimistically compare the hash, if we do in fact have one already. Eric would the change be OK if Tom did it that way? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 13, 2014 at 11:59 AM, David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <therbert@google.com> > Date: Fri, 10 Jan 2014 08:27:20 -0800 > >> On Thu, Jan 9, 2014 at 9:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >>> On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote: >>>> When initializing a gro_list for a packet, first check the rxhash of >>>> the incoming skb against that of the skb's in the list. This should be >>>> a very strong inidicator of whether the flow is going to be matched, >>>> and potentially allows a lot of other checks to be short circuited. >>>> >>> >>> Hmm... this idea was discussed in the past. I used it when attempting to >>> use a hash table instead of a gro_list last year. >>> >>> Unfortunately this added lot of cycles when rxhash is not provided by >>> hardware, and my tests found it was not a win. >>> >>> Remember : in most cases, gro_list contains one flow, so this test does >>> nothing special but adds overhead. >> >> I don't understand what your basis is that gro_list in most cases >> contains one flow > > It also doesn't jive well with Eric's recent patch to adjust the GRO > overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net: > gro: change GRO overflow strategy")) > > :-) > > I sort of like Tom's idea to optimistically compare the hash, if we > do in fact have one already. > > Eric would the change be OK if Tom did it that way? btw, I'm also looking at "if ((a ^ b) | (c ^ d)...)" versus "if ((a != b) || (c != d)...)". There's a pretty small number of functions that use this trick. In isolating one for testing, I really don't see the ^ | method as being much of win, even with a modest amount of branch correct branch prediction || can be better, if we can get mostly correct branch prediction it can be significantly better. Before I fix this, is there any background I should know about? :-) Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-01-13 at 11:59 -0800, David Miller wrote: > It also doesn't jive well with Eric's recent patch to adjust the GRO > overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net: > gro: change GRO overflow strategy")) > > :-) > > I sort of like Tom's idea to optimistically compare the hash, if we > do in fact have one already. > > Eric would the change be OK if Tom did it that way? > -- Yes this is what I suggested, but it seems Tom had something different in mind. I would rather not call flow dissector from GRO, especially considering nobody but Google uses RPS/RFS (otherwise CVE-2013-4348 would have been discovered much sooner) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-01-13 at 12:17 -0800, Eric Dumazet wrote: > On Mon, 2014-01-13 at 11:59 -0800, David Miller wrote: > > > It also doesn't jive well with Eric's recent patch to adjust the GRO > > overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net: > > gro: change GRO overflow strategy")) > > > > :-) > > > > I sort of like Tom's idea to optimistically compare the hash, if we > > do in fact have one already. > > > > Eric would the change be OK if Tom did it that way? > > -- > > Yes this is what I suggested, but it seems Tom had something different > in mind. > > I would rather not call flow dissector from GRO, especially considering > nobody but Google uses RPS/RFS (otherwise CVE-2013-4348 would have been > discovered much sooner) According to the original report of that vulnerability: > skb_flow_dissect() were used by several places: > - packet scheduler that want classify flows > - skb_get_rxhash() that will be used by RPS, vxlan, multiqueue > tap,macvtap packet fanout > - skb_probe_transport_header() which was used for probing transport > header for DODGY packets > - __skb_get_poff() which will be used by socket filter So flow dissector is already part of the attack surface for both local and remote users in common configurations. Ben.
On Mon, Jan 13, 2014 at 12:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2014-01-13 at 11:59 -0800, David Miller wrote: > >> It also doesn't jive well with Eric's recent patch to adjust the GRO >> overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net: >> gro: change GRO overflow strategy")) >> >> :-) >> >> I sort of like Tom's idea to optimistically compare the hash, if we >> do in fact have one already. >> >> Eric would the change be OK if Tom did it that way? >> -- > > Yes this is what I suggested, but it seems Tom had something different > in mind. > I will add skb_get_hash_noeval > I would rather not call flow dissector from GRO, especially considering > nobody but Google uses RPS/RFS (otherwise CVE-2013-4348 would have been > discovered much sooner) Or maybe nobody uses IPIP. ;-) > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-01-13 at 13:24 -0800, Tom Herbert wrote: > > Or maybe nobody uses IPIP. ;-) The bug happened even without IPIP being used on the node. Changing one byte from 0x45 to 0x40 was enough to trigger it. Since we do not perform header checksum in flow dissector, normal corrupted traffic on the Internet would have a chance to trigger the infinite loop. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-01-13 at 20:50 +0000, Ben Hutchings wrote: > According to the original report of that vulnerability: > > > skb_flow_dissect() were used by several places: > > - packet scheduler that want classify flows > > - skb_get_rxhash() that will be used by RPS, vxlan, multiqueue > > tap,macvtap packet fanout > > - skb_probe_transport_header() which was used for probing transport > > header for DODGY packets > > - __skb_get_poff() which will be used by socket filter > > So flow dissector is already part of the attack surface for both local > and remote users in common configurations. Take a debian or Android distro, and this bug is not hit on 'common configurations'. Send them a 'packet of death', they will not hang, unless some admin set up RPS/RFS on the incoming device. Anyway, I see no point pushing this schem (flow dissect all incoming packets). A router has no gain going to L4 header, but still might need GRO. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-01-13 at 13:37 -0800, Eric Dumazet wrote: > On Mon, 2014-01-13 at 20:50 +0000, Ben Hutchings wrote: > > > According to the original report of that vulnerability: > > > > > skb_flow_dissect() were used by several places: > > > - packet scheduler that want classify flows > > > - skb_get_rxhash() that will be used by RPS, vxlan, multiqueue > > > tap,macvtap packet fanout > > > - skb_probe_transport_header() which was used for probing transport > > > header for DODGY packets > > > - __skb_get_poff() which will be used by socket filter > > > > So flow dissector is already part of the attack surface for both local > > and remote users in common configurations. > > Take a debian or Android distro, and this bug is not hit on 'common > configurations'. Send them a 'packet of death', they will not hang, > unless some admin set up RPS/RFS on the incoming device. [...] When I investigated the scope of this for Debian, I tried sending a 'packet of death' to a VM and actually triggered the lockup in the TX path of the *host*, running Debian unstable with Linux 3.11. I didn't track down exactly why that was but I think that libvirt's default networking configuration includes multiqueue devices that use flow dissector. Ben.
On Wed, 2014-01-15 at 01:31 +0000, Ben Hutchings wrote: > When I investigated the scope of this for Debian, I tried sending a > 'packet of death' to a VM and actually triggered the lockup in the TX > path of the *host*, running Debian unstable with Linux 3.11. I didn't > track down exactly why that was but I think that libvirt's default > networking configuration includes multiqueue devices that use flow > dissector. OK, I take that majority of debian hosts are running some VM then, nice to know, time to update my hosts and usages I guess. Anyway, current flow dissector needs care if we really use it in unprotected areas. Hostile packets can force flow dissection of MTU bytes, bringing host to abysmal performance. Once we fix all the issues, we'll see how expensive it is and if it really can help. Last year, my experiments were exactly using it in GRO, to have a hash table instead of a single gro_list, unfortunately this added a latency regression that I found not acceptable at that time. With TSO auto sizing, I might need to revisit the idea anyway... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-01-14 at 18:27 -0800, Eric Dumazet wrote: > On Wed, 2014-01-15 at 01:31 +0000, Ben Hutchings wrote: > > > When I investigated the scope of this for Debian, I tried sending a > > 'packet of death' to a VM and actually triggered the lockup in the TX > > path of the *host*, running Debian unstable with Linux 3.11. I didn't > > track down exactly why that was but I think that libvirt's default > > networking configuration includes multiqueue devices that use flow > > dissector. > > OK, I take that majority of debian hosts are running some VM then, > nice to know, time to update my hosts and usages I guess. I said that common configurations use flow dissector - not the majority. > Anyway, current flow dissector needs care if we really use it in > unprotected areas. > > Hostile packets can force flow dissection of MTU bytes, > bringing host to abysmal performance. [...] Yes. I think the number of times it can loop should be limited to some small number. Ben.
diff --git a/net/core/dev.c b/net/core/dev.c index ce01847..88bf426 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3794,10 +3794,14 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) { struct sk_buff *p; unsigned int maclen = skb->dev->hard_header_len; + u32 hash = skb_get_hash(skb); for (p = napi->gro_list; p; p = p->next) { unsigned long diffs; + if ((diffs = (hash != p->rxhash))) + goto skip; + diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev; diffs |= p->vlan_tci ^ skb->vlan_tci; if (maclen == ETH_HLEN) @@ -3807,6 +3811,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) diffs = memcmp(skb_mac_header(p), skb_gro_mac_header(skb), maclen); +skip: NAPI_GRO_CB(p)->same_flow = !diffs; NAPI_GRO_CB(p)->flush = 0; }
When initializing a gro_list for a packet, first check the rxhash of the incoming skb against that of the skb's in the list. This should be a very strong inidicator of whether the flow is going to be matched, and potentially allows a lot of other checks to be short circuited. Signed-off-by: Tom Herbert <therbert@google.com> --- net/core/dev.c | 5 +++++ 1 file changed, 5 insertions(+)