diff mbox

net: Check skb->rxhash in gro_receive

Message ID alpine.DEB.2.02.1401092051020.25554@tomh.mtv.corp.google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Jan. 10, 2014, 4:54 a.m. UTC
When initializing a gro_list for a packet, first check the rxhash of
the incoming skb against that of the skb's in the list. This should be
a very strong inidicator of whether the flow is going to be matched,
and potentially allows a lot of other checks to be short circuited.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/core/dev.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

David Miller Jan. 10, 2014, 5 a.m. UTC | #1
From: Tom Herbert <therbert@google.com>
Date: Thu, 9 Jan 2014 20:54:09 -0800 (PST)

> When initializing a gro_list for a packet, first check the rxhash of
> the incoming skb against that of the skb's in the list. This should be
> a very strong inidicator of whether the flow is going to be matched,
> and potentially allows a lot of other checks to be short circuited.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>

This is not appropriate for 'net', so I assume you mean to target
this for 'net-next', right?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2014, 5:38 a.m. UTC | #2
On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote:
> When initializing a gro_list for a packet, first check the rxhash of
> the incoming skb against that of the skb's in the list. This should be
> a very strong inidicator of whether the flow is going to be matched,
> and potentially allows a lot of other checks to be short circuited.
> 

Hmm... this idea was discussed in the past. I used it when attempting to
use a hash table instead of a gro_list last year.

Unfortunately this added lot of cycles when rxhash is not provided by
hardware, and my tests found it was not a win.

Remember : in most cases, gro_list contains one flow, so this test does
nothing special but adds overhead.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2014, 5:43 a.m. UTC | #3
On Thu, 2014-01-09 at 21:38 -0800, Eric Dumazet wrote:

> Hmm... this idea was discussed in the past. I used it when attempting to
> use a hash table instead of a gro_list last year.
> 
> Unfortunately this added lot of cycles when rxhash is not provided by
> hardware, and my tests found it was not a win.
> 
> Remember : in most cases, gro_list contains one flow, so this test does
> nothing special but adds overhead.

Additional point : For napi_gro_frag() users, skb->head do not contains
headers, so flow dissector enters slow path to fetch the headers from
the fragment, one by one.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Jan. 10, 2014, 4:27 p.m. UTC | #4
On Thu, Jan 9, 2014 at 9:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote:
>> When initializing a gro_list for a packet, first check the rxhash of
>> the incoming skb against that of the skb's in the list. This should be
>> a very strong inidicator of whether the flow is going to be matched,
>> and potentially allows a lot of other checks to be short circuited.
>>
>
> Hmm... this idea was discussed in the past. I used it when attempting to
> use a hash table instead of a gro_list last year.
>
> Unfortunately this added lot of cycles when rxhash is not provided by
> hardware, and my tests found it was not a win.
>
> Remember : in most cases, gro_list contains one flow, so this test does
> nothing special but adds overhead.

I don't understand what your basis is that gro_list in most cases
contains one flow, but assuming that were true, maybe we should make
the it only contain one flow eliminating the complexity of multiple
flows (same_flow logic is convoluted and layers of encapsulation is
not going to simplify things).

If we are doing RPS or RFS, rxhash will be computed anyway, so the
case your optimizing is pretty narrow: no RPS, no RFS, no hardware
hash, and a single flow in gro_list. Nevertheless, if this is really
an important concern, we can make the check directly against
skb->rxhash so to be opportunistic and avoid the possibility of
computation.

>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Jan. 10, 2014, 4:47 p.m. UTC | #5
On Thu, Jan 9, 2014 at 9:43 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2014-01-09 at 21:38 -0800, Eric Dumazet wrote:
>
>> Hmm... this idea was discussed in the past. I used it when attempting to
>> use a hash table instead of a gro_list last year.
>>
>> Unfortunately this added lot of cycles when rxhash is not provided by
>> hardware, and my tests found it was not a win.
>>
>> Remember : in most cases, gro_list contains one flow, so this test does
>> nothing special but adds overhead.
>
> Additional point : For napi_gro_frag() users, skb->head do not contains
> headers, so flow dissector enters slow path to fetch the headers from
> the fragment, one by one.
>
skb_get_hash is only called for the new skb, we compare against
skb->rxhash for packets on the gro_list. Also, I still intend to
change skb_get_hash to not repeatedly compute the hash when there's no
L4 hash to be found.

>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 13, 2014, 7:59 p.m. UTC | #6
From: Tom Herbert <therbert@google.com>
Date: Fri, 10 Jan 2014 08:27:20 -0800

> On Thu, Jan 9, 2014 at 9:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote:
>>> When initializing a gro_list for a packet, first check the rxhash of
>>> the incoming skb against that of the skb's in the list. This should be
>>> a very strong inidicator of whether the flow is going to be matched,
>>> and potentially allows a lot of other checks to be short circuited.
>>>
>>
>> Hmm... this idea was discussed in the past. I used it when attempting to
>> use a hash table instead of a gro_list last year.
>>
>> Unfortunately this added lot of cycles when rxhash is not provided by
>> hardware, and my tests found it was not a win.
>>
>> Remember : in most cases, gro_list contains one flow, so this test does
>> nothing special but adds overhead.
> 
> I don't understand what your basis is that gro_list in most cases
> contains one flow

It also doesn't jive well with Eric's recent patch to adjust the GRO
overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net:
gro: change GRO overflow strategy"))

:-)

I sort of like Tom's idea to optimistically compare the hash, if we
do in fact have one already.

Eric would the change be OK if Tom did it that way?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Jan. 13, 2014, 8:13 p.m. UTC | #7
On Mon, Jan 13, 2014 at 11:59 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Fri, 10 Jan 2014 08:27:20 -0800
>
>> On Thu, Jan 9, 2014 at 9:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Thu, 2014-01-09 at 20:54 -0800, Tom Herbert wrote:
>>>> When initializing a gro_list for a packet, first check the rxhash of
>>>> the incoming skb against that of the skb's in the list. This should be
>>>> a very strong inidicator of whether the flow is going to be matched,
>>>> and potentially allows a lot of other checks to be short circuited.
>>>>
>>>
>>> Hmm... this idea was discussed in the past. I used it when attempting to
>>> use a hash table instead of a gro_list last year.
>>>
>>> Unfortunately this added lot of cycles when rxhash is not provided by
>>> hardware, and my tests found it was not a win.
>>>
>>> Remember : in most cases, gro_list contains one flow, so this test does
>>> nothing special but adds overhead.
>>
>> I don't understand what your basis is that gro_list in most cases
>> contains one flow
>
> It also doesn't jive well with Eric's recent patch to adjust the GRO
> overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net:
> gro: change GRO overflow strategy"))
>
> :-)
>
> I sort of like Tom's idea to optimistically compare the hash, if we
> do in fact have one already.
>
> Eric would the change be OK if Tom did it that way?

btw, I'm also looking at "if ((a ^ b) | (c ^ d)...)" versus "if ((a !=
b) || (c != d)...)".  There's a pretty small number of functions that
use this trick. In isolating one for testing,  I really don't see the
^ | method as being much of win, even with a modest amount of branch
correct branch prediction || can be better, if we can get mostly
correct branch prediction it can be significantly better. Before I fix
this, is there any background I should know about? :-)

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 13, 2014, 8:17 p.m. UTC | #8
On Mon, 2014-01-13 at 11:59 -0800, David Miller wrote:

> It also doesn't jive well with Eric's recent patch to adjust the GRO
> overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net:
> gro: change GRO overflow strategy"))
> 
> :-)
> 
> I sort of like Tom's idea to optimistically compare the hash, if we
> do in fact have one already.
> 
> Eric would the change be OK if Tom did it that way?
> --

Yes this is what I suggested, but it seems Tom had something different
in mind.

I would rather not call flow dissector from GRO, especially considering
nobody but Google uses RPS/RFS (otherwise CVE-2013-4348 would have been
discovered much sooner)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Jan. 13, 2014, 8:50 p.m. UTC | #9
On Mon, 2014-01-13 at 12:17 -0800, Eric Dumazet wrote:
> On Mon, 2014-01-13 at 11:59 -0800, David Miller wrote:
> 
> > It also doesn't jive well with Eric's recent patch to adjust the GRO
> > overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net:
> > gro: change GRO overflow strategy"))
> > 
> > :-)
> > 
> > I sort of like Tom's idea to optimistically compare the hash, if we
> > do in fact have one already.
> > 
> > Eric would the change be OK if Tom did it that way?
> > --
> 
> Yes this is what I suggested, but it seems Tom had something different
> in mind.
> 
> I would rather not call flow dissector from GRO, especially considering
> nobody but Google uses RPS/RFS (otherwise CVE-2013-4348 would have been
> discovered much sooner)

According to the original report of that vulnerability:

> skb_flow_dissect() were used by several places:                         
> - packet scheduler that want classify flows                             
> - skb_get_rxhash() that will be used by RPS, vxlan, multiqueue          
> tap,macvtap packet fanout                                               
> - skb_probe_transport_header() which was used for probing transport     
> header for DODGY packets                                                
> - __skb_get_poff() which will be used by socket filter             

So flow dissector is already part of the attack surface for both local
and remote users in common configurations.

Ben.
Tom Herbert Jan. 13, 2014, 9:24 p.m. UTC | #10
On Mon, Jan 13, 2014 at 12:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-01-13 at 11:59 -0800, David Miller wrote:
>
>> It also doesn't jive well with Eric's recent patch to adjust the GRO
>> overflow strategy (600adc18eba823f9fd8ed5fec8b04f11dddf3884 ("net:
>> gro: change GRO overflow strategy"))
>>
>> :-)
>>
>> I sort of like Tom's idea to optimistically compare the hash, if we
>> do in fact have one already.
>>
>> Eric would the change be OK if Tom did it that way?
>> --
>
> Yes this is what I suggested, but it seems Tom had something different
> in mind.
>

I will add skb_get_hash_noeval

> I would rather not call flow dissector from GRO, especially considering
> nobody but Google uses RPS/RFS (otherwise CVE-2013-4348 would have been
> discovered much sooner)

Or maybe nobody uses IPIP. ;-)

>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 13, 2014, 9:36 p.m. UTC | #11
On Mon, 2014-01-13 at 13:24 -0800, Tom Herbert wrote:

> 
> Or maybe nobody uses IPIP. ;-)

The bug happened even without IPIP being used on the node.

Changing one byte from 0x45 to 0x40 was enough to trigger it.

Since we do not perform header checksum in flow dissector, normal
corrupted traffic on the Internet would have a chance to trigger the
infinite loop.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 13, 2014, 9:37 p.m. UTC | #12
On Mon, 2014-01-13 at 20:50 +0000, Ben Hutchings wrote:

> According to the original report of that vulnerability:
> 
> > skb_flow_dissect() were used by several places:                         
> > - packet scheduler that want classify flows                             
> > - skb_get_rxhash() that will be used by RPS, vxlan, multiqueue          
> > tap,macvtap packet fanout                                               
> > - skb_probe_transport_header() which was used for probing transport     
> > header for DODGY packets                                                
> > - __skb_get_poff() which will be used by socket filter             
> 
> So flow dissector is already part of the attack surface for both local
> and remote users in common configurations.

Take a debian or Android distro, and this bug is not hit on 'common
configurations'. Send them a 'packet of death', they will not hang,
unless some admin set up RPS/RFS on the incoming device.

Anyway, I see no point pushing this schem (flow dissect all incoming
packets). A router has no gain going to L4 header, but still might
need GRO.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Jan. 15, 2014, 1:31 a.m. UTC | #13
On Mon, 2014-01-13 at 13:37 -0800, Eric Dumazet wrote:
> On Mon, 2014-01-13 at 20:50 +0000, Ben Hutchings wrote:
> 
> > According to the original report of that vulnerability:
> > 
> > > skb_flow_dissect() were used by several places:                         
> > > - packet scheduler that want classify flows                             
> > > - skb_get_rxhash() that will be used by RPS, vxlan, multiqueue          
> > > tap,macvtap packet fanout                                               
> > > - skb_probe_transport_header() which was used for probing transport     
> > > header for DODGY packets                                                
> > > - __skb_get_poff() which will be used by socket filter             
> > 
> > So flow dissector is already part of the attack surface for both local
> > and remote users in common configurations.
> 
> Take a debian or Android distro, and this bug is not hit on 'common
> configurations'. Send them a 'packet of death', they will not hang,
> unless some admin set up RPS/RFS on the incoming device.
[...]

When I investigated the scope of this for Debian, I tried sending a
'packet of death' to a VM and actually triggered the lockup in the TX
path of the *host*, running Debian unstable with Linux 3.11.  I didn't
track down exactly why that was but I think that libvirt's default
networking configuration includes multiqueue devices that use flow
dissector.

Ben.
Eric Dumazet Jan. 15, 2014, 2:27 a.m. UTC | #14
On Wed, 2014-01-15 at 01:31 +0000, Ben Hutchings wrote:

> When I investigated the scope of this for Debian, I tried sending a
> 'packet of death' to a VM and actually triggered the lockup in the TX
> path of the *host*, running Debian unstable with Linux 3.11.  I didn't
> track down exactly why that was but I think that libvirt's default
> networking configuration includes multiqueue devices that use flow
> dissector.

OK, I take that majority of debian hosts are running some VM then,
nice to know, time to update my hosts and usages I guess.

Anyway, current flow dissector needs care if we really use it in
unprotected areas.

Hostile packets can force flow dissection of MTU bytes,
bringing host to abysmal performance.

Once we fix all the issues, we'll see how expensive it is and
if it really can help.

Last year, my experiments were exactly using it in GRO, to
have a hash table instead of a single gro_list, unfortunately
this added a latency regression that I found not acceptable at that
time.

With TSO auto sizing, I might need to revisit the idea anyway...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Jan. 15, 2014, 2:43 a.m. UTC | #15
On Tue, 2014-01-14 at 18:27 -0800, Eric Dumazet wrote:
> On Wed, 2014-01-15 at 01:31 +0000, Ben Hutchings wrote:
> 
> > When I investigated the scope of this for Debian, I tried sending a
> > 'packet of death' to a VM and actually triggered the lockup in the TX
> > path of the *host*, running Debian unstable with Linux 3.11.  I didn't
> > track down exactly why that was but I think that libvirt's default
> > networking configuration includes multiqueue devices that use flow
> > dissector.
> 
> OK, I take that majority of debian hosts are running some VM then,
> nice to know, time to update my hosts and usages I guess.

I said that common configurations use flow dissector - not the majority.

> Anyway, current flow dissector needs care if we really use it in
> unprotected areas.
> 
> Hostile packets can force flow dissection of MTU bytes,
> bringing host to abysmal performance.
[...]

Yes.  I think the number of times it can loop should be limited to some
small number.

Ben.
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index ce01847..88bf426 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3794,10 +3794,14 @@  static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 {
 	struct sk_buff *p;
 	unsigned int maclen = skb->dev->hard_header_len;
+	u32 hash = skb_get_hash(skb);
 
 	for (p = napi->gro_list; p; p = p->next) {
 		unsigned long diffs;
 
+		if ((diffs = (hash != p->rxhash)))
+			goto skip;
+
 		diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
 		diffs |= p->vlan_tci ^ skb->vlan_tci;
 		if (maclen == ETH_HLEN)
@@ -3807,6 +3811,7 @@  static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 			diffs = memcmp(skb_mac_header(p),
 				       skb_gro_mac_header(skb),
 				       maclen);
+skip:
 		NAPI_GRO_CB(p)->same_flow = !diffs;
 		NAPI_GRO_CB(p)->flush = 0;
 	}