diff mbox

net: fix for a race condition in the inet frag code

Message ID 1393855520-18334-1-git-send-email-nikolay@redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Nikolay Aleksandrov March 3, 2014, 2:05 p.m. UTC
I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
the users of inet_frag_kill/inet_frag_put).
What happens is that after a fragment has been added to the hash chain but
before it's been added to the lru_list (inet_frag_lru_add), it may get
deleted (either by an expired timer if the system load is high or the
timer sufficiently low, or by the fraq_queue function for different
reasons) before it's added to the lru_list, then after it gets added
it's a matter of time for the evictor to get to a piece of memory which
has been freed leading to a number of different bugs depending on what's
left there. I've been able to trigger this on both IPv4 and IPv6 (which
is normal as the frag code is the same), but it's been much more
difficult to trigger on IPv4 due to the protocol differences about how
fragments are treated. The setup I used to reproduce this is:
2 machines with 4 x 10G bonded in a RR bond, so the same flow can be
seen on multiple cards at the same time. Then I used multiple instances
of ping/ping6 to generate fragmented packets and flood the machines with
them while running other processes to load the attacked machine.
*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for the
fragment to be added to the lru_list, and then it'll remove it (it works
because the hash chain removal is done before the lru_list one and
there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
I'm new to this code, so I'm not sure if this is the best approach to fix
the issue and am open to other suggestions, since I consider the issue
quite serious (remotely triggerable) I'll be closely monitoring this thread
to get it fixed asap.

 net/ipv4/inet_fragment.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Florian Westphal March 3, 2014, 2:40 p.m. UTC | #1
Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> I stumbled upon this very serious bug while hunting for another one,
> it's a very subtle race condition between inet_frag_evictor,
> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> the users of inet_frag_kill/inet_frag_put).
> What happens is that after a fragment has been added to the hash chain but
> before it's been added to the lru_list (inet_frag_lru_add), it may get
> deleted (either by an expired timer if the system load is high or the
> timer sufficiently low, or by the fraq_queue function for different
> reasons) before it's added to the lru_list

Sorry.  Not following here, see below.

> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index bb075fc9a14f..322dcebfc588 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>  
>  	atomic_inc(&qp->refcnt);
>  	hlist_add_head(&qp->list, &hb->chain);
> +	inet_frag_lru_add(nf, qp);
>  	spin_unlock(&hb->chain_lock);
>  	read_unlock(&f->lock);

If I understand correctly your're saying that qp can be free'd on
another/cpu timer right after dropping the locks.  But how is it
possible?

->refcnt is bumped above when arming the timer (before dropping chain
lock), so even if the frag_expire timer fires instantly it should not
free qp.

What am I missing?

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov March 3, 2014, 2:43 p.m. UTC | #2
On 03/03/2014 03:40 PM, Florian Westphal wrote:
> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>> I stumbled upon this very serious bug while hunting for another one,
>> it's a very subtle race condition between inet_frag_evictor,
>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
>> the users of inet_frag_kill/inet_frag_put).
>> What happens is that after a fragment has been added to the hash chain but
>> before it's been added to the lru_list (inet_frag_lru_add), it may get
>> deleted (either by an expired timer if the system load is high or the
>> timer sufficiently low, or by the fraq_queue function for different
>> reasons) before it's added to the lru_list
> 
> Sorry.  Not following here, see below.
> 
>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
>> index bb075fc9a14f..322dcebfc588 100644
>> --- a/net/ipv4/inet_fragment.c
>> +++ b/net/ipv4/inet_fragment.c
>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>>  
>>  	atomic_inc(&qp->refcnt);
>>  	hlist_add_head(&qp->list, &hb->chain);
>> +	inet_frag_lru_add(nf, qp);
>>  	spin_unlock(&hb->chain_lock);
>>  	read_unlock(&f->lock);
> 
> If I understand correctly your're saying that qp can be free'd on
> another/cpu timer right after dropping the locks.  But how is it
> possible?
> 
> ->refcnt is bumped above when arming the timer (before dropping chain
> lock), so even if the frag_expire timer fires instantly it should not
> free qp.
> 
> What am I missing?
> 
> Thanks,
> Florian
> 
inet_frag_kill when called from the IPv4/6 frag_queue function will remove the
timer refcount, then inet_frag_put afterwards will drop it to 0 and free it and
all of this could happen before the frag was ever added to the LRU list, then it
gets added. This happens much easier for IPv6 because of the dropping of
overlapping fragments in its frag_queue function, the point is we need to have
the timer's refcount removed in any way (it could be the timer itself - there's
an inet_frag_put in the end, or much easier by the frag_queue function).
I think I've explained it badly, I hope this makes it clearer :-)

Cheers,
Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov March 3, 2014, 2:49 p.m. UTC | #3
On 03/03/2014 03:40 PM, Florian Westphal wrote:
> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>> I stumbled upon this very serious bug while hunting for another one,
>> it's a very subtle race condition between inet_frag_evictor,
>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
>> the users of inet_frag_kill/inet_frag_put).
>> What happens is that after a fragment has been added to the hash chain but
>> before it's been added to the lru_list (inet_frag_lru_add), it may get
>> deleted (either by an expired timer if the system load is high or the
>> timer sufficiently low, or by the fraq_queue function for different
>> reasons) before it's added to the lru_list
> 
> Sorry.  Not following here, see below.
> 
>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
>> index bb075fc9a14f..322dcebfc588 100644
>> --- a/net/ipv4/inet_fragment.c
>> +++ b/net/ipv4/inet_fragment.c
>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>>  
>>  	atomic_inc(&qp->refcnt);
>>  	hlist_add_head(&qp->list, &hb->chain);
>> +	inet_frag_lru_add(nf, qp);
>>  	spin_unlock(&hb->chain_lock);
>>  	read_unlock(&f->lock);
> 
> If I understand correctly your're saying that qp can be free'd on
> another/cpu timer right after dropping the locks.  But how is it
> possible?
> 
> ->refcnt is bumped above when arming the timer (before dropping chain
> lock), so even if the frag_expire timer fires instantly it should not
> free qp.
> 
> What am I missing?
> 
> Thanks,
> Florian
> 
An important point is that inet_frag_kill removes both the timer's refcnt and
has an unconditional atomic_dec to remove the original/guarding refcnt, so it
basically removes everything that's in the way.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer March 3, 2014, 3:31 p.m. UTC | #4
On Mon, 03 Mar 2014 15:49:56 +0100
Nikolay Aleksandrov <nikolay@redhat.com> wrote:

> On 03/03/2014 03:40 PM, Florian Westphal wrote:
> > Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> >> I stumbled upon this very serious bug while hunting for another one,
> >> it's a very subtle race condition between inet_frag_evictor,
> >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> >> the users of inet_frag_kill/inet_frag_put).
> >> What happens is that after a fragment has been added to the hash chain but
> >> before it's been added to the lru_list (inet_frag_lru_add), it may get
> >> deleted (either by an expired timer if the system load is high or the
> >> timer sufficiently low, or by the fraq_queue function for different
> >> reasons) before it's added to the lru_list
> > 
> > Sorry.  Not following here, see below.
> > 
> >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> >> index bb075fc9a14f..322dcebfc588 100644
> >> --- a/net/ipv4/inet_fragment.c
> >> +++ b/net/ipv4/inet_fragment.c
> >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
> >>  
> >>  	atomic_inc(&qp->refcnt);
> >>  	hlist_add_head(&qp->list, &hb->chain);
> >> +	inet_frag_lru_add(nf, qp);
> >>  	spin_unlock(&hb->chain_lock);
> >>  	read_unlock(&f->lock);
> > 
> > If I understand correctly your're saying that qp can be free'd on
> > another/cpu timer right after dropping the locks.  But how is it
> > possible?
> > 
> > ->refcnt is bumped above when arming the timer (before dropping chain
> > lock), so even if the frag_expire timer fires instantly it should not
> > free qp.
> > 
> > What am I missing?
> > 
> > Thanks,
> > Florian
> > 
> An important point is that inet_frag_kill removes both the timer's refcnt and
> has an unconditional atomic_dec to remove the original/guarding refcnt, so it
> basically removes everything that's in the way.
 
It sound like we might have a refcnt problem...
Do we need an extra refcnt for maintaining elements the LRU list?
Nikolay Aleksandrov March 3, 2014, 3:34 p.m. UTC | #5
On 03/03/2014 04:31 PM, Jesper Dangaard Brouer wrote:
> On Mon, 03 Mar 2014 15:49:56 +0100
> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> 
>> On 03/03/2014 03:40 PM, Florian Westphal wrote:
>>> Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>> I stumbled upon this very serious bug while hunting for another one,
>>>> it's a very subtle race condition between inet_frag_evictor,
>>>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
>>>> the users of inet_frag_kill/inet_frag_put).
>>>> What happens is that after a fragment has been added to the hash chain but
>>>> before it's been added to the lru_list (inet_frag_lru_add), it may get
>>>> deleted (either by an expired timer if the system load is high or the
>>>> timer sufficiently low, or by the fraq_queue function for different
>>>> reasons) before it's added to the lru_list
>>>
>>> Sorry.  Not following here, see below.
>>>
>>>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
>>>> index bb075fc9a14f..322dcebfc588 100644
>>>> --- a/net/ipv4/inet_fragment.c
>>>> +++ b/net/ipv4/inet_fragment.c
>>>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
>>>>  
>>>>  	atomic_inc(&qp->refcnt);
>>>>  	hlist_add_head(&qp->list, &hb->chain);
>>>> +	inet_frag_lru_add(nf, qp);
>>>>  	spin_unlock(&hb->chain_lock);
>>>>  	read_unlock(&f->lock);
>>>
>>> If I understand correctly your're saying that qp can be free'd on
>>> another/cpu timer right after dropping the locks.  But how is it
>>> possible?
>>>
>>> ->refcnt is bumped above when arming the timer (before dropping chain
>>> lock), so even if the frag_expire timer fires instantly it should not
>>> free qp.
>>>
>>> What am I missing?
>>>
>>> Thanks,
>>> Florian
>>>
>> An important point is that inet_frag_kill removes both the timer's refcnt and
>> has an unconditional atomic_dec to remove the original/guarding refcnt, so it
>> basically removes everything that's in the way.
>  
> It sound like we might have a refcnt problem...
> Do we need an extra refcnt for maintaining elements the LRU list?
> 
I don't think so, if you keep the lru_add separated from the chain addition as
before you still got a race condition when frags are seen by the fq_find
functions but are not in the LRU list yet unless you add the refcnt before
adding to the lru list while holding the chain lock (or before that) which will
alter the behaviour, but still if inet_frag_kill gets called it should remove
that refcnt and thus put us in the same position as now if we want to keep the
behaviour as it's now.

Nik
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer March 3, 2014, 5:13 p.m. UTC | #6
On Mon, 03 Mar 2014 15:43:00 +0100
Nikolay Aleksandrov <nikolay@redhat.com> wrote:

> On 03/03/2014 03:40 PM, Florian Westphal wrote:
> > Nikolay Aleksandrov <nikolay@redhat.com> wrote:

[...]
> >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> >> index bb075fc9a14f..322dcebfc588 100644
> >> --- a/net/ipv4/inet_fragment.c
> >> +++ b/net/ipv4/inet_fragment.c
> >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
> >>  
> >>  	atomic_inc(&qp->refcnt);
> >>  	hlist_add_head(&qp->list, &hb->chain);
> >> +	inet_frag_lru_add(nf, qp);
> >>  	spin_unlock(&hb->chain_lock);
> >>  	read_unlock(&f->lock);
> > 
[...]
> > 
> inet_frag_kill when called from the IPv4/6 frag_queue function will remove the
> timer refcount, then inet_frag_put afterwards will drop it to 0 and free it and
> all of this could happen before the frag was ever added to the LRU list, then it
> gets added. This happens much easier for IPv6 because of the dropping of
> overlapping fragments in its frag_queue function, the point is we need to have
> the timer's refcount removed in any way (it could be the timer itself - there's
> an inet_frag_put in the end, or much easier by the frag_queue function).
> I think I've explained it badly, I hope this makes it clearer :-)

I like this desc better.

After some IRC discussions with Nik and Florian, I acknowledge this is
real race condition.

The real solution is the remove the LRU list system (which will also
solve a scalability problem), but short-term we need Nik's fix, which I
guess should be a stable fix.

Thanks Nik!
Florian Westphal March 3, 2014, 5:17 p.m. UTC | #7
Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> On 03/03/2014 03:40 PM, Florian Westphal wrote:
> > Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> >> I stumbled upon this very serious bug while hunting for another one,
> >> it's a very subtle race condition between inet_frag_evictor,
> >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> >> the users of inet_frag_kill/inet_frag_put).
> >> What happens is that after a fragment has been added to the hash chain but
> >> before it's been added to the lru_list (inet_frag_lru_add), it may get
> >> deleted (either by an expired timer if the system load is high or the
> >> timer sufficiently low, or by the fraq_queue function for different
> >> reasons) before it's added to the lru_list
> > 
> > Sorry.  Not following here, see below.
> > 
> >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> >> index bb075fc9a14f..322dcebfc588 100644
> >> --- a/net/ipv4/inet_fragment.c
> >> +++ b/net/ipv4/inet_fragment.c
> >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
> >>  
> >>  	atomic_inc(&qp->refcnt);
> >>  	hlist_add_head(&qp->list, &hb->chain);
> >> +	inet_frag_lru_add(nf, qp);
> >>  	spin_unlock(&hb->chain_lock);
> >>  	read_unlock(&f->lock);
> > 
> > If I understand correctly your're saying that qp can be free'd on
> > another/cpu timer right after dropping the locks.  But how is it
> > possible?
> > 
> > ->refcnt is bumped above when arming the timer (before dropping chain
> > lock), so even if the frag_expire timer fires instantly it should not
> > free qp.
> > 
> > What am I missing?
> > 
> > Thanks,
> > Florian
> > 
> An important point is that inet_frag_kill removes both the timer's refcnt and
> has an unconditional atomic_dec to remove the original/guarding refcnt, so it
> basically removes everything that's in the way.

You're right.

Problem is that when we return from inet_frag_intern() we can end up
with a qp that is no longer in the hash (inet_frag_kill was invoked)
but has been added to the lru list _after_ inet_frag_kill supposedly
removed it.

The refcnt is not 0 (yet) by the time inet_frag_intern returns
but it turns to 0 soon after on the next _put event.

Your fix makes 'in hash table but not on lru list' impossible and
thus avoids the problem.

Thanks for explaining!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 3, 2014, 9:34 p.m. UTC | #8
From: Nikolay Aleksandrov <nikolay@redhat.com>
Date: Mon,  3 Mar 2014 15:05:20 +0100

> I stumbled upon this very serious bug while hunting for another one,
> it's a very subtle race condition between inet_frag_evictor,
> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically
> the users of inet_frag_kill/inet_frag_put).
> What happens is that after a fragment has been added to the hash chain but
> before it's been added to the lru_list (inet_frag_lru_add), it may get
> deleted (either by an expired timer if the system load is high or the
> timer sufficiently low, or by the fraq_queue function for different
> reasons) before it's added to the lru_list, then after it gets added
> it's a matter of time for the evictor to get to a piece of memory which
> has been freed leading to a number of different bugs depending on what's
> left there. I've been able to trigger this on both IPv4 and IPv6 (which
> is normal as the frag code is the same), but it's been much more
> difficult to trigger on IPv4 due to the protocol differences about how
> fragments are treated. The setup I used to reproduce this is:
> 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be
> seen on multiple cards at the same time. Then I used multiple instances
> of ping/ping6 to generate fragmented packets and flood the machines with
> them while running other processes to load the attacked machine.
> *It is very important to have the _same flow_ coming in on multiple CPUs
> concurrently. Usually the attacked machine would die in less than 30
> minutes, if configured properly to have many evictor calls and timeouts
> it could happen in 10 minutes or so.
> 
> The fix is simple, just move the lru_add under the hash chain locked
> region so when a removing function is called it'll have to wait for the
> fragment to be added to the lru_list, and then it'll remove it (it works
> because the hash chain removal is done before the lru_list one and
> there's no window between the two list adds when the frag can get
> dropped). With this fix applied I couldn't kill the same machine in 24
> hours with the same setup.
> 
> Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
> rwlock")
> 
> CC: Jesper Dangaard Brouer <brouer@redhat.com>
> CC: David S. Miller <davem@davemloft.net>
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>

Nik, please beef up the commit message a bit and resubmit.

Some of your replies explained the situation a bit better.  Don't
be afraid of making the commit message too long or too verbose :-)

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index bb075fc9a14f..322dcebfc588 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -278,9 +278,10 @@  static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &hb->chain);
+	inet_frag_lru_add(nf, qp);
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
-	inet_frag_lru_add(nf, qp);
+
 	return qp;
 }