Message ID | 1393855520-18334-1-git-send-email-nikolay@redhat.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Nikolay Aleksandrov <nikolay@redhat.com> wrote: > I stumbled upon this very serious bug while hunting for another one, > it's a very subtle race condition between inet_frag_evictor, > inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically > the users of inet_frag_kill/inet_frag_put). > What happens is that after a fragment has been added to the hash chain but > before it's been added to the lru_list (inet_frag_lru_add), it may get > deleted (either by an expired timer if the system load is high or the > timer sufficiently low, or by the fraq_queue function for different > reasons) before it's added to the lru_list Sorry. Not following here, see below. > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > index bb075fc9a14f..322dcebfc588 100644 > --- a/net/ipv4/inet_fragment.c > +++ b/net/ipv4/inet_fragment.c > @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, > > atomic_inc(&qp->refcnt); > hlist_add_head(&qp->list, &hb->chain); > + inet_frag_lru_add(nf, qp); > spin_unlock(&hb->chain_lock); > read_unlock(&f->lock); If I understand correctly your're saying that qp can be free'd on another/cpu timer right after dropping the locks. But how is it possible? ->refcnt is bumped above when arming the timer (before dropping chain lock), so even if the frag_expire timer fires instantly it should not free qp. What am I missing? Thanks, Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/2014 03:40 PM, Florian Westphal wrote: > Nikolay Aleksandrov <nikolay@redhat.com> wrote: >> I stumbled upon this very serious bug while hunting for another one, >> it's a very subtle race condition between inet_frag_evictor, >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically >> the users of inet_frag_kill/inet_frag_put). >> What happens is that after a fragment has been added to the hash chain but >> before it's been added to the lru_list (inet_frag_lru_add), it may get >> deleted (either by an expired timer if the system load is high or the >> timer sufficiently low, or by the fraq_queue function for different >> reasons) before it's added to the lru_list > > Sorry. Not following here, see below. > >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c >> index bb075fc9a14f..322dcebfc588 100644 >> --- a/net/ipv4/inet_fragment.c >> +++ b/net/ipv4/inet_fragment.c >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, >> >> atomic_inc(&qp->refcnt); >> hlist_add_head(&qp->list, &hb->chain); >> + inet_frag_lru_add(nf, qp); >> spin_unlock(&hb->chain_lock); >> read_unlock(&f->lock); > > If I understand correctly your're saying that qp can be free'd on > another/cpu timer right after dropping the locks. But how is it > possible? > > ->refcnt is bumped above when arming the timer (before dropping chain > lock), so even if the frag_expire timer fires instantly it should not > free qp. > > What am I missing? > > Thanks, > Florian > inet_frag_kill when called from the IPv4/6 frag_queue function will remove the timer refcount, then inet_frag_put afterwards will drop it to 0 and free it and all of this could happen before the frag was ever added to the LRU list, then it gets added. This happens much easier for IPv6 because of the dropping of overlapping fragments in its frag_queue function, the point is we need to have the timer's refcount removed in any way (it could be the timer itself - there's an inet_frag_put in the end, or much easier by the frag_queue function). I think I've explained it badly, I hope this makes it clearer :-) Cheers, Nik -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/2014 03:40 PM, Florian Westphal wrote: > Nikolay Aleksandrov <nikolay@redhat.com> wrote: >> I stumbled upon this very serious bug while hunting for another one, >> it's a very subtle race condition between inet_frag_evictor, >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically >> the users of inet_frag_kill/inet_frag_put). >> What happens is that after a fragment has been added to the hash chain but >> before it's been added to the lru_list (inet_frag_lru_add), it may get >> deleted (either by an expired timer if the system load is high or the >> timer sufficiently low, or by the fraq_queue function for different >> reasons) before it's added to the lru_list > > Sorry. Not following here, see below. > >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c >> index bb075fc9a14f..322dcebfc588 100644 >> --- a/net/ipv4/inet_fragment.c >> +++ b/net/ipv4/inet_fragment.c >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, >> >> atomic_inc(&qp->refcnt); >> hlist_add_head(&qp->list, &hb->chain); >> + inet_frag_lru_add(nf, qp); >> spin_unlock(&hb->chain_lock); >> read_unlock(&f->lock); > > If I understand correctly your're saying that qp can be free'd on > another/cpu timer right after dropping the locks. But how is it > possible? > > ->refcnt is bumped above when arming the timer (before dropping chain > lock), so even if the frag_expire timer fires instantly it should not > free qp. > > What am I missing? > > Thanks, > Florian > An important point is that inet_frag_kill removes both the timer's refcnt and has an unconditional atomic_dec to remove the original/guarding refcnt, so it basically removes everything that's in the way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 03 Mar 2014 15:49:56 +0100 Nikolay Aleksandrov <nikolay@redhat.com> wrote: > On 03/03/2014 03:40 PM, Florian Westphal wrote: > > Nikolay Aleksandrov <nikolay@redhat.com> wrote: > >> I stumbled upon this very serious bug while hunting for another one, > >> it's a very subtle race condition between inet_frag_evictor, > >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically > >> the users of inet_frag_kill/inet_frag_put). > >> What happens is that after a fragment has been added to the hash chain but > >> before it's been added to the lru_list (inet_frag_lru_add), it may get > >> deleted (either by an expired timer if the system load is high or the > >> timer sufficiently low, or by the fraq_queue function for different > >> reasons) before it's added to the lru_list > > > > Sorry. Not following here, see below. > > > >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > >> index bb075fc9a14f..322dcebfc588 100644 > >> --- a/net/ipv4/inet_fragment.c > >> +++ b/net/ipv4/inet_fragment.c > >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, > >> > >> atomic_inc(&qp->refcnt); > >> hlist_add_head(&qp->list, &hb->chain); > >> + inet_frag_lru_add(nf, qp); > >> spin_unlock(&hb->chain_lock); > >> read_unlock(&f->lock); > > > > If I understand correctly your're saying that qp can be free'd on > > another/cpu timer right after dropping the locks. But how is it > > possible? > > > > ->refcnt is bumped above when arming the timer (before dropping chain > > lock), so even if the frag_expire timer fires instantly it should not > > free qp. > > > > What am I missing? > > > > Thanks, > > Florian > > > An important point is that inet_frag_kill removes both the timer's refcnt and > has an unconditional atomic_dec to remove the original/guarding refcnt, so it > basically removes everything that's in the way. It sound like we might have a refcnt problem... Do we need an extra refcnt for maintaining elements the LRU list?
On 03/03/2014 04:31 PM, Jesper Dangaard Brouer wrote: > On Mon, 03 Mar 2014 15:49:56 +0100 > Nikolay Aleksandrov <nikolay@redhat.com> wrote: > >> On 03/03/2014 03:40 PM, Florian Westphal wrote: >>> Nikolay Aleksandrov <nikolay@redhat.com> wrote: >>>> I stumbled upon this very serious bug while hunting for another one, >>>> it's a very subtle race condition between inet_frag_evictor, >>>> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically >>>> the users of inet_frag_kill/inet_frag_put). >>>> What happens is that after a fragment has been added to the hash chain but >>>> before it's been added to the lru_list (inet_frag_lru_add), it may get >>>> deleted (either by an expired timer if the system load is high or the >>>> timer sufficiently low, or by the fraq_queue function for different >>>> reasons) before it's added to the lru_list >>> >>> Sorry. Not following here, see below. >>> >>>> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c >>>> index bb075fc9a14f..322dcebfc588 100644 >>>> --- a/net/ipv4/inet_fragment.c >>>> +++ b/net/ipv4/inet_fragment.c >>>> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, >>>> >>>> atomic_inc(&qp->refcnt); >>>> hlist_add_head(&qp->list, &hb->chain); >>>> + inet_frag_lru_add(nf, qp); >>>> spin_unlock(&hb->chain_lock); >>>> read_unlock(&f->lock); >>> >>> If I understand correctly your're saying that qp can be free'd on >>> another/cpu timer right after dropping the locks. But how is it >>> possible? >>> >>> ->refcnt is bumped above when arming the timer (before dropping chain >>> lock), so even if the frag_expire timer fires instantly it should not >>> free qp. >>> >>> What am I missing? >>> >>> Thanks, >>> Florian >>> >> An important point is that inet_frag_kill removes both the timer's refcnt and >> has an unconditional atomic_dec to remove the original/guarding refcnt, so it >> basically removes everything that's in the way. > > It sound like we might have a refcnt problem... > Do we need an extra refcnt for maintaining elements the LRU list? > I don't think so, if you keep the lru_add separated from the chain addition as before you still got a race condition when frags are seen by the fq_find functions but are not in the LRU list yet unless you add the refcnt before adding to the lru list while holding the chain lock (or before that) which will alter the behaviour, but still if inet_frag_kill gets called it should remove that refcnt and thus put us in the same position as now if we want to keep the behaviour as it's now. Nik -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 03 Mar 2014 15:43:00 +0100 Nikolay Aleksandrov <nikolay@redhat.com> wrote: > On 03/03/2014 03:40 PM, Florian Westphal wrote: > > Nikolay Aleksandrov <nikolay@redhat.com> wrote: [...] > >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > >> index bb075fc9a14f..322dcebfc588 100644 > >> --- a/net/ipv4/inet_fragment.c > >> +++ b/net/ipv4/inet_fragment.c > >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, > >> > >> atomic_inc(&qp->refcnt); > >> hlist_add_head(&qp->list, &hb->chain); > >> + inet_frag_lru_add(nf, qp); > >> spin_unlock(&hb->chain_lock); > >> read_unlock(&f->lock); > > [...] > > > inet_frag_kill when called from the IPv4/6 frag_queue function will remove the > timer refcount, then inet_frag_put afterwards will drop it to 0 and free it and > all of this could happen before the frag was ever added to the LRU list, then it > gets added. This happens much easier for IPv6 because of the dropping of > overlapping fragments in its frag_queue function, the point is we need to have > the timer's refcount removed in any way (it could be the timer itself - there's > an inet_frag_put in the end, or much easier by the frag_queue function). > I think I've explained it badly, I hope this makes it clearer :-) I like this desc better. After some IRC discussions with Nik and Florian, I acknowledge this is real race condition. The real solution is the remove the LRU list system (which will also solve a scalability problem), but short-term we need Nik's fix, which I guess should be a stable fix. Thanks Nik!
Nikolay Aleksandrov <nikolay@redhat.com> wrote: > On 03/03/2014 03:40 PM, Florian Westphal wrote: > > Nikolay Aleksandrov <nikolay@redhat.com> wrote: > >> I stumbled upon this very serious bug while hunting for another one, > >> it's a very subtle race condition between inet_frag_evictor, > >> inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically > >> the users of inet_frag_kill/inet_frag_put). > >> What happens is that after a fragment has been added to the hash chain but > >> before it's been added to the lru_list (inet_frag_lru_add), it may get > >> deleted (either by an expired timer if the system load is high or the > >> timer sufficiently low, or by the fraq_queue function for different > >> reasons) before it's added to the lru_list > > > > Sorry. Not following here, see below. > > > >> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > >> index bb075fc9a14f..322dcebfc588 100644 > >> --- a/net/ipv4/inet_fragment.c > >> +++ b/net/ipv4/inet_fragment.c > >> @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, > >> > >> atomic_inc(&qp->refcnt); > >> hlist_add_head(&qp->list, &hb->chain); > >> + inet_frag_lru_add(nf, qp); > >> spin_unlock(&hb->chain_lock); > >> read_unlock(&f->lock); > > > > If I understand correctly your're saying that qp can be free'd on > > another/cpu timer right after dropping the locks. But how is it > > possible? > > > > ->refcnt is bumped above when arming the timer (before dropping chain > > lock), so even if the frag_expire timer fires instantly it should not > > free qp. > > > > What am I missing? > > > > Thanks, > > Florian > > > An important point is that inet_frag_kill removes both the timer's refcnt and > has an unconditional atomic_dec to remove the original/guarding refcnt, so it > basically removes everything that's in the way. You're right. Problem is that when we return from inet_frag_intern() we can end up with a qp that is no longer in the hash (inet_frag_kill was invoked) but has been added to the lru list _after_ inet_frag_kill supposedly removed it. The refcnt is not 0 (yet) by the time inet_frag_intern returns but it turns to 0 soon after on the next _put event. Your fix makes 'in hash table but not on lru list' impossible and thus avoids the problem. Thanks for explaining! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Nikolay Aleksandrov <nikolay@redhat.com> Date: Mon, 3 Mar 2014 15:05:20 +0100 > I stumbled upon this very serious bug while hunting for another one, > it's a very subtle race condition between inet_frag_evictor, > inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically > the users of inet_frag_kill/inet_frag_put). > What happens is that after a fragment has been added to the hash chain but > before it's been added to the lru_list (inet_frag_lru_add), it may get > deleted (either by an expired timer if the system load is high or the > timer sufficiently low, or by the fraq_queue function for different > reasons) before it's added to the lru_list, then after it gets added > it's a matter of time for the evictor to get to a piece of memory which > has been freed leading to a number of different bugs depending on what's > left there. I've been able to trigger this on both IPv4 and IPv6 (which > is normal as the frag code is the same), but it's been much more > difficult to trigger on IPv4 due to the protocol differences about how > fragments are treated. The setup I used to reproduce this is: > 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be > seen on multiple cards at the same time. Then I used multiple instances > of ping/ping6 to generate fragmented packets and flood the machines with > them while running other processes to load the attacked machine. > *It is very important to have the _same flow_ coming in on multiple CPUs > concurrently. Usually the attacked machine would die in less than 30 > minutes, if configured properly to have many evictor calls and timeouts > it could happen in 10 minutes or so. > > The fix is simple, just move the lru_add under the hash chain locked > region so when a removing function is called it'll have to wait for the > fragment to be added to the lru_list, and then it'll remove it (it works > because the hash chain removal is done before the lru_list one and > there's no window between the two list adds when the frag can get > dropped). With this fix applied I couldn't kill the same machine in 24 > hours with the same setup. > > Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of > rwlock") > > CC: Jesper Dangaard Brouer <brouer@redhat.com> > CC: David S. Miller <davem@davemloft.net> > > Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Nik, please beef up the commit message a bit and resubmit. Some of your replies explained the situation a bit better. Don't be afraid of making the commit message too long or too verbose :-) Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index bb075fc9a14f..322dcebfc588 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -278,9 +278,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf, atomic_inc(&qp->refcnt); hlist_add_head(&qp->list, &hb->chain); + inet_frag_lru_add(nf, qp); spin_unlock(&hb->chain_lock); read_unlock(&f->lock); - inet_frag_lru_add(nf, qp); + return qp; }
I stumbled upon this very serious bug while hunting for another one, it's a very subtle race condition between inet_frag_evictor, inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically the users of inet_frag_kill/inet_frag_put). What happens is that after a fragment has been added to the hash chain but before it's been added to the lru_list (inet_frag_lru_add), it may get deleted (either by an expired timer if the system load is high or the timer sufficiently low, or by the fraq_queue function for different reasons) before it's added to the lru_list, then after it gets added it's a matter of time for the evictor to get to a piece of memory which has been freed leading to a number of different bugs depending on what's left there. I've been able to trigger this on both IPv4 and IPv6 (which is normal as the frag code is the same), but it's been much more difficult to trigger on IPv4 due to the protocol differences about how fragments are treated. The setup I used to reproduce this is: 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be seen on multiple cards at the same time. Then I used multiple instances of ping/ping6 to generate fragmented packets and flood the machines with them while running other processes to load the attacked machine. *It is very important to have the _same flow_ coming in on multiple CPUs concurrently. Usually the attacked machine would die in less than 30 minutes, if configured properly to have many evictor calls and timeouts it could happen in 10 minutes or so. The fix is simple, just move the lru_add under the hash chain locked region so when a removing function is called it'll have to wait for the fragment to be added to the lru_list, and then it'll remove it (it works because the hash chain removal is done before the lru_list one and there's no window between the two list adds when the frag can get dropped). With this fix applied I couldn't kill the same machine in 24 hours with the same setup. Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of rwlock") CC: Jesper Dangaard Brouer <brouer@redhat.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> --- I'm new to this code, so I'm not sure if this is the best approach to fix the issue and am open to other suggestions, since I consider the issue quite serious (remotely triggerable) I'll be closely monitoring this thread to get it fixed asap. net/ipv4/inet_fragment.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)