diff mbox

[net-next,V3-evictor] net: frag evictor, avoid killing warm frag queues

Message ID 20121204133007.20215.52566.stgit@dragon
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer Dec. 4, 2012, 1:30 p.m. UTC
The fragmentation evictor system have a very unfortunate eviction
system for killing fragment, when the system is put under pressure.

If packets are coming in too fast, the evictor code kills "warm"
fragments too quickly.  Resulting in close to zero throughput, as
fragments are killed before they have a chance to complete

This is related to the bad interaction with the LRU (Least Recently
Used) list.  Under load the LRU list sort-of changes meaning/behavior.
When the LRU head is very new/warm, then the head is most likely the
one with most fragments and the tail (latest used or added element)
with least.

Solved by, introducing a creation "jiffie" timestamp (creation_ts).
If the element is tried evicted in same jiffie, then perform tail drop
on the LRU list instead.

Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

---
V2:
 - Drop the INET_FRAG_FIRST_IN idea for detecting dropped "head" packets

V3:
 - Move the tail drop, from inet_frag_alloc() to inet_frag_evictor()
   This will be close to the same semantics, but at a higher cost.


 include/net/inet_frag.h  |    1 +
 net/ipv4/inet_fragment.c |   12 ++++++++++++
 2 files changed, 13 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Dec. 4, 2012, 2:47 p.m. UTC | #1
On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> The fragmentation evictor system have a very unfortunate eviction
> system for killing fragment, when the system is put under pressure.
> 
> If packets are coming in too fast, the evictor code kills "warm"
> fragments too quickly.  Resulting in close to zero throughput, as
> fragments are killed before they have a chance to complete
> 
> This is related to the bad interaction with the LRU (Least Recently
> Used) list.  Under load the LRU list sort-of changes meaning/behavior.
> When the LRU head is very new/warm, then the head is most likely the
> one with most fragments and the tail (latest used or added element)
> with least.
> 
> Solved by, introducing a creation "jiffie" timestamp (creation_ts).
> If the element is tried evicted in same jiffie, then perform tail drop
> on the LRU list instead.
> 
> Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

This would only 'work' if a reassembled packet can be done/completed
under one jiffie.

For 64KB packets, this means 100Mb link wont be able to deliver a
reassembled packet under IP frags load if HZ=1000

LRU goal is to be able to select the oldest inet_frag_queue, because in
typical networks, packet losses are really happening and this is why
some packets wont complete their reassembly. They naturally will be
found on LRU head, and they probably are very fat (for example a single
packet was lost for the inet_frag_queue)

Choosing the most recent inet_frag_queue is exactly the opposite
strategy. We pay the huge cost of maintaining a central LRU, and we
exactly misuse it.

As long as an inet_frag_queue receives new fragments and is moved to the
LRU tail, its a candidate for being kept, not a candidate for being
evicted.

Only when an inet_frag_queue is the oldest one, it becomes a candidate
for eviction.

I think you are trying to solve a configuration/tuning problem by
changing a valid strategy.

Whats wrong with admitting high_thresh/low_thresh default values should
be updated, now some people apparently want to use IP fragments in
production ?

Lets say we allow to use 1 % of memory for frags, instead of the current
256 KB limit, which was chosen decades ago.

Only in very severe DOS attacks, LRU head 'creation_ts' would possibly
be <= 1ms. And under severe DOS attacks, I am afraid there is nothing we
can do.

(We could eventually avoid LRU hassle and chose instead a random drop
strategy)

high_thresh/low_thresh should be changed from 'int' to 'long' as well,
so that a 64bit host could use more than 2GB for frag storage.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Dec. 4, 2012, 5:51 p.m. UTC | #2
On Tue, 2012-12-04 at 06:47 -0800, Eric Dumazet wrote:
> On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> > The fragmentation evictor system have a very unfortunate eviction
> > system for killing fragment, when the system is put under pressure.
> > 
> > If packets are coming in too fast, the evictor code kills "warm"
> > fragments too quickly.  Resulting in close to zero throughput, as
> > fragments are killed before they have a chance to complete
> > 
> > This is related to the bad interaction with the LRU (Least Recently
> > Used) list.  Under load the LRU list sort-of changes meaning/behavior.
> > When the LRU head is very new/warm, then the head is most likely the
> > one with most fragments and the tail (latest used or added element)
> > with least.
> > 
> > Solved by, introducing a creation "jiffie" timestamp (creation_ts).
> > If the element is tried evicted in same jiffie, then perform tail drop
> > on the LRU list instead.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

First of all, this patch is not the perfect thing, its a starting point
of a discussion to find a better solution.


> This would only 'work' if a reassembled packet can be done/completed
> under one jiffie.

True, and I'm not happy with this resolution.  It's only purpose is to
help me detect when the LRU list is reversing it functionality. 

This is the *only* message I'm trying to convey:

    **The LRU list is misbehaving** (in this situation)


Perhaps the best option is to implement something else than a LRU... I
just haven't found the correct replacement/idea yet.


> For 64KB packets, this means 100Mb link wont be able to deliver a
> reassembled packet under IP frags load if HZ=1000

True, the 1 jiffie check should be increased, but that's not the point.
(Also I make no promise of fairness, I hope we can address this fairness
issues in a later patch, perhaps in combination with replacing the LRU).


(Notice: I have run tests with higher high_thresh/low_thresh values, the
results are the same)


> LRU goal is to be able to select the oldest inet_frag_queue, because in
> typical networks, packet losses are really happening and this is why
> some packets wont complete their reassembly. They naturally will be
> found on LRU head, and they probably are very fat (for example a single
> packet was lost for the inet_frag_queue)

Look at what is happening in inet_frag_evictor(), when we are under
load.  We will quickly delete all the oldest inet_frag_queue, you are
talking about.  After which the LRU list will be filled with what? Only
new fragments.  

Think about that is the order of this list, now?  Remember it only
contains incomplete inet_frag_queue's.

My theory, prove me wrong, is when the LRU head is very new/warm, then
the head is most likely the one with most fragments and the tail (latest
used or added element) with the least fragments.


> Choosing the most recent inet_frag_queue is exactly the opposite
> strategy. We pay the huge cost of maintaining a central LRU, and we
> exactly misuse it.

Then the LRU list is perhaps is the wrong choice?

> As long as an inet_frag_queue receives new fragments and is moved to the
> LRU tail, its a candidate for being kept, not a candidate for being
> evicted.

Remember I have shown/proven that all inet_frag_queue's in the list
have been touched within 1 jiffie.  Which one do you choose for removal?

(Also remember if an inet_frag_queue looses one frame, on the network
layer, it will not complete, and after 1 jiffie it will be killed by the
evictor.  So, this function still "works")


> Only when an inet_frag_queue is the oldest one, it becomes a candidate
> for eviction.
> 
> I think you are trying to solve a configuration/tuning problem by
> changing a valid strategy.
> 
> Whats wrong with admitting high_thresh/low_thresh default values should
> be updated, now some people apparently want to use IP fragments in
> production ?

I'm not against increasing the high_thresh/low_thresh default values.
I have tested with your 4MB/3MB settings (and 40/39, and 400/399).  The
results are (almost) the same, its not the problem!  I have shown you
several test results already (added some extra tests below)
And yes, the high_thresh/low_thresh default values should be increased,
I just don't want to discuss how much.

I want to discuss the correctness of the evictor and LRU.  You are
trying to avoid calling the evictor code; you cannot, assuming a queing
system, where packets are arriving at a higher rate than you can
process.
Jesper Dangaard Brouer Dec. 5, 2012, 9:24 a.m. UTC | #3
First of all, this patch contains a small bug (see below), which
resulted in me not testing the correct patch...

Second, this patch does NOT behave as I expected and claimed.  Thus, my
conclusions, in my previous respond might be wrong!

The previous evictor patch of letting new fragments enter, worked
amazingly well.  But I suspect, this might also be related to a
bug/problem in the evictor loop (which were being hidden by that patch).

My new *theory* is that the evictor loop, will be looping too much, if
it finds a fragment which is INET_FRAG_COMPLETE ... in that case, we
don't advance the LRU list, and thus will pickup the exact same
inet_frag_queue again in the loop... to get out of the loop we need
another CPU or packet to change the LRU list for us... I'll test that
theory... (its could also be CPUs fighting over the same LRU head
element that cause this) ... more to come...


On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 4750d2b..d8bf59b 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -178,6 +178,16 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
>  
>  		q = list_first_entry(&nf->lru_list,
>  				struct inet_frag_queue, lru_list);
> +
> +		/* When head of LRU is very new/warm, then the head is
> +		 * most likely the one with most fragments and the
> +		 * tail with least, thus drop tail
> +		 */
> +		if (!force && q->creation_ts == (u32) jiffies) {
> +			q = list_entry(&nf->lru_list.prev,

Remove the "&" in &nf->lru_list.prev

> +				struct inet_frag_queue, lru_list);
> +		}
> +
>  		atomic_inc(&q->refcnt);
>  		read_unlock(&f->lock);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Dec. 6, 2012, 12:26 p.m. UTC | #4
On Wed, 2012-12-05 at 10:24 +0100, Jesper Dangaard Brouer wrote:
> 
> The previous evictor patch of letting new fragments enter, worked
> amazingly well.  But I suspect, this might also be related to a
> bug/problem in the evictor loop (which were being hidden by that
> patch).

The evictor loop does not contain a bug, just a SMP scalability issue
(which is fixed by later patches).  The first evictor patch, which
does not let new fragments enter, only worked amazingly well because
its hiding this (and other) scalability issues, and implicit allowing
frags already "in" to exceed the mem usage for 1 jiffie.  Thus,
invalidating the patch, as the improvement were only a side effect.


> My new *theory* is that the evictor loop, will be looping too much, if
> it finds a fragment which is INET_FRAG_COMPLETE ... in that case, we
> don't advance the LRU list, and thus will pickup the exact same
> inet_frag_queue again in the loop... to get out of the loop we need
> another CPU or packet to change the LRU list for us... I'll test that
> theory... (its could also be CPUs fighting over the same LRU head
> element that cause this) ... more to come...

The above theory does happen, but does not cause excessive looping.
The CPUs are just fighting about who gets to free the inet_frag_queue
and who gets to unlink it from its data structures (I guess, resulting
cache bouncing between CPUs).

CPUs are fighting for the same LRU head (inet_frag_queue) element,
which is bad for scalability.  We could fix this by unlinking the
element once a CPU graps it, but it would require us to change a
read_lock to a write_lock, thus we might not gain much performance.

I already (implicit) fix this is a later patch, where I'm moving the
LRU lists to be per CPU.  So, I don't know if it's worth fixing.


(And yes, I'm using thresh 4Mb/3Mb as my default setting now, but I'm
also experimenting with other thresh sizes)

p.s. Thank you Eric for being so persistent, so I realized this patch
were not good.  We can hopefully now, move on to the other patches,
which fixes the real scalability issues.

--Jesper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Dec. 6, 2012, 12:32 p.m. UTC | #5
Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> CPUs are fighting for the same LRU head (inet_frag_queue) element,
> which is bad for scalability.  We could fix this by unlinking the
> element once a CPU graps it, but it would require us to change a
> read_lock to a write_lock, thus we might not gain much performance.
> 
> I already (implicit) fix this is a later patch, where I'm moving the
> LRU lists to be per CPU.  So, I don't know if it's worth fixing.

Do you think its worth trying to remove the lru list altogether and
just evict from the hash in a round-robin fashion instead?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Dec. 6, 2012, 1:29 p.m. UTC | #6
> Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> > CPUs are fighting for the same LRU head (inet_frag_queue) element,
> > which is bad for scalability.  We could fix this by unlinking the
> > element once a CPU graps it, but it would require us to change a
> > read_lock to a write_lock, thus we might not gain much performance.
> >
> > I already (implicit) fix this is a later patch, where I'm moving the
> > LRU lists to be per CPU.  So, I don't know if it's worth fixing.
> 
> Do you think its worth trying to remove the lru list altogether and
> just evict from the hash in a round-robin fashion instead?

Round-robin will be the same as LRU under overload - so have the
same issues.
Random might be better - especially if IP datagrams for which
more than one in-sequence packet have been received are moved
to a second structure.
But you still need something to control the total memory use.

NFS/UDP is about the only thing that generates very large
IP datagrams - and no one in their right mind runs that
over non-local links.

For SMP you might hash to a small array of pointers (to fragments)
each having its own lock. Only evict items with the same hash.
Put the id in the array and you probably won't need to look at
the actual fragment (saving a cache miss) unless it is the one
you want.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Dec. 6, 2012, 1:55 p.m. UTC | #7
On Thu, 2012-12-06 at 13:32 +0100, Florian Westphal wrote:
> Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> > CPUs are fighting for the same LRU head (inet_frag_queue) element,
> > which is bad for scalability.  We could fix this by unlinking the
> > element once a CPU graps it, but it would require us to change a
> > read_lock to a write_lock, thus we might not gain much performance.
> > 
> > I already (implicit) fix this is a later patch, where I'm moving the
> > LRU lists to be per CPU.  So, I don't know if it's worth fixing.
> 
> Do you think its worth trying to remove the lru list altogether and
> just evict from the hash in a round-robin fashion instead?

Perhaps.  But do note my bashing of the LRU list were wrong.  I planned
to explain that in a separate mail, but basically I were causing a DoS
attack with incomplete fragments on my self, because I had disabled
Ethernet flow-control.  Which led me to some false assumptions on the
LRU list behavior (sorry).

The LRU might be the correct solution after all.  If I enable Ethernet
flow-control again, then I have a hard time "activating" the evictor
code (with thresh 4M/3M) .  I'll need a separate DoS program, which can
send incomplete fragments (in back-to-back bursts) to provoke the
evictor and LRU.

My cheap DoS reproducer-hack is to disable Ethernet flow-control on only
one interface (out of 3), to cause packet drops and the incomplete
fragments. The current preliminary results is that the two other
interfaces still gets packets through, we don't get the zero throughput
situation.
 Two interfaces and no DoS: 15342 Mbit/s
 Three interfaces and DoS:   7355 Mbit/s

The reduction might look big, but you have to take into account, that
"activating" the evictor code, is also causing scalability issues of its
own (which could account for the performance drop it self).

--Jesper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 6, 2012, 2:47 p.m. UTC | #8
On Thu, 2012-12-06 at 14:55 +0100, Jesper Dangaard Brouer wrote:

> Perhaps.  But do note my bashing of the LRU list were wrong.  I planned
> to explain that in a separate mail, but basically I were causing a DoS
> attack with incomplete fragments on my self, because I had disabled
> Ethernet flow-control.  Which led me to some false assumptions on the
> LRU list behavior (sorry).
> 
> The LRU might be the correct solution after all.  If I enable Ethernet
> flow-control again, then I have a hard time "activating" the evictor
> code (with thresh 4M/3M) .  I'll need a separate DoS program, which can
> send incomplete fragments (in back-to-back bursts) to provoke the
> evictor and LRU.
> 
> My cheap DoS reproducer-hack is to disable Ethernet flow-control on only
> one interface (out of 3), to cause packet drops and the incomplete
> fragments. The current preliminary results is that the two other
> interfaces still gets packets through, we don't get the zero throughput
> situation.
>  Two interfaces and no DoS: 15342 Mbit/s
>  Three interfaces and DoS:   7355 Mbit/s
> 
> The reduction might look big, but you have to take into account, that
> "activating" the evictor code, is also causing scalability issues of its
> own (which could account for the performance drop it self).

I would try removing the LRU, but keeping the age information (jiffie of
last valid frag received on one inet_frag_queue)

The eviction would be a function of the current memory used for the
frags (percpu_counter for good SMP scalability), divided by the max
allowed size, and ipfrag_time.

Under load, we would evict inet_frag_queue before the ipfrag_time timer,
without necessarily having to scan whole frags, only the ones we find in
the bucket we need to parse anyway (and lock)

The whole idea of a full garbage collect under softirq is not scalable,
as it locks a CPU in a non preemptible section for too long.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Dec. 6, 2012, 3:23 p.m. UTC | #9
On Thu, 2012-12-06 at 06:47 -0800, Eric Dumazet wrote:
> On Thu, 2012-12-06 at 14:55 +0100, Jesper Dangaard Brouer wrote:
> 
> > The LRU might be the correct solution after all.  If I enable Ethernet
> > flow-control again, then I have a hard time "activating" the evictor
> > code (with thresh 4M/3M) .  I'll need a separate DoS program, which can
> > send incomplete fragments (in back-to-back bursts) to provoke the
> > evictor and LRU.
> > 
> > My cheap DoS reproducer-hack is to disable Ethernet flow-control on only
> > one interface (out of 3), to cause packet drops and the incomplete
> > fragments. The current preliminary results is that the two other
> > interfaces still gets packets through, we don't get the zero throughput
> > situation.
> >  Two interfaces and no DoS: 15342 Mbit/s
> >  Three interfaces and DoS:   7355 Mbit/s
> > 
> > The reduction might look big, but you have to take into account, that
> > "activating" the evictor code, is also causing scalability issues of its
> > own (which could account for the performance drop it self).
> 
> I would try removing the LRU, but keeping the age information (jiffie of
> last valid frag received on one inet_frag_queue)

I don't think its worth optimizing further, atm.

Because, the test above is without any of my SMP scalability fixes.
With my SMP fixes the result is, full scalability:

 Three interfaces:  (9601+6723+9432) = 25756 Mbit/s

And the 6723 Mbit/s number, is because the old 10G NIC cannot generate
anymore...

And I basically cannot use the cheap DoS reproducer-hack, as the
machine/code-path is now too fast...

Running with 4 interfaces, and starting 6 netperf's (to cause more
interleaving and higher mem usage):

 4716+8042+8765+6204+2475+4568 = 34770 Mbit/s

I could just manage to get to do IpReasmFails = 14.

[jbrouer@dragon ~]$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives                    2980048            0.0
IpInDelivers                    66217              0.0
IpReasmReqds                    2980040            0.0
IpReasmOKs                      66218              0.0
IpReasmFails                    14                 0.0
UdpInDatagrams                  66218              0.0
IpExtInOctets                   4397976885         0.0

So, after the SMP fixes, its very hard to "activate" the evictor.  We
would need to find a slower e.g. embedded box and tune the evictor on
that, as a multi-CPU machine basically will scale "too-well" now ;-)

--Jesper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 6, 2012, 9:38 p.m. UTC | #10
From: "David Laight" <David.Laight@ACULAB.COM>
Date: Thu, 6 Dec 2012 13:29:13 -0000

> NFS/UDP is about the only thing that generates very large
> IP datagrams - and no one in their right mind runs that
> over non-local links.

There are people with real applications that use UDP with
large IP datagrams.  As unfortunate as it is, this is the
reality we have to deal with.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 32786a0..7b897b2 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -24,6 +24,7 @@  struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;        /* total length of orig datagram */
 	int			meat;
+	u32			creation_ts;/* jiffies when queue was created*/
 	__u8			last_in;    /* first/last segment arrived? */
 
 #define INET_FRAG_COMPLETE	4
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4750d2b..d8bf59b 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -178,6 +178,16 @@  int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 
 		q = list_first_entry(&nf->lru_list,
 				struct inet_frag_queue, lru_list);
+
+		/* When head of LRU is very new/warm, then the head is
+		 * most likely the one with most fragments and the
+		 * tail with least, thus drop tail
+		 */
+		if (!force && q->creation_ts == (u32) jiffies) {
+			q = list_entry(&nf->lru_list.prev,
+				struct inet_frag_queue, lru_list);
+		}
+
 		atomic_inc(&q->refcnt);
 		read_unlock(&f->lock);
 
@@ -243,11 +253,13 @@  static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 		struct inet_frags *f, void *arg)
 {
 	struct inet_frag_queue *q;
+	// Note: We could also perform the tail drop here
 
 	q = kzalloc(f->qsize, GFP_ATOMIC);
 	if (q == NULL)
 		return NULL;
 
+	q->creation_ts = (u32) jiffies;
 	q->net = nf;
 	f->constructor(q, arg);
 	atomic_add(f->qsize, &nf->mem);