diff mbox

regression with poll(2)

Message ID 20120820090443.GA3275@suse.de
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Mel Gorman Aug. 20, 2012, 9:04 a.m. UTC
On Sun, Aug 19, 2012 at 11:49:31AM -0700, Sage Weil wrote:
> I've bisected and identified this commit:
> 
>     netvm: propagate page->pfmemalloc to skb
>     
>     The skb->pfmemalloc flag gets set to true iff during the slab allocation
>     of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
>     packet is fragmented, it is possible that pages will be allocated from the
>     PFMEMALLOC reserve without propagating this information to the skb.  This
>     patch propagates page->pfmemalloc from pages allocated for fragments to
>     the skb.
>     
>     Signed-off-by: Mel Gorman <mgorman@suse.de>
>     Acked-by: David S. Miller <davem@davemloft.net>
>     Cc: Neil Brown <neilb@suse.de>
>     Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>     Cc: Mike Christie <michaelc@cs.wisc.edu>
>     Cc: Eric B Munson <emunson@mgebm.net>
>     Cc: Eric Dumazet <eric.dumazet@gmail.com>
>     Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
>     Cc: Mel Gorman <mgorman@suse.de>
>     Cc: Christoph Lameter <cl@linux.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 

Ok, thanks.

> I've retested several times and confirmed that this change leads to the 
> breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> the problem.
> 
> I've also added some additional instrumentation to my code and confirmed 
> that the process is blocking on poll(2) while netstat is reporting 
> data available on the socket.
> 
> What can I do to help track this down?
> 

Can the following patch be tested please? It is reported to fix an fio
regression that may be similar to what you are experiencing but has not
been picked up yet.

---8<---
From: Alex Shi <alex.shi@intel.com>
Subject: [PATCH] mm: correct page->pfmemalloc to fix deactivate_slab regression

commit cfd19c5a9ec (mm: only set page->pfmemalloc when
ALLOC_NO_WATERMARKS was used) try to narrow down page->pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance has 40% regression.

This patch move the checking in get_page_from_freelist, that resolved
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 mm/page_alloc.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

Comments

Eric Dumazet Aug. 20, 2012, 9:30 a.m. UTC | #1
On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:

> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.
> 
> -

This seems to help here.

Boot your machine with "mem=768M" or a bit less depending on your setup,
and try a netperf.

-> before patch :

# netperf
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
localhost.localdomain () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    14.00       6.05   

-> after patch :

Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00    18509.73   


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 20, 2012, 4:54 p.m. UTC | #2
On Mon, 20 Aug 2012, Mel Gorman wrote:
> On Sun, Aug 19, 2012 at 11:49:31AM -0700, Sage Weil wrote:
> > I've bisected and identified this commit:
> > 
> >     netvm: propagate page->pfmemalloc to skb
> >     
> >     The skb->pfmemalloc flag gets set to true iff during the slab allocation
> >     of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
> >     packet is fragmented, it is possible that pages will be allocated from the
> >     PFMEMALLOC reserve without propagating this information to the skb.  This
> >     patch propagates page->pfmemalloc from pages allocated for fragments to
> >     the skb.
> >     
> >     Signed-off-by: Mel Gorman <mgorman@suse.de>
> >     Acked-by: David S. Miller <davem@davemloft.net>
> >     Cc: Neil Brown <neilb@suse.de>
> >     Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >     Cc: Mike Christie <michaelc@cs.wisc.edu>
> >     Cc: Eric B Munson <emunson@mgebm.net>
> >     Cc: Eric Dumazet <eric.dumazet@gmail.com>
> >     Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
> >     Cc: Mel Gorman <mgorman@suse.de>
> >     Cc: Christoph Lameter <cl@linux.com>
> >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > 
> 
> Ok, thanks.
> 
> > I've retested several times and confirmed that this change leads to the 
> > breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> > the problem.
> > 
> > I've also added some additional instrumentation to my code and confirmed 
> > that the process is blocking on poll(2) while netstat is reporting 
> > data available on the socket.
> > 
> > What can I do to help track this down?
> > 
> 
> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.

This patch appears to resolve things for me as well, at least after a 
couple of passes.  I'll let you know if I see any further problems come up 
with more testing.

Thanks!
sage


> 
> ---8<---
> From: Alex Shi <alex.shi@intel.com>
> Subject: [PATCH] mm: correct page->pfmemalloc to fix deactivate_slab regression
> 
> commit cfd19c5a9ec (mm: only set page->pfmemalloc when
> ALLOC_NO_WATERMARKS was used) try to narrow down page->pfmemalloc
> setting, but it missed some places the pfmemalloc should be set.
> 
> So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
> cause incorrect deactivate_slab() on our core2 server:
> 
>     64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
>                      |
>                      --- _raw_spin_lock
>                         |
>                         |---0.34%-- deactivate_slab
>                         |          __slab_alloc
>                         |          kmem_cache_alloc
>                         |          |
> 
> That causes our fio sync write performance has 40% regression.
> 
> This patch move the checking in get_page_from_freelist, that resolved
> this issue.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  mm/page_alloc.c |   21 +++++++++++----------
>  1 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 009ac28..07f1924 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1928,6 +1928,17 @@ this_zone_full:
>  		zlc_active = 0;
>  		goto zonelist_scan;
>  	}
> +
> +	if (page)
> +		/*
> +		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> +		 * necessary to allocate the page. The expectation is
> +		 * that the caller is taking steps that will free more
> +		 * memory. The caller should avoid the page being used
> +		 * for !PFMEMALLOC purposes.
> +		 */
> +		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> +
>  	return page;
>  }
>  
> @@ -2389,14 +2400,6 @@ rebalance:
>  				zonelist, high_zoneidx, nodemask,
>  				preferred_zone, migratetype);
>  		if (page) {
> -			/*
> -			 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> -			 * necessary to allocate the page. The expectation is
> -			 * that the caller is taking steps that will free more
> -			 * memory. The caller should avoid the page being used
> -			 * for !PFMEMALLOC purposes.
> -			 */
> -			page->pfmemalloc = true;
>  			goto got_pg;
>  		}
>  	}
> @@ -2569,8 +2572,6 @@ retry_cpuset:
>  		page = __alloc_pages_slowpath(gfp_mask, order,
>  				zonelist, high_zoneidx, nodemask,
>  				preferred_zone, migratetype);
> -	else
> -		page->pfmemalloc = false;
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
>  
> -- 
> 1.7.5.4
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Aug. 20, 2012, 5:02 p.m. UTC | #3
On Mon, Aug 20, 2012 at 2:04 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.

Andrew, is this in your queue, or should I take this directly, or
what? It seems to fix the problem for Eric and Sage, at least.

           Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Aug. 20, 2012, 11:20 p.m. UTC | #4
On Mon, 20 Aug 2012 11:30:59 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:
> 
> > Can the following patch be tested please? It is reported to fix an fio
> > regression that may be similar to what you are experiencing but has not
> > been picked up yet.
> > 
> > -
> 
> This seems to help here.
> 
> Boot your machine with "mem=768M" or a bit less depending on your setup,
> and try a netperf.
> 
> -> before patch :
> 
> # netperf
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost.localdomain () port 0 AF_INET
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
> 
>  87380  16384  16384    14.00       6.05   
> 
> -> after patch :
> 
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
> 
>  87380  16384  16384    10.00    18509.73   

"seem to help"?  Was previous performance fully restored?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 21, 2012, 5:16 a.m. UTC | #5
On Mon, 2012-08-20 at 16:20 -0700, Andrew Morton wrote:
> On Mon, 20 Aug 2012 11:30:59 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:
> > 
> > > Can the following patch be tested please? It is reported to fix an fio
> > > regression that may be similar to what you are experiencing but has not
> > > been picked up yet.
> > > 
> > > -
> > 
> > This seems to help here.
> > 
> > Boot your machine with "mem=768M" or a bit less depending on your setup,
> > and try a netperf.
> > 
> > -> before patch :
> > 
> > # netperf
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > localhost.localdomain () port 0 AF_INET
> > Recv   Send    Send                          
> > Socket Socket  Message  Elapsed              
> > Size   Size    Size     Time     Throughput  
> > bytes  bytes   bytes    secs.    10^6bits/sec  
> > 
> >  87380  16384  16384    14.00       6.05   
> > 
> > -> after patch :
> > 
> > Recv   Send    Send                          
> > Socket Socket  Message  Elapsed              
> > Size   Size    Size     Time     Throughput  
> > bytes  bytes   bytes    secs.    10^6bits/sec  
> > 
> >  87380  16384  16384    10.00    18509.73   
> 
> "seem to help"?  Was previous performance fully restored?

I did some tests this morning on my HP Z600, and got same numbers than
3.5.0

Of course, its a bit difficult to say, because there is no
CONFIG_PFMEMALLOC to test real impact.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman Aug. 21, 2012, 7:05 a.m. UTC | #6
On Mon, Aug 20, 2012 at 09:54:59AM -0700, Sage Weil wrote:
> > <SNIP>
> > 
> > > I've retested several times and confirmed that this change leads to the 
> > > breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> > > the problem.
> > > 
> > > I've also added some additional instrumentation to my code and confirmed 
> > > that the process is blocking on poll(2) while netstat is reporting 
> > > data available on the socket.
> > > 
> > > What can I do to help track this down?
> > > 
> > 
> > Can the following patch be tested please? It is reported to fix an fio
> > regression that may be similar to what you are experiencing but has not
> > been picked up yet.
> 
> This patch appears to resolve things for me as well, at least after a 
> couple of passes.  I'll let you know if I see any further problems come up 
> with more testing.
> 

Thanks very much Sage.
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 009ac28..07f1924 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1928,6 +1928,17 @@  this_zone_full:
 		zlc_active = 0;
 		goto zonelist_scan;
 	}
+
+	if (page)
+		/*
+		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
+		 * necessary to allocate the page. The expectation is
+		 * that the caller is taking steps that will free more
+		 * memory. The caller should avoid the page being used
+		 * for !PFMEMALLOC purposes.
+		 */
+		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+
 	return page;
 }
 
@@ -2389,14 +2400,6 @@  rebalance:
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 		if (page) {
-			/*
-			 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
-			 * necessary to allocate the page. The expectation is
-			 * that the caller is taking steps that will free more
-			 * memory. The caller should avoid the page being used
-			 * for !PFMEMALLOC purposes.
-			 */
-			page->pfmemalloc = true;
 			goto got_pg;
 		}
 	}
@@ -2569,8 +2572,6 @@  retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	else
-		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);