Patchwork Excessive stall times on ext4 in 3.9-rc2

login
register
mail settings
Submitter Mel Gorman
Date April 11, 2013, 5:04 p.m.
Message ID <20130411170402.GB11656@suse.de>
Download mbox | patch
Permalink /patch/235835/
State Accepted
Headers show

Comments

Mel Gorman - April 11, 2013, 5:04 p.m.
On Wed, Apr 10, 2013 at 09:12:45AM -0400, Theodore Ts'o wrote:
> On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote:
> > During major activity there is likely to be "good" behaviour
> > with stalls roughly every 30 seconds roughly corresponding to
> > dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
> > when this happens.
> > 
> >   237 ?        00:00:00 flush-8:0
> > [<ffffffff811a35b9>] sleep_on_buffer+0x9/0x10
> > [<ffffffff811a35ee>] __lock_buffer+0x2e/0x30
> > [<ffffffff8123a21f>] do_get_write_access+0x43f/0x4b0
> 
> If we're stalling on lock_buffer(), that implies that buffer was being
> written, and for some reason it was taking a very long time to
> complete.
> 

Yes.

> It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> record the time when a particular I/O encryption/decryption is getting
> queued to the kcryptd workqueues, and when they finally squirt out.
> 

That somewhat assumes that dm_crypt was at fault which is not unreasonable
but I was skeptical as the workload on dm_crypt was opening a maildir
and mostly reads.

I used a tracepoint in jbd2 to get an idea of what device the buffer_head
was managing and dm did not show up on the list. This is what a trace-cmd
log of the test told me.

       flush-8:0-240   [005]   236.655363: jbd2_lock_buffer_stall: dev 8,8 stall_ms 1096
         awesome-1364  [005]   290.594396: jbd2_lock_buffer_stall: dev 8,6 stall_ms 7312
 gnome-pty-helpe-2256  [005]   290.836952: jbd2_lock_buffer_stall: dev 8,8 stall_ms 7528
       flush-8:0-240   [003]   304.012424: jbd2_lock_buffer_stall: dev 8,8 stall_ms 4472
  gnome-terminal-2332  [005]   308.290879: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3060
         awesome-1364  [006]   308.291318: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3048
       flush-8:0-240   [005]   331.525996: jbd2_lock_buffer_stall: dev 8,5 stall_ms 8732
       flush-8:0-240   [005]   332.353526: jbd2_lock_buffer_stall: dev 8,5 stall_ms 472
       flush-8:0-240   [005]   345.341547: jbd2_lock_buffer_stall: dev 8,5 stall_ms 10024
  gnome-terminal-2418  [005]   347.166876: jbd2_lock_buffer_stall: dev 8,6 stall_ms 11852
         awesome-1364  [005]   347.167082: jbd2_lock_buffer_stall: dev 8,6 stall_ms 11844
       flush-8:0-240   [005]   347.424520: jbd2_lock_buffer_stall: dev 8,5 stall_ms 2012
       flush-8:0-240   [005]   347.583752: jbd2_lock_buffer_stall: dev 8,5 stall_ms 156
       flush-8:0-240   [005]   390.079682: jbd2_lock_buffer_stall: dev 8,8 stall_ms 396
       flush-8:0-240   [002]   407.882385: jbd2_lock_buffer_stall: dev 8,8 stall_ms 12244
       flush-8:0-240   [005]   408.003976: jbd2_lock_buffer_stall: dev 8,8 stall_ms 124
  gnome-terminal-2610  [005]   413.613365: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3400
         awesome-1364  [006]   413.613605: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3736
       flush-8:0-240   [002]   430.706475: jbd2_lock_buffer_stall: dev 8,5 stall_ms 9748
       flush-8:0-240   [005]   458.188896: jbd2_lock_buffer_stall: dev 8,5 stall_ms 7748
       flush-8:0-240   [005]   458.828143: jbd2_lock_buffer_stall: dev 8,5 stall_ms 348
       flush-8:0-240   [006]   459.163814: jbd2_lock_buffer_stall: dev 8,5 stall_ms 252
       flush-8:0-240   [005]   462.340173: jbd2_lock_buffer_stall: dev 8,5 stall_ms 3160
       flush-8:0-240   [005]   469.917705: jbd2_lock_buffer_stall: dev 8,5 stall_ms 6340
       flush-8:0-240   [005]   474.434206: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4512
             tar-2315  [005]   510.043613: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4316
           tclsh-1780  [005]   773.336488: jbd2_lock_buffer_stall: dev 8,5 stall_ms 736
             git-3100  [005]   775.933506: jbd2_lock_buffer_stall: dev 8,5 stall_ms 3664
             git-4763  [005]   864.093317: jbd2_lock_buffer_stall: dev 8,5 stall_ms 140
       flush-8:0-240   [005]   864.242068: jbd2_lock_buffer_stall: dev 8,6 stall_ms 280
             git-4763  [005]   864.264157: jbd2_lock_buffer_stall: dev 8,5 stall_ms 148
       flush-8:0-240   [005]   865.200004: jbd2_lock_buffer_stall: dev 8,5 stall_ms 464
             git-4763  [000]   865.602469: jbd2_lock_buffer_stall: dev 8,5 stall_ms 300
       flush-8:0-240   [005]   865.705448: jbd2_lock_buffer_stall: dev 8,5 stall_ms 500
       flush-8:0-240   [005]   885.367576: jbd2_lock_buffer_stall: dev 8,8 stall_ms 11024
       flush-8:0-240   [005]   895.339697: jbd2_lock_buffer_stall: dev 8,5 stall_ms 120
       flush-8:0-240   [005]   895.765488: jbd2_lock_buffer_stall: dev 8,5 stall_ms 424
 systemd-journal-265   [005]   915.687201: jbd2_lock_buffer_stall: dev 8,8 stall_ms 14844
       flush-8:0-240   [005]   915.690529: jbd2_lock_buffer_stall: dev 8,6 stall_ms 19656
             git-5442  [005]  1034.845674: jbd2_lock_buffer_stall: dev 8,5 stall_ms 344
             git-5442  [005]  1035.157389: jbd2_lock_buffer_stall: dev 8,5 stall_ms 264
       flush-8:0-240   [005]  1035.875478: jbd2_lock_buffer_stall: dev 8,8 stall_ms 1368
       flush-8:0-240   [005]  1036.189218: jbd2_lock_buffer_stall: dev 8,8 stall_ms 312
  gnome-terminal-5592  [005]  1037.318594: jbd2_lock_buffer_stall: dev 8,6 stall_ms 2628
         awesome-1364  [000]  1037.318913: jbd2_lock_buffer_stall: dev 8,6 stall_ms 2632
             git-5789  [005]  1076.805405: jbd2_lock_buffer_stall: dev 8,5 stall_ms 184
       flush-8:0-240   [005]  1078.401721: jbd2_lock_buffer_stall: dev 8,5 stall_ms 700
       flush-8:0-240   [005]  1078.784200: jbd2_lock_buffer_stall: dev 8,5 stall_ms 356
             git-5789  [005]  1079.722683: jbd2_lock_buffer_stall: dev 8,5 stall_ms 1452
       flush-8:0-240   [005]  1109.928552: jbd2_lock_buffer_stall: dev 8,5 stall_ms 976
       flush-8:0-240   [005]  1111.762280: jbd2_lock_buffer_stall: dev 8,5 stall_ms 1832
       flush-8:0-240   [005]  1260.197720: jbd2_lock_buffer_stall: dev 8,5 stall_ms 344
       flush-8:0-240   [005]  1260.403556: jbd2_lock_buffer_stall: dev 8,5 stall_ms 204
       flush-8:0-240   [005]  1260.550904: jbd2_lock_buffer_stall: dev 8,5 stall_ms 108
             git-6598  [005]  1260.832948: jbd2_lock_buffer_stall: dev 8,5 stall_ms 1084
       flush-8:0-240   [005]  1311.736367: jbd2_lock_buffer_stall: dev 8,5 stall_ms 260
       flush-8:0-240   [005]  1313.689297: jbd2_lock_buffer_stall: dev 8,5 stall_ms 412
       flush-8:0-240   [005]  1314.230420: jbd2_lock_buffer_stall: dev 8,5 stall_ms 540
             git-7022  [006]  1314.241607: jbd2_lock_buffer_stall: dev 8,5 stall_ms 668
       flush-8:0-240   [000]  1347.980425: jbd2_lock_buffer_stall: dev 8,5 stall_ms 308
       flush-8:0-240   [005]  1348.164598: jbd2_lock_buffer_stall: dev 8,5 stall_ms 104
             git-7998  [005]  1547.755328: jbd2_lock_buffer_stall: dev 8,5 stall_ms 304
       flush-8:0-240   [006]  1547.764209: jbd2_lock_buffer_stall: dev 8,5 stall_ms 208
       flush-8:0-240   [005]  1548.653365: jbd2_lock_buffer_stall: dev 8,5 stall_ms 844
       flush-8:0-240   [005]  1549.255022: jbd2_lock_buffer_stall: dev 8,5 stall_ms 460
       flush-8:0-240   [005]  1725.036408: jbd2_lock_buffer_stall: dev 8,6 stall_ms 156
             git-8743  [005]  1740.492630: jbd2_lock_buffer_stall: dev 8,5 stall_ms 15032
             git-8743  [005]  1749.485214: jbd2_lock_buffer_stall: dev 8,5 stall_ms 8648
       flush-8:0-240   [005]  1775.937819: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4268
       flush-8:0-240   [006]  1776.335682: jbd2_lock_buffer_stall: dev 8,5 stall_ms 336
       flush-8:0-240   [006]  1776.446799: jbd2_lock_buffer_stall: dev 8,5 stall_ms 112
       flush-8:0-240   [005]  1802.593183: jbd2_lock_buffer_stall: dev 8,6 stall_ms 108
       flush-8:0-240   [006]  1802.809237: jbd2_lock_buffer_stall: dev 8,8 stall_ms 208
       flush-8:0-240   [005]  2012.041976: jbd2_lock_buffer_stall: dev 8,6 stall_ms 292
           tclsh-1778  [005]  2012.055139: jbd2_lock_buffer_stall: dev 8,5 stall_ms 424
  latency-output-1933  [002]  2012.055147: jbd2_lock_buffer_stall: dev 8,5 stall_ms 136
             git-10209 [005]  2012.074584: jbd2_lock_buffer_stall: dev 8,5 stall_ms 164
       flush-8:0-240   [005]  2012.177241: jbd2_lock_buffer_stall: dev 8,5 stall_ms 128
             git-10209 [005]  2012.297472: jbd2_lock_buffer_stall: dev 8,5 stall_ms 216
       flush-8:0-240   [005]  2012.299828: jbd2_lock_buffer_stall: dev 8,5 stall_ms 120

dm is not obviously at fault there. sda5 is /usr/src (git checkout
running there with some logging), sda6 is /home and sda8 is / . This is
the tracepoint patch used.

---8<---
jbd2: Trace when lock_buffer at the start of a journal write takes a long time

While investigating interactivity problems it was clear that processes
sometimes stall for long periods of times if an attempt is made to lock
a buffer that is already part of a transaction. It would stall in a
trace looking something like

[<ffffffff811a39de>] __lock_buffer+0x2e/0x30
[<ffffffff8123a60f>] do_get_write_access+0x43f/0x4b0
[<ffffffff8123a7cb>] jbd2_journal_get_write_access+0x2b/0x50
[<ffffffff81220f79>] __ext4_journal_get_write_access+0x39/0x80
[<ffffffff811f3198>] ext4_reserve_inode_write+0x78/0xa0
[<ffffffff811f3209>] ext4_mark_inode_dirty+0x49/0x220
[<ffffffff811f57d1>] ext4_dirty_inode+0x41/0x60
[<ffffffff8119ac3e>] __mark_inode_dirty+0x4e/0x2d0
[<ffffffff8118b9b9>] update_time+0x79/0xc0
[<ffffffff8118ba98>] file_update_time+0x98/0x100
[<ffffffff81110ffc>] __generic_file_aio_write+0x17c/0x3b0
[<ffffffff811112aa>] generic_file_aio_write+0x7a/0xf0
[<ffffffff811ea853>] ext4_file_write+0x83/0xd0
[<ffffffff81172b23>] do_sync_write+0xa3/0xe0
[<ffffffff811731ae>] vfs_write+0xae/0x180
[<ffffffff8117361d>] sys_write+0x4d/0x90
[<ffffffff8159d62d>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

There was a suspicion that dm_crypt might be part responsible so this
patch adds a tracepoint capturing when lock_buffer takes too long
in do_get_write_access() that logs what device is being written and
how long the stall was for.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/jbd2/transaction.c       |  8 ++++++++
 include/trace/events/jbd2.h | 21 +++++++++++++++++++++
 2 files changed, 29 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - April 11, 2013, 6:35 p.m.
On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
> > If we're stalling on lock_buffer(), that implies that buffer was being
> > written, and for some reason it was taking a very long time to
> > complete.
> > 
> 
> Yes.
> 
> > It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> > record the time when a particular I/O encryption/decryption is getting
> > queued to the kcryptd workqueues, and when they finally squirt out.
> > 
> 
> That somewhat assumes that dm_crypt was at fault which is not unreasonable
> but I was skeptical as the workload on dm_crypt was opening a maildir
> and mostly reads.

Hmm... well, I've reviewed all of the places in the ext4 and jbd2
layer where we call lock_buffer(), and with one exception[1] we're not
holding the the bh locked any longer than necessary.  There are a few
places where we grab a spinlock or two before we can do what we need
to do and then release the lock'ed buffer head, but the only time we
hold the bh locked for long periods of time is when we submit metadata
blocks for I/O.

[1] There is one exception in ext4_xattr_release_block() where I
believe we should move the call to unlock_buffer(bh) before the call
to ext4_free_blocks(), since we've already elevanted the bh count and
ext4_free_blocks() does not need to have the bh locked.  It's not
related to any of the stalls you've repored, though, as near as I can
tell (none of the stack traces include the ext4 xattr code, and this
would only affect external extended attribute blocks).


Could you code which checks the hold time of lock_buffer(), measuing
from when the lock is successfully grabbed, to see if you can see if I
missed some code path in ext4 or jbd2 where the bh is locked and then
there is some call to some function which needs to block for some
random reason?  What I'd suggest is putting a timestamp in buffer_head
structure, which is set by lock_buffer once it is successfully grabbed
the lock, and then in unlock_buffer(), if it is held for more than a
second or some such, to dump out the stack trace.

Because at this point, either I'm missing something or I'm beginning
to suspect that your hard drive (or maybe something the block layer?)
is simply taking a long time to service an I/O request.  Putting in
this check should be able to very quickly determine what code path
and/or which subsystem we should be focused upon.

Thanks,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - April 11, 2013, 9:33 p.m.
On Thu 11-04-13 14:35:12, Ted Tso wrote:
> On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
> > > If we're stalling on lock_buffer(), that implies that buffer was being
> > > written, and for some reason it was taking a very long time to
> > > complete.
> > > 
> > 
> > Yes.
> > 
> > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> > > record the time when a particular I/O encryption/decryption is getting
> > > queued to the kcryptd workqueues, and when they finally squirt out.
> > > 
> > 
> > That somewhat assumes that dm_crypt was at fault which is not unreasonable
> > but I was skeptical as the workload on dm_crypt was opening a maildir
> > and mostly reads.
> 
> Hmm... well, I've reviewed all of the places in the ext4 and jbd2
> layer where we call lock_buffer(), and with one exception[1] we're not
> holding the the bh locked any longer than necessary.  There are a few
> places where we grab a spinlock or two before we can do what we need
> to do and then release the lock'ed buffer head, but the only time we
> hold the bh locked for long periods of time is when we submit metadata
> blocks for I/O.
> 
> [1] There is one exception in ext4_xattr_release_block() where I
> believe we should move the call to unlock_buffer(bh) before the call
> to ext4_free_blocks(), since we've already elevanted the bh count and
> ext4_free_blocks() does not need to have the bh locked.  It's not
> related to any of the stalls you've repored, though, as near as I can
> tell (none of the stack traces include the ext4 xattr code, and this
> would only affect external extended attribute blocks).
> 
> 
> Could you code which checks the hold time of lock_buffer(), measuing
> from when the lock is successfully grabbed, to see if you can see if I
> missed some code path in ext4 or jbd2 where the bh is locked and then
> there is some call to some function which needs to block for some
> random reason?  What I'd suggest is putting a timestamp in buffer_head
> structure, which is set by lock_buffer once it is successfully grabbed
> the lock, and then in unlock_buffer(), if it is held for more than a
> second or some such, to dump out the stack trace.
> 
> Because at this point, either I'm missing something or I'm beginning
> to suspect that your hard drive (or maybe something the block layer?)
> is simply taking a long time to service an I/O request.  Putting in
> this check should be able to very quickly determine what code path
> and/or which subsystem we should be focused upon.
  I think it might be more enlightening if Mel traced which process in
which funclion is holding the buffer lock. I suspect we'll find out that
the flusher thread has submitted the buffer for IO as an async write and
thus it takes a long time to complete in presence of reads which have
higher priority.

								Honza
Theodore Ts'o - April 12, 2013, 2:57 a.m.
On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
>   I think it might be more enlightening if Mel traced which process in
> which funclion is holding the buffer lock. I suspect we'll find out that
> the flusher thread has submitted the buffer for IO as an async write and
> thus it takes a long time to complete in presence of reads which have
> higher priority.

That's an interesting theory.  If the workload is one which is very
heavy on reads and writes, that could explain the high latency.  That
would explain why those of us who are using primarily SSD's are seeing
the problems, because would be reads are nice and fast.

If that is the case, one possible solution that comes to mind would be
to mark buffer_heads that contain metadata with a flag, so that the
flusher thread can write them back at the same priority as reads.

The only problem I can see with this hypothesis is that if this is the
explanation for what Mel and Jiri are seeing, it's something that
would have been around for a long time, and would affect ext3 as well
as ext4.  That isn't quite consistent, however, with Mel's observation
that this is a probablem which has gotten worse in relatively
recently.

	  	    	   	    	       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - April 12, 2013, 4:50 a.m.
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
> >   I think it might be more enlightening if Mel traced which process in
> > which funclion is holding the buffer lock. I suspect we'll find out that
> > the flusher thread has submitted the buffer for IO as an async write and
> > thus it takes a long time to complete in presence of reads which have
> > higher priority.
> 
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.

Ext4 is already using REQ_META for this purpose.

I'm surprised that no-one has suggested "change the IO elevator"
yet.....

Cheers,

Dave.
Mel Gorman - April 12, 2013, 9:45 a.m.
On Thu, Apr 11, 2013 at 02:35:12PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
> > > If we're stalling on lock_buffer(), that implies that buffer was being
> > > written, and for some reason it was taking a very long time to
> > > complete.
> > > 
> > 
> > Yes.
> > 
> > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> > > record the time when a particular I/O encryption/decryption is getting
> > > queued to the kcryptd workqueues, and when they finally squirt out.
> > > 
> > 
> > That somewhat assumes that dm_crypt was at fault which is not unreasonable
> > but I was skeptical as the workload on dm_crypt was opening a maildir
> > and mostly reads.
> 
> Hmm... well, I've reviewed all of the places in the ext4 and jbd2
> layer where we call lock_buffer(), and with one exception[1] we're not
> holding the the bh locked any longer than necessary.  There are a few
> places where we grab a spinlock or two before we can do what we need
> to do and then release the lock'ed buffer head, but the only time we
> hold the bh locked for long periods of time is when we submit metadata
> blocks for I/O.
> 

Yeah, ok. This is not the answer I was hoping for but it's the answer I
expected.

> Could you code which checks the hold time of lock_buffer(), measuing
> from when the lock is successfully grabbed, to see if you can see if I
> missed some code path in ext4 or jbd2 where the bh is locked and then
> there is some call to some function which needs to block for some
> random reason?
>
> What I'd suggest is putting a timestamp in buffer_head
> structure, which is set by lock_buffer once it is successfully grabbed
> the lock, and then in unlock_buffer(), if it is held for more than a
> second or some such, to dump out the stack trace.
> 

I can do that but the results might lack meaning. What I could do instead
is use a variation of the page owner tracking patch (current iteration at
https://lkml.org/lkml/2012/12/7/487) to record a stack trace in lock_buffer
and dump it from jbd2/transaction.c if it stalls for too long. I'll report
if I find something useful.

> Because at this point, either I'm missing something or I'm beginning
> to suspect that your hard drive (or maybe something the block layer?)
> is simply taking a long time to service an I/O request. 

It could be because the drive is a piece of crap.
Mel Gorman - April 12, 2013, 9:47 a.m.
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
> >   I think it might be more enlightening if Mel traced which process in
> > which funclion is holding the buffer lock. I suspect we'll find out that
> > the flusher thread has submitted the buffer for IO as an async write and
> > thus it takes a long time to complete in presence of reads which have
> > higher priority.
> 
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.
> 
> The only problem I can see with this hypothesis is that if this is the
> explanation for what Mel and Jiri are seeing, it's something that
> would have been around for a long time, and would affect ext3 as well
> as ext4.  That isn't quite consistent, however, with Mel's observation
> that this is a probablem which has gotten worse in relatively
> recently.
> 

According to the tests I've run, multi-second stalls have been a problem for
a while but never really bothered me. I'm not sure why it felt particularly
bad around -rc2 or why it seems to be better now. Maybe I just had my
cranky pants on.
Tvrtko Ursulin - April 12, 2013, 10:18 a.m.
Hi all,

On Thursday 11 April 2013 22:57:08 Theodore Ts'o wrote:
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.
> 
> The only problem I can see with this hypothesis is that if this is the
> explanation for what Mel and Jiri are seeing, it's something that
> would have been around for a long time, and would affect ext3 as well
> as ext4.  That isn't quite consistent, however, with Mel's observation
> that this is a probablem which has gotten worse in relatively
> recently.

Dropping in as a casual observer and having missed the start of the thread, 
risking that I will just muddle the water for you.

I had a similar problem for quite a while with ext4, at least that was my 
conclusion since the fix was to migrate one filesystem to xfs which fixed it 
for me. Time period when I observed this was between 3.5 and 3.7 kernels.

Situation was I had an ext4 filesystem (on top of LVM, which was on top of MD 
RAID 1, which was on top of two mechanical hard drives) which was dedicated to 
holding a large SVN check-out. Other filesystems were also ext4 on different 
logical volumes (but same spindles).

Symptoms were long stalls of everything (including window management!) on a 
relatively heavily loaded desktop (which was KDE). Stalls would last anything 
from five to maybe even 30 seconds. Not sure exactly but long enough that you 
think the system has actually crashed. I couldn't even switch away to a 
different virtual terminal during the stall, nothing.

Eventually I traced it down to kdesvn (subversion client) periodically 
refreshing (or something) it's metadata and hence generating some IO on that 
dedicated filesystem. That combined with some other desktop activity had an 
effect of stalling everything else. I thought it was very weird, but I suppose 
KDE and all the rest nowadays do to much IO in everything they do.

Following a hunch I reformatted that filesystem as XFS which fixed the 
problem.

I can't reproduce this now to run any tests so I know this is not very helpful 
now. But perhaps some of the info will be useful to someone.

Tvrtko

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - April 12, 2013, 3:19 p.m.
On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
> > If that is the case, one possible solution that comes to mind would be
> > to mark buffer_heads that contain metadata with a flag, so that the
> > flusher thread can write them back at the same priority as reads.
> 
> Ext4 is already using REQ_META for this purpose.

We're using REQ_META | REQ_PRIO for reads, not writes.

> I'm surprised that no-one has suggested "change the IO elevator"
> yet.....

Well, testing to see if the stalls go away with the noop schedule is a
good thing to try just to validate the theory.

The thing is, we do want to make ext4 work well with cfq, and
prioritizing non-readahead read requests ahead of data writeback does
make sense.  The issue is with is that metadata writes going through
the block device could in some cases effectively cause a priority
inversion when what had previously been an asynchronous writeback
starts blocking a foreground, user-visible process.

At least, that's the theory; we should confirm that this is indeed
what is causing the data stalls which Mel is reporting on HDD's before
we start figuring out how to fix this problem.

   	 	      	     	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - April 13, 2013, 1:23 a.m.
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
> > > If that is the case, one possible solution that comes to mind would be
> > > to mark buffer_heads that contain metadata with a flag, so that the
> > > flusher thread can write them back at the same priority as reads.
> > 
> > Ext4 is already using REQ_META for this purpose.
> 
> We're using REQ_META | REQ_PRIO for reads, not writes.
> 
> > I'm surprised that no-one has suggested "change the IO elevator"
> > yet.....
> 
> Well, testing to see if the stalls go away with the noop schedule is a
> good thing to try just to validate the theory.

Exactly.

> The thing is, we do want to make ext4 work well with cfq, and
> prioritizing non-readahead read requests ahead of data writeback does
> make sense.  The issue is with is that metadata writes going through
> the block device could in some cases effectively cause a priority
> inversion when what had previously been an asynchronous writeback
> starts blocking a foreground, user-visible process.

Here's the historic problem with CFQ: it's scheduling algorithms
change from release to release, and so what you tune the filesystem
to for this release is likely to cause different behaviour
in a few releases time.

We've had this problem time and time again with CFQ+XFS, so we
stopped trying to "tune" to a particular elevator long ago.  The
best you can do it tag the Io as appropriately as possible (e.g.
metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ
hasn't been broken since the last release....

> At least, that's the theory; we should confirm that this is indeed
> what is causing the data stalls which Mel is reporting on HDD's before
> we start figuring out how to fix this problem.

*nod*.

Cheers,

Dave.
Theodore Ts'o - April 21, 2013, 12:05 a.m.
As an update to this thread, we brought up this issue at LSF/MM, and
there is a thought that we should be able to solve this problem by
having lock_buffer() check to see if the buffer is locked due to a
write being queued, to have the priority of the write bumped up in the
write queues to resolve the priority inversion.  I believe Jeff Moyer
was going to look into this, if I remember correctly.

An alternate solution which I've been playing around adds buffer_head
flags so we can indicate that a buffer contains metadata and/or should
have I/O submitted with the REQ_PRIO flag set.

Adding a buffer_head flag for at least BH_Meta is probably a good
thing, since that way the blktrace will be properly annotated.
Whether we should keep the BH_Prio flag or rely on lock_buffer()
automatically raising the priority is, my feeling is that if
lock_buffer() can do the right thing, we should probably do it via
lock_buffer().  I have a feeling this might be decidedly non-trivial,
though, so perhaps we should just doing via BH flags?

	   	      	     	  	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - April 22, 2013, 2:38 p.m.
(Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
> > > If that is the case, one possible solution that comes to mind would be
> > > to mark buffer_heads that contain metadata with a flag, so that the
> > > flusher thread can write them back at the same priority as reads.
> > 
> > Ext4 is already using REQ_META for this purpose.
> 
> We're using REQ_META | REQ_PRIO for reads, not writes.
> 
> > I'm surprised that no-one has suggested "change the IO elevator"
> > yet.....
> 
> Well, testing to see if the stalls go away with the noop schedule is a
> good thing to try just to validate the theory.
> 

I still haven't tested with a different elevator. While this bug is
relatively high priority for me, there are still are other issues in the way.

TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
starve flusher writes.

Now the ugliness and being a windbag.

I collected blktrace and some other logs and they are available at
http://www.csn.ul.ie/~mel/postings/stalls-20130419/log.tar.gz  and there
is a lot of stuff in there.  The unix time the test started is in the
first line of the file tests-timestamp-bisect . This can help figure out
how far into the test some of the other timestamped logs are

The kernel log with information from the lock_buffer debugging patch is
in dmesg-bisect-gitcheckout . The information in it is race-prone and
cannot be 100% trusted but it's still useful.

iostat is in iostat-bisect-gitcheckout 

Here are a few observations I got from the data.

1. flushers appear to acquire buffer locks *very* quickly after an
   application writes. Look for lines that look like

   "foo failed trylock without holder released 0 ms ago acquired 0 ms ago by bar"

   There are a lot of entries like this

	jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar
	jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar
	jbd2 2124 tar failed trylock without holder released 0 ms ago acquired 0 ms ago by 239 flush-8:0
	jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar
	jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar
	jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar
	jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar

   I expected flushers to be writing back the buffers just released in about
   5 seconds time, not immediately.  It may indicate that when flushers
   wake to clean expired inodes that it keeps cleaning inodes as they are
   being dirtied.

2. The flush thread can prevent a process making forward progress for
   a long time. Take this as an example

         jbd2 stalled dev 8,8 for 8168 ms lock holdtime 20692 ms
         Last Owner 239 flush-8:0 Acquired Stack
          [<ffffffff8100fd8a>] save_stack_trace+0x2a/0x50
          [<ffffffff811a3ad6>] set_lock_buffer_owner+0x86/0x90
          [<ffffffff811a72ee>] __block_write_full_page+0x16e/0x360
          [<ffffffff811a75b3>] block_write_full_page_endio+0xd3/0x110
          [<ffffffff811a7600>] block_write_full_page+0x10/0x20
          [<ffffffff811aa7f3>] blkdev_writepage+0x13/0x20
          [<ffffffff81119352>] __writepage+0x12/0x40
          [<ffffffff81119b56>] write_cache_pages+0x206/0x460
          [<ffffffff81119df5>] generic_writepages+0x45/0x70
          [<ffffffff8111accb>] do_writepages+0x1b/0x30
          [<ffffffff81199d60>] __writeback_single_inode+0x40/0x1b0
          [<ffffffff8119c40a>] writeback_sb_inodes+0x19a/0x350
          [<ffffffff8119c656>] __writeback_inodes_wb+0x96/0xc0
          [<ffffffff8119c8fb>] wb_writeback+0x27b/0x330
          [<ffffffff8119e300>] wb_do_writeback+0x190/0x1d0
          [<ffffffff8119e3c3>] bdi_writeback_thread+0x83/0x280
          [<ffffffff8106901b>] kthread+0xbb/0xc0
          [<ffffffff8159e1fc>] ret_from_fork+0x7c/0xb0
          [<ffffffffffffffff>] 0xffffffffffffffff

   This part is saying that we locked the buffer due to blkdev_writepage
   which I assume must be a metadata update. Based on where we lock the
   buffer, the only reason we would leave the buffer unlocked if this
   was an asynchronous write request leaving the buffer to be unlocked by
   end_buffer_async_write at some time in the future

         Last Owner Activity Stack: 239 flush-8:0
          [<ffffffff812aee61>] __blkdev_issue_zeroout+0x191/0x1a0
          [<ffffffff812aef51>] blkdev_issue_zeroout+0xe1/0xf0
          [<ffffffff8121abe9>] ext4_ext_zeroout.isra.30+0x49/0x60
          [<ffffffff8121ee47>] ext4_ext_convert_to_initialized+0x227/0x5f0
          [<ffffffff8121f8a3>] ext4_ext_handle_uninitialized_extents+0x2f3/0x3a0
          [<ffffffff8121ff57>] ext4_ext_map_blocks+0x5d7/0xa00
          [<ffffffff811f0715>] ext4_map_blocks+0x2d5/0x470
          [<ffffffff811f47da>] mpage_da_map_and_submit+0xba/0x2f0
          [<ffffffff811f52e0>] ext4_da_writepages+0x380/0x620
          [<ffffffff8111accb>] do_writepages+0x1b/0x30
          [<ffffffff81199d60>] __writeback_single_inode+0x40/0x1b0
          [<ffffffff8119c40a>] writeback_sb_inodes+0x19a/0x350
          [<ffffffff8119c656>] __writeback_inodes_wb+0x96/0xc0
          [<ffffffff8119c8fb>] wb_writeback+0x27b/0x330

   This part is indicating that at the time a process tried to acquire
   the buffer lock that flusher was off doing something else entirely.
   That points again to the metadata write being asynchronous.

         Current Owner 1829 stap
          [<ffffffff8100fd8a>] save_stack_trace+0x2a/0x50
          [<ffffffff811a3ad6>] set_lock_buffer_owner+0x86/0x90
          [<ffffffff8123a5b2>] do_get_write_access+0xd2/0x800
          [<ffffffff8123ae2b>] jbd2_journal_get_write_access+0x2b/0x50
          [<ffffffff81221249>] __ext4_journal_get_write_access+0x39/0x80
          [<ffffffff81229bca>] ext4_free_blocks+0x36a/0xbe0
          [<ffffffff8121c686>] ext4_remove_blocks+0x256/0x2d0
          [<ffffffff8121c905>] ext4_ext_rm_leaf+0x205/0x520
          [<ffffffff8121e64c>] ext4_ext_remove_space+0x4dc/0x750
          [<ffffffff8122051b>] ext4_ext_truncate+0x19b/0x1e0
          [<ffffffff811efde5>] ext4_truncate.part.61+0xd5/0xf0
          [<ffffffff811f0ee4>] ext4_truncate+0x34/0x90
          [<ffffffff811f382d>] ext4_setattr+0x18d/0x640
          [<ffffffff8118d3e2>] notify_change+0x1f2/0x3c0
          [<ffffffff811716f9>] do_truncate+0x59/0xa0
          [<ffffffff8117d3f6>] handle_truncate+0x66/0xa0
          [<ffffffff81181576>] do_last+0x626/0x820
          [<ffffffff81181823>] path_openat+0xb3/0x4a0
          [<ffffffff8118237d>] do_filp_open+0x3d/0xa0
          [<ffffffff81172869>] do_sys_open+0xf9/0x1e0
          [<ffffffff8117296c>] sys_open+0x1c/0x20
          [<ffffffff8159e2ad>] system_call_fastpath+0x1a/0x1f
          [<ffffffffffffffff>] 0xffffffffffffffff

   This is just showing where stap was trying to acquire the buffer lock
   truncating data.

3. The blktrace indicates that reads can starve writes from flusher

   While there are people that can look at a blktrace and find problems
   like they are rain man, I'm more like an ADHD squirrel when looking at
   a trace.  I wrote a script to look for what unrelated requests completed
   while an request got stalled for over a second. It seemed like something
   that a tool shoudl already exist for but I didn't find one unless btt
   can give the information somehow.

   Each delayed request is quite long but here is the first example
   discovered by the script

Request 4174 took 1.060828037 to complete
  239    W    260608696  [flush-8:0]
Request started time index 4.731862902
Inflight while queued
  239    W    260608696  [flush-8:0]
    239    W    260608072  [flush-8:0]
    239    W    260607872  [flush-8:0]
    239    W    260608488  [flush-8:0]
    239    W    260608472  [flush-8:0]
    239    W    260608568  [flush-8:0]
    239    W    260608008  [flush-8:0]
    239    W    260607728  [flush-8:0]
    239    W    260607112  [flush-8:0]
    239    W    260608544  [flush-8:0]
    239    W    260622168  [flush-8:0]
    239    W    271863816  [flush-8:0]
    239    W    260608672  [flush-8:0]
    239    W    260607944  [flush-8:0]
    239    W    203833687  [flush-8:0]
   1676    R    541999743 [watch-inbox-ope]
    239    W    260608240  [flush-8:0]
    239    W    203851359  [flush-8:0]
    239    W    272019768  [flush-8:0]
    239    W    260607272  [flush-8:0]
    239    W    260607992  [flush-8:0]
    239    W    483478791  [flush-8:0]
    239    W    260608528  [flush-8:0]
    239    W    260607456  [flush-8:0]
    239    W    261310704  [flush-8:0]
    239    W    260608200  [flush-8:0]
    239    W    260607744  [flush-8:0]
    239    W    204729015  [flush-8:0]
    239    W    204728927  [flush-8:0]
    239    W    260608584  [flush-8:0]
    239    W    260608352  [flush-8:0]
    239    W    270532504  [flush-8:0]
    239    W    260608600  [flush-8:0]
    239    W    260607152  [flush-8:0]
    239    W    260607888  [flush-8:0]
    239    W    260607192  [flush-8:0]
    239    W    260607568  [flush-8:0]
    239    W    260607632  [flush-8:0]
    239    W    271831080  [flush-8:0]
    239    W    260608312  [flush-8:0]
    239    W    260607440  [flush-8:0]
    239    W    204729023  [flush-8:0]
    239    W    260608056  [flush-8:0]
    239    W    272019776  [flush-8:0]
    239    W    260608632  [flush-8:0]
    239    W    260607704  [flush-8:0]
    239    W    271827168  [flush-8:0]
    239    W    260607208  [flush-8:0]
    239    W    260607384  [flush-8:0]
    239    W    260607856  [flush-8:0]
    239    W    260607320  [flush-8:0]
    239    W    271827160  [flush-8:0]
    239    W    260608152  [flush-8:0]
    239    W    261271552  [flush-8:0]
    239    W    260607168  [flush-8:0]
    239    W    260608088  [flush-8:0]
    239    W    260607480  [flush-8:0]
    239    W    260608424  [flush-8:0]
    239    W    260608040  [flush-8:0]
    239    W    260608400  [flush-8:0]
    239    W    260608224  [flush-8:0]
    239    W    260607680  [flush-8:0]
    239    W    260607808  [flush-8:0]
    239    W    266347440  [flush-8:0]
    239    W    260607776  [flush-8:0]
    239    W    260607512  [flush-8:0]
    239    W    266347280  [flush-8:0]
    239    W    260607424  [flush-8:0]
    239    W    260607656  [flush-8:0]
    239    W    260607976  [flush-8:0]
    239    W    260608440  [flush-8:0]
    239    W    260608272  [flush-8:0]
    239    W    260607536  [flush-8:0]
    239    W    260607920  [flush-8:0]
    239    W    260608456  [flush-8:0]
Complete since queueing
 1676    R    541999743 [watch-inbox-ope]
  239    W    203833687  [flush-8:0]
 1676    R    541999759 [watch-inbox-ope]
 1676    R    541999791 [watch-inbox-ope]
 1676    R    541999807 [watch-inbox-ope]
 1676    R    541999839 [watch-inbox-ope]
 1676    R    541999855 [watch-inbox-ope]
 1676    R    542210351 [watch-inbox-ope]
 1676    R    542210367 [watch-inbox-ope]
 1676    R    541999887 [watch-inbox-ope]
 1676    R    541999911 [watch-inbox-ope]
 1676    R    541999935 [watch-inbox-ope]
 1676    R    541999967 [watch-inbox-ope]
 1676   RM    540448791 [watch-inbox-ope]
 1676    R    541999983 [watch-inbox-ope]
 1676    R    542051791 [watch-inbox-ope]
 1676    R    541999999 [watch-inbox-ope]
 1676    R    541949839 [watch-inbox-ope]
 1676    R    541949871 [watch-inbox-ope]
 1676    R    541949903 [watch-inbox-ope]
 1676    R    541949935 [watch-inbox-ope]
 1676    R    541949887 [watch-inbox-ope]
 1676    R    542051823 [watch-inbox-ope]
 1676    R    541949967 [watch-inbox-ope]
 1676    R    542051839 [watch-inbox-ope]
 1676    R    541949999 [watch-inbox-ope]
 1676    R    541950015 [watch-inbox-ope]
 1676    R    541950031 [watch-inbox-ope]
 1676    R    541950047 [watch-inbox-ope]
 1676    R    541950063 [watch-inbox-ope]
 1676    R    542112079 [watch-inbox-ope]
 1676    R    542112095 [watch-inbox-ope]
 1676    R    542112111 [watch-inbox-ope]
 1676    R    542112127 [watch-inbox-ope]
 1676    R    542112847 [watch-inbox-ope]
 1676    R    542112863 [watch-inbox-ope]
 1676   RM    540461311 [watch-inbox-ope]
 1676   RM    540448799 [watch-inbox-ope]
 1676    R    542112879 [watch-inbox-ope]
 1676    R    541950087 [watch-inbox-ope]
 1676    R    541950111 [watch-inbox-ope]
 1676    R    542112895 [watch-inbox-ope]
 1676    R    541950127 [watch-inbox-ope]
 1676    R    541950159 [watch-inbox-ope]
 1676    R    541950175 [watch-inbox-ope]
 1676    R    541950191 [watch-inbox-ope]
 1676    R    541950207 [watch-inbox-ope]
 1676    R    541950239 [watch-inbox-ope]
 1676    R    541950255 [watch-inbox-ope]
 1676    R    541950287 [watch-inbox-ope]
 1676    R    541950303 [watch-inbox-ope]
 1676    R    541950319 [watch-inbox-ope]
 1676    R    542113103 [watch-inbox-ope]
 1676    R    541950343 [watch-inbox-ope]
 1676    R    541950367 [watch-inbox-ope]
 1676    R    541950399 [watch-inbox-ope]
 1676    R    542113119 [watch-inbox-ope]
 1676    R    542113135 [watch-inbox-ope]
 1676    R    541950415 [watch-inbox-ope]
 1676   RM    540448815 [watch-inbox-ope]
 1676    R    542113151 [watch-inbox-ope]
 1676    R    541950447 [watch-inbox-ope]
 1676    R    541950463 [watch-inbox-ope]
 1676    R    542113743 [watch-inbox-ope]
 1676    R    542113759 [watch-inbox-ope]
 1676    R    542113775 [watch-inbox-ope]
 1676    R    542113791 [watch-inbox-ope]
  239    W    203851359  [flush-8:0]
  239    W    204729015  [flush-8:0]
  239    W    204728927  [flush-8:0]
  239    W    204729023  [flush-8:0]
  239    W    260569008  [flush-8:0]
 1676    R    542145871 [watch-inbox-ope]
 1676    R    542145903 [watch-inbox-ope]
 1676    R    542145887 [watch-inbox-ope]
 1676    R    542154639 [watch-inbox-ope]
 1676    R    542154655 [watch-inbox-ope]
 1676    R    542154671 [watch-inbox-ope]
 1676    R    542154687 [watch-inbox-ope]
 1676    R    542154831 [watch-inbox-ope]
 1676    R    542154863 [watch-inbox-ope]
 1676    R    542157647 [watch-inbox-ope]
 1676    R    542157663 [watch-inbox-ope]
 1676    R    541950479 [watch-inbox-ope]
 1676    R    541950503 [watch-inbox-ope]
 1676    R    541950535 [watch-inbox-ope]
 1676    R    541950599 [watch-inbox-ope]
 1676    R    541950727 [watch-inbox-ope]
 1676    R    541950751 [watch-inbox-ope]
 1676    R    541950767 [watch-inbox-ope]
 1676   RM    540448823 [watch-inbox-ope]
 1676    R    541950783 [watch-inbox-ope]
 1676    R    541950807 [watch-inbox-ope]
 1676    R    541950839 [watch-inbox-ope]
 1676    R    541950855 [watch-inbox-ope]
 1676    R    541950879 [watch-inbox-ope]
 1676    R    541950895 [watch-inbox-ope]
 1676    R    541950919 [watch-inbox-ope]
 1676    R    541950951 [watch-inbox-ope]
 1676    R    541950959 [watch-inbox-ope]
 1676    R    541950975 [watch-inbox-ope]
 1676    R    541951007 [watch-inbox-ope]
 1676    R    541951023 [watch-inbox-ope]
 1676    R    541951055 [watch-inbox-ope]
 1676    R    541951087 [watch-inbox-ope]
 1676    R    541951103 [watch-inbox-ope]
 1676    R    541951119 [watch-inbox-ope]
 1676    R    541951143 [watch-inbox-ope]
 1676    R    541951167 [watch-inbox-ope]
 1676    R    542157679 [watch-inbox-ope]
 1676    R    542157695 [watch-inbox-ope]
 1676    R    541951183 [watch-inbox-ope]
 1676    R    541951215 [watch-inbox-ope]
 1676    R    541951231 [watch-inbox-ope]
 1676    R    542158223 [watch-inbox-ope]
 1676   RM    540448831 [watch-inbox-ope]
 1676    R    541951247 [watch-inbox-ope]
 1676    R    541951271 [watch-inbox-ope]
 1676    R    541951295 [watch-inbox-ope]
 1676    R    542158239 [watch-inbox-ope]
 1676    R    542158255 [watch-inbox-ope]
 1676    R    541951311 [watch-inbox-ope]
 1676    R    542158271 [watch-inbox-ope]
 1676    R    541951343 [watch-inbox-ope]
 1676    R    541951359 [watch-inbox-ope]
 1676    R    541951391 [watch-inbox-ope]
 1676    R    541951407 [watch-inbox-ope]
 1676    R    541951423 [watch-inbox-ope]
 1676    R    541951439 [watch-inbox-ope]
 1676    R    541951471 [watch-inbox-ope]
 1676    R    542158607 [watch-inbox-ope]
 1676    R    541951487 [watch-inbox-ope]
 1676    R    542158639 [watch-inbox-ope]
 1676    R    542158655 [watch-inbox-ope]
 1676    R    542187215 [watch-inbox-ope]
 1676    R    542187231 [watch-inbox-ope]
 1676    R    542187247 [watch-inbox-ope]
 1676    R    541951503 [watch-inbox-ope]
 1676   RM    540448839 [watch-inbox-ope]
 1676    R    542187263 [watch-inbox-ope]
 1676    R    541951535 [watch-inbox-ope]
 1676    R    541951551 [watch-inbox-ope]
 1676    R    541951599 [watch-inbox-ope]
 1676    R    541951575 [watch-inbox-ope]
 1676    R    542190607 [watch-inbox-ope]
  239    W    261310704  [flush-8:0]
  239    W    266347280  [flush-8:0]
  239    W    266347440  [flush-8:0]
 1676    R    542190623 [watch-inbox-ope]
 1676    R    542190639 [watch-inbox-ope]
 1676    R    542190655 [watch-inbox-ope]
 1676    R    542193999 [watch-inbox-ope]
 1676    R    542194015 [watch-inbox-ope]
 1676    R    541951631 [watch-inbox-ope]
 1676    R    541951663 [watch-inbox-ope]
 1676    R    541951679 [watch-inbox-ope]
 1676    R    541951711 [watch-inbox-ope]
 1676    R    541951727 [watch-inbox-ope]
 1676    R    541951743 [watch-inbox-ope]
 1676    R    542194031 [watch-inbox-ope]
 1676    R    542194047 [watch-inbox-ope]
 1676    R    542197711 [watch-inbox-ope]
 1676   RM    540448847 [watch-inbox-ope]
 1676    R    541951759 [watch-inbox-ope]
 1676    R    541951783 [watch-inbox-ope]
 1676    R    541951807 [watch-inbox-ope]
 1676    R    542197727 [watch-inbox-ope]
 1676    R    542197743 [watch-inbox-ope]
 1676    R    542197759 [watch-inbox-ope]
 1676    R    541951823 [watch-inbox-ope]
 1676    R    541951855 [watch-inbox-ope]
 1676    R    541951871 [watch-inbox-ope]
 1676    R    541951895 [watch-inbox-ope]
 1676    R    541951919 [watch-inbox-ope]
 1676    R    541951935 [watch-inbox-ope]
 1676    R    541951951 [watch-inbox-ope]
 1676    R    541951967 [watch-inbox-ope]
 1676    R    541951983 [watch-inbox-ope]
 1676    R    542207567 [watch-inbox-ope]
 1676    R    542207599 [watch-inbox-ope]
 1676    R    542210383 [watch-inbox-ope]
 1676    R    542210399 [watch-inbox-ope]
 1676    R    542210415 [watch-inbox-ope]
 1676    R    542210431 [watch-inbox-ope]
 1676   RM    540448855 [watch-inbox-ope]
 1676    R    541952015 [watch-inbox-ope]
 1676    R    541952047 [watch-inbox-ope]
 1676    R    541952063 [watch-inbox-ope]
 1676    R    541952079 [watch-inbox-ope]
 1676    R    541952103 [watch-inbox-ope]
 1676    R    541952127 [watch-inbox-ope]
 1676    R    541952159 [watch-inbox-ope]
 1676    R    541952175 [watch-inbox-ope]
 1676    R    541952207 [watch-inbox-ope]
 1676    R    541952223 [watch-inbox-ope]
 1676    R    541952255 [watch-inbox-ope]
 1676    R    541952303 [watch-inbox-ope]
 1676    R    541952319 [watch-inbox-ope]
 1676    R    541952335 [watch-inbox-ope]
 1676    R    541952351 [watch-inbox-ope]
 1676    R    541952383 [watch-inbox-ope]
 1676    R    542051855 [watch-inbox-ope]
 1676    R    542051871 [watch-inbox-ope]
 1676    R    542051887 [watch-inbox-ope]
 1676    R    542051903 [watch-inbox-ope]
 1676    R    542051919 [watch-inbox-ope]
 1676    R    541952391 [watch-inbox-ope]
 1676    R    541952415 [watch-inbox-ope]
 1676   RM    540448863 [watch-inbox-ope]
 1676    R    542051935 [watch-inbox-ope]
 1676    R    541952431 [watch-inbox-ope]
 1676    R    541952447 [watch-inbox-ope]
 1676    R    541952463 [watch-inbox-ope]
 1676    R    541952487 [watch-inbox-ope]
 1676    R    541952511 [watch-inbox-ope]
 1676    R    541952527 [watch-inbox-ope]
 1676    R    541952559 [watch-inbox-ope]
 1676    R    541952607 [watch-inbox-ope]
 1676    R    541952623 [watch-inbox-ope]
 1676    R    542051951 [watch-inbox-ope]
 1676    R    541952639 [watch-inbox-ope]
 1676    R    542112271 [watch-inbox-ope]
  239    W    261271552  [flush-8:0]
  239    W    270532504  [flush-8:0]
  239    W    271827168  [flush-8:0]
  239    W    271827160  [flush-8:0]
  239    W    271831080  [flush-8:0]
 1676    R    542112287 [watch-inbox-ope]
 1676    R    542112303 [watch-inbox-ope]
 1676    R    542112319 [watch-inbox-ope]
 1676    R    542112335 [watch-inbox-ope]
 1676    R    542112351 [watch-inbox-ope]
 1676    R    542112367 [watch-inbox-ope]
 1676    R    542112383 [watch-inbox-ope]
 1676    R    542112655 [watch-inbox-ope]
 1676   RM    540448871 [watch-inbox-ope]
 1676    R    542112671 [watch-inbox-ope]
 1676    R    542112687 [watch-inbox-ope]
 1676    R    542112703 [watch-inbox-ope]
 1676    R    542112719 [watch-inbox-ope]
 1676    R    542112735 [watch-inbox-ope]
 1676    R    541952655 [watch-inbox-ope]
 1676    R    541952687 [watch-inbox-ope]
 1676    R    541952703 [watch-inbox-ope]
 1676    R    541952735 [watch-inbox-ope]
 1676    R    541952751 [watch-inbox-ope]
 1676    R    542112751 [watch-inbox-ope]
 1676    R    541952767 [watch-inbox-ope]
 1676    R    541952783 [watch-inbox-ope]
 1676    R    541952799 [watch-inbox-ope]
 1676    R    541952815 [watch-inbox-ope]
 1676    R    541952831 [watch-inbox-ope]
 1676    R    541952863 [watch-inbox-ope]
 1676    R    541952879 [watch-inbox-ope]
 1676    R    542113807 [watch-inbox-ope]
 1676    R    541952903 [watch-inbox-ope]
 1676    R    541952935 [watch-inbox-ope]
 1676   RM    540448879 [watch-inbox-ope]
 1676    R    541952959 [watch-inbox-ope]
 1676    R    542113823 [watch-inbox-ope]
 1676    R    542113839 [watch-inbox-ope]
 1676    R    542113855 [watch-inbox-ope]
 1676    R    541952975 [watch-inbox-ope]
 1676    R    541952991 [watch-inbox-ope]
 1676    R    541953007 [watch-inbox-ope]
 1676    R    541953023 [watch-inbox-ope]
 1676    R    541953055 [watch-inbox-ope]
 1676    R    541953071 [watch-inbox-ope]
 1676    R    541953103 [watch-inbox-ope]
 1676    R    541953119 [watch-inbox-ope]
 1676    R    541953135 [watch-inbox-ope]
 1676    R    542113871 [watch-inbox-ope]
 1676    R    542113887 [watch-inbox-ope]
 1676    R    541953167 [watch-inbox-ope]
 1676    R    541953191 [watch-inbox-ope]
 1676    R    541953223 [watch-inbox-ope]
 1676    R    541953247 [watch-inbox-ope]
 1676    R    541953263 [watch-inbox-ope]
 1676    R    541953279 [watch-inbox-ope]
 1676   RM    540448887 [watch-inbox-ope]
 1676    R    541953303 [watch-inbox-ope]
 1676    R    541953327 [watch-inbox-ope]
 1676    R    542113903 [watch-inbox-ope]
 1676    R    542113919 [watch-inbox-ope]
 1676    R    541953359 [watch-inbox-ope]
 1676    R    541953375 [watch-inbox-ope]
 1676    R    541953391 [watch-inbox-ope]
 1676    R    541953407 [watch-inbox-ope]
 1676    R    542145679 [watch-inbox-ope]
 1676    R    542145695 [watch-inbox-ope]
 1676    R    542145711 [watch-inbox-ope]
 1676    R    542145727 [watch-inbox-ope]
 1676    R    542145743 [watch-inbox-ope]
 1676    R    541953423 [watch-inbox-ope]
 1676    R    542145759 [watch-inbox-ope]
 1676    R    541953455 [watch-inbox-ope]
 1676    R    541953471 [watch-inbox-ope]
 1676    R    542145775 [watch-inbox-ope]
 1676    R    542145791 [watch-inbox-ope]
 1676    R    541953487 [watch-inbox-ope]
 1676    R    541953519 [watch-inbox-ope]
 1676   RM    540448895 [watch-inbox-ope]
 1676    R    541953535 [watch-inbox-ope]
 1676    R    541953551 [watch-inbox-ope]
 1676    R    541953567 [watch-inbox-ope]
 1676    R    541953599 [watch-inbox-ope]
 1676    R    541953615 [watch-inbox-ope]
 1676    R    541953631 [watch-inbox-ope]
 1676    R    541953647 [watch-inbox-ope]
 1676    R    542157455 [watch-inbox-ope]
 1676    R    542157471 [watch-inbox-ope]
 1676    R    542157487 [watch-inbox-ope]
 1676    R    541953671 [watch-inbox-ope]
 1676   RA    540386719 [watch-inbox-ope]
 1676   RA    540386727 [watch-inbox-ope]
 1676   RA    540386735 [watch-inbox-ope]
 1676   RA    540386743 [watch-inbox-ope]
 1676   RA    540386751 [watch-inbox-ope]
 1676   RA    540386759 [watch-inbox-ope]
 1676   RA    540386767 [watch-inbox-ope]
 1676   RA    540386775 [watch-inbox-ope]
 1676   RA    540386783 [watch-inbox-ope]
 1676   RA    540386791 [watch-inbox-ope]
 1676   RA    540386799 [watch-inbox-ope]
 1676   RA    540386807 [watch-inbox-ope]
 1676   RA    540386815 [watch-inbox-ope]
 1676   RA    540386823 [watch-inbox-ope]
 1676   RA    540386831 [watch-inbox-ope]
 1676   RA    540386839 [watch-inbox-ope]
 1676   RA    540386847 [watch-inbox-ope]
 1676   RA    540386855 [watch-inbox-ope]
 1676   RA    540386863 [watch-inbox-ope]
 1676   RA    540386871 [watch-inbox-ope]
 1676   RA    540386879 [watch-inbox-ope]
 1676   RA    540386887 [watch-inbox-ope]
 1676   RA    540386895 [watch-inbox-ope]
 1676   RA    540386903 [watch-inbox-ope]
 1676   RA    540386911 [watch-inbox-ope]
 1676   RA    540386919 [watch-inbox-ope]
 1676   RA    540386927 [watch-inbox-ope]
 1676   RA    540386935 [watch-inbox-ope]
 1676   RA    540386943 [watch-inbox-ope]
 1676   RA    540386951 [watch-inbox-ope]
 1676   RA    540386959 [watch-inbox-ope]
 1676   RM    540386711 [watch-inbox-ope]
  239    W    271863816  [flush-8:0]
  239    W    272019768  [flush-8:0]
  239    W    272019776  [flush-8:0]
  239    W    483478791  [flush-8:0]
  239    W    260578312  [flush-8:0]
  239    W    260578400  [flush-8:0]
 1676    R    541953695 [watch-inbox-ope]
 1676    R    541953711 [watch-inbox-ope]
 1676    R    542157503 [watch-inbox-ope]
 1676    R    541953743 [watch-inbox-ope]
 1676    R    541953759 [watch-inbox-ope]
 1676    R    541953775 [watch-inbox-ope]
 1676    R    542157519 [watch-inbox-ope]
 1676    R    541953791 [watch-inbox-ope]
 1676    R    542157551 [watch-inbox-ope]
 1676    R    541953807 [watch-inbox-ope]
 1676    R    541953831 [watch-inbox-ope]
 1676    R    541953863 [watch-inbox-ope]
 1676    R    541953927 [watch-inbox-ope]
 1676    R    541954055 [watch-inbox-ope]
 1676   RM    540448903 [watch-inbox-ope]
 1676    R    542157567 [watch-inbox-ope]
 1676    R    541954127 [watch-inbox-ope]
 1676    R    541954143 [watch-inbox-ope]
 1676    R    541954159 [watch-inbox-ope]
 1676    R    541954183 [watch-inbox-ope]
 1676    R    541954207 [watch-inbox-ope]
 1676    R    541954223 [watch-inbox-ope]
 1676    R    541954239 [watch-inbox-ope]
 1676    R    541954255 [watch-inbox-ope]
 1676    R    541954271 [watch-inbox-ope]
 1676    R    541954287 [watch-inbox-ope]
 1676    R    541954319 [watch-inbox-ope]
 1676    R    541954335 [watch-inbox-ope]
 1676    R    541954351 [watch-inbox-ope]
 1676    R    541954367 [watch-inbox-ope]
 1676    R    541954391 [watch-inbox-ope]
 1676    R    541954415 [watch-inbox-ope]
 1676    R    541954431 [watch-inbox-ope]
 1676    R    541954455 [watch-inbox-ope]
 1676    R    541954479 [watch-inbox-ope]
 1676    R    541954495 [watch-inbox-ope]
 1676   RM    540456719 [watch-inbox-ope]
  239    W    260622168  [flush-8:0]
  239    W    260625528  [flush-8:0]
  239    W    260625608  [flush-8:0]
  239    W    260614368  [flush-8:0]
  239    W    260614336  [flush-8:0]
  239    W    260614304  [flush-8:0]
  239    W    260614280  [flush-8:0]
 1676    R    541954511 [watch-inbox-ope]
 1676    R    541954527 [watch-inbox-ope]
 1676    R    541954543 [watch-inbox-ope]
 1676    R    541954567 [watch-inbox-ope]
 1676    R    541954599 [watch-inbox-ope]
 1676    R    541954607 [watch-inbox-ope]
 1676    R    541954623 [watch-inbox-ope]
 1676    R    541954655 [watch-inbox-ope]
 1676    R    541954671 [watch-inbox-ope]
 1676    R    541954687 [watch-inbox-ope]
 1676    R    542158351 [watch-inbox-ope]
 1676    R    542158367 [watch-inbox-ope]
 1676    R    542158383 [watch-inbox-ope]
 1676    R    542158399 [watch-inbox-ope]
 1676    R    542158415 [watch-inbox-ope]
 1676    R    541954703 [watch-inbox-ope]
 1676    R    541954727 [watch-inbox-ope]
 1676    R    541954751 [watch-inbox-ope]
 1676    R    542158431 [watch-inbox-ope]
 1676    R    542158447 [watch-inbox-ope]
 1676    R    542158463 [watch-inbox-ope]
 1676    R    541954767 [watch-inbox-ope]
 1676   RM    540456727 [watch-inbox-ope]
 1676    R    541954831 [watch-inbox-ope]
 1676    R    541954863 [watch-inbox-ope]
 1676    R    541954783 [watch-inbox-ope]
 1676    R    541954799 [watch-inbox-ope]
 1676    R    541954895 [watch-inbox-ope]
 1676    R    541954911 [watch-inbox-ope]
 1676    R    541954927 [watch-inbox-ope]
 1676    R    541954943 [watch-inbox-ope]
 1676    R    542158479 [watch-inbox-ope]
 1676    R    541954959 [watch-inbox-ope]
 1676    R    542158495 [watch-inbox-ope]
 1676    R    541954983 [watch-inbox-ope]
 1676    R    541955007 [watch-inbox-ope]
 1676    R    541955023 [watch-inbox-ope]
 1676    R    541955047 [watch-inbox-ope]
 1676    R    541955071 [watch-inbox-ope]
 1676    R    541955087 [watch-inbox-ope]
 1676    R    541955119 [watch-inbox-ope]
 1676    R    541955183 [watch-inbox-ope]
  239    W    260607112  [flush-8:0]
  239    W    260607152  [flush-8:0]
  239    W    260607168  [flush-8:0]
  239    W    260607192  [flush-8:0]
  239    W    260607208  [flush-8:0]
  239    W    260607272  [flush-8:0]
  239    W    260607320  [flush-8:0]
  239    W    260607384  [flush-8:0]
  239    W    260607424  [flush-8:0]
  239    W    260607440  [flush-8:0]
  239    W    260607456  [flush-8:0]
  239    W    260607480  [flush-8:0]
  239    W    260607512  [flush-8:0]
  239    W    260607536  [flush-8:0]
  239    W    260607568  [flush-8:0]
  239    W    260607632  [flush-8:0]
  239    W    260607656  [flush-8:0]
  239    W    260607680  [flush-8:0]
  239    W    260607704  [flush-8:0]
  239    W    260607728  [flush-8:0]
  239    W    260607744  [flush-8:0]
  239    W    260607776  [flush-8:0]
  239    W    260607808  [flush-8:0]
  239    W    260607856  [flush-8:0]
  239    W    260607872  [flush-8:0]
  239    W    260607888  [flush-8:0]
  239    W    260607920  [flush-8:0]
  239    W    260607944  [flush-8:0]
  239    W    260607976  [flush-8:0]
  239    W    260607992  [flush-8:0]
  239    W    260608008  [flush-8:0]
  239    W    260608040  [flush-8:0]
  239    W    260608056  [flush-8:0]
  239    W    260608072  [flush-8:0]
  239    W    260608088  [flush-8:0]
  239    W    260608152  [flush-8:0]
  239    W    260608200  [flush-8:0]
  239    W    260608224  [flush-8:0]
  239    W    260608240  [flush-8:0]
  239    W    260608272  [flush-8:0]
  239    W    260608312  [flush-8:0]
  239    W    260608352  [flush-8:0]
  239    W    260608400  [flush-8:0]
  239    W    260608424  [flush-8:0]
  239    W    260608440  [flush-8:0]
 1676    R    541955311 [watch-inbox-ope]
 1676    R    541955327 [watch-inbox-ope]
 1676    R    542158511 [watch-inbox-ope]
 1676    R    542158543 [watch-inbox-ope]
 1676    R    542158559 [watch-inbox-ope]
 1676   RM    540456735 [watch-inbox-ope]
 1676    R    542158575 [watch-inbox-ope]
 1676    R    542158591 [watch-inbox-ope]
 1676    R    541955343 [watch-inbox-ope]
 1676    R    541955359 [watch-inbox-ope]
 1676    R    541955375 [watch-inbox-ope]
 1676    R    541955391 [watch-inbox-ope]
 1676    R    541955423 [watch-inbox-ope]
 1676    R    541955439 [watch-inbox-ope]
 1676    R    542187279 [watch-inbox-ope]
  239    W    260608456  [flush-8:0]
  239    W    260608472  [flush-8:0]
  239    W    260608488  [flush-8:0]
  239    W    260608528  [flush-8:0]
  239    W    260608544  [flush-8:0]
  239    W    260608568  [flush-8:0]
  239    W    260608584  [flush-8:0]
  239    W    260608600  [flush-8:0]
  239    W    260608632  [flush-8:0]
  239    W    260608672  [flush-8:0]
 1676    R    542187295 [watch-inbox-ope]
 1676    R    542187311 [watch-inbox-ope]
 1676    R    541955463 [watch-inbox-ope]
 1676    R    541955503 [watch-inbox-ope]
 1676    R    541955487 [watch-inbox-ope]
 1676    R    542187327 [watch-inbox-ope]
 1676    R    541955535 [watch-inbox-ope]
 1676    R    541955551 [watch-inbox-ope]
 1676    R    541955567 [watch-inbox-ope]
 1676    R    541955583 [watch-inbox-ope]
 1676    R    541955615 [watch-inbox-ope]
 1676    R    541955655 [watch-inbox-ope]
 1676   RM    540456743 [watch-inbox-ope]
 1676    R    541955679 [watch-inbox-ope]
 1676    R    541955695 [watch-inbox-ope]
 1676    R    541955711 [watch-inbox-ope]
 1676    R    542187343 [watch-inbox-ope]
 1676    R    542187359 [watch-inbox-ope]
 1676    R    542187375 [watch-inbox-ope]
 1676    R    542187391 [watch-inbox-ope]
 1676    R    542190479 [watch-inbox-ope]
 1676    R    541955727 [watch-inbox-ope]
 1676    R    541955751 [watch-inbox-ope]
 1676    R    541955775 [watch-inbox-ope]
 1676    R    542190495 [watch-inbox-ope]
 1676    R    542190511 [watch-inbox-ope]
 1676    R    542190527 [watch-inbox-ope]
 1676    R    541955791 [watch-inbox-ope]
 1676    R    541955823 [watch-inbox-ope]
 1676    R    541955839 [watch-inbox-ope]
 1676    R    541955855 [watch-inbox-ope]
 1676    R    541955879 [watch-inbox-ope]
 1676    R    541955903 [watch-inbox-ope]
 1676    R    541955919 [watch-inbox-ope]
 1676   RM    540456751 [watch-inbox-ope]
 1676    R    541955943 [watch-inbox-ope]
 1676    R    541955967 [watch-inbox-ope]
 1676    R    542190543 [watch-inbox-ope]
 1676    R    542190559 [watch-inbox-ope]
 1676    R    542190575 [watch-inbox-ope]
 1676    R    542190591 [watch-inbox-ope]
 1676    R    542193807 [watch-inbox-ope]
 1676    R    541955983 [watch-inbox-ope]
 1676    R    541956015 [watch-inbox-ope]
 1676    R    541956031 [watch-inbox-ope]
 1676    R    541956047 [watch-inbox-ope]
 1676    R    541956079 [watch-inbox-ope]
 1676    R    541956095 [watch-inbox-ope]
 1676    R    542193839 [watch-inbox-ope]
 1676    R    542193855 [watch-inbox-ope]
 1676    R    542193871 [watch-inbox-ope]
 1676    R    541956111 [watch-inbox-ope]
 1676    R    541956143 [watch-inbox-ope]
 1676    R    541956207 [watch-inbox-ope]
 1676    R    541956255 [watch-inbox-ope]
 1676    R    542193887 [watch-inbox-ope]
 1676    R    541956271 [watch-inbox-ope]
 1676    R    541956287 [watch-inbox-ope]
 1676    R    541956335 [watch-inbox-ope]
 1676   RM    540456759 [watch-inbox-ope]
 1676    R    542193903 [watch-inbox-ope]
 1676    R    541956319 [watch-inbox-ope]
 1676    R    541956367 [watch-inbox-ope]
 1676    R    541956399 [watch-inbox-ope]
 1676    R    541956415 [watch-inbox-ope]
 1676    R    541956431 [watch-inbox-ope]
 1676    R    541956447 [watch-inbox-ope]
 1676    R    541956479 [watch-inbox-ope]
 1676    R    542197775 [watch-inbox-ope]
 1676    R    542197791 [watch-inbox-ope]
 1676    R    542197807 [watch-inbox-ope]
 1676    R    542197823 [watch-inbox-ope]
 1676    R    542197839 [watch-inbox-ope]

I recognise that the output will have a WTF reaction but the key
observations to me are

a) a single write request from flusher took over a second to complete
b) at the time it was queued, it was mostly other writes that were in
   the queue at the same time
c) The write request and the parallel writes were all asynchronous write
   requests
D) at the time the request completed, there were a LARGE number of
   other requested queued and completed at the same time.

Of the requests queued and completed in the meantime the breakdown was

     22 RM
     31 RA
     82 W
    445 R

If I'm reading this correctly, it is saying that 22 reads were merged (RM),
31 reads were remapped to another device (RA) which is probably reads from
the dm-crypt partition, 82 were writes (W) which is not far off the number
of writes that were in the queue and 445 were other reads. The delay was
dominated by reads that were queued after the write request and completed
before it.

There are lots of other example but here is part of one from much later
that starts with.

Request 27128 took 7.536721619 to complete
  239    W    188973663  [flush-8:0]

That's saying that the 27128th request in the trace took over 7 seconds
to complete and was an asynchronous write from flusher. The contents of
the queue are displayed at that time and the breakdown of requests is

     23 WS
     86 RM
    124 RA
    442 W
   1931 R

7 seconds later when it was completed the breakdown of completed
requests was

     25 WS
    114 RM
    155 RA
    408 W
   2457 R

In combination, that confirms for me that asynchronous writes from flush
are being starved by reads. When a process requires a buffer that is locked
by that asynchronous write from flusher, it stalls.

> The thing is, we do want to make ext4 work well with cfq, and
> prioritizing non-readahead read requests ahead of data writeback does
> make sense.  The issue is with is that metadata writes going through
> the block device could in some cases effectively cause a priority
> inversion when what had previously been an asynchronous writeback
> starts blocking a foreground, user-visible process.
> 
> At least, that's the theory;

I *think* the data more or less confirms the theory but it'd be nice if
someone else double checked in case I'm seeing what I want to see
instead of what is actually there.
Jeff Moyer - April 22, 2013, 10:42 p.m.
Mel Gorman <mgorman@suse.de> writes:

> (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

Thanks.  I've got a few comments and corrections for you below.

> TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
> starve flusher writes.
[snip]

> 3. The blktrace indicates that reads can starve writes from flusher
>
>    While there are people that can look at a blktrace and find problems
>    like they are rain man, I'm more like an ADHD squirrel when looking at
>    a trace.  I wrote a script to look for what unrelated requests completed
>    while an request got stalled for over a second. It seemed like something
>    that a tool shoudl already exist for but I didn't find one unless btt
>    can give the information somehow.

Care to share that script?

[snip]

> I recognise that the output will have a WTF reaction but the key
> observations to me are
>
> a) a single write request from flusher took over a second to complete
> b) at the time it was queued, it was mostly other writes that were in
>    the queue at the same time
> c) The write request and the parallel writes were all asynchronous write
>    requests
> D) at the time the request completed, there were a LARGE number of
>    other requested queued and completed at the same time.
>
> Of the requests queued and completed in the meantime the breakdown was
>
>      22 RM
>      31 RA
>      82 W
>     445 R
>
> If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> 31 reads were remapped to another device (RA) which is probably reads from
> the dm-crypt partition, 82 were writes (W) which is not far off the number
> of writes that were in the queue and 445 were other reads. The delay was
> dominated by reads that were queued after the write request and completed
> before it.

RM == Read Meta
RA == Read Ahead  (remapping, by the way, does not happen across
                   devices, just into partitions)
W and R you understood correctly.

> That's saying that the 27128th request in the trace took over 7 seconds
> to complete and was an asynchronous write from flusher. The contents of
> the queue are displayed at that time and the breakdown of requests is
>
>      23 WS  [JEM: write sync]
>      86 RM  [JEM: Read Meta]
>     124 RA  [JEM: Read Ahead]
>     442 W
>    1931 R
>
> 7 seconds later when it was completed the breakdown of completed
> requests was
>
>      25 WS
>     114 RM
>     155 RA
>     408 W
>    2457 R
>
> In combination, that confirms for me that asynchronous writes from flush
> are being starved by reads. When a process requires a buffer that is locked
> by that asynchronous write from flusher, it stalls.
>
>> The thing is, we do want to make ext4 work well with cfq, and
>> prioritizing non-readahead read requests ahead of data writeback does
>> make sense.  The issue is with is that metadata writes going through
>> the block device could in some cases effectively cause a priority
>> inversion when what had previously been an asynchronous writeback
>> starts blocking a foreground, user-visible process.
>> 
>> At least, that's the theory;
>
> I *think* the data more or less confirms the theory but it'd be nice if
> someone else double checked in case I'm seeing what I want to see
> instead of what is actually there.

Looks sane.  You can also see a lot of "preempt"s in the blkparse
output, which indicates exactly what you're saying.  Any sync request
gets priority over the async requests.

I'll also note that even though your I/O is going all over the place
(D2C is pretty bad, 14ms), most of the time is spent waiting for a
struct request allocation or between Queue and Merge:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.000992259   8.898375882     2300861
Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====
G2I               0.000000461   0.000044702   3.237065090     1015803
Q2M               0.000000101   8.203147238 2064.079367557     1311662
I2D               0.000002012   1.476824812 2064.089774419     1014890
M2D               0.000003283   6.994306138 283.573348664     1284872
D2C               0.000061889   0.014438316   0.857811758     2291996
Q2C               0.000072284  13.363007244 2064.092228625     2292191

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
---------- | --------- --------- --------- --------- ---------
   Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%

I'm not sure I believe that max value.  2064 seconds seems a bit high.
Also, Q2M should not be anywhere near that big, so more investigation is
required there.  A quick look over the data doesn't show any such delays
(making me question the tools), but I'll write some code tomorrow to
verify the btt output.

Jan, if I were to come up with a way of promoting a particular async
queue to the front of the line, where would I put such a call in the
ext4/jbd2 code to be effective?

Mel, can you reproduce this at will?  Do you have a reproducer that I
could run so I'm not constantly bugging you?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - April 23, 2013, 12:02 a.m.
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote:
> 
> Jan, if I were to come up with a way of promoting a particular async
> queue to the front of the line, where would I put such a call in the
> ext4/jbd2 code to be effective?

Well, I thought we had discussed trying to bump a pending I/O
automatically when there was an attempt to call lock_buffer() on the
bh?  That would be ideal, because we could keep the async writeback
low priority until someone is trying to wait upon it, at which point
obviously it should no longer be considered an async write call.

Failing that, this is something I've been toying with.... what do you
think?

http://patchwork.ozlabs.org/patch/238192/
http://patchwork.ozlabs.org/patch/238257/

(The first patch in the series just makes sure that allocation bitmap
reads are marked with the META/PRIO flags.  It's not strictly speaking
related to the problem discussed here, but for completeness:
http://patchwork.ozlabs.org/patch/238193/)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - April 23, 2013, 2:01 p.m.
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote:
> > 3. The blktrace indicates that reads can starve writes from flusher
> >
> >    While there are people that can look at a blktrace and find problems
> >    like they are rain man, I'm more like an ADHD squirrel when looking at
> >    a trace.  I wrote a script to look for what unrelated requests completed
> >    while an request got stalled for over a second. It seemed like something
> >    that a tool shoudl already exist for but I didn't find one unless btt
> >    can give the information somehow.
> 
> Care to share that script?
> 

I would have preferred not to because it is an ugly hatchet job churned
out in a few minutes. It's written in perl and uses the text output from
blkparse making it slow. It uses an excessive amount of memory because I was
taking shortcuts so is resource heavy. It ignores most of the information
from blkparse and so there are gaps in what it reports. Even though it's
dogshit, it was useful in this particular case so I added it to mmtests
anyway. Be aware that it takes ages to run and you might want to break
the blkparse output into pieces

It's used something like

blkparse -i blktrace-log > blkparse-log
cat blkparse-log | $PATH_TO_MMTESTS/subreport/blktrace-queue-watch.pl

> [snip]
> 
> > I recognise that the output will have a WTF reaction but the key
> > observations to me are
> >
> > a) a single write request from flusher took over a second to complete
> > b) at the time it was queued, it was mostly other writes that were in
> >    the queue at the same time
> > c) The write request and the parallel writes were all asynchronous write
> >    requests
> > D) at the time the request completed, there were a LARGE number of
> >    other requested queued and completed at the same time.
> >
> > Of the requests queued and completed in the meantime the breakdown was
> >
> >      22 RM
> >      31 RA
> >      82 W
> >     445 R
> >
> > If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> > 31 reads were remapped to another device (RA) which is probably reads from
> > the dm-crypt partition, 82 were writes (W) which is not far off the number
> > of writes that were in the queue and 445 were other reads. The delay was
> > dominated by reads that were queued after the write request and completed
> > before it.
> 
> RM == Read Meta
> RA == Read Ahead  (remapping, by the way, does not happen across
>                    devices, just into partitions)
> W and R you understood correctly.
> 

Thanks for those corrections. I misread the meaning of the action
identifiers section of the blkparse manual. I should have double checked
the source.

> >> <SNIP>
> >> The thing is, we do want to make ext4 work well with cfq, and
> >> prioritizing non-readahead read requests ahead of data writeback does
> >> make sense.  The issue is with is that metadata writes going through
> >> the block device could in some cases effectively cause a priority
> >> inversion when what had previously been an asynchronous writeback
> >> starts blocking a foreground, user-visible process.
> >> 
> >> At least, that's the theory;
> >
> > I *think* the data more or less confirms the theory but it'd be nice if
> > someone else double checked in case I'm seeing what I want to see
> > instead of what is actually there.
> 
> Looks sane.  You can also see a lot of "preempt"s in the blkparse
> output, which indicates exactly what you're saying.  Any sync request
> gets priority over the async requests.
> 

Good to know.

> I'll also note that even though your I/O is going all over the place
> (D2C is pretty bad, 14ms), most of the time is spent waiting for a
> struct request allocation or between Queue and Merge:
> 
> ==================== All Devices ====================
> 
>             ALL           MIN           AVG           MAX           N
> --------------- ------------- ------------- ------------- -----------
> 
> Q2Q               0.000000001   0.000992259   8.898375882     2300861
> Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====

This is not normally my sandbox so do you mind spelling this out?

IIUC, the time to allocate the struct request from the slab cache is just a
small portion of this time. The bulk of the time is spent in get_request()
waiting for congestion to clear on the request list for either the sync or
async queue. Once a process goes to sleep on that waitqueue, it has to wait
until enough requests on that queue have been serviced before it gets woken
again at which point it gets priority access to prevent further starvation.
This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
is that the async queue was congested and on average, we are waiting for
10 seconds for it to clear. The maximum value may be bogus for reasons
explained later.

Is that accurate?

> G2I               0.000000461   0.000044702   3.237065090     1015803
> Q2M               0.000000101   8.203147238 2064.079367557     1311662
> I2D               0.000002012   1.476824812 2064.089774419     1014890
> M2D               0.000003283   6.994306138 283.573348664     1284872
> D2C               0.000061889   0.014438316   0.857811758     2291996
> Q2C               0.000072284  13.363007244 2064.092228625     2292191
> 
> ==================== Device Overhead ====================
> 
>        DEV |       Q2G       G2I       Q2M       I2D       D2C
> ---------- | --------- --------- --------- --------- ---------
>  (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
> ---------- | --------- --------- --------- --------- ---------
>    Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
> 
> I'm not sure I believe that max value.  2064 seconds seems a bit high.

It is so I looked closer at the timestamps and there is an one hour
correction about 4400 seconds into the test.  Daylight savings time kicked
in on March 31st and the machine is rarely rebooted until this test case
came along. It looks like there is a timezone or time misconfiguration
on the laptop that starts the machine with the wrong time. NTP must have
corrected the time which skewed the readings in that window severely :(

Normally on my test machines these services are disabled to avoid
exactly this sort of problem.

> Also, Q2M should not be anywhere near that big, so more investigation is
> required there.  A quick look over the data doesn't show any such delays
> (making me question the tools), but I'll write some code tomorrow to
> verify the btt output.
> 

It might be a single set of readings during a time correction that
screwed it.

> Jan, if I were to come up with a way of promoting a particular async
> queue to the front of the line, where would I put such a call in the
> ext4/jbd2 code to be effective?
> 
> Mel, can you reproduce this at will?  Do you have a reproducer that I
> could run so I'm not constantly bugging you?
> 

I can reproduce it at will. Due to the nature of the test, the test
results are variable and unfortunately it is one of the tricker mmtest
configurations to setup.

1. Get access to a webserver
2. Close mmtests to your test machine
   git clone https://github.com/gormanm/mmtests.git
3. Edit shellpacks/common-config.sh and set WEBROOT to a webserver path
4. Create a tar.gz of a large git tree and place it at $WEBROOT/linux-2.6.tar.gz
   Alternatively place a compressed git tree anywhere and edit
   configs/config-global-dhp__io-multiple-source-latency
   and update GITCHECKOUT_SOURCETAR
5. Create a tar.gz of a large maildir directory and place it at
   $WEBROOT/$WEBROOT/maildir.tar.gz
   Alternatively, use an existing maildir folder and set
   MONITOR_INBOX_OPEN_MAILDIR in
   configs/config-global-dhp__io-multiple-source-latency

It's awkward but it's not like there are standard benchmarks lying around
and it seemed the best way to reproduce the problems I typically see early
in the lifetime of a system or when running a git checkout when the tree
has not been used in a few hours. Run the actual test with

./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --run-monitor test-name-of-your-choice

Results will be in work/log. You'll need to run this as root so it
can run blktrace and so it can drop_caches between git checkouts
(to force disk IO). If systemtap craps out on you, then edit
configs/config-global-dhp__io-multiple-source-latency and remove dstate
from MONITORS_GZIP

If you have trouble getting this running, ping me on IRC.

Thanks.
Mel Gorman - April 23, 2013, 3:33 p.m.
On Sat, Apr 20, 2013 at 08:05:22PM -0400, Theodore Ts'o wrote:
> An alternate solution which I've been playing around adds buffer_head
> flags so we can indicate that a buffer contains metadata and/or should
> have I/O submitted with the REQ_PRIO flag set.
> 

I beefed up the reporting slightly and tested the patches comparing
3.9-rc6 vanilla with your patches. The full report with graphs are at

http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/report.html

                           3.9.0-rc6             3.9.0-rc6
                             vanilla               ext4tag
User    min           0.00 (  0.00%)        0.00 (  0.00%)
User    mean           nan (   nan%)         nan (   nan%)
User    stddev         nan (   nan%)         nan (   nan%)
User    max           0.00 (  0.00%)        0.00 (  0.00%)
User    range         0.00 (  0.00%)        0.00 (  0.00%)
System  min           9.14 (  0.00%)        9.13 (  0.11%)
System  mean          9.60 (  0.00%)        9.73 ( -1.33%)
System  stddev        0.39 (  0.00%)        0.94 (-142.69%)
System  max          10.31 (  0.00%)       11.58 (-12.32%)
System  range         1.17 (  0.00%)        2.45 (-109.40%)
Elapsed min         665.54 (  0.00%)      612.25 (  8.01%)
Elapsed mean        775.35 (  0.00%)      688.01 ( 11.26%)
Elapsed stddev       69.11 (  0.00%)       58.22 ( 15.75%)
Elapsed max         858.40 (  0.00%)      773.06 (  9.94%)
Elapsed range       192.86 (  0.00%)      160.81 ( 16.62%)
CPU     min           3.00 (  0.00%)        3.00 (  0.00%)
CPU     mean          3.60 (  0.00%)        4.20 (-16.67%)
CPU     stddev        0.49 (  0.00%)        0.75 (-52.75%)
CPU     max           4.00 (  0.00%)        5.00 (-25.00%)
CPU     range         1.00 (  0.00%)        2.00 (-100.00%)

The patches appear to improve the git checkout times slightly but this
test is quite variable.

The vmstat figures report some reclaim activity but if you look at the graphs
further down you will see that the bulk of the kswapd reclaim scan and
steal activity is at the start of the test when it's downloading and
untarring a git tree to work with. (I also note that the mouse-over
graph for direct reclaim efficiency is broken but it's not important
right now).

From iostat

                    3.9.0-rc6   3.9.0-rc6
                      vanilla     ext4tag
Mean dm-0-avgqz          1.18        1.19
Mean dm-0-await         17.30       16.50
Mean dm-0-r_await       17.30       16.50
Mean dm-0-w_await        0.94        0.48
Mean sda-avgqz         650.29      719.81
Mean sda-await        2501.33     2597.23
Mean sda-r_await        30.01       24.91
Mean sda-w_await     11228.80    11120.64
Max  dm-0-avgqz         12.30       10.14
Max  dm-0-await         42.65       52.23
Max  dm-0-r_await       42.65       52.23
Max  dm-0-w_await      541.00      263.83
Max  sda-avgqz        3811.93     3375.11
Max  sda-await        7178.61     7170.44
Max  sda-r_await       384.37      297.85
Max  sda-w_await     51353.93    50338.25

There are no really obvious massive advantages to me there and if you look
at the graphs for the avgqs, await etc over time, the patched kernel are
not obviously better. The Wait CPU usage looks roughly the same too.

On the more positive side, the dstate systemtap monitor script tells me
that all processes were stalled for less time -- 9575 seconds versus
10910. The most severe event to stall on is sleep_on_buffer() as a
result of ext4_bread.

Vanilla kernel	3325677 ms stalled with 57 events
Patched kernel  2411471 ms stalled with 42 events

That's a pretty big drop but it gets bad again for the second worst stall --
wait_on_page_bit as a result of generic_file_buffered_write.

Vanilla kernel  1336064 ms stalled with 109 events
Patched kernel  2338781 ms stalled with 164 events

So conceptually the patches make sense but the first set of tests do
not indicate that they'll fix the problem and the stall times do not
indicate that interactivity will be any better. I'll still apply them
and boot them on my main work machine and see how they "feel" this
evening.
Theodore Ts'o - April 23, 2013, 3:50 p.m.
On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote:
> That's a pretty big drop but it gets bad again for the second worst stall --
> wait_on_page_bit as a result of generic_file_buffered_write.
> 
> Vanilla kernel  1336064 ms stalled with 109 events
> Patched kernel  2338781 ms stalled with 164 events

Do you have the stack trace for this stall?  I'm wondering if this is
caused by the waiting for stable pages in write_begin() , or something
else.

If it is blocking caused by stable page writeback that's interesting,
since it would imply that something in your workload is trying to
write to a page that has already been modified (i.e., appending to a
log file, or updating a database file).  Does that make sense given
what your workload might be running?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - April 23, 2013, 4:13 p.m.
On Tue, Apr 23, 2013 at 11:50:19AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote:
> > That's a pretty big drop but it gets bad again for the second worst stall --
> > wait_on_page_bit as a result of generic_file_buffered_write.
> > 
> > Vanilla kernel  1336064 ms stalled with 109 events
> > Patched kernel  2338781 ms stalled with 164 events
> 
> Do you have the stack trace for this stall?  I'm wondering if this is
> caused by the waiting for stable pages in write_begin() , or something
> else.
> 

[<ffffffff81110238>] wait_on_page_bit+0x78/0x80
[<ffffffff815af294>] kretprobe_trampoline+0x0/0x4c
[<ffffffff81110e84>] generic_file_buffered_write+0x114/0x2a0
[<ffffffff81111ccd>] __generic_file_aio_write+0x1bd/0x3c0
[<ffffffff81111f4a>] generic_file_aio_write+0x7a/0xf0
[<ffffffff811ee639>] ext4_file_write+0x99/0x420
[<ffffffff81174d87>] do_sync_write+0xa7/0xe0
[<ffffffff81175447>] vfs_write+0xa7/0x180
[<ffffffff811758cd>] sys_write+0x4d/0x90
[<ffffffff815b3eed>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

The processes that stalled in this particular trace are wget, latency-output,
tar and tclsh. Most of these are sequential writers except for tar which
is both a sequential reader and sequential writers.

> If it is blocking caused by stable page writeback that's interesting,
> since it would imply that something in your workload is trying to
> write to a page that has already been modified (i.e., appending to a
> log file, or updating a database file).  Does that make sense given
> what your workload might be running?
> 

I doubt it is stable write consider the type of processes that are running. I
would expect the bulk of the activity to be sequential readers or writers
of multiple files. The summarised report from the raw data is now at

http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-vanilla
http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-ext4tag

It's an aside but the worst of the stalls are incurred by systemd-tmpfile
which were not a deliberate part of the test and yet another thing that
I would not have caught unless I was running tests on my laptop. Looking
closer at that thing, the default configuration is to run the service 15
minutes after boot and after that it runs once a day. It looks like the
bulk of the scanning would be in /var/tmp/ looking at systemds own files
(over 3000 of them) which I'm a little amused by.

My normal test machines would not hit this because they are not systemd
based but the existance of thing thing is worth noting. Any IO-based tests
run on systemd-based distributions may give different results depending
on whether this service triggered during the test or not.
Jeff Moyer - April 24, 2013, 7:09 p.m.
Mel Gorman <mgorman@suse.de> writes:

>> I'll also note that even though your I/O is going all over the place
>> (D2C is pretty bad, 14ms), most of the time is spent waiting for a
>> struct request allocation or between Queue and Merge:
>> 
>> ==================== All Devices ====================
>> 
>>             ALL           MIN           AVG           MAX           N
>> --------------- ------------- ------------- ------------- -----------
>> 
>> Q2Q               0.000000001   0.000992259   8.898375882     2300861
>> Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====
>
> This is not normally my sandbox so do you mind spelling this out?
>
> IIUC, the time to allocate the struct request from the slab cache is just a
> small portion of this time. The bulk of the time is spent in get_request()
> waiting for congestion to clear on the request list for either the sync or
> async queue. Once a process goes to sleep on that waitqueue, it has to wait
> until enough requests on that queue have been serviced before it gets woken
> again at which point it gets priority access to prevent further starvation.
> This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
> is that the async queue was congested and on average, we are waiting for
> 10 seconds for it to clear. The maximum value may be bogus for reasons
> explained later.
>
> Is that accurate?

Yes, without getting into excruciating detail.

>> G2I               0.000000461   0.000044702   3.237065090     1015803
>> Q2M               0.000000101   8.203147238 2064.079367557     1311662
>> I2D               0.000002012   1.476824812 2064.089774419     1014890
>> M2D               0.000003283   6.994306138 283.573348664     1284872
>> D2C               0.000061889   0.014438316   0.857811758     2291996
>> Q2C               0.000072284  13.363007244 2064.092228625     2292191
>> 
>> ==================== Device Overhead ====================
>> 
>>        DEV |       Q2G       G2I       Q2M       I2D       D2C
>> ---------- | --------- --------- --------- --------- ---------
>>  (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
>> ---------- | --------- --------- --------- --------- ---------
>>    Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
>> 
>> I'm not sure I believe that max value.  2064 seconds seems a bit high.
>
> It is so I looked closer at the timestamps and there is an one hour
> correction about 4400 seconds into the test.  Daylight savings time kicked
> in on March 31st and the machine is rarely rebooted until this test case
> came along. It looks like there is a timezone or time misconfiguration
> on the laptop that starts the machine with the wrong time. NTP must have
> corrected the time which skewed the readings in that window severely :(

Not sure I'm buying that argument, as there are no gaps in the blkparse
output.  The logging is not done using wallclock time.  I still haven't
had sufficient time to dig into these numbers.

>> Also, Q2M should not be anywhere near that big, so more investigation is
>> required there.  A quick look over the data doesn't show any such delays
>> (making me question the tools), but I'll write some code tomorrow to
>> verify the btt output.
>> 
>
> It might be a single set of readings during a time correction that
> screwed it.

Again, I don't think so.

> I can reproduce it at will. Due to the nature of the test, the test
> results are variable and unfortunately it is one of the tricker mmtest
> configurations to setup.
>
> 1. Get access to a webserver
> 2. Close mmtests to your test machine
>    git clone https://github.com/gormanm/mmtests.git
> 3. Edit shellpacks/common-config.sh and set WEBROOT to a webserver path
> 4. Create a tar.gz of a large git tree and place it at $WEBROOT/linux-2.6.tar.gz
>    Alternatively place a compressed git tree anywhere and edit
>    configs/config-global-dhp__io-multiple-source-latency
>    and update GITCHECKOUT_SOURCETAR
> 5. Create a tar.gz of a large maildir directory and place it at
>    $WEBROOT/$WEBROOT/maildir.tar.gz
>    Alternatively, use an existing maildir folder and set
>    MONITOR_INBOX_OPEN_MAILDIR in
>    configs/config-global-dhp__io-multiple-source-latency
>
> It's awkward but it's not like there are standard benchmarks lying around
> and it seemed the best way to reproduce the problems I typically see early
> in the lifetime of a system or when running a git checkout when the tree
> has not been used in a few hours. Run the actual test with
>
> ./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --run-monitor test-name-of-your-choice
>
> Results will be in work/log. You'll need to run this as root so it
> can run blktrace and so it can drop_caches between git checkouts
> (to force disk IO). If systemtap craps out on you, then edit
> configs/config-global-dhp__io-multiple-source-latency and remove dstate
> from MONITORS_GZIP

And how do I determine whether I've hit the problem?

> If you have trouble getting this running, ping me on IRC.

Yes, I'm having issues getting things to go, but you didn't provide me a
time zone, an irc server or a nick to help me find you.  Was that
intentional?  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - April 25, 2013, 12:21 p.m.
On Wed, Apr 24, 2013 at 03:09:13PM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> >> I'll also note that even though your I/O is going all over the place
> >> (D2C is pretty bad, 14ms), most of the time is spent waiting for a
> >> struct request allocation or between Queue and Merge:
> >> 
> >> ==================== All Devices ====================
> >> 
> >>             ALL           MIN           AVG           MAX           N
> >> --------------- ------------- ------------- ------------- -----------
> >> 
> >> Q2Q               0.000000001   0.000992259   8.898375882     2300861
> >> Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====
> >
> > This is not normally my sandbox so do you mind spelling this out?
> >
> > IIUC, the time to allocate the struct request from the slab cache is just a
> > small portion of this time. The bulk of the time is spent in get_request()
> > waiting for congestion to clear on the request list for either the sync or
> > async queue. Once a process goes to sleep on that waitqueue, it has to wait
> > until enough requests on that queue have been serviced before it gets woken
> > again at which point it gets priority access to prevent further starvation.
> > This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
> > is that the async queue was congested and on average, we are waiting for
> > 10 seconds for it to clear. The maximum value may be bogus for reasons
> > explained later.
> >
> > Is that accurate?
> 
> Yes, without getting into excruciating detail.


Good enough, thanks.

> >> I'm not sure I believe that max value.  2064 seconds seems a bit high.
> >
> > It is so I looked closer at the timestamps and there is an one hour
> > correction about 4400 seconds into the test.  Daylight savings time kicked
> > in on March 31st and the machine is rarely rebooted until this test case
> > came along. It looks like there is a timezone or time misconfiguration
> > on the laptop that starts the machine with the wrong time. NTP must have
> > corrected the time which skewed the readings in that window severely :(
> 
> Not sure I'm buying that argument, as there are no gaps in the blkparse
> output.  The logging is not done using wallclock time.  I still haven't
> had sufficient time to dig into these numbers.
> 

Ok.

> > It's awkward but it's not like there are standard benchmarks lying around
> > and it seemed the best way to reproduce the problems I typically see early
> > in the lifetime of a system or when running a git checkout when the tree
> > has not been used in a few hours. Run the actual test with
> >
> > ./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --run-monitor test-name-of-your-choice
> >
> > Results will be in work/log. You'll need to run this as root so it
> > can run blktrace and so it can drop_caches between git checkouts
> > (to force disk IO). If systemtap craps out on you, then edit
> > configs/config-global-dhp__io-multiple-source-latency and remove dstate
> > from MONITORS_GZIP
> 
> And how do I determine whether I've hit the problem?
> 

If systemtap is available then

cat work/log/dstate-TESTNAME-gitcheckout | subreport/stap-dstate-frequency

will give you a report on the worst stalls and the stack traces when those
stalls occurred. If the stalls are 10+ seconds then you've certainly hit
the problem.

Alternatively

cd work/log
../../compare-kernels.sh

Look at the average time it takes to run the git checkout. Is it
abnormally high in comparison to if there was no parallel IO? If you do
not know what the normal time is, run with

./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --no-monitor test-name-no-monitor

The monitors are what's opening the maildir and generating the parallel
IO. If there is an excessive difference between the git checkout times,
then you've hit the problem.

Furthermore, look at the await times. If they do not appear in the
compare-kernels.sh output then

../../bin/compare-mmtests.pl -d . -b gitcheckout -n test-name-of-your-choice --print-monitor iostat

And look at the await times. Are they very high? If so, you hit the
problem. If you want a better look at the await figures over time,
either look at the iostat file or you can graph it with

../../bin/graph-mmtests.sh -d . -b gitcheckout -n test-name-of-your-choice --print-monitor iostat --sub-heading sda-await

where sda-await should be  substituted for whatever the disk is that
you're running the test one.


> > If you have trouble getting this running, ping me on IRC.
> 
> Yes, I'm having issues getting things to go, but you didn't provide me a
> time zone, an irc server or a nick to help me find you.  Was that
> intentional?  ;-)
> 

Not consciously :) . I'm in the IST timezone (GMT+1), OFTC IRC network
and #mm channel. If it's my evening I'm not always responsive so send me
the error output on email and I'll try fixing any problems that way.

Patch

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 325bc01..1be0ccb 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -640,6 +640,7 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
 	int error;
 	char *frozen_buffer = NULL;
 	int need_copy = 0;
+	unsigned long start_lock, time_lock;
 
 	if (is_handle_aborted(handle))
 		return -EROFS;
@@ -655,9 +656,16 @@  repeat:
 
 	/* @@@ Need to check for errors here at some point. */
 
+ 	start_lock = jiffies;
 	lock_buffer(bh);
 	jbd_lock_bh_state(bh);
 
+	/* If it takes too long to lock the buffer, trace it */
+	time_lock = jbd2_time_diff(start_lock, jiffies);
+	if (time_lock > HZ/10)
+		trace_jbd2_lock_buffer_stall(bh->b_bdev->bd_dev,
+			jiffies_to_msecs(time_lock));
+
 	/* We now hold the buffer lock so it is safe to query the buffer
 	 * state.  Is the buffer dirty?
 	 *
diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 070df49..c1d1f3e 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -358,6 +358,27 @@  TRACE_EVENT(jbd2_write_superblock,
 		  MINOR(__entry->dev), __entry->write_op)
 );
 
+TRACE_EVENT(jbd2_lock_buffer_stall,
+
+	TP_PROTO(dev_t dev, unsigned long stall_ms),
+
+	TP_ARGS(dev, stall_ms),
+
+	TP_STRUCT__entry(
+		__field(        dev_t, dev	)
+		__field(unsigned long, stall_ms	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= dev;
+		__entry->stall_ms	= stall_ms;
+	),
+
+	TP_printk("dev %d,%d stall_ms %lu",
+		MAJOR(__entry->dev), MINOR(__entry->dev),
+		__entry->stall_ms)
+);
+
 #endif /* _TRACE_JBD2_H */
 
 /* This part must be outside protection */