diff mbox

[RFC] Don't do page stablization if !CONFIG_BLKDEV_INTEGRITY

Message ID E1S5QTU-0005Cc-Kl@tytso-glaptop.cam.corp.google.com
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o March 7, 2012, 11:40 p.m. UTC
We've recently discovered a workload at Google where the page
stablization patches (specifically commit 0e499890c1f: ext4: wait for
writeback to complete while making pages writable) resulted in a
**major** performance regression.  As in, kernel threads that were
writing to log files were getting hit by up to 2 seconds stalls, which
very badly hurt a particular application.  Reverting this commit fixed
the performance regression.

The main reason for the page stablizatoin patches was for DIF/DIX
support, right?   So I'm wondering if we should just disable the calls
to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
i.e., something like this.

What do people think?  I have a feeling this is going to be very
controversial....

					- Ted

ext4: Disable page stablization if DIF/DIX not enabled

Requiring processes which are writing to files which are under writeback
until the writeback is complete can result in massive performance hits.
This is especially true if writes are being throttled due to I/O cgroup
limits and the application is running on an especially busy server.

If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
since that's the main case where this is needed, and page stablization
can be very painful.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Sandeen March 7, 2012, 11:54 p.m. UTC | #1
On 3/7/12 5:40 PM, Theodore Ts'o wrote:
> We've recently discovered a workload at Google where the page
> stablization patches (specifically commit 0e499890c1f: ext4: wait for
> writeback to complete while making pages writable) resulted in a
> **major** performance regression.  As in, kernel threads that were
> writing to log files were getting hit by up to 2 seconds stalls, which
> very badly hurt a particular application.  Reverting this commit fixed
> the performance regression.
> 
> The main reason for the page stablizatoin patches was for DIF/DIX
> support, right?   So I'm wondering if we should just disable the calls
> to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> i.e., something like this.

Can you devise a non-secret testcase that demonstrates this?

Thanks,
-Eric

> What do people think?  I have a feeling this is going to be very
> controversial....
> 
> 					- Ted
> 
> ext4: Disable page stablization if DIF/DIX not enabled
> 
> Requiring processes which are writing to files which are under writeback
> until the writeback is complete can result in massive performance hits.
> This is especially true if writes are being throttled due to I/O cgroup
> limits and the application is running on an especially busy server.
> 
> If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
> since that's the main case where this is needed, and page stablization
> can be very painful.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1a30db7..d25c60f 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		ret = -EAGAIN;
>  		goto out_unlock;
>  	}
> +#ifdef CONFIG_BLKDEV_INTEGRITY
>  	wait_on_page_writeback(page);
> +#endif
>  	return 0;
>  out_unlock:
>  	unlock_page(page);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5f8081c..01f86c5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	if (page_has_buffers(page)) {
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
> +#ifdef CONFIG_BLKDEV_INTEGRITY
>  			/* Wait so that we don't change page under IO */
>  			wait_on_page_writeback(page);
> +#endif
>  			ret = VM_FAULT_LOCKED;
>  			goto out;
>  		}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong March 8, 2012, 12:05 a.m. UTC | #2
On Wed, Mar 07, 2012 at 05:54:11PM -0600, Eric Sandeen wrote:
> On 3/7/12 5:40 PM, Theodore Ts'o wrote:
> > We've recently discovered a workload at Google where the page
> > stablization patches (specifically commit 0e499890c1f: ext4: wait for
> > writeback to complete while making pages writable) resulted in a
> > **major** performance regression.  As in, kernel threads that were
> > writing to log files were getting hit by up to 2 seconds stalls, which
> > very badly hurt a particular application.  Reverting this commit fixed
> > the performance regression.
> > 
> > The main reason for the page stablizatoin patches was for DIF/DIX
> > support, right?   So I'm wondering if we should just disable the calls
> > to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> > i.e., something like this.
> 
> Can you devise a non-secret testcase that demonstrates this?

It sure would be nice if the block device could know if it requires stable
writeback, and the fs could figure that out.... though iirc there was more to
my patchset than just these two wait_on_page_writeback() calls.

--D
> 
> Thanks,
> -Eric
> 
> > What do people think?  I have a feeling this is going to be very
> > controversial....
> > 
> > 					- Ted
> > 
> > ext4: Disable page stablization if DIF/DIX not enabled
> > 
> > Requiring processes which are writing to files which are under writeback
> > until the writeback is complete can result in massive performance hits.
> > This is especially true if writes are being throttled due to I/O cgroup
> > limits and the application is running on an especially busy server.
> > 
> > If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
> > since that's the main case where this is needed, and page stablization
> > can be very painful.
> > 
> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 1a30db7..d25c60f 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		ret = -EAGAIN;
> >  		goto out_unlock;
> >  	}
> > +#ifdef CONFIG_BLKDEV_INTEGRITY
> >  	wait_on_page_writeback(page);
> > +#endif
> >  	return 0;
> >  out_unlock:
> >  	unlock_page(page);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 5f8081c..01f86c5 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  	if (page_has_buffers(page)) {
> >  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> >  					ext4_bh_unmapped)) {
> > +#ifdef CONFIG_BLKDEV_INTEGRITY
> >  			/* Wait so that we don't change page under IO */
> >  			wait_on_page_writeback(page);
> > +#endif
> >  			ret = VM_FAULT_LOCKED;
> >  			goto out;
> >  		}
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh March 8, 2012, 12:23 a.m. UTC | #3
On 03/07/2012 03:40 PM, Theodore Ts'o wrote:
> 
> We've recently discovered a workload at Google where the page
> stablization patches (specifically commit 0e499890c1f: ext4: wait for
> writeback to complete while making pages writable) resulted in a
> **major** performance regression.  As in, kernel threads that were
> writing to log files were getting hit by up to 2 seconds stalls, which
> very badly hurt a particular application.  

That 2 seconds hit I think I know how to fix somewhat with a smarter
write-back. I want to talk about this in LSF with people

> Reverting this commit fixed the performance regression.
> 
> The main reason for the page stablizatoin patches was for DIF/DIX
> support, right?   So I'm wondering if we should just disable the calls
> to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> i.e., something like this.
> 
> What do people think?  I have a feeling this is going to be very
> controversial....
> 

NACK

It's not a CONFIG_ thing it's: Is this particular device needs stable pages?

As I stated many times before, the device should have a property that
says if it needs stable pages or not. The candidates for stable pages are:

	- DIF/DIX enabled devices
	- RAID-1/4/5/6 devices
	- iscsi devices with data digest signature
	- Any other checksum enabled block device.

A fedora distro will have CONFIG_BLKDEV_INTEGRITY set then you are always
out of luck, even with devices that can care less.

Please submit a proper patch, even a temporary mount option. But this is
ABI. The best is to find where to export it as part of the device's
properties sysfs dir. And inspect that

> 					- Ted
> 

Thanks
Boaz

> ext4: Disable page stablization if DIF/DIX not enabled
> 
> Requiring processes which are writing to files which are under writeback
> until the writeback is complete can result in massive performance hits.
> This is especially true if writes are being throttled due to I/O cgroup
> limits and the application is running on an especially busy server.
> 
> If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
> since that's the main case where this is needed, and page stablization
> can be very painful.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1a30db7..d25c60f 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		ret = -EAGAIN;
>  		goto out_unlock;
>  	}
> +#ifdef CONFIG_BLKDEV_INTEGRITY
>  	wait_on_page_writeback(page);
> +#endif
>  	return 0;
>  out_unlock:
>  	unlock_page(page);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5f8081c..01f86c5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	if (page_has_buffers(page)) {
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
> +#ifdef CONFIG_BLKDEV_INTEGRITY
>  			/* Wait so that we don't change page under IO */
>  			wait_on_page_writeback(page);
> +#endif
>  			ret = VM_FAULT_LOCKED;
>  			goto out;
>  		}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Brown March 8, 2012, 2:39 a.m. UTC | #4
> Can you devise a non-secret testcase that demonstrates this?

Hmm.  I bet you could get fio to do it.  Giant file, random mmap()
writes, spin until the CPU overwhelms writeback?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 8, 2012, 3:54 p.m. UTC | #5
On Wed, Mar 07, 2012 at 09:39:50PM -0500, Zach Brown wrote:
> 
> >Can you devise a non-secret testcase that demonstrates this?
> 
> Hmm.  I bet you could get fio to do it.  Giant file, random mmap()
> writes, spin until the CPU overwhelms writeback?

Kick off a bunch of fio processes, each in separate I/O cgroups set up
so that each of the processes get a "fair" amount of the I/O
bandwidth.  (This is quite common in cloud deployments where you are
packing a huge number of tasks onto a single box; whether the tasks
are inside virtual machines or containers don't really matter for the
purpose of this exercise.  We basically need to simulate a system
where the disks are busy.)

Then in one of those cgroups, create a process which is constantly
appending to a file using buffered I/O; this could be a log file, or
an application-level journal file; and measure the latency of that
write system call.  Every so often, writeback will push the dirty
pages corresponding to the log/journal file to disk.  When that
happens, and page stablization is enabled, the latency of that write
system call will spike.

And any time you have a distributed system where you are depending on
a large number of RPC/SOAP/Service Oriented Architecture Enterpise
Service Bus calls (I don't really care which buzzword you use, but IBM
and Oracle really like the last one :-), long-tail latencies are what
kill your responsiveness and predictability.  Especially when a thread
goes away for a second or more...

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason March 8, 2012, 6:09 p.m. UTC | #6
On Thu, Mar 08, 2012 at 10:54:19AM -0500, Ted Ts'o wrote:
> On Wed, Mar 07, 2012 at 09:39:50PM -0500, Zach Brown wrote:
> > 
> > >Can you devise a non-secret testcase that demonstrates this?
> > 
> > Hmm.  I bet you could get fio to do it.  Giant file, random mmap()
> > writes, spin until the CPU overwhelms writeback?
> 
> Kick off a bunch of fio processes, each in separate I/O cgroups set up
> so that each of the processes get a "fair" amount of the I/O
> bandwidth.  (This is quite common in cloud deployments where you are
> packing a huge number of tasks onto a single box; whether the tasks
> are inside virtual machines or containers don't really matter for the
> purpose of this exercise.  We basically need to simulate a system
> where the disks are busy.)
> 
> Then in one of those cgroups, create a process which is constantly
> appending to a file using buffered I/O; this could be a log file, or
> an application-level journal file; and measure the latency of that
> write system call.  Every so often, writeback will push the dirty
> pages corresponding to the log/journal file to disk.  When that
> happens, and page stablization is enabled, the latency of that write
> system call will spike.
> 
> And any time you have a distributed system where you are depending on
> a large number of RPC/SOAP/Service Oriented Architecture Enterpise
> Service Bus calls (I don't really care which buzzword you use, but IBM
> and Oracle really like the last one :-), long-tail latencies are what
> kill your responsiveness and predictability.  Especially when a thread
> goes away for a second or more...

But, why are we writeback for a second or more?  Aren't there other
parts of this we would want to fix as well?

I'm not against only turning on stable pages when they are needed, but
the code that isn't the default tends to be somewhat less used.  So it
does increase testing burden when we do want stable pages, and it tends
to make for awkward bugs that are hard to reproduce because someone
neglects to mention it.

IMHO it's much more important to nail down the 2 second writeback
latency. That's not good.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh March 8, 2012, 8:20 p.m. UTC | #7
On 03/08/2012 10:09 AM, Chris Mason wrote:
> 
> But, why are we writeback for a second or more?  Aren't there other
> parts of this we would want to fix as well?
> 
> I'm not against only turning on stable pages when they are needed, but
> the code that isn't the default tends to be somewhat less used.  So it
> does increase testing burden when we do want stable pages, and it tends
> to make for awkward bugs that are hard to reproduce because someone
> neglects to mention it.
> 
> IMHO it's much more important to nail down the 2 second writeback
> latency. That's not good.
> 

I think I understand this one. It's do to the sync nature introduced
by page_waiting in mkwrite.

The system is loaded everything is somewhat 2 second or more in a lag.
The 2 sec (or more) comes from the max-dirty-limit/disk-speed so any
IO you'll submit will probably be on stable disk 2 sec later. (In theory,
any power fail will loose all dirty pages which is in our case
max-dirty-limit)

Now usually that's fine because everything is queued and waits a bit
evenly distributed and you wait, theoretically, only the rate of your
IO. But here, all of a sudden, you are not allowed to be queued and you
are waiting for the head of queue to be actually done, and the app is
just frozen.

Actually now when I think of it the pages were already submitted for
them to be waited on. So the 2-sec is the depth of the block+scsi+target
queues. I guess they can be pretty deep.

I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
the last n pages of an inode who's mtime is less then 2-sec. This would
solve any sequential writer wait penalty, which is Ted's case

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason March 8, 2012, 8:37 p.m. UTC | #8
On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> On 03/08/2012 10:09 AM, Chris Mason wrote:
> > 
> > But, why are we writeback for a second or more?  Aren't there other
> > parts of this we would want to fix as well?
> > 
> > I'm not against only turning on stable pages when they are needed, but
> > the code that isn't the default tends to be somewhat less used.  So it
> > does increase testing burden when we do want stable pages, and it tends
> > to make for awkward bugs that are hard to reproduce because someone
> > neglects to mention it.
> > 
> > IMHO it's much more important to nail down the 2 second writeback
> > latency. That's not good.
> > 
> 
> I think I understand this one. It's do to the sync nature introduced
> by page_waiting in mkwrite.

Pages go from dirty to writeback for a few reasons.  Background
writeout, or O_DIRECT or someone running sync

background writeout shouldn't be queueing up so much work that
synchronous writeout has a 2 second delay.

If the latencies are coming from something that was run through
fsync...well there's not too much we can do about that.  The problem is
that our page_mkwrite call isn't starting the IO it is just waiting on
it, so we can't bump the priority on it.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer March 8, 2012, 8:42 p.m. UTC | #9
Chris Mason <chris.mason@oracle.com> writes:

> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>> I think I understand this one. It's do to the sync nature introduced
>> by page_waiting in mkwrite.
>
> Pages go from dirty to writeback for a few reasons.  Background
> writeout, or O_DIRECT or someone running sync
>
> background writeout shouldn't be queueing up so much work that
> synchronous writeout has a 2 second delay.

So now we're back to figuring out how to tell how long I/O will take?
If writeback is issuing random access I/Os to spinning media, you can
bet it might be a while.  Today, you could lower nr_requests to some
obscenely small number to improve worst-case latency.  I thought there
was some talk about improving the intelligence of writeback in this
regard, but it's a tough problem, especially given that writeback isn't
the only cook in the kitchen.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh March 8, 2012, 8:50 p.m. UTC | #10
On 03/08/2012 12:37 PM, Chris Mason wrote:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>> On 03/08/2012 10:09 AM, Chris Mason wrote:
>>>
>>> But, why are we writeback for a second or more?  Aren't there other
>>> parts of this we would want to fix as well?
>>>
>>> I'm not against only turning on stable pages when they are needed, but
>>> the code that isn't the default tends to be somewhat less used.  So it
>>> does increase testing burden when we do want stable pages, and it tends
>>> to make for awkward bugs that are hard to reproduce because someone
>>> neglects to mention it.
>>>
>>> IMHO it's much more important to nail down the 2 second writeback
>>> latency. That's not good.
>>>
>>
>> I think I understand this one. It's do to the sync nature introduced
>> by page_waiting in mkwrite.
> 
> Pages go from dirty to writeback for a few reasons.  Background
> writeout, or O_DIRECT or someone running sync
> 
> background writeout shouldn't be queueing up so much work that
> synchronous writeout has a 2 second delay.
> 
> If the latencies are coming from something that was run through
> fsync...well there's not too much we can do about that.  The problem is
> that our page_mkwrite call isn't starting the IO it is just waiting on
> it, so we can't bump the priority on it.
> 

I agree. I think the logger model is: write, than sync

Before they used to be waiting on the sync phase now their waiting
on write, when concurrent with sync. I would like to inspect this situation.
I agree with you that it's just shifting heaviness that is now hiding somewhere
else.

> -chris

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason March 8, 2012, 8:55 p.m. UTC | #11
On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> Chris Mason <chris.mason@oracle.com> writes:
> 
> > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> >> I think I understand this one. It's do to the sync nature introduced
> >> by page_waiting in mkwrite.
> >
> > Pages go from dirty to writeback for a few reasons.  Background
> > writeout, or O_DIRECT or someone running sync
> >
> > background writeout shouldn't be queueing up so much work that
> > synchronous writeout has a 2 second delay.
> 
> So now we're back to figuring out how to tell how long I/O will take?
> If writeback is issuing random access I/Os to spinning media, you can
> bet it might be a while.  Today, you could lower nr_requests to some
> obscenely small number to improve worst-case latency.  I thought there
> was some talk about improving the intelligence of writeback in this
> regard, but it's a tough problem, especially given that writeback isn't
> the only cook in the kitchen.


I'm not against Ted's original idea, but I'd hate to miss the chance to
blame Jens for the block layer making filesystems slow.

Since there is a reliable way to reproduce, it makes sense to track it
down a little more instead of papering over the wait_on_page_writeback
call.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 8, 2012, 9:12 p.m. UTC | #12
On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> 
> So now we're back to figuring out how to tell how long I/O will take?
> If writeback is issuing random access I/Os to spinning media, you can
> bet it might be a while.  Today, you could lower nr_requests to some
> obscenely small number to improve worst-case latency.  I thought there
> was some talk about improving the intelligence of writeback in this
> regard, but it's a tough problem, especially given that writeback isn't
> the only cook in the kitchen.

... and it gets worse if there is any kind of I/O prioritization going
on via ionice(), or (as was the case in our example) I/O cgroups were
being used to provide proportional I/O rate controls.  I don't think
it's realistic to assume the writeback code can predict how long I/O
will take when it does a submission.

BTW, I'd have to check (having not looked at the application code in
depth; the bug was primarily solved by bisection and reverting the
problem commit) but I'm not entirely sure the thread doing the write
was calling fsync(); the main issue as I understand things was that
the application wasn't expecting the write(2) system call would block
unexpectedly for long periods of time while doing small buffered,
appending I/O's.  (Again, for the kind of work that distributed
systems do, 99th percentile latency is important!)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason March 8, 2012, 9:20 p.m. UTC | #13
On Thu, Mar 08, 2012 at 04:12:21PM -0500, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> > 
> > So now we're back to figuring out how to tell how long I/O will take?
> > If writeback is issuing random access I/Os to spinning media, you can
> > bet it might be a while.  Today, you could lower nr_requests to some
> > obscenely small number to improve worst-case latency.  I thought there
> > was some talk about improving the intelligence of writeback in this
> > regard, but it's a tough problem, especially given that writeback isn't
> > the only cook in the kitchen.
> 
> ... and it gets worse if there is any kind of I/O prioritization going
> on via ionice(), or (as was the case in our example) I/O cgroups were
> being used to provide proportional I/O rate controls.  I don't think
> it's realistic to assume the writeback code can predict how long I/O
> will take when it does a submission.

cgroups do make it much harder because it could be a simple IO priority
inversion.  The latencies are just going to be a fact of life for now
and the best choice is to skip the stable pages.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 8, 2012, 9:24 p.m. UTC | #14
On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> 
> I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
> the last n pages of an inode who's mtime is less then 2-sec. This would
> solve any sequential writer wait penalty, which is Ted's case

That won't work in general, *unless* 2 seconds is enough time that the
appending writer is done writing to that particular 4k page and moved
on to the next 4k block, so nothing touches that page and potentially
blocks for however long it takes for the queues to drain.

Let's take another example, suppose you have a file-backed mmap
region, and you modify the page, and now let's suppose the process is
under enough memory pressure that the page cleaner decides to initiate
writeback of the page.  Now suppose you get unlucky (this is the 1% or
0.1% case; remember, 99th or 99.9 percentile latencies matter), and
you try to modify the page in question again.  ***THUNK*** your
process takes a page fault, and is frozen solid in amber for
potentially seconds until the I/O queues drain.

Hmm.... let's turn this around.  If the issue is checksum calculation,
how about trying to solve this problem in some cases by deferring the
checksum calculation until right before the block I/O layer is going
to schedule the write (i.e., have the I/O submitter provide a callback
function which calculates the checksum, which gets called by the BIO
layer at the very last moment)?

This won't work in all cases (I can see this getting really messy in
the software RAID-5/6 case if you don't want to memory copies) but it
might solve the problem in at least some of the cases where people
care about this.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason March 8, 2012, 9:38 p.m. UTC | #15
On Thu, Mar 08, 2012 at 04:24:12PM -0500, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> > 
> > I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
> > the last n pages of an inode who's mtime is less then 2-sec. This would
> > solve any sequential writer wait penalty, which is Ted's case
> 
> That won't work in general, *unless* 2 seconds is enough time that the
> appending writer is done writing to that particular 4k page and moved
> on to the next 4k block, so nothing touches that page and potentially
> blocks for however long it takes for the queues to drain.
> 
> Let's take another example, suppose you have a file-backed mmap
> region, and you modify the page, and now let's suppose the process is
> under enough memory pressure that the page cleaner decides to initiate
> writeback of the page.  Now suppose you get unlucky (this is the 1% or
> 0.1% case; remember, 99th or 99.9 percentile latencies matter), and
> you try to modify the page in question again.  ***THUNK*** your
> process takes a page fault, and is frozen solid in amber for
> potentially seconds until the I/O queues drain.
> 
> Hmm.... let's turn this around.  If the issue is checksum calculation,
> how about trying to solve this problem in some cases by deferring the
> checksum calculation until right before the block I/O layer is going
> to schedule the write (i.e., have the I/O submitter provide a callback
> function which calculates the checksum, which gets called by the BIO
> layer at the very last moment)?

Btrfs currently does this, and the DIF code is by definition right
before the write.  The pages really only get set writeback when they are
being sent in flight, so the waiting being done by the stable pages
patch is file_write or page_mkwrite being polite and waiting for the IO
to finish before changing the page.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 8, 2012, 9:41 p.m. UTC | #16
On Thu, Mar 08, 2012 at 04:38:08PM -0500, Chris Mason wrote:
> Btrfs currently does this, and the DIF code is by definition right
> before the write.  The pages really only get set writeback when they are
> being sent in flight, so the waiting being done by the stable pages
> patch is file_write or page_mkwrite being polite and waiting for the IO
> to finish before changing the page.

Right before submission to the bio layer?  Or right before the device
driver sends the request to the host bus adapter?  I was thinking of
the latter....

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh March 8, 2012, 9:52 p.m. UTC | #17
On 03/08/2012 01:24 PM, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>>
>> I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
>> the last n pages of an inode who's mtime is less then 2-sec. This would
>> solve any sequential writer wait penalty, which is Ted's case
> 
> That won't work in general, *unless* 2 seconds is enough time that the
> appending writer is done writing to that particular 4k page and moved
> on to the next 4k block, so nothing touches that page and potentially
> blocks for however long it takes for the queues to drain.
> 

Exactly. Is that not the case for a sequential writer. It modifies
the top page for a while inode time keeps incrementing, then eventually
it advances to the next page, now the before the last page is never
touched again. and can be submitted for writeout.

> Let's take another example, suppose you have a file-backed mmap
> region, and you modify the page, and now let's suppose the process is
> under enough memory pressure that the page cleaner decides to initiate
> writeback of the page.  Now suppose you get unlucky (this is the 1% or
> 0.1% case; remember, 99th or 99.9 percentile latencies matter), and
> you try to modify the page in question again.  ***THUNK*** your
> process takes a page fault, and is frozen solid in amber for
> potentially seconds until the I/O queues drain.
> 

As I said, if the IO is random you are in though luck, and BTW mmap
is not a must, just simple write() call will behave just the same
since it sits in the same mkwrite().

But that was not your case. Your case was an appending logger.

But my new theory is that your case is not the "writeback" to
app-write collision, but the app-write to sync() collision which
is a different case.

> Hmm.... let's turn this around.  If the issue is checksum calculation,
> how about trying to solve this problem in some cases by deferring the
> checksum calculation until right before the block I/O layer is going
> to schedule the write (i.e., have the I/O submitter provide a callback
> function which calculates the checksum, which gets called by the BIO
> layer at the very last moment)?
> 

iscsi is that case, but the problem is not when "calculates the checksum"
but: when "changing page state" your schema can work but you will need to
add a new page state, dirty => writeback => stable (new state).

> This won't work in all cases (I can see this getting really messy in
> the software RAID-5/6 case if you don't want to memory copies) but it
> might solve the problem in at least some of the cases where people
> care about this.
> 
> 					- Ted

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 8, 2012, 11:32 p.m. UTC | #18
On Thu, Mar 08, 2012 at 12:50:42PM -0800, Boaz Harrosh wrote:
> On 03/08/2012 12:37 PM, Chris Mason wrote:
> > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> >> On 03/08/2012 10:09 AM, Chris Mason wrote:
> >>>
> >>> But, why are we writeback for a second or more?  Aren't there other
> >>> parts of this we would want to fix as well?
> >>>
> >>> I'm not against only turning on stable pages when they are needed, but
> >>> the code that isn't the default tends to be somewhat less used.  So it
> >>> does increase testing burden when we do want stable pages, and it tends
> >>> to make for awkward bugs that are hard to reproduce because someone
> >>> neglects to mention it.
> >>>
> >>> IMHO it's much more important to nail down the 2 second writeback
> >>> latency. That's not good.
> >>>
> >>
> >> I think I understand this one. It's do to the sync nature introduced
> >> by page_waiting in mkwrite.
> > 
> > Pages go from dirty to writeback for a few reasons.  Background
> > writeout, or O_DIRECT or someone running sync
> > 
> > background writeout shouldn't be queueing up so much work that
> > synchronous writeout has a 2 second delay.
> > 
> > If the latencies are coming from something that was run through
> > fsync...well there's not too much we can do about that.  The problem is
> > that our page_mkwrite call isn't starting the IO it is just waiting on
> > it, so we can't bump the priority on it.
> > 
> 
> I agree. I think the logger model is: write, than sync
> 
> Before they used to be waiting on the sync phase now their waiting
> on write, when concurrent with sync. I would like to inspect this situation.
> I agree with you that it's just shifting heaviness that is now hiding somewhere
> else.

I'd argue that the application is broken, not the kernel. IO
latencies are always going to occur, so if the application is
sensitive to them, then the app needs to take measures to minimise
the effect of potential latencies.  Double/ring buffering with async
IO dispatch is the general method of avoiding significant latencies
in IO aggregator applications like loggers...

Cheers,

Dave.
Chris Mason March 9, 2012, 1:02 a.m. UTC | #19
On Thu, Mar 08, 2012 at 04:41:48PM -0500, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 04:38:08PM -0500, Chris Mason wrote:
> > Btrfs currently does this, and the DIF code is by definition right
> > before the write.  The pages really only get set writeback when they are
> > being sent in flight, so the waiting being done by the stable pages
> > patch is file_write or page_mkwrite being polite and waiting for the IO
> > to finish before changing the page.
> 
> Right before submission to the bio layer?  Or right before the device
> driver sends the request to the host bus adapter?  I was thinking of
> the latter....

blktrace can probably give us numbers for how big that wait is, but
hopefully it is very small.  If it is too big, you can always bump
nr_requests.

Btrfs crcs before submit_bio, not sure when the dif code does it.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen March 9, 2012, 1:08 a.m. UTC | #20
>>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:

Chris> Btrfs crcs before submit_bio, not sure when the dif code does it.

Slightly later. At make_request time.
Dave Chinner March 9, 2012, 8:11 a.m. UTC | #21
On Thu, Mar 08, 2012 at 04:20:54PM -0500, Chris Mason wrote:
> On Thu, Mar 08, 2012 at 04:12:21PM -0500, Ted Ts'o wrote:
> > On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> > > 
> > > So now we're back to figuring out how to tell how long I/O will take?
> > > If writeback is issuing random access I/Os to spinning media, you can
> > > bet it might be a while.  Today, you could lower nr_requests to some
> > > obscenely small number to improve worst-case latency.  I thought there
> > > was some talk about improving the intelligence of writeback in this
> > > regard, but it's a tough problem, especially given that writeback isn't
> > > the only cook in the kitchen.
> > 
> > ... and it gets worse if there is any kind of I/O prioritization going
> > on via ionice(), or (as was the case in our example) I/O cgroups were
> > being used to provide proportional I/O rate controls.  I don't think
> > it's realistic to assume the writeback code can predict how long I/O
> > will take when it does a submission.
> 
> cgroups do make it much harder because it could be a simple IO priority
> inversion.  The latencies are just going to be a fact of life for now
> and the best choice is to skip the stable pages.

They have always been a fact of life - just ask anyone that has to
deal with deterministic or "real-time" IO applications.

Unpredictable IO path latencies are not a new problem, and it
doesn't take stable pages to cause sigificant holdoffs in the
writing to a file.  For example: writeback triggers triggers delayed
allocation, which locks the extent map and then blocks behind an
allocation already in progress or has to do IO to read in freespace
metadata. The next write comes along from another thread/process and
it has to map a new page and that now blocks on the extent map lock
and won't progress until the delayed allocation in progress
completes....

IO latencies are pretty much unavoidable, so the best thing to do is
to write applications that care about latency to minimise it's
impact as much as possible. Simple techniques like double buffering
and async IO dispatch techniques to decouple the IO stream from the
process/threads that are doing real work are the usual ways of
dealing with this problem.

Cheers,

Dave.
Theodore Ts'o March 9, 2012, 4:20 p.m. UTC | #22
On Thu, Mar 08, 2012 at 08:02:07PM -0500, Chris Mason wrote:
> > Right before submission to the bio layer?  Or right before the device
> > driver sends the request to the host bus adapter?  I was thinking of
> > the latter....
> 
> blktrace can probably give us numbers for how big that wait is, but
> hopefully it is very small.  If it is too big, you can always bump
> nr_requests.

Not necessarily; things can get stalled for quite a while due to
ionice or io rate throttling.  There's quite a lot that can happen
after submit_bio, at the cfq layer, for example.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index 1a30db7..d25c60f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2333,7 +2333,9 @@  int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		ret = -EAGAIN;
 		goto out_unlock;
 	}
+#ifdef CONFIG_BLKDEV_INTEGRITY
 	wait_on_page_writeback(page);
+#endif
 	return 0;
 out_unlock:
 	unlock_page(page);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5f8081c..01f86c5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4638,8 +4638,10 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
+#ifdef CONFIG_BLKDEV_INTEGRITY
 			/* Wait so that we don't change page under IO */
 			wait_on_page_writeback(page);
+#endif
 			ret = VM_FAULT_LOCKED;
 			goto out;
 		}