Message ID | E1S5QTU-0005Cc-Kl@tytso-glaptop.cam.corp.google.com |
---|---|
State | Accepted, archived |
Headers | show |
On 3/7/12 5:40 PM, Theodore Ts'o wrote: > We've recently discovered a workload at Google where the page > stablization patches (specifically commit 0e499890c1f: ext4: wait for > writeback to complete while making pages writable) resulted in a > **major** performance regression. As in, kernel threads that were > writing to log files were getting hit by up to 2 seconds stalls, which > very badly hurt a particular application. Reverting this commit fixed > the performance regression. > > The main reason for the page stablizatoin patches was for DIF/DIX > support, right? So I'm wondering if we should just disable the calls > to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined. > i.e., something like this. Can you devise a non-secret testcase that demonstrates this? Thanks, -Eric > What do people think? I have a feeling this is going to be very > controversial.... > > - Ted > > ext4: Disable page stablization if DIF/DIX not enabled > > Requiring processes which are writing to files which are under writeback > until the writeback is complete can result in massive performance hits. > This is especially true if writes are being throttled due to I/O cgroup > limits and the application is running on an especially busy server. > > If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization, > since that's the main case where this is needed, and page stablization > can be very painful. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/buffer.c b/fs/buffer.c > index 1a30db7..d25c60f 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > ret = -EAGAIN; > goto out_unlock; > } > +#ifdef CONFIG_BLKDEV_INTEGRITY > wait_on_page_writeback(page); > +#endif > return 0; > out_unlock: > unlock_page(page); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 5f8081c..01f86c5 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > if (page_has_buffers(page)) { > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > ext4_bh_unmapped)) { > +#ifdef CONFIG_BLKDEV_INTEGRITY > /* Wait so that we don't change page under IO */ > wait_on_page_writeback(page); > +#endif > ret = VM_FAULT_LOCKED; > goto out; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 07, 2012 at 05:54:11PM -0600, Eric Sandeen wrote: > On 3/7/12 5:40 PM, Theodore Ts'o wrote: > > We've recently discovered a workload at Google where the page > > stablization patches (specifically commit 0e499890c1f: ext4: wait for > > writeback to complete while making pages writable) resulted in a > > **major** performance regression. As in, kernel threads that were > > writing to log files were getting hit by up to 2 seconds stalls, which > > very badly hurt a particular application. Reverting this commit fixed > > the performance regression. > > > > The main reason for the page stablizatoin patches was for DIF/DIX > > support, right? So I'm wondering if we should just disable the calls > > to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined. > > i.e., something like this. > > Can you devise a non-secret testcase that demonstrates this? It sure would be nice if the block device could know if it requires stable writeback, and the fs could figure that out.... though iirc there was more to my patchset than just these two wait_on_page_writeback() calls. --D > > Thanks, > -Eric > > > What do people think? I have a feeling this is going to be very > > controversial.... > > > > - Ted > > > > ext4: Disable page stablization if DIF/DIX not enabled > > > > Requiring processes which are writing to files which are under writeback > > until the writeback is complete can result in massive performance hits. > > This is especially true if writes are being throttled due to I/O cgroup > > limits and the application is running on an especially busy server. > > > > If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization, > > since that's the main case where this is needed, and page stablization > > can be very painful. > > > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > > > diff --git a/fs/buffer.c b/fs/buffer.c > > index 1a30db7..d25c60f 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > > ret = -EAGAIN; > > goto out_unlock; > > } > > +#ifdef CONFIG_BLKDEV_INTEGRITY > > wait_on_page_writeback(page); > > +#endif > > return 0; > > out_unlock: > > unlock_page(page); > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 5f8081c..01f86c5 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > if (page_has_buffers(page)) { > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > ext4_bh_unmapped)) { > > +#ifdef CONFIG_BLKDEV_INTEGRITY > > /* Wait so that we don't change page under IO */ > > wait_on_page_writeback(page); > > +#endif > > ret = VM_FAULT_LOCKED; > > goto out; > > } > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/07/2012 03:40 PM, Theodore Ts'o wrote: > > We've recently discovered a workload at Google where the page > stablization patches (specifically commit 0e499890c1f: ext4: wait for > writeback to complete while making pages writable) resulted in a > **major** performance regression. As in, kernel threads that were > writing to log files were getting hit by up to 2 seconds stalls, which > very badly hurt a particular application. That 2 seconds hit I think I know how to fix somewhat with a smarter write-back. I want to talk about this in LSF with people > Reverting this commit fixed the performance regression. > > The main reason for the page stablizatoin patches was for DIF/DIX > support, right? So I'm wondering if we should just disable the calls > to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined. > i.e., something like this. > > What do people think? I have a feeling this is going to be very > controversial.... > NACK It's not a CONFIG_ thing it's: Is this particular device needs stable pages? As I stated many times before, the device should have a property that says if it needs stable pages or not. The candidates for stable pages are: - DIF/DIX enabled devices - RAID-1/4/5/6 devices - iscsi devices with data digest signature - Any other checksum enabled block device. A fedora distro will have CONFIG_BLKDEV_INTEGRITY set then you are always out of luck, even with devices that can care less. Please submit a proper patch, even a temporary mount option. But this is ABI. The best is to find where to export it as part of the device's properties sysfs dir. And inspect that > - Ted > Thanks Boaz > ext4: Disable page stablization if DIF/DIX not enabled > > Requiring processes which are writing to files which are under writeback > until the writeback is complete can result in massive performance hits. > This is especially true if writes are being throttled due to I/O cgroup > limits and the application is running on an especially busy server. > > If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization, > since that's the main case where this is needed, and page stablization > can be very painful. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/buffer.c b/fs/buffer.c > index 1a30db7..d25c60f 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > ret = -EAGAIN; > goto out_unlock; > } > +#ifdef CONFIG_BLKDEV_INTEGRITY > wait_on_page_writeback(page); > +#endif > return 0; > out_unlock: > unlock_page(page); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 5f8081c..01f86c5 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > if (page_has_buffers(page)) { > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > ext4_bh_unmapped)) { > +#ifdef CONFIG_BLKDEV_INTEGRITY > /* Wait so that we don't change page under IO */ > wait_on_page_writeback(page); > +#endif > ret = VM_FAULT_LOCKED; > goto out; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Can you devise a non-secret testcase that demonstrates this?
Hmm. I bet you could get fio to do it. Giant file, random mmap()
writes, spin until the CPU overwhelms writeback?
- z
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 07, 2012 at 09:39:50PM -0500, Zach Brown wrote: > > >Can you devise a non-secret testcase that demonstrates this? > > Hmm. I bet you could get fio to do it. Giant file, random mmap() > writes, spin until the CPU overwhelms writeback? Kick off a bunch of fio processes, each in separate I/O cgroups set up so that each of the processes get a "fair" amount of the I/O bandwidth. (This is quite common in cloud deployments where you are packing a huge number of tasks onto a single box; whether the tasks are inside virtual machines or containers don't really matter for the purpose of this exercise. We basically need to simulate a system where the disks are busy.) Then in one of those cgroups, create a process which is constantly appending to a file using buffered I/O; this could be a log file, or an application-level journal file; and measure the latency of that write system call. Every so often, writeback will push the dirty pages corresponding to the log/journal file to disk. When that happens, and page stablization is enabled, the latency of that write system call will spike. And any time you have a distributed system where you are depending on a large number of RPC/SOAP/Service Oriented Architecture Enterpise Service Bus calls (I don't really care which buzzword you use, but IBM and Oracle really like the last one :-), long-tail latencies are what kill your responsiveness and predictability. Especially when a thread goes away for a second or more... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 10:54:19AM -0500, Ted Ts'o wrote: > On Wed, Mar 07, 2012 at 09:39:50PM -0500, Zach Brown wrote: > > > > >Can you devise a non-secret testcase that demonstrates this? > > > > Hmm. I bet you could get fio to do it. Giant file, random mmap() > > writes, spin until the CPU overwhelms writeback? > > Kick off a bunch of fio processes, each in separate I/O cgroups set up > so that each of the processes get a "fair" amount of the I/O > bandwidth. (This is quite common in cloud deployments where you are > packing a huge number of tasks onto a single box; whether the tasks > are inside virtual machines or containers don't really matter for the > purpose of this exercise. We basically need to simulate a system > where the disks are busy.) > > Then in one of those cgroups, create a process which is constantly > appending to a file using buffered I/O; this could be a log file, or > an application-level journal file; and measure the latency of that > write system call. Every so often, writeback will push the dirty > pages corresponding to the log/journal file to disk. When that > happens, and page stablization is enabled, the latency of that write > system call will spike. > > And any time you have a distributed system where you are depending on > a large number of RPC/SOAP/Service Oriented Architecture Enterpise > Service Bus calls (I don't really care which buzzword you use, but IBM > and Oracle really like the last one :-), long-tail latencies are what > kill your responsiveness and predictability. Especially when a thread > goes away for a second or more... But, why are we writeback for a second or more? Aren't there other parts of this we would want to fix as well? I'm not against only turning on stable pages when they are needed, but the code that isn't the default tends to be somewhat less used. So it does increase testing burden when we do want stable pages, and it tends to make for awkward bugs that are hard to reproduce because someone neglects to mention it. IMHO it's much more important to nail down the 2 second writeback latency. That's not good. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/08/2012 10:09 AM, Chris Mason wrote: > > But, why are we writeback for a second or more? Aren't there other > parts of this we would want to fix as well? > > I'm not against only turning on stable pages when they are needed, but > the code that isn't the default tends to be somewhat less used. So it > does increase testing burden when we do want stable pages, and it tends > to make for awkward bugs that are hard to reproduce because someone > neglects to mention it. > > IMHO it's much more important to nail down the 2 second writeback > latency. That's not good. > I think I understand this one. It's do to the sync nature introduced by page_waiting in mkwrite. The system is loaded everything is somewhat 2 second or more in a lag. The 2 sec (or more) comes from the max-dirty-limit/disk-speed so any IO you'll submit will probably be on stable disk 2 sec later. (In theory, any power fail will loose all dirty pages which is in our case max-dirty-limit) Now usually that's fine because everything is queued and waits a bit evenly distributed and you wait, theoretically, only the rate of your IO. But here, all of a sudden, you are not allowed to be queued and you are waiting for the head of queue to be actually done, and the app is just frozen. Actually now when I think of it the pages were already submitted for them to be waited on. So the 2-sec is the depth of the block+scsi+target queues. I guess they can be pretty deep. I have a theory of how we can fix that 2-sec wait, by avoiding writeback of the last n pages of an inode who's mtime is less then 2-sec. This would solve any sequential writer wait penalty, which is Ted's case Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: > On 03/08/2012 10:09 AM, Chris Mason wrote: > > > > But, why are we writeback for a second or more? Aren't there other > > parts of this we would want to fix as well? > > > > I'm not against only turning on stable pages when they are needed, but > > the code that isn't the default tends to be somewhat less used. So it > > does increase testing burden when we do want stable pages, and it tends > > to make for awkward bugs that are hard to reproduce because someone > > neglects to mention it. > > > > IMHO it's much more important to nail down the 2 second writeback > > latency. That's not good. > > > > I think I understand this one. It's do to the sync nature introduced > by page_waiting in mkwrite. Pages go from dirty to writeback for a few reasons. Background writeout, or O_DIRECT or someone running sync background writeout shouldn't be queueing up so much work that synchronous writeout has a 2 second delay. If the latencies are coming from something that was run through fsync...well there's not too much we can do about that. The problem is that our page_mkwrite call isn't starting the IO it is just waiting on it, so we can't bump the priority on it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason <chris.mason@oracle.com> writes: > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: >> I think I understand this one. It's do to the sync nature introduced >> by page_waiting in mkwrite. > > Pages go from dirty to writeback for a few reasons. Background > writeout, or O_DIRECT or someone running sync > > background writeout shouldn't be queueing up so much work that > synchronous writeout has a 2 second delay. So now we're back to figuring out how to tell how long I/O will take? If writeback is issuing random access I/Os to spinning media, you can bet it might be a while. Today, you could lower nr_requests to some obscenely small number to improve worst-case latency. I thought there was some talk about improving the intelligence of writeback in this regard, but it's a tough problem, especially given that writeback isn't the only cook in the kitchen. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/08/2012 12:37 PM, Chris Mason wrote: > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: >> On 03/08/2012 10:09 AM, Chris Mason wrote: >>> >>> But, why are we writeback for a second or more? Aren't there other >>> parts of this we would want to fix as well? >>> >>> I'm not against only turning on stable pages when they are needed, but >>> the code that isn't the default tends to be somewhat less used. So it >>> does increase testing burden when we do want stable pages, and it tends >>> to make for awkward bugs that are hard to reproduce because someone >>> neglects to mention it. >>> >>> IMHO it's much more important to nail down the 2 second writeback >>> latency. That's not good. >>> >> >> I think I understand this one. It's do to the sync nature introduced >> by page_waiting in mkwrite. > > Pages go from dirty to writeback for a few reasons. Background > writeout, or O_DIRECT or someone running sync > > background writeout shouldn't be queueing up so much work that > synchronous writeout has a 2 second delay. > > If the latencies are coming from something that was run through > fsync...well there's not too much we can do about that. The problem is > that our page_mkwrite call isn't starting the IO it is just waiting on > it, so we can't bump the priority on it. > I agree. I think the logger model is: write, than sync Before they used to be waiting on the sync phase now their waiting on write, when concurrent with sync. I would like to inspect this situation. I agree with you that it's just shifting heaviness that is now hiding somewhere else. > -chris Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote: > Chris Mason <chris.mason@oracle.com> writes: > > > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: > >> I think I understand this one. It's do to the sync nature introduced > >> by page_waiting in mkwrite. > > > > Pages go from dirty to writeback for a few reasons. Background > > writeout, or O_DIRECT or someone running sync > > > > background writeout shouldn't be queueing up so much work that > > synchronous writeout has a 2 second delay. > > So now we're back to figuring out how to tell how long I/O will take? > If writeback is issuing random access I/Os to spinning media, you can > bet it might be a while. Today, you could lower nr_requests to some > obscenely small number to improve worst-case latency. I thought there > was some talk about improving the intelligence of writeback in this > regard, but it's a tough problem, especially given that writeback isn't > the only cook in the kitchen. I'm not against Ted's original idea, but I'd hate to miss the chance to blame Jens for the block layer making filesystems slow. Since there is a reliable way to reproduce, it makes sense to track it down a little more instead of papering over the wait_on_page_writeback call. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote: > > So now we're back to figuring out how to tell how long I/O will take? > If writeback is issuing random access I/Os to spinning media, you can > bet it might be a while. Today, you could lower nr_requests to some > obscenely small number to improve worst-case latency. I thought there > was some talk about improving the intelligence of writeback in this > regard, but it's a tough problem, especially given that writeback isn't > the only cook in the kitchen. ... and it gets worse if there is any kind of I/O prioritization going on via ionice(), or (as was the case in our example) I/O cgroups were being used to provide proportional I/O rate controls. I don't think it's realistic to assume the writeback code can predict how long I/O will take when it does a submission. BTW, I'd have to check (having not looked at the application code in depth; the bug was primarily solved by bisection and reverting the problem commit) but I'm not entirely sure the thread doing the write was calling fsync(); the main issue as I understand things was that the application wasn't expecting the write(2) system call would block unexpectedly for long periods of time while doing small buffered, appending I/O's. (Again, for the kind of work that distributed systems do, 99th percentile latency is important!) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 04:12:21PM -0500, Ted Ts'o wrote: > On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote: > > > > So now we're back to figuring out how to tell how long I/O will take? > > If writeback is issuing random access I/Os to spinning media, you can > > bet it might be a while. Today, you could lower nr_requests to some > > obscenely small number to improve worst-case latency. I thought there > > was some talk about improving the intelligence of writeback in this > > regard, but it's a tough problem, especially given that writeback isn't > > the only cook in the kitchen. > > ... and it gets worse if there is any kind of I/O prioritization going > on via ionice(), or (as was the case in our example) I/O cgroups were > being used to provide proportional I/O rate controls. I don't think > it's realistic to assume the writeback code can predict how long I/O > will take when it does a submission. cgroups do make it much harder because it could be a simple IO priority inversion. The latencies are just going to be a fact of life for now and the best choice is to skip the stable pages. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: > > I have a theory of how we can fix that 2-sec wait, by avoiding writeback of > the last n pages of an inode who's mtime is less then 2-sec. This would > solve any sequential writer wait penalty, which is Ted's case That won't work in general, *unless* 2 seconds is enough time that the appending writer is done writing to that particular 4k page and moved on to the next 4k block, so nothing touches that page and potentially blocks for however long it takes for the queues to drain. Let's take another example, suppose you have a file-backed mmap region, and you modify the page, and now let's suppose the process is under enough memory pressure that the page cleaner decides to initiate writeback of the page. Now suppose you get unlucky (this is the 1% or 0.1% case; remember, 99th or 99.9 percentile latencies matter), and you try to modify the page in question again. ***THUNK*** your process takes a page fault, and is frozen solid in amber for potentially seconds until the I/O queues drain. Hmm.... let's turn this around. If the issue is checksum calculation, how about trying to solve this problem in some cases by deferring the checksum calculation until right before the block I/O layer is going to schedule the write (i.e., have the I/O submitter provide a callback function which calculates the checksum, which gets called by the BIO layer at the very last moment)? This won't work in all cases (I can see this getting really messy in the software RAID-5/6 case if you don't want to memory copies) but it might solve the problem in at least some of the cases where people care about this. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 04:24:12PM -0500, Ted Ts'o wrote: > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: > > > > I have a theory of how we can fix that 2-sec wait, by avoiding writeback of > > the last n pages of an inode who's mtime is less then 2-sec. This would > > solve any sequential writer wait penalty, which is Ted's case > > That won't work in general, *unless* 2 seconds is enough time that the > appending writer is done writing to that particular 4k page and moved > on to the next 4k block, so nothing touches that page and potentially > blocks for however long it takes for the queues to drain. > > Let's take another example, suppose you have a file-backed mmap > region, and you modify the page, and now let's suppose the process is > under enough memory pressure that the page cleaner decides to initiate > writeback of the page. Now suppose you get unlucky (this is the 1% or > 0.1% case; remember, 99th or 99.9 percentile latencies matter), and > you try to modify the page in question again. ***THUNK*** your > process takes a page fault, and is frozen solid in amber for > potentially seconds until the I/O queues drain. > > Hmm.... let's turn this around. If the issue is checksum calculation, > how about trying to solve this problem in some cases by deferring the > checksum calculation until right before the block I/O layer is going > to schedule the write (i.e., have the I/O submitter provide a callback > function which calculates the checksum, which gets called by the BIO > layer at the very last moment)? Btrfs currently does this, and the DIF code is by definition right before the write. The pages really only get set writeback when they are being sent in flight, so the waiting being done by the stable pages patch is file_write or page_mkwrite being polite and waiting for the IO to finish before changing the page. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 04:38:08PM -0500, Chris Mason wrote: > Btrfs currently does this, and the DIF code is by definition right > before the write. The pages really only get set writeback when they are > being sent in flight, so the waiting being done by the stable pages > patch is file_write or page_mkwrite being polite and waiting for the IO > to finish before changing the page. Right before submission to the bio layer? Or right before the device driver sends the request to the host bus adapter? I was thinking of the latter.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/08/2012 01:24 PM, Ted Ts'o wrote: > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: >> >> I have a theory of how we can fix that 2-sec wait, by avoiding writeback of >> the last n pages of an inode who's mtime is less then 2-sec. This would >> solve any sequential writer wait penalty, which is Ted's case > > That won't work in general, *unless* 2 seconds is enough time that the > appending writer is done writing to that particular 4k page and moved > on to the next 4k block, so nothing touches that page and potentially > blocks for however long it takes for the queues to drain. > Exactly. Is that not the case for a sequential writer. It modifies the top page for a while inode time keeps incrementing, then eventually it advances to the next page, now the before the last page is never touched again. and can be submitted for writeout. > Let's take another example, suppose you have a file-backed mmap > region, and you modify the page, and now let's suppose the process is > under enough memory pressure that the page cleaner decides to initiate > writeback of the page. Now suppose you get unlucky (this is the 1% or > 0.1% case; remember, 99th or 99.9 percentile latencies matter), and > you try to modify the page in question again. ***THUNK*** your > process takes a page fault, and is frozen solid in amber for > potentially seconds until the I/O queues drain. > As I said, if the IO is random you are in though luck, and BTW mmap is not a must, just simple write() call will behave just the same since it sits in the same mkwrite(). But that was not your case. Your case was an appending logger. But my new theory is that your case is not the "writeback" to app-write collision, but the app-write to sync() collision which is a different case. > Hmm.... let's turn this around. If the issue is checksum calculation, > how about trying to solve this problem in some cases by deferring the > checksum calculation until right before the block I/O layer is going > to schedule the write (i.e., have the I/O submitter provide a callback > function which calculates the checksum, which gets called by the BIO > layer at the very last moment)? > iscsi is that case, but the problem is not when "calculates the checksum" but: when "changing page state" your schema can work but you will need to add a new page state, dirty => writeback => stable (new state). > This won't work in all cases (I can see this getting really messy in > the software RAID-5/6 case if you don't want to memory copies) but it > might solve the problem in at least some of the cases where people > care about this. > > - Ted Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2012 at 12:50:42PM -0800, Boaz Harrosh wrote: > On 03/08/2012 12:37 PM, Chris Mason wrote: > > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote: > >> On 03/08/2012 10:09 AM, Chris Mason wrote: > >>> > >>> But, why are we writeback for a second or more? Aren't there other > >>> parts of this we would want to fix as well? > >>> > >>> I'm not against only turning on stable pages when they are needed, but > >>> the code that isn't the default tends to be somewhat less used. So it > >>> does increase testing burden when we do want stable pages, and it tends > >>> to make for awkward bugs that are hard to reproduce because someone > >>> neglects to mention it. > >>> > >>> IMHO it's much more important to nail down the 2 second writeback > >>> latency. That's not good. > >>> > >> > >> I think I understand this one. It's do to the sync nature introduced > >> by page_waiting in mkwrite. > > > > Pages go from dirty to writeback for a few reasons. Background > > writeout, or O_DIRECT or someone running sync > > > > background writeout shouldn't be queueing up so much work that > > synchronous writeout has a 2 second delay. > > > > If the latencies are coming from something that was run through > > fsync...well there's not too much we can do about that. The problem is > > that our page_mkwrite call isn't starting the IO it is just waiting on > > it, so we can't bump the priority on it. > > > > I agree. I think the logger model is: write, than sync > > Before they used to be waiting on the sync phase now their waiting > on write, when concurrent with sync. I would like to inspect this situation. > I agree with you that it's just shifting heaviness that is now hiding somewhere > else. I'd argue that the application is broken, not the kernel. IO latencies are always going to occur, so if the application is sensitive to them, then the app needs to take measures to minimise the effect of potential latencies. Double/ring buffering with async IO dispatch is the general method of avoiding significant latencies in IO aggregator applications like loggers... Cheers, Dave.
On Thu, Mar 08, 2012 at 04:41:48PM -0500, Ted Ts'o wrote: > On Thu, Mar 08, 2012 at 04:38:08PM -0500, Chris Mason wrote: > > Btrfs currently does this, and the DIF code is by definition right > > before the write. The pages really only get set writeback when they are > > being sent in flight, so the waiting being done by the stable pages > > patch is file_write or page_mkwrite being polite and waiting for the IO > > to finish before changing the page. > > Right before submission to the bio layer? Or right before the device > driver sends the request to the host bus adapter? I was thinking of > the latter.... blktrace can probably give us numbers for how big that wait is, but hopefully it is very small. If it is too big, you can always bump nr_requests. Btrfs crcs before submit_bio, not sure when the dif code does it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
Chris> Btrfs crcs before submit_bio, not sure when the dif code does it.
Slightly later. At make_request time.
On Thu, Mar 08, 2012 at 04:20:54PM -0500, Chris Mason wrote: > On Thu, Mar 08, 2012 at 04:12:21PM -0500, Ted Ts'o wrote: > > On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote: > > > > > > So now we're back to figuring out how to tell how long I/O will take? > > > If writeback is issuing random access I/Os to spinning media, you can > > > bet it might be a while. Today, you could lower nr_requests to some > > > obscenely small number to improve worst-case latency. I thought there > > > was some talk about improving the intelligence of writeback in this > > > regard, but it's a tough problem, especially given that writeback isn't > > > the only cook in the kitchen. > > > > ... and it gets worse if there is any kind of I/O prioritization going > > on via ionice(), or (as was the case in our example) I/O cgroups were > > being used to provide proportional I/O rate controls. I don't think > > it's realistic to assume the writeback code can predict how long I/O > > will take when it does a submission. > > cgroups do make it much harder because it could be a simple IO priority > inversion. The latencies are just going to be a fact of life for now > and the best choice is to skip the stable pages. They have always been a fact of life - just ask anyone that has to deal with deterministic or "real-time" IO applications. Unpredictable IO path latencies are not a new problem, and it doesn't take stable pages to cause sigificant holdoffs in the writing to a file. For example: writeback triggers triggers delayed allocation, which locks the extent map and then blocks behind an allocation already in progress or has to do IO to read in freespace metadata. The next write comes along from another thread/process and it has to map a new page and that now blocks on the extent map lock and won't progress until the delayed allocation in progress completes.... IO latencies are pretty much unavoidable, so the best thing to do is to write applications that care about latency to minimise it's impact as much as possible. Simple techniques like double buffering and async IO dispatch techniques to decouple the IO stream from the process/threads that are doing real work are the usual ways of dealing with this problem. Cheers, Dave.
On Thu, Mar 08, 2012 at 08:02:07PM -0500, Chris Mason wrote: > > Right before submission to the bio layer? Or right before the device > > driver sends the request to the host bus adapter? I was thinking of > > the latter.... > > blktrace can probably give us numbers for how big that wait is, but > hopefully it is very small. If it is too big, you can always bump > nr_requests. Not necessarily; things can get stalled for quite a while due to ionice or io rate throttling. There's quite a lot that can happen after submit_bio, at the cfq layer, for example. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/buffer.c b/fs/buffer.c index 1a30db7..d25c60f 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, ret = -EAGAIN; goto out_unlock; } +#ifdef CONFIG_BLKDEV_INTEGRITY wait_on_page_writeback(page); +#endif return 0; out_unlock: unlock_page(page); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5f8081c..01f86c5 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { +#ifdef CONFIG_BLKDEV_INTEGRITY /* Wait so that we don't change page under IO */ wait_on_page_writeback(page); +#endif ret = VM_FAULT_LOCKED; goto out; }
We've recently discovered a workload at Google where the page stablization patches (specifically commit 0e499890c1f: ext4: wait for writeback to complete while making pages writable) resulted in a **major** performance regression. As in, kernel threads that were writing to log files were getting hit by up to 2 seconds stalls, which very badly hurt a particular application. Reverting this commit fixed the performance regression. The main reason for the page stablizatoin patches was for DIF/DIX support, right? So I'm wondering if we should just disable the calls to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined. i.e., something like this. What do people think? I have a feeling this is going to be very controversial.... - Ted ext4: Disable page stablization if DIF/DIX not enabled Requiring processes which are writing to files which are under writeback until the writeback is complete can result in massive performance hits. This is especially true if writes are being throttled due to I/O cgroup limits and the application is running on an especially busy server. If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization, since that's the main case where this is needed, and page stablization can be very painful. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html