[3/8] nowait aio: return if direct write will trigger writeback

Submitted by Goldwyn Rodrigues on Feb. 28, 2017, 11:36 p.m.

Details

Message ID 20170228233610.25456-4-rgoldwyn@suse.de
State New
Headers show

Commit Message

Goldwyn Rodrigues Feb. 28, 2017, 11:36 p.m.
From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

This introduces a new function filemap_range_has_page() which
returns true if the file's mapping has a page within the range
mentioned.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 49 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox March 1, 2017, 3:46 a.m.
On Tue, Feb 28, 2017 at 05:36:05PM -0600, Goldwyn Rodrigues wrote:
> Find out if the write will trigger a wait due to writeback. If yes,
> return -EAGAIN.
> 
> This introduces a new function filemap_range_has_page() which
> returns true if the file's mapping has a page within the range
> mentioned.

Ugh, this is pretty inefficient.  If that's all you want to know, then
using the radix tree directly will be far more efficient than spinning
up all the pagevec machinery only to discard the pages found.

But what's going to kick these pages out of cache?  Shouldn't we rather
find the pages, kick them out if clean, start writeback if not, and *then*
return -EAGAIN?

So maybe we want to spin up the pagevec machinery after all so we can
do that extra work?
Christoph Hellwig March 1, 2017, 3:38 p.m.
On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> Ugh, this is pretty inefficient.  If that's all you want to know, then
> using the radix tree directly will be far more efficient than spinning
> up all the pagevec machinery only to discard the pages found.
> 
> But what's going to kick these pages out of cache?  Shouldn't we rather
> find the pages, kick them out if clean, start writeback if not, and *then*
> return -EAGAIN?
> 
> So maybe we want to spin up the pagevec machinery after all so we can
> do that extra work?

As pointed out in the last round of these patches I think we really
need to pass a flags argument to filemap_write_and_wait_range to
communicate the non-blocking nature and only return -EAGAIN if we'd
block.  As a bonus that can indeed start to kick the pages out.
Jan Kara March 2, 2017, 10:38 a.m.
On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > Ugh, this is pretty inefficient.  If that's all you want to know, then
> > using the radix tree directly will be far more efficient than spinning
> > up all the pagevec machinery only to discard the pages found.
> > 
> > But what's going to kick these pages out of cache?  Shouldn't we rather
> > find the pages, kick them out if clean, start writeback if not, and *then*
> > return -EAGAIN?
> > 
> > So maybe we want to spin up the pagevec machinery after all so we can
> > do that extra work?
> 
> As pointed out in the last round of these patches I think we really
> need to pass a flags argument to filemap_write_and_wait_range to
> communicate the non-blocking nature and only return -EAGAIN if we'd
> block.  As a bonus that can indeed start to kick the pages out.

Aren't flags to filemap_write_and_wait_range() unnecessary complication?
Realistically, most users wanting performance from AIO DIO so badly that
they bother with this API won't have any pages to write / evict. If they do
by some bad accident, they can fall back to standard "blocking" AIO DIO.
So I don't see much value in teaching filemap_write_and_wait_range() about
a non-blocking mode...

								Honza
Matthew Wilcox March 2, 2017, 2:12 p.m.
On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > return -EAGAIN?
> > 
> > As pointed out in the last round of these patches I think we really
> > need to pass a flags argument to filemap_write_and_wait_range to
> > communicate the non-blocking nature and only return -EAGAIN if we'd
> > block.  As a bonus that can indeed start to kick the pages out.
> 
> Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> Realistically, most users wanting performance from AIO DIO so badly that
> they bother with this API won't have any pages to write / evict. If they do
> by some bad accident, they can fall back to standard "blocking" AIO DIO.
> So I don't see much value in teaching filemap_write_and_wait_range() about
> a non-blocking mode...

That lets me execute a DoS against a user using this API.  All I have
to do is open the file they're using read-only and read a byte from it.
Page goes into page-cache, and they'll only get -EAGAIN from calling
this syscall until the page ages out.

Also, I don't understand why this is a flag.  Isn't the point of AIO to
be non-blocking?  Why isn't this just a change to how we do AIO?
Jan Kara March 2, 2017, 3:22 p.m.
On Thu 02-03-17 06:12:45, Matthew Wilcox wrote:
> On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> > On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > > return -EAGAIN?
> > > 
> > > As pointed out in the last round of these patches I think we really
> > > need to pass a flags argument to filemap_write_and_wait_range to
> > > communicate the non-blocking nature and only return -EAGAIN if we'd
> > > block.  As a bonus that can indeed start to kick the pages out.
> > 
> > Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> > Realistically, most users wanting performance from AIO DIO so badly that
> > they bother with this API won't have any pages to write / evict. If they do
> > by some bad accident, they can fall back to standard "blocking" AIO DIO.
> > So I don't see much value in teaching filemap_write_and_wait_range() about
> > a non-blocking mode...
> 
> That lets me execute a DoS against a user using this API.  All I have
> to do is open the file they're using read-only and read a byte from it.
> Page goes into page-cache, and they'll only get -EAGAIN from calling
> this syscall until the page ages out.

It will not be a DoS. This non-blocking AIO can always return EAGAIN when
it feels like it and the caller is required to fall back to a blocking
version in that case if he wants to guarantee forward progress. It is just
a performance optimization which allows user (database) to submit IO from a
computation thread instead of having to offload it to an IO thread...

> Also, I don't understand why this is a flag.  Isn't the point of AIO to
> be non-blocking?  Why isn't this just a change to how we do AIO?

Because this is an API change and the caller has to implement some handling
to guarantee a forward progress of non-blocking IO...

								Honza

Patch hide | download patch | download mbox

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ab2f556..527ef53 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2494,6 +2494,8 @@  extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 78dd50e..82335f4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -375,6 +375,39 @@  int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:		address space structure to wait for
+ * @start_byte:		offset in bytes where the range starts
+ * @end_byte:		offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	pgoff_t index = start_byte >> PAGE_SHIFT;
+	pgoff_t end = end_byte >> PAGE_SHIFT;
+	struct pagevec pvec;
+	int ret;
+
+	if (end_byte < start_byte)
+		return 0;
+
+	if (mapping->nrpages == 0)
+		return 0;
+
+	pagevec_init(&pvec, 0);
+	ret = pagevec_lookup(&pvec, mapping, index, 1);
+	if (!ret)
+		return 0;
+	ret = (pvec.pages[0]->index <= end);
+	pagevec_release(&pvec);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 				     loff_t start_byte, loff_t end_byte)
 {
@@ -2631,6 +2664,9 @@  inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 
 	pos = iocb->ki_pos;
 
+	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+		return -EINVAL;
+
 	if (limit != RLIM_INFINITY) {
 		if (iocb->ki_pos >= limit) {
 			send_sig(SIGXFSZ, current, 0);
@@ -2700,9 +2736,17 @@  generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	write_len = iov_iter_count(from);
 	end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
-		goto out;
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		/* If there are pages to writeback, return */
+		if (filemap_range_has_page(inode->i_mapping, pos,
+					   pos + iov_iter_count(from)))
+			return -EAGAIN;
+	} else {
+		written = filemap_write_and_wait_range(mapping, pos,
+							pos + write_len - 1);
+		if (written)
+			goto out;
+	}
 
 	/*
 	 * After a write we want buffered reads to be sure to go to disk to get