diff mbox series

[v3,56/68] afs: Handle len being extending over page end in write_begin/write_end

Message ID 163967169723.1823006.2868573008412053995.stgit@warthog.procyon.org.uk
State New
Headers show
Series fscache, cachefiles: Rewrite | expand

Commit Message

David Howells Dec. 16, 2021, 4:21 p.m. UTC
With transparent huge pages, in the future, write_begin() and write_end()
may be passed a length parameter that, in combination with the offset into
the page, exceeds the length of that page.  This allows
grab_cache_page_write_begin() to better choose the size of THP to allocate.

Fix afs's functions to handle this by trimming the length as needed after
the page has been allocated.

[Removed the now-unnecessary index var; spotted by kernel test robot]

Fixes: e1b1240c1ff5 ("netfs: Add write_begin helper")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/162367681795.460125.11729955608839747375.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163819657068.215744.601051542491746150.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906965274.143852.11487892388439890377.stgit@warthog.procyon.org.uk/ # v2
---

 fs/afs/write.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Comments

Linus Torvalds Dec. 16, 2021, 4:31 p.m. UTC | #1
On Thu, Dec 16, 2021 at 8:22 AM David Howells <dhowells@redhat.com> wrote:
>
> With transparent huge pages, in the future, write_begin() and write_end()
> may be passed a length parameter that, in combination with the offset into
> the page, exceeds the length of that page.  This allows
> grab_cache_page_write_begin() to better choose the size of THP to allocate.

I still think this is a fundamental bug in the caller. That
"explanation" is weak, and the whole concept smells like week-old fish
to me.

         Linus
David Howells Dec. 16, 2021, 4:47 p.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > With transparent huge pages, in the future, write_begin() and write_end()
> > may be passed a length parameter that, in combination with the offset into
> > the page, exceeds the length of that page.  This allows
> > grab_cache_page_write_begin() to better choose the size of THP to allocate.
> 
> I still think this is a fundamental bug in the caller. That
> "explanation" is weak, and the whole concept smells like week-old fish
> to me.

You really should ask Willy about this as it's multipage folio-related.

AIUI, because the page/folio may be allocated inside ->write_begin(),
generic_perform_write() tells the filesystem how much it has been asked to
write and then the folio allocation can be made to fit that.

However, at this time, ->write_begin() and ->write_end() have a page pointer
(or pointer-to-pointer), not a folio pointer, in their signature, so the
filesystem has to convert between them.

I'm working on write helpers for netfslib that absorb this out of the
filesystems that use it into its own take on generic_perform_write(), thereby
eliminating the need for ->write_begin and ->write_end.  I have this kind of
working for afs and 9p at the moment and am looking at ceph, but there's a way
to go yet.  I believe iomap does the same for block-based filesystems that use
it (such as xfs).

I think Willy's aim is to get rid of ->write_begin and ->write_end entirely.

David
Matthew Wilcox Dec. 16, 2021, 7:28 p.m. UTC | #3
On Thu, Dec 16, 2021 at 08:31:19AM -0800, Linus Torvalds wrote:
> On Thu, Dec 16, 2021 at 8:22 AM David Howells <dhowells@redhat.com> wrote:
> >
> > With transparent huge pages, in the future, write_begin() and write_end()
> > may be passed a length parameter that, in combination with the offset into
> > the page, exceeds the length of that page.  This allows
> > grab_cache_page_write_begin() to better choose the size of THP to allocate.
> 
> I still think this is a fundamental bug in the caller. That
> "explanation" is weak, and the whole concept smells like week-old fish
> to me.

You're right that  ->write_end can't be called with more bytes than fit
in the folio.  That makes no sense at all.

I haven't finished fully fleshing this out yet (and as a result we still
only create PAGE_SIZE folios on writes), but my basic plan was:

generic_perform_write:
-	bytes = min_t(unsigned long, PAGE_SIZE - offset,
+	bytes = min_t(unsigned long, FOLIO_MAX_PAGE_CACHE_SIZE - offset,
 					iov_iter_count(i));
...
                status = a_ops->write_begin(file, mapping, pos, bytes, flags,
-                                               &page, &fsdata);
+                                               &folio, &fsdata);

+		offset = offset_in_folio(folio, pos);
+		bytes = folio_size(folio - offset);
...
                status = a_ops->write_end(file, mapping, pos, bytes, copied,
-                                               page, fsdata);
+                                               folio, fsdata);

Since ->write_begin is the place where we actually create folios, it
needs to know what size folio to create.  Unless you'd rather we do
something to actually create the folio before calling ->write_begin?
Linus Torvalds Dec. 16, 2021, 7:46 p.m. UTC | #4
On Thu, Dec 16, 2021 at 11:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Since ->write_begin is the place where we actually create folios, it
> needs to know what size folio to create.  Unless you'd rather we do
> something to actually create the folio before calling ->write_begin?

I don't think we can create a folio before that, because the
filesystem may not even want a folio (think persistent memory or
whatever).

Honestly, I think you need to describe more what you actually want to
happen. Because generic_perform_write() has already decided to use a
PAGE_SIZE by the time write_begin() is called,

Right now the world order is "we chunk things by PAGE_SIZE", and
that's just how it is.

I can see other options - like the filesystem passing in the chunk
size when it calls generic_perform_write().

Or we make the rule be that ->write_begin() simply always is given the
whole area, and the filesystem can decide how it wants to chunk things
up, and return the size of the write chunk in the status (rather than
the current "success or error").

But at no point will this *EVER* be a "afs will limit the size to the
folio size" issue. Nothing like that will ever make sense. Allowing
bigger chunks will not be about any fscache issues, it will be about
every single filesystem that uses generic_perform_write().

So I will NAK these patches by David, because they are fundamentally
wrong, whichever way we turn. Any "write in bigger chunks" patch will
be something else entirely.

                 Linus
Matthew Wilcox Dec. 16, 2021, 8:20 p.m. UTC | #5
On Thu, Dec 16, 2021 at 11:46:18AM -0800, Linus Torvalds wrote:
> On Thu, Dec 16, 2021 at 11:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Since ->write_begin is the place where we actually create folios, it
> > needs to know what size folio to create.  Unless you'd rather we do
> > something to actually create the folio before calling ->write_begin?
> 
> I don't think we can create a folio before that, because the
> filesystem may not even want a folio (think persistent memory or
> whatever).
> 
> Honestly, I think you need to describe more what you actually want to
> happen. Because generic_perform_write() has already decided to use a
> PAGE_SIZE by the time write_begin() is called,
> 
> Right now the world order is "we chunk things by PAGE_SIZE", and
> that's just how it is.

Right.  And we could leave it like that.  There's a huge amount of win
that comes from just creating large folios as part of readahead, and
anything we do for writes is going to be a smaller win.

That said, I would like it if a program which does:

fd = creat("foo", 0644);
write(fd, buf, 64 * 1024);
close(fd);

uses a single 64k page.

> I can see other options - like the filesystem passing in the chunk
> size when it calls generic_perform_write().

I'm hoping to avoid that.  Ideally filesystems don't know what the
"chunk size" is that's being used; they'll see a mixture of sizes
being used for any given file (potentially).  Depends on access
patterns, availability of higher-order memory, etc.

> Or we make the rule be that ->write_begin() simply always is given the
> whole area, and the filesystem can decide how it wants to chunk things
> up, and return the size of the write chunk in the status (rather than
> the current "success or error").

We do need to be slightly more limiting than "always gets the whole
area", because we do that fault_in_iov_iter_readable() call first,
and if the user has been mean and asked to write() 2GB of memory on
a (virtual) machine with 256MB, I'd prefer it if we didn't swap our way
through 2GB of address space before calling into ->write_begin.

> But at no point will this *EVER* be a "afs will limit the size to the
> folio size" issue. Nothing like that will ever make sense. Allowing
> bigger chunks will not be about any fscache issues, it will be about
> every single filesystem that uses generic_perform_write().

I agree that there should be nothing here that is specific to fscache.
David has in the past tried to convince me that he should always get
256kB folios, and I've done my best to explain that the MM just isn't
going to make that guarantee.

That said, this patch seems to be doing the right thing; it passes
the entire length into netfs_write_begin(), and is then truncating
the length to stop at the end of the folio that it got back.
David Howells Dec. 16, 2021, 9:17 p.m. UTC | #6
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So I will NAK these patches by David, because they are fundamentally
> wrong, whichever way we turn. Any "write in bigger chunks" patch will
> be something else entirely.

I'll just drop patches 56 and 57 for now, then.  I think the problem only
manifests if I set the flag saying the filesystem/inode/whatever supports
multipage folios.

David
diff mbox series

Patch

diff --git a/fs/afs/write.c b/fs/afs/write.c
index ca4909baf5e6..8e4e87d66855 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -25,7 +25,8 @@  int afs_set_page_dirty(struct page *page)
 }
 
 /*
- * prepare to perform part of a write to a page
+ * Prepare to perform part of a write to a page.  Note that len may extend
+ * beyond the end of the page.
  */
 int afs_write_begin(struct file *file, struct address_space *mapping,
 		    loff_t pos, unsigned len, unsigned flags,
@@ -36,7 +37,6 @@  int afs_write_begin(struct file *file, struct address_space *mapping,
 	unsigned long priv;
 	unsigned f, from;
 	unsigned t, to;
-	pgoff_t index;
 	int ret;
 
 	_enter("{%llx:%llu},%llx,%x",
@@ -51,8 +51,8 @@  int afs_write_begin(struct file *file, struct address_space *mapping,
 	if (ret < 0)
 		return ret;
 
-	index = folio_index(folio);
-	from = pos - index * PAGE_SIZE;
+	from = offset_in_folio(folio, pos);
+	len = min_t(size_t, len, folio_size(folio) - from);
 	to = from + len;
 
 try_again:
@@ -103,7 +103,8 @@  int afs_write_begin(struct file *file, struct address_space *mapping,
 }
 
 /*
- * finalise part of a write to a page
+ * Finalise part of a write to a page.  Note that len may extend beyond the end
+ * of the page.
  */
 int afs_write_end(struct file *file, struct address_space *mapping,
 		  loff_t pos, unsigned len, unsigned copied,