diff mbox series

net: bypass ->sendpage for slab pages

Message ID 20200819051945.1797088-1-hch@lst.de
State Deferred
Delegated to: David Miller
Headers show
Series net: bypass ->sendpage for slab pages | expand

Commit Message

Christoph Hellwig Aug. 19, 2020, 5:19 a.m. UTC
Sending Slab or tail pages into ->sendpage will cause really strange
delayed oops.  Prevent it right in the networking code instead of
requiring drivers to guess the exact conditions where sendpage works.

Based on a patch from Coly Li <colyli@suse.de>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 net/socket.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

David Miller Aug. 19, 2020, 7:07 p.m. UTC | #1
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 19 Aug 2020 07:19:45 +0200

> Sending Slab or tail pages into ->sendpage will cause really strange
> delayed oops.  Prevent it right in the networking code instead of
> requiring drivers to guess the exact conditions where sendpage works.
> 
> Based on a patch from Coly Li <colyli@suse.de>.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Yes this fixes the problem, but it doesn't in any way deal with the
callers who are doing this stuff.

They are all likely using sendpage because they expect that it will
avoid the copy, for performance reasons or whatever.

Now it won't.

At least with Coly's patch set, the set of violators was documented
and they could switch to allocating non-slab pages or calling
sendmsg() or write() instead.

I hear talk about ABIs just doing the right thing, but when their
value is increased performance vs. other interfaces it means that
taking a slow path silently is bad in the long term.  And that's
what this proposed patch here does.
Christoph Hellwig Aug. 20, 2020, 4:37 a.m. UTC | #2
On Wed, Aug 19, 2020 at 12:07:09PM -0700, David Miller wrote:
> Yes this fixes the problem, but it doesn't in any way deal with the
> callers who are doing this stuff.
> 
> They are all likely using sendpage because they expect that it will
> avoid the copy, for performance reasons or whatever.
> 
> Now it won't.
> 
> At least with Coly's patch set, the set of violators was documented
> and they could switch to allocating non-slab pages or calling
> sendmsg() or write() instead.
> 
> I hear talk about ABIs just doing the right thing, but when their
> value is increased performance vs. other interfaces it means that
> taking a slow path silently is bad in the long term.  And that's
> what this proposed patch here does.

If you look at who uses sendpage outside the networking layer itself
you see that it is basically block driver and file systems.  These
have no way to control what memory they get passed and have to deal
with everything someone throws at them.

So for these callers the requirements are in order of importance:

 (1) just send the damn page without generating weird OOPSes
 (2) do so as fast as possible
 (3) do so without requÑ–ring pointless boilerplate code

Any I think the current interface fails these requirements really badly.
Having a helper that just does the right thing would really help all of
these users, including those currently using raw ->sendpage over
kernel_sendpage.  If you don't like kernel_sendpage to just do the
right thing we could just add another helper, e.g.
kernel_sendpage_or_fallback, but that would seem a little pointless
to me.
David Miller Aug. 21, 2020, 9:14 p.m. UTC | #3
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 20 Aug 2020 06:37:44 +0200

> If you look at who uses sendpage outside the networking layer itself
> you see that it is basically block driver and file systems.  These
> have no way to control what memory they get passed and have to deal
> with everything someone throws at them.

I see nvme doing virt_to_page() on several things when it calls into
kernel_sendpage().

This is the kind of stuff I want cleaned up, and which your patch
will not trap nor address.

In nvme it sometimes seems to check for sendpage validity:

		/* can't zcopy slab pages */
		if (unlikely(PageSlab(page))) {
			ret = sock_no_sendpage(queue->sock, page, offset, len,
					flags);
		} else {
			ret = kernel_sendpage(queue->sock, page, offset, len,
					flags);
		}

Yet elsewhere does not and just blindly calls:

	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
			offset_in_page(pdu) + req->offset, len,  flags);

This pdu seems to come from a page frag allocation.

That's the target side.  On the host side:

		ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
					left, flags);

No page slab check or anything like that.

I'm hesitent to put in the kernel_sendpage() patch, becuase it provides a
disincentive to fix up code like this.
Coly Li Sept. 18, 2020, 8:37 a.m. UTC | #4
On 2020/8/22 05:14, David Miller wrote:
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 20 Aug 2020 06:37:44 +0200
> 
>> If you look at who uses sendpage outside the networking layer itself
>> you see that it is basically block driver and file systems.  These
>> have no way to control what memory they get passed and have to deal
>> with everything someone throws at them.
> 
> I see nvme doing virt_to_page() on several things when it calls into
> kernel_sendpage().
> 
> This is the kind of stuff I want cleaned up, and which your patch
> will not trap nor address.
> 
> In nvme it sometimes seems to check for sendpage validity:
> 
> 		/* can't zcopy slab pages */
> 		if (unlikely(PageSlab(page))) {
> 			ret = sock_no_sendpage(queue->sock, page, offset, len,
> 					flags);
> 		} else {
> 			ret = kernel_sendpage(queue->sock, page, offset, len,
> 					flags);
> 		}
> 
> Yet elsewhere does not and just blindly calls:
> 
> 	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
> 			offset_in_page(pdu) + req->offset, len,  flags);
> 
> This pdu seems to come from a page frag allocation.
> 
> That's the target side.  On the host side:
> 
> 		ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
> 					left, flags);
> 
> No page slab check or anything like that.
> 
> I'm hesitent to put in the kernel_sendpage() patch, becuase it provides a
> disincentive to fix up code like this.
> 

Hi David and Christoph,

It has been quiet for a while, what should we go next for the
kernel_sendpage() related issue ?

Will Christoph's or my series be considered as proper fix, or maybe I
should wait for some other better idea to show up? Any is OK for me,
once the problem is fixed.

Thanks in advance.

Coly Li
Christoph Hellwig Sept. 21, 2020, 2:25 p.m. UTC | #5
On Fri, Sep 18, 2020 at 04:37:24PM +0800, Coly Li wrote:
> Hi David and Christoph,
> 
> It has been quiet for a while, what should we go next for the
> kernel_sendpage() related issue ?
> 
> Will Christoph's or my series be considered as proper fix, or maybe I
> should wait for some other better idea to show up? Any is OK for me,
> once the problem is fixed.

I think for all the network storage stuff we really need a "send me
out a page helper", and the nvmet bits that Dave pointed to look to
me like they actually are currently broken.

Given that Dave doesn't want to change the kernel_sendpage semantics
I'll resend with a new helper instead.  Any preferences for a name?
safe_sendpage?  kernel_sendpage_safe?  kernel_send_one_page?
diff mbox series

Patch

diff --git a/net/socket.c b/net/socket.c
index dbbe8ea7d395da..b4e65688915fe3 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3638,7 +3638,11 @@  EXPORT_SYMBOL(kernel_getpeername);
 int kernel_sendpage(struct socket *sock, struct page *page, int offset,
 		    size_t size, int flags)
 {
-	if (sock->ops->sendpage)
+	/* sendpage does manipulates the refcount of the sent in page, which
+	 * does not work for Slab pages, or for tails of non-__GFP_COMP
+	 * high order pages.
+	 */
+	if (sock->ops->sendpage && !PageSlab(page) && page_count(page) > 0)
 		return sock->ops->sendpage(sock, page, offset, size, flags);
 
 	return sock_no_sendpage(sock, page, offset, size, flags);