diff mbox series

aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll)

Message ID 20190303151846.GQ2217@ZenIV.linux.org.uk
State Not Applicable
Delegated to: David Miller
Headers show
Series aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) | expand

Commit Message

Al Viro March 3, 2019, 3:18 p.m. UTC
On Sun, Mar 03, 2019 at 01:55:02PM +0000, Al Viro wrote:

> Maybe unrelated to this bug, but...  What's to prevent a wakeup
> that happens just after we'd been added to a waitqueue by ->poll()
> triggering aio_poll_wake(), which gets to aio_poll_complete()
> with its fput() *before* we'd reached the end of ->poll() instance?
> 
> I wonder if adding
> 	get_file(req->file);  // make sure that early completion is safe
> right after
>         req->file = fget(iocb->aio_fildes);
>         if (unlikely(!req->file))
>                 return -EBADF;
> paired with
> 	fput(req->file);
> right after out: in aio_poll() is needed...  Am I missing something
> obvious here?  Christoph?

In more details - normally IOCB_CMD_POLL handling looks so:

1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll()
2) aio_poll() resolves the descriptor to struct file by
	req->file = fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll().  After sanity checks (basically, "poll_wait()
had been called and only once") it locks the queue.  That's what the extra
reference to iocb had been for - we know we can safely access it.
5) With queue locked, we check if ->woken has already been set to true
(by aio_poll_wake()) and, if it had been, we unlock the queue, drop
a reference to aio_kiocb and bugger off - at that point it's
a responsibility to aio_poll_wake() and the stuff called/scheduled by
it.  That code will drop the reference to file in req->file, along
with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait.  If we do, we unlock
the queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with the
other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the queue
lock), so that wakeup won't get us.  No async activity will be happening,
so we can safely drop req->file and iocb ourselves.

If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking safely.
And we don't touch ->file if we detect that case.

However, vfs_poll() most certainly *does* touch the file it had been
given.  So wakeup coming while we are still in ->poll() might end up
doing fput() on that file.  That case is not too rare, and usually
we are saved by the still present reference from descriptor table -
that fput() is not the final one.  But if another thread closes that
descriptor right after our fget() and wakeup does happen before
->poll() returns, we are in trouble - final fput() done while we
are in the middle of a method.

What we need is to hold on to the file reference the same way we do with
aio_kiocb.  No need to store the reference to what we'd got in a separate
variable - req->file is never reassigned and we'd already made sure that
req won't be freed under us, so dropping the extra reference with
fput(req->file) is fine in all cases.

Fixes: bfe4037e722ec
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Comments

Eric Dumazet March 3, 2019, 6:37 p.m. UTC | #1
On 03/03/2019 07:18 AM, Al Viro wrote:

> Fixes: bfe4037e722ec
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/aio.c b/fs/aio.c
> index 3083180a54c8..7e88bfabdac2 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>  
>  	/* one for removal from waitqueue, one for this function */
>  	refcount_set(&aiocb->ki_refcnt, 2);
> +	get_file(req->file);
>  
>  	mask = vfs_poll(req->file, &apt.pt) & req->events;
>  	if (unlikely(!req->head)) {
> @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>  	spin_unlock_irq(&ctx->ctx_lock);
>  
>  out:
> +	fput(req->file);
>  	if (unlikely(apt.error)) {
>  		fput(req->file);
>  		return apt.error;
> 

Very nice changelog Al, thanks for fixing this.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Linus Torvalds March 3, 2019, 7:44 p.m. UTC | #2
On Sun, Mar 3, 2019 at 7:18 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> > Maybe unrelated to this bug, but...  What's to prevent a wakeup
> > that happens just after we'd been added to a waitqueue by ->poll()
> > triggering aio_poll_wake(), which gets to aio_poll_complete()
> > with its fput() *before* we'd reached the end of ->poll() instance?

I'm assuming you're talking about the second vfs_poll() in
aio_poll_complete_work()? The one we call before we check for
"rew->cancelled" properly under the spinlock?

> 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll()
> 2) aio_poll() resolves the descriptor to struct file by
>         req->file = fget(iocb->aio_fildes)
[...]

So honestly, the whole filp handling in aio looks overly complicated to me.

All the different operations do that silly fget/fput() dance, although
aio_read/write at least tried to make a common helper function for
handling it.

But as far as I can tell, they *all* could do:

 - look up file in aio_submit() when allocating and creating the aio_kiocb

 - drop the filp in 'iocb_put()' (which happens whether things
complete successfully or not).

and we'd have avoided a lot of complexity, and we'd have avoided this bug.

Your patch fixes the poll() case, but it does so by just letting the
existing complexity remain, and adding a second fget/fput pair in the
poll logic.

It would seem like it would be much better to rip all the complexity
out entirely, and replace it with sane, simple and obviously correct
code.

Hmm?

In other words, why wouldn't something like the attached work instead?

TOTALLY UNTESTED! It builds, and it looks sane, but maybe I'm
overlooking some obvious problem with it. But doesn't it look nice to
see

 2 files changed, 41 insertions(+), 50 deletions(-)

with actual code reduction, and a fundamental simplification in
handling of the file pointer?

                 Linus
fs/aio.c           | 83 ++++++++++++++++++++++--------------------------------
 include/linux/fs.h |  8 +++++-
 2 files changed, 41 insertions(+), 50 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index aaaaf4d12c73..9eccd2ea6ca9 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -167,14 +167,18 @@ struct kioctx {
 	unsigned		id;
 };
 
+/*
+ * First field must be 'ki_filp' in all the
+ * iocb unions!
+ */
 struct fsync_iocb {
+	struct file		*ki_filp;
 	struct work_struct	work;
-	struct file		*file;
 	bool			datasync;
 };
 
 struct poll_iocb {
-	struct file		*file;
+	struct file		*ki_filp;
 	struct wait_queue_head	*head;
 	__poll_t		events;
 	bool			woken;
@@ -183,8 +187,15 @@ struct poll_iocb {
 	struct work_struct	work;
 };
 
+/*
+ * NOTE! Each of the iocb union members has "ki_filp" as
+ * the first entry in their struct definition. So you can
+ * access the file pointer either directly through this
+ * anonymous union, or through any of the sub-structs.
+ */
 struct aio_kiocb {
 	union {
+		struct file		*ki_filp;
 		struct kiocb		rw;
 		struct fsync_iocb	fsync;
 		struct poll_iocb	poll;
@@ -1060,6 +1071,7 @@ static inline void iocb_put(struct aio_kiocb *iocb)
 {
 	if (refcount_read(&iocb->ki_refcnt) == 0 ||
 	    refcount_dec_and_test(&iocb->ki_refcnt)) {
+		fput(iocb->ki_filp);
 		percpu_ref_put(&iocb->ki_ctx->reqs);
 		kmem_cache_free(kiocb_cachep, iocb);
 	}
@@ -1413,7 +1425,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
 		aio_remove_iocb(iocb);
 
 	if (kiocb->ki_flags & IOCB_WRITE) {
-		struct inode *inode = file_inode(kiocb->ki_filp);
+		struct inode *inode = file_inode(iocb->ki_filp);
 
 		/*
 		 * Tell lockdep we inherited freeze protection from submission
@@ -1421,10 +1433,9 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
 		 */
 		if (S_ISREG(inode->i_mode))
 			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
-		file_end_write(kiocb->ki_filp);
+		file_end_write(iocb->ki_filp);
 	}
 
-	fput(kiocb->ki_filp);
 	aio_complete(iocb, res, res2);
 }
 
@@ -1432,9 +1443,6 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 {
 	int ret;
 
-	req->ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->ki_filp))
-		return -EBADF;
 	req->ki_complete = aio_complete_rw;
 	req->private = NULL;
 	req->ki_pos = iocb->aio_offset;
@@ -1451,7 +1459,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 		ret = ioprio_check_cap(iocb->aio_reqprio);
 		if (ret) {
 			pr_debug("aio ioprio check cap error: %d\n", ret);
-			goto out_fput;
+			return ret;
 		}
 
 		req->ki_ioprio = iocb->aio_reqprio;
@@ -1460,14 +1468,10 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 
 	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
 	if (unlikely(ret))
-		goto out_fput;
+		return ret;
 
 	req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */
 	return 0;
-
-out_fput:
-	fput(req->ki_filp);
-	return ret;
 }
 
 static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec,
@@ -1521,24 +1525,19 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
 	if (ret)
 		return ret;
 	file = req->ki_filp;
-
-	ret = -EBADF;
 	if (unlikely(!(file->f_mode & FMODE_READ)))
-		goto out_fput;
+		return -EBADF;
 	ret = -EINVAL;
 	if (unlikely(!file->f_op->read_iter))
-		goto out_fput;
+		return -EINVAL;
 
 	ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter);
 	if (ret)
-		goto out_fput;
+		return ret;
 	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
 	if (!ret)
 		aio_rw_done(req, call_read_iter(file, req, &iter));
 	kfree(iovec);
-out_fput:
-	if (unlikely(ret))
-		fput(file);
 	return ret;
 }
 
@@ -1555,16 +1554,14 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
 		return ret;
 	file = req->ki_filp;
 
-	ret = -EBADF;
 	if (unlikely(!(file->f_mode & FMODE_WRITE)))
-		goto out_fput;
-	ret = -EINVAL;
+		return -EBADF;
 	if (unlikely(!file->f_op->write_iter))
-		goto out_fput;
+		return -EINVAL;
 
 	ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter);
 	if (ret)
-		goto out_fput;
+		return ret;
 	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
 	if (!ret) {
 		/*
@@ -1582,9 +1579,6 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
 		aio_rw_done(req, call_write_iter(file, req, &iter));
 	}
 	kfree(iovec);
-out_fput:
-	if (unlikely(ret))
-		fput(file);
 	return ret;
 }
 
@@ -1593,8 +1587,7 @@ static void aio_fsync_work(struct work_struct *work)
 	struct fsync_iocb *req = container_of(work, struct fsync_iocb, work);
 	int ret;
 
-	ret = vfs_fsync(req->file, req->datasync);
-	fput(req->file);
+	ret = vfs_fsync(req->ki_filp, req->datasync);
 	aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
 }
 
@@ -1605,13 +1598,8 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
 			iocb->aio_rw_flags))
 		return -EINVAL;
 
-	req->file = fget(iocb->aio_fildes);
-	if (unlikely(!req->file))
-		return -EBADF;
-	if (unlikely(!req->file->f_op->fsync)) {
-		fput(req->file);
+	if (unlikely(!req->ki_filp->f_op->fsync))
 		return -EINVAL;
-	}
 
 	req->datasync = datasync;
 	INIT_WORK(&req->work, aio_fsync_work);
@@ -1621,10 +1609,7 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
 
 static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask)
 {
-	struct file *file = iocb->poll.file;
-
 	aio_complete(iocb, mangle_poll(mask), 0);
-	fput(file);
 }
 
 static void aio_poll_complete_work(struct work_struct *work)
@@ -1636,7 +1621,7 @@ static void aio_poll_complete_work(struct work_struct *work)
 	__poll_t mask = 0;
 
 	if (!READ_ONCE(req->cancelled))
-		mask = vfs_poll(req->file, &pt) & req->events;
+		mask = vfs_poll(iocb->ki_filp, &pt) & req->events;
 
 	/*
 	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
@@ -1743,9 +1728,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 
 	INIT_WORK(&req->work, aio_poll_complete_work);
 	req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP;
-	req->file = fget(iocb->aio_fildes);
-	if (unlikely(!req->file))
-		return -EBADF;
 
 	req->head = NULL;
 	req->woken = false;
@@ -1763,7 +1745,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 	/* one for removal from waitqueue, one for this function */
 	refcount_set(&aiocb->ki_refcnt, 2);
 
-	mask = vfs_poll(req->file, &apt.pt) & req->events;
+	mask = vfs_poll(aiocb->ki_filp, &apt.pt) & req->events;
 	if (unlikely(!req->head)) {
 		/* we did not manage to set up a waitqueue, done */
 		goto out;
@@ -1788,10 +1770,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 	spin_unlock_irq(&ctx->ctx_lock);
 
 out:
-	if (unlikely(apt.error)) {
-		fput(req->file);
+	if (unlikely(apt.error))
 		return apt.error;
-	}
 
 	if (mask)
 		aio_poll_complete(aiocb, mask);
@@ -1829,6 +1809,11 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 	if (unlikely(!req))
 		goto out_put_reqs_available;
 
+	req->ki_filp = fget(iocb->aio_fildes);
+	ret = -EBADF;
+	if (unlikely(!req->ki_filp))
+		goto out_put_req;
+
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
 		 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29d8e2cfed0e..fd423fec8d83 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -304,13 +304,19 @@ enum rw_hint {
 
 struct kiocb {
 	struct file		*ki_filp;
+
+	/* The 'ki_filp' pointer is shared in a union for aio */
+	randomized_struct_fields_start
+
 	loff_t			ki_pos;
 	void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
 	void			*private;
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-} __randomize_layout;
+
+	randomized_struct_fields_end
+};
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
 {
Linus Torvalds March 3, 2019, 8:13 p.m. UTC | #3
On Sun, Mar 3, 2019 at 11:44 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But doesn't it look nice to see
>
>  2 files changed, 41 insertions(+), 50 deletions(-)
>
> with actual code reduction, and a fundamental simplification in
> handling of the file pointer?

A coupl,e of the changes are "useless", and do the same thing as not
having them at all:

-               struct inode *inode = file_inode(kiocb->ki_filp);
+               struct inode *inode = file_inode(iocb->ki_filp);

-               file_end_write(kiocb->ki_filp);
+               file_end_write(iocb->ki_filp);

because the "ki_filp" ends up existing in both kiocb and iocb. At one
point of editing that file I decided to try to just remove it from the
sub-structs entirely and only keep it in the top-level structure, but
it needs to be inside the 'struct kiocb' anyway for all the other
users outside of fs/aio.c.

Anyway, I don't think the patch is wrong (although I haven't actually
_tested_ it) but I wanted to point out that those two one-liner
changes are just "noise" that doesn't matter for the working of the
patch.

In the above, we have 'kiocb' being the embedded 'struct kiocb', and
'iocb' is the 'struct aio_kiocb' that contains it. 'ki_filp' is the
exact same field in both cases.

                 Linus

               Linus
Al Viro March 3, 2019, 8:30 p.m. UTC | #4
On Sun, Mar 03, 2019 at 11:44:33AM -0800, Linus Torvalds wrote:
> On Sun, Mar 3, 2019 at 7:18 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > > Maybe unrelated to this bug, but...  What's to prevent a wakeup
> > > that happens just after we'd been added to a waitqueue by ->poll()
> > > triggering aio_poll_wake(), which gets to aio_poll_complete()
> > > with its fput() *before* we'd reached the end of ->poll() instance?
> 
> I'm assuming you're talking about the second vfs_poll() in
> aio_poll_complete_work()? The one we call before we check for
> "rew->cancelled" properly under the spinlock?

No.  The first one, right in aio_poll().

> But as far as I can tell, they *all* could do:
> 
>  - look up file in aio_submit() when allocating and creating the aio_kiocb
> 
>  - drop the filp in 'iocb_put()' (which happens whether things
> complete successfully or not).
>
> and we'd have avoided a lot of complexity, and we'd have avoided this bug.
> 
> Your patch fixes the poll() case, but it does so by just letting the
> existing complexity remain, and adding a second fget/fput pair in the
> poll logic.
> 
> It would seem like it would be much better to rip all the complexity
> out entirely, and replace it with sane, simple and obviously correct
> code.
> 
> Hmm?
> 
> In other words, why wouldn't something like the attached work instead?
 
I'll need to dig out the mail archive from last year, but IIRC this
had been considered and there'd been non-trivial problems.  Give me
an hour or so and I'll dig that out (there'd been many rounds of
review, with long threads involved, some private, some on fsdevel).

> @@ -1060,6 +1071,7 @@ static inline void iocb_put(struct aio_kiocb *iocb)
>  {
>  	if (refcount_read(&iocb->ki_refcnt) == 0 ||
>  	    refcount_dec_and_test(&iocb->ki_refcnt)) {
> +		fput(iocb->ki_filp);

Trivial oops here - it might be NULL.
Linus Torvalds March 3, 2019, 10:23 p.m. UTC | #5
On Sun, Mar 3, 2019 at 12:30 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Mar 03, 2019 at 11:44:33AM -0800, Linus Torvalds wrote:
> >
> > I'm assuming you're talking about the second vfs_poll() in
> > aio_poll_complete_work()? The one we call before we check for
> > "rew->cancelled" properly under the spinlock?
>
> No.  The first one, right in aio_poll().

Ok, they are both confusing. The lifetime of that filp is just not
clear, with the whole "it could have been completed asynchronously"
issue.

> I'll need to dig out the mail archive from last year, but IIRC this
> had been considered and there'd been non-trivial problems.  Give me
> an hour or so and I'll dig that out (there'd been many rounds of
> review, with long threads involved, some private, some on fsdevel).

Looking more at the patch, it still looks eminently sane to me.I fixed
the silly "ki_filp" thing you noticed (I thought we made fput() take
NULL, but you're right we don't), and simplified the patch to not do
the changes that aren't necessary.

I fixed the silly "filp can be NULL" issue you pointed out (I lazily
thought we allowed fput(NULL), but you're right, we definitely don't),
and simplified the patch to not do the unnecessary changes where we
can just access the file pointer multiple different ways.

And the resulting kernel boots fine, but I doubt I have anything that
actually uses io_submit(), so that doesn't mean much.

Slightly updated patch attached anyway, even smaller than before:

 2 files changed, 36 insertions(+), 44 deletions(-)

with several of the new lines being comments about that file pointer
placement in the union.

                Linus
fs/aio.c           | 72 ++++++++++++++++++++++--------------------------------
 include/linux/fs.h |  8 +++++-
 2 files changed, 36 insertions(+), 44 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index aaaaf4d12c73..82c08422b0f4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -167,9 +167,13 @@ struct kioctx {
 	unsigned		id;
 };
 
+/*
+ * First field must be the file pointer in all the
+ * iocb unions! See also 'struct kiocb' in <linux/fs.h>
+ */
 struct fsync_iocb {
-	struct work_struct	work;
 	struct file		*file;
+	struct work_struct	work;
 	bool			datasync;
 };
 
@@ -183,8 +187,15 @@ struct poll_iocb {
 	struct work_struct	work;
 };
 
+/*
+ * NOTE! Each of the iocb union members has the file pointer
+ * as the first entry in their struct definition. So you can
+ * access the file pointer through any of the sub-structs,
+ * or directly as just 'ki_filp' in this struct.
+ */
 struct aio_kiocb {
 	union {
+		struct file		*ki_filp;
 		struct kiocb		rw;
 		struct fsync_iocb	fsync;
 		struct poll_iocb	poll;
@@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)
 {
 	if (refcount_read(&iocb->ki_refcnt) == 0 ||
 	    refcount_dec_and_test(&iocb->ki_refcnt)) {
+		if (iocb->ki_filp)
+			fput(iocb->ki_filp);
 		percpu_ref_put(&iocb->ki_ctx->reqs);
 		kmem_cache_free(kiocb_cachep, iocb);
 	}
@@ -1424,7 +1437,6 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
 		file_end_write(kiocb->ki_filp);
 	}
 
-	fput(kiocb->ki_filp);
 	aio_complete(iocb, res, res2);
 }
 
@@ -1432,9 +1444,6 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 {
 	int ret;
 
-	req->ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->ki_filp))
-		return -EBADF;
 	req->ki_complete = aio_complete_rw;
 	req->private = NULL;
 	req->ki_pos = iocb->aio_offset;
@@ -1451,7 +1460,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 		ret = ioprio_check_cap(iocb->aio_reqprio);
 		if (ret) {
 			pr_debug("aio ioprio check cap error: %d\n", ret);
-			goto out_fput;
+			return ret;
 		}
 
 		req->ki_ioprio = iocb->aio_reqprio;
@@ -1460,14 +1469,10 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 
 	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
 	if (unlikely(ret))
-		goto out_fput;
+		return ret;
 
 	req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */
 	return 0;
-
-out_fput:
-	fput(req->ki_filp);
-	return ret;
 }
 
 static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec,
@@ -1521,24 +1526,19 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
 	if (ret)
 		return ret;
 	file = req->ki_filp;
-
-	ret = -EBADF;
 	if (unlikely(!(file->f_mode & FMODE_READ)))
-		goto out_fput;
+		return -EBADF;
 	ret = -EINVAL;
 	if (unlikely(!file->f_op->read_iter))
-		goto out_fput;
+		return -EINVAL;
 
 	ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter);
 	if (ret)
-		goto out_fput;
+		return ret;
 	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
 	if (!ret)
 		aio_rw_done(req, call_read_iter(file, req, &iter));
 	kfree(iovec);
-out_fput:
-	if (unlikely(ret))
-		fput(file);
 	return ret;
 }
 
@@ -1555,16 +1555,14 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
 		return ret;
 	file = req->ki_filp;
 
-	ret = -EBADF;
 	if (unlikely(!(file->f_mode & FMODE_WRITE)))
-		goto out_fput;
-	ret = -EINVAL;
+		return -EBADF;
 	if (unlikely(!file->f_op->write_iter))
-		goto out_fput;
+		return -EINVAL;
 
 	ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter);
 	if (ret)
-		goto out_fput;
+		return ret;
 	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
 	if (!ret) {
 		/*
@@ -1582,9 +1580,6 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
 		aio_rw_done(req, call_write_iter(file, req, &iter));
 	}
 	kfree(iovec);
-out_fput:
-	if (unlikely(ret))
-		fput(file);
 	return ret;
 }
 
@@ -1594,7 +1589,6 @@ static void aio_fsync_work(struct work_struct *work)
 	int ret;
 
 	ret = vfs_fsync(req->file, req->datasync);
-	fput(req->file);
 	aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
 }
 
@@ -1605,13 +1599,8 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
 			iocb->aio_rw_flags))
 		return -EINVAL;
 
-	req->file = fget(iocb->aio_fildes);
-	if (unlikely(!req->file))
-		return -EBADF;
-	if (unlikely(!req->file->f_op->fsync)) {
-		fput(req->file);
+	if (unlikely(!req->file->f_op->fsync))
 		return -EINVAL;
-	}
 
 	req->datasync = datasync;
 	INIT_WORK(&req->work, aio_fsync_work);
@@ -1621,10 +1610,7 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
 
 static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask)
 {
-	struct file *file = iocb->poll.file;
-
 	aio_complete(iocb, mangle_poll(mask), 0);
-	fput(file);
 }
 
 static void aio_poll_complete_work(struct work_struct *work)
@@ -1743,9 +1729,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 
 	INIT_WORK(&req->work, aio_poll_complete_work);
 	req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP;
-	req->file = fget(iocb->aio_fildes);
-	if (unlikely(!req->file))
-		return -EBADF;
 
 	req->head = NULL;
 	req->woken = false;
@@ -1788,10 +1771,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 	spin_unlock_irq(&ctx->ctx_lock);
 
 out:
-	if (unlikely(apt.error)) {
-		fput(req->file);
+	if (unlikely(apt.error))
 		return apt.error;
-	}
 
 	if (mask)
 		aio_poll_complete(aiocb, mask);
@@ -1829,6 +1810,11 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 	if (unlikely(!req))
 		goto out_put_reqs_available;
 
+	req->ki_filp = fget(iocb->aio_fildes);
+	ret = -EBADF;
+	if (unlikely(!req->ki_filp))
+		goto out_put_req;
+
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
 		 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29d8e2cfed0e..fd423fec8d83 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -304,13 +304,19 @@ enum rw_hint {
 
 struct kiocb {
 	struct file		*ki_filp;
+
+	/* The 'ki_filp' pointer is shared in a union for aio */
+	randomized_struct_fields_start
+
 	loff_t			ki_pos;
 	void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
 	void			*private;
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-} __randomize_layout;
+
+	randomized_struct_fields_end
+};
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
 {
Al Viro March 4, 2019, 2:36 a.m. UTC | #6
On Sun, Mar 03, 2019 at 02:23:33PM -0800, Linus Torvalds wrote:

OK, having dug through the archives, the reasons were not strong.
So that part is OK...

> @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)
>  {
>  	if (refcount_read(&iocb->ki_refcnt) == 0 ||
>  	    refcount_dec_and_test(&iocb->ki_refcnt)) {
> +		if (iocb->ki_filp)
> +			fput(iocb->ki_filp);
>  		percpu_ref_put(&iocb->ki_ctx->reqs);
>  		kmem_cache_free(kiocb_cachep, iocb);
>  	}

That reminds me - refcount_t here is rather ridiculous; what we
have is
	* everything other than aio_poll: ki_refcnt is 0 all along
	* aio_poll: originally 0, then set to 2, then two iocb_put()
are done (either both synchronous to aio_poll(), or one sync and one
async).

That's not a refcount at all.  It's a flag, set only for aio_poll ones.
And that test in iocb_put() is "if flag is set, clear it and bugger off".

What's worse, AFAICS the callers in aio_poll() are buggered (see below)

>  static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask)
>  {
> -	struct file *file = iocb->poll.file;
> -
>  	aio_complete(iocb, mangle_poll(mask), 0);
> -	fput(file);
>  }

No reasons to keep that function at all now...

> -	if (unlikely(apt.error)) {
> -		fput(req->file);
> +	if (unlikely(apt.error))
>  		return apt.error;
> -	}
>  
>  	if (mask)
>  		aio_poll_complete(aiocb, mask);

Looking at that thing...  How does it manage to avoid leaks
when we try to use it on e.g. /dev/tty, which has
        poll_wait(file, &tty->read_wait, wait);
        poll_wait(file, &tty->write_wait, wait);
in n_tty_poll()?

AFAICS, we'll succeed adding it to the first queue, then have
aio_poll_queue_proc() fail and set apt.error to -EINVAL.
Suppose we are looking for EPOLLIN and there's nothing ready
to read.  We'll go
        mask = vfs_poll(req->file, &apt.pt) & req->events;
mask is 0.
        if (unlikely(!req->head)) {
nope - it's &tty->read_wait, not NULL
                /* we did not manage to set up a waitqueue, done */
                goto out;
        }

        spin_lock_irq(&ctx->ctx_lock);
        spin_lock(&req->head->lock);
        if (req->woken) {
nope - no wakeups so far
                /* wake_up context handles the rest */
                mask = 0;
                apt.error = 0;
        } else if (mask || apt.error) {
apt.error is non-zero
                /* if we get an error or a mask we are done */
                WARN_ON_ONCE(list_empty(&req->wait.entry));
                list_del_init(&req->wait.entry);
OK, removed from queue
        } else {
                /* actually waiting for an event */
                list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
                aiocb->ki_cancel = aio_poll_cancel;
        }
        spin_unlock(&req->head->lock);
        spin_unlock_irq(&ctx->ctx_lock);

out:
        if (unlikely(apt.error)) {
                fput(req->file);
                return apt.error;
... and we return -EINVAL to __io_submit_one(), where we hit
        /*
         * If ret is 0, we'd either done aio_complete() ourselves or have
         * arranged for that to be done asynchronously.  Anything non-zero
         * means that we need to destroy req ourselves.
         */
        if (ret)
                goto out_put_req;
        return 0;
out_put_req:
        if (req->ki_eventfd)
                eventfd_ctx_put(req->ki_eventfd);
        iocb_put(req);
out_put_reqs_available:
        put_reqs_available(ctx, 1);
        return ret;

and all knowledge of req is lost.  But we'd done only *one* iocb_put() in
that case, and ->ki_refcnt had been set to 2 by aio_poll().  How could it
avoid a leak?  The same goes for "->poll() returns something without
bothering to call poll_wait()" case, actually...

IOW, I would rather have aio_poll() (along with your fput-a-bit-later change)
do this -
out:
        if (mask && !apt.error)
                aio_complete(aiocb, mangle_poll(mask), 0);
        iocb_put(aiocb);
        return apt.error;

Comments?
Dmitry Vyukov March 4, 2019, 7:53 a.m. UTC | #7
On Sun, Mar 3, 2019 at 4:19 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Mar 03, 2019 at 01:55:02PM +0000, Al Viro wrote:
>
> > Maybe unrelated to this bug, but...  What's to prevent a wakeup
> > that happens just after we'd been added to a waitqueue by ->poll()
> > triggering aio_poll_wake(), which gets to aio_poll_complete()
> > with its fput() *before* we'd reached the end of ->poll() instance?
> >
> > I wonder if adding
> >       get_file(req->file);  // make sure that early completion is safe
> > right after
> >         req->file = fget(iocb->aio_fildes);
> >         if (unlikely(!req->file))
> >                 return -EBADF;
> > paired with
> >       fput(req->file);
> > right after out: in aio_poll() is needed...  Am I missing something
> > obvious here?  Christoph?
>
> In more details - normally IOCB_CMD_POLL handling looks so:
>
> 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll()
> 2) aio_poll() resolves the descriptor to struct file by
>         req->file = fget(iocb->aio_fildes)
> 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
> aio_kiocb to 2 (bumps by 1, that is).
> 4) aio_poll() calls vfs_poll().  After sanity checks (basically, "poll_wait()
> had been called and only once") it locks the queue.  That's what the extra
> reference to iocb had been for - we know we can safely access it.
> 5) With queue locked, we check if ->woken has already been set to true
> (by aio_poll_wake()) and, if it had been, we unlock the queue, drop
> a reference to aio_kiocb and bugger off - at that point it's
> a responsibility to aio_poll_wake() and the stuff called/scheduled by
> it.  That code will drop the reference to file in req->file, along
> with the other reference to our aio_kiocb.
> 6) otherwise, we see whether we need to wait.  If we do, we unlock
> the queue, drop one reference to aio_kiocb and go away - eventual
> wakeup (or cancel) will deal with the reference to file and with the
> other reference to aio_kiocb
> 7) otherwise we remove ourselves from waitqueue (still under the queue
> lock), so that wakeup won't get us.  No async activity will be happening,
> so we can safely drop req->file and iocb ourselves.
>
> If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
> won't get freed under us, so we can do all the checks and locking safely.
> And we don't touch ->file if we detect that case.
>
> However, vfs_poll() most certainly *does* touch the file it had been
> given.  So wakeup coming while we are still in ->poll() might end up
> doing fput() on that file.  That case is not too rare, and usually
> we are saved by the still present reference from descriptor table -
> that fput() is not the final one.  But if another thread closes that
> descriptor right after our fget() and wakeup does happen before
> ->poll() returns, we are in trouble - final fput() done while we
> are in the middle of a method.
>
> What we need is to hold on to the file reference the same way we do with
> aio_kiocb.  No need to store the reference to what we'd got in a separate
> variable - req->file is never reassigned and we'd already made sure that
> req won't be freed under us, so dropping the extra reference with
> fput(req->file) is fine in all cases.
>
> Fixes: bfe4037e722ec
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Please add the Reported-by tag from the report.
If you don't add it, then either:
1. somebody (you) will need to remember to later go and associate fix
the the report with "#syz fix" command
2. or the bug will stay open and syzbot will never report
use-after-frees in unix_dgram_poll again (it's still the same already
reported bug)
3. or worse somebody will spend time re-debugging this bug just to
find that you already fixed it but did not include the tag

Thanks

> ---
> diff --git a/fs/aio.c b/fs/aio.c
> index 3083180a54c8..7e88bfabdac2 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>
>         /* one for removal from waitqueue, one for this function */
>         refcount_set(&aiocb->ki_refcnt, 2);
> +       get_file(req->file);
>
>         mask = vfs_poll(req->file, &apt.pt) & req->events;
>         if (unlikely(!req->head)) {
> @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>         spin_unlock_irq(&ctx->ctx_lock);
>
>  out:
> +       fput(req->file);
>         if (unlikely(apt.error)) {
>                 fput(req->file);
>                 return apt.error;
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20190303151846.GQ2217%40ZenIV.linux.org.uk.
> For more options, visit https://groups.google.com/d/optout.
Linus Torvalds March 4, 2019, 9:22 p.m. UTC | #8
On Sun, Mar 3, 2019 at 6:36 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, having dug through the archives, the reasons were not strong.
> So that part is OK...

I've committed the patch.

However, I didn't actually do the separate and independent cleanups:


> > @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)
> >  {
> >       if (refcount_read(&iocb->ki_refcnt) == 0 ||
> >           refcount_dec_and_test(&iocb->ki_refcnt)) {
> > +             if (iocb->ki_filp)
> > +                     fput(iocb->ki_filp);
> >               percpu_ref_put(&iocb->ki_ctx->reqs);
> >               kmem_cache_free(kiocb_cachep, iocb);
> >       }
>
> That reminds me - refcount_t here is rather ridiculous; what we
> have is
>         * everything other than aio_poll: ki_refcnt is 0 all along
>         * aio_poll: originally 0, then set to 2, then two iocb_put()
> are done (either both synchronous to aio_poll(), or one sync and one
> async).
>
> That's not a refcount at all.  It's a flag, set only for aio_poll ones.
> And that test in iocb_put() is "if flag is set, clear it and bugger off".
>
> What's worse, AFAICS the callers in aio_poll() are buggered (see below)

Ok. Suggestions?


> > -     if (unlikely(apt.error)) {
> > -             fput(req->file);
> > +     if (unlikely(apt.error))
> >               return apt.error;
> > -     }
> >
> >       if (mask)
> >               aio_poll_complete(aiocb, mask);
>
> Looking at that thing...  How does it manage to avoid leaks
> when we try to use it on e.g. /dev/tty, which has
>         poll_wait(file, &tty->read_wait, wait);
>         poll_wait(file, &tty->write_wait, wait);
> in n_tty_poll()?

I don't think that's he only case that uses multiple poll entries.
It's now the poll() machinery is designed to be used, after all.
Although maybe everybody ends up using just a single wait-queue..

> AFAICS, we'll succeed adding it to the first queue, then have
> aio_poll_queue_proc() fail and set apt.error to -EINVAL.

Yeah, that's bogus, but documented. I guess nobody really expects to
use aio_poll on anything but a socket or something?

But your refcount leak looks valid. Hmm.

So yes, that whole ki_refcnt looks a bit odd and pointless. And
apparently buggy too.

> Comments?

Can we just

 (a) remove ki_refcnt entirely

 (b) remove the "iocb_put(aiocb);"

from aio_poll()?

Every actual successful io submission should end in aio_complete(), or
we free the aio iocb in the "out_put_req" error case. There's no point
in doing the refcount as you pointed out, and it seems to be actively
buggy.

Anyway, all of this (and your suggestion to just remove
"aio_poll_complete()" and just use "aio_complete()") is independent of
the file refcounting part, I feel. Which is why I just committed that
patch as-is (84c4e1f89fef "aio: simplify - and fix - fget/fput for
io_submit()").

                   Linus
diff mbox series

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 3083180a54c8..7e88bfabdac2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1767,6 +1767,7 @@  static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 
 	/* one for removal from waitqueue, one for this function */
 	refcount_set(&aiocb->ki_refcnt, 2);
+	get_file(req->file);
 
 	mask = vfs_poll(req->file, &apt.pt) & req->events;
 	if (unlikely(!req->head)) {
@@ -1793,6 +1794,7 @@  static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 	spin_unlock_irq(&ctx->ctx_lock);
 
 out:
+	fput(req->file);
 	if (unlikely(apt.error)) {
 		fput(req->file);
 		return apt.error;