diff mbox series

[1/3] io_uring: add support for async work inheriting files table

Message ID 20191017212858.13230-2-axboe@kernel.dk
State Changes Requested
Delegated to: David Miller
Headers show
Series [1/3] io_uring: add support for async work inheriting files table | expand

Commit Message

Jens Axboe Oct. 17, 2019, 9:28 p.m. UTC
This is in preparation for adding opcodes that need to modify files
in a process file table, either adding new ones or closing old ones.

If an opcode needs this, it must set REQ_F_NEED_FILES in the request
structure. If work that needs to get punted to async context have this
set, they will grab a reference to the process file table. When the
work is completed, the reference is dropped again.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Comments

Jann Horn Oct. 18, 2019, 2:41 a.m. UTC | #1
On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
> This is in preparation for adding opcodes that need to modify files
> in a process file table, either adding new ones or closing old ones.

Closing old ones would be tricky. Basically if you call
get_files_struct() while you're between an fdget()/fdput() pair (e.g.
from sys_io_uring_enter()), you're not allowed to use that
files_struct reference to replace or close existing FDs through that
reference. (Or more accurately, if you go through fdget() with
files_struct refcount 1, you must not replace/close FDs in there in
any way until you've passed the corresponding fdput().)

You can avoid that if you ensure that you never use fdget()/fdput() in
the relevant places, only fget()/fput().

> If an opcode needs this, it must set REQ_F_NEED_FILES in the request
> structure. If work that needs to get punted to async context have this
> set, they will grab a reference to the process file table. When the
> work is completed, the reference is dropped again.
[...]
> @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>                                 set_fs(USER_DS);
>                         }
>                 }
> +               if (s->files && !old_files) {
> +                       old_files = current->files;
> +                       current->files = s->files;
> +               }

AFAIK e.g. stuff like proc_fd_link() in procfs can concurrently call
get_files_struct() even on kernel tasks, so you should take the
task_lock(current) while fiddling with the ->files pointer.

Also, maybe I'm too tired to read this correctly, but it seems like
when io_sq_wq_submit_work() is processing multiple elements with
->files pointers, this part will only consume a reference to the first
one?

>
>                 if (!ret) {
>                         s->has_user = cur_mm != NULL;
> @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>                 unuse_mm(cur_mm);
>                 mmput(cur_mm);
>         }
> +       if (old_files) {
> +               struct files_struct *files = current->files;
> +               current->files = old_files;
> +               put_files_struct(files);
> +       }

And then here the first files_struct reference is dropped, and the
rest of them leak?

>  }
>
>  /*
> @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>
>                         s->sqe = sqe_copy;
>                         memcpy(&req->submit, s, sizeof(*s));
> +                       if (req->flags & REQ_F_NEED_FILES)
> +                               req->submit.files = get_files_struct(current);

Stupid question: How does this interact with sqpoll mode? In that
case, this function is running on a kernel thread that isn't sharing
the application's files_struct, right?
Jens Axboe Oct. 18, 2019, 2:01 p.m. UTC | #2
On 10/17/19 8:41 PM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
>> This is in preparation for adding opcodes that need to modify files
>> in a process file table, either adding new ones or closing old ones.
> 
> Closing old ones would be tricky. Basically if you call
> get_files_struct() while you're between an fdget()/fdput() pair (e.g.
> from sys_io_uring_enter()), you're not allowed to use that
> files_struct reference to replace or close existing FDs through that
> reference. (Or more accurately, if you go through fdget() with
> files_struct refcount 1, you must not replace/close FDs in there in
> any way until you've passed the corresponding fdput().)
> 
> You can avoid that if you ensure that you never use fdget()/fdput() in
> the relevant places, only fget()/fput().

That's a good point, I didn't think the closing aspect through when
writing that changelog. File addition is the most interesting aspect,
obviously, and the only part that I care about in this patch set. I'll
change the wording.

>> If an opcode needs this, it must set REQ_F_NEED_FILES in the request
>> structure. If work that needs to get punted to async context have this
>> set, they will grab a reference to the process file table. When the
>> work is completed, the reference is dropped again.
> [...]
>> @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                                  set_fs(USER_DS);
>>                          }
>>                  }
>> +               if (s->files && !old_files) {
>> +                       old_files = current->files;
>> +                       current->files = s->files;
>> +               }
> 
> AFAIK e.g. stuff like proc_fd_link() in procfs can concurrently call
> get_files_struct() even on kernel tasks, so you should take the
> task_lock(current) while fiddling with the ->files pointer.

Fixed up, thanks!

> Also, maybe I'm too tired to read this correctly, but it seems like
> when io_sq_wq_submit_work() is processing multiple elements with
> ->files pointers, this part will only consume a reference to the first
> one?

Like the mm, we should only have the one file table. But there's no
reason to not handle this properly, I've amended the commit to properly
swap so it works for any number of file tables.

>>                  if (!ret) {
>>                          s->has_user = cur_mm != NULL;
>> @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                  unuse_mm(cur_mm);
>>                  mmput(cur_mm);
>>          }
>> +       if (old_files) {
>> +               struct files_struct *files = current->files;
>> +               current->files = old_files;
>> +               put_files_struct(files);
>> +       }
> 
> And then here the first files_struct reference is dropped, and the
> rest of them leak?

Fixed with the above change.

>> @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>
>>                          s->sqe = sqe_copy;
>>                          memcpy(&req->submit, s, sizeof(*s));
>> +                       if (req->flags & REQ_F_NEED_FILES)
>> +                               req->submit.files = get_files_struct(current);
> 
> Stupid question: How does this interact with sqpoll mode? In that
> case, this function is running on a kernel thread that isn't sharing
> the application's files_struct, right?

Not a stupid question! It doesn't work with sqpoll. We need to be
entered on behalf of the task, and we never see that with sqpoll (except
if NEED_WAKE is set in flags).

For now I'll just forbid it explicitly in io_accept(), just like we do
for IORING_SETUP_IOPOLL.

Updated patch1:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436

and patch 3:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=442bb35fc4f8f28c29ea220475c45babb44ee49c
Jann Horn Oct. 18, 2019, 2:34 p.m. UTC | #3
On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/17/19 8:41 PM, Jann Horn wrote:
> > On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> This is in preparation for adding opcodes that need to modify files
> >> in a process file table, either adding new ones or closing old ones.
[...]
> Updated patch1:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436

I don't understand what you're doing with old_files in there. In the
"s->files && !old_files" branch, "current->files = s->files" happens
without holding task_lock(), but current->files and s->files are also
the same already at that point anyway. And what's the intent behind
assigning stuff to old_files inside the loop? Isn't that going to
cause the workqueue to keep a modified current->files beyond the
runtime of the work?
Jens Axboe Oct. 18, 2019, 2:37 p.m. UTC | #4
On 10/18/19 8:34 AM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/17/19 8:41 PM, Jann Horn wrote:
>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>> This is in preparation for adding opcodes that need to modify files
>>>> in a process file table, either adding new ones or closing old ones.
> [...]
>> Updated patch1:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
> 
> I don't understand what you're doing with old_files in there. In the
> "s->files && !old_files" branch, "current->files = s->files" happens
> without holding task_lock(), but current->files and s->files are also
> the same already at that point anyway. And what's the intent behind
> assigning stuff to old_files inside the loop? Isn't that going to
> cause the workqueue to keep a modified current->files beyond the
> runtime of the work?

I simply forgot to remove the old block, it should only have this one:

if (s->files && s->files != cur_files) {                        
        task_lock(current);                                     
        current->files = s->files;                              
        task_unlock(current);                                   
        if (cur_files)                                          
                put_files_struct(cur_files);                    
        cur_files = s->files;                                   
}
Jann Horn Oct. 18, 2019, 2:40 p.m. UTC | #5
On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 10/18/19 8:34 AM, Jann Horn wrote:
> > On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 10/17/19 8:41 PM, Jann Horn wrote:
> >>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> This is in preparation for adding opcodes that need to modify files
> >>>> in a process file table, either adding new ones or closing old ones.
> > [...]
> >> Updated patch1:
> >>
> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
> >
> > I don't understand what you're doing with old_files in there. In the
> > "s->files && !old_files" branch, "current->files = s->files" happens
> > without holding task_lock(), but current->files and s->files are also
> > the same already at that point anyway. And what's the intent behind
> > assigning stuff to old_files inside the loop? Isn't that going to
> > cause the workqueue to keep a modified current->files beyond the
> > runtime of the work?
>
> I simply forgot to remove the old block, it should only have this one:
>
> if (s->files && s->files != cur_files) {
>         task_lock(current);
>         current->files = s->files;
>         task_unlock(current);
>         if (cur_files)
>                 put_files_struct(cur_files);
>         cur_files = s->files;
> }

Don't you still need a put_files_struct() in the case where "s->files
== cur_files"?
Jens Axboe Oct. 18, 2019, 2:43 p.m. UTC | #6
On 10/18/19 8:40 AM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 10/18/19 8:34 AM, Jann Horn wrote:
>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 10/17/19 8:41 PM, Jann Horn wrote:
>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> This is in preparation for adding opcodes that need to modify files
>>>>>> in a process file table, either adding new ones or closing old ones.
>>> [...]
>>>> Updated patch1:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
>>>
>>> I don't understand what you're doing with old_files in there. In the
>>> "s->files && !old_files" branch, "current->files = s->files" happens
>>> without holding task_lock(), but current->files and s->files are also
>>> the same already at that point anyway. And what's the intent behind
>>> assigning stuff to old_files inside the loop? Isn't that going to
>>> cause the workqueue to keep a modified current->files beyond the
>>> runtime of the work?
>>
>> I simply forgot to remove the old block, it should only have this one:
>>
>> if (s->files && s->files != cur_files) {
>>          task_lock(current);
>>          current->files = s->files;
>>          task_unlock(current);
>>          if (cur_files)
>>                  put_files_struct(cur_files);
>>          cur_files = s->files;
>> }
> 
> Don't you still need a put_files_struct() in the case where "s->files
> == cur_files"?

I want to hold on to the files for as long as I can, to avoid unnecessary
shuffling of it. But I take it your worry here is that we'll be calling
something that manipulates ->files? Nothing should do that, unless
s->files is set. We didn't hide the workqueue ->files[] before this
change either.
Jann Horn Oct. 18, 2019, 2:52 p.m. UTC | #7
On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 10/18/19 8:40 AM, Jann Horn wrote:
> > On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 10/18/19 8:34 AM, Jann Horn wrote:
> >>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 10/17/19 8:41 PM, Jann Horn wrote:
> >>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>> This is in preparation for adding opcodes that need to modify files
> >>>>>> in a process file table, either adding new ones or closing old ones.
> >>> [...]
> >>>> Updated patch1:
> >>>>
> >>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
> >>>
> >>> I don't understand what you're doing with old_files in there. In the
> >>> "s->files && !old_files" branch, "current->files = s->files" happens
> >>> without holding task_lock(), but current->files and s->files are also
> >>> the same already at that point anyway. And what's the intent behind
> >>> assigning stuff to old_files inside the loop? Isn't that going to
> >>> cause the workqueue to keep a modified current->files beyond the
> >>> runtime of the work?
> >>
> >> I simply forgot to remove the old block, it should only have this one:
> >>
> >> if (s->files && s->files != cur_files) {
> >>          task_lock(current);
> >>          current->files = s->files;
> >>          task_unlock(current);
> >>          if (cur_files)
> >>                  put_files_struct(cur_files);
> >>          cur_files = s->files;
> >> }
> >
> > Don't you still need a put_files_struct() in the case where "s->files
> > == cur_files"?
>
> I want to hold on to the files for as long as I can, to avoid unnecessary
> shuffling of it. But I take it your worry here is that we'll be calling
> something that manipulates ->files? Nothing should do that, unless
> s->files is set. We didn't hide the workqueue ->files[] before this
> change either.

No, my worry is that the refcount of the files_struct is left too
high. From what I can tell, the "do" loop in io_sq_wq_submit_work()
iterates over multiple instances of struct sqe_submit. If there are
two sqe_submit instances with the same ->files (each holding a
reference from the get_files_struct() in __io_queue_sqe()), then:

When processing the first sqe_submit instance, current->files and
cur_files are set to $user_files.
When processing the second sqe_submit instance, nothing happens
(s->files == cur_files).
After the loop, at the end of the function, put_files_struct() is
called once on $user_files.

So get_files_struct() has been called twice, but put_files_struct()
has only been called once. That leaves the refcount too high, and by
repeating this, an attacker can make the refcount wrap around and then
cause a use-after-free.
Jens Axboe Oct. 18, 2019, 3 p.m. UTC | #8
On 10/18/19 8:52 AM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 10/18/19 8:40 AM, Jann Horn wrote:
>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 10/18/19 8:34 AM, Jann Horn wrote:
>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote:
>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>> This is in preparation for adding opcodes that need to modify files
>>>>>>>> in a process file table, either adding new ones or closing old ones.
>>>>> [...]
>>>>>> Updated patch1:
>>>>>>
>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
>>>>>
>>>>> I don't understand what you're doing with old_files in there. In the
>>>>> "s->files && !old_files" branch, "current->files = s->files" happens
>>>>> without holding task_lock(), but current->files and s->files are also
>>>>> the same already at that point anyway. And what's the intent behind
>>>>> assigning stuff to old_files inside the loop? Isn't that going to
>>>>> cause the workqueue to keep a modified current->files beyond the
>>>>> runtime of the work?
>>>>
>>>> I simply forgot to remove the old block, it should only have this one:
>>>>
>>>> if (s->files && s->files != cur_files) {
>>>>           task_lock(current);
>>>>           current->files = s->files;
>>>>           task_unlock(current);
>>>>           if (cur_files)
>>>>                   put_files_struct(cur_files);
>>>>           cur_files = s->files;
>>>> }
>>>
>>> Don't you still need a put_files_struct() in the case where "s->files
>>> == cur_files"?
>>
>> I want to hold on to the files for as long as I can, to avoid unnecessary
>> shuffling of it. But I take it your worry here is that we'll be calling
>> something that manipulates ->files? Nothing should do that, unless
>> s->files is set. We didn't hide the workqueue ->files[] before this
>> change either.
> 
> No, my worry is that the refcount of the files_struct is left too
> high. From what I can tell, the "do" loop in io_sq_wq_submit_work()
> iterates over multiple instances of struct sqe_submit. If there are
> two sqe_submit instances with the same ->files (each holding a
> reference from the get_files_struct() in __io_queue_sqe()), then:
> 
> When processing the first sqe_submit instance, current->files and
> cur_files are set to $user_files.
> When processing the second sqe_submit instance, nothing happens
> (s->files == cur_files).
> After the loop, at the end of the function, put_files_struct() is
> called once on $user_files.
> 
> So get_files_struct() has been called twice, but put_files_struct()
> has only been called once. That leaves the refcount too high, and by
> repeating this, an attacker can make the refcount wrap around and then
> cause a use-after-free.

Ah now I see what you are getting at, yes that's clearly a bug! I wonder
how we best safely can batch the drops. We can track the number of times
we've used the same files, and do atomic_sub_and_test() in a
put_files_struct_many() type addition. But that would leave us open to
the issue you describe, where someone could maliciously overflow the
files ref count.

Probably not worth over-optimizing, as long as we can avoid the
current->files task lock/unlock and shuffle.

I'll update the patch.
Jens Axboe Oct. 18, 2019, 3:54 p.m. UTC | #9
On 10/18/19 9:00 AM, Jens Axboe wrote:
> On 10/18/19 8:52 AM, Jann Horn wrote:
>> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 10/18/19 8:40 AM, Jann Horn wrote:
>>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 10/18/19 8:34 AM, Jann Horn wrote:
>>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote:
>>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>> This is in preparation for adding opcodes that need to modify files
>>>>>>>>> in a process file table, either adding new ones or closing old ones.
>>>>>> [...]
>>>>>>> Updated patch1:
>>>>>>>
>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
>>>>>>
>>>>>> I don't understand what you're doing with old_files in there. In the
>>>>>> "s->files && !old_files" branch, "current->files = s->files" happens
>>>>>> without holding task_lock(), but current->files and s->files are also
>>>>>> the same already at that point anyway. And what's the intent behind
>>>>>> assigning stuff to old_files inside the loop? Isn't that going to
>>>>>> cause the workqueue to keep a modified current->files beyond the
>>>>>> runtime of the work?
>>>>>
>>>>> I simply forgot to remove the old block, it should only have this one:
>>>>>
>>>>> if (s->files && s->files != cur_files) {
>>>>>            task_lock(current);
>>>>>            current->files = s->files;
>>>>>            task_unlock(current);
>>>>>            if (cur_files)
>>>>>                    put_files_struct(cur_files);
>>>>>            cur_files = s->files;
>>>>> }
>>>>
>>>> Don't you still need a put_files_struct() in the case where "s->files
>>>> == cur_files"?
>>>
>>> I want to hold on to the files for as long as I can, to avoid unnecessary
>>> shuffling of it. But I take it your worry here is that we'll be calling
>>> something that manipulates ->files? Nothing should do that, unless
>>> s->files is set. We didn't hide the workqueue ->files[] before this
>>> change either.
>>
>> No, my worry is that the refcount of the files_struct is left too
>> high. From what I can tell, the "do" loop in io_sq_wq_submit_work()
>> iterates over multiple instances of struct sqe_submit. If there are
>> two sqe_submit instances with the same ->files (each holding a
>> reference from the get_files_struct() in __io_queue_sqe()), then:
>>
>> When processing the first sqe_submit instance, current->files and
>> cur_files are set to $user_files.
>> When processing the second sqe_submit instance, nothing happens
>> (s->files == cur_files).
>> After the loop, at the end of the function, put_files_struct() is
>> called once on $user_files.
>>
>> So get_files_struct() has been called twice, but put_files_struct()
>> has only been called once. That leaves the refcount too high, and by
>> repeating this, an attacker can make the refcount wrap around and then
>> cause a use-after-free.
> 
> Ah now I see what you are getting at, yes that's clearly a bug! I wonder
> how we best safely can batch the drops. We can track the number of times
> we've used the same files, and do atomic_sub_and_test() in a
> put_files_struct_many() type addition. But that would leave us open to
> the issue you describe, where someone could maliciously overflow the
> files ref count.
> 
> Probably not worth over-optimizing, as long as we can avoid the
> current->files task lock/unlock and shuffle.
> 
> I'll update the patch.

Alright, this incremental on top should do it. And full updated patch
here:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=40449c5a3d3b16796fa13e9469c69d62986e961c

Let me know what you think.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 68eda36bcc9c..2fed0badad38 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2289,13 +2289,13 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 				set_fs(USER_DS);
 			}
 		}
-		if (s->files && s->files != cur_files) {
+		if (cur_files)
+			put_files_struct(cur_files);
+		cur_files = s->files;
+		if (cur_files && cur_files != current->files) {
 			task_lock(current);
-			current->files = s->files;
+			current->files = cur_files;
 			task_unlock(current);
-			if (cur_files)
-				put_files_struct(cur_files);
-			cur_files = s->files;
 		}
 
 		if (!ret) {
@@ -2393,8 +2393,9 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		task_lock(current);
 		current->files = old_files;
 		task_unlock(current);
-		put_files_struct(cur_files);
 	}
+	if (cur_files)
+		put_files_struct(cur_files);
 }
 
 /*
Jann Horn Oct. 18, 2019, 4:20 p.m. UTC | #10
On Fri, Oct 18, 2019 at 5:55 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/18/19 9:00 AM, Jens Axboe wrote:
> > On 10/18/19 8:52 AM, Jann Horn wrote:
> >> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 10/18/19 8:40 AM, Jann Horn wrote:
> >>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 10/18/19 8:34 AM, Jann Horn wrote:
> >>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote:
> >>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>> This is in preparation for adding opcodes that need to modify files
> >>>>>>>>> in a process file table, either adding new ones or closing old ones.
> >>>>>> [...]
> >>>>>>> Updated patch1:
> >>>>>>>
> >>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
> >>>>>>
> >>>>>> I don't understand what you're doing with old_files in there. In the
> >>>>>> "s->files && !old_files" branch, "current->files = s->files" happens
> >>>>>> without holding task_lock(), but current->files and s->files are also
> >>>>>> the same already at that point anyway. And what's the intent behind
> >>>>>> assigning stuff to old_files inside the loop? Isn't that going to
> >>>>>> cause the workqueue to keep a modified current->files beyond the
> >>>>>> runtime of the work?
> >>>>>
> >>>>> I simply forgot to remove the old block, it should only have this one:
> >>>>>
> >>>>> if (s->files && s->files != cur_files) {
> >>>>>            task_lock(current);
> >>>>>            current->files = s->files;
> >>>>>            task_unlock(current);
> >>>>>            if (cur_files)
> >>>>>                    put_files_struct(cur_files);
> >>>>>            cur_files = s->files;
> >>>>> }
> >>>>
> >>>> Don't you still need a put_files_struct() in the case where "s->files
> >>>> == cur_files"?
> >>>
> >>> I want to hold on to the files for as long as I can, to avoid unnecessary
> >>> shuffling of it. But I take it your worry here is that we'll be calling
> >>> something that manipulates ->files? Nothing should do that, unless
> >>> s->files is set. We didn't hide the workqueue ->files[] before this
> >>> change either.
> >>
> >> No, my worry is that the refcount of the files_struct is left too
> >> high. From what I can tell, the "do" loop in io_sq_wq_submit_work()
> >> iterates over multiple instances of struct sqe_submit. If there are
> >> two sqe_submit instances with the same ->files (each holding a
> >> reference from the get_files_struct() in __io_queue_sqe()), then:
> >>
> >> When processing the first sqe_submit instance, current->files and
> >> cur_files are set to $user_files.
> >> When processing the second sqe_submit instance, nothing happens
> >> (s->files == cur_files).
> >> After the loop, at the end of the function, put_files_struct() is
> >> called once on $user_files.
> >>
> >> So get_files_struct() has been called twice, but put_files_struct()
> >> has only been called once. That leaves the refcount too high, and by
> >> repeating this, an attacker can make the refcount wrap around and then
> >> cause a use-after-free.
> >
> > Ah now I see what you are getting at, yes that's clearly a bug! I wonder
> > how we best safely can batch the drops. We can track the number of times
> > we've used the same files, and do atomic_sub_and_test() in a
> > put_files_struct_many() type addition. But that would leave us open to
> > the issue you describe, where someone could maliciously overflow the
> > files ref count.
> >
> > Probably not worth over-optimizing, as long as we can avoid the
> > current->files task lock/unlock and shuffle.
> >
> > I'll update the patch.
>
> Alright, this incremental on top should do it. And full updated patch
> here:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=40449c5a3d3b16796fa13e9469c69d62986e961c
>
> Let me know what you think.

Ignoring the locking elision, basically the logic is now this:

static void io_sq_wq_submit_work(struct work_struct *work)
{
        struct io_kiocb *req = container_of(work, struct io_kiocb, work);
        struct files_struct *cur_files = NULL, *old_files;
        [...]
        old_files = current->files;
        [...]
        do {
                struct sqe_submit *s = &req->submit;
                [...]
                if (cur_files)
                        /* drop cur_files reference; borrow lifetime must
                         * end before here */
                        put_files_struct(cur_files);
                /* move reference ownership to cur_files */
                cur_files = s->files;
                if (cur_files) {
                        task_lock(current);
                        /* current->files borrows reference from cur_files;
                         * existing borrow from previous loop ends here */
                        current->files = cur_files;
                        task_unlock(current);
                }

                [call __io_submit_sqe()]
                [...]
        } while (req);
        [...]
        /* existing borrow ends here */
        task_lock(current);
        current->files = old_files;
        task_unlock(current);
        if (cur_files)
                /* drop cur_files reference; borrow lifetime must
                 * end before here */
                put_files_struct(cur_files);
}

If you run two iterations of this loop, with a first element that has
a ->files pointer and a second element that doesn't, then in the
second run through the loop, the reference to the files_struct will be
dropped while current->files still points to it; current->files is
only reset after the loop has ended. If someone accesses
current->files through procfs directly after that, AFAICS you'd get a
use-after-free.
Jens Axboe Oct. 18, 2019, 4:36 p.m. UTC | #11
On 10/18/19 10:20 AM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 5:55 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/18/19 9:00 AM, Jens Axboe wrote:
>>> On 10/18/19 8:52 AM, Jann Horn wrote:
>>>> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 10/18/19 8:40 AM, Jann Horn wrote:
>>>>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 10/18/19 8:34 AM, Jann Horn wrote:
>>>>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote:
>>>>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>> This is in preparation for adding opcodes that need to modify files
>>>>>>>>>>> in a process file table, either adding new ones or closing old ones.
>>>>>>>> [...]
>>>>>>>>> Updated patch1:
>>>>>>>>>
>>>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436
>>>>>>>>
>>>>>>>> I don't understand what you're doing with old_files in there. In the
>>>>>>>> "s->files && !old_files" branch, "current->files = s->files" happens
>>>>>>>> without holding task_lock(), but current->files and s->files are also
>>>>>>>> the same already at that point anyway. And what's the intent behind
>>>>>>>> assigning stuff to old_files inside the loop? Isn't that going to
>>>>>>>> cause the workqueue to keep a modified current->files beyond the
>>>>>>>> runtime of the work?
>>>>>>>
>>>>>>> I simply forgot to remove the old block, it should only have this one:
>>>>>>>
>>>>>>> if (s->files && s->files != cur_files) {
>>>>>>>             task_lock(current);
>>>>>>>             current->files = s->files;
>>>>>>>             task_unlock(current);
>>>>>>>             if (cur_files)
>>>>>>>                     put_files_struct(cur_files);
>>>>>>>             cur_files = s->files;
>>>>>>> }
>>>>>>
>>>>>> Don't you still need a put_files_struct() in the case where "s->files
>>>>>> == cur_files"?
>>>>>
>>>>> I want to hold on to the files for as long as I can, to avoid unnecessary
>>>>> shuffling of it. But I take it your worry here is that we'll be calling
>>>>> something that manipulates ->files? Nothing should do that, unless
>>>>> s->files is set. We didn't hide the workqueue ->files[] before this
>>>>> change either.
>>>>
>>>> No, my worry is that the refcount of the files_struct is left too
>>>> high. From what I can tell, the "do" loop in io_sq_wq_submit_work()
>>>> iterates over multiple instances of struct sqe_submit. If there are
>>>> two sqe_submit instances with the same ->files (each holding a
>>>> reference from the get_files_struct() in __io_queue_sqe()), then:
>>>>
>>>> When processing the first sqe_submit instance, current->files and
>>>> cur_files are set to $user_files.
>>>> When processing the second sqe_submit instance, nothing happens
>>>> (s->files == cur_files).
>>>> After the loop, at the end of the function, put_files_struct() is
>>>> called once on $user_files.
>>>>
>>>> So get_files_struct() has been called twice, but put_files_struct()
>>>> has only been called once. That leaves the refcount too high, and by
>>>> repeating this, an attacker can make the refcount wrap around and then
>>>> cause a use-after-free.
>>>
>>> Ah now I see what you are getting at, yes that's clearly a bug! I wonder
>>> how we best safely can batch the drops. We can track the number of times
>>> we've used the same files, and do atomic_sub_and_test() in a
>>> put_files_struct_many() type addition. But that would leave us open to
>>> the issue you describe, where someone could maliciously overflow the
>>> files ref count.
>>>
>>> Probably not worth over-optimizing, as long as we can avoid the
>>> current->files task lock/unlock and shuffle.
>>>
>>> I'll update the patch.
>>
>> Alright, this incremental on top should do it. And full updated patch
>> here:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=40449c5a3d3b16796fa13e9469c69d62986e961c
>>
>> Let me know what you think.
> 
> Ignoring the locking elision, basically the logic is now this:
> 
> static void io_sq_wq_submit_work(struct work_struct *work)
> {
>          struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>          struct files_struct *cur_files = NULL, *old_files;
>          [...]
>          old_files = current->files;
>          [...]
>          do {
>                  struct sqe_submit *s = &req->submit;
>                  [...]
>                  if (cur_files)
>                          /* drop cur_files reference; borrow lifetime must
>                           * end before here */
>                          put_files_struct(cur_files);
>                  /* move reference ownership to cur_files */
>                  cur_files = s->files;
>                  if (cur_files) {
>                          task_lock(current);
>                          /* current->files borrows reference from cur_files;
>                           * existing borrow from previous loop ends here */
>                          current->files = cur_files;
>                          task_unlock(current);
>                  }
> 
>                  [call __io_submit_sqe()]
>                  [...]
>          } while (req);
>          [...]
>          /* existing borrow ends here */
>          task_lock(current);
>          current->files = old_files;
>          task_unlock(current);
>          if (cur_files)
>                  /* drop cur_files reference; borrow lifetime must
>                   * end before here */
>                  put_files_struct(cur_files);
> }
> 
> If you run two iterations of this loop, with a first element that has
> a ->files pointer and a second element that doesn't, then in the
> second run through the loop, the reference to the files_struct will be
> dropped while current->files still points to it; current->files is
> only reset after the loop has ended. If someone accesses
> current->files through procfs directly after that, AFAICS you'd get a
> use-after-free.

Amazing how this is still broken. You are right, and it's especially
annoying since that's exactly the case I originally talked about (not
flipping current->files if we don't have to). I just did it wrong, so
we'll leave a dangling pointer in ->files.

The by far most common case is if one sqe has a files it needs to
attach, then others that also have files will be the same set. So I want
to optimize for the case where we only flip current->files once when we
see the files, and once when we're done with the loop.

Let me see if I can get this right...
Jens Axboe Oct. 18, 2019, 5:05 p.m. UTC | #12
On 10/18/19 10:36 AM, Jens Axboe wrote:
>> Ignoring the locking elision, basically the logic is now this:
>>
>> static void io_sq_wq_submit_work(struct work_struct *work)
>> {
>>           struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>           struct files_struct *cur_files = NULL, *old_files;
>>           [...]
>>           old_files = current->files;
>>           [...]
>>           do {
>>                   struct sqe_submit *s = &req->submit;
>>                   [...]
>>                   if (cur_files)
>>                           /* drop cur_files reference; borrow lifetime must
>>                            * end before here */
>>                           put_files_struct(cur_files);
>>                   /* move reference ownership to cur_files */
>>                   cur_files = s->files;
>>                   if (cur_files) {
>>                           task_lock(current);
>>                           /* current->files borrows reference from cur_files;
>>                            * existing borrow from previous loop ends here */
>>                           current->files = cur_files;
>>                           task_unlock(current);
>>                   }
>>
>>                   [call __io_submit_sqe()]
>>                   [...]
>>           } while (req);
>>           [...]
>>           /* existing borrow ends here */
>>           task_lock(current);
>>           current->files = old_files;
>>           task_unlock(current);
>>           if (cur_files)
>>                   /* drop cur_files reference; borrow lifetime must
>>                    * end before here */
>>                   put_files_struct(cur_files);
>> }
>>
>> If you run two iterations of this loop, with a first element that has
>> a ->files pointer and a second element that doesn't, then in the
>> second run through the loop, the reference to the files_struct will be
>> dropped while current->files still points to it; current->files is
>> only reset after the loop has ended. If someone accesses
>> current->files through procfs directly after that, AFAICS you'd get a
>> use-after-free.
> 
> Amazing how this is still broken. You are right, and it's especially
> annoying since that's exactly the case I originally talked about (not
> flipping current->files if we don't have to). I just did it wrong, so
> we'll leave a dangling pointer in ->files.
> 
> The by far most common case is if one sqe has a files it needs to
> attach, then others that also have files will be the same set. So I want
> to optimize for the case where we only flip current->files once when we
> see the files, and once when we're done with the loop.
> 
> Let me see if I can get this right...

I _think_ the simplest way to do it is simply to have both cur_files and
current->files hold a reference to the file table. That won't really add
any extra cost as the double increments / decrements are following each
other. Something like this incremental, totally untested.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2fed0badad38..b3cf3f3d7911 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 			put_files_struct(cur_files);
 		cur_files = s->files;
 		if (cur_files && cur_files != current->files) {
+			struct files_struct *old;
+
+			atomic_inc(&cur_files->count);
 			task_lock(current);
+			old = current->files;
 			current->files = cur_files;
 			task_unlock(current);
+			put_files_struct(old);
 		}
 
 		if (!ret) {
@@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		mmput(cur_mm);
 	}
 	if (old_files != current->files) {
+		struct files_struct *old;
+
 		task_lock(current);
+		old = current->files;
 		current->files = old_files;
 		task_unlock(current);
+		put_files_struct(old);
 	}
 	if (cur_files)
 		put_files_struct(cur_files);
Jann Horn Oct. 18, 2019, 6:06 p.m. UTC | #13
On Fri, Oct 18, 2019 at 7:05 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/18/19 10:36 AM, Jens Axboe wrote:
> >> Ignoring the locking elision, basically the logic is now this:
> >>
> >> static void io_sq_wq_submit_work(struct work_struct *work)
> >> {
> >>           struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> >>           struct files_struct *cur_files = NULL, *old_files;
> >>           [...]
> >>           old_files = current->files;
> >>           [...]
> >>           do {
> >>                   struct sqe_submit *s = &req->submit;
> >>                   [...]
> >>                   if (cur_files)
> >>                           /* drop cur_files reference; borrow lifetime must
> >>                            * end before here */
> >>                           put_files_struct(cur_files);
> >>                   /* move reference ownership to cur_files */
> >>                   cur_files = s->files;
> >>                   if (cur_files) {
> >>                           task_lock(current);
> >>                           /* current->files borrows reference from cur_files;
> >>                            * existing borrow from previous loop ends here */
> >>                           current->files = cur_files;
> >>                           task_unlock(current);
> >>                   }
> >>
> >>                   [call __io_submit_sqe()]
> >>                   [...]
> >>           } while (req);
> >>           [...]
> >>           /* existing borrow ends here */
> >>           task_lock(current);
> >>           current->files = old_files;
> >>           task_unlock(current);
> >>           if (cur_files)
> >>                   /* drop cur_files reference; borrow lifetime must
> >>                    * end before here */
> >>                   put_files_struct(cur_files);
> >> }
> >>
> >> If you run two iterations of this loop, with a first element that has
> >> a ->files pointer and a second element that doesn't, then in the
> >> second run through the loop, the reference to the files_struct will be
> >> dropped while current->files still points to it; current->files is
> >> only reset after the loop has ended. If someone accesses
> >> current->files through procfs directly after that, AFAICS you'd get a
> >> use-after-free.
> >
> > Amazing how this is still broken. You are right, and it's especially
> > annoying since that's exactly the case I originally talked about (not
> > flipping current->files if we don't have to). I just did it wrong, so
> > we'll leave a dangling pointer in ->files.
> >
> > The by far most common case is if one sqe has a files it needs to
> > attach, then others that also have files will be the same set. So I want
> > to optimize for the case where we only flip current->files once when we
> > see the files, and once when we're done with the loop.
> >
> > Let me see if I can get this right...
>
> I _think_ the simplest way to do it is simply to have both cur_files and
> current->files hold a reference to the file table. That won't really add
> any extra cost as the double increments / decrements are following each
> other. Something like this incremental, totally untested.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2fed0badad38..b3cf3f3d7911 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>                         put_files_struct(cur_files);
>                 cur_files = s->files;
>                 if (cur_files && cur_files != current->files) {
> +                       struct files_struct *old;
> +
> +                       atomic_inc(&cur_files->count);
>                         task_lock(current);
> +                       old = current->files;
>                         current->files = cur_files;
>                         task_unlock(current);
> +                       put_files_struct(old);
>                 }
>
>                 if (!ret) {
> @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>                 mmput(cur_mm);
>         }
>         if (old_files != current->files) {
> +               struct files_struct *old;
> +
>                 task_lock(current);
> +               old = current->files;
>                 current->files = old_files;
>                 task_unlock(current);
> +               put_files_struct(old);
>         }
>         if (cur_files)
>                 put_files_struct(cur_files);

The only part I still feel a bit twitchy about is this part at the end:

        if (old_files != current->files) {
                struct files_struct *old;

                task_lock(current);
                old = current->files;
                current->files = old_files;
                task_unlock(current);
                put_files_struct(old);
        }

If it was possible for the initial ->files to be the same as the
->files of a submission, and we got two submissions with first a
different files_struct and then our old one, then this branch would
not be executed even though it should, which would leave the refcount
of the files_struct one too high. But that probably can't happen?
Since kernel workers should be running with &init_files (I think?) and
that thing is never used for userspace tasks. But still, I'd feel
better if you could change it like this:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f9f5c70564f0..7673035d6bfe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2265,6 +2265,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 {
        struct io_kiocb *req = container_of(work, struct io_kiocb, work);
        struct files_struct *cur_files = NULL, *old_files;
+       bool restore_current_files = false;
        struct io_ring_ctx *ctx = req->ctx;
        struct mm_struct *cur_mm = NULL;
        struct async_list *async_list;
@@ -2313,6 +2314,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
                        current->files = cur_files;
                        task_unlock(current);
                        put_files_struct(old);
+                       restore_current_files = true;
                }

                if (!ret) {
@@ -2406,7 +2408,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
                unuse_mm(cur_mm);
                mmput(cur_mm);
        }
-       if (old_files != current->files) {
+       if (restore_current_files) {
                struct files_struct *old;

                task_lock(current);


But actually, by the way: Is this whole files_struct thing creating a
reference loop? The files_struct has a reference to the uring file,
and the uring file has ACCEPT work that has a reference to the
files_struct. If the task gets killed and the accept work blocks, the
entire files_struct will stay alive, right?
Jens Axboe Oct. 18, 2019, 6:16 p.m. UTC | #14
On 10/18/19 12:06 PM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 7:05 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/18/19 10:36 AM, Jens Axboe wrote:
>>>> Ignoring the locking elision, basically the logic is now this:
>>>>
>>>> static void io_sq_wq_submit_work(struct work_struct *work)
>>>> {
>>>>            struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>>>            struct files_struct *cur_files = NULL, *old_files;
>>>>            [...]
>>>>            old_files = current->files;
>>>>            [...]
>>>>            do {
>>>>                    struct sqe_submit *s = &req->submit;
>>>>                    [...]
>>>>                    if (cur_files)
>>>>                            /* drop cur_files reference; borrow lifetime must
>>>>                             * end before here */
>>>>                            put_files_struct(cur_files);
>>>>                    /* move reference ownership to cur_files */
>>>>                    cur_files = s->files;
>>>>                    if (cur_files) {
>>>>                            task_lock(current);
>>>>                            /* current->files borrows reference from cur_files;
>>>>                             * existing borrow from previous loop ends here */
>>>>                            current->files = cur_files;
>>>>                            task_unlock(current);
>>>>                    }
>>>>
>>>>                    [call __io_submit_sqe()]
>>>>                    [...]
>>>>            } while (req);
>>>>            [...]
>>>>            /* existing borrow ends here */
>>>>            task_lock(current);
>>>>            current->files = old_files;
>>>>            task_unlock(current);
>>>>            if (cur_files)
>>>>                    /* drop cur_files reference; borrow lifetime must
>>>>                     * end before here */
>>>>                    put_files_struct(cur_files);
>>>> }
>>>>
>>>> If you run two iterations of this loop, with a first element that has
>>>> a ->files pointer and a second element that doesn't, then in the
>>>> second run through the loop, the reference to the files_struct will be
>>>> dropped while current->files still points to it; current->files is
>>>> only reset after the loop has ended. If someone accesses
>>>> current->files through procfs directly after that, AFAICS you'd get a
>>>> use-after-free.
>>>
>>> Amazing how this is still broken. You are right, and it's especially
>>> annoying since that's exactly the case I originally talked about (not
>>> flipping current->files if we don't have to). I just did it wrong, so
>>> we'll leave a dangling pointer in ->files.
>>>
>>> The by far most common case is if one sqe has a files it needs to
>>> attach, then others that also have files will be the same set. So I want
>>> to optimize for the case where we only flip current->files once when we
>>> see the files, and once when we're done with the loop.
>>>
>>> Let me see if I can get this right...
>>
>> I _think_ the simplest way to do it is simply to have both cur_files and
>> current->files hold a reference to the file table. That won't really add
>> any extra cost as the double increments / decrements are following each
>> other. Something like this incremental, totally untested.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 2fed0badad38..b3cf3f3d7911 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                          put_files_struct(cur_files);
>>                  cur_files = s->files;
>>                  if (cur_files && cur_files != current->files) {
>> +                       struct files_struct *old;
>> +
>> +                       atomic_inc(&cur_files->count);
>>                          task_lock(current);
>> +                       old = current->files;
>>                          current->files = cur_files;
>>                          task_unlock(current);
>> +                       put_files_struct(old);
>>                  }
>>
>>                  if (!ret) {
>> @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                  mmput(cur_mm);
>>          }
>>          if (old_files != current->files) {
>> +               struct files_struct *old;
>> +
>>                  task_lock(current);
>> +               old = current->files;
>>                  current->files = old_files;
>>                  task_unlock(current);
>> +               put_files_struct(old);
>>          }
>>          if (cur_files)
>>                  put_files_struct(cur_files);
> 
> The only part I still feel a bit twitchy about is this part at the end:
> 
>          if (old_files != current->files) {
>                  struct files_struct *old;
> 
>                  task_lock(current);
>                  old = current->files;
>                  current->files = old_files;
>                  task_unlock(current);
>                  put_files_struct(old);
>          }
> 
> If it was possible for the initial ->files to be the same as the
> ->files of a submission, and we got two submissions with first a
> different files_struct and then our old one, then this branch would
> not be executed even though it should, which would leave the refcount
> of the files_struct one too high. But that probably can't happen?
> Since kernel workers should be running with &init_files (I think?) and
> that thing is never used for userspace tasks. But still, I'd feel
> better if you could change it like this:

Right, that is never going to happen. But your solution is simpler,
so... I'll throw some testing at it.

> But actually, by the way: Is this whole files_struct thing creating a
> reference loop? The files_struct has a reference to the uring file,
> and the uring file has ACCEPT work that has a reference to the
> files_struct. If the task gets killed and the accept work blocks, the
> entire files_struct will stay alive, right?

Yes, for the lifetime of the request, it does create a loop. So if the
application goes away, I think you're right, the files_struct will stay.
And so will the io_uring, for that matter, as we depend on the closing
of the files to do the final reap.

Hmm, not sure how best to handle that, to be honest. We need some way to
break the loop, if the request never finishes.
Jann Horn Oct. 18, 2019, 6:50 p.m. UTC | #15
On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/18/19 12:06 PM, Jann Horn wrote:
> > But actually, by the way: Is this whole files_struct thing creating a
> > reference loop? The files_struct has a reference to the uring file,
> > and the uring file has ACCEPT work that has a reference to the
> > files_struct. If the task gets killed and the accept work blocks, the
> > entire files_struct will stay alive, right?
>
> Yes, for the lifetime of the request, it does create a loop. So if the
> application goes away, I think you're right, the files_struct will stay.
> And so will the io_uring, for that matter, as we depend on the closing
> of the files to do the final reap.
>
> Hmm, not sure how best to handle that, to be honest. We need some way to
> break the loop, if the request never finishes.

A wacky and dubious approach would be to, instead of taking a
reference to the files_struct, abuse f_op->flush() to synchronously
flush out pending requests with references to the files_struct... But
it's probably a bad idea, given that in f_op->flush(), you can't
easily tell which files_struct the close is coming from. I suppose you
could keep a list of (fdtable, fd) pairs through which ACCEPT requests
have come in and then let f_op->flush() probe whether the file
pointers are gone from them...
Wolfgang Bumiller Oct. 23, 2019, 12:04 p.m. UTC | #16
On Thu, Oct 17, 2019 at 03:28:56PM -0600, Jens Axboe wrote:
> This is in preparation for adding opcodes that need to modify files
> in a process file table, either adding new ones or closing old ones.
> 
> If an opcode needs this, it must set REQ_F_NEED_FILES in the request
> structure. If work that needs to get punted to async context have this
> set, they will grab a reference to the process file table. When the
> work is completed, the reference is dropped again.

I think IORING_OP_SENDMSG and _RECVMSG need to set this flag due to
SCM_RIGHTS control messages.
Thought I'd reply here since I just now ran into the issue that I was
getting ever-increasing wrong file descriptor numbers on pretty much
ever "other" async recvmsg() call I did via io-uring while receiving
file descriptors from lxc for the seccomp-notify proxy. (I'm currently
running an ubuntu based 5.3.1 kernel)
I ended up finding them in /proc - they show up in all kernel threads,
eg.:

root:/root # grep Name /proc/9/status
Name:   mm_percpu_wq
root:/root # ls -l /proc/9/fd
total 0
lr-x------ 1 root root 64 Oct 23 12:00 0 -> '/proc/512 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 1 -> /proc/512/mem
lr-x------ 1 root root 64 Oct 23 12:00 10 -> '/proc/11782 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 11 -> /proc/11782/mem
lr-x------ 1 root root 64 Oct 23 12:00 12 -> '/proc/12210 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 13 -> /proc/12210/mem
lr-x------ 1 root root 64 Oct 23 12:00 14 -> '/proc/12298 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 15 -> /proc/12298/mem
lr-x------ 1 root root 64 Oct 23 12:00 16 -> '/proc/13955 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 17 -> /proc/13955/mem
lr-x------ 1 root root 64 Oct 23 12:00 18 -> '/proc/13989 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 19 -> /proc/13989/mem
lr-x------ 1 root root 64 Oct 23 12:00 2 -> '/proc/584 (deleted)'
lr-x------ 1 root root 64 Oct 23 12:00 20 -> '/proc/15502 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 21 -> /proc/15502/mem
lr-x------ 1 root root 64 Oct 23 12:00 22 -> '/proc/15510 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 23 -> /proc/15510/mem
lr-x------ 1 root root 64 Oct 23 12:00 24 -> '/proc/17833 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 25 -> /proc/17833/mem
lr-x------ 1 root root 64 Oct 23 12:00 26 -> '/proc/17836 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 27 -> /proc/17836/mem
lr-x------ 1 root root 64 Oct 23 12:00 28 -> '/proc/21929 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 29 -> /proc/21929/mem
lrwx------ 1 root root 64 Oct 23 12:00 3 -> /proc/584/mem
lr-x------ 1 root root 64 Oct 23 12:00 30 -> '/proc/22214 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 31 -> /proc/22214/mem
lr-x------ 1 root root 64 Oct 23 12:00 32 -> '/proc/22283 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 33 -> /proc/22283/mem
lr-x------ 1 root root 64 Oct 23 12:00 34 -> '/proc/29795 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 35 -> /proc/29795/mem
lr-x------ 1 root root 64 Oct 23 12:00 36 -> '/proc/30124 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 37 -> /proc/30124/mem
lr-x------ 1 root root 64 Oct 23 12:00 38 -> '/proc/31016 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 39 -> /proc/31016/mem
lr-x------ 1 root root 64 Oct 23 12:00 4 -> '/proc/1632 (deleted)'
lr-x------ 1 root root 64 Oct 23 12:00 40 -> '/proc/4137 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 41 -> /proc/4137/mem
lrwx------ 1 root root 64 Oct 23 12:00 5 -> /proc/1632/mem
lr-x------ 1 root root 64 Oct 23 12:00 6 -> '/proc/3655 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 7 -> /proc/3655/mem
lr-x------ 1 root root 64 Oct 23 12:00 8 -> '/proc/7075 (deleted)'
lrwx------ 1 root root 64 Oct 23 12:00 9 -> /proc/7075/mem
root:/root #

Those are the fds I expected to receive, and I get fd numbers
consistently increasing with them.
lxc sends the syscall-executing process' pidfd and its 'mem' fd via a
socket, but instead of making it to the receiver, they end up there...

I suspect that an async sendmsg() call could potentially end up
accessing those instead of the ones from the sender process, but I
haven't tested it...

> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 635856023fdf..ad462237275e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -267,10 +267,11 @@ struct io_ring_ctx {
>  struct sqe_submit {
>  	const struct io_uring_sqe	*sqe;
>  	unsigned short			index;
> +	bool				has_user : 1;
> +	bool				in_async : 1;
> +	bool				needs_fixed_file : 1;
>  	u32				sequence;
> -	bool				has_user;
> -	bool				in_async;
> -	bool				needs_fixed_file;
> +	struct files_struct		*files;
>  };
>  
>  /*
> @@ -323,6 +324,7 @@ struct io_kiocb {
>  #define REQ_F_FAIL_LINK		256	/* fail rest of links */
>  #define REQ_F_SHADOW_DRAIN	512	/* link-drain shadow req */
>  #define REQ_F_TIMEOUT		1024	/* timeout request */
> +#define REQ_F_NEED_FILES	2048	/* needs to assume file table */
>  	u64			user_data;
>  	u32			result;
>  	u32			sequence;
> @@ -2191,6 +2193,7 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
>  static void io_sq_wq_submit_work(struct work_struct *work)
>  {
>  	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> +	struct files_struct *old_files = NULL;
>  	struct io_ring_ctx *ctx = req->ctx;
>  	struct mm_struct *cur_mm = NULL;
>  	struct async_list *async_list;
> @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>  				set_fs(USER_DS);
>  			}
>  		}
> +		if (s->files && !old_files) {
> +			old_files = current->files;
> +			current->files = s->files;
> +		}
>  
>  		if (!ret) {
>  			s->has_user = cur_mm != NULL;
> @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>  		unuse_mm(cur_mm);
>  		mmput(cur_mm);
>  	}
> +	if (old_files) {
> +		struct files_struct *files = current->files;
> +		current->files = old_files;
> +		put_files_struct(files);
> +	}
>  }
>  
>  /*
> @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  
>  			s->sqe = sqe_copy;
>  			memcpy(&req->submit, s, sizeof(*s));
> +			if (req->flags & REQ_F_NEED_FILES)
> +				req->submit.files = get_files_struct(current);
>  			list = io_async_list_from_sqe(ctx, s->sqe);
>  			if (!io_add_to_prev_work(list, req)) {
>  				if (list)
> @@ -2633,6 +2647,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>  		s->index = head;
>  		s->sqe = &ctx->sq_sqes[head];
>  		s->sequence = ctx->cached_sq_head;
> +		s->files = NULL;
>  		ctx->cached_sq_head++;
>  		return true;
>  	}
> -- 
> 2.17.1
Jens Axboe Oct. 23, 2019, 2:11 p.m. UTC | #17
On 10/23/19 6:04 AM, Wolfgang Bumiller wrote:
> On Thu, Oct 17, 2019 at 03:28:56PM -0600, Jens Axboe wrote:
>> This is in preparation for adding opcodes that need to modify files
>> in a process file table, either adding new ones or closing old ones.
>>
>> If an opcode needs this, it must set REQ_F_NEED_FILES in the request
>> structure. If work that needs to get punted to async context have this
>> set, they will grab a reference to the process file table. When the
>> work is completed, the reference is dropped again.
> 
> I think IORING_OP_SENDMSG and _RECVMSG need to set this flag due to
> SCM_RIGHTS control messages.
> Thought I'd reply here since I just now ran into the issue that I was
> getting ever-increasing wrong file descriptor numbers on pretty much
> ever "other" async recvmsg() call I did via io-uring while receiving
> file descriptors from lxc for the seccomp-notify proxy. (I'm currently
> running an ubuntu based 5.3.1 kernel)
> I ended up finding them in /proc - they show up in all kernel threads,
> eg.:
> 
> root:/root # grep Name /proc/9/status
> Name:   mm_percpu_wq
> root:/root # ls -l /proc/9/fd
> total 0
> lr-x------ 1 root root 64 Oct 23 12:00 0 -> '/proc/512 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 1 -> /proc/512/mem
> lr-x------ 1 root root 64 Oct 23 12:00 10 -> '/proc/11782 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 11 -> /proc/11782/mem
> lr-x------ 1 root root 64 Oct 23 12:00 12 -> '/proc/12210 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 13 -> /proc/12210/mem
> lr-x------ 1 root root 64 Oct 23 12:00 14 -> '/proc/12298 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 15 -> /proc/12298/mem
> lr-x------ 1 root root 64 Oct 23 12:00 16 -> '/proc/13955 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 17 -> /proc/13955/mem
> lr-x------ 1 root root 64 Oct 23 12:00 18 -> '/proc/13989 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 19 -> /proc/13989/mem
> lr-x------ 1 root root 64 Oct 23 12:00 2 -> '/proc/584 (deleted)'
> lr-x------ 1 root root 64 Oct 23 12:00 20 -> '/proc/15502 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 21 -> /proc/15502/mem
> lr-x------ 1 root root 64 Oct 23 12:00 22 -> '/proc/15510 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 23 -> /proc/15510/mem
> lr-x------ 1 root root 64 Oct 23 12:00 24 -> '/proc/17833 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 25 -> /proc/17833/mem
> lr-x------ 1 root root 64 Oct 23 12:00 26 -> '/proc/17836 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 27 -> /proc/17836/mem
> lr-x------ 1 root root 64 Oct 23 12:00 28 -> '/proc/21929 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 29 -> /proc/21929/mem
> lrwx------ 1 root root 64 Oct 23 12:00 3 -> /proc/584/mem
> lr-x------ 1 root root 64 Oct 23 12:00 30 -> '/proc/22214 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 31 -> /proc/22214/mem
> lr-x------ 1 root root 64 Oct 23 12:00 32 -> '/proc/22283 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 33 -> /proc/22283/mem
> lr-x------ 1 root root 64 Oct 23 12:00 34 -> '/proc/29795 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 35 -> /proc/29795/mem
> lr-x------ 1 root root 64 Oct 23 12:00 36 -> '/proc/30124 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 37 -> /proc/30124/mem
> lr-x------ 1 root root 64 Oct 23 12:00 38 -> '/proc/31016 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 39 -> /proc/31016/mem
> lr-x------ 1 root root 64 Oct 23 12:00 4 -> '/proc/1632 (deleted)'
> lr-x------ 1 root root 64 Oct 23 12:00 40 -> '/proc/4137 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 41 -> /proc/4137/mem
> lrwx------ 1 root root 64 Oct 23 12:00 5 -> /proc/1632/mem
> lr-x------ 1 root root 64 Oct 23 12:00 6 -> '/proc/3655 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 7 -> /proc/3655/mem
> lr-x------ 1 root root 64 Oct 23 12:00 8 -> '/proc/7075 (deleted)'
> lrwx------ 1 root root 64 Oct 23 12:00 9 -> /proc/7075/mem
> root:/root #
> 
> Those are the fds I expected to receive, and I get fd numbers
> consistently increasing with them.
> lxc sends the syscall-executing process' pidfd and its 'mem' fd via a
> socket, but instead of making it to the receiver, they end up there...
> 
> I suspect that an async sendmsg() call could potentially end up
> accessing those instead of the ones from the sender process, but I
> haven't tested it...

Might "just" be a case of the sendmsg() being stuck, we can't currently
cancel work. So if they never complete, the ring won't go away.

Actually working on a small workqueue replacement for io_uring which
allow us to cancel things like that. It's a requirement for accept() as
well, but also for basic read/write send/recv on sockets. So used to
storage IO operations that complete in a finite amount of time...

But yes, I hope with that, and the flush trick that Jann suggested, that
we can make this 100% reliable for any type of operation.
Jens Axboe Oct. 24, 2019, 7:41 p.m. UTC | #18
On 10/18/19 12:50 PM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/18/19 12:06 PM, Jann Horn wrote:
>>> But actually, by the way: Is this whole files_struct thing creating a
>>> reference loop? The files_struct has a reference to the uring file,
>>> and the uring file has ACCEPT work that has a reference to the
>>> files_struct. If the task gets killed and the accept work blocks, the
>>> entire files_struct will stay alive, right?
>>
>> Yes, for the lifetime of the request, it does create a loop. So if the
>> application goes away, I think you're right, the files_struct will stay.
>> And so will the io_uring, for that matter, as we depend on the closing
>> of the files to do the final reap.
>>
>> Hmm, not sure how best to handle that, to be honest. We need some way to
>> break the loop, if the request never finishes.
> 
> A wacky and dubious approach would be to, instead of taking a
> reference to the files_struct, abuse f_op->flush() to synchronously
> flush out pending requests with references to the files_struct... But
> it's probably a bad idea, given that in f_op->flush(), you can't
> easily tell which files_struct the close is coming from. I suppose you
> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
> have come in and then let f_op->flush() probe whether the file
> pointers are gone from them...

Got back to this after finishing the io-wq stuff, which we need for the
cancel.

Here's an updated patch:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570

that seems to work for me (lightly tested), we correctly find and cancel
work that is holding on to the file table.

The full series sits on top of my for-5.5/io_uring-wq branch, and can be
viewed here:

http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test

Let me know what you think!
Jann Horn Oct. 24, 2019, 8:31 p.m. UTC | #19
On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/18/19 12:50 PM, Jann Horn wrote:
> > On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 10/18/19 12:06 PM, Jann Horn wrote:
> >>> But actually, by the way: Is this whole files_struct thing creating a
> >>> reference loop? The files_struct has a reference to the uring file,
> >>> and the uring file has ACCEPT work that has a reference to the
> >>> files_struct. If the task gets killed and the accept work blocks, the
> >>> entire files_struct will stay alive, right?
> >>
> >> Yes, for the lifetime of the request, it does create a loop. So if the
> >> application goes away, I think you're right, the files_struct will stay.
> >> And so will the io_uring, for that matter, as we depend on the closing
> >> of the files to do the final reap.
> >>
> >> Hmm, not sure how best to handle that, to be honest. We need some way to
> >> break the loop, if the request never finishes.
> >
> > A wacky and dubious approach would be to, instead of taking a
> > reference to the files_struct, abuse f_op->flush() to synchronously
> > flush out pending requests with references to the files_struct... But
> > it's probably a bad idea, given that in f_op->flush(), you can't
> > easily tell which files_struct the close is coming from. I suppose you
> > could keep a list of (fdtable, fd) pairs through which ACCEPT requests
> > have come in and then let f_op->flush() probe whether the file
> > pointers are gone from them...
>
> Got back to this after finishing the io-wq stuff, which we need for the
> cancel.
>
> Here's an updated patch:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
>
> that seems to work for me (lightly tested), we correctly find and cancel
> work that is holding on to the file table.
>
> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
> viewed here:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
>
> Let me know what you think!

Ah, I didn't realize that the second argument to f_op->flush is a
pointer to the files_struct. That's neat.


Security: There is no guarantee that ->flush() will run after the last
io_uring_enter() finishes. You can race like this, with threads A and
B in one process and C in another one:

A: sends uring fd to C via unix domain socket
A: starts syscall io_uring_enter(fd, ...)
A: calls fdget(fd), takes reference to file
B: starts syscall close(fd)
B: fd table entry is removed
B: f_op->flush is invoked and finds no pending transactions
B: syscall close() returns
A: continues io_uring_enter(), grabbing current->files
A: io_uring_enter() returns
A and B: exit
worker: use-after-free access to files_struct

I think the solution to this would be (unless you're fine with adding
some broad global read-write mutex) something like this in
__io_queue_sqe(), where "fd" and "f" are the variables from
io_uring_enter(), plumbed through the stack somehow:

if (req->flags & REQ_F_NEED_FILES) {
  rcu_read_lock();
  spin_lock_irq(&ctx->inflight_lock);
  if (fcheck(fd) == f) {
    list_add(&req->inflight_list,
      &ctx->inflight_list);
    req->work.files = current->files;
    ret = 0;
  } else {
    ret = -EBADF;
  }
  spin_unlock_irq(&ctx->inflight_lock);
  rcu_read_unlock();
  if (ret)
    goto put_req;
}


Minor note: If a process uses dup() to duplicate the uring fd, then
closes the duplicated fd, that will cause work cancellations - but I
guess that's fine?


Style nit: I find it a bit confusing to name both the list head and
the list member heads "inflight_list". Maybe name them "inflight_list"
and "inflight_entry", or something like that?


Correctness: Why is the wait in io_uring_flush() TASK_INTERRUPTIBLE?
Shouldn't it be TASK_UNINTERRUPTIBLE? If someone sends a signal to the
task while it's at that schedule(), it's just going to loop back
around and retry what it was doing already, right?


Security + Correctness: If there is more than one io_wqe, it seems to
me that io_uring_flush() calls io_wq_cancel_work(), which calls
io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first
request it looks at is pending. In that case, io_wq_cancel_work() will
immediately return, and io_uring_flush() will also immediately return.
It looks like any other requests will continue running?
Jens Axboe Oct. 24, 2019, 10:04 p.m. UTC | #20
On 10/24/19 2:31 PM, Jann Horn wrote:
> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/18/19 12:50 PM, Jann Horn wrote:
>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 10/18/19 12:06 PM, Jann Horn wrote:
>>>>> But actually, by the way: Is this whole files_struct thing creating a
>>>>> reference loop? The files_struct has a reference to the uring file,
>>>>> and the uring file has ACCEPT work that has a reference to the
>>>>> files_struct. If the task gets killed and the accept work blocks, the
>>>>> entire files_struct will stay alive, right?
>>>>
>>>> Yes, for the lifetime of the request, it does create a loop. So if the
>>>> application goes away, I think you're right, the files_struct will stay.
>>>> And so will the io_uring, for that matter, as we depend on the closing
>>>> of the files to do the final reap.
>>>>
>>>> Hmm, not sure how best to handle that, to be honest. We need some way to
>>>> break the loop, if the request never finishes.
>>>
>>> A wacky and dubious approach would be to, instead of taking a
>>> reference to the files_struct, abuse f_op->flush() to synchronously
>>> flush out pending requests with references to the files_struct... But
>>> it's probably a bad idea, given that in f_op->flush(), you can't
>>> easily tell which files_struct the close is coming from. I suppose you
>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
>>> have come in and then let f_op->flush() probe whether the file
>>> pointers are gone from them...
>>
>> Got back to this after finishing the io-wq stuff, which we need for the
>> cancel.
>>
>> Here's an updated patch:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
>>
>> that seems to work for me (lightly tested), we correctly find and cancel
>> work that is holding on to the file table.
>>
>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
>> viewed here:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
>>
>> Let me know what you think!
> 
> Ah, I didn't realize that the second argument to f_op->flush is a
> pointer to the files_struct. That's neat.
> 
> 
> Security: There is no guarantee that ->flush() will run after the last
> io_uring_enter() finishes. You can race like this, with threads A and
> B in one process and C in another one:
> 
> A: sends uring fd to C via unix domain socket
> A: starts syscall io_uring_enter(fd, ...)
> A: calls fdget(fd), takes reference to file
> B: starts syscall close(fd)
> B: fd table entry is removed
> B: f_op->flush is invoked and finds no pending transactions
> B: syscall close() returns
> A: continues io_uring_enter(), grabbing current->files
> A: io_uring_enter() returns
> A and B: exit
> worker: use-after-free access to files_struct
> 
> I think the solution to this would be (unless you're fine with adding
> some broad global read-write mutex) something like this in
> __io_queue_sqe(), where "fd" and "f" are the variables from
> io_uring_enter(), plumbed through the stack somehow:
> 
> if (req->flags & REQ_F_NEED_FILES) {
>    rcu_read_lock();
>    spin_lock_irq(&ctx->inflight_lock);
>    if (fcheck(fd) == f) {
>      list_add(&req->inflight_list,
>        &ctx->inflight_list);
>      req->work.files = current->files;
>      ret = 0;
>    } else {
>      ret = -EBADF;
>    }
>    spin_unlock_irq(&ctx->inflight_lock);
>    rcu_read_unlock();
>    if (ret)
>      goto put_req;
> }

First of all, thanks for the thorough look at this! We already have f
available here, it's req->file. And we just made a copy of the sqe, so
we have sqe->fd available as well. I fixed this up.

> Minor note: If a process uses dup() to duplicate the uring fd, then
> closes the duplicated fd, that will cause work cancellations - but I
> guess that's fine?

I don't think that's a concern.

> Style nit: I find it a bit confusing to name both the list head and
> the list member heads "inflight_list". Maybe name them "inflight_list"
> and "inflight_entry", or something like that?

Fixed

> Correctness: Why is the wait in io_uring_flush() TASK_INTERRUPTIBLE?
> Shouldn't it be TASK_UNINTERRUPTIBLE? If someone sends a signal to the
> task while it's at that schedule(), it's just going to loop back
> around and retry what it was doing already, right?

Fixed

> Security + Correctness: If there is more than one io_wqe, it seems to
> me that io_uring_flush() calls io_wq_cancel_work(), which calls
> io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first
> request it looks at is pending. In that case, io_wq_cancel_work() will
> immediately return, and io_uring_flush() will also immediately return.
> It looks like any other requests will continue running?

Ah good point, I missed that. We need to keep looping until we get
NOTFOUND returned. Fixed as well.

Also added cancellation if the task is going away. Here's the
incremental patch, I'll resend with the full version.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 43ae0e04fd09..ec9dadfa90d2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -326,7 +326,7 @@ struct io_kiocb {
 	u32			result;
 	u32			sequence;
 
-	struct list_head	inflight_list;
+	struct list_head	inflight_entry;
 
 	struct io_wq_work	work;
 };
@@ -688,7 +688,7 @@ static void __io_free_req(struct io_kiocb *req)
 		unsigned long flags;
 
 		spin_lock_irqsave(&ctx->inflight_lock, flags);
-		list_del(&req->inflight_list);
+		list_del(&req->inflight_entry);
 		if (wq_has_sleeper(&ctx->inflight_wait))
 			wake_up(&ctx->inflight_wait);
 		spin_unlock_irqrestore(&ctx->inflight_lock, flags);
@@ -2329,6 +2329,24 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	return 0;
 }
 
+static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			 struct io_uring_sqe *sqe)
+{
+	int ret = -EBADF;
+
+	rcu_read_lock();
+	spin_lock_irq(&ctx->inflight_lock);
+	if (fcheck(sqe->fd) == req->file) {
+		list_add(&req->inflight_entry, &ctx->inflight_list);
+		req->work.files = current->files;
+		ret = 0;
+	}
+	spin_unlock_irq(&ctx->inflight_lock);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			struct sqe_submit *s)
 {
@@ -2349,23 +2367,24 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			s->sqe = sqe_copy;
 			memcpy(&req->submit, s, sizeof(*s));
 			if (req->flags & REQ_F_NEED_FILES) {
-				spin_lock_irq(&ctx->inflight_lock);
-				list_add(&req->inflight_list,
-						&ctx->inflight_list);
-				req->work.files = current->files;
-				spin_unlock_irq(&ctx->inflight_lock);
+				ret = io_grab_files(ctx, req, sqe_copy);
+				if (ret) {
+					kfree(sqe_copy);
+					goto err;
+				}
 			}
-			io_queue_async_work(ctx, req);
 
 			/*
 			 * Queued up for async execution, worker will release
 			 * submit reference when the iocb is actually submitted.
 			 */
+			io_queue_async_work(ctx, req);
 			return 0;
 		}
 	}
 
 	/* drop submission reference */
+err:
 	io_put_req(req, NULL);
 
 	/* and drop final reference, if we failed */
@@ -3768,38 +3787,51 @@ static int io_uring_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-static int io_uring_flush(struct file *file, void *data)
+static void io_uring_cancel_files(struct io_ring_ctx *ctx,
+				  struct files_struct *files)
 {
-	struct io_ring_ctx *ctx = file->private_data;
-	enum io_wq_cancel ret;
 	struct io_kiocb *req;
 	DEFINE_WAIT(wait);
 
-restart:
-	ret = IO_WQ_CANCEL_NOTFOUND;
+	while (!list_empty_careful(&ctx->inflight_list)) {
+		enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND;
 
-	spin_lock_irq(&ctx->inflight_lock);
-	list_for_each_entry(req, &ctx->inflight_list, inflight_list) {
-		if (req->work.files == data) {
-			ret = io_wq_cancel_work(ctx->io_wq, &req->work);
-			break;
+		spin_lock_irq(&ctx->inflight_lock);
+		list_for_each_entry(req, &ctx->inflight_list, inflight_entry) {
+			if (req->work.files == files) {
+				ret = io_wq_cancel_work(ctx->io_wq, &req->work);
+				break;
+			}
 		}
-	}
-	if (ret == IO_WQ_CANCEL_RUNNING)
-		prepare_to_wait(&ctx->inflight_wait, &wait, TASK_INTERRUPTIBLE);
+		if (ret == IO_WQ_CANCEL_RUNNING)
+			prepare_to_wait(&ctx->inflight_wait, &wait,
+					TASK_UNINTERRUPTIBLE);
 
-	spin_unlock_irq(&ctx->inflight_lock);
+		spin_unlock_irq(&ctx->inflight_lock);
 
-	/*
-	 * If it was found running, wait for one inflight request to finish.
-	 * Retry loop after that, as it may not have been the one we were
-	 * looking for.
-	 */
-	if (ret == IO_WQ_CANCEL_RUNNING) {
+		/*
+		 * We need to keep going until we get NOTFOUND. We only cancel
+		 * one work at the time.
+		 *
+		 * If we get CANCEL_RUNNING, then wait for a work to complete
+		 * before continuing.
+		 */
+		if (ret == IO_WQ_CANCEL_OK)
+			continue;
+		else if (ret != IO_WQ_CANCEL_RUNNING)
+			break;
 		schedule();
-		goto restart;
-	}
+	};
+}
 
+static int io_uring_flush(struct file *file, void *data)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
+		io_wq_cancel_all(ctx->io_wq);
+	else
+		io_uring_cancel_files(ctx, data);
 	return 0;
 }
Jens Axboe Oct. 24, 2019, 10:09 p.m. UTC | #21
On 10/24/19 4:04 PM, Jens Axboe wrote:
> Also added cancellation if the task is going away. Here's the
> incremental patch, I'll resend with the full version.

Full version:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=24f099c6dd49fdedec464c5a7c0dfe2ac5f500d0
Jann Horn Oct. 24, 2019, 11:13 p.m. UTC | #22
On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/24/19 2:31 PM, Jann Horn wrote:
> > On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 10/18/19 12:50 PM, Jann Horn wrote:
> >>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 10/18/19 12:06 PM, Jann Horn wrote:
> >>>>> But actually, by the way: Is this whole files_struct thing creating a
> >>>>> reference loop? The files_struct has a reference to the uring file,
> >>>>> and the uring file has ACCEPT work that has a reference to the
> >>>>> files_struct. If the task gets killed and the accept work blocks, the
> >>>>> entire files_struct will stay alive, right?
> >>>>
> >>>> Yes, for the lifetime of the request, it does create a loop. So if the
> >>>> application goes away, I think you're right, the files_struct will stay.
> >>>> And so will the io_uring, for that matter, as we depend on the closing
> >>>> of the files to do the final reap.
> >>>>
> >>>> Hmm, not sure how best to handle that, to be honest. We need some way to
> >>>> break the loop, if the request never finishes.
> >>>
> >>> A wacky and dubious approach would be to, instead of taking a
> >>> reference to the files_struct, abuse f_op->flush() to synchronously
> >>> flush out pending requests with references to the files_struct... But
> >>> it's probably a bad idea, given that in f_op->flush(), you can't
> >>> easily tell which files_struct the close is coming from. I suppose you
> >>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
> >>> have come in and then let f_op->flush() probe whether the file
> >>> pointers are gone from them...
> >>
> >> Got back to this after finishing the io-wq stuff, which we need for the
> >> cancel.
> >>
> >> Here's an updated patch:
> >>
> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
> >>
> >> that seems to work for me (lightly tested), we correctly find and cancel
> >> work that is holding on to the file table.
> >>
> >> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
> >> viewed here:
> >>
> >> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
> >>
> >> Let me know what you think!
> >
> > Ah, I didn't realize that the second argument to f_op->flush is a
> > pointer to the files_struct. That's neat.
> >
> >
> > Security: There is no guarantee that ->flush() will run after the last
> > io_uring_enter() finishes. You can race like this, with threads A and
> > B in one process and C in another one:
> >
> > A: sends uring fd to C via unix domain socket
> > A: starts syscall io_uring_enter(fd, ...)
> > A: calls fdget(fd), takes reference to file
> > B: starts syscall close(fd)
> > B: fd table entry is removed
> > B: f_op->flush is invoked and finds no pending transactions
> > B: syscall close() returns
> > A: continues io_uring_enter(), grabbing current->files
> > A: io_uring_enter() returns
> > A and B: exit
> > worker: use-after-free access to files_struct
> >
> > I think the solution to this would be (unless you're fine with adding
> > some broad global read-write mutex) something like this in
> > __io_queue_sqe(), where "fd" and "f" are the variables from
> > io_uring_enter(), plumbed through the stack somehow:
> >
> > if (req->flags & REQ_F_NEED_FILES) {
> >    rcu_read_lock();
> >    spin_lock_irq(&ctx->inflight_lock);
> >    if (fcheck(fd) == f) {
> >      list_add(&req->inflight_list,
> >        &ctx->inflight_list);
> >      req->work.files = current->files;
> >      ret = 0;
> >    } else {
> >      ret = -EBADF;
> >    }
> >    spin_unlock_irq(&ctx->inflight_lock);
> >    rcu_read_unlock();
> >    if (ret)
> >      goto put_req;
> > }
>
> First of all, thanks for the thorough look at this! We already have f
> available here, it's req->file. And we just made a copy of the sqe, so
> we have sqe->fd available as well. I fixed this up.

sqe->fd is the file descriptor we're doing I/O on, not the file
descriptor of the uring file, right? Same thing for req->file. This
check only detects whether the fd we're doing I/O on was closed, which
is irrelevant.

> > Security + Correctness: If there is more than one io_wqe, it seems to
> > me that io_uring_flush() calls io_wq_cancel_work(), which calls
> > io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first
> > request it looks at is pending. In that case, io_wq_cancel_work() will
> > immediately return, and io_uring_flush() will also immediately return.
> > It looks like any other requests will continue running?
>
> Ah good point, I missed that. We need to keep looping until we get
> NOTFOUND returned. Fixed as well.
>
> Also added cancellation if the task is going away. Here's the
> incremental patch, I'll resend with the full version.
[...]
> +static int io_uring_flush(struct file *file, void *data)
> +{
> +       struct io_ring_ctx *ctx = file->private_data;
> +
> +       if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
> +               io_wq_cancel_all(ctx->io_wq);

Looking at io_wq_cancel_all(), this will just send a signal to the
task without waiting for anything, right? Isn't that unsafe?


> +       else
> +               io_uring_cancel_files(ctx, data);
>         return 0;
>  }
Jens Axboe Oct. 25, 2019, 12:35 a.m. UTC | #23
On 10/24/19 5:13 PM, Jann Horn wrote:
> On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/24/19 2:31 PM, Jann Horn wrote:
>>> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 10/18/19 12:50 PM, Jann Horn wrote:
>>>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 10/18/19 12:06 PM, Jann Horn wrote:
>>>>>>> But actually, by the way: Is this whole files_struct thing creating a
>>>>>>> reference loop? The files_struct has a reference to the uring file,
>>>>>>> and the uring file has ACCEPT work that has a reference to the
>>>>>>> files_struct. If the task gets killed and the accept work blocks, the
>>>>>>> entire files_struct will stay alive, right?
>>>>>>
>>>>>> Yes, for the lifetime of the request, it does create a loop. So if the
>>>>>> application goes away, I think you're right, the files_struct will stay.
>>>>>> And so will the io_uring, for that matter, as we depend on the closing
>>>>>> of the files to do the final reap.
>>>>>>
>>>>>> Hmm, not sure how best to handle that, to be honest. We need some way to
>>>>>> break the loop, if the request never finishes.
>>>>>
>>>>> A wacky and dubious approach would be to, instead of taking a
>>>>> reference to the files_struct, abuse f_op->flush() to synchronously
>>>>> flush out pending requests with references to the files_struct... But
>>>>> it's probably a bad idea, given that in f_op->flush(), you can't
>>>>> easily tell which files_struct the close is coming from. I suppose you
>>>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
>>>>> have come in and then let f_op->flush() probe whether the file
>>>>> pointers are gone from them...
>>>>
>>>> Got back to this after finishing the io-wq stuff, which we need for the
>>>> cancel.
>>>>
>>>> Here's an updated patch:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
>>>>
>>>> that seems to work for me (lightly tested), we correctly find and cancel
>>>> work that is holding on to the file table.
>>>>
>>>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
>>>> viewed here:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
>>>>
>>>> Let me know what you think!
>>>
>>> Ah, I didn't realize that the second argument to f_op->flush is a
>>> pointer to the files_struct. That's neat.
>>>
>>>
>>> Security: There is no guarantee that ->flush() will run after the last
>>> io_uring_enter() finishes. You can race like this, with threads A and
>>> B in one process and C in another one:
>>>
>>> A: sends uring fd to C via unix domain socket
>>> A: starts syscall io_uring_enter(fd, ...)
>>> A: calls fdget(fd), takes reference to file
>>> B: starts syscall close(fd)
>>> B: fd table entry is removed
>>> B: f_op->flush is invoked and finds no pending transactions
>>> B: syscall close() returns
>>> A: continues io_uring_enter(), grabbing current->files
>>> A: io_uring_enter() returns
>>> A and B: exit
>>> worker: use-after-free access to files_struct
>>>
>>> I think the solution to this would be (unless you're fine with adding
>>> some broad global read-write mutex) something like this in
>>> __io_queue_sqe(), where "fd" and "f" are the variables from
>>> io_uring_enter(), plumbed through the stack somehow:
>>>
>>> if (req->flags & REQ_F_NEED_FILES) {
>>>     rcu_read_lock();
>>>     spin_lock_irq(&ctx->inflight_lock);
>>>     if (fcheck(fd) == f) {
>>>       list_add(&req->inflight_list,
>>>         &ctx->inflight_list);
>>>       req->work.files = current->files;
>>>       ret = 0;
>>>     } else {
>>>       ret = -EBADF;
>>>     }
>>>     spin_unlock_irq(&ctx->inflight_lock);
>>>     rcu_read_unlock();
>>>     if (ret)
>>>       goto put_req;
>>> }
>>
>> First of all, thanks for the thorough look at this! We already have f
>> available here, it's req->file. And we just made a copy of the sqe, so
>> we have sqe->fd available as well. I fixed this up.
> 
> sqe->fd is the file descriptor we're doing I/O on, not the file
> descriptor of the uring file, right? Same thing for req->file. This
> check only detects whether the fd we're doing I/O on was closed, which
> is irrelevant.

Duh yes, I'm an idiot. Easily fixable, I'll update this for the ring fd.

>>> Security + Correctness: If there is more than one io_wqe, it seems to
>>> me that io_uring_flush() calls io_wq_cancel_work(), which calls
>>> io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first
>>> request it looks at is pending. In that case, io_wq_cancel_work() will
>>> immediately return, and io_uring_flush() will also immediately return.
>>> It looks like any other requests will continue running?
>>
>> Ah good point, I missed that. We need to keep looping until we get
>> NOTFOUND returned. Fixed as well.
>>
>> Also added cancellation if the task is going away. Here's the
>> incremental patch, I'll resend with the full version.
> [...]
>> +static int io_uring_flush(struct file *file, void *data)
>> +{
>> +       struct io_ring_ctx *ctx = file->private_data;
>> +
>> +       if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
>> +               io_wq_cancel_all(ctx->io_wq);
> 
> Looking at io_wq_cancel_all(), this will just send a signal to the
> task without waiting for anything, right? Isn't that unsafe?

Yes, that's a logic error, we should always do the
io_uring_cancel_files(). Ala:

	io_uring_cancel_files();
	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
		io_wq_cancel_all(ctx->io_wq);

Thanks!
Jens Axboe Oct. 25, 2019, 12:52 a.m. UTC | #24
On 10/24/19 6:35 PM, Jens Axboe wrote:
> On 10/24/19 5:13 PM, Jann Horn wrote:
>> On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>> On 10/24/19 2:31 PM, Jann Horn wrote:
>>>> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>> On 10/18/19 12:50 PM, Jann Horn wrote:
>>>>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 10/18/19 12:06 PM, Jann Horn wrote:
>>>>>>>> But actually, by the way: Is this whole files_struct thing creating a
>>>>>>>> reference loop? The files_struct has a reference to the uring file,
>>>>>>>> and the uring file has ACCEPT work that has a reference to the
>>>>>>>> files_struct. If the task gets killed and the accept work blocks, the
>>>>>>>> entire files_struct will stay alive, right?
>>>>>>>
>>>>>>> Yes, for the lifetime of the request, it does create a loop. So if the
>>>>>>> application goes away, I think you're right, the files_struct will stay.
>>>>>>> And so will the io_uring, for that matter, as we depend on the closing
>>>>>>> of the files to do the final reap.
>>>>>>>
>>>>>>> Hmm, not sure how best to handle that, to be honest. We need some way to
>>>>>>> break the loop, if the request never finishes.
>>>>>>
>>>>>> A wacky and dubious approach would be to, instead of taking a
>>>>>> reference to the files_struct, abuse f_op->flush() to synchronously
>>>>>> flush out pending requests with references to the files_struct... But
>>>>>> it's probably a bad idea, given that in f_op->flush(), you can't
>>>>>> easily tell which files_struct the close is coming from. I suppose you
>>>>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
>>>>>> have come in and then let f_op->flush() probe whether the file
>>>>>> pointers are gone from them...
>>>>>
>>>>> Got back to this after finishing the io-wq stuff, which we need for the
>>>>> cancel.
>>>>>
>>>>> Here's an updated patch:
>>>>>
>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
>>>>>
>>>>> that seems to work for me (lightly tested), we correctly find and cancel
>>>>> work that is holding on to the file table.
>>>>>
>>>>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
>>>>> viewed here:
>>>>>
>>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
>>>>>
>>>>> Let me know what you think!
>>>>
>>>> Ah, I didn't realize that the second argument to f_op->flush is a
>>>> pointer to the files_struct. That's neat.
>>>>
>>>>
>>>> Security: There is no guarantee that ->flush() will run after the last
>>>> io_uring_enter() finishes. You can race like this, with threads A and
>>>> B in one process and C in another one:
>>>>
>>>> A: sends uring fd to C via unix domain socket
>>>> A: starts syscall io_uring_enter(fd, ...)
>>>> A: calls fdget(fd), takes reference to file
>>>> B: starts syscall close(fd)
>>>> B: fd table entry is removed
>>>> B: f_op->flush is invoked and finds no pending transactions
>>>> B: syscall close() returns
>>>> A: continues io_uring_enter(), grabbing current->files
>>>> A: io_uring_enter() returns
>>>> A and B: exit
>>>> worker: use-after-free access to files_struct
>>>>
>>>> I think the solution to this would be (unless you're fine with adding
>>>> some broad global read-write mutex) something like this in
>>>> __io_queue_sqe(), where "fd" and "f" are the variables from
>>>> io_uring_enter(), plumbed through the stack somehow:
>>>>
>>>> if (req->flags & REQ_F_NEED_FILES) {
>>>>      rcu_read_lock();
>>>>      spin_lock_irq(&ctx->inflight_lock);
>>>>      if (fcheck(fd) == f) {
>>>>        list_add(&req->inflight_list,
>>>>          &ctx->inflight_list);
>>>>        req->work.files = current->files;
>>>>        ret = 0;
>>>>      } else {
>>>>        ret = -EBADF;
>>>>      }
>>>>      spin_unlock_irq(&ctx->inflight_lock);
>>>>      rcu_read_unlock();
>>>>      if (ret)
>>>>        goto put_req;
>>>> }
>>>
>>> First of all, thanks for the thorough look at this! We already have f
>>> available here, it's req->file. And we just made a copy of the sqe, so
>>> we have sqe->fd available as well. I fixed this up.
>>
>> sqe->fd is the file descriptor we're doing I/O on, not the file
>> descriptor of the uring file, right? Same thing for req->file. This
>> check only detects whether the fd we're doing I/O on was closed, which
>> is irrelevant.
> 
> Duh yes, I'm an idiot. Easily fixable, I'll update this for the ring fd.

Incremental:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ec9dadfa90d2..4d94886a3d13 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -262,11 +262,13 @@ struct io_ring_ctx {
 
 struct sqe_submit {
 	const struct io_uring_sqe	*sqe;
+	struct file			*ring_file;
 	unsigned short			index;
 	bool				has_user : 1;
 	bool				in_async : 1;
 	bool				needs_fixed_file : 1;
 	u32				sequence;
+	int				ring_fd;
 };
 
 /*
@@ -2329,14 +2331,13 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	return 0;
 }
 
-static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 struct io_uring_sqe *sqe)
+static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req)
 {
 	int ret = -EBADF;
 
 	rcu_read_lock();
 	spin_lock_irq(&ctx->inflight_lock);
-	if (fcheck(sqe->fd) == req->file) {
+	if (fcheck(req->submit.ring_fd) == req->submit.ring_file) {
 		list_add(&req->inflight_entry, &ctx->inflight_list);
 		req->work.files = current->files;
 		ret = 0;
@@ -2367,7 +2368,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			s->sqe = sqe_copy;
 			memcpy(&req->submit, s, sizeof(*s));
 			if (req->flags & REQ_F_NEED_FILES) {
-				ret = io_grab_files(ctx, req, sqe_copy);
+				ret = io_grab_files(ctx, req);
 				if (ret) {
 					kfree(sqe_copy);
 					goto err;
@@ -2585,6 +2586,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 	head = READ_ONCE(sq_array[head & ctx->sq_mask]);
 	if (head < ctx->sq_entries) {
+		s->ring_file = NULL;
 		s->index = head;
 		s->sqe = &ctx->sq_sqes[head];
 		s->sequence = ctx->cached_sq_head;
@@ -2782,7 +2784,8 @@ static int io_sq_thread(void *data)
 	return 0;
 }
 
-static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
+			  struct file *ring_file, int ring_fd)
 {
 	struct io_submit_state state, *statep = NULL;
 	struct io_kiocb *link = NULL;
@@ -2824,9 +2827,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		}
 
 out:
+		s.ring_file = ring_file;
 		s.has_user = true;
 		s.in_async = false;
 		s.needs_fixed_file = false;
+		s.ring_fd = ring_fd;
 		submit++;
 		trace_io_uring_submit_sqe(ctx, true, false);
 		io_submit_sqe(ctx, &s, statep, &link);
@@ -3828,10 +3833,9 @@ static int io_uring_flush(struct file *file, void *data)
 {
 	struct io_ring_ctx *ctx = file->private_data;
 
+	io_uring_cancel_files(ctx, data);
 	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
 		io_wq_cancel_all(ctx->io_wq);
-	else
-		io_uring_cancel_files(ctx, data);
 	return 0;
 }
 
@@ -3903,7 +3907,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		to_submit = min(to_submit, ctx->sq_entries);
 
 		mutex_lock(&ctx->uring_lock);
-		submitted = io_ring_submit(ctx, to_submit);
+		submitted = io_ring_submit(ctx, to_submit, f.file, fd);
 		mutex_unlock(&ctx->uring_lock);
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 635856023fdf..ad462237275e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -267,10 +267,11 @@  struct io_ring_ctx {
 struct sqe_submit {
 	const struct io_uring_sqe	*sqe;
 	unsigned short			index;
+	bool				has_user : 1;
+	bool				in_async : 1;
+	bool				needs_fixed_file : 1;
 	u32				sequence;
-	bool				has_user;
-	bool				in_async;
-	bool				needs_fixed_file;
+	struct files_struct		*files;
 };
 
 /*
@@ -323,6 +324,7 @@  struct io_kiocb {
 #define REQ_F_FAIL_LINK		256	/* fail rest of links */
 #define REQ_F_SHADOW_DRAIN	512	/* link-drain shadow req */
 #define REQ_F_TIMEOUT		1024	/* timeout request */
+#define REQ_F_NEED_FILES	2048	/* needs to assume file table */
 	u64			user_data;
 	u32			result;
 	u32			sequence;
@@ -2191,6 +2193,7 @@  static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct files_struct *old_files = NULL;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct mm_struct *cur_mm = NULL;
 	struct async_list *async_list;
@@ -2220,6 +2223,10 @@  static void io_sq_wq_submit_work(struct work_struct *work)
 				set_fs(USER_DS);
 			}
 		}
+		if (s->files && !old_files) {
+			old_files = current->files;
+			current->files = s->files;
+		}
 
 		if (!ret) {
 			s->has_user = cur_mm != NULL;
@@ -2312,6 +2319,11 @@  static void io_sq_wq_submit_work(struct work_struct *work)
 		unuse_mm(cur_mm);
 		mmput(cur_mm);
 	}
+	if (old_files) {
+		struct files_struct *files = current->files;
+		current->files = old_files;
+		put_files_struct(files);
+	}
 }
 
 /*
@@ -2413,6 +2425,8 @@  static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 			s->sqe = sqe_copy;
 			memcpy(&req->submit, s, sizeof(*s));
+			if (req->flags & REQ_F_NEED_FILES)
+				req->submit.files = get_files_struct(current);
 			list = io_async_list_from_sqe(ctx, s->sqe);
 			if (!io_add_to_prev_work(list, req)) {
 				if (list)
@@ -2633,6 +2647,7 @@  static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 		s->index = head;
 		s->sqe = &ctx->sq_sqes[head];
 		s->sequence = ctx->cached_sq_head;
+		s->files = NULL;
 		ctx->cached_sq_head++;
 		return true;
 	}