diff mbox series

[v2] cifs: ignore cached share root handle closing errors

Message ID 20200401125026.4899-1-aaptel@suse.com
State New
Headers show
Series [v2] cifs: ignore cached share root handle closing errors | expand

Commit Message

Aurélien Aptel April 1, 2020, 12:50 p.m. UTC
Fix tcon use-after-free and NULL ptr deref.

Customer system crashes with the following kernel log:

[462233.169868] CIFS VFS: Cancelling wait for mid 4894753 cmd: 14       => a QUERY DIR
[462233.228045] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.305922] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.306205] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.347060] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.347107] CIFS VFS: Close unmatched open
[462233.347113] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
...
    [exception RIP: cifs_put_tcon+0xa0] (this is doing tcon->ses->server)
 #6 [...] smb2_cancelled_close_fid at ... [cifs]
 #7 [...] process_one_work at ...
 #8 [...] worker_thread at ...
 #9 [...] kthread at ...

The most likely explanation we have is:

* When we put the last reference of a tcon (refcount=0), we close the
  cached share root handle.
* If closing a handle is interupted, SMB2_close() will
  queue a SMB2_close() in a work thread.
* The queued object keeps a tcon ref so we bump the tcon
  refcount, jumping from 0 to 1.
* We reach the end of cifs_put_tcon(), we free the tcon object despite
  it now having a refcount of 1.
* The queued work now runs, but the tcon, ses & server was freed in
  the meantime resulting in a crash.

THREAD 1
========
cifs_put_tcon                 => tcon refcount reach 0
  SMB2_tdis
   close_shroot_lease
    close_shroot_lease_locked => if cached root has lease && refcount reach 0
     smb2_close_cached_fid    => if cached root valid
      SMB2_close              => retry close in a worker thread if interrupted
       smb2_handle_cancelled_close
        __smb2_handle_cancelled_close    => !! tcon refcount bump 0 => 1 !!
         INIT_WORK(&cancelled->work, smb2_cancelled_close_fid);
         queue_work(cifsiod_wq, &cancelled->work) => queue work
 tconInfoFree(tcon);    ==> freed!
 cifs_put_smb_ses(ses); ==> freed!

THREAD 2 (workqueue)
========
smb2_cancelled_close_fid
  SMB2_close(0, cancelled->tcon, ...); => use-after-free of tcon
  cifs_put_tcon(cancelled->tcon);      => tcon refcount reach 0 second time
  *CRASH*

Fixes: d9191319358d ("CIFS: Close cached root handle only if it has a lease")
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
---
 fs/cifs/smb2misc.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Pavel Shilovsky April 3, 2020, 6:04 p.m. UTC | #1
ср, 1 апр. 2020 г. в 05:50, Aurelien Aptel <aaptel@suse.com>:
>
> Fix tcon use-after-free and NULL ptr deref.
>
> Customer system crashes with the following kernel log:
>
> [462233.169868] CIFS VFS: Cancelling wait for mid 4894753 cmd: 14       => a QUERY DIR
> [462233.228045] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.305922] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.306205] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.347060] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.347107] CIFS VFS: Close unmatched open
> [462233.347113] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> ...
>     [exception RIP: cifs_put_tcon+0xa0] (this is doing tcon->ses->server)
>  #6 [...] smb2_cancelled_close_fid at ... [cifs]
>  #7 [...] process_one_work at ...
>  #8 [...] worker_thread at ...
>  #9 [...] kthread at ...
>
> The most likely explanation we have is:
>
> * When we put the last reference of a tcon (refcount=0), we close the
>   cached share root handle.
> * If closing a handle is interupted, SMB2_close() will
>   queue a SMB2_close() in a work thread.
> * The queued object keeps a tcon ref so we bump the tcon
>   refcount, jumping from 0 to 1.
> * We reach the end of cifs_put_tcon(), we free the tcon object despite
>   it now having a refcount of 1.
> * The queued work now runs, but the tcon, ses & server was freed in
>   the meantime resulting in a crash.
>
> THREAD 1
> ========
> cifs_put_tcon                 => tcon refcount reach 0
>   SMB2_tdis
>    close_shroot_lease
>     close_shroot_lease_locked => if cached root has lease && refcount reach 0
>      smb2_close_cached_fid    => if cached root valid
>       SMB2_close              => retry close in a worker thread if interrupted
>        smb2_handle_cancelled_close
>         __smb2_handle_cancelled_close    => !! tcon refcount bump 0 => 1 !!
>          INIT_WORK(&cancelled->work, smb2_cancelled_close_fid);
>          queue_work(cifsiod_wq, &cancelled->work) => queue work
>  tconInfoFree(tcon);    ==> freed!
>  cifs_put_smb_ses(ses); ==> freed!
>
> THREAD 2 (workqueue)
> ========
> smb2_cancelled_close_fid
>   SMB2_close(0, cancelled->tcon, ...); => use-after-free of tcon
>   cifs_put_tcon(cancelled->tcon);      => tcon refcount reach 0 second time
>   *CRASH*
>
> Fixes: d9191319358d ("CIFS: Close cached root handle only if it has a lease")
> Signed-off-by: Aurelien Aptel <aaptel@suse.com>
> ---
>  fs/cifs/smb2misc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
> index 0511aaf451d4..965276aeffcf 100644
> --- a/fs/cifs/smb2misc.c
> +++ b/fs/cifs/smb2misc.c
> @@ -766,6 +766,12 @@ smb2_handle_cancelled_close(struct cifs_tcon *tcon, __u64 persistent_fid,
>
>         cifs_dbg(FYI, "%s: tc_count=%d\n", __func__, tcon->tc_count);
>         spin_lock(&cifs_tcp_ses_lock);
> +       if (tcon->tc_count <= 0) {
> +               spin_unlock(&cifs_tcp_ses_lock);
> +               cifs_dbg(VFS, "tcon is closing, skipping async close retry of fid %llu %llu\n",

Thanks for a good catch! Some suggestions below:

- Why VFS? Share is being disconnected anyway, so all associated
handles should be closed by the server eventually. I think FYI is
sufficient here.
- cifs_dbg_server may be better than cifs_dbg and please include tcon
ID here otherwise FIDs may be meaningless if there are several shares
mounted on a host.
- May be WARN_ONCE if tc_count is negative since it is a possible bug?

Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>

--
Best regards,
Pavel Shilovsky
Aurélien Aptel April 6, 2020, 10:01 a.m. UTC | #2
Hi Pavel,

Thanks for the review, will send v3 :)

I thought this one would be of interest to you as it's related to your
patch in the Fixes tag.

I've noticed the shared root handle is causing us lots of issues
lately. Paulo is working on a DFS bug related to it too (when switching
servers the handle becomes invalid).

Cheers,
diff mbox series

Patch

diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index 0511aaf451d4..965276aeffcf 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -766,6 +766,12 @@  smb2_handle_cancelled_close(struct cifs_tcon *tcon, __u64 persistent_fid,
 
 	cifs_dbg(FYI, "%s: tc_count=%d\n", __func__, tcon->tc_count);
 	spin_lock(&cifs_tcp_ses_lock);
+	if (tcon->tc_count <= 0) {
+		spin_unlock(&cifs_tcp_ses_lock);
+		cifs_dbg(VFS, "tcon is closing, skipping async close retry of fid %llu %llu\n",
+			 persistent_fid, volatile_fid);
+		return 0;
+	}
 	tcon->tc_count++;
 	spin_unlock(&cifs_tcp_ses_lock);