diff mbox series

[linux,dev-5.1] fsi: core: Fix NULL dereference issue

Message ID 20190614071643.18607-1-mine260309@gmail.com
State Not Applicable, archived
Headers show
Series [linux,dev-5.1] fsi: core: Fix NULL dereference issue | expand

Commit Message

Lei YU June 14, 2019, 7:16 a.m. UTC
The failure case in fsi_slave_init() is wrong and could cause NULL
dereference issue.
E.g. on FP5280G2 machine, it could get failure in fsi_slave_set_smode(),
and when it does fsi rescan, kernel crashes due to:

    Unable to handle kernel NULL pointer dereference at virtual address 00000060

The fix is to make it not calling kfree() but just goto err_free.

However, in err_free, it calls put_device() to free the device, it still
cause issue during fsi rescan, that the device is used after freed.

    WARNING: CPU: 0 PID: 1433 at lib/refcount.c:190 refcount_sub_and_test_checked+0x94/0xac
    refcount_t: underflow; use-after-free.

So the put_device() is removed and "err_free" label is renamed to "fail".

The slave device is freed by put_device() in fsi_master_remove_slave().

Signed-off-by: Lei YU <mine260309@gmail.com>
---
 drivers/fsi/fsi-core.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

Comments

Jeremy Kerr June 26, 2019, 1:33 a.m. UTC | #1
Hi Lei,

> The failure case in fsi_slave_init() is wrong and could cause NULL
> dereference issue.
> E.g. on FP5280G2 machine, it could get failure in
> fsi_slave_set_smode(),
> and when it does fsi rescan, kernel crashes due to:
> 
>     Unable to handle kernel NULL pointer dereference at virtual
> address 00000060
> 
> The fix is to make it not calling kfree() but just goto err_free.
> 
> However, in err_free, it calls put_device() to free the device, it
> still
> cause issue during fsi rescan, that the device is used after freed.
> 
>     WARNING: CPU: 0 PID: 1433 at lib/refcount.c:190
> refcount_sub_and_test_checked+0x94/0xac
>     refcount_t: underflow; use-after-free.
> 
> So the put_device() is removed and "err_free" label is renamed to
> "fail".

It looks like this will leak memory (through the struct fsi_slave) that
has been kzalloc()ed. After device_register, we need to call
put_device() to free the struct fsi_slave, but there's no mechanism for
that to happen if we remove it from fsi_slave_init().

The error paths for this function do need to be fixed, but I don't think
this is the right approach.

Do you have a backtrace of the refcount_sub_and_test_checked+0x94/0xac
warning? This may not be the actual struct device that underflows.

Cheers,


Jeremy
Lei YU June 26, 2019, 7:39 a.m. UTC | #2
On Wed, Jun 26, 2019 at 9:33 AM Jeremy Kerr <jk@ozlabs.org> wrote:
>
> Hi Lei,
>
> > The failure case in fsi_slave_init() is wrong and could cause NULL
> > dereference issue.
> > E.g. on FP5280G2 machine, it could get failure in
> > fsi_slave_set_smode(),
> > and when it does fsi rescan, kernel crashes due to:
> >
> >     Unable to handle kernel NULL pointer dereference at virtual
> > address 00000060
> >
> > The fix is to make it not calling kfree() but just goto err_free.
> >
> > However, in err_free, it calls put_device() to free the device, it
> > still
> > cause issue during fsi rescan, that the device is used after freed.
> >
> >     WARNING: CPU: 0 PID: 1433 at lib/refcount.c:190
> > refcount_sub_and_test_checked+0x94/0xac
> >     refcount_t: underflow; use-after-free.
> >
> > So the put_device() is removed and "err_free" label is renamed to
> > "fail".
>
> It looks like this will leak memory (through the struct fsi_slave) that
> has been kzalloc()ed. After device_register, we need to call
> put_device() to free the struct fsi_slave, but there's no mechanism for
> that to happen if we remove it from fsi_slave_init().

The memory is "leaked" in this function, that the slave device is not freed
here. But eventually, it will be freed in fsi_slave_release() (if I understand
the code correctly), so there is no leak, eventually.

> The error paths for this function do need to be fixed, but I don't think
> this is the right approach.
>
> Do you have a backtrace of the refcount_sub_and_test_checked+0x94/0xac
> warning? This may not be the actual struct device that underflows.

Yes, below is the full trace:

kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 1417 at lib/refcount.c:190
refcount_sub_and_test_checked+0x94/0xac
kernel: refcount_t: underflow; use-after-free.
kernel: CPU: 0 PID: 1417 Comm: openpower-proc- Tainted: G        W
    5.1.1-60926c8a1d00a056568cb44da9e96d3ee5f6e654 #1
kernel: Hardware name: Generic DT based system
kernel: Backtrace:
kernel: [<80107ca0>] (dump_backtrace) from [<80107ed4>] (show_stack+0x20/0x24)
kernel:  r7:803629e8 r6:00000009 r5:00000000 r4:95937d14
kernel: [<80107eb4>] (show_stack) from [<8067e944>] (dump_stack+0x20/0x28)
kernel: [<8067e924>] (dump_stack) from [<80115ff8>] (__warn.part.3+0xb4/0xdc)
kernel: [<80115f44>] (__warn.part.3) from [<8011608c>]
(warn_slowpath_fmt+0x6c/0x90)
kernel:  r6:000000be r5:807ff14c r4:80a07008
kernel: [<80116024>] (warn_slowpath_fmt) from [<803629e8>]
(refcount_sub_and_test_checked+0x94/0xac)
kernel:  r3:80a4cbda r2:807ff1b8
kernel:  r7:80410c1c r6:94506710 r5:95937ddc r4:00000000
kernel: [<80362954>] (refcount_sub_and_test_checked) from [<80362a18>]
(refcount_dec_and_test_checked+0x18/0x1c)
kernel:  r5:95937ddc r4:94503c08
kernel: [<80362a00>] (refcount_dec_and_test_checked) from [<806836a8>]
(kobject_put+0x2c/0x1e8)
kernel: [<8068367c>] (kobject_put) from [<80410c44>]
(klist_children_put+0x28/0x2c)
kernel:  r6:94506710 r5:95937ddc r4:945060fc
kernel: [<80410c1c>] (klist_children_put) from [<80683130>]
(klist_next+0x8c/0xe0)
kernel: [<806830a4>] (klist_next) from [<80412548>]
(device_for_each_child+0x4c/0xa4)
kernel:  r10:00000051 r9:95937f60 r8:95075030 r7:80a07008 r6:80525804
r5:00000000
kernel:  r4:00000000 r3:db89ad67
kernel: [<804124fc>] (device_for_each_child) from [<80525910>]
(fsi_master_rescan+0x40/0x68)
kernel:  r7:959293a0 r6:95075020 r5:94503afc r4:94503a00
kernel: [<805258d0>] (fsi_master_rescan) from [<8052562c>]
(master_rescan_store+0x1c/0x28)
kernel:  r5:00000000 r4:00000001
kernel: [<80525610>] (master_rescan_store) from [<80410354>]
(dev_attr_store+0x28/0x34)
kernel:  r5:00000000 r4:80525610
kernel: [<8041032c>] (dev_attr_store) from [<802acac8>]
(sysfs_kf_write+0x48/0x54)
kernel:  r5:00000000 r4:8041032c
kernel: [<802aca80>] (sysfs_kf_write) from [<802abef8>]
(kernfs_fop_write+0x10c/0x1ec)
kernel:  r5:00000000 r4:00000000
kernel: [<802abdec>] (kernfs_fop_write) from [<80237e80>]
(__vfs_write+0x44/0x18c)
kernel:  r10:95937f60 r9:00000001 r8:00000000 r7:95937f60 r6:802abdec
r5:80a07008
kernel:  r4:95930f40
kernel: [<80237e3c>] (__vfs_write) from [<80239864>] (vfs_write+0xb0/0x194)
kernel:  r10:00000000 r9:00000000 r8:00000000 r7:95937f60 r6:0101a3d0
r5:95930f40
kernel:  r4:00000001
kernel: [<802397b4>] (vfs_write) from [<80239b10>] (ksys_write+0x70/0xe8)
kernel:  r8:00000000 r7:0101a3d0 r6:80a07008 r5:95930f40 r4:95930f40
kernel: [<80239aa0>] (ksys_write) from [<80239ba0>] (sys_write+0x18/0x1c)
kernel:  r9:95936000 r8:801011e4 r7:00000004 r6:00000003 r5:0101a3d0 r4:00000001
kernel: [<80239b88>] (sys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54)
kernel: Exception stack(0x95937fa8 to 0x95937ff0)
kernel: 7fa0:                   00000001 0101a3d0 00000003 0101a3d0
00000001 00000000
kernel: 7fc0: 00000001 0101a3d0 00000003 00000004 0006ac0c 0006ac20
7e994c88 7e994a1c
kernel: 7fe0: 00000070 7e9949a8 48a884ac 48714fec
kernel: ---[ end trace 366cbd7a56b7dc9f ]---
Milton Miller II June 26, 2019, 11:16 p.m. UTC | #3
On 06/26/2019 around 02:39AM in some timezone, Lei YU wrote:

>On Wed, Jun 26, 2019 at 9:33 AM Jeremy Kerr <jk@ozlabs.org> wrote:
>>
>> Hi Lei,
>>
>> > The failure case in fsi_slave_init() is wrong and could cause
>NULL
>> > dereference issue.
>> > E.g. on FP5280G2 machine, it could get failure in
>> > fsi_slave_set_smode(),
>> > and when it does fsi rescan, kernel crashes due to:
>> >
>> >     Unable to handle kernel NULL pointer dereference at virtual
>> > address 00000060
>> >
>> > The fix is to make it not calling kfree() but just goto err_free.
>> >
>> > However, in err_free, it calls put_device() to free the device,
>it
>> > still
>> > cause issue during fsi rescan, that the device is used after
>freed.
>> >
>> >     WARNING: CPU: 0 PID: 1433 at lib/refcount.c:190
>> > refcount_sub_and_test_checked+0x94/0xac
>> >     refcount_t: underflow; use-after-free.
>> >
>> > So the put_device() is removed and "err_free" label is renamed to
>> > "fail".
>>
>> It looks like this will leak memory (through the struct fsi_slave)
>that
>> has been kzalloc()ed. After device_register, we need to call
>> put_device() to free the struct fsi_slave, but there's no mechanism
>for
>> that to happen if we remove it from fsi_slave_init().
>
>The memory is "leaked" in this function, that the slave device is not
>freed
>here. But eventually, it will be freed in fsi_slave_release() (if I
>understand
>the code correctly), so there is no leak, eventually.

This is definitely bad.  In other paths the struct device for the
fsi-slave might be kept beyond the lifetime of the fsi-master slave
device, which would lead to use after free.

Each kobject must have its own release method and be allocated via
a separate kmalloc.

>
>> The error paths for this function do need to be fixed, but I don't
>think
>> this is the right approach.
>>
>> Do you have a backtrace of the
>refcount_sub_and_test_checked+0x94/0xac
>> warning? This may not be the actual struct device that underflows.
>
>Yes, below is the full trace:

This is just one possibility.


BTW, there doesn't seem to be any MAINTAINERS entry for drivers/fsi

Milton
Jeremy Kerr June 27, 2019, 6:38 a.m. UTC | #4
Hi Lei

> > 
> > It looks like this will leak memory (through the struct fsi_slave)
> > that
> > has been kzalloc()ed. After device_register, we need to call
> > put_device() to free the struct fsi_slave, but there's no mechanism
> > for
> > that to happen if we remove it from fsi_slave_init().
> 
> The memory is "leaked" in this function, that the slave device is not
> freed
> here. But eventually, it will be freed in fsi_slave_release() (if I
> understand
> the code correctly), so there is no leak, eventually.

But there's no way for fsi_slave_release() to be called, as the device
isn't registered with the core (d1dcd67825 effectively removed the
device_add from the slave device init).

I think that the BUG_ON that we're hitting in the release path is
through the cdev parent release, which brings the refcount down to 0,
then we do another put_device().

This could just be a matter of correcting the registration of the slave
to the device core - I'll take a look here.

Cheers,


Jeremy
diff mbox series

Patch

diff --git a/drivers/fsi/fsi-core.c b/drivers/fsi/fsi-core.c
index 2c31563fdcae..ebede90cf4bd 100644
--- a/drivers/fsi/fsi-core.c
+++ b/drivers/fsi/fsi-core.c
@@ -1041,14 +1041,14 @@  static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
 	rc = __fsi_get_new_minor(slave, fsi_dev_cfam, &slave->dev.devt,
 				 &slave->cdev_idx);
 	if (rc)
-		goto err_free;
+		goto fail;
 
 	/* Create chardev for userspace access */
 	cdev_init(&slave->cdev, &cfam_fops);
 	rc = cdev_device_add(&slave->cdev, &slave->dev);
 	if (rc) {
 		dev_err(&slave->dev, "Error %d creating slave device\n", rc);
-		goto err_free;
+		goto fail;
 	}
 
 	rc = fsi_slave_set_smode(slave);
@@ -1056,8 +1056,8 @@  static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
 		dev_warn(&master->dev,
 				"can't set smode on slave:%02x:%02x %d\n",
 				link, id, rc);
-		kfree(slave);
-		return -ENODEV;
+		rc = -ENODEV;
+		goto fail;
 	}
 	if (master->link_config)
 		master->link_config(master, link,
@@ -1075,10 +1075,7 @@  static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
 		dev_dbg(&master->dev, "failed during slave scan with: %d\n",
 				rc);
 
-	return rc;
-
- err_free:
-	put_device(&slave->dev);
+ fail:
 	return rc;
 }