diff mbox series

net: eth-uclass: change state before stop() in eth_halt()

Message ID 20221130164225.766877-1-lusus@denx.de
State Rejected
Delegated to: Marek Vasut
Headers show
Series net: eth-uclass: change state before stop() in eth_halt() | expand

Commit Message

Niel Fourie Nov. 30, 2022, 4:42 p.m. UTC
In eth_halt(), change the private uclass state before calling
stop() instead of afterwards, to avoid writing to memory which
may have been freed during stop().

In the ethernet gadget implementation, the gadget device gets
probed during start() and removed during stop(), which includes
freeing `uclass_priv_` to which `priv` is pointing. Writing to
`priv` after stop() may corrupt the `fd` member of `struct
malloc_chunk`, which represents the freed block, and could cause
hard-to-debug crashes on subsequent calls to malloc()/free().

Signed-off-by: Niel Fourie <lusus@denx.de>
Cc: Ramon Fried <rfried.dev@gmail.com>
Cc: Marek Vasut <marex@denx.de>
Cc: Lukasz Majewski <lukma@denx.de>
---
 net/eth-uclass.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Lukasz Majewski Dec. 1, 2022, 8:24 a.m. UTC | #1
On Wed, 30 Nov 2022 17:42:25 +0100
Niel Fourie <lusus@denx.de> wrote:

> In eth_halt(), change the private uclass state before calling
> stop() instead of afterwards, to avoid writing to memory which
> may have been freed during stop().
> 
> In the ethernet gadget implementation, the gadget device gets
> probed during start() and removed during stop(), which includes
> freeing `uclass_priv_` to which `priv` is pointing. Writing to
> `priv` after stop() may corrupt the `fd` member of `struct
> malloc_chunk`, which represents the freed block, and could cause
> hard-to-debug crashes on subsequent calls to malloc()/free().
> 
> Signed-off-by: Niel Fourie <lusus@denx.de>
> Cc: Ramon Fried <rfried.dev@gmail.com>
> Cc: Marek Vasut <marex@denx.de>
> Cc: Lukasz Majewski <lukma@denx.de>
> ---
>  net/eth-uclass.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
> index f41da4b37b3..bc3b9751e32 100644
> --- a/net/eth-uclass.c
> +++ b/net/eth-uclass.c
> @@ -342,9 +342,9 @@ void eth_halt(void)
>  	if (!priv || !priv->running)
>  		return;
>  
> -	eth_get_ops(current)->stop(current);
>  	priv->state = ETH_STATE_PASSIVE;
>  	priv->running = false;
> +	eth_get_ops(current)->stop(current);
>  }
>  
>  int eth_is_active(struct udevice *dev)

Reviewed-by: Lukasz Majewski <lukma@denx.de>


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de
Marek Vasut Dec. 1, 2022, 10:44 a.m. UTC | #2
On 12/1/22 09:24, Lukasz Majewski wrote:
> On Wed, 30 Nov 2022 17:42:25 +0100
> Niel Fourie <lusus@denx.de> wrote:
> 
>> In eth_halt(), change the private uclass state before calling
>> stop() instead of afterwards, to avoid writing to memory which
>> may have been freed during stop().
>>
>> In the ethernet gadget implementation, the gadget device gets
>> probed during start() and removed during stop(), which includes
>> freeing `uclass_priv_` to which `priv` is pointing. Writing to
>> `priv` after stop() may corrupt the `fd` member of `struct
>> malloc_chunk`, which represents the freed block, and could cause
>> hard-to-debug crashes on subsequent calls to malloc()/free().
>>
>> Signed-off-by: Niel Fourie <lusus@denx.de>
>> Cc: Ramon Fried <rfried.dev@gmail.com>
>> Cc: Marek Vasut <marex@denx.de>
>> Cc: Lukasz Majewski <lukma@denx.de>
>> ---
>>   net/eth-uclass.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
>> index f41da4b37b3..bc3b9751e32 100644
>> --- a/net/eth-uclass.c
>> +++ b/net/eth-uclass.c
>> @@ -342,9 +342,9 @@ void eth_halt(void)
>>   	if (!priv || !priv->running)
>>   		return;
>>   
>> -	eth_get_ops(current)->stop(current);
>>   	priv->state = ETH_STATE_PASSIVE;
>>   	priv->running = false;
>> +	eth_get_ops(current)->stop(current);
>>   }
>>   
>>   int eth_is_active(struct udevice *dev)
> 
> Reviewed-by: Lukasz Majewski <lukma@denx.de>

How come nobody triggered this problem with regular ethernet in U-Boot ?

If this is isolated to USB gadget ethernet, then please do not hack 
around this in core networking code, but rather fix the USB ethernet 
gadget itself. It seems that gadget code should not unregister the 
gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not fully.
Niel Fourie Dec. 2, 2022, 10:36 a.m. UTC | #3
Hi Marek,

On 01/12/2022 11:44, Marek Vasut wrote:
> On 12/1/22 09:24, Lukasz Majewski wrote:
>> On Wed, 30 Nov 2022 17:42:25 +0100
>> Niel Fourie <lusus@denx.de> wrote:
>>
>>> In eth_halt(), change the private uclass state before calling
>>> stop() instead of afterwards, to avoid writing to memory which
>>> may have been freed during stop().
>>>
>>> In the ethernet gadget implementation, the gadget device gets
>>> probed during start() and removed during stop(), which includes
>>> freeing `uclass_priv_` to which `priv` is pointing. Writing to
>>> `priv` after stop() may corrupt the `fd` member of `struct
>>> malloc_chunk`, which represents the freed block, and could cause
>>> hard-to-debug crashes on subsequent calls to malloc()/free().
>>>
>>> Signed-off-by: Niel Fourie <lusus@denx.de>
>>> Cc: Ramon Fried <rfried.dev@gmail.com>
>>> Cc: Marek Vasut <marex@denx.de>
>>> Cc: Lukasz Majewski <lukma@denx.de>
>>> ---
>>>   net/eth-uclass.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
>>> index f41da4b37b3..bc3b9751e32 100644
>>> --- a/net/eth-uclass.c
>>> +++ b/net/eth-uclass.c
>>> @@ -342,9 +342,9 @@ void eth_halt(void)
>>>       if (!priv || !priv->running)
>>>           return;
>>> -    eth_get_ops(current)->stop(current);
>>>       priv->state = ETH_STATE_PASSIVE;
>>>       priv->running = false;
>>> +    eth_get_ops(current)->stop(current);
>>>   }
>>>   int eth_is_active(struct udevice *dev)
>>
>> Reviewed-by: Lukasz Majewski <lukma@denx.de>
> 
> How come nobody triggered this problem with regular ethernet in U-Boot ?
> 
> If this is isolated to USB gadget ethernet, then please do not hack 
> around this in core networking code, but rather fix the USB ethernet 
> gadget itself. It seems that gadget code should not unregister the 
> gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not fully.

The reason is simple, the regular ethernet drivers do not get removed on 
stop() like for the gadget ethernet driver, and in their case `priv` is 
still valid.

I agree on your point that reworking the ethernet gadget code would be 
preferable, but this would be a much bigger effort (and if I were to do 
it, I would probably introduce even more bugs). I am not certain whether 
this would not also affect the non-DM gadget implementation as well, 
which still contain drivers like ci_udc.c which does not appear to have 
been ported to DM gadget yet? (I only see DM USB there...)

That said, I am not certain whether this is not also bug, as I am not 
certain whether the assumption that `priv` should be available after 
stop() is valid or not.

The documentation at 
https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135 
states:

The **stop** function should turn off / disable the hardware and place 
it back in its reset state.  It can be called at any time (before any 
call to the related start() function), so make sure it can handle this 
sort of thing.

In such a complete reset state I am not certain whether the assumption 
that `priv` should exist is still valid, at least not without another 
call to dev_get_uclass_priv() and revalidating it first?

Lastly, this assumption that priv is still valid is rather new and it 
was introduced here:

https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf

This commit appears to be a workaround for drivers which cannot deal 
with stop() being called at any time as required in the above quoted 
documentation.

I would consider adding a workaround to a workaround in this case to be 
the lesser evil, as tracking down this bug in the first place was like 
looking for a needle in a haystack. This change would at least save 
everybody else from strange crashes in particular configurations without 
any negative impact. But this is fortunately not my decision. :)

Best regards,
Niel Fourie
Ramon Fried Dec. 3, 2022, 11:50 p.m. UTC | #4
On Fri, Dec 2, 2022 at 12:36 PM Niel Fourie <lusus@denx.de> wrote:
>
> Hi Marek,
>
> On 01/12/2022 11:44, Marek Vasut wrote:
> > On 12/1/22 09:24, Lukasz Majewski wrote:
> >> On Wed, 30 Nov 2022 17:42:25 +0100
> >> Niel Fourie <lusus@denx.de> wrote:
> >>
> >>> In eth_halt(), change the private uclass state before calling
> >>> stop() instead of afterwards, to avoid writing to memory which
> >>> may have been freed during stop().
> >>>
> >>> In the ethernet gadget implementation, the gadget device gets
> >>> probed during start() and removed during stop(), which includes
> >>> freeing `uclass_priv_` to which `priv` is pointing. Writing to
> >>> `priv` after stop() may corrupt the `fd` member of `struct
> >>> malloc_chunk`, which represents the freed block, and could cause
> >>> hard-to-debug crashes on subsequent calls to malloc()/free().
> >>>
> >>> Signed-off-by: Niel Fourie <lusus@denx.de>
> >>> Cc: Ramon Fried <rfried.dev@gmail.com>
> >>> Cc: Marek Vasut <marex@denx.de>
> >>> Cc: Lukasz Majewski <lukma@denx.de>
> >>> ---
> >>>   net/eth-uclass.c | 2 +-
> >>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
> >>> index f41da4b37b3..bc3b9751e32 100644
> >>> --- a/net/eth-uclass.c
> >>> +++ b/net/eth-uclass.c
> >>> @@ -342,9 +342,9 @@ void eth_halt(void)
> >>>       if (!priv || !priv->running)
> >>>           return;
> >>> -    eth_get_ops(current)->stop(current);
> >>>       priv->state = ETH_STATE_PASSIVE;
> >>>       priv->running = false;
> >>> +    eth_get_ops(current)->stop(current);
> >>>   }
> >>>   int eth_is_active(struct udevice *dev)
> >>
> >> Reviewed-by: Lukasz Majewski <lukma@denx.de>
> >
> > How come nobody triggered this problem with regular ethernet in U-Boot ?
> >
> > If this is isolated to USB gadget ethernet, then please do not hack
> > around this in core networking code, but rather fix the USB ethernet
> > gadget itself. It seems that gadget code should not unregister the
> > gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not fully.
>
> The reason is simple, the regular ethernet drivers do not get removed on
> stop() like for the gadget ethernet driver, and in their case `priv` is
> still valid.
>
> I agree on your point that reworking the ethernet gadget code would be
> preferable, but this would be a much bigger effort (and if I were to do
> it, I would probably introduce even more bugs). I am not certain whether
> this would not also affect the non-DM gadget implementation as well,
> which still contain drivers like ci_udc.c which does not appear to have
> been ported to DM gadget yet? (I only see DM USB there...)
>
> That said, I am not certain whether this is not also bug, as I am not
> certain whether the assumption that `priv` should be available after
> stop() is valid or not.
>
> The documentation at
> https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135
> states:
>
> The **stop** function should turn off / disable the hardware and place
> it back in its reset state.  It can be called at any time (before any
> call to the related start() function), so make sure it can handle this
> sort of thing.
>
> In such a complete reset state I am not certain whether the assumption
> that `priv` should exist is still valid, at least not without another
> call to dev_get_uclass_priv() and revalidating it first?
>
> Lastly, this assumption that priv is still valid is rather new and it
> was introduced here:
>
> https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf
>
> This commit appears to be a workaround for drivers which cannot deal
> with stop() being called at any time as required in the above quoted
> documentation.
>
> I would consider adding a workaround to a workaround in this case to be
> the lesser evil, as tracking down this bug in the first place was like
> looking for a needle in a haystack. This change would at least save
> everybody else from strange crashes in particular configurations without
> any negative impact. But this is fortunately not my decision. :)
>
> Best regards,
> Niel Fourie
>
> --
> DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-21 Fax: +49-8142-66989-80  Email: lusus@denx.de
I will accept this patch, but please add some comment in the code
about why the order is important.
Marek Vasut Dec. 4, 2022, 2:17 a.m. UTC | #5
On 12/2/22 11:36, Niel Fourie wrote:
> Hi Marek,

Hi,

> On 01/12/2022 11:44, Marek Vasut wrote:
>> On 12/1/22 09:24, Lukasz Majewski wrote:
>>> On Wed, 30 Nov 2022 17:42:25 +0100
>>> Niel Fourie <lusus@denx.de> wrote:
>>>
>>>> In eth_halt(), change the private uclass state before calling
>>>> stop() instead of afterwards, to avoid writing to memory which
>>>> may have been freed during stop().
>>>>
>>>> In the ethernet gadget implementation, the gadget device gets
>>>> probed during start() and removed during stop(), which includes
>>>> freeing `uclass_priv_` to which `priv` is pointing. Writing to
>>>> `priv` after stop() may corrupt the `fd` member of `struct
>>>> malloc_chunk`, which represents the freed block, and could cause
>>>> hard-to-debug crashes on subsequent calls to malloc()/free().
>>>>
>>>> Signed-off-by: Niel Fourie <lusus@denx.de>
>>>> Cc: Ramon Fried <rfried.dev@gmail.com>
>>>> Cc: Marek Vasut <marex@denx.de>
>>>> Cc: Lukasz Majewski <lukma@denx.de>
>>>> ---
>>>>   net/eth-uclass.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
>>>> index f41da4b37b3..bc3b9751e32 100644
>>>> --- a/net/eth-uclass.c
>>>> +++ b/net/eth-uclass.c
>>>> @@ -342,9 +342,9 @@ void eth_halt(void)
>>>>       if (!priv || !priv->running)
>>>>           return;
>>>> -    eth_get_ops(current)->stop(current);
>>>>       priv->state = ETH_STATE_PASSIVE;
>>>>       priv->running = false;
>>>> +    eth_get_ops(current)->stop(current);
>>>>   }
>>>>   int eth_is_active(struct udevice *dev)
>>>
>>> Reviewed-by: Lukasz Majewski <lukma@denx.de>
>>
>> How come nobody triggered this problem with regular ethernet in U-Boot ?
>>
>> If this is isolated to USB gadget ethernet, then please do not hack 
>> around this in core networking code, but rather fix the USB ethernet 
>> gadget itself. It seems that gadget code should not unregister the 
>> gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not 
>> fully.
> 
> The reason is simple, the regular ethernet drivers do not get removed on 
> stop() like for the gadget ethernet driver, and in their case `priv` is 
> still valid.

The suggestion for a proper fix is in the last paragraph above -- do not 
unregister the usb ethernet gadget device in halt(), keep the gadget 
device registered. Then the priv pointer would be valid. (*)

> I agree on your point that reworking the ethernet gadget code would be 
> preferable, but this would be a much bigger effort (and if I were to do 
> it, I would probably introduce even more bugs). I am not certain whether 
> this would not also affect the non-DM gadget implementation as well, 
> which still contain drivers like ci_udc.c which does not appear to have 
> been ported to DM gadget yet? (I only see DM USB there...)
> 
> That said, I am not certain whether this is not also bug, as I am not 
> certain whether the assumption that `priv` should be available after 
> stop() is valid or not.
> 
> The documentation at 
> https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135 states:
> 
> The **stop** function should turn off / disable the hardware and place 
> it back in its reset state.  It can be called at any time (before any 
> call to the related start() function), so make sure it can handle this 
> sort of thing.

The ->stop callback is supposed to stop the interface, turn off its DMA, 
but NOT deallocate the device (and its associated data) behind it.

> In such a complete reset state I am not certain whether the assumption 
> that `priv` should exist is still valid, at least not without another 
> call to dev_get_uclass_priv() and revalidating it first?

In case a device is probe()d, its private data are also allocated and 
available, so yes, 'priv' pointer should still be valid.

> Lastly, this assumption that priv is still valid is rather new and it 
> was introduced here:
> 
> https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf

I disagree, the device private data are valid during the entire lifespan 
of the device. That assumption has been baked into the driver model 
itself and far predates that commit.

In case of a usb ethernet, the lifespan of the device starts with the 
'bind' command which triggers the ->bind callback, and first use which 
triggers its ->probe callback. The lifespan ends with 'unbind' command 
or OS boot, which triggers ->remove callback and ->unbind callbacks.

> This commit appears to be a workaround for drivers which cannot deal 
> with stop() being called at any time as required in the above quoted 
> documentation.

This commit prevents network device ->start() from being called multiple 
times, which is a valid precaution, as calling start while the interface 
is already up would interfere with existing connection (e.g. the 
netconsole as mentioned in the commit message). That does not seem to be 
a workaround to me.

> I would consider adding a workaround to a workaround in this case to be 
> the lesser evil, as tracking down this bug in the first place was like 
> looking for a needle in a haystack. This change would at least save 
> everybody else from strange crashes in particular configurations without 
> any negative impact. But this is fortunately not my decision. :)

Commit fa795f45254 ("net: eth-uclass: avoid running start() twice 
without stop()") is as far as I can tell unrelated to this change and 
does not seem to me like a workaround.

It does however show that this patch introduces a bug -- this patch 
changes the order in which priv->state = ETH_STATE_PASSIVE; is assigned 
from _after_ the ->stop callback to _before_ the -> stop callback. This 
breaks drivers/net/ldpaa_eth/ldpaa_eth.c which checks the priv->state in 
its ->stop callback, either on its own in non-DM case, or in 
eth_is_active() implementation in DM case. With this patch, the 
interface would never be stopped in the ->stop callback, because the 
condition (net_dev->state == ETH_STATE_PASSIVE) test in the ldpaa stop 
callback implementation would always be true.

The proper fix is in the usb ethernet gadget code, see (*) above, let's 
not pile workarounds onto already difficult to maintain code.

[...]
Marek Vasut Dec. 4, 2022, 2:20 a.m. UTC | #6
On 12/4/22 00:50, Ramon Fried wrote:
> On Fri, Dec 2, 2022 at 12:36 PM Niel Fourie <lusus@denx.de> wrote:
>>
>> Hi Marek,
>>
>> On 01/12/2022 11:44, Marek Vasut wrote:
>>> On 12/1/22 09:24, Lukasz Majewski wrote:
>>>> On Wed, 30 Nov 2022 17:42:25 +0100
>>>> Niel Fourie <lusus@denx.de> wrote:
>>>>
>>>>> In eth_halt(), change the private uclass state before calling
>>>>> stop() instead of afterwards, to avoid writing to memory which
>>>>> may have been freed during stop().
>>>>>
>>>>> In the ethernet gadget implementation, the gadget device gets
>>>>> probed during start() and removed during stop(), which includes
>>>>> freeing `uclass_priv_` to which `priv` is pointing. Writing to
>>>>> `priv` after stop() may corrupt the `fd` member of `struct
>>>>> malloc_chunk`, which represents the freed block, and could cause
>>>>> hard-to-debug crashes on subsequent calls to malloc()/free().
>>>>>
>>>>> Signed-off-by: Niel Fourie <lusus@denx.de>
>>>>> Cc: Ramon Fried <rfried.dev@gmail.com>
>>>>> Cc: Marek Vasut <marex@denx.de>
>>>>> Cc: Lukasz Majewski <lukma@denx.de>
>>>>> ---
>>>>>    net/eth-uclass.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
>>>>> index f41da4b37b3..bc3b9751e32 100644
>>>>> --- a/net/eth-uclass.c
>>>>> +++ b/net/eth-uclass.c
>>>>> @@ -342,9 +342,9 @@ void eth_halt(void)
>>>>>        if (!priv || !priv->running)
>>>>>            return;
>>>>> -    eth_get_ops(current)->stop(current);
>>>>>        priv->state = ETH_STATE_PASSIVE;
>>>>>        priv->running = false;
>>>>> +    eth_get_ops(current)->stop(current);
>>>>>    }
>>>>>    int eth_is_active(struct udevice *dev)
>>>>
>>>> Reviewed-by: Lukasz Majewski <lukma@denx.de>
>>>
>>> How come nobody triggered this problem with regular ethernet in U-Boot ?
>>>
>>> If this is isolated to USB gadget ethernet, then please do not hack
>>> around this in core networking code, but rather fix the USB ethernet
>>> gadget itself. It seems that gadget code should not unregister the
>>> gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not fully.
>>
>> The reason is simple, the regular ethernet drivers do not get removed on
>> stop() like for the gadget ethernet driver, and in their case `priv` is
>> still valid.
>>
>> I agree on your point that reworking the ethernet gadget code would be
>> preferable, but this would be a much bigger effort (and if I were to do
>> it, I would probably introduce even more bugs). I am not certain whether
>> this would not also affect the non-DM gadget implementation as well,
>> which still contain drivers like ci_udc.c which does not appear to have
>> been ported to DM gadget yet? (I only see DM USB there...)
>>
>> That said, I am not certain whether this is not also bug, as I am not
>> certain whether the assumption that `priv` should be available after
>> stop() is valid or not.
>>
>> The documentation at
>> https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135
>> states:
>>
>> The **stop** function should turn off / disable the hardware and place
>> it back in its reset state.  It can be called at any time (before any
>> call to the related start() function), so make sure it can handle this
>> sort of thing.
>>
>> In such a complete reset state I am not certain whether the assumption
>> that `priv` should exist is still valid, at least not without another
>> call to dev_get_uclass_priv() and revalidating it first?
>>
>> Lastly, this assumption that priv is still valid is rather new and it
>> was introduced here:
>>
>> https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf
>>
>> This commit appears to be a workaround for drivers which cannot deal
>> with stop() being called at any time as required in the above quoted
>> documentation.
>>
>> I would consider adding a workaround to a workaround in this case to be
>> the lesser evil, as tracking down this bug in the first place was like
>> looking for a needle in a haystack. This change would at least save
>> everybody else from strange crashes in particular configurations without
>> any negative impact. But this is fortunately not my decision. :)

[...]

> I will accept this patch, but please add some comment in the code
> about why the order is important.

Let me pause you here, see my previous reply. I'm afraid this breaks 
drivers/net/ldpaa_eth/ldpaa_eth.c , which checks priv->state in its -> 
stop callback implementation, so if priv->state is changed _before_ the 
->stop() callback is called, the ldpaa would never be stopped. I 
proposed a fix on the USB gadget side -- keep the gadget device 
allocated -- I believe that's the correct approach.
Niel Fourie Dec. 5, 2022, 12:06 p.m. UTC | #7
Hi Marek,

On 04/12/2022 03:17, Marek Vasut wrote:
> On 12/2/22 11:36, Niel Fourie wrote:
>> Hi Marek,
> 
> Hi,
> 
>> On 01/12/2022 11:44, Marek Vasut wrote:
>>> On 12/1/22 09:24, Lukasz Majewski wrote:
>>>> On Wed, 30 Nov 2022 17:42:25 +0100
>>>> Niel Fourie <lusus@denx.de> wrote:
>>>>
>>>>> In eth_halt(), change the private uclass state before calling
>>>>> stop() instead of afterwards, to avoid writing to memory which
>>>>> may have been freed during stop().
>>>>>
>>>>> In the ethernet gadget implementation, the gadget device gets
>>>>> probed during start() and removed during stop(), which includes
>>>>> freeing `uclass_priv_` to which `priv` is pointing. Writing to
>>>>> `priv` after stop() may corrupt the `fd` member of `struct
>>>>> malloc_chunk`, which represents the freed block, and could cause
>>>>> hard-to-debug crashes on subsequent calls to malloc()/free().
>>>>>
>>>>> Signed-off-by: Niel Fourie <lusus@denx.de>
>>>>> Cc: Ramon Fried <rfried.dev@gmail.com>
>>>>> Cc: Marek Vasut <marex@denx.de>
>>>>> Cc: Lukasz Majewski <lukma@denx.de>
>>>>> ---
>>>>>   net/eth-uclass.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/eth-uclass.c b/net/eth-uclass.c
>>>>> index f41da4b37b3..bc3b9751e32 100644
>>>>> --- a/net/eth-uclass.c
>>>>> +++ b/net/eth-uclass.c
>>>>> @@ -342,9 +342,9 @@ void eth_halt(void)
>>>>>       if (!priv || !priv->running)
>>>>>           return;
>>>>> -    eth_get_ops(current)->stop(current);
>>>>>       priv->state = ETH_STATE_PASSIVE;
>>>>>       priv->running = false;
>>>>> +    eth_get_ops(current)->stop(current);
>>>>>   }
>>>>>   int eth_is_active(struct udevice *dev)
>>>>
>>>> Reviewed-by: Lukasz Majewski <lukma@denx.de>
>>>
>>> How come nobody triggered this problem with regular ethernet in U-Boot ?
>>>
>>> If this is isolated to USB gadget ethernet, then please do not hack 
>>> around this in core networking code, but rather fix the USB ethernet 
>>> gadget itself. It seems that gadget code should not unregister the 
>>> gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not 
>>> fully.
>>
>> The reason is simple, the regular ethernet drivers do not get removed 
>> on stop() like for the gadget ethernet driver, and in their case 
>> `priv` is still valid.
> 
> The suggestion for a proper fix is in the last paragraph above -- do not 
> unregister the usb ethernet gadget device in halt(), keep the gadget 
> device registered. Then the priv pointer would be valid. (*)
> 
>> I agree on your point that reworking the ethernet gadget code would be 
>> preferable, but this would be a much bigger effort (and if I were to 
>> do it, I would probably introduce even more bugs). I am not certain 
>> whether this would not also affect the non-DM gadget implementation as 
>> well, which still contain drivers like ci_udc.c which does not appear 
>> to have been ported to DM gadget yet? (I only see DM USB there...)
>>
>> That said, I am not certain whether this is not also bug, as I am not 
>> certain whether the assumption that `priv` should be available after 
>> stop() is valid or not.
>>
>> The documentation at 
>> https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135 states:
>>
>> The **stop** function should turn off / disable the hardware and place 
>> it back in its reset state.  It can be called at any time (before any 
>> call to the related start() function), so make sure it can handle this 
>> sort of thing.
> 
> The ->stop callback is supposed to stop the interface, turn off its DMA, 
> but NOT deallocate the device (and its associated data) behind it.
> 
>> In such a complete reset state I am not certain whether the assumption 
>> that `priv` should exist is still valid, at least not without another 
>> call to dev_get_uclass_priv() and revalidating it first?
> 
> In case a device is probe()d, its private data are also allocated and 
> available, so yes, 'priv' pointer should still be valid.

Granted, the usb gadget driver implementation is problematic, and it 
definitely belongs on the TODO list.

That being said, priv not being valid at this stage is not a new 
problem, as validation for it after stop() was explicitly added in 
commit c3211708 ("net: eth-uclass: Fix for DM USB ethernet support") for 
this reason 4 years ago. Fortunately in that case, it just fixed a null 
pointer de-reference, not corruption of freed memory.

>> Lastly, this assumption that priv is still valid is rather new and it 
>> was introduced here:
>>
>> https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf
> 
> I disagree, the device private data are valid during the entire lifespan 
> of the device. That assumption has been baked into the driver model 
> itself and far predates that commit.
> 
> In case of a usb ethernet, the lifespan of the device starts with the 
> 'bind' command which triggers the ->bind callback, and first use which 
> triggers its ->probe callback. The lifespan ends with 'unbind' command 
> or OS boot, which triggers ->remove callback and ->unbind callbacks.
> 
>> This commit appears to be a workaround for drivers which cannot deal 
>> with stop() being called at any time as required in the above quoted 
>> documentation.
> 
> This commit prevents network device ->start() from being called multiple 
> times, which is a valid precaution, as calling start while the interface 
> is already up would interfere with existing connection (e.g. the 
> netconsole as mentioned in the commit message). That does not seem to be 
> a workaround to me.

Ah yes, the commit message indeed states that some drivers (like 
fec_mxc) cannot deal with multiple start() calls, not multiple stop() 
calls. My mistake, good point.

But preventing multiple start()s getting called is not what the code 
change in this commit is actually doing. It is instead inhibiting 
calling stop() on devices for which priv is not valid or for which 
priv->running is false. Therefore it is avoiding calling stop() on 
un-started devices.

>> I would consider adding a workaround to a workaround in this case to 
>> be the lesser evil, as tracking down this bug in the first place was 
>> like looking for a needle in a haystack. This change would at least 
>> save everybody else from strange crashes in particular configurations 
>> without any negative impact. But this is fortunately not my decision. :)
> 
> Commit fa795f45254 ("net: eth-uclass: avoid running start() twice 
> without stop()") is as far as I can tell unrelated to this change and 
> does not seem to me like a workaround.

It does introduce writing to priv after stop() is called without 
(re-)validating priv first, though.

The code previously first called stop(), then called 
dev_get_uclass_priv() to get priv and then validated priv before writing 
to it, which avoids exactly the problem of priv no longer being valid.

As mentioned above, this was explicitly addressed by commit c3211708 
("net: eth-uclass: Fix for DM USB ethernet support").

> It does however show that this patch introduces a bug -- this patch 
> changes the order in which priv->state = ETH_STATE_PASSIVE; is assigned 
> from _after_ the ->stop callback to _before_ the -> stop callback. This 
> breaks drivers/net/ldpaa_eth/ldpaa_eth.c which checks the priv->state in 
> its ->stop callback, either on its own in non-DM case, or in 
> eth_is_active() implementation in DM case. With this patch, the 
> interface would never be stopped in the ->stop callback, because the 
> condition (net_dev->state == ETH_STATE_PASSIVE) test in the ldpaa stop 
> callback implementation would always be true.
>

In drivers/net/ldpaa_eth/ldpaa_eth.c:ldpaa_eth_stop(), priv is of type
struct ldpaa_eth_priv*, defined in drivers/net/ldpaa_eth/ldpaa_eth.h and 
is accessed using dev_get_priv().

In net/eth-uclass.c:eht_halt(), priv is of type struct eth_device_priv* 
and defined in the same .c file, and is accessed using 
dev_get_uclass_priv(). As the structure is local to this file, nothing 
outside of this file should have any knowledge of its contents, and 
changing of the order of the calls should only impact this file.

I sincerely hope that these two are not interfering with each other, 
otherwise we have much bigger problems...

> The proper fix is in the usb ethernet gadget code, see (*) above, let's 
> not pile workarounds onto already difficult to maintain code.
> 
> [...]
Agreed. The problem is unfortunately just that I don't see that 
happening simply to fix this bug.

Best regards,
Niel Fourie
Niel Fourie Dec. 5, 2022, 3:33 p.m. UTC | #8
Hi Marek,

On 05/12/2022 13:06, Niel Fourie wrote:
[...]

>> It does however show that this patch introduces a bug -- this patch 
>> changes the order in which priv->state = ETH_STATE_PASSIVE; is 
>> assigned from _after_ the ->stop callback to _before_ the -> stop 
>> callback. This breaks drivers/net/ldpaa_eth/ldpaa_eth.c which checks 
>> the priv->state in its ->stop callback, either on its own in non-DM 
>> case, or in eth_is_active() implementation in DM case. With this 
>> patch, the interface would never be stopped in the ->stop callback, 
>> because the condition (net_dev->state == ETH_STATE_PASSIVE) test in 
>> the ldpaa stop callback implementation would always be true.
>>
> 
> In drivers/net/ldpaa_eth/ldpaa_eth.c:ldpaa_eth_stop(), priv is of type
> struct ldpaa_eth_priv*, defined in drivers/net/ldpaa_eth/ldpaa_eth.h and 
> is accessed using dev_get_priv().
> 
> In net/eth-uclass.c:eht_halt(), priv is of type struct eth_device_priv* 
> and defined in the same .c file, and is accessed using 
> dev_get_uclass_priv(). As the structure is local to this file, nothing 
> outside of this file should have any knowledge of its contents, and 
> changing of the order of the calls should only impact this file.
> 
> I sincerely hope that these two are not interfering with each other, 
> otherwise we have much bigger problems...
> 

Shucks, I was thrown off by the the fact that net_dev is of type struct 
eth_device, and its member state is separate from struct 
eth_device_state and its member state, that I missed the implication of 
eth_is_active() *setting* the value of struct eth_device_priv's state 
not *reading* it.

Well spotted, you are correct. The patch in its current form would 
introduce that bug. Thank you for finding that.

Adding back the call to dev_get_uclass_priv() to get priv and validating 
it again *after* stop() as it was done before commit fa795f45254 ("net: 
eth-uclass: avoid running start() twice without stop()") would fix this, 
and perhaps also make the issue with stop() and Ethernet gadget more 
obvious. A comment on why it needs to be repeated would also be useful. 
Would this be an acceptable improvement?

I agree that fixing the USB ethernet gadget is still the best solution, 
but until that happens, we could at least limit everyone's pain.

Best regards,
Niel Fourie
Marek Vasut Dec. 5, 2022, 4:18 p.m. UTC | #9
On 12/5/22 13:06, Niel Fourie wrote:

Hi,

[...]

>>>> How come nobody triggered this problem with regular ethernet in 
>>>> U-Boot ?
>>>>
>>>> If this is isolated to USB gadget ethernet, then please do not hack 
>>>> around this in core networking code, but rather fix the USB ethernet 
>>>> gadget itself. It seems that gadget code should not unregister the 
>>>> gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not 
>>>> fully.
>>>
>>> The reason is simple, the regular ethernet drivers do not get removed 
>>> on stop() like for the gadget ethernet driver, and in their case 
>>> `priv` is still valid.
>>
>> The suggestion for a proper fix is in the last paragraph above -- do 
>> not unregister the usb ethernet gadget device in halt(), keep the 
>> gadget device registered. Then the priv pointer would be valid. (*)
>>
>>> I agree on your point that reworking the ethernet gadget code would 
>>> be preferable, but this would be a much bigger effort (and if I were 
>>> to do it, I would probably introduce even more bugs). I am not 
>>> certain whether this would not also affect the non-DM gadget 
>>> implementation as well, which still contain drivers like ci_udc.c 
>>> which does not appear to have been ported to DM gadget yet? (I only 
>>> see DM USB there...)
>>>
>>> That said, I am not certain whether this is not also bug, as I am not 
>>> certain whether the assumption that `priv` should be available after 
>>> stop() is valid or not.
>>>
>>> The documentation at 
>>> https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135 states:
>>>
>>> The **stop** function should turn off / disable the hardware and 
>>> place it back in its reset state.  It can be called at any time 
>>> (before any call to the related start() function), so make sure it 
>>> can handle this sort of thing.
>>
>> The ->stop callback is supposed to stop the interface, turn off its 
>> DMA, but NOT deallocate the device (and its associated data) behind it.
>>
>>> In such a complete reset state I am not certain whether the 
>>> assumption that `priv` should exist is still valid, at least not 
>>> without another call to dev_get_uclass_priv() and revalidating it first?
>>
>> In case a device is probe()d, its private data are also allocated and 
>> available, so yes, 'priv' pointer should still be valid.
> 
> Granted, the usb gadget driver implementation is problematic, and it 
> definitely belongs on the TODO list.
> 
> That being said, priv not being valid at this stage is not a new 
> problem, as validation for it after stop() was explicitly added in 
> commit c3211708 ("net: eth-uclass: Fix for DM USB ethernet support") for 
> this reason 4 years ago. Fortunately in that case, it just fixed a null 
> pointer de-reference, not corruption of freed memory.

I believe the aforementioned patch is already also wrong, since the priv 
pointer would be invalid for gadget ethernet driver got removed in the 
stop callback.

>>> Lastly, this assumption that priv is still valid is rather new and it 
>>> was introduced here:
>>>
>>> https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf
>>
>> I disagree, the device private data are valid during the entire 
>> lifespan of the device. That assumption has been baked into the driver 
>> model itself and far predates that commit.
>>
>> In case of a usb ethernet, the lifespan of the device starts with the 
>> 'bind' command which triggers the ->bind callback, and first use which 
>> triggers its ->probe callback. The lifespan ends with 'unbind' command 
>> or OS boot, which triggers ->remove callback and ->unbind callbacks.
>>
>>> This commit appears to be a workaround for drivers which cannot deal 
>>> with stop() being called at any time as required in the above quoted 
>>> documentation.
>>
>> This commit prevents network device ->start() from being called 
>> multiple times, which is a valid precaution, as calling start while 
>> the interface is already up would interfere with existing connection 
>> (e.g. the netconsole as mentioned in the commit message). That does 
>> not seem to be a workaround to me.
> 
> Ah yes, the commit message indeed states that some drivers (like 
> fec_mxc) cannot deal with multiple start() calls, not multiple stop() 
> calls. My mistake, good point.
> 
> But preventing multiple start()s getting called is not what the code 
> change in this commit is actually doing. It is instead inhibiting 
> calling stop() on devices for which priv is not valid or for which 
> priv->running is false. Therefore it is avoiding calling stop() on 
> un-started devices.

I wonder if such a patch is needed to fix it ?

diff --git a/net/eth-uclass.c b/net/eth-uclass.c
index f41da4b37b3..e7b5d3943e8 100644
--- a/net/eth-uclass.c
+++ b/net/eth-uclass.c
@@ -298,7 +298,7 @@ int eth_init(void)
                 if (current) {
                         debug("Trying %s\n", current->name);

-                       if (device_active(current)) {
+                       if (!eth_is_active(current)) {

>>> I would consider adding a workaround to a workaround in this case to 
>>> be the lesser evil, as tracking down this bug in the first place was 
>>> like looking for a needle in a haystack. This change would at least 
>>> save everybody else from strange crashes in particular configurations 
>>> without any negative impact. But this is fortunately not my decision. :)
>>
>> Commit fa795f45254 ("net: eth-uclass: avoid running start() twice 
>> without stop()") is as far as I can tell unrelated to this change and 
>> does not seem to me like a workaround.
> 
> It does introduce writing to priv after stop() is called without 
> (re-)validating priv first, though.
> 
> The code previously first called stop(), then called 
> dev_get_uclass_priv() to get priv and then validated priv before writing 
> to it, which avoids exactly the problem of priv no longer being valid.
> 
> As mentioned above, this was explicitly addressed by commit c3211708 
> ("net: eth-uclass: Fix for DM USB ethernet support").

I'm afraid this is just piling workarounds on top of workarounds.

>> It does however show that this patch introduces a bug -- this patch 
>> changes the order in which priv->state = ETH_STATE_PASSIVE; is 
>> assigned from _after_ the ->stop callback to _before_ the -> stop 
>> callback. This breaks drivers/net/ldpaa_eth/ldpaa_eth.c which checks 
>> the priv->state in its ->stop callback, either on its own in non-DM 
>> case, or in eth_is_active() implementation in DM case. With this 
>> patch, the interface would never be stopped in the ->stop callback, 
>> because the condition (net_dev->state == ETH_STATE_PASSIVE) test in 
>> the ldpaa stop callback implementation would always be true.
>>
> 
> In drivers/net/ldpaa_eth/ldpaa_eth.c:ldpaa_eth_stop(), priv is of type
> struct ldpaa_eth_priv*, defined in drivers/net/ldpaa_eth/ldpaa_eth.h and 
> is accessed using dev_get_priv().
> 
> In net/eth-uclass.c:eht_halt(), priv is of type struct eth_device_priv* 
> and defined in the same .c file, and is accessed using 
> dev_get_uclass_priv(). As the structure is local to this file, nothing 
> outside of this file should have any knowledge of its contents, and 
> changing of the order of the calls should only impact this file.

As you already found out yourself in subsequent email:

  655 static void ldpaa_eth_stop(struct udevice *dev)
  656 {
  657         struct ldpaa_eth_priv *priv = dev_get_priv(dev);
...
  666 #ifdef CONFIG_DM_ETH
  667         if (!eth_is_active(dev))
...

The eth_is_active():

350 int eth_is_active(struct udevice *dev)
351 {
352         struct eth_device_priv *priv;
353
354         if (!dev || !device_active(dev))
355                 return 0;
356
357         priv = dev_get_uclass_priv(dev);
358         return priv->state == ETH_STATE_ACTIVE;
359 }

does access dev_get_uclass_priv(dev)->priv .

> I sincerely hope that these two are not interfering with each other, 
> otherwise we have much bigger problems...
> 
>> The proper fix is in the usb ethernet gadget code, see (*) above, 
>> let's not pile workarounds onto already difficult to maintain code.
>>
>> [...]
> Agreed. The problem is unfortunately just that I don't see that 
> happening simply to fix this bug.

Please, do not pile workarounds on top of workarounds, there is too many 
of those already as you found out above. Each one just slaps something 
on, which later on breaks and causes pain a few years down the line. 
This is not manageable in the long run.

The proper fix I believe is (*) above and it should not be hard to 
implement that one.
Marek Vasut Dec. 5, 2022, 4:20 p.m. UTC | #10
On 12/5/22 16:33, Niel Fourie wrote:
> Hi Marek,

Hi,

> On 05/12/2022 13:06, Niel Fourie wrote:
> [...]
> 
>>> It does however show that this patch introduces a bug -- this patch 
>>> changes the order in which priv->state = ETH_STATE_PASSIVE; is 
>>> assigned from _after_ the ->stop callback to _before_ the -> stop 
>>> callback. This breaks drivers/net/ldpaa_eth/ldpaa_eth.c which checks 
>>> the priv->state in its ->stop callback, either on its own in non-DM 
>>> case, or in eth_is_active() implementation in DM case. With this 
>>> patch, the interface would never be stopped in the ->stop callback, 
>>> because the condition (net_dev->state == ETH_STATE_PASSIVE) test in 
>>> the ldpaa stop callback implementation would always be true.
>>>
>>
>> In drivers/net/ldpaa_eth/ldpaa_eth.c:ldpaa_eth_stop(), priv is of type
>> struct ldpaa_eth_priv*, defined in drivers/net/ldpaa_eth/ldpaa_eth.h 
>> and is accessed using dev_get_priv().
>>
>> In net/eth-uclass.c:eht_halt(), priv is of type struct 
>> eth_device_priv* and defined in the same .c file, and is accessed 
>> using dev_get_uclass_priv(). As the structure is local to this file, 
>> nothing outside of this file should have any knowledge of its 
>> contents, and changing of the order of the calls should only impact 
>> this file.
>>
>> I sincerely hope that these two are not interfering with each other, 
>> otherwise we have much bigger problems...
>>
> 
> Shucks, I was thrown off by the the fact that net_dev is of type struct 
> eth_device, and its member state is separate from struct 
> eth_device_state and its member state, that I missed the implication of 
> eth_is_active() *setting* the value of struct eth_device_priv's state 
> not *reading* it.
> 
> Well spotted, you are correct. The patch in its current form would 
> introduce that bug. Thank you for finding that.

Good, so we agree this patch introduces a bug.

> Adding back the call to dev_get_uclass_priv() to get priv and validating 
> it again *after* stop() as it was done before commit fa795f45254 ("net: 
> eth-uclass: avoid running start() twice without stop()") would fix this, 
> and perhaps also make the issue with stop() and Ethernet gadget more 
> obvious. A comment on why it needs to be repeated would also be useful. 
> Would this be an acceptable improvement?
> 
> I agree that fixing the USB ethernet gadget is still the best solution, 
> but until that happens, we could at least limit everyone's pain.

See my reply to the previous email.

Keep the usb gadget device around, that should not be hard to implement 
and that should fix this problem once and for all, and for the future too.
diff mbox series

Patch

diff --git a/net/eth-uclass.c b/net/eth-uclass.c
index f41da4b37b3..bc3b9751e32 100644
--- a/net/eth-uclass.c
+++ b/net/eth-uclass.c
@@ -342,9 +342,9 @@  void eth_halt(void)
 	if (!priv || !priv->running)
 		return;
 
-	eth_get_ops(current)->stop(current);
 	priv->state = ETH_STATE_PASSIVE;
 	priv->running = false;
+	eth_get_ops(current)->stop(current);
 }
 
 int eth_is_active(struct udevice *dev)