diff mbox series

[net-next,5/6] devlink: Reshuffle resource registration logic

Message ID 6176a137a4ded48501e8a06fda0e305f9cfc787c.1637173517.git.leonro@nvidia.com
State Awaiting Upstream
Headers show
Series Devlink cleanups | expand

Commit Message

Leon Romanovsky Nov. 17, 2021, 6:26 p.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

The devlink->lock doesn't need to be held during whole execution of
function, but only when recursive path walk and list addition are
performed. So reshuffle the resource registration logic to allocate
resource and configure it without lock.

As part of this change, complain more louder if driver authors used
already existed resource_id. It is performed outside of the locks as
drivers were supposed to provide unique IDs and such error can't happen.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/core/devlink.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

Comments

Jakub Kicinski Nov. 18, 2021, 4:49 a.m. UTC | #1
On Wed, 17 Nov 2021 20:26:21 +0200 Leon Romanovsky wrote:
> -	top_hierarchy = parent_resource_id == DEVLINK_RESOURCE_ID_PARENT_TOP;
> -
> -	mutex_lock(&devlink->lock);
> -	resource = devlink_resource_find(devlink, NULL, resource_id);
> -	if (resource) {
> -		err = -EINVAL;
> -		goto out;
> -	}
> +	WARN_ON(devlink_resource_find(devlink, NULL, resource_id));

This is not atomic with the add now.

>  	resource = kzalloc(sizeof(*resource), GFP_KERNEL);
>  	if (!resource) {
> @@ -9851,7 +9843,17 @@ int devlink_resource_register(struct devlink *devlink,
>  		goto out;
>  	}
>  
> -	if (top_hierarchy) {
> +	resource->name = resource_name;
> +	resource->size = resource_size;
> +	resource->size_new = resource_size;
> +	resource->id = resource_id;
> +	resource->size_valid = true;
> +	memcpy(&resource->size_params, size_params,
> +	       sizeof(resource->size_params));
> +	INIT_LIST_HEAD(&resource->resource_list);
> +
> +	mutex_lock(&devlink->lock);
Leon Romanovsky Nov. 18, 2021, 7:50 a.m. UTC | #2
On Wed, Nov 17, 2021 at 08:49:56PM -0800, Jakub Kicinski wrote:
> On Wed, 17 Nov 2021 20:26:21 +0200 Leon Romanovsky wrote:
> > -	top_hierarchy = parent_resource_id == DEVLINK_RESOURCE_ID_PARENT_TOP;
> > -
> > -	mutex_lock(&devlink->lock);
> > -	resource = devlink_resource_find(devlink, NULL, resource_id);
> > -	if (resource) {
> > -		err = -EINVAL;
> > -		goto out;
> > -	}
> > +	WARN_ON(devlink_resource_find(devlink, NULL, resource_id));
> 
> This is not atomic with the add now.

And it shouldn't. devlink_resource_find() will return valid resource only
if there driver is completely bogus with races or incorrect allocations of
resource_id.

devlink_*_register(..)
 mutex_lock(&devlink->lock);
 if (devlink_*_find(...)) {
    mutex_unlock(&devlink->lock);
    return ....;
 }
 .....

It is almost always wrong from locking and layering perspective the pattern above,
as it is racy by definition if not protected by top layer.

There are exceptions from the rule above, but devlink is clearly not the
one of such exceptions.

Thanks
Jakub Kicinski Nov. 19, 2021, 1:48 a.m. UTC | #3
On Thu, 18 Nov 2021 09:50:20 +0200 Leon Romanovsky wrote:
> And it shouldn't. devlink_resource_find() will return valid resource only
> if there driver is completely bogus with races or incorrect allocations of
> resource_id.
> 
> devlink_*_register(..)
>  mutex_lock(&devlink->lock);
>  if (devlink_*_find(...)) {
>     mutex_unlock(&devlink->lock);
>     return ....;
>  }
>  .....
> 
> It is almost always wrong from locking and layering perspective the pattern above,
> as it is racy by definition if not protected by top layer.
> 
> There are exceptions from the rule above, but devlink is clearly not the
> one of such exceptions.

Just drop the unnecessary "cleanup" patches and limit the amount 
of driver code we'll have to revert if your approach fails.

I spent enough time going back and forth with you.

Please.
Leon Romanovsky Nov. 19, 2021, 3:38 p.m. UTC | #4
On Thu, Nov 18, 2021 at 05:48:13PM -0800, Jakub Kicinski wrote:
> On Thu, 18 Nov 2021 09:50:20 +0200 Leon Romanovsky wrote:
> > And it shouldn't. devlink_resource_find() will return valid resource only
> > if there driver is completely bogus with races or incorrect allocations of
> > resource_id.
> > 
> > devlink_*_register(..)
> >  mutex_lock(&devlink->lock);
> >  if (devlink_*_find(...)) {
> >     mutex_unlock(&devlink->lock);
> >     return ....;
> >  }
> >  .....
> > 
> > It is almost always wrong from locking and layering perspective the pattern above,
> > as it is racy by definition if not protected by top layer.
> > 
> > There are exceptions from the rule above, but devlink is clearly not the
> > one of such exceptions.
> 
> Just drop the unnecessary "cleanup" patches and limit the amount 
> of driver code we'll have to revert if your approach fails.

My approach works, exactly like it works in other subsystems.
https://lore.kernel.org/netdev/cover.1636390483.git.leonro@nvidia.com/

We are waiting to see your proposal extended to support parallel devlink
execution and to be applied to real drivers.
https://lore.kernel.org/netdev/20211030231254.2477599-1-kuba@kernel.org/

Anyway, you are maintainer, you want half work, you will get half work.

> 
> I spent enough time going back and forth with you.
> 
> Please.

Disagreements are hard for everyone, not only for you.

Thanks
Jakub Kicinski Nov. 19, 2021, 4:10 p.m. UTC | #5
On Fri, 19 Nov 2021 17:38:53 +0200 Leon Romanovsky wrote:
> On Thu, Nov 18, 2021 at 05:48:13PM -0800, Jakub Kicinski wrote:
> > On Thu, 18 Nov 2021 09:50:20 +0200 Leon Romanovsky wrote:  
> > > And it shouldn't. devlink_resource_find() will return valid resource only
> > > if there driver is completely bogus with races or incorrect allocations of
> > > resource_id.
> > > 
> > > devlink_*_register(..)
> > >  mutex_lock(&devlink->lock);
> > >  if (devlink_*_find(...)) {
> > >     mutex_unlock(&devlink->lock);
> > >     return ....;
> > >  }
> > >  .....
> > > 
> > > It is almost always wrong from locking and layering perspective the pattern above,
> > > as it is racy by definition if not protected by top layer.
> > > 
> > > There are exceptions from the rule above, but devlink is clearly not the
> > > one of such exceptions.  
> > 
> > Just drop the unnecessary "cleanup" patches and limit the amount 
> > of driver code we'll have to revert if your approach fails.  
> 
> My approach works, exactly like it works in other subsystems.
> https://lore.kernel.org/netdev/cover.1636390483.git.leonro@nvidia.com/

What "other subsystems"? I'm aware of the RFC version of these patches.

Breaking up the locks to to protect sub-objects only is fine for
protecting internal lists but now you can't guarantee that the object
exists when driver is called.

I'm sure you'll utter your unprovable "in real drivers.." but the fact
is my approach does not suffer from any such issues. Or depends on
drivers registering devlink last.

I can start passing a pointer to a devlink_port to split/unsplit
functions, which is a great improvement to the devlink driver API.

> We are waiting to see your proposal extended to support parallel devlink
> execution and to be applied to real drivers.
> https://lore.kernel.org/netdev/20211030231254.2477599-1-kuba@kernel.org/

The conversion to xarray you have done is a great improvement, I don't
disagree with the way you convert to allow parallel calls either.

I already told you that real drivers can be converted rather easily,
even if it's not really necessary.

But I'm giving you time to make your proposal. If I spend time
polishing my patches I'll be even more eager to put this behind me.

> Anyway, you are maintainer, you want half work, you will get half work.

What do you mean half work? You have a record of breaking things 
in the area and changing directions. How is my request to limit
unnecessary "cleanups" affecting drivers until the work is finished
not perfectly reasonable?!?!

> > I spent enough time going back and forth with you.
> 
> Disagreements are hard for everyone, not only for you.
Leon Romanovsky Nov. 21, 2021, 8:45 a.m. UTC | #6
On Fri, Nov 19, 2021 at 08:10:17AM -0800, Jakub Kicinski wrote:
> On Fri, 19 Nov 2021 17:38:53 +0200 Leon Romanovsky wrote:
> > On Thu, Nov 18, 2021 at 05:48:13PM -0800, Jakub Kicinski wrote:
> > > On Thu, 18 Nov 2021 09:50:20 +0200 Leon Romanovsky wrote:  
> > > > And it shouldn't. devlink_resource_find() will return valid resource only
> > > > if there driver is completely bogus with races or incorrect allocations of
> > > > resource_id.
> > > > 
> > > > devlink_*_register(..)
> > > >  mutex_lock(&devlink->lock);
> > > >  if (devlink_*_find(...)) {
> > > >     mutex_unlock(&devlink->lock);
> > > >     return ....;
> > > >  }
> > > >  .....
> > > > 
> > > > It is almost always wrong from locking and layering perspective the pattern above,
> > > > as it is racy by definition if not protected by top layer.
> > > > 
> > > > There are exceptions from the rule above, but devlink is clearly not the
> > > > one of such exceptions.  
> > > 
> > > Just drop the unnecessary "cleanup" patches and limit the amount 
> > > of driver code we'll have to revert if your approach fails.  
> > 
> > My approach works, exactly like it works in other subsystems.
> > https://lore.kernel.org/netdev/cover.1636390483.git.leonro@nvidia.com/
> 
> What "other subsystems"? I'm aware of the RFC version of these patches.

Approach to have fine-grained locking scheme, instead of having one big lock.
This was done in MM for mmap_sem, we did it for RDMA too.

> 
> Breaking up the locks to to protect sub-objects only is fine for
> protecting internal lists but now you can't guarantee that the object
> exists when driver is called.

I can only guess about which objects you are talking.

If you are talking about various devlink sub-objects (ports, traps,
e.t.c), they created by the drivers and as such should be managed by them.
Also they are connected to devlink which is guaranteed to exist. At the end,
they called to devlink_XXX->devlink pointer without any existence check.

If you are talking about devlink instance itself, we guarantee that it
exists between devlink_alloc() and devlink_free(). It seems to me pretty
reasonable request from drivers do not access devlink before devlink_alloc()
or after devlink_free(),

> 
> I'm sure you'll utter your unprovable "in real drivers.." but the fact
> is my approach does not suffer from any such issues. Or depends on
> drivers registering devlink last.

Registration of devlink doesn't do anything except opening it to the world.
The lifetime is controlled with alloc and free. My beloved sentence "in
real drivers ..." belongs to use of devlink_put and devlink_locks outside
of devlink.c and nothing more.

> 
> I can start passing a pointer to a devlink_port to split/unsplit
> functions, which is a great improvement to the devlink driver API.

You can do it with my approach too. We incremented reference counter
of devlink instance when devlink_nl_cmd_port_split_doit() was called,
and we can safely take devlink->port_list_lock lock before returning
from pre_doit.

> 
> > We are waiting to see your proposal extended to support parallel devlink
> > execution and to be applied to real drivers.
> > https://lore.kernel.org/netdev/20211030231254.2477599-1-kuba@kernel.org/
> 
> The conversion to xarray you have done is a great improvement, I don't
> disagree with the way you convert to allow parallel calls either.
> 
> I already told you that real drivers can be converted rather easily,
> even if it's not really necessary.
> 
> But I'm giving you time to make your proposal. If I spend time
> polishing my patches I'll be even more eager to put this behind me.

I see exposure of devlink internals to the driver as last resort, so I
stopped to make proposals after your responses:

"I prefer my version."
https://lore.kernel.org/netdev/20211108101646.0a4e5ca4@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

"If by "fixed first" you mean it needs 5 locks to be added and to remove
any guarantees on sub-object lifetime then no thanks."
https://lore.kernel.org/netdev/20211108104608.378c106e@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

> 
> > Anyway, you are maintainer, you want half work, you will get half work.
> 
> What do you mean half work? You have a record of breaking things 
> in the area and changing directions. How is my request to limit
> unnecessary "cleanups" affecting drivers until the work is finished
> not perfectly reasonable?!?!

I don't know what made you think so. My end goals (parallel execution
and safe devlink reload) and solutions didn't changes:
 * Devlink instance is safe to access by kernel between devlink_alloc() and devlink_free().
 * Devlink instance is visible for users between devlink_register() and devlink_unregister().
 * Locks should be fine-grained and limited.

By saying, half work, I mean that attempt to limit locks leave many
functions to be such that can't fail and as such should be void and not
"return 0".

And regarding "breaking things", I'm not doing it for fun, but with real
desire to improve kernel for everyone, not only for our driver.

> 
> > > I spent enough time going back and forth with you.
> > 
> > Disagreements are hard for everyone, not only for you.
Jakub Kicinski Nov. 23, 2021, 2:27 a.m. UTC | #7
On Sun, 21 Nov 2021 10:45:12 +0200 Leon Romanovsky wrote:
> On Fri, Nov 19, 2021 at 08:10:17AM -0800, Jakub Kicinski wrote:
> > On Fri, 19 Nov 2021 17:38:53 +0200 Leon Romanovsky wrote:  
> > > My approach works, exactly like it works in other subsystems.
> > > https://lore.kernel.org/netdev/cover.1636390483.git.leonro@nvidia.com/  
> > 
> > What "other subsystems"? I'm aware of the RFC version of these patches.  
> 
> Approach to have fine-grained locking scheme, instead of having one big lock.
> This was done in MM for mmap_sem, we did it for RDMA too.

You're breaking things up to avoid lock ordering issues. The user can
still only run a single write command at a time.

> > Breaking up the locks to to protect sub-objects only is fine for
> > protecting internal lists but now you can't guarantee that the object
> > exists when driver is called.  
> 
> I can only guess about which objects you are talking.

It obviously refers to the port splitting I mentioned below.

> If you are talking about various devlink sub-objects (ports, traps,
> e.t.c), they created by the drivers and as such should be managed by them.
> Also they are connected to devlink which is guaranteed to exist. At the end,
> they called to devlink_XXX->devlink pointer without any existence check.
> 
> If you are talking about devlink instance itself, we guarantee that it
> exists between devlink_alloc() and devlink_free(). It seems to me pretty
> reasonable request from drivers do not access devlink before devlink_alloc()
> or after devlink_free(),
> 
> > I'm sure you'll utter your unprovable "in real drivers.." but the fact
> > is my approach does not suffer from any such issues. Or depends on
> > drivers registering devlink last.  
> 
> Registration of devlink doesn't do anything except opening it to the world.
> The lifetime is controlled with alloc and free. My beloved sentence "in
> real drivers ..." belongs to use of devlink_put and devlink_locks outside
> of devlink.c and nothing more.

As soon as there is a inter-dependency between two subsystems "must 
be last" breaks down.

> > I can start passing a pointer to a devlink_port to split/unsplit
> > functions, which is a great improvement to the devlink driver API.  
> 
> You can do it with my approach too. We incremented reference counter
> of devlink instance when devlink_nl_cmd_port_split_doit() was called,
> and we can safely take devlink->port_list_lock lock before returning
> from pre_doit.

Wait, I thought you'd hold devlink->lock around split/unsplit.

Please look at the port splitting case, mlx5 doesn't implement it
but it's an important feature.

Either way, IDK how ref count on devlink helps with lifetime of a
subobject. You must assume the sub-objects can only be created outside
of the time devlink instance is visible or under devlink->lock?
Leon Romanovsky Nov. 23, 2021, 8:33 a.m. UTC | #8
On Mon, Nov 22, 2021 at 06:27:28PM -0800, Jakub Kicinski wrote:
> On Sun, 21 Nov 2021 10:45:12 +0200 Leon Romanovsky wrote:
> > On Fri, Nov 19, 2021 at 08:10:17AM -0800, Jakub Kicinski wrote:
> > > On Fri, 19 Nov 2021 17:38:53 +0200 Leon Romanovsky wrote:  
> > > > My approach works, exactly like it works in other subsystems.
> > > > https://lore.kernel.org/netdev/cover.1636390483.git.leonro@nvidia.com/  
> > > 
> > > What "other subsystems"? I'm aware of the RFC version of these patches.  
> > 
> > Approach to have fine-grained locking scheme, instead of having one big lock.
> > This was done in MM for mmap_sem, we did it for RDMA too.
> 
> You're breaking things up to avoid lock ordering issues. The user can
> still only run a single write command at a time.
> 
> > > Breaking up the locks to to protect sub-objects only is fine for
> > > protecting internal lists but now you can't guarantee that the object
> > > exists when driver is called.  
> > 
> > I can only guess about which objects you are talking.
> 
> It obviously refers to the port splitting I mentioned below.
> 
> > If you are talking about various devlink sub-objects (ports, traps,
> > e.t.c), they created by the drivers and as such should be managed by them.
> > Also they are connected to devlink which is guaranteed to exist. At the end,
> > they called to devlink_XXX->devlink pointer without any existence check.
> > 
> > If you are talking about devlink instance itself, we guarantee that it
> > exists between devlink_alloc() and devlink_free(). It seems to me pretty
> > reasonable request from drivers do not access devlink before devlink_alloc()
> > or after devlink_free(),
> > 
> > > I'm sure you'll utter your unprovable "in real drivers.." but the fact
> > > is my approach does not suffer from any such issues. Or depends on
> > > drivers registering devlink last.  
> > 
> > Registration of devlink doesn't do anything except opening it to the world.
> > The lifetime is controlled with alloc and free. My beloved sentence "in
> > real drivers ..." belongs to use of devlink_put and devlink_locks outside
> > of devlink.c and nothing more.
> 
> As soon as there is a inter-dependency between two subsystems "must 
> be last" breaks down.

"Must be last" is better to be changed "when the device is ready".

-----------------------------------------------------------------
> The real question is whether you now require devlink_register()
> to go last in general? 

No, it is not requirement, but my suggestion. You need to be aware that
after call to devlink_register(), the device will be fully open for devlink
netlink access. So it is strongly advised to put devlink_register to be the
last command in PCI initialization sequence.

https://lore.kernel.org/netdev/YXhVd16heaHCegL1@unreal/
--------------------------------------------------------------------

> 
> > > I can start passing a pointer to a devlink_port to split/unsplit
> > > functions, which is a great improvement to the devlink driver API.  
> > 
> > You can do it with my approach too. We incremented reference counter
> > of devlink instance when devlink_nl_cmd_port_split_doit() was called,
> > and we can safely take devlink->port_list_lock lock before returning
> > from pre_doit.
> 
> Wait, I thought you'd hold devlink->lock around split/unsplit.

I'm holding.

    519 static int devlink_nl_pre_doit(const struct genl_ops *ops,
    520                                struct sk_buff *skb, struct genl_info *info)
    521 {
    ...
    529
    530         mutex_lock(&devlink->lock);


> 
> Please look at the port splitting case, mlx5 doesn't implement it
> but it's an important feature.

I'll, but please don't forget that it was RFC, just to present that
devlink can be changed internally without exposing internals.

> 
> Either way, IDK how ref count on devlink helps with lifetime of a
> subobject. You must assume the sub-objects can only be created outside
> of the time devlink instance is visible or under devlink->lock?

The devlink lifetime is:
stages:        I                   II                   III   
 devlink_alloc -> devlink_register -> devlink_unregister -> devlink_free.

All sub-objects should be created between devlink_alloc and devlink_free.
It will ensure that ->devlink pointer is always valid.

Stage I:
 * There is no need to hold any devlink locks or increase reference counter.
   If driver doesn't do anything crazy during its init, nothing in devlink
   land will run in parallel. 
Stage II:
 * There is a need to hold devlink->lock and/or play with reference counter
   and/or use fine-grained locks. Users can issue "devlink ..." commands.
Stage III:

Thanks
Jakub Kicinski Nov. 23, 2021, 11:33 p.m. UTC | #9
On Tue, 23 Nov 2021 10:33:13 +0200 Leon Romanovsky wrote:
> > > You can do it with my approach too. We incremented reference counter
> > > of devlink instance when devlink_nl_cmd_port_split_doit() was called,
> > > and we can safely take devlink->port_list_lock lock before returning
> > > from pre_doit.  
> > 
> > Wait, I thought you'd hold devlink->lock around split/unsplit.  
> 
> I'm holding.
> 
>     519 static int devlink_nl_pre_doit(const struct genl_ops *ops,
>     520                                struct sk_buff *skb, struct genl_info *info)
>     521 {
>     ...
>     529
>     530         mutex_lock(&devlink->lock);

Then I'm confused why you said you need to hold a ref count on devlink.
Is it devlink_unregister() that's not taking devlink->lock?

> > Please look at the port splitting case, mlx5 doesn't implement it
> > but it's an important feature.  
> 
> I'll, but please don't forget that it was RFC, just to present that
> devlink can be changed internally without exposing internals.
> 
> > Either way, IDK how ref count on devlink helps with lifetime of a
> > subobject. You must assume the sub-objects can only be created outside
> > of the time devlink instance is visible or under devlink->lock?  
> 
> The devlink lifetime is:
> stages:        I                   II                   III   
>  devlink_alloc -> devlink_register -> devlink_unregister -> devlink_free.
> 
> All sub-objects should be created between devlink_alloc and devlink_free.
> It will ensure that ->devlink pointer is always valid.
> 
> Stage I:
>  * There is no need to hold any devlink locks or increase reference counter.
>    If driver doesn't do anything crazy during its init, nothing in devlink
>    land will run in parallel. 
> Stage II:
>  * There is a need to hold devlink->lock and/or play with reference counter
>    and/or use fine-grained locks. Users can issue "devlink ..." commands.

So sub-objects can (dis)appear only in I/III or under devlink->lock.
Why did you add the per-sub object list locks, then?
Leon Romanovsky Nov. 25, 2021, 9:02 a.m. UTC | #10
On Tue, Nov 23, 2021 at 03:33:12PM -0800, Jakub Kicinski wrote:
> On Tue, 23 Nov 2021 10:33:13 +0200 Leon Romanovsky wrote:
> > > > You can do it with my approach too. We incremented reference counter
> > > > of devlink instance when devlink_nl_cmd_port_split_doit() was called,
> > > > and we can safely take devlink->port_list_lock lock before returning
> > > > from pre_doit.  
> > > 
> > > Wait, I thought you'd hold devlink->lock around split/unsplit.  
> > 
> > I'm holding.
> > 
> >     519 static int devlink_nl_pre_doit(const struct genl_ops *ops,
> >     520                                struct sk_buff *skb, struct genl_info *info)
> >     521 {
> >     ...
> >     529
> >     530         mutex_lock(&devlink->lock);
> 
> Then I'm confused why you said you need to hold a ref count on devlink.

This was an example to your sentence "I can start passing a pointer
to a devlink_port to split/unsplit functions, which is a great improvement
to the devlink driver API."
https://lore.kernel.org/all/20211119081017.6676843b@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

In my view, it is complete over-engineering and not needed at all. In
current driver model, you can pass devlink_port pointer pretty safely
without worries that "->devlink" disappear.

> Is it devlink_unregister() that's not taking devlink->lock?

Maybe, but my rationale for devlink_get in my example was slightly different.
We need to use it when the ->devlink structure and sub-object are
managed completely independent with different lifetimes and sub-object
can over-live the devlink structure.

All devlink_*_register() calls require valid devlink structure, so as I
wrote above, devlink_get is not needed really needed.

However you used that example so many times that I started to fear that
I'm missing something very basic.

> 
> > > Please look at the port splitting case, mlx5 doesn't implement it
> > > but it's an important feature.  
> > 
> > I'll, but please don't forget that it was RFC, just to present that
> > devlink can be changed internally without exposing internals.
> > 
> > > Either way, IDK how ref count on devlink helps with lifetime of a
> > > subobject. You must assume the sub-objects can only be created outside
> > > of the time devlink instance is visible or under devlink->lock?  
> > 
> > The devlink lifetime is:
> > stages:        I                   II                   III   
> >  devlink_alloc -> devlink_register -> devlink_unregister -> devlink_free.
> > 
> > All sub-objects should be created between devlink_alloc and devlink_free.
> > It will ensure that ->devlink pointer is always valid.
> > 
> > Stage I:
> >  * There is no need to hold any devlink locks or increase reference counter.
> >    If driver doesn't do anything crazy during its init, nothing in devlink
> >    land will run in parallel. 
> > Stage II:
> >  * There is a need to hold devlink->lock and/or play with reference counter
> >    and/or use fine-grained locks. Users can issue "devlink ..." commands.
> 
> So sub-objects can (dis)appear only in I/III or under devlink->lock.
> Why did you add the per-sub object list locks, then?

There are number of reasons and not all of them are technical.

I wanted to do that, my initial plan was to cleanly separate user-visible
API vs. in-kernel API and use one lock or no locks at all.

But at some point of time, I recalculated my path, when I saw that
I'm failing to explain even simple devlink lifetime model, together
with warm feedback from the community and need to have this patch:

[RFC PATCH 14/16] devlink: Require devlink lock during device reload
https://lore.kernel.org/netdev/ad7f5f275bcda1ee058d7bd3020b7d85cd44b9f6.1636390483.git.leonro@nvidia.com/

That patch is super-important in the devlink_reload puzzle, it closes the hack
used in devlink_reload flow to allow to call to same devlink_*_register() calls
without taking devlink->lock, so they can take it. In order to do it, I
used list locks, because only for this that devlink->lock was needed in
these calls.

However, there is a way to avoid list locks. It can be achieved if we start
to manage devlink state machine (at least for reload) internally and add
something like that in devlink_*_register() calls:

 if (devlink->not_in_reload)
    mutex_lock(&devlink->lock);

It doesn't look nice, and invites immediate question: "why don't we
provide two APIs? locked and unlocked? Locked for reload, and unlocked
for all other parts". Unfortunately, this will require major changes in
the drivers and in offline conversation I was told "do whatever you need
in devlink as long as it doesn't require change in the driver, we want
same drver flow for probe and reload.".

Thanks
diff mbox series

Patch

diff --git a/net/core/devlink.c b/net/core/devlink.c
index 356057ea2e52..1dda313d6d1b 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -9833,17 +9833,9 @@  int devlink_resource_register(struct devlink *devlink,
 {
 	struct devlink_resource *resource;
 	struct list_head *resource_list;
-	bool top_hierarchy;
 	int err = 0;
 
-	top_hierarchy = parent_resource_id == DEVLINK_RESOURCE_ID_PARENT_TOP;
-
-	mutex_lock(&devlink->lock);
-	resource = devlink_resource_find(devlink, NULL, resource_id);
-	if (resource) {
-		err = -EINVAL;
-		goto out;
-	}
+	WARN_ON(devlink_resource_find(devlink, NULL, resource_id));
 
 	resource = kzalloc(sizeof(*resource), GFP_KERNEL);
 	if (!resource) {
@@ -9851,7 +9843,17 @@  int devlink_resource_register(struct devlink *devlink,
 		goto out;
 	}
 
-	if (top_hierarchy) {
+	resource->name = resource_name;
+	resource->size = resource_size;
+	resource->size_new = resource_size;
+	resource->id = resource_id;
+	resource->size_valid = true;
+	memcpy(&resource->size_params, size_params,
+	       sizeof(resource->size_params));
+	INIT_LIST_HEAD(&resource->resource_list);
+
+	mutex_lock(&devlink->lock);
+	if (parent_resource_id == DEVLINK_RESOURCE_ID_PARENT_TOP) {
 		resource_list = &devlink->resource_list;
 	} else {
 		struct devlink_resource *parent_resource;
@@ -9868,14 +9870,6 @@  int devlink_resource_register(struct devlink *devlink,
 		}
 	}
 
-	resource->name = resource_name;
-	resource->size = resource_size;
-	resource->size_new = resource_size;
-	resource->id = resource_id;
-	resource->size_valid = true;
-	memcpy(&resource->size_params, size_params,
-	       sizeof(resource->size_params));
-	INIT_LIST_HEAD(&resource->resource_list);
 	list_add_tail(&resource->list, resource_list);
 out:
 	mutex_unlock(&devlink->lock);