diff mbox series

[21/33] genirq/msi: Provide msi_domain_alloc_irq_at()

Message ID 20221111135206.463650635@linutronix.de
State New
Headers show
Series genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation | expand

Commit Message

Thomas Gleixner Nov. 11, 2022, 1:58 p.m. UTC
For supporting post MSI-X enable allocations and for the upcoming PCI/IMS
support a seperate interface is required which allows not only the
allocation of a specific index, but also the allocation of any, i.e. the
next free index. The latter is especially required for IMS because IMS
completely does away with index to functionality mappings which are
often found in MSI/MSI-X implementation.

But even with MSI-X there are devices where only the first few indices have
a fixed functionality and the rest is freely assignable by software,
e.g. to queues.

msi_domain_alloc_irq_at() is also different from the range based interfaces
as it always enforces that the MSI descriptor is allocated by the core code
and not preallocated by the caller like the PCI/MSI[-X] enable code path
does.

msi_domain_alloc_irq_at() can be invoked with the index argument set to
MSI_ANY_INDEX which makes the core code pick the next free index. The irq
domain can provide a prepare_desc() operation callback in its
msi_domain_ops to do domain specific post allocation initialization before
the actual Linux interrupt and the associated interrupt descriptor and
hierarchy alloccations are conducted.

The function also takes an optional @cookie argument which is of type union
msi_dev_cookie. This cookie is not used by the core code and is stored in
the allocated msi_desc::data::cookie. The meaning of the cookie is
completely implementation defined. In case of IMS this might be a PASID or
a pointer to a device queue, but for the MSI core it's opaque and not used
in any way.

The function returns a struct msi_map which on success contains the
allocated index number and the Linux interrupt number so the caller can
spare the index to Linux interrupt number lookup.

On failure map::index contains the error code and map::virq is 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h     |    4 +
 include/linux/msi_api.h |    7 +++
 kernel/irq/msi.c        |  105 ++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 105 insertions(+), 11 deletions(-)

Comments

Jason Gunthorpe Nov. 16, 2022, 7:36 p.m. UTC | #1
On Fri, Nov 11, 2022 at 02:58:44PM +0100, Thomas Gleixner wrote:
> For supporting post MSI-X enable allocations and for the upcoming PCI/IMS
> support a seperate interface is required which allows not only the
> allocation of a specific index, but also the allocation of any, i.e. the
> next free index. The latter is especially required for IMS because IMS
> completely does away with index to functionality mappings which are
> often found in MSI/MSI-X implementation.
> 
> But even with MSI-X there are devices where only the first few indices have
> a fixed functionality and the rest is freely assignable by software,
> e.g. to queues.
> 
> msi_domain_alloc_irq_at() is also different from the range based interfaces
> as it always enforces that the MSI descriptor is allocated by the core code
> and not preallocated by the caller like the PCI/MSI[-X] enable code path
> does.
> 
> msi_domain_alloc_irq_at() can be invoked with the index argument set to
> MSI_ANY_INDEX which makes the core code pick the next free index. The irq
> domain can provide a prepare_desc() operation callback in its
> msi_domain_ops to do domain specific post allocation initialization before
> the actual Linux interrupt and the associated interrupt descriptor and
> hierarchy alloccations are conducted.
> 
> The function also takes an optional @cookie argument which is of type union
> msi_dev_cookie. This cookie is not used by the core code and is stored in
> the allocated msi_desc::data::cookie. The meaning of the cookie is
> completely implementation defined. In case of IMS this might be a PASID or
> a pointer to a device queue, but for the MSI core it's opaque and not used
> in any way.

To my mind it makes more sense to pass a 'void *' through from
msi_domain_alloc_irq_at() to the prepare_desc() op with the idea that
the driver calling msi_domain_alloc_irq_at() knows it is calling it
against the domain that it allocated. The prepare_desc can then use
the void * to properly initialize anything about the desc under the
right lock.

Before calling this the driver should have setup whatever thing is
going to originate the interrupt, eg allocated the HW object that
sources the interrupt and part of what the void * would convey is the
detailed information on how to program the HW object. eg IDXD is using
an iobase and an offset along with the enforcing PASID, but something
like mlx5 would probably want an object id, type, and SF ID.

This is again where I don't much like the use of an ID to refer to the
domain.

Having the driver allocate the device domain, retain a pointer to it,
and use that domain pointer with all these new APIs seems much clearer
than converting the pointer to an ID.

Jason
Thomas Gleixner Nov. 17, 2022, 9:40 a.m. UTC | #2
On Wed, Nov 16 2022 at 15:36, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:44PM +0100, Thomas Gleixner wrote:
>> The function also takes an optional @cookie argument which is of type union
>> msi_dev_cookie. This cookie is not used by the core code and is stored in
>> the allocated msi_desc::data::cookie. The meaning of the cookie is
>> completely implementation defined. In case of IMS this might be a PASID or
>> a pointer to a device queue, but for the MSI core it's opaque and not used
>> in any way.
>
> To my mind it makes more sense to pass a 'void *' through from
> msi_domain_alloc_irq_at() to the prepare_desc() op with the idea that
> the driver calling msi_domain_alloc_irq_at() knows it is calling it
> against the domain that it allocated. The prepare_desc can then use
> the void * to properly initialize anything about the desc under the
> right lock.

You are looking at it from one particular use case. 

> Before calling this the driver should have setup whatever thing is
> going to originate the interrupt, eg allocated the HW object that
> sources the interrupt and part of what the void * would convey is the
> detailed information on how to program the HW object. eg IDXD is using
> an iobase and an offset along with the enforcing PASID, but something
> like mlx5 would probably want an object id, type, and SF ID.

Correct, and that's why the cookie is there. You can stash your pointer
into the cookie and an IDXD user stores the PASID. The IDXD user which
allocates an interrupt does not even know about iobase and offset. It
does neither care about what the IDXD irq domain implementation does
with that cookie.

Neither should your queue code care. The queue driver code puts a
pointer to struct mlx5_voodoo into the cookie when allocating the
interrupt and then the mlx5 irqdomain code which is a complete separate
entity gets this cookie handed into prepare_desc().

struct mlx5_voodoo contains all information for the irq domain code to
set up the necessary things in the queue. That must be obviously a
contract between the queue code and the irqdomain code but that's not
any different than MSI or MSI-X. The only difference is that in the IMS
case the contract is per device and not codified in a standard.

> This is again where I don't much like the use of an ID to refer to the
> domain.
>
> Having the driver allocate the device domain, retain a pointer to it,
> and use that domain pointer with all these new APIs seems much clearer
> than converting the pointer to an ID.

You're really obsessed about this irqdomain pointer, right?

You have to differentiate between the irq domain implementation and the
actual usage sites and not conflate them into one thing.

Let's look at the usage site:

      struct cookie cookie = { .ptr = mymagicqueue, }

      pci_ims_alloc_irq(pci_dev, &cookie);

versus:

      struct cookie cookie = { .ptr = mymagicqueue, }

      ims_alloc_irq(&pci_dev->dev, mydev->ims_domain, &cookie);

Even in the unlikely case that we have more than two domains, then still
the usage site has zero interest in the domain pointer:

      struct cookie cookie = { .ptr = mymagicqueue, }

      pci_ims_alloc_irq(pci_dev, myqueue->domid, &cookie);

where the code which instantiates myqueue sets up domid.

The usage site has absolutely no business to touch irqdomain pointer or
to even know that one exists. All it needs to know is how the cookie
contract works, obviously.

Now the functions you need in your irqdomain implementation to
e.g. prepare the MSI descriptor surely need to know about the irqdomain
pointer, but that gets handed in from the allocation code so the prepare
function knows which instance it is operating on.

So what does the irqdomain pointer buy you? Exactly nothing!

Look at the IDXD reference implementation.

     The IDXD probe code which initializes the physical device
     instantiates the irq domain along with the iobase for the
     storage array.

     The actual queue (or whatever IDXD names it) setup code just sticks
     PASID into the cookie and allocates an interrupt. It gets a virtual
     irq number and requests the interrupt.

Where is the need for a pointer? The queue code does not even know about
the iobase of the storage array. It's completely irrelevant there. All
it has to know is the cookie contract, not more.

Let's take you pointer obsession to the extreme:

      struct irq_desc *desc = pci_alloc_msix_interrupt(pci_dev);

      request_irq(desc, handler, pci_dev);

versus:

      int virq = pci_alloc_msix_interrupt(pci_dev);

      request_irq(virq, handler, pci_dev);

You could argue the same way that there is no need for a Linux interrupt
number and we could just use the interrupt descriptor pointer.

Sure, you can do that, but then you violate _all_ encapsulation rules in
one go for absolutely _ZERO_ value.

Want another example based on kmalloc()?

Almost 20 years ago I did a treewide mopup of drivers which decided that
they need to fiddle in the irq descriptor for the very wrong reasons.
I had to do that to be able to do a trivial change in the core code...

C is patently bad for encapsulation, but you can make it worse by
forcefully ignoring the design patterns which allow to completely hide
implementation details of a subsystem or infrastructure.

If you look at the last commit in the ARM part of this work:

  https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/commit/?h=devmsi-arm&id=96c97746cbb431a306e95c04d6b3c75751244716

then you can see the final move to remove the visibility of
the MSI management internals.

This makes it possible to completely overhaul the inner workings of the
MSI core without having to chase abuse all over the place.

Thanks,

        tglx
Reinette Chatre Nov. 17, 2022, 11:33 p.m. UTC | #3
Hi Thomas,

I am trying all three parts of this work out with some experimental
code within the IDXD driver that attempts to use IMS on the host.

In the test, pci_ims_alloc_irq() always encounters -EBUSY and it
seems that there is an attempt to insert the struct msi_desc into
the xarray twice, the second attempt encountering the -EBUSY.

While trying to understand what is going on I found myself looking
at this code area and I'll annotate this patch with what I learned.

On 11/11/2022 5:58 AM, Thomas Gleixner wrote:

...

> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -39,6 +39,7 @@ static inline int msi_sysfs_create_group
>  /* Invalid XA index which is outside of any searchable range */
>  #define MSI_XA_MAX_INDEX	(ULONG_MAX - 1)
>  #define MSI_XA_DOMAIN_SIZE	(MSI_MAX_INDEX + 1)
> +#define MSI_ANY_INDEX		UINT_MAX
>  
>  static inline void msi_setup_default_irqdomain(struct device *dev, struct msi_device_data *md)
>  {
> @@ -126,18 +127,34 @@ static int msi_insert_desc(struct device

When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being
called twice, first with index = MSI_ANY_INDEX, second with index = 0.
(domid = 1 both times)

>  	}
>  
>  	hwsize = msi_domain_get_hwsize(dev, domid);
> -	if (index >= hwsize) {
> -		ret = -ERANGE;
> -		goto fail;
> -	}
>  
> -	desc->msi_index = index;
> -	index += baseidx;
> -	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
> -	if (ret)
> -		goto fail;
> -	return 0;
> +	if (index == MSI_ANY_INDEX) {
> +		struct xa_limit limit;
> +		unsigned int index;
> +
> +		limit.min = baseidx;
> +		limit.max = baseidx + hwsize - 1;
>  
> +		/* Let the xarray allocate a free index within the limits */
> +		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
> +		if (ret)
> +			goto fail;
> +

This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc()
is called the first time and the xa_alloc() succeeds at index 65536.

> +		desc->msi_index = index;

This is problematic with desc->msi_index being a u16, assigning
65536 to it becomes 0.

> +		return 0;
> +	} else {
> +		if (index >= hwsize) {
> +			ret = -ERANGE;
> +			goto fail;
> +		}
> +
> +		desc->msi_index = index;
> +		index += baseidx;
> +		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
> +		if (ret)
> +			goto fail;

This "else" path is followed when msi_insert_desc() is called the second
time with "index = 0". The xa_insert() above fails at index 65536
(baseidx = 65536) with -EBUSY, trickling up as the return code to
pci_ims_alloc_irq().

> +		return 0;
> +	}
>  fail:
>  	msi_free_desc(desc);
>  	return ret;
> @@ -335,7 +352,7 @@ int msi_setup_device_data(struct device
>  
>  	msi_setup_default_irqdomain(dev, md);
>  
> -	xa_init(&md->__store);
> +	xa_init_flags(&md->__store, XA_FLAGS_ALLOC);
>  	mutex_init(&md->mutex);
>  	md->__iter_idx = MSI_XA_MAX_INDEX;
>  	dev->msi.data = md;
> @@ -1423,6 +1440,72 @@ int msi_domain_alloc_irqs_all_locked(str
>  	return msi_domain_alloc_locked(dev, &ctrl);
>  }
>  
> +/**
> + * msi_domain_alloc_irq_at - Allocate an interrupt from a MSI interrupt domain at
> + *			     a given index - or at the next free index
> + *
> + * @dev:	Pointer to device struct of the device for which the interrupts
> + *		are allocated
> + * @domid:	Id of the interrupt domain to operate on
> + * @index:	Index for allocation. If @index == %MSI_ANY_INDEX the allocation
> + *		uses the next free index.
> + * @affdesc:	Optional pointer to an interrupt affinity descriptor structure
> + * @cookie:	Optional pointer to a descriptor specific cookie to be stored
> + *		in msi_desc::data. Must be NULL for MSI-X allocations
> + *
> + * This requires a MSI interrupt domain which lets the core code manage the
> + * MSI descriptors.
> + *
> + * Return: struct msi_map
> + *
> + *	On success msi_map::index contains the allocated index number and
> + *	msi_map::virq the corresponding Linux interrupt number
> + *
> + *	On failure msi_map::index contains the error code and msi_map::virq
> + *	is %0.
> + */
> +struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index,
> +				       const struct irq_affinity_desc *affdesc,
> +				       union msi_dev_cookie *cookie)
> +{
> +	struct irq_domain *domain;
> +	struct msi_map map = { };
> +	struct msi_desc *desc;
> +	int ret;
> +
> +	msi_lock_descs(dev);
> +	domain = msi_get_device_domain(dev, domid);
> +	if (!domain) {
> +		map.index = -ENODEV;
> +		goto unlock;
> +	}
> +
> +	desc = msi_alloc_desc(dev, 1, affdesc);
> +	if (!desc) {
> +		map.index = -ENOMEM;
> +		goto unlock;
> +	}
> +
> +	if (cookie)
> +		desc->data.cookie = *cookie;
> +
> +	ret = msi_insert_desc(dev, desc, domid, index);
> +	if (ret) {
> +		map.index = ret;
> +		goto unlock;
> +	}

Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */)

> +
> +	map.index = desc->msi_index;

msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends
up being 0.

> +	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);

Here is where the second call to msi_insert_desc() originates:

msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \
__msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \
msi_domain_add_simple_msi_descs() -> msi_insert_desc()
		

> +	if (ret)
> +		map.index = ret;
> +	else
> +		map.virq = desc->irq;
> +unlock:
> +	msi_unlock_descs(dev);
> +	return map;
> +}
> +
>  static void __msi_domain_free_irqs(struct device *dev, struct irq_domain *domain,
>  				   struct msi_ctrl *ctrl)
>  {
> 

Reinette
Thomas Gleixner Nov. 18, 2022, 12:58 a.m. UTC | #4
On Thu, Nov 17 2022 at 15:33, Reinette Chatre wrote:
> I am trying all three parts of this work out with some experimental
> code within the IDXD driver that attempts to use IMS on the host.
>
> In the test, pci_ims_alloc_irq() always encounters -EBUSY and it
> seems that there is an attempt to insert the struct msi_desc into
> the xarray twice, the second attempt encountering the -EBUSY.
>
> While trying to understand what is going on I found myself looking
> at this code area and I'll annotate this patch with what I learned.

Ok.

> When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being
> called twice, first with index = MSI_ANY_INDEX, second with index = 0.
> (domid = 1 both times)

How so?

>>  	}
>>  
>>  	hwsize = msi_domain_get_hwsize(dev, domid);
>> -	if (index >= hwsize) {
>> -		ret = -ERANGE;
>> -		goto fail;
>> -	}
>>  
>> -	desc->msi_index = index;
>> -	index += baseidx;
>> -	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>> -	if (ret)
>> -		goto fail;
>> -	return 0;
>> +	if (index == MSI_ANY_INDEX) {
>> +		struct xa_limit limit;
>> +		unsigned int index;
>> +
>> +		limit.min = baseidx;
>> +		limit.max = baseidx + hwsize - 1;
>>  
>> +		/* Let the xarray allocate a free index within the limits */
>> +		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
>> +		if (ret)
>> +			goto fail;
>> +
>
> This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc()
> is called the first time and the xa_alloc() succeeds at index 65536.
>
>> +		desc->msi_index = index;
>
> This is problematic with desc->msi_index being a u16, assigning
> 65536 to it becomes 0.

You are partially right. I need to fix that and make it explicit as it's
a "works by chance or maybe not" construct right now.

But desc->msi_index is correct to be truncated because it's the index
within the domain space which is zero based.

>> +		return 0;
>> +	} else {
>> +		if (index >= hwsize) {
>> +			ret = -ERANGE;
>> +			goto fail;
>> +		}
>> +
>> +		desc->msi_index = index;
>> +		index += baseidx;
>> +		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>> +		if (ret)
>> +			goto fail;
>
> This "else" path is followed when msi_insert_desc() is called the second
> time with "index = 0". The xa_insert() above fails at index 65536
> (baseidx = 65536) with -EBUSY, trickling up as the return code to
> pci_ims_alloc_irq().

Why is it called with index=0 the second time?
>> +	desc = msi_alloc_desc(dev, 1, affdesc);
>> +	if (!desc) {
>> +		map.index = -ENOMEM;
>> +		goto unlock;
>> +	}
>> +
>> +	if (cookie)
>> +		desc->data.cookie = *cookie;
>> +
>> +	ret = msi_insert_desc(dev, desc, domid, index);
>> +	if (ret) {
>> +		map.index = ret;
>> +		goto unlock;
>> +	}
>
> Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */)
>
>> +
>> +	map.index = desc->msi_index;
>
> msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends
> up being 0.

which is kinda correct.

>> +	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
>
> Here is where the second call to msi_insert_desc() originates:
>
> msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \
> __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \
> msi_domain_add_simple_msi_descs() -> msi_insert_desc()

but yes, that's bogus because it tries to allocate what is allocated already.

Too tired to decode this circular dependency right now. Will stare at it
with brain awake in the morning. Duh!

Thanks,

        tglx
Thomas Gleixner Nov. 18, 2022, 9:15 a.m. UTC | #5
On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote:
> On Thu, Nov 17 2022 at 15:33, Reinette Chatre wrote:
>> When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being
>> called twice, first with index = MSI_ANY_INDEX, second with index = 0.
>> (domid = 1 both times)
>
> How so?
>
>>>  	}
>>>  
>>>  	hwsize = msi_domain_get_hwsize(dev, domid);
>>> -	if (index >= hwsize) {
>>> -		ret = -ERANGE;
>>> -		goto fail;
>>> -	}
>>>  
>>> -	desc->msi_index = index;
>>> -	index += baseidx;
>>> -	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>>> -	if (ret)
>>> -		goto fail;
>>> -	return 0;
>>> +	if (index == MSI_ANY_INDEX) {
>>> +		struct xa_limit limit;
>>> +		unsigned int index;
>>> +
>>> +		limit.min = baseidx;
>>> +		limit.max = baseidx + hwsize - 1;
>>>  
>>> +		/* Let the xarray allocate a free index within the limits */
>>> +		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
>>> +		if (ret)
>>> +			goto fail;
>>> +
>>
>> This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc()
>> is called the first time and the xa_alloc() succeeds at index 65536.
>>
>>> +		desc->msi_index = index;
>>
>> This is problematic with desc->msi_index being a u16, assigning
>> 65536 to it becomes 0.
>
> You are partially right. I need to fix that and make it explicit as it's
> a "works by chance or maybe not" construct right now.
>
> But desc->msi_index is correct to be truncated because it's the index
> within the domain space which is zero based.

It should obviously do:

   desc->msi_index = index - baseidx;

>>> +		return 0;
>>> +	} else {
>>> +		if (index >= hwsize) {
>>> +			ret = -ERANGE;
>>> +			goto fail;
>>> +		}
>>> +
>>> +		desc->msi_index = index;
>>> +		index += baseidx;
>>> +		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>>> +		if (ret)
>>> +			goto fail;
>>
>> This "else" path is followed when msi_insert_desc() is called the second
>> time with "index = 0". The xa_insert() above fails at index 65536
>> (baseidx = 65536) with -EBUSY, trickling up as the return code to
>> pci_ims_alloc_irq().
>
> Why is it called with index=0 the second time?
>>> +	desc = msi_alloc_desc(dev, 1, affdesc);
>>> +	if (!desc) {
>>> +		map.index = -ENOMEM;
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	if (cookie)
>>> +		desc->data.cookie = *cookie;
>>> +
>>> +	ret = msi_insert_desc(dev, desc, domid, index);
>>> +	if (ret) {
>>> +		map.index = ret;
>>> +		goto unlock;
>>> +	}
>>
>> Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */)
>>
>>> +
>>> +	map.index = desc->msi_index;
>>
>> msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends
>> up being 0.
>
> which is kinda correct.
>
>>> +	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
>>
>> Here is where the second call to msi_insert_desc() originates:
>>
>> msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \
>> __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \
>> msi_domain_add_simple_msi_descs() -> msi_insert_desc()
>
> but yes, that's bogus because it tries to allocate what is allocated already.
>
> Too tired to decode this circular dependency right now. Will stare at it
> with brain awake in the morning. Duh!

Duh. I'm a moron.

Of course I "tested" this by flipping default and secondary domain
around and doing dynamic allocations from PCI/MSI-X but that won't catch
the bug because PCI/MSI-X does not have the ALLOC_SIMPLE_DESCS flag set.

Let me fix that.

Thanks,

        tglx
Thomas Gleixner Nov. 18, 2022, 11:05 a.m. UTC | #6
On Fri, Nov 18 2022 at 10:15, Thomas Gleixner wrote:
> On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote:
> Of course I "tested" this by flipping default and secondary domain
> around and doing dynamic allocations from PCI/MSI-X but that won't catch
> the bug because PCI/MSI-X does not have the ALLOC_SIMPLE_DESCS flag set.
>
> Let me fix that.

Delta patch against

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1G-part3

below.

Thanks,

        tglx
---
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index d4f26649a185..d243ad3e5489 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
 		if (ret)
 			goto fail;
 
-		desc->msi_index = index;
+		desc->msi_index = index - baseidx;
 		return 0;
 	} else {
 		if (index >= hwsize) {
@@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
 				       const struct irq_affinity_desc *affdesc,
 				       union msi_dev_cookie *cookie)
 {
+	struct msi_ctrl ctrl = { .domid	= domid, .nirqs = 1, };
+	struct msi_domain_info *info;
 	struct irq_domain *domain;
 	struct msi_map map = { };
-	struct msi_desc *desc;
 	int ret;
 
 	msi_lock_descs(dev);
@@ -1503,12 +1504,16 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
 		goto unlock;
 	}
 
-	map.index = desc->msi_index;
-	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
-	if (ret)
+	ctrl.first = ctrl.last = desc->msi_index;
+	info = domain->host_data;
+
+	ret = __msi_domain_alloc_irqs(dev, domain, &ctrl);
+	if (ret) {
 		map.index = ret;
-	else
+		msi_domain_free_locked(dev, &ctrl);
+	} else {
 		map.virq = desc->irq;
+	}
 unlock:
 	msi_unlock_descs(dev);
 	return map;
Reinette Chatre Nov. 18, 2022, 6:18 p.m. UTC | #7
Hi Thomas,

On 11/18/2022 3:05 AM, Thomas Gleixner wrote:
> On Fri, Nov 18 2022 at 10:15, Thomas Gleixner wrote:
>> On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote:

...

> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index d4f26649a185..d243ad3e5489 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>  		if (ret)
>  			goto fail;
>  
> -		desc->msi_index = index;
> +		desc->msi_index = index - baseidx;

Could msi_desc->msi_index be made bigger? The hardware I am testing
on claims to support more IMS entries than what the u16 can
accommodate.

>  		return 0;
>  	} else {
>  		if (index >= hwsize) {
> @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
>  				       const struct irq_affinity_desc *affdesc,
>  				       union msi_dev_cookie *cookie)
>  {
> +	struct msi_ctrl ctrl = { .domid	= domid, .nirqs = 1, };
> +	struct msi_domain_info *info;
>  	struct irq_domain *domain;
>  	struct msi_map map = { };
> -	struct msi_desc *desc;

(*desc is still needed)

>  	int ret;
>  
>  	msi_lock_descs(dev);
> @@ -1503,12 +1504,16 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
>  		goto unlock;
>  	}
>  
> -	map.index = desc->msi_index;
> -	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
> -	if (ret)
> +	ctrl.first = ctrl.last = desc->msi_index;
> +	info = domain->host_data;
> +
> +	ret = __msi_domain_alloc_irqs(dev, domain, &ctrl);
> +	if (ret) {
>  		map.index = ret;
> -	else
> +		msi_domain_free_locked(dev, &ctrl);
> +	} else {
>  		map.virq = desc->irq;
> +	}
>  unlock:
>  	msi_unlock_descs(dev);
>  	return map;

Thank you very much. With the above snippet it is possible to
allocate an IMS IRQ. I am not yet able to use the IRQ and I am working
on more tracing to figure out why. In the mean time, I did
just try the pci_ims_alloc_irq()/pci_ims_free_irq() flow and
pci_ims_free_irq() triggered the WARN below:

remove_proc_entry: removing non-empty directory 'irq/220', leaking at least 'idxd-portal'
WARNING: CPU: XX PID: 4322 at fs/proc/generic.c:718 remove_proc_entry+0x184/0x190

[SNIP]

RIP: 0010:remove_proc_entry+0x184/0x190
Code: a5 af 48 8d 90 68 ff ff ff 48 85 c0 48 0f 45 c2 48 8b 95 88 00 00 00 4c 8b 80 b0 00 00 00 48 8b 92 b0 00 00 00 e8 2d 67 c6 00 <0f> 0b e9 4d ff ff ff e8 a0 c1 ce 00 0f 1f 44 00 00 41 57 41 56 41
RSP: 0018:ff223b51cf947c80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ff1b39f680241300 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffafa37b97 RDI: 00000000ffffffff
RBP: ff1b3a06666b8000 R08: ff1b3a15bdbfffe8 R09: 0000000000000003
R10: ff1b3a15bce00000 R11: ff1b3a15bd900000 R12: ff1b3a06666b8090
R13: 00000000000000dd R14: ff1b3a069237fb80 R15: 0000000000000001
FS:  00007fedd2dff000(0000) GS:ff1b3a15be940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff8e5bfc01c CR3: 000000110a138006 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
unregister_irq_proc+0xe3/0x110
free_desc+0x29/0x70
irq_free_descs+0x4b/0x80
msi_domain_free_locked.part.0+0x19b/0x1d0
msi_domain_free_irqs_range+0x67/0xb0
idxd_wq_free_irq+0x89/0x150 [idxd]
drv_disable_wq+0x5f/0x90 [idxd]
idxd_dmaengine_drv_remove+0xa3/0xc0 [idxd]
device_release_driver_internal+0x1aa/0x230
driver_detach+0x44/0x90
bus_remove_driver+0x58/0xe0
idxd_exit_module+0x18/0x3a [idxd]
__do_sys_delete_module.constprop.0+0x186/0x280
? fpregs_assert_state_consistent+0x22/0x50
? exit_to_user_mode_prepare+0x40/0x150
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fedd2526c9b
Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffca85a47d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055af0e1a3700 RCX: 00007fedd2526c9b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055af0e1a3768
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 00007fedd25beac0 R11: 0000000000000206 R12: 00007ffca85a4a30
R13: 000055af0e1a32a0 R14: 00007ffca85a58e5 R15: 000055af0e1a3700
</TASK>

Reinette
Thomas Gleixner Nov. 18, 2022, 10:31 p.m. UTC | #8
On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote:
>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>>  		if (ret)
>>  			goto fail;
>>  
>> -		desc->msi_index = index;
>> +		desc->msi_index = index - baseidx;
>
> Could msi_desc->msi_index be made bigger? The hardware I am testing
> on claims to support more IMS entries than what the u16 can
> accommodate.

Sure that's trivial. How big does it claim it is?

>> @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
>>  				       const struct irq_affinity_desc *affdesc,
>>  				       union msi_dev_cookie *cookie)
>>  {
>> +	struct msi_ctrl ctrl = { .domid	= domid, .nirqs = 1, };
>> +	struct msi_domain_info *info;
>>  	struct irq_domain *domain;
>>  	struct msi_map map = { };
>> -	struct msi_desc *desc;
>
> (*desc is still needed)

Yes, I figured that out later :)

> Thank you very much. With the above snippet it is possible to
> allocate an IMS IRQ. I am not yet able to use the IRQ and I am working
> on more tracing to figure out why. In the mean time, I did
> just try the pci_ims_alloc_irq()/pci_ims_free_irq() flow and
> pci_ims_free_irq() triggered the WARN below:
>
> remove_proc_entry: removing non-empty directory 'irq/220', leaking at least 'idxd-portal'

Hrm, that's the irq action directory. No idea why that is not torn down.

I assume your sequence is:

  pci_ims_alloc();
  request_irq();        <- This creates it
  free_irq();           <- This removes it
  pci_ims_free();

Right?

Thanks,

        tglx
Reinette Chatre Nov. 18, 2022, 10:59 p.m. UTC | #9
Hi Thomas,

On 11/18/2022 2:31 PM, Thomas Gleixner wrote:
> On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote:
>>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>>>  		if (ret)
>>>  			goto fail;
>>>  
>>> -		desc->msi_index = index;
>>> +		desc->msi_index = index - baseidx;
>>
>> Could msi_desc->msi_index be made bigger? The hardware I am testing
>> on claims to support more IMS entries than what the u16 can
>> accommodate.
> 
> Sure that's trivial. How big does it claim it is?

2048

> I assume your sequence is:
> 
>   pci_ims_alloc();
>   request_irq();        <- This creates it
>   free_irq();           <- This removes it
>   pci_ims_free();
> 
> Right?

No. My mistake. Sorry for the noise.

Reinette
Reinette Chatre Nov. 19, 2022, 12:19 a.m. UTC | #10
Hi Thomas,

On 11/18/2022 2:59 PM, Reinette Chatre wrote:
> On 11/18/2022 2:31 PM, Thomas Gleixner wrote:
>> On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote:
>>>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>>>>  		if (ret)
>>>>  			goto fail;
>>>>  
>>>> -		desc->msi_index = index;
>>>> +		desc->msi_index = index - baseidx;
>>>
>>> Could msi_desc->msi_index be made bigger? The hardware I am testing
>>> on claims to support more IMS entries than what the u16 can
>>> accommodate.
>>
>> Sure that's trivial. How big does it claim it is?
> 
> 2048

Dave Jiang corrected me ... the max the hardware can support
is 16128 so the current size of msi_index is sufficient.

Reinette
diff mbox series

Patch

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -80,6 +80,7 @@  struct pci_dev;
 struct platform_msi_priv_data;
 struct device_attribute;
 struct irq_domain;
+struct irq_affinity_desc;
 
 void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg);
@@ -567,6 +568,9 @@  int msi_domain_alloc_irqs_range(struct d
 				unsigned int first, unsigned int last);
 int msi_domain_alloc_irqs_all_locked(struct device *dev, unsigned int domid, int nirqs);
 
+struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index,
+				       const struct irq_affinity_desc *affdesc,
+				       union msi_dev_cookie *cookie);
 
 void msi_domain_free_irqs_range_locked(struct device *dev, unsigned int domid,
 				       unsigned int first, unsigned int last);
--- a/include/linux/msi_api.h
+++ b/include/linux/msi_api.h
@@ -48,6 +48,13 @@  struct msi_map {
 	int	virq;
 };
 
+/*
+ * Constant to be used for dynamic allocations when the allocation
+ * is any free MSI index (entry in the MSI-X table or a software
+ * managed index.
+ */
+#define MSI_ANY_INDEX		UINT_MAX
+
 unsigned int msi_domain_get_virq(struct device *dev, unsigned int domid, unsigned int index);
 
 /**
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -39,6 +39,7 @@  static inline int msi_sysfs_create_group
 /* Invalid XA index which is outside of any searchable range */
 #define MSI_XA_MAX_INDEX	(ULONG_MAX - 1)
 #define MSI_XA_DOMAIN_SIZE	(MSI_MAX_INDEX + 1)
+#define MSI_ANY_INDEX		UINT_MAX
 
 static inline void msi_setup_default_irqdomain(struct device *dev, struct msi_device_data *md)
 {
@@ -126,18 +127,34 @@  static int msi_insert_desc(struct device
 	}
 
 	hwsize = msi_domain_get_hwsize(dev, domid);
-	if (index >= hwsize) {
-		ret = -ERANGE;
-		goto fail;
-	}
 
-	desc->msi_index = index;
-	index += baseidx;
-	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
-	if (ret)
-		goto fail;
-	return 0;
+	if (index == MSI_ANY_INDEX) {
+		struct xa_limit limit;
+		unsigned int index;
+
+		limit.min = baseidx;
+		limit.max = baseidx + hwsize - 1;
 
+		/* Let the xarray allocate a free index within the limits */
+		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
+		if (ret)
+			goto fail;
+
+		desc->msi_index = index;
+		return 0;
+	} else {
+		if (index >= hwsize) {
+			ret = -ERANGE;
+			goto fail;
+		}
+
+		desc->msi_index = index;
+		index += baseidx;
+		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
+		if (ret)
+			goto fail;
+		return 0;
+	}
 fail:
 	msi_free_desc(desc);
 	return ret;
@@ -335,7 +352,7 @@  int msi_setup_device_data(struct device
 
 	msi_setup_default_irqdomain(dev, md);
 
-	xa_init(&md->__store);
+	xa_init_flags(&md->__store, XA_FLAGS_ALLOC);
 	mutex_init(&md->mutex);
 	md->__iter_idx = MSI_XA_MAX_INDEX;
 	dev->msi.data = md;
@@ -1423,6 +1440,72 @@  int msi_domain_alloc_irqs_all_locked(str
 	return msi_domain_alloc_locked(dev, &ctrl);
 }
 
+/**
+ * msi_domain_alloc_irq_at - Allocate an interrupt from a MSI interrupt domain at
+ *			     a given index - or at the next free index
+ *
+ * @dev:	Pointer to device struct of the device for which the interrupts
+ *		are allocated
+ * @domid:	Id of the interrupt domain to operate on
+ * @index:	Index for allocation. If @index == %MSI_ANY_INDEX the allocation
+ *		uses the next free index.
+ * @affdesc:	Optional pointer to an interrupt affinity descriptor structure
+ * @cookie:	Optional pointer to a descriptor specific cookie to be stored
+ *		in msi_desc::data. Must be NULL for MSI-X allocations
+ *
+ * This requires a MSI interrupt domain which lets the core code manage the
+ * MSI descriptors.
+ *
+ * Return: struct msi_map
+ *
+ *	On success msi_map::index contains the allocated index number and
+ *	msi_map::virq the corresponding Linux interrupt number
+ *
+ *	On failure msi_map::index contains the error code and msi_map::virq
+ *	is %0.
+ */
+struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index,
+				       const struct irq_affinity_desc *affdesc,
+				       union msi_dev_cookie *cookie)
+{
+	struct irq_domain *domain;
+	struct msi_map map = { };
+	struct msi_desc *desc;
+	int ret;
+
+	msi_lock_descs(dev);
+	domain = msi_get_device_domain(dev, domid);
+	if (!domain) {
+		map.index = -ENODEV;
+		goto unlock;
+	}
+
+	desc = msi_alloc_desc(dev, 1, affdesc);
+	if (!desc) {
+		map.index = -ENOMEM;
+		goto unlock;
+	}
+
+	if (cookie)
+		desc->data.cookie = *cookie;
+
+	ret = msi_insert_desc(dev, desc, domid, index);
+	if (ret) {
+		map.index = ret;
+		goto unlock;
+	}
+
+	map.index = desc->msi_index;
+	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
+	if (ret)
+		map.index = ret;
+	else
+		map.virq = desc->irq;
+unlock:
+	msi_unlock_descs(dev);
+	return map;
+}
+
 static void __msi_domain_free_irqs(struct device *dev, struct irq_domain *domain,
 				   struct msi_ctrl *ctrl)
 {