diff mbox

[v2,8/8] spapr_pci: Use XICS interrupt allocator and do not cache interrupts in PHB

Message ID 1400147999-4793-9-git-send-email-aik@ozlabs.ru
State New
Headers show

Commit Message

Alexey Kardashevskiy May 15, 2014, 9:59 a.m. UTC
Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
XICS used to be unable to reuse interrupts which becomes a problem for
dynamic MSI reconfiguration which is happening on guest driver reload or
PCI hot (un)plug. Another problem is that PHB has a limit of devices
supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
for that.

This makes use of new XICS ability to reuse interrupts.

This removes cached MSI configuration from SPAPR PHB so the first IRQ number
of a device is stored in MSI/MSIX config space so there is no need to store
this anywhere else. From now on, SPAPR PHB only keeps flags telling what type
of interrupt for which device it has configured in order to return error if
(for example) MSIX was enabled and the guest is trying to disable MSI which
it has not enabled.

This removes a limit for the maximum number of MSIX-enabled devices per PHB,
now XICS and PCI bus capacity are the only limitation.

This changes migration stream as it fixes vmstate_spapr_pci_msi::name which was
wrong since the beginning.

This fixed traces to be more informative.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

In reality either MSIX or MSI is enabled, never both. So I could remove msi/msix
bitmaps from this patch, would it make sense?
---
 hw/ppc/spapr_pci.c          | 179 +++++++++++++++++++++++---------------------
 include/hw/pci-host/spapr.h |  11 +--
 trace-events                |   5 +-
 3 files changed, 99 insertions(+), 96 deletions(-)

Comments

Alexander Graf May 21, 2014, 8:40 a.m. UTC | #1
On 15.05.14 11:59, Alexey Kardashevskiy wrote:
> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
> XICS used to be unable to reuse interrupts which becomes a problem for
> dynamic MSI reconfiguration which is happening on guest driver reload or
> PCI hot (un)plug. Another problem is that PHB has a limit of devices
> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
> for that.
>
> This makes use of new XICS ability to reuse interrupts.
>
> This removes cached MSI configuration from SPAPR PHB so the first IRQ number
> of a device is stored in MSI/MSIX config space so there is no need to store
> this anywhere else. From now on, SPAPR PHB only keeps flags telling what type
> of interrupt for which device it has configured in order to return error if
> (for example) MSIX was enabled and the guest is trying to disable MSI which
> it has not enabled.
>
> This removes a limit for the maximum number of MSIX-enabled devices per PHB,
> now XICS and PCI bus capacity are the only limitation.
>
> This changes migration stream as it fixes vmstate_spapr_pci_msi::name which was
> wrong since the beginning.
>
> This fixed traces to be more informative.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>
> In reality either MSIX or MSI is enabled, never both. So I could remove msi/msix
> bitmaps from this patch, would it make sense?

Is this a hard requirement? Does a device have to choose between MSIX 
and MSI or could it theoretically have both enabled? Is this a PCI 
limitation, a PAPR/XICS limitation or just a limitation of your 
implementation?


Alex
Alexey Kardashevskiy May 21, 2014, 8:52 a.m. UTC | #2
On 05/21/2014 06:40 PM, Alexander Graf wrote:
> 
> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>> XICS used to be unable to reuse interrupts which becomes a problem for
>> dynamic MSI reconfiguration which is happening on guest driver reload or
>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
>> for that.
>>
>> This makes use of new XICS ability to reuse interrupts.
>>
>> This removes cached MSI configuration from SPAPR PHB so the first IRQ number
>> of a device is stored in MSI/MSIX config space so there is no need to store
>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
>> type
>> of interrupt for which device it has configured in order to return error if
>> (for example) MSIX was enabled and the guest is trying to disable MSI which
>> it has not enabled.
>>
>> This removes a limit for the maximum number of MSIX-enabled devices per PHB,
>> now XICS and PCI bus capacity are the only limitation.
>>
>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>> which was
>> wrong since the beginning.
>>
>> This fixed traces to be more informative.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> In reality either MSIX or MSI is enabled, never both. So I could remove
>> msi/msix
>> bitmaps from this patch, would it make sense?
> 
> Is this a hard requirement? Does a device have to choose between MSIX and
> MSI or could it theoretically have both enabled? Is this a PCI limitation,
> a PAPR/XICS limitation or just a limitation of your implementation?


My implementation does not have this limitation, I asked if I can simplify
code by introducing one :)

I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
it does not seem to be used by anyone => cannot debug and confirm.

PAPR spec assumes that if the guest tries enabling MSIX when MSI is already
enabled, this is a "change", not enabling both types. But it also says MSI
and MSIX vector numbers are not shared. Hm.
Alexander Graf May 21, 2014, 9:06 a.m. UTC | #3
On 21.05.14 10:52, Alexey Kardashevskiy wrote:
> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>> dynamic MSI reconfiguration which is happening on guest driver reload or
>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
>>> for that.
>>>
>>> This makes use of new XICS ability to reuse interrupts.
>>>
>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ number
>>> of a device is stored in MSI/MSIX config space so there is no need to store
>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
>>> type
>>> of interrupt for which device it has configured in order to return error if
>>> (for example) MSIX was enabled and the guest is trying to disable MSI which
>>> it has not enabled.
>>>
>>> This removes a limit for the maximum number of MSIX-enabled devices per PHB,
>>> now XICS and PCI bus capacity are the only limitation.
>>>
>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>> which was
>>> wrong since the beginning.
>>>
>>> This fixed traces to be more informative.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> In reality either MSIX or MSI is enabled, never both. So I could remove
>>> msi/msix
>>> bitmaps from this patch, would it make sense?
>> Is this a hard requirement? Does a device have to choose between MSIX and
>> MSI or could it theoretically have both enabled? Is this a PCI limitation,
>> a PAPR/XICS limitation or just a limitation of your implementation?
>
> My implementation does not have this limitation, I asked if I can simplify
> code by introducing one :)
>
> I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
> it does not seem to be used by anyone => cannot debug and confirm.
>
> PAPR spec assumes that if the guest tries enabling MSIX when MSI is already
> enabled, this is a "change", not enabling both types. But it also says MSI
> and MSIX vector numbers are not shared. Hm.

Yeah, I'm not aware of any limitation on hardware here and I'd rather 
not impose one.

Michael, do you know of any hardware that uses MSI *and* MSI-X at the 
same time?


Alex
Michael S. Tsirkin May 21, 2014, 9:11 a.m. UTC | #4
On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
> 
> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
> >On 05/21/2014 06:40 PM, Alexander Graf wrote:
> >>On 15.05.14 11:59, Alexey Kardashevskiy wrote:
> >>>Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
> >>>XICS used to be unable to reuse interrupts which becomes a problem for
> >>>dynamic MSI reconfiguration which is happening on guest driver reload or
> >>>PCI hot (un)plug. Another problem is that PHB has a limit of devices
> >>>supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
> >>>for that.
> >>>
> >>>This makes use of new XICS ability to reuse interrupts.
> >>>
> >>>This removes cached MSI configuration from SPAPR PHB so the first IRQ number
> >>>of a device is stored in MSI/MSIX config space so there is no need to store
> >>>this anywhere else. From now on, SPAPR PHB only keeps flags telling what
> >>>type
> >>>of interrupt for which device it has configured in order to return error if
> >>>(for example) MSIX was enabled and the guest is trying to disable MSI which
> >>>it has not enabled.
> >>>
> >>>This removes a limit for the maximum number of MSIX-enabled devices per PHB,
> >>>now XICS and PCI bus capacity are the only limitation.
> >>>
> >>>This changes migration stream as it fixes vmstate_spapr_pci_msi::name
> >>>which was
> >>>wrong since the beginning.
> >>>
> >>>This fixed traces to be more informative.
> >>>
> >>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>---
> >>>
> >>>In reality either MSIX or MSI is enabled, never both. So I could remove
> >>>msi/msix
> >>>bitmaps from this patch, would it make sense?
> >>Is this a hard requirement? Does a device have to choose between MSIX and
> >>MSI or could it theoretically have both enabled? Is this a PCI limitation,
> >>a PAPR/XICS limitation or just a limitation of your implementation?
> >
> >My implementation does not have this limitation, I asked if I can simplify
> >code by introducing one :)
> >
> >I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
> >it does not seem to be used by anyone => cannot debug and confirm.
> >
> >PAPR spec assumes that if the guest tries enabling MSIX when MSI is already
> >enabled, this is a "change", not enabling both types. But it also says MSI
> >and MSIX vector numbers are not shared. Hm.
> 
> Yeah, I'm not aware of any limitation on hardware here and I'd
> rather not impose one.
> 
> Michael, do you know of any hardware that uses MSI *and* MSI-X at
> the same time?
> 
> 
> Alex

No, and the PCI spec says:
	A function is permitted to implement both MSI and MSI-X, but system
	software is
	prohibited from enabling both at the same time. If system software
	enables both at the same time, the result is undefined.
Alexander Graf May 21, 2014, 9:13 a.m. UTC | #5
On 21.05.14 11:11, Michael S. Tsirkin wrote:
> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>>>> dynamic MSI reconfiguration which is happening on guest driver reload or
>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
>>>>> for that.
>>>>>
>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>
>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ number
>>>>> of a device is stored in MSI/MSIX config space so there is no need to store
>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
>>>>> type
>>>>> of interrupt for which device it has configured in order to return error if
>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI which
>>>>> it has not enabled.
>>>>>
>>>>> This removes a limit for the maximum number of MSIX-enabled devices per PHB,
>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>
>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>>>> which was
>>>>> wrong since the beginning.
>>>>>
>>>>> This fixed traces to be more informative.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>
>>>>> In reality either MSIX or MSI is enabled, never both. So I could remove
>>>>> msi/msix
>>>>> bitmaps from this patch, would it make sense?
>>>> Is this a hard requirement? Does a device have to choose between MSIX and
>>>> MSI or could it theoretically have both enabled? Is this a PCI limitation,
>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>> My implementation does not have this limitation, I asked if I can simplify
>>> code by introducing one :)
>>>
>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>
>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is already
>>> enabled, this is a "change", not enabling both types. But it also says MSI
>>> and MSIX vector numbers are not shared. Hm.
>> Yeah, I'm not aware of any limitation on hardware here and I'd
>> rather not impose one.
>>
>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>> the same time?
>>
>>
>> Alex
> No, and the PCI spec says:
> 	A function is permitted to implement both MSI and MSI-X, but system
> 	software is
> 	prohibited from enabling both at the same time. If system software
> 	enables both at the same time, the result is undefined.

Ah, cool. So yes Alexey, feel free to impose it :).


Alex
Alexey Kardashevskiy May 21, 2014, 9:33 a.m. UTC | #6
On 05/21/2014 07:13 PM, Alexander Graf wrote:
> 
> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>>>>> dynamic MSI reconfiguration which is happening on guest driver reload or
>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
>>>>>> for that.
>>>>>>
>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>
>>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ
>>>>>> number
>>>>>> of a device is stored in MSI/MSIX config space so there is no need to
>>>>>> store
>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
>>>>>> type
>>>>>> of interrupt for which device it has configured in order to return
>>>>>> error if
>>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI
>>>>>> which
>>>>>> it has not enabled.
>>>>>>
>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>> per PHB,
>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>
>>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>>>>> which was
>>>>>> wrong since the beginning.
>>>>>>
>>>>>> This fixed traces to be more informative.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>
>>>>>> In reality either MSIX or MSI is enabled, never both. So I could remove
>>>>>> msi/msix
>>>>>> bitmaps from this patch, would it make sense?
>>>>> Is this a hard requirement? Does a device have to choose between MSIX and
>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>> limitation,
>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>> My implementation does not have this limitation, I asked if I can simplify
>>>> code by introducing one :)
>>>>
>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>
>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>> already
>>>> enabled, this is a "change", not enabling both types. But it also says MSI
>>>> and MSIX vector numbers are not shared. Hm.
>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>> rather not impose one.
>>>
>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>> the same time?
>>>
>>>
>>> Alex
>> No, and the PCI spec says:
>>     A function is permitted to implement both MSI and MSI-X, but system
>>     software is
>>     prohibited from enabling both at the same time. If system software
>>     enables both at the same time, the result is undefined.
> 
> Ah, cool. So yes Alexey, feel free to impose it :).

Heh. This solves just half of the problem - I still have to keep track of
what device got MSI/MSIX configured via that ibm,change-msi interface. I
was hoping I can store such flag somewhere in a device PCI config space but
MSI/MSIX enable bit is not good as it is not set when those calls are made.
And I cannot rely on address/data fields much as the guest can change them
(I already use them to store IRQ numbers and btw it is missing checks when
I read them back for disposal, I'll fix in next round).

Or on "enable" event I could put IRQ numbers to .data of MSI config space
and on "disable" check if it is not zero, then configuration took place,
then I can remove my msi[]/msix[] flag arrays. If the guest did any change
to MSI/MSIX config space (it does not on SPAPR except weird selftest
cases), I compare .data with what ICS can possibly have and either reject
"disable" or handle it and if it breaks XICS - that's too bad for the
stupid guest. Would that be acceptable?
Michael S. Tsirkin May 21, 2014, 9:38 a.m. UTC | #7
On Wed, May 21, 2014 at 07:33:36PM +1000, Alexey Kardashevskiy wrote:
> On 05/21/2014 07:13 PM, Alexander Graf wrote:
> > 
> > On 21.05.14 11:11, Michael S. Tsirkin wrote:
> >> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
> >>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
> >>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
> >>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
> >>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
> >>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
> >>>>>> dynamic MSI reconfiguration which is happening on guest driver reload or
> >>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
> >>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
> >>>>>> for that.
> >>>>>>
> >>>>>> This makes use of new XICS ability to reuse interrupts.
> >>>>>>
> >>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ
> >>>>>> number
> >>>>>> of a device is stored in MSI/MSIX config space so there is no need to
> >>>>>> store
> >>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
> >>>>>> type
> >>>>>> of interrupt for which device it has configured in order to return
> >>>>>> error if
> >>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI
> >>>>>> which
> >>>>>> it has not enabled.
> >>>>>>
> >>>>>> This removes a limit for the maximum number of MSIX-enabled devices
> >>>>>> per PHB,
> >>>>>> now XICS and PCI bus capacity are the only limitation.
> >>>>>>
> >>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
> >>>>>> which was
> >>>>>> wrong since the beginning.
> >>>>>>
> >>>>>> This fixed traces to be more informative.
> >>>>>>
> >>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>> ---
> >>>>>>
> >>>>>> In reality either MSIX or MSI is enabled, never both. So I could remove
> >>>>>> msi/msix
> >>>>>> bitmaps from this patch, would it make sense?
> >>>>> Is this a hard requirement? Does a device have to choose between MSIX and
> >>>>> MSI or could it theoretically have both enabled? Is this a PCI
> >>>>> limitation,
> >>>>> a PAPR/XICS limitation or just a limitation of your implementation?
> >>>> My implementation does not have this limitation, I asked if I can simplify
> >>>> code by introducing one :)
> >>>>
> >>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
> >>>> it does not seem to be used by anyone => cannot debug and confirm.
> >>>>
> >>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
> >>>> already
> >>>> enabled, this is a "change", not enabling both types. But it also says MSI
> >>>> and MSIX vector numbers are not shared. Hm.
> >>> Yeah, I'm not aware of any limitation on hardware here and I'd
> >>> rather not impose one.
> >>>
> >>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
> >>> the same time?
> >>>
> >>>
> >>> Alex
> >> No, and the PCI spec says:
> >>     A function is permitted to implement both MSI and MSI-X, but system
> >>     software is
> >>     prohibited from enabling both at the same time. If system software
> >>     enables both at the same time, the result is undefined.
> > 
> > Ah, cool. So yes Alexey, feel free to impose it :).
> 
> Heh. This solves just half of the problem - I still have to keep track of
> what device got MSI/MSIX configured via that ibm,change-msi interface. I
> was hoping I can store such flag somewhere in a device PCI config space but
> MSI/MSIX enable bit is not good as it is not set when those calls are made.

Hmm could you pls remind me why is it desirable to store this
in device? Device is not yet sending MSI interrupts after all
otherwise enable would be set.

> And I cannot rely on address/data fields much as the guest can change them
> (I already use them to store IRQ numbers and btw it is missing checks when
> I read them back for disposal, I'll fix in next round).
> 
> Or on "enable" event I could put IRQ numbers to .data of MSI config space
> and on "disable" check if it is not zero, then configuration took place,
> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
> to MSI/MSIX config space (it does not on SPAPR except weird selftest
> cases), I compare .data with what ICS can possibly have and either reject
> "disable" or handle it and if it breaks XICS - that's too bad for the
> stupid guest. Would that be acceptable?


> 
> -- 
> Alexey
Alexander Graf May 21, 2014, 9:50 a.m. UTC | #8
On 21.05.14 11:33, Alexey Kardashevskiy wrote:
> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>>>>>> dynamic MSI reconfiguration which is happening on guest driver reload or
>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
>>>>>>> for that.
>>>>>>>
>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>
>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ
>>>>>>> number
>>>>>>> of a device is stored in MSI/MSIX config space so there is no need to
>>>>>>> store
>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
>>>>>>> type
>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>> error if
>>>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI
>>>>>>> which
>>>>>>> it has not enabled.
>>>>>>>
>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>> per PHB,
>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>
>>>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>>>>>> which was
>>>>>>> wrong since the beginning.
>>>>>>>
>>>>>>> This fixed traces to be more informative.
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could remove
>>>>>>> msi/msix
>>>>>>> bitmaps from this patch, would it make sense?
>>>>>> Is this a hard requirement? Does a device have to choose between MSIX and
>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>> limitation,
>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>> My implementation does not have this limitation, I asked if I can simplify
>>>>> code by introducing one :)
>>>>>
>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>
>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>> already
>>>>> enabled, this is a "change", not enabling both types. But it also says MSI
>>>>> and MSIX vector numbers are not shared. Hm.
>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>> rather not impose one.
>>>>
>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>> the same time?
>>>>
>>>>
>>>> Alex
>>> No, and the PCI spec says:
>>>      A function is permitted to implement both MSI and MSI-X, but system
>>>      software is
>>>      prohibited from enabling both at the same time. If system software
>>>      enables both at the same time, the result is undefined.
>> Ah, cool. So yes Alexey, feel free to impose it :).
> Heh. This solves just half of the problem - I still have to keep track of
> what device got MSI/MSIX configured via that ibm,change-msi interface. I
> was hoping I can store such flag somewhere in a device PCI config space but
> MSI/MSIX enable bit is not good as it is not set when those calls are made.
> And I cannot rely on address/data fields much as the guest can change them
> (I already use them to store IRQ numbers and btw it is missing checks when
> I read them back for disposal, I'll fix in next round).
>
> Or on "enable" event I could put IRQ numbers to .data of MSI config space
> and on "disable" check if it is not zero, then configuration took place,
> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
> to MSI/MSIX config space (it does not on SPAPR except weird selftest
> cases), I compare .data with what ICS can possibly have and either reject
> "disable" or handle it and if it breaks XICS - that's too bad for the
> stupid guest. Would that be acceptable?

Can't you prohibit the guest from writing to the MSI configuration 
registers itself? Then you don't need to do sanity checks.


Alex
Alexey Kardashevskiy May 21, 2014, 10:03 a.m. UTC | #9
On 05/21/2014 07:38 PM, Michael S. Tsirkin wrote:
> On Wed, May 21, 2014 at 07:33:36PM +1000, Alexey Kardashevskiy wrote:
>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>
>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>>>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver reload or
>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good reason
>>>>>>>> for that.
>>>>>>>>
>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>
>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ
>>>>>>>> number
>>>>>>>> of a device is stored in MSI/MSIX config space so there is no need to
>>>>>>>> store
>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling what
>>>>>>>> type
>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>> error if
>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI
>>>>>>>> which
>>>>>>>> it has not enabled.
>>>>>>>>
>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>> per PHB,
>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>
>>>>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>>>>>>> which was
>>>>>>>> wrong since the beginning.
>>>>>>>>
>>>>>>>> This fixed traces to be more informative.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could remove
>>>>>>>> msi/msix
>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>> Is this a hard requirement? Does a device have to choose between MSIX and
>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>> limitation,
>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>> My implementation does not have this limitation, I asked if I can simplify
>>>>>> code by introducing one :)
>>>>>>
>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled but
>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>
>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>> already
>>>>>> enabled, this is a "change", not enabling both types. But it also says MSI
>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>> rather not impose one.
>>>>>
>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>> the same time?
>>>>>
>>>>>
>>>>> Alex
>>>> No, and the PCI spec says:
>>>>     A function is permitted to implement both MSI and MSI-X, but system
>>>>     software is
>>>>     prohibited from enabling both at the same time. If system software
>>>>     enables both at the same time, the result is undefined.
>>>
>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>
>> Heh. This solves just half of the problem - I still have to keep track of
>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>> was hoping I can store such flag somewhere in a device PCI config space but
>> MSI/MSIX enable bit is not good as it is not set when those calls are made.
> 
> Hmm could you pls remind me why is it desirable to store this
> in device?

I need this flag to know if I can process "disable" or return an error. So
I need to save it somewhere. And there can be up to 256 buses * 32 dev * 8
functions = 65536 flags which is 8KB. And only a small portion of it will
ever be used for obvious reasons. Having 1 bit anywhere in config space or
QEMU's PCIDevice would help here...

At the moment I keep an array in SPAPR's PHB, it is 32 entries long so
SPAPR PHB can have only 32 MSI-enabled devices and our testers think this
is not enough :)


> Device is not yet sending MSI interrupts after all
> otherwise enable would be set.

That is correct.


> 
>> And I cannot rely on address/data fields much as the guest can change them
>> (I already use them to store IRQ numbers and btw it is missing checks when
>> I read them back for disposal, I'll fix in next round).
>>
>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>> and on "disable" check if it is not zero, then configuration took place,
>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>> cases), I compare .data with what ICS can possibly have and either reject
>> "disable" or handle it and if it breaks XICS - that's too bad for the
>> stupid guest. Would that be acceptable?
> 
> 
>>
>> -- 
>> Alexey
Alexey Kardashevskiy May 21, 2014, 10:13 a.m. UTC | #10
On 05/21/2014 07:50 PM, Alexander Graf wrote:
> 
> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>>>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>> reload or
>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>> reason
>>>>>>>> for that.
>>>>>>>>
>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>
>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ
>>>>>>>> number
>>>>>>>> of a device is stored in MSI/MSIX config space so there is no need to
>>>>>>>> store
>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>> what
>>>>>>>> type
>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>> error if
>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI
>>>>>>>> which
>>>>>>>> it has not enabled.
>>>>>>>>
>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>> per PHB,
>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>
>>>>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>>>>>>> which was
>>>>>>>> wrong since the beginning.
>>>>>>>>
>>>>>>>> This fixed traces to be more informative.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>> remove
>>>>>>>> msi/msix
>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>> MSIX and
>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>> limitation,
>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>> simplify
>>>>>> code by introducing one :)
>>>>>>
>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>> but
>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>
>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>> already
>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>> says MSI
>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>> rather not impose one.
>>>>>
>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>> the same time?
>>>>>
>>>>>
>>>>> Alex
>>>> No, and the PCI spec says:
>>>>      A function is permitted to implement both MSI and MSI-X, but system
>>>>      software is
>>>>      prohibited from enabling both at the same time. If system software
>>>>      enables both at the same time, the result is undefined.
>>> Ah, cool. So yes Alexey, feel free to impose it :).
>> Heh. This solves just half of the problem - I still have to keep track of
>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>> was hoping I can store such flag somewhere in a device PCI config space but
>> MSI/MSIX enable bit is not good as it is not set when those calls are made.
>> And I cannot rely on address/data fields much as the guest can change them
>> (I already use them to store IRQ numbers and btw it is missing checks when
>> I read them back for disposal, I'll fix in next round).
>>
>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>> and on "disable" check if it is not zero, then configuration took place,
>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>> cases), I compare .data with what ICS can possibly have and either reject
>> "disable" or handle it and if it breaks XICS - that's too bad for the
>> stupid guest. Would that be acceptable?
> 
> Can't you prohibit the guest from writing to the MSI configuration
> registers itself? Then you don't need to do sanity checks.


I could for emulated devices but VFIO uses the same code. For example,
there is an IBM SCSI IPR card which does a "self test". For that, it saves
MSIX BAR content, does reboot via some backdoor interface and restores MSIX
BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
from cached values when guest is trying to restore it with what it thinks
is actual MSIX data (it is virtualized because of x86). But there is cache
in the host kernel and I am trying to avoid cache here (or put it into the
device).
Alexander Graf May 21, 2014, 10:35 a.m. UTC | #11
On 21.05.14 12:13, Alexey Kardashevskiy wrote:
> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX interrupt as
>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a problem for
>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>> reload or
>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>>> reason
>>>>>>>>> for that.
>>>>>>>>>
>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>
>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first IRQ
>>>>>>>>> number
>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no need to
>>>>>>>>> store
>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>>> what
>>>>>>>>> type
>>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>>> error if
>>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable MSI
>>>>>>>>> which
>>>>>>>>> it has not enabled.
>>>>>>>>>
>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>>> per PHB,
>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>
>>>>>>>>> This changes migration stream as it fixes vmstate_spapr_pci_msi::name
>>>>>>>>> which was
>>>>>>>>> wrong since the beginning.
>>>>>>>>>
>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>>> remove
>>>>>>>>> msi/msix
>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>> MSIX and
>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>> limitation,
>>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>> simplify
>>>>>>> code by introducing one :)
>>>>>>>
>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>>> but
>>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>>
>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>>> already
>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>> says MSI
>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>> rather not impose one.
>>>>>>
>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>> the same time?
>>>>>>
>>>>>>
>>>>>> Alex
>>>>> No, and the PCI spec says:
>>>>>       A function is permitted to implement both MSI and MSI-X, but system
>>>>>       software is
>>>>>       prohibited from enabling both at the same time. If system software
>>>>>       enables both at the same time, the result is undefined.
>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>> Heh. This solves just half of the problem - I still have to keep track of
>>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>>> was hoping I can store such flag somewhere in a device PCI config space but
>>> MSI/MSIX enable bit is not good as it is not set when those calls are made.
>>> And I cannot rely on address/data fields much as the guest can change them
>>> (I already use them to store IRQ numbers and btw it is missing checks when
>>> I read them back for disposal, I'll fix in next round).
>>>
>>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>>> and on "disable" check if it is not zero, then configuration took place,
>>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>> cases), I compare .data with what ICS can possibly have and either reject
>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>> stupid guest. Would that be acceptable?
>> Can't you prohibit the guest from writing to the MSI configuration
>> registers itself? Then you don't need to do sanity checks.
>
> I could for emulated devices but VFIO uses the same code. For example,
> there is an IBM SCSI IPR card which does a "self test". For that, it saves
> MSIX BAR content, does reboot via some backdoor interface and restores MSIX
> BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
> from cached values when guest is trying to restore it with what it thinks
> is actual MSIX data (it is virtualized because of x86). But there is cache

We already have a cache because we don't access the real PCI registers 
with msi_set_message(), no?


Alex
Alexey Kardashevskiy May 21, 2014, 12:42 p.m. UTC | #12
On 05/21/2014 08:35 PM, Alexander Graf wrote:
> 
> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>> interrupt as
>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>> problem for
>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>> reload or
>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>>>> reason
>>>>>>>>>> for that.
>>>>>>>>>>
>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>
>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first
>>>>>>>>>> IRQ
>>>>>>>>>> number
>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>> need to
>>>>>>>>>> store
>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>>>> what
>>>>>>>>>> type
>>>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>>>> error if
>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable
>>>>>>>>>> MSI
>>>>>>>>>> which
>>>>>>>>>> it has not enabled.
>>>>>>>>>>
>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>>>> per PHB,
>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>
>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>> which was
>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>
>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>>>> remove
>>>>>>>>>> msi/msix
>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>> MSIX and
>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>> limitation,
>>>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>> simplify
>>>>>>>> code by introducing one :)
>>>>>>>>
>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>>>> but
>>>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>>>
>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>>>> already
>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>> says MSI
>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>> rather not impose one.
>>>>>>>
>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>> the same time?
>>>>>>>
>>>>>>>
>>>>>>> Alex
>>>>>> No, and the PCI spec says:
>>>>>>       A function is permitted to implement both MSI and MSI-X, but
>>>>>> system
>>>>>>       software is
>>>>>>       prohibited from enabling both at the same time. If system software
>>>>>>       enables both at the same time, the result is undefined.
>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>> Heh. This solves just half of the problem - I still have to keep track of
>>>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>>>> was hoping I can store such flag somewhere in a device PCI config space
>>>> but
>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>> made.
>>>> And I cannot rely on address/data fields much as the guest can change them
>>>> (I already use them to store IRQ numbers and btw it is missing checks when
>>>> I read them back for disposal, I'll fix in next round).
>>>>
>>>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>>>> and on "disable" check if it is not zero, then configuration took place,
>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>> cases), I compare .data with what ICS can possibly have and either reject
>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>> stupid guest. Would that be acceptable?
>>> Can't you prohibit the guest from writing to the MSI configuration
>>> registers itself? Then you don't need to do sanity checks.
>>
>> I could for emulated devices but VFIO uses the same code. For example,
>> there is an IBM SCSI IPR card which does a "self test". For that, it saves
>> MSIX BAR content, does reboot via some backdoor interface and restores MSIX
>> BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
>> from cached values when guest is trying to restore it with what it thinks
>> is actual MSIX data (it is virtualized because of x86). But there is cache
> 
> We already have a cache because we don't access the real PCI registers with
> msi_set_message(), no?


For emulated devices there is no cache. And in any case the guest is
allowed to write to it... Who knows what AIX does? I do not.
Alexey Kardashevskiy May 22, 2014, 6:53 a.m. UTC | #13
On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>
>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>> interrupt as
>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>> problem for
>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>> reload or
>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>>>>> reason
>>>>>>>>>>> for that.
>>>>>>>>>>>
>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>
>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first
>>>>>>>>>>> IRQ
>>>>>>>>>>> number
>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>> need to
>>>>>>>>>>> store
>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>>>>> what
>>>>>>>>>>> type
>>>>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>>>>> error if
>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable
>>>>>>>>>>> MSI
>>>>>>>>>>> which
>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>
>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>>>>> per PHB,
>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>
>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>> which was
>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>
>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>>>>> remove
>>>>>>>>>>> msi/msix
>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>> MSIX and
>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>> limitation,
>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>> simplify
>>>>>>>>> code by introducing one :)
>>>>>>>>>
>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>>>>> but
>>>>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>>>>
>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>>>>> already
>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>> says MSI
>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>> rather not impose one.
>>>>>>>>
>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>> the same time?
>>>>>>>>
>>>>>>>>
>>>>>>>> Alex
>>>>>>> No, and the PCI spec says:
>>>>>>>       A function is permitted to implement both MSI and MSI-X, but
>>>>>>> system
>>>>>>>       software is
>>>>>>>       prohibited from enabling both at the same time. If system software
>>>>>>>       enables both at the same time, the result is undefined.
>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>> Heh. This solves just half of the problem - I still have to keep track of
>>>>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>>>>> was hoping I can store such flag somewhere in a device PCI config space
>>>>> but
>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>> made.
>>>>> And I cannot rely on address/data fields much as the guest can change them
>>>>> (I already use them to store IRQ numbers and btw it is missing checks when
>>>>> I read them back for disposal, I'll fix in next round).
>>>>>
>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>>>>> and on "disable" check if it is not zero, then configuration took place,
>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>> cases), I compare .data with what ICS can possibly have and either reject
>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>> stupid guest. Would that be acceptable?
>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>> registers itself? Then you don't need to do sanity checks.
>>>
>>> I could for emulated devices but VFIO uses the same code. For example,
>>> there is an IBM SCSI IPR card which does a "self test". For that, it saves
>>> MSIX BAR content, does reboot via some backdoor interface and restores MSIX
>>> BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
>>> from cached values when guest is trying to restore it with what it thinks
>>> is actual MSIX data (it is virtualized because of x86). But there is cache
>>
>> We already have a cache because we don't access the real PCI registers with
>> msi_set_message(), no?
> 
> 
> For emulated devices there is no cache. And in any case the guest is
> allowed to write to it... Who knows what AIX does? I do not.


Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok but
how to migrate such thing? Temporary cache somewhere and then unpack it? Or
use old style migration callbacks?
Alexander Graf May 22, 2014, 7:16 a.m. UTC | #14
> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
> 
>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>> 
>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>> interrupt as
>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>> problem for
>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>>> reload or
>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>>>>>> reason
>>>>>>>>>>>> for that.
>>>>>>>>>>>> 
>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>> 
>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first
>>>>>>>>>>>> IRQ
>>>>>>>>>>>> number
>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>>> need to
>>>>>>>>>>>> store
>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>>>>>> what
>>>>>>>>>>>> type
>>>>>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>>>>>> error if
>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable
>>>>>>>>>>>> MSI
>>>>>>>>>>>> which
>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>> 
>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>>>>>> per PHB,
>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>> 
>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>> which was
>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>> 
>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>> 
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>> 
>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>>>>>> remove
>>>>>>>>>>>> msi/msix
>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>>> MSIX and
>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>>> limitation,
>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>>> simplify
>>>>>>>>>> code by introducing one :)
>>>>>>>>>> 
>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>>>>>> but
>>>>>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>>>>> 
>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>>>>>> already
>>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>>> says MSI
>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>> rather not impose one.
>>>>>>>>> 
>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>>> the same time?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>>> No, and the PCI spec says:
>>>>>>>>      A function is permitted to implement both MSI and MSI-X, but
>>>>>>>> system
>>>>>>>>      software is
>>>>>>>>      prohibited from enabling both at the same time. If system software
>>>>>>>>      enables both at the same time, the result is undefined.
>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>> Heh. This solves just half of the problem - I still have to keep track of
>>>>>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>>>>>> was hoping I can store such flag somewhere in a device PCI config space
>>>>>> but
>>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>>> made.
>>>>>> And I cannot rely on address/data fields much as the guest can change them
>>>>>> (I already use them to store IRQ numbers and btw it is missing checks when
>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>> 
>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>>>>>> and on "disable" check if it is not zero, then configuration took place,
>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>>> cases), I compare .data with what ICS can possibly have and either reject
>>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>>> stupid guest. Would that be acceptable?
>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>> registers itself? Then you don't need to do sanity checks.
>>>> 
>>>> I could for emulated devices but VFIO uses the same code. For example,
>>>> there is an IBM SCSI IPR card which does a "self test". For that, it saves
>>>> MSIX BAR content, does reboot via some backdoor interface and restores MSIX
>>>> BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
>>>> from cached values when guest is trying to restore it with what it thinks
>>>> is actual MSIX data (it is virtualized because of x86). But there is cache
>>> 
>>> We already have a cache because we don't access the real PCI registers with
>>> msi_set_message(), no?
>> 
>> 
>> For emulated devices there is no cache. And in any case the guest is
>> allowed to write to it... Who knows what AIX does? I do not.
> 
> 
> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok but
> how to migrate such thing? Temporary cache somewhere and then unpack it? Or
> use old style migration callbacks?

Could you try to introduce a new vmstate type that just serializes and deserializes hash tables? Maybe there is already a serialization function for it in glib?


Alex
Alexey Kardashevskiy May 22, 2014, 10:53 a.m. UTC | #15
On 05/22/2014 05:16 PM, Alexander Graf wrote:> 
> 
>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>
>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>
>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>> problem for
>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>>>> reload or
>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>>>>>>> reason
>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first
>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>> number
>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>>>> need to
>>>>>>>>>>>>> store
>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>>>>>>> what
>>>>>>>>>>>>> type
>>>>>>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>>>>>>> error if
>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable
>>>>>>>>>>>>> MSI
>>>>>>>>>>>>> which
>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>> which was
>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>
>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>>>>>>> remove
>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>>>> MSIX and
>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>>>> limitation,
>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>>>> simplify
>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>
>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>>>>>>> but
>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>>>>>>
>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>>>>>>> already
>>>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>>>> says MSI
>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>> rather not impose one.
>>>>>>>>>>
>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>>>> the same time?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>      A function is permitted to implement both MSI and MSI-X, but
>>>>>>>>> system
>>>>>>>>>      software is
>>>>>>>>>      prohibited from enabling both at the same time. If system software
>>>>>>>>>      enables both at the same time, the result is undefined.
>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>> Heh. This solves just half of the problem - I still have to keep track of
>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>>>>>>> was hoping I can store such flag somewhere in a device PCI config space
>>>>>>> but
>>>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>>>> made.
>>>>>>> And I cannot rely on address/data fields much as the guest can change them
>>>>>>> (I already use them to store IRQ numbers and btw it is missing checks when
>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>
>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>>>>>>> and on "disable" check if it is not zero, then configuration took place,
>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>>>> cases), I compare .data with what ICS can possibly have and either reject
>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>>>> stupid guest. Would that be acceptable?
>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>
>>>>> I could for emulated devices but VFIO uses the same code. For example,
>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it saves
>>>>> MSIX BAR content, does reboot via some backdoor interface and restores MSIX
>>>>> BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
>>>>> from cached values when guest is trying to restore it with what it thinks
>>>>> is actual MSIX data (it is virtualized because of x86). But there is cache
>>>>
>>>> We already have a cache because we don't access the real PCI registers with
>>>> msi_set_message(), no?
>>>
>>>
>>> For emulated devices there is no cache. And in any case the guest is
>>> allowed to write to it... Who knows what AIX does? I do not.
>>
>>
>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok but
>> how to migrate such thing? Temporary cache somewhere and then unpack it? Or
>> use old style migration callbacks?
> 

> Could you try to introduce a new vmstate type that just serializes and
> deserializes hash tables? Maybe there is already a serialization
> function for it in glib?

I have not found any (most likely I do not know how to search there),
I added mine, here are VMSTATE_HASH + its use for SPAPR.


Is this a movement to right direction? I need to put key/value sizes
into VMSTATE definition somehow but do not really want to touch
VMStateField.




Alexey Kardashevskiy (2):
  vmstate: Add helper to enable GHashTable migration
  spapr_pci: Use XICS interrupt allocator and do not cache interrupts in
    PHB

 hw/ppc/spapr_pci.c          | 144 +++++++++++++++++---------------------------
 include/hw/pci-host/spapr.h |  13 ++--
 include/migration/vmstate.h |  10 +++
 include/qemu-common.h       |  13 ++++
 trace-events                |   5 +-
 vmstate.c                   |  54 +++++++++++++++++
 6 files changed, 141 insertions(+), 98 deletions(-)
Alexander Graf May 22, 2014, 10:57 a.m. UTC | #16
On 22.05.14 12:53, Alexey Kardashevskiy wrote:
> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>
>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>
>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of devices
>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no good
>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the first
>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>> number
>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>> store
>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags telling
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> type
>>>>>>>>>>>>>> of interrupt for which device it has configured in order to return
>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to disable
>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled devices
>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I could
>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your implementation?
>>>>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>>>>> simplify
>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>
>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX enabled
>>>>>>>>>>>> but
>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and confirm.
>>>>>>>>>>>>
>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when MSI is
>>>>>>>>>>>> already
>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>>>>> says MSI
>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>
>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>>>>> the same time?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>       A function is permitted to implement both MSI and MSI-X, but
>>>>>>>>>> system
>>>>>>>>>>       software is
>>>>>>>>>>       prohibited from enabling both at the same time. If system software
>>>>>>>>>>       enables both at the same time, the result is undefined.
>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>> Heh. This solves just half of the problem - I still have to keep track of
>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi interface. I
>>>>>>>> was hoping I can store such flag somewhere in a device PCI config space
>>>>>>>> but
>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>>>>> made.
>>>>>>>> And I cannot rely on address/data fields much as the guest can change them
>>>>>>>> (I already use them to store IRQ numbers and btw it is missing checks when
>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>
>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI config space
>>>>>>>> and on "disable" check if it is not zero, then configuration took place,
>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did any change
>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>>>>> cases), I compare .data with what ICS can possibly have and either reject
>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>> I could for emulated devices but VFIO uses the same code. For example,
>>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it saves
>>>>>> MSIX BAR content, does reboot via some backdoor interface and restores MSIX
>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring MSIX data
>>>>>> from cached values when guest is trying to restore it with what it thinks
>>>>>> is actual MSIX data (it is virtualized because of x86). But there is cache
>>>>> We already have a cache because we don't access the real PCI registers with
>>>>> msi_set_message(), no?
>>>>
>>>> For emulated devices there is no cache. And in any case the guest is
>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>
>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok but
>>> how to migrate such thing? Temporary cache somewhere and then unpack it? Or
>>> use old style migration callbacks?
>> Could you try to introduce a new vmstate type that just serializes and
>> deserializes hash tables? Maybe there is already a serialization
>> function for it in glib?
> I have not found any (most likely I do not know how to search there),
> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>
>
> Is this a movement to right direction? I need to put key/value sizes
> into VMSTATE definition somehow but do not really want to touch
> VMStateField.

I'm not sure. Juan?


Alex
Alexey Kardashevskiy May 22, 2014, 2:25 p.m. UTC | #17
On 05/22/2014 08:57 PM, Alexander Graf wrote:
> 
> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>
>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>
>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of
>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no
>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the
>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags
>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>> of interrupt for which device it has configured in order to
>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled
>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>>>>>> simplify
>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>> but
>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>> already
>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>
>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>>>>>> the same time?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Alex
>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>       A function is permitted to implement both MSI and MSI-X, but
>>>>>>>>>>> system
>>>>>>>>>>>       software is
>>>>>>>>>>>       prohibited from enabling both at the same time. If system
>>>>>>>>>>> software
>>>>>>>>>>>       enables both at the same time, the result is undefined.
>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>> track of
>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>> interface. I
>>>>>>>>> was hoping I can store such flag somewhere in a device PCI config
>>>>>>>>> space
>>>>>>>>> but
>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>>>>>> made.
>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>> change them
>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>> checks when
>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>
>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>> config space
>>>>>>>>> and on "disable" check if it is not zero, then configuration took
>>>>>>>>> place,
>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>> any change
>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>>>>>> cases), I compare .data with what ICS can possibly have and either
>>>>>>>>> reject
>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>> I could for emulated devices but VFIO uses the same code. For example,
>>>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it
>>>>>>> saves
>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>> restores MSIX
>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>> MSIX data
>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>> thinks
>>>>>>> is actual MSIX data (it is virtualized because of x86). But there is
>>>>>>> cache
>>>>>> We already have a cache because we don't access the real PCI
>>>>>> registers with
>>>>>> msi_set_message(), no?
>>>>>
>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>
>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok
>>>> but
>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>> it? Or
>>>> use old style migration callbacks?
>>> Could you try to introduce a new vmstate type that just serializes and
>>> deserializes hash tables? Maybe there is already a serialization
>>> function for it in glib?
>> I have not found any (most likely I do not know how to search there),
>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>
>>
>> Is this a movement to right direction? I need to put key/value sizes
>> into VMSTATE definition somehow but do not really want to touch
>> VMStateField.
> 
> I'm not sure. Juan?


I also tried to simplify this particular thing more by assuming that the
key is always "int" and put size of value to VMStateField::size. But there
is no way to get the size in VMStateInfo::get(). Or I could do a
"subsection" and try implementing has as an array (and have get/put defined
for items, should work) but in this case I'll need to cache number of
elements of the hash table somewhere so I'll need a wrapper around GHashTable.

Making generalized version is not obvious for me :(
Alexey Kardashevskiy May 27, 2014, 4:51 a.m. UTC | #18
On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote:
> On 05/22/2014 08:57 PM, Alexander Graf wrote:
>>
>> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>
>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>>
>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of
>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no
>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the
>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags
>>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>> of interrupt for which device it has configured in order to
>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled
>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>>>>>>> simplify
>>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX
>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>>> already
>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>>>>>>> the same time?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alex
>>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>>       A function is permitted to implement both MSI and MSI-X, but
>>>>>>>>>>>> system
>>>>>>>>>>>>       software is
>>>>>>>>>>>>       prohibited from enabling both at the same time. If system
>>>>>>>>>>>> software
>>>>>>>>>>>>       enables both at the same time, the result is undefined.
>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>>> track of
>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>>> interface. I
>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI config
>>>>>>>>>> space
>>>>>>>>>> but
>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>>>>>>> made.
>>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>>> change them
>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>>> checks when
>>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>>
>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>>> config space
>>>>>>>>>> and on "disable" check if it is not zero, then configuration took
>>>>>>>>>> place,
>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>>> any change
>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>>>>>>> cases), I compare .data with what ICS can possibly have and either
>>>>>>>>>> reject
>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>>> I could for emulated devices but VFIO uses the same code. For example,
>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it
>>>>>>>> saves
>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>>> restores MSIX
>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>>> MSIX data
>>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>>> thinks
>>>>>>>> is actual MSIX data (it is virtualized because of x86). But there is
>>>>>>>> cache
>>>>>>> We already have a cache because we don't access the real PCI
>>>>>>> registers with
>>>>>>> msi_set_message(), no?
>>>>>>
>>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>>
>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok
>>>>> but
>>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>>> it? Or
>>>>> use old style migration callbacks?
>>>> Could you try to introduce a new vmstate type that just serializes and
>>>> deserializes hash tables? Maybe there is already a serialization
>>>> function for it in glib?
>>> I have not found any (most likely I do not know how to search there),
>>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>>
>>>
>>> Is this a movement to right direction? I need to put key/value sizes
>>> into VMSTATE definition somehow but do not really want to touch
>>> VMStateField.
>>
>> I'm not sure. Juan?
> 
> 
> I also tried to simplify this particular thing more by assuming that the
> key is always "int" and put size of value to VMStateField::size. But there
> is no way to get the size in VMStateInfo::get(). Or I could do a
> "subsection" and try implementing has as an array (and have get/put defined
> for items, should work) but in this case I'll need to cache number of
> elements of the hash table somewhere so I'll need a wrapper around GHashTable.
> 
> Making generalized version is not obvious for me :(

Juan?
Alex?
Anybody?
Ping.

How do I migrate GHashTable? If I am allowed to use custom and bit more
polished get/put from "[PATCH 1/2] vmstate: Add helper to enable GHashTable
migration", I'll be fine.

Thanks!
Alexander Graf May 27, 2014, 11:55 p.m. UTC | #19
On 27.05.14 06:51, Alexey Kardashevskiy wrote:
> On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote:
>> On 05/22/2014 08:57 PM, Alexander Graf wrote:
>>> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>>
>>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>>>
>>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest driver
>>>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of
>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no
>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the
>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there is no
>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags
>>>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>> of interrupt for which device it has configured in order to
>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled
>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose between
>>>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a PCI
>>>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if I can
>>>>>>>>>>>>>>> simplify
>>>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX
>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But it also
>>>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* MSI-X at
>>>>>>>>>>>>>> the same time?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>>>        A function is permitted to implement both MSI and MSI-X, but
>>>>>>>>>>>>> system
>>>>>>>>>>>>>        software is
>>>>>>>>>>>>>        prohibited from enabling both at the same time. If system
>>>>>>>>>>>>> software
>>>>>>>>>>>>>        enables both at the same time, the result is undefined.
>>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>>>> track of
>>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>>>> interface. I
>>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI config
>>>>>>>>>>> space
>>>>>>>>>>> but
>>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those calls are
>>>>>>>>>>> made.
>>>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>>>> change them
>>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>>>> checks when
>>>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>>>
>>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>>>> config space
>>>>>>>>>>> and on "disable" check if it is not zero, then configuration took
>>>>>>>>>>> place,
>>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>>>> any change
>>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird selftest
>>>>>>>>>>> cases), I compare .data with what ICS can possibly have and either
>>>>>>>>>>> reject
>>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad for the
>>>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>>>> I could for emulated devices but VFIO uses the same code. For example,
>>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it
>>>>>>>>> saves
>>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>>>> restores MSIX
>>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>>>> MSIX data
>>>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>>>> thinks
>>>>>>>>> is actual MSIX data (it is virtualized because of x86). But there is
>>>>>>>>> cache
>>>>>>>> We already have a cache because we don't access the real PCI
>>>>>>>> registers with
>>>>>>>> msi_set_message(), no?
>>>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok
>>>>>> but
>>>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>>>> it? Or
>>>>>> use old style migration callbacks?
>>>>> Could you try to introduce a new vmstate type that just serializes and
>>>>> deserializes hash tables? Maybe there is already a serialization
>>>>> function for it in glib?
>>>> I have not found any (most likely I do not know how to search there),
>>>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>>>
>>>>
>>>> Is this a movement to right direction? I need to put key/value sizes
>>>> into VMSTATE definition somehow but do not really want to touch
>>>> VMStateField.
>>> I'm not sure. Juan?
>>
>> I also tried to simplify this particular thing more by assuming that the
>> key is always "int" and put size of value to VMStateField::size. But there
>> is no way to get the size in VMStateInfo::get(). Or I could do a
>> "subsection" and try implementing has as an array (and have get/put defined
>> for items, should work) but in this case I'll need to cache number of
>> elements of the hash table somewhere so I'll need a wrapper around GHashTable.
>>
>> Making generalized version is not obvious for me :(
> Juan?
> Alex?
> Anybody?
> Ping.
>
> How do I migrate GHashTable? If I am allowed to use custom and bit more
> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable GHashTable
> migration", I'll be fine.

Yeah, I think it's ok to be custom in this case. Or another crazy idea - 
could you flatten the hash table into an array of structs that you can 
describe using VMState? You could then convert from the flat array 
to/from the GHashTable with pre_load/post_load hooks.

The benefit of that would be that the data becomes more easily 
introspectible with tools like my live migration parser.


Alex
Alexey Kardashevskiy May 28, 2014, 12:34 a.m. UTC | #20
On 05/28/2014 09:55 AM, Alexander Graf wrote:
> 
> On 27.05.14 06:51, Alexey Kardashevskiy wrote:
>> On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote:
>>> On 05/22/2014 08:57 PM, Alexander Graf wrote:
>>>> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>>>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>>>
>>>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest
>>>>>>>>>>>>>>>>>> driver
>>>>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of
>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no
>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the
>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there
>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags
>>>>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>> of interrupt for which device it has configured in order to
>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled
>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose
>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a
>>>>>>>>>>>>>>>>> PCI
>>>>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if
>>>>>>>>>>>>>>>> I can
>>>>>>>>>>>>>>>> simplify
>>>>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX
>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But
>>>>>>>>>>>>>>>> it also
>>>>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and*
>>>>>>>>>>>>>>> MSI-X at
>>>>>>>>>>>>>>> the same time?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>>>>        A function is permitted to implement both MSI and
>>>>>>>>>>>>>> MSI-X, but
>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>        software is
>>>>>>>>>>>>>>        prohibited from enabling both at the same time. If system
>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>        enables both at the same time, the result is undefined.
>>>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>>>>> track of
>>>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>>>>> interface. I
>>>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI config
>>>>>>>>>>>> space
>>>>>>>>>>>> but
>>>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those
>>>>>>>>>>>> calls are
>>>>>>>>>>>> made.
>>>>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>>>>> change them
>>>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>>>>> checks when
>>>>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>>>>
>>>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>>>>> config space
>>>>>>>>>>>> and on "disable" check if it is not zero, then configuration took
>>>>>>>>>>>> place,
>>>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>>>>> any change
>>>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird
>>>>>>>>>>>> selftest
>>>>>>>>>>>> cases), I compare .data with what ICS can possibly have and either
>>>>>>>>>>>> reject
>>>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad
>>>>>>>>>>>> for the
>>>>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>>>>> I could for emulated devices but VFIO uses the same code. For
>>>>>>>>>> example,
>>>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it
>>>>>>>>>> saves
>>>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>>>>> restores MSIX
>>>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>>>>> MSIX data
>>>>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>>>>> thinks
>>>>>>>>>> is actual MSIX data (it is virtualized because of x86). But there is
>>>>>>>>>> cache
>>>>>>>>> We already have a cache because we don't access the real PCI
>>>>>>>>> registers with
>>>>>>>>> msi_set_message(), no?
>>>>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok
>>>>>>> but
>>>>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>>>>> it? Or
>>>>>>> use old style migration callbacks?
>>>>>> Could you try to introduce a new vmstate type that just serializes and
>>>>>> deserializes hash tables? Maybe there is already a serialization
>>>>>> function for it in glib?
>>>>> I have not found any (most likely I do not know how to search there),
>>>>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>>>>
>>>>>
>>>>> Is this a movement to right direction? I need to put key/value sizes
>>>>> into VMSTATE definition somehow but do not really want to touch
>>>>> VMStateField.
>>>> I'm not sure. Juan?
>>>
>>> I also tried to simplify this particular thing more by assuming that the
>>> key is always "int" and put size of value to VMStateField::size. But there
>>> is no way to get the size in VMStateInfo::get(). Or I could do a
>>> "subsection" and try implementing has as an array (and have get/put defined
>>> for items, should work) but in this case I'll need to cache number of
>>> elements of the hash table somewhere so I'll need a wrapper around
>>> GHashTable.
>>>
>>> Making generalized version is not obvious for me :(
>> Juan?
>> Alex?
>> Anybody?
>> Ping.
>>
>> How do I migrate GHashTable? If I am allowed to use custom and bit more
>> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable GHashTable
>> migration", I'll be fine.
> 
> Yeah, I think it's ok to be custom in this case. Or another crazy idea -
> could you flatten the hash table into an array of structs that you can
> describe using VMState? You could then convert from the flat array to/from
> the GHashTable with pre_load/post_load hooks.


Array is exactly what I am trying to get rid of. Ok, I'll remove hashmap at
all and implement dynamic flat array (yay, yet another bicycle!).


> The benefit of that would be that the data becomes more easily
> introspectible with tools like my live migration parser.

Wow. What is that parser?
Alexander Graf May 28, 2014, 12:41 a.m. UTC | #21
On 28.05.14 02:34, Alexey Kardashevskiy wrote:
> On 05/28/2014 09:55 AM, Alexander Graf wrote:
>> On 27.05.14 06:51, Alexey Kardashevskiy wrote:
>>> On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote:
>>>> On 05/22/2014 08:57 PM, Alexander Graf wrote:
>>>>> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>>>>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>>>>
>>>>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf wrote:
>>>>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest
>>>>>>>>>>>>>>>>>>> driver
>>>>>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a limit of
>>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there is no
>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so the
>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there
>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps flags
>>>>>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>> of interrupt for which device it has configured in order to
>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This removes a limit for the maximum number of MSIX-enabled
>>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose
>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a
>>>>>>>>>>>>>>>>>> PCI
>>>>>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if
>>>>>>>>>>>>>>>>> I can
>>>>>>>>>>>>>>>>> simplify
>>>>>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and MSIX
>>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But
>>>>>>>>>>>>>>>>> it also
>>>>>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and I'd
>>>>>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and*
>>>>>>>>>>>>>>>> MSI-X at
>>>>>>>>>>>>>>>> the same time?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>>>>>         A function is permitted to implement both MSI and
>>>>>>>>>>>>>>> MSI-X, but
>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>         software is
>>>>>>>>>>>>>>>         prohibited from enabling both at the same time. If system
>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>>         enables both at the same time, the result is undefined.
>>>>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>>>>>> track of
>>>>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>>>>>> interface. I
>>>>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI config
>>>>>>>>>>>>> space
>>>>>>>>>>>>> but
>>>>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those
>>>>>>>>>>>>> calls are
>>>>>>>>>>>>> made.
>>>>>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>>>>>> change them
>>>>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>>>>>> checks when
>>>>>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>>>>>> config space
>>>>>>>>>>>>> and on "disable" check if it is not zero, then configuration took
>>>>>>>>>>>>> place,
>>>>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>>>>>> any change
>>>>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird
>>>>>>>>>>>>> selftest
>>>>>>>>>>>>> cases), I compare .data with what ICS can possibly have and either
>>>>>>>>>>>>> reject
>>>>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>>>>>> Can't you prohibit the guest from writing to the MSI configuration
>>>>>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>>>>>> I could for emulated devices but VFIO uses the same code. For
>>>>>>>>>>> example,
>>>>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For that, it
>>>>>>>>>>> saves
>>>>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>>>>>> restores MSIX
>>>>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>>>>>> MSIX data
>>>>>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>>>>>> thinks
>>>>>>>>>>> is actual MSIX data (it is virtualized because of x86). But there is
>>>>>>>>>>> cache
>>>>>>>>>> We already have a cache because we don't access the real PCI
>>>>>>>>>> registers with
>>>>>>>>>> msi_set_message(), no?
>>>>>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or less ok
>>>>>>>> but
>>>>>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>>>>>> it? Or
>>>>>>>> use old style migration callbacks?
>>>>>>> Could you try to introduce a new vmstate type that just serializes and
>>>>>>> deserializes hash tables? Maybe there is already a serialization
>>>>>>> function for it in glib?
>>>>>> I have not found any (most likely I do not know how to search there),
>>>>>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>>>>>
>>>>>>
>>>>>> Is this a movement to right direction? I need to put key/value sizes
>>>>>> into VMSTATE definition somehow but do not really want to touch
>>>>>> VMStateField.
>>>>> I'm not sure. Juan?
>>>> I also tried to simplify this particular thing more by assuming that the
>>>> key is always "int" and put size of value to VMStateField::size. But there
>>>> is no way to get the size in VMStateInfo::get(). Or I could do a
>>>> "subsection" and try implementing has as an array (and have get/put defined
>>>> for items, should work) but in this case I'll need to cache number of
>>>> elements of the hash table somewhere so I'll need a wrapper around
>>>> GHashTable.
>>>>
>>>> Making generalized version is not obvious for me :(
>>> Juan?
>>> Alex?
>>> Anybody?
>>> Ping.
>>>
>>> How do I migrate GHashTable? If I am allowed to use custom and bit more
>>> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable GHashTable
>>> migration", I'll be fine.
>> Yeah, I think it's ok to be custom in this case. Or another crazy idea -
>> could you flatten the hash table into an array of structs that you can
>> describe using VMState? You could then convert from the flat array to/from
>> the GHashTable with pre_load/post_load hooks.
>
> Array is exactly what I am trying to get rid of. Ok, I'll remove hashmap at
> all and implement dynamic flat array (yay, yet another bicycle!).

Huh? The array would only live during the migration. It would be size=0 
during normal execution, but in a pre_save hook we could make size = 
hash.length() and reuse the existing, working VMState infrastructure.

>
>
>> The benefit of that would be that the data becomes more easily
>> introspectible with tools like my live migration parser.
> Wow. What is that parser?

https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02949.html

I still need to rewrite that thing to always send the JSON description 
of the stream after the migration's EOF marker. And implement support 
for the HTAB stream ;). But overall it works reasonably well.


Alex
Alexey Kardashevskiy May 28, 2014, 1:18 a.m. UTC | #22
On 05/28/2014 10:41 AM, Alexander Graf wrote:
> 
> On 28.05.14 02:34, Alexey Kardashevskiy wrote:
>> On 05/28/2014 09:55 AM, Alexander Graf wrote:
>>> On 27.05.14 06:51, Alexey Kardashevskiy wrote:
>>>> On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote:
>>>>> On 05/22/2014 08:57 PM, Alexander Graf wrote:
>>>>>> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>>>>>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>>>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>>>>>
>>>>>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest
>>>>>>>>>>>>>>>>>>>> driver
>>>>>>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a
>>>>>>>>>>>>>>>>>>>> limit of
>>>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there
>>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there
>>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps
>>>>>>>>>>>>>>>>>>>> flags
>>>>>>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>> of interrupt for which device it has configured in
>>>>>>>>>>>>>>>>>>>> order to
>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This removes a limit for the maximum number of
>>>>>>>>>>>>>>>>>>>> MSIX-enabled
>>>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose
>>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a
>>>>>>>>>>>>>>>>>>> PCI
>>>>>>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if
>>>>>>>>>>>>>>>>>> I can
>>>>>>>>>>>>>>>>>> simplify
>>>>>>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and
>>>>>>>>>>>>>>>>>> MSIX
>>>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But
>>>>>>>>>>>>>>>>>> it also
>>>>>>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and
>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and*
>>>>>>>>>>>>>>>>> MSI-X at
>>>>>>>>>>>>>>>>> the same time?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>>>>>>         A function is permitted to implement both MSI and
>>>>>>>>>>>>>>>> MSI-X, but
>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>         software is
>>>>>>>>>>>>>>>>         prohibited from enabling both at the same time. If
>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>>>         enables both at the same time, the result is
>>>>>>>>>>>>>>>> undefined.
>>>>>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>>>>>>> track of
>>>>>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>>>>>>> interface. I
>>>>>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI
>>>>>>>>>>>>>> config
>>>>>>>>>>>>>> space
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those
>>>>>>>>>>>>>> calls are
>>>>>>>>>>>>>> made.
>>>>>>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>>>>>>> change them
>>>>>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>>>>>>> checks when
>>>>>>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>>>>>>> config space
>>>>>>>>>>>>>> and on "disable" check if it is not zero, then configuration
>>>>>>>>>>>>>> took
>>>>>>>>>>>>>> place,
>>>>>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>>>>>>> any change
>>>>>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird
>>>>>>>>>>>>>> selftest
>>>>>>>>>>>>>> cases), I compare .data with what ICS can possibly have and
>>>>>>>>>>>>>> either
>>>>>>>>>>>>>> reject
>>>>>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad
>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>>>>>>> Can't you prohibit the guest from writing to the MSI
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>>>>>>> I could for emulated devices but VFIO uses the same code. For
>>>>>>>>>>>> example,
>>>>>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For
>>>>>>>>>>>> that, it
>>>>>>>>>>>> saves
>>>>>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>>>>>>> restores MSIX
>>>>>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>>>>>>> MSIX data
>>>>>>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>>>>>>> thinks
>>>>>>>>>>>> is actual MSIX data (it is virtualized because of x86). But
>>>>>>>>>>>> there is
>>>>>>>>>>>> cache
>>>>>>>>>>> We already have a cache because we don't access the real PCI
>>>>>>>>>>> registers with
>>>>>>>>>>> msi_set_message(), no?
>>>>>>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>>>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or
>>>>>>>>> less ok
>>>>>>>>> but
>>>>>>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>>>>>>> it? Or
>>>>>>>>> use old style migration callbacks?
>>>>>>>> Could you try to introduce a new vmstate type that just serializes and
>>>>>>>> deserializes hash tables? Maybe there is already a serialization
>>>>>>>> function for it in glib?
>>>>>>> I have not found any (most likely I do not know how to search there),
>>>>>>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>>>>>>
>>>>>>>
>>>>>>> Is this a movement to right direction? I need to put key/value sizes
>>>>>>> into VMSTATE definition somehow but do not really want to touch
>>>>>>> VMStateField.
>>>>>> I'm not sure. Juan?
>>>>> I also tried to simplify this particular thing more by assuming that the
>>>>> key is always "int" and put size of value to VMStateField::size. But
>>>>> there
>>>>> is no way to get the size in VMStateInfo::get(). Or I could do a
>>>>> "subsection" and try implementing has as an array (and have get/put
>>>>> defined
>>>>> for items, should work) but in this case I'll need to cache number of
>>>>> elements of the hash table somewhere so I'll need a wrapper around
>>>>> GHashTable.
>>>>>
>>>>> Making generalized version is not obvious for me :(
>>>> Juan?
>>>> Alex?
>>>> Anybody?
>>>> Ping.
>>>>
>>>> How do I migrate GHashTable? If I am allowed to use custom and bit more
>>>> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable
>>>> GHashTable
>>>> migration", I'll be fine.
>>> Yeah, I think it's ok to be custom in this case. Or another crazy idea -
>>> could you flatten the hash table into an array of structs that you can
>>> describe using VMState? You could then convert from the flat array to/from
>>> the GHashTable with pre_load/post_load hooks.
>>
>> Array is exactly what I am trying to get rid of. Ok, I'll remove hashmap at
>> all and implement dynamic flat array (yay, yet another bicycle!).
> 
> Huh? The array would only live during the migration. It would be size=0
> during normal execution, but in a pre_save hook we could make size =
> hash.length() and reuse the existing, working VMState infrastructure.

When would I free that array? What if I continue the source guest and then
migrate again? I mean I can solve all of this for sure but duplicating data
just to make existing migration happy is bit weird. But - I'll do what you
say here, it is no big deal :)


>>> The benefit of that would be that the data becomes more easily
>>> introspectible with tools like my live migration parser.
>> Wow. What is that parser?
> 
> https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02949.html
> 
> I still need to rewrite that thing to always send the JSON description of
> the stream after the migration's EOF marker. And implement support for the
> HTAB stream ;). But overall it works reasonably well.

Uh. Cool.
Alexander Graf May 28, 2014, 11:35 a.m. UTC | #23
On 28.05.14 03:18, Alexey Kardashevskiy wrote:
> On 05/28/2014 10:41 AM, Alexander Graf wrote:
>> On 28.05.14 02:34, Alexey Kardashevskiy wrote:
>>> On 05/28/2014 09:55 AM, Alexander Graf wrote:

...

>>>> How do I migrate GHashTable? If I am allowed to use custom and bit more
>>>> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable
>>>> GHashTable
>>>> migration", I'll be fine.
>>>> Yeah, I think it's ok to be custom in this case. Or another crazy idea -
>>>> could you flatten the hash table into an array of structs that you can
>>>> describe using VMState? You could then convert from the flat array to/from
>>>> the GHashTable with pre_load/post_load hooks.
>>> Array is exactly what I am trying to get rid of. Ok, I'll remove hashmap at
>>> all and implement dynamic flat array (yay, yet another bicycle!).
>> Huh? The array would only live during the migration. It would be size=0
>> during normal execution, but in a pre_save hook we could make size =
>> hash.length() and reuse the existing, working VMState infrastructure.
> When would I free that array? What if I continue the source guest and then
> migrate again?

Something like

void pre_save(...) {
     free(s->array);
     s->array_len = s->hash.number_of_keys();
     s->array = g_malloc(s->array_len * sizeof(struct array_elem));
     for (i = 0; i < s->array_len; i++) {
         s->array[i].key = s->hash.key[i];
         s->array[i].value = s->hash.value[i];
     }
}

That would waste a few bytes when we continue after migration, but it 
should at least keep that overhead to a minimum.

> I mean I can solve all of this for sure but duplicating data
> just to make existing migration happy is bit weird. But - I'll do what you
> say here, it is no big deal :)

I don't find the concept of duplicating data for migration too odd - it 
sounds like a good compromise between introspectability and abstraction. 
If you have a better suggestion I'm all open :).


Alex
diff mbox

Patch

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index e574014..49e0382 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -220,36 +220,12 @@  static void rtas_write_pci_config(PowerPCCPU *cpu, sPAPREnvironment *spapr,
 }
 
 /*
- * Find an entry with config_addr or returns the empty one if not found AND
- * alloc_new is set.
- * At the moment the msi_table entries are never released so there is
- * no point to look till the end of the list if we need to find the free entry.
- */
-static int spapr_msicfg_find(sPAPRPHBState *phb, uint32_t config_addr,
-                             bool alloc_new)
-{
-    int i;
-
-    for (i = 0; i < SPAPR_MSIX_MAX_DEVS; ++i) {
-        if (!phb->msi_table[i].nvec) {
-            break;
-        }
-        if (phb->msi_table[i].config_addr == config_addr) {
-            return i;
-        }
-    }
-    if ((i < SPAPR_MSIX_MAX_DEVS) && alloc_new) {
-        trace_spapr_pci_msi("Allocating new MSI config", i, config_addr);
-        return i;
-    }
-
-    return -1;
-}
-
-/*
  * Set MSI/MSIX message data.
  * This is required for msi_notify()/msix_notify() which
  * will write at the addresses via spapr_msi_write().
+ *
+ * If hwaddr == 0, all entries will have .data == first_irq i.e.
+ * table will be reset.
  */
 static void spapr_msi_setmsg(PCIDevice *pdev, hwaddr addr, bool msix,
                              unsigned first_irq, unsigned req_num)
@@ -263,12 +239,51 @@  static void spapr_msi_setmsg(PCIDevice *pdev, hwaddr addr, bool msix,
         return;
     }
 
-    for (i = 0; i < req_num; ++i, ++msg.data) {
+    for (i = 0; i < req_num; ++i) {
         msix_set_message(pdev, i, msg);
         trace_spapr_pci_msi_setup(pdev->name, i, msg.address);
+        if (addr) {
+            ++msg.data;
+        }
     }
 }
 
+static unsigned spapr_msi_get(sPAPRPHBState *phb, PCIDevice *pdev,
+                              unsigned *num, bool *msix)
+{
+    MSIMessage msg;
+    unsigned irq = 0;
+    uint8_t offs = (pci_bus_num(pdev->bus) << SPAPR_PCI_BUS_SHIFT) |
+        PCI_SLOT(pdev->devfn);
+
+    if ((phb->msi[offs] & (1 << PCI_FUNC(pdev->devfn))) &&
+        (phb->msix[offs] & (1 << PCI_FUNC(pdev->devfn)))) {
+        error_report("Both MSI and MSIX configured! MSIX will be used.");
+    }
+
+    if (phb->msix[offs] & (1 << PCI_FUNC(pdev->devfn))) {
+        *num = pdev->msix_entries_nr;
+        if (*num) {
+            msg = msix_get_message(pdev, 0);
+            irq = msg.data;
+            if (msix) {
+                *msix = true;
+            }
+        }
+    } else if (phb->msi[offs] & (1 << PCI_FUNC(pdev->devfn))) {
+        *num = msi_nr_vectors_allocated(pdev);
+        if (*num) {
+            msg = msi_get_message(pdev, 0);
+            irq = msg.data;
+            if (msix) {
+                *msix = false;
+            }
+        }
+    }
+
+    return irq;
+}
+
 static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
                                 uint32_t token, uint32_t nargs,
                                 target_ulong args, uint32_t nret,
@@ -280,9 +295,10 @@  static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     unsigned int req_num = rtas_ld(args, 4); /* 0 == remove all */
     unsigned int seq_num = rtas_ld(args, 5);
     unsigned int ret_intr_type;
-    int ndev, irq, max_irqs = 0;
+    unsigned int irq, max_irqs = 0, num = 0, offs;
     sPAPRPHBState *phb = NULL;
     PCIDevice *pdev = NULL;
+    bool msix = false;
 
     switch (func) {
     case RTAS_CHANGE_MSI_FN:
@@ -307,16 +323,29 @@  static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
         rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
         return;
     }
+    offs = (pci_bus_num(pdev->bus) << SPAPR_PCI_BUS_SHIFT) |
+        PCI_SLOT(pdev->devfn);
 
     /* Releasing MSIs */
     if (!req_num) {
-        ndev = spapr_msicfg_find(phb, config_addr, false);
-        if (ndev < 0) {
-            trace_spapr_pci_msi("MSI has not been enabled", -1, config_addr);
+        irq = spapr_msi_get(phb, pdev, &num, &msix);
+        if (!num || !irq ||
+            ((func == RTAS_CHANGE_MSI_FN) && msix) ||
+            ((func == RTAS_CHANGE_MSIX_FN) && !msix)) {
+            trace_spapr_pci_msi("Releasing wrong config", config_addr);
             rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
             return;
         }
-        trace_spapr_pci_msi("Released MSIs", ndev, config_addr);
+
+        xics_free(spapr->icp, 0, irq, num);
+        spapr_msi_setmsg(pdev, 0, msix, 0, num);
+
+        if (msix) {
+            phb->msix[offs] &= ~(1 << PCI_FUNC(pdev->devfn));
+        } else {
+            phb->msi[offs] &= ~(1 << PCI_FUNC(pdev->devfn));
+        }
+        trace_spapr_pci_msi("Released MSIs", config_addr);
         rtas_st(rets, 0, RTAS_OUT_SUCCESS);
         rtas_st(rets, 1, 0);
         return;
@@ -324,15 +353,6 @@  static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
 
     /* Enabling MSI */
 
-    /* Find a device number in the map to add or reuse the existing one */
-    ndev = spapr_msicfg_find(phb, config_addr, true);
-    if (ndev >= SPAPR_MSIX_MAX_DEVS || ndev < 0) {
-        error_report("No free entry for a new MSI device");
-        rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
-        return;
-    }
-    trace_spapr_pci_msi("Configuring MSI", ndev, config_addr);
-
     /* Check if the device supports as many IRQs as requested */
     if (ret_intr_type == RTAS_TYPE_MSI) {
         max_irqs = msi_nr_vectors_allocated(pdev);
@@ -340,48 +360,44 @@  static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
         max_irqs = pdev->msix_entries_nr;
     }
     if (!max_irqs) {
-        error_report("Requested interrupt type %d is not enabled for device#%d",
-                     ret_intr_type, ndev);
+        error_report("Requested interrupt type %d is not enabled for device %x",
+                     ret_intr_type, config_addr);
         rtas_st(rets, 0, -1); /* Hardware error */
         return;
     }
     /* Correct the number if the guest asked for too many */
     if (req_num > max_irqs) {
+        trace_spapr_pci_msi_retry(config_addr, req_num, max_irqs);
         req_num = max_irqs;
+        irq = 0; /* to avoid misleading trace */
+        goto out;
     }
 
-    /* Check if there is an old config and MSI number has not changed */
-    if (phb->msi_table[ndev].nvec && (req_num != phb->msi_table[ndev].nvec)) {
-        /* Unexpected behaviour */
-        error_report("Cannot reuse MSI config for device#%d", ndev);
+    /* Allocate MSIs */
+    irq = xics_alloc_block(spapr->icp, 0, req_num, false,
+                           ret_intr_type == RTAS_TYPE_MSI);
+    if (!irq) {
+        error_report("Cannot allocate MSIs for device %x", config_addr);
         rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
         return;
     }
 
-    /* There is no cached config, allocate MSIs */
-    if (!phb->msi_table[ndev].nvec) {
-        irq = xics_alloc_block(spapr->icp, 0, req_num, false,
-                               ret_intr_type == RTAS_TYPE_MSI);
-        if (irq < 0) {
-            error_report("Cannot allocate MSIs for device#%d", ndev);
-            rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
-            return;
-        }
-        phb->msi_table[ndev].irq = irq;
-        phb->msi_table[ndev].nvec = req_num;
-        phb->msi_table[ndev].config_addr = config_addr;
-    }
-
     /* Setup MSI/MSIX vectors in the device (via cfgspace or MSIX BAR) */
     spapr_msi_setmsg(pdev, spapr->msi_win_addr, ret_intr_type == RTAS_TYPE_MSIX,
-                     phb->msi_table[ndev].irq, req_num);
+                     irq, req_num);
+    if (ret_intr_type == RTAS_TYPE_MSIX) {
+        phb->msix[offs] |= 1 << PCI_FUNC(pdev->devfn);
+    } else {
+        phb->msi[offs] |= 1 << PCI_FUNC(pdev->devfn);
+    }
 
+out:
     rtas_st(rets, 0, RTAS_OUT_SUCCESS);
     rtas_st(rets, 1, req_num);
     rtas_st(rets, 2, ++seq_num);
     rtas_st(rets, 3, ret_intr_type);
 
-    trace_spapr_pci_rtas_ibm_change_msi(func, req_num);
+    trace_spapr_pci_rtas_ibm_change_msi(config_addr, func, req_num, irq);
 }
 
 static void rtas_ibm_query_interrupt_source_number(PowerPCCPU *cpu,
@@ -395,25 +411,28 @@  static void rtas_ibm_query_interrupt_source_number(PowerPCCPU *cpu,
     uint32_t config_addr = rtas_ld(args, 0);
     uint64_t buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     unsigned int intr_src_num = -1, ioa_intr_num = rtas_ld(args, 3);
-    int ndev;
+    unsigned irq, num = 0;
     sPAPRPHBState *phb = NULL;
+    PCIDevice *pdev = NULL;
 
     /* Fins sPAPRPHBState */
     phb = find_phb(spapr, buid);
-    if (!phb) {
+    if (phb) {
+        pdev = find_dev(spapr, buid, config_addr);
+    }
+    if (!phb || !pdev) {
         rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
         return;
     }
 
     /* Find device descriptor and start IRQ */
-    ndev = spapr_msicfg_find(phb, config_addr, false);
-    if (ndev < 0) {
-        trace_spapr_pci_msi("MSI has not been enabled", -1, config_addr);
+    irq = spapr_msi_get(phb, pdev, &num, NULL);
+    if (!irq || !num || (ioa_intr_num >= num)) {
+        trace_spapr_pci_msi("Failed to return vector", config_addr);
         rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
         return;
     }
-
-    intr_src_num = phb->msi_table[ndev].irq + ioa_intr_num;
+    intr_src_num = irq + ioa_intr_num;
     trace_spapr_pci_rtas_ibm_query_interrupt_source_number(ioa_intr_num,
                                                            intr_src_num);
 
@@ -675,20 +694,6 @@  static const VMStateDescription vmstate_spapr_pci_lsi = {
     },
 };
 
-static const VMStateDescription vmstate_spapr_pci_msi = {
-    .name = "spapr_pci/lsi",
-    .version_id = 1,
-    .minimum_version_id = 1,
-    .minimum_version_id_old = 1,
-    .fields      = (VMStateField []) {
-        VMSTATE_UINT32(config_addr, struct spapr_pci_msi),
-        VMSTATE_UINT32(irq, struct spapr_pci_msi),
-        VMSTATE_UINT32(nvec, struct spapr_pci_msi),
-
-        VMSTATE_END_OF_LIST()
-    },
-};
-
 static const VMStateDescription vmstate_spapr_pci = {
     .name = "spapr_pci",
     .version_id = 1,
@@ -703,8 +708,8 @@  static const VMStateDescription vmstate_spapr_pci = {
         VMSTATE_UINT64_EQUAL(io_win_size, sPAPRPHBState),
         VMSTATE_STRUCT_ARRAY(lsi_table, sPAPRPHBState, PCI_NUM_PINS, 0,
                              vmstate_spapr_pci_lsi, struct spapr_pci_lsi),
-        VMSTATE_STRUCT_ARRAY(msi_table, sPAPRPHBState, SPAPR_MSIX_MAX_DEVS, 0,
-                             vmstate_spapr_pci_msi, struct spapr_pci_msi),
+        VMSTATE_UINT8_ARRAY(msi, sPAPRPHBState, PCI_BUS_MAX * PCI_SLOT_MAX),
+        VMSTATE_UINT8_ARRAY(msix, sPAPRPHBState, PCI_BUS_MAX * PCI_SLOT_MAX),
 
         VMSTATE_END_OF_LIST()
     },
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 970b4a9..7ef29ab 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -27,8 +27,6 @@ 
 #include "hw/pci/pci_host.h"
 #include "hw/ppc/xics.h"
 
-#define SPAPR_MSIX_MAX_DEVS 32
-
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
@@ -55,11 +53,10 @@  typedef struct sPAPRPHBState {
         uint32_t irq;
     } lsi_table[PCI_NUM_PINS];
 
-    struct spapr_pci_msi {
-        uint32_t config_addr;
-        uint32_t irq;
-        uint32_t nvec;
-    } msi_table[SPAPR_MSIX_MAX_DEVS];
+#define PCI_BUS_MAX             0xFF
+#define SPAPR_PCI_BUS_SHIFT     5
+    uint8_t msi[PCI_BUS_MAX * PCI_SLOT_MAX];
+    uint8_t msix[PCI_BUS_MAX * PCI_SLOT_MAX];
 
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
diff --git a/trace-events b/trace-events
index 0a66c71..c180ea1 100644
--- a/trace-events
+++ b/trace-events
@@ -1159,12 +1159,13 @@  qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/ppc/spapr_pci.c
-spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
+spapr_pci_msi(const char *msg, uint32_t ca) "%s (cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
-spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
+spapr_pci_rtas_ibm_change_msi(unsigned cfg, unsigned func, unsigned req, unsigned first) "cfgaddr %x func %u, requested %u, first irq %u"
 spapr_pci_rtas_ibm_query_interrupt_source_number(unsigned ioa, unsigned intr) "queries for #%u, IRQ%u"
 spapr_pci_msi_write(uint64_t addr, uint64_t data, uint32_t dt_irq) "@%"PRIx64"<=%"PRIx64" IRQ %u"
 spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
+spapr_pci_msi_retry(unsigned config_addr, unsigned req_num, unsigned max_irqs) "Guest device at %x asked %u, have only %u"
 
 # hw/intc/xics.c
 xics_icp_check_ipi(int server, uint8_t mfrr) "CPU %d can take IPI mfrr=%#x"