diff mbox series

hw/pci: migration: Skip config space check for vendor specific capability during restore/load

Message ID 20240130095617.31661-1-vkale@nvidia.com
State New
Headers show
Series hw/pci: migration: Skip config space check for vendor specific capability during restore/load | expand

Commit Message

Vinayak Kale Jan. 30, 2024, 9:56 a.m. UTC
In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.

config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.

Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
If cmask[] is not set for an offset, then qemu skips config space check for that offset.

Signed-off-by: Vinayak Kale <vkale@nvidia.com>
---
 hw/pci/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Vinayak Kale Jan. 30, 2024, 6:02 p.m. UTC | #1
Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.

Regards,
Vinayak

On 30/01/24 3:26 pm, Vinayak Kale wrote:
> In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
> in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
> 
> config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
> 
> Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
> If cmask[] is not set for an offset, then qemu skips config space check for that offset.
> 
> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> ---
>   hw/pci/pci.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 76080af580..32429109df 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>       memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
>       /* Make capability read-only by default */
>       memset(pdev->wmask + offset, 0, size);
> -    /* Check capability by default */
> -    memset(pdev->cmask + offset, 0xFF, size);
> +
> +    if (cap_id != PCI_CAP_ID_VNDR) {
> +        /* Check non-vendor specific capability by default */
> +        memset(pdev->cmask + offset, 0xFF, size);
> +    }
>       return offset;
>   }
>
Alex Williamson Jan. 30, 2024, 6:58 p.m. UTC | #2
On Tue, 30 Jan 2024 23:32:26 +0530
Vinayak Kale <vkale@nvidia.com> wrote:

> Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
> 
> Regards,
> Vinayak
> 
> On 30/01/24 3:26 pm, Vinayak Kale wrote:
> > In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
> > in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
> > 
> > config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
> > 
> > Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
> > If cmask[] is not set for an offset, then qemu skips config space check for that offset.
> > 
> > Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> > ---
> >   hw/pci/pci.c | 7 +++++--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 76080af580..32429109df 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
> >       memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
> >       /* Make capability read-only by default */
> >       memset(pdev->wmask + offset, 0, size);
> > -    /* Check capability by default */
> > -    memset(pdev->cmask + offset, 0xFF, size);
> > +
> > +    if (cap_id != PCI_CAP_ID_VNDR) {
> > +        /* Check non-vendor specific capability by default */
> > +        memset(pdev->cmask + offset, 0xFF, size);
> > +    }
> >       return offset;
> >   }
> >     
> 

If there is a possibility that the data within the vendor specific cap
can be consumed by the driver or diagnostic tools, then it's part of
the device ABI and should be consistent across migration.  A mismatch
can certainly cause a migration failure, but why shouldn't it?

This might be arguably ok (with more details) for a specific device,
but I don't think it can be the default given the arbitrary data
vendors can expose here.  Also, if this one, why not also the vendor
specific extended capability?  Thanks,

Alex
Vinayak Kale Jan. 31, 2024, 9:52 a.m. UTC | #3
On 31/01/24 12:28 am, Alex Williamson wrote:
> 
> On Tue, 30 Jan 2024 23:32:26 +0530
> Vinayak Kale <vkale@nvidia.com> wrote:
> 
>> Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
>>
>> Regards,
>> Vinayak
>>
>> On 30/01/24 3:26 pm, Vinayak Kale wrote:
>>> In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
>>> in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
>>>
>>> config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
>>>
>>> Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
>>> If cmask[] is not set for an offset, then qemu skips config space check for that offset.
>>>
>>> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
>>> ---
>>>    hw/pci/pci.c | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index 76080af580..32429109df 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>        memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
>>>        /* Make capability read-only by default */
>>>        memset(pdev->wmask + offset, 0, size);
>>> -    /* Check capability by default */
>>> -    memset(pdev->cmask + offset, 0xFF, size);
>>> +
>>> +    if (cap_id != PCI_CAP_ID_VNDR) {
>>> +        /* Check non-vendor specific capability by default */
>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>> +    }
>>>        return offset;
>>>    }
>>>
>>
> 
> If there is a possibility that the data within the vendor specific cap
> can be consumed by the driver or diagnostic tools, then it's part of
> the device ABI and should be consistent across migration.  A mismatch
> can certainly cause a migration failure, but why shouldn't it?

Sure, the device ABI should be consistent across migration. In case of 
VSC, it should represent same format on source and destination. But 
shouldn't VSC content check or its interpretation be left to vendor 
driver instead of qemu?

> 
> This might be arguably ok (with more details) for a specific device,
> but I don't think it can be the default given the arbitrary data
> vendors can expose here.  Also, if this one, why not also the vendor
> specific extended capability?  Thanks,

I'll cover VSEC in next version of the patch.

> 
> Alex
>
Alex Williamson Jan. 31, 2024, 5:38 p.m. UTC | #4
On Wed, 31 Jan 2024 15:22:59 +0530
Vinayak Kale <vkale@nvidia.com> wrote:

> On 31/01/24 12:28 am, Alex Williamson wrote:
> > 
> > On Tue, 30 Jan 2024 23:32:26 +0530
> > Vinayak Kale <vkale@nvidia.com> wrote:
> >   
> >> Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
> >>
> >> Regards,
> >> Vinayak
> >>
> >> On 30/01/24 3:26 pm, Vinayak Kale wrote:  
> >>> In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
> >>> in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
> >>>
> >>> config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
> >>>
> >>> Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
> >>> If cmask[] is not set for an offset, then qemu skips config space check for that offset.
> >>>
> >>> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> >>> ---
> >>>    hw/pci/pci.c | 7 +++++--
> >>>    1 file changed, 5 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>> index 76080af580..32429109df 100644
> >>> --- a/hw/pci/pci.c
> >>> +++ b/hw/pci/pci.c
> >>> @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
> >>>        memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
> >>>        /* Make capability read-only by default */
> >>>        memset(pdev->wmask + offset, 0, size);
> >>> -    /* Check capability by default */
> >>> -    memset(pdev->cmask + offset, 0xFF, size);
> >>> +
> >>> +    if (cap_id != PCI_CAP_ID_VNDR) {
> >>> +        /* Check non-vendor specific capability by default */
> >>> +        memset(pdev->cmask + offset, 0xFF, size);
> >>> +    }
> >>>        return offset;
> >>>    }
> >>>  
> >>  
> > 
> > If there is a possibility that the data within the vendor specific cap
> > can be consumed by the driver or diagnostic tools, then it's part of
> > the device ABI and should be consistent across migration.  A mismatch
> > can certainly cause a migration failure, but why shouldn't it?  
> 
> Sure, the device ABI should be consistent across migration. In case of 
> VSC, it should represent same format on source and destination. But 
> shouldn't VSC content check or its interpretation be left to vendor 
> driver instead of qemu?

By "vendor driver" here, are you suggesting that QEMU device models (ex.
hw/net/{e1000*,igb*,rtl8139*}) should perform that validation?  If so,
where's the patch that introduces any sort of validation hooks for
vendors to provide?  Where is this validation going to happen in the
case of a migratable vfio-pci variant devices?  Nothing about this
patch suggests that it's deferring responsibility to some other code
entity, it only indicates "checking this breaks, let's not do it".

It's possible that the device you care about only reports volatile
diagnostic information through the vendor specific capability, but
another device might use it to report information relative to the
internal hardware configuration.  Without knowing what the vendor
specific capability contains, QEMU needs to take the most conservative
approach by default.  Thanks,

Alex
Vinayak Kale Feb. 1, 2024, 5:38 p.m. UTC | #5
On 31/01/24 11:08 pm, Alex Williamson wrote:
> 
> On Wed, 31 Jan 2024 15:22:59 +0530
> Vinayak Kale <vkale@nvidia.com> wrote:
> 
>> On 31/01/24 12:28 am, Alex Williamson wrote:
>>>
>>> On Tue, 30 Jan 2024 23:32:26 +0530
>>> Vinayak Kale <vkale@nvidia.com> wrote:
>>>
>>>> Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
>>>>
>>>> Regards,
>>>> Vinayak
>>>>
>>>> On 30/01/24 3:26 pm, Vinayak Kale wrote:
>>>>> In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
>>>>> in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
>>>>>
>>>>> config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
>>>>>
>>>>> Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
>>>>> If cmask[] is not set for an offset, then qemu skips config space check for that offset.
>>>>>
>>>>> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
>>>>> ---
>>>>>     hw/pci/pci.c | 7 +++++--
>>>>>     1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>> index 76080af580..32429109df 100644
>>>>> --- a/hw/pci/pci.c
>>>>> +++ b/hw/pci/pci.c
>>>>> @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>         memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
>>>>>         /* Make capability read-only by default */
>>>>>         memset(pdev->wmask + offset, 0, size);
>>>>> -    /* Check capability by default */
>>>>> -    memset(pdev->cmask + offset, 0xFF, size);
>>>>> +
>>>>> +    if (cap_id != PCI_CAP_ID_VNDR) {
>>>>> +        /* Check non-vendor specific capability by default */
>>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>>>> +    }
>>>>>         return offset;
>>>>>     }
>>>>>
>>>>
>>>
>>> If there is a possibility that the data within the vendor specific cap
>>> can be consumed by the driver or diagnostic tools, then it's part of
>>> the device ABI and should be consistent across migration.  A mismatch
>>> can certainly cause a migration failure, but why shouldn't it?
>>
>> Sure, the device ABI should be consistent across migration. In case of
>> VSC, it should represent same format on source and destination. But
>> shouldn't VSC content check or its interpretation be left to vendor
>> driver instead of qemu?
> 
> By "vendor driver" here, are you suggesting that QEMU device models (ex.
> hw/net/{e1000*,igb*,rtl8139*}) should perform that validation?  If so,
> where's the patch that introduces any sort of validation hooks for
> vendors to provide?  Where is this validation going to happen in the
> case of a migratable vfio-pci variant devices?  Nothing about this
> patch suggests that it's deferring responsibility to some other code
> entity, it only indicates "checking this breaks, let's not do it".
> 
> It's possible that the device you care about only reports volatile
> diagnostic information through the vendor specific capability, but
> another device might use it to report information relative to the
> internal hardware configuration.  Without knowing what the vendor
> specific capability contains, QEMU needs to take the most conservative
> approach by default.  Thanks,

PCI/PCIe spec doesn’t define ABI for VSC/VSEC contents. Any other code 
entity except vendor driver should ignore VSC contents. QEMU’s 
expectation of VSC contents to be equal on source and destination seems 
incorrect given that QEMU has no control over ABI for VSC contents.

> 
> Alex
>
Michael S. Tsirkin Feb. 1, 2024, 6:10 p.m. UTC | #6
On Thu, Feb 01, 2024 at 11:08:58PM +0530, Vinayak Kale wrote:
> 
> On 31/01/24 11:08 pm, Alex Williamson wrote:
> > 
> > On Wed, 31 Jan 2024 15:22:59 +0530
> > Vinayak Kale <vkale@nvidia.com> wrote:
> > 
> > > On 31/01/24 12:28 am, Alex Williamson wrote:
> > > > 
> > > > On Tue, 30 Jan 2024 23:32:26 +0530
> > > > Vinayak Kale <vkale@nvidia.com> wrote:
> > > > 
> > > > > Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
> > > > > 
> > > > > Regards,
> > > > > Vinayak
> > > > > 
> > > > > On 30/01/24 3:26 pm, Vinayak Kale wrote:
> > > > > > In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
> > > > > > in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
> > > > > > 
> > > > > > config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
> > > > > > 
> > > > > > Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
> > > > > > If cmask[] is not set for an offset, then qemu skips config space check for that offset.
> > > > > > 
> > > > > > Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> > > > > > ---
> > > > > >     hw/pci/pci.c | 7 +++++--
> > > > > >     1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > > > > index 76080af580..32429109df 100644
> > > > > > --- a/hw/pci/pci.c
> > > > > > +++ b/hw/pci/pci.c
> > > > > > @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
> > > > > >         memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
> > > > > >         /* Make capability read-only by default */
> > > > > >         memset(pdev->wmask + offset, 0, size);
> > > > > > -    /* Check capability by default */
> > > > > > -    memset(pdev->cmask + offset, 0xFF, size);
> > > > > > +
> > > > > > +    if (cap_id != PCI_CAP_ID_VNDR) {
> > > > > > +        /* Check non-vendor specific capability by default */
> > > > > > +        memset(pdev->cmask + offset, 0xFF, size);
> > > > > > +    }
> > > > > >         return offset;
> > > > > >     }
> > > > > > 
> > > > > 
> > > > 
> > > > If there is a possibility that the data within the vendor specific cap
> > > > can be consumed by the driver or diagnostic tools, then it's part of
> > > > the device ABI and should be consistent across migration.  A mismatch
> > > > can certainly cause a migration failure, but why shouldn't it?
> > > 
> > > Sure, the device ABI should be consistent across migration. In case of
> > > VSC, it should represent same format on source and destination. But
> > > shouldn't VSC content check or its interpretation be left to vendor
> > > driver instead of qemu?
> > 
> > By "vendor driver" here, are you suggesting that QEMU device models (ex.
> > hw/net/{e1000*,igb*,rtl8139*}) should perform that validation?  If so,
> > where's the patch that introduces any sort of validation hooks for
> > vendors to provide?  Where is this validation going to happen in the
> > case of a migratable vfio-pci variant devices?  Nothing about this
> > patch suggests that it's deferring responsibility to some other code
> > entity, it only indicates "checking this breaks, let's not do it".
> > 
> > It's possible that the device you care about only reports volatile
> > diagnostic information through the vendor specific capability, but
> > another device might use it to report information relative to the
> > internal hardware configuration.  Without knowing what the vendor
> > specific capability contains, QEMU needs to take the most conservative
> > approach by default.  Thanks,
> 
> PCI/PCIe spec doesn’t define ABI for VSC/VSEC contents. Any other code
> entity except vendor driver should ignore VSC contents. QEMU’s expectation
> of VSC contents to be equal on source and destination seems incorrect given
> that QEMU has no control over ABI for VSC contents.
> 
> > 
> > Alex
> > 

I don't get why this matters though. This is no different from any other
device specific register. If a register is visible to guest it generally
should not change across migration. If you are migrating a VFIO device and
you are making a vsc visible to guest then your migration routine must
make sure to migrate the contents of vsc.

Maybe there's a good reason to have a register which actually does
change. Then, please document the actual reason. When you say:

	 Ideally qemu should not check VSC during restore/load.

then that is clearly wrong in most cases.
Alex Williamson Feb. 2, 2024, 12:03 a.m. UTC | #7
On Thu, 1 Feb 2024 13:10:48 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Feb 01, 2024 at 11:08:58PM +0530, Vinayak Kale wrote:
> > 
> > On 31/01/24 11:08 pm, Alex Williamson wrote:  
> > > 
> > > On Wed, 31 Jan 2024 15:22:59 +0530
> > > Vinayak Kale <vkale@nvidia.com> wrote:
> > >   
> > > > On 31/01/24 12:28 am, Alex Williamson wrote:  
> > > > > 
> > > > > On Tue, 30 Jan 2024 23:32:26 +0530
> > > > > Vinayak Kale <vkale@nvidia.com> wrote:
> > > > >   
> > > > > > Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
> > > > > > 
> > > > > > Regards,
> > > > > > Vinayak
> > > > > > 
> > > > > > On 30/01/24 3:26 pm, Vinayak Kale wrote:  
> > > > > > > In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
> > > > > > > in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
> > > > > > > 
> > > > > > > config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
> > > > > > > 
> > > > > > > Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
> > > > > > > If cmask[] is not set for an offset, then qemu skips config space check for that offset.
> > > > > > > 
> > > > > > > Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> > > > > > > ---
> > > > > > >     hw/pci/pci.c | 7 +++++--
> > > > > > >     1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > > > > > index 76080af580..32429109df 100644
> > > > > > > --- a/hw/pci/pci.c
> > > > > > > +++ b/hw/pci/pci.c
> > > > > > > @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
> > > > > > >         memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
> > > > > > >         /* Make capability read-only by default */
> > > > > > >         memset(pdev->wmask + offset, 0, size);
> > > > > > > -    /* Check capability by default */
> > > > > > > -    memset(pdev->cmask + offset, 0xFF, size);
> > > > > > > +
> > > > > > > +    if (cap_id != PCI_CAP_ID_VNDR) {
> > > > > > > +        /* Check non-vendor specific capability by default */
> > > > > > > +        memset(pdev->cmask + offset, 0xFF, size);
> > > > > > > +    }
> > > > > > >         return offset;
> > > > > > >     }
> > > > > > >   
> > > > > >   
> > > > > 
> > > > > If there is a possibility that the data within the vendor specific cap
> > > > > can be consumed by the driver or diagnostic tools, then it's part of
> > > > > the device ABI and should be consistent across migration.  A mismatch
> > > > > can certainly cause a migration failure, but why shouldn't it?  
> > > > 
> > > > Sure, the device ABI should be consistent across migration. In case of
> > > > VSC, it should represent same format on source and destination. But
> > > > shouldn't VSC content check or its interpretation be left to vendor
> > > > driver instead of qemu?  
> > > 
> > > By "vendor driver" here, are you suggesting that QEMU device models (ex.
> > > hw/net/{e1000*,igb*,rtl8139*}) should perform that validation?  If so,
> > > where's the patch that introduces any sort of validation hooks for
> > > vendors to provide?  Where is this validation going to happen in the
> > > case of a migratable vfio-pci variant devices?  Nothing about this
> > > patch suggests that it's deferring responsibility to some other code
> > > entity, it only indicates "checking this breaks, let's not do it".
> > > 
> > > It's possible that the device you care about only reports volatile
> > > diagnostic information through the vendor specific capability, but
> > > another device might use it to report information relative to the
> > > internal hardware configuration.  Without knowing what the vendor
> > > specific capability contains, QEMU needs to take the most conservative
> > > approach by default.  Thanks,  
> > 
> > PCI/PCIe spec doesn’t define ABI for VSC/VSEC contents. Any other code
> > entity except vendor driver should ignore VSC contents. QEMU’s expectation
> > of VSC contents to be equal on source and destination seems incorrect given
> > that QEMU has no control over ABI for VSC contents.
> >   
> > > 
> > > Alex
> > >   
> 
> I don't get why this matters though. This is no different from any other
> device specific register. If a register is visible to guest it generally
> should not change across migration. If you are migrating a VFIO device and
> you are making a vsc visible to guest then your migration routine must
> make sure to migrate the contents of vsc.
> 
> Maybe there's a good reason to have a register which actually does
> change. Then, please document the actual reason. When you say:
> 
> 	 Ideally qemu should not check VSC during restore/load.
> 
> then that is clearly wrong in most cases.

The argument as I understand it is that enforcing that the contents
remain unchanged between source and target is a policy, but QEMU has
no basis to create such a policy because the ABI for this capability is
not defined by the spec.  Furthermore since the ABI of this capability
is only defined by the device, only the driver for the device should
have any interaction with the capability.

There's some merit there, but a potential flaw is that QEMU does in
fact register several vendor specific capabilities for which it does
know the ABI.  See for example:

	vfio_add_vmd_shadow_cap()
	vfio_add_nv_gpudirect_cap()
	pci_bridge_qemu_reserve_cap_init()
	virtio_pci_add_mem_cap()

I think only the vfio VMD one can be claimed not to support migration
since afaik we could add a GPUDirect clique capability to a migratable
vGPU.  So since these are QEMU created capabilities and QEMU does
understand the ABI, what about these specific capabilities would allow
them to change between source and target VMs?

The GPUDirect clique capability describes p2p capabilities between
devices.  That information cannot spontaneously change.

The bridge vendor capability exposes reserved bridge resources, which I
think is consumed by firmware so a change would result in resources
shifting after migration and a guest reboot.  That would be bad.

The virtio vendor capability does a variety of things that I'm sure MST
could explain better than me, but also seems like defined ABI to the VM.

So actually is it the case that the only vendor specific capabilities
exposed to the VM that QEMU doesn't understand are those provided from
a device exposed through vfio-pci?  If that's the case, then QEMU's
default policy to verify that the vendor capability is unchanged seems
to be valid and I'd think the patch instead should be creating
exceptions to that default policy for vfio-pci devices (ie. nothing
says vfio-pci cannot clear cmask bits set by the pci-core).

I think we also see above that there are vendor specific capabilities
that have a scope beyond the guest driver itself, some visible through
userspace tools, some consumed by firmware.  So I still find it
difficult to determine whether any vfio-pci exception should be
unconditional or tied to specific devices.

Also, migration is only supported by vfio-pci variant drivers, so why
is it that the variant driver cannot or should not make the target
capability match the source?  If the contents of the capability
changes, does it need to be exposed at all by the variant driver, or is
the vendor specific capability the only means to expose this
information for the driver?  Thanks,

Alex
Vinayak Kale Feb. 9, 2024, 9:17 a.m. UTC | #8
On 02/02/24 5:33 am, Alex Williamson wrote:
> 
> On Thu, 1 Feb 2024 13:10:48 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> On Thu, Feb 01, 2024 at 11:08:58PM +0530, Vinayak Kale wrote:
>>>
>>> On 31/01/24 11:08 pm, Alex Williamson wrote:
>>>>
>>>> On Wed, 31 Jan 2024 15:22:59 +0530
>>>> Vinayak Kale <vkale@nvidia.com> wrote:
>>>>
>>>>> On 31/01/24 12:28 am, Alex Williamson wrote:
>>>>>>
>>>>>> On Tue, 30 Jan 2024 23:32:26 +0530
>>>>>> Vinayak Kale <vkale@nvidia.com> wrote:
>>>>>>
>>>>>>> Missed adding Michael, Marcel, Alex and Avihai earlier, apologies.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vinayak
>>>>>>>
>>>>>>> On 30/01/24 3:26 pm, Vinayak Kale wrote:
>>>>>>>> In case of migration, during restore operation, qemu checks the config space of the pci device with the config space
>>>>>>>> in the migration stream captured during save operation. In case of config space data mismatch, restore operation is failed.
>>>>>>>>
>>>>>>>> config space check is done in function get_pci_config_device(). By default VSC (vendor-specific-capability) in config space is checked.
>>>>>>>>
>>>>>>>> Ideally qemu should not check VSC during restore/load. This patch skips the check by not setting pdev->cmask[] for VSC offsets in pci_add_capability().
>>>>>>>> If cmask[] is not set for an offset, then qemu skips config space check for that offset.
>>>>>>>>
>>>>>>>> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
>>>>>>>> ---
>>>>>>>>      hw/pci/pci.c | 7 +++++--
>>>>>>>>      1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>>>> index 76080af580..32429109df 100644
>>>>>>>> --- a/hw/pci/pci.c
>>>>>>>> +++ b/hw/pci/pci.c
>>>>>>>> @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>          memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
>>>>>>>>          /* Make capability read-only by default */
>>>>>>>>          memset(pdev->wmask + offset, 0, size);
>>>>>>>> -    /* Check capability by default */
>>>>>>>> -    memset(pdev->cmask + offset, 0xFF, size);
>>>>>>>> +
>>>>>>>> +    if (cap_id != PCI_CAP_ID_VNDR) {
>>>>>>>> +        /* Check non-vendor specific capability by default */
>>>>>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>>>>>>> +    }
>>>>>>>>          return offset;
>>>>>>>>      }
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> If there is a possibility that the data within the vendor specific cap
>>>>>> can be consumed by the driver or diagnostic tools, then it's part of
>>>>>> the device ABI and should be consistent across migration.  A mismatch
>>>>>> can certainly cause a migration failure, but why shouldn't it?
>>>>>
>>>>> Sure, the device ABI should be consistent across migration. In case of
>>>>> VSC, it should represent same format on source and destination. But
>>>>> shouldn't VSC content check or its interpretation be left to vendor
>>>>> driver instead of qemu?
>>>>
>>>> By "vendor driver" here, are you suggesting that QEMU device models (ex.
>>>> hw/net/{e1000*,igb*,rtl8139*}) should perform that validation?  If so,
>>>> where's the patch that introduces any sort of validation hooks for
>>>> vendors to provide?  Where is this validation going to happen in the
>>>> case of a migratable vfio-pci variant devices?  Nothing about this
>>>> patch suggests that it's deferring responsibility to some other code
>>>> entity, it only indicates "checking this breaks, let's not do it".
>>>>
>>>> It's possible that the device you care about only reports volatile
>>>> diagnostic information through the vendor specific capability, but
>>>> another device might use it to report information relative to the
>>>> internal hardware configuration.  Without knowing what the vendor
>>>> specific capability contains, QEMU needs to take the most conservative
>>>> approach by default.  Thanks,
>>>
>>> PCI/PCIe spec doesn’t define ABI for VSC/VSEC contents. Any other code
>>> entity except vendor driver should ignore VSC contents. QEMU’s expectation
>>> of VSC contents to be equal on source and destination seems incorrect given
>>> that QEMU has no control over ABI for VSC contents.
>>>
>>>>
>>>> Alex
>>>>
>>
>> I don't get why this matters though. This is no different from any other
>> device specific register. If a register is visible to guest it generally
>> should not change across migration. If you are migrating a VFIO device and
>> you are making a vsc visible to guest then your migration routine must
>> make sure to migrate the contents of vsc.
>>
>> Maybe there's a good reason to have a register which actually does
>> change. Then, please document the actual reason. When you say:
>>
>>         Ideally qemu should not check VSC during restore/load.
>>
>> then that is clearly wrong in most cases.
> 
> The argument as I understand it is that enforcing that the contents
> remain unchanged between source and target is a policy, but QEMU has
> no basis to create such a policy because the ABI for this capability is
> not defined by the spec.  Furthermore since the ABI of this capability
> is only defined by the device, only the driver for the device should
> have any interaction with the capability.
> 
> There's some merit there, but a potential flaw is that QEMU does in
> fact register several vendor specific capabilities for which it does
> know the ABI.  See for example:
> 
>          vfio_add_vmd_shadow_cap()
>          vfio_add_nv_gpudirect_cap()
>          pci_bridge_qemu_reserve_cap_init()
>          virtio_pci_add_mem_cap()
> 
> I think only the vfio VMD one can be claimed not to support migration
> since afaik we could add a GPUDirect clique capability to a migratable
> vGPU.  So since these are QEMU created capabilities and QEMU does
> understand the ABI, what about these specific capabilities would allow
> them to change between source and target VMs?
> 
> The GPUDirect clique capability describes p2p capabilities between
> devices.  That information cannot spontaneously change.
> 
> The bridge vendor capability exposes reserved bridge resources, which I
> think is consumed by firmware so a change would result in resources
> shifting after migration and a guest reboot.  That would be bad.
> 
> The virtio vendor capability does a variety of things that I'm sure MST
> could explain better than me, but also seems like defined ABI to the VM.
> 
> So actually is it the case that the only vendor specific capabilities
> exposed to the VM that QEMU doesn't understand are those provided from
> a device exposed through vfio-pci?  If that's the case, then QEMU's
> default policy to verify that the vendor capability is unchanged seems
> to be valid and I'd think the patch instead should be creating
> exceptions to that default policy for vfio-pci devices (ie. nothing
> says vfio-pci cannot clear cmask bits set by the pci-core).

In our case yes, QEMU doesn't know VSC ABI as the device is exposed 
through vfio-pci. I will update the patch to create an exception to the 
default policy for vfio-pci. Thanks.

> 
> I think we also see above that there are vendor specific capabilities
> that have a scope beyond the guest driver itself, some visible through
> userspace tools, some consumed by firmware.  So I still find it
> difficult to determine whether any vfio-pci exception should be
> unconditional or tied to specific devices.
> 
> Also, migration is only supported by vfio-pci variant drivers, so why
> is it that the variant driver cannot or should not make the target
> capability match the source?  If the contents of the capability
> changes, does it need to be exposed at all by the variant driver, or is
> the vendor specific capability the only means to expose this
> information for the driver?  Thanks,
> 
> Alex
>
diff mbox series

Patch

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 76080af580..32429109df 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2485,8 +2485,11 @@  int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
     memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4));
     /* Make capability read-only by default */
     memset(pdev->wmask + offset, 0, size);
-    /* Check capability by default */
-    memset(pdev->cmask + offset, 0xFF, size);
+
+    if (cap_id != PCI_CAP_ID_VNDR) {
+        /* Check non-vendor specific capability by default */
+        memset(pdev->cmask + offset, 0xFF, size);
+    }
     return offset;
 }