diff mbox

[RFC] qemu pci: pci_add_capability enhancement to prevent damaging config space

Message ID 4FACB581.2050609@ozlabs.ru
State New
Headers show

Commit Message

Alexey Kardashevskiy May 11, 2012, 6:45 a.m. UTC
Normally the pci_add_capability is called on devices to add new
capability. This is ok for emulated devices which capabilities list
is being built by QEMU.

In the case of VFIO the capability may already exist and adding new
capability into the beginning of the linked list may create a loop.

For example, the old code destroys the following config
of PCIe Intel E1000E:

before adding PCI_CAP_ID_MSI (0x05):
0x34: 0xC8
0xC8: 0x01 0xD0
0xD0: 0x05 0xE0
0xE0: 0x10 0x00

after:
0x34: 0xD0
0xC8: 0x01 0xD0
0xD0: 0x05 0xC8
0xE0: 0x10 0x00

As result capabilities 0x01 and 0x05 point to each other.

The proposed patch does not change capability pointers when
the same type capability is about to add.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/pci.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

Comments

Alexander Graf May 11, 2012, 10:52 a.m. UTC | #1
On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:

> Normally the pci_add_capability is called on devices to add new
> capability. This is ok for emulated devices which capabilities list
> is being built by QEMU.
> 
> In the case of VFIO the capability may already exist and adding new
> capability into the beginning of the linked list may create a loop.
> 
> For example, the old code destroys the following config
> of PCIe Intel E1000E:
> 
> before adding PCI_CAP_ID_MSI (0x05):
> 0x34: 0xC8
> 0xC8: 0x01 0xD0
> 0xD0: 0x05 0xE0
> 0xE0: 0x10 0x00
> 
> after:
> 0x34: 0xD0
> 0xC8: 0x01 0xD0
> 0xD0: 0x05 0xC8
> 0xE0: 0x10 0x00
> 
> As result capabilities 0x01 and 0x05 point to each other.
> 
> The proposed patch does not change capability pointers when
> the same type capability is about to add.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> hw/pci.c |   10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index aa0c0b8..1f7c924 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>     }
> 
>     config = pdev->config + offset;
> -    config[PCI_CAP_LIST_ID] = cap_id;
> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> +    if (config[PCI_CAP_LIST_ID] != cap_id) {

This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either

  * replace the existing one or
  * drop out and not write the new one in.

I'm not sure which way would be more natural.

> +        config[PCI_CAP_LIST_ID] = cap_id;
> +        config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> +        pdev->config[PCI_CAPABILITY_LIST] = offset;
> +        pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> +    }
>     memset(pdev->used + offset, 0xFF, size);
>     /* Make capability read-only by default */
>     memset(pdev->wmask + offset, 0, size);
> 
> 
> -- 
> Alexey
Alexey Kardashevskiy May 11, 2012, 12:47 p.m. UTC | #2
11.05.2012 20:52, Alexander Graf написал:
> 
> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
> 
>> Normally the pci_add_capability is called on devices to add new
>> capability. This is ok for emulated devices which capabilities list
>> is being built by QEMU.
>>
>> In the case of VFIO the capability may already exist and adding new
>> capability into the beginning of the linked list may create a loop.
>>
>> For example, the old code destroys the following config
>> of PCIe Intel E1000E:
>>
>> before adding PCI_CAP_ID_MSI (0x05):
>> 0x34: 0xC8
>> 0xC8: 0x01 0xD0
>> 0xD0: 0x05 0xE0
>> 0xE0: 0x10 0x00
>>
>> after:
>> 0x34: 0xD0
>> 0xC8: 0x01 0xD0
>> 0xD0: 0x05 0xC8
>> 0xE0: 0x10 0x00
>>
>> As result capabilities 0x01 and 0x05 point to each other.
>>
>> The proposed patch does not change capability pointers when
>> the same type capability is about to add.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> hw/pci.c |   10 ++++++----
>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/pci.c b/hw/pci.c
>> index aa0c0b8..1f7c924 100644
>> --- a/hw/pci.c
>> +++ b/hw/pci.c
>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>     }
>>
>>     config = pdev->config + offset;
>> -    config[PCI_CAP_LIST_ID] = cap_id;
>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
> 
> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>   * replace the existing one or
>   * drop out and not write the new one in.
> 
> I'm not sure which way would be more natural.

There is a third option - add another function, lets call it
pci_fixup_capability() which would do whatever pci_add_capability() does
but won't touch list pointers.

When vfio, pci_add_capability() is called from the code which knows
exactly that the capability exists and where it is and it calls
pci_add_capability() based on this knowledge so doing additional loops
just for imaginery scalability is a bit weird, no?


>> +        config[PCI_CAP_LIST_ID] = cap_id;
>> +        config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>> +        pdev->config[PCI_CAPABILITY_LIST] = offset;
>> +        pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>> +    }
>>     memset(pdev->used + offset, 0xFF, size);
>>     /* Make capability read-only by default */
>>     memset(pdev->wmask + offset, 0, size);
Alexander Graf May 11, 2012, 2:13 p.m. UTC | #3
On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:

> 11.05.2012 20:52, Alexander Graf написал:
>> 
>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>> 
>>> Normally the pci_add_capability is called on devices to add new
>>> capability. This is ok for emulated devices which capabilities list
>>> is being built by QEMU.
>>> 
>>> In the case of VFIO the capability may already exist and adding new
>>> capability into the beginning of the linked list may create a loop.
>>> 
>>> For example, the old code destroys the following config
>>> of PCIe Intel E1000E:
>>> 
>>> before adding PCI_CAP_ID_MSI (0x05):
>>> 0x34: 0xC8
>>> 0xC8: 0x01 0xD0
>>> 0xD0: 0x05 0xE0
>>> 0xE0: 0x10 0x00
>>> 
>>> after:
>>> 0x34: 0xD0
>>> 0xC8: 0x01 0xD0
>>> 0xD0: 0x05 0xC8
>>> 0xE0: 0x10 0x00
>>> 
>>> As result capabilities 0x01 and 0x05 point to each other.
>>> 
>>> The proposed patch does not change capability pointers when
>>> the same type capability is about to add.
>>> 
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>> hw/pci.c |   10 ++++++----
>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/hw/pci.c b/hw/pci.c
>>> index aa0c0b8..1f7c924 100644
>>> --- a/hw/pci.c
>>> +++ b/hw/pci.c
>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>    }
>>> 
>>>    config = pdev->config + offset;
>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>> 
>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>  * replace the existing one or
>>  * drop out and not write the new one in.

  * hw_error :)

>> 
>> I'm not sure which way would be more natural.
> 
> There is a third option - add another function, lets call it
> pci_fixup_capability() which would do whatever pci_add_capability() does
> but won't touch list pointers.

What good is a function that breaks internal consistency?

> When vfio, pci_add_capability() is called from the code which knows
> exactly that the capability exists and where it is and it calls
> pci_add_capability() based on this knowledge so doing additional loops
> just for imaginery scalability is a bit weird, no?

Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.


Alex
Jason Baron May 11, 2012, 7:20 p.m. UTC | #4
On Fri, May 11, 2012 at 04:45:21PM +1000, Alexey Kardashevskiy wrote:
> Normally the pci_add_capability is called on devices to add new
> capability. This is ok for emulated devices which capabilities list
> is being built by QEMU.
> 
> In the case of VFIO the capability may already exist and adding new
> capability into the beginning of the linked list may create a loop.

Hi,

I don't quite understand how we get a loop, if 'offset' is supplied to
'pci_add_capability' and there is an overlap we get -EINVAL. Otherwise,
we are adding the capability in a new empty space. So, I see how we
could get the capability in the list twice, but not how there is a loop.
what am I missing?

Thanks,

-Jason

> 
> For example, the old code destroys the following config
> of PCIe Intel E1000E:
> 
> before adding PCI_CAP_ID_MSI (0x05):
> 0x34: 0xC8
> 0xC8: 0x01 0xD0
> 0xD0: 0x05 0xE0
> 0xE0: 0x10 0x00
> 
> after:
> 0x34: 0xD0
> 0xC8: 0x01 0xD0
> 0xD0: 0x05 0xC8
> 0xE0: 0x10 0x00
> 
> As result capabilities 0x01 and 0x05 point to each other.
> 
> The proposed patch does not change capability pointers when
> the same type capability is about to add.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/pci.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index aa0c0b8..1f7c924 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>      }
> 
>      config = pdev->config + offset;
> -    config[PCI_CAP_LIST_ID] = cap_id;
> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
> +        config[PCI_CAP_LIST_ID] = cap_id;
> +        config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> +        pdev->config[PCI_CAPABILITY_LIST] = offset;
> +        pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> +    }
>      memset(pdev->used + offset, 0xFF, size);
>      /* Make capability read-only by default */
>      memset(pdev->wmask + offset, 0, size);
> 
> 
> -- 
> Alexey
>
Alexey Kardashevskiy May 12, 2012, 12:27 a.m. UTC | #5
12.05.2012 5:20, Jason Baron написал:
> On Fri, May 11, 2012 at 04:45:21PM +1000, Alexey Kardashevskiy wrote:
>> Normally the pci_add_capability is called on devices to add new
>> capability. This is ok for emulated devices which capabilities list
>> is being built by QEMU.
>>
>> In the case of VFIO the capability may already exist and adding new
>> capability into the beginning of the linked list may create a loop.
> 
> Hi,
> 
> I don't quite understand how we get a loop, if 'offset' is supplied to
> 'pci_add_capability' and there is an overlap we get -EINVAL. Otherwise,
> we are adding the capability in a new empty space. So, I see how we
> could get the capability in the list twice, but not how there is a loop.
> what am I missing?


This happens only with VFIO.

The capability already exists in the config space as it is fetched from
the host kernel _before_ msi_init is called. Furthermore, msi_init() is
called when VFIO sees this capability in the config space.

We probably want to re-add all capabilities, do not know...



> Thanks,
> 
> -Jason
> 
>>
>> For example, the old code destroys the following config
>> of PCIe Intel E1000E:
>>
>> before adding PCI_CAP_ID_MSI (0x05):
>> 0x34: 0xC8
>> 0xC8: 0x01 0xD0
>> 0xD0: 0x05 0xE0
>> 0xE0: 0x10 0x00
>>
>> after:
>> 0x34: 0xD0
>> 0xC8: 0x01 0xD0
>> 0xD0: 0x05 0xC8
>> 0xE0: 0x10 0x00
>>
>> As result capabilities 0x01 and 0x05 point to each other.
>>
>> The proposed patch does not change capability pointers when
>> the same type capability is about to add.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/pci.c |   10 ++++++----
>>  1 files changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/pci.c b/hw/pci.c
>> index aa0c0b8..1f7c924 100644
>> --- a/hw/pci.c
>> +++ b/hw/pci.c
>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>      }
>>
>>      config = pdev->config + offset;
>> -    config[PCI_CAP_LIST_ID] = cap_id;
>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>> +        config[PCI_CAP_LIST_ID] = cap_id;
>> +        config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>> +        pdev->config[PCI_CAPABILITY_LIST] = offset;
>> +        pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>> +    }
>>      memset(pdev->used + offset, 0xFF, size);
>>      /* Make capability read-only by default */
>>      memset(pdev->wmask + offset, 0, size);
Alex Williamson May 14, 2012, 2:37 a.m. UTC | #6
On Sat, 2012-05-12 at 10:27 +1000, Alexey Kardashevskiy wrote:
> 12.05.2012 5:20, Jason Baron написал:
> > On Fri, May 11, 2012 at 04:45:21PM +1000, Alexey Kardashevskiy wrote:
> >> Normally the pci_add_capability is called on devices to add new
> >> capability. This is ok for emulated devices which capabilities list
> >> is being built by QEMU.
> >>
> >> In the case of VFIO the capability may already exist and adding new
> >> capability into the beginning of the linked list may create a loop.
> > 
> > Hi,
> > 
> > I don't quite understand how we get a loop, if 'offset' is supplied to
> > 'pci_add_capability' and there is an overlap we get -EINVAL. Otherwise,
> > we are adding the capability in a new empty space. So, I see how we
> > could get the capability in the list twice, but not how there is a loop.
> > what am I missing?
> 
> 
> This happens only with VFIO.
> 
> The capability already exists in the config space as it is fetched from
> the host kernel _before_ msi_init is called. Furthermore, msi_init() is
> called when VFIO sees this capability in the config space.
> 
> We probably want to re-add all capabilities, do not know...

Yep, I've had a msi[1] and msix[2] patches in my vfio tree for a long
time, we really want to support this generically for all capabilities
though.  We either need to detect or allow the caller to specify that
the config space is already programmed.  Note that even if we don't
create a loop, particularly finicky drivers may balk at just changing
the order of the capabilities list.  Thanks,

Alex

[1]https://github.com/awilliam/qemu-vfio/commit/a9f04351610ab69e22d90a76dc85be3269000a9f
[2]https://github.com/awilliam/qemu-vfio/commit/b4de3d0436b0260fbc6fcd40787c1c92ffca2980

> >>
> >> For example, the old code destroys the following config
> >> of PCIe Intel E1000E:
> >>
> >> before adding PCI_CAP_ID_MSI (0x05):
> >> 0x34: 0xC8
> >> 0xC8: 0x01 0xD0
> >> 0xD0: 0x05 0xE0
> >> 0xE0: 0x10 0x00
> >>
> >> after:
> >> 0x34: 0xD0
> >> 0xC8: 0x01 0xD0
> >> 0xD0: 0x05 0xC8
> >> 0xE0: 0x10 0x00
> >>
> >> As result capabilities 0x01 and 0x05 point to each other.
> >>
> >> The proposed patch does not change capability pointers when
> >> the same type capability is about to add.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/pci.c |   10 ++++++----
> >>  1 files changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/pci.c b/hw/pci.c
> >> index aa0c0b8..1f7c924 100644
> >> --- a/hw/pci.c
> >> +++ b/hw/pci.c
> >> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
> >>      }
> >>
> >>      config = pdev->config + offset;
> >> -    config[PCI_CAP_LIST_ID] = cap_id;
> >> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> >> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
> >> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> >> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
> >> +        config[PCI_CAP_LIST_ID] = cap_id;
> >> +        config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> >> +        pdev->config[PCI_CAPABILITY_LIST] = offset;
> >> +        pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> >> +    }
> >>      memset(pdev->used + offset, 0xFF, size);
> >>      /* Make capability read-only by default */
> >>      memset(pdev->wmask + offset, 0, size);
> 
> 
>
Alexey Kardashevskiy May 14, 2012, 3:49 a.m. UTC | #7
On 12/05/12 00:13, Alexander Graf wrote:
> 
> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
> 
>> 11.05.2012 20:52, Alexander Graf написал:
>>>
>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>
>>>> Normally the pci_add_capability is called on devices to add new
>>>> capability. This is ok for emulated devices which capabilities list
>>>> is being built by QEMU.
>>>>
>>>> In the case of VFIO the capability may already exist and adding new
>>>> capability into the beginning of the linked list may create a loop.
>>>>
>>>> For example, the old code destroys the following config
>>>> of PCIe Intel E1000E:
>>>>
>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>> 0x34: 0xC8
>>>> 0xC8: 0x01 0xD0
>>>> 0xD0: 0x05 0xE0
>>>> 0xE0: 0x10 0x00
>>>>
>>>> after:
>>>> 0x34: 0xD0
>>>> 0xC8: 0x01 0xD0
>>>> 0xD0: 0x05 0xC8
>>>> 0xE0: 0x10 0x00
>>>>
>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>
>>>> The proposed patch does not change capability pointers when
>>>> the same type capability is about to add.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> hw/pci.c |   10 ++++++----
>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>> index aa0c0b8..1f7c924 100644
>>>> --- a/hw/pci.c
>>>> +++ b/hw/pci.c
>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>    }
>>>>
>>>>    config = pdev->config + offset;
>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>
>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>  * replace the existing one or
>>>  * drop out and not write the new one in.
> 
>   * hw_error :)
> 
>>>
>>> I'm not sure which way would be more natural.
>>
>> There is a third option - add another function, lets call it
>> pci_fixup_capability() which would do whatever pci_add_capability() does
>> but won't touch list pointers.
> 
> What good is a function that breaks internal consistency?


It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
the whole list and add a capability if it does not exist. Emulated devices which care about having a
capability at some fixed offset would have initialized their config space before calling this
capabilities API (as VFIO does).

If we really want to support emulated devices which want some capabilities be at fixed offset and
others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
driver may care about its offset.



>> When vfio, pci_add_capability() is called from the code which knows
>> exactly that the capability exists and where it is and it calls
>> pci_add_capability() based on this knowledge so doing additional loops
>> just for imaginery scalability is a bit weird, no?
> 
> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
> 
> 
> Alex
>
Alexey Kardashevskiy May 18, 2012, 5:12 a.m. UTC | #8
Alexander,

Is that any better? :)


@@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
  * in pci config space */
 int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
                        uint8_t offset, uint8_t size)
 {
-    uint8_t *config;
+    uint8_t *config, existing;
     int i, overlapping_cap;

+    existing = pci_find_capability(pdev, cap_id);
+    if (existing) {
+        if (offset && (existing != offset)) {
+            return -EEXIST;
+        }
+        for (i = existing; i < size; ++i) {
+            if (pdev->used[i]) {
+                return -EFAULT;
+            }
+        }
+        memset(pdev->used + offset, 0xFF, size);
+        /* Make capability read-only by default */
+        memset(pdev->wmask + offset, 0, size);
+        /* Check capability by default */
+        memset(pdev->cmask + offset, 0xFF, size);
+        return existing;
+    }
+
     if (!offset) {
         offset = pci_find_space(pdev, size);
         if (!offset) {
             return -ENOSPC;






On 14/05/12 13:49, Alexey Kardashevskiy wrote:
> On 12/05/12 00:13, Alexander Graf wrote:
>>
>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>
>>> 11.05.2012 20:52, Alexander Graf написал:
>>>>
>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>
>>>>> Normally the pci_add_capability is called on devices to add new
>>>>> capability. This is ok for emulated devices which capabilities list
>>>>> is being built by QEMU.
>>>>>
>>>>> In the case of VFIO the capability may already exist and adding new
>>>>> capability into the beginning of the linked list may create a loop.
>>>>>
>>>>> For example, the old code destroys the following config
>>>>> of PCIe Intel E1000E:
>>>>>
>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>> 0x34: 0xC8
>>>>> 0xC8: 0x01 0xD0
>>>>> 0xD0: 0x05 0xE0
>>>>> 0xE0: 0x10 0x00
>>>>>
>>>>> after:
>>>>> 0x34: 0xD0
>>>>> 0xC8: 0x01 0xD0
>>>>> 0xD0: 0x05 0xC8
>>>>> 0xE0: 0x10 0x00
>>>>>
>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>
>>>>> The proposed patch does not change capability pointers when
>>>>> the same type capability is about to add.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>> hw/pci.c |   10 ++++++----
>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>> index aa0c0b8..1f7c924 100644
>>>>> --- a/hw/pci.c
>>>>> +++ b/hw/pci.c
>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>    }
>>>>>
>>>>>    config = pdev->config + offset;
>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>
>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>  * replace the existing one or
>>>>  * drop out and not write the new one in.
>>
>>   * hw_error :)
>>
>>>>
>>>> I'm not sure which way would be more natural.
>>>
>>> There is a third option - add another function, lets call it
>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>> but won't touch list pointers.
>>
>> What good is a function that breaks internal consistency?
> 
> 
> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
> the whole list and add a capability if it does not exist. Emulated devices which care about having a
> capability at some fixed offset would have initialized their config space before calling this
> capabilities API (as VFIO does).
> 
> If we really want to support emulated devices which want some capabilities be at fixed offset and
> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
> driver may care about its offset.
> 
> 
> 
>>> When vfio, pci_add_capability() is called from the code which knows
>>> exactly that the capability exists and where it is and it calls
>>> pci_add_capability() based on this knowledge so doing additional loops
>>> just for imaginery scalability is a bit weird, no?
>>
>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
>>
>>
>> Alex
>>
> 
>
Benjamin Herrenschmidt May 22, 2012, 2:02 a.m. UTC | #9
On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
> Alexander,
> 
> Is that any better? :)

Alex (Graf that is), ping ?

The original patch from Alexey was fine btw.

VFIO will always call things with the existing capability offset so
there's no real risk of doing the wrong thing or break the list or
anything.

IE. A small simple patch that addresses the problem :-)

The new patch is a bit more "robust" I believe, I don't think we need to
go too far to fix a problem we don't have. But we need a fix for the
real issue and the simple patch does it neatly from what I can
understand.

Cheers,
Ben.

> 
> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>   * in pci config space */
>  int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>                         uint8_t offset, uint8_t size)
>  {
> -    uint8_t *config;
> +    uint8_t *config, existing;
>      int i, overlapping_cap;
> 
> +    existing = pci_find_capability(pdev, cap_id);
> +    if (existing) {
> +        if (offset && (existing != offset)) {
> +            return -EEXIST;
> +        }
> +        for (i = existing; i < size; ++i) {
> +            if (pdev->used[i]) {
> +                return -EFAULT;
> +            }
> +        }
> +        memset(pdev->used + offset, 0xFF, size);
> +        /* Make capability read-only by default */
> +        memset(pdev->wmask + offset, 0, size);
> +        /* Check capability by default */
> +        memset(pdev->cmask + offset, 0xFF, size);
> +        return existing;
> +    }
> +
>      if (!offset) {
>          offset = pci_find_space(pdev, size);
>          if (!offset) {
>              return -ENOSPC;
> 
> 
> 
> 
> 
> 
> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
> > On 12/05/12 00:13, Alexander Graf wrote:
> >>
> >> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
> >>
> >>> 11.05.2012 20:52, Alexander Graf написал:
> >>>>
> >>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
> >>>>
> >>>>> Normally the pci_add_capability is called on devices to add new
> >>>>> capability. This is ok for emulated devices which capabilities list
> >>>>> is being built by QEMU.
> >>>>>
> >>>>> In the case of VFIO the capability may already exist and adding new
> >>>>> capability into the beginning of the linked list may create a loop.
> >>>>>
> >>>>> For example, the old code destroys the following config
> >>>>> of PCIe Intel E1000E:
> >>>>>
> >>>>> before adding PCI_CAP_ID_MSI (0x05):
> >>>>> 0x34: 0xC8
> >>>>> 0xC8: 0x01 0xD0
> >>>>> 0xD0: 0x05 0xE0
> >>>>> 0xE0: 0x10 0x00
> >>>>>
> >>>>> after:
> >>>>> 0x34: 0xD0
> >>>>> 0xC8: 0x01 0xD0
> >>>>> 0xD0: 0x05 0xC8
> >>>>> 0xE0: 0x10 0x00
> >>>>>
> >>>>> As result capabilities 0x01 and 0x05 point to each other.
> >>>>>
> >>>>> The proposed patch does not change capability pointers when
> >>>>> the same type capability is about to add.
> >>>>>
> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>> ---
> >>>>> hw/pci.c |   10 ++++++----
> >>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/pci.c b/hw/pci.c
> >>>>> index aa0c0b8..1f7c924 100644
> >>>>> --- a/hw/pci.c
> >>>>> +++ b/hw/pci.c
> >>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
> >>>>>    }
> >>>>>
> >>>>>    config = pdev->config + offset;
> >>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
> >>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
> >>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
> >>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> >>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
> >>>>
> >>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
> >>>>  * replace the existing one or
> >>>>  * drop out and not write the new one in.
> >>
> >>   * hw_error :)
> >>
> >>>>
> >>>> I'm not sure which way would be more natural.
> >>>
> >>> There is a third option - add another function, lets call it
> >>> pci_fixup_capability() which would do whatever pci_add_capability() does
> >>> but won't touch list pointers.
> >>
> >> What good is a function that breaks internal consistency?
> > 
> > 
> > It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
> > the whole list and add a capability if it does not exist. Emulated devices which care about having a
> > capability at some fixed offset would have initialized their config space before calling this
> > capabilities API (as VFIO does).
> > 
> > If we really want to support emulated devices which want some capabilities be at fixed offset and
> > others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
> > by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
> > driver may care about its offset.
> > 
> > 
> > 
> >>> When vfio, pci_add_capability() is called from the code which knows
> >>> exactly that the capability exists and where it is and it calls
> >>> pci_add_capability() based on this knowledge so doing additional loops
> >>> just for imaginery scalability is a bit weird, no?
> >>
> >> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
> >>
> >>
> >> Alex
> >>
> > 
> > 
> 
>
Alexander Graf May 22, 2012, 3:21 a.m. UTC | #10
On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>> Alexander,
>> 
>> Is that any better? :)
> 
> Alex (Graf that is), ping ?
> 
> The original patch from Alexey was fine btw.
> 
> VFIO will always call things with the existing capability offset so
> there's no real risk of doing the wrong thing or break the list or
> anything.
> 
> IE. A small simple patch that addresses the problem :-)
> 
> The new patch is a bit more "robust" I believe, I don't think we need to
> go too far to fix a problem we don't have. But we need a fix for the
> real issue and the simple patch does it neatly from what I can
> understand.
> 
> Cheers,
> Ben.
> 
>> 
>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>  * in pci config space */
>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>                        uint8_t offset, uint8_t size)
>> {
>> -    uint8_t *config;
>> +    uint8_t *config, existing;

Existing is a pointer to the target dev's config space, right?

>>     int i, overlapping_cap;
>> 
>> +    existing = pci_find_capability(pdev, cap_id);
>> +    if (existing) {
>> +        if (offset && (existing != offset)) {
>> +            return -EEXIST;
>> +        }
>> +        for (i = existing; i < size; ++i) {

So how does this possibly make sense?

>> +            if (pdev->used[i]) {
>> +                return -EFAULT;
>> +            }
>> +        }
>> +        memset(pdev->used + offset, 0xFF, size);

Why?

>> +        /* Make capability read-only by default */
>> +        memset(pdev->wmask + offset, 0, size);

Why?

>> +        /* Check capability by default */
>> +        memset(pdev->cmask + offset, 0xFF, size);

I don't understand this part either.


Alex

>> +        return existing;
>> +    }
>> +
>>     if (!offset) {
>>         offset = pci_find_space(pdev, size);
>>         if (!offset) {
>>             return -ENOSPC;
>> 
>> 
>> 
>> 
>> 
>> 
>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>> 
>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>> 
>>>>> 11.05.2012 20:52, Alexander Graf написал:
>>>>>> 
>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>> 
>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>> is being built by QEMU.
>>>>>>> 
>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>> 
>>>>>>> For example, the old code destroys the following config
>>>>>>> of PCIe Intel E1000E:
>>>>>>> 
>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>> 0x34: 0xC8
>>>>>>> 0xC8: 0x01 0xD0
>>>>>>> 0xD0: 0x05 0xE0
>>>>>>> 0xE0: 0x10 0x00
>>>>>>> 
>>>>>>> after:
>>>>>>> 0x34: 0xD0
>>>>>>> 0xC8: 0x01 0xD0
>>>>>>> 0xD0: 0x05 0xC8
>>>>>>> 0xE0: 0x10 0x00
>>>>>>> 
>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>> 
>>>>>>> The proposed patch does not change capability pointers when
>>>>>>> the same type capability is about to add.
>>>>>>> 
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>> --- a/hw/pci.c
>>>>>>> +++ b/hw/pci.c
>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>   }
>>>>>>> 
>>>>>>>   config = pdev->config + offset;
>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>> 
>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>> * replace the existing one or
>>>>>> * drop out and not write the new one in.
>>>> 
>>>>  * hw_error :)
>>>> 
>>>>>> 
>>>>>> I'm not sure which way would be more natural.
>>>>> 
>>>>> There is a third option - add another function, lets call it
>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>> but won't touch list pointers.
>>>> 
>>>> What good is a function that breaks internal consistency?
>>> 
>>> 
>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>> capability at some fixed offset would have initialized their config space before calling this
>>> capabilities API (as VFIO does).
>>> 
>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>> driver may care about its offset.
>>> 
>>> 
>>> 
>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>> exactly that the capability exists and where it is and it calls
>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>> just for imaginery scalability is a bit weird, no?
>>>> 
>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
>>>> 
>>>> 
>>>> Alex
>>>> 
>>> 
>>> 
>> 
>> 
> 
>
Alexey Kardashevskiy May 22, 2012, 3:44 a.m. UTC | #11
On 22/05/12 13:21, Alexander Graf wrote:
> 
> 
> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>> Alexander,
>>>
>>> Is that any better? :)
>>
>> Alex (Graf that is), ping ?
>>
>> The original patch from Alexey was fine btw.
>>
>> VFIO will always call things with the existing capability offset so
>> there's no real risk of doing the wrong thing or break the list or
>> anything.
>>
>> IE. A small simple patch that addresses the problem :-)
>>
>> The new patch is a bit more "robust" I believe, I don't think we need to
>> go too far to fix a problem we don't have. But we need a fix for the
>> real issue and the simple patch does it neatly from what I can
>> understand.
>>
>> Cheers,
>> Ben.
>>
>>>
>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>  * in pci config space */
>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>                        uint8_t offset, uint8_t size)
>>> {
>>> -    uint8_t *config;
>>> +    uint8_t *config, existing;
> 
> Existing is a pointer to the target dev's config space, right?

Yes.

>>>     int i, overlapping_cap;
>>>
>>> +    existing = pci_find_capability(pdev, cap_id);
>>> +    if (existing) {
>>> +        if (offset && (existing != offset)) {
>>> +            return -EEXIST;
>>> +        }
>>> +        for (i = existing; i < size; ++i) {
> 
> So how does this possibly make sense?

Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
check that this space has not been tried to use by someone else.

>>> +            if (pdev->used[i]) {
>>> +                return -EFAULT;
>>> +            }
>>> +        }
>>> +        memset(pdev->used + offset, 0xFF, size);
> Why?

Because I am marking the space this capability takes as used.

>>> +        /* Make capability read-only by default */
>>> +        memset(pdev->wmask + offset, 0, size);
> Why?

Because the pci_add_capability() does it for a new capability by default.


>>> +        /* Check capability by default */
>>> +        memset(pdev->cmask + offset, 0xFF, size);
> 
> I don't understand this part either.

The pci_add_capability() does it for a new capability by default.



> 
> Alex
> 
>>> +        return existing;
>>> +    }
>>> +
>>>     if (!offset) {
>>>         offset = pci_find_space(pdev, size);
>>>         if (!offset) {
>>>             return -ENOSPC;
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>>>
>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>>>
>>>>>> 11.05.2012 20:52, Alexander Graf написал:
>>>>>>>
>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>>> is being built by QEMU.
>>>>>>>>
>>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>>>
>>>>>>>> For example, the old code destroys the following config
>>>>>>>> of PCIe Intel E1000E:
>>>>>>>>
>>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>>> 0x34: 0xC8
>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>> 0xD0: 0x05 0xE0
>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>
>>>>>>>> after:
>>>>>>>> 0x34: 0xD0
>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>> 0xD0: 0x05 0xC8
>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>
>>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>>>
>>>>>>>> The proposed patch does not change capability pointers when
>>>>>>>> the same type capability is about to add.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>>> --- a/hw/pci.c
>>>>>>>> +++ b/hw/pci.c
>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   config = pdev->config + offset;
>>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>>>
>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>>> * replace the existing one or
>>>>>>> * drop out and not write the new one in.
>>>>>
>>>>>  * hw_error :)
>>>>>
>>>>>>>
>>>>>>> I'm not sure which way would be more natural.
>>>>>>
>>>>>> There is a third option - add another function, lets call it
>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>>> but won't touch list pointers.
>>>>>
>>>>> What good is a function that breaks internal consistency?
>>>>
>>>>
>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>>> capability at some fixed offset would have initialized their config space before calling this
>>>> capabilities API (as VFIO does).
>>>>
>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>>> driver may care about its offset.
>>>>
>>>>
>>>>
>>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>>> exactly that the capability exists and where it is and it calls
>>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>>> just for imaginery scalability is a bit weird, no?
>>>>>
>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
Alexander Graf May 22, 2012, 5:52 a.m. UTC | #12
On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 22/05/12 13:21, Alexander Graf wrote:
>> 
>> 
>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>> 
>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>> Alexander,
>>>> 
>>>> Is that any better? :)
>>> 
>>> Alex (Graf that is), ping ?
>>> 
>>> The original patch from Alexey was fine btw.
>>> 
>>> VFIO will always call things with the existing capability offset so
>>> there's no real risk of doing the wrong thing or break the list or
>>> anything.
>>> 
>>> IE. A small simple patch that addresses the problem :-)
>>> 
>>> The new patch is a bit more "robust" I believe, I don't think we need to
>>> go too far to fix a problem we don't have. But we need a fix for the
>>> real issue and the simple patch does it neatly from what I can
>>> understand.
>>> 
>>> Cheers,
>>> Ben.
>>> 
>>>> 
>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>> * in pci config space */
>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>                       uint8_t offset, uint8_t size)
>>>> {
>>>> -    uint8_t *config;
>>>> +    uint8_t *config, existing;
>> 
>> Existing is a pointer to the target dev's config space, right?
> 
> Yes.
> 
>>>>    int i, overlapping_cap;
>>>> 
>>>> +    existing = pci_find_capability(pdev, cap_id);
>>>> +    if (existing) {
>>>> +        if (offset && (existing != offset)) {
>>>> +            return -EEXIST;
>>>> +        }
>>>> +        for (i = existing; i < size; ++i) {
>> 
>> So how does this possibly make sense?
> 
> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
> check that this space has not been tried to use by someone else.

i is an int. existing is a uint8_t*.

> 
>>>> +            if (pdev->used[i]) {
>>>> +                return -EFAULT;
>>>> +            }
>>>> +        }
>>>> +        memset(pdev->used + offset, 0xFF, size);
>> Why?
> 
> Because I am marking the space this capability takes as used.

But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no?

> 
>>>> +        /* Make capability read-only by default */
>>>> +        memset(pdev->wmask + offset, 0, size);
>> Why?
> 
> Because the pci_add_capability() does it for a new capability by default.

Hrm. So you're copying code? Can't you merge the overwrite and write cases?

Alex

> 
> 
>>>> +        /* Check capability by default */
>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>> 
>> I don't understand this part either.
> 
> The pci_add_capability() does it for a new capability by default.
> 
> 
> 
>> 
>> Alex
>> 
>>>> +        return existing;
>>>> +    }
>>>> +
>>>>    if (!offset) {
>>>>        offset = pci_find_space(pdev, size);
>>>>        if (!offset) {
>>>>            return -ENOSPC;
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>>>> 
>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>>>> 
>>>>>>> 11.05.2012 20:52, Alexander Graf написал:
>>>>>>>> 
>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>>>> 
>>>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>>>> is being built by QEMU.
>>>>>>>>> 
>>>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>>>> 
>>>>>>>>> For example, the old code destroys the following config
>>>>>>>>> of PCIe Intel E1000E:
>>>>>>>>> 
>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>>>> 0x34: 0xC8
>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>> 0xD0: 0x05 0xE0
>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>> 
>>>>>>>>> after:
>>>>>>>>> 0x34: 0xD0
>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>> 0xD0: 0x05 0xC8
>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>> 
>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>>>> 
>>>>>>>>> The proposed patch does not change capability pointers when
>>>>>>>>> the same type capability is about to add.
>>>>>>>>> 
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>>>> 
>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>>>> --- a/hw/pci.c
>>>>>>>>> +++ b/hw/pci.c
>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>>  }
>>>>>>>>> 
>>>>>>>>>  config = pdev->config + offset;
>>>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>>>> 
>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>>>> * replace the existing one or
>>>>>>>> * drop out and not write the new one in.
>>>>>> 
>>>>>> * hw_error :)
>>>>>> 
>>>>>>>> 
>>>>>>>> I'm not sure which way would be more natural.
>>>>>>> 
>>>>>>> There is a third option - add another function, lets call it
>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>>>> but won't touch list pointers.
>>>>>> 
>>>>>> What good is a function that breaks internal consistency?
>>>>> 
>>>>> 
>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>>>> capability at some fixed offset would have initialized their config space before calling this
>>>>> capabilities API (as VFIO does).
>>>>> 
>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>>>> driver may care about its offset.
>>>>> 
>>>>> 
>>>>> 
>>>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>>>> exactly that the capability exists and where it is and it calls
>>>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>>>> just for imaginery scalability is a bit weird, no?
>>>>>> 
>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
>>>>>> 
>>>>>> 
>>>>>> Alex
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
> 
> 
> -- 
> Alexey
diff mbox

Patch

diff --git a/hw/pci.c b/hw/pci.c
index aa0c0b8..1f7c924 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1794,10 +1794,12 @@  int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
     }

     config = pdev->config + offset;
-    config[PCI_CAP_LIST_ID] = cap_id;
-    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
-    pdev->config[PCI_CAPABILITY_LIST] = offset;
-    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
+    if (config[PCI_CAP_LIST_ID] != cap_id) {
+        config[PCI_CAP_LIST_ID] = cap_id;
+        config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
+        pdev->config[PCI_CAPABILITY_LIST] = offset;
+        pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
+    }
     memset(pdev->used + offset, 0xFF, size);
     /* Make capability read-only by default */
     memset(pdev->wmask + offset, 0, size);