diff mbox

[RFC] qemu pci: pci_add_capability enhancement to prevent damaging config space

Message ID 4FBB2E25.2030206@ozlabs.ru
State New
Headers show

Commit Message

Alexey Kardashevskiy May 22, 2012, 6:11 a.m. UTC
On 22/05/12 15:52, Alexander Graf wrote:
> 
> 
> On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 22/05/12 13:21, Alexander Graf wrote:
>>>
>>>
>>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>
>>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>>> Alexander,
>>>>>
>>>>> Is that any better? :)
>>>>
>>>> Alex (Graf that is), ping ?
>>>>
>>>> The original patch from Alexey was fine btw.
>>>>
>>>> VFIO will always call things with the existing capability offset so
>>>> there's no real risk of doing the wrong thing or break the list or
>>>> anything.
>>>>
>>>> IE. A small simple patch that addresses the problem :-)
>>>>
>>>> The new patch is a bit more "robust" I believe, I don't think we need to
>>>> go too far to fix a problem we don't have. But we need a fix for the
>>>> real issue and the simple patch does it neatly from what I can
>>>> understand.
>>>>
>>>> Cheers,
>>>> Ben.
>>>>
>>>>>
>>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>>> * in pci config space */
>>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>                       uint8_t offset, uint8_t size)
>>>>> {
>>>>> -    uint8_t *config;
>>>>> +    uint8_t *config, existing;
>>>
>>> Existing is a pointer to the target dev's config space, right?
>>
>> Yes.
>>
>>>>>    int i, overlapping_cap;
>>>>>
>>>>> +    existing = pci_find_capability(pdev, cap_id);
>>>>> +    if (existing) {
>>>>> +        if (offset && (existing != offset)) {
>>>>> +            return -EEXIST;
>>>>> +        }
>>>>> +        for (i = existing; i < size; ++i) {
>>>
>>> So how does this possibly make sense?
>>
>> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
>> check that this space has not been tried to use by someone else.
> 
> i is an int. existing is a uint8_t*.


It was there before me. This function already does a loop and this is how it was coded at the first place.


>>>>> +            if (pdev->used[i]) {
>>>>> +                return -EFAULT;
>>>>> +            }
>>>>> +        }
>>>>> +        memset(pdev->used + offset, 0xFF, size);
>>> Why?
>>
>> Because I am marking the space this capability takes as used.
> 
> But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no?


No, it does not exist for VFIO - VFIO read the config space from the host kernel first and then calls msi_init or msix_init or whatever_else_init depending on what it got from the host kernel. And these xxx_init() functions eventually call pci_add_capability().

Sure we can either implement own msi_init/msix_init (and may be others in the future) for VFIO (which would do all the same as other QEMU devices except touching the capabilities)  OR  hack msi_init/msix_init not to touch capabilities if they exist.



>>>>> +        /* Make capability read-only by default */
>>>>> +        memset(pdev->wmask + offset, 0, size);
>>> Why?
>>
>> Because the pci_add_capability() does it for a new capability by default.
> 
> Hrm. So you're copying code? Can't you merge the overwrite and write cases?

I am trying to make it as a single chunk which is as small as possible.


If it helps, below is the same patch with extended context to see what is going on in that function.






hw/pci.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)





>>>>> +        /* Check capability by default */
>>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>>
>>> I don't understand this part either.
>>
>> The pci_add_capability() does it for a new capability by default.
>>
>>
>>
>>>
>>> Alex
>>>
>>>>> +        return existing;
>>>>> +    }
>>>>> +
>>>>>    if (!offset) {
>>>>>        offset = pci_find_space(pdev, size);
>>>>>        if (!offset) {
>>>>>            return -ENOSPC;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>>>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>>>>>
>>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>> 11.05.2012 20:52, Alexander Graf =0?8A0;:
>>>>>>>>>
>>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>>>>>
>>>>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>>>>> is being built by QEMU.
>>>>>>>>>>
>>>>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>>>>>
>>>>>>>>>> For example, the old code destroys the following config
>>>>>>>>>> of PCIe Intel E1000E:
>>>>>>>>>>
>>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>>>>> 0x34: 0xC8
>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>> 0xD0: 0x05 0xE0
>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>
>>>>>>>>>> after:
>>>>>>>>>> 0x34: 0xD0
>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>> 0xD0: 0x05 0xC8
>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>
>>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>>>>>
>>>>>>>>>> The proposed patch does not change capability pointers when
>>>>>>>>>> the same type capability is about to add.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>>>>> --- a/hw/pci.c
>>>>>>>>>> +++ b/hw/pci.c
>>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  config = pdev->config + offset;
>>>>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>>>>>
>>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>>>>> * replace the existing one or
>>>>>>>>> * drop out and not write the new one in.
>>>>>>>
>>>>>>> * hw_error :)
>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm not sure which way would be more natural.
>>>>>>>>
>>>>>>>> There is a third option - add another function, lets call it
>>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>>>>> but won't touch list pointers.
>>>>>>>
>>>>>>> What good is a function that breaks internal consistency?
>>>>>>
>>>>>>
>>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>>>>> capability at some fixed offset would have initialized their config space before calling this
>>>>>> capabilities API (as VFIO does).
>>>>>>
>>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>>>>> driver may care about its offset.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>>>>> exactly that the capability exists and where it is and it calls
>>>>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>>>>> just for imaginery scalability is a bit weird, no?
>>>>>>>
>>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
>>>>>>>
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>> -- 
>> Alexey

Comments

Alexander Graf May 22, 2012, 6:31 a.m. UTC | #1
On 22.05.2012, at 08:11, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 22/05/12 15:52, Alexander Graf wrote:
>> 
>> 
>> On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> 
>>> On 22/05/12 13:21, Alexander Graf wrote:
>>>> 
>>>> 
>>>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>> 
>>>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>>>> Alexander,
>>>>>> 
>>>>>> Is that any better? :)
>>>>> 
>>>>> Alex (Graf that is), ping ?
>>>>> 
>>>>> The original patch from Alexey was fine btw.
>>>>> 
>>>>> VFIO will always call things with the existing capability offset so
>>>>> there's no real risk of doing the wrong thing or break the list or
>>>>> anything.
>>>>> 
>>>>> IE. A small simple patch that addresses the problem :-)
>>>>> 
>>>>> The new patch is a bit more "robust" I believe, I don't think we need to
>>>>> go too far to fix a problem we don't have. But we need a fix for the
>>>>> real issue and the simple patch does it neatly from what I can
>>>>> understand.
>>>>> 
>>>>> Cheers,
>>>>> Ben.
>>>>> 
>>>>>> 
>>>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>>>> * in pci config space */
>>>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>                      uint8_t offset, uint8_t size)
>>>>>> {
>>>>>> -    uint8_t *config;
>>>>>> +    uint8_t *config, existing;
>>>> 
>>>> Existing is a pointer to the target dev's config space, right?
>>> 
>>> Yes.
>>> 
>>>>>>   int i, overlapping_cap;
>>>>>> 
>>>>>> +    existing = pci_find_capability(pdev, cap_id);
>>>>>> +    if (existing) {
>>>>>> +        if (offset && (existing != offset)) {
>>>>>> +            return -EEXIST;
>>>>>> +        }
>>>>>> +        for (i = existing; i < size; ++i) {
>>>> 
>>>> So how does this possibly make sense?
>>> 
>>> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
>>> check that this space has not been tried to use by someone else.
>> 
>> i is an int. existing is a uint8_t*.
> 
> 
> It was there before me. This function already does a loop and this is how it was coded at the first place.

Ugh. Existing is a uint8_t, not a pointer. Gotta love C syntax...

> 
> 
>>>>>> +            if (pdev->used[i]) {
>>>>>> +                return -EFAULT;
>>>>>> +            }
>>>>>> +        }
>>>>>> +        memset(pdev->used + offset, 0xFF, size);
>>>> Why?
>>> 
>>> Because I am marking the space this capability takes as used.
>> 
>> But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no?
> 
> 
> No, it does not exist for VFIO - VFIO read the config space from the host kernel first and then calls msi_init or msix_init or whatever_else_init depending on what it got from the host kernel. And these xxx_init() functions eventually call pci_add_capability().

So why would the function that populates the config space initially not set the used flag correctly?

> 
> Sure we can either implement own msi_init/msix_init (and may be others in the future) for VFIO (which would do all the same as other QEMU devices except touching the capabilities)  OR  hack msi_init/msix_init not to touch capabilities if they exist.

No, calling the internal functions sounds fine to me. It's the step before that irritates me. VFIO shouldn't differ too much from an emulated device wrt its config space population really.

> 
> 
> 
>>>>>> +        /* Make capability read-only by default */
>>>>>> +        memset(pdev->wmask + offset, 0, size);
>>>> Why?
>>> 
>>> Because the pci_add_capability() does it for a new capability by default.
>> 
>> Hrm. So you're copying code? Can't you merge the overwrite and write cases?
> 
> I am trying to make it as a single chunk which is as small as possible.

No, you're needlessly duplicating code which is a bad idea :). Please reuse as much of the existing function as possible, unless it really doesn't make sense.

> 
> 
> If it helps, below is the same patch with extended context to see what is going on in that function.
> 
> 
> 
> 
> 
> 
> hw/pci.c |   20 +++++++++++++++++++-
> 1 files changed, 19 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index 63a8219..7008a42 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -1772,75 +1772,93 @@ static int pci_add_option_rom(PCIDevice *pdev, bool is_default_rom)
>     ptr = memory_region_get_ram_ptr(&pdev->rom);
>     load_image(path, ptr);
>     g_free(path);
> 
>     if (is_default_rom) {
>         /* Only the default rom images will be patched (if needed). */
>         pci_patch_ids(pdev, ptr, size);
>     }
> 
>     qemu_put_ram_ptr(ptr);
> 
>     pci_register_bar(pdev, PCI_ROM_SLOT, 0, &pdev->rom);
> 
>     return 0;
> }
> 
> static void pci_del_option_rom(PCIDevice *pdev)
> {
>     if (!pdev->has_rom)
>         return;
> 
>     vmstate_unregister_ram(&pdev->rom, &pdev->qdev);
>     memory_region_destroy(&pdev->rom);
>     pdev->has_rom = false;
> }
> 
> /*
>  * if !offset
>  * Reserve space and add capability to the linked list in pci config space
>  *
>  * if offset = 0,
>  * Find and reserve space and add capability to the linked list
>  * in pci config space */
> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>                        uint8_t offset, uint8_t size)
> {
> -    uint8_t *config;
> +    uint8_t *config, existing;
>     int i, overlapping_cap;
> 
> +    existing = pci_find_capability(pdev, cap_id);
> +    if (existing) {
> +        if (offset && (existing != offset)) {
> +            return -EEXIST;
> +        }
> +        for (i = existing; i < size; ++i) {
> +            if (pdev->used[i]) {
> +                return -EFAULT;
> +            }
> +        }

}

> +        memset(pdev->used + offset, 0xFF, size);
> +        /* Make capability read-only by default */
> +        memset(pdev->wmask + offset, 0, size);
> +        /* Check capability by default */
> +        memset(pdev->cmask + offset, 0xFF, size);
> +        return existing;
> +    }
> +
>     if (!offset) {

&& !existing maybe?

>         offset = pci_find_space(pdev, size);
>         if (!offset) {
>             return -ENOSPC;
>         }
>     } else {
>         /* Verify that capabilities don't overlap.  Note: device assignment
>          * depends on this check to verify that the device is not broken.
>          * Should never trigger for emulated devices, but it's helpful
>          * for debugging these. */
>         for (i = offset; i < offset + size; i++) {
>             overlapping_cap = pci_find_capability_at_offset(pdev, i);
>             if (overlapping_cap) {
>                 fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
>                         "Attempt to add PCI capability %x at offset "
>                         "%x overlaps existing capability %x at offset %x\n",
>                         pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
>                         PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
>                         cap_id, offset, overlapping_cap, i);
>                 return -EINVAL;
>             }
>         }
>     }
> 

If (!existing) {

>     config = pdev->config + offset;
>     config[PCI_CAP_LIST_ID] = cap_id;
>     config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>     pdev->config[PCI_CAPABILITY_LIST] = offset;
>     pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;

}

which poses the question why the above wouldn't apply for the existing case. It would work just as well to leave that in, no?

Alex

>     memset(pdev->used + offset, 0xFF, size);
>     /* Make capability read-only by default */
>     memset(pdev->wmask + offset, 0, size);
>     /* Check capability by default */
>     memset(pdev->cmask + offset, 0xFF, size);
>     return offset;
> }
> 
> 
> 
> 
>>>>>> +        /* Check capability by default */
>>>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>>> 
>>>> I don't understand this part either.
>>> 
>>> The pci_add_capability() does it for a new capability by default.
>>> 
>>> 
>>> 
>>>> 
>>>> Alex
>>>> 
>>>>>> +        return existing;
>>>>>> +    }
>>>>>> +
>>>>>>   if (!offset) {
>>>>>>       offset = pci_find_space(pdev, size);
>>>>>>       if (!offset) {
>>>>>>           return -ENOSPC;
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>>>>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>>>>>> 
>>>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>>>>>> 
>>>>>>>>> 11.05.2012 20:52, Alexander Graf =0?8A0;:
>>>>>>>>>> 
>>>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>>>>>> 
>>>>>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>>>>>> is being built by QEMU.
>>>>>>>>>>> 
>>>>>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>>>>>> 
>>>>>>>>>>> For example, the old code destroys the following config
>>>>>>>>>>> of PCIe Intel E1000E:
>>>>>>>>>>> 
>>>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>>>>>> 0x34: 0xC8
>>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>>> 0xD0: 0x05 0xE0
>>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>> 
>>>>>>>>>>> after:
>>>>>>>>>>> 0x34: 0xD0
>>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>>> 0xD0: 0x05 0xC8
>>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>> 
>>>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>>>>>> 
>>>>>>>>>>> The proposed patch does not change capability pointers when
>>>>>>>>>>> the same type capability is about to add.
>>>>>>>>>>> 
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>>>>>> 
>>>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>>>>>> --- a/hw/pci.c
>>>>>>>>>>> +++ b/hw/pci.c
>>>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> config = pdev->config + offset;
>>>>>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>>>>>> 
>>>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>>>>>> * replace the existing one or
>>>>>>>>>> * drop out and not write the new one in.
>>>>>>>> 
>>>>>>>> * hw_error :)
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I'm not sure which way would be more natural.
>>>>>>>>> 
>>>>>>>>> There is a third option - add another function, lets call it
>>>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>>>>>> but won't touch list pointers.
>>>>>>>> 
>>>>>>>> What good is a function that breaks internal consistency?
>>>>>>> 
>>>>>>> 
>>>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>>>>>> capability at some fixed offset would have initialized their config space before calling this
>>>>>>> capabilities API (as VFIO does).
>>>>>>> 
>>>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>>>>>> driver may care about its offset.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>>>>>> exactly that the capability exists and where it is and it calls
>>>>>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>>>>>> just for imaginery scalability is a bit weird, no?
>>>>>>>> 
>>>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Alex
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> -- 
>>> Alexey
> 
> 
> -- 
> Alexey
Alexander Graf May 22, 2012, 6:38 a.m. UTC | #2
On 22.05.2012, at 08:11, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 22/05/12 15:52, Alexander Graf wrote:
>> 
>> 
>> On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> 
>>> On 22/05/12 13:21, Alexander Graf wrote:
>>>> 
>>>> 
>>>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>> 
>>>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>>>> Alexander,
>>>>>> 
>>>>>> Is that any better? :)
>>>>> 
>>>>> Alex (Graf that is), ping ?
>>>>> 
>>>>> The original patch from Alexey was fine btw.
>>>>> 
>>>>> VFIO will always call things with the existing capability offset so
>>>>> there's no real risk of doing the wrong thing or break the list or
>>>>> anything.
>>>>> 
>>>>> IE. A small simple patch that addresses the problem :-)
>>>>> 
>>>>> The new patch is a bit more "robust" I believe, I don't think we need to
>>>>> go too far to fix a problem we don't have. But we need a fix for the
>>>>> real issue and the simple patch does it neatly from what I can
>>>>> understand.
>>>>> 
>>>>> Cheers,
>>>>> Ben.
>>>>> 
>>>>>> 
>>>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>>>> * in pci config space */
>>>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>                      uint8_t offset, uint8_t size)
>>>>>> {
>>>>>> -    uint8_t *config;
>>>>>> +    uint8_t *config, existing;
>>>> 
>>>> Existing is a pointer to the target dev's config space, right?
>>> 
>>> Yes.
>>> 
>>>>>>   int i, overlapping_cap;
>>>>>> 
>>>>>> +    existing = pci_find_capability(pdev, cap_id);
>>>>>> +    if (existing) {
>>>>>> +        if (offset && (existing != offset)) {
>>>>>> +            return -EEXIST;
>>>>>> +        }
>>>>>> +        for (i = existing; i < size; ++i) {
>>>> 
>>>> So how does this possibly make sense?
>>> 
>>> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
>>> check that this space has not been tried to use by someone else.
>> 
>> i is an int. existing is a uint8_t*.
> 
> 
> It was there before me. This function already does a loop and this is how it was coded at the first place.

Also, while at it, please add some comments at least for the code you add that explain why you do the things you do :).

Alex
Alexey Kardashevskiy May 22, 2012, 7:01 a.m. UTC | #3
On 22/05/12 16:31, Alexander Graf wrote:
> 
> 
> On 22.05.2012, at 08:11, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 22/05/12 15:52, Alexander Graf wrote:
>>>
>>>
>>> On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>
>>>> On 22/05/12 13:21, Alexander Graf wrote:
>>>>>
>>>>>
>>>>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>
>>>>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>>>>> Alexander,
>>>>>>>
>>>>>>> Is that any better? :)
>>>>>>
>>>>>> Alex (Graf that is), ping ?
>>>>>>
>>>>>> The original patch from Alexey was fine btw.
>>>>>>
>>>>>> VFIO will always call things with the existing capability offset so
>>>>>> there's no real risk of doing the wrong thing or break the list or
>>>>>> anything.
>>>>>>
>>>>>> IE. A small simple patch that addresses the problem :-)
>>>>>>
>>>>>> The new patch is a bit more "robust" I believe, I don't think we need to
>>>>>> go too far to fix a problem we don't have. But we need a fix for the
>>>>>> real issue and the simple patch does it neatly from what I can
>>>>>> understand.
>>>>>>
>>>>>> Cheers,
>>>>>> Ben.
>>>>>>
>>>>>>>
>>>>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>>>>> * in pci config space */
>>>>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>                      uint8_t offset, uint8_t size)
>>>>>>> {
>>>>>>> -    uint8_t *config;
>>>>>>> +    uint8_t *config, existing;
>>>>>
>>>>> Existing is a pointer to the target dev's config space, right?
>>>>
>>>> Yes.
>>>>
>>>>>>>   int i, overlapping_cap;
>>>>>>>
>>>>>>> +    existing = pci_find_capability(pdev, cap_id);
>>>>>>> +    if (existing) {
>>>>>>> +        if (offset && (existing != offset)) {
>>>>>>> +            return -EEXIST;
>>>>>>> +        }
>>>>>>> +        for (i = existing; i < size; ++i) {
>>>>>
>>>>> So how does this possibly make sense?
>>>>
>>>> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
>>>> check that this space has not been tried to use by someone else.
>>>
>>> i is an int. existing is a uint8_t*.
>>
>>
>> It was there before me. This function already does a loop and this is how it was coded at the first place.
> 
> Ugh. Existing is a uint8_t, not a pointer. Gotta love C syntax...


Well it is still does not make much sense to have "int i" rather than "uint8_t i" :)


>>>>>>> +            if (pdev->used[i]) {
>>>>>>> +                return -EFAULT;
>>>>>>> +            }
>>>>>>> +        }
>>>>>>> +        memset(pdev->used + offset, 0xFF, size);
>>>>> Why?
>>>>
>>>> Because I am marking the space this capability takes as used.
>>>
>>> But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no?
>>
>>
>> No, it does not exist for VFIO - VFIO read the config space from the host kernel first and then calls msi_init or msix_init or whatever_else_init depending on what it got from the host kernel. And these xxx_init() functions eventually call pci_add_capability().
> So why would the function that populates the config space initially not set the used flag correctly?


This is internal kitchen of PCIDevice which I do not want to touch from anywhere but pci.c. And
there is no "fixup_capability" or something.


>> Sure we can either implement own msi_init/msix_init (and may be others in the future) for VFIO (which would do all the same as other QEMU devices except touching the capabilities)  OR  hack msi_init/msix_init not to touch capabilities if they exist.
> No, calling the internal functions sounds fine to me. It's the step before that irritates me. VFIO shouldn't differ too much from an emulated device wrt its config space population really.


The last thing we want for a VFIO device is changing its capabilities list.


>>>>>>> +        /* Make capability read-only by default */
>>>>>>> +        memset(pdev->wmask + offset, 0, size);
>>>>> Why?
>>>>
>>>> Because the pci_add_capability() does it for a new capability by default.
>>>
>>> Hrm. So you're copying code? Can't you merge the overwrite and write cases?
>>
>> I am trying to make it as a single chunk which is as small as possible.
> 
> No, you're needlessly duplicating code which is a bad idea :). Please reuse as much of the existing function as possible, unless it really doesn't make sense.

I actually duplicated 4 (four) lines and did it just once. This is too little to be called
"duplicating" :) And I get very special case visually separated and easy to remove if we find a
better solution later.

But - no problemo, I'll rework it.


[no further comments]



>> If it helps, below is the same patch with extended context to see what is going on in that function.
>>
>>
>>
>>
>>
>>
>> hw/pci.c |   20 +++++++++++++++++++-
>> 1 files changed, 19 insertions(+), 1 deletions(-)
>>
>> diff --git a/hw/pci.c b/hw/pci.c
>> index 63a8219..7008a42 100644
>> --- a/hw/pci.c
>> +++ b/hw/pci.c
>> @@ -1772,75 +1772,93 @@ static int pci_add_option_rom(PCIDevice *pdev, bool is_default_rom)
>>     ptr = memory_region_get_ram_ptr(&pdev->rom);
>>     load_image(path, ptr);
>>     g_free(path);
>>
>>     if (is_default_rom) {
>>         /* Only the default rom images will be patched (if needed). */
>>         pci_patch_ids(pdev, ptr, size);
>>     }
>>
>>     qemu_put_ram_ptr(ptr);
>>
>>     pci_register_bar(pdev, PCI_ROM_SLOT, 0, &pdev->rom);
>>
>>     return 0;
>> }
>>
>> static void pci_del_option_rom(PCIDevice *pdev)
>> {
>>     if (!pdev->has_rom)
>>         return;
>>
>>     vmstate_unregister_ram(&pdev->rom, &pdev->qdev);
>>     memory_region_destroy(&pdev->rom);
>>     pdev->has_rom = false;
>> }
>>
>> /*
>>  * if !offset
>>  * Reserve space and add capability to the linked list in pci config space
>>  *
>>  * if offset = 0,
>>  * Find and reserve space and add capability to the linked list
>>  * in pci config space */
>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>                        uint8_t offset, uint8_t size)
>> {
>> -    uint8_t *config;
>> +    uint8_t *config, existing;
>>     int i, overlapping_cap;
>>
>> +    existing = pci_find_capability(pdev, cap_id);
>> +    if (existing) {
>> +        if (offset && (existing != offset)) {
>> +            return -EEXIST;
>> +        }
>> +        for (i = existing; i < size; ++i) {
>> +            if (pdev->used[i]) {
>> +                return -EFAULT;
>> +            }
>> +        }
> 
> }
> 
>> +        memset(pdev->used + offset, 0xFF, size);
>> +        /* Make capability read-only by default */
>> +        memset(pdev->wmask + offset, 0, size);
>> +        /* Check capability by default */
>> +        memset(pdev->cmask + offset, 0xFF, size);
>> +        return existing;
>> +    }
>> +
>>     if (!offset) {
> 
> && !existing maybe?
> 
>>         offset = pci_find_space(pdev, size);
>>         if (!offset) {
>>             return -ENOSPC;
>>         }
>>     } else {
>>         /* Verify that capabilities don't overlap.  Note: device assignment
>>          * depends on this check to verify that the device is not broken.
>>          * Should never trigger for emulated devices, but it's helpful
>>          * for debugging these. */
>>         for (i = offset; i < offset + size; i++) {
>>             overlapping_cap = pci_find_capability_at_offset(pdev, i);
>>             if (overlapping_cap) {
>>                 fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
>>                         "Attempt to add PCI capability %x at offset "
>>                         "%x overlaps existing capability %x at offset %x\n",
>>                         pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
>>                         PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
>>                         cap_id, offset, overlapping_cap, i);
>>                 return -EINVAL;
>>             }
>>         }
>>     }
>>
> 
> If (!existing) {
> 
>>     config = pdev->config + offset;
>>     config[PCI_CAP_LIST_ID] = cap_id;
>>     config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>     pdev->config[PCI_CAPABILITY_LIST] = offset;
>>     pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
> 
> }
> 
> which poses the question why the above wouldn't apply for the existing case. It would work just as well to leave that in, no?
> 
> Alex
> 
>>     memset(pdev->used + offset, 0xFF, size);
>>     /* Make capability read-only by default */
>>     memset(pdev->wmask + offset, 0, size);
>>     /* Check capability by default */
>>     memset(pdev->cmask + offset, 0xFF, size);
>>     return offset;
>> }
>>
>>
>>
>>
>>>>>>> +        /* Check capability by default */
>>>>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>>>>
>>>>> I don't understand this part either.
>>>>
>>>> The pci_add_capability() does it for a new capability by default.
>>>>
>>>>
>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>>> +        return existing;
>>>>>>> +    }
>>>>>>> +
>>>>>>>   if (!offset) {
>>>>>>>       offset = pci_find_space(pdev, size);
>>>>>>>       if (!offset) {
>>>>>>>           return -ENOSPC;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>>>>>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>>>>>>>
>>>>>>>>>> 11.05.2012 20:52, Alexander Graf =0?8A0;:
>>>>>>>>>>>
>>>>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>>>>>>> is being built by QEMU.
>>>>>>>>>>>>
>>>>>>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>>>>>>>
>>>>>>>>>>>> For example, the old code destroys the following config
>>>>>>>>>>>> of PCIe Intel E1000E:
>>>>>>>>>>>>
>>>>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>>>>>>> 0x34: 0xC8
>>>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>>>> 0xD0: 0x05 0xE0
>>>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>>>
>>>>>>>>>>>> after:
>>>>>>>>>>>> 0x34: 0xD0
>>>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>>>> 0xD0: 0x05 0xC8
>>>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>>>
>>>>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>>>>>>>
>>>>>>>>>>>> The proposed patch does not change capability pointers when
>>>>>>>>>>>> the same type capability is about to add.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>>>>>>> --- a/hw/pci.c
>>>>>>>>>>>> +++ b/hw/pci.c
>>>>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> config = pdev->config + offset;
>>>>>>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>>>>>>>
>>>>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>>>>>>> * replace the existing one or
>>>>>>>>>>> * drop out and not write the new one in.
>>>>>>>>>
>>>>>>>>> * hw_error :)
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure which way would be more natural.
>>>>>>>>>>
>>>>>>>>>> There is a third option - add another function, lets call it
>>>>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>>>>>>> but won't touch list pointers.
>>>>>>>>>
>>>>>>>>> What good is a function that breaks internal consistency?
>>>>>>>>
>>>>>>>>
>>>>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>>>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>>>>>>> capability at some fixed offset would have initialized their config space before calling this
>>>>>>>> capabilities API (as VFIO does).
>>>>>>>>
>>>>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>>>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>>>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>>>>>>> driver may care about its offset.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>>>>>>> exactly that the capability exists and where it is and it calls
>>>>>>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>>>>>>> just for imaginery scalability is a bit weird, no?
>>>>>>>>>
>>>>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
Alexander Graf May 22, 2012, 7:13 a.m. UTC | #4
On 22.05.2012, at 09:01, Alexey Kardashevskiy wrote:

> On 22/05/12 16:31, Alexander Graf wrote:
>> 
>> 
>> On 22.05.2012, at 08:11, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> 
>>> On 22/05/12 15:52, Alexander Graf wrote:
>>>> 
>>>> 
>>>> On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> 
>>>>> On 22/05/12 13:21, Alexander Graf wrote:
>>>>>> 
>>>>>> 
>>>>>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>> 
>>>>>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>>>>>> Alexander,
>>>>>>>> 
>>>>>>>> Is that any better? :)
>>>>>>> 
>>>>>>> Alex (Graf that is), ping ?
>>>>>>> 
>>>>>>> The original patch from Alexey was fine btw.
>>>>>>> 
>>>>>>> VFIO will always call things with the existing capability offset so
>>>>>>> there's no real risk of doing the wrong thing or break the list or
>>>>>>> anything.
>>>>>>> 
>>>>>>> IE. A small simple patch that addresses the problem :-)
>>>>>>> 
>>>>>>> The new patch is a bit more "robust" I believe, I don't think we need to
>>>>>>> go too far to fix a problem we don't have. But we need a fix for the
>>>>>>> real issue and the simple patch does it neatly from what I can
>>>>>>> understand.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Ben.
>>>>>>> 
>>>>>>>> 
>>>>>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev)
>>>>>>>> * in pci config space */
>>>>>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>                     uint8_t offset, uint8_t size)
>>>>>>>> {
>>>>>>>> -    uint8_t *config;
>>>>>>>> +    uint8_t *config, existing;
>>>>>> 
>>>>>> Existing is a pointer to the target dev's config space, right?
>>>>> 
>>>>> Yes.
>>>>> 
>>>>>>>>  int i, overlapping_cap;
>>>>>>>> 
>>>>>>>> +    existing = pci_find_capability(pdev, cap_id);
>>>>>>>> +    if (existing) {
>>>>>>>> +        if (offset && (existing != offset)) {
>>>>>>>> +            return -EEXIST;
>>>>>>>> +        }
>>>>>>>> +        for (i = existing; i < size; ++i) {
>>>>>> 
>>>>>> So how does this possibly make sense?
>>>>> 
>>>>> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double
>>>>> check that this space has not been tried to use by someone else.
>>>> 
>>>> i is an int. existing is a uint8_t*.
>>> 
>>> 
>>> It was there before me. This function already does a loop and this is how it was coded at the first place.
>> 
>> Ugh. Existing is a uint8_t, not a pointer. Gotta love C syntax...
> 
> 
> Well it is still does not make much sense to have "int i" rather than "uint8_t i" :)
> 
> 
>>>>>>>> +            if (pdev->used[i]) {
>>>>>>>> +                return -EFAULT;
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>> +        memset(pdev->used + offset, 0xFF, size);
>>>>>> Why?
>>>>> 
>>>>> Because I am marking the space this capability takes as used.
>>>> 
>>>> But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no?
>>> 
>>> 
>>> No, it does not exist for VFIO - VFIO read the config space from the host kernel first and then calls msi_init or msix_init or whatever_else_init depending on what it got from the host kernel. And these xxx_init() functions eventually call pci_add_capability().
>> So why would the function that populates the config space initially not set the used flag correctly?
> 
> 
> This is internal kitchen of PCIDevice which I do not want to touch from anywhere but pci.c. And
> there is no "fixup_capability" or something.

Hrm. Maybe we should have one? :) Or instead of populating the config space with the exact data from the host device, loop through the host device capabilities and populate them using this function as we go. That should maintain the offsets, but ensure that all internal flags are set, no?

> 
> 
>>> Sure we can either implement own msi_init/msix_init (and may be others in the future) for VFIO (which would do all the same as other QEMU devices except touching the capabilities)  OR  hack msi_init/msix_init not to touch capabilities if they exist.
>> No, calling the internal functions sounds fine to me. It's the step before that irritates me. VFIO shouldn't differ too much from an emulated device wrt its config space population really.
> 
> 
> The last thing we want for a VFIO device is changing its capabilities list.

Well - we want it to look the same. The population should go through the same methods as emulated devices have to go through, no? :)

> 
> 
>>>>>>>> +        /* Make capability read-only by default */
>>>>>>>> +        memset(pdev->wmask + offset, 0, size);
>>>>>> Why?
>>>>> 
>>>>> Because the pci_add_capability() does it for a new capability by default.
>>>> 
>>>> Hrm. So you're copying code? Can't you merge the overwrite and write cases?
>>> 
>>> I am trying to make it as a single chunk which is as small as possible.
>> 
>> No, you're needlessly duplicating code which is a bad idea :). Please reuse as much of the existing function as possible, unless it really doesn't make sense.
> 
> I actually duplicated 4 (four) lines and did it just once. This is too little to be called
> "duplicating" :) And I get very special case visually separated and easy to remove if we find a
> better solution later.

Yeah, but the special case really shouldn't be all that special - that's why I was irritated.

> But - no problemo, I'll rework it.

Thanks!

> [no further comments]

What about mine below? ;)


Alex

> 
> 
> 
>>> If it helps, below is the same patch with extended context to see what is going on in that function.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> hw/pci.c |   20 +++++++++++++++++++-
>>> 1 files changed, 19 insertions(+), 1 deletions(-)
>>> 
>>> diff --git a/hw/pci.c b/hw/pci.c
>>> index 63a8219..7008a42 100644
>>> --- a/hw/pci.c
>>> +++ b/hw/pci.c
>>> @@ -1772,75 +1772,93 @@ static int pci_add_option_rom(PCIDevice *pdev, bool is_default_rom)
>>>    ptr = memory_region_get_ram_ptr(&pdev->rom);
>>>    load_image(path, ptr);
>>>    g_free(path);
>>> 
>>>    if (is_default_rom) {
>>>        /* Only the default rom images will be patched (if needed). */
>>>        pci_patch_ids(pdev, ptr, size);
>>>    }
>>> 
>>>    qemu_put_ram_ptr(ptr);
>>> 
>>>    pci_register_bar(pdev, PCI_ROM_SLOT, 0, &pdev->rom);
>>> 
>>>    return 0;
>>> }
>>> 
>>> static void pci_del_option_rom(PCIDevice *pdev)
>>> {
>>>    if (!pdev->has_rom)
>>>        return;
>>> 
>>>    vmstate_unregister_ram(&pdev->rom, &pdev->qdev);
>>>    memory_region_destroy(&pdev->rom);
>>>    pdev->has_rom = false;
>>> }
>>> 
>>> /*
>>> * if !offset
>>> * Reserve space and add capability to the linked list in pci config space
>>> *
>>> * if offset = 0,
>>> * Find and reserve space and add capability to the linked list
>>> * in pci config space */
>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>                       uint8_t offset, uint8_t size)
>>> {
>>> -    uint8_t *config;
>>> +    uint8_t *config, existing;
>>>    int i, overlapping_cap;
>>> 
>>> +    existing = pci_find_capability(pdev, cap_id);
>>> +    if (existing) {
>>> +        if (offset && (existing != offset)) {
>>> +            return -EEXIST;
>>> +        }
>>> +        for (i = existing; i < size; ++i) {
>>> +            if (pdev->used[i]) {
>>> +                return -EFAULT;
>>> +            }
>>> +        }
>> 
>> }
>> 
>>> +        memset(pdev->used + offset, 0xFF, size);
>>> +        /* Make capability read-only by default */
>>> +        memset(pdev->wmask + offset, 0, size);
>>> +        /* Check capability by default */
>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>> +        return existing;
>>> +    }
>>> +
>>>    if (!offset) {
>> 
>> && !existing maybe?
>> 
>>>        offset = pci_find_space(pdev, size);
>>>        if (!offset) {
>>>            return -ENOSPC;
>>>        }
>>>    } else {
>>>        /* Verify that capabilities don't overlap.  Note: device assignment
>>>         * depends on this check to verify that the device is not broken.
>>>         * Should never trigger for emulated devices, but it's helpful
>>>         * for debugging these. */
>>>        for (i = offset; i < offset + size; i++) {
>>>            overlapping_cap = pci_find_capability_at_offset(pdev, i);
>>>            if (overlapping_cap) {
>>>                fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
>>>                        "Attempt to add PCI capability %x at offset "
>>>                        "%x overlaps existing capability %x at offset %x\n",
>>>                        pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
>>>                        PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
>>>                        cap_id, offset, overlapping_cap, i);
>>>                return -EINVAL;
>>>            }
>>>        }
>>>    }
>>> 
>> 
>> If (!existing) {
>> 
>>>    config = pdev->config + offset;
>>>    config[PCI_CAP_LIST_ID] = cap_id;
>>>    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>> 
>> }
>> 
>> which poses the question why the above wouldn't apply for the existing case. It would work just as well to leave that in, no?
>> 
>> Alex
>> 
>>>    memset(pdev->used + offset, 0xFF, size);
>>>    /* Make capability read-only by default */
>>>    memset(pdev->wmask + offset, 0, size);
>>>    /* Check capability by default */
>>>    memset(pdev->cmask + offset, 0xFF, size);
>>>    return offset;
>>> }
>>> 
>>> 
>>> 
>>> 
>>>>>>>> +        /* Check capability by default */
>>>>>>>> +        memset(pdev->cmask + offset, 0xFF, size);
>>>>>> 
>>>>>> I don't understand this part either.
>>>>> 
>>>>> The pci_add_capability() does it for a new capability by default.
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Alex
>>>>>> 
>>>>>>>> +        return existing;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>  if (!offset) {
>>>>>>>>      offset = pci_find_space(pdev, size);
>>>>>>>>      if (!offset) {
>>>>>>>>          return -ENOSPC;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote:
>>>>>>>>> On 12/05/12 00:13, Alexander Graf wrote:
>>>>>>>>>> 
>>>>>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote:
>>>>>>>>>> 
>>>>>>>>>>> 11.05.2012 20:52, Alexander Graf =0?8A0;:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Normally the pci_add_capability is called on devices to add new
>>>>>>>>>>>>> capability. This is ok for emulated devices which capabilities list
>>>>>>>>>>>>> is being built by QEMU.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the case of VFIO the capability may already exist and adding new
>>>>>>>>>>>>> capability into the beginning of the linked list may create a loop.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For example, the old code destroys the following config
>>>>>>>>>>>>> of PCIe Intel E1000E:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05):
>>>>>>>>>>>>> 0x34: 0xC8
>>>>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>>>>> 0xD0: 0x05 0xE0
>>>>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>>>> 
>>>>>>>>>>>>> after:
>>>>>>>>>>>>> 0x34: 0xD0
>>>>>>>>>>>>> 0xC8: 0x01 0xD0
>>>>>>>>>>>>> 0xD0: 0x05 0xC8
>>>>>>>>>>>>> 0xE0: 0x10 0x00
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The proposed patch does not change capability pointers when
>>>>>>>>>>>>> the same type capability is about to add.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> hw/pci.c |   10 ++++++----
>>>>>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c
>>>>>>>>>>>>> index aa0c0b8..1f7c924 100644
>>>>>>>>>>>>> --- a/hw/pci.c
>>>>>>>>>>>>> +++ b/hw/pci.c
>>>>>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> config = pdev->config + offset;
>>>>>>>>>>>>> -    config[PCI_CAP_LIST_ID] = cap_id;
>>>>>>>>>>>>> -    config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
>>>>>>>>>>>>> -    pdev->config[PCI_CAPABILITY_LIST] = offset;
>>>>>>>>>>>>> -    pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
>>>>>>>>>>>>> +    if (config[PCI_CAP_LIST_ID] != cap_id) {
>>>>>>>>>>>> 
>>>>>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either
>>>>>>>>>>>> * replace the existing one or
>>>>>>>>>>>> * drop out and not write the new one in.
>>>>>>>>>> 
>>>>>>>>>> * hw_error :)
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm not sure which way would be more natural.
>>>>>>>>>>> 
>>>>>>>>>>> There is a third option - add another function, lets call it
>>>>>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does
>>>>>>>>>>> but won't touch list pointers.
>>>>>>>>>> 
>>>>>>>>>> What good is a function that breaks internal consistency?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through
>>>>>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a
>>>>>>>>> capability at some fixed offset would have initialized their config space before calling this
>>>>>>>>> capabilities API (as VFIO does).
>>>>>>>>> 
>>>>>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and
>>>>>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency
>>>>>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest
>>>>>>>>> driver may care about its offset.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> When vfio, pci_add_capability() is called from the code which knows
>>>>>>>>>>> exactly that the capability exists and where it is and it calls
>>>>>>>>>>> pci_add_capability() based on this knowledge so doing additional loops
>>>>>>>>>>> just for imaginery scalability is a bit weird, no?
>>>>>>>>>> 
>>>>>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability.
> 
> 
> 
> 
> -- 
> Alexey
Benjamin Herrenschmidt May 22, 2012, 7:37 a.m. UTC | #5
On Tue, 2012-05-22 at 09:13 +0200, Alexander Graf wrote:
> On 22.05.2012, at 09:01, Alexey Kardashevskiy wrote:

> > This is internal kitchen of PCIDevice which I do not want to touch
> from anywhere but pci.c. And
> > there is no "fixup_capability" or something.
> 
> Hrm. Maybe we should have one? :) Or instead of populating the config
> space with the exact data from the host device,
> loop through the host device capabilities and populate them using this
> function as we go.
>  That should maintain the offsets, but ensure that all internal flags
> are set, no?

That actually sounds reasonable, though it might be more efficient
to have something like pci_parse_config() or similar called once
with a pre-cooked config space.

Internally inside pci.c it can do that same loop you mention. The
advantage is that it can also do whatever else we might need in the
future. If for some reason, something wants to cache a cap pointer
it can be done there, whatever else that is normally initialized as
fields to generate the config space can be initialized by reading the
config space and then initializing the fields etc... from that one
function.

Cheers,
Ben.
diff mbox

Patch

diff --git a/hw/pci.c b/hw/pci.c
index 63a8219..7008a42 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1772,75 +1772,93 @@  static int pci_add_option_rom(PCIDevice *pdev, bool is_default_rom)
     ptr = memory_region_get_ram_ptr(&pdev->rom);
     load_image(path, ptr);
     g_free(path);
 
     if (is_default_rom) {
         /* Only the default rom images will be patched (if needed). */
         pci_patch_ids(pdev, ptr, size);
     }
 
     qemu_put_ram_ptr(ptr);
 
     pci_register_bar(pdev, PCI_ROM_SLOT, 0, &pdev->rom);
 
     return 0;
 }
 
 static void pci_del_option_rom(PCIDevice *pdev)
 {
     if (!pdev->has_rom)
         return;
 
     vmstate_unregister_ram(&pdev->rom, &pdev->qdev);
     memory_region_destroy(&pdev->rom);
     pdev->has_rom = false;
 }
 
 /*
  * if !offset
  * Reserve space and add capability to the linked list in pci config space
  *
  * if offset = 0,
  * Find and reserve space and add capability to the linked list
  * in pci config space */
 int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
                        uint8_t offset, uint8_t size)
 {
-    uint8_t *config;
+    uint8_t *config, existing;
     int i, overlapping_cap;
 
+    existing = pci_find_capability(pdev, cap_id);
+    if (existing) {
+        if (offset && (existing != offset)) {
+            return -EEXIST;
+        }
+        for (i = existing; i < size; ++i) {
+            if (pdev->used[i]) {
+                return -EFAULT;
+            }
+        }
+        memset(pdev->used + offset, 0xFF, size);
+        /* Make capability read-only by default */
+        memset(pdev->wmask + offset, 0, size);
+        /* Check capability by default */
+        memset(pdev->cmask + offset, 0xFF, size);
+        return existing;
+    }
+
     if (!offset) {
         offset = pci_find_space(pdev, size);
         if (!offset) {
             return -ENOSPC;
         }
     } else {
         /* Verify that capabilities don't overlap.  Note: device assignment
          * depends on this check to verify that the device is not broken.
          * Should never trigger for emulated devices, but it's helpful
          * for debugging these. */
         for (i = offset; i < offset + size; i++) {
             overlapping_cap = pci_find_capability_at_offset(pdev, i);
             if (overlapping_cap) {
                 fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
                         "Attempt to add PCI capability %x at offset "
                         "%x overlaps existing capability %x at offset %x\n",
                         pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
                         PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
                         cap_id, offset, overlapping_cap, i);
                 return -EINVAL;
             }
         }
     }
 
     config = pdev->config + offset;
     config[PCI_CAP_LIST_ID] = cap_id;
     config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
     pdev->config[PCI_CAPABILITY_LIST] = offset;
     pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
     memset(pdev->used + offset, 0xFF, size);
     /* Make capability read-only by default */
     memset(pdev->wmask + offset, 0, size);
     /* Check capability by default */
     memset(pdev->cmask + offset, 0xFF, size);
     return offset;
 }