diff mbox

[v4,for-2.3,13/25] hw/acpi: remove from root bus 0 the crs resources used by other busses.

Message ID 1425813387-31231-14-git-send-email-marcel@redhat.com
State New
Headers show

Commit Message

Marcel Apfelbaum March 8, 2015, 11:16 a.m. UTC
If multiple root busses are used, root bus 0 cannot use all the
pci holes ranges. Remove the IO/mem ranges used by the other
primary busses.

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 hw/i386/acpi-build.c | 84 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 12 deletions(-)

Comments

Kevin O'Connor March 8, 2015, 4:13 p.m. UTC | #1
On Sun, Mar 08, 2015 at 01:16:15PM +0200, Marcel Apfelbaum wrote:
> If multiple root busses are used, root bus 0 cannot use all the
> pci holes ranges. Remove the IO/mem ranges used by the other
> primary busses.
[...]
> -    aml_append(crs,
> -        aml_word_io(aml_min_fixed, aml_max_fixed,
> -                    aml_pos_decode, aml_entire_range,
> -                    0x0000, 0x0D00, 0xFFFF, 0x0000, 0xF300));
> +
> +    /* prepare PCI IO ranges */
> +    range.base = 0x0D00;
> +    range.limit = 0xFFFF;
> +    if (QLIST_EMPTY(&io_ranges)) {
> +        aml_append(crs,
> +            aml_word_io(aml_min_fixed, aml_max_fixed,
> +                        aml_pos_decode, aml_entire_range,
> +                        0x0000, range.base, range.limit,
> +                        0x0000, range.limit - range.base + 1));
> +    } else {
> +        QLIST_FOREACH(entry, &io_ranges, entry) {
> +            if (range.base < entry->base) {
> +                aml_append(crs,
> +                    aml_word_io(aml_min_fixed, aml_max_fixed,
> +                                aml_pos_decode, aml_entire_range,
> +                                0x0000, range.base, entry->base - 1,
> +                                0x0000, entry->base - range.base));
> +            }
> +            range.base = entry->limit + 1;
> +            if (!QLIST_NEXT(entry, entry)) {
> +                aml_append(crs,
> +                    aml_word_io(aml_min_fixed, aml_max_fixed,
> +                                aml_pos_decode, aml_entire_range,
> +                                0x0000, range.base, range.limit,
> +                                0x0000, range.limit - range.base + 1));
> +            }
> +        }
> +    }

If I read this correctly, it looks like a machine with two root buses
and 20 devices, each with one memory range and one io range, would end
up with 40 CRS ranges (ie, a CRS range for every resource).  It also
looks like this furthers the requirement that the guest firmware
assign the PCI resources prior to QEMU being able to generate the ACPI
tables.

Am I correct?  If so, that doesn't sound ideal.

-Kevin
Marcel Apfelbaum March 8, 2015, 5:51 p.m. UTC | #2
On 03/08/2015 06:13 PM, Kevin O'Connor wrote:
> On Sun, Mar 08, 2015 at 01:16:15PM +0200, Marcel Apfelbaum wrote:
>> If multiple root busses are used, root bus 0 cannot use all the
>> pci holes ranges. Remove the IO/mem ranges used by the other
>> primary busses.
> [...]
>> -    aml_append(crs,
>> -        aml_word_io(aml_min_fixed, aml_max_fixed,
>> -                    aml_pos_decode, aml_entire_range,
>> -                    0x0000, 0x0D00, 0xFFFF, 0x0000, 0xF300));
>> +
>> +    /* prepare PCI IO ranges */
>> +    range.base = 0x0D00;
>> +    range.limit = 0xFFFF;
>> +    if (QLIST_EMPTY(&io_ranges)) {
>> +        aml_append(crs,
>> +            aml_word_io(aml_min_fixed, aml_max_fixed,
>> +                        aml_pos_decode, aml_entire_range,
>> +                        0x0000, range.base, range.limit,
>> +                        0x0000, range.limit - range.base + 1));
>> +    } else {
>> +        QLIST_FOREACH(entry, &io_ranges, entry) {
>> +            if (range.base < entry->base) {
>> +                aml_append(crs,
>> +                    aml_word_io(aml_min_fixed, aml_max_fixed,
>> +                                aml_pos_decode, aml_entire_range,
>> +                                0x0000, range.base, entry->base - 1,
>> +                                0x0000, entry->base - range.base));
>> +            }
>> +            range.base = entry->limit + 1;
>> +            if (!QLIST_NEXT(entry, entry)) {
>> +                aml_append(crs,
>> +                    aml_word_io(aml_min_fixed, aml_max_fixed,
>> +                                aml_pos_decode, aml_entire_range,
>> +                                0x0000, range.base, range.limit,
>> +                                0x0000, range.limit - range.base + 1));
>> +            }
>> +        }
>> +    }
>

Hi Kevin,
Thank you for your review!

> If I read this correctly, it looks like a machine with two root buses
> and 20 devices, each with one memory range and one io range, would end
> up with 40 CRS ranges (ie, a CRS range for every resource).
Correct.

As Michael pointed out in another thread, the firmware is considered
guest code and QEMU cannot assume anything on how the resources are
assigned. This is why this solution was chosen.

However we have two things that make the situation a little better.
1. The PXB implementation includes a pci-bridge and all devices are automatically
    attached to the secondary bus, in this way we have one IO/MEM range per extra root bus.
2. On top of this series we can add a merge algorithm that will bring together
    consecutive ranges. This series does not include this optimization and it
    focuses on the correctness.

   It also
> looks like this furthers the requirement that the guest firmware
> assign the PCI resources prior to QEMU being able to generate the ACPI
> tables.
>
> Am I correct?  If so, that doesn't sound ideal.
You are correct, however is not that bad because we have the following sequence:
  - Early in the boot sequence the bios scans the PCI buses and assigns IO/MEM ranges
  - At this moment all resources needed by QEMU are present in the configuration space.
  - At the end of the boot sequence the BIOS queries the ACPI tables and *only then*
    the tables are computed.

I think we use that implicitly for other features, anyway, it looks like an elegant
solution with no real drawbacks. (Our assumptions are safe)

Thanks,
Marcel
>
> -Kevin
>
Kevin O'Connor March 8, 2015, 6:26 p.m. UTC | #3
On Sun, Mar 08, 2015 at 07:51:42PM +0200, Marcel Apfelbaum wrote:
> On 03/08/2015 06:13 PM, Kevin O'Connor wrote:
> >If I read this correctly, it looks like a machine with two root buses
> >and 20 devices, each with one memory range and one io range, would end
> >up with 40 CRS ranges (ie, a CRS range for every resource).
> Correct.
> 
> As Michael pointed out in another thread, the firmware is considered
> guest code and QEMU cannot assume anything on how the resources are
> assigned. This is why this solution was chosen.
> 
> However we have two things that make the situation a little better.
> 1. The PXB implementation includes a pci-bridge and all devices are automatically
>    attached to the secondary bus, in this way we have one IO/MEM range per extra root bus.

Out of curiosity, does the PXB implementation add the pci-bridge just
to simplify the IO/MEM range, or are there other technical reasons for
it?

> 2. On top of this series we can add a merge algorithm that will bring together
>    consecutive ranges. This series does not include this optimization and it
>    focuses on the correctness.
> 
>   It also
> >looks like this furthers the requirement that the guest firmware
> >assign the PCI resources prior to QEMU being able to generate the ACPI
> >tables.
> >
> >Am I correct?  If so, that doesn't sound ideal.
> You are correct, however is not that bad because we have the following sequence:
>  - Early in the boot sequence the bios scans the PCI buses and assigns IO/MEM ranges
>  - At this moment all resources needed by QEMU are present in the configuration space.
>  - At the end of the boot sequence the BIOS queries the ACPI tables and *only then*
>    the tables are computed.
> 
> I think we use that implicitly for other features, anyway, it looks like an elegant
> solution with no real drawbacks. (Our assumptions are safe)

Thank you for the clarification.  I understand that it works, but I've
never been that comfortable with the QEMU<->firmware dance with PCI
resources.  I do understand that the alternatives have as many or more
problems though.  So, I'm not objecting to this implementation.

Cheers,
-Kevin
Marcel Apfelbaum March 8, 2015, 6:32 p.m. UTC | #4
On 03/08/2015 08:26 PM, Kevin O'Connor wrote:
> On Sun, Mar 08, 2015 at 07:51:42PM +0200, Marcel Apfelbaum wrote:
>> On 03/08/2015 06:13 PM, Kevin O'Connor wrote:
>>> If I read this correctly, it looks like a machine with two root buses
>>> and 20 devices, each with one memory range and one io range, would end
>>> up with 40 CRS ranges (ie, a CRS range for every resource).
>> Correct.
>>
>> As Michael pointed out in another thread, the firmware is considered
>> guest code and QEMU cannot assume anything on how the resources are
>> assigned. This is why this solution was chosen.
>>
>> However we have two things that make the situation a little better.
>> 1. The PXB implementation includes a pci-bridge and all devices are automatically
>>     attached to the secondary bus, in this way we have one IO/MEM range per extra root bus.
>
> Out of curiosity, does the PXB implementation add the pci-bridge just
> to simplify the IO/MEM range, or are there other technical reasons for
> it?
We have another elephant there :) -> pci hotplug.
All the "free" memory ranges are assigned to bus 0, this will leave
the pxb buses without the hotplug capability.
Using a PCI bridge will give us some IO/MEM ranges for hotplug: the ones created
because of minimum requirement by PCI spec and not used currently by any devices.

>
>> 2. On top of this series we can add a merge algorithm that will bring together
>>     consecutive ranges. This series does not include this optimization and it
>>     focuses on the correctness.
>>
>>    It also
>>> looks like this furthers the requirement that the guest firmware
>>> assign the PCI resources prior to QEMU being able to generate the ACPI
>>> tables.
>>>
>>> Am I correct?  If so, that doesn't sound ideal.
>> You are correct, however is not that bad because we have the following sequence:
>>   - Early in the boot sequence the bios scans the PCI buses and assigns IO/MEM ranges
>>   - At this moment all resources needed by QEMU are present in the configuration space.
>>   - At the end of the boot sequence the BIOS queries the ACPI tables and *only then*
>>     the tables are computed.
>>
>> I think we use that implicitly for other features, anyway, it looks like an elegant
>> solution with no real drawbacks. (Our assumptions are safe)
>
> Thank you for the clarification.  I understand that it works, but I've
> never been that comfortable with the QEMU<->firmware dance with PCI
> resources.  I do understand that the alternatives have as many or more
> problems though.  So, I'm not objecting to this implementation.
No problem, thank you and your review is much appreciated as always,
Marcel

>
> Cheers,
> -Kevin
>
Michael S. Tsirkin March 8, 2015, 6:34 p.m. UTC | #5
On Sun, Mar 08, 2015 at 12:13:40PM -0400, Kevin O'Connor wrote:
> On Sun, Mar 08, 2015 at 01:16:15PM +0200, Marcel Apfelbaum wrote:
> > If multiple root busses are used, root bus 0 cannot use all the
> > pci holes ranges. Remove the IO/mem ranges used by the other
> > primary busses.
> [...]
> > -    aml_append(crs,
> > -        aml_word_io(aml_min_fixed, aml_max_fixed,
> > -                    aml_pos_decode, aml_entire_range,
> > -                    0x0000, 0x0D00, 0xFFFF, 0x0000, 0xF300));
> > +
> > +    /* prepare PCI IO ranges */
> > +    range.base = 0x0D00;
> > +    range.limit = 0xFFFF;
> > +    if (QLIST_EMPTY(&io_ranges)) {
> > +        aml_append(crs,
> > +            aml_word_io(aml_min_fixed, aml_max_fixed,
> > +                        aml_pos_decode, aml_entire_range,
> > +                        0x0000, range.base, range.limit,
> > +                        0x0000, range.limit - range.base + 1));
> > +    } else {
> > +        QLIST_FOREACH(entry, &io_ranges, entry) {
> > +            if (range.base < entry->base) {
> > +                aml_append(crs,
> > +                    aml_word_io(aml_min_fixed, aml_max_fixed,
> > +                                aml_pos_decode, aml_entire_range,
> > +                                0x0000, range.base, entry->base - 1,
> > +                                0x0000, entry->base - range.base));
> > +            }
> > +            range.base = entry->limit + 1;
> > +            if (!QLIST_NEXT(entry, entry)) {
> > +                aml_append(crs,
> > +                    aml_word_io(aml_min_fixed, aml_max_fixed,
> > +                                aml_pos_decode, aml_entire_range,
> > +                                0x0000, range.base, range.limit,
> > +                                0x0000, range.limit - range.base + 1));
> > +            }
> > +        }
> > +    }
> 
> If I read this correctly, it looks like a machine with two root buses
> and 20 devices, each with one memory range and one io range, would end
> up with 40 CRS ranges (ie, a CRS range for every resource).

I think that's only if you stick multiple devices directly behind the
bridge.  Looks like with a single pci bridge behind root, there will
only be 2 ranges.

Maybe try to enforce this sane topology?

> It also
> looks like this furthers the requirement that the guest firmware
> assign the PCI resources prior to QEMU being able to generate the ACPI
> tables.

That seems unavoidable unless we want to assign ranges from
hardware/management.
Which I think would be a mistake: management doesn't really know,
or care.

> 
> -Kevin
Kevin O'Connor March 8, 2015, 6:46 p.m. UTC | #6
On Sun, Mar 08, 2015 at 07:34:34PM +0100, Michael S. Tsirkin wrote:
> On Sun, Mar 08, 2015 at 12:13:40PM -0400, Kevin O'Connor wrote:
> > If I read this correctly, it looks like a machine with two root buses
> > and 20 devices, each with one memory range and one io range, would end
> > up with 40 CRS ranges (ie, a CRS range for every resource).
> 
> I think that's only if you stick multiple devices directly behind the
> bridge.  Looks like with a single pci bridge behind root, there will
> only be 2 ranges.

Yeah, that makes sense, so doesn't seem to be a problem.

> Maybe try to enforce this sane topology?
> 
> > It also
> > looks like this furthers the requirement that the guest firmware
> > assign the PCI resources prior to QEMU being able to generate the ACPI
> > tables.
> 
> That seems unavoidable unless we want to assign ranges from
> hardware/management.
> Which I think would be a mistake: management doesn't really know,
> or care.

I understand.  I think what would help me is if we could document
somewhere that the firmware has to assign PCI resources before
querying the bios tables and that it is the *only* pre-requisite for
querying them.  Looking now, though, I don't see any fw_cfg
documentation in the repo, so I'm not sure where that could be added.

Thanks,
-Kevin
Michael S. Tsirkin March 9, 2015, 8:44 a.m. UTC | #7
On Sun, Mar 08, 2015 at 02:46:28PM -0400, Kevin O'Connor wrote:
> On Sun, Mar 08, 2015 at 07:34:34PM +0100, Michael S. Tsirkin wrote:
> > On Sun, Mar 08, 2015 at 12:13:40PM -0400, Kevin O'Connor wrote:
> > > If I read this correctly, it looks like a machine with two root buses
> > > and 20 devices, each with one memory range and one io range, would end
> > > up with 40 CRS ranges (ie, a CRS range for every resource).
> > 
> > I think that's only if you stick multiple devices directly behind the
> > bridge.  Looks like with a single pci bridge behind root, there will
> > only be 2 ranges.
> 
> Yeah, that makes sense, so doesn't seem to be a problem.
> 
> > Maybe try to enforce this sane topology?
> > 
> > > It also
> > > looks like this furthers the requirement that the guest firmware
> > > assign the PCI resources prior to QEMU being able to generate the ACPI
> > > tables.
> > 
> > That seems unavoidable unless we want to assign ranges from
> > hardware/management.
> > Which I think would be a mistake: management doesn't really know,
> > or care.
> 
> I understand.  I think what would help me is if we could document
> somewhere that the firmware has to assign PCI resources before
> querying the bios tables and that it is the *only* pre-requisite for
> querying them.  Looking now, though, I don't see any fw_cfg
> documentation in the repo, so I'm not sure where that could be added.
> 
> Thanks,
> -Kevin

Sigh. Might make a GSoC project?
Stefan Hajnoczi March 10, 2015, 1:09 p.m. UTC | #8
On Mon, Mar 09, 2015 at 09:44:24AM +0100, Michael S. Tsirkin wrote:
> On Sun, Mar 08, 2015 at 02:46:28PM -0400, Kevin O'Connor wrote:
> > On Sun, Mar 08, 2015 at 07:34:34PM +0100, Michael S. Tsirkin wrote:
> > > On Sun, Mar 08, 2015 at 12:13:40PM -0400, Kevin O'Connor wrote:
> > > > If I read this correctly, it looks like a machine with two root buses
> > > > and 20 devices, each with one memory range and one io range, would end
> > > > up with 40 CRS ranges (ie, a CRS range for every resource).
> > > 
> > > I think that's only if you stick multiple devices directly behind the
> > > bridge.  Looks like with a single pci bridge behind root, there will
> > > only be 2 ranges.
> > 
> > Yeah, that makes sense, so doesn't seem to be a problem.
> > 
> > > Maybe try to enforce this sane topology?
> > > 
> > > > It also
> > > > looks like this furthers the requirement that the guest firmware
> > > > assign the PCI resources prior to QEMU being able to generate the ACPI
> > > > tables.
> > > 
> > > That seems unavoidable unless we want to assign ranges from
> > > hardware/management.
> > > Which I think would be a mistake: management doesn't really know,
> > > or care.
> > 
> > I understand.  I think what would help me is if we could document
> > somewhere that the firmware has to assign PCI resources before
> > querying the bios tables and that it is the *only* pre-requisite for
> > querying them.  Looking now, though, I don't see any fw_cfg
> > documentation in the repo, so I'm not sure where that could be added.
> > 
> > Thanks,
> > -Kevin
> 
> Sigh. Might make a GSoC project?

Documentation projects are not possible under Google Summer of Code:

https://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2015/help_page#12._Are_proposals_for_documentation_work

If there is a coding project you are interested in mentoring, there is a
project idea template to fill out here:

http://qemu-project.org/Google_Summer_of_Code_2015#Project_idea_template

Stefan
diff mbox

Patch

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index f4d8816..44819b8 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -881,6 +881,9 @@  build_ssdt(GArray *table_data, GArray *linker,
     Aml *ssdt, *sb_scope, *scope, *pkg, *dev, *method, *crs, *field, *ifctx;
     PciRangeQ io_ranges = QLIST_HEAD_INITIALIZER(io_ranges);
     PciRangeQ mem_ranges = QLIST_HEAD_INITIALIZER(mem_ranges);
+    PciMemoryRange range;
+    PciRangeEntry *entry;
+    int root_bus_limit = 0xFF;
     int i;
 
     ssdt = init_aml_allocator();
@@ -909,6 +912,10 @@  build_ssdt(GArray *table_data, GArray *linker,
                 continue;
             }
 
+            if (bus_info->bus < root_bus_limit) {
+                root_bus_limit = bus_info->bus - 1;
+            }
+
             scope = aml_scope("\\_SB");
             dev = aml_device("PC%.02X", (uint8_t)bus_info->bus);
             aml_append(dev, aml_name_decl("_UID",
@@ -923,8 +930,6 @@  build_ssdt(GArray *table_data, GArray *linker,
             aml_append(ssdt, scope);
         }
 
-        pci_range_list_free(&io_ranges);
-        pci_range_list_free(&mem_ranges);
         qapi_free_PciInfoList(info_list);
     }
 
@@ -933,26 +938,78 @@  build_ssdt(GArray *table_data, GArray *linker,
     crs = aml_resource_template();
     aml_append(crs,
         aml_word_bus_number(aml_min_fixed, aml_max_fixed, aml_pos_decode,
-                            0x0000, 0x0000, 0x00FF, 0x0000, 0x0100));
+                            0x0000, 0x0, root_bus_limit,
+                            0x0000, root_bus_limit + 1));
     aml_append(crs, aml_io(aml_decode16, 0x0CF8, 0x0CF8, 0x01, 0x08));
 
     aml_append(crs,
         aml_word_io(aml_min_fixed, aml_max_fixed,
                     aml_pos_decode, aml_entire_range,
                     0x0000, 0x0000, 0x0CF7, 0x0000, 0x0CF8));
-    aml_append(crs,
-        aml_word_io(aml_min_fixed, aml_max_fixed,
-                    aml_pos_decode, aml_entire_range,
-                    0x0000, 0x0D00, 0xFFFF, 0x0000, 0xF300));
+
+    /* prepare PCI IO ranges */
+    range.base = 0x0D00;
+    range.limit = 0xFFFF;
+    if (QLIST_EMPTY(&io_ranges)) {
+        aml_append(crs,
+            aml_word_io(aml_min_fixed, aml_max_fixed,
+                        aml_pos_decode, aml_entire_range,
+                        0x0000, range.base, range.limit,
+                        0x0000, range.limit - range.base + 1));
+    } else {
+        QLIST_FOREACH(entry, &io_ranges, entry) {
+            if (range.base < entry->base) {
+                aml_append(crs,
+                    aml_word_io(aml_min_fixed, aml_max_fixed,
+                                aml_pos_decode, aml_entire_range,
+                                0x0000, range.base, entry->base - 1,
+                                0x0000, entry->base - range.base));
+            }
+            range.base = entry->limit + 1;
+            if (!QLIST_NEXT(entry, entry)) {
+                aml_append(crs,
+                    aml_word_io(aml_min_fixed, aml_max_fixed,
+                                aml_pos_decode, aml_entire_range,
+                                0x0000, range.base, range.limit,
+                                0x0000, range.limit - range.base + 1));
+            }
+        }
+    }
+
     aml_append(crs,
         aml_dword_memory(aml_pos_decode, aml_min_fixed, aml_max_fixed,
                          aml_cacheable, aml_ReadWrite,
                          0, 0x000A0000, 0x000BFFFF, 0, 0x00020000));
-    aml_append(crs,
-        aml_dword_memory(aml_pos_decode, aml_min_fixed, aml_max_fixed,
-                         aml_non_cacheable, aml_ReadWrite,
-                         0, pci->w32.begin, pci->w32.end - 1, 0,
-                         pci->w32.end - pci->w32.begin));
+
+    /* prepare PCI memory ranges */
+    range.base = pci->w32.begin;
+    range.limit = pci->w32.end - 1;
+    if (QLIST_EMPTY(&mem_ranges)) {
+        aml_append(crs,
+            aml_dword_memory(aml_pos_decode, aml_min_fixed, aml_max_fixed,
+                             aml_non_cacheable, aml_ReadWrite,
+                             0, range.base, range.limit,
+                             0, range.limit - range.base + 1));
+    } else {
+        QLIST_FOREACH(entry, &mem_ranges, entry) {
+            if (range.base < entry->base) {
+                aml_append(crs,
+                    aml_dword_memory(aml_pos_decode, aml_min_fixed, aml_max_fixed,
+                                     aml_non_cacheable, aml_ReadWrite,
+                                     0, range.base, entry->base - 1,
+                                     0, entry->base - range.base));
+            }
+            range.base = entry->limit + 1;
+            if (!QLIST_NEXT(entry, entry)) {
+                aml_append(crs,
+                    aml_dword_memory(aml_pos_decode, aml_min_fixed, aml_max_fixed,
+                                     aml_non_cacheable, aml_ReadWrite,
+                                     0, range.base, range.limit,
+                                     0, range.base - range.limit + 1));
+            }
+        }
+    }
+
     if (pci->w64.begin) {
         aml_append(crs,
             aml_qword_memory(aml_pos_decode, aml_min_fixed, aml_max_fixed,
@@ -975,6 +1032,9 @@  build_ssdt(GArray *table_data, GArray *linker,
     aml_append(dev, aml_name_decl("_CRS", crs));
     aml_append(scope, dev);
 
+    pci_range_list_free(&io_ranges);
+    pci_range_list_free(&mem_ranges);
+
     /* reserve PCIHP resources */
     if (pm->pcihp_io_len) {
         dev = aml_device("PHPR");