diff mbox series

[04/11] hw/arm/aspeed: add a 'mmio-exec' property to boot from the FMC flash module

Message ID 20180831103816.13479-5-clg@kaod.org
State New
Headers show
Series aspeed: misc fixes and enhancements (SMC) | expand

Commit Message

Cédric Le Goater Aug. 31, 2018, 10:38 a.m. UTC
Now that MMIO execution is supported, introduce a 'mmio-exec' property
to boot directly from CE0 of the FMC controller using a memory region
alias.

The overhead for the current firmware images using a custom U-Boot is
around 2 seconds, which is fine, but with a U-Boot from mainline, it
takes an extra 50 seconds or so to reach Linux. This might be related
to the fact that a device tree is used.

MMIO execution is not activated by default because until boot time is
improved.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/arm/aspeed.h |  2 ++
 hw/arm/aspeed.c         | 43 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 40 insertions(+), 5 deletions(-)

Comments

Peter Maydell Sept. 18, 2018, 6:44 p.m. UTC | #1
On 31 August 2018 at 11:38, Cédric Le Goater <clg@kaod.org> wrote:
> Now that MMIO execution is supported, introduce a 'mmio-exec' property
> to boot directly from CE0 of the FMC controller using a memory region
> alias.

The name of this property seems to be a reference to QEMU's
internals: is there some other name that would make more sense
to a user who cares more about the guest and the emulated hardware?

> The overhead for the current firmware images using a custom U-Boot is
> around 2 seconds, which is fine, but with a U-Boot from mainline, it
> takes an extra 50 seconds or so to reach Linux. This might be related
> to the fact that a device tree is used.

Is this overhead just because we do a lot of execution from the
mmio region and that's slower than execution from RAM ?

> MMIO execution is not activated by default because until boot time is
> improved.

Can the guest tell the difference? I'm wondering how much we
might need to care about backward compatibility if in future
we fix the performance problem, or if we can just switch the
default to execute-from-flash without breaking guests.

thanks
-- PMM
Cédric Le Goater Sept. 19, 2018, 6:51 a.m. UTC | #2
On 09/18/2018 08:44 PM, Peter Maydell wrote:
> On 31 August 2018 at 11:38, Cédric Le Goater <clg@kaod.org> wrote:
>> Now that MMIO execution is supported, introduce a 'mmio-exec' property
>> to boot directly from CE0 of the FMC controller using a memory region
>> alias.
> 
> The name of this property seems to be a reference to QEMU's
> internals: is there some other name that would make more sense
> to a user who cares more about the guest and the emulated hardware?

It is true that the user does not care about internals. 

"execute-from-flash" as you proposed below ? 

>> The overhead for the current firmware images using a custom U-Boot is
>> around 2 seconds, which is fine, but with a U-Boot from mainline, it
>> takes an extra 50 seconds or so to reach Linux. This might be related
>> to the fact that a device tree is used.
> 
> Is this overhead just because we do a lot of execution from the
> mmio region and that's slower than execution from RAM ?

That is what it seems but I haven't dug further the problem. The mainline 
U-Boot also has a rewritten DRAM calibration.

Using the logs is too painful so I would need to add some statistics on 
what is being done. May be number of read/write ops on the flash model 
that we could dump from the monitor, or some more low-level figures of 
QEMU internals. 

>> MMIO execution is not activated by default because until boot time is
>> improved.
> 
> Can the guest tell the difference? I'm wondering how much we
> might need to care about backward compatibility if in future
> we fix the performance problem, or if we can just switch the
> default to execute-from-flash without breaking guests.

Yes that is a good a point. When performance is fixed, we should remove 
the ROM region and the machine option. What I am proposing is a more of 
a work around to analyze a problem than something we would keep in the 
future.

Thanks,

C.
Cédric Le Goater Sept. 19, 2018, 10:19 a.m. UTC | #3
On 09/19/2018 08:51 AM, Cédric Le Goater wrote:
> On 09/18/2018 08:44 PM, Peter Maydell wrote:
>> On 31 August 2018 at 11:38, Cédric Le Goater <clg@kaod.org> wrote:
>>> Now that MMIO execution is supported, introduce a 'mmio-exec' property
>>> to boot directly from CE0 of the FMC controller using a memory region
>>> alias.
>>
>> The name of this property seems to be a reference to QEMU's
>> internals: is there some other name that would make more sense
>> to a user who cares more about the guest and the emulated hardware?
> 
> It is true that the user does not care about internals. 
> 
> "execute-from-flash" as you proposed below ? 
> 
>>> The overhead for the current firmware images using a custom U-Boot is
>>> around 2 seconds, which is fine, but with a U-Boot from mainline, it
>>> takes an extra 50 seconds or so to reach Linux. This might be related
>>> to the fact that a device tree is used.
>>
>> Is this overhead just because we do a lot of execution from the
>> mmio region and that's slower than execution from RAM ?
> 
> That is what it seems but I haven't dug further the problem. The mainline 
> U-Boot also has a rewritten DRAM calibration.
> 
> Using the logs is too painful so I would need to add some statistics on 
> what is being done. May be number of read/write ops on the flash model 
> that we could dump from the monitor, or some more low-level figures of 
> QEMU internals. 

These are the number of read operations done on the flash memory region :
 
    922478 ~ 3.5 MBytes   OpenBMC U-Boot
  20569977 ~ 80  MBytes   Mainline U-Boot 


So we are trashing the TBs I would say. Is there a way to increase the 
TB size ? or some other TB parameter ? Is that directly linked to the 
instruction cache size ? 

I am no expert of that area of QEMU but willing to experiment.

Thanks,

C. 

>>> MMIO execution is not activated by default because until boot time is
>>> improved.
>>
>> Can the guest tell the difference? I'm wondering how much we
>> might need to care about backward compatibility if in future
>> we fix the performance problem, or if we can just switch the
>> default to execute-from-flash without breaking guests.
> 
> Yes that is a good a point. When performance is fixed, we should remove 
> the ROM region and the machine option. What I am proposing is a more of 
> a work around to analyze a problem than something we would keep in the 
> future.
> 
> Thanks,
> 
> C.
>
Peter Maydell Sept. 24, 2018, 11:41 a.m. UTC | #4
On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote:
> These are the number of read operations done on the flash memory region :
>
>     922478 ~ 3.5 MBytes   OpenBMC U-Boot
>   20569977 ~ 80  MBytes   Mainline U-Boot
>
>
> So we are trashing the TBs I would say. Is there a way to increase the
> TB size ? or some other TB parameter ? Is that directly linked to the
> instruction cache size ?

Well, execution direct from MMIO means we read each instruction
as we execute it -- there's no caching of TBs. (We generate a
temporary TB with a single insn and throw it away after executing
from it.) This is because we can't be sure that we would get the
same data value the next time we read from the address.
So we definitely expect things to go slower -- I'm just a little
surprised that we spend so much time executing from flash devices
that aren't in "romd" mode.

thanks
-- PMM
Cédric Le Goater Sept. 24, 2018, 1:28 p.m. UTC | #5
On 09/24/2018 01:41 PM, Peter Maydell wrote:
> On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote:
>> These are the number of read operations done on the flash memory region :
>>
>>     922478 ~ 3.5 MBytes   OpenBMC U-Boot
>>   20569977 ~ 80  MBytes   Mainline U-Boot
>>
>>
>> So we are trashing the TBs I would say. Is there a way to increase the
>> TB size ? or some other TB parameter ? Is that directly linked to the
>> instruction cache size ?
> 
> Well, execution direct from MMIO means we read each instruction
> as we execute it -- there's no caching of TBs. (We generate a
> temporary TB with a single insn and throw it away after executing
> from it.) This is because we can't be sure that we would get the
> same data value the next time we read from the address.

In our case, we should have the same data. So may be I can introduce 
a read-only region with a cache to speed up the accesses. A bit like 
this was done with the previous mmio inteface.

> So we definitely expect things to go slower -- I'm just a little
> surprised that we spend so much time executing from flash devices
> that aren't in "romd" mode.

Yes. it is very slow but some instructions are loaded more than 
250.000 times. See below.

Thanks,

C.





count   address
    ... ...
 149113 0x28a18
 149113 0x28a1c
 149113 0x28a20
 149113 0x28a24
 149113 0x28a28
 149113 0x28a2c
 149113 0x28a30
 149113 0x28a34
 149113 0x28a38
 149113 0x28a3c
 149113 0x28a40
 253584 0x288b4
 253584 0x288bc
 253584 0x288c0
 253584 0x288c8
 253584 0x288cc
 253584 0x288d4
 253584 0x288e0
 253584 0x288e4
 253584 0x288f0
 253584 0x288f4
 253584 0x288fc
 253584 0x28904
 253584 0x28908
 253584 0x28910
 253584 0x28914
 253584 0x28920
 253584 0x28924
 253584 0x28928
 253584 0x2892c
 253584 0x28934
 253584 0x28938
 253584 0x2893c
 253584 0x28940
 253584 0x28944
 253584 0x2894c
 253584 0x28954
 253584 0x28958
 253584 0x2895c
 253584 0x28960
 253584 0x28964
 253584 0x28968
 253584 0x2896c
 253584 0x28970
 253584 0x28974
 253584 0x28978
 253584 0x2897c
 253584 0x28980
 253584 0x53364
 253585 0x288b8
 253585 0x288c4
 253585 0x288d0
 253585 0x288d8
 253585 0x288dc
 253585 0x288f8
 253585 0x28900
 253585 0x2890c
 253585 0x28918
 253585 0x2891c
 253585 0x28948
 253585 0x53344
 253586 0x28930
 253586 0x28950
 253587 0x288b0
 257497 0x53348
 277523 0x53354
Peter Maydell Sept. 24, 2018, 1:31 p.m. UTC | #6
On 24 September 2018 at 14:28, Cédric Le Goater <clg@kaod.org> wrote:
> On 09/24/2018 01:41 PM, Peter Maydell wrote:
>> On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote:
>>> These are the number of read operations done on the flash memory region :
>>>
>>>     922478 ~ 3.5 MBytes   OpenBMC U-Boot
>>>   20569977 ~ 80  MBytes   Mainline U-Boot
>>>
>>>
>>> So we are trashing the TBs I would say. Is there a way to increase the
>>> TB size ? or some other TB parameter ? Is that directly linked to the
>>> instruction cache size ?
>>
>> Well, execution direct from MMIO means we read each instruction
>> as we execute it -- there's no caching of TBs. (We generate a
>> temporary TB with a single insn and throw it away after executing
>> from it.) This is because we can't be sure that we would get the
>> same data value the next time we read from the address.
>
> In our case, we should have the same data. So may be I can introduce
> a read-only region with a cache to speed up the accesses. A bit like
> this was done with the previous mmio inteface.

Yes, this is in theory possible, but we dropped the old
mmio-exec interface because it had some nasty race conditions:
the problem is being able to be sure you've properly dropped
the cached TBs when the device contents change.

Which flash device is this? For pflash typically the 'romd mode'
stuff suffices to allow execution as-if-from-ram most of the time.

thanks
-- PMM
Cédric Le Goater Sept. 24, 2018, 1:59 p.m. UTC | #7
On 09/24/2018 03:31 PM, Peter Maydell wrote:
> On 24 September 2018 at 14:28, Cédric Le Goater <clg@kaod.org> wrote:
>> On 09/24/2018 01:41 PM, Peter Maydell wrote:
>>> On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote:
>>>> These are the number of read operations done on the flash memory region :
>>>>
>>>>     922478 ~ 3.5 MBytes   OpenBMC U-Boot
>>>>   20569977 ~ 80  MBytes   Mainline U-Boot
>>>>
>>>>
>>>> So we are trashing the TBs I would say. Is there a way to increase the
>>>> TB size ? or some other TB parameter ? Is that directly linked to the
>>>> instruction cache size ?
>>>
>>> Well, execution direct from MMIO means we read each instruction
>>> as we execute it -- there's no caching of TBs. (We generate a
>>> temporary TB with a single insn and throw it away after executing
>>> from it.) This is because we can't be sure that we would get the
>>> same data value the next time we read from the address.
>>
>> In our case, we should have the same data. So may be I can introduce
>> a read-only region with a cache to speed up the accesses. A bit like
>> this was done with the previous mmio inteface.
> 
> Yes, this is in theory possible, but we dropped the old
> mmio-exec interface because it had some nasty race conditions:
> the problem is being able to be sure you've properly dropped
> the cached TBs when the device contents change.
> 
> Which flash device is this? For pflash typically the 'romd mode'
> stuff suffices to allow execution as-if-from-ram most of the time.

It's a SPI slave flash device attached to the Aspeed SMC controller.

It works very well with in a 'romd mode'. I am just stressing a bit 
more the model with MMIO execution but it's not something we really
need. So we can just drop this patch in the v2 patchset "aspeed: misc 
fixes and enhancements (SMC)" in which I fixed a few other things.

Thanks,

C.
diff mbox series

Patch

diff --git a/include/hw/arm/aspeed.h b/include/hw/arm/aspeed.h
index 2b77f8d2b3c8..d079f4d6e5db 100644
--- a/include/hw/arm/aspeed.h
+++ b/include/hw/arm/aspeed.h
@@ -30,6 +30,8 @@  typedef struct AspeedBoardConfig {
 
 typedef struct AspeedMachine {
     MachineState parent_obj;
+
+    bool mmio_exec;
 } AspeedMachine;
 
 #define ASPEED_MACHINE_CLASS(klass) \
diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
index 6b33ecd5aa43..3a66c2dedc3e 100644
--- a/hw/arm/aspeed.c
+++ b/hw/arm/aspeed.c
@@ -216,11 +216,18 @@  static void aspeed_board_init(MachineState *machine,
          * SoC and 128MB for the AST2500 SoC, which is twice as big as
          * needed by the flash modules of the Aspeed machines.
          */
-        memory_region_init_rom(boot_rom, OBJECT(bmc), "aspeed.boot_rom",
-                               fl->size, &error_abort);
-        memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
-                                    boot_rom);
-        write_boot_rom(drive0, FIRMWARE_ADDR, fl->size, &error_abort);
+        if (ASPEED_MACHINE(machine)->mmio_exec) {
+            memory_region_init_alias(boot_rom, OBJECT(bmc), "aspeed.boot_rom",
+                                     &fl->mmio, 0, fl->size);
+            memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
+                                        boot_rom);
+        } else {
+            memory_region_init_rom(boot_rom, OBJECT(bmc), "aspeed.boot_rom",
+                                   fl->size, &error_abort);
+            memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
+                                        boot_rom);
+            write_boot_rom(drive0, FIRMWARE_ADDR, fl->size, &error_abort);
+        }
     }
 
     aspeed_board_binfo.kernel_filename = machine->kernel_filename;
@@ -313,6 +320,29 @@  static void aspeed_machine_init(MachineState *machine)
     aspeed_board_init(machine, amc->board);
 }
 
+static bool aspeed_get_mmio_exec(Object *obj, Error **errp)
+{
+    return ASPEED_MACHINE(obj)->mmio_exec;
+}
+
+static void aspeed_set_mmio_exec(Object *obj, bool value, Error **errp)
+{
+    ASPEED_MACHINE(obj)->mmio_exec = value;
+}
+
+static void aspeed_machine_instance_init(Object *obj)
+{
+    ASPEED_MACHINE(obj)->mmio_exec = false;
+}
+
+static void aspeed_machine_class_props_init(ObjectClass *oc)
+{
+    object_class_property_add_bool(oc, "mmio-exec", aspeed_get_mmio_exec,
+                                   aspeed_set_mmio_exec, &error_abort);
+    object_class_property_set_description(oc, "mmio-exec",
+                              "boot using MMIO execution", &error_abort);
+}
+
 static void aspeed_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
@@ -327,6 +357,8 @@  static void aspeed_machine_class_init(ObjectClass *oc, void *data)
     mc->no_cdrom = 1;
     mc->no_parallel = 1;
     amc->board = board;
+
+    aspeed_machine_class_props_init(oc);
 }
 
 static const TypeInfo aspeed_machine_type = {
@@ -334,6 +366,7 @@  static const TypeInfo aspeed_machine_type = {
     .parent = TYPE_MACHINE,
     .instance_size = sizeof(AspeedMachine),
     .class_size = sizeof(AspeedMachineClass),
+    .instance_init = aspeed_machine_instance_init,
     .abstract = true,
 };