Patchwork Hack integrating SeaBios / LinuxBoot option rom with QEMU trace backends

login
register
mail settings
Submitter Daniel P. Berrange
Date Oct. 10, 2011, 5:08 p.m.
Message ID <20111010170803.GV9408@redhat.com>
Download mbox | patch
Permalink /patch/118792/
State New
Headers show

Comments

Daniel P. Berrange - Oct. 10, 2011, 5:08 p.m.
I've been investigating where time disappears to when booting Linux guests.

Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so
that it could print a timestamp before each new line of debug output. The
problem with that is that it slowed down startup, so the timings I was
examining all changed.

What I really wanted was to use QEMU's trace infrastructure with a simple
SystemTAP script. This is easy enough in the QEMU layer, but I also need
to see where time goes to inside the various BIOS functions, and the
options ROMs such as LinuxBoot. So I came up with a small hack to insert
"probes" into SeaBios and LinuxBoot, which trigger a special IO port
(0x404), which then cause QEMU to emit a trace event.

The implementation is really very crude and does not allow any arguments
to be passed each probes, but since all I care about is timing information,
it is good enough for my needs.

I'm not really expecting these patches to be merged into QEMU/SeaBios
since they're just a crude hack & I don't have time to write something
better. I figure they might be useful for someone else though...

With the attached patches applied to QEMU and SeaBios, the attached
systemtap script can be used to debug timings in QEMU startup.

For example, one execution of QEMU produced the following log:

  $ stap qemu-timing.stp
  0.000 Start
  0.036 Run
  0.038 BIOS post
  0.180 BIOS int 19
  0.181 BIOS boot OS
  0.181 LinuxBoot copy kernel
  1.371 LinuxBoot copy initrd
  1.616 LinuxBoot boot OS
  2.489 Shutdown request
  2.490 Stop

showing that LinuxBoot is responsible for by far the most execution
time (~1500ms), in my test which runs for 2500ms in total.

Regards,
Daniel
Anthony Liguori - Oct. 10, 2011, 6:53 p.m.
On 10/10/2011 12:08 PM, Daniel P. Berrange wrote:
> I've been investigating where time disappears to when booting Linux guests.
>
> Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so
> that it could print a timestamp before each new line of debug output. The
> problem with that is that it slowed down startup, so the timings I was
> examining all changed.
>
> What I really wanted was to use QEMU's trace infrastructure with a simple
> SystemTAP script. This is easy enough in the QEMU layer, but I also need
> to see where time goes to inside the various BIOS functions, and the
> options ROMs such as LinuxBoot. So I came up with a small hack to insert
> "probes" into SeaBios and LinuxBoot, which trigger a special IO port
> (0x404), which then cause QEMU to emit a trace event.
>
> The implementation is really very crude and does not allow any arguments
> to be passed each probes, but since all I care about is timing information,
> it is good enough for my needs.
>
> I'm not really expecting these patches to be merged into QEMU/SeaBios
> since they're just a crude hack&  I don't have time to write something
> better. I figure they might be useful for someone else though...
>
> With the attached patches applied to QEMU and SeaBios, the attached
> systemtap script can be used to debug timings in QEMU startup.
>
> For example, one execution of QEMU produced the following log:
>
>    $ stap qemu-timing.stp
>    0.000 Start
>    0.036 Run
>    0.038 BIOS post
>    0.180 BIOS int 19
>    0.181 BIOS boot OS
>    0.181 LinuxBoot copy kernel
>    1.371 LinuxBoot copy initrd

Yeah, there was a thread a bit ago about the performance of the interface to 
read the kernel/initrd.  I think at it was using single byte access instructions 
and there were patches to use string accessors instead?  I can't remember where 
that threaded ended up.

CC'ing Gleb and Alex who may recall more.

Regards,

Anthony Liguori


>    1.616 LinuxBoot boot OS
>    2.489 Shutdown request
>    2.490 Stop
>
> showing that LinuxBoot is responsible for by far the most execution
> time (~1500ms), in my test which runs for 2500ms in total.
>
> Regards,
> Daniel
Alexander Graf - Oct. 10, 2011, 7:01 p.m.
On 10.10.2011, at 20:53, Anthony Liguori wrote:

> On 10/10/2011 12:08 PM, Daniel P. Berrange wrote:
>> I've been investigating where time disappears to when booting Linux guests.
>> 
>> Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so
>> that it could print a timestamp before each new line of debug output. The
>> problem with that is that it slowed down startup, so the timings I was
>> examining all changed.
>> 
>> What I really wanted was to use QEMU's trace infrastructure with a simple
>> SystemTAP script. This is easy enough in the QEMU layer, but I also need
>> to see where time goes to inside the various BIOS functions, and the
>> options ROMs such as LinuxBoot. So I came up with a small hack to insert
>> "probes" into SeaBios and LinuxBoot, which trigger a special IO port
>> (0x404), which then cause QEMU to emit a trace event.
>> 
>> The implementation is really very crude and does not allow any arguments
>> to be passed each probes, but since all I care about is timing information,
>> it is good enough for my needs.
>> 
>> I'm not really expecting these patches to be merged into QEMU/SeaBios
>> since they're just a crude hack&  I don't have time to write something
>> better. I figure they might be useful for someone else though...
>> 
>> With the attached patches applied to QEMU and SeaBios, the attached
>> systemtap script can be used to debug timings in QEMU startup.
>> 
>> For example, one execution of QEMU produced the following log:
>> 
>>   $ stap qemu-timing.stp
>>   0.000 Start
>>   0.036 Run
>>   0.038 BIOS post
>>   0.180 BIOS int 19
>>   0.181 BIOS boot OS
>>   0.181 LinuxBoot copy kernel
>>   1.371 LinuxBoot copy initrd
> 
> Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd.  I think at it was using single byte access instructions and there were patches to use string accessors instead?  I can't remember where that threaded ended up.

IIRC we're already using string accessors, but are still slow. Richard had a nice patch cooked up to basically have the fw_cfg interface be able to DMA its data to the guest. I like the idea. Avi did not.

And yes, bad -kernel performance does hurt in some workloads. A lot.


Alex
Kevin O'Connor - Oct. 10, 2011, 11:57 p.m.
On Mon, Oct 10, 2011 at 06:08:03PM +0100, Daniel P. Berrange wrote:
> I've been investigating where time disappears to when booting Linux guests.
> 
> Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so
> that it could print a timestamp before each new line of debug output. The
> problem with that is that it slowed down startup, so the timings I was
> examining all changed.

A lot of effort went into optimizing SeaBIOS boot time.  There is a
tool in seabios git to help with benchmarking - tools/readserial.py.
The tool was designed for use with serial ports on real machines using
coreboot, but it works with qemu too:

mkfifo seabioslog

./tools/readserial.py -nf seabioslog

qemu-system-x86_64 -chardev pipe,id=seabios,path=seabioslog -device isa-debugcon,iobase=0x402,chardev=seabios -hda myimage

This will show the SeaBIOS debug output with timing info.

-Kevin
Daniel P. Berrange - Oct. 11, 2011, 8:23 a.m.
On Mon, Oct 10, 2011 at 09:01:52PM +0200, Alexander Graf wrote:
> 
> On 10.10.2011, at 20:53, Anthony Liguori wrote:
> 
> > On 10/10/2011 12:08 PM, Daniel P. Berrange wrote:
> >> With the attached patches applied to QEMU and SeaBios, the attached
> >> systemtap script can be used to debug timings in QEMU startup.
> >> 
> >> For example, one execution of QEMU produced the following log:
> >> 
> >>   $ stap qemu-timing.stp
> >>   0.000 Start
> >>   0.036 Run
> >>   0.038 BIOS post
> >>   0.180 BIOS int 19
> >>   0.181 BIOS boot OS
> >>   0.181 LinuxBoot copy kernel
> >>   1.371 LinuxBoot copy initrd
> > 
> > Yeah, there was a thread a bit ago about the performance
> > of the interface to read the kernel/initrd.  I think at it
> > was using single byte access instructions and there were
> > patches to use string accessors instead?  I can't remember
> > where that threaded ended up.

There was initially a huge performance problem, which was
fixed during the course of the thread, getting to the current
state where it still takes a few seconds to load large blobs.
The thread continued with many proposals & counter proposals
but nothing further really came out of it.

   https://lists.gnu.org/archive/html/qemu-devel/2010-08/msg00133.html

One core point to take away though, is that -kernel/-initrd is
*not* just for ad-hoc testing by qemu/kernel developers. It is
critical functionality widely used by users of QEMU in production
scenarios and performance of it does matter, in some cases, alot.

> IIRC we're already using string accessors, but are still
> slow. Richard had a nice patch cooked up to basically have
> the fw_cfg interface be able to DMA its data to the guest.
> I like the idea. Avi did not.

That's here:

  https://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01037.html

> And yes, bad -kernel performance does hurt in some workloads. A lot.

Let me recap the 3 usage scenarios I believe are most common:

 - Most Linux distro installs done with libvirt + virt-manager/virt-install
   are done by directly booting the distro's PXE kernel/initrd files.
   The kernel images are typically < 5 MB, while the initrd images may
   be as large as 150 MB.  Both are compressed already. An uncompressed
   initrd image would be more like 300 MB,  so these are avoided for
   obvious reasons.

   Performance is not really an issue, within reason, since the overall
   distro installation time will easily dominate, but loading should
   still be measured in seconds, not minutes.

   The reason for using a kernel/initrd instead of a bootable ISO is
   to be able to set kernel command line arguments for the installer.

 - libguestfs directly boots its appliance using the regular host's
   kernel image and a custom built initrd image. The initrd does
   not contain the entire appliance, just enough to boot up and
   dynamically read files in from the host OS on demand. This is
   a so called "supermin appliance".

   The kernel is < 5 MB, while the initrd is approx 100MB. The initrd
   image is used uncompressed, because decompression time needs to be
   eliminated from bootup.  Performance is very critical for libguestfs.
   100's of milliseconds really do make a big difference for it.

   The reason for using a kernel/initrd instead of bootable ISO is to
   avoid the time required to actually build the ISO, and to avoid
   having more disks visible in the guest, which could confuse apps
   using libguestfs which enumerate disks.

 - Application sandbox, directly boots the regular host's kernel and
   a custom initrd image. The initrd does not contain any files except
   for the 9p kernel modules and a custom init binary, which mounts
   the guest root FS from a 9p filesystem export.

   The kernel is < 5 MB, while the initrd is approx 700 KB compressed,
   or 1.4 MB compressed. Performance for the sandbox is even more
   critical than for libguestfs. Even 10's of milliseconds make a
   difference here. The commands being run in the sandbox can be
   very short lived processes, executed reasonably frequently. The
   goal is to have end-to-end runtime overhead of < 2 seconds. This
   includes libvirt guest startup, qemu startup/shutdown, bios time,
   option ROM time, kernel boot & shutdown time.

   The reason for using a kerenl/initrd instead of a bootable ISO,
   is that building an ISO requires time itself, and we need to be
   able to easily pass kernel boot arguments via -append.


I'm focusing on the last use case, and if the phase of the moon
is correct, I can currently executed a sandbox command with a total
overhead of 3.5 seconds (if using a compressed initrd) of which
the QEMU execution time is 2.5 seconds.

Of this, 1.4 seconds is the time required by LinuxBoot to copy the
kernel+initrd. If I used an uncompressed initrd, which I really want
to, to avoid decompression overhead, this increases to ~1.7 seconds.
So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
of total sandbox execution overhead.

For comparison I also did a test building a bootable ISO using ISOLinux.
This required 700 ms for the boot time, which is appoximately 1/2 the
time reqiured for direct kernel/initrd boot. But you have to then add
on time required to build the ISO on every boot, to add custom kernel
command line args. So while ISO is faster than LinuxBoot currently
there is still non-negligable overhead here that I want to avoid.

For further comparison I tested with Rich Jones' patches which add a
DMA-like inteface to fw_cfg. With this the time spent in the LinuxBoot
option ROM was as close to zero as matters.

So obviously, my preference is for -kernel/-initrd to be made very fast
using the DMA-like patches, or any other patches which could achieve
similarly high performance for -kernel/-initd.

Regards,
Daniel
Richard W.M. Jones - Oct. 11, 2011, 8:43 a.m.
On Tue, Oct 11, 2011 at 09:23:15AM +0100, Daniel P. Berrange wrote:
>  - libguestfs directly boots its appliance using the regular host's
>    kernel image and a custom built initrd image. The initrd does
>    not contain the entire appliance, just enough to boot up and
>    dynamically read files in from the host OS on demand. This is
>    a so called "supermin appliance".
> 
>    The kernel is < 5 MB, while the initrd is approx 100MB.
[...]

Actually this is how libguestfs used to work, but the performance of
such a large initrd against the poor qemu implementation meant we had
to abandon this approach.

We now use -kernel ~5MB, a small -initrd ~1.1MB and a large ext2
format root disk (of course loaded on demand, which is better anyway).

Nevertheless any improvement in -kernel and -initrd load times would
help us a gain a few 1/10ths of seconds, which is still very important
for us.  Overall boot time is 3-4 seconds and we are often in a
situation where we need to repeatedly boot the appliance.

Rich.
Avi Kivity - Oct. 11, 2011, 9:08 a.m.
On 10/10/2011 09:01 PM, Alexander Graf wrote:
> >>  For example, one execution of QEMU produced the following log:
> >>
> >>    $ stap qemu-timing.stp
> >>    0.000 Start
> >>    0.036 Run
> >>    0.038 BIOS post
> >>    0.180 BIOS int 19
> >>    0.181 BIOS boot OS
> >>    0.181 LinuxBoot copy kernel
> >>    1.371 LinuxBoot copy initrd
> >
> >  Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd.  I think at it was using single byte access instructions and there were patches to use string accessors instead?  I can't remember where that threaded ended up.
>
> IIRC we're already using string accessors, but are still slow. Richard had a nice patch cooked up to basically have the fw_cfg interface be able to DMA its data to the guest. I like the idea. Avi did not.
>
> And yes, bad -kernel performance does hurt in some workloads. A lot.
>
>

The rep/ins implementation is still slow, optimizing it can help.

What does 'perf top' say when running this workload?
Avi Kivity - Oct. 11, 2011, 9:15 a.m.
On 10/11/2011 10:23 AM, Daniel P. Berrange wrote:
>   - Application sandbox, directly boots the regular host's kernel and
>     a custom initrd image. The initrd does not contain any files except
>     for the 9p kernel modules and a custom init binary, which mounts
>     the guest root FS from a 9p filesystem export.
>
>     The kernel is<  5 MB, while the initrd is approx 700 KB compressed,
>     or 1.4 MB compressed. Performance for the sandbox is even more
>     critical than for libguestfs. Even 10's of milliseconds make a
>     difference here. The commands being run in the sandbox can be
>     very short lived processes, executed reasonably frequently. The
>     goal is to have end-to-end runtime overhead of<  2 seconds. This
>     includes libvirt guest startup, qemu startup/shutdown, bios time,
>     option ROM time, kernel boot&  shutdown time.
>
>     The reason for using a kerenl/initrd instead of a bootable ISO,
>     is that building an ISO requires time itself, and we need to be
>     able to easily pass kernel boot arguments via -append.
>
>
> I'm focusing on the last use case, and if the phase of the moon
> is correct, I can currently executed a sandbox command with a total
> overhead of 3.5 seconds (if using a compressed initrd) of which
> the QEMU execution time is 2.5 seconds.
>
> Of this, 1.4 seconds is the time required by LinuxBoot to copy the
> kernel+initrd. If I used an uncompressed initrd, which I really want
> to, to avoid decompression overhead, this increases to ~1.7 seconds.
> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
> of total sandbox execution overhead.

One thing we can do is boot a guest and immediately snapshot it, before 
it runs any application specific code.  Subsequent invocations will 
MAP_PRIVATE the memory image and COW their way.  This avoids the kernel 
initialization time as well.

>
> For comparison I also did a test building a bootable ISO using ISOLinux.
> This required 700 ms for the boot time, which is appoximately 1/2 the
> time reqiured for direct kernel/initrd boot. But you have to then add
> on time required to build the ISO on every boot, to add custom kernel
> command line args. So while ISO is faster than LinuxBoot currently
> there is still non-negligable overhead here that I want to avoid.

You can accept parameters from virtio-serial or some other channel.  Is 
there any reason you need them specifically as *kernel* command line 
parameters?

> For further comparison I tested with Rich Jones' patches which add a
> DMA-like inteface to fw_cfg. With this the time spent in the LinuxBoot
> option ROM was as close to zero as matters.
>
> So obviously, my preference is for -kernel/-initrd to be made very fast
> using the DMA-like patches, or any other patches which could achieve
> similarly high performance for -kernel/-initd.
Daniel P. Berrange - Oct. 11, 2011, 9:18 a.m.
On Tue, Oct 11, 2011 at 11:08:33AM +0200, Avi Kivity wrote:
> On 10/10/2011 09:01 PM, Alexander Graf wrote:
> >>>  For example, one execution of QEMU produced the following log:
> >>>
> >>>    $ stap qemu-timing.stp
> >>>    0.000 Start
> >>>    0.036 Run
> >>>    0.038 BIOS post
> >>>    0.180 BIOS int 19
> >>>    0.181 BIOS boot OS
> >>>    0.181 LinuxBoot copy kernel
> >>>    1.371 LinuxBoot copy initrd
> >>
> >>  Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd.  I think at it was using single byte access instructions and there were patches to use string accessors instead?  I can't remember where that threaded ended up.
> >
> >IIRC we're already using string accessors, but are still slow. Richard had a nice patch cooked up to basically have the fw_cfg interface be able to DMA its data to the guest. I like the idea. Avi did not.
> >
> >And yes, bad -kernel performance does hurt in some workloads. A lot.
> >
> >
> 
> The rep/ins implementation is still slow, optimizing it can help.
> 
> What does 'perf top' say when running this workload?

To ensure it only recorded the LinuxBoot code, I created a 100 MB
kernel image which takes approx 30 seconds to copy. Here is the
perf output for approx 15 seconds of that copy:

             1906.00 15.0% read_hpet                       [kernel]            
             1029.00  8.1% x86_emulate_insn                [kvm]               
              863.00  6.8% test_cc                         [kvm]               
              661.00  5.2% emulator_get_segment            [kvm]               
              631.00  5.0% kvm_mmu_pte_write               [kvm]               
              535.00  4.2% __linearize                     [kvm]               
              431.00  3.4% do_raw_spin_lock                [kernel]            
              356.00  2.8% vmx_get_segment                 [kvm_intel]         
              330.00  2.6% vmx_segment_cache_test_set      [kvm_intel]         
              308.00  2.4% segmented_write                 [kvm]               
              291.00  2.3% vread_hpet                      [kernel].vsyscall_fn
              251.00  2.0% vmx_get_cpl                     [kvm_intel]         
              230.00  1.8% trace_kvm_mmu_audit             [kvm]               
              207.00  1.6% kvm_write_guest                 [kvm]               
              199.00  1.6% emulator_write_emulated         [kvm]               
              187.00  1.5% emulator_write_emulated_onepage [kvm]               
              185.00  1.5% kvm_write_guest_page            [kvm]               
              177.00  1.4% vmx_get_segment_base            [kvm_intel]         
              158.00  1.2% fw_cfg_io_readb                 qemu-system-x86_64  
              148.00  1.2% register_address_increment      [kvm]               
              142.00  1.1% emulator_write_phys             [kvm]               
              134.00  1.1% acpi_os_read_port               [kernel]            


Daniel
Alexander Graf - Oct. 11, 2011, 9:19 a.m.
On 11.10.2011, at 11:15, Avi Kivity wrote:

> On 10/11/2011 10:23 AM, Daniel P. Berrange wrote:
>>  - Application sandbox, directly boots the regular host's kernel and
>>    a custom initrd image. The initrd does not contain any files except
>>    for the 9p kernel modules and a custom init binary, which mounts
>>    the guest root FS from a 9p filesystem export.
>> 
>>    The kernel is<  5 MB, while the initrd is approx 700 KB compressed,
>>    or 1.4 MB compressed. Performance for the sandbox is even more
>>    critical than for libguestfs. Even 10's of milliseconds make a
>>    difference here. The commands being run in the sandbox can be
>>    very short lived processes, executed reasonably frequently. The
>>    goal is to have end-to-end runtime overhead of<  2 seconds. This
>>    includes libvirt guest startup, qemu startup/shutdown, bios time,
>>    option ROM time, kernel boot&  shutdown time.
>> 
>>    The reason for using a kerenl/initrd instead of a bootable ISO,
>>    is that building an ISO requires time itself, and we need to be
>>    able to easily pass kernel boot arguments via -append.
>> 
>> 
>> I'm focusing on the last use case, and if the phase of the moon
>> is correct, I can currently executed a sandbox command with a total
>> overhead of 3.5 seconds (if using a compressed initrd) of which
>> the QEMU execution time is 2.5 seconds.
>> 
>> Of this, 1.4 seconds is the time required by LinuxBoot to copy the
>> kernel+initrd. If I used an uncompressed initrd, which I really want
>> to, to avoid decompression overhead, this increases to ~1.7 seconds.
>> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
>> of total sandbox execution overhead.
> 
> One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code.  Subsequent invocations will MAP_PRIVATE the memory image and COW their way.  This avoids the kernel initialization time as well.

That doesn't allow modification of -append and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs.

> 
>> 
>> For comparison I also did a test building a bootable ISO using ISOLinux.
>> This required 700 ms for the boot time, which is appoximately 1/2 the
>> time reqiured for direct kernel/initrd boot. But you have to then add
>> on time required to build the ISO on every boot, to add custom kernel
>> command line args. So while ISO is faster than LinuxBoot currently
>> there is still non-negligable overhead here that I want to avoid.
> 
> You can accept parameters from virtio-serial or some other channel.  Is there any reason you need them specifically as *kernel* command line parameters?

That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd.

And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.


Alex
Avi Kivity - Oct. 11, 2011, 9:26 a.m.
On 10/11/2011 11:19 AM, Alexander Graf wrote:
> >>
> >>  Of this, 1.4 seconds is the time required by LinuxBoot to copy the
> >>  kernel+initrd. If I used an uncompressed initrd, which I really want
> >>  to, to avoid decompression overhead, this increases to ~1.7 seconds.
> >>  So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
> >>  of total sandbox execution overhead.
> >
> >  One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code.  Subsequent invocations will MAP_PRIVATE the memory image and COW their way.  This avoids the kernel initialization time as well.
>
> That doesn't allow modification of -append

Is it really needed?

> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs.

Typically you'd check the timestamps to make sure you're running an 
up-to-date version.

>
> >
> >>
> >>  For comparison I also did a test building a bootable ISO using ISOLinux.
> >>  This required 700 ms for the boot time, which is appoximately 1/2 the
> >>  time reqiured for direct kernel/initrd boot. But you have to then add
> >>  on time required to build the ISO on every boot, to add custom kernel
> >>  command line args. So while ISO is faster than LinuxBoot currently
> >>  there is still non-negligable overhead here that I want to avoid.
> >
> >  You can accept parameters from virtio-serial or some other channel.  Is there any reason you need them specifically as *kernel* command line parameters?
>
> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd.

This use case is not installation, it's for app sandboxing.

> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.

rep/ins is exactly like dma+wait for this use case: provide an address, 
get a memory image in return.  There's no need to add another interface, 
we should just optimize the existing one.
Daniel P. Berrange - Oct. 11, 2011, 9:27 a.m.
On Tue, Oct 11, 2011 at 11:15:05AM +0200, Avi Kivity wrote:
> On 10/11/2011 10:23 AM, Daniel P. Berrange wrote:
> >  - Application sandbox, directly boots the regular host's kernel and
> >    a custom initrd image. The initrd does not contain any files except
> >    for the 9p kernel modules and a custom init binary, which mounts
> >    the guest root FS from a 9p filesystem export.
> >
> >    The kernel is<  5 MB, while the initrd is approx 700 KB compressed,
> >    or 1.4 MB compressed. Performance for the sandbox is even more
> >    critical than for libguestfs. Even 10's of milliseconds make a
> >    difference here. The commands being run in the sandbox can be
> >    very short lived processes, executed reasonably frequently. The
> >    goal is to have end-to-end runtime overhead of<  2 seconds. This
> >    includes libvirt guest startup, qemu startup/shutdown, bios time,
> >    option ROM time, kernel boot&  shutdown time.
> >
> >    The reason for using a kerenl/initrd instead of a bootable ISO,
> >    is that building an ISO requires time itself, and we need to be
> >    able to easily pass kernel boot arguments via -append.
> >
> >
> >I'm focusing on the last use case, and if the phase of the moon
> >is correct, I can currently executed a sandbox command with a total
> >overhead of 3.5 seconds (if using a compressed initrd) of which
> >the QEMU execution time is 2.5 seconds.
> >
> >Of this, 1.4 seconds is the time required by LinuxBoot to copy the
> >kernel+initrd. If I used an uncompressed initrd, which I really want
> >to, to avoid decompression overhead, this increases to ~1.7 seconds.
> >So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
> >of total sandbox execution overhead.
> 
> One thing we can do is boot a guest and immediately snapshot it,
> before it runs any application specific code.  Subsequent
> invocations will MAP_PRIVATE the memory image and COW their way.
> This avoids the kernel initialization time as well.

This is adding an awful lot of complexity to the process, just
to avoid fixing a performance problem in QEMU. You can't even
reliably snapshot in between the time of booting the kerenl and
running the app code, without having to write some kind of handshake
betwen guest & the host app. You now also have the problem of
figuring out when the snapshot has become invalid due to host OS
software updates, which I explicitly wanted to avoid by *always*
running the current software directly.

> >For comparison I also did a test building a bootable ISO using ISOLinux.
> >This required 700 ms for the boot time, which is appoximately 1/2 the
> >time reqiured for direct kernel/initrd boot. But you have to then add
> >on time required to build the ISO on every boot, to add custom kernel
> >command line args. So while ISO is faster than LinuxBoot currently
> >there is still non-negligable overhead here that I want to avoid.
> 
> You can accept parameters from virtio-serial or some other channel.
> Is there any reason you need them specifically as *kernel* command
> line parameters?

Well some of the parameters are actually kernel parameters :-) The rest
are things I pass to the 'init' process which runs in the initrd. When
this process first starts the only things it can easily access are those
builtin to the kernel image, so data available from /proc or /sys like
the /proc/cmdline file. It hasn't even loaded things like the virtio-serial
or virtio-9pfs kernel modules at this point.

Daniel
Avi Kivity - Oct. 11, 2011, 9:35 a.m.
On 10/11/2011 11:18 AM, Daniel P. Berrange wrote:
> >
> >  The rep/ins implementation is still slow, optimizing it can help.
> >
> >  What does 'perf top' say when running this workload?
>
> To ensure it only recorded the LinuxBoot code, I created a 100 MB
> kernel image which takes approx 30 seconds to copy. Here is the
> perf output for approx 15 seconds of that copy:
>
>               1906.00 15.0% read_hpet                       [kernel]

Recent kernels are very clock intensive...

>               1029.00  8.1% x86_emulate_insn                [kvm]
>                863.00  6.8% test_cc                         [kvm]

test_cc() is wierd - not called on this path at all.

>                661.00  5.2% emulator_get_segment            [kvm]
>                631.00  5.0% kvm_mmu_pte_write               [kvm]
>                535.00  4.2% __linearize                     [kvm]
>                431.00  3.4% do_raw_spin_lock                [kernel]
>                356.00  2.8% vmx_get_segment                 [kvm_intel]
>                330.00  2.6% vmx_segment_cache_test_set      [kvm_intel]
>                308.00  2.4% segmented_write                 [kvm]
>                291.00  2.3% vread_hpet                      [kernel].vsyscall_fn
>                251.00  2.0% vmx_get_cpl                     [kvm_intel]
>                230.00  1.8% trace_kvm_mmu_audit             [kvm]
>                207.00  1.6% kvm_write_guest                 [kvm]
>                199.00  1.6% emulator_write_emulated         [kvm]
>                187.00  1.5% emulator_write_emulated_onepage [kvm]
>                185.00  1.5% kvm_write_guest_page            [kvm]
>                177.00  1.4% vmx_get_segment_base            [kvm_intel]
>                158.00  1.2% fw_cfg_io_readb                 qemu-system-x86_64

This is where something gets done.

>                148.00  1.2% register_address_increment      [kvm]
>                142.00  1.1% emulator_write_phys             [kvm]

And here too.  So 97.7% overhead, which could be reduced by a factor of 
4096 if the code is made more rep-aware.
Alexander Graf - Oct. 11, 2011, 9:38 a.m.
On 11.10.2011, at 11:26, Avi Kivity wrote:

> On 10/11/2011 11:19 AM, Alexander Graf wrote:
>> >>
>> >>  Of this, 1.4 seconds is the time required by LinuxBoot to copy the
>> >>  kernel+initrd. If I used an uncompressed initrd, which I really want
>> >>  to, to avoid decompression overhead, this increases to ~1.7 seconds.
>> >>  So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
>> >>  of total sandbox execution overhead.
>> >
>> >  One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code.  Subsequent invocations will MAP_PRIVATE the memory image and COW their way.  This avoids the kernel initialization time as well.
>> 
>> That doesn't allow modification of -append
> 
> Is it really needed?

For our use case for example yes. We pass the cifs user/pass using the kernel cmdline, so we can reuse existing initrd code and just mount it as root.

> 
>> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs.
> 
> Typically you'd check the timestamps to make sure you're running an up-to-date version.

Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case.

> 
>> 
>> >
>> >>
>> >>  For comparison I also did a test building a bootable ISO using ISOLinux.
>> >>  This required 700 ms for the boot time, which is appoximately 1/2 the
>> >>  time reqiured for direct kernel/initrd boot. But you have to then add
>> >>  on time required to build the ISO on every boot, to add custom kernel
>> >>  command line args. So while ISO is faster than LinuxBoot currently
>> >>  there is still non-negligable overhead here that I want to avoid.
>> >
>> >  You can accept parameters from virtio-serial or some other channel.  Is there any reason you need them specifically as *kernel* command line parameters?
>> 
>> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd.
> 
> This use case is not installation, it's for app sandboxing.

I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about.

> 
>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
> 
> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.

Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.

I don't see where the problem is in admitting that we were wrong back then. The fw_cfg interface as it is is great for small config variables, but nobody sane would even consider using IDE without DMA these days for example, because you're transferring bulk data. And that's exactly what we do in this case. We transfer bulk data.

However, I'll gladly see myself proven wrong with an awesomely fast rep/ins implementation that loads 100MB in < 1/10th of a second.


Alex
Avi Kivity - Oct. 11, 2011, 9:39 a.m.
On 10/11/2011 11:27 AM, Daniel P. Berrange wrote:
> >
> >  One thing we can do is boot a guest and immediately snapshot it,
> >  before it runs any application specific code.  Subsequent
> >  invocations will MAP_PRIVATE the memory image and COW their way.
> >  This avoids the kernel initialization time as well.
>
> This is adding an awful lot of complexity to the process, just
> to avoid fixing a performance problem in QEMU.

The performance problem is in the host kernel, not qemu, and I'm 
certainly not against fixing it.  I'm trying to see if we can optimize 
it even further to make it instantaneous.

>   You can't even
> reliably snapshot in between the time of booting the kerenl and
> running the app code, without having to write some kind of handshake
> betwen guest&  the host app. You now also have the problem of
> figuring out when the snapshot has become invalid due to host OS
> software updates, which I explicitly wanted to avoid by *always*
> running the current software directly.

Sure, it adds complexity, but the improvement may be worth it.

>
> >  >For comparison I also did a test building a bootable ISO using ISOLinux.
> >  >This required 700 ms for the boot time, which is appoximately 1/2 the
> >  >time reqiured for direct kernel/initrd boot. But you have to then add
> >  >on time required to build the ISO on every boot, to add custom kernel
> >  >command line args. So while ISO is faster than LinuxBoot currently
> >  >there is still non-negligable overhead here that I want to avoid.
> >
> >  You can accept parameters from virtio-serial or some other channel.
> >  Is there any reason you need them specifically as *kernel* command
> >  line parameters?
>
> Well some of the parameters are actually kernel parameters :-) The rest
> are things I pass to the 'init' process which runs in the initrd. When
> this process first starts the only things it can easily access are those
> builtin to the kernel image, so data available from /proc or /sys like
> the /proc/cmdline file. It hasn't even loaded things like the virtio-serial
> or virtio-9pfs kernel modules at this point.
>

It could, if it wanted to.  It's completely custom, yes?
Daniel P. Berrange - Oct. 11, 2011, 9:49 a.m.
On Tue, Oct 11, 2011 at 11:39:36AM +0200, Avi Kivity wrote:
> On 10/11/2011 11:27 AM, Daniel P. Berrange wrote:
> >>  >For comparison I also did a test building a bootable ISO using ISOLinux.
> >>  >This required 700 ms for the boot time, which is appoximately 1/2 the
> >>  >time reqiured for direct kernel/initrd boot. But you have to then add
> >>  >on time required to build the ISO on every boot, to add custom kernel
> >>  >command line args. So while ISO is faster than LinuxBoot currently
> >>  >there is still non-negligable overhead here that I want to avoid.
> >>
> >>  You can accept parameters from virtio-serial or some other channel.
> >>  Is there any reason you need them specifically as *kernel* command
> >>  line parameters?
> >
> >Well some of the parameters are actually kernel parameters :-) The rest
> >are things I pass to the 'init' process which runs in the initrd. When
> >this process first starts the only things it can easily access are those
> >builtin to the kernel image, so data available from /proc or /sys like
> >the /proc/cmdline file. It hasn't even loaded things like the virtio-serial
> >or virtio-9pfs kernel modules at this point.
> >
> 
> It could, if it wanted to.  It's completely custom, yes?

I'm thinking primarily about debug related parameters, which need to be
used as soon the process starts, not delayed until after we've loaded
kernel modules at which point the step we wanted to debug is already
past.


Daniel
Avi Kivity - Oct. 11, 2011, 9:49 a.m.
On 10/11/2011 11:38 AM, Alexander Graf wrote:
> >
> >>  and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs.
> >
> >  Typically you'd check the timestamps to make sure you're running an up-to-date version.
>
> Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case.

That's life in virt land.  If you want nice repeatable bugs write single 
threaded Python.

> >
> >>
> >>  >
> >>  >>
> >>  >>   For comparison I also did a test building a bootable ISO using ISOLinux.
> >>  >>   This required 700 ms for the boot time, which is appoximately 1/2 the
> >>  >>   time reqiured for direct kernel/initrd boot. But you have to then add
> >>  >>   on time required to build the ISO on every boot, to add custom kernel
> >>  >>   command line args. So while ISO is faster than LinuxBoot currently
> >>  >>   there is still non-negligable overhead here that I want to avoid.
> >>  >
> >>  >   You can accept parameters from virtio-serial or some other channel.  Is there any reason you need them specifically as *kernel* command line parameters?
> >>
> >>  That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd.
> >
> >  This use case is not installation, it's for app sandboxing.
>
> I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about.

I'm talking about the case he mentioned, not every possible use case.  
Usually booting an ISO image is best since it only loads on demand.

>
> >
> >>  And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
> >
> >  rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
>
> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.

We can batch per page, which makes the overhead negligible.

> I don't see where the problem is in admitting that we were wrong back then. The fw_cfg interface as it is is great for small config variables, but nobody sane would even consider using IDE without DMA these days for example, because you're transferring bulk data. And that's exactly what we do in this case. We transfer bulk data.
>
> However, I'll gladly see myself proven wrong with an awesomely fast rep/ins implementation that loads 100MB in<  1/10th of a second.
>

100 MB in 100 ms gives us 1 GB/s, or 4 us per page.  I'm not sure we can 
get exactly there, but pretty close.
Avi Kivity - Oct. 11, 2011, 9:50 a.m.
On 10/11/2011 11:49 AM, Daniel P. Berrange wrote:
> On Tue, Oct 11, 2011 at 11:39:36AM +0200, Avi Kivity wrote:
> >  On 10/11/2011 11:27 AM, Daniel P. Berrange wrote:
> >  >>   >For comparison I also did a test building a bootable ISO using ISOLinux.
> >  >>   >This required 700 ms for the boot time, which is appoximately 1/2 the
> >  >>   >time reqiured for direct kernel/initrd boot. But you have to then add
> >  >>   >on time required to build the ISO on every boot, to add custom kernel
> >  >>   >command line args. So while ISO is faster than LinuxBoot currently
> >  >>   >there is still non-negligable overhead here that I want to avoid.
> >  >>
> >  >>   You can accept parameters from virtio-serial or some other channel.
> >  >>   Is there any reason you need them specifically as *kernel* command
> >  >>   line parameters?
> >  >
> >  >Well some of the parameters are actually kernel parameters :-) The rest
> >  >are things I pass to the 'init' process which runs in the initrd. When
> >  >this process first starts the only things it can easily access are those
> >  >builtin to the kernel image, so data available from /proc or /sys like
> >  >the /proc/cmdline file. It hasn't even loaded things like the virtio-serial
> >  >or virtio-9pfs kernel modules at this point.
> >  >
> >
> >  It could, if it wanted to.  It's completely custom, yes?
>
> I'm thinking primarily about debug related parameters, which need to be
> used as soon the process starts, not delayed until after we've loaded
> kernel modules at which point the step we wanted to debug is already
> past.

Ah, so there's no issue in regenerating the image if you want to debug.
Gleb Natapov - Oct. 11, 2011, 9:50 a.m.
On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote:
> rep/ins is exactly like dma+wait for this use case: provide an
> address, get a memory image in return.  There's no need to add
> another interface, we should just optimize the existing one.
> 
rep/ins cannot be optimized to be as efficient as dma and remain to
be correct at the same time. There are various corner cases that
simplified "fast" implementation will likely miss. Like DF flag
settings, delaying interrupts for too much, doing ins/outs to/from
iomem (this is may be not a big problem unless userspace finds a way
to trigger it). There are ways that current implementation can be
optimized still though. 

But loading MBs of data through fw_cfg interface is just abusing it.
You wouldn't use pio on real HW to move megabytes of data and expect
good performance. 


--
			Gleb.
Avi Kivity - Oct. 11, 2011, 9:55 a.m.
On 10/11/2011 11:50 AM, Gleb Natapov wrote:
> On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote:
> >  rep/ins is exactly like dma+wait for this use case: provide an
> >  address, get a memory image in return.  There's no need to add
> >  another interface, we should just optimize the existing one.
> >
> rep/ins cannot be optimized to be as efficient as dma and remain to
> be correct at the same time. There are various corner cases that
> simplified "fast" implementation will likely miss. Like DF flag
> settings, delaying interrupts for too much, doing ins/outs to/from
> iomem (this is may be not a big problem unless userspace finds a way
> to trigger it). There are ways that current implementation can be
> optimized still though.

These can all go through the slow path, except interrupts, which need to 
be checked after every access.

> But loading MBs of data through fw_cfg interface is just abusing it.
> You wouldn't use pio on real HW to move megabytes of data and expect
> good performance.

True, this is a point in favour of a true dma interface.
Gleb Natapov - Oct. 11, 2011, 9:56 a.m.
On Tue, Oct 11, 2011 at 11:49:16AM +0200, Avi Kivity wrote:
> >Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
> 
> We can batch per page, which makes the overhead negligible.
> 
Current code batch userspace exit per 1024 bytes IIRC and changing it to
page didn't show significant improvement (also IIRC). But after io data
is copied into the kernel emulator process it byte by byte. Possible
optimization, which I didn't tried, is to check that destination memory is
not mmio and write back the whole buffer if it is the case.

--
			Gleb.
Avi Kivity - Oct. 11, 2011, 9:59 a.m.
On 10/11/2011 11:56 AM, Gleb Natapov wrote:
> On Tue, Oct 11, 2011 at 11:49:16AM +0200, Avi Kivity wrote:
> >  >Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
> >
> >  We can batch per page, which makes the overhead negligible.
> >
> Current code batch userspace exit per 1024 bytes IIRC and changing it to
> page didn't show significant improvement (also IIRC). But after io data
> is copied into the kernel emulator process it byte by byte. Possible
> optimization, which I didn't tried, is to check that destination memory is
> not mmio and write back the whole buffer if it is the case.
>

All the permission checks, segment checks, register_address_increment, 
page table walking, can be done per page.  Right now they are done per byte.

btw Intel also made this optimization, current processors copy complete 
cache lines instead of bytes, so they probably also do the checks just once.
Daniel P. Berrange - Oct. 11, 2011, 10:09 a.m.
On Tue, Oct 11, 2011 at 11:50:01AM +0200, Avi Kivity wrote:
> On 10/11/2011 11:49 AM, Daniel P. Berrange wrote:
> >On Tue, Oct 11, 2011 at 11:39:36AM +0200, Avi Kivity wrote:
> >>  On 10/11/2011 11:27 AM, Daniel P. Berrange wrote:
> >>  >>   >For comparison I also did a test building a bootable ISO using ISOLinux.
> >>  >>   >This required 700 ms for the boot time, which is appoximately 1/2 the
> >>  >>   >time reqiured for direct kernel/initrd boot. But you have to then add
> >>  >>   >on time required to build the ISO on every boot, to add custom kernel
> >>  >>   >command line args. So while ISO is faster than LinuxBoot currently
> >>  >>   >there is still non-negligable overhead here that I want to avoid.
> >>  >>
> >>  >>   You can accept parameters from virtio-serial or some other channel.
> >>  >>   Is there any reason you need them specifically as *kernel* command
> >>  >>   line parameters?
> >>  >
> >>  >Well some of the parameters are actually kernel parameters :-) The rest
> >>  >are things I pass to the 'init' process which runs in the initrd. When
> >>  >this process first starts the only things it can easily access are those
> >>  >builtin to the kernel image, so data available from /proc or /sys like
> >>  >the /proc/cmdline file. It hasn't even loaded things like the virtio-serial
> >>  >or virtio-9pfs kernel modules at this point.
> >>  >
> >>
> >>  It could, if it wanted to.  It's completely custom, yes?
> >
> >I'm thinking primarily about debug related parameters, which need to be
> >used as soon the process starts, not delayed until after we've loaded
> >kernel modules at which point the step we wanted to debug is already
> >past.
> 
> Ah, so there's no issue in regenerating the image if you want to debug.

Compared to just altering the -append arg to QEMU, rebuilding the initrd
image and init program is a PITA.

Daniel
Gleb Natapov - Oct. 11, 2011, 10:28 a.m.
On Tue, Oct 11, 2011 at 11:59:45AM +0200, Avi Kivity wrote:
> On 10/11/2011 11:56 AM, Gleb Natapov wrote:
> >On Tue, Oct 11, 2011 at 11:49:16AM +0200, Avi Kivity wrote:
> >>  >Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
> >>
> >>  We can batch per page, which makes the overhead negligible.
> >>
> >Current code batch userspace exit per 1024 bytes IIRC and changing it to
> >page didn't show significant improvement (also IIRC). But after io data
> >is copied into the kernel emulator process it byte by byte. Possible
> >optimization, which I didn't tried, is to check that destination memory is
> >not mmio and write back the whole buffer if it is the case.
> >
> 
> All the permission checks, segment checks,
> register_address_increment, page table walking, can be done per
> page.  Right now they are done per byte.
> 
Permission checking result is cached in ctxt->perm_ok. I see that
current code check it after several function calls, but this was not the
case before. All others are done for each iteration currently. By writing
back a whole buffer at once we eliminate others too. Interesting how
much it will improve the situation.


> btw Intel also made this optimization, current processors copy
> complete cache lines instead of bytes, so they probably also do the
> checks just once.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.
Anthony Liguori - Oct. 11, 2011, 1:12 p.m.
On 10/11/2011 04:38 AM, Alexander Graf wrote:
>
> On 11.10.2011, at 11:26, Avi Kivity wrote:
>
>> On 10/11/2011 11:19 AM, Alexander Graf wrote:
>>>>>
>>>>>   Of this, 1.4 seconds is the time required by LinuxBoot to copy the
>>>>>   kernel+initrd. If I used an uncompressed initrd, which I really want
>>>>>   to, to avoid decompression overhead, this increases to ~1.7 seconds.
>>>>>   So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
>>>>>   of total sandbox execution overhead.
>>>>
>>>>   One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code.  Subsequent invocations will MAP_PRIVATE the memory image and COW their way.  This avoids the kernel initialization time as well.
>>>
>>> That doesn't allow modification of -append
>>
>> Is it really needed?
>
> For our use case for example yes. We pass the cifs user/pass using the kernel cmdline, so we can reuse existing initrd code and just mount it as root.
>
>>
>>> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs.
>>
>> Typically you'd check the timestamps to make sure you're running an up-to-date version.
>
> Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case.
>
>>
>>>
>>>>
>>>>>
>>>>>   For comparison I also did a test building a bootable ISO using ISOLinux.
>>>>>   This required 700 ms for the boot time, which is appoximately 1/2 the
>>>>>   time reqiured for direct kernel/initrd boot. But you have to then add
>>>>>   on time required to build the ISO on every boot, to add custom kernel
>>>>>   command line args. So while ISO is faster than LinuxBoot currently
>>>>>   there is still non-negligable overhead here that I want to avoid.
>>>>
>>>>   You can accept parameters from virtio-serial or some other channel.  Is there any reason you need them specifically as *kernel* command line parameters?
>>>
>>> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd.
>>
>> This use case is not installation, it's for app sandboxing.
>
> I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about.
>
>>
>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
>>
>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
>
> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.

rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.

Regards,

Anthony Liguori

>
> I don't see where the problem is in admitting that we were wrong back then. The fw_cfg interface as it is is great for small config variables, but nobody sane would even consider using IDE without DMA these days for example, because you're transferring bulk data. And that's exactly what we do in this case. We transfer bulk data.
>
> However, I'll gladly see myself proven wrong with an awesomely fast rep/ins implementation that loads 100MB in<  1/10th of a second.
>
>
> Alex
>
>
Alexander Graf - Oct. 11, 2011, 1:14 p.m.
On 11.10.2011, at 15:12, Anthony Liguori wrote:

> On 10/11/2011 04:38 AM, Alexander Graf wrote:
>> 
>> On 11.10.2011, at 11:26, Avi Kivity wrote:
>> 
>>> On 10/11/2011 11:19 AM, Alexander Graf wrote:
>>>>>> 
>>>>>>  Of this, 1.4 seconds is the time required by LinuxBoot to copy the
>>>>>>  kernel+initrd. If I used an uncompressed initrd, which I really want
>>>>>>  to, to avoid decompression overhead, this increases to ~1.7 seconds.
>>>>>>  So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
>>>>>>  of total sandbox execution overhead.
>>>>> 
>>>>>  One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code.  Subsequent invocations will MAP_PRIVATE the memory image and COW their way.  This avoids the kernel initialization time as well.
>>>> 
>>>> That doesn't allow modification of -append
>>> 
>>> Is it really needed?
>> 
>> For our use case for example yes. We pass the cifs user/pass using the kernel cmdline, so we can reuse existing initrd code and just mount it as root.
>> 
>>> 
>>>> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs.
>>> 
>>> Typically you'd check the timestamps to make sure you're running an up-to-date version.
>> 
>> Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case.
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>  For comparison I also did a test building a bootable ISO using ISOLinux.
>>>>>>  This required 700 ms for the boot time, which is appoximately 1/2 the
>>>>>>  time reqiured for direct kernel/initrd boot. But you have to then add
>>>>>>  on time required to build the ISO on every boot, to add custom kernel
>>>>>>  command line args. So while ISO is faster than LinuxBoot currently
>>>>>>  there is still non-negligable overhead here that I want to avoid.
>>>>> 
>>>>>  You can accept parameters from virtio-serial or some other channel.  Is there any reason you need them specifically as *kernel* command line parameters?
>>>> 
>>>> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd.
>>> 
>>> This use case is not installation, it's for app sandboxing.
>> 
>> I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about.
>> 
>>> 
>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
>>> 
>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
>> 
>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
> 
> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.

No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.


Alex
Anthony Liguori - Oct. 11, 2011, 1:17 p.m.
On 10/11/2011 04:55 AM, Avi Kivity wrote:
> On 10/11/2011 11:50 AM, Gleb Natapov wrote:
>> On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote:
>> > rep/ins is exactly like dma+wait for this use case: provide an
>> > address, get a memory image in return. There's no need to add
>> > another interface, we should just optimize the existing one.
>> >
>> rep/ins cannot be optimized to be as efficient as dma and remain to
>> be correct at the same time. There are various corner cases that
>> simplified "fast" implementation will likely miss. Like DF flag
>> settings, delaying interrupts for too much, doing ins/outs to/from
>> iomem (this is may be not a big problem unless userspace finds a way
>> to trigger it). There are ways that current implementation can be
>> optimized still though.
>
> These can all go through the slow path, except interrupts, which need to be
> checked after every access.
>
>> But loading MBs of data through fw_cfg interface is just abusing it.
>> You wouldn't use pio on real HW to move megabytes of data and expect
>> good performance.
>
> True, this is a point in favour of a true dma interface.

Doing kernel loading through fw_cfg has always been a bit ugly.

A better approach would be to implement a PCI device with a ROM bar that 
contained an option ROM that read additional bars from the device to get at the 
kernel and initrd.

That also enables some potentially interesting models like having the additional 
bars be optionally persisted letting a user have direct control over which 
kernel/initrds were loaded.  It's essentially a PCI device with a flash chip on 
it that contains a kernel/initrd.

Regards,

Anthony Liguori

>
Gleb Natapov - Oct. 11, 2011, 1:17 p.m.
On Tue, Oct 11, 2011 at 03:14:52PM +0200, Alexander Graf wrote:
> > rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.
> 
> No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.
> 
Not only granularity, but double copy too. May be Anthony is referring
to real HW in which case I also do not see how it can be true since one
operation is synchronous and another is not.

--
			Gleb.
Anthony Liguori - Oct. 11, 2011, 1:19 p.m.
On 10/11/2011 08:14 AM, Alexander Graf wrote:
>>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
>>>>
>>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
>>>
>>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
>>
>> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.
>
> No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.

So make a proper PCI device for kernel loading.  It's a much more natural 
approach and let's use alias -kernel/-initrd/-append to -device 
kernel-pci,kernel=PATH,initrd=PATH

Regards,

Anthony Liguori

>
> Alex
>
>
Gleb Natapov - Oct. 11, 2011, 1:22 p.m.
On Tue, Oct 11, 2011 at 08:17:28AM -0500, Anthony Liguori wrote:
> On 10/11/2011 04:55 AM, Avi Kivity wrote:
> >On 10/11/2011 11:50 AM, Gleb Natapov wrote:
> >>On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote:
> >>> rep/ins is exactly like dma+wait for this use case: provide an
> >>> address, get a memory image in return. There's no need to add
> >>> another interface, we should just optimize the existing one.
> >>>
> >>rep/ins cannot be optimized to be as efficient as dma and remain to
> >>be correct at the same time. There are various corner cases that
> >>simplified "fast" implementation will likely miss. Like DF flag
> >>settings, delaying interrupts for too much, doing ins/outs to/from
> >>iomem (this is may be not a big problem unless userspace finds a way
> >>to trigger it). There are ways that current implementation can be
> >>optimized still though.
> >
> >These can all go through the slow path, except interrupts, which need to be
> >checked after every access.
> >
> >>But loading MBs of data through fw_cfg interface is just abusing it.
> >>You wouldn't use pio on real HW to move megabytes of data and expect
> >>good performance.
> >
> >True, this is a point in favour of a true dma interface.
> 
> Doing kernel loading through fw_cfg has always been a bit ugly.
> 
> A better approach would be to implement a PCI device with a ROM bar
> that contained an option ROM that read additional bars from the
> device to get at the kernel and initrd.
I thought about this too. But sizes of initrd people mentioning here
a crazy. We can run out of pci space very quickly. We can implement one
of the BARs as sliding window into initrd though.

> 
> That also enables some potentially interesting models like having
> the additional bars be optionally persisted letting a user have
> direct control over which kernel/initrds were loaded.  It's
> essentially a PCI device with a flash chip on it that contains a
> kernel/initrd.
> 

--
			Gleb.
Avi Kivity - Oct. 11, 2011, 1:23 p.m.
On 10/11/2011 03:19 PM, Anthony Liguori wrote:
>> No, DMA has a lot bigger granularities in kvm/user interaction. We
>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we
>> can at most do page granularity.
>
>
> So make a proper PCI device for kernel loading.  It's a much more
> natural approach and let's use alias -kernel/-initrd/-append to
> -device kernel-pci,kernel=PATH,initrd=PATH

This is overkill.  First let's optimize rep/movs before introducing any
more interfaces.  If that doesn't work, then we can have a dma interface
for fwcfg.  But a new pci device?
Gleb Natapov - Oct. 11, 2011, 1:24 p.m.
On Tue, Oct 11, 2011 at 03:23:36PM +0200, Avi Kivity wrote:
> On 10/11/2011 03:19 PM, Anthony Liguori wrote:
> >> No, DMA has a lot bigger granularities in kvm/user interaction. We
> >> can easily DMA a 50MB region with a single kvm/user exit. For PIO we
> >> can at most do page granularity.
> >
> >
> > So make a proper PCI device for kernel loading.  It's a much more
> > natural approach and let's use alias -kernel/-initrd/-append to
> > -device kernel-pci,kernel=PATH,initrd=PATH
> 
> This is overkill.  First let's optimize rep/movs before introducing any
> more interfaces.  If that doesn't work, then we can have a dma interface
> for fwcfg.  But a new pci device?
> 
We can hot unplug it right after the boot :)

--
			Gleb.
Anthony Liguori - Oct. 11, 2011, 1:29 p.m.
On 10/11/2011 08:23 AM, Avi Kivity wrote:
> On 10/11/2011 03:19 PM, Anthony Liguori wrote:
>>> No, DMA has a lot bigger granularities in kvm/user interaction. We
>>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we
>>> can at most do page granularity.
>>
>>
>> So make a proper PCI device for kernel loading.  It's a much more
>> natural approach and let's use alias -kernel/-initrd/-append to
>> -device kernel-pci,kernel=PATH,initrd=PATH
>
> This is overkill.  First let's optimize rep/movs before introducing any
> more interfaces.  If that doesn't work, then we can have a dma interface
> for fwcfg.  But a new pci device?

This is how it would work on bare metal.  Why is a PCI device overkill compared 
to a dma interface for fwcfg?

If we're adding dma to fwcfg, then fwcfg has become far too complex for it's 
intended purpose.

Regards,

Anthony Liguori

>
Avi Kivity - Oct. 11, 2011, 1:45 p.m.
On 10/11/2011 03:29 PM, Anthony Liguori wrote:
> On 10/11/2011 08:23 AM, Avi Kivity wrote:
>> On 10/11/2011 03:19 PM, Anthony Liguori wrote:
>>>> No, DMA has a lot bigger granularities in kvm/user interaction. We
>>>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we
>>>> can at most do page granularity.
>>>
>>>
>>> So make a proper PCI device for kernel loading.  It's a much more
>>> natural approach and let's use alias -kernel/-initrd/-append to
>>> -device kernel-pci,kernel=PATH,initrd=PATH
>>
>> This is overkill.  First let's optimize rep/movs before introducing any
>> more interfaces.  If that doesn't work, then we can have a dma interface
>> for fwcfg.  But a new pci device?
>
> This is how it would work on bare metal.  Why is a PCI device overkill
> compared to a dma interface for fwcfg?
>

Because it's a limited use case, despite all the talk around it.

> If we're adding dma to fwcfg, then fwcfg has become far too complex
> for it's intended purpose.
>

I have to agree to that.

btw, -net nic,model=virtio -net user is an internal DMA interface we
already have.  We can boot from it. Why not use it?
Anthony Liguori - Oct. 11, 2011, 1:58 p.m.
On 10/11/2011 08:45 AM, Avi Kivity wrote:
> On 10/11/2011 03:29 PM, Anthony Liguori wrote:
>> On 10/11/2011 08:23 AM, Avi Kivity wrote:
>>> On 10/11/2011 03:19 PM, Anthony Liguori wrote:
>>>>> No, DMA has a lot bigger granularities in kvm/user interaction. We
>>>>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we
>>>>> can at most do page granularity.
>>>>
>>>>
>>>> So make a proper PCI device for kernel loading.  It's a much more
>>>> natural approach and let's use alias -kernel/-initrd/-append to
>>>> -device kernel-pci,kernel=PATH,initrd=PATH
>>>
>>> This is overkill.  First let's optimize rep/movs before introducing any
>>> more interfaces.  If that doesn't work, then we can have a dma interface
>>> for fwcfg.  But a new pci device?
>>
>> This is how it would work on bare metal.  Why is a PCI device overkill
>> compared to a dma interface for fwcfg?
>>
>
> Because it's a limited use case, despite all the talk around it.
>
>> If we're adding dma to fwcfg, then fwcfg has become far too complex
>> for it's intended purpose.
>>
>
> I have to agree to that.
>
> btw, -net nic,model=virtio -net user is an internal DMA interface we
> already have.  We can boot from it. Why not use it?

tftp over slirp is probably slower than fwcfg.  It's been every time I looked.

Regards,

Anthony Liguori

>
Daniel P. Berrange - Oct. 11, 2011, 2:01 p.m.
On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote:
> On 10/11/2011 08:14 AM, Alexander Graf wrote:
> >>>>>And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
> >>>>
> >>>>rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
> >>>
> >>>Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
> >>
> >>rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.
> >
> >No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.
> 
> So make a proper PCI device for kernel loading.  It's a much more
> natural approach and let's use alias -kernel/-initrd/-append to
> -device kernel-pci,kernel=PATH,initrd=PATH

Adding a PCI device doesn't sound very appealing, unless you
can guarentee it is never visible to the guest once LinuxBoot
has finished its dirty work, so mgmt apps don't have to worry
about PCI addressing wrt guest ABI.


Daniel
Anthony Liguori - Oct. 11, 2011, 2:33 p.m.
On 10/11/2011 09:01 AM, Daniel P. Berrange wrote:
> On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote:
>> On 10/11/2011 08:14 AM, Alexander Graf wrote:
>>>>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
>>>>>>
>>>>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
>>>>>
>>>>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
>>>>
>>>> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.
>>>
>>> No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.
>>
>> So make a proper PCI device for kernel loading.  It's a much more
>> natural approach and let's use alias -kernel/-initrd/-append to
>> -device kernel-pci,kernel=PATH,initrd=PATH
>
> Adding a PCI device doesn't sound very appealing, unless you
> can guarentee it is never visible to the guest once LinuxBoot
> has finished its dirty work,

It'll definitely be guest visible just like fwcfg is guest visible.

Regards,

Anthony Liguori

  so mgmt apps don't have to worry
> about PCI addressing wrt guest ABI.
>
>
> Daniel
Alexander Graf - Oct. 11, 2011, 2:34 p.m.
On 11.10.2011, at 16:33, Anthony Liguori wrote:

> On 10/11/2011 09:01 AM, Daniel P. Berrange wrote:
>> On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote:
>>> On 10/11/2011 08:14 AM, Alexander Graf wrote:
>>>>>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
>>>>>>> 
>>>>>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
>>>>>> 
>>>>>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
>>>>> 
>>>>> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.
>>>> 
>>>> No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.
>>> 
>>> So make a proper PCI device for kernel loading.  It's a much more
>>> natural approach and let's use alias -kernel/-initrd/-append to
>>> -device kernel-pci,kernel=PATH,initrd=PATH
>> 
>> Adding a PCI device doesn't sound very appealing, unless you
>> can guarentee it is never visible to the guest once LinuxBoot
>> has finished its dirty work,
> 
> It'll definitely be guest visible just like fwcfg is guest visible.

Yup, just that this time it eats up one of our previous PCI slots ;)

So far it's the best proposal I've heard though.


Alex
Daniel P. Berrange - Oct. 11, 2011, 2:36 p.m.
On Tue, Oct 11, 2011 at 09:33:49AM -0500, Anthony Liguori wrote:
> On 10/11/2011 09:01 AM, Daniel P. Berrange wrote:
> >On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote:
> >>On 10/11/2011 08:14 AM, Alexander Graf wrote:
> >>>>>>>And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow.
> >>>>>>
> >>>>>>rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return.  There's no need to add another interface, we should just optimize the existing one.
> >>>>>
> >>>>>Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop.
> >>>>
> >>>>rep/ins is effectively equivalent to DMA except in how it's handled within QEMU.
> >>>
> >>>No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity.
> >>
> >>So make a proper PCI device for kernel loading.  It's a much more
> >>natural approach and let's use alias -kernel/-initrd/-append to
> >>-device kernel-pci,kernel=PATH,initrd=PATH
> >
> >Adding a PCI device doesn't sound very appealing, unless you
> >can guarentee it is never visible to the guest once LinuxBoot
> >has finished its dirty work,
> 
> It'll definitely be guest visible just like fwcfg is guest visible.

The difference is that fwcfg doesn't provide any real problems to the
guest OS. PCI devices will.

Also this means that if you have an existing VM  booting with -kernel
and you update to a newer QEMU binary, the guest ABI changes due to
the new PCI device :-( Unless we keep the old code around forever too,
which means we'd really want to improve the old code anyway.

Daniel
Blue Swirl - Oct. 15, 2011, 10 a.m.
On Tue, Oct 11, 2011 at 8:23 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> On Mon, Oct 10, 2011 at 09:01:52PM +0200, Alexander Graf wrote:
>>
>> On 10.10.2011, at 20:53, Anthony Liguori wrote:
>>
>> > On 10/10/2011 12:08 PM, Daniel P. Berrange wrote:
>> >> With the attached patches applied to QEMU and SeaBios, the attached
>> >> systemtap script can be used to debug timings in QEMU startup.
>> >>
>> >> For example, one execution of QEMU produced the following log:
>> >>
>> >>   $ stap qemu-timing.stp
>> >>   0.000 Start
>> >>   0.036 Run
>> >>   0.038 BIOS post
>> >>   0.180 BIOS int 19
>> >>   0.181 BIOS boot OS
>> >>   0.181 LinuxBoot copy kernel
>> >>   1.371 LinuxBoot copy initrd
>> >
>> > Yeah, there was a thread a bit ago about the performance
>> > of the interface to read the kernel/initrd.  I think at it
>> > was using single byte access instructions and there were
>> > patches to use string accessors instead?  I can't remember
>> > where that threaded ended up.
>
> There was initially a huge performance problem, which was
> fixed during the course of the thread, getting to the current
> state where it still takes a few seconds to load large blobs.
> The thread continued with many proposals & counter proposals
> but nothing further really came out of it.
>
>   https://lists.gnu.org/archive/html/qemu-devel/2010-08/msg00133.html
>
> One core point to take away though, is that -kernel/-initrd is
> *not* just for ad-hoc testing by qemu/kernel developers. It is
> critical functionality widely used by users of QEMU in production
> scenarios and performance of it does matter, in some cases, alot.
>
>> IIRC we're already using string accessors, but are still
>> slow. Richard had a nice patch cooked up to basically have
>> the fw_cfg interface be able to DMA its data to the guest.
>> I like the idea. Avi did not.
>
> That's here:
>
>  https://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01037.html
>
>> And yes, bad -kernel performance does hurt in some workloads. A lot.
>
> Let me recap the 3 usage scenarios I believe are most common:
>
>  - Most Linux distro installs done with libvirt + virt-manager/virt-install
>   are done by directly booting the distro's PXE kernel/initrd files.
>   The kernel images are typically < 5 MB, while the initrd images may
>   be as large as 150 MB.  Both are compressed already. An uncompressed
>   initrd image would be more like 300 MB,  so these are avoided for
>   obvious reasons.
>
>   Performance is not really an issue, within reason, since the overall
>   distro installation time will easily dominate, but loading should
>   still be measured in seconds, not minutes.
>
>   The reason for using a kernel/initrd instead of a bootable ISO is
>   to be able to set kernel command line arguments for the installer.
>
>  - libguestfs directly boots its appliance using the regular host's
>   kernel image and a custom built initrd image. The initrd does
>   not contain the entire appliance, just enough to boot up and
>   dynamically read files in from the host OS on demand. This is
>   a so called "supermin appliance".
>
>   The kernel is < 5 MB, while the initrd is approx 100MB. The initrd
>   image is used uncompressed, because decompression time needs to be
>   eliminated from bootup.  Performance is very critical for libguestfs.
>   100's of milliseconds really do make a big difference for it.
>
>   The reason for using a kernel/initrd instead of bootable ISO is to
>   avoid the time required to actually build the ISO, and to avoid
>   having more disks visible in the guest, which could confuse apps
>   using libguestfs which enumerate disks.
>
>  - Application sandbox, directly boots the regular host's kernel and
>   a custom initrd image. The initrd does not contain any files except
>   for the 9p kernel modules and a custom init binary, which mounts
>   the guest root FS from a 9p filesystem export.
>
>   The kernel is < 5 MB, while the initrd is approx 700 KB compressed,
>   or 1.4 MB compressed. Performance for the sandbox is even more
>   critical than for libguestfs. Even 10's of milliseconds make a
>   difference here. The commands being run in the sandbox can be
>   very short lived processes, executed reasonably frequently. The
>   goal is to have end-to-end runtime overhead of < 2 seconds. This
>   includes libvirt guest startup, qemu startup/shutdown, bios time,
>   option ROM time, kernel boot & shutdown time.
>
>   The reason for using a kerenl/initrd instead of a bootable ISO,
>   is that building an ISO requires time itself, and we need to be
>   able to easily pass kernel boot arguments via -append.
>
>
> I'm focusing on the last use case, and if the phase of the moon
> is correct, I can currently executed a sandbox command with a total
> overhead of 3.5 seconds (if using a compressed initrd) of which
> the QEMU execution time is 2.5 seconds.
>
> Of this, 1.4 seconds is the time required by LinuxBoot to copy the
> kernel+initrd. If I used an uncompressed initrd, which I really want
> to, to avoid decompression overhead, this increases to ~1.7 seconds.
> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40%
> of total sandbox execution overhead.
>
> For comparison I also did a test building a bootable ISO using ISOLinux.
> This required 700 ms for the boot time, which is appoximately 1/2 the
> time reqiured for direct kernel/initrd boot. But you have to then add
> on time required to build the ISO on every boot, to add custom kernel
> command line args. So while ISO is faster than LinuxBoot currently
> there is still non-negligable overhead here that I want to avoid.
>
> For further comparison I tested with Rich Jones' patches which add a
> DMA-like inteface to fw_cfg. With this the time spent in the LinuxBoot
> option ROM was as close to zero as matters.
>
> So obviously, my preference is for -kernel/-initrd to be made very fast
> using the DMA-like patches, or any other patches which could achieve
> similarly high performance for -kernel/-initd.

I don't understand why PC can't use the same way of loading initrd by
QEMU to guest memory before boot as Sparc32 uses. It should even be
possible to deduplicate the kernel and initrd images: improve the
loader to use mmap() for loading so that several guests would use the
same pages. Preloaded kernel and initrd are paravirtual anyway, there
could be even guest visible changes if ever needed (e.g. map
kernel/initrd pages outside of normal RAM areas).
Kevin O'Connor - Oct. 15, 2011, 2:19 p.m.
On Tue, Oct 11, 2011 at 08:17:28AM -0500, Anthony Liguori wrote:
> On 10/11/2011 04:55 AM, Avi Kivity wrote:
> >On 10/11/2011 11:50 AM, Gleb Natapov wrote:
> >>But loading MBs of data through fw_cfg interface is just abusing it.
> >>You wouldn't use pio on real HW to move megabytes of data and expect
> >>good performance.
> >
> >True, this is a point in favour of a true dma interface.
> 
> Doing kernel loading through fw_cfg has always been a bit ugly.
> 
> A better approach would be to implement a PCI device with a ROM bar
> that contained an option ROM that read additional bars from the
> device to get at the kernel and initrd.

If one is willing to add a PCI device, then one could add a virtio
block device with a non-standard PCI device/vendor code and teach
SeaBIOS to scan these non-standard ids.  It's a bit of a hack, but it
would be a simple way of creating a drive visible by the bios but
hidden from the OS.

-Kevin
Richard W.M. Jones - Oct. 15, 2011, 4:16 p.m.
On Sat, Oct 15, 2011 at 10:00:02AM +0000, Blue Swirl wrote:
> I don't understand why PC can't use the same way of loading initrd by
> QEMU to guest memory before boot as Sparc32 uses. It should even be
> possible to deduplicate the kernel and initrd images: improve the
> loader to use mmap() for loading so that several guests would use the
> same pages. Preloaded kernel and initrd are paravirtual anyway, there
> could be even guest visible changes if ever needed (e.g. map
> kernel/initrd pages outside of normal RAM areas).

+1!

Even better if we extended Linux so it worked more like OS-9 (circa
1990): At boot, scan memory for modules and insmod them.  That way we
wouldn't even need an initrd since we could just supply the correct
list of modules that the guest needs to mount its root disk.

Rich.
Lluís Vilanova - Oct. 16, 2011, 5:20 p.m.
Richard W M Jones writes:

> On Sat, Oct 15, 2011 at 10:00:02AM +0000, Blue Swirl wrote:
>> I don't understand why PC can't use the same way of loading initrd by
>> QEMU to guest memory before boot as Sparc32 uses. It should even be
>> possible to deduplicate the kernel and initrd images: improve the
>> loader to use mmap() for loading so that several guests would use the
>> same pages. Preloaded kernel and initrd are paravirtual anyway, there
>> could be even guest visible changes if ever needed (e.g. map
>> kernel/initrd pages outside of normal RAM areas).

> +1!

> Even better if we extended Linux so it worked more like OS-9 (circa
> 1990): At boot, scan memory for modules and insmod them.  That way we
> wouldn't even need an initrd since we could just supply the correct
> list of modules that the guest needs to mount its root disk.

I'm not really knowledgeable of the topic at hand, but if the objective is to
boot the kernel image given through QEMU's cmdline, QEMU can just put the files
in memory in a way that is compliant to what grub already does, which is a
format well understood by linux itself (multiboot modules, I think they're
called).


Lluis

Patch

diff --git a/hw/pc.c b/hw/pc.c
index 203627d..76d0790 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -43,6 +43,7 @@ 
 #include "ui/qemu-spice.h"
 #include "memory.h"
 #include "exec-memory.h"
+#include "trace.h"
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -516,6 +517,16 @@  static void handle_a20_line_change(void *opaque, int irq, int level)
 
 /***********************************************************/
 /* Bochs BIOS debug ports */
+enum {
+  PROBE_SEABIOS_POST = 1001,
+  PROBE_SEABIOS_INT_18 = 1002,
+  PROBE_SEABIOS_INT_19 = 1003,
+  PROBE_SEABIOS_BOOT_OS = 1004,
+
+  PROBE_LINUXBOOT_COPY_KERNEL = 2001,
+  PROBE_LINUXBOOT_COPY_INITRD = 2002,
+  PROBE_LINUXBOOT_BOOT_OS = 2003,
+};
 
 static void bochs_bios_write(void *opaque, uint32_t addr, uint32_t val)
 {
@@ -534,6 +545,31 @@  static void bochs_bios_write(void *opaque, uint32_t addr, uint32_t val)
         fprintf(stderr, "%c", val);
 #endif
         break;
+    case 0x404: {
+	switch (val) {
+	case PROBE_SEABIOS_POST:
+	    trace_seabios_post();
+	    break;
+	case PROBE_SEABIOS_INT_18:
+	    trace_seabios_int_18();
+	    break;
+	case PROBE_SEABIOS_INT_19:
+	    trace_seabios_int_19();
+	    break;
+	case PROBE_SEABIOS_BOOT_OS:
+	    trace_seabios_boot_OS();
+	    break;
+	case PROBE_LINUXBOOT_COPY_KERNEL:
+	    trace_linuxboot_copy_kernel();
+	    break;
+	case PROBE_LINUXBOOT_COPY_INITRD:
+	    trace_linuxboot_copy_initrd();
+	    break;
+	case PROBE_LINUXBOOT_BOOT_OS:
+	    trace_linuxboot_boot_OS();
+	    break;
+	}
+    }   break;
     case 0x8900:
         /* same as Bochs power off */
         if (val == shutdown_str[shutdown_index]) {
@@ -589,6 +625,7 @@  static void *bochs_bios_init(void)
     register_ioport_write(0x401, 1, 2, bochs_bios_write, NULL);
     register_ioport_write(0x402, 1, 1, bochs_bios_write, NULL);
     register_ioport_write(0x403, 1, 1, bochs_bios_write, NULL);
+    register_ioport_write(0x404, 1, 4, bochs_bios_write, NULL);
     register_ioport_write(0x8900, 1, 1, bochs_bios_write, NULL);
 
     register_ioport_write(0x501, 1, 1, bochs_bios_write, NULL);
diff --git a/pc-bios/linuxboot.bin b/pc-bios/linuxboot.bin
index e7c3669..40b9217 100644
Binary files a/pc-bios/linuxboot.bin and b/pc-bios/linuxboot.bin differ
diff --git a/pc-bios/optionrom/linuxboot.S b/pc-bios/optionrom/linuxboot.S
index 748c831..5c39fb1 100644
--- a/pc-bios/optionrom/linuxboot.S
+++ b/pc-bios/optionrom/linuxboot.S
@@ -108,11 +108,21 @@  copy_kernel:
 	/* We're now running in 16-bit CS, but 32-bit ES! */
 
 	/* Load kernel and initrd */
+	mov		$0x7d1,%eax
+	mov		$0x404,%edx
+	outl		%eax,(%dx)
 	read_fw_blob_addr32(FW_CFG_KERNEL)
+	mov		$0x7d2,%eax
+	mov		$0x404,%edx
+	outl		%eax,(%dx)
 	read_fw_blob_addr32(FW_CFG_INITRD)
 	read_fw_blob_addr32(FW_CFG_CMDLINE)
 	read_fw_blob_addr32(FW_CFG_SETUP)
 
+	mov		$0x7d3,%eax
+	mov		$0x404,%edx
+	outl		%eax,(%dx)
+
 	/* And now jump into Linux! */
 	mov		$0, %eax
 	mov		%eax, %cr0
diff --git a/trace-events b/trace-events
index a31d9aa..34ca28b 100644
--- a/trace-events
+++ b/trace-events
@@ -289,6 +289,11 @@  scsi_request_sense(int target, int lun, int tag) "target %d lun %d tag %d"
 
 # vl.c
 vm_state_notify(int running, int reason) "running %d reason %d"
+main_start(void) "startup"
+main_loop(void) "loop"
+main_stop(void) "stop"
+qemu_shutdown_request(void) "shutdown request"
+qemu_powerdown_request(void) "powerdown request"
 
 # block/qed-l2-cache.c
 qed_alloc_l2_cache_entry(void *l2_cache, void *entry) "l2_cache %p entry %p"
@@ -502,3 +507,12 @@  escc_sunkbd_event_in(int ch) "Untranslated keycode %2.2x"
 escc_sunkbd_event_out(int ch) "Translated keycode %2.2x"
 escc_kbd_command(int val) "Command %d"
 escc_sunmouse_event(int dx, int dy, int buttons_state) "dx=%d dy=%d buttons=%01x"
+
+seabios_post(void) "BIOS post"
+seabios_int_18(void) "BIOS int18"
+seabios_int_19(void) "BIOS int19"
+seabios_boot_OS(void) "BIOS boot OS"
+
+linuxboot_copy_kernel(void) "LinuxBoot Copy Kernel"
+linuxboot_copy_initrd(void) "LinuxBoot Copy InitRD"
+linuxboot_boot_OS(void) "LinuxBoot boot OS"
diff --git a/vl.c b/vl.c
index bd4a5ce..91e6f5e 100644
--- a/vl.c
+++ b/vl.c
@@ -162,7 +162,7 @@  int main(int argc, char **argv)
 #include "qemu-queue.h"
 #include "cpus.h"
 #include "arch_init.h"
-
+#include "trace.h"
 #include "ui/qemu-spice.h"
 
 //#define DEBUG_NET
@@ -1414,12 +1414,14 @@  void qemu_system_killed(int signal, pid_t pid)
 
 void qemu_system_shutdown_request(void)
 {
+    trace_qemu_shutdown_request();
     shutdown_requested = 1;
     qemu_notify_event();
 }
 
 void qemu_system_powerdown_request(void)
 {
+    trace_qemu_powerdown_request();
     powerdown_requested = 1;
     qemu_notify_event();
 }
@@ -2313,6 +2315,8 @@  int main(int argc, char **argv, char **envp)
     const char *trace_events = NULL;
     const char *trace_file = NULL;
 
+    trace_main_start();
+
     atexit(qemu_run_exit_notifiers);
     error_set_progname(argv[0]);
 
@@ -3571,10 +3575,12 @@  int main(int argc, char **argv, char **envp)
 
     os_setup_post();
 
+    trace_main_loop();
     main_loop();
     quit_timers();
     net_cleanup();
     res_free();
 
+    trace_main_stop();
     return 0;
 }