Message ID | 20111010170803.GV9408@redhat.com |
---|---|
State | New |
Headers | show |
On 10/10/2011 12:08 PM, Daniel P. Berrange wrote: > I've been investigating where time disappears to when booting Linux guests. > > Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so > that it could print a timestamp before each new line of debug output. The > problem with that is that it slowed down startup, so the timings I was > examining all changed. > > What I really wanted was to use QEMU's trace infrastructure with a simple > SystemTAP script. This is easy enough in the QEMU layer, but I also need > to see where time goes to inside the various BIOS functions, and the > options ROMs such as LinuxBoot. So I came up with a small hack to insert > "probes" into SeaBios and LinuxBoot, which trigger a special IO port > (0x404), which then cause QEMU to emit a trace event. > > The implementation is really very crude and does not allow any arguments > to be passed each probes, but since all I care about is timing information, > it is good enough for my needs. > > I'm not really expecting these patches to be merged into QEMU/SeaBios > since they're just a crude hack& I don't have time to write something > better. I figure they might be useful for someone else though... > > With the attached patches applied to QEMU and SeaBios, the attached > systemtap script can be used to debug timings in QEMU startup. > > For example, one execution of QEMU produced the following log: > > $ stap qemu-timing.stp > 0.000 Start > 0.036 Run > 0.038 BIOS post > 0.180 BIOS int 19 > 0.181 BIOS boot OS > 0.181 LinuxBoot copy kernel > 1.371 LinuxBoot copy initrd Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd. I think at it was using single byte access instructions and there were patches to use string accessors instead? I can't remember where that threaded ended up. CC'ing Gleb and Alex who may recall more. Regards, Anthony Liguori > 1.616 LinuxBoot boot OS > 2.489 Shutdown request > 2.490 Stop > > showing that LinuxBoot is responsible for by far the most execution > time (~1500ms), in my test which runs for 2500ms in total. > > Regards, > Daniel
On 10.10.2011, at 20:53, Anthony Liguori wrote: > On 10/10/2011 12:08 PM, Daniel P. Berrange wrote: >> I've been investigating where time disappears to when booting Linux guests. >> >> Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so >> that it could print a timestamp before each new line of debug output. The >> problem with that is that it slowed down startup, so the timings I was >> examining all changed. >> >> What I really wanted was to use QEMU's trace infrastructure with a simple >> SystemTAP script. This is easy enough in the QEMU layer, but I also need >> to see where time goes to inside the various BIOS functions, and the >> options ROMs such as LinuxBoot. So I came up with a small hack to insert >> "probes" into SeaBios and LinuxBoot, which trigger a special IO port >> (0x404), which then cause QEMU to emit a trace event. >> >> The implementation is really very crude and does not allow any arguments >> to be passed each probes, but since all I care about is timing information, >> it is good enough for my needs. >> >> I'm not really expecting these patches to be merged into QEMU/SeaBios >> since they're just a crude hack& I don't have time to write something >> better. I figure they might be useful for someone else though... >> >> With the attached patches applied to QEMU and SeaBios, the attached >> systemtap script can be used to debug timings in QEMU startup. >> >> For example, one execution of QEMU produced the following log: >> >> $ stap qemu-timing.stp >> 0.000 Start >> 0.036 Run >> 0.038 BIOS post >> 0.180 BIOS int 19 >> 0.181 BIOS boot OS >> 0.181 LinuxBoot copy kernel >> 1.371 LinuxBoot copy initrd > > Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd. I think at it was using single byte access instructions and there were patches to use string accessors instead? I can't remember where that threaded ended up. IIRC we're already using string accessors, but are still slow. Richard had a nice patch cooked up to basically have the fw_cfg interface be able to DMA its data to the guest. I like the idea. Avi did not. And yes, bad -kernel performance does hurt in some workloads. A lot. Alex
On Mon, Oct 10, 2011 at 06:08:03PM +0100, Daniel P. Berrange wrote: > I've been investigating where time disappears to when booting Linux guests. > > Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so > that it could print a timestamp before each new line of debug output. The > problem with that is that it slowed down startup, so the timings I was > examining all changed. A lot of effort went into optimizing SeaBIOS boot time. There is a tool in seabios git to help with benchmarking - tools/readserial.py. The tool was designed for use with serial ports on real machines using coreboot, but it works with qemu too: mkfifo seabioslog ./tools/readserial.py -nf seabioslog qemu-system-x86_64 -chardev pipe,id=seabios,path=seabioslog -device isa-debugcon,iobase=0x402,chardev=seabios -hda myimage This will show the SeaBIOS debug output with timing info. -Kevin
On Mon, Oct 10, 2011 at 09:01:52PM +0200, Alexander Graf wrote: > > On 10.10.2011, at 20:53, Anthony Liguori wrote: > > > On 10/10/2011 12:08 PM, Daniel P. Berrange wrote: > >> With the attached patches applied to QEMU and SeaBios, the attached > >> systemtap script can be used to debug timings in QEMU startup. > >> > >> For example, one execution of QEMU produced the following log: > >> > >> $ stap qemu-timing.stp > >> 0.000 Start > >> 0.036 Run > >> 0.038 BIOS post > >> 0.180 BIOS int 19 > >> 0.181 BIOS boot OS > >> 0.181 LinuxBoot copy kernel > >> 1.371 LinuxBoot copy initrd > > > > Yeah, there was a thread a bit ago about the performance > > of the interface to read the kernel/initrd. I think at it > > was using single byte access instructions and there were > > patches to use string accessors instead? I can't remember > > where that threaded ended up. There was initially a huge performance problem, which was fixed during the course of the thread, getting to the current state where it still takes a few seconds to load large blobs. The thread continued with many proposals & counter proposals but nothing further really came out of it. https://lists.gnu.org/archive/html/qemu-devel/2010-08/msg00133.html One core point to take away though, is that -kernel/-initrd is *not* just for ad-hoc testing by qemu/kernel developers. It is critical functionality widely used by users of QEMU in production scenarios and performance of it does matter, in some cases, alot. > IIRC we're already using string accessors, but are still > slow. Richard had a nice patch cooked up to basically have > the fw_cfg interface be able to DMA its data to the guest. > I like the idea. Avi did not. That's here: https://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01037.html > And yes, bad -kernel performance does hurt in some workloads. A lot. Let me recap the 3 usage scenarios I believe are most common: - Most Linux distro installs done with libvirt + virt-manager/virt-install are done by directly booting the distro's PXE kernel/initrd files. The kernel images are typically < 5 MB, while the initrd images may be as large as 150 MB. Both are compressed already. An uncompressed initrd image would be more like 300 MB, so these are avoided for obvious reasons. Performance is not really an issue, within reason, since the overall distro installation time will easily dominate, but loading should still be measured in seconds, not minutes. The reason for using a kernel/initrd instead of a bootable ISO is to be able to set kernel command line arguments for the installer. - libguestfs directly boots its appliance using the regular host's kernel image and a custom built initrd image. The initrd does not contain the entire appliance, just enough to boot up and dynamically read files in from the host OS on demand. This is a so called "supermin appliance". The kernel is < 5 MB, while the initrd is approx 100MB. The initrd image is used uncompressed, because decompression time needs to be eliminated from bootup. Performance is very critical for libguestfs. 100's of milliseconds really do make a big difference for it. The reason for using a kernel/initrd instead of bootable ISO is to avoid the time required to actually build the ISO, and to avoid having more disks visible in the guest, which could confuse apps using libguestfs which enumerate disks. - Application sandbox, directly boots the regular host's kernel and a custom initrd image. The initrd does not contain any files except for the 9p kernel modules and a custom init binary, which mounts the guest root FS from a 9p filesystem export. The kernel is < 5 MB, while the initrd is approx 700 KB compressed, or 1.4 MB compressed. Performance for the sandbox is even more critical than for libguestfs. Even 10's of milliseconds make a difference here. The commands being run in the sandbox can be very short lived processes, executed reasonably frequently. The goal is to have end-to-end runtime overhead of < 2 seconds. This includes libvirt guest startup, qemu startup/shutdown, bios time, option ROM time, kernel boot & shutdown time. The reason for using a kerenl/initrd instead of a bootable ISO, is that building an ISO requires time itself, and we need to be able to easily pass kernel boot arguments via -append. I'm focusing on the last use case, and if the phase of the moon is correct, I can currently executed a sandbox command with a total overhead of 3.5 seconds (if using a compressed initrd) of which the QEMU execution time is 2.5 seconds. Of this, 1.4 seconds is the time required by LinuxBoot to copy the kernel+initrd. If I used an uncompressed initrd, which I really want to, to avoid decompression overhead, this increases to ~1.7 seconds. So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% of total sandbox execution overhead. For comparison I also did a test building a bootable ISO using ISOLinux. This required 700 ms for the boot time, which is appoximately 1/2 the time reqiured for direct kernel/initrd boot. But you have to then add on time required to build the ISO on every boot, to add custom kernel command line args. So while ISO is faster than LinuxBoot currently there is still non-negligable overhead here that I want to avoid. For further comparison I tested with Rich Jones' patches which add a DMA-like inteface to fw_cfg. With this the time spent in the LinuxBoot option ROM was as close to zero as matters. So obviously, my preference is for -kernel/-initrd to be made very fast using the DMA-like patches, or any other patches which could achieve similarly high performance for -kernel/-initd. Regards, Daniel
On Tue, Oct 11, 2011 at 09:23:15AM +0100, Daniel P. Berrange wrote: > - libguestfs directly boots its appliance using the regular host's > kernel image and a custom built initrd image. The initrd does > not contain the entire appliance, just enough to boot up and > dynamically read files in from the host OS on demand. This is > a so called "supermin appliance". > > The kernel is < 5 MB, while the initrd is approx 100MB. [...] Actually this is how libguestfs used to work, but the performance of such a large initrd against the poor qemu implementation meant we had to abandon this approach. We now use -kernel ~5MB, a small -initrd ~1.1MB and a large ext2 format root disk (of course loaded on demand, which is better anyway). Nevertheless any improvement in -kernel and -initrd load times would help us a gain a few 1/10ths of seconds, which is still very important for us. Overall boot time is 3-4 seconds and we are often in a situation where we need to repeatedly boot the appliance. Rich.
On 10/10/2011 09:01 PM, Alexander Graf wrote: > >> For example, one execution of QEMU produced the following log: > >> > >> $ stap qemu-timing.stp > >> 0.000 Start > >> 0.036 Run > >> 0.038 BIOS post > >> 0.180 BIOS int 19 > >> 0.181 BIOS boot OS > >> 0.181 LinuxBoot copy kernel > >> 1.371 LinuxBoot copy initrd > > > > Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd. I think at it was using single byte access instructions and there were patches to use string accessors instead? I can't remember where that threaded ended up. > > IIRC we're already using string accessors, but are still slow. Richard had a nice patch cooked up to basically have the fw_cfg interface be able to DMA its data to the guest. I like the idea. Avi did not. > > And yes, bad -kernel performance does hurt in some workloads. A lot. > > The rep/ins implementation is still slow, optimizing it can help. What does 'perf top' say when running this workload?
On 10/11/2011 10:23 AM, Daniel P. Berrange wrote: > - Application sandbox, directly boots the regular host's kernel and > a custom initrd image. The initrd does not contain any files except > for the 9p kernel modules and a custom init binary, which mounts > the guest root FS from a 9p filesystem export. > > The kernel is< 5 MB, while the initrd is approx 700 KB compressed, > or 1.4 MB compressed. Performance for the sandbox is even more > critical than for libguestfs. Even 10's of milliseconds make a > difference here. The commands being run in the sandbox can be > very short lived processes, executed reasonably frequently. The > goal is to have end-to-end runtime overhead of< 2 seconds. This > includes libvirt guest startup, qemu startup/shutdown, bios time, > option ROM time, kernel boot& shutdown time. > > The reason for using a kerenl/initrd instead of a bootable ISO, > is that building an ISO requires time itself, and we need to be > able to easily pass kernel boot arguments via -append. > > > I'm focusing on the last use case, and if the phase of the moon > is correct, I can currently executed a sandbox command with a total > overhead of 3.5 seconds (if using a compressed initrd) of which > the QEMU execution time is 2.5 seconds. > > Of this, 1.4 seconds is the time required by LinuxBoot to copy the > kernel+initrd. If I used an uncompressed initrd, which I really want > to, to avoid decompression overhead, this increases to ~1.7 seconds. > So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% > of total sandbox execution overhead. One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code. Subsequent invocations will MAP_PRIVATE the memory image and COW their way. This avoids the kernel initialization time as well. > > For comparison I also did a test building a bootable ISO using ISOLinux. > This required 700 ms for the boot time, which is appoximately 1/2 the > time reqiured for direct kernel/initrd boot. But you have to then add > on time required to build the ISO on every boot, to add custom kernel > command line args. So while ISO is faster than LinuxBoot currently > there is still non-negligable overhead here that I want to avoid. You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? > For further comparison I tested with Rich Jones' patches which add a > DMA-like inteface to fw_cfg. With this the time spent in the LinuxBoot > option ROM was as close to zero as matters. > > So obviously, my preference is for -kernel/-initrd to be made very fast > using the DMA-like patches, or any other patches which could achieve > similarly high performance for -kernel/-initd.
On Tue, Oct 11, 2011 at 11:08:33AM +0200, Avi Kivity wrote: > On 10/10/2011 09:01 PM, Alexander Graf wrote: > >>> For example, one execution of QEMU produced the following log: > >>> > >>> $ stap qemu-timing.stp > >>> 0.000 Start > >>> 0.036 Run > >>> 0.038 BIOS post > >>> 0.180 BIOS int 19 > >>> 0.181 BIOS boot OS > >>> 0.181 LinuxBoot copy kernel > >>> 1.371 LinuxBoot copy initrd > >> > >> Yeah, there was a thread a bit ago about the performance of the interface to read the kernel/initrd. I think at it was using single byte access instructions and there were patches to use string accessors instead? I can't remember where that threaded ended up. > > > >IIRC we're already using string accessors, but are still slow. Richard had a nice patch cooked up to basically have the fw_cfg interface be able to DMA its data to the guest. I like the idea. Avi did not. > > > >And yes, bad -kernel performance does hurt in some workloads. A lot. > > > > > > The rep/ins implementation is still slow, optimizing it can help. > > What does 'perf top' say when running this workload? To ensure it only recorded the LinuxBoot code, I created a 100 MB kernel image which takes approx 30 seconds to copy. Here is the perf output for approx 15 seconds of that copy: 1906.00 15.0% read_hpet [kernel] 1029.00 8.1% x86_emulate_insn [kvm] 863.00 6.8% test_cc [kvm] 661.00 5.2% emulator_get_segment [kvm] 631.00 5.0% kvm_mmu_pte_write [kvm] 535.00 4.2% __linearize [kvm] 431.00 3.4% do_raw_spin_lock [kernel] 356.00 2.8% vmx_get_segment [kvm_intel] 330.00 2.6% vmx_segment_cache_test_set [kvm_intel] 308.00 2.4% segmented_write [kvm] 291.00 2.3% vread_hpet [kernel].vsyscall_fn 251.00 2.0% vmx_get_cpl [kvm_intel] 230.00 1.8% trace_kvm_mmu_audit [kvm] 207.00 1.6% kvm_write_guest [kvm] 199.00 1.6% emulator_write_emulated [kvm] 187.00 1.5% emulator_write_emulated_onepage [kvm] 185.00 1.5% kvm_write_guest_page [kvm] 177.00 1.4% vmx_get_segment_base [kvm_intel] 158.00 1.2% fw_cfg_io_readb qemu-system-x86_64 148.00 1.2% register_address_increment [kvm] 142.00 1.1% emulator_write_phys [kvm] 134.00 1.1% acpi_os_read_port [kernel] Daniel
On 11.10.2011, at 11:15, Avi Kivity wrote: > On 10/11/2011 10:23 AM, Daniel P. Berrange wrote: >> - Application sandbox, directly boots the regular host's kernel and >> a custom initrd image. The initrd does not contain any files except >> for the 9p kernel modules and a custom init binary, which mounts >> the guest root FS from a 9p filesystem export. >> >> The kernel is< 5 MB, while the initrd is approx 700 KB compressed, >> or 1.4 MB compressed. Performance for the sandbox is even more >> critical than for libguestfs. Even 10's of milliseconds make a >> difference here. The commands being run in the sandbox can be >> very short lived processes, executed reasonably frequently. The >> goal is to have end-to-end runtime overhead of< 2 seconds. This >> includes libvirt guest startup, qemu startup/shutdown, bios time, >> option ROM time, kernel boot& shutdown time. >> >> The reason for using a kerenl/initrd instead of a bootable ISO, >> is that building an ISO requires time itself, and we need to be >> able to easily pass kernel boot arguments via -append. >> >> >> I'm focusing on the last use case, and if the phase of the moon >> is correct, I can currently executed a sandbox command with a total >> overhead of 3.5 seconds (if using a compressed initrd) of which >> the QEMU execution time is 2.5 seconds. >> >> Of this, 1.4 seconds is the time required by LinuxBoot to copy the >> kernel+initrd. If I used an uncompressed initrd, which I really want >> to, to avoid decompression overhead, this increases to ~1.7 seconds. >> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% >> of total sandbox execution overhead. > > One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code. Subsequent invocations will MAP_PRIVATE the memory image and COW their way. This avoids the kernel initialization time as well. That doesn't allow modification of -append and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs. > >> >> For comparison I also did a test building a bootable ISO using ISOLinux. >> This required 700 ms for the boot time, which is appoximately 1/2 the >> time reqiured for direct kernel/initrd boot. But you have to then add >> on time required to build the ISO on every boot, to add custom kernel >> command line args. So while ISO is faster than LinuxBoot currently >> there is still non-negligable overhead here that I want to avoid. > > You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd. And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. Alex
On 10/11/2011 11:19 AM, Alexander Graf wrote: > >> > >> Of this, 1.4 seconds is the time required by LinuxBoot to copy the > >> kernel+initrd. If I used an uncompressed initrd, which I really want > >> to, to avoid decompression overhead, this increases to ~1.7 seconds. > >> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% > >> of total sandbox execution overhead. > > > > One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code. Subsequent invocations will MAP_PRIVATE the memory image and COW their way. This avoids the kernel initialization time as well. > > That doesn't allow modification of -append Is it really needed? > and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs. Typically you'd check the timestamps to make sure you're running an up-to-date version. > > > > >> > >> For comparison I also did a test building a bootable ISO using ISOLinux. > >> This required 700 ms for the boot time, which is appoximately 1/2 the > >> time reqiured for direct kernel/initrd boot. But you have to then add > >> on time required to build the ISO on every boot, to add custom kernel > >> command line args. So while ISO is faster than LinuxBoot currently > >> there is still non-negligable overhead here that I want to avoid. > > > > You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? > > That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd. This use case is not installation, it's for app sandboxing. > And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one.
On Tue, Oct 11, 2011 at 11:15:05AM +0200, Avi Kivity wrote: > On 10/11/2011 10:23 AM, Daniel P. Berrange wrote: > > - Application sandbox, directly boots the regular host's kernel and > > a custom initrd image. The initrd does not contain any files except > > for the 9p kernel modules and a custom init binary, which mounts > > the guest root FS from a 9p filesystem export. > > > > The kernel is< 5 MB, while the initrd is approx 700 KB compressed, > > or 1.4 MB compressed. Performance for the sandbox is even more > > critical than for libguestfs. Even 10's of milliseconds make a > > difference here. The commands being run in the sandbox can be > > very short lived processes, executed reasonably frequently. The > > goal is to have end-to-end runtime overhead of< 2 seconds. This > > includes libvirt guest startup, qemu startup/shutdown, bios time, > > option ROM time, kernel boot& shutdown time. > > > > The reason for using a kerenl/initrd instead of a bootable ISO, > > is that building an ISO requires time itself, and we need to be > > able to easily pass kernel boot arguments via -append. > > > > > >I'm focusing on the last use case, and if the phase of the moon > >is correct, I can currently executed a sandbox command with a total > >overhead of 3.5 seconds (if using a compressed initrd) of which > >the QEMU execution time is 2.5 seconds. > > > >Of this, 1.4 seconds is the time required by LinuxBoot to copy the > >kernel+initrd. If I used an uncompressed initrd, which I really want > >to, to avoid decompression overhead, this increases to ~1.7 seconds. > >So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% > >of total sandbox execution overhead. > > One thing we can do is boot a guest and immediately snapshot it, > before it runs any application specific code. Subsequent > invocations will MAP_PRIVATE the memory image and COW their way. > This avoids the kernel initialization time as well. This is adding an awful lot of complexity to the process, just to avoid fixing a performance problem in QEMU. You can't even reliably snapshot in between the time of booting the kerenl and running the app code, without having to write some kind of handshake betwen guest & the host app. You now also have the problem of figuring out when the snapshot has become invalid due to host OS software updates, which I explicitly wanted to avoid by *always* running the current software directly. > >For comparison I also did a test building a bootable ISO using ISOLinux. > >This required 700 ms for the boot time, which is appoximately 1/2 the > >time reqiured for direct kernel/initrd boot. But you have to then add > >on time required to build the ISO on every boot, to add custom kernel > >command line args. So while ISO is faster than LinuxBoot currently > >there is still non-negligable overhead here that I want to avoid. > > You can accept parameters from virtio-serial or some other channel. > Is there any reason you need them specifically as *kernel* command > line parameters? Well some of the parameters are actually kernel parameters :-) The rest are things I pass to the 'init' process which runs in the initrd. When this process first starts the only things it can easily access are those builtin to the kernel image, so data available from /proc or /sys like the /proc/cmdline file. It hasn't even loaded things like the virtio-serial or virtio-9pfs kernel modules at this point. Daniel
On 10/11/2011 11:18 AM, Daniel P. Berrange wrote: > > > > The rep/ins implementation is still slow, optimizing it can help. > > > > What does 'perf top' say when running this workload? > > To ensure it only recorded the LinuxBoot code, I created a 100 MB > kernel image which takes approx 30 seconds to copy. Here is the > perf output for approx 15 seconds of that copy: > > 1906.00 15.0% read_hpet [kernel] Recent kernels are very clock intensive... > 1029.00 8.1% x86_emulate_insn [kvm] > 863.00 6.8% test_cc [kvm] test_cc() is wierd - not called on this path at all. > 661.00 5.2% emulator_get_segment [kvm] > 631.00 5.0% kvm_mmu_pte_write [kvm] > 535.00 4.2% __linearize [kvm] > 431.00 3.4% do_raw_spin_lock [kernel] > 356.00 2.8% vmx_get_segment [kvm_intel] > 330.00 2.6% vmx_segment_cache_test_set [kvm_intel] > 308.00 2.4% segmented_write [kvm] > 291.00 2.3% vread_hpet [kernel].vsyscall_fn > 251.00 2.0% vmx_get_cpl [kvm_intel] > 230.00 1.8% trace_kvm_mmu_audit [kvm] > 207.00 1.6% kvm_write_guest [kvm] > 199.00 1.6% emulator_write_emulated [kvm] > 187.00 1.5% emulator_write_emulated_onepage [kvm] > 185.00 1.5% kvm_write_guest_page [kvm] > 177.00 1.4% vmx_get_segment_base [kvm_intel] > 158.00 1.2% fw_cfg_io_readb qemu-system-x86_64 This is where something gets done. > 148.00 1.2% register_address_increment [kvm] > 142.00 1.1% emulator_write_phys [kvm] And here too. So 97.7% overhead, which could be reduced by a factor of 4096 if the code is made more rep-aware.
On 11.10.2011, at 11:26, Avi Kivity wrote: > On 10/11/2011 11:19 AM, Alexander Graf wrote: >> >> >> >> Of this, 1.4 seconds is the time required by LinuxBoot to copy the >> >> kernel+initrd. If I used an uncompressed initrd, which I really want >> >> to, to avoid decompression overhead, this increases to ~1.7 seconds. >> >> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% >> >> of total sandbox execution overhead. >> > >> > One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code. Subsequent invocations will MAP_PRIVATE the memory image and COW their way. This avoids the kernel initialization time as well. >> >> That doesn't allow modification of -append > > Is it really needed? For our use case for example yes. We pass the cifs user/pass using the kernel cmdline, so we can reuse existing initrd code and just mount it as root. > >> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs. > > Typically you'd check the timestamps to make sure you're running an up-to-date version. Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case. > >> >> > >> >> >> >> For comparison I also did a test building a bootable ISO using ISOLinux. >> >> This required 700 ms for the boot time, which is appoximately 1/2 the >> >> time reqiured for direct kernel/initrd boot. But you have to then add >> >> on time required to build the ISO on every boot, to add custom kernel >> >> command line args. So while ISO is faster than LinuxBoot currently >> >> there is still non-negligable overhead here that I want to avoid. >> > >> > You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? >> >> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd. > > This use case is not installation, it's for app sandboxing. I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about. > >> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. > > rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. I don't see where the problem is in admitting that we were wrong back then. The fw_cfg interface as it is is great for small config variables, but nobody sane would even consider using IDE without DMA these days for example, because you're transferring bulk data. And that's exactly what we do in this case. We transfer bulk data. However, I'll gladly see myself proven wrong with an awesomely fast rep/ins implementation that loads 100MB in < 1/10th of a second. Alex
On 10/11/2011 11:27 AM, Daniel P. Berrange wrote: > > > > One thing we can do is boot a guest and immediately snapshot it, > > before it runs any application specific code. Subsequent > > invocations will MAP_PRIVATE the memory image and COW their way. > > This avoids the kernel initialization time as well. > > This is adding an awful lot of complexity to the process, just > to avoid fixing a performance problem in QEMU. The performance problem is in the host kernel, not qemu, and I'm certainly not against fixing it. I'm trying to see if we can optimize it even further to make it instantaneous. > You can't even > reliably snapshot in between the time of booting the kerenl and > running the app code, without having to write some kind of handshake > betwen guest& the host app. You now also have the problem of > figuring out when the snapshot has become invalid due to host OS > software updates, which I explicitly wanted to avoid by *always* > running the current software directly. Sure, it adds complexity, but the improvement may be worth it. > > > >For comparison I also did a test building a bootable ISO using ISOLinux. > > >This required 700 ms for the boot time, which is appoximately 1/2 the > > >time reqiured for direct kernel/initrd boot. But you have to then add > > >on time required to build the ISO on every boot, to add custom kernel > > >command line args. So while ISO is faster than LinuxBoot currently > > >there is still non-negligable overhead here that I want to avoid. > > > > You can accept parameters from virtio-serial or some other channel. > > Is there any reason you need them specifically as *kernel* command > > line parameters? > > Well some of the parameters are actually kernel parameters :-) The rest > are things I pass to the 'init' process which runs in the initrd. When > this process first starts the only things it can easily access are those > builtin to the kernel image, so data available from /proc or /sys like > the /proc/cmdline file. It hasn't even loaded things like the virtio-serial > or virtio-9pfs kernel modules at this point. > It could, if it wanted to. It's completely custom, yes?
On Tue, Oct 11, 2011 at 11:39:36AM +0200, Avi Kivity wrote: > On 10/11/2011 11:27 AM, Daniel P. Berrange wrote: > >> >For comparison I also did a test building a bootable ISO using ISOLinux. > >> >This required 700 ms for the boot time, which is appoximately 1/2 the > >> >time reqiured for direct kernel/initrd boot. But you have to then add > >> >on time required to build the ISO on every boot, to add custom kernel > >> >command line args. So while ISO is faster than LinuxBoot currently > >> >there is still non-negligable overhead here that I want to avoid. > >> > >> You can accept parameters from virtio-serial or some other channel. > >> Is there any reason you need them specifically as *kernel* command > >> line parameters? > > > >Well some of the parameters are actually kernel parameters :-) The rest > >are things I pass to the 'init' process which runs in the initrd. When > >this process first starts the only things it can easily access are those > >builtin to the kernel image, so data available from /proc or /sys like > >the /proc/cmdline file. It hasn't even loaded things like the virtio-serial > >or virtio-9pfs kernel modules at this point. > > > > It could, if it wanted to. It's completely custom, yes? I'm thinking primarily about debug related parameters, which need to be used as soon the process starts, not delayed until after we've loaded kernel modules at which point the step we wanted to debug is already past. Daniel
On 10/11/2011 11:38 AM, Alexander Graf wrote: > > > >> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs. > > > > Typically you'd check the timestamps to make sure you're running an up-to-date version. > > Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case. That's life in virt land. If you want nice repeatable bugs write single threaded Python. > > > >> > >> > > >> >> > >> >> For comparison I also did a test building a bootable ISO using ISOLinux. > >> >> This required 700 ms for the boot time, which is appoximately 1/2 the > >> >> time reqiured for direct kernel/initrd boot. But you have to then add > >> >> on time required to build the ISO on every boot, to add custom kernel > >> >> command line args. So while ISO is faster than LinuxBoot currently > >> >> there is still non-negligable overhead here that I want to avoid. > >> > > >> > You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? > >> > >> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd. > > > > This use case is not installation, it's for app sandboxing. > > I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about. I'm talking about the case he mentioned, not every possible use case. Usually booting an ISO image is best since it only loads on demand. > > > > >> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. > > > > rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. > > Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. We can batch per page, which makes the overhead negligible. > I don't see where the problem is in admitting that we were wrong back then. The fw_cfg interface as it is is great for small config variables, but nobody sane would even consider using IDE without DMA these days for example, because you're transferring bulk data. And that's exactly what we do in this case. We transfer bulk data. > > However, I'll gladly see myself proven wrong with an awesomely fast rep/ins implementation that loads 100MB in< 1/10th of a second. > 100 MB in 100 ms gives us 1 GB/s, or 4 us per page. I'm not sure we can get exactly there, but pretty close.
On 10/11/2011 11:49 AM, Daniel P. Berrange wrote: > On Tue, Oct 11, 2011 at 11:39:36AM +0200, Avi Kivity wrote: > > On 10/11/2011 11:27 AM, Daniel P. Berrange wrote: > > >> >For comparison I also did a test building a bootable ISO using ISOLinux. > > >> >This required 700 ms for the boot time, which is appoximately 1/2 the > > >> >time reqiured for direct kernel/initrd boot. But you have to then add > > >> >on time required to build the ISO on every boot, to add custom kernel > > >> >command line args. So while ISO is faster than LinuxBoot currently > > >> >there is still non-negligable overhead here that I want to avoid. > > >> > > >> You can accept parameters from virtio-serial or some other channel. > > >> Is there any reason you need them specifically as *kernel* command > > >> line parameters? > > > > > >Well some of the parameters are actually kernel parameters :-) The rest > > >are things I pass to the 'init' process which runs in the initrd. When > > >this process first starts the only things it can easily access are those > > >builtin to the kernel image, so data available from /proc or /sys like > > >the /proc/cmdline file. It hasn't even loaded things like the virtio-serial > > >or virtio-9pfs kernel modules at this point. > > > > > > > It could, if it wanted to. It's completely custom, yes? > > I'm thinking primarily about debug related parameters, which need to be > used as soon the process starts, not delayed until after we've loaded > kernel modules at which point the step we wanted to debug is already > past. Ah, so there's no issue in regenerating the image if you want to debug.
On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote: > rep/ins is exactly like dma+wait for this use case: provide an > address, get a memory image in return. There's no need to add > another interface, we should just optimize the existing one. > rep/ins cannot be optimized to be as efficient as dma and remain to be correct at the same time. There are various corner cases that simplified "fast" implementation will likely miss. Like DF flag settings, delaying interrupts for too much, doing ins/outs to/from iomem (this is may be not a big problem unless userspace finds a way to trigger it). There are ways that current implementation can be optimized still though. But loading MBs of data through fw_cfg interface is just abusing it. You wouldn't use pio on real HW to move megabytes of data and expect good performance. -- Gleb.
On 10/11/2011 11:50 AM, Gleb Natapov wrote: > On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote: > > rep/ins is exactly like dma+wait for this use case: provide an > > address, get a memory image in return. There's no need to add > > another interface, we should just optimize the existing one. > > > rep/ins cannot be optimized to be as efficient as dma and remain to > be correct at the same time. There are various corner cases that > simplified "fast" implementation will likely miss. Like DF flag > settings, delaying interrupts for too much, doing ins/outs to/from > iomem (this is may be not a big problem unless userspace finds a way > to trigger it). There are ways that current implementation can be > optimized still though. These can all go through the slow path, except interrupts, which need to be checked after every access. > But loading MBs of data through fw_cfg interface is just abusing it. > You wouldn't use pio on real HW to move megabytes of data and expect > good performance. True, this is a point in favour of a true dma interface.
On Tue, Oct 11, 2011 at 11:49:16AM +0200, Avi Kivity wrote: > >Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. > > We can batch per page, which makes the overhead negligible. > Current code batch userspace exit per 1024 bytes IIRC and changing it to page didn't show significant improvement (also IIRC). But after io data is copied into the kernel emulator process it byte by byte. Possible optimization, which I didn't tried, is to check that destination memory is not mmio and write back the whole buffer if it is the case. -- Gleb.
On 10/11/2011 11:56 AM, Gleb Natapov wrote: > On Tue, Oct 11, 2011 at 11:49:16AM +0200, Avi Kivity wrote: > > >Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. > > > > We can batch per page, which makes the overhead negligible. > > > Current code batch userspace exit per 1024 bytes IIRC and changing it to > page didn't show significant improvement (also IIRC). But after io data > is copied into the kernel emulator process it byte by byte. Possible > optimization, which I didn't tried, is to check that destination memory is > not mmio and write back the whole buffer if it is the case. > All the permission checks, segment checks, register_address_increment, page table walking, can be done per page. Right now they are done per byte. btw Intel also made this optimization, current processors copy complete cache lines instead of bytes, so they probably also do the checks just once.
On Tue, Oct 11, 2011 at 11:50:01AM +0200, Avi Kivity wrote: > On 10/11/2011 11:49 AM, Daniel P. Berrange wrote: > >On Tue, Oct 11, 2011 at 11:39:36AM +0200, Avi Kivity wrote: > >> On 10/11/2011 11:27 AM, Daniel P. Berrange wrote: > >> >> >For comparison I also did a test building a bootable ISO using ISOLinux. > >> >> >This required 700 ms for the boot time, which is appoximately 1/2 the > >> >> >time reqiured for direct kernel/initrd boot. But you have to then add > >> >> >on time required to build the ISO on every boot, to add custom kernel > >> >> >command line args. So while ISO is faster than LinuxBoot currently > >> >> >there is still non-negligable overhead here that I want to avoid. > >> >> > >> >> You can accept parameters from virtio-serial or some other channel. > >> >> Is there any reason you need them specifically as *kernel* command > >> >> line parameters? > >> > > >> >Well some of the parameters are actually kernel parameters :-) The rest > >> >are things I pass to the 'init' process which runs in the initrd. When > >> >this process first starts the only things it can easily access are those > >> >builtin to the kernel image, so data available from /proc or /sys like > >> >the /proc/cmdline file. It hasn't even loaded things like the virtio-serial > >> >or virtio-9pfs kernel modules at this point. > >> > > >> > >> It could, if it wanted to. It's completely custom, yes? > > > >I'm thinking primarily about debug related parameters, which need to be > >used as soon the process starts, not delayed until after we've loaded > >kernel modules at which point the step we wanted to debug is already > >past. > > Ah, so there's no issue in regenerating the image if you want to debug. Compared to just altering the -append arg to QEMU, rebuilding the initrd image and init program is a PITA. Daniel
On Tue, Oct 11, 2011 at 11:59:45AM +0200, Avi Kivity wrote: > On 10/11/2011 11:56 AM, Gleb Natapov wrote: > >On Tue, Oct 11, 2011 at 11:49:16AM +0200, Avi Kivity wrote: > >> >Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. > >> > >> We can batch per page, which makes the overhead negligible. > >> > >Current code batch userspace exit per 1024 bytes IIRC and changing it to > >page didn't show significant improvement (also IIRC). But after io data > >is copied into the kernel emulator process it byte by byte. Possible > >optimization, which I didn't tried, is to check that destination memory is > >not mmio and write back the whole buffer if it is the case. > > > > All the permission checks, segment checks, > register_address_increment, page table walking, can be done per > page. Right now they are done per byte. > Permission checking result is cached in ctxt->perm_ok. I see that current code check it after several function calls, but this was not the case before. All others are done for each iteration currently. By writing back a whole buffer at once we eliminate others too. Interesting how much it will improve the situation. > btw Intel also made this optimization, current processors copy > complete cache lines instead of bytes, so they probably also do the > checks just once. > > -- > error compiling committee.c: too many arguments to function -- Gleb.
On 10/11/2011 04:38 AM, Alexander Graf wrote: > > On 11.10.2011, at 11:26, Avi Kivity wrote: > >> On 10/11/2011 11:19 AM, Alexander Graf wrote: >>>>> >>>>> Of this, 1.4 seconds is the time required by LinuxBoot to copy the >>>>> kernel+initrd. If I used an uncompressed initrd, which I really want >>>>> to, to avoid decompression overhead, this increases to ~1.7 seconds. >>>>> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% >>>>> of total sandbox execution overhead. >>>> >>>> One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code. Subsequent invocations will MAP_PRIVATE the memory image and COW their way. This avoids the kernel initialization time as well. >>> >>> That doesn't allow modification of -append >> >> Is it really needed? > > For our use case for example yes. We pass the cifs user/pass using the kernel cmdline, so we can reuse existing initrd code and just mount it as root. > >> >>> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs. >> >> Typically you'd check the timestamps to make sure you're running an up-to-date version. > > Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case. > >> >>> >>>> >>>>> >>>>> For comparison I also did a test building a bootable ISO using ISOLinux. >>>>> This required 700 ms for the boot time, which is appoximately 1/2 the >>>>> time reqiured for direct kernel/initrd boot. But you have to then add >>>>> on time required to build the ISO on every boot, to add custom kernel >>>>> command line args. So while ISO is faster than LinuxBoot currently >>>>> there is still non-negligable overhead here that I want to avoid. >>>> >>>> You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? >>> >>> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd. >> >> This use case is not installation, it's for app sandboxing. > > I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about. > >> >>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. >> >> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. > > Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. Regards, Anthony Liguori > > I don't see where the problem is in admitting that we were wrong back then. The fw_cfg interface as it is is great for small config variables, but nobody sane would even consider using IDE without DMA these days for example, because you're transferring bulk data. And that's exactly what we do in this case. We transfer bulk data. > > However, I'll gladly see myself proven wrong with an awesomely fast rep/ins implementation that loads 100MB in< 1/10th of a second. > > > Alex > >
On 11.10.2011, at 15:12, Anthony Liguori wrote: > On 10/11/2011 04:38 AM, Alexander Graf wrote: >> >> On 11.10.2011, at 11:26, Avi Kivity wrote: >> >>> On 10/11/2011 11:19 AM, Alexander Graf wrote: >>>>>> >>>>>> Of this, 1.4 seconds is the time required by LinuxBoot to copy the >>>>>> kernel+initrd. If I used an uncompressed initrd, which I really want >>>>>> to, to avoid decompression overhead, this increases to ~1.7 seconds. >>>>>> So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% >>>>>> of total sandbox execution overhead. >>>>> >>>>> One thing we can do is boot a guest and immediately snapshot it, before it runs any application specific code. Subsequent invocations will MAP_PRIVATE the memory image and COW their way. This avoids the kernel initialization time as well. >>>> >>>> That doesn't allow modification of -append >>> >>> Is it really needed? >> >> For our use case for example yes. We pass the cifs user/pass using the kernel cmdline, so we can reuse existing initrd code and just mount it as root. >> >>> >>>> and gets you in a pretty bizarre state when doing updates of your host files, since then you have 2 different paths: full boot and restore. That's yet another potential source for bugs. >>> >>> Typically you'd check the timestamps to make sure you're running an up-to-date version. >> >> Yes. That's why I said you end up with 2 different boot cases. Now imagine you get a bug once every 10000 bootups and try to trace that down that it only happens when running in the non-resume case. >> >>> >>>> >>>>> >>>>>> >>>>>> For comparison I also did a test building a bootable ISO using ISOLinux. >>>>>> This required 700 ms for the boot time, which is appoximately 1/2 the >>>>>> time reqiured for direct kernel/initrd boot. But you have to then add >>>>>> on time required to build the ISO on every boot, to add custom kernel >>>>>> command line args. So while ISO is faster than LinuxBoot currently >>>>>> there is still non-negligable overhead here that I want to avoid. >>>>> >>>>> You can accept parameters from virtio-serial or some other channel. Is there any reason you need them specifically as *kernel* command line parameters? >>>> >>>> That doesn't work for kernel parameters. It also means things would have to be rewritten needlessly. Some times we can't easily change the way parameters are passed into the guest either, for example when running a random (read: old, think of RHEL5) distro installation initrd. >>> >>> This use case is not installation, it's for app sandboxing. >> >> I thought we were talking about plenty different use cases here? I'm pretty sure there are even more out there that we haven't even thought about. >> >>> >>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. >>> >>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. >> >> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. > > rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. Alex
On 10/11/2011 04:55 AM, Avi Kivity wrote: > On 10/11/2011 11:50 AM, Gleb Natapov wrote: >> On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote: >> > rep/ins is exactly like dma+wait for this use case: provide an >> > address, get a memory image in return. There's no need to add >> > another interface, we should just optimize the existing one. >> > >> rep/ins cannot be optimized to be as efficient as dma and remain to >> be correct at the same time. There are various corner cases that >> simplified "fast" implementation will likely miss. Like DF flag >> settings, delaying interrupts for too much, doing ins/outs to/from >> iomem (this is may be not a big problem unless userspace finds a way >> to trigger it). There are ways that current implementation can be >> optimized still though. > > These can all go through the slow path, except interrupts, which need to be > checked after every access. > >> But loading MBs of data through fw_cfg interface is just abusing it. >> You wouldn't use pio on real HW to move megabytes of data and expect >> good performance. > > True, this is a point in favour of a true dma interface. Doing kernel loading through fw_cfg has always been a bit ugly. A better approach would be to implement a PCI device with a ROM bar that contained an option ROM that read additional bars from the device to get at the kernel and initrd. That also enables some potentially interesting models like having the additional bars be optionally persisted letting a user have direct control over which kernel/initrds were loaded. It's essentially a PCI device with a flash chip on it that contains a kernel/initrd. Regards, Anthony Liguori >
On Tue, Oct 11, 2011 at 03:14:52PM +0200, Alexander Graf wrote: > > rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. > > No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. > Not only granularity, but double copy too. May be Anthony is referring to real HW in which case I also do not see how it can be true since one operation is synchronous and another is not. -- Gleb.
On 10/11/2011 08:14 AM, Alexander Graf wrote: >>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. >>>> >>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. >>> >>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. >> >> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. > > No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. So make a proper PCI device for kernel loading. It's a much more natural approach and let's use alias -kernel/-initrd/-append to -device kernel-pci,kernel=PATH,initrd=PATH Regards, Anthony Liguori > > Alex > >
On Tue, Oct 11, 2011 at 08:17:28AM -0500, Anthony Liguori wrote: > On 10/11/2011 04:55 AM, Avi Kivity wrote: > >On 10/11/2011 11:50 AM, Gleb Natapov wrote: > >>On Tue, Oct 11, 2011 at 11:26:14AM +0200, Avi Kivity wrote: > >>> rep/ins is exactly like dma+wait for this use case: provide an > >>> address, get a memory image in return. There's no need to add > >>> another interface, we should just optimize the existing one. > >>> > >>rep/ins cannot be optimized to be as efficient as dma and remain to > >>be correct at the same time. There are various corner cases that > >>simplified "fast" implementation will likely miss. Like DF flag > >>settings, delaying interrupts for too much, doing ins/outs to/from > >>iomem (this is may be not a big problem unless userspace finds a way > >>to trigger it). There are ways that current implementation can be > >>optimized still though. > > > >These can all go through the slow path, except interrupts, which need to be > >checked after every access. > > > >>But loading MBs of data through fw_cfg interface is just abusing it. > >>You wouldn't use pio on real HW to move megabytes of data and expect > >>good performance. > > > >True, this is a point in favour of a true dma interface. > > Doing kernel loading through fw_cfg has always been a bit ugly. > > A better approach would be to implement a PCI device with a ROM bar > that contained an option ROM that read additional bars from the > device to get at the kernel and initrd. I thought about this too. But sizes of initrd people mentioning here a crazy. We can run out of pci space very quickly. We can implement one of the BARs as sliding window into initrd though. > > That also enables some potentially interesting models like having > the additional bars be optionally persisted letting a user have > direct control over which kernel/initrds were loaded. It's > essentially a PCI device with a flash chip on it that contains a > kernel/initrd. > -- Gleb.
On 10/11/2011 03:19 PM, Anthony Liguori wrote: >> No, DMA has a lot bigger granularities in kvm/user interaction. We >> can easily DMA a 50MB region with a single kvm/user exit. For PIO we >> can at most do page granularity. > > > So make a proper PCI device for kernel loading. It's a much more > natural approach and let's use alias -kernel/-initrd/-append to > -device kernel-pci,kernel=PATH,initrd=PATH This is overkill. First let's optimize rep/movs before introducing any more interfaces. If that doesn't work, then we can have a dma interface for fwcfg. But a new pci device?
On Tue, Oct 11, 2011 at 03:23:36PM +0200, Avi Kivity wrote: > On 10/11/2011 03:19 PM, Anthony Liguori wrote: > >> No, DMA has a lot bigger granularities in kvm/user interaction. We > >> can easily DMA a 50MB region with a single kvm/user exit. For PIO we > >> can at most do page granularity. > > > > > > So make a proper PCI device for kernel loading. It's a much more > > natural approach and let's use alias -kernel/-initrd/-append to > > -device kernel-pci,kernel=PATH,initrd=PATH > > This is overkill. First let's optimize rep/movs before introducing any > more interfaces. If that doesn't work, then we can have a dma interface > for fwcfg. But a new pci device? > We can hot unplug it right after the boot :) -- Gleb.
On 10/11/2011 08:23 AM, Avi Kivity wrote: > On 10/11/2011 03:19 PM, Anthony Liguori wrote: >>> No, DMA has a lot bigger granularities in kvm/user interaction. We >>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we >>> can at most do page granularity. >> >> >> So make a proper PCI device for kernel loading. It's a much more >> natural approach and let's use alias -kernel/-initrd/-append to >> -device kernel-pci,kernel=PATH,initrd=PATH > > This is overkill. First let's optimize rep/movs before introducing any > more interfaces. If that doesn't work, then we can have a dma interface > for fwcfg. But a new pci device? This is how it would work on bare metal. Why is a PCI device overkill compared to a dma interface for fwcfg? If we're adding dma to fwcfg, then fwcfg has become far too complex for it's intended purpose. Regards, Anthony Liguori >
On 10/11/2011 03:29 PM, Anthony Liguori wrote: > On 10/11/2011 08:23 AM, Avi Kivity wrote: >> On 10/11/2011 03:19 PM, Anthony Liguori wrote: >>>> No, DMA has a lot bigger granularities in kvm/user interaction. We >>>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we >>>> can at most do page granularity. >>> >>> >>> So make a proper PCI device for kernel loading. It's a much more >>> natural approach and let's use alias -kernel/-initrd/-append to >>> -device kernel-pci,kernel=PATH,initrd=PATH >> >> This is overkill. First let's optimize rep/movs before introducing any >> more interfaces. If that doesn't work, then we can have a dma interface >> for fwcfg. But a new pci device? > > This is how it would work on bare metal. Why is a PCI device overkill > compared to a dma interface for fwcfg? > Because it's a limited use case, despite all the talk around it. > If we're adding dma to fwcfg, then fwcfg has become far too complex > for it's intended purpose. > I have to agree to that. btw, -net nic,model=virtio -net user is an internal DMA interface we already have. We can boot from it. Why not use it?
On 10/11/2011 08:45 AM, Avi Kivity wrote: > On 10/11/2011 03:29 PM, Anthony Liguori wrote: >> On 10/11/2011 08:23 AM, Avi Kivity wrote: >>> On 10/11/2011 03:19 PM, Anthony Liguori wrote: >>>>> No, DMA has a lot bigger granularities in kvm/user interaction. We >>>>> can easily DMA a 50MB region with a single kvm/user exit. For PIO we >>>>> can at most do page granularity. >>>> >>>> >>>> So make a proper PCI device for kernel loading. It's a much more >>>> natural approach and let's use alias -kernel/-initrd/-append to >>>> -device kernel-pci,kernel=PATH,initrd=PATH >>> >>> This is overkill. First let's optimize rep/movs before introducing any >>> more interfaces. If that doesn't work, then we can have a dma interface >>> for fwcfg. But a new pci device? >> >> This is how it would work on bare metal. Why is a PCI device overkill >> compared to a dma interface for fwcfg? >> > > Because it's a limited use case, despite all the talk around it. > >> If we're adding dma to fwcfg, then fwcfg has become far too complex >> for it's intended purpose. >> > > I have to agree to that. > > btw, -net nic,model=virtio -net user is an internal DMA interface we > already have. We can boot from it. Why not use it? tftp over slirp is probably slower than fwcfg. It's been every time I looked. Regards, Anthony Liguori >
On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote: > On 10/11/2011 08:14 AM, Alexander Graf wrote: > >>>>>And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. > >>>> > >>>>rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. > >>> > >>>Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. > >> > >>rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. > > > >No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. > > So make a proper PCI device for kernel loading. It's a much more > natural approach and let's use alias -kernel/-initrd/-append to > -device kernel-pci,kernel=PATH,initrd=PATH Adding a PCI device doesn't sound very appealing, unless you can guarentee it is never visible to the guest once LinuxBoot has finished its dirty work, so mgmt apps don't have to worry about PCI addressing wrt guest ABI. Daniel
On 10/11/2011 09:01 AM, Daniel P. Berrange wrote: > On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote: >> On 10/11/2011 08:14 AM, Alexander Graf wrote: >>>>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. >>>>>> >>>>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. >>>>> >>>>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. >>>> >>>> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. >>> >>> No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. >> >> So make a proper PCI device for kernel loading. It's a much more >> natural approach and let's use alias -kernel/-initrd/-append to >> -device kernel-pci,kernel=PATH,initrd=PATH > > Adding a PCI device doesn't sound very appealing, unless you > can guarentee it is never visible to the guest once LinuxBoot > has finished its dirty work, It'll definitely be guest visible just like fwcfg is guest visible. Regards, Anthony Liguori so mgmt apps don't have to worry > about PCI addressing wrt guest ABI. > > > Daniel
On 11.10.2011, at 16:33, Anthony Liguori wrote: > On 10/11/2011 09:01 AM, Daniel P. Berrange wrote: >> On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote: >>> On 10/11/2011 08:14 AM, Alexander Graf wrote: >>>>>>>> And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. >>>>>>> >>>>>>> rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. >>>>>> >>>>>> Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. >>>>> >>>>> rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. >>>> >>>> No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. >>> >>> So make a proper PCI device for kernel loading. It's a much more >>> natural approach and let's use alias -kernel/-initrd/-append to >>> -device kernel-pci,kernel=PATH,initrd=PATH >> >> Adding a PCI device doesn't sound very appealing, unless you >> can guarentee it is never visible to the guest once LinuxBoot >> has finished its dirty work, > > It'll definitely be guest visible just like fwcfg is guest visible. Yup, just that this time it eats up one of our previous PCI slots ;) So far it's the best proposal I've heard though. Alex
On Tue, Oct 11, 2011 at 09:33:49AM -0500, Anthony Liguori wrote: > On 10/11/2011 09:01 AM, Daniel P. Berrange wrote: > >On Tue, Oct 11, 2011 at 08:19:14AM -0500, Anthony Liguori wrote: > >>On 10/11/2011 08:14 AM, Alexander Graf wrote: > >>>>>>>And I don't see the point why we would have to shoot yet another hole into the guest just because we're too unwilling to make an interface that's perfectly valid horribly slow. > >>>>>> > >>>>>>rep/ins is exactly like dma+wait for this use case: provide an address, get a memory image in return. There's no need to add another interface, we should just optimize the existing one. > >>>>> > >>>>>Whatever we do, the interface will never be as fast as DMA. We will always have to do sanity / permission checks for every IO operation, can batch up only so many IO requests and in QEMU again have to call our callbacks in a loop. > >>>> > >>>>rep/ins is effectively equivalent to DMA except in how it's handled within QEMU. > >>> > >>>No, DMA has a lot bigger granularities in kvm/user interaction. We can easily DMA a 50MB region with a single kvm/user exit. For PIO we can at most do page granularity. > >> > >>So make a proper PCI device for kernel loading. It's a much more > >>natural approach and let's use alias -kernel/-initrd/-append to > >>-device kernel-pci,kernel=PATH,initrd=PATH > > > >Adding a PCI device doesn't sound very appealing, unless you > >can guarentee it is never visible to the guest once LinuxBoot > >has finished its dirty work, > > It'll definitely be guest visible just like fwcfg is guest visible. The difference is that fwcfg doesn't provide any real problems to the guest OS. PCI devices will. Also this means that if you have an existing VM booting with -kernel and you update to a newer QEMU binary, the guest ABI changes due to the new PCI device :-( Unless we keep the old code around forever too, which means we'd really want to improve the old code anyway. Daniel
On Tue, Oct 11, 2011 at 8:23 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Mon, Oct 10, 2011 at 09:01:52PM +0200, Alexander Graf wrote: >> >> On 10.10.2011, at 20:53, Anthony Liguori wrote: >> >> > On 10/10/2011 12:08 PM, Daniel P. Berrange wrote: >> >> With the attached patches applied to QEMU and SeaBios, the attached >> >> systemtap script can be used to debug timings in QEMU startup. >> >> >> >> For example, one execution of QEMU produced the following log: >> >> >> >> $ stap qemu-timing.stp >> >> 0.000 Start >> >> 0.036 Run >> >> 0.038 BIOS post >> >> 0.180 BIOS int 19 >> >> 0.181 BIOS boot OS >> >> 0.181 LinuxBoot copy kernel >> >> 1.371 LinuxBoot copy initrd >> > >> > Yeah, there was a thread a bit ago about the performance >> > of the interface to read the kernel/initrd. I think at it >> > was using single byte access instructions and there were >> > patches to use string accessors instead? I can't remember >> > where that threaded ended up. > > There was initially a huge performance problem, which was > fixed during the course of the thread, getting to the current > state where it still takes a few seconds to load large blobs. > The thread continued with many proposals & counter proposals > but nothing further really came out of it. > > https://lists.gnu.org/archive/html/qemu-devel/2010-08/msg00133.html > > One core point to take away though, is that -kernel/-initrd is > *not* just for ad-hoc testing by qemu/kernel developers. It is > critical functionality widely used by users of QEMU in production > scenarios and performance of it does matter, in some cases, alot. > >> IIRC we're already using string accessors, but are still >> slow. Richard had a nice patch cooked up to basically have >> the fw_cfg interface be able to DMA its data to the guest. >> I like the idea. Avi did not. > > That's here: > > https://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01037.html > >> And yes, bad -kernel performance does hurt in some workloads. A lot. > > Let me recap the 3 usage scenarios I believe are most common: > > - Most Linux distro installs done with libvirt + virt-manager/virt-install > are done by directly booting the distro's PXE kernel/initrd files. > The kernel images are typically < 5 MB, while the initrd images may > be as large as 150 MB. Both are compressed already. An uncompressed > initrd image would be more like 300 MB, so these are avoided for > obvious reasons. > > Performance is not really an issue, within reason, since the overall > distro installation time will easily dominate, but loading should > still be measured in seconds, not minutes. > > The reason for using a kernel/initrd instead of a bootable ISO is > to be able to set kernel command line arguments for the installer. > > - libguestfs directly boots its appliance using the regular host's > kernel image and a custom built initrd image. The initrd does > not contain the entire appliance, just enough to boot up and > dynamically read files in from the host OS on demand. This is > a so called "supermin appliance". > > The kernel is < 5 MB, while the initrd is approx 100MB. The initrd > image is used uncompressed, because decompression time needs to be > eliminated from bootup. Performance is very critical for libguestfs. > 100's of milliseconds really do make a big difference for it. > > The reason for using a kernel/initrd instead of bootable ISO is to > avoid the time required to actually build the ISO, and to avoid > having more disks visible in the guest, which could confuse apps > using libguestfs which enumerate disks. > > - Application sandbox, directly boots the regular host's kernel and > a custom initrd image. The initrd does not contain any files except > for the 9p kernel modules and a custom init binary, which mounts > the guest root FS from a 9p filesystem export. > > The kernel is < 5 MB, while the initrd is approx 700 KB compressed, > or 1.4 MB compressed. Performance for the sandbox is even more > critical than for libguestfs. Even 10's of milliseconds make a > difference here. The commands being run in the sandbox can be > very short lived processes, executed reasonably frequently. The > goal is to have end-to-end runtime overhead of < 2 seconds. This > includes libvirt guest startup, qemu startup/shutdown, bios time, > option ROM time, kernel boot & shutdown time. > > The reason for using a kerenl/initrd instead of a bootable ISO, > is that building an ISO requires time itself, and we need to be > able to easily pass kernel boot arguments via -append. > > > I'm focusing on the last use case, and if the phase of the moon > is correct, I can currently executed a sandbox command with a total > overhead of 3.5 seconds (if using a compressed initrd) of which > the QEMU execution time is 2.5 seconds. > > Of this, 1.4 seconds is the time required by LinuxBoot to copy the > kernel+initrd. If I used an uncompressed initrd, which I really want > to, to avoid decompression overhead, this increases to ~1.7 seconds. > So the LinuxBoot ROM is ~60% of total QEMU execution time, or 40% > of total sandbox execution overhead. > > For comparison I also did a test building a bootable ISO using ISOLinux. > This required 700 ms for the boot time, which is appoximately 1/2 the > time reqiured for direct kernel/initrd boot. But you have to then add > on time required to build the ISO on every boot, to add custom kernel > command line args. So while ISO is faster than LinuxBoot currently > there is still non-negligable overhead here that I want to avoid. > > For further comparison I tested with Rich Jones' patches which add a > DMA-like inteface to fw_cfg. With this the time spent in the LinuxBoot > option ROM was as close to zero as matters. > > So obviously, my preference is for -kernel/-initrd to be made very fast > using the DMA-like patches, or any other patches which could achieve > similarly high performance for -kernel/-initd. I don't understand why PC can't use the same way of loading initrd by QEMU to guest memory before boot as Sparc32 uses. It should even be possible to deduplicate the kernel and initrd images: improve the loader to use mmap() for loading so that several guests would use the same pages. Preloaded kernel and initrd are paravirtual anyway, there could be even guest visible changes if ever needed (e.g. map kernel/initrd pages outside of normal RAM areas).
On Tue, Oct 11, 2011 at 08:17:28AM -0500, Anthony Liguori wrote: > On 10/11/2011 04:55 AM, Avi Kivity wrote: > >On 10/11/2011 11:50 AM, Gleb Natapov wrote: > >>But loading MBs of data through fw_cfg interface is just abusing it. > >>You wouldn't use pio on real HW to move megabytes of data and expect > >>good performance. > > > >True, this is a point in favour of a true dma interface. > > Doing kernel loading through fw_cfg has always been a bit ugly. > > A better approach would be to implement a PCI device with a ROM bar > that contained an option ROM that read additional bars from the > device to get at the kernel and initrd. If one is willing to add a PCI device, then one could add a virtio block device with a non-standard PCI device/vendor code and teach SeaBIOS to scan these non-standard ids. It's a bit of a hack, but it would be a simple way of creating a drive visible by the bios but hidden from the OS. -Kevin
On Sat, Oct 15, 2011 at 10:00:02AM +0000, Blue Swirl wrote: > I don't understand why PC can't use the same way of loading initrd by > QEMU to guest memory before boot as Sparc32 uses. It should even be > possible to deduplicate the kernel and initrd images: improve the > loader to use mmap() for loading so that several guests would use the > same pages. Preloaded kernel and initrd are paravirtual anyway, there > could be even guest visible changes if ever needed (e.g. map > kernel/initrd pages outside of normal RAM areas). +1! Even better if we extended Linux so it worked more like OS-9 (circa 1990): At boot, scan memory for modules and insmod them. That way we wouldn't even need an initrd since we could just supply the correct list of modules that the guest needs to mount its root disk. Rich.
Richard W M Jones writes: > On Sat, Oct 15, 2011 at 10:00:02AM +0000, Blue Swirl wrote: >> I don't understand why PC can't use the same way of loading initrd by >> QEMU to guest memory before boot as Sparc32 uses. It should even be >> possible to deduplicate the kernel and initrd images: improve the >> loader to use mmap() for loading so that several guests would use the >> same pages. Preloaded kernel and initrd are paravirtual anyway, there >> could be even guest visible changes if ever needed (e.g. map >> kernel/initrd pages outside of normal RAM areas). > +1! > Even better if we extended Linux so it worked more like OS-9 (circa > 1990): At boot, scan memory for modules and insmod them. That way we > wouldn't even need an initrd since we could just supply the correct > list of modules that the guest needs to mount its root disk. I'm not really knowledgeable of the topic at hand, but if the objective is to boot the kernel image given through QEMU's cmdline, QEMU can just put the files in memory in a way that is compliant to what grub already does, which is a format well understood by linux itself (multiboot modules, I think they're called). Lluis
diff --git a/hw/pc.c b/hw/pc.c index 203627d..76d0790 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -43,6 +43,7 @@ #include "ui/qemu-spice.h" #include "memory.h" #include "exec-memory.h" +#include "trace.h" /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -516,6 +517,16 @@ static void handle_a20_line_change(void *opaque, int irq, int level) /***********************************************************/ /* Bochs BIOS debug ports */ +enum { + PROBE_SEABIOS_POST = 1001, + PROBE_SEABIOS_INT_18 = 1002, + PROBE_SEABIOS_INT_19 = 1003, + PROBE_SEABIOS_BOOT_OS = 1004, + + PROBE_LINUXBOOT_COPY_KERNEL = 2001, + PROBE_LINUXBOOT_COPY_INITRD = 2002, + PROBE_LINUXBOOT_BOOT_OS = 2003, +}; static void bochs_bios_write(void *opaque, uint32_t addr, uint32_t val) { @@ -534,6 +545,31 @@ static void bochs_bios_write(void *opaque, uint32_t addr, uint32_t val) fprintf(stderr, "%c", val); #endif break; + case 0x404: { + switch (val) { + case PROBE_SEABIOS_POST: + trace_seabios_post(); + break; + case PROBE_SEABIOS_INT_18: + trace_seabios_int_18(); + break; + case PROBE_SEABIOS_INT_19: + trace_seabios_int_19(); + break; + case PROBE_SEABIOS_BOOT_OS: + trace_seabios_boot_OS(); + break; + case PROBE_LINUXBOOT_COPY_KERNEL: + trace_linuxboot_copy_kernel(); + break; + case PROBE_LINUXBOOT_COPY_INITRD: + trace_linuxboot_copy_initrd(); + break; + case PROBE_LINUXBOOT_BOOT_OS: + trace_linuxboot_boot_OS(); + break; + } + } break; case 0x8900: /* same as Bochs power off */ if (val == shutdown_str[shutdown_index]) { @@ -589,6 +625,7 @@ static void *bochs_bios_init(void) register_ioport_write(0x401, 1, 2, bochs_bios_write, NULL); register_ioport_write(0x402, 1, 1, bochs_bios_write, NULL); register_ioport_write(0x403, 1, 1, bochs_bios_write, NULL); + register_ioport_write(0x404, 1, 4, bochs_bios_write, NULL); register_ioport_write(0x8900, 1, 1, bochs_bios_write, NULL); register_ioport_write(0x501, 1, 1, bochs_bios_write, NULL); diff --git a/pc-bios/linuxboot.bin b/pc-bios/linuxboot.bin index e7c3669..40b9217 100644 Binary files a/pc-bios/linuxboot.bin and b/pc-bios/linuxboot.bin differ diff --git a/pc-bios/optionrom/linuxboot.S b/pc-bios/optionrom/linuxboot.S index 748c831..5c39fb1 100644 --- a/pc-bios/optionrom/linuxboot.S +++ b/pc-bios/optionrom/linuxboot.S @@ -108,11 +108,21 @@ copy_kernel: /* We're now running in 16-bit CS, but 32-bit ES! */ /* Load kernel and initrd */ + mov $0x7d1,%eax + mov $0x404,%edx + outl %eax,(%dx) read_fw_blob_addr32(FW_CFG_KERNEL) + mov $0x7d2,%eax + mov $0x404,%edx + outl %eax,(%dx) read_fw_blob_addr32(FW_CFG_INITRD) read_fw_blob_addr32(FW_CFG_CMDLINE) read_fw_blob_addr32(FW_CFG_SETUP) + mov $0x7d3,%eax + mov $0x404,%edx + outl %eax,(%dx) + /* And now jump into Linux! */ mov $0, %eax mov %eax, %cr0 diff --git a/trace-events b/trace-events index a31d9aa..34ca28b 100644 --- a/trace-events +++ b/trace-events @@ -289,6 +289,11 @@ scsi_request_sense(int target, int lun, int tag) "target %d lun %d tag %d" # vl.c vm_state_notify(int running, int reason) "running %d reason %d" +main_start(void) "startup" +main_loop(void) "loop" +main_stop(void) "stop" +qemu_shutdown_request(void) "shutdown request" +qemu_powerdown_request(void) "powerdown request" # block/qed-l2-cache.c qed_alloc_l2_cache_entry(void *l2_cache, void *entry) "l2_cache %p entry %p" @@ -502,3 +507,12 @@ escc_sunkbd_event_in(int ch) "Untranslated keycode %2.2x" escc_sunkbd_event_out(int ch) "Translated keycode %2.2x" escc_kbd_command(int val) "Command %d" escc_sunmouse_event(int dx, int dy, int buttons_state) "dx=%d dy=%d buttons=%01x" + +seabios_post(void) "BIOS post" +seabios_int_18(void) "BIOS int18" +seabios_int_19(void) "BIOS int19" +seabios_boot_OS(void) "BIOS boot OS" + +linuxboot_copy_kernel(void) "LinuxBoot Copy Kernel" +linuxboot_copy_initrd(void) "LinuxBoot Copy InitRD" +linuxboot_boot_OS(void) "LinuxBoot boot OS" diff --git a/vl.c b/vl.c index bd4a5ce..91e6f5e 100644 --- a/vl.c +++ b/vl.c @@ -162,7 +162,7 @@ int main(int argc, char **argv) #include "qemu-queue.h" #include "cpus.h" #include "arch_init.h" - +#include "trace.h" #include "ui/qemu-spice.h" //#define DEBUG_NET @@ -1414,12 +1414,14 @@ void qemu_system_killed(int signal, pid_t pid) void qemu_system_shutdown_request(void) { + trace_qemu_shutdown_request(); shutdown_requested = 1; qemu_notify_event(); } void qemu_system_powerdown_request(void) { + trace_qemu_powerdown_request(); powerdown_requested = 1; qemu_notify_event(); } @@ -2313,6 +2315,8 @@ int main(int argc, char **argv, char **envp) const char *trace_events = NULL; const char *trace_file = NULL; + trace_main_start(); + atexit(qemu_run_exit_notifiers); error_set_progname(argv[0]); @@ -3571,10 +3575,12 @@ int main(int argc, char **argv, char **envp) os_setup_post(); + trace_main_loop(); main_loop(); quit_timers(); net_cleanup(); res_free(); + trace_main_stop(); return 0; }