From patchwork Tue Sep 27 19:05:02 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Blue Swirl X-Patchwork-Id: 116653 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [140.186.70.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 6D31EB6F75 for ; Wed, 28 Sep 2011 05:05:40 +1000 (EST) Received: from localhost ([::1]:46115 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R8cyF-0002yR-QN for incoming@patchwork.ozlabs.org; Tue, 27 Sep 2011 15:05:35 -0400 Received: from eggs.gnu.org ([140.186.70.92]:50051) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R8cy5-0002yG-Si for qemu-devel@nongnu.org; Tue, 27 Sep 2011 15:05:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1R8cy3-0000IN-Hv for qemu-devel@nongnu.org; Tue, 27 Sep 2011 15:05:25 -0400 Received: from mail-qy0-f173.google.com ([209.85.216.173]:56946) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R8cy3-0000IF-6N; Tue, 27 Sep 2011 15:05:23 -0400 Received: by qyc1 with SMTP id 1so1336461qyc.4 for ; Tue, 27 Sep 2011 12:05:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=AM+N4/kkJ6Q626Pe6h6OlZ7RjP5ip1i5sBDuBEgnbJY=; b=u6M/Oqro/sb0zTRNR2U+KcTrPu/FJ+FjrT7iyqui3uAVgg5btJSY1rI5ruLedw2QBb gYE9SL6Zu+6wxUuwUZlntHvmcVZqIeq1rE27HQTQHjTeL2/ewbEueFKLU/84Lw17jX4t j8NKgnjK3hWOQuxJ+zMSNw0wiqXSt+v7zNzi4= Received: by 10.224.190.65 with SMTP id dh1mr6341244qab.9.1317150322320; Tue, 27 Sep 2011 12:05:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.6.129 with HTTP; Tue, 27 Sep 2011 12:05:02 -0700 (PDT) In-Reply-To: References: <1315989802-18753-1-git-send-email-agraf@suse.de> <1315989802-18753-25-git-send-email-agraf@suse.de> <14529F4D-D8AC-4097-8DF8-5F13EDCCC77F@suse.de> <4E7769D0.3090909@freescale.com> <1CECB54D-1FED-4AC2-B86B-8082CCFE001F@suse.de> <4E810883.4010405@freescale.com> <668883E4-DAA5-4D79-BB3C-2DE9D859C659@suse.de> From: Blue Swirl Date: Tue, 27 Sep 2011 19:05:02 +0000 Message-ID: To: Alexander Graf X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 209.85.216.173 Cc: Scott Wood , Yoder Stuart-B08248 , qemu-ppc@nongnu.org, qemu-devel Developers , Aurelien Jarno Subject: Re: [Qemu-devel] [PATCH 24/58] PPC: E500: Add PV spinning code X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org On Tue, Sep 27, 2011 at 5:23 PM, Alexander Graf wrote: > > On 27.09.2011, at 19:20, Blue Swirl wrote: > >> On Tue, Sep 27, 2011 at 5:03 PM, Alexander Graf wrote: >>> >>> On 27.09.2011, at 18:53, Blue Swirl wrote: >>> >>>> On Tue, Sep 27, 2011 at 3:59 PM, Alexander Graf wrote: >>>>> >>>>> On 27.09.2011, at 17:50, Blue Swirl wrote: >>>>> >>>>>> On Mon, Sep 26, 2011 at 11:19 PM, Scott Wood wrote: >>>>>>> On 09/24/2011 05:00 AM, Alexander Graf wrote: >>>>>>>> On 24.09.2011, at 10:44, Blue Swirl wrote: >>>>>>>>> On Sat, Sep 24, 2011 at 8:03 AM, Alexander Graf wrote: >>>>>>>>>> On 24.09.2011, at 09:41, Blue Swirl wrote: >>>>>>>>>>> On Mon, Sep 19, 2011 at 4:12 PM, Scott Wood wrote: >>>>>>>>>>>> The goal with the spin table stuff, suboptimal as it is, was something >>>>>>>>>>>> that would work on any powerpc implementation.  Other >>>>>>>>>>>> implementation-specific release mechanisms are allowed, and are >>>>>>>>>>>> indicated by a property in the cpu node, but only if the loader knows >>>>>>>>>>>> that the OS supports it. >>>>>>>>>>>> >>>>>>>>>>>>> IIUC the spec that includes these bits is not finalized yet. It is however in use on all u-boot versions for e500 that I'm aware of and the method Linux uses to bring up secondary CPUs. >>>>>>>>>>>> >>>>>>>>>>>> It's in ePAPR 1.0, which has been out for a while now.  ePAPR 1.1 was >>>>>>>>>>>> just released which clarifies some things such as WIMG. >>>>>>>>>>>> >>>>>>>>>>>>> Stuart / Scott, do you have any pointers to documentation where the spinning is explained? >>>>>>>>>>>> >>>>>>>>>>>> https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v1.1.pdf >>>>>>>>>>> >>>>>>>>>>> Chapter 5.5.2 describes the table. This is actually an interface >>>>>>>>>>> between OS and Open Firmware, obviously there can't be a real hardware >>>>>>>>>>> device that magically loads r3 etc. >>>>>>> >>>>>>> Not Open Firmware, but rather an ePAPR-compliant loader. >>>>>> >>>>>> 'boot program to client program interface definition'. >>>>>> >>>>>>>>>>> The device method would break abstraction layers, >>>>>>> >>>>>>> Which abstraction layers? >>>>>> >>>>>> QEMU system emulation emulates hardware, not software. Hardware >>>>>> devices don't touch CPU registers. >>>>> >>>>> The great part about this emulated device is that it's basically guest software running in host context. To the guest, it's not a device in the ordinary sense, such as vmport, but rather the same as software running on another core, just that the other core isn't running any software. >>>>> >>>>> Sure, if you consider this a device, it does break abstraction layers. Just consider it as host running guest code, then it makes sense :). >>>>> >>>>>> >>>>>>>>>>> it's much like >>>>>>>>>>> vmport stuff in x86. Using a hypercall would be a small improvement. >>>>>>>>>>> Instead it should be possible to implement a small boot ROM which puts >>>>>>>>>>> the secondary CPUs into managed halt state without spinning, then the >>>>>>>>>>> boot CPU could send an IPI to a halted CPU to wake them up based on >>>>>>>>>>> the spin table, just like real HW would do. >>>>>>> >>>>>>> The spin table, with no IPI or halt state, is what real HW does (or >>>>>>> rather, what software does on real HW) today.  It's ugly and inefficient >>>>>>> but it should work everywhere.  Anything else would be dependent on a >>>>>>> specific HW implementation. >>>>>> >>>>>> Yes. Hardware doesn't ever implement the spin table. >>>>>> >>>>>>>>>>> On Sparc32 OpenBIOS this >>>>>>>>>>> is something like a few lines of ASM on both sides. >>>>>>>>>> >>>>>>>>>> That sounds pretty close to what I had implemented in v1. Back then the only comment was to do it using this method from Scott. >>>>>>> >>>>>>> I had some comments on the actual v1 implementation as well. :-) >>>>>>> >>>>>>>>>> So we have the choice between having code inside the guest that >>>>>>>>>> spins, maybe even only checks every x ms, by programming a timer, >>>>>>>>>> or we can try to make an event out of the memory write. V1 was >>>>>>>>>> the former, v2 (this one) is the latter. This version performs a >>>>>>>>>> lot better and is easier to understand. >>>>>>>>> >>>>>>>>> The abstraction layers should not be broken lightly, I suppose some >>>>>>>>> performance or laziness^Wlocal optimization reasons were behind vmport >>>>>>>>> design too. The ideal way to solve this could be to detect a spinning >>>>>>>>> CPU and optimize that for all architectures, that could be tricky >>>>>>>>> though (if a CPU remains in the same TB for extended periods, inspect >>>>>>>>> the TB: if it performs a loop with a single load instruction, replace >>>>>>>>> the load by a special wait operation for any memory stores to that >>>>>>>>> page). >>>>>>> >>>>>>> How's that going to work with KVM? >>>>>>> >>>>>>>> In fact, the whole kernel loading way we go today is pretty much >>>>>>>> wrong. We should rather do it similar to OpenBIOS where firmware >>>>>>>> always loads and then pulls the kernel from QEMU using a PV >>>>>>>> interface. At that point, we would have to implement such an >>>>>>>> optimization as you suggest. Or implement a hypercall :). >>>>>>> >>>>>>> I think the current approach is more usable for most purposes.  If you >>>>>>> start U-Boot instead of a kernel, how do pass information on from the >>>>>>> user (kernel, rfs, etc)?  Require the user to create flash images[1]? >>>>>> >>>>>> No, for example OpenBIOS gets the kernel command line from fw_cfg device. >>>>>> >>>>>>> Maybe that's a useful mode of operation in some cases, but I don't think >>>>>>> we should be slavishly bound to it.  Think of the current approach as >>>>>>> something between whole-system and userspace emulation. >>>>>> >>>>>> This is similar to ARM, M68k and Xtensa semi-hosting mode, but not at >>>>>> kernel level but lower. Perhaps this mode should be enabled with >>>>>> -semihosting flag or a new flag. Then the bare metal version could be >>>>>> run without the flag. >>>>> >>>>> and then we'd have 2 implementations for running in system emulation mode and need to maintain both. I don't think that scales very well. >>>> >>>> No, but such hacks are not common. >>>> >>>>>> >>>>>>> Where does the device tree come from?  How do you tell the guest about >>>>>>> what devices it has, especially in virtualization scenarios with non-PCI >>>>>>> passthrough devices, or custom qdev instantiations? >>>>>>> >>>>>>>> But at least we'd always be running the same guest software stack. >>>>>>> >>>>>>> No we wouldn't.  Any U-Boot that runs under QEMU would have to be >>>>>>> heavily modified, unless we want to implement a ton of random device >>>>>>> emulation, at least one extra memory translation layer (LAWs, localbus >>>>>>> windows, CCSRBAR, and such), hacks to allow locked cache lines to >>>>>>> operate despite a lack of backing store, etc. >>>>>> >>>>>> I'd say HW emulation business as usual. Now with the new memory API, >>>>>> it should be possible to emulate the caches with line locking and TLBs >>>>>> etc., this was not previously possible. IIRC implementing locked cache >>>>>> lines would allow x86 to boot unmodified coreboot. >>>>> >>>>> So how would you emulate cache lines with line locking on KVM? >>>> >>>> The cache would be a MMIO device which registers to handle all memory >>>> space. Configuring the cache controller changes how the device >>>> operates. Put this device between CPU and memory and other devices. >>>> Performance would probably be horrible, so CPU should disable the >>>> device automatically after some time. >>> >>> So how would you execute code on this region then? :) >> >> Easy, fix QEMU to allow executing from MMIO. (Yeah, I forgot about that). > > It's not quite as easy to fix KVM to do the same though unfortunately. We'd have to either implement a full instruction emulator in the kernel (x86 style) or transfer all state from KVM into QEMU to execute it there (hell breaks loose). Both alternatives are not exactly appealing. > >> >>>> >>>>> However, we already have a number of hacks in SeaBIOS to run in QEMU, so I don't see an issue in adding a few here and there in u-boot. The memory pressure is a real issue though. I'm not sure how we'd manage that one. Maybe we could try and reuse the host u-boot binary? heh >>>> >>>> I don't think SeaBIOS breaks layering except for fw_cfg. >>> >>> I'm not saying we're breaking layering there. I'm saying that changing u-boot is not so bad, since it's the same as we do with SeaBIOS. It was an argument in favor of your position. >> >> Never mind then ;-) >> >>>> For extremely >>>> memory limited situation, perhaps QEMU (or Native KVM Tool for lean >>>> and mean version) could be run without glibc, inside kernel or even >>>> interfacing directly with the hypervisor. I'd also continue making it >>>> possible to disable building unused devices and features. >>> >>> I'm pretty sure you're not the only one with that goal ;). >> >> Great, let's do it. > > VGA comes first :) This patch fixes the easy parts, ISA devices remain since they are not qdevified. But didn't someone already send patches to do that? diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c index c7e365b..a11444c 100644 --- a/hw/cirrus_vga.c +++ b/hw/cirrus_vga.c @@ -2955,11 +2955,6 @@ static int pci_cirrus_vga_initfn(PCIDevice *dev) return 0; } -void pci_cirrus_vga_init(PCIBus *bus) -{ - pci_create_simple(bus, -1, "cirrus-vga"); -} - static PCIDeviceInfo cirrus_vga_info = { .qdev.name = "cirrus-vga", .qdev.desc = "Cirrus CLGD 54xx VGA", diff --git a/hw/pc.c b/hw/pc.c index 203627d..97f93d4 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1068,7 +1068,11 @@ void pc_vga_init(PCIBus *pci_bus) { if (cirrus_vga_enabled) { if (pci_bus) { - pci_cirrus_vga_init(pci_bus); + if (!pci_cirrus_vga_init(pci_bus)) { + fprintf(stderr, "Warning: cirrus_vga not available," + " using standard VGA instead\n"); + pci_vga_init(pci_bus); + } } else { isa_cirrus_vga_init(get_system_memory()); } diff --git a/hw/pc.h b/hw/pc.h index 7e6ddba..90a502d 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -8,6 +8,7 @@ #include "fdc.h" #include "net.h" #include "memory.h" +#include "pci.h" /* PC-style peripherals (also used by other machines). */ @@ -217,13 +218,34 @@ static inline int isa_vga_init(void) return 1; } -int pci_vga_init(PCIBus *bus); +/* vga-pci.c */ +static inline bool pci_vga_init(PCIBus *bus) +{ + PCIDevice *dev; + + dev = pci_try_create_simple(bus, -1, "VGA"); + if (!dev) { + return false; + } + return true; +} + int isa_vga_mm_init(target_phys_addr_t vram_base, target_phys_addr_t ctrl_base, int it_shift, MemoryRegion *address_space); /* cirrus_vga.c */ -void pci_cirrus_vga_init(PCIBus *bus); +static inline bool pci_cirrus_vga_init(PCIBus *bus) +{ + PCIDevice *dev; + + dev = pci_try_create_simple(bus, -1, "cirrus-vga"); + if (!dev) { + return false; + } + return true; +} + void isa_cirrus_vga_init(MemoryRegion *address_space); /* ne2000.c */ diff --git a/hw/pci.c b/hw/pci.c index 749e8d8..46c01ac 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1687,6 +1687,19 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, int devfn, return dev; } +PCIDevice *pci_try_create_simple_multifunction(PCIBus *bus, int devfn, + bool multifunction, + const char *name) +{ + PCIDevice *dev = pci_try_create_multifunction(bus, devfn, multifunction, + name); + if (!dev) { + return NULL; + } + qdev_init_nofail(&dev->qdev); + return dev; +} + PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name) { return pci_create_multifunction(bus, devfn, false, name); @@ -1702,6 +1715,11 @@ PCIDevice *pci_try_create(PCIBus *bus, int devfn, const char *name) return pci_try_create_multifunction(bus, devfn, false, name); } +PCIDevice *pci_try_create_simple(PCIBus *bus, int devfn, const char *name) +{ + return pci_try_create_simple_multifunction(bus, devfn, false, name); +} + static int pci_find_space(PCIDevice *pdev, uint8_t size) { int config_size = pci_config_size(pdev); diff --git a/hw/pci.h b/hw/pci.h index 86a81c8..aa2e040 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -473,9 +473,13 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, int devfn, PCIDevice *pci_try_create_multifunction(PCIBus *bus, int devfn, bool multifunction, const char *name); +PCIDevice *pci_try_create_simple_multifunction(PCIBus *bus, int devfn, + bool multifunction, + const char *name); PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name); PCIDevice *pci_create_simple(PCIBus *bus, int devfn, const char *name); PCIDevice *pci_try_create(PCIBus *bus, int devfn, const char *name); +PCIDevice *pci_try_create_simple(PCIBus *bus, int devfn, const char *name); static inline int pci_is_express(const PCIDevice *d) { diff --git a/hw/vga-pci.c b/hw/vga-pci.c index 3c8bcb0..f296b19 100644 --- a/hw/vga-pci.c +++ b/hw/vga-pci.c @@ -70,12 +70,6 @@ static int pci_vga_initfn(PCIDevice *dev) return 0; } -int pci_vga_init(PCIBus *bus) -{ - pci_create_simple(bus, -1, "VGA"); - return 0; -} - static PCIDeviceInfo vga_info = { .qdev.name = "VGA", .qdev.size = sizeof(PCIVGAState),