From patchwork Tue Sep 27 19:05:02 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Blue Swirl <blauwirbel@gmail.com>
X-Patchwork-Id: 116653
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [140.186.70.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 6D31EB6F75
	for <incoming@patchwork.ozlabs.org>;
	Wed, 28 Sep 2011 05:05:40 +1000 (EST)
Received: from localhost ([::1]:46115 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1R8cyF-0002yR-QN
	for incoming@patchwork.ozlabs.org; Tue, 27 Sep 2011 15:05:35 -0400
Received: from eggs.gnu.org ([140.186.70.92]:50051)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <blauwirbel@gmail.com>) id 1R8cy5-0002yG-Si
	for qemu-devel@nongnu.org; Tue, 27 Sep 2011 15:05:27 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <blauwirbel@gmail.com>) id 1R8cy3-0000IN-Hv
	for qemu-devel@nongnu.org; Tue, 27 Sep 2011 15:05:25 -0400
Received: from mail-qy0-f173.google.com ([209.85.216.173]:56946)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <blauwirbel@gmail.com>)
	id 1R8cy3-0000IF-6N; Tue, 27 Sep 2011 15:05:23 -0400
Received: by qyc1 with SMTP id 1so1336461qyc.4
	for <multiple recipients>; Tue, 27 Sep 2011 12:05:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc:content-type;
	bh=AM+N4/kkJ6Q626Pe6h6OlZ7RjP5ip1i5sBDuBEgnbJY=;
	b=u6M/Oqro/sb0zTRNR2U+KcTrPu/FJ+FjrT7iyqui3uAVgg5btJSY1rI5ruLedw2QBb
	gYE9SL6Zu+6wxUuwUZlntHvmcVZqIeq1rE27HQTQHjTeL2/ewbEueFKLU/84Lw17jX4t
	j8NKgnjK3hWOQuxJ+zMSNw0wiqXSt+v7zNzi4=
Received: by 10.224.190.65 with SMTP id dh1mr6341244qab.9.1317150322320; Tue,
	27 Sep 2011 12:05:22 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.224.6.129 with HTTP; Tue, 27 Sep 2011 12:05:02 -0700 (PDT)
In-Reply-To: <B9619655-7C24-40B8-BC7B-B178ABEEC077@suse.de>
References: <1315989802-18753-1-git-send-email-agraf@suse.de>
	<1315989802-18753-25-git-send-email-agraf@suse.de>
	<CAAu8pHv=4XDycKpWL-4XAMteMekf7dsWThqmO1ONS6tHQPHAjw@mail.gmail.com>
	<C220D5FE-87F1-465D-8803-D84F9D88AD15@suse.de>
	<CAAu8pHtDDZLxemWPQzEFiRDw-JH28nyFPQXs+a-JXUg7ND8O3Q@mail.gmail.com>
	<14529F4D-D8AC-4097-8DF8-5F13EDCCC77F@suse.de>
	<4E7769D0.3090909@freescale.com>
	<CAAu8pHuaFSooFf7Ea57TTTE=8AOdmCN0DMHNZBihOpmn-VKH9Q@mail.gmail.com>
	<1CECB54D-1FED-4AC2-B86B-8082CCFE001F@suse.de>
	<CAAu8pHteYfHVC7wfBRfwh1pYk64i9cRi_tUb=2_yY_x88sncCw@mail.gmail.com>
	<D803B0B1-4DC8-4F1D-BA25-5E098FF68D56@suse.de>
	<4E810883.4010405@freescale.com>
	<CAAu8pHsOwjL7eqL+1zD0ZEgW_tf0Btmek+J0wZMQePiw1wVvFA@mail.gmail.com>
	<DD1EC171-E182-453D-B463-A14704F1FB5D@suse.de>
	<CAAu8pHuhnXnULeqZP2X1gr740LiF7yH12dL9RDtY-_d0dd7qyw@mail.gmail.com>
	<668883E4-DAA5-4D79-BB3C-2DE9D859C659@suse.de>
	<CAAu8pHusXayghFp8Uzy-4-yjV__dxu5QMKPgYbCHSCcYkHoMLw@mail.gmail.com>
	<B9619655-7C24-40B8-BC7B-B178ABEEC077@suse.de>
From: Blue Swirl <blauwirbel@gmail.com>
Date: Tue, 27 Sep 2011 19:05:02 +0000
Message-ID: 
 <CAAu8pHvQpjk=h+Kp5M-f7oNG3bE2BvLXdnNKCwtL1yax9BhhDw@mail.gmail.com>
To: Alexander Graf <agraf@suse.de>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 209.85.216.173
Cc: Scott Wood <scottwood@freescale.com>,
	Yoder Stuart-B08248 <b08248@freescale.com>, qemu-ppc@nongnu.org,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Aurelien Jarno <aurelien@aurel32.net>
Subject: Re: [Qemu-devel] [PATCH 24/58] PPC: E500: Add PV spinning code
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: </archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

On Tue, Sep 27, 2011 at 5:23 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 27.09.2011, at 19:20, Blue Swirl wrote:
>
>> On Tue, Sep 27, 2011 at 5:03 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 27.09.2011, at 18:53, Blue Swirl wrote:
>>>
>>>> On Tue, Sep 27, 2011 at 3:59 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>>
>>>>> On 27.09.2011, at 17:50, Blue Swirl wrote:
>>>>>
>>>>>> On Mon, Sep 26, 2011 at 11:19 PM, Scott Wood <scottwood@freescale.com> wrote:
>>>>>>> On 09/24/2011 05:00 AM, Alexander Graf wrote:
>>>>>>>> On 24.09.2011, at 10:44, Blue Swirl wrote:
>>>>>>>>> On Sat, Sep 24, 2011 at 8:03 AM, Alexander Graf <agraf@suse.de> wrote:
>>>>>>>>>> On 24.09.2011, at 09:41, Blue Swirl wrote:
>>>>>>>>>>> On Mon, Sep 19, 2011 at 4:12 PM, Scott Wood <scottwood@freescale.com> wrote:
>>>>>>>>>>>> The goal with the spin table stuff, suboptimal as it is, was something
>>>>>>>>>>>> that would work on any powerpc implementation.  Other
>>>>>>>>>>>> implementation-specific release mechanisms are allowed, and are
>>>>>>>>>>>> indicated by a property in the cpu node, but only if the loader knows
>>>>>>>>>>>> that the OS supports it.
>>>>>>>>>>>>
>>>>>>>>>>>>> IIUC the spec that includes these bits is not finalized yet. It is however in use on all u-boot versions for e500 that I'm aware of and the method Linux uses to bring up secondary CPUs.
>>>>>>>>>>>>
>>>>>>>>>>>> It's in ePAPR 1.0, which has been out for a while now.  ePAPR 1.1 was
>>>>>>>>>>>> just released which clarifies some things such as WIMG.
>>>>>>>>>>>>
>>>>>>>>>>>>> Stuart / Scott, do you have any pointers to documentation where the spinning is explained?
>>>>>>>>>>>>
>>>>>>>>>>>> https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v1.1.pdf
>>>>>>>>>>>
>>>>>>>>>>> Chapter 5.5.2 describes the table. This is actually an interface
>>>>>>>>>>> between OS and Open Firmware, obviously there can't be a real hardware
>>>>>>>>>>> device that magically loads r3 etc.
>>>>>>>
>>>>>>> Not Open Firmware, but rather an ePAPR-compliant loader.
>>>>>>
>>>>>> 'boot program to client program interface definition'.
>>>>>>
>>>>>>>>>>> The device method would break abstraction layers,
>>>>>>>
>>>>>>> Which abstraction layers?
>>>>>>
>>>>>> QEMU system emulation emulates hardware, not software. Hardware
>>>>>> devices don't touch CPU registers.
>>>>>
>>>>> The great part about this emulated device is that it's basically guest software running in host context. To the guest, it's not a device in the ordinary sense, such as vmport, but rather the same as software running on another core, just that the other core isn't running any software.
>>>>>
>>>>> Sure, if you consider this a device, it does break abstraction layers. Just consider it as host running guest code, then it makes sense :).
>>>>>
>>>>>>
>>>>>>>>>>> it's much like
>>>>>>>>>>> vmport stuff in x86. Using a hypercall would be a small improvement.
>>>>>>>>>>> Instead it should be possible to implement a small boot ROM which puts
>>>>>>>>>>> the secondary CPUs into managed halt state without spinning, then the
>>>>>>>>>>> boot CPU could send an IPI to a halted CPU to wake them up based on
>>>>>>>>>>> the spin table, just like real HW would do.
>>>>>>>
>>>>>>> The spin table, with no IPI or halt state, is what real HW does (or
>>>>>>> rather, what software does on real HW) today.  It's ugly and inefficient
>>>>>>> but it should work everywhere.  Anything else would be dependent on a
>>>>>>> specific HW implementation.
>>>>>>
>>>>>> Yes. Hardware doesn't ever implement the spin table.
>>>>>>
>>>>>>>>>>> On Sparc32 OpenBIOS this
>>>>>>>>>>> is something like a few lines of ASM on both sides.
>>>>>>>>>>
>>>>>>>>>> That sounds pretty close to what I had implemented in v1. Back then the only comment was to do it using this method from Scott.
>>>>>>>
>>>>>>> I had some comments on the actual v1 implementation as well. :-)
>>>>>>>
>>>>>>>>>> So we have the choice between having code inside the guest that
>>>>>>>>>> spins, maybe even only checks every x ms, by programming a timer,
>>>>>>>>>> or we can try to make an event out of the memory write. V1 was
>>>>>>>>>> the former, v2 (this one) is the latter. This version performs a
>>>>>>>>>> lot better and is easier to understand.
>>>>>>>>>
>>>>>>>>> The abstraction layers should not be broken lightly, I suppose some
>>>>>>>>> performance or laziness^Wlocal optimization reasons were behind vmport
>>>>>>>>> design too. The ideal way to solve this could be to detect a spinning
>>>>>>>>> CPU and optimize that for all architectures, that could be tricky
>>>>>>>>> though (if a CPU remains in the same TB for extended periods, inspect
>>>>>>>>> the TB: if it performs a loop with a single load instruction, replace
>>>>>>>>> the load by a special wait operation for any memory stores to that
>>>>>>>>> page).
>>>>>>>
>>>>>>> How's that going to work with KVM?
>>>>>>>
>>>>>>>> In fact, the whole kernel loading way we go today is pretty much
>>>>>>>> wrong. We should rather do it similar to OpenBIOS where firmware
>>>>>>>> always loads and then pulls the kernel from QEMU using a PV
>>>>>>>> interface. At that point, we would have to implement such an
>>>>>>>> optimization as you suggest. Or implement a hypercall :).
>>>>>>>
>>>>>>> I think the current approach is more usable for most purposes.  If you
>>>>>>> start U-Boot instead of a kernel, how do pass information on from the
>>>>>>> user (kernel, rfs, etc)?  Require the user to create flash images[1]?
>>>>>>
>>>>>> No, for example OpenBIOS gets the kernel command line from fw_cfg device.
>>>>>>
>>>>>>> Maybe that's a useful mode of operation in some cases, but I don't think
>>>>>>> we should be slavishly bound to it.  Think of the current approach as
>>>>>>> something between whole-system and userspace emulation.
>>>>>>
>>>>>> This is similar to ARM, M68k and Xtensa semi-hosting mode, but not at
>>>>>> kernel level but lower. Perhaps this mode should be enabled with
>>>>>> -semihosting flag or a new flag. Then the bare metal version could be
>>>>>> run without the flag.
>>>>>
>>>>> and then we'd have 2 implementations for running in system emulation mode and need to maintain both. I don't think that scales very well.
>>>>
>>>> No, but such hacks are not common.
>>>>
>>>>>>
>>>>>>> Where does the device tree come from?  How do you tell the guest about
>>>>>>> what devices it has, especially in virtualization scenarios with non-PCI
>>>>>>> passthrough devices, or custom qdev instantiations?
>>>>>>>
>>>>>>>> But at least we'd always be running the same guest software stack.
>>>>>>>
>>>>>>> No we wouldn't.  Any U-Boot that runs under QEMU would have to be
>>>>>>> heavily modified, unless we want to implement a ton of random device
>>>>>>> emulation, at least one extra memory translation layer (LAWs, localbus
>>>>>>> windows, CCSRBAR, and such), hacks to allow locked cache lines to
>>>>>>> operate despite a lack of backing store, etc.
>>>>>>
>>>>>> I'd say HW emulation business as usual. Now with the new memory API,
>>>>>> it should be possible to emulate the caches with line locking and TLBs
>>>>>> etc., this was not previously possible. IIRC implementing locked cache
>>>>>> lines would allow x86 to boot unmodified coreboot.
>>>>>
>>>>> So how would you emulate cache lines with line locking on KVM?
>>>>
>>>> The cache would be a MMIO device which registers to handle all memory
>>>> space. Configuring the cache controller changes how the device
>>>> operates. Put this device between CPU and memory and other devices.
>>>> Performance would probably be horrible, so CPU should disable the
>>>> device automatically after some time.
>>>
>>> So how would you execute code on this region then? :)
>>
>> Easy, fix QEMU to allow executing from MMIO. (Yeah, I forgot about that).
>
> It's not quite as easy to fix KVM to do the same though unfortunately. We'd have to either implement a full instruction emulator in the kernel (x86 style) or transfer all state from KVM into QEMU to execute it there (hell breaks loose). Both alternatives are not exactly appealing.
>
>>
>>>>
>>>>> However, we already have a number of hacks in SeaBIOS to run in QEMU, so I don't see an issue in adding a few here and there in u-boot. The memory pressure is a real issue though. I'm not sure how we'd manage that one. Maybe we could try and reuse the host u-boot binary? heh
>>>>
>>>> I don't think SeaBIOS breaks layering except for fw_cfg.
>>>
>>> I'm not saying we're breaking layering there. I'm saying that changing u-boot is not so bad, since it's the same as we do with SeaBIOS. It was an argument in favor of your position.
>>
>> Never mind then ;-)
>>
>>>> For extremely
>>>> memory limited situation, perhaps QEMU (or Native KVM Tool for lean
>>>> and mean version) could be run without glibc, inside kernel or even
>>>> interfacing directly with the hypervisor. I'd also continue making it
>>>> possible to disable building unused devices and features.
>>>
>>> I'm pretty sure you're not the only one with that goal ;).
>>
>> Great, let's do it.
>
> VGA comes first :)

This patch fixes the easy parts, ISA devices remain since they are not
qdevified. But didn't someone already send patches to do that?
diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index c7e365b..a11444c 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -2955,11 +2955,6 @@ static int pci_cirrus_vga_initfn(PCIDevice *dev)
      return 0;
 }
 
-void pci_cirrus_vga_init(PCIBus *bus)
-{
-    pci_create_simple(bus, -1, "cirrus-vga");
-}
-
 static PCIDeviceInfo cirrus_vga_info = {
     .qdev.name    = "cirrus-vga",
     .qdev.desc    = "Cirrus CLGD 54xx VGA",
diff --git a/hw/pc.c b/hw/pc.c
index 203627d..97f93d4 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1068,7 +1068,11 @@ void pc_vga_init(PCIBus *pci_bus)
 {
     if (cirrus_vga_enabled) {
         if (pci_bus) {
-            pci_cirrus_vga_init(pci_bus);
+            if (!pci_cirrus_vga_init(pci_bus)) {
+                fprintf(stderr, "Warning: cirrus_vga not available,"
+                        " using standard VGA instead\n");
+                pci_vga_init(pci_bus);
+            }
         } else {
             isa_cirrus_vga_init(get_system_memory());
         }
diff --git a/hw/pc.h b/hw/pc.h
index 7e6ddba..90a502d 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -8,6 +8,7 @@
 #include "fdc.h"
 #include "net.h"
 #include "memory.h"
+#include "pci.h"
 
 /* PC-style peripherals (also used by other machines).  */
 
@@ -217,13 +218,34 @@ static inline int isa_vga_init(void)
     return 1;
 }
 
-int pci_vga_init(PCIBus *bus);
+/* vga-pci.c */
+static inline bool pci_vga_init(PCIBus *bus)
+{
+    PCIDevice *dev;
+
+    dev = pci_try_create_simple(bus, -1, "VGA");
+    if (!dev) {
+        return false;
+    }
+    return true;
+}
+
 int isa_vga_mm_init(target_phys_addr_t vram_base,
                     target_phys_addr_t ctrl_base, int it_shift,
                     MemoryRegion *address_space);
 
 /* cirrus_vga.c */
-void pci_cirrus_vga_init(PCIBus *bus);
+static inline bool pci_cirrus_vga_init(PCIBus *bus)
+{
+    PCIDevice *dev;
+
+    dev = pci_try_create_simple(bus, -1, "cirrus-vga");
+    if (!dev) {
+        return false;
+    }
+    return true;
+}
+
 void isa_cirrus_vga_init(MemoryRegion *address_space);
 
 /* ne2000.c */
diff --git a/hw/pci.c b/hw/pci.c
index 749e8d8..46c01ac 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1687,6 +1687,19 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, int devfn,
     return dev;
 }
 
+PCIDevice *pci_try_create_simple_multifunction(PCIBus *bus, int devfn,
+                                               bool multifunction,
+                                               const char *name)
+{
+    PCIDevice *dev = pci_try_create_multifunction(bus, devfn, multifunction,
+                                                  name);
+    if (!dev) {
+        return NULL;
+    }
+    qdev_init_nofail(&dev->qdev);
+    return dev;
+}
+
 PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name)
 {
     return pci_create_multifunction(bus, devfn, false, name);
@@ -1702,6 +1715,11 @@ PCIDevice *pci_try_create(PCIBus *bus, int devfn, const char *name)
     return pci_try_create_multifunction(bus, devfn, false, name);
 }
 
+PCIDevice *pci_try_create_simple(PCIBus *bus, int devfn, const char *name)
+{
+    return pci_try_create_simple_multifunction(bus, devfn, false, name);
+}
+
 static int pci_find_space(PCIDevice *pdev, uint8_t size)
 {
     int config_size = pci_config_size(pdev);
diff --git a/hw/pci.h b/hw/pci.h
index 86a81c8..aa2e040 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -473,9 +473,13 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, int devfn,
 PCIDevice *pci_try_create_multifunction(PCIBus *bus, int devfn,
                                         bool multifunction,
                                         const char *name);
+PCIDevice *pci_try_create_simple_multifunction(PCIBus *bus, int devfn,
+                                               bool multifunction,
+                                               const char *name);
 PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name);
 PCIDevice *pci_create_simple(PCIBus *bus, int devfn, const char *name);
 PCIDevice *pci_try_create(PCIBus *bus, int devfn, const char *name);
+PCIDevice *pci_try_create_simple(PCIBus *bus, int devfn, const char *name);
 
 static inline int pci_is_express(const PCIDevice *d)
 {
diff --git a/hw/vga-pci.c b/hw/vga-pci.c
index 3c8bcb0..f296b19 100644
--- a/hw/vga-pci.c
+++ b/hw/vga-pci.c
@@ -70,12 +70,6 @@ static int pci_vga_initfn(PCIDevice *dev)
      return 0;
 }
 
-int pci_vga_init(PCIBus *bus)
-{
-    pci_create_simple(bus, -1, "VGA");
-    return 0;
-}
-
 static PCIDeviceInfo vga_info = {
     .qdev.name    = "VGA",
     .qdev.size    = sizeof(PCIVGAState),