diff mbox

[edk2] (PAM stuff) reset doesn't work on OVMF + SeaBIOS CSM

Message ID 1361191995.13482.107.camel@i7.infradead.org
State New
Headers show

Commit Message

David Woodhouse Feb. 18, 2013, 12:53 p.m. UTC
On Mon, 2013-02-18 at 10:40 +0000, David Woodhouse wrote:
> On Sat, 2013-02-16 at 02:37 +0100, Laszlo Ersek wrote:
> > I give up. Thanks for the help & sorry about spamming three lists.
> 
> I've managed to reproduce this on a clean F18 system. This is the stock
> qemu 1.2.2-6.fc18 on kernel 3.7.6-201.fc18.x86_64 with a newly-installed
> Fedora 18 VM in the guest.
> 
> qemu-system-x86_64 -enable-kvm -cdrom F18boot.iso -serial mon:stdio -bios OVMF.fd
> 
> On my laptop where I'd been doing most of my testing, even after running
> 'yum distro-sync qemu\*' to get back to the stock qemu, I still can't
> reproduce the issue. They are both running the *same* kernel.
> 
> I'll try reverting a whole bunch of other stuff that ought to be
> irrelevant to the stock distro packages, and see if/when it breaks...

I cannot make these two machines behave consistently. I have absolutely
no clue what is going on here.

At reset, the PAM regions are all set to '1' (read only). So the CSM
should reside in RAM at 0xffff0 but THAT SHOULDN'T MATTER. After a reset
we should be running from 0xfffffff0 and there's unconditionally ROM
there, isn't there?

Nevertheless, on my workstation as on yours, we do seem to end up
executing from the CSM in RAM when we reset. But on my laptop, it
executes the *ROM* as it should.

This patch 'fixes' it, and I think it might even be correct in itself,
but I don't think it's a correct fix for the problem we're discussing.
And I certainly want to know what's different on my laptop that makes it
work *without* this patch.

Either there's some weirdness with setting the high CS base address, on
CPU reset. Or perhaps the contents of the memory region at 0xfffffff0
have *really* been changed along with the sub-1MiB range. Or maybe the
universe just hates us...

Comments

Paolo Bonzini Feb. 18, 2013, 2:46 p.m. UTC | #1
Il 18/02/2013 13:53, David Woodhouse ha scritto:
> 
> diff --git a/hw/piix_pci.c b/hw/piix_pci.c
> index 6c77e49..6dcf1c5 100644
> --- a/hw/piix_pci.c
> +++ b/hw/piix_pci.c
> @@ -171,6 +171,23 @@ static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
>      return 0;
>  }
>  
> +static void i440fx_reset(void *opaque)
> +{
> +    PCII440FXState *d = opaque;
> +    uint8_t *pci_conf = d->dev.config;
> +
> +    pci_conf[0x59] = 0x00; // Reset PAM setup
> +    pci_conf[0x5a] = 0x00;
> +    pci_conf[0x5b] = 0x00;
> +    pci_conf[0x5c] = 0x00;
> +    pci_conf[0x5d] = 0x00;
> +    pci_conf[0x5e] = 0x00;
> +    pci_conf[0x5f] = 0x00;
> +    pci_conf[0x72] = 0x02; // And SMM
> +
> +    i440fx_update_memory_mappings(d);
> +}
> +
>  static int i440fx_post_load(void *opaque, int version_id)
>  {
>      PCII440FXState *d = opaque;
> @@ -217,6 +234,8 @@ static int i440fx_initfn(PCIDevice *dev)
>      d->dev.config[I440FX_SMRAM] = 0x02;
>  
>      cpu_smm_register(&i440fx_set_smm, d);
> +
> +    qemu_register_reset(i440fx_reset, d);
>      return 0;
>  }

If you want to submit this patch for upstream QEMU (I agree it is a good
idea), please set dc->reset instead in i440fx_class_init.

Paolo
David Woodhouse Feb. 18, 2013, 3 p.m. UTC | #2
On Mon, 2013-02-18 at 15:46 +0100, Paolo Bonzini wrote:
> If you want to submit this patch for upstream QEMU (I agree it is a
> good idea), please set dc->reset instead in i440fx_class_init.

Thanks.

I just copied the way that PIIX3 does it... is that something that
piix3_class_init() should be doing for *its* reset function too?

I'll submit this for upstream, but I consider it a workaround for the
real bug that Laszlo has been suffering from. So I'd rather wait until
we've solved that properly, or at least until we understand why we get
such different results on different CPUs.
Paolo Bonzini Feb. 18, 2013, 3:38 p.m. UTC | #3
Il 18/02/2013 16:00, David Woodhouse ha scritto:
> On Mon, 2013-02-18 at 15:46 +0100, Paolo Bonzini wrote:
>> > If you want to submit this patch for upstream QEMU (I agree it is a
>> > good idea), please set dc->reset instead in i440fx_class_init.
> Thanks.
> 
> I just copied the way that PIIX3 does it... is that something that
> piix3_class_init() should be doing for *its* reset function too?

Yes.

> I'll submit this for upstream, but I consider it a workaround for the
> real bug that Laszlo has been suffering from. So I'd rather wait until
> we've solved that properly, or at least until we understand why we get
> such different results on different CPUs.

Indeed the difference between CPUs is puzzling.

Paolo
David Woodhouse Feb. 18, 2013, 4:48 p.m. UTC | #4
On Mon, 2013-02-18 at 16:38 +0100, Paolo Bonzini wrote:
> Indeed the difference between CPUs is puzzling.

Very much so. So far I have established that it works OK here:

cpu family	: 6
model		: 44
model name	: Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
stepping	: 2
microcode	: 0x5
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt aes lahf_lm arat
dtherm tpr_shadow vnmi flexpriority ept vpid


...and it works here:

cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz
stepping	: 7
microcode	: 0x28
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts
dtherm tpr_shadow vnmi flexpriority ept vpid

... but not here:

cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
stepping	: 5
microcode	: 0x5
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi
flexpriority ept vpid

..or here:

cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping	: 6
microcode	: 0x606
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64
monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm
dtherm tpr_shadow vnmi flexpriority
Laszlo Ersek Feb. 18, 2013, 5:12 p.m. UTC | #5
On 02/18/13 13:53, David Woodhouse wrote:

> Nevertheless, on my workstation as on yours, we do seem to end up
> executing from the CSM in RAM when we reset. But on my laptop, it
> executes the *ROM* as it should.
>
> This patch 'fixes' it, and I think it might even be correct in itself,
> but I don't think it's a correct fix for the problem we're discussing.
> And I certainly want to know what's different on my laptop that makes it
> work *without* this patch.
>
> Either there's some weirdness with setting the high CS base address, on
> CPU reset. Or perhaps the contents of the memory region at 0xfffffff0
> have *really* been changed along with the sub-1MiB range. Or maybe the
> universe just hates us...

We're ending up in the wrong place, under 1MB (which is consistent with
your "reset the PAMs" patch -- state of PAMs should only matter below
1MB).

I single-stepped qemu-1.3.1 in x86_cpu_reset() /
cpu_x86_load_seg_cache(), and we seem to set the correct base. However
when I pause the VM when it's spinning in the reset loop, and I issue
the following in virsh:

# qemu-monitor-command --domain \
  fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \
  cpu 0

# qemu-monitor-command --domain \
  fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \
  info registers

for EIP and CS I get (from cpu_x86_dump_seg_cache(), in the
"HF_CS64_MASK clear" branch):

EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000623
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 0000f300
CS =f000 000f0000 0000ffff 0000f300
    ^    ^        ^        ^
    |    base     limit    flags
    selector

SS =0000 00000000 0000ffff 0000f300
DS =0000 00000000 0000ffff 0000f300
FS =0000 00000000 0000ffff 0000f300
GS =0000 00000000 0000ffff 0000f300
LDT=0000 00000000 0000ffff 00008200
TR =0000 feffd000 00002088 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000

(1) The three high nibbles of CS base are lost.


Furthermore, the flags value is (Intel SDM Vol.3A, 3.4.5):

  1 11  1 0011 00000000
  P DPL S type base 23:16
  ^ ^   ^
  | |   descriptor type (1 == code or data segment, 0 == system segment), DESC_S_MASK
  | descriptor privilege level (3 == least privileged)
  segment present, DESC_P_MASK

The "type" field depends on the S bit (here 1 == code/data). 0011b means
(see 3.4.5.1):

  0   0 1 1
  D/C E W A
      C R A
  ^   ^ ^ ^
  |   | | accessed, DESC_A_MASK
  |   | |
  |   | for data: 0=r/o, 1==r/w
  |   | for code: 0==exec/only, 1==exec/read, DESC_R_MASK
  |   |
  |   for data: 1==expand down
  |   for code: 1==conforming
  |
  0 == data, 1 == code, DESC_CS_MASK

The type dumped by "info registers" is "data segment, expand up,
read/write, accessed".

I believe the D/C bit (bit 11) should be set, and then 1011b would mean
"code segment, non-conforming, exec/read, accessed".

(2) x86_cpu_reset() does pass DESC_CS_MASK for R_CS, but it doesn't seem
to be present in the dumped value.


I have no idea what's going on, but vmx_set_segment() in the kernel has
a bunch of hacks for CS && selector == 0xf000 && base == 0xffff0000, and
it seems to be host processor dependent. Eg. from commit b246dd5d:

        /*
         * Fix segments for real mode guest in hosts that don't have
         * "unrestricted_mode" or it was disabled.
         * This is done to allow migration of the guests from hosts with
         * unrestricted guest like Westmere to older host that don't have
         * unrestricted guest like Nehelem.
         */
        if (vmx->rmode.vm86_active) {
                switch (seg) {
                case VCPU_SREG_CS:
                        vmcs_write32(GUEST_CS_AR_BYTES, 0xf3);
                        vmcs_write32(GUEST_CS_LIMIT, 0xffff);
                        if (vmcs_readl(GUEST_CS_BASE) == 0xffff0000)
                                vmcs_writel(GUEST_CS_BASE, 0xf0000);
                        vmcs_write16(GUEST_CS_SELECTOR,
                                     vmcs_readl(GUEST_CS_BASE) >> 4);
                        break;

Also in init_vmcb() [arch/x86/kvm/svm.c] I can see (from commit
d92899a0):

        /*
         * cs.base should really be 0xffff0000, but vmx can't handle that, so
         * be consistent with it.
         *
         * Replace when we have real mode working for vmx.
         */
        save->cs.base = 0xf0000;

Going back to vmx, vmx_vcpu_reset() [arch/x86/kvm/vmx.c]:

        /*
         * GUEST_CS_BASE should really be 0xffff0000, but VT vm86 mode
         * insists on having GUEST_CS_BASE == GUEST_CS_SELECTOR << 4.  Sigh.
         */
        if (kvm_vcpu_is_bsp(&vmx->vcpu)) {
                vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
                vmcs_writel(GUEST_CS_BASE, 0x000f0000);
        } else {
                vmcs_write16(GUEST_CS_SELECTOR, vmx->vcpu.arch.sipi_vector << 8);
                vmcs_writel(GUEST_CS_BASE, vmx->vcpu.arch.sipi_vector << 12);
        }

The leading comment and the main logic date back to commit 6aa8b732
([PATCH] kvm: userspace interface).

(3) I wanted to ask you whether your laptop CPU is "more modern" than
your workstation CPU, but from your other email I guess they're indeed
different.

Laszlo
Laszlo Ersek Feb. 18, 2013, 5:28 p.m. UTC | #6
On 02/18/13 17:48, David Woodhouse wrote:
> On Mon, 2013-02-18 at 16:38 +0100, Paolo Bonzini wrote:
>> Indeed the difference between CPUs is puzzling.
> 
> Very much so. So far I have established that it works OK here:
> 
> cpu family	: 6
> model		: 44
> model name	: Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
> stepping	: 2
> microcode	: 0x5
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
> xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
> est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt aes lahf_lm arat
> dtherm tpr_shadow vnmi flexpriority ept vpid

> ... but not here:
> 
> cpu family	: 6
> model		: 30
> model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
> stepping	: 5
> microcode	: 0x5
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3
> cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi
> flexpriority ept vpid

Maybe related to <https://patchwork.kernel.org/patch/26836/>?

Anyway, the X980 seems to have these in addition to the shared flags:

#define X86_FEATURE_AES         (4*32+25) /* AES instructions */
#define X86_FEATURE_ARAT        (7*32+ 1) /* Always Running APIC Timer */
#define X86_FEATURE_PCID        (4*32+17) /* Process Context Identifiers */
#define X86_FEATURE_PCLMULQDQ   (4*32+ 1) /* PCLMULQDQ instruction */
#define X86_FEATURE_GBPAGES     (1*32+26) /* "pdpe1gb" GB pages */

Unique to the 870 is:

#define X86_FEATURE_IDA         (7*32+ 0) /* Intel Dynamic Acceleration */
#define X86_FEATURE_SMX         (4*32+ 6) /* Safer mode */

Laszlo
Laszlo Ersek Feb. 18, 2013, 5:35 p.m. UTC | #7
On 02/18/13 18:28, Laszlo Ersek wrote:
> On 02/18/13 17:48, David Woodhouse wrote:
>> On Mon, 2013-02-18 at 16:38 +0100, Paolo Bonzini wrote:
>>> Indeed the difference between CPUs is puzzling.
>>
>> Very much so. So far I have established that it works OK here:
>>
>> cpu family	: 6
>> model		: 44
>> model name	: Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
>> stepping	: 2
>> microcode	: 0x5
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
>> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
>> xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
>> est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt aes lahf_lm arat
>> dtherm tpr_shadow vnmi flexpriority ept vpid
> 
>> ... but not here:
>>
>> cpu family	: 6
>> model		: 30
>> model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
>> stepping	: 5
>> microcode	: 0x5
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
>> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
>> nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3
>> cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi
>> flexpriority ept vpid
> 
> Maybe related to <https://patchwork.kernel.org/patch/26836/>?


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=3a624e2
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=046d871

Can you please compare the values of

  /sys/module/kvm_intel/parameters/unrestricted_guest

between the working & non-working hosts? On my workstation, it says "N".

Assuming it says "Y" on your "working" machines, what happens if you
flip it to "N" (or load the kvm_intel module with the
"unrestricted_guest=N" modparam)?

Thanks!
Laszlo
Gleb Natapov Feb. 18, 2013, 5:45 p.m. UTC | #8
On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
> On 02/18/13 13:53, David Woodhouse wrote:
> 
> > Nevertheless, on my workstation as on yours, we do seem to end up
> > executing from the CSM in RAM when we reset. But on my laptop, it
> > executes the *ROM* as it should.
> >
> > This patch 'fixes' it, and I think it might even be correct in itself,
> > but I don't think it's a correct fix for the problem we're discussing.
> > And I certainly want to know what's different on my laptop that makes it
> > work *without* this patch.
> >
> > Either there's some weirdness with setting the high CS base address, on
> > CPU reset. Or perhaps the contents of the memory region at 0xfffffff0
> > have *really* been changed along with the sub-1MiB range. Or maybe the
> > universe just hates us...
> 
> We're ending up in the wrong place, under 1MB (which is consistent with
> your "reset the PAMs" patch -- state of PAMs should only matter below
> 1MB).
> 
> I single-stepped qemu-1.3.1 in x86_cpu_reset() /
> cpu_x86_load_seg_cache(), and we seem to set the correct base. However
> when I pause the VM when it's spinning in the reset loop, and I issue
> the following in virsh:
> 
> # qemu-monitor-command --domain \
>   fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \
>   cpu 0
> 
> # qemu-monitor-command --domain \
>   fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \
>   info registers
> 
> for EIP and CS I get (from cpu_x86_dump_seg_cache(), in the
> "HF_CS64_MASK clear" branch):
> 
> EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000623
> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
> EIP=0000fff0 EFL=00000002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 0000f300
> CS =f000 000f0000 0000ffff 0000f300
>     ^    ^        ^        ^
>     |    base     limit    flags
>     selector
> 
This is because real mode is emulated as vm86 mode on intel cpus without
"unrestricted guest" flag.

--
			Gleb.
Laszlo Ersek Feb. 18, 2013, 6:16 p.m. UTC | #9
On 02/18/13 18:45, Gleb Natapov wrote:
> On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:

>> CS =f000 000f0000 0000ffff 0000f300
>>     ^    ^        ^        ^
>>     |    base     limit    flags
>>     selector
>>
> This is because real mode is emulated as vm86 mode on intel cpus without
> "unrestricted guest" flag.

Awesome, this supports my desperate hunch in
<http://lists.nongnu.org/archive/html/qemu-devel/2013-02/msg02689.html>.
I hope David can confirm in practice!

Thanks!
Laszlo
David Woodhouse Feb. 18, 2013, 6:28 p.m. UTC | #10
On Mon, 2013-02-18 at 19:16 +0100, Laszlo Ersek wrote:
> On 02/18/13 18:45, Gleb Natapov wrote:
> > On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
> 
> >> CS =f000 000f0000 0000ffff 0000f300
> >>     ^    ^        ^        ^
> >>     |    base     limit    flags
> >>     selector
> >>
> > This is because real mode is emulated as vm86 mode on intel cpus without
> > "unrestricted guest" flag.
> 
> Awesome, this supports mys desperate hunch in
> <http://lists.nongnu.org/archive/html/qemu-devel/2013-02/msg02689.html>.
> I hope David can confirm in practice!

Yes, my working machines have unrestricted_guest support, and the
non-working machines don't. So when we're emulating it in vm86, the
extended segment base handling is broken.
Gleb Natapov Feb. 18, 2013, 6:31 p.m. UTC | #11
On Mon, Feb 18, 2013 at 07:16:25PM +0100, Laszlo Ersek wrote:
> On 02/18/13 18:45, Gleb Natapov wrote:
> > On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
> 
> >> CS =f000 000f0000 0000ffff 0000f300
> >>     ^    ^        ^        ^
> >>     |    base     limit    flags
> >>     selector
> >>
> > This is because real mode is emulated as vm86 mode on intel cpus without
> > "unrestricted guest" flag.
> 
> Awesome, this supports my desperate hunch in
> <http://lists.nongnu.org/archive/html/qemu-devel/2013-02/msg02689.html>.
> I hope David can confirm in practice!
>
Laszlo explained to me that the problem is that after reset we end up
in SeaBIOS reset code instead of OVMF one. This is because kvm starts
to execute from ffff0 instead of fffffff0 after reset and this memory
location is modifying during CSM loading. Seabios solves this problem
by detecting reset condition and copying pristine image of itself from
the end of 4G to the end of 1M. OVMF should do the same, but with CSM
it does not get control back after reset since Seabios reset vector is
executed instead. Why not put OVMF reset code at reset vector in CSM
built SeaBIOS to solve the problem?

--
			Gleb.
Kevin O'Connor Feb. 18, 2013, 7 p.m. UTC | #12
On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
> Laszlo explained to me that the problem is that after reset we end up
> in SeaBIOS reset code instead of OVMF one. This is because kvm starts
> to execute from ffff0 instead of fffffff0 after reset and this memory
> location is modifying during CSM loading. Seabios solves this problem
> by detecting reset condition and copying pristine image of itself from
> the end of 4G to the end of 1M. OVMF should do the same, but with CSM
> it does not get control back after reset since Seabios reset vector is
> executed instead. Why not put OVMF reset code at reset vector in CSM
> built SeaBIOS to solve the problem?

Why not fix KVM so that it runs at fffffff0 after reset?

The only thing SeaBIOS could do is setup the segment registers and
then jump to fffffff0, which is a bit of work for the same end result.

-Kevin
David Woodhouse Feb. 18, 2013, 7:04 p.m. UTC | #13
On Mon, 2013-02-18 at 14:00 -0500, Kevin O'Connor wrote:
> On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
> > Laszlo explained to me that the problem is that after reset we end up
> > in SeaBIOS reset code instead of OVMF one. This is because kvm starts
> > to execute from ffff0 instead of fffffff0 after reset and this memory
> > location is modifying during CSM loading. Seabios solves this problem
> > by detecting reset condition and copying pristine image of itself from
> > the end of 4G to the end of 1M. OVMF should do the same, but with CSM
> > it does not get control back after reset since Seabios reset vector is
> > executed instead. Why not put OVMF reset code at reset vector in CSM
> > built SeaBIOS to solve the problem?
> 
> Why not fix KVM so that it runs at fffffff0 after reset?
> 
> The only thing SeaBIOS could do is setup the segment registers and
> then jump to fffffff0, which is a bit of work for the same end result.

Well, what SeaBIOS already *does* is bash on the keyboard controller to
cause a reset. Which *ought* to work too; I have a patch to at least fix
*that*, by resetting the PAM setup in the i440.

But yes, KVM definitely ought to be running at 0xfffffff0. This is the
*vm86* code that's broken, not the native KVM version.
Laszlo Ersek Feb. 18, 2013, 7:09 p.m. UTC | #14
On 02/18/13 20:00, Kevin O'Connor wrote:
> On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
>> Laszlo explained to me that the problem is that after reset we end up
>> in SeaBIOS reset code instead of OVMF one. This is because kvm starts
>> to execute from ffff0 instead of fffffff0 after reset and this memory
>> location is modifying during CSM loading. Seabios solves this problem
>> by detecting reset condition and copying pristine image of itself from
>> the end of 4G to the end of 1M. OVMF should do the same, but with CSM
>> it does not get control back after reset since Seabios reset vector is
>> executed instead. Why not put OVMF reset code at reset vector in CSM
>> built SeaBIOS to solve the problem?
> 
> Why not fix KVM so that it runs at fffffff0 after reset?
> 
> The only thing SeaBIOS could do is setup the segment registers and
> then jump to fffffff0, which is a bit of work for the same end result.

Gleb told me to test under a kvm/next host kernel; there have been many
real-mode related commits. I'll report back.

Laszlo
Kevin O'Connor Feb. 18, 2013, 7:11 p.m. UTC | #15
On Mon, Feb 18, 2013 at 07:04:08PM +0000, David Woodhouse wrote:
> Well, what SeaBIOS already *does* is bash on the keyboard controller to
> cause a reset. Which *ought* to work too; I have a patch to at least fix
> *that*, by resetting the PAM setup in the i440.

The thing to be aware of here is that not all resets are equal.  There
is old code out there that will force a reset to go from 80386 mode to
8086 mode (or was it 286 to 8086?).  So, some resets are really
resumes (which must not alter memory) and some are real resets.  It's
a mystery to me which is which, but I know this came up the last time
the QEMU reset logic was discussed.

-Kevin
Gleb Natapov Feb. 18, 2013, 7:17 p.m. UTC | #16
On Mon, Feb 18, 2013 at 02:00:52PM -0500, Kevin O'Connor wrote:
> On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
> > Laszlo explained to me that the problem is that after reset we end up
> > in SeaBIOS reset code instead of OVMF one. This is because kvm starts
> > to execute from ffff0 instead of fffffff0 after reset and this memory
> > location is modifying during CSM loading. Seabios solves this problem
> > by detecting reset condition and copying pristine image of itself from
> > the end of 4G to the end of 1M. OVMF should do the same, but with CSM
> > it does not get control back after reset since Seabios reset vector is
> > executed instead. Why not put OVMF reset code at reset vector in CSM
> > built SeaBIOS to solve the problem?
> 
> Why not fix KVM so that it runs at fffffff0 after reset?
> 
Because KVM uses VMX extension and VMX on CPU without "unrestricted
guest" is not capable of doing so. Recent KVM code should be able
to emulate real mode from the fffffff0 address instead of trying to
enter vmx guest mode. I asked Laszlo to check if it is so, but even if
KVM in 3.9 will work it will not fix all existent kernels out there.
Old behaviour of approximating real mode by vm86 is still supported by
using emulate_invalid_guest_state=false kernel module option and it will
be nice if it will not break OVMF since it can be used as a workaround
in case unemulated instruction is encountered.

> The only thing SeaBIOS could do is setup the segment registers and
> then jump to fffffff0, which is a bit of work for the same end result.
> 
If it will jump to fffffff0 KVM will jump to ffff0 instead :) It should
restore pre-CSM loaded OVMF state and reset.

--
			Gleb.
Kevin O'Connor Feb. 18, 2013, 7:33 p.m. UTC | #17
On Mon, Feb 18, 2013 at 09:17:05PM +0200, Gleb Natapov wrote:
> On Mon, Feb 18, 2013 at 02:00:52PM -0500, Kevin O'Connor wrote:
> > Why not fix KVM so that it runs at fffffff0 after reset?
> > 
> Because KVM uses VMX extension and VMX on CPU without "unrestricted
> guest" is not capable of doing so. Recent KVM code should be able
> to emulate real mode from the fffffff0 address instead of trying to
> enter vmx guest mode. I asked Laszlo to check if it is so, but even if
> KVM in 3.9 will work it will not fix all existent kernels out there.
> Old behaviour of approximating real mode by vm86 is still supported by
> using emulate_invalid_guest_state=false kernel module option and it will
> be nice if it will not break OVMF since it can be used as a workaround
> in case unemulated instruction is encountered.

For old versions of KVM, SeaBIOS can detect the loop and issue a
shutdown.  Not nice for users to have their "reboot" turn into a
"poweroff", but likely better than just a hang.

> > The only thing SeaBIOS could do is setup the segment registers and
> > then jump to fffffff0, which is a bit of work for the same end result.
> > 
> If it will jump to fffffff0 KVM will jump to ffff0 instead :) It should
> restore pre-CSM loaded OVMF state and reset.

I take it you mean copy 0xfffe0000 to 0xe0000?  That would not be fun.
SeaBIOS would need to detect that it's in the state (it's definitely
not correct to do that on real-hardware or on "working" kvm
instances), then setup a trampoline somewhere outside of
0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the
memory, restore segment registers, and then jump to 0xfffffff0.
That's a lot of kvm specific code to add to seabios as a workaround
and it seems fragile anyway.

-Kevin
Laszlo Ersek Feb. 18, 2013, 7:39 p.m. UTC | #18
On 02/18/13 20:09, Laszlo Ersek wrote:
> On 02/18/13 20:00, Kevin O'Connor wrote:
>> On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
>>> Laszlo explained to me that the problem is that after reset we end up
>>> in SeaBIOS reset code instead of OVMF one. This is because kvm starts
>>> to execute from ffff0 instead of fffffff0 after reset and this memory
>>> location is modifying during CSM loading. Seabios solves this problem
>>> by detecting reset condition and copying pristine image of itself from
>>> the end of 4G to the end of 1M. OVMF should do the same, but with CSM
>>> it does not get control back after reset since Seabios reset vector is
>>> executed instead. Why not put OVMF reset code at reset vector in CSM
>>> built SeaBIOS to solve the problem?
>>
>> Why not fix KVM so that it runs at fffffff0 after reset?
>>
>> The only thing SeaBIOS could do is setup the segment registers and
>> then jump to fffffff0, which is a bit of work for the same end result.
> 
> Gleb told me to test under a kvm/next host kernel; there have been many
> real-mode related commits. I'll report back.

I built a host kernel from
<http://git.kernel.org/?p=virt/kvm/kvm.git;a=shortlog;h=refs/heads/next>, currently
at commit cbd29cb6.

The guest reboot works now. :)

Thanks all!
Laszlo
David Woodhouse Feb. 18, 2013, 9:12 p.m. UTC | #19
On Mon, 2013-02-18 at 14:11 -0500, Kevin O'Connor wrote:
> On Mon, Feb 18, 2013 at 07:04:08PM +0000, David Woodhouse wrote:
> > Well, what SeaBIOS already *does* is bash on the keyboard controller to
> > cause a reset. Which *ought* to work too; I have a patch to at least fix
> > *that*, by resetting the PAM setup in the i440.
> 
> The thing to be aware of here is that not all resets are equal.  There
> is old code out there that will force a reset to go from 80386 mode to
> 8086 mode (or was it 286 to 8086?).  So, some resets are really
> resumes (which must not alter memory) and some are real resets.  It's
> a mystery to me which is which, but I know this came up the last time
> the QEMU reset logic was discussed.

Hm, yes. It will have been 286 to 8086, because ISTR there was no
*other* way for the CPU to get back from 286 mode.

The i440fx data sheet (§3.0) appears to say that the default values are
loaded on a *hard* reset, not a soft reset. And a reset invoked by the
keyboard controller (as SeaBIOS does) is a *soft* reset. The only way to
do a *hard* reset from software that's mentioned in the datasheet is the
PMC turbo/reset control register (port 0x93). And that, presumably, is
chipset-dependent and not something we can easily use from the reset
vector without doing a bunch of hardware probing.

I suppose we could set it up in advance, during the *first*
initialisation. Just point a 'do_hard_reset()' function pointer at a
function of our choice, perhaps with the existing keyboard reset as a
default if we don't know of anything better.

So we could probably solve the software side, in the guest... but qemu
doesn't seem to distinguish between a hard reset and a soft reset, so
there's no way to make it reset the PAM registers in one case but not
the other. Does this reset for 286->8086 mode actually work in qemu at
all? Is qemu's "reset" a hard reset, or a soft reset?

I suppose given that the RCR is part of the I440FX, and the behaviour
that we want to vary for hard vs. soft reset is also within the I440FX,
we *could* contrive to reset the PAM registers *only* when reset via the
RCR. But if I propose a patch which does it that way, will someone hunt
me down and hurt me?
Kevin O'Connor Feb. 18, 2013, 10:37 p.m. UTC | #20
On Mon, Feb 18, 2013 at 09:12:46PM +0000, David Woodhouse wrote:
> The i440fx data sheet (§3.0) appears to say that the default values are
> loaded on a *hard* reset, not a soft reset. And a reset invoked by the
> keyboard controller (as SeaBIOS does) is a *soft* reset. The only way to
> do a *hard* reset from software that's mentioned in the datasheet is the
> PMC turbo/reset control register (port 0x93). And that, presumably, is
> chipset-dependent and not something we can easily use from the reset
> vector without doing a bunch of hardware probing.

The ACPI v2 spec describes a "hard" reset register.  SeaBIOS could
extract it from the FADT and then use it.  Of course, we'd probably
want to update the QEMU ACPI tables to implement ACPI v2 then.

-Kevin
David Woodhouse Feb. 19, 2013, 11:50 a.m. UTC | #21
On Mon, 2013-02-18 at 23:08 +0000, David Woodhouse wrote:
> Laszlo has hooked up the RCR on the PIIX3 already, so something like
> this ought to make it reset the PAM setup *only* if reset via that...

> +static void i440fx_reset(DeviceState *ds)
> +{
> +    PCIDevice *dev = DO_UPCAST(PCIDevice, qdev, ds);
> +    PCII440FXState *d = DO_UPCAST(PCII440FXState, dev, dev);
> +    uint8_t *pci_conf = d->dev.config;
> +
> +    if (!d->piix3->rcr_hard_reset)
> +        return;
> +

... except that bit (referring to PIIX3 state directly from
i440fx_reset(), having stashed a pointer to it) is horrible.

I've posted a 'cleaner' but much larger and more intrusive patch which
shows how we could introduce a 'reset type' as a proper concept, which
may well be useful for other platforms and situations too.

I'm not too bothered which way we go, but it would be very good to fix
the PAM reset in qemu, because it's a genuine fix and it's *extremely*
convenient to work around the KVM CS segment base bug.
David Woodhouse Feb. 19, 2013, 3:29 p.m. UTC | #22
On Mon, 2013-02-18 at 17:37 -0500, Kevin O'Connor wrote:
> The ACPI v2 spec describes a "hard" reset register.  SeaBIOS could
> extract it from the FADT and then use it.  Of course, we'd probably
> want to update the QEMU ACPI tables to implement ACPI v2 then.

This sounded great until I actually came to implement it.

The PIIX reset at 0xcf9 requires *two* writes; one to set the reset type
and then a second write with bit 2 set to actually do the reset.

The ACPI RESET_REG definition only allows for *one* value to be written.

Is that because the PIIX will actually do a hard reset when you write
0x06 to it *anyway*, despite theoretically saying that you should write
0x02 first? Or is the ACPI definition of RESET_REG simply incapable of
being used on the PIIX?
Laszlo Ersek Feb. 19, 2013, 4:33 p.m. UTC | #23
On 02/19/13 16:29, David Woodhouse wrote:
> On Mon, 2013-02-18 at 17:37 -0500, Kevin O'Connor wrote:
>> The ACPI v2 spec describes a "hard" reset register.  SeaBIOS could
>> extract it from the FADT and then use it.  Of course, we'd probably
>> want to update the QEMU ACPI tables to implement ACPI v2 then.
>
> This sounded great until I actually came to implement it.
>
> The PIIX reset at 0xcf9 requires *two* writes; one to set the reset type
> and then a second write with bit 2 set to actually do the reset.
>
> The ACPI RESET_REG definition only allows for *one* value to be written.
>
> Is that because the PIIX will actually do a hard reset when you write
> 0x06 to it *anyway*, despite theoretically saying that you should write
> 0x02 first? Or is the ACPI definition of RESET_REG simply incapable of
> being used on the PIIX?

The linux kernel actually considers BOOT_ACPI and BOOT_CF9 separate
things; see

- native_machine_emergency_restart() [arch/x86/kernel/reboot.c],
- acpi_reboot() [drivers/acpi/reboot.c],
- acpi_reset() [drivers/acpi/acpica/hwxface.c].

BOOT_ACPI looks like a single write in any case (io space, memory, pci
config).

Funnily enough, on my Thinkpad (
  acpidump --table FACP --binary -o fadt.aml
  iasl -d fadt.aml
):

  /*
   * Intel ACPI Component Architecture
   * AML Disassembler version 20090123
   *
   * Disassembly of fadt.aml, Tue Feb 19 17:13:43 2013
   *
   * ACPI Data Table [FACP]
   *
   * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
   */

  [000h 000  4]                    Signature : "FACP"    /* Fixed ACPI Description Table */

  [070h 112  4]        Flags (decoded below) : 0000C2AD
               Reset Register Supported (V2) : 0

  [074h 116 12]               Reset Register : <Generic Address Structure>
  [074h 116  1]                     Space ID : 01 (SystemIO)
  [075h 117  1]                    Bit Width : 08
  [076h 118  1]                   Bit Offset : 00
  [077h 119  1]                 Access Width : 00
  [078h 120  8]                      Address : 0000000000000CF9
  [080h 128  1]         Value to cause reset : 06

Same on my HP Z400. "Reset register is not supported, but you could
still write 6 to 0xcf9" :)

I'd say "6 to 0xCF9" is good enough; rcr_write() in qemu is OK with it
too (including your patch at
<http://thread.gmane.org/gmane.comp.emulators.qemu/195351/focus=195387>.)

Laszlo
Gleb Natapov Feb. 19, 2013, 6:13 p.m. UTC | #24
On Mon, Feb 18, 2013 at 02:33:23PM -0500, Kevin O'Connor wrote:
> On Mon, Feb 18, 2013 at 09:17:05PM +0200, Gleb Natapov wrote:
> > On Mon, Feb 18, 2013 at 02:00:52PM -0500, Kevin O'Connor wrote:
> > > Why not fix KVM so that it runs at fffffff0 after reset?
> > > 
> > Because KVM uses VMX extension and VMX on CPU without "unrestricted
> > guest" is not capable of doing so. Recent KVM code should be able
> > to emulate real mode from the fffffff0 address instead of trying to
> > enter vmx guest mode. I asked Laszlo to check if it is so, but even if
> > KVM in 3.9 will work it will not fix all existent kernels out there.
> > Old behaviour of approximating real mode by vm86 is still supported by
> > using emulate_invalid_guest_state=false kernel module option and it will
> > be nice if it will not break OVMF since it can be used as a workaround
> > in case unemulated instruction is encountered.
> 
> For old versions of KVM, SeaBIOS can detect the loop and issue a
> shutdown.  Not nice for users to have their "reboot" turn into a
> "poweroff", but likely better than just a hang.
> 
> > > The only thing SeaBIOS could do is setup the segment registers and
> > > then jump to fffffff0, which is a bit of work for the same end result.
> > > 
> > If it will jump to fffffff0 KVM will jump to ffff0 instead :) It should
> > restore pre-CSM loaded OVMF state and reset.
> 
> I take it you mean copy 0xfffe0000 to 0xe0000?  That would not be fun.
> SeaBIOS would need to detect that it's in the state (it's definitely
> not correct to do that on real-hardware or on "working" kvm
> instances), then setup a trampoline somewhere outside of
> 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the
> memory, restore segment registers, and then jump to 0xfffffff0.
> That's a lot of kvm specific code to add to seabios as a workaround
> and it seems fragile anyway.
> 
Isn't this exactly what qemu_prep_reset() is doing now?

--
			Gleb.
David Woodhouse Feb. 19, 2013, 6:35 p.m. UTC | #25
On Tue, 2013-02-19 at 20:13 +0200, Gleb Natapov wrote:
> 
> > I take it you mean copy 0xfffe0000 to 0xe0000?  That would not be
> fun.
> > SeaBIOS would need to detect that it's in the state (it's definitely
> > not correct to do that on real-hardware or on "working" kvm
> > instances), then setup a trampoline somewhere outside of
> > 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the
> > memory, restore segment registers, and then jump to 0xfffffff0.
> > That's a lot of kvm specific code to add to seabios as a workaround
> > and it seems fragile anyway.
> > 
> Isn't this exactly what qemu_prep_reset() is doing now?

No. It doesn't do the trampoline thing because it doesn't *have* to;
it's copying an identical copy of the code back over itself.
Gleb Natapov Feb. 19, 2013, 6:41 p.m. UTC | #26
On Tue, Feb 19, 2013 at 06:35:03PM +0000, David Woodhouse wrote:
> On Tue, 2013-02-19 at 20:13 +0200, Gleb Natapov wrote:
> > 
> > > I take it you mean copy 0xfffe0000 to 0xe0000?  That would not be
> > fun.
> > > SeaBIOS would need to detect that it's in the state (it's definitely
> > > not correct to do that on real-hardware or on "working" kvm
> > > instances), then setup a trampoline somewhere outside of
> > > 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the
> > > memory, restore segment registers, and then jump to 0xfffffff0.
> > > That's a lot of kvm specific code to add to seabios as a workaround
> > > and it seems fragile anyway.
> > > 
> > Isn't this exactly what qemu_prep_reset() is doing now?
> 
> No. It doesn't do the trampoline thing because it doesn't *have* to;
> it's copying an identical copy of the code back over itself.
> 
Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or
it leaves OVMF code there somewhere. CSM reset code can jump into OVMF
code in 0xe0000-0xfffff range and let it do the copy.

--
			Gleb.
David Woodhouse Feb. 19, 2013, 6:48 p.m. UTC | #27
On Tue, 2013-02-19 at 20:41 +0200, Gleb Natapov wrote:
> Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or
> it leaves OVMF code there somewhere. CSM reset code can jump into OVMF
> code in 0xe0000-0xfffff range and let it do the copy.

There is no OVMF code there; OVMF doesn't bother to put *anything* into
the RAM at 1MiB-δ unless there's a CSM.

CSM code isn't supposed to be hardware-specific, but I suppose for the
CSM running under KVM case we could *potentially* have a hack at the
reset vector so that when we do find ourselves there under a buggy
qemu/KVM implementation, it could set up a trampoline, reset the PAM
registers manually (so that the KVM CS base address bug doesn't actually
*hurt* us), then try again?

I'd rather implement the 0xcf9 reset properly in qemu though, and make
SeaBIOS use that (which it can do *sanely* as a CSM if it's in the ACPI
tables).
Gleb Natapov Feb. 19, 2013, 7:01 p.m. UTC | #28
On Tue, Feb 19, 2013 at 06:48:41PM +0000, David Woodhouse wrote:
> On Tue, 2013-02-19 at 20:41 +0200, Gleb Natapov wrote:
> > Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or
> > it leaves OVMF code there somewhere. CSM reset code can jump into OVMF
> > code in 0xe0000-0xfffff range and let it do the copy.
> 
> There is no OVMF code there; OVMF doesn't bother to put *anything* into
> the RAM at 1MiB-δ unless there's a CSM.
> 
It runs from ROM and do not shadow itself?

> CSM code isn't supposed to be hardware-specific, but I suppose for the
> CSM running under KVM case we could *potentially* have a hack at the
> reset vector so that when we do find ourselves there under a buggy
> qemu/KVM implementation, it could set up a trampoline, reset the PAM
> registers manually (so that the KVM CS base address bug doesn't actually
> *hurt* us), then try again?
> 
Yes, we are trying to come up with qemu/KVM specific hack here.

> I'd rather implement the 0xcf9 reset properly in qemu though, and make
> SeaBIOS use that (which it can do *sanely* as a CSM if it's in the ACPI
> tables).
> 
I didn't follow that other discussion about hard/soft reset. How proper
0xcf9 reset will fix the problem? What will it do that system_reset does
not?

--
			Gleb.
David Woodhouse Feb. 19, 2013, 7:39 p.m. UTC | #29
On Tue, 2013-02-19 at 21:01 +0200, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 06:48:41PM +0000, David Woodhouse wrote:
> > On Tue, 2013-02-19 at 20:41 +0200, Gleb Natapov wrote:
> > > Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or
> > > it leaves OVMF code there somewhere. CSM reset code can jump into OVMF
> > > code in 0xe0000-0xfffff range and let it do the copy.
> > 
> > There is no OVMF code there; OVMF doesn't bother to put *anything* into
> > the RAM at 1MiB-δ unless there's a CSM.
> > 
> It runs from ROM and do not shadow itself?

It has no need to shadow itself. It loads the SeaBIOS CSM into the range
under 1MiB, if it wants to support legacy BIOS. Other than that, it
never cares about 16-bit code at all.

> > I'd rather implement the 0xcf9 reset properly in qemu though, and make
> > SeaBIOS use that (which it can do *sanely* as a CSM if it's in the ACPI
> > tables).
> > 
> I didn't follow that other discussion about hard/soft reset. How proper
> 0xcf9 reset will fix the problem? What will it do that system_reset does
> not?

A full *hard* reset (0xcf9) will reset the PAM configuration, and thus
the BIOS from 4GiB-δ *would* be shadowed into 1MiB-δ, by hardware.

But qemu doesn't *implement* a full hard reset; it doesn't reset the PAM
registers.

And making it do so the naïve way, by just hooking up a simple device
reset function to do so, would be wrong. Because it *shouldn't* happen
on a soft reset, such as a triple-fault or a reset triggered by the
keyboard controller. Since a soft reset was the only way to get back
from 80286 protected mode to 8086 mode, some software may actually *use*
it and expect it to behave correctly.

Hence the discussion about reset handling.

We'd need to fix SeaBIOS to use the 0xcf9 reset too; currently it'll sit
in an endless loop of keyboard-induced *soft* resets anyway, because it
tries that before 0xcf9.

And in fact it probably shouldn't use the hard-coded 0xcf9 reset; it
should use the one indicated by the ACPI RESET_REG field (which *is*
0xcf9... or should be).
Laszlo Ersek Feb. 19, 2013, 8:45 p.m. UTC | #30
On 02/19/13 19:41, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 06:35:03PM +0000, David Woodhouse wrote:
>> On Tue, 2013-02-19 at 20:13 +0200, Gleb Natapov wrote:
>>>
>>>> I take it you mean copy 0xfffe0000 to 0xe0000?  That would not be
>>> fun.
>>>> SeaBIOS would need to detect that it's in the state (it's definitely
>>>> not correct to do that on real-hardware or on "working" kvm
>>>> instances), then setup a trampoline somewhere outside of
>>>> 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the
>>>> memory, restore segment registers, and then jump to 0xfffffff0.
>>>> That's a lot of kvm specific code to add to seabios as a workaround
>>>> and it seems fragile anyway.
>>>>
>>> Isn't this exactly what qemu_prep_reset() is doing now?
>>
>> No. It doesn't do the trampoline thing because it doesn't *have* to;
>> it's copying an identical copy of the code back over itself.
>>
> Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or
> it leaves OVMF code there somewhere. CSM reset code can jump into OVMF
> code in 0xe0000-0xfffff range and let it do the copy.

I think the only thing you could know about the UEFI environment
(call-wise or jump-wise) while in the CSM is ReverseThunkCallSegment /
ReverseThunkCallOffset. Theoretically those allow the CSM to call back
into EfiCompatibility (32-bit protected mode environment).

Some problems:
- Using the reverse thunk might only be allowed if we ended up in real
mode coming through the forward thunk to begin with. When qemu/kvm
simply jumps to 0xffff0 (reset request from guest OS), this doesn't hold.

- Currently no reverse thunk functions are defined in the CSM spec, and
the implementation in TianoCore seems ... absent. The directory
"IntelFrameworkModulePkg/Csm/LegacyBiosDxe/Ipf" appears to contain some
incomplete Itanium assembly.

Anyway David is fixing qemu to reset the PAMs at hard reset, so OVMF
should show up again in the f-segment, accomodating older kvm hosts.

Laszlo
Paolo Bonzini Feb. 19, 2013, 8:49 p.m. UTC | #31
Il 19/02/2013 20:39, David Woodhouse ha scritto:
> We'd need to fix SeaBIOS to use the 0xcf9 reset too; currently it'll sit
> in an endless loop of keyboard-induced *soft* resets anyway, because it
> tries that before 0xcf9.
> 
> And in fact it probably shouldn't use the hard-coded 0xcf9 reset; it
> should use the one indicated by the ACPI RESET_REG field (which *is*
> 0xcf9... or should be).

We should implement this: http://mjg59.dreamwidth.org/3561.html

> A while back I did some tests with Windows running on top of qemu.
> This is a great way to evaluate OS behaviour, because you've got
> complete control of what's handed to the OS and what the OS tries to
> do to the hardware. And what I discovered was a little surprising. In
> the absence of an ACPI reboot vector, Windows will hit the keyboard
> controller, wait a while, hit it again and then give up. If an ACPI
> reboot vector is present, windows will poke it, try the keyboard
> controller, poke the ACPI vector again and try the keyboard
> controller one more time.
> 
> This turns out to be important. The first thing it means is that it
> generates two writes to the ACPI reboot vector. The second is that it
> leaves a gap between them while it's fiddling with the keyboard
> controller. And, shockingly, it turns out that on most systems the
> ACPI reboot vector points at 0xcf9 in system IO space. Even though
> most implementations nominally require two different values be
> written, it seems that this isn't a strict requirement and the ACPI
> method works.

Paolo
David Woodhouse Feb. 19, 2013, 11:03 p.m. UTC | #32
On Tue, 2013-02-19 at 21:49 +0100, Paolo Bonzini wrote:
> > And in fact it probably shouldn't use the hard-coded 0xcf9 reset; it
> > should use the one indicated by the ACPI RESET_REG field (which *is*
> > 0xcf9... or should be).
> 
> We should implement this: http://mjg59.dreamwidth.org/3561.html

Matthew fails to distinguish between a hard reset and a soft reset. From
the CSM if we do find ourselves running at 0xffff0 (which should never
happen except under buggy KVM emulation anyway), we really do need to be
using the 0xcf9 reset (or the ACPI reset, which is going to point to the
same thing in general), and *not* the keyboard reset. And, of course, we
need it to work correctly and reset the PAM configuration (qv).

However, a single bash on the 0xcf9 register ought to suffice so the
ACPI/kbd/ACPI/kbd loop that Matthew describes is probably acceptable. As
long as it does the ACPI one *first*.

( It's also interesting that, as Laszlo observes, machines tend to set
the RESET_REG in the FADT *without* setting the enabled bit in the FADT
flags. Does Windows use it anyway? And is there are reason for *not*
setting the enabled bit, or is it just that all PC BIOSes are written by
crack-smoking hobos that they drag in off the street, and this is just
an artefact of the rule "anything they *can* get wrong and still boot
Windows, they *will* get wrong"? )
diff mbox

Patch

diff --git a/hw/piix_pci.c b/hw/piix_pci.c
index 6c77e49..6dcf1c5 100644
--- a/hw/piix_pci.c
+++ b/hw/piix_pci.c
@@ -171,6 +171,23 @@  static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
     return 0;
 }
 
+static void i440fx_reset(void *opaque)
+{
+    PCII440FXState *d = opaque;
+    uint8_t *pci_conf = d->dev.config;
+
+    pci_conf[0x59] = 0x00; // Reset PAM setup
+    pci_conf[0x5a] = 0x00;
+    pci_conf[0x5b] = 0x00;
+    pci_conf[0x5c] = 0x00;
+    pci_conf[0x5d] = 0x00;
+    pci_conf[0x5e] = 0x00;
+    pci_conf[0x5f] = 0x00;
+    pci_conf[0x72] = 0x02; // And SMM
+
+    i440fx_update_memory_mappings(d);
+}
+
 static int i440fx_post_load(void *opaque, int version_id)
 {
     PCII440FXState *d = opaque;
@@ -217,6 +234,8 @@  static int i440fx_initfn(PCIDevice *dev)
     d->dev.config[I440FX_SMRAM] = 0x02;
 
     cpu_smm_register(&i440fx_set_smm, d);
+
+    qemu_register_reset(i440fx_reset, d);
     return 0;
 }