diff mbox

[uq/master,2/2] KVM: make XSAVE support more robust

Message ID 1378386382-415-3-git-send-email-pbonzini@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini Sept. 5, 2013, 1:06 p.m. UTC
QEMU moves state from CPUArchState to struct kvm_xsave and back when it
invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
region as an opaque blob, it might be impossible to set some state on
the destination if migrating to an older version.

This patch blocks migration if it finds that unsupported bits are set
in the XSTATE_BV header field.  To make this work robustly, QEMU should
only report in env->xstate_bv those fields that will actually end up
in the migration stream.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 target-i386/kvm.c     | 3 ++-
 target-i386/machine.c | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

Gleb Natapov Sept. 8, 2013, 11:52 a.m. UTC | #1
On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
> region as an opaque blob, it might be impossible to set some state on
> the destination if migrating to an older version.
> 
> This patch blocks migration if it finds that unsupported bits are set
> in the XSTATE_BV header field.  To make this work robustly, QEMU should
> only report in env->xstate_bv those fields that will actually end up
> in the migration stream.
We usually handle host cpu differences in cpuid layer, not by trying to
validate migration data. i.e CPUID.0D should be configurable and
management should be able to query QEMU what is supported and prevent
migration attempt accordingly.

> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  target-i386/kvm.c     | 3 ++-
>  target-i386/machine.c | 4 ++++
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 749aa09..df08a4b 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu)
>              sizeof env->fpregs);
>      memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
>              sizeof env->xmm_regs);
> -    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
> +    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
> +            XSTATE_SUPPORTED;
Don't we just drop state here that will not be restored on the
destination and destination will not be able to tell since we masked
unsupported bits?

>      memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
>              sizeof env->ymmh_regs);
>      return 0;
> diff --git a/target-i386/machine.c b/target-i386/machine.c
> index dc81cde..9e2cfcf 100644
> --- a/target-i386/machine.c
> +++ b/target-i386/machine.c
> @@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id)
>      CPUX86State *env = &cpu->env;
>      int i;
>  
> +    if (env->xstate_bv & ~XSTATE_SUPPORTED) {
> +        return -EINVAL;
> +    }
> + 
>      /*
>       * Real mode guest segments register DPL should be zero.
>       * Older KVM version were setting it wrongly.
> -- 
> 1.8.3.1

--
			Gleb.
Paolo Bonzini Sept. 9, 2013, 8:51 a.m. UTC | #2
Il 08/09/2013 13:52, Gleb Natapov ha scritto:
> On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
>> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
>> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
>> region as an opaque blob, it might be impossible to set some state on
>> the destination if migrating to an older version.
>>
>> This patch blocks migration if it finds that unsupported bits are set
>> in the XSTATE_BV header field.  To make this work robustly, QEMU should
>> only report in env->xstate_bv those fields that will actually end up
>> in the migration stream.
> 
> We usually handle host cpu differences in cpuid layer, not by trying to
> validate migration data.

Actually we do both.  QEMU for example detects invalid subsections and
blocks migration, and CPU differences also result in subsections that
the destination does not know.

But as far as QEMU is concerned, setting an unknown bit in XSTATE_BV is
not a CPU difference, it is simply invalid migration data.

> i.e CPUID.0D should be configurable and
> management should be able to query QEMU what is supported and prevent
> migration attempt accordingly.

Management is already able to query QEMU of what is supported, because
new XSAVE state is always attached to new CPUID bits in leaves other
than 0Dh (e.g. EAX=07h, ECX=0h returns AVX512 and MPX support in EBX).
QEMU should compute 0Dh data based on those bits indeed.

However, KVM_GET/SET_XSAVE should still return all values supported by
the hypervisor, independent of the supported CPUID bits.

>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  target-i386/kvm.c     | 3 ++-
>>  target-i386/machine.c | 4 ++++
>>  2 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
>> index 749aa09..df08a4b 100644
>> --- a/target-i386/kvm.c
>> +++ b/target-i386/kvm.c
>> @@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu)
>>              sizeof env->fpregs);
>>      memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
>>              sizeof env->xmm_regs);
>> -    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
>> +    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
>> +            XSTATE_SUPPORTED;
> Don't we just drop state here that will not be restored on the
> destination and destination will not be able to tell since we masked
> unsupported bits?

A well-behaved guest should not have modified that state anyway, since:

* the source and destination machines should have the same CPU

* since the destination QEMU does not support the feature, the source
should have masked it as well

* the guest should always probe CPUID before using a feature

There will be only one change for well-behaved guests with this patch
(and the change will be invisible to them since they behave well).
After the patch, KVM_SET_XSAVE will set the extended states to the
processor-reset state instead of all-zeros.  However, all
currently-defined states have a processor-reset state that is equal to
all-zeroes, so this change is theoretical.

In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
and we should hide all features that are not visible in CPUID.  It is
okay, however, to test it in cpu_post_load.

Paolo

>>      memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
>>              sizeof env->ymmh_regs);
>>      return 0;
>> diff --git a/target-i386/machine.c b/target-i386/machine.c
>> index dc81cde..9e2cfcf 100644
>> --- a/target-i386/machine.c
>> +++ b/target-i386/machine.c
>> @@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id)
>>      CPUX86State *env = &cpu->env;
>>      int i;
>>  
>> +    if (env->xstate_bv & ~XSTATE_SUPPORTED) {
>> +        return -EINVAL;
>> +    }
>> + 
>>      /*
>>       * Real mode guest segments register DPL should be zero.
>>       * Older KVM version were setting it wrongly.
>> -- 
>> 1.8.3.1
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Gleb Natapov Sept. 9, 2013, 9:18 a.m. UTC | #3
On Mon, Sep 09, 2013 at 10:51:58AM +0200, Paolo Bonzini wrote:
> Il 08/09/2013 13:52, Gleb Natapov ha scritto:
> > On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
> >> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
> >> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
> >> region as an opaque blob, it might be impossible to set some state on
> >> the destination if migrating to an older version.
> >>
> >> This patch blocks migration if it finds that unsupported bits are set
> >> in the XSTATE_BV header field.  To make this work robustly, QEMU should
> >> only report in env->xstate_bv those fields that will actually end up
> >> in the migration stream.
> > 
> > We usually handle host cpu differences in cpuid layer, not by trying to
> > validate migration data.
> 
> Actually we do both.  QEMU for example detects invalid subsections and
> blocks migration, and CPU differences also result in subsections that
> the destination does not know.
> 
That's different from what you do here though. If xstate_bv was in its
separate subsection things would be easier, but it is not.

> But as far as QEMU is concerned, setting an unknown bit in XSTATE_BV is
> not a CPU difference, it is simply invalid migration data.
> 
> > i.e CPUID.0D should be configurable and
> > management should be able to query QEMU what is supported and prevent
> > migration attempt accordingly.
> 
> Management is already able to query QEMU of what is supported, because
> new XSAVE state is always attached to new CPUID bits in leaves other
> than 0Dh (e.g. EAX=07h, ECX=0h returns AVX512 and MPX support in EBX).
> QEMU should compute 0Dh data based on those bits indeed.
If it is computable from other data even better, easier for us.

> 
> However, KVM_GET/SET_XSAVE should still return all values supported by
> the hypervisor, independent of the supported CPUID bits.
> 
Why?

> >>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >>  target-i386/kvm.c     | 3 ++-
> >>  target-i386/machine.c | 4 ++++
> >>  2 files changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> >> index 749aa09..df08a4b 100644
> >> --- a/target-i386/kvm.c
> >> +++ b/target-i386/kvm.c
> >> @@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu)
> >>              sizeof env->fpregs);
> >>      memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
> >>              sizeof env->xmm_regs);
> >> -    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
> >> +    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
> >> +            XSTATE_SUPPORTED;
> > Don't we just drop state here that will not be restored on the
> > destination and destination will not be able to tell since we masked
> > unsupported bits?
> 
> A well-behaved guest should not have modified that state anyway, since:
> 
> * the source and destination machines should have the same CPU
> 
> * since the destination QEMU does not support the feature, the source
> should have masked it as well
> 
> * the guest should always probe CPUID before using a feature
> 
The I fail to see what is the purpose of the patch. I see two cases:
1. Each extended state has separate CPUID bit (is this guarantied?)
  - In this case, as you say, by matching CPUID on src and dst we guaranty
    that migration data is good.
2. There is a state that is advertised in CPUID.0D, but does not have
   any regular "feature" CPUID associated with it.
 - In this case this patch will drop valid state that needs to be
   restored.

> There will be only one change for well-behaved guests with this patch
> (and the change will be invisible to them since they behave well).
> After the patch, KVM_SET_XSAVE will set the extended states to the
> processor-reset state instead of all-zeros.  However, all
> currently-defined states have a processor-reset state that is equal to
> all-zeroes, so this change is theoretical.
> 
> In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
> and we should hide all features that are not visible in CPUID.  It is
> okay, however, to test it in cpu_post_load.
The kernel should not even return state that is not visible in CPUID.

> 
> Paolo
> 
> >>      memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
> >>              sizeof env->ymmh_regs);
> >>      return 0;
> >> diff --git a/target-i386/machine.c b/target-i386/machine.c
> >> index dc81cde..9e2cfcf 100644
> >> --- a/target-i386/machine.c
> >> +++ b/target-i386/machine.c
> >> @@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id)
> >>      CPUX86State *env = &cpu->env;
> >>      int i;
> >>  
> >> +    if (env->xstate_bv & ~XSTATE_SUPPORTED) {
> >> +        return -EINVAL;
> >> +    }
> >> + 
> >>      /*
> >>       * Real mode guest segments register DPL should be zero.
> >>       * Older KVM version were setting it wrongly.
> >> -- 
> >> 1.8.3.1
> > 
> > --
> > 			Gleb.
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 

--
			Gleb.
Paolo Bonzini Sept. 9, 2013, 9:50 a.m. UTC | #4
Il 09/09/2013 11:18, Gleb Natapov ha scritto:
> On Mon, Sep 09, 2013 at 10:51:58AM +0200, Paolo Bonzini wrote:
>> Il 08/09/2013 13:52, Gleb Natapov ha scritto:
>>> On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
>>>> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
>>>> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
>>>> region as an opaque blob, it might be impossible to set some state on
>>>> the destination if migrating to an older version.
>>>>
>>>> This patch blocks migration if it finds that unsupported bits are set
>>>> in the XSTATE_BV header field.  To make this work robustly, QEMU should
>>>> only report in env->xstate_bv those fields that will actually end up
>>>> in the migration stream.
>>>
>>> We usually handle host cpu differences in cpuid layer, not by trying to
>>> validate migration data.
>>
>> Actually we do both.  QEMU for example detects invalid subsections and
>> blocks migration, and CPU differences also result in subsections that
>> the destination does not know.
>>
> That's different from what you do here though. If xstate_bv was in its
> separate subsection things would be easier, but it is not.

I agree.  And also if YMM was in its separate subsections; future XSAVE
states will likely use subsections (whose presence is keyed off bits in
env->xstate_bv).

>> However, KVM_GET/SET_XSAVE should still return all values supported by
>> the hypervisor, independent of the supported CPUID bits.
>
> Why?

Because this is not talking to the guest, it is talking to userspace.

The VCPU state is more than what is visible to the guest, and returning
all of it seems more consistent with the rest of the KVM API.  For
example, KVM_GET_FPU always returns SSE state even if the CPUID lacks
SSE and/or FXSR.

>> A well-behaved guest should not have modified that state anyway, since:
>>
>> * the source and destination machines should have the same CPU
>>
>> * since the destination QEMU does not support the feature, the source
>> should have masked it as well
>>
>> * the guest should always probe CPUID before using a feature
>>
> The I fail to see what is the purpose of the patch. I see two cases:
> 1. Each extended state has separate CPUID bit (is this guarantied?)

Not guaranteed, but it has always happened so far (AVX, AVX-512, MPX).

>   - In this case, as you say, by matching CPUID on src and dst we guaranty
>     that migration data is good.

But we don't match CPUID on src and destination.  This is something that
the user should do, but it's better if we can test it too.  Subsections
do that for us; I am, in some sense, emulating subsections for the XSAVE
states that are not stored in subsections.

>> In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
>> and we should hide all features that are not visible in CPUID.  It is
>> okay, however, to test it in cpu_post_load.
> 
> The kernel should not even return state that is not visible in CPUID.

That's an interesting point of view that I hadn't considered.  But just
like you asked me why it should return state that is not visible in
CPUID, I'm asking you why it should not...

Paolo

>>
>> Paolo
>>
>>>>      memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
>>>>              sizeof env->ymmh_regs);
>>>>      return 0;
>>>> diff --git a/target-i386/machine.c b/target-i386/machine.c
>>>> index dc81cde..9e2cfcf 100644
>>>> --- a/target-i386/machine.c
>>>> +++ b/target-i386/machine.c
>>>> @@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id)
>>>>      CPUX86State *env = &cpu->env;
>>>>      int i;
>>>>  
>>>> +    if (env->xstate_bv & ~XSTATE_SUPPORTED) {
>>>> +        return -EINVAL;
>>>> +    }
>>>> + 
>>>>      /*
>>>>       * Real mode guest segments register DPL should be zero.
>>>>       * Older KVM version were setting it wrongly.
>>>> -- 
>>>> 1.8.3.1
>>>
>>> --
>>> 			Gleb.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Gleb Natapov Sept. 9, 2013, 10:41 a.m. UTC | #5
On Mon, Sep 09, 2013 at 11:50:03AM +0200, Paolo Bonzini wrote:
> Il 09/09/2013 11:18, Gleb Natapov ha scritto:
> > On Mon, Sep 09, 2013 at 10:51:58AM +0200, Paolo Bonzini wrote:
> >> Il 08/09/2013 13:52, Gleb Natapov ha scritto:
> >>> On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
> >>>> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
> >>>> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
> >>>> region as an opaque blob, it might be impossible to set some state on
> >>>> the destination if migrating to an older version.
> >>>>
> >>>> This patch blocks migration if it finds that unsupported bits are set
> >>>> in the XSTATE_BV header field.  To make this work robustly, QEMU should
> >>>> only report in env->xstate_bv those fields that will actually end up
> >>>> in the migration stream.
> >>>
> >>> We usually handle host cpu differences in cpuid layer, not by trying to
> >>> validate migration data.
> >>
> >> Actually we do both.  QEMU for example detects invalid subsections and
> >> blocks migration, and CPU differences also result in subsections that
> >> the destination does not know.
> >>
> > That's different from what you do here though. If xstate_bv was in its
> > separate subsection things would be easier, but it is not.
> 
> I agree.  And also if YMM was in its separate subsections; future XSAVE
> states will likely use subsections (whose presence is keyed off bits in
> env->xstate_bv).
> 
> >> However, KVM_GET/SET_XSAVE should still return all values supported by
> >> the hypervisor, independent of the supported CPUID bits.
> >
> > Why?
> 
> Because this is not talking to the guest, it is talking to userspace.
> 
> The VCPU state is more than what is visible to the guest, and returning
If a state does not affect guest in any way there is not reason to
migrate it.

> all of it seems more consistent with the rest of the KVM API.  For
> example, KVM_GET_FPU always returns SSE state even if the CPUID lacks
> SSE and/or FXSR.
There are counter examples too :) If APIC is not created we do not return
fake information on GET_IRQCHIP.  I think nobody expected FPU state to
grow indefinitely, so fixed, inflexible API was introduced, but now,
when CPU state has flexible extended state management it make sense to
model it in the API too.

> 
> >> A well-behaved guest should not have modified that state anyway, since:
> >>
> >> * the source and destination machines should have the same CPU
> >>
> >> * since the destination QEMU does not support the feature, the source
> >> should have masked it as well
> >>
> >> * the guest should always probe CPUID before using a feature
> >>
> > The I fail to see what is the purpose of the patch. I see two cases:
> > 1. Each extended state has separate CPUID bit (is this guarantied?)
> 
> Not guaranteed, but it has always happened so far (AVX, AVX-512, MPX).
> 
OK, So for now no need to make 0D configurable, but we need to provide
correct one according to those flags, not to passthrough host values.
 
> >   - In this case, as you say, by matching CPUID on src and dst we guaranty
> >     that migration data is good.
> 
> But we don't match CPUID on src and destination.  This is something that
Yes, I was saying that management infrastructure already knows how to handle
it.

> the user should do, but it's better if we can test it too.  Subsections
> do that for us; I am, in some sense, emulating subsections for the XSAVE
> states that are not stored in subsections.
We do not do it for other bits. It is possible currently to migrate to a
slightly different cpu without failure and it may cause guest to crash,
but we are not trying actively to catch those situations. Why XSAVE is
different?

> 
> >> In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
> >> and we should hide all features that are not visible in CPUID.  It is
> >> okay, however, to test it in cpu_post_load.
> > 
> > The kernel should not even return state that is not visible in CPUID.
> 
> That's an interesting point of view that I hadn't considered.  But just
> like you asked me why it should return state that is not visible in
> CPUID, I'm asking you why it should not...
> 
For number of reasons. First because since a sate is not used there is no
point in migrating it. Second to make interface more deterministic for
QEMU. i.e QEMU configures only features it supports and gets
exactly same state from the kernel no matter what host cpu is and what
kernel version is. This patch will not be needed since kernel will do
the job.

--
			Gleb.
Paolo Bonzini Sept. 9, 2013, noon UTC | #6
Il 09/09/2013 12:41, Gleb Natapov ha scritto:
>>>> In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
>>>> and we should hide all features that are not visible in CPUID.  It is
>>>> okay, however, to test it in cpu_post_load.
>>>
>>> The kernel should not even return state that is not visible in CPUID.
>>
>> That's an interesting point of view that I hadn't considered.  But just
>> like you asked me why it should return state that is not visible in
>> CPUID, I'm asking you why it should not...
>>
> For number of reasons. First because since a sate is not used there is no
> point in migrating it. Second to make interface more deterministic for
> QEMU. i.e QEMU configures only features it supports and gets
> exactly same state from the kernel no matter what host cpu is and what
> kernel version is. This patch will not be needed since kernel will do
> the job.

Good reasons, thanks.  Let's do it in the kernel then and avoid this
patch altogether.

Paolo
diff mbox

Patch

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 749aa09..df08a4b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1291,7 +1291,8 @@  static int kvm_get_xsave(X86CPU *cpu)
             sizeof env->fpregs);
     memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
             sizeof env->xmm_regs);
-    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
+    env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
+            XSTATE_SUPPORTED;
     memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
             sizeof env->ymmh_regs);
     return 0;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index dc81cde..9e2cfcf 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -278,6 +278,10 @@  static int cpu_post_load(void *opaque, int version_id)
     CPUX86State *env = &cpu->env;
     int i;
 
+    if (env->xstate_bv & ~XSTATE_SUPPORTED) {
+        return -EINVAL;
+    }
+ 
     /*
      * Real mode guest segments register DPL should be zero.
      * Older KVM version were setting it wrongly.