Patchwork [7/7] KVM regsync: Fix do_kvm_cpu_synchronize_state data integrity issue

login
register
mail settings
Submitter Jason J. Herne
Date Dec. 21, 2012, 1:56 p.m.
Message ID <1356098191-4998-1-git-send-email-jjherne@us.ibm.com>
Download mbox | patch
Permalink /patch/207816/
State New
Headers show

Comments

Jason J. Herne - Dec. 21, 2012, 1:56 p.m.
From: "Jason J. Herne" <jjherne@us.ibm.com>

Modify syncing algorithm in do_kvm_cpu_synchronize_state to avoid
overwriting previously synced register data by calling
do_kvm_cpu_synchronize_state twice.

The problem occurs if the following sequence of events occurs:
1. kvm_arch_get_registers(env, KVM_REGSYNC_RUNTIME_STATE)
2. Use the runtime state
3. kvm_arch_get_registers(env, KVM_REGSYNC_FULL_STATE) (ignored)
4. Use the full state.

In step 4 the call to kvm_arch_get_registers() does nothing (to avoid squashing
local changes to the runtime registers), but the caller assumes the full
register state is now available.

This is fixed by encoding which registers are synced in env->kvm_vcpu_dirty and
calling kvm_arch_put_registers() to sync local changes back to KVM before
calling kvm_arch_get_registers() if we are expanding the set of synced
registers.

Signed-off-by: Jason J. Herne <jjherne@us.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 include/exec/cpu-defs.h |    6 ++++++
 kvm-all.c               |   14 ++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)
Alexander Graf - Jan. 3, 2013, 1:56 p.m.
On 21.12.2012, at 14:56, Jason J. Herne wrote:

> From: "Jason J. Herne" <jjherne@us.ibm.com>
> 
> Modify syncing algorithm in do_kvm_cpu_synchronize_state to avoid
> overwriting previously synced register data by calling
> do_kvm_cpu_synchronize_state twice.
> 
> The problem occurs if the following sequence of events occurs:
> 1. kvm_arch_get_registers(env, KVM_REGSYNC_RUNTIME_STATE)
> 2. Use the runtime state
> 3. kvm_arch_get_registers(env, KVM_REGSYNC_FULL_STATE) (ignored)
> 4. Use the full state.
> 
> In step 4 the call to kvm_arch_get_registers() does nothing (to avoid squashing
> local changes to the runtime registers), but the caller assumes the full
> register state is now available.
> 
> This is fixed by encoding which registers are synced in env->kvm_vcpu_dirty and
> calling kvm_arch_put_registers() to sync local changes back to KVM before
> calling kvm_arch_get_registers() if we are expanding the set of synced
> registers.
> 
> Signed-off-by: Jason J. Herne <jjherne@us.ibm.com>
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
> include/exec/cpu-defs.h |    6 ++++++
> kvm-all.c               |   14 ++++++++++----
> 2 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index aea0ece..af3b6aa 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -208,6 +208,12 @@ typedef struct CPUWatchpoint {
>     struct KVMState *kvm_state;                                         \
>     struct kvm_run *kvm_run;                                            \
>     int kvm_fd;                                                         \
> +                                                                        \
> +    /* Register level indicating which vcpu registers have been synced  \
> +       from KVM, are potentially dirty due to local modifications, and  \
> +       will need to be written back to KVM.  Valid values are 0, which  \
> +       indicates no registers are dirty, or any of the KVM_REGSYNC_*    \
> +       constants defined in kvm.h */                                    \
>     int kvm_vcpu_dirty;
> 
> #endif
> diff --git a/kvm-all.c b/kvm-all.c
> index aee5bdd..858a636 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -230,7 +230,7 @@ int kvm_init_vcpu(CPUArchState *env)
> 
>     env->kvm_fd = ret;
>     env->kvm_state = s;
> -    env->kvm_vcpu_dirty = 1;
> +    env->kvm_vcpu_dirty = KVM_REGSYNC_FULL_STATE;
> 
>     mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
>     if (mmap_size < 0) {
> @@ -1489,10 +1489,16 @@ void kvm_flush_coalesced_mmio_buffer(void)
> static void do_kvm_cpu_synchronize_state(void *_args)
> {
>     struct kvm_cpu_syncstate_args *args = _args;
> +    CPUArchState *env = args->env;
> +    int register_level = args->register_level;
> 

This probably becomes more readable if we explicitly revert back to unsynced state first:

/* Write back local modifications at our current level */
if (register_level > env->kvm_vcpu_dirty) {
    kvm_arch_put_registers(...);
    env->kvm_vcpu_dirty = 0;
}

and then do the sync we are requested to do:

if (!env->kvm_vcpu_dirty) {
    ...
}


Alex

>     if (!args->env->kvm_vcpu_dirty) {
> -        kvm_arch_get_registers(args->env, args->register_level);
> -        args->env->kvm_vcpu_dirty = 1;
> +        kvm_arch_get_registers(env, register_level);
> +        env->kvm_vcpu_dirty = register_level;
> +    } else if (register_level > env->kvm_vcpu_dirty) {
> +        kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
> +        kvm_arch_get_registers(env, register_level);
> +        env->kvm_vcpu_dirty = register_level;
>     }
> }
> 
> @@ -1535,7 +1541,7 @@ int kvm_cpu_exec(CPUArchState *env)
> 
>     do {
>         if (env->kvm_vcpu_dirty) {
> -            kvm_arch_put_registers(env, KVM_REGSYNC_RUNTIME_STATE);
> +            kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
>             env->kvm_vcpu_dirty = 0;
>         }
> 
> -- 
> 1.7.9.5
>
Jason J. Herne - Jan. 3, 2013, 6:48 p.m.
On 01/03/2013 08:56 AM, Alexander Graf wrote:
>> static void do_kvm_cpu_synchronize_state(void *_args)
>> >{
>> >     struct kvm_cpu_syncstate_args *args = _args;
>> >+    CPUArchState *env = args->env;
>> >+    int register_level = args->register_level;
>> >
> This probably becomes more readable if we explicitly revert back to unsynced state first:
>
> /* Write back local modifications at our current level */
> if (register_level > env->kvm_vcpu_dirty) {
>      kvm_arch_put_registers(...);
>      env->kvm_vcpu_dirty = 0;
> }
>
> and then do the sync we are requested to do:
>
> if (!env->kvm_vcpu_dirty) {
>      ...
> }

I agree, but only if we add a second conditional to the if 1st statement 
  as such:

if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty)

This is to cover the case where the caller is asking for register level 
"1" and we're already dirty at level "2". In this case, nothing should 
happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is 
the case.

static void do_kvm_cpu_synchronize_state(void *_args)
{
     struct kvm_cpu_syncstate_args *args = _args;
     CPUArchState *env = args->env;
     int register_level = args->register_level;

     /* Write back local modifications at our current level */
     if (args->env->kvm_vcpu_dirty && register_level > 
env->kvm_vcpu_dirty) {
         kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
         env->kvm_vcpu_dirty = 0;
     }

     if (!args->env->kvm_vcpu_dirty) {
         kvm_arch_get_registers(env, register_level);
         env->kvm_vcpu_dirty = register_level;
     }
}

Do you agree?  Thanks for your time. :)
Alexander Graf - Jan. 3, 2013, 7:09 p.m.
On 03.01.2013, at 19:48, Jason J. Herne wrote:

> On 01/03/2013 08:56 AM, Alexander Graf wrote:
>>> static void do_kvm_cpu_synchronize_state(void *_args)
>>> >{
>>> >     struct kvm_cpu_syncstate_args *args = _args;
>>> >+    CPUArchState *env = args->env;
>>> >+    int register_level = args->register_level;
>>> >
>> This probably becomes more readable if we explicitly revert back to unsynced state first:
>> 
>> /* Write back local modifications at our current level */
>> if (register_level > env->kvm_vcpu_dirty) {
>>     kvm_arch_put_registers(...);
>>     env->kvm_vcpu_dirty = 0;
>> }
>> 
>> and then do the sync we are requested to do:
>> 
>> if (!env->kvm_vcpu_dirty) {
>>     ...
>> }
> 
> I agree, but only if we add a second conditional to the if 1st statement  as such:
> 
> if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty)
> 
> This is to cover the case where the caller is asking for register level "1" and we're already dirty at level "2". In this case, nothing should happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the case.

As before, I'd prefer to make this explicit:

> 
> static void do_kvm_cpu_synchronize_state(void *_args)
> {
>    struct kvm_cpu_syncstate_args *args = _args;
>    CPUArchState *env = args->env;
>    int register_level = args->register_level;

if (register_level > env->kvm_vcpu_dirty) {
    /* We are more dirty than we need to - all is well */
    return;
}

> 
>    /* Write back local modifications at our current level */
>    if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
>        kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
>        env->kvm_vcpu_dirty = 0;
>    }
> 
>    if (!args->env->kvm_vcpu_dirty) {
>        kvm_arch_get_registers(env, register_level);
>        env->kvm_vcpu_dirty = register_level;
>    }
> }
> 
> Do you agree?  Thanks for your time. :)

Please also check out the discussions I've had with Bharat about his watchdog patches. There we need a mechanism to synchronize registers only when we actually need to, in order to avoid potential race conditions with a kernel timer.

That particular case doesn't work well with levels. We can have multiple different potential race producers in the kernel that we need to avoid individually, so we can't always synchronize all of them when only one of them needs to be synchronized.

The big question is what we should be doing about this. We basically have 3 options:

  * implement levels, treat racy registers as "manually synchronized", as Bharat's latest patch set does
  * implement levels, add a bitmap for additional special synchronization bits
  * replace levels by bitmap

I'm quite frankly not sure which one of the 3 would be the best way forward.


Alex
Marcelo Tosatti - Jan. 4, 2013, 1:38 a.m.
On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
> 
> On 03.01.2013, at 19:48, Jason J. Herne wrote:
> 
> > On 01/03/2013 08:56 AM, Alexander Graf wrote:
> >>> static void do_kvm_cpu_synchronize_state(void *_args)
> >>> >{
> >>> >     struct kvm_cpu_syncstate_args *args = _args;
> >>> >+    CPUArchState *env = args->env;
> >>> >+    int register_level = args->register_level;
> >>> >
> >> This probably becomes more readable if we explicitly revert back to unsynced state first:
> >> 
> >> /* Write back local modifications at our current level */
> >> if (register_level > env->kvm_vcpu_dirty) {
> >>     kvm_arch_put_registers(...);
> >>     env->kvm_vcpu_dirty = 0;
> >> }
> >> 
> >> and then do the sync we are requested to do:
> >> 
> >> if (!env->kvm_vcpu_dirty) {
> >>     ...
> >> }
> > 
> > I agree, but only if we add a second conditional to the if 1st statement  as such:
> > 
> > if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty)
> > 
> > This is to cover the case where the caller is asking for register level "1" and we're already dirty at level "2". In this case, nothing should happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the case.
> 
> As before, I'd prefer to make this explicit:
> 
> > 
> > static void do_kvm_cpu_synchronize_state(void *_args)
> > {
> >    struct kvm_cpu_syncstate_args *args = _args;
> >    CPUArchState *env = args->env;
> >    int register_level = args->register_level;
> 
> if (register_level > env->kvm_vcpu_dirty) {
>     /* We are more dirty than we need to - all is well */
>     return;
> }
> 
> > 
> >    /* Write back local modifications at our current level */
> >    if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
> >        kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
> >        env->kvm_vcpu_dirty = 0;
> >    }
> > 
> >    if (!args->env->kvm_vcpu_dirty) {
> >        kvm_arch_get_registers(env, register_level);
> >        env->kvm_vcpu_dirty = register_level;
> >    }
> > }
> > 
> > Do you agree?  Thanks for your time. :)
> 
> Please also check out the discussions I've had with Bharat about his watchdog patches. There we need a mechanism to synchronize registers only when we actually need to, in order to avoid potential race conditions with a kernel timer.
> 
> That particular case doesn't work well with levels. We can have multiple different potential race producers in the kernel that we need to avoid individually, so we can't always synchronize all of them when only one of them needs to be synchronized.
> 
> The big question is what we should be doing about this. We basically have 3 options:
> 
>   * implement levels, treat racy registers as "manually synchronized", as Bharat's latest patch set does
>   * implement levels, add a bitmap for additional special synchronization bits
>   * replace levels by bitmap
> 
> I'm quite frankly not sure which one of the 3 would be the best way forward.
> 
> 
> Alex

Implement read/write accessors for individual registers, similarly to
how its done in the kernel. See kvm_cache_regs.h.

This also avoids the programmer from mapping "register X,Y,Z" to "level M"
manually (which is quite bug prone).

Can use KVM_GET/SET_ONEREG, then.
Bharat Bhushan - Jan. 4, 2013, 4:22 a.m.
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Friday, January 04, 2013 7:08 AM
> To: Alexander Graf
> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony Liguori; qemu-
> devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> do_kvm_cpu_synchronize_state data integrity issue
> 
> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
> >
> > On 03.01.2013, at 19:48, Jason J. Herne wrote:
> >
> > > On 01/03/2013 08:56 AM, Alexander Graf wrote:
> > >>> static void do_kvm_cpu_synchronize_state(void *_args)
> > >>> >{
> > >>> >     struct kvm_cpu_syncstate_args *args = _args;
> > >>> >+    CPUArchState *env = args->env;
> > >>> >+    int register_level = args->register_level;
> > >>> >
> > >> This probably becomes more readable if we explicitly revert back to
> unsynced state first:
> > >>
> > >> /* Write back local modifications at our current level */ if
> > >> (register_level > env->kvm_vcpu_dirty) {
> > >>     kvm_arch_put_registers(...);
> > >>     env->kvm_vcpu_dirty = 0;
> > >> }
> > >>
> > >> and then do the sync we are requested to do:
> > >>
> > >> if (!env->kvm_vcpu_dirty) {
> > >>     ...
> > >> }
> > >
> > > I agree, but only if we add a second conditional to the if 1st statement  as
> such:
> > >
> > > if (args->env->kvm_vcpu_dirty && register_level >
> > > env->kvm_vcpu_dirty)
> > >
> > > This is to cover the case where the caller is asking for register level "1"
> and we're already dirty at level "2". In this case, nothing should happen and
> we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the case.
> >
> > As before, I'd prefer to make this explicit:
> >
> > >
> > > static void do_kvm_cpu_synchronize_state(void *_args) {
> > >    struct kvm_cpu_syncstate_args *args = _args;
> > >    CPUArchState *env = args->env;
> > >    int register_level = args->register_level;
> >
> > if (register_level > env->kvm_vcpu_dirty) {
> >     /* We are more dirty than we need to - all is well */
> >     return;
> > }
> >
> > >
> > >    /* Write back local modifications at our current level */
> > >    if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
> > >        kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
> > >        env->kvm_vcpu_dirty = 0;
> > >    }
> > >
> > >    if (!args->env->kvm_vcpu_dirty) {
> > >        kvm_arch_get_registers(env, register_level);
> > >        env->kvm_vcpu_dirty = register_level;
> > >    }
> > > }
> > >
> > > Do you agree?  Thanks for your time. :)
> >
> > Please also check out the discussions I've had with Bharat about his watchdog
> patches. There we need a mechanism to synchronize registers only when we
> actually need to, in order to avoid potential race conditions with a kernel
> timer.
> >
> > That particular case doesn't work well with levels. We can have multiple
> different potential race producers in the kernel that we need to avoid
> individually, so we can't always synchronize all of them when only one of them
> needs to be synchronized.
> >
> > The big question is what we should be doing about this. We basically have 3
> options:
> >
> >   * implement levels, treat racy registers as "manually synchronized", as
> Bharat's latest patch set does
> >   * implement levels, add a bitmap for additional special synchronization bits
> >   * replace levels by bitmap
> >
> > I'm quite frankly not sure which one of the 3 would be the best way forward.
> >
> >
> > Alex
> 
> Implement read/write accessors for individual registers, similarly to how its
> done in the kernel. See kvm_cache_regs.h.

Read/write for individual registers can be heavy for cases where multiple register needed to be updated. Is not it?

For cases where multiple register needed to be synchronized then I would like to elaborate the options as:

Option 1:
int timer_func(CPUArchState env)
{
	cpu_synchronize_state(env);
	//update env->timer_registers
	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;
}

Change the kvm_arch_put_registers() to add followings:

int kvm_arch_put_registers(CPUArchState *env, int level)
{
	
	If (env->updated_regs_type) {
		struct kvm_sregs sregs;
		If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) {
			// update sregs->timer_registers;
			// may be we can have a bitmap to tell kernel for what actually updated
		}
		If (env->updated_regs_type & XX_CPU_REGS_TYPE) {
			// update respective registers in sregs
		}

		ioctl(env, KVM_GET_SREGS, &sregs);
	}
}

This mechanism will get all registers state but this is required only once per entry in QEMU.


Option 2:
Define read_regs_type() and write_regs_type() for cases where it requires multiple registers updates:

int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type)
{
     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.

	Switch(reg_type) {
	Case TIMER_REGS:
		Struct *timer_regs = regs_ptr;
		Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type); 
		Break;

	Default:
		Printf(error);
}

Similarly will be the write function:
int write_regs_type((CPUArchState env, int reg_type)
{
     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.

	Switch(reg_type) {
	Case TIMER_REGS:
		Struct timer_regs;
		Timer_regs.reg1 = env->timer_regs->reg1;
		Timer_regs.reg2 = env->timer_regs->reg2;
		Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type); 
		Break;

	Default:
		Printf(error);
}

Int timer_func(CPUxxState env)
{
    Struct timer_regs;
    read_regs_type((env, &timer_regs,TIMER_REGS);
    //update env->timer_registers
    Write_regs_type(env, TIMER_REGS)
} 

This will get one type of register_types and can cause multiple IOCTL per entry to QEMU.
-------

Keep on using GET/SET_ONE_REG when only one registers needed to be updated.


Thanks
-Bharat

> 
> This also avoids the programmer from mapping "register X,Y,Z" to "level M"
> manually (which is quite bug prone).
> 
> Can use KVM_GET/SET_ONEREG, then.
>
Alexander Graf - Jan. 4, 2013, 8:40 a.m.
On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote:

> 
> 
>> -----Original Message-----
>> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
>> Sent: Friday, January 04, 2013 7:08 AM
>> To: Alexander Graf
>> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony Liguori; qemu-
>> devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
>> do_kvm_cpu_synchronize_state data integrity issue
>> 
>> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
>>> 
>>> On 03.01.2013, at 19:48, Jason J. Herne wrote:
>>> 
>>>> On 01/03/2013 08:56 AM, Alexander Graf wrote:
>>>>>> static void do_kvm_cpu_synchronize_state(void *_args)
>>>>>>> {
>>>>>>>    struct kvm_cpu_syncstate_args *args = _args;
>>>>>>> +    CPUArchState *env = args->env;
>>>>>>> +    int register_level = args->register_level;
>>>>>>> 
>>>>> This probably becomes more readable if we explicitly revert back to
>> unsynced state first:
>>>>> 
>>>>> /* Write back local modifications at our current level */ if
>>>>> (register_level > env->kvm_vcpu_dirty) {
>>>>>    kvm_arch_put_registers(...);
>>>>>    env->kvm_vcpu_dirty = 0;
>>>>> }
>>>>> 
>>>>> and then do the sync we are requested to do:
>>>>> 
>>>>> if (!env->kvm_vcpu_dirty) {
>>>>>    ...
>>>>> }
>>>> 
>>>> I agree, but only if we add a second conditional to the if 1st statement  as
>> such:
>>>> 
>>>> if (args->env->kvm_vcpu_dirty && register_level >
>>>> env->kvm_vcpu_dirty)
>>>> 
>>>> This is to cover the case where the caller is asking for register level "1"
>> and we're already dirty at level "2". In this case, nothing should happen and
>> we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the case.
>>> 
>>> As before, I'd prefer to make this explicit:
>>> 
>>>> 
>>>> static void do_kvm_cpu_synchronize_state(void *_args) {
>>>>   struct kvm_cpu_syncstate_args *args = _args;
>>>>   CPUArchState *env = args->env;
>>>>   int register_level = args->register_level;
>>> 
>>> if (register_level > env->kvm_vcpu_dirty) {
>>>    /* We are more dirty than we need to - all is well */
>>>    return;
>>> }
>>> 
>>>> 
>>>>   /* Write back local modifications at our current level */
>>>>   if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
>>>>       kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
>>>>       env->kvm_vcpu_dirty = 0;
>>>>   }
>>>> 
>>>>   if (!args->env->kvm_vcpu_dirty) {
>>>>       kvm_arch_get_registers(env, register_level);
>>>>       env->kvm_vcpu_dirty = register_level;
>>>>   }
>>>> }
>>>> 
>>>> Do you agree?  Thanks for your time. :)
>>> 
>>> Please also check out the discussions I've had with Bharat about his watchdog
>> patches. There we need a mechanism to synchronize registers only when we
>> actually need to, in order to avoid potential race conditions with a kernel
>> timer.
>>> 
>>> That particular case doesn't work well with levels. We can have multiple
>> different potential race producers in the kernel that we need to avoid
>> individually, so we can't always synchronize all of them when only one of them
>> needs to be synchronized.
>>> 
>>> The big question is what we should be doing about this. We basically have 3
>> options:
>>> 
>>>  * implement levels, treat racy registers as "manually synchronized", as
>> Bharat's latest patch set does
>>>  * implement levels, add a bitmap for additional special synchronization bits
>>>  * replace levels by bitmap
>>> 
>>> I'm quite frankly not sure which one of the 3 would be the best way forward.
>>> 
>>> 
>>> Alex
>> 
>> Implement read/write accessors for individual registers, similarly to how its
>> done in the kernel. See kvm_cache_regs.h.
> 
> Read/write for individual registers can be heavy for cases where multiple register needed to be updated. Is not it?

It depends. We can still have classes of registers to synchronize that we could even synchronize using MANY_REG. For writes, things become even faster with individual accessors since we can easily coalesce all register writes until we hit the vcpu again into a MANY_REG array and write them in one go.

> 
> For cases where multiple register needed to be synchronized then I would like to elaborate the options as:
> 
> Option 1:
> int timer_func(CPUArchState env)
> {
> 	cpu_synchronize_state(env);
> 	//update env->timer_registers
> 	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;

To scale this one, it's probably also more clever to swap the logic:

    env->kvm_sync_extra |= SYNC_TIMER_REGS;
    cpu_synchronize_state(env); /* gets timer regs */
    /* update env->timer_registers */
    /* timer regs will be synchronized later because kvm_sync_extra has SYNC_TIMER_REGS set */

Your case right now is special in that we don't need to get any register state, but only write it. However, if we do

    cpu_synchronize_state(env); /* will not get timer regs */
    env->kvm_sync_extra |= SYNC_TIMER_REGS;

then the next

    cpu_synchronize_state(env);

would overwrite the timer values again. So we would also need to indicate that the bits are dirty:

    cpu_synchronize_state(env); /* will not get timer regs */
    env->kvm_sync_extra |= SYNC_TIMER_REGS;
    env->kvm_sync_dirty |= SYNC_TIMER_REGS;

which cpu_synchronize_state() could use to make sure it doesn't read back any timer regs again. It would obviously also have to set it in its normal operation when it executed first ;).

> }
> 
> Change the kvm_arch_put_registers() to add followings:
> 
> int kvm_arch_put_registers(CPUArchState *env, int level)
> {
> 	
> 	If (env->updated_regs_type) {
> 		struct kvm_sregs sregs;
> 		If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) {
> 			// update sregs->timer_registers;
> 			// may be we can have a bitmap to tell kernel for what actually updated
> 		}
> 		If (env->updated_regs_type & XX_CPU_REGS_TYPE) {
> 			// update respective registers in sregs
> 		}
> 
> 		ioctl(env, KVM_GET_SREGS, &sregs);
> 	}
> }
> 
> This mechanism will get all registers state but this is required only once per entry in QEMU.

I don't understand this bit.

> 
> 
> Option 2:
> Define read_regs_type() and write_regs_type() for cases where it requires multiple registers updates:
> 
> int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type)
> {
>     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> 
> 	Switch(reg_type) {
> 	Case TIMER_REGS:
> 		Struct *timer_regs = regs_ptr;
> 		Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type); 
> 		Break;
> 
> 	Default:
> 		Printf(error);
> }
> 
> Similarly will be the write function:
> int write_regs_type((CPUArchState env, int reg_type)
> {
>     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> 
> 	Switch(reg_type) {
> 	Case TIMER_REGS:
> 		Struct timer_regs;
> 		Timer_regs.reg1 = env->timer_regs->reg1;
> 		Timer_regs.reg2 = env->timer_regs->reg2;
> 		Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type); 
> 		Break;
> 
> 	Default:
> 		Printf(error);
> }
> 
> Int timer_func(CPUxxState env)
> {
>    Struct timer_regs;
>    read_regs_type((env, &timer_regs,TIMER_REGS);
>    //update env->timer_registers
>    Write_regs_type(env, TIMER_REGS)
> } 
> 
> This will get one type of register_types and can cause multiple IOCTL per entry to QEMU.

This is still too explicit. How about

static inline void ppc_set_tsr(CPUState *env, target_ulong val)
{
    env->kvm_sync_extra |= SYNC_TIMER_REGS;
    cpu_synchronize_registers(env);
    env->spr[SPR_TSR] = val;
}

static inline void ppc_set_tcr(CPUState *env, target_ulong val)
{
    env->kvm_sync_extra |= SYNC_TIMER_REGS;
    cpu_synchronize_registers(env);
    env->spr[SPR_TCR] = val;
}

int timer_func(CPUState *env) {
    ppc_set_tsr(env, 0);
    ppc_set_tcr(env, 0);
}

The main benefit from this method is that we can do a simple search/replace to get rid of essentially all synchronization pitfalls where we might miss out on some code path that then doesn't synchronize its register set.

> -------
> 
> Keep on using GET/SET_ONE_REG when only one registers needed to be updated.

We can always use ONE_REG for any of the variants.

The main question I have is whether it makes sense to move from

  /* state subset only touched by the VCPU itself during runtime */
  #define KVM_PUT_RUNTIME_STATE   1
  /* state subset modified during VCPU reset */
  #define KVM_PUT_RESET_STATE     2
  /* full state set, modified during initialization or on vmload */
  #define KVM_PUT_FULL_STATE      3

to a bitmap. So we would always only synchronize KVM_PUT_RUNTIME_STATE, but keep a bitmap as explained above around for any state that is more elaborate. For easy conversion, we could even add bits for SYNC_RESET and SYNC_FULL = (SYNC_RESET | SYNC_FULL_REAL).

That should get rid of all the level based logic, giving us a lot more flexibility on what we want to synchronize.


Alex
Bharat Bhushan - Jan. 4, 2013, 8:57 a.m.
> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Friday, January 04, 2013 2:11 PM
> To: Bhushan Bharat-R65777
> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> Liguori; qemu-devel@nongnu.org qemu-devel
> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> do_kvm_cpu_synchronize_state data integrity issue
> 
> 
> On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote:
> 
> >
> >
> >> -----Original Message-----
> >> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> >> Sent: Friday, January 04, 2013 7:08 AM
> >> To: Alexander Graf
> >> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> >> Liguori; qemu- devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
> >> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> >> do_kvm_cpu_synchronize_state data integrity issue
> >>
> >> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
> >>>
> >>> On 03.01.2013, at 19:48, Jason J. Herne wrote:
> >>>
> >>>> On 01/03/2013 08:56 AM, Alexander Graf wrote:
> >>>>>> static void do_kvm_cpu_synchronize_state(void *_args)
> >>>>>>> {
> >>>>>>>    struct kvm_cpu_syncstate_args *args = _args;
> >>>>>>> +    CPUArchState *env = args->env;
> >>>>>>> +    int register_level = args->register_level;
> >>>>>>>
> >>>>> This probably becomes more readable if we explicitly revert back
> >>>>> to
> >> unsynced state first:
> >>>>>
> >>>>> /* Write back local modifications at our current level */ if
> >>>>> (register_level > env->kvm_vcpu_dirty) {
> >>>>>    kvm_arch_put_registers(...);
> >>>>>    env->kvm_vcpu_dirty = 0;
> >>>>> }
> >>>>>
> >>>>> and then do the sync we are requested to do:
> >>>>>
> >>>>> if (!env->kvm_vcpu_dirty) {
> >>>>>    ...
> >>>>> }
> >>>>
> >>>> I agree, but only if we add a second conditional to the if 1st
> >>>> statement  as
> >> such:
> >>>>
> >>>> if (args->env->kvm_vcpu_dirty && register_level >
> >>>> env->kvm_vcpu_dirty)
> >>>>
> >>>> This is to cover the case where the caller is asking for register level "1"
> >> and we're already dirty at level "2". In this case, nothing should
> >> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the
> case.
> >>>
> >>> As before, I'd prefer to make this explicit:
> >>>
> >>>>
> >>>> static void do_kvm_cpu_synchronize_state(void *_args) {
> >>>>   struct kvm_cpu_syncstate_args *args = _args;
> >>>>   CPUArchState *env = args->env;
> >>>>   int register_level = args->register_level;
> >>>
> >>> if (register_level > env->kvm_vcpu_dirty) {
> >>>    /* We are more dirty than we need to - all is well */
> >>>    return;
> >>> }
> >>>
> >>>>
> >>>>   /* Write back local modifications at our current level */
> >>>>   if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
> >>>>       kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
> >>>>       env->kvm_vcpu_dirty = 0;
> >>>>   }
> >>>>
> >>>>   if (!args->env->kvm_vcpu_dirty) {
> >>>>       kvm_arch_get_registers(env, register_level);
> >>>>       env->kvm_vcpu_dirty = register_level;
> >>>>   }
> >>>> }
> >>>>
> >>>> Do you agree?  Thanks for your time. :)
> >>>
> >>> Please also check out the discussions I've had with Bharat about his
> >>> watchdog
> >> patches. There we need a mechanism to synchronize registers only when
> >> we actually need to, in order to avoid potential race conditions with
> >> a kernel timer.
> >>>
> >>> That particular case doesn't work well with levels. We can have
> >>> multiple
> >> different potential race producers in the kernel that we need to
> >> avoid individually, so we can't always synchronize all of them when
> >> only one of them needs to be synchronized.
> >>>
> >>> The big question is what we should be doing about this. We basically
> >>> have 3
> >> options:
> >>>
> >>>  * implement levels, treat racy registers as "manually
> >>> synchronized", as
> >> Bharat's latest patch set does
> >>>  * implement levels, add a bitmap for additional special
> >>> synchronization bits
> >>>  * replace levels by bitmap
> >>>
> >>> I'm quite frankly not sure which one of the 3 would be the best way forward.
> >>>
> >>>
> >>> Alex
> >>
> >> Implement read/write accessors for individual registers, similarly to
> >> how its done in the kernel. See kvm_cache_regs.h.
> >
> > Read/write for individual registers can be heavy for cases where multiple
> register needed to be updated. Is not it?
> 
> It depends. We can still have classes of registers to synchronize that we could
> even synchronize using MANY_REG. For writes, things become even faster with
> individual accessors since we can easily coalesce all register writes until we
> hit the vcpu again into a MANY_REG array and write them in one go.
> 
> >
> > For cases where multiple register needed to be synchronized then I would like
> to elaborate the options as:
> >
> > Option 1:
> > int timer_func(CPUArchState env)
> > {
> > 	cpu_synchronize_state(env);
> > 	//update env->timer_registers
> > 	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;
> 
> To scale this one, it's probably also more clever to swap the logic:
> 
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     cpu_synchronize_state(env); /* gets timer regs */
>     /* update env->timer_registers */
>     /* timer regs will be synchronized later because kvm_sync_extra has
> SYNC_TIMER_REGS set */
> 
> Your case right now is special in that we don't need to get any register state,
> but only write it. However, if we do
> 
>     cpu_synchronize_state(env); /* will not get timer regs */
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
> 
> then the next
> 
>     cpu_synchronize_state(env);

Currently the cpu_sychronize_state() will fetch registers only if  " !env->kvm_vcpu_dirty " . and on first cpu_synchromize_state it is set dirty. I want same logic to continue with under this option.

Once the registered, the rest of code can keep on changing registers and setting respective bitmap in env->update_reg_type. Then finally on kvm_arch_put_register() will check all bits in env->update_reg_type for SET_SREGS ioctl.

-Bharat

> 
> would overwrite the timer values again. So we would also need to indicate that
> the bits are dirty:
> 
>     cpu_synchronize_state(env); /* will not get timer regs */
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     env->kvm_sync_dirty |= SYNC_TIMER_REGS;
> 
> which cpu_synchronize_state() could use to make sure it doesn't read back any
> timer regs again. It would obviously also have to set it in its normal operation
> when it executed first ;).
> 
> > }
> >
> > Change the kvm_arch_put_registers() to add followings:
> >
> > int kvm_arch_put_registers(CPUArchState *env, int level) {
> >
> > 	If (env->updated_regs_type) {
> > 		struct kvm_sregs sregs;
> > 		If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) {
> > 			// update sregs->timer_registers;
> > 			// may be we can have a bitmap to tell kernel for what
> actually updated
> > 		}
> > 		If (env->updated_regs_type & XX_CPU_REGS_TYPE) {
> > 			// update respective registers in sregs
> > 		}
> >
> > 		ioctl(env, KVM_GET_SREGS, &sregs);
> > 	}
> > }
> >
> > This mechanism will get all registers state but this is required only once per
> entry in QEMU.
> 
> I don't understand this bit.
> 
> >
> >
> > Option 2:
> > Define read_regs_type() and write_regs_type() for cases where it requires
> multiple registers updates:
> >
> > int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) {
> >     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> >
> > 	Switch(reg_type) {
> > 	Case TIMER_REGS:
> > 		Struct *timer_regs = regs_ptr;
> > 		Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type);
> > 		Break;
> >
> > 	Default:
> > 		Printf(error);
> > }
> >
> > Similarly will be the write function:
> > int write_regs_type((CPUArchState env, int reg_type) {
> >     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> >
> > 	Switch(reg_type) {
> > 	Case TIMER_REGS:
> > 		Struct timer_regs;
> > 		Timer_regs.reg1 = env->timer_regs->reg1;
> > 		Timer_regs.reg2 = env->timer_regs->reg2;
> > 		Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type);
> > 		Break;
> >
> > 	Default:
> > 		Printf(error);
> > }
> >
> > Int timer_func(CPUxxState env)
> > {
> >    Struct timer_regs;
> >    read_regs_type((env, &timer_regs,TIMER_REGS);
> >    //update env->timer_registers
> >    Write_regs_type(env, TIMER_REGS)
> > }
> >
> > This will get one type of register_types and can cause multiple IOCTL per
> entry to QEMU.
> 
> This is still too explicit. How about
> 
> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     cpu_synchronize_registers(env);
>     env->spr[SPR_TSR] = val;
> }
> 
> static inline void ppc_set_tcr(CPUState *env, target_ulong val) {
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     cpu_synchronize_registers(env);
>     env->spr[SPR_TCR] = val;
> }
> 
> int timer_func(CPUState *env) {
>     ppc_set_tsr(env, 0);
>     ppc_set_tcr(env, 0);
> }
> 
> The main benefit from this method is that we can do a simple search/replace to
> get rid of essentially all synchronization pitfalls where we might miss out on
> some code path that then doesn't synchronize its register set.
> 
> > -------
> >
> > Keep on using GET/SET_ONE_REG when only one registers needed to be updated.
> 
> We can always use ONE_REG for any of the variants.
> 
> The main question I have is whether it makes sense to move from
> 
>   /* state subset only touched by the VCPU itself during runtime */
>   #define KVM_PUT_RUNTIME_STATE   1
>   /* state subset modified during VCPU reset */
>   #define KVM_PUT_RESET_STATE     2
>   /* full state set, modified during initialization or on vmload */
>   #define KVM_PUT_FULL_STATE      3
> 
> to a bitmap. So we would always only synchronize KVM_PUT_RUNTIME_STATE, but keep
> a bitmap as explained above around for any state that is more elaborate. For
> easy conversion, we could even add bits for SYNC_RESET and SYNC_FULL =
> (SYNC_RESET | SYNC_FULL_REAL).
> 
> That should get rid of all the level based logic, giving us a lot more
> flexibility on what we want to synchronize.
> 
> 
> Alex
>
Alexander Graf - Jan. 4, 2013, 9:01 a.m.
On 04.01.2013, at 09:57, Bhushan Bharat-R65777 wrote:

> 
> 
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Friday, January 04, 2013 2:11 PM
>> To: Bhushan Bharat-R65777
>> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
>> Liguori; qemu-devel@nongnu.org qemu-devel
>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
>> do_kvm_cpu_synchronize_state data integrity issue
>> 
>> 
>> On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote:
>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
>>>> Sent: Friday, January 04, 2013 7:08 AM
>>>> To: Alexander Graf
>>>> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
>>>> Liguori; qemu- devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
>>>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
>>>> do_kvm_cpu_synchronize_state data integrity issue
>>>> 
>>>> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
>>>>> 
>>>>> On 03.01.2013, at 19:48, Jason J. Herne wrote:
>>>>> 
>>>>>> On 01/03/2013 08:56 AM, Alexander Graf wrote:
>>>>>>>> static void do_kvm_cpu_synchronize_state(void *_args)
>>>>>>>>> {
>>>>>>>>>   struct kvm_cpu_syncstate_args *args = _args;
>>>>>>>>> +    CPUArchState *env = args->env;
>>>>>>>>> +    int register_level = args->register_level;
>>>>>>>>> 
>>>>>>> This probably becomes more readable if we explicitly revert back
>>>>>>> to
>>>> unsynced state first:
>>>>>>> 
>>>>>>> /* Write back local modifications at our current level */ if
>>>>>>> (register_level > env->kvm_vcpu_dirty) {
>>>>>>>   kvm_arch_put_registers(...);
>>>>>>>   env->kvm_vcpu_dirty = 0;
>>>>>>> }
>>>>>>> 
>>>>>>> and then do the sync we are requested to do:
>>>>>>> 
>>>>>>> if (!env->kvm_vcpu_dirty) {
>>>>>>>   ...
>>>>>>> }
>>>>>> 
>>>>>> I agree, but only if we add a second conditional to the if 1st
>>>>>> statement  as
>>>> such:
>>>>>> 
>>>>>> if (args->env->kvm_vcpu_dirty && register_level >
>>>>>> env->kvm_vcpu_dirty)
>>>>>> 
>>>>>> This is to cover the case where the caller is asking for register level "1"
>>>> and we're already dirty at level "2". In this case, nothing should
>>>> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the
>> case.
>>>>> 
>>>>> As before, I'd prefer to make this explicit:
>>>>> 
>>>>>> 
>>>>>> static void do_kvm_cpu_synchronize_state(void *_args) {
>>>>>>  struct kvm_cpu_syncstate_args *args = _args;
>>>>>>  CPUArchState *env = args->env;
>>>>>>  int register_level = args->register_level;
>>>>> 
>>>>> if (register_level > env->kvm_vcpu_dirty) {
>>>>>   /* We are more dirty than we need to - all is well */
>>>>>   return;
>>>>> }
>>>>> 
>>>>>> 
>>>>>>  /* Write back local modifications at our current level */
>>>>>>  if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
>>>>>>      kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
>>>>>>      env->kvm_vcpu_dirty = 0;
>>>>>>  }
>>>>>> 
>>>>>>  if (!args->env->kvm_vcpu_dirty) {
>>>>>>      kvm_arch_get_registers(env, register_level);
>>>>>>      env->kvm_vcpu_dirty = register_level;
>>>>>>  }
>>>>>> }
>>>>>> 
>>>>>> Do you agree?  Thanks for your time. :)
>>>>> 
>>>>> Please also check out the discussions I've had with Bharat about his
>>>>> watchdog
>>>> patches. There we need a mechanism to synchronize registers only when
>>>> we actually need to, in order to avoid potential race conditions with
>>>> a kernel timer.
>>>>> 
>>>>> That particular case doesn't work well with levels. We can have
>>>>> multiple
>>>> different potential race producers in the kernel that we need to
>>>> avoid individually, so we can't always synchronize all of them when
>>>> only one of them needs to be synchronized.
>>>>> 
>>>>> The big question is what we should be doing about this. We basically
>>>>> have 3
>>>> options:
>>>>> 
>>>>> * implement levels, treat racy registers as "manually
>>>>> synchronized", as
>>>> Bharat's latest patch set does
>>>>> * implement levels, add a bitmap for additional special
>>>>> synchronization bits
>>>>> * replace levels by bitmap
>>>>> 
>>>>> I'm quite frankly not sure which one of the 3 would be the best way forward.
>>>>> 
>>>>> 
>>>>> Alex
>>>> 
>>>> Implement read/write accessors for individual registers, similarly to
>>>> how its done in the kernel. See kvm_cache_regs.h.
>>> 
>>> Read/write for individual registers can be heavy for cases where multiple
>> register needed to be updated. Is not it?
>> 
>> It depends. We can still have classes of registers to synchronize that we could
>> even synchronize using MANY_REG. For writes, things become even faster with
>> individual accessors since we can easily coalesce all register writes until we
>> hit the vcpu again into a MANY_REG array and write them in one go.
>> 
>>> 
>>> For cases where multiple register needed to be synchronized then I would like
>> to elaborate the options as:
>>> 
>>> Option 1:
>>> int timer_func(CPUArchState env)
>>> {
>>> 	cpu_synchronize_state(env);
>>> 	//update env->timer_registers
>>> 	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;
>> 
>> To scale this one, it's probably also more clever to swap the logic:
>> 
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>    cpu_synchronize_state(env); /* gets timer regs */
>>    /* update env->timer_registers */
>>    /* timer regs will be synchronized later because kvm_sync_extra has
>> SYNC_TIMER_REGS set */
>> 
>> Your case right now is special in that we don't need to get any register state,
>> but only write it. However, if we do
>> 
>>    cpu_synchronize_state(env); /* will not get timer regs */
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>> 
>> then the next
>> 
>>    cpu_synchronize_state(env);
> 
> Currently the cpu_sychronize_state() will fetch registers only if  " !env->kvm_vcpu_dirty " . and on first cpu_synchromize_state it is set dirty. I want same logic to continue with under this option.

That's exactly what this patch set here is changing, because the logic is too restrictive.

> Once the registered, the rest of code can keep on changing registers and setting respective bitmap in env->update_reg_type. Then finally on kvm_arch_put_register() will check all bits in env->update_reg_type for SET_SREGS ioctl.

If we want to be extensible, we have to think outside of the "put" bottle and also consider more advanced "get" operations.


Alex
Bharat Bhushan - Jan. 4, 2013, 10:23 a.m.
> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Friday, January 04, 2013 2:11 PM
> To: Bhushan Bharat-R65777
> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> Liguori; qemu-devel@nongnu.org qemu-devel
> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> do_kvm_cpu_synchronize_state data integrity issue
> 
> 
> On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote:
> 
> >
> >
> >> -----Original Message-----
> >> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> >> Sent: Friday, January 04, 2013 7:08 AM
> >> To: Alexander Graf
> >> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> >> Liguori; qemu- devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
> >> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> >> do_kvm_cpu_synchronize_state data integrity issue
> >>
> >> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
> >>>
> >>> On 03.01.2013, at 19:48, Jason J. Herne wrote:
> >>>
> >>>> On 01/03/2013 08:56 AM, Alexander Graf wrote:
> >>>>>> static void do_kvm_cpu_synchronize_state(void *_args)
> >>>>>>> {
> >>>>>>>    struct kvm_cpu_syncstate_args *args = _args;
> >>>>>>> +    CPUArchState *env = args->env;
> >>>>>>> +    int register_level = args->register_level;
> >>>>>>>
> >>>>> This probably becomes more readable if we explicitly revert back
> >>>>> to
> >> unsynced state first:
> >>>>>
> >>>>> /* Write back local modifications at our current level */ if
> >>>>> (register_level > env->kvm_vcpu_dirty) {
> >>>>>    kvm_arch_put_registers(...);
> >>>>>    env->kvm_vcpu_dirty = 0;
> >>>>> }
> >>>>>
> >>>>> and then do the sync we are requested to do:
> >>>>>
> >>>>> if (!env->kvm_vcpu_dirty) {
> >>>>>    ...
> >>>>> }
> >>>>
> >>>> I agree, but only if we add a second conditional to the if 1st
> >>>> statement  as
> >> such:
> >>>>
> >>>> if (args->env->kvm_vcpu_dirty && register_level >
> >>>> env->kvm_vcpu_dirty)
> >>>>
> >>>> This is to cover the case where the caller is asking for register level "1"
> >> and we're already dirty at level "2". In this case, nothing should
> >> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the
> case.
> >>>
> >>> As before, I'd prefer to make this explicit:
> >>>
> >>>>
> >>>> static void do_kvm_cpu_synchronize_state(void *_args) {
> >>>>   struct kvm_cpu_syncstate_args *args = _args;
> >>>>   CPUArchState *env = args->env;
> >>>>   int register_level = args->register_level;
> >>>
> >>> if (register_level > env->kvm_vcpu_dirty) {
> >>>    /* We are more dirty than we need to - all is well */
> >>>    return;
> >>> }
> >>>
> >>>>
> >>>>   /* Write back local modifications at our current level */
> >>>>   if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
> >>>>       kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
> >>>>       env->kvm_vcpu_dirty = 0;
> >>>>   }
> >>>>
> >>>>   if (!args->env->kvm_vcpu_dirty) {
> >>>>       kvm_arch_get_registers(env, register_level);
> >>>>       env->kvm_vcpu_dirty = register_level;
> >>>>   }
> >>>> }
> >>>>
> >>>> Do you agree?  Thanks for your time. :)
> >>>
> >>> Please also check out the discussions I've had with Bharat about his
> >>> watchdog
> >> patches. There we need a mechanism to synchronize registers only when
> >> we actually need to, in order to avoid potential race conditions with
> >> a kernel timer.
> >>>
> >>> That particular case doesn't work well with levels. We can have
> >>> multiple
> >> different potential race producers in the kernel that we need to
> >> avoid individually, so we can't always synchronize all of them when
> >> only one of them needs to be synchronized.
> >>>
> >>> The big question is what we should be doing about this. We basically
> >>> have 3
> >> options:
> >>>
> >>>  * implement levels, treat racy registers as "manually
> >>> synchronized", as
> >> Bharat's latest patch set does
> >>>  * implement levels, add a bitmap for additional special
> >>> synchronization bits
> >>>  * replace levels by bitmap
> >>>
> >>> I'm quite frankly not sure which one of the 3 would be the best way forward.
> >>>
> >>>
> >>> Alex
> >>
> >> Implement read/write accessors for individual registers, similarly to
> >> how its done in the kernel. See kvm_cache_regs.h.
> >
> > Read/write for individual registers can be heavy for cases where multiple
> register needed to be updated. Is not it?
> 
> It depends. We can still have classes of registers to synchronize that we could
> even synchronize using MANY_REG. For writes, things become even faster with
> individual accessors since we can easily coalesce all register writes until we
> hit the vcpu again into a MANY_REG array and write them in one go.
> 
> >
> > For cases where multiple register needed to be synchronized then I would like
> to elaborate the options as:
> >
> > Option 1:
> > int timer_func(CPUArchState env)
> > {
> > 	cpu_synchronize_state(env);
> > 	//update env->timer_registers
> > 	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;
> 
> To scale this one, it's probably also more clever to swap the logic:
> 
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     cpu_synchronize_state(env); /* gets timer regs */
>     /* update env->timer_registers */
>     /* timer regs will be synchronized later because kvm_sync_extra has
> SYNC_TIMER_REGS set */
> 
> Your case right now is special in that we don't need to get any register state,
> but only write it. However, if we do
> 
>     cpu_synchronize_state(env); /* will not get timer regs */
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
> 
> then the next
> 
>     cpu_synchronize_state(env);
> 
> would overwrite the timer values again. So we would also need to indicate that
> the bits are dirty:
> 
>     cpu_synchronize_state(env); /* will not get timer regs */
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     env->kvm_sync_dirty |= SYNC_TIMER_REGS;
> 
> which cpu_synchronize_state() could use to make sure it doesn't read back any
> timer regs again.

yes

> It would obviously also have to set it in its normal operation
> when it executed first ;).

We also need get all registers like for debugging and we should have flag like ALL_REGS etc.

But I general I agree with the idea. 

> 
> > }
> >
> > Change the kvm_arch_put_registers() to add followings:
> >
> > int kvm_arch_put_registers(CPUArchState *env, int level) {
> >
> > 	If (env->updated_regs_type) {
> > 		struct kvm_sregs sregs;
> > 		If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) {
> > 			// update sregs->timer_registers;
> > 			// may be we can have a bitmap to tell kernel for what
> actually updated
> > 		}
> > 		If (env->updated_regs_type & XX_CPU_REGS_TYPE) {
> > 			// update respective registers in sregs
> > 		}
> >
> > 		ioctl(env, KVM_GET_SREGS, &sregs);
> > 	}
> > }
> >
> > This mechanism will get all registers state but this is required only once per
> entry in QEMU.
> 
> I don't understand this bit
> 
> >
> >
> > Option 2:
> > Define read_regs_type() and write_regs_type() for cases where it requires
> multiple registers updates:
> >
> > int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) {
> >     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> >
> > 	Switch(reg_type) {
> > 	Case TIMER_REGS:
> > 		Struct *timer_regs = regs_ptr;
> > 		Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type);
> > 		Break;
> >
> > 	Default:
> > 		Printf(error);
> > }
> >
> > Similarly will be the write function:
> > int write_regs_type((CPUArchState env, int reg_type) {
> >     -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> >
> > 	Switch(reg_type) {
> > 	Case TIMER_REGS:
> > 		Struct timer_regs;
> > 		Timer_regs.reg1 = env->timer_regs->reg1;
> > 		Timer_regs.reg2 = env->timer_regs->reg2;
> > 		Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type);
> > 		Break;
> >
> > 	Default:
> > 		Printf(error);
> > }
> >
> > Int timer_func(CPUxxState env)
> > {
> >    Struct timer_regs;
> >    read_regs_type((env, &timer_regs,TIMER_REGS);
> >    //update env->timer_registers
> >    Write_regs_type(env, TIMER_REGS)
> > }
> >
> > This will get one type of register_types and can cause multiple IOCTL per
> entry to QEMU.
> 
> This is still too explicit. How about
> 
> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     cpu_synchronize_registers(env);
>     env->spr[SPR_TSR] = val;

You also want env->kvm_sync_dirty also, right?

> }
> 
> static inline void ppc_set_tcr(CPUState *env, target_ulong val) {
>     env->kvm_sync_extra |= SYNC_TIMER_REGS;
>     cpu_synchronize_registers(env);
>     env->spr[SPR_TCR] = val;

Same as above 

> }
> 
> int timer_func(CPUState *env) {
>     ppc_set_tsr(env, 0);

SYNC_TIMERS_REGS are get only once on first call.

>     ppc_set_tcr(env, 0);
> }
> 
> The main benefit from this method is that we can do a simple search/replace to
> get rid of essentially all synchronization pitfalls where we might miss out on
> some code path that then doesn't synchronize its register set.
> 
> > -------
> >
> > Keep on using GET/SET_ONE_REG when only one registers needed to be updated.
> 
> We can always use ONE_REG for any of the variants.
> 
> The main question I have is whether it makes sense to move from
> 
>   /* state subset only touched by the VCPU itself during runtime */
>   #define KVM_PUT_RUNTIME_STATE   1
>   /* state subset modified during VCPU reset */
>   #define KVM_PUT_RESET_STATE     2
>   /* full state set, modified during initialization or on vmload */
>   #define KVM_PUT_FULL_STATE      3
> 
> to a bitmap. So we would always only synchronize KVM_PUT_RUNTIME_STATE, but keep
> a bitmap as explained above around for any state that is more elaborate. For
> easy conversion, we could even add bits for SYNC_RESET and SYNC_FULL =
> (SYNC_RESET | SYNC_FULL_REAL).

Yes, that looks better to me .

-Bharat

> 
> That should get rid of all the level based logic, giving us a lot more
> flexibility on what we want to synchronize.
> 
> 
> Alex
>
Alexander Graf - Jan. 4, 2013, 10:29 a.m.
On 04.01.2013, at 11:23, Bhushan Bharat-R65777 wrote:

> 
> 
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Friday, January 04, 2013 2:11 PM
>> To: Bhushan Bharat-R65777
>> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
>> Liguori; qemu-devel@nongnu.org qemu-devel
>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
>> do_kvm_cpu_synchronize_state data integrity issue
>> 
>> 
>> On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote:
>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
>>>> Sent: Friday, January 04, 2013 7:08 AM
>>>> To: Alexander Graf
>>>> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
>>>> Liguori; qemu- devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
>>>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
>>>> do_kvm_cpu_synchronize_state data integrity issue
>>>> 
>>>> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
>>>>> 
>>>>> On 03.01.2013, at 19:48, Jason J. Herne wrote:
>>>>> 
>>>>>> On 01/03/2013 08:56 AM, Alexander Graf wrote:
>>>>>>>> static void do_kvm_cpu_synchronize_state(void *_args)
>>>>>>>>> {
>>>>>>>>>   struct kvm_cpu_syncstate_args *args = _args;
>>>>>>>>> +    CPUArchState *env = args->env;
>>>>>>>>> +    int register_level = args->register_level;
>>>>>>>>> 
>>>>>>> This probably becomes more readable if we explicitly revert back
>>>>>>> to
>>>> unsynced state first:
>>>>>>> 
>>>>>>> /* Write back local modifications at our current level */ if
>>>>>>> (register_level > env->kvm_vcpu_dirty) {
>>>>>>>   kvm_arch_put_registers(...);
>>>>>>>   env->kvm_vcpu_dirty = 0;
>>>>>>> }
>>>>>>> 
>>>>>>> and then do the sync we are requested to do:
>>>>>>> 
>>>>>>> if (!env->kvm_vcpu_dirty) {
>>>>>>>   ...
>>>>>>> }
>>>>>> 
>>>>>> I agree, but only if we add a second conditional to the if 1st
>>>>>> statement  as
>>>> such:
>>>>>> 
>>>>>> if (args->env->kvm_vcpu_dirty && register_level >
>>>>>> env->kvm_vcpu_dirty)
>>>>>> 
>>>>>> This is to cover the case where the caller is asking for register level "1"
>>>> and we're already dirty at level "2". In this case, nothing should
>>>> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the
>> case.
>>>>> 
>>>>> As before, I'd prefer to make this explicit:
>>>>> 
>>>>>> 
>>>>>> static void do_kvm_cpu_synchronize_state(void *_args) {
>>>>>>  struct kvm_cpu_syncstate_args *args = _args;
>>>>>>  CPUArchState *env = args->env;
>>>>>>  int register_level = args->register_level;
>>>>> 
>>>>> if (register_level > env->kvm_vcpu_dirty) {
>>>>>   /* We are more dirty than we need to - all is well */
>>>>>   return;
>>>>> }
>>>>> 
>>>>>> 
>>>>>>  /* Write back local modifications at our current level */
>>>>>>  if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
>>>>>>      kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
>>>>>>      env->kvm_vcpu_dirty = 0;
>>>>>>  }
>>>>>> 
>>>>>>  if (!args->env->kvm_vcpu_dirty) {
>>>>>>      kvm_arch_get_registers(env, register_level);
>>>>>>      env->kvm_vcpu_dirty = register_level;
>>>>>>  }
>>>>>> }
>>>>>> 
>>>>>> Do you agree?  Thanks for your time. :)
>>>>> 
>>>>> Please also check out the discussions I've had with Bharat about his
>>>>> watchdog
>>>> patches. There we need a mechanism to synchronize registers only when
>>>> we actually need to, in order to avoid potential race conditions with
>>>> a kernel timer.
>>>>> 
>>>>> That particular case doesn't work well with levels. We can have
>>>>> multiple
>>>> different potential race producers in the kernel that we need to
>>>> avoid individually, so we can't always synchronize all of them when
>>>> only one of them needs to be synchronized.
>>>>> 
>>>>> The big question is what we should be doing about this. We basically
>>>>> have 3
>>>> options:
>>>>> 
>>>>> * implement levels, treat racy registers as "manually
>>>>> synchronized", as
>>>> Bharat's latest patch set does
>>>>> * implement levels, add a bitmap for additional special
>>>>> synchronization bits
>>>>> * replace levels by bitmap
>>>>> 
>>>>> I'm quite frankly not sure which one of the 3 would be the best way forward.
>>>>> 
>>>>> 
>>>>> Alex
>>>> 
>>>> Implement read/write accessors for individual registers, similarly to
>>>> how its done in the kernel. See kvm_cache_regs.h.
>>> 
>>> Read/write for individual registers can be heavy for cases where multiple
>> register needed to be updated. Is not it?
>> 
>> It depends. We can still have classes of registers to synchronize that we could
>> even synchronize using MANY_REG. For writes, things become even faster with
>> individual accessors since we can easily coalesce all register writes until we
>> hit the vcpu again into a MANY_REG array and write them in one go.
>> 
>>> 
>>> For cases where multiple register needed to be synchronized then I would like
>> to elaborate the options as:
>>> 
>>> Option 1:
>>> int timer_func(CPUArchState env)
>>> {
>>> 	cpu_synchronize_state(env);
>>> 	//update env->timer_registers
>>> 	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;
>> 
>> To scale this one, it's probably also more clever to swap the logic:
>> 
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>    cpu_synchronize_state(env); /* gets timer regs */
>>    /* update env->timer_registers */
>>    /* timer regs will be synchronized later because kvm_sync_extra has
>> SYNC_TIMER_REGS set */
>> 
>> Your case right now is special in that we don't need to get any register state,
>> but only write it. However, if we do
>> 
>>    cpu_synchronize_state(env); /* will not get timer regs */
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>> 
>> then the next
>> 
>>    cpu_synchronize_state(env);
>> 
>> would overwrite the timer values again. So we would also need to indicate that
>> the bits are dirty:
>> 
>>    cpu_synchronize_state(env); /* will not get timer regs */
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>    env->kvm_sync_dirty |= SYNC_TIMER_REGS;
>> 
>> which cpu_synchronize_state() could use to make sure it doesn't read back any
>> timer regs again.
> 
> yes
> 
>> It would obviously also have to set it in its normal operation
>> when it executed first ;).
> 
> We also need get all registers like for debugging and we should have flag like ALL_REGS etc.

Well, we could just take the current level of "general purpose registers" as one flag.

> 
> But I general I agree with the idea. 
> 
>> 
>>> }
>>> 
>>> Change the kvm_arch_put_registers() to add followings:
>>> 
>>> int kvm_arch_put_registers(CPUArchState *env, int level) {
>>> 
>>> 	If (env->updated_regs_type) {
>>> 		struct kvm_sregs sregs;
>>> 		If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) {
>>> 			// update sregs->timer_registers;
>>> 			// may be we can have a bitmap to tell kernel for what
>> actually updated
>>> 		}
>>> 		If (env->updated_regs_type & XX_CPU_REGS_TYPE) {
>>> 			// update respective registers in sregs
>>> 		}
>>> 
>>> 		ioctl(env, KVM_GET_SREGS, &sregs);
>>> 	}
>>> }
>>> 
>>> This mechanism will get all registers state but this is required only once per
>> entry in QEMU.
>> 
>> I don't understand this bit
>> 
>>> 
>>> 
>>> Option 2:
>>> Define read_regs_type() and write_regs_type() for cases where it requires
>> multiple registers updates:
>>> 
>>> int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) {
>>>    -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
>>> 
>>> 	Switch(reg_type) {
>>> 	Case TIMER_REGS:
>>> 		Struct *timer_regs = regs_ptr;
>>> 		Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type);
>>> 		Break;
>>> 
>>> 	Default:
>>> 		Printf(error);
>>> }
>>> 
>>> Similarly will be the write function:
>>> int write_regs_type((CPUArchState env, int reg_type) {
>>>    -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
>>> 
>>> 	Switch(reg_type) {
>>> 	Case TIMER_REGS:
>>> 		Struct timer_regs;
>>> 		Timer_regs.reg1 = env->timer_regs->reg1;
>>> 		Timer_regs.reg2 = env->timer_regs->reg2;
>>> 		Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type);
>>> 		Break;
>>> 
>>> 	Default:
>>> 		Printf(error);
>>> }
>>> 
>>> Int timer_func(CPUxxState env)
>>> {
>>>   Struct timer_regs;
>>>   read_regs_type((env, &timer_regs,TIMER_REGS);
>>>   //update env->timer_registers
>>>   Write_regs_type(env, TIMER_REGS)
>>> }
>>> 
>>> This will get one type of register_types and can cause multiple IOCTL per
>> entry to QEMU.
>> 
>> This is still too explicit. How about
>> 
>> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>    cpu_synchronize_registers(env);
>>    env->spr[SPR_TSR] = val;
> 
> You also want env->kvm_sync_dirty also, right?

Not quite, since SYNC_TIMER_REGS spans more than only TSR. So we need to make sure we fetch the non-TSR register values before we can declare TIMER_REGS as dirty.

> 
>> }
>> 
>> static inline void ppc_set_tcr(CPUState *env, target_ulong val) {
>>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>    cpu_synchronize_registers(env);
>>    env->spr[SPR_TCR] = val;
> 
> Same as above 
> 
>> }
>> 
>> int timer_func(CPUState *env) {
>>    ppc_set_tsr(env, 0);
> 
> SYNC_TIMERS_REGS are get only once on first call.

Sure, but that's implicit. A user of these functions doesn't have to care which one comes first.


Alex
Bharat Bhushan - Jan. 4, 2013, 10:32 a.m.
> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Friday, January 04, 2013 3:59 PM
> To: Bhushan Bharat-R65777
> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> Liguori; qemu-devel@nongnu.org qemu-devel
> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> do_kvm_cpu_synchronize_state data integrity issue
> 
> 
> On 04.01.2013, at 11:23, Bhushan Bharat-R65777 wrote:
> 
> >
> >
> >> -----Original Message-----
> >> From: Alexander Graf [mailto:agraf@suse.de]
> >> Sent: Friday, January 04, 2013 2:11 PM
> >> To: Bhushan Bharat-R65777
> >> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian
> >> Borntraeger; Anthony Liguori; qemu-devel@nongnu.org qemu-devel
> >> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> >> do_kvm_cpu_synchronize_state data integrity issue
> >>
> >>
> >> On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote:
> >>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> >>>> Sent: Friday, January 04, 2013 7:08 AM
> >>>> To: Alexander Graf
> >>>> Cc: jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> >>>> Liguori; qemu- devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
> >>>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> >>>> do_kvm_cpu_synchronize_state data integrity issue
> >>>>
> >>>> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote:
> >>>>>
> >>>>> On 03.01.2013, at 19:48, Jason J. Herne wrote:
> >>>>>
> >>>>>> On 01/03/2013 08:56 AM, Alexander Graf wrote:
> >>>>>>>> static void do_kvm_cpu_synchronize_state(void *_args)
> >>>>>>>>> {
> >>>>>>>>>   struct kvm_cpu_syncstate_args *args = _args;
> >>>>>>>>> +    CPUArchState *env = args->env;
> >>>>>>>>> +    int register_level = args->register_level;
> >>>>>>>>>
> >>>>>>> This probably becomes more readable if we explicitly revert back
> >>>>>>> to
> >>>> unsynced state first:
> >>>>>>>
> >>>>>>> /* Write back local modifications at our current level */ if
> >>>>>>> (register_level > env->kvm_vcpu_dirty) {
> >>>>>>>   kvm_arch_put_registers(...);
> >>>>>>>   env->kvm_vcpu_dirty = 0;
> >>>>>>> }
> >>>>>>>
> >>>>>>> and then do the sync we are requested to do:
> >>>>>>>
> >>>>>>> if (!env->kvm_vcpu_dirty) {
> >>>>>>>   ...
> >>>>>>> }
> >>>>>>
> >>>>>> I agree, but only if we add a second conditional to the if 1st
> >>>>>> statement  as
> >>>> such:
> >>>>>>
> >>>>>> if (args->env->kvm_vcpu_dirty && register_level >
> >>>>>> env->kvm_vcpu_dirty)
> >>>>>>
> >>>>>> This is to cover the case where the caller is asking for register level
> "1"
> >>>> and we're already dirty at level "2". In this case, nothing should
> >>>> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure
> >>>> that is the
> >> case.
> >>>>>
> >>>>> As before, I'd prefer to make this explicit:
> >>>>>
> >>>>>>
> >>>>>> static void do_kvm_cpu_synchronize_state(void *_args) {  struct
> >>>>>> kvm_cpu_syncstate_args *args = _args;  CPUArchState *env =
> >>>>>> args->env;  int register_level = args->register_level;
> >>>>>
> >>>>> if (register_level > env->kvm_vcpu_dirty) {
> >>>>>   /* We are more dirty than we need to - all is well */
> >>>>>   return;
> >>>>> }
> >>>>>
> >>>>>>
> >>>>>>  /* Write back local modifications at our current level */  if
> >>>>>> (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) {
> >>>>>>      kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
> >>>>>>      env->kvm_vcpu_dirty = 0;
> >>>>>>  }
> >>>>>>
> >>>>>>  if (!args->env->kvm_vcpu_dirty) {
> >>>>>>      kvm_arch_get_registers(env, register_level);
> >>>>>>      env->kvm_vcpu_dirty = register_level;  } }
> >>>>>>
> >>>>>> Do you agree?  Thanks for your time. :)
> >>>>>
> >>>>> Please also check out the discussions I've had with Bharat about
> >>>>> his watchdog
> >>>> patches. There we need a mechanism to synchronize registers only
> >>>> when we actually need to, in order to avoid potential race
> >>>> conditions with a kernel timer.
> >>>>>
> >>>>> That particular case doesn't work well with levels. We can have
> >>>>> multiple
> >>>> different potential race producers in the kernel that we need to
> >>>> avoid individually, so we can't always synchronize all of them when
> >>>> only one of them needs to be synchronized.
> >>>>>
> >>>>> The big question is what we should be doing about this. We
> >>>>> basically have 3
> >>>> options:
> >>>>>
> >>>>> * implement levels, treat racy registers as "manually
> >>>>> synchronized", as
> >>>> Bharat's latest patch set does
> >>>>> * implement levels, add a bitmap for additional special
> >>>>> synchronization bits
> >>>>> * replace levels by bitmap
> >>>>>
> >>>>> I'm quite frankly not sure which one of the 3 would be the best way
> forward.
> >>>>>
> >>>>>
> >>>>> Alex
> >>>>
> >>>> Implement read/write accessors for individual registers, similarly
> >>>> to how its done in the kernel. See kvm_cache_regs.h.
> >>>
> >>> Read/write for individual registers can be heavy for cases where
> >>> multiple
> >> register needed to be updated. Is not it?
> >>
> >> It depends. We can still have classes of registers to synchronize
> >> that we could even synchronize using MANY_REG. For writes, things
> >> become even faster with individual accessors since we can easily
> >> coalesce all register writes until we hit the vcpu again into a MANY_REG
> array and write them in one go.
> >>
> >>>
> >>> For cases where multiple register needed to be synchronized then I
> >>> would like
> >> to elaborate the options as:
> >>>
> >>> Option 1:
> >>> int timer_func(CPUArchState env)
> >>> {
> >>> 	cpu_synchronize_state(env);
> >>> 	//update env->timer_registers
> >>> 	env->updated_regs_type |= TIMER_REGS_TYPE_BIT;
> >>
> >> To scale this one, it's probably also more clever to swap the logic:
> >>
> >>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
> >>    cpu_synchronize_state(env); /* gets timer regs */
> >>    /* update env->timer_registers */
> >>    /* timer regs will be synchronized later because kvm_sync_extra
> >> has SYNC_TIMER_REGS set */
> >>
> >> Your case right now is special in that we don't need to get any
> >> register state, but only write it. However, if we do
> >>
> >>    cpu_synchronize_state(env); /* will not get timer regs */
> >>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
> >>
> >> then the next
> >>
> >>    cpu_synchronize_state(env);
> >>
> >> would overwrite the timer values again. So we would also need to
> >> indicate that the bits are dirty:
> >>
> >>    cpu_synchronize_state(env); /* will not get timer regs */
> >>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
> >>    env->kvm_sync_dirty |= SYNC_TIMER_REGS;
> >>
> >> which cpu_synchronize_state() could use to make sure it doesn't read
> >> back any timer regs again.
> >
> > yes
> >
> >> It would obviously also have to set it in its normal operation when
> >> it executed first ;).
> >
> > We also need get all registers like for debugging and we should have flag like
> ALL_REGS etc.
> 
> Well, we could just take the current level of "general purpose registers" as one
> flag.
> 
> >
> > But I general I agree with the idea.
> >
> >>
> >>> }
> >>>
> >>> Change the kvm_arch_put_registers() to add followings:
> >>>
> >>> int kvm_arch_put_registers(CPUArchState *env, int level) {
> >>>
> >>> 	If (env->updated_regs_type) {
> >>> 		struct kvm_sregs sregs;
> >>> 		If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) {
> >>> 			// update sregs->timer_registers;
> >>> 			// may be we can have a bitmap to tell kernel for what
> >> actually updated
> >>> 		}
> >>> 		If (env->updated_regs_type & XX_CPU_REGS_TYPE) {
> >>> 			// update respective registers in sregs
> >>> 		}
> >>>
> >>> 		ioctl(env, KVM_GET_SREGS, &sregs);
> >>> 	}
> >>> }
> >>>
> >>> This mechanism will get all registers state but this is required
> >>> only once per
> >> entry in QEMU.
> >>
> >> I don't understand this bit
> >>
> >>>
> >>>
> >>> Option 2:
> >>> Define read_regs_type() and write_regs_type() for cases where it
> >>> requires
> >> multiple registers updates:
> >>>
> >>> int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) {
> >>>    -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> >>>
> >>> 	Switch(reg_type) {
> >>> 	Case TIMER_REGS:
> >>> 		Struct *timer_regs = regs_ptr;
> >>> 		Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type);
> >>> 		Break;
> >>>
> >>> 	Default:
> >>> 		Printf(error);
> >>> }
> >>>
> >>> Similarly will be the write function:
> >>> int write_regs_type((CPUArchState env, int reg_type) {
> >>>    -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc.
> >>>
> >>> 	Switch(reg_type) {
> >>> 	Case TIMER_REGS:
> >>> 		Struct timer_regs;
> >>> 		Timer_regs.reg1 = env->timer_regs->reg1;
> >>> 		Timer_regs.reg2 = env->timer_regs->reg2;
> >>> 		Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type);
> >>> 		Break;
> >>>
> >>> 	Default:
> >>> 		Printf(error);
> >>> }
> >>>
> >>> Int timer_func(CPUxxState env)
> >>> {
> >>>   Struct timer_regs;
> >>>   read_regs_type((env, &timer_regs,TIMER_REGS);
> >>>   //update env->timer_registers
> >>>   Write_regs_type(env, TIMER_REGS)
> >>> }
> >>>
> >>> This will get one type of register_types and can cause multiple
> >>> IOCTL per
> >> entry to QEMU.
> >>
> >> This is still too explicit. How about
> >>
> >> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
> >>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
> >>    cpu_synchronize_registers(env);
> >>    env->spr[SPR_TSR] = val;
> >
> > You also want env->kvm_sync_dirty also, right?
> 
> Not quite, since SYNC_TIMER_REGS spans more than only TSR. So we need to make
> sure we fetch the non-TSR register values before we can declare TIMER_REGS as
> dirty.

Right , I actually want communicate mark dirty the TIMER_REGS only :)

> 
> >
> >> }
> >>
> >> static inline void ppc_set_tcr(CPUState *env, target_ulong val) {
> >>    env->kvm_sync_extra |= SYNC_TIMER_REGS;
> >>    cpu_synchronize_registers(env);
> >>    env->spr[SPR_TCR] = val;
> >
> > Same as above
> >
> >> }
> >>
> >> int timer_func(CPUState *env) {
> >>    ppc_set_tsr(env, 0);
> >
> > SYNC_TIMERS_REGS are get only once on first call.
> 
> Sure, but that's implicit. A user of these functions doesn't have to care which
> one comes first.

Agree

-Bharat

> 
> 
> Alex
>
Alexander Graf - Jan. 4, 2013, 10:36 a.m.
On 04.01.2013, at 11:32, Bhushan Bharat-R65777 wrote:

> 
> 
>> -----Original Message-----
>>>>> 
>>>>> Int timer_func(CPUxxState env)
>>>>> {
>>>>>  Struct timer_regs;
>>>>>  read_regs_type((env, &timer_regs,TIMER_REGS);
>>>>>  //update env->timer_registers
>>>>>  Write_regs_type(env, TIMER_REGS)
>>>>> }
>>>>> 
>>>>> This will get one type of register_types and can cause multiple
>>>>> IOCTL per
>>>> entry to QEMU.
>>>> 
>>>> This is still too explicit. How about
>>>> 
>>>> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
>>>>   env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>>>   cpu_synchronize_registers(env);
>>>>   env->spr[SPR_TSR] = val;
>>> 
>>> You also want env->kvm_sync_dirty also, right?
>> 
>> Not quite, since SYNC_TIMER_REGS spans more than only TSR. So we need to make
>> sure we fetch the non-TSR register values before we can declare TIMER_REGS as
>> dirty.
> 
> Right , I actually want communicate mark dirty the TIMER_REGS only :)

Imagine TIMER_REGS includes TSR and TCR. Then

ppc_set_tsr();

should still work without a respective ppc_set_tcr() call. So we can't just mark it dirty here. The only way we could optimize this bit would be by either splitting the bitmap per register or by extending the setter functions to the full scope of the sync:

static inline void ppc_set_timer_regs(CPUState *env, target_ulong tsr, target_ulong tcr)
{
    ...
}


Alex
Bharat Bhushan - Jan. 4, 2013, 11:01 a.m.
> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Friday, January 04, 2013 4:07 PM
> To: Bhushan Bharat-R65777
> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
> Liguori; qemu-devel@nongnu.org qemu-devel
> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> do_kvm_cpu_synchronize_state data integrity issue
> 
> 
> On 04.01.2013, at 11:32, Bhushan Bharat-R65777 wrote:
> 
> >
> >
> >> -----Original Message-----
> >>>>>
> >>>>> Int timer_func(CPUxxState env)
> >>>>> {
> >>>>>  Struct timer_regs;
> >>>>>  read_regs_type((env, &timer_regs,TIMER_REGS);  //update
> >>>>> env->timer_registers  Write_regs_type(env, TIMER_REGS) }
> >>>>>
> >>>>> This will get one type of register_types and can cause multiple
> >>>>> IOCTL per
> >>>> entry to QEMU.
> >>>>
> >>>> This is still too explicit. How about
> >>>>
> >>>> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
> >>>>   env->kvm_sync_extra |= SYNC_TIMER_REGS;
> >>>>   cpu_synchronize_registers(env);
> >>>>   env->spr[SPR_TSR] = val;
> >>>
> >>> You also want env->kvm_sync_dirty also, right?
> >>
> >> Not quite, since SYNC_TIMER_REGS spans more than only TSR. So we need
> >> to make sure we fetch the non-TSR register values before we can
> >> declare TIMER_REGS as dirty.
> >
> > Right , I actually want communicate mark dirty the TIMER_REGS only :)
> 
> Imagine TIMER_REGS includes TSR and TCR. Then
> 
> ppc_set_tsr();
> 
> should still work without a respective ppc_set_tcr() call. So we can't just mark
> it dirty here. The only way we could optimize this bit would be by either
> splitting the bitmap per register or by extending the setter functions to the
> full scope of the sync:

Are you talking about GET all register of a type, say TIMER (TCR and TSR) on first call to cpu_synchronization_function but SET will be only for dirty register. Say if TSR is changed then setter will only communicate that TSR is changed?

Or

(
get multiple regs of a type , say TIMER (TCR and TSR), change one or more and mark dirty if any one changed. Set all registers of a type, TIMER (TCR and TSR) if dirty.
  In this case the first call by ppc_set_tsr() to cpu_synchronization_function will implicitly get all registers of TIMER (TCR and TSR) - means one or more registers of timer can be updated by QEMU. This function will update only env->TSR.
  Later if ppc_set_tcr() is called, it will also call cpu_synchronization_function but this will see that TIMER registers are already available to QEMU, so it will not fetch again. This function will update the TCR also.
  Finally they all, TIMER (TCR and TSR) will be pushed to kernel.
)

Thanks
-Bharat

> 
> static inline void ppc_set_timer_regs(CPUState *env, target_ulong tsr,
> target_ulong tcr) {
>     ...
> }
> 
> 
> Alex
>
Alexander Graf - Jan. 4, 2013, 11:03 a.m.
On 04.01.2013, at 12:01, Bhushan Bharat-R65777 wrote:

> 
> 
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Friday, January 04, 2013 4:07 PM
>> To: Bhushan Bharat-R65777
>> Cc: Marcelo Tosatti; jjherne@linux.vnet.ibm.com; Christian Borntraeger; Anthony
>> Liguori; qemu-devel@nongnu.org qemu-devel
>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
>> do_kvm_cpu_synchronize_state data integrity issue
>> 
>> 
>> On 04.01.2013, at 11:32, Bhushan Bharat-R65777 wrote:
>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>>>>> 
>>>>>>> Int timer_func(CPUxxState env)
>>>>>>> {
>>>>>>> Struct timer_regs;
>>>>>>> read_regs_type((env, &timer_regs,TIMER_REGS);  //update
>>>>>>> env->timer_registers  Write_regs_type(env, TIMER_REGS) }
>>>>>>> 
>>>>>>> This will get one type of register_types and can cause multiple
>>>>>>> IOCTL per
>>>>>> entry to QEMU.
>>>>>> 
>>>>>> This is still too explicit. How about
>>>>>> 
>>>>>> static inline void ppc_set_tsr(CPUState *env, target_ulong val) {
>>>>>>  env->kvm_sync_extra |= SYNC_TIMER_REGS;
>>>>>>  cpu_synchronize_registers(env);
>>>>>>  env->spr[SPR_TSR] = val;
>>>>> 
>>>>> You also want env->kvm_sync_dirty also, right?
>>>> 
>>>> Not quite, since SYNC_TIMER_REGS spans more than only TSR. So we need
>>>> to make sure we fetch the non-TSR register values before we can
>>>> declare TIMER_REGS as dirty.
>>> 
>>> Right , I actually want communicate mark dirty the TIMER_REGS only :)
>> 
>> Imagine TIMER_REGS includes TSR and TCR. Then
>> 
>> ppc_set_tsr();
>> 
>> should still work without a respective ppc_set_tcr() call. So we can't just mark
>> it dirty here. The only way we could optimize this bit would be by either
>> splitting the bitmap per register or by extending the setter functions to the
>> full scope of the sync:
> 
> Are you talking about GET all register of a type, say TIMER (TCR and TSR) on first call to cpu_synchronization_function but SET will be only for dirty register. Say if TSR is changed then setter will only communicate that TSR is changed?
> 
> Or
> 
> (
> get multiple regs of a type , say TIMER (TCR and TSR), change one or more and mark dirty if any one changed. Set all registers of a type, TIMER (TCR and TSR) if dirty.
>  In this case the first call by ppc_set_tsr() to cpu_synchronization_function will implicitly get all registers of TIMER (TCR and TSR) - means one or more registers of timer can be updated by QEMU. This function will update only env->TSR.
>  Later if ppc_set_tcr() is called, it will also call cpu_synchronization_function but this will see that TIMER registers are already available to QEMU, so it will not fetch again. This function will update the TCR also.
>  Finally they all, TIMER (TCR and TSR) will be pushed to kernel.

Yes :)


Alex
Jason J. Herne - Jan. 4, 2013, 3:25 p.m.
If I've followed the conversation correctly this is what needs to be done:

1. Remove the level parameters from kvm_arch_get_registers and 
kvm_arch_put_registers.

2. Add a new bitmap parameter to kvm_arch_get_registers and 
kvm_arch_put_registers.

3. Define a bit that correlates to our current notion of "all runtime 
registers".  This bit, and all bits in this bitmap, would be 
architecture specific.

4. Remove the cpustate->kvm_sync_dirty field.  Replace it with a bitmap 
that tracks which bits are dirty and need to be synced back to KVM-land.

5. As we do today, we'll assume registers are dirty and turn on their 
corresponding bit in this new bitmap whenever we "get" the registers 
from KVM.

6. Add other bits as needed on a case by case basis.

Does this seem to match what was discussed, and what we want to do?
Alexander Graf - Jan. 4, 2013, 4:27 p.m.
On 04.01.2013, at 16:25, Jason J. Herne wrote:

> If I've followed the conversation correctly this is what needs to be done:
> 
> 1. Remove the level parameters from kvm_arch_get_registers and kvm_arch_put_registers.
> 
> 2. Add a new bitmap parameter to kvm_arch_get_registers and kvm_arch_put_registers.

I would combine these into "replace levels with bitmap".

> 3. Define a bit that correlates to our current notion of "all runtime registers".  This bit, and all bits in this bitmap, would be architecture specific.

Why would that bit be architecture specific? "All runtime registers" == "registers that gdb can access" IIRC. The implementation on what exactly that means obviously is architecture specific, but the bit itself would not be, as the gdbstub wants to be able to synchronize in arch independent code.

> 4. Remove the cpustate->kvm_sync_dirty field.  Replace it with a bitmap that tracks which bits are dirty and need to be synced back to KVM-land.
> 
> 5. As we do today, we'll assume registers are dirty and turn on their corresponding bit in this new bitmap whenever we "get" the registers from KVM.

Yes. Changing these semantics is nothing for today :).

> 6. Add other bits as needed on a case by case basis.
> 
> Does this seem to match what was discussed, and what we want to do?

It's probably the best way forward, keeping everyone happy.

Please coordinate with Bharat on who actually wants to sit down to implement this. Or if you're quick you might be able to beat him to it regardless thanks to time zones :).


Alex
Jason J. Herne - Jan. 4, 2013, 5:39 p.m.
On 01/04/2013 11:27 AM, Alexander Graf wrote:
>
> On 04.01.2013, at 16:25, Jason J. Herne wrote:
>
>> If I've followed the conversation correctly this is what needs to be done:
>>
>> 1. Remove the level parameters from kvm_arch_get_registers and kvm_arch_put_registers.
>>
>> 2. Add a new bitmap parameter to kvm_arch_get_registers and kvm_arch_put_registers.
>
> I would combine these into "replace levels with bitmap".
>
>> 3. Define a bit that correlates to our current notion of "all runtime registers".  This bit, and all bits in this bitmap, would be architecture specific.
>
> Why would that bit be architecture specific? "All runtime registers" == "registers that gdb can access" IIRC. The implementation on what exactly that means obviously is architecture specific, but the bit itself would not be, as the gdbstub wants to be able to synchronize in arch independent code.
>
>> 4. Remove the cpustate->kvm_sync_dirty field.  Replace it with a bitmap that tracks which bits are dirty and need to be synced back to KVM-land.
>>
>> 5. As we do today, we'll assume registers are dirty and turn on their corresponding bit in this new bitmap whenever we "get" the registers from KVM.
>
> Yes. Changing these semantics is nothing for today :).
>
>> 6. Add other bits as needed on a case by case basis.
>>
>> Does this seem to match what was discussed, and what we want to do?
>
> It's probably the best way forward, keeping everyone happy.
>
> Please coordinate with Bharat on who actually wants to sit down to implement this. Or if you're quick you might be able to beat him to it regardless thanks to time zones :).
>

Hi Bharat,

How would you like to handle these changes?  I can do them, or you could 
if you prefer. Please let me know.
Bharat Bhushan - Jan. 6, 2013, 2:43 p.m.
> -----Original Message-----
> From: Jason J. Herne [mailto:jjherne@linux.vnet.ibm.com]
> Sent: Friday, January 04, 2013 11:10 PM
> To: Alexander Graf
> Cc: Christian Borntraeger; Anthony Liguori; Marcelo Tosatti; qemu-
> devel@nongnu.org qemu-devel; Bhushan Bharat-R65777
> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix
> do_kvm_cpu_synchronize_state data integrity issue
> 
> On 01/04/2013 11:27 AM, Alexander Graf wrote:
> >
> > On 04.01.2013, at 16:25, Jason J. Herne wrote:
> >
> >> If I've followed the conversation correctly this is what needs to be done:
> >>
> >> 1. Remove the level parameters from kvm_arch_get_registers and
> kvm_arch_put_registers.
> >>
> >> 2. Add a new bitmap parameter to kvm_arch_get_registers and
> kvm_arch_put_registers.
> >
> > I would combine these into "replace levels with bitmap".
> >
> >> 3. Define a bit that correlates to our current notion of "all runtime
> registers".  This bit, and all bits in this bitmap, would be architecture
> specific.
> >
> > Why would that bit be architecture specific? "All runtime registers" ==
> "registers that gdb can access" IIRC. The implementation on what exactly that
> means obviously is architecture specific, but the bit itself would not be, as
> the gdbstub wants to be able to synchronize in arch independent code.
> >
> >> 4. Remove the cpustate->kvm_sync_dirty field.  Replace it with a bitmap that
> tracks which bits are dirty and need to be synced back to KVM-land.
> >>
> >> 5. As we do today, we'll assume registers are dirty and turn on their
> corresponding bit in this new bitmap whenever we "get" the registers from KVM.
> >
> > Yes. Changing these semantics is nothing for today :).
> >
> >> 6. Add other bits as needed on a case by case basis.
> >>
> >> Does this seem to match what was discussed, and what we want to do?
> >
> > It's probably the best way forward, keeping everyone happy.
> >
> > Please coordinate with Bharat on who actually wants to sit down to implement
> this. Or if you're quick you might be able to beat him to it regardless thanks
> to time zones :).
> >
> 
> Hi Bharat,
> 
> How would you like to handle these changes?  I can do them, or you could if you
> prefer. Please let me know.

Hi Jason,

I will be happy to see the patch from you and because of some other things if you think that it will take time then let me know, I will do the changes. This framework will be used by my watchdog patches and only thing I want is my watchdog changes get pushed in QEMU.

Thanks
-Bharat
Jason J. Herne - Jan. 7, 2013, 3:43 p.m.
On 01/04/2013 11:27 AM, Alexander Graf wrote:
>
> On 04.01.2013, at 16:25, Jason J. Herne wrote:
>
>> If I've followed the conversation correctly this is what needs to be done:
>>
>> 1. Remove the level parameters from kvm_arch_get_registers and kvm_arch_put_registers.
>>
>> 2. Add a new bitmap parameter to kvm_arch_get_registers and kvm_arch_put_registers.
>
> I would combine these into "replace levels with bitmap".
>
>> 3. Define a bit that correlates to our current notion of "all runtime registers".  This bit, and all bits in this bitmap, would be architecture specific.
>
> Why would that bit be architecture specific? "All runtime registers" == "registers that gdb can access" IIRC. The implementation on what exactly that means obviously is architecture specific, but the bit itself would not be, as the gdbstub wants to be able to synchronize in arch independent code.
>

How do we want to define these bits?  is it logical to break up the 
registers into smaller categories and then use masks to create 
RUNTIME_STATE, FULL_STATE, RESET_STATE?  If so, how should we define 
them?  Would they be arch specific and then we'd create the _STATE masks 
for each architecture?

If we do simply define a bit for each of the above three states instead, 
they should probably be 100% mutually exclusive to provide the best 
protection against complicated data synchronization issues (like the 
original 7/7 patch was trying to prevent).  Also, if we can assume 100% 
mutual exclusion the sync logic becomes trivial:

static void do_kvm_cpu_synchronize_state(void *arg)
{
     struct kvm_cpu_syncstate_args *args = arg;

     /* Do not sync regs that are already dirty */
     int regs_to_get = args->regmap & ~cpu->kvm_vcpu_dirty;

     kvm_arch_get_registers(args->cpu, regs_to_get);
     args->cpu->kvm_vcpu_dirty |= regs_to_get;
}

Thoughts?
Alexander Graf - Jan. 7, 2013, 3:49 p.m.
On 07.01.2013, at 16:43, Jason J. Herne wrote:

> On 01/04/2013 11:27 AM, Alexander Graf wrote:
>> 
>> On 04.01.2013, at 16:25, Jason J. Herne wrote:
>> 
>>> If I've followed the conversation correctly this is what needs to be done:
>>> 
>>> 1. Remove the level parameters from kvm_arch_get_registers and kvm_arch_put_registers.
>>> 
>>> 2. Add a new bitmap parameter to kvm_arch_get_registers and kvm_arch_put_registers.
>> 
>> I would combine these into "replace levels with bitmap".
>> 
>>> 3. Define a bit that correlates to our current notion of "all runtime registers".  This bit, and all bits in this bitmap, would be architecture specific.
>> 
>> Why would that bit be architecture specific? "All runtime registers" == "registers that gdb can access" IIRC. The implementation on what exactly that means obviously is architecture specific, but the bit itself would not be, as the gdbstub wants to be able to synchronize in arch independent code.
>> 
> 
> How do we want to define these bits?  is it logical to break up the registers into smaller categories and then use masks to create RUNTIME_STATE, FULL_STATE, RESET_STATE?  If so, how should we define them?  Would they be arch specific and then we'd create the _STATE masks for each architecture?

I see. So you only want to make the name arch independent, but keep its actual backing bits arch specific. I can see how that'd end up being a useful thing to do, yes.

So we could have archs that just define RUNTIME_STATE as ARCH_RUNTIME_STATE and others that define it as ARCH_STATE_REGx | ARCH_STATE_REGy. That way other code may only synchronize less than the full runtime state. Works for me :).

> If we do simply define a bit for each of the above three states instead, they should probably be 100% mutually exclusive to provide the best protection against complicated data synchronization issues (like the original 7/7 patch was trying to prevent).  Also, if we can assume 100% mutual exclusion the sync logic becomes trivial:
> 
> static void do_kvm_cpu_synchronize_state(void *arg)
> {
>    struct kvm_cpu_syncstate_args *args = arg;
> 
>    /* Do not sync regs that are already dirty */
>    int regs_to_get = args->regmap & ~cpu->kvm_vcpu_dirty;
> 
>    kvm_arch_get_registers(args->cpu, regs_to_get);
>    args->cpu->kvm_vcpu_dirty |= regs_to_get;
> }
> 
> Thoughts?

I like the idea of making the bits 100% mutually exclusive.


Alex
Jason J. Herne - Jan. 7, 2013, 6:19 p.m.
On 01/07/2013 10:49 AM, Alexander Graf wrote:
>
> On 07.01.2013, at 16:43, Jason J. Herne wrote:
>
>> On 01/04/2013 11:27 AM, Alexander Graf wrote:
>>>
>>> On 04.01.2013, at 16:25, Jason J. Herne wrote:
>>>
>>>> If I've followed the conversation correctly this is what needs to be done:
>>>>
>>>> 1. Remove the level parameters from kvm_arch_get_registers and kvm_arch_put_registers.
>>>>
>>>> 2. Add a new bitmap parameter to kvm_arch_get_registers and kvm_arch_put_registers.
>>>
>>> I would combine these into "replace levels with bitmap".
>>>
>>>> 3. Define a bit that correlates to our current notion of "all runtime registers".  This bit, and all bits in this bitmap, would be architecture specific.
>>>
>>> Why would that bit be architecture specific? "All runtime registers" == "registers that gdb can access" IIRC. The implementation on what exactly that means obviously is architecture specific, but the bit itself would not be, as the gdbstub wants to be able to synchronize in arch independent code.
>>>
>>
>> How do we want to define these bits?  is it logical to break up the registers into smaller categories and then use masks to create RUNTIME_STATE, FULL_STATE, RESET_STATE?  If so, how should we define them?  Would they be arch specific and then we'd create the _STATE masks for each architecture?
>
> I see. So you only want to make the name arch independent, but keep its actual backing bits arch specific. I can see how that'd end up being a useful thing to do, yes.
>
> So we could have archs that just define RUNTIME_STATE as ARCH_RUNTIME_STATE and others that define it as ARCH_STATE_REGx | ARCH_STATE_REGy. That way other code may only synchronize less than the full runtime state. Works for me :).
>
>> If we do simply define a bit for each of the above three states instead, they should probably be 100% mutually exclusive to provide the best protection against complicated data synchronization issues (like the original 7/7 patch was trying to prevent).  Also, if we can assume 100% mutual exclusion the sync logic becomes trivial:
>>
>> static void do_kvm_cpu_synchronize_state(void *arg)
>> {
>>     struct kvm_cpu_syncstate_args *args = arg;
>>
>>     /* Do not sync regs that are already dirty */
>>     int regs_to_get = args->regmap & ~cpu->kvm_vcpu_dirty;
>>
>>     kvm_arch_get_registers(args->cpu, regs_to_get);
>>     args->cpu->kvm_vcpu_dirty |= regs_to_get;
>> }
>>
>> Thoughts?
>
> I like the idea of making the bits 100% mutually exclusive.
>

I've started writing the code to replace the kvm_arch_put_register level 
parameter with a register bitmap and I'm hitting some problems with 
respect to the Intel/PPC targets:

1. target-i386/kvm,c : kvm_arch_put_registers() : This function syncs 
many registers "all the time".  I'm not entirely convinced from reading 
the code that I can easily group these registers into the default 
mutually exclusive groupings (runtime, reset, full).  Some of the 
comments seem to imply an ordering.  Also, a massive data structure is 
prepared and then a single IOCTL call is used to do the register sync. 
So I'm not sure if the IOCTL will expect to always see certain 
registers.  Things get messier when considering the msr's in 
kvm_put_msrs().

2. Currently no architecture has code to selectively GET register state 
(hence the initial patch set).  If we do not fully implement this now 
for all KVM targets then the selective syncing implemented in 
do_kvm_cpu_synchronize_state() will fail.  Consider the case where we 
sync the reset group of registers and make some of them dirty.  If a 
separate task syncs the full state at this point the reset regs will get 
pulled down and local changes will be lost due to 
kvm_arch_get_registers.  Sure we could hack around it but we would just 
be reverting to a "level" based system on top of our bitmap.

It looks like this is going to be a decent amount of work. I do not 
believe I have the platform specific knowledge (nor would I have the 
time) to make the changes required to the x86 and PPC platform specific 
code.  Perhaps others might volunteer if this is worth the effort :)? 
Else, perhaps we should examine a simpler solution to the problem?  Any 
chance the originally submitted patch set would suffice?
Alexander Graf - Jan. 7, 2013, 6:42 p.m.
On 07.01.2013, at 19:19, Jason J. Herne wrote:

> On 01/07/2013 10:49 AM, Alexander Graf wrote:
>> 
>> On 07.01.2013, at 16:43, Jason J. Herne wrote:
>> 
>>> On 01/04/2013 11:27 AM, Alexander Graf wrote:
>>>> 
>>>> On 04.01.2013, at 16:25, Jason J. Herne wrote:
>>>> 
>>>>> If I've followed the conversation correctly this is what needs to be done:
>>>>> 
>>>>> 1. Remove the level parameters from kvm_arch_get_registers and kvm_arch_put_registers.
>>>>> 
>>>>> 2. Add a new bitmap parameter to kvm_arch_get_registers and kvm_arch_put_registers.
>>>> 
>>>> I would combine these into "replace levels with bitmap".
>>>> 
>>>>> 3. Define a bit that correlates to our current notion of "all runtime registers".  This bit, and all bits in this bitmap, would be architecture specific.
>>>> 
>>>> Why would that bit be architecture specific? "All runtime registers" == "registers that gdb can access" IIRC. The implementation on what exactly that means obviously is architecture specific, but the bit itself would not be, as the gdbstub wants to be able to synchronize in arch independent code.
>>>> 
>>> 
>>> How do we want to define these bits?  is it logical to break up the registers into smaller categories and then use masks to create RUNTIME_STATE, FULL_STATE, RESET_STATE?  If so, how should we define them?  Would they be arch specific and then we'd create the _STATE masks for each architecture?
>> 
>> I see. So you only want to make the name arch independent, but keep its actual backing bits arch specific. I can see how that'd end up being a useful thing to do, yes.
>> 
>> So we could have archs that just define RUNTIME_STATE as ARCH_RUNTIME_STATE and others that define it as ARCH_STATE_REGx | ARCH_STATE_REGy. That way other code may only synchronize less than the full runtime state. Works for me :).
>> 
>>> If we do simply define a bit for each of the above three states instead, they should probably be 100% mutually exclusive to provide the best protection against complicated data synchronization issues (like the original 7/7 patch was trying to prevent).  Also, if we can assume 100% mutual exclusion the sync logic becomes trivial:
>>> 
>>> static void do_kvm_cpu_synchronize_state(void *arg)
>>> {
>>>    struct kvm_cpu_syncstate_args *args = arg;
>>> 
>>>    /* Do not sync regs that are already dirty */
>>>    int regs_to_get = args->regmap & ~cpu->kvm_vcpu_dirty;
>>> 
>>>    kvm_arch_get_registers(args->cpu, regs_to_get);
>>>    args->cpu->kvm_vcpu_dirty |= regs_to_get;
>>> }
>>> 
>>> Thoughts?
>> 
>> I like the idea of making the bits 100% mutually exclusive.
>> 
> 
> I've started writing the code to replace the kvm_arch_put_register level parameter with a register bitmap and I'm hitting some problems with respect to the Intel/PPC targets:
> 
> 1. target-i386/kvm,c : kvm_arch_put_registers() : This function syncs many registers "all the time".  I'm not entirely convinced from reading the code that I can easily group these registers into the default mutually exclusive groupings (runtime, reset, full).

Runtime, reset and full are special, as they are levels.

#define SYNC_RESET (SYNC_RUNTIME | SYNC_RESET_EXTRA)
#define SYNC_FULL (SYNC_RESET | SYNC_FULL_EXTRA)

But SYNC_RUNTIME and the _EXTRA bits should be individual bits. That way the bitmap contains mutual bits, but we still have friendly helper bitmaps to convert from the past level based approach.

>  Some of the comments seem to imply an ordering.  Also, a massive data structure is prepared and then a single IOCTL call is used to do the register sync. So I'm not sure if the IOCTL will expect to always see certain registers.  

The ioctl API is pretty well documented in the Linux kernel source in Documentation/virt/kvm/api.txt. But the synchronization function itself can still be one giant piece of code that simply disassembles the bitmap rather than a level, no?

> Things get messier when considering the msr's in kvm_put_msrs().
> 
> 2. Currently no architecture has code to selectively GET register state (hence the initial patch set).  If we do not fully implement this now for all KVM targets then the selective syncing implemented in do_kvm_cpu_synchronize_state() will fail.  Consider the case where we sync the reset group of registers and make some of them dirty.  If a separate task syncs the full state at this point the reset regs will get pulled down and local changes will be lost due to kvm_arch_get_registers.  Sure we could hack around it but we would just be reverting to a "level" based system on top of our bitmap.
> 
> It looks like this is going to be a decent amount of work. I do not believe I have the platform specific knowledge (nor would I have the time) to make the changes required to the x86 and PPC platform specific code.  Perhaps others might volunteer if this is worth the effort :)? Else, perhaps we should examine a simpler solution to the problem?  Any chance the originally submitted patch set would suffice?

I'd like to hear some ideas from Jan first :).


Alex

Patch

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index aea0ece..af3b6aa 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -208,6 +208,12 @@  typedef struct CPUWatchpoint {
     struct KVMState *kvm_state;                                         \
     struct kvm_run *kvm_run;                                            \
     int kvm_fd;                                                         \
+                                                                        \
+    /* Register level indicating which vcpu registers have been synced  \
+       from KVM, are potentially dirty due to local modifications, and  \
+       will need to be written back to KVM.  Valid values are 0, which  \
+       indicates no registers are dirty, or any of the KVM_REGSYNC_*    \
+       constants defined in kvm.h */                                    \
     int kvm_vcpu_dirty;
 
 #endif
diff --git a/kvm-all.c b/kvm-all.c
index aee5bdd..858a636 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -230,7 +230,7 @@  int kvm_init_vcpu(CPUArchState *env)
 
     env->kvm_fd = ret;
     env->kvm_state = s;
-    env->kvm_vcpu_dirty = 1;
+    env->kvm_vcpu_dirty = KVM_REGSYNC_FULL_STATE;
 
     mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
     if (mmap_size < 0) {
@@ -1489,10 +1489,16 @@  void kvm_flush_coalesced_mmio_buffer(void)
 static void do_kvm_cpu_synchronize_state(void *_args)
 {
     struct kvm_cpu_syncstate_args *args = _args;
+    CPUArchState *env = args->env;
+    int register_level = args->register_level;
 
     if (!args->env->kvm_vcpu_dirty) {
-        kvm_arch_get_registers(args->env, args->register_level);
-        args->env->kvm_vcpu_dirty = 1;
+        kvm_arch_get_registers(env, register_level);
+        env->kvm_vcpu_dirty = register_level;
+    } else if (register_level > env->kvm_vcpu_dirty) {
+        kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
+        kvm_arch_get_registers(env, register_level);
+        env->kvm_vcpu_dirty = register_level;
     }
 }
 
@@ -1535,7 +1541,7 @@  int kvm_cpu_exec(CPUArchState *env)
 
     do {
         if (env->kvm_vcpu_dirty) {
-            kvm_arch_put_registers(env, KVM_REGSYNC_RUNTIME_STATE);
+            kvm_arch_put_registers(env, env->kvm_vcpu_dirty);
             env->kvm_vcpu_dirty = 0;
         }