Patchwork qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

login
register
mail settings
Submitter Anthony Liguori
Date Nov. 23, 2010, 4:49 p.m.
Message ID <1290530963-3448-1-git-send-email-aliguori@us.ibm.com>
Download mbox | patch
Permalink /patch/72699/
State New
Headers show

Comments

Anthony Liguori - Nov. 23, 2010, 4:49 p.m.
qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT.  Instead of teaching
them to respond to these signals (which cannot be trapped), use SIGUSR1 to
approximate the behavior of SIGSTOP/SIGCONT.

The purpose of this is to implement CPU hard limits using an external tool that
watches the CPU consumption and stops the VCPU as appropriate.

This provides a more elegant solution in that it allows the VCPU thread to
release qemu_mutex before going to sleep.

This current implementation uses a single signal.  I think this is too racey
in the long term so I think we should introduce a second signal.  If two signals
get coalesced into one, it could confuse the monitoring tool into giving the
VCPU the inverse of it's entitlement.

It might be better to simply move this logic entirely into QEMU to make this
more robust--the question is whether we think this is a good long term feature
to carry in QEMU?

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Blue Swirl - Nov. 23, 2010, 7:35 p.m.
On Tue, Nov 23, 2010 at 4:49 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT.  Instead of teaching
> them to respond to these signals (which cannot be trapped), use SIGUSR1 to
> approximate the behavior of SIGSTOP/SIGCONT.
>
> The purpose of this is to implement CPU hard limits using an external tool that
> watches the CPU consumption and stops the VCPU as appropriate.
>
> This provides a more elegant solution in that it allows the VCPU thread to
> release qemu_mutex before going to sleep.
>
> This current implementation uses a single signal.  I think this is too racey
> in the long term so I think we should introduce a second signal.  If two signals
> get coalesced into one, it could confuse the monitoring tool into giving the
> VCPU the inverse of it's entitlement.
>
> It might be better to simply move this logic entirely into QEMU to make this
> more robust--the question is whether we think this is a good long term feature
> to carry in QEMU?

> +static __thread int sigusr1_wfd;

While OpenBSD finally updated the default compiler to 4.2.1 from 3.x
series, thread local storage is still not supported:

$ cat thread.c
static __thread int sigusr1_wfd;
$ gcc thread.c -c
thread.c:1: error: thread-local storage not supported for this target
$ gcc -v
Reading specs from /usr/lib/gcc-lib/sparc64-unknown-openbsd4.8/4.2.1/specs
Target: sparc64-unknown-openbsd4.8
Configured with: OpenBSD/sparc64 system compiler
Thread model: posix
gcc version 4.2.1 20070719
Anthony Liguori - Nov. 23, 2010, 9:46 p.m.
On 11/23/2010 01:35 PM, Blue Swirl wrote:
> On Tue, Nov 23, 2010 at 4:49 PM, Anthony Liguori<aliguori@us.ibm.com>  wrote:
>    
>> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT.  Instead of teaching
>> them to respond to these signals (which cannot be trapped), use SIGUSR1 to
>> approximate the behavior of SIGSTOP/SIGCONT.
>>
>> The purpose of this is to implement CPU hard limits using an external tool that
>> watches the CPU consumption and stops the VCPU as appropriate.
>>
>> This provides a more elegant solution in that it allows the VCPU thread to
>> release qemu_mutex before going to sleep.
>>
>> This current implementation uses a single signal.  I think this is too racey
>> in the long term so I think we should introduce a second signal.  If two signals
>> get coalesced into one, it could confuse the monitoring tool into giving the
>> VCPU the inverse of it's entitlement.
>>
>> It might be better to simply move this logic entirely into QEMU to make this
>> more robust--the question is whether we think this is a good long term feature
>> to carry in QEMU?
>>      
>    
>> +static __thread int sigusr1_wfd;
>>      
> While OpenBSD finally updated the default compiler to 4.2.1 from 3.x
> series, thread local storage is still not supported:
>    

Hrm, is there a portable way to do this (distinguish a signal on a 
particular thread)?

Regards,

Anthony Liguori

> $ cat thread.c
> static __thread int sigusr1_wfd;
> $ gcc thread.c -c
> thread.c:1: error: thread-local storage not supported for this target
> $ gcc -v
> Reading specs from /usr/lib/gcc-lib/sparc64-unknown-openbsd4.8/4.2.1/specs
> Target: sparc64-unknown-openbsd4.8
> Configured with: OpenBSD/sparc64 system compiler
> Thread model: posix
> gcc version 4.2.1 20070719
>
Paolo Bonzini - Nov. 23, 2010, 11:43 p.m.
On 11/23/2010 10:46 PM, Anthony Liguori wrote:
>>> +static __thread int sigusr1_wfd;
>> While OpenBSD finally updated the default compiler to 4.2.1 from 3.x
>> series, thread local storage is still not supported:
>
> Hrm, is there a portable way to do this (distinguish a signal on a
> particular thread)?

You can use pthread_getspecific/pthread_setspecific.

Paolo
Anthony Liguori - Nov. 24, 2010, 1:15 a.m.
On 11/23/2010 05:43 PM, Paolo Bonzini wrote:
> On 11/23/2010 10:46 PM, Anthony Liguori wrote:
>>>> +static __thread int sigusr1_wfd;
>>> While OpenBSD finally updated the default compiler to 4.2.1 from 3.x
>>> series, thread local storage is still not supported:
>>
>> Hrm, is there a portable way to do this (distinguish a signal on a
>> particular thread)?
>
> You can use pthread_getspecific/pthread_setspecific.

Is it signal safe?

BTW, this is all only theoretical.  This is in the KVM io thread code 
which is already highly unportable.

Regards,

Anthony Liguori

> Paolo
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini - Nov. 24, 2010, 2:08 a.m.
On 11/24/2010 02:15 AM, Anthony Liguori wrote:
>
> Is it signal safe?

Yes, at heart it is just a somewhat more expensive access to 
pthread_self()->some_array[key].

> BTW, this is all only theoretical.  This is in the KVM io thread code
> which is already highly unportable.

True, and newer versions of GCC emulate __thread even on Windows.

Paolo
Avi Kivity - Nov. 24, 2010, 8:18 a.m.
On 11/23/2010 06:49 PM, Anthony Liguori wrote:
> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT.  Instead of teaching
> them to respond to these signals (which cannot be trapped), use SIGUSR1 to
> approximate the behavior of SIGSTOP/SIGCONT.
>
> The purpose of this is to implement CPU hard limits using an external tool that
> watches the CPU consumption and stops the VCPU as appropriate.
>
> This provides a more elegant solution in that it allows the VCPU thread to
> release qemu_mutex before going to sleep.
>
> This current implementation uses a single signal.  I think this is too racey
> in the long term so I think we should introduce a second signal.  If two signals
> get coalesced into one, it could confuse the monitoring tool into giving the
> VCPU the inverse of it's entitlement.

You can use sigqueue() to send an accompanying value.

> It might be better to simply move this logic entirely into QEMU to make this
> more robust--the question is whether we think this is a good long term feature
> to carry in QEMU?
>

I'm more concerned about lock holder preemption, and interaction of this 
mechanism with any kernel solution for LHP.

> +static __thread int sigusr1_wfd;
> +
> +static void on_sigusr1(int signo)
> +{
> +    char ch = 0;
> +    if (write(sigusr1_wfd,&ch, 1)<  0) {
> +        /* who cares */
> +    }
> +}

We do have signalfd().

> +
> +static void sigusr1_read(void *opaque)
> +{
> +    CPUState *env = opaque;
> +    ssize_t len;
> +    int caught_signal = 0;
> +
> +    do {
> +        char buffer[256];
> +        len = read(env->sigusr1_fd, buffer, sizeof(buffer));
> +        caught_signal = 1;
> +    } while (len>  0);
> +
> +    if (caught_signal) {
> +        if (env->stopped) {

env->stopped is multiplexed among multiple users, so this interferes 
with vm_stop().

We need to make ->stopped a reference count instead.
Anthony Liguori - Nov. 24, 2010, 1:58 p.m.
On 11/24/2010 02:18 AM, Avi Kivity wrote:
> On 11/23/2010 06:49 PM, Anthony Liguori wrote:
>> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT.  Instead of 
>> teaching
>> them to respond to these signals (which cannot be trapped), use 
>> SIGUSR1 to
>> approximate the behavior of SIGSTOP/SIGCONT.
>>
>> The purpose of this is to implement CPU hard limits using an external 
>> tool that
>> watches the CPU consumption and stops the VCPU as appropriate.
>>
>> This provides a more elegant solution in that it allows the VCPU 
>> thread to
>> release qemu_mutex before going to sleep.
>>
>> This current implementation uses a single signal.  I think this is 
>> too racey
>> in the long term so I think we should introduce a second signal.  If 
>> two signals
>> get coalesced into one, it could confuse the monitoring tool into 
>> giving the
>> VCPU the inverse of it's entitlement.
>
> You can use sigqueue() to send an accompanying value.

I switched to using SIGRTMIN+5 and SIGRTMIN+6.  I think that's a nicer 
solution since it maps to SIGCONT/SIGSTOP.

>> It might be better to simply move this logic entirely into QEMU to 
>> make this
>> more robust--the question is whether we think this is a good long 
>> term feature
>> to carry in QEMU?
>>
>
> I'm more concerned about lock holder preemption, and interaction of 
> this mechanism with any kernel solution for LHP.

Can you suggest some scenarios and I'll create some test cases?  I'm 
trying figure out the best way to evaluate this.

Are you assuming the existence of a directed yield and the specific 
concern is what happens when a directed yield happens after a PLE and 
the target of the yield has been capped?

>> +static __thread int sigusr1_wfd;
>> +
>> +static void on_sigusr1(int signo)
>> +{
>> +    char ch = 0;
>> +    if (write(sigusr1_wfd,&ch, 1)<  0) {
>> +        /* who cares */
>> +    }
>> +}
>
> We do have signalfd().

This is actually called from signalfd.  I thought about refactoring that 
loop to handle signals directly but since we do this elsewhere I figured 
I'd keep things consistent.

>> +
>> +static void sigusr1_read(void *opaque)
>> +{
>> +    CPUState *env = opaque;
>> +    ssize_t len;
>> +    int caught_signal = 0;
>> +
>> +    do {
>> +        char buffer[256];
>> +        len = read(env->sigusr1_fd, buffer, sizeof(buffer));
>> +        caught_signal = 1;
>> +    } while (len>  0);
>> +
>> +    if (caught_signal) {
>> +        if (env->stopped) {
>
> env->stopped is multiplexed among multiple users, so this interferes 
> with vm_stop().
>
> We need to make ->stopped a reference count instead.

Indeed.

Regards,

Anthony Liguori
Avi Kivity - Nov. 24, 2010, 2:23 p.m.
On 11/24/2010 03:58 PM, Anthony Liguori wrote:
> On 11/24/2010 02:18 AM, Avi Kivity wrote:
>> On 11/23/2010 06:49 PM, Anthony Liguori wrote:
>>> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT.  Instead of 
>>> teaching
>>> them to respond to these signals (which cannot be trapped), use 
>>> SIGUSR1 to
>>> approximate the behavior of SIGSTOP/SIGCONT.
>>>
>>> The purpose of this is to implement CPU hard limits using an 
>>> external tool that
>>> watches the CPU consumption and stops the VCPU as appropriate.
>>>
>>> This provides a more elegant solution in that it allows the VCPU 
>>> thread to
>>> release qemu_mutex before going to sleep.
>>>
>>> This current implementation uses a single signal.  I think this is 
>>> too racey
>>> in the long term so I think we should introduce a second signal.  If 
>>> two signals
>>> get coalesced into one, it could confuse the monitoring tool into 
>>> giving the
>>> VCPU the inverse of it's entitlement.
>>
>> You can use sigqueue() to send an accompanying value.
>
> I switched to using SIGRTMIN+5 and SIGRTMIN+6.  I think that's a nicer 
> solution since it maps to SIGCONT/SIGSTOP.
>

These may get reordered, need to check the semantics.

>>> It might be better to simply move this logic entirely into QEMU to 
>>> make this
>>> more robust--the question is whether we think this is a good long 
>>> term feature
>>> to carry in QEMU?
>>>
>>
>> I'm more concerned about lock holder preemption, and interaction of 
>> this mechanism with any kernel solution for LHP.
>
> Can you suggest some scenarios and I'll create some test cases?  I'm 
> trying figure out the best way to evaluate this.

Booting 64-vcpu Windows on a 64-cpu host with PLE but without directed 
yield takes longer than forever because PLE detects contention within 
the guest, which under our current PLE implementation (usleep(100)) 
converts guest contention into delays.

(a directed yield implementation would find that all vcpus are runnable, 
yielding optimal results under this test case).

So if you were to test something similar running with a 20% vcpu cap, 
I'm sure you'd run into similar issues.  It may show with fewer vcpus 
(I've only tested 64).

> Are you assuming the existence of a directed yield and the specific 
> concern is what happens when a directed yield happens after a PLE and 
> the target of the yield has been capped?

Yes.  My concern is that we will see the same kind of problems directed 
yield was designed to fix, but without allowing directed yield to fix 
them.  Directed yield was designed to fix lock holder preemption under 
contention, now you're inducing contention but not allowing directed 
yield to work, even when we will have it.


>
>>> +static __thread int sigusr1_wfd;
>>> +
>>> +static void on_sigusr1(int signo)
>>> +{
>>> +    char ch = 0;
>>> +    if (write(sigusr1_wfd,&ch, 1)<  0) {
>>> +        /* who cares */
>>> +    }
>>> +}
>>
>> We do have signalfd().
>
> This is actually called from signalfd.  I thought about refactoring 
> that loop to handle signals directly but since we do this elsewhere I 
> figured I'd keep things consistent.

Ah, yes.

>>> +
>>> +static void sigusr1_read(void *opaque)
>>> +{
>>> +    CPUState *env = opaque;
>>> +    ssize_t len;
>>> +    int caught_signal = 0;
>>> +
>>> +    do {
>>> +        char buffer[256];
>>> +        len = read(env->sigusr1_fd, buffer, sizeof(buffer));
>>> +        caught_signal = 1;
>>> +    } while (len>  0);
>>> +
>>> +    if (caught_signal) {
>>> +        if (env->stopped) {
>>
>> env->stopped is multiplexed among multiple users, so this interferes 
>> with vm_stop().
>>
>> We need to make ->stopped a reference count instead.
>
> Indeed.

We also need to make the global vm_stop() be reference based, since 
there are multiple consumers of that interface.
Srivatsa Vaddagiri - Dec. 1, 2010, 12:37 p.m.
On Wed, Nov 24, 2010 at 04:23:15PM +0200, Avi Kivity wrote:
> >>I'm more concerned about lock holder preemption, and interaction
> >>of this mechanism with any kernel solution for LHP.
> >
> >Can you suggest some scenarios and I'll create some test cases?
> >I'm trying figure out the best way to evaluate this.
> 
> Booting 64-vcpu Windows on a 64-cpu host with PLE but without
> directed yield takes longer than forever because PLE detects
> contention within the guest, which under our current PLE
> implementation (usleep(100)) converts guest contention into delays.

Is there any way of optimizing PLE at runtime in such special case? For ex: 
either turn off PLE feature or gradually increase (spin-)timeout when PLE should
kick in ..

> (a directed yield implementation would find that all vcpus are
> runnable, yielding optimal results under this test case).

I would think a plain yield() (rather than usleep/directed yield) would suffice
here (yield would realize that there is nobody else to yield to and continue
running the same vcpu thread). As regards to any concern of leaking cpu 
bandwidth because of a plain yield, I think it can be fixed by a more
simpler modification to yield that allows a thread to reclaim whatever timeslice
it gave up previously [1].

Regarding directed yield, do we have any reliable mechanism to find target of 
directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do 
we determine the vcpu thread to which cycles need to be yielded upon contention?

> So if you were to test something similar running with a 20% vcpu
> cap, I'm sure you'd run into similar issues.  It may show with fewer
> vcpus (I've only tested 64).
> 
> >Are you assuming the existence of a directed yield and the
> >specific concern is what happens when a directed yield happens
> >after a PLE and the target of the yield has been capped?
> 
> Yes.  My concern is that we will see the same kind of problems
> directed yield was designed to fix, but without allowing directed
> yield to fix them.  Directed yield was designed to fix lock holder
> preemption under contention,

For modified guests, something like [2] seems to be the best approach to fix
lock-holder preemption (LHP) problem, which does not require any sort of 
directed yield support. Essentially upon contention, a vcpu registers its lock 
of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it 
up (again via another hypercall).

For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1])
should fix the LHP problem.

Fyi, Xen folks also seem to be avoiding a directed yield for some of the same
reasons [3].

Given this line of thinking, hard-limiting guests (either in user-space or
kernel-space, latter being what I prefer) should not have adverse interactions 
with LHP-related solutions.

> now you're inducing contention but not
> allowing directed yield to work, even when we will have it.

- vatsa

References:

1. http://lkml.org/lkml/2010/8/3/38
2. https://lkml.org/lkml/2010/11/16/479
3. http://www.gossamer-threads.com/lists/xen/devel/182179#182179
Avi Kivity - Dec. 1, 2010, 12:56 p.m.
On 12/01/2010 02:37 PM, Srivatsa Vaddagiri wrote:
> On Wed, Nov 24, 2010 at 04:23:15PM +0200, Avi Kivity wrote:
> >  >>I'm more concerned about lock holder preemption, and interaction
> >  >>of this mechanism with any kernel solution for LHP.
> >  >
> >  >Can you suggest some scenarios and I'll create some test cases?
> >  >I'm trying figure out the best way to evaluate this.
> >
> >  Booting 64-vcpu Windows on a 64-cpu host with PLE but without
> >  directed yield takes longer than forever because PLE detects
> >  contention within the guest, which under our current PLE
> >  implementation (usleep(100)) converts guest contention into delays.
>
> Is there any way of optimizing PLE at runtime in such special case? For ex:
> either turn off PLE feature or gradually increase (spin-)timeout when PLE should
> kick in ..

It's not a special case at all.  Both host contention and guest 
contention are perfectly normal, and can occur simultaneously.


> >  (a directed yield implementation would find that all vcpus are
> >  runnable, yielding optimal results under this test case).
>
> I would think a plain yield() (rather than usleep/directed yield) would suffice
> here (yield would realize that there is nobody else to yield to and continue
> running the same vcpu thread).

Currently yield() is a no-op on Linux.

> As regards to any concern of leaking cpu
> bandwidth because of a plain yield, I think it can be fixed by a more
> simpler modification to yield that allows a thread to reclaim whatever timeslice
> it gave up previously [1].

If some other thread used that timeslice, don't we have an accounting 
problem?

> Regarding directed yield, do we have any reliable mechanism to find target of
> directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do
> we determine the vcpu thread to which cycles need to be yielded upon contention?

My idea was to yield to a random starved vcpu of the same guest.  There 
are several cases to consider:

- we hit the right vcpu; lock is released, party.
- we hit some vcpu that is doing unrelated work.  yielding thread 
doesn't make progress, but we're not wasting cpu time.
- we hit another waiter for the same lock.  it will also PLE exit and 
trigger a directed yield.  this increases the cost of directed yield by 
a factor of count_of_runnable_but_not_running_vcpus, which could be 
large, but not disasterously so (i.e. don't run a 64-vcpu guest on a 
uniprocessor host with this)

> >  So if you were to test something similar running with a 20% vcpu
> >  cap, I'm sure you'd run into similar issues.  It may show with fewer
> >  vcpus (I've only tested 64).
> >
> >  >Are you assuming the existence of a directed yield and the
> >  >specific concern is what happens when a directed yield happens
> >  >after a PLE and the target of the yield has been capped?
> >
> >  Yes.  My concern is that we will see the same kind of problems
> >  directed yield was designed to fix, but without allowing directed
> >  yield to fix them.  Directed yield was designed to fix lock holder
> >  preemption under contention,
>
> For modified guests, something like [2] seems to be the best approach to fix
> lock-holder preemption (LHP) problem, which does not require any sort of
> directed yield support. Essentially upon contention, a vcpu registers its lock
> of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it
> up (again via another hypercall).

Right.

> For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1])
> should fix the LHP problem.

A plain yield (ignoring no-opiness on Linux) will penalize the running 
guest wrt other guests.  We need to maintain fairness.

> Fyi, Xen folks also seem to be avoiding a directed yield for some of the same
> reasons [3].

I think that fails for unmodified guests, where you don't know when the 
lock is released and so you don't have a wake_up notification.  You lost 
a large timeslice and you can't gain it back, whereas with pv the wakeup 
means you only lose as much time as the lock was held.

> Given this line of thinking, hard-limiting guests (either in user-space or
> kernel-space, latter being what I prefer) should not have adverse interactions
> with LHP-related solutions.

If you hard-limit a vcpu that holds a lock, any waiting vcpus are also 
halted.  With directed yield you can let the lock holder make some 
progress at the expense of another vcpu.  A regular yield() will simply 
stall the waiter.
Srivatsa Vaddagiri - Dec. 1, 2010, 4:12 p.m.
On Wed, Dec 01, 2010 at 02:56:44PM +0200, Avi Kivity wrote:
> >>  (a directed yield implementation would find that all vcpus are
> >>  runnable, yielding optimal results under this test case).
> >
> >I would think a plain yield() (rather than usleep/directed yield) would suffice
> >here (yield would realize that there is nobody else to yield to and continue
> >running the same vcpu thread).
> 
> Currently yield() is a no-op on Linux.

Hmm only when there is a single task in runqueue, otherwise yield will cause
remaining threads to run before letting the yielding task to run again.

> >As regards to any concern of leaking cpu
> >bandwidth because of a plain yield, I think it can be fixed by a more
> >simpler modification to yield that allows a thread to reclaim whatever timeslice
> >it gave up previously [1].
> 
> If some other thread used that timeslice, don't we have an
> accounting problem?

Not if yield() remembers what timeslice was given up and adds that back when
thread is finally ready to run. Figure below illustrates this idea:


      A0/4    C0/4 D0/4 A0/4  C0/4 D0/4 A0/4  C0/4 D0/4 A0/4 
p0   |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------|
            \                \               \                  \
	   B0/2[2]	    B0/0[6]         B0/0[10]            B0/14[0]

 
where,
	p0	-> physical cpu0
	L	-> denotes period of lock contention
	A0/4    -> means vcpu A0 (of guest A) ran for 4 ms
	B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up
		   6ms worth of its timeslice so far). In reality, we should
	     	   not see too much of "given up" timeslice for a vcpu.

> >Regarding directed yield, do we have any reliable mechanism to find target of
> >directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do
> >we determine the vcpu thread to which cycles need to be yielded upon contention?
> 
> My idea was to yield to a random starved vcpu of the same guest.
> There are several cases to consider:
> 
> - we hit the right vcpu; lock is released, party.
> - we hit some vcpu that is doing unrelated work.  yielding thread
> doesn't make progress, but we're not wasting cpu time.
> - we hit another waiter for the same lock.  it will also PLE exit
> and trigger a directed yield.  this increases the cost of directed
> yield by a factor of count_of_runnable_but_not_running_vcpus, which
> could be large, but not disasterously so (i.e. don't run a 64-vcpu
> guest on a uniprocessor host with this)
> 
> >>  So if you were to test something similar running with a 20% vcpu
> >>  cap, I'm sure you'd run into similar issues.  It may show with fewer
> >>  vcpus (I've only tested 64).
> >>
> >>  >Are you assuming the existence of a directed yield and the
> >>  >specific concern is what happens when a directed yield happens
> >>  >after a PLE and the target of the yield has been capped?
> >>
> >>  Yes.  My concern is that we will see the same kind of problems
> >>  directed yield was designed to fix, but without allowing directed
> >>  yield to fix them.  Directed yield was designed to fix lock holder
> >>  preemption under contention,
> >
> >For modified guests, something like [2] seems to be the best approach to fix
> >lock-holder preemption (LHP) problem, which does not require any sort of
> >directed yield support. Essentially upon contention, a vcpu registers its lock
> >of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it
> >up (again via another hypercall).
> 
> Right.

We don't have these hypercalls for KVM atm, which I am working on now.

> >For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1])
> >should fix the LHP problem.
> 
> A plain yield (ignoring no-opiness on Linux) will penalize the
> running guest wrt other guests.  We need to maintain fairness.

Agreed on the need to maintain fairness.

> >Fyi, Xen folks also seem to be avoiding a directed yield for some of the same
> >reasons [3].
> 
> I think that fails for unmodified guests, where you don't know when
> the lock is released and so you don't have a wake_up notification.
> You lost a large timeslice and you can't gain it back, whereas with
> pv the wakeup means you only lose as much time as the lock was held.
> 
> >Given this line of thinking, hard-limiting guests (either in user-space or
> >kernel-space, latter being what I prefer) should not have adverse interactions
> >with LHP-related solutions.
> 
> If you hard-limit a vcpu that holds a lock, any waiting vcpus are
> also halted.

This can happen in normal case when lock-holders are preempted as well. So
not a new problem that hard-limits is introducing!

>  With directed yield you can let the lock holder make
> some progress at the expense of another vcpu.  A regular yield()
> will simply stall the waiter.

Agreed. Do you see any problems with slightly enhanced version of yeild
described above (rather than directed yield)? It has some advantage over 
directed yield in that it preserves not only fairness between VMs but also 
fairness between VCPUs of a VM. Also it avoids the need for a guessing game 
mentioned above and bad interactions with hard-limits.

CCing other scheduler experts for their opinion of proposed yield() extensions.

- vatsa
Peter Zijlstra - Dec. 1, 2010, 4:25 p.m.
On Wed, 2010-12-01 at 21:42 +0530, Srivatsa Vaddagiri wrote:

> Not if yield() remembers what timeslice was given up and adds that back when
> thread is finally ready to run. Figure below illustrates this idea:
> 
> 
>       A0/4    C0/4 D0/4 A0/4  C0/4 D0/4 A0/4  C0/4 D0/4 A0/4 
> p0   |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------|
>             \                \               \                  \
> 	   B0/2[2]	    B0/0[6]         B0/0[10]            B0/14[0]
> 
>  
> where,
> 	p0	-> physical cpu0
> 	L	-> denotes period of lock contention
> 	A0/4    -> means vcpu A0 (of guest A) ran for 4 ms
> 	B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up
> 		   6ms worth of its timeslice so far). In reality, we should
> 	     	   not see too much of "given up" timeslice for a vcpu.

/me fails to parse

> > >Regarding directed yield, do we have any reliable mechanism to find target of
> > >directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do
> > >we determine the vcpu thread to which cycles need to be yielded upon contention?
> > 
> > My idea was to yield to a random starved vcpu of the same guest.
> > There are several cases to consider:
> > 
> > - we hit the right vcpu; lock is released, party.
> > - we hit some vcpu that is doing unrelated work.  yielding thread
> > doesn't make progress, but we're not wasting cpu time.
> > - we hit another waiter for the same lock.  it will also PLE exit
> > and trigger a directed yield.  this increases the cost of directed
> > yield by a factor of count_of_runnable_but_not_running_vcpus, which
> > could be large, but not disasterously so (i.e. don't run a 64-vcpu
> > guest on a uniprocessor host with this)
> > 
> > >>  So if you were to test something similar running with a 20% vcpu
> > >>  cap, I'm sure you'd run into similar issues.  It may show with fewer
> > >>  vcpus (I've only tested 64).
> > >>
> > >>  >Are you assuming the existence of a directed yield and the
> > >>  >specific concern is what happens when a directed yield happens
> > >>  >after a PLE and the target of the yield has been capped?
> > >>
> > >>  Yes.  My concern is that we will see the same kind of problems
> > >>  directed yield was designed to fix, but without allowing directed
> > >>  yield to fix them.  Directed yield was designed to fix lock holder
> > >>  preemption under contention,
> > >
> > >For modified guests, something like [2] seems to be the best approach to fix
> > >lock-holder preemption (LHP) problem, which does not require any sort of
> > >directed yield support. Essentially upon contention, a vcpu registers its lock
> > >of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it
> > >up (again via another hypercall).
> > 
> > Right.
> 
> We don't have these hypercalls for KVM atm, which I am working on now.
> 
> > >For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1])
> > >should fix the LHP problem.
> > 
> > A plain yield (ignoring no-opiness on Linux) will penalize the
> > running guest wrt other guests.  We need to maintain fairness.
> 
> Agreed on the need to maintain fairness.

Directed yield and fairness don't mix well either. You can end up
feeding the other tasks more time than you'll ever get back.

> > >Fyi, Xen folks also seem to be avoiding a directed yield for some of the same
> > >reasons [3].
> > 
> > I think that fails for unmodified guests, where you don't know when
> > the lock is released and so you don't have a wake_up notification.
> > You lost a large timeslice and you can't gain it back, whereas with
> > pv the wakeup means you only lose as much time as the lock was held.
> > 
> > >Given this line of thinking, hard-limiting guests (either in user-space or
> > >kernel-space, latter being what I prefer) should not have adverse interactions
> > >with LHP-related solutions.
> > 
> > If you hard-limit a vcpu that holds a lock, any waiting vcpus are
> > also halted.
> 
> This can happen in normal case when lock-holders are preempted as well. So
> not a new problem that hard-limits is introducing!

No, but hard limits make it _much_ worse.

> >  With directed yield you can let the lock holder make
> > some progress at the expense of another vcpu.  A regular yield()
> > will simply stall the waiter.
> 
> Agreed. Do you see any problems with slightly enhanced version of yeild
> described above (rather than directed yield)? It has some advantage over 
> directed yield in that it preserves not only fairness between VMs but also 
> fairness between VCPUs of a VM. Also it avoids the need for a guessing game 
> mentioned above and bad interactions with hard-limits.
> 
> CCing other scheduler experts for their opinion of proposed yield() extensions.

sys_yield() usage for anything other but two FIFO threads of the same
priority goes to /dev/null.

The Xen paravirt spinlock solution is relatively sane, use that.
Unmodified guests suck anyway, there's really nothing much sane you can
do there as you don't know who owns what lock.
Chris Wright - Dec. 1, 2010, 5:17 p.m.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Wed, 2010-12-01 at 21:42 +0530, Srivatsa Vaddagiri wrote:
> 
> > Not if yield() remembers what timeslice was given up and adds that back when
> > thread is finally ready to run. Figure below illustrates this idea:
> > 
> > 
> >       A0/4    C0/4 D0/4 A0/4  C0/4 D0/4 A0/4  C0/4 D0/4 A0/4 
> > p0   |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------|
> >             \                \               \                  \
> > 	   B0/2[2]	    B0/0[6]         B0/0[10]            B0/14[0]
> > 
> >  
> > where,
> > 	p0	-> physical cpu0
> > 	L	-> denotes period of lock contention
> > 	A0/4    -> means vcpu A0 (of guest A) ran for 4 ms
> > 	B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up
> > 		   6ms worth of its timeslice so far). In reality, we should
> > 	     	   not see too much of "given up" timeslice for a vcpu.
> 
> /me fails to parse
> 
> > > >Regarding directed yield, do we have any reliable mechanism to find target of
> > > >directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do
> > > >we determine the vcpu thread to which cycles need to be yielded upon contention?
> > > 
> > > My idea was to yield to a random starved vcpu of the same guest.
> > > There are several cases to consider:
> > > 
> > > - we hit the right vcpu; lock is released, party.
> > > - we hit some vcpu that is doing unrelated work.  yielding thread
> > > doesn't make progress, but we're not wasting cpu time.
> > > - we hit another waiter for the same lock.  it will also PLE exit
> > > and trigger a directed yield.  this increases the cost of directed
> > > yield by a factor of count_of_runnable_but_not_running_vcpus, which
> > > could be large, but not disasterously so (i.e. don't run a 64-vcpu
> > > guest on a uniprocessor host with this)
> > > 
> > > >>  So if you were to test something similar running with a 20% vcpu
> > > >>  cap, I'm sure you'd run into similar issues.  It may show with fewer
> > > >>  vcpus (I've only tested 64).
> > > >>
> > > >>  >Are you assuming the existence of a directed yield and the
> > > >>  >specific concern is what happens when a directed yield happens
> > > >>  >after a PLE and the target of the yield has been capped?
> > > >>
> > > >>  Yes.  My concern is that we will see the same kind of problems
> > > >>  directed yield was designed to fix, but without allowing directed
> > > >>  yield to fix them.  Directed yield was designed to fix lock holder
> > > >>  preemption under contention,
> > > >
> > > >For modified guests, something like [2] seems to be the best approach to fix
> > > >lock-holder preemption (LHP) problem, which does not require any sort of
> > > >directed yield support. Essentially upon contention, a vcpu registers its lock
> > > >of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it
> > > >up (again via another hypercall).
> > > 
> > > Right.
> > 
> > We don't have these hypercalls for KVM atm, which I am working on now.
> > 
> > > >For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1])
> > > >should fix the LHP problem.
> > > 
> > > A plain yield (ignoring no-opiness on Linux) will penalize the
> > > running guest wrt other guests.  We need to maintain fairness.
> > 
> > Agreed on the need to maintain fairness.
> 
> Directed yield and fairness don't mix well either. You can end up
> feeding the other tasks more time than you'll ever get back.

If the directed yield is always to another task in your cgroup then
inter-guest scheduling fairness should be maintained.

> > > >Fyi, Xen folks also seem to be avoiding a directed yield for some of the same
> > > >reasons [3].
> > > 
> > > I think that fails for unmodified guests, where you don't know when
> > > the lock is released and so you don't have a wake_up notification.
> > > You lost a large timeslice and you can't gain it back, whereas with
> > > pv the wakeup means you only lose as much time as the lock was held.
> > > 
> > > >Given this line of thinking, hard-limiting guests (either in user-space or
> > > >kernel-space, latter being what I prefer) should not have adverse interactions
> > > >with LHP-related solutions.
> > > 
> > > If you hard-limit a vcpu that holds a lock, any waiting vcpus are
> > > also halted.
> > 
> > This can happen in normal case when lock-holders are preempted as well. So
> > not a new problem that hard-limits is introducing!
> 
> No, but hard limits make it _much_ worse.
> 
> > >  With directed yield you can let the lock holder make
> > > some progress at the expense of another vcpu.  A regular yield()
> > > will simply stall the waiter.
> > 
> > Agreed. Do you see any problems with slightly enhanced version of yeild
> > described above (rather than directed yield)? It has some advantage over 
> > directed yield in that it preserves not only fairness between VMs but also 
> > fairness between VCPUs of a VM. Also it avoids the need for a guessing game 
> > mentioned above and bad interactions with hard-limits.
> > 
> > CCing other scheduler experts for their opinion of proposed yield() extensions.
> 
> sys_yield() usage for anything other but two FIFO threads of the same
> priority goes to /dev/null.
> 
> The Xen paravirt spinlock solution is relatively sane, use that.
> Unmodified guests suck anyway, there's really nothing much sane you can
> do there as you don't know who owns what lock.
Peter Zijlstra - Dec. 1, 2010, 5:22 p.m.
On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote:
> Directed yield and fairness don't mix well either. You can end up
> feeding the other tasks more time than you'll ever get back.

If the directed yield is always to another task in your cgroup then
inter-guest scheduling fairness should be maintained. 

Yes, but not the inter-vcpu fairness.
Rik van Riel - Dec. 1, 2010, 5:26 p.m.
On 12/01/2010 12:22 PM, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote:
>> Directed yield and fairness don't mix well either. You can end up
>> feeding the other tasks more time than you'll ever get back.
>
> If the directed yield is always to another task in your cgroup then
> inter-guest scheduling fairness should be maintained.
>
> Yes, but not the inter-vcpu fairness.

The pause loop exiting & directed yield patches I am working on
preserve inter-vcpu fairness by round robining among the vcpus
inside one KVM guest.
Srivatsa Vaddagiri - Dec. 1, 2010, 5:29 p.m.
On Wed, Dec 01, 2010 at 05:25:18PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 21:42 +0530, Srivatsa Vaddagiri wrote:
> 
> > Not if yield() remembers what timeslice was given up and adds that back when
> > thread is finally ready to run. Figure below illustrates this idea:
> > 
> > 
> >       A0/4    C0/4 D0/4 A0/4  C0/4 D0/4 A0/4  C0/4 D0/4 A0/4 
> > p0   |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------|
> >             \                \               \                  \
> > 	   B0/2[2]	    B0/0[6]         B0/0[10]            B0/14[0]
> > 
> >  
> > where,
> > 	p0	-> physical cpu0
> > 	L	-> denotes period of lock contention
> > 	A0/4    -> means vcpu A0 (of guest A) ran for 4 ms
> > 	B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up
> > 		   6ms worth of its timeslice so far). In reality, we should
> > 	     	   not see too much of "given up" timeslice for a vcpu.
> 
> /me fails to parse

Maybe ASCII art didnt get displayed well by your email reader? Essentially what
I am suggesting above is some extension to yield as below:

yield_task_fair(...)
{

+	ideal_runtime = sched_slice(cfs_rq, curr);
+       delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+	rem_time_slice = ideal_runtime - delta_exec;
+
+	current->donate_time += rem_time_slice > some_threshold ?
+				 some_threshold : rem_time_slice;

	...
}


sched_slice(...)
{
	slice = ...

+	slice += current->donate_time;

}

or something close to it. I am bit reluctant to go that route myself, unless the
fairness issue with plain yield is quite bad.

> > > A plain yield (ignoring no-opiness on Linux) will penalize the
> > > running guest wrt other guests.  We need to maintain fairness.

Avi, any idea how much penalty are we talking of here in using plain yield?
If that is acceptable in practice, I'd prefer we use the plain yield rather than
add any more sophistication to it ..

> > Agreed on the need to maintain fairness.
> 
> Directed yield and fairness don't mix well either. You can end up
> feeding the other tasks more time than you'll ever get back.

Agreed, that's the reason I was suggesting a different form of yield which
addresses the fairness between VCPUs as well.

> > This can happen in normal case when lock-holders are preempted as well. So
> > not a new problem that hard-limits is introducing!
> 
> No, but hard limits make it _much_ worse.

Sure ..

> > >  With directed yield you can let the lock holder make
> > > some progress at the expense of another vcpu.  A regular yield()
> > > will simply stall the waiter.
> > 
> > Agreed. Do you see any problems with slightly enhanced version of yeild
> > described above (rather than directed yield)? It has some advantage over 
> > directed yield in that it preserves not only fairness between VMs but also 
> > fairness between VCPUs of a VM. Also it avoids the need for a guessing game 
> > mentioned above and bad interactions with hard-limits.
> > 
> > CCing other scheduler experts for their opinion of proposed yield() extensions.
> 
> sys_yield() usage for anything other but two FIFO threads of the same
> priority goes to /dev/null.
> 
> The Xen paravirt spinlock solution is relatively sane, use that.
> Unmodified guests suck anyway, there's really nothing much sane you can
> do there as you don't know who owns what lock.

Hopefully we don't have to deal with unmodified guests for too long.
Till that time, plain yield() upon lock-contention seems like the best option
we have for such unmodified guests.

- vatsa
Peter Zijlstra - Dec. 1, 2010, 5:45 p.m.
On Wed, 2010-12-01 at 22:59 +0530, Srivatsa Vaddagiri wrote:
> 
> yield_task_fair(...)
> {
> 
> +       ideal_runtime = sched_slice(cfs_rq, curr);
> +       delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> +       rem_time_slice = ideal_runtime - delta_exec;
> +
> +       current->donate_time += rem_time_slice > some_threshold ?
> +                                some_threshold : rem_time_slice;
> 
>         ...
> }
> 
> 
> sched_slice(...)
> {
>         slice = ...
> 
> +       slice += current->donate_time;
> 
> }
> 
> or something close to it. I am bit reluctant to go that route myself, unless the
> fairness issue with plain yield is quite bad. 

That really won't do anything. You need to adjust both tasks their
vruntime. Also, I really wouldn't touch the yield() implementation, nor
would I expose any such time donation crap to userspace.
Chris Wright - Dec. 1, 2010, 5:46 p.m.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote:
> > Directed yield and fairness don't mix well either. You can end up
> > feeding the other tasks more time than you'll ever get back.
> 
> If the directed yield is always to another task in your cgroup then
> inter-guest scheduling fairness should be maintained. 
> 
> Yes, but not the inter-vcpu fairness.

That same vcpu doesn't get fair scheduling if it spends its entire
timeslice spinning on a lock held by a de-scheduled vcpu.
Srivatsa Vaddagiri - Dec. 1, 2010, 6 p.m.
On Wed, Dec 01, 2010 at 06:45:02PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 22:59 +0530, Srivatsa Vaddagiri wrote:
> > 
> > yield_task_fair(...)
> > {
> > 
> > +       ideal_runtime = sched_slice(cfs_rq, curr);
> > +       delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> > +       rem_time_slice = ideal_runtime - delta_exec;
> > +
> > +       current->donate_time += rem_time_slice > some_threshold ?
> > +                                some_threshold : rem_time_slice;
> > 
> >         ...
> > }
> > 
> > 
> > sched_slice(...)
> > {
> >         slice = ...
> > 
> > +       slice += current->donate_time;
> > 
> > }
> > 
> > or something close to it. I am bit reluctant to go that route myself, unless the
> > fairness issue with plain yield is quite bad. 
> 
> That really won't do anything. You need to adjust both tasks their
> vruntime.

We are dealing with just one task here (the task that is yielding).
After recording how much timeslice we are "giving up" in current->donate_time
(donate_time is perhaps not the right name to use), we adjust the yielding
task's vruntime as per existing logic (for ex: to make it go to back of
runqueue). When the yielding tasks gets to run again, lock is hopefully 
available for it to grab, we let it run longer than the default sched_slice()
to compensate for what time it gave up previously to other threads in same
runqueue. This ensures that because of yielding upon lock contention, we are not
leaking bandwidth in favor of other guests. Again I don't know how much of
fairness issue this is in practice, so unless we see some numbers I'd prefer
sticking to plain yield() upon lock-contention (for unmodified guests that is).

> Also, I really wouldn't touch the yield() implementation, nor
> would I expose any such time donation crap to userspace.

- vatsa
Peter Zijlstra - Dec. 1, 2010, 7:07 p.m.
On Wed, 2010-12-01 at 12:26 -0500, Rik van Riel wrote:
> On 12/01/2010 12:22 PM, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote:
> >> Directed yield and fairness don't mix well either. You can end up
> >> feeding the other tasks more time than you'll ever get back.
> >
> > If the directed yield is always to another task in your cgroup then
> > inter-guest scheduling fairness should be maintained.
> >
> > Yes, but not the inter-vcpu fairness.
> 
> The pause loop exiting & directed yield patches I am working on
> preserve inter-vcpu fairness by round robining among the vcpus
> inside one KVM guest.

I don't necessarily think that's enough.

Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning.
They'll end up all three donating some time to the 4th.

The only way to make that fair again is if due to future contention the
4th cpu donates an equal amount of time back to the resp. cpus it got
time from. Guest lock patterns and host scheduling don't provide this
guarantee.
Peter Zijlstra - Dec. 1, 2010, 7:09 p.m.
On Wed, 2010-12-01 at 23:30 +0530, Srivatsa Vaddagiri wrote:
> On Wed, Dec 01, 2010 at 06:45:02PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 22:59 +0530, Srivatsa Vaddagiri wrote:
> > > 
> > > yield_task_fair(...)
> > > {
> > > 
> > > +       ideal_runtime = sched_slice(cfs_rq, curr);
> > > +       delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> > > +       rem_time_slice = ideal_runtime - delta_exec;
> > > +
> > > +       current->donate_time += rem_time_slice > some_threshold ?
> > > +                                some_threshold : rem_time_slice;
> > > 
> > >         ...
> > > }
> > > 
> > > 
> > > sched_slice(...)
> > > {
> > >         slice = ...
> > > 
> > > +       slice += current->donate_time;
> > > 
> > > }
> > > 
> > > or something close to it. I am bit reluctant to go that route myself, unless the
> > > fairness issue with plain yield is quite bad. 
> > 
> > That really won't do anything. You need to adjust both tasks their
> > vruntime.
> 
> We are dealing with just one task here (the task that is yielding).
> After recording how much timeslice we are "giving up" in current->donate_time
> (donate_time is perhaps not the right name to use), we adjust the yielding
> task's vruntime as per existing logic (for ex: to make it go to back of
> runqueue). When the yielding tasks gets to run again, lock is hopefully 
> available for it to grab, we let it run longer than the default sched_slice()
> to compensate for what time it gave up previously to other threads in same
> runqueue. This ensures that because of yielding upon lock contention, we are not
> leaking bandwidth in favor of other guests. Again I don't know how much of
> fairness issue this is in practice, so unless we see some numbers I'd prefer
> sticking to plain yield() upon lock-contention (for unmodified guests that is).

No, that won't work. Once you've given up time you cannot add it back
without destroying fairness.

You can limit the unfairness by limiting the amount of feedback, but I
really dislike such 'yield' semantics.
Rik van Riel - Dec. 1, 2010, 7:24 p.m.
On 12/01/2010 02:07 PM, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 12:26 -0500, Rik van Riel wrote:
>> On 12/01/2010 12:22 PM, Peter Zijlstra wrote:

>> The pause loop exiting&  directed yield patches I am working on
>> preserve inter-vcpu fairness by round robining among the vcpus
>> inside one KVM guest.
>
> I don't necessarily think that's enough.
>
> Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning.
> They'll end up all three donating some time to the 4th.
>
> The only way to make that fair again is if due to future contention the
> 4th cpu donates an equal amount of time back to the resp. cpus it got
> time from. Guest lock patterns and host scheduling don't provide this
> guarantee.

You have no guarantees when running virtualized, guest
CPU time could be taken away by another guest just as
easily as by another VCPU.

Even if we equalized the amount of CPU time each VCPU
ends up getting across some time interval, that is no
guarantee they get useful work done, or that the time
gets fairly divided to _user processes_ running inside
the guest.

The VCPU could be running something lock-happy when
it temporarily gives up the CPU, and get extra CPU time
back when running something userspace intensive.

In-between, it may well have scheduled to another task
(allowing it to get more CPU time).

I'm not convinced the kind of fairness you suggest is
possible or useful.
Peter Zijlstra - Dec. 1, 2010, 7:35 p.m.
On Wed, 2010-12-01 at 14:24 -0500, Rik van Riel wrote:
> On 12/01/2010 02:07 PM, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 12:26 -0500, Rik van Riel wrote:
> >> On 12/01/2010 12:22 PM, Peter Zijlstra wrote:
> 
> >> The pause loop exiting&  directed yield patches I am working on
> >> preserve inter-vcpu fairness by round robining among the vcpus
> >> inside one KVM guest.
> >
> > I don't necessarily think that's enough.
> >
> > Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning.
> > They'll end up all three donating some time to the 4th.
> >
> > The only way to make that fair again is if due to future contention the
> > 4th cpu donates an equal amount of time back to the resp. cpus it got
> > time from. Guest lock patterns and host scheduling don't provide this
> > guarantee.
> 
> You have no guarantees when running virtualized, guest
> CPU time could be taken away by another guest just as
> easily as by another VCPU.
> 
> Even if we equalized the amount of CPU time each VCPU
> ends up getting across some time interval, that is no
> guarantee they get useful work done, or that the time
> gets fairly divided to _user processes_ running inside
> the guest.

Right, and Jeremy was working on making the guest load-balancer aware of
that so the user-space should get fairly scheduled on service (of
course, that's assuming you run a linux guest with that logic in).

> The VCPU could be running something lock-happy when
> it temporarily gives up the CPU, and get extra CPU time
> back when running something userspace intensive.
> 
> In-between, it may well have scheduled to another task
> (allowing it to get more CPU time).
> 
> I'm not convinced the kind of fairness you suggest is
> possible or useful.

Well, physical cpus get equal service, but yeah, time loss due to
contention could probably be talked equivalent to do non-equal service
in the vcpu case.

Anyway, don't take it as a critique per-se, your approach sounds like
the sanest proposal yet.
Rik van Riel - Dec. 1, 2010, 7:42 p.m.
On 12/01/2010 02:35 PM, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 14:24 -0500, Rik van Riel wrote:

>> Even if we equalized the amount of CPU time each VCPU
>> ends up getting across some time interval, that is no
>> guarantee they get useful work done, or that the time
>> gets fairly divided to _user processes_ running inside
>> the guest.
>
> Right, and Jeremy was working on making the guest load-balancer aware of
> that so the user-space should get fairly scheduled on service (of
> course, that's assuming you run a linux guest with that logic in).

At that point, you might not need the host side balancing
any more, since the guest can move around processes
internally (if needed).
Peter Zijlstra - Dec. 1, 2010, 7:47 p.m.
On Wed, 2010-12-01 at 14:42 -0500, Rik van Riel wrote:
> On 12/01/2010 02:35 PM, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 14:24 -0500, Rik van Riel wrote:
> 
> >> Even if we equalized the amount of CPU time each VCPU
> >> ends up getting across some time interval, that is no
> >> guarantee they get useful work done, or that the time
> >> gets fairly divided to _user processes_ running inside
> >> the guest.
> >
> > Right, and Jeremy was working on making the guest load-balancer aware of
> > that so the user-space should get fairly scheduled on service (of
> > course, that's assuming you run a linux guest with that logic in).
> 
> At that point, you might not need the host side balancing
> any more, since the guest can move around processes
> internally (if needed).

Not quite sure what you're saying, host load-balancing is always needed,
but if you're talking about the whole directed yield thing, then yes,
paravirt spinlocks will take care of that.
Avi Kivity - Dec. 2, 2010, 9:07 a.m.
On 12/01/2010 09:07 PM, Peter Zijlstra wrote:
> >
> >  The pause loop exiting&  directed yield patches I am working on
> >  preserve inter-vcpu fairness by round robining among the vcpus
> >  inside one KVM guest.
>
> I don't necessarily think that's enough.
>
> Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning.
> They'll end up all three donating some time to the 4th.
>
> The only way to make that fair again is if due to future contention the
> 4th cpu donates an equal amount of time back to the resp. cpus it got
> time from. Guest lock patterns and host scheduling don't provide this
> guarantee.
>

That is fine.  Fairness is pointless if it there's no work to be done 
for the other three threads.  Just like if you have four processes, one 
of which if processing, the others sleeping, cpu is not divided equally.
Avi Kivity - Dec. 2, 2010, 9:14 a.m.
On 12/01/2010 07:29 PM, Srivatsa Vaddagiri wrote:
> >  >  >  A plain yield (ignoring no-opiness on Linux) will penalize the
> >  >  >  running guest wrt other guests.  We need to maintain fairness.
>
> Avi, any idea how much penalty are we talking of here in using plain yield?
> If that is acceptable in practice, I'd prefer we use the plain yield rather than
> add any more sophistication to it ..

The first issue with plain yield is that it's a no-op unless some ioctl 
is enabled.  It's simply not very meaningful for CFS.  That's why the 
current PLE implementation uses sleep() (which is incredibly sucky in 
another way).

Even if we do get yield() to work, giving up a timeslice will cause a 
contending guest to lose way more than its fair share.  Consider a lock 
that is held for 100us every 500us; if each time we detect contention we 
give up a 3ms timeslice, the guest will grind down to a halt while other 
guests will happily use its timeslice.  With directed yield, the guest 
continues to run; and any excess time we donated to a vcpu is preserved.

> >  The Xen paravirt spinlock solution is relatively sane, use that.
> >  Unmodified guests suck anyway, there's really nothing much sane you can
> >  do there as you don't know who owns what lock.
>
> Hopefully we don't have to deal with unmodified guests for too long.
> Till that time, plain yield() upon lock-contention seems like the best option
> we have for such unmodified guests.

We will have to deal with unmodified guests forever.  Both older Linux 
and Windows.
Avi Kivity - Dec. 2, 2010, 9:17 a.m.
On 12/01/2010 09:09 PM, Peter Zijlstra wrote:
> >
> >  We are dealing with just one task here (the task that is yielding).
> >  After recording how much timeslice we are "giving up" in current->donate_time
> >  (donate_time is perhaps not the right name to use), we adjust the yielding
> >  task's vruntime as per existing logic (for ex: to make it go to back of
> >  runqueue). When the yielding tasks gets to run again, lock is hopefully
> >  available for it to grab, we let it run longer than the default sched_slice()
> >  to compensate for what time it gave up previously to other threads in same
> >  runqueue. This ensures that because of yielding upon lock contention, we are not
> >  leaking bandwidth in favor of other guests. Again I don't know how much of
> >  fairness issue this is in practice, so unless we see some numbers I'd prefer
> >  sticking to plain yield() upon lock-contention (for unmodified guests that is).
>
> No, that won't work. Once you've given up time you cannot add it back
> without destroying fairness.
>
> You can limit the unfairness by limiting the amount of feedback, but I
> really dislike such 'yield' semantics.

Agreed.

What I'd like to see in directed yield is donating exactly the amount of 
vruntime that's needed to make the target thread run.  The donating 
thread won't get its vruntime back, unless the other thread hits 
contention itself and does a directed yield back.  So even if your lock 
is ping-ponged around, the guest doesn't lose vruntime compared to other 
processes on the host.
Srivatsa Vaddagiri - Dec. 2, 2010, 11:47 a.m.
On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
> On 12/01/2010 09:09 PM, Peter Zijlstra wrote:
> >>
> >>  We are dealing with just one task here (the task that is yielding).
> >>  After recording how much timeslice we are "giving up" in current->donate_time
> >>  (donate_time is perhaps not the right name to use), we adjust the yielding
> >>  task's vruntime as per existing logic (for ex: to make it go to back of
> >>  runqueue). When the yielding tasks gets to run again, lock is hopefully
> >>  available for it to grab, we let it run longer than the default sched_slice()
> >>  to compensate for what time it gave up previously to other threads in same
> >>  runqueue. This ensures that because of yielding upon lock contention, we are not
> >>  leaking bandwidth in favor of other guests. Again I don't know how much of
> >>  fairness issue this is in practice, so unless we see some numbers I'd prefer
> >>  sticking to plain yield() upon lock-contention (for unmodified guests that is).
> >
> >No, that won't work. Once you've given up time you cannot add it back
> >without destroying fairness.

Over shorter intervals perhaps. Over longer interval (few seconds to couple of
minutes), fairness should not be affected because of this feedback? In any case,
don't we have similar issues with directed yield as well?

> >You can limit the unfairness by limiting the amount of feedback, but I
> >really dislike such 'yield' semantics.
> 
> Agreed.
> 
> What I'd like to see in directed yield is donating exactly the
> amount of vruntime that's needed to make the target thread run.

I presume this requires the target vcpu to move left in rb-tree to run 
earlier than scheduled currently and that it doesn't involve any
change to the sched_period() of target vcpu?

Just was wondering how this would work in case of buggy guests. Lets say that a
guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
boosting each other's vruntime, potentially affecting fairtime for other guests
(to the point of starving them perhaps)?

- vatsa

> The donating thread won't get its vruntime back, unless the other thread
> hits contention itself and does a directed yield back.  So even if
> your lock is ping-ponged around, the guest doesn't lose vruntime
> compared to other processes on the host.
Srivatsa Vaddagiri - Dec. 2, 2010, 12:19 p.m.
On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
> What I'd like to see in directed yield is donating exactly the
> amount of vruntime that's needed to make the target thread run.  The

How would that work well with hard-limits? The target thread would have been
rate limited and no amount of vruntime donation would make it work
"immediately" ..

- vatsa
Srivatsa Vaddagiri - Dec. 2, 2010, 12:22 p.m.
On Thu, Dec 02, 2010 at 05:17:00PM +0530, Srivatsa Vaddagiri wrote:
> Just was wondering how this would work in case of buggy guests. Lets say that a
> guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> boosting each other's vruntime, potentially affecting fairtime for other guests
> (to the point of starving them perhaps)?

Guests that exhibit strong spinlock contentions can cause similar symptoms as
well?

- vatsa
Avi Kivity - Dec. 2, 2010, 12:41 p.m.
On 12/02/2010 01:47 PM, Srivatsa Vaddagiri wrote:
> On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
> >  On 12/01/2010 09:09 PM, Peter Zijlstra wrote:
> >  >>
> >  >>   We are dealing with just one task here (the task that is yielding).
> >  >>   After recording how much timeslice we are "giving up" in current->donate_time
> >  >>   (donate_time is perhaps not the right name to use), we adjust the yielding
> >  >>   task's vruntime as per existing logic (for ex: to make it go to back of
> >  >>   runqueue). When the yielding tasks gets to run again, lock is hopefully
> >  >>   available for it to grab, we let it run longer than the default sched_slice()
> >  >>   to compensate for what time it gave up previously to other threads in same
> >  >>   runqueue. This ensures that because of yielding upon lock contention, we are not
> >  >>   leaking bandwidth in favor of other guests. Again I don't know how much of
> >  >>   fairness issue this is in practice, so unless we see some numbers I'd prefer
> >  >>   sticking to plain yield() upon lock-contention (for unmodified guests that is).
> >  >
> >  >No, that won't work. Once you've given up time you cannot add it back
> >  >without destroying fairness.
>
> Over shorter intervals perhaps. Over longer interval (few seconds to couple of
> minutes), fairness should not be affected because of this feedback? In any case,
> don't we have similar issues with directed yield as well?

Directed yield works by donating vruntime to another thread.  If you 
don't have vruntime, you can't donate it (well, you're guaranteed to 
have some, since you're running, but if all you have is a microsecond's 
worth, that's what you can donate).

> >
> >  What I'd like to see in directed yield is donating exactly the
> >  amount of vruntime that's needed to make the target thread run.
>
> I presume this requires the target vcpu to move left in rb-tree to run
> earlier than scheduled currently and that it doesn't involve any
> change to the sched_period() of target vcpu?
>
> Just was wondering how this would work in case of buggy guests. Lets say that a
> guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> boosting each other's vruntime, potentially affecting fairtime for other guests
> (to the point of starving them perhaps)?

We preserve vruntime overall.  If you give vruntime to someone, it comes 
at your own expense.  Overall vruntime is preserved.
Avi Kivity - Dec. 2, 2010, 12:42 p.m.
On 12/02/2010 02:19 PM, Srivatsa Vaddagiri wrote:
> On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
> >  What I'd like to see in directed yield is donating exactly the
> >  amount of vruntime that's needed to make the target thread run.  The
>
> How would that work well with hard-limits? The target thread would have been
> rate limited and no amount of vruntime donation would make it work
> "immediately" ..

At least it would stop the donator from running.
Srivatsa Vaddagiri - Dec. 2, 2010, 1:13 p.m.
On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> >>  What I'd like to see in directed yield is donating exactly the
> >>  amount of vruntime that's needed to make the target thread run.
> >
> >I presume this requires the target vcpu to move left in rb-tree to run
> >earlier than scheduled currently and that it doesn't involve any
> >change to the sched_period() of target vcpu?
> >
> >Just was wondering how this would work in case of buggy guests. Lets say that a
> >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> >boosting each other's vruntime, potentially affecting fairtime for other guests
> >(to the point of starving them perhaps)?
> 
> We preserve vruntime overall.  If you give vruntime to someone, it
> comes at your own expense.  Overall vruntime is preserved.

Hmm ..so I presume that this means we don't affect target thread's position in
rb-tree upon donation, rather we influence its sched_period() to include 
donated time? IOW donation has no effect on causing the target thread to run
"immediately", rather it will have the effect of causing it run "longer"
whenever it runs next?

Even that would require some precaution in directed yield to ensure that it
doesn't unduly inflate vruntime of target, hurting fairness for other guests on
same cpu as target (example guest code that can lead to this situation 
below):

vcpu0:				vcpu1:

				spinlock(A);

spinlock(A);            

                        	while(1)
				;
 
			 	spin_unlock(A);

- vatsa
Avi Kivity - Dec. 2, 2010, 1:49 p.m.
On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote:
> On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> >  >>   What I'd like to see in directed yield is donating exactly the
> >  >>   amount of vruntime that's needed to make the target thread run.
> >  >
> >  >I presume this requires the target vcpu to move left in rb-tree to run
> >  >earlier than scheduled currently and that it doesn't involve any
> >  >change to the sched_period() of target vcpu?
> >  >
> >  >Just was wondering how this would work in case of buggy guests. Lets say that a
> >  >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> >  >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> >  >boosting each other's vruntime, potentially affecting fairtime for other guests
> >  >(to the point of starving them perhaps)?
> >
> >  We preserve vruntime overall.  If you give vruntime to someone, it
> >  comes at your own expense.  Overall vruntime is preserved.
>
> Hmm ..so I presume that this means we don't affect target thread's position in
> rb-tree upon donation, rather we influence its sched_period() to include
> donated time? IOW donation has no effect on causing the target thread to run
> "immediately", rather it will have the effect of causing it run "longer"
> whenever it runs next?

No.  The intent (at least mine, maybe Rik has other ideas) is to move 
some vruntime from current to target such that target would be placed 
before current in the timeline.

> Even that would require some precaution in directed yield to ensure that it
> doesn't unduly inflate vruntime of target, hurting fairness for other guests on
> same cpu as target (example guest code that can lead to this situation
> below):
>
> vcpu0:				vcpu1:
>
> 				spinlock(A);
>
> spinlock(A);
>
>                          	while(1)
> 				;
>
> 			 	spin_unlock(A);

directed yield should preserve the invariant that sum(vruntime) does not 
change.
Srivatsa Vaddagiri - Dec. 2, 2010, 3:27 p.m.
On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote:
> On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote:
> >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> >>  >>   What I'd like to see in directed yield is donating exactly the
> >>  >>   amount of vruntime that's needed to make the target thread run.
> >>  >
> >>  >I presume this requires the target vcpu to move left in rb-tree to run
> >>  >earlier than scheduled currently and that it doesn't involve any
> >>  >change to the sched_period() of target vcpu?
> >>  >
> >>  >Just was wondering how this would work in case of buggy guests. Lets say that a
> >>  >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> >>  >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> >>  >boosting each other's vruntime, potentially affecting fairtime for other guests
> >>  >(to the point of starving them perhaps)?
> >>
> >>  We preserve vruntime overall.  If you give vruntime to someone, it
> >>  comes at your own expense.  Overall vruntime is preserved.
> >
> >Hmm ..so I presume that this means we don't affect target thread's position in
> >rb-tree upon donation, rather we influence its sched_period() to include
> >donated time? IOW donation has no effect on causing the target thread to run
> >"immediately", rather it will have the effect of causing it run "longer"
> >whenever it runs next?
> 
> No.  The intent (at least mine, maybe Rik has other ideas) is to

CCing Rik now ..

> move some vruntime from current to target such that target would be
> placed before current in the timeline.

Well ok, then this is what I had presumed earlier (about shifting target towards
left in rb-tree).

> >Even that would require some precaution in directed yield to ensure that it
> >doesn't unduly inflate vruntime of target, hurting fairness for other guests on
> >same cpu as target (example guest code that can lead to this situation
> >below):
> >
> >vcpu0:				vcpu1:
> >
> >				spinlock(A);
> >
> >spinlock(A);
> >
> >                         	while(1)
> >				;
> >
> >			 	spin_unlock(A);
> 
> directed yield should preserve the invariant that sum(vruntime) does
> not change.

Hmm don't think I understand this invariant sum() part. Lets take a simple
example as below:


p0	-> A0 B0 A1

p1	-> B1 C0 C1

A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus

Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace
delibrately). When A0 spins and exits (due to PLE) what does its directed yield
do? Going by your statement, it can put target before current, leading to
perhaps this arrangement in runqueue:

p0	-> A1 B0 A0

Now A1 spins and wants to do a directed yield back to A0, leading to :

p0	-> A0 B0 A1

This can go back and forth, starving B0 (iow leading to some sort of DoS
attack).

Where does the "invariant sum" part of directed yield kick in to avoid such 
nastiness?

- vatsa
Srivatsa Vaddagiri - Dec. 2, 2010, 3:28 p.m.
Actually CCing Rik now!

On Thu, Dec 02, 2010 at 08:57:16PM +0530, Srivatsa Vaddagiri wrote:
> On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote:
> > On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote:
> > >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> > >>  >>   What I'd like to see in directed yield is donating exactly the
> > >>  >>   amount of vruntime that's needed to make the target thread run.
> > >>  >
> > >>  >I presume this requires the target vcpu to move left in rb-tree to run
> > >>  >earlier than scheduled currently and that it doesn't involve any
> > >>  >change to the sched_period() of target vcpu?
> > >>  >
> > >>  >Just was wondering how this would work in case of buggy guests. Lets say that a
> > >>  >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> > >>  >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> > >>  >boosting each other's vruntime, potentially affecting fairtime for other guests
> > >>  >(to the point of starving them perhaps)?
> > >>
> > >>  We preserve vruntime overall.  If you give vruntime to someone, it
> > >>  comes at your own expense.  Overall vruntime is preserved.
> > >
> > >Hmm ..so I presume that this means we don't affect target thread's position in
> > >rb-tree upon donation, rather we influence its sched_period() to include
> > >donated time? IOW donation has no effect on causing the target thread to run
> > >"immediately", rather it will have the effect of causing it run "longer"
> > >whenever it runs next?
> > 
> > No.  The intent (at least mine, maybe Rik has other ideas) is to
> 
> CCing Rik now ..
> 
> > move some vruntime from current to target such that target would be
> > placed before current in the timeline.
> 
> Well ok, then this is what I had presumed earlier (about shifting target towards
> left in rb-tree).
> 
> > >Even that would require some precaution in directed yield to ensure that it
> > >doesn't unduly inflate vruntime of target, hurting fairness for other guests on
> > >same cpu as target (example guest code that can lead to this situation
> > >below):
> > >
> > >vcpu0:				vcpu1:
> > >
> > >				spinlock(A);
> > >
> > >spinlock(A);
> > >
> > >                         	while(1)
> > >				;
> > >
> > >			 	spin_unlock(A);
> > 
> > directed yield should preserve the invariant that sum(vruntime) does
> > not change.
> 
> Hmm don't think I understand this invariant sum() part. Lets take a simple
> example as below:
> 
> 
> p0	-> A0 B0 A1
> 
> p1	-> B1 C0 C1
> 
> A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus
> 
> Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace
> delibrately). When A0 spins and exits (due to PLE) what does its directed yield
> do? Going by your statement, it can put target before current, leading to
> perhaps this arrangement in runqueue:
> 
> p0	-> A1 B0 A0
> 
> Now A1 spins and wants to do a directed yield back to A0, leading to :
> 
> p0	-> A0 B0 A1
> 
> This can go back and forth, starving B0 (iow leading to some sort of DoS
> attack).
> 
> Where does the "invariant sum" part of directed yield kick in to avoid such 
> nastiness?
> 
> - vatsa
Avi Kivity - Dec. 2, 2010, 3:33 p.m.
On 12/02/2010 05:27 PM, Srivatsa Vaddagiri wrote:
> >  >Even that would require some precaution in directed yield to ensure that it
> >  >doesn't unduly inflate vruntime of target, hurting fairness for other guests on
> >  >same cpu as target (example guest code that can lead to this situation
> >  >below):
> >  >
> >  >vcpu0:				vcpu1:
> >  >
> >  >				spinlock(A);
> >  >
> >  >spinlock(A);
> >  >
> >  >                          	while(1)
> >  >				;
> >  >
> >  >			 	spin_unlock(A);
> >
> >  directed yield should preserve the invariant that sum(vruntime) does
> >  not change.
>
> Hmm don't think I understand this invariant sum() part. Lets take a simple
> example as below:
>
>
> p0	->  A0 B0 A1
>
> p1	->  B1 C0 C1
>
> A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus
>
> Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace
> delibrately). When A0 spins and exits (due to PLE) what does its directed yield
> do? Going by your statement, it can put target before current, leading to
> perhaps this arrangement in runqueue:
>
> p0	->  A1 B0 A0
>
> Now A1 spins and wants to do a directed yield back to A0, leading to :
>
> p0	->  A0 B0 A1
>
> This can go back and forth, starving B0 (iow leading to some sort of DoS
> attack).
>
> Where does the "invariant sum" part of directed yield kick in to avoid such
> nastiness?

A0 and A1's vruntime will keep growing, eventually B will become 
leftmost and become runnable (assuming leftmost == min vruntime, not 
sure what the terminology is).
Srivatsa Vaddagiri - Dec. 2, 2010, 3:44 p.m.
On Thu, Dec 02, 2010 at 05:33:40PM +0200, Avi Kivity wrote:
> A0 and A1's vruntime will keep growing, eventually B will become
> leftmost and become runnable (assuming leftmost == min vruntime, not
> sure what the terminology is).

Donation (in directed yield) will cause vruntime to drop as well (thats the only
way target can get to run ahead of its scheduled time), so I still think there
are nasty issues involved here. 

Anyway, I am curious to see the directed yield implementation that Rik has - I
can comment more after I have seen that!

- vatsa

Patch

diff --git a/cpu-defs.h b/cpu-defs.h
index 51533c6..6434dca 100644
--- a/cpu-defs.h
+++ b/cpu-defs.h
@@ -220,6 +220,7 @@  struct KVMCPUState {
     const char *cpu_model_str;                                          \
     struct KVMState *kvm_state;                                         \
     struct kvm_run *kvm_run;                                            \
+    int sigusr1_fd;                                                     \
     int kvm_fd;                                                         \
     int kvm_vcpu_dirty;                                                 \
     struct KVMCPUState kvm_cpu_state;
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 471306b..354109f 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1351,6 +1351,29 @@  static void pause_all_threads(void)
     }
 }
 
+static void vcpu_stop(CPUState *env)
+{
+    if (env != cpu_single_env) {
+        env->stop = 1;
+        pthread_kill(env->kvm_cpu_state.thread, SIG_IPI);
+    } else {
+        env->stop = 0;
+        env->stopped = 1;
+        cpu_exit(env);
+    }
+
+    while (!env->stopped) {
+        qemu_cond_wait(&qemu_pause_cond);
+    }
+}
+
+static void vcpu_start(CPUState *env)
+{
+    env->stop = 0;
+    env->stopped = 0;
+    pthread_kill(env->kvm_cpu_state.thread, SIG_IPI);
+}
+
 static void resume_all_threads(void)
 {
     CPUState *penv = first_cpu;
@@ -1426,6 +1449,37 @@  static int kvm_main_loop_cpu(CPUState *env)
     return 0;
 }
 
+static __thread int sigusr1_wfd;
+
+static void on_sigusr1(int signo)
+{
+    char ch = 0;
+    if (write(sigusr1_wfd, &ch, 1) < 0) {
+        /* who cares */
+    }
+}
+
+static void sigusr1_read(void *opaque)
+{
+    CPUState *env = opaque;
+    ssize_t len;
+    int caught_signal = 0;
+
+    do {
+        char buffer[256];
+        len = read(env->sigusr1_fd, buffer, sizeof(buffer));
+        caught_signal = 1;
+    } while (len > 0);
+
+    if (caught_signal) {
+        if (env->stopped) {
+            vcpu_start(env);
+        } else {
+            vcpu_stop(env);
+        }
+    }
+}
+
 static void *ap_main_loop(void *_env)
 {
     CPUState *env = _env;
@@ -1433,10 +1487,12 @@  static void *ap_main_loop(void *_env)
 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
     struct ioperm_data *data = NULL;
 #endif
+    int fds[2];
 
     current_env = env;
     env->thread_id = kvm_get_thread_id();
     sigfillset(&signals);
+    sigdelset(&signals, SIGUSR1);
     sigprocmask(SIG_BLOCK, &signals, NULL);
 
 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
@@ -1451,6 +1507,18 @@  static void *ap_main_loop(void *_env)
     kvm_create_vcpu(env, env->cpu_index);
     setup_kernel_sigmask(env);
 
+    if (pipe(fds) == -1) {
+        /* do nothing */
+    }
+
+    fcntl(fds[0], F_SETFL, O_NONBLOCK);
+    fcntl(fds[1], F_SETFL, O_NONBLOCK);
+
+    env->sigusr1_fd = fds[0];
+    sigusr1_wfd = fds[1];
+
+    qemu_set_fd_handler2(fds[0], NULL, sigusr1_read, NULL, env);
+
     /* signal VCPU creation */
     current_env->created = 1;
     pthread_cond_signal(&qemu_vcpu_cond);
@@ -1463,6 +1531,8 @@  static void *ap_main_loop(void *_env)
     /* re-initialize cpu_single_env after re-acquiring qemu_mutex */
     cpu_single_env = env;
 
+    signal(SIGUSR1, on_sigusr1);
+
     kvm_main_loop_cpu(env);
     return NULL;
 }
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 0f3fb50..3addc77 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -783,6 +783,7 @@  struct KVMState {
     int irqchip_in_kernel;
     int pit_in_kernel;
     int xsave, xcrs;
+    int sigusr2_fd;
 
     struct kvm_context kvm_context;
 };