Message ID | 1290530963-3448-1-git-send-email-aliguori@us.ibm.com |
---|---|
State | New |
Headers | show |
On Tue, Nov 23, 2010 at 4:49 PM, Anthony Liguori <aliguori@us.ibm.com> wrote: > qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of teaching > them to respond to these signals (which cannot be trapped), use SIGUSR1 to > approximate the behavior of SIGSTOP/SIGCONT. > > The purpose of this is to implement CPU hard limits using an external tool that > watches the CPU consumption and stops the VCPU as appropriate. > > This provides a more elegant solution in that it allows the VCPU thread to > release qemu_mutex before going to sleep. > > This current implementation uses a single signal. I think this is too racey > in the long term so I think we should introduce a second signal. If two signals > get coalesced into one, it could confuse the monitoring tool into giving the > VCPU the inverse of it's entitlement. > > It might be better to simply move this logic entirely into QEMU to make this > more robust--the question is whether we think this is a good long term feature > to carry in QEMU? > +static __thread int sigusr1_wfd; While OpenBSD finally updated the default compiler to 4.2.1 from 3.x series, thread local storage is still not supported: $ cat thread.c static __thread int sigusr1_wfd; $ gcc thread.c -c thread.c:1: error: thread-local storage not supported for this target $ gcc -v Reading specs from /usr/lib/gcc-lib/sparc64-unknown-openbsd4.8/4.2.1/specs Target: sparc64-unknown-openbsd4.8 Configured with: OpenBSD/sparc64 system compiler Thread model: posix gcc version 4.2.1 20070719
On 11/23/2010 01:35 PM, Blue Swirl wrote: > On Tue, Nov 23, 2010 at 4:49 PM, Anthony Liguori<aliguori@us.ibm.com> wrote: > >> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of teaching >> them to respond to these signals (which cannot be trapped), use SIGUSR1 to >> approximate the behavior of SIGSTOP/SIGCONT. >> >> The purpose of this is to implement CPU hard limits using an external tool that >> watches the CPU consumption and stops the VCPU as appropriate. >> >> This provides a more elegant solution in that it allows the VCPU thread to >> release qemu_mutex before going to sleep. >> >> This current implementation uses a single signal. I think this is too racey >> in the long term so I think we should introduce a second signal. If two signals >> get coalesced into one, it could confuse the monitoring tool into giving the >> VCPU the inverse of it's entitlement. >> >> It might be better to simply move this logic entirely into QEMU to make this >> more robust--the question is whether we think this is a good long term feature >> to carry in QEMU? >> > >> +static __thread int sigusr1_wfd; >> > While OpenBSD finally updated the default compiler to 4.2.1 from 3.x > series, thread local storage is still not supported: > Hrm, is there a portable way to do this (distinguish a signal on a particular thread)? Regards, Anthony Liguori > $ cat thread.c > static __thread int sigusr1_wfd; > $ gcc thread.c -c > thread.c:1: error: thread-local storage not supported for this target > $ gcc -v > Reading specs from /usr/lib/gcc-lib/sparc64-unknown-openbsd4.8/4.2.1/specs > Target: sparc64-unknown-openbsd4.8 > Configured with: OpenBSD/sparc64 system compiler > Thread model: posix > gcc version 4.2.1 20070719 >
On 11/23/2010 10:46 PM, Anthony Liguori wrote: >>> +static __thread int sigusr1_wfd; >> While OpenBSD finally updated the default compiler to 4.2.1 from 3.x >> series, thread local storage is still not supported: > > Hrm, is there a portable way to do this (distinguish a signal on a > particular thread)? You can use pthread_getspecific/pthread_setspecific. Paolo
On 11/23/2010 05:43 PM, Paolo Bonzini wrote: > On 11/23/2010 10:46 PM, Anthony Liguori wrote: >>>> +static __thread int sigusr1_wfd; >>> While OpenBSD finally updated the default compiler to 4.2.1 from 3.x >>> series, thread local storage is still not supported: >> >> Hrm, is there a portable way to do this (distinguish a signal on a >> particular thread)? > > You can use pthread_getspecific/pthread_setspecific. Is it signal safe? BTW, this is all only theoretical. This is in the KVM io thread code which is already highly unportable. Regards, Anthony Liguori > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/24/2010 02:15 AM, Anthony Liguori wrote: > > Is it signal safe? Yes, at heart it is just a somewhat more expensive access to pthread_self()->some_array[key]. > BTW, this is all only theoretical. This is in the KVM io thread code > which is already highly unportable. True, and newer versions of GCC emulate __thread even on Windows. Paolo
On 11/23/2010 06:49 PM, Anthony Liguori wrote: > qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of teaching > them to respond to these signals (which cannot be trapped), use SIGUSR1 to > approximate the behavior of SIGSTOP/SIGCONT. > > The purpose of this is to implement CPU hard limits using an external tool that > watches the CPU consumption and stops the VCPU as appropriate. > > This provides a more elegant solution in that it allows the VCPU thread to > release qemu_mutex before going to sleep. > > This current implementation uses a single signal. I think this is too racey > in the long term so I think we should introduce a second signal. If two signals > get coalesced into one, it could confuse the monitoring tool into giving the > VCPU the inverse of it's entitlement. You can use sigqueue() to send an accompanying value. > It might be better to simply move this logic entirely into QEMU to make this > more robust--the question is whether we think this is a good long term feature > to carry in QEMU? > I'm more concerned about lock holder preemption, and interaction of this mechanism with any kernel solution for LHP. > +static __thread int sigusr1_wfd; > + > +static void on_sigusr1(int signo) > +{ > + char ch = 0; > + if (write(sigusr1_wfd,&ch, 1)< 0) { > + /* who cares */ > + } > +} We do have signalfd(). > + > +static void sigusr1_read(void *opaque) > +{ > + CPUState *env = opaque; > + ssize_t len; > + int caught_signal = 0; > + > + do { > + char buffer[256]; > + len = read(env->sigusr1_fd, buffer, sizeof(buffer)); > + caught_signal = 1; > + } while (len> 0); > + > + if (caught_signal) { > + if (env->stopped) { env->stopped is multiplexed among multiple users, so this interferes with vm_stop(). We need to make ->stopped a reference count instead.
On 11/24/2010 02:18 AM, Avi Kivity wrote: > On 11/23/2010 06:49 PM, Anthony Liguori wrote: >> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of >> teaching >> them to respond to these signals (which cannot be trapped), use >> SIGUSR1 to >> approximate the behavior of SIGSTOP/SIGCONT. >> >> The purpose of this is to implement CPU hard limits using an external >> tool that >> watches the CPU consumption and stops the VCPU as appropriate. >> >> This provides a more elegant solution in that it allows the VCPU >> thread to >> release qemu_mutex before going to sleep. >> >> This current implementation uses a single signal. I think this is >> too racey >> in the long term so I think we should introduce a second signal. If >> two signals >> get coalesced into one, it could confuse the monitoring tool into >> giving the >> VCPU the inverse of it's entitlement. > > You can use sigqueue() to send an accompanying value. I switched to using SIGRTMIN+5 and SIGRTMIN+6. I think that's a nicer solution since it maps to SIGCONT/SIGSTOP. >> It might be better to simply move this logic entirely into QEMU to >> make this >> more robust--the question is whether we think this is a good long >> term feature >> to carry in QEMU? >> > > I'm more concerned about lock holder preemption, and interaction of > this mechanism with any kernel solution for LHP. Can you suggest some scenarios and I'll create some test cases? I'm trying figure out the best way to evaluate this. Are you assuming the existence of a directed yield and the specific concern is what happens when a directed yield happens after a PLE and the target of the yield has been capped? >> +static __thread int sigusr1_wfd; >> + >> +static void on_sigusr1(int signo) >> +{ >> + char ch = 0; >> + if (write(sigusr1_wfd,&ch, 1)< 0) { >> + /* who cares */ >> + } >> +} > > We do have signalfd(). This is actually called from signalfd. I thought about refactoring that loop to handle signals directly but since we do this elsewhere I figured I'd keep things consistent. >> + >> +static void sigusr1_read(void *opaque) >> +{ >> + CPUState *env = opaque; >> + ssize_t len; >> + int caught_signal = 0; >> + >> + do { >> + char buffer[256]; >> + len = read(env->sigusr1_fd, buffer, sizeof(buffer)); >> + caught_signal = 1; >> + } while (len> 0); >> + >> + if (caught_signal) { >> + if (env->stopped) { > > env->stopped is multiplexed among multiple users, so this interferes > with vm_stop(). > > We need to make ->stopped a reference count instead. Indeed. Regards, Anthony Liguori
On 11/24/2010 03:58 PM, Anthony Liguori wrote: > On 11/24/2010 02:18 AM, Avi Kivity wrote: >> On 11/23/2010 06:49 PM, Anthony Liguori wrote: >>> qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of >>> teaching >>> them to respond to these signals (which cannot be trapped), use >>> SIGUSR1 to >>> approximate the behavior of SIGSTOP/SIGCONT. >>> >>> The purpose of this is to implement CPU hard limits using an >>> external tool that >>> watches the CPU consumption and stops the VCPU as appropriate. >>> >>> This provides a more elegant solution in that it allows the VCPU >>> thread to >>> release qemu_mutex before going to sleep. >>> >>> This current implementation uses a single signal. I think this is >>> too racey >>> in the long term so I think we should introduce a second signal. If >>> two signals >>> get coalesced into one, it could confuse the monitoring tool into >>> giving the >>> VCPU the inverse of it's entitlement. >> >> You can use sigqueue() to send an accompanying value. > > I switched to using SIGRTMIN+5 and SIGRTMIN+6. I think that's a nicer > solution since it maps to SIGCONT/SIGSTOP. > These may get reordered, need to check the semantics. >>> It might be better to simply move this logic entirely into QEMU to >>> make this >>> more robust--the question is whether we think this is a good long >>> term feature >>> to carry in QEMU? >>> >> >> I'm more concerned about lock holder preemption, and interaction of >> this mechanism with any kernel solution for LHP. > > Can you suggest some scenarios and I'll create some test cases? I'm > trying figure out the best way to evaluate this. Booting 64-vcpu Windows on a 64-cpu host with PLE but without directed yield takes longer than forever because PLE detects contention within the guest, which under our current PLE implementation (usleep(100)) converts guest contention into delays. (a directed yield implementation would find that all vcpus are runnable, yielding optimal results under this test case). So if you were to test something similar running with a 20% vcpu cap, I'm sure you'd run into similar issues. It may show with fewer vcpus (I've only tested 64). > Are you assuming the existence of a directed yield and the specific > concern is what happens when a directed yield happens after a PLE and > the target of the yield has been capped? Yes. My concern is that we will see the same kind of problems directed yield was designed to fix, but without allowing directed yield to fix them. Directed yield was designed to fix lock holder preemption under contention, now you're inducing contention but not allowing directed yield to work, even when we will have it. > >>> +static __thread int sigusr1_wfd; >>> + >>> +static void on_sigusr1(int signo) >>> +{ >>> + char ch = 0; >>> + if (write(sigusr1_wfd,&ch, 1)< 0) { >>> + /* who cares */ >>> + } >>> +} >> >> We do have signalfd(). > > This is actually called from signalfd. I thought about refactoring > that loop to handle signals directly but since we do this elsewhere I > figured I'd keep things consistent. Ah, yes. >>> + >>> +static void sigusr1_read(void *opaque) >>> +{ >>> + CPUState *env = opaque; >>> + ssize_t len; >>> + int caught_signal = 0; >>> + >>> + do { >>> + char buffer[256]; >>> + len = read(env->sigusr1_fd, buffer, sizeof(buffer)); >>> + caught_signal = 1; >>> + } while (len> 0); >>> + >>> + if (caught_signal) { >>> + if (env->stopped) { >> >> env->stopped is multiplexed among multiple users, so this interferes >> with vm_stop(). >> >> We need to make ->stopped a reference count instead. > > Indeed. We also need to make the global vm_stop() be reference based, since there are multiple consumers of that interface.
On Wed, Nov 24, 2010 at 04:23:15PM +0200, Avi Kivity wrote: > >>I'm more concerned about lock holder preemption, and interaction > >>of this mechanism with any kernel solution for LHP. > > > >Can you suggest some scenarios and I'll create some test cases? > >I'm trying figure out the best way to evaluate this. > > Booting 64-vcpu Windows on a 64-cpu host with PLE but without > directed yield takes longer than forever because PLE detects > contention within the guest, which under our current PLE > implementation (usleep(100)) converts guest contention into delays. Is there any way of optimizing PLE at runtime in such special case? For ex: either turn off PLE feature or gradually increase (spin-)timeout when PLE should kick in .. > (a directed yield implementation would find that all vcpus are > runnable, yielding optimal results under this test case). I would think a plain yield() (rather than usleep/directed yield) would suffice here (yield would realize that there is nobody else to yield to and continue running the same vcpu thread). As regards to any concern of leaking cpu bandwidth because of a plain yield, I think it can be fixed by a more simpler modification to yield that allows a thread to reclaim whatever timeslice it gave up previously [1]. Regarding directed yield, do we have any reliable mechanism to find target of directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do we determine the vcpu thread to which cycles need to be yielded upon contention? > So if you were to test something similar running with a 20% vcpu > cap, I'm sure you'd run into similar issues. It may show with fewer > vcpus (I've only tested 64). > > >Are you assuming the existence of a directed yield and the > >specific concern is what happens when a directed yield happens > >after a PLE and the target of the yield has been capped? > > Yes. My concern is that we will see the same kind of problems > directed yield was designed to fix, but without allowing directed > yield to fix them. Directed yield was designed to fix lock holder > preemption under contention, For modified guests, something like [2] seems to be the best approach to fix lock-holder preemption (LHP) problem, which does not require any sort of directed yield support. Essentially upon contention, a vcpu registers its lock of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it up (again via another hypercall). For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1]) should fix the LHP problem. Fyi, Xen folks also seem to be avoiding a directed yield for some of the same reasons [3]. Given this line of thinking, hard-limiting guests (either in user-space or kernel-space, latter being what I prefer) should not have adverse interactions with LHP-related solutions. > now you're inducing contention but not > allowing directed yield to work, even when we will have it. - vatsa References: 1. http://lkml.org/lkml/2010/8/3/38 2. https://lkml.org/lkml/2010/11/16/479 3. http://www.gossamer-threads.com/lists/xen/devel/182179#182179
On 12/01/2010 02:37 PM, Srivatsa Vaddagiri wrote: > On Wed, Nov 24, 2010 at 04:23:15PM +0200, Avi Kivity wrote: > > >>I'm more concerned about lock holder preemption, and interaction > > >>of this mechanism with any kernel solution for LHP. > > > > > >Can you suggest some scenarios and I'll create some test cases? > > >I'm trying figure out the best way to evaluate this. > > > > Booting 64-vcpu Windows on a 64-cpu host with PLE but without > > directed yield takes longer than forever because PLE detects > > contention within the guest, which under our current PLE > > implementation (usleep(100)) converts guest contention into delays. > > Is there any way of optimizing PLE at runtime in such special case? For ex: > either turn off PLE feature or gradually increase (spin-)timeout when PLE should > kick in .. It's not a special case at all. Both host contention and guest contention are perfectly normal, and can occur simultaneously. > > (a directed yield implementation would find that all vcpus are > > runnable, yielding optimal results under this test case). > > I would think a plain yield() (rather than usleep/directed yield) would suffice > here (yield would realize that there is nobody else to yield to and continue > running the same vcpu thread). Currently yield() is a no-op on Linux. > As regards to any concern of leaking cpu > bandwidth because of a plain yield, I think it can be fixed by a more > simpler modification to yield that allows a thread to reclaim whatever timeslice > it gave up previously [1]. If some other thread used that timeslice, don't we have an accounting problem? > Regarding directed yield, do we have any reliable mechanism to find target of > directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do > we determine the vcpu thread to which cycles need to be yielded upon contention? My idea was to yield to a random starved vcpu of the same guest. There are several cases to consider: - we hit the right vcpu; lock is released, party. - we hit some vcpu that is doing unrelated work. yielding thread doesn't make progress, but we're not wasting cpu time. - we hit another waiter for the same lock. it will also PLE exit and trigger a directed yield. this increases the cost of directed yield by a factor of count_of_runnable_but_not_running_vcpus, which could be large, but not disasterously so (i.e. don't run a 64-vcpu guest on a uniprocessor host with this) > > So if you were to test something similar running with a 20% vcpu > > cap, I'm sure you'd run into similar issues. It may show with fewer > > vcpus (I've only tested 64). > > > > >Are you assuming the existence of a directed yield and the > > >specific concern is what happens when a directed yield happens > > >after a PLE and the target of the yield has been capped? > > > > Yes. My concern is that we will see the same kind of problems > > directed yield was designed to fix, but without allowing directed > > yield to fix them. Directed yield was designed to fix lock holder > > preemption under contention, > > For modified guests, something like [2] seems to be the best approach to fix > lock-holder preemption (LHP) problem, which does not require any sort of > directed yield support. Essentially upon contention, a vcpu registers its lock > of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it > up (again via another hypercall). Right. > For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1]) > should fix the LHP problem. A plain yield (ignoring no-opiness on Linux) will penalize the running guest wrt other guests. We need to maintain fairness. > Fyi, Xen folks also seem to be avoiding a directed yield for some of the same > reasons [3]. I think that fails for unmodified guests, where you don't know when the lock is released and so you don't have a wake_up notification. You lost a large timeslice and you can't gain it back, whereas with pv the wakeup means you only lose as much time as the lock was held. > Given this line of thinking, hard-limiting guests (either in user-space or > kernel-space, latter being what I prefer) should not have adverse interactions > with LHP-related solutions. If you hard-limit a vcpu that holds a lock, any waiting vcpus are also halted. With directed yield you can let the lock holder make some progress at the expense of another vcpu. A regular yield() will simply stall the waiter.
On Wed, Dec 01, 2010 at 02:56:44PM +0200, Avi Kivity wrote: > >> (a directed yield implementation would find that all vcpus are > >> runnable, yielding optimal results under this test case). > > > >I would think a plain yield() (rather than usleep/directed yield) would suffice > >here (yield would realize that there is nobody else to yield to and continue > >running the same vcpu thread). > > Currently yield() is a no-op on Linux. Hmm only when there is a single task in runqueue, otherwise yield will cause remaining threads to run before letting the yielding task to run again. > >As regards to any concern of leaking cpu > >bandwidth because of a plain yield, I think it can be fixed by a more > >simpler modification to yield that allows a thread to reclaim whatever timeslice > >it gave up previously [1]. > > If some other thread used that timeslice, don't we have an > accounting problem? Not if yield() remembers what timeslice was given up and adds that back when thread is finally ready to run. Figure below illustrates this idea: A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 p0 |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------| \ \ \ \ B0/2[2] B0/0[6] B0/0[10] B0/14[0] where, p0 -> physical cpu0 L -> denotes period of lock contention A0/4 -> means vcpu A0 (of guest A) ran for 4 ms B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up 6ms worth of its timeslice so far). In reality, we should not see too much of "given up" timeslice for a vcpu. > >Regarding directed yield, do we have any reliable mechanism to find target of > >directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do > >we determine the vcpu thread to which cycles need to be yielded upon contention? > > My idea was to yield to a random starved vcpu of the same guest. > There are several cases to consider: > > - we hit the right vcpu; lock is released, party. > - we hit some vcpu that is doing unrelated work. yielding thread > doesn't make progress, but we're not wasting cpu time. > - we hit another waiter for the same lock. it will also PLE exit > and trigger a directed yield. this increases the cost of directed > yield by a factor of count_of_runnable_but_not_running_vcpus, which > could be large, but not disasterously so (i.e. don't run a 64-vcpu > guest on a uniprocessor host with this) > > >> So if you were to test something similar running with a 20% vcpu > >> cap, I'm sure you'd run into similar issues. It may show with fewer > >> vcpus (I've only tested 64). > >> > >> >Are you assuming the existence of a directed yield and the > >> >specific concern is what happens when a directed yield happens > >> >after a PLE and the target of the yield has been capped? > >> > >> Yes. My concern is that we will see the same kind of problems > >> directed yield was designed to fix, but without allowing directed > >> yield to fix them. Directed yield was designed to fix lock holder > >> preemption under contention, > > > >For modified guests, something like [2] seems to be the best approach to fix > >lock-holder preemption (LHP) problem, which does not require any sort of > >directed yield support. Essentially upon contention, a vcpu registers its lock > >of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it > >up (again via another hypercall). > > Right. We don't have these hypercalls for KVM atm, which I am working on now. > >For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1]) > >should fix the LHP problem. > > A plain yield (ignoring no-opiness on Linux) will penalize the > running guest wrt other guests. We need to maintain fairness. Agreed on the need to maintain fairness. > >Fyi, Xen folks also seem to be avoiding a directed yield for some of the same > >reasons [3]. > > I think that fails for unmodified guests, where you don't know when > the lock is released and so you don't have a wake_up notification. > You lost a large timeslice and you can't gain it back, whereas with > pv the wakeup means you only lose as much time as the lock was held. > > >Given this line of thinking, hard-limiting guests (either in user-space or > >kernel-space, latter being what I prefer) should not have adverse interactions > >with LHP-related solutions. > > If you hard-limit a vcpu that holds a lock, any waiting vcpus are > also halted. This can happen in normal case when lock-holders are preempted as well. So not a new problem that hard-limits is introducing! > With directed yield you can let the lock holder make > some progress at the expense of another vcpu. A regular yield() > will simply stall the waiter. Agreed. Do you see any problems with slightly enhanced version of yeild described above (rather than directed yield)? It has some advantage over directed yield in that it preserves not only fairness between VMs but also fairness between VCPUs of a VM. Also it avoids the need for a guessing game mentioned above and bad interactions with hard-limits. CCing other scheduler experts for their opinion of proposed yield() extensions. - vatsa
On Wed, 2010-12-01 at 21:42 +0530, Srivatsa Vaddagiri wrote: > Not if yield() remembers what timeslice was given up and adds that back when > thread is finally ready to run. Figure below illustrates this idea: > > > A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 > p0 |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------| > \ \ \ \ > B0/2[2] B0/0[6] B0/0[10] B0/14[0] > > > where, > p0 -> physical cpu0 > L -> denotes period of lock contention > A0/4 -> means vcpu A0 (of guest A) ran for 4 ms > B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up > 6ms worth of its timeslice so far). In reality, we should > not see too much of "given up" timeslice for a vcpu. /me fails to parse > > >Regarding directed yield, do we have any reliable mechanism to find target of > > >directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do > > >we determine the vcpu thread to which cycles need to be yielded upon contention? > > > > My idea was to yield to a random starved vcpu of the same guest. > > There are several cases to consider: > > > > - we hit the right vcpu; lock is released, party. > > - we hit some vcpu that is doing unrelated work. yielding thread > > doesn't make progress, but we're not wasting cpu time. > > - we hit another waiter for the same lock. it will also PLE exit > > and trigger a directed yield. this increases the cost of directed > > yield by a factor of count_of_runnable_but_not_running_vcpus, which > > could be large, but not disasterously so (i.e. don't run a 64-vcpu > > guest on a uniprocessor host with this) > > > > >> So if you were to test something similar running with a 20% vcpu > > >> cap, I'm sure you'd run into similar issues. It may show with fewer > > >> vcpus (I've only tested 64). > > >> > > >> >Are you assuming the existence of a directed yield and the > > >> >specific concern is what happens when a directed yield happens > > >> >after a PLE and the target of the yield has been capped? > > >> > > >> Yes. My concern is that we will see the same kind of problems > > >> directed yield was designed to fix, but without allowing directed > > >> yield to fix them. Directed yield was designed to fix lock holder > > >> preemption under contention, > > > > > >For modified guests, something like [2] seems to be the best approach to fix > > >lock-holder preemption (LHP) problem, which does not require any sort of > > >directed yield support. Essentially upon contention, a vcpu registers its lock > > >of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it > > >up (again via another hypercall). > > > > Right. > > We don't have these hypercalls for KVM atm, which I am working on now. > > > >For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1]) > > >should fix the LHP problem. > > > > A plain yield (ignoring no-opiness on Linux) will penalize the > > running guest wrt other guests. We need to maintain fairness. > > Agreed on the need to maintain fairness. Directed yield and fairness don't mix well either. You can end up feeding the other tasks more time than you'll ever get back. > > >Fyi, Xen folks also seem to be avoiding a directed yield for some of the same > > >reasons [3]. > > > > I think that fails for unmodified guests, where you don't know when > > the lock is released and so you don't have a wake_up notification. > > You lost a large timeslice and you can't gain it back, whereas with > > pv the wakeup means you only lose as much time as the lock was held. > > > > >Given this line of thinking, hard-limiting guests (either in user-space or > > >kernel-space, latter being what I prefer) should not have adverse interactions > > >with LHP-related solutions. > > > > If you hard-limit a vcpu that holds a lock, any waiting vcpus are > > also halted. > > This can happen in normal case when lock-holders are preempted as well. So > not a new problem that hard-limits is introducing! No, but hard limits make it _much_ worse. > > With directed yield you can let the lock holder make > > some progress at the expense of another vcpu. A regular yield() > > will simply stall the waiter. > > Agreed. Do you see any problems with slightly enhanced version of yeild > described above (rather than directed yield)? It has some advantage over > directed yield in that it preserves not only fairness between VMs but also > fairness between VCPUs of a VM. Also it avoids the need for a guessing game > mentioned above and bad interactions with hard-limits. > > CCing other scheduler experts for their opinion of proposed yield() extensions. sys_yield() usage for anything other but two FIFO threads of the same priority goes to /dev/null. The Xen paravirt spinlock solution is relatively sane, use that. Unmodified guests suck anyway, there's really nothing much sane you can do there as you don't know who owns what lock.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Wed, 2010-12-01 at 21:42 +0530, Srivatsa Vaddagiri wrote: > > > Not if yield() remembers what timeslice was given up and adds that back when > > thread is finally ready to run. Figure below illustrates this idea: > > > > > > A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 > > p0 |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------| > > \ \ \ \ > > B0/2[2] B0/0[6] B0/0[10] B0/14[0] > > > > > > where, > > p0 -> physical cpu0 > > L -> denotes period of lock contention > > A0/4 -> means vcpu A0 (of guest A) ran for 4 ms > > B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up > > 6ms worth of its timeslice so far). In reality, we should > > not see too much of "given up" timeslice for a vcpu. > > /me fails to parse > > > > >Regarding directed yield, do we have any reliable mechanism to find target of > > > >directed yield in this (unmodified/non-paravirtualized guest) case? IOW how do > > > >we determine the vcpu thread to which cycles need to be yielded upon contention? > > > > > > My idea was to yield to a random starved vcpu of the same guest. > > > There are several cases to consider: > > > > > > - we hit the right vcpu; lock is released, party. > > > - we hit some vcpu that is doing unrelated work. yielding thread > > > doesn't make progress, but we're not wasting cpu time. > > > - we hit another waiter for the same lock. it will also PLE exit > > > and trigger a directed yield. this increases the cost of directed > > > yield by a factor of count_of_runnable_but_not_running_vcpus, which > > > could be large, but not disasterously so (i.e. don't run a 64-vcpu > > > guest on a uniprocessor host with this) > > > > > > >> So if you were to test something similar running with a 20% vcpu > > > >> cap, I'm sure you'd run into similar issues. It may show with fewer > > > >> vcpus (I've only tested 64). > > > >> > > > >> >Are you assuming the existence of a directed yield and the > > > >> >specific concern is what happens when a directed yield happens > > > >> >after a PLE and the target of the yield has been capped? > > > >> > > > >> Yes. My concern is that we will see the same kind of problems > > > >> directed yield was designed to fix, but without allowing directed > > > >> yield to fix them. Directed yield was designed to fix lock holder > > > >> preemption under contention, > > > > > > > >For modified guests, something like [2] seems to be the best approach to fix > > > >lock-holder preemption (LHP) problem, which does not require any sort of > > > >directed yield support. Essentially upon contention, a vcpu registers its lock > > > >of interest and goes to sleep (via hypercall) waiting for lock-owner to wake it > > > >up (again via another hypercall). > > > > > > Right. > > > > We don't have these hypercalls for KVM atm, which I am working on now. > > > > > >For unmodified guests, IMHO a plain yield (or slightly enhanced yield [1]) > > > >should fix the LHP problem. > > > > > > A plain yield (ignoring no-opiness on Linux) will penalize the > > > running guest wrt other guests. We need to maintain fairness. > > > > Agreed on the need to maintain fairness. > > Directed yield and fairness don't mix well either. You can end up > feeding the other tasks more time than you'll ever get back. If the directed yield is always to another task in your cgroup then inter-guest scheduling fairness should be maintained. > > > >Fyi, Xen folks also seem to be avoiding a directed yield for some of the same > > > >reasons [3]. > > > > > > I think that fails for unmodified guests, where you don't know when > > > the lock is released and so you don't have a wake_up notification. > > > You lost a large timeslice and you can't gain it back, whereas with > > > pv the wakeup means you only lose as much time as the lock was held. > > > > > > >Given this line of thinking, hard-limiting guests (either in user-space or > > > >kernel-space, latter being what I prefer) should not have adverse interactions > > > >with LHP-related solutions. > > > > > > If you hard-limit a vcpu that holds a lock, any waiting vcpus are > > > also halted. > > > > This can happen in normal case when lock-holders are preempted as well. So > > not a new problem that hard-limits is introducing! > > No, but hard limits make it _much_ worse. > > > > With directed yield you can let the lock holder make > > > some progress at the expense of another vcpu. A regular yield() > > > will simply stall the waiter. > > > > Agreed. Do you see any problems with slightly enhanced version of yeild > > described above (rather than directed yield)? It has some advantage over > > directed yield in that it preserves not only fairness between VMs but also > > fairness between VCPUs of a VM. Also it avoids the need for a guessing game > > mentioned above and bad interactions with hard-limits. > > > > CCing other scheduler experts for their opinion of proposed yield() extensions. > > sys_yield() usage for anything other but two FIFO threads of the same > priority goes to /dev/null. > > The Xen paravirt spinlock solution is relatively sane, use that. > Unmodified guests suck anyway, there's really nothing much sane you can > do there as you don't know who owns what lock.
On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote: > Directed yield and fairness don't mix well either. You can end up > feeding the other tasks more time than you'll ever get back. If the directed yield is always to another task in your cgroup then inter-guest scheduling fairness should be maintained. Yes, but not the inter-vcpu fairness.
On 12/01/2010 12:22 PM, Peter Zijlstra wrote: > On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote: >> Directed yield and fairness don't mix well either. You can end up >> feeding the other tasks more time than you'll ever get back. > > If the directed yield is always to another task in your cgroup then > inter-guest scheduling fairness should be maintained. > > Yes, but not the inter-vcpu fairness. The pause loop exiting & directed yield patches I am working on preserve inter-vcpu fairness by round robining among the vcpus inside one KVM guest.
On Wed, Dec 01, 2010 at 05:25:18PM +0100, Peter Zijlstra wrote: > On Wed, 2010-12-01 at 21:42 +0530, Srivatsa Vaddagiri wrote: > > > Not if yield() remembers what timeslice was given up and adds that back when > > thread is finally ready to run. Figure below illustrates this idea: > > > > > > A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 C0/4 D0/4 A0/4 > > p0 |----|-L|----|----|----|L|----|----|----|L|----|----|----|--------------| > > \ \ \ \ > > B0/2[2] B0/0[6] B0/0[10] B0/14[0] > > > > > > where, > > p0 -> physical cpu0 > > L -> denotes period of lock contention > > A0/4 -> means vcpu A0 (of guest A) ran for 4 ms > > B0/2[6] -> means vcpu B0 (of guest B) ran for 2 ms (and has given up > > 6ms worth of its timeslice so far). In reality, we should > > not see too much of "given up" timeslice for a vcpu. > > /me fails to parse Maybe ASCII art didnt get displayed well by your email reader? Essentially what I am suggesting above is some extension to yield as below: yield_task_fair(...) { + ideal_runtime = sched_slice(cfs_rq, curr); + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + rem_time_slice = ideal_runtime - delta_exec; + + current->donate_time += rem_time_slice > some_threshold ? + some_threshold : rem_time_slice; ... } sched_slice(...) { slice = ... + slice += current->donate_time; } or something close to it. I am bit reluctant to go that route myself, unless the fairness issue with plain yield is quite bad. > > > A plain yield (ignoring no-opiness on Linux) will penalize the > > > running guest wrt other guests. We need to maintain fairness. Avi, any idea how much penalty are we talking of here in using plain yield? If that is acceptable in practice, I'd prefer we use the plain yield rather than add any more sophistication to it .. > > Agreed on the need to maintain fairness. > > Directed yield and fairness don't mix well either. You can end up > feeding the other tasks more time than you'll ever get back. Agreed, that's the reason I was suggesting a different form of yield which addresses the fairness between VCPUs as well. > > This can happen in normal case when lock-holders are preempted as well. So > > not a new problem that hard-limits is introducing! > > No, but hard limits make it _much_ worse. Sure .. > > > With directed yield you can let the lock holder make > > > some progress at the expense of another vcpu. A regular yield() > > > will simply stall the waiter. > > > > Agreed. Do you see any problems with slightly enhanced version of yeild > > described above (rather than directed yield)? It has some advantage over > > directed yield in that it preserves not only fairness between VMs but also > > fairness between VCPUs of a VM. Also it avoids the need for a guessing game > > mentioned above and bad interactions with hard-limits. > > > > CCing other scheduler experts for their opinion of proposed yield() extensions. > > sys_yield() usage for anything other but two FIFO threads of the same > priority goes to /dev/null. > > The Xen paravirt spinlock solution is relatively sane, use that. > Unmodified guests suck anyway, there's really nothing much sane you can > do there as you don't know who owns what lock. Hopefully we don't have to deal with unmodified guests for too long. Till that time, plain yield() upon lock-contention seems like the best option we have for such unmodified guests. - vatsa
On Wed, 2010-12-01 at 22:59 +0530, Srivatsa Vaddagiri wrote: > > yield_task_fair(...) > { > > + ideal_runtime = sched_slice(cfs_rq, curr); > + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; > + rem_time_slice = ideal_runtime - delta_exec; > + > + current->donate_time += rem_time_slice > some_threshold ? > + some_threshold : rem_time_slice; > > ... > } > > > sched_slice(...) > { > slice = ... > > + slice += current->donate_time; > > } > > or something close to it. I am bit reluctant to go that route myself, unless the > fairness issue with plain yield is quite bad. That really won't do anything. You need to adjust both tasks their vruntime. Also, I really wouldn't touch the yield() implementation, nor would I expose any such time donation crap to userspace.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote: > > Directed yield and fairness don't mix well either. You can end up > > feeding the other tasks more time than you'll ever get back. > > If the directed yield is always to another task in your cgroup then > inter-guest scheduling fairness should be maintained. > > Yes, but not the inter-vcpu fairness. That same vcpu doesn't get fair scheduling if it spends its entire timeslice spinning on a lock held by a de-scheduled vcpu.
On Wed, Dec 01, 2010 at 06:45:02PM +0100, Peter Zijlstra wrote: > On Wed, 2010-12-01 at 22:59 +0530, Srivatsa Vaddagiri wrote: > > > > yield_task_fair(...) > > { > > > > + ideal_runtime = sched_slice(cfs_rq, curr); > > + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; > > + rem_time_slice = ideal_runtime - delta_exec; > > + > > + current->donate_time += rem_time_slice > some_threshold ? > > + some_threshold : rem_time_slice; > > > > ... > > } > > > > > > sched_slice(...) > > { > > slice = ... > > > > + slice += current->donate_time; > > > > } > > > > or something close to it. I am bit reluctant to go that route myself, unless the > > fairness issue with plain yield is quite bad. > > That really won't do anything. You need to adjust both tasks their > vruntime. We are dealing with just one task here (the task that is yielding). After recording how much timeslice we are "giving up" in current->donate_time (donate_time is perhaps not the right name to use), we adjust the yielding task's vruntime as per existing logic (for ex: to make it go to back of runqueue). When the yielding tasks gets to run again, lock is hopefully available for it to grab, we let it run longer than the default sched_slice() to compensate for what time it gave up previously to other threads in same runqueue. This ensures that because of yielding upon lock contention, we are not leaking bandwidth in favor of other guests. Again I don't know how much of fairness issue this is in practice, so unless we see some numbers I'd prefer sticking to plain yield() upon lock-contention (for unmodified guests that is). > Also, I really wouldn't touch the yield() implementation, nor > would I expose any such time donation crap to userspace. - vatsa
On Wed, 2010-12-01 at 12:26 -0500, Rik van Riel wrote: > On 12/01/2010 12:22 PM, Peter Zijlstra wrote: > > On Wed, 2010-12-01 at 09:17 -0800, Chris Wright wrote: > >> Directed yield and fairness don't mix well either. You can end up > >> feeding the other tasks more time than you'll ever get back. > > > > If the directed yield is always to another task in your cgroup then > > inter-guest scheduling fairness should be maintained. > > > > Yes, but not the inter-vcpu fairness. > > The pause loop exiting & directed yield patches I am working on > preserve inter-vcpu fairness by round robining among the vcpus > inside one KVM guest. I don't necessarily think that's enough. Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning. They'll end up all three donating some time to the 4th. The only way to make that fair again is if due to future contention the 4th cpu donates an equal amount of time back to the resp. cpus it got time from. Guest lock patterns and host scheduling don't provide this guarantee.
On Wed, 2010-12-01 at 23:30 +0530, Srivatsa Vaddagiri wrote: > On Wed, Dec 01, 2010 at 06:45:02PM +0100, Peter Zijlstra wrote: > > On Wed, 2010-12-01 at 22:59 +0530, Srivatsa Vaddagiri wrote: > > > > > > yield_task_fair(...) > > > { > > > > > > + ideal_runtime = sched_slice(cfs_rq, curr); > > > + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; > > > + rem_time_slice = ideal_runtime - delta_exec; > > > + > > > + current->donate_time += rem_time_slice > some_threshold ? > > > + some_threshold : rem_time_slice; > > > > > > ... > > > } > > > > > > > > > sched_slice(...) > > > { > > > slice = ... > > > > > > + slice += current->donate_time; > > > > > > } > > > > > > or something close to it. I am bit reluctant to go that route myself, unless the > > > fairness issue with plain yield is quite bad. > > > > That really won't do anything. You need to adjust both tasks their > > vruntime. > > We are dealing with just one task here (the task that is yielding). > After recording how much timeslice we are "giving up" in current->donate_time > (donate_time is perhaps not the right name to use), we adjust the yielding > task's vruntime as per existing logic (for ex: to make it go to back of > runqueue). When the yielding tasks gets to run again, lock is hopefully > available for it to grab, we let it run longer than the default sched_slice() > to compensate for what time it gave up previously to other threads in same > runqueue. This ensures that because of yielding upon lock contention, we are not > leaking bandwidth in favor of other guests. Again I don't know how much of > fairness issue this is in practice, so unless we see some numbers I'd prefer > sticking to plain yield() upon lock-contention (for unmodified guests that is). No, that won't work. Once you've given up time you cannot add it back without destroying fairness. You can limit the unfairness by limiting the amount of feedback, but I really dislike such 'yield' semantics.
On 12/01/2010 02:07 PM, Peter Zijlstra wrote: > On Wed, 2010-12-01 at 12:26 -0500, Rik van Riel wrote: >> On 12/01/2010 12:22 PM, Peter Zijlstra wrote: >> The pause loop exiting& directed yield patches I am working on >> preserve inter-vcpu fairness by round robining among the vcpus >> inside one KVM guest. > > I don't necessarily think that's enough. > > Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning. > They'll end up all three donating some time to the 4th. > > The only way to make that fair again is if due to future contention the > 4th cpu donates an equal amount of time back to the resp. cpus it got > time from. Guest lock patterns and host scheduling don't provide this > guarantee. You have no guarantees when running virtualized, guest CPU time could be taken away by another guest just as easily as by another VCPU. Even if we equalized the amount of CPU time each VCPU ends up getting across some time interval, that is no guarantee they get useful work done, or that the time gets fairly divided to _user processes_ running inside the guest. The VCPU could be running something lock-happy when it temporarily gives up the CPU, and get extra CPU time back when running something userspace intensive. In-between, it may well have scheduled to another task (allowing it to get more CPU time). I'm not convinced the kind of fairness you suggest is possible or useful.
On Wed, 2010-12-01 at 14:24 -0500, Rik van Riel wrote: > On 12/01/2010 02:07 PM, Peter Zijlstra wrote: > > On Wed, 2010-12-01 at 12:26 -0500, Rik van Riel wrote: > >> On 12/01/2010 12:22 PM, Peter Zijlstra wrote: > > >> The pause loop exiting& directed yield patches I am working on > >> preserve inter-vcpu fairness by round robining among the vcpus > >> inside one KVM guest. > > > > I don't necessarily think that's enough. > > > > Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning. > > They'll end up all three donating some time to the 4th. > > > > The only way to make that fair again is if due to future contention the > > 4th cpu donates an equal amount of time back to the resp. cpus it got > > time from. Guest lock patterns and host scheduling don't provide this > > guarantee. > > You have no guarantees when running virtualized, guest > CPU time could be taken away by another guest just as > easily as by another VCPU. > > Even if we equalized the amount of CPU time each VCPU > ends up getting across some time interval, that is no > guarantee they get useful work done, or that the time > gets fairly divided to _user processes_ running inside > the guest. Right, and Jeremy was working on making the guest load-balancer aware of that so the user-space should get fairly scheduled on service (of course, that's assuming you run a linux guest with that logic in). > The VCPU could be running something lock-happy when > it temporarily gives up the CPU, and get extra CPU time > back when running something userspace intensive. > > In-between, it may well have scheduled to another task > (allowing it to get more CPU time). > > I'm not convinced the kind of fairness you suggest is > possible or useful. Well, physical cpus get equal service, but yeah, time loss due to contention could probably be talked equivalent to do non-equal service in the vcpu case. Anyway, don't take it as a critique per-se, your approach sounds like the sanest proposal yet.
On 12/01/2010 02:35 PM, Peter Zijlstra wrote: > On Wed, 2010-12-01 at 14:24 -0500, Rik van Riel wrote: >> Even if we equalized the amount of CPU time each VCPU >> ends up getting across some time interval, that is no >> guarantee they get useful work done, or that the time >> gets fairly divided to _user processes_ running inside >> the guest. > > Right, and Jeremy was working on making the guest load-balancer aware of > that so the user-space should get fairly scheduled on service (of > course, that's assuming you run a linux guest with that logic in). At that point, you might not need the host side balancing any more, since the guest can move around processes internally (if needed).
On Wed, 2010-12-01 at 14:42 -0500, Rik van Riel wrote: > On 12/01/2010 02:35 PM, Peter Zijlstra wrote: > > On Wed, 2010-12-01 at 14:24 -0500, Rik van Riel wrote: > > >> Even if we equalized the amount of CPU time each VCPU > >> ends up getting across some time interval, that is no > >> guarantee they get useful work done, or that the time > >> gets fairly divided to _user processes_ running inside > >> the guest. > > > > Right, and Jeremy was working on making the guest load-balancer aware of > > that so the user-space should get fairly scheduled on service (of > > course, that's assuming you run a linux guest with that logic in). > > At that point, you might not need the host side balancing > any more, since the guest can move around processes > internally (if needed). Not quite sure what you're saying, host load-balancing is always needed, but if you're talking about the whole directed yield thing, then yes, paravirt spinlocks will take care of that.
On 12/01/2010 09:07 PM, Peter Zijlstra wrote: > > > > The pause loop exiting& directed yield patches I am working on > > preserve inter-vcpu fairness by round robining among the vcpus > > inside one KVM guest. > > I don't necessarily think that's enough. > > Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning. > They'll end up all three donating some time to the 4th. > > The only way to make that fair again is if due to future contention the > 4th cpu donates an equal amount of time back to the resp. cpus it got > time from. Guest lock patterns and host scheduling don't provide this > guarantee. > That is fine. Fairness is pointless if it there's no work to be done for the other three threads. Just like if you have four processes, one of which if processing, the others sleeping, cpu is not divided equally.
On 12/01/2010 07:29 PM, Srivatsa Vaddagiri wrote: > > > > A plain yield (ignoring no-opiness on Linux) will penalize the > > > > running guest wrt other guests. We need to maintain fairness. > > Avi, any idea how much penalty are we talking of here in using plain yield? > If that is acceptable in practice, I'd prefer we use the plain yield rather than > add any more sophistication to it .. The first issue with plain yield is that it's a no-op unless some ioctl is enabled. It's simply not very meaningful for CFS. That's why the current PLE implementation uses sleep() (which is incredibly sucky in another way). Even if we do get yield() to work, giving up a timeslice will cause a contending guest to lose way more than its fair share. Consider a lock that is held for 100us every 500us; if each time we detect contention we give up a 3ms timeslice, the guest will grind down to a halt while other guests will happily use its timeslice. With directed yield, the guest continues to run; and any excess time we donated to a vcpu is preserved. > > The Xen paravirt spinlock solution is relatively sane, use that. > > Unmodified guests suck anyway, there's really nothing much sane you can > > do there as you don't know who owns what lock. > > Hopefully we don't have to deal with unmodified guests for too long. > Till that time, plain yield() upon lock-contention seems like the best option > we have for such unmodified guests. We will have to deal with unmodified guests forever. Both older Linux and Windows.
On 12/01/2010 09:09 PM, Peter Zijlstra wrote: > > > > We are dealing with just one task here (the task that is yielding). > > After recording how much timeslice we are "giving up" in current->donate_time > > (donate_time is perhaps not the right name to use), we adjust the yielding > > task's vruntime as per existing logic (for ex: to make it go to back of > > runqueue). When the yielding tasks gets to run again, lock is hopefully > > available for it to grab, we let it run longer than the default sched_slice() > > to compensate for what time it gave up previously to other threads in same > > runqueue. This ensures that because of yielding upon lock contention, we are not > > leaking bandwidth in favor of other guests. Again I don't know how much of > > fairness issue this is in practice, so unless we see some numbers I'd prefer > > sticking to plain yield() upon lock-contention (for unmodified guests that is). > > No, that won't work. Once you've given up time you cannot add it back > without destroying fairness. > > You can limit the unfairness by limiting the amount of feedback, but I > really dislike such 'yield' semantics. Agreed. What I'd like to see in directed yield is donating exactly the amount of vruntime that's needed to make the target thread run. The donating thread won't get its vruntime back, unless the other thread hits contention itself and does a directed yield back. So even if your lock is ping-ponged around, the guest doesn't lose vruntime compared to other processes on the host.
On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > On 12/01/2010 09:09 PM, Peter Zijlstra wrote: > >> > >> We are dealing with just one task here (the task that is yielding). > >> After recording how much timeslice we are "giving up" in current->donate_time > >> (donate_time is perhaps not the right name to use), we adjust the yielding > >> task's vruntime as per existing logic (for ex: to make it go to back of > >> runqueue). When the yielding tasks gets to run again, lock is hopefully > >> available for it to grab, we let it run longer than the default sched_slice() > >> to compensate for what time it gave up previously to other threads in same > >> runqueue. This ensures that because of yielding upon lock contention, we are not > >> leaking bandwidth in favor of other guests. Again I don't know how much of > >> fairness issue this is in practice, so unless we see some numbers I'd prefer > >> sticking to plain yield() upon lock-contention (for unmodified guests that is). > > > >No, that won't work. Once you've given up time you cannot add it back > >without destroying fairness. Over shorter intervals perhaps. Over longer interval (few seconds to couple of minutes), fairness should not be affected because of this feedback? In any case, don't we have similar issues with directed yield as well? > >You can limit the unfairness by limiting the amount of feedback, but I > >really dislike such 'yield' semantics. > > Agreed. > > What I'd like to see in directed yield is donating exactly the > amount of vruntime that's needed to make the target thread run. I presume this requires the target vcpu to move left in rb-tree to run earlier than scheduled currently and that it doesn't involve any change to the sched_period() of target vcpu? Just was wondering how this would work in case of buggy guests. Lets say that a guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep boosting each other's vruntime, potentially affecting fairtime for other guests (to the point of starving them perhaps)? - vatsa > The donating thread won't get its vruntime back, unless the other thread > hits contention itself and does a directed yield back. So even if > your lock is ping-ponged around, the guest doesn't lose vruntime > compared to other processes on the host.
On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > What I'd like to see in directed yield is donating exactly the > amount of vruntime that's needed to make the target thread run. The How would that work well with hard-limits? The target thread would have been rate limited and no amount of vruntime donation would make it work "immediately" .. - vatsa
On Thu, Dec 02, 2010 at 05:17:00PM +0530, Srivatsa Vaddagiri wrote: > Just was wondering how this would work in case of buggy guests. Lets say that a > guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > boosting each other's vruntime, potentially affecting fairtime for other guests > (to the point of starving them perhaps)? Guests that exhibit strong spinlock contentions can cause similar symptoms as well? - vatsa
On 12/02/2010 01:47 PM, Srivatsa Vaddagiri wrote: > On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > > On 12/01/2010 09:09 PM, Peter Zijlstra wrote: > > >> > > >> We are dealing with just one task here (the task that is yielding). > > >> After recording how much timeslice we are "giving up" in current->donate_time > > >> (donate_time is perhaps not the right name to use), we adjust the yielding > > >> task's vruntime as per existing logic (for ex: to make it go to back of > > >> runqueue). When the yielding tasks gets to run again, lock is hopefully > > >> available for it to grab, we let it run longer than the default sched_slice() > > >> to compensate for what time it gave up previously to other threads in same > > >> runqueue. This ensures that because of yielding upon lock contention, we are not > > >> leaking bandwidth in favor of other guests. Again I don't know how much of > > >> fairness issue this is in practice, so unless we see some numbers I'd prefer > > >> sticking to plain yield() upon lock-contention (for unmodified guests that is). > > > > > >No, that won't work. Once you've given up time you cannot add it back > > >without destroying fairness. > > Over shorter intervals perhaps. Over longer interval (few seconds to couple of > minutes), fairness should not be affected because of this feedback? In any case, > don't we have similar issues with directed yield as well? Directed yield works by donating vruntime to another thread. If you don't have vruntime, you can't donate it (well, you're guaranteed to have some, since you're running, but if all you have is a microsecond's worth, that's what you can donate). > > > > What I'd like to see in directed yield is donating exactly the > > amount of vruntime that's needed to make the target thread run. > > I presume this requires the target vcpu to move left in rb-tree to run > earlier than scheduled currently and that it doesn't involve any > change to the sched_period() of target vcpu? > > Just was wondering how this would work in case of buggy guests. Lets say that a > guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > boosting each other's vruntime, potentially affecting fairtime for other guests > (to the point of starving them perhaps)? We preserve vruntime overall. If you give vruntime to someone, it comes at your own expense. Overall vruntime is preserved.
On 12/02/2010 02:19 PM, Srivatsa Vaddagiri wrote: > On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > > What I'd like to see in directed yield is donating exactly the > > amount of vruntime that's needed to make the target thread run. The > > How would that work well with hard-limits? The target thread would have been > rate limited and no amount of vruntime donation would make it work > "immediately" .. At least it would stop the donator from running.
On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > >> What I'd like to see in directed yield is donating exactly the > >> amount of vruntime that's needed to make the target thread run. > > > >I presume this requires the target vcpu to move left in rb-tree to run > >earlier than scheduled currently and that it doesn't involve any > >change to the sched_period() of target vcpu? > > > >Just was wondering how this would work in case of buggy guests. Lets say that a > >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > >boosting each other's vruntime, potentially affecting fairtime for other guests > >(to the point of starving them perhaps)? > > We preserve vruntime overall. If you give vruntime to someone, it > comes at your own expense. Overall vruntime is preserved. Hmm ..so I presume that this means we don't affect target thread's position in rb-tree upon donation, rather we influence its sched_period() to include donated time? IOW donation has no effect on causing the target thread to run "immediately", rather it will have the effect of causing it run "longer" whenever it runs next? Even that would require some precaution in directed yield to ensure that it doesn't unduly inflate vruntime of target, hurting fairness for other guests on same cpu as target (example guest code that can lead to this situation below): vcpu0: vcpu1: spinlock(A); spinlock(A); while(1) ; spin_unlock(A); - vatsa
On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote: > On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > > >> What I'd like to see in directed yield is donating exactly the > > >> amount of vruntime that's needed to make the target thread run. > > > > > >I presume this requires the target vcpu to move left in rb-tree to run > > >earlier than scheduled currently and that it doesn't involve any > > >change to the sched_period() of target vcpu? > > > > > >Just was wondering how this would work in case of buggy guests. Lets say that a > > >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > > >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > > >boosting each other's vruntime, potentially affecting fairtime for other guests > > >(to the point of starving them perhaps)? > > > > We preserve vruntime overall. If you give vruntime to someone, it > > comes at your own expense. Overall vruntime is preserved. > > Hmm ..so I presume that this means we don't affect target thread's position in > rb-tree upon donation, rather we influence its sched_period() to include > donated time? IOW donation has no effect on causing the target thread to run > "immediately", rather it will have the effect of causing it run "longer" > whenever it runs next? No. The intent (at least mine, maybe Rik has other ideas) is to move some vruntime from current to target such that target would be placed before current in the timeline. > Even that would require some precaution in directed yield to ensure that it > doesn't unduly inflate vruntime of target, hurting fairness for other guests on > same cpu as target (example guest code that can lead to this situation > below): > > vcpu0: vcpu1: > > spinlock(A); > > spinlock(A); > > while(1) > ; > > spin_unlock(A); directed yield should preserve the invariant that sum(vruntime) does not change.
On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote: > On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote: > >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > >> >> What I'd like to see in directed yield is donating exactly the > >> >> amount of vruntime that's needed to make the target thread run. > >> > > >> >I presume this requires the target vcpu to move left in rb-tree to run > >> >earlier than scheduled currently and that it doesn't involve any > >> >change to the sched_period() of target vcpu? > >> > > >> >Just was wondering how this would work in case of buggy guests. Lets say that a > >> >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > >> >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > >> >boosting each other's vruntime, potentially affecting fairtime for other guests > >> >(to the point of starving them perhaps)? > >> > >> We preserve vruntime overall. If you give vruntime to someone, it > >> comes at your own expense. Overall vruntime is preserved. > > > >Hmm ..so I presume that this means we don't affect target thread's position in > >rb-tree upon donation, rather we influence its sched_period() to include > >donated time? IOW donation has no effect on causing the target thread to run > >"immediately", rather it will have the effect of causing it run "longer" > >whenever it runs next? > > No. The intent (at least mine, maybe Rik has other ideas) is to CCing Rik now .. > move some vruntime from current to target such that target would be > placed before current in the timeline. Well ok, then this is what I had presumed earlier (about shifting target towards left in rb-tree). > >Even that would require some precaution in directed yield to ensure that it > >doesn't unduly inflate vruntime of target, hurting fairness for other guests on > >same cpu as target (example guest code that can lead to this situation > >below): > > > >vcpu0: vcpu1: > > > > spinlock(A); > > > >spinlock(A); > > > > while(1) > > ; > > > > spin_unlock(A); > > directed yield should preserve the invariant that sum(vruntime) does > not change. Hmm don't think I understand this invariant sum() part. Lets take a simple example as below: p0 -> A0 B0 A1 p1 -> B1 C0 C1 A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace delibrately). When A0 spins and exits (due to PLE) what does its directed yield do? Going by your statement, it can put target before current, leading to perhaps this arrangement in runqueue: p0 -> A1 B0 A0 Now A1 spins and wants to do a directed yield back to A0, leading to : p0 -> A0 B0 A1 This can go back and forth, starving B0 (iow leading to some sort of DoS attack). Where does the "invariant sum" part of directed yield kick in to avoid such nastiness? - vatsa
Actually CCing Rik now! On Thu, Dec 02, 2010 at 08:57:16PM +0530, Srivatsa Vaddagiri wrote: > On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote: > > On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote: > > >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > > >> >> What I'd like to see in directed yield is donating exactly the > > >> >> amount of vruntime that's needed to make the target thread run. > > >> > > > >> >I presume this requires the target vcpu to move left in rb-tree to run > > >> >earlier than scheduled currently and that it doesn't involve any > > >> >change to the sched_period() of target vcpu? > > >> > > > >> >Just was wondering how this would work in case of buggy guests. Lets say that a > > >> >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > > >> >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > > >> >boosting each other's vruntime, potentially affecting fairtime for other guests > > >> >(to the point of starving them perhaps)? > > >> > > >> We preserve vruntime overall. If you give vruntime to someone, it > > >> comes at your own expense. Overall vruntime is preserved. > > > > > >Hmm ..so I presume that this means we don't affect target thread's position in > > >rb-tree upon donation, rather we influence its sched_period() to include > > >donated time? IOW donation has no effect on causing the target thread to run > > >"immediately", rather it will have the effect of causing it run "longer" > > >whenever it runs next? > > > > No. The intent (at least mine, maybe Rik has other ideas) is to > > CCing Rik now .. > > > move some vruntime from current to target such that target would be > > placed before current in the timeline. > > Well ok, then this is what I had presumed earlier (about shifting target towards > left in rb-tree). > > > >Even that would require some precaution in directed yield to ensure that it > > >doesn't unduly inflate vruntime of target, hurting fairness for other guests on > > >same cpu as target (example guest code that can lead to this situation > > >below): > > > > > >vcpu0: vcpu1: > > > > > > spinlock(A); > > > > > >spinlock(A); > > > > > > while(1) > > > ; > > > > > > spin_unlock(A); > > > > directed yield should preserve the invariant that sum(vruntime) does > > not change. > > Hmm don't think I understand this invariant sum() part. Lets take a simple > example as below: > > > p0 -> A0 B0 A1 > > p1 -> B1 C0 C1 > > A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus > > Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace > delibrately). When A0 spins and exits (due to PLE) what does its directed yield > do? Going by your statement, it can put target before current, leading to > perhaps this arrangement in runqueue: > > p0 -> A1 B0 A0 > > Now A1 spins and wants to do a directed yield back to A0, leading to : > > p0 -> A0 B0 A1 > > This can go back and forth, starving B0 (iow leading to some sort of DoS > attack). > > Where does the "invariant sum" part of directed yield kick in to avoid such > nastiness? > > - vatsa
On 12/02/2010 05:27 PM, Srivatsa Vaddagiri wrote: > > >Even that would require some precaution in directed yield to ensure that it > > >doesn't unduly inflate vruntime of target, hurting fairness for other guests on > > >same cpu as target (example guest code that can lead to this situation > > >below): > > > > > >vcpu0: vcpu1: > > > > > > spinlock(A); > > > > > >spinlock(A); > > > > > > while(1) > > > ; > > > > > > spin_unlock(A); > > > > directed yield should preserve the invariant that sum(vruntime) does > > not change. > > Hmm don't think I understand this invariant sum() part. Lets take a simple > example as below: > > > p0 -> A0 B0 A1 > > p1 -> B1 C0 C1 > > A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus > > Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace > delibrately). When A0 spins and exits (due to PLE) what does its directed yield > do? Going by your statement, it can put target before current, leading to > perhaps this arrangement in runqueue: > > p0 -> A1 B0 A0 > > Now A1 spins and wants to do a directed yield back to A0, leading to : > > p0 -> A0 B0 A1 > > This can go back and forth, starving B0 (iow leading to some sort of DoS > attack). > > Where does the "invariant sum" part of directed yield kick in to avoid such > nastiness? A0 and A1's vruntime will keep growing, eventually B will become leftmost and become runnable (assuming leftmost == min vruntime, not sure what the terminology is).
On Thu, Dec 02, 2010 at 05:33:40PM +0200, Avi Kivity wrote: > A0 and A1's vruntime will keep growing, eventually B will become > leftmost and become runnable (assuming leftmost == min vruntime, not > sure what the terminology is). Donation (in directed yield) will cause vruntime to drop as well (thats the only way target can get to run ahead of its scheduled time), so I still think there are nasty issues involved here. Anyway, I am curious to see the directed yield implementation that Rik has - I can comment more after I have seen that! - vatsa
diff --git a/cpu-defs.h b/cpu-defs.h index 51533c6..6434dca 100644 --- a/cpu-defs.h +++ b/cpu-defs.h @@ -220,6 +220,7 @@ struct KVMCPUState { const char *cpu_model_str; \ struct KVMState *kvm_state; \ struct kvm_run *kvm_run; \ + int sigusr1_fd; \ int kvm_fd; \ int kvm_vcpu_dirty; \ struct KVMCPUState kvm_cpu_state; diff --git a/qemu-kvm.c b/qemu-kvm.c index 471306b..354109f 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1351,6 +1351,29 @@ static void pause_all_threads(void) } } +static void vcpu_stop(CPUState *env) +{ + if (env != cpu_single_env) { + env->stop = 1; + pthread_kill(env->kvm_cpu_state.thread, SIG_IPI); + } else { + env->stop = 0; + env->stopped = 1; + cpu_exit(env); + } + + while (!env->stopped) { + qemu_cond_wait(&qemu_pause_cond); + } +} + +static void vcpu_start(CPUState *env) +{ + env->stop = 0; + env->stopped = 0; + pthread_kill(env->kvm_cpu_state.thread, SIG_IPI); +} + static void resume_all_threads(void) { CPUState *penv = first_cpu; @@ -1426,6 +1449,37 @@ static int kvm_main_loop_cpu(CPUState *env) return 0; } +static __thread int sigusr1_wfd; + +static void on_sigusr1(int signo) +{ + char ch = 0; + if (write(sigusr1_wfd, &ch, 1) < 0) { + /* who cares */ + } +} + +static void sigusr1_read(void *opaque) +{ + CPUState *env = opaque; + ssize_t len; + int caught_signal = 0; + + do { + char buffer[256]; + len = read(env->sigusr1_fd, buffer, sizeof(buffer)); + caught_signal = 1; + } while (len > 0); + + if (caught_signal) { + if (env->stopped) { + vcpu_start(env); + } else { + vcpu_stop(env); + } + } +} + static void *ap_main_loop(void *_env) { CPUState *env = _env; @@ -1433,10 +1487,12 @@ static void *ap_main_loop(void *_env) #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT struct ioperm_data *data = NULL; #endif + int fds[2]; current_env = env; env->thread_id = kvm_get_thread_id(); sigfillset(&signals); + sigdelset(&signals, SIGUSR1); sigprocmask(SIG_BLOCK, &signals, NULL); #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT @@ -1451,6 +1507,18 @@ static void *ap_main_loop(void *_env) kvm_create_vcpu(env, env->cpu_index); setup_kernel_sigmask(env); + if (pipe(fds) == -1) { + /* do nothing */ + } + + fcntl(fds[0], F_SETFL, O_NONBLOCK); + fcntl(fds[1], F_SETFL, O_NONBLOCK); + + env->sigusr1_fd = fds[0]; + sigusr1_wfd = fds[1]; + + qemu_set_fd_handler2(fds[0], NULL, sigusr1_read, NULL, env); + /* signal VCPU creation */ current_env->created = 1; pthread_cond_signal(&qemu_vcpu_cond); @@ -1463,6 +1531,8 @@ static void *ap_main_loop(void *_env) /* re-initialize cpu_single_env after re-acquiring qemu_mutex */ cpu_single_env = env; + signal(SIGUSR1, on_sigusr1); + kvm_main_loop_cpu(env); return NULL; } diff --git a/qemu-kvm.h b/qemu-kvm.h index 0f3fb50..3addc77 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -783,6 +783,7 @@ struct KVMState { int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; + int sigusr2_fd; struct kvm_context kvm_context; };
qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of teaching them to respond to these signals (which cannot be trapped), use SIGUSR1 to approximate the behavior of SIGSTOP/SIGCONT. The purpose of this is to implement CPU hard limits using an external tool that watches the CPU consumption and stops the VCPU as appropriate. This provides a more elegant solution in that it allows the VCPU thread to release qemu_mutex before going to sleep. This current implementation uses a single signal. I think this is too racey in the long term so I think we should introduce a second signal. If two signals get coalesced into one, it could confuse the monitoring tool into giving the VCPU the inverse of it's entitlement. It might be better to simply move this logic entirely into QEMU to make this more robust--the question is whether we think this is a good long term feature to carry in QEMU? Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>