Message ID | 20091007132006.GA9668@redhat.com |
---|---|
State | Rejected |
Headers | show |
On Wed, 7 Oct 2009, Michael S. Tsirkin wrote: > This reverts commit ee3993069ff55fa6f1c64daf1e09963e340db8e4. > > With commit ee3993069ff55fa6f1c64daf1e09963e340db8e4, > winxp installation on a raw format file fails > during disk format, with a message "your > disk may be damaged". > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > Cc: malc <av1474@comtv.ru> > --- > > malc, care to describe what the problem you > are solving here is? New thread is spawned, SIGALRM arrives, kernel picks up this new thread since it's mask allows it - bad thigs happen. > > All, could everyone please start being nice and post patches > and give time for review before applying? > Thanks! There's nothing to review in this particular case, the thing was plain broken, besides same thing was already discussed between me and Andrzej, see for instance: http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html Needless to say the revert is broken also.
> There's nothing to review in this particular case, the thing was plain > broken, besides same thing was already discussed between me and > Andrzej, see for instance: > http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html > > Needless to say the revert is broken also. > Yeah, I understand the race the original commit fixes but I don't understand how the original commit broke WinXP. Do you have further insight Michael? Blindly reverting something that appears right because of a regression is likely just papering over a bigger problem. Regards, Anthony Liguori
On Wed, Oct 07, 2009 at 11:28:21AM -0500, Anthony Liguori wrote: > >> There's nothing to review in this particular case, the thing was plain >> broken, besides same thing was already discussed between me and >> Andrzej, see for instance: >> http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html >> >> Needless to say the revert is broken also. >> > > Yeah, I understand the race the original commit fixes but I don't > understand how the original commit broke WinXP. > > Do you have further insight Michael? Nope, I just bisected and then reverted and tested that this helps. > Blindly reverting something that > appears right because of a regression is likely just papering over a > bigger problem. > > Regards, > > Anthony Liguori >
On Wed, Oct 07, 2009 at 11:28:21AM -0500, Anthony Liguori wrote: > >> There's nothing to review in this particular case, the thing was plain >> broken, besides same thing was already discussed between me and >> Andrzej, see for instance: >> http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html >> >> Needless to say the revert is broken also. >> > > Yeah, I understand the race the original commit fixes but I don't > understand how the original commit broke WinXP. > > Do you have further insight Michael? My guess is that SIGALARM gets lost, this somehow leads to disk request to timeout. Possible? > Blindly reverting something that appears right because of a > regression is likely just papering over a bigger problem. > > Regards, > > Anthony Liguori >
On Wed, 7 Oct 2009, Michael S. Tsirkin wrote: > On Wed, Oct 07, 2009 at 11:28:21AM -0500, Anthony Liguori wrote: > > > >> There's nothing to review in this particular case, the thing was plain > >> broken, besides same thing was already discussed between me and > >> Andrzej, see for instance: > >> http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html > >> > >> Needless to say the revert is broken also. > >> > > > > Yeah, I understand the race the original commit fixes but I don't > > understand how the original commit broke WinXP. > > > > Do you have further insight Michael? > > My guess is that SIGALARM gets lost, this somehow leads to > disk request to timeout. Possible? Being delayed - possible, being lost - violation of the spec. > > > Blindly reverting something that appears right because of a > > regression is likely just papering over a bigger problem. > > > > Regards, > > > > Anthony Liguori > > >
On Thu, Oct 08, 2009 at 12:54:00AM +0400, malc wrote: > On Wed, 7 Oct 2009, Michael S. Tsirkin wrote: > > > On Wed, Oct 07, 2009 at 11:28:21AM -0500, Anthony Liguori wrote: > > > > > >> There's nothing to review in this particular case, the thing was plain > > >> broken, besides same thing was already discussed between me and > > >> Andrzej, see for instance: > > >> http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html > > >> > > >> Needless to say the revert is broken also. > > >> > > > > > > Yeah, I understand the race the original commit fixes but I don't > > > understand how the original commit broke WinXP. > > > > > > Do you have further insight Michael? > > > > My guess is that SIGALARM gets lost, this somehow leads to > > disk request to timeout. Possible? > > Being delayed - possible, being lost - violation of the spec. I see this: The use of sigprocmask() is unspecified in a multithreaded process; see pthread_sigmask(3). Does it matter? > > > > > Blindly reverting something that appears right because of a > > > regression is likely just papering over a bigger problem. > > > > > > Regards, > > > > > > Anthony Liguori > > > > > > > -- > mailto:av1474@comtv.ru
On Wed, 7 Oct 2009, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2009 at 12:54:00AM +0400, malc wrote: > > On Wed, 7 Oct 2009, Michael S. Tsirkin wrote: > > > > > On Wed, Oct 07, 2009 at 11:28:21AM -0500, Anthony Liguori wrote: > > > > > > > >> There's nothing to review in this particular case, the thing was plain > > > >> broken, besides same thing was already discussed between me and > > > >> Andrzej, see for instance: > > > >> http://www.mail-archive.com/qemu-devel@nongnu.org/msg09542.html > > > >> > > > >> Needless to say the revert is broken also. > > > >> > > > > > > > > Yeah, I understand the race the original commit fixes but I don't > > > > understand how the original commit broke WinXP. > > > > > > > > Do you have further insight Michael? > > > > > > My guess is that SIGALARM gets lost, this somehow leads to > > > disk request to timeout. Possible? > > > > Being delayed - possible, being lost - violation of the spec. > > I see this: > > The use of sigprocmask() is unspecified in a > multithreaded process; see pthread_sigmask(3). > > Does it matter? One of the patches i've asked you to try today replaced sigprocmask with pthread_sigmask, you've said it did nothing. In any case, strictly speaking, the code is wrong, so yes it does matter in theory. [..snip..]
malc wrote: > > The use of sigprocmask() is unspecified in a > > multithreaded process; see pthread_sigmask(3). > > > > Does it matter? > > One of the patches i've asked you to try today replaced sigprocmask with > pthread_sigmask, you've said it did nothing. In any case, strictly > speaking, the code is wrong, so yes it does matter in theory. It won't matter on a Linux host (they are the same), but pthread_sigmask should be used because it's Right(tm) and it could make a difference on some other host. -- Jamie
On Thu, 8 Oct 2009, Jamie Lokier wrote: > malc wrote: > > > The use of sigprocmask() is unspecified in a > > > multithreaded process; see pthread_sigmask(3). > > > > > > Does it matter? > > > > One of the patches i've asked you to try today replaced sigprocmask with > > pthread_sigmask, you've said it did nothing. In any case, strictly > > speaking, the code is wrong, so yes it does matter in theory. > > It won't matter on a Linux host (they are the same), but > pthread_sigmask should be used because it's Right(tm) and it could > make a difference on some other host. What made you think i'm of a different opinion?
malc wrote: > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > malc wrote: > > > > The use of sigprocmask() is unspecified in a > > > > multithreaded process; see pthread_sigmask(3). > > > > > > > > Does it matter? > > > > > > One of the patches i've asked you to try today replaced sigprocmask with > > > pthread_sigmask, you've said it did nothing. In any case, strictly > > > speaking, the code is wrong, so yes it does matter in theory. > > > > It won't matter on a Linux host (they are the same), but > > pthread_sigmask should be used because it's Right(tm) and it could > > make a difference on some other host. > > What made you think i'm of a different opinion? The fact you asked Michael to test a patch which replaced sigprocmask with pthread_sigmask. -- Jamie
On Thu, 8 Oct 2009, Jamie Lokier wrote: > malc wrote: > > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > > > malc wrote: > > > > > The use of sigprocmask() is unspecified in a > > > > > multithreaded process; see pthread_sigmask(3). > > > > > > > > > > Does it matter? > > > > > > > > One of the patches i've asked you to try today replaced sigprocmask with > > > > pthread_sigmask, you've said it did nothing. In any case, strictly > > > > speaking, the code is wrong, so yes it does matter in theory. > > > > > > It won't matter on a Linux host (they are the same), but > > > pthread_sigmask should be used because it's Right(tm) and it could > > > make a difference on some other host. > > > > What made you think i'm of a different opinion? > > The fact you asked Michael to test a patch which replaced sigprocmask > with pthread_sigmask. That made you think i'm somehow in favour of sigprocmask? I'm confused.
On Thu, Oct 08, 2009 at 04:15:27PM +0100, Jamie Lokier wrote: > malc wrote: > > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > > > malc wrote: > > > > > The use of sigprocmask() is unspecified in a > > > > > multithreaded process; see pthread_sigmask(3). > > > > > > > > > > Does it matter? > > > > > > > > One of the patches i've asked you to try today replaced sigprocmask with > > > > pthread_sigmask, you've said it did nothing. In any case, strictly > > > > speaking, the code is wrong, so yes it does matter in theory. BTW, tried looking at the code. I saw some (unrelated) issues: static void aio_signal_handler(int signum) { if (posix_aio_state) { char byte = 0; write(posix_aio_state->wfd, &byte, sizeof(byte)); } qemu_service_io(); } And qemu_service_io does a *ton* of things. Questions: - Do we need the call to qemu_service_io? Seems to behave the same with and without it. - Are all of the data structures touched by qemu_service_io protected by blocking signals before access? Also: - let's use signalfd on linux, if available? - for SIGALARM, maybe timerfd? Thanks,
Michael S. Tsirkin wrote: > BTW, tried looking at the code. > I saw some (unrelated) issues: > > static void aio_signal_handler(int signum) > { > if (posix_aio_state) { > char byte = 0; > > write(posix_aio_state->wfd, &byte, sizeof(byte)); > } > > qemu_service_io(); > } > > And qemu_service_io does a *ton* of things. > It just does a cpu_exit(). > Questions: > - Do we need the call to qemu_service_io? Seems to > behave the same with and without it. > Where you run into trouble is when you have a guest that has no periodic timer and needs a mechanism to break out of the tcg loop whenever an IO operation completes. See Author: aliguori <aliguori@c046a42c-6fe2-441c-8c8c-71466251a162> Date: Wed Oct 8 19:50:24 2008 +0000 Fix IO performance regression in sparc > - Are all of the data structures touched by qemu_service_io > protected by blocking signals before access? > cpu_exit is pretty careful to not do anything that could be racey. > Also: > - let's use signalfd on linux, if available? > signalfd doesn't help. In fact, we originally used signalfd and it caused a regression. > - for SIGALARM, maybe timerfd? > Or just get rid of SIGALARM... Regards, Anthony Liguori
Michael S. Tsirkin wrote: > BTW, tried looking at the code. > I saw some (unrelated) issues: > > static void aio_signal_handler(int signum) > { > if (posix_aio_state) { > char byte = 0; > > write(posix_aio_state->wfd, &byte, sizeof(byte)); > } > > qemu_service_io(); > } > > And qemu_service_io does a *ton* of things. > It just does a cpu_exit(). > Questions: > - Do we need the call to qemu_service_io? Seems to > behave the same with and without it. > Where you run into trouble is when you have a guest that has no periodic timer and needs a mechanism to break out of the tcg loop whenever an IO operation completes. See Author: aliguori <aliguori@c046a42c-6fe2-441c-8c8c-71466251a162> Date: Wed Oct 8 19:50:24 2008 +0000 Fix IO performance regression in sparc > - Are all of the data structures touched by qemu_service_io > protected by blocking signals before access? > cpu_exit is pretty careful to not do anything that could be racey. > Also: > - let's use signalfd on linux, if available? > signalfd doesn't help. In fact, we originally used signalfd and it caused a regression. > - for SIGALARM, maybe timerfd? > Or just get rid of SIGALARM... Regards, Anthony Liguori
malc wrote: > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > malc wrote: > > > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > > > > > malc wrote: > > > > > > The use of sigprocmask() is unspecified in a > > > > > > multithreaded process; see pthread_sigmask(3). > > > > > > > > > > > > Does it matter? > > > > > > > > > > One of the patches i've asked you to try today replaced sigprocmask with > > > > > pthread_sigmask, you've said it did nothing. In any case, strictly > > > > > speaking, the code is wrong, so yes it does matter in theory. > > > > > > > > It won't matter on a Linux host (they are the same), but > > > > pthread_sigmask should be used because it's Right(tm) and it could > > > > make a difference on some other host. > > > > > > What made you think i'm of a different opinion? > > > > The fact you asked Michael to test a patch which replaced sigprocmask > > with pthread_sigmask. > > That made you think i'm somehow in favour of sigprocmask? I'm confused. It didn't make me think you're in favour of sigprocmask. It made me think you thought there was a difference between them that would affect the bug. Anyways, I think we're done with this subthread :-) -- Jamie
Michael S. Tsirkin wrote: > And qemu_service_io does a *ton* of things. > Questions: > - Do we need the call to qemu_service_io? Seems to > behave the same with and without it. > - Are all of the data structures touched by qemu_service_io > protected by blocking signals before access? Plus: - Are all the things done by qemu_service_io async-signal-safe? E.g. any use of pthread primitives is not async-signal-safe. I don't know if thread-local storage is. -- Jamie
On Fri, 9 Oct 2009, Jamie Lokier wrote: > malc wrote: > > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > > > malc wrote: > > > > On Thu, 8 Oct 2009, Jamie Lokier wrote: > > > > > > > > > malc wrote: > > > > > > > The use of sigprocmask() is unspecified in a > > > > > > > multithreaded process; see pthread_sigmask(3). > > > > > > > > > > > > > > Does it matter? > > > > > > > > > > > > One of the patches i've asked you to try today replaced sigprocmask with > > > > > > pthread_sigmask, you've said it did nothing. In any case, strictly > > > > > > speaking, the code is wrong, so yes it does matter in theory. > > > > > > > > > > It won't matter on a Linux host (they are the same), but > > > > > pthread_sigmask should be used because it's Right(tm) and it could > > > > > make a difference on some other host. > > > > > > > > What made you think i'm of a different opinion? > > > > > > The fact you asked Michael to test a patch which replaced sigprocmask > > > with pthread_sigmask. > > > > That made you think i'm somehow in favour of sigprocmask? I'm confused. > > It didn't make me think you're in favour of sigprocmask. > > It made me think you thought there was a difference between them that > would affect the bug. > > Anyways, I think we're done with this subthread :-) Goooood :)
diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 400d898..68cbec8 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -301,9 +301,14 @@ static size_t handle_aiocb_rw(struct qemu_paiocb *aiocb) static void *aio_thread(void *unused) { pid_t pid; + sigset_t set; pid = getpid(); + /* block all signals */ + if (sigfillset(&set)) die("sigfillset"); + if (sigprocmask(SIG_BLOCK, &set, NULL)) die("sigprocmask"); + while (1) { struct qemu_paiocb *aiocb; size_t ret = 0; @@ -364,18 +369,9 @@ static void *aio_thread(void *unused) static void spawn_thread(void) { - sigset_t set, oldset; - cur_threads++; idle_threads++; - - /* block all signals */ - if (sigfillset(&set)) die("sigfillset"); - if (sigprocmask(SIG_SETMASK, &set, &oldset)) die("sigprocmask"); - thread_create(&thread_id, &attr, aio_thread, NULL); - - if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore"); } static void qemu_paio_submit(struct qemu_paiocb *aiocb)
This reverts commit ee3993069ff55fa6f1c64daf1e09963e340db8e4. With commit ee3993069ff55fa6f1c64daf1e09963e340db8e4, winxp installation on a raw format file fails during disk format, with a message "your disk may be damaged". Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: malc <av1474@comtv.ru> --- malc, care to describe what the problem you are solving here is? All, could everyone please start being nice and post patches and give time for review before applying? Thanks! posix-aio-compat.c | 14 +++++--------- 1 files changed, 5 insertions(+), 9 deletions(-)