diff mbox

[STAGING] : Block migration segfaults

Message ID 20091203162121.67d9c120@doriath
State New
Headers show

Commit Message

Luiz Capitulino Dec. 3, 2009, 6:21 p.m. UTC
Hi there,

 Got this while testing block migration in staging:

"""
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000410cf9 in monitor_vprintf (mon=0x0, fmt=0x5ae5e7 "Start full migration for %s\n",
ap=0x7fff1f830a40) at /home/lcapitulino/src/aliguori-queue/monitor.c:192
192         if (mon->mc && !mon->mc->print_enabled) {
"""

 The problem here is that init_blk_migration() calls monitor_printf() with
a NULL 'mon' and the backtrace shows that this is true for the entire call
chain.

 You probably didn't note it before because the lowest-level monitor
print function would just return if the 'mon' parameter was NULL.

 A patch from me (4a29a in staging) changes a higher level monitor
function to touch 'mon' before passing it down and here's the segfault.

 Now, the point is: I could give the old behavior back but I think we're
hiding a bug there. Why would you call monitor_printf() with a NULL 'mon'?

 Anyways, the following patch adds the old behavior back just in case
you want to see it working...

commit 3575196202d4e54c1fc63a631ca5bd1a7778e30d
Author: Luiz Capitulino <lcapitulino@redhat.com>
Date:   Thu Dec 3 15:49:16 2009 -0200

    monitor: Fix block migration segfault
    
    The monitor_vprintf() function now touches the 'mon' variable
    before calling monitor_puts(), this causes block migration
    to segfault as its functions call monitor_printf() with a
    NULL 'mon'.
    
    This is probably hiding the real bug, but for some reason this
    has been the behavior for a long time.
    
    We also change monitor_print_object() to use monitor_print(),
    so that monitor_puts() is only called by monitor_vprintf().
    
    Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>

Comments

Jan Kiszka Dec. 3, 2009, 6:36 p.m. UTC | #1
Luiz Capitulino wrote:
>  Hi there,
> 
>  Got this while testing block migration in staging:
> 
> """
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000410cf9 in monitor_vprintf (mon=0x0, fmt=0x5ae5e7 "Start full migration for %s\n",
> ap=0x7fff1f830a40) at /home/lcapitulino/src/aliguori-queue/monitor.c:192
> 192         if (mon->mc && !mon->mc->print_enabled) {
> """
> 
>  The problem here is that init_blk_migration() calls monitor_printf() with
> a NULL 'mon' and the backtrace shows that this is true for the entire call
> chain.

What is the backtrace? And how did you start the migration?

> 
>  You probably didn't note it before because the lowest-level monitor
> print function would just return if the 'mon' parameter was NULL.

I was aware that mon might be NULL, but the existing code handled this
gracefully.

> 
>  A patch from me (4a29a in staging) changes a higher level monitor
> function to touch 'mon' before passing it down and here's the segfault.
> 
>  Now, the point is: I could give the old behavior back but I think we're
> hiding a bug there. Why would you call monitor_printf() with a NULL 'mon'?

If there is no monitor associated with the current context, it can be
NULL. This is mostly the case during early startup. One may also use
this to suppress output (though I don't recall any real case ATM).

> 
>  Anyways, the following patch adds the old behavior back just in case
> you want to see it working...

Yes, better restore the check. Still, your call stack would be
interesting. Maybe there is actual another bug behind it.

> 
> commit 3575196202d4e54c1fc63a631ca5bd1a7778e30d
> Author: Luiz Capitulino <lcapitulino@redhat.com>
> Date:   Thu Dec 3 15:49:16 2009 -0200
> 
>     monitor: Fix block migration segfault

Let's call it by it true name: Catch printing to non-existent monitor.

>     
>     The monitor_vprintf() function now touches the 'mon' variable
>     before calling monitor_puts(), this causes block migration
>     to segfault as its functions call monitor_printf() with a
>     NULL 'mon'.
>     
>     This is probably hiding the real bug, but for some reason this
>     has been the behavior for a long time.
>     
>     We also change monitor_print_object() to use monitor_print(),
>     so that monitor_puts() is only called by monitor_vprintf().
>     
>     Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> 
> diff --git a/monitor.c b/monitor.c
> index b035e0b..6d0b1dd 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -171,9 +171,6 @@ static void monitor_puts(Monitor *mon, const char *str)
>  {
>      char c;
>  
> -    if (!mon)
> -        return;
> -
>      for(;;) {
>          c = *str++;
>          if (c == '\0')
> @@ -189,6 +186,9 @@ static void monitor_puts(Monitor *mon, const char *str)
>  
>  void monitor_vprintf(Monitor *mon, const char *fmt, va_list ap)
>  {
> +    if (!mon)
> +        return;
> +
>      if (mon->mc && !mon->mc->print_enabled) {
>          qemu_error_new(QERR_UNDEFINED_ERROR);
>      } else {
> @@ -269,7 +269,7 @@ static void monitor_print_qobject(Monitor *mon, const QObject *data)
>              break;
>      }
>  
> -    monitor_puts(mon, "\n");
> +    monitor_printf(mon, "\n");
>  }
>  
>  static void monitor_json_emitter(Monitor *mon, const QObject *data)

Jan
Luiz Capitulino Dec. 3, 2009, 6:59 p.m. UTC | #2
On Thu, 03 Dec 2009 19:36:15 +0100
Jan Kiszka <jan.kiszka@siemens.com> wrote:

> Luiz Capitulino wrote:
> >  Hi there,
> > 
> >  Got this while testing block migration in staging:
> > 
> > """
> > Program terminated with signal 11, Segmentation fault.
> > #0  0x0000000000410cf9 in monitor_vprintf (mon=0x0, fmt=0x5ae5e7 "Start full migration for %s\n",
> > ap=0x7fff1f830a40) at /home/lcapitulino/src/aliguori-queue/monitor.c:192
> > 192         if (mon->mc && !mon->mc->print_enabled) {
> > """
> > 
> >  The problem here is that init_blk_migration() calls monitor_printf() with
> > a NULL 'mon' and the backtrace shows that this is true for the entire call
> > chain.
> 
> What is the backtrace? And how did you start the migration?

 Started the source VM with:

# qemu -hda disks/fedora-11-kratos-i386.img -enable-kvm -snapshot \
       -balloon virtio -m 1G -S

 and the destination one with:

# qemu -hda disks/fedora-11-kratos-i386.img -enable-kvm -snapshot \
       -balloon virtio -m 1G -S -incoming tcp:0:4444

 Migration command issued:

(QEMU) migrate -d -b tcp:0:4444


 I have no idea if this correct and wondered if specifying the same image
for the destination VM would be catastrophic. :)

 The backtrace follows and it's the source VM which segfaults:

"""
#0  0x0000000000410c11 in monitor_vprintf (mon=0x0, fmt=0x5ada87 "Start full migration for %s\n", 
#1  0x0000000000410d59 in monitor_printf (mon=0x0, fmt=0x5ada87 "Start full migration for %s\n")
#2  0x00000000004e584b in init_blk_migration (mon=0x0, f=0x2864130) at block-migration.c:254
#3  0x00000000004e5d8a in block_save_live (mon=0x0, f=0x2864130, stage=1, opaque=0xbea960)
#4  0x00000000004db61a in qemu_savevm_state_begin (mon=0x0, f=0x2864130, blk_enable=1, shared=0)
#5  0x00000000004d2448 in migrate_fd_connect (s=0x20bf470) at migration.c:279
#6  0x00000000004d2849 in tcp_wait_for_connect (opaque=0x20bf470) at migration-tcp.c:72
#7  0x000000000040c3f8 in main_loop_wait (timeout=5000) at /home/lcapitulino/src/aliguori-queue/vl.c:3875
#8  0x000000000040ca17 in main_loop () at /home/lcapitulino/src/aliguori-queue/vl.c:4095
#9  0x00000000004104ce in main (argc=10, argv=0x7fff6f838bf8, envp=0x7fff6f838c50)
"""

[...]

> >  A patch from me (4a29a in staging) changes a higher level monitor
> > function to touch 'mon' before passing it down and here's the segfault.
> > 
> >  Now, the point is: I could give the old behavior back but I think we're
> > hiding a bug there. Why would you call monitor_printf() with a NULL 'mon'?
> 
> If there is no monitor associated with the current context, it can be
> NULL. This is mostly the case during early startup.

 Why would one call monitor_printf() before monitor initialization?

> One may also use
> this to suppress output (though I don't recall any real case ATM).

 I would prefer functions like monitor_disable()/monitor_enable()
for this...

> >  Anyways, the following patch adds the old behavior back just in case
> > you want to see it working...
> 
> Yes, better restore the check. Still, your call stack would be
> interesting. Maybe there is actual another bug behind it.

 Ok, I'll send the fix because it's not that important right now,
but I'm not convinced this is the right thing to do.
Jan Kiszka Dec. 3, 2009, 7:10 p.m. UTC | #3
Luiz Capitulino wrote:
> On Thu, 03 Dec 2009 19:36:15 +0100
> Jan Kiszka <jan.kiszka@siemens.com> wrote:
> 
>> Luiz Capitulino wrote:
>>>  Hi there,
>>>
>>>  Got this while testing block migration in staging:
>>>
>>> """
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x0000000000410cf9 in monitor_vprintf (mon=0x0, fmt=0x5ae5e7 "Start full migration for %s\n",
>>> ap=0x7fff1f830a40) at /home/lcapitulino/src/aliguori-queue/monitor.c:192
>>> 192         if (mon->mc && !mon->mc->print_enabled) {
>>> """
>>>
>>>  The problem here is that init_blk_migration() calls monitor_printf() with
>>> a NULL 'mon' and the backtrace shows that this is true for the entire call
>>> chain.
>> What is the backtrace? And how did you start the migration?
> 
>  Started the source VM with:
> 
> # qemu -hda disks/fedora-11-kratos-i386.img -enable-kvm -snapshot \
>        -balloon virtio -m 1G -S
> 
>  and the destination one with:
> 
> # qemu -hda disks/fedora-11-kratos-i386.img -enable-kvm -snapshot \
>        -balloon virtio -m 1G -S -incoming tcp:0:4444
> 
>  Migration command issued:
> 
> (QEMU) migrate -d -b tcp:0:4444
> 

Ah, forgot '-d'! Yes, that was precisely the use case for suppressing
monitor output I was talking about. This depends on the monitor services
catching NULL properly, so please push the corresponding patch.

> 
>  I have no idea if this correct and wondered if specifying the same image
> for the destination VM would be catastrophic. :)
> 
>  The backtrace follows and it's the source VM which segfaults:
> 
> """
> #0  0x0000000000410c11 in monitor_vprintf (mon=0x0, fmt=0x5ada87 "Start full migration for %s\n", 
> #1  0x0000000000410d59 in monitor_printf (mon=0x0, fmt=0x5ada87 "Start full migration for %s\n")
> #2  0x00000000004e584b in init_blk_migration (mon=0x0, f=0x2864130) at block-migration.c:254
> #3  0x00000000004e5d8a in block_save_live (mon=0x0, f=0x2864130, stage=1, opaque=0xbea960)
> #4  0x00000000004db61a in qemu_savevm_state_begin (mon=0x0, f=0x2864130, blk_enable=1, shared=0)
> #5  0x00000000004d2448 in migrate_fd_connect (s=0x20bf470) at migration.c:279
> #6  0x00000000004d2849 in tcp_wait_for_connect (opaque=0x20bf470) at migration-tcp.c:72
> #7  0x000000000040c3f8 in main_loop_wait (timeout=5000) at /home/lcapitulino/src/aliguori-queue/vl.c:3875
> #8  0x000000000040ca17 in main_loop () at /home/lcapitulino/src/aliguori-queue/vl.c:4095
> #9  0x00000000004104ce in main (argc=10, argv=0x7fff6f838bf8, envp=0x7fff6f838c50)
> """
> 
> [...]
> 
>>>  A patch from me (4a29a in staging) changes a higher level monitor
>>> function to touch 'mon' before passing it down and here's the segfault.
>>>
>>>  Now, the point is: I could give the old behavior back but I think we're
>>> hiding a bug there. Why would you call monitor_printf() with a NULL 'mon'?
>> If there is no monitor associated with the current context, it can be
>> NULL. This is mostly the case during early startup.
> 
>  Why would one call monitor_printf() before monitor initialization?

Via shared code e.g.

> 
>> One may also use
>> this to suppress output (though I don't recall any real case ATM).
> 
>  I would prefer functions like monitor_disable()/monitor_enable()
> for this...

They have different meanings.

> 
>>>  Anyways, the following patch adds the old behavior back just in case
>>> you want to see it working...
>> Yes, better restore the check. Still, your call stack would be
>> interesting. Maybe there is actual another bug behind it.
> 
>  Ok, I'll send the fix because it's not that important right now,
> but I'm not convinced this is the right thing to do.

Thanks,
Jan
Anthony Liguori Dec. 3, 2009, 7:44 p.m. UTC | #4
Jan Kiszka wrote:
> Luiz Capitulino wrote:
>   
>>  Hi there,
>>
>>  Got this while testing block migration in staging:
>>
>> """
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x0000000000410cf9 in monitor_vprintf (mon=0x0, fmt=0x5ae5e7 "Start full migration for %s\n",
>> ap=0x7fff1f830a40) at /home/lcapitulino/src/aliguori-queue/monitor.c:192
>> 192         if (mon->mc && !mon->mc->print_enabled) {
>> """
>>
>>  The problem here is that init_blk_migration() calls monitor_printf() with
>> a NULL 'mon' and the backtrace shows that this is true for the entire call
>> chain.
>>     
>
> What is the backtrace? And how did you start the migration?
>
>   
>>  You probably didn't note it before because the lowest-level monitor
>> print function would just return if the 'mon' parameter was NULL.
>>     
>
> I was aware that mon might be NULL, but the existing code handled this
> gracefully.
>
>   
>>  A patch from me (4a29a in staging) changes a higher level monitor
>> function to touch 'mon' before passing it down and here's the segfault.
>>
>>  Now, the point is: I could give the old behavior back but I think we're
>> hiding a bug there. Why would you call monitor_printf() with a NULL 'mon'?
>>     
>
> If there is no monitor associated with the current context, it can be
> NULL. This is mostly the case during early startup. One may also use
> this to suppress output (though I don't recall any real case ATM).
>   

I'm a bit concerned with this explanation as there is no reason 
something should be printing to the monitor unless it's in response to a 
monitor command.  I'd like to see the full call chain to see what's 
happening too.

>>  Anyways, the following patch adds the old behavior back just in case
>> you want to see it working...
>>     
>
> Yes, better restore the check. Still, your call stack would be
> interesting. Maybe there is actual another bug behind it.
>   

I think we should have an assert in this path because mon==NULL is 
definitely wrong.

Regards,

Anthony Liguori
Anthony Liguori Dec. 3, 2009, 7:47 p.m. UTC | #5
Jan Kiszka wrote:
> Luiz Capitulino wrote:
>   
>> On Thu, 03 Dec 2009 19:36:15 +0100
>> Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>
>>     
>>> Luiz Capitulino wrote:
>>>       
>>>>  Hi there,
>>>>
>>>>  Got this while testing block migration in staging:
>>>>
>>>> """
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0  0x0000000000410cf9 in monitor_vprintf (mon=0x0, fmt=0x5ae5e7 "Start full migration for %s\n",
>>>> ap=0x7fff1f830a40) at /home/lcapitulino/src/aliguori-queue/monitor.c:192
>>>> 192         if (mon->mc && !mon->mc->print_enabled) {
>>>> """
>>>>
>>>>  The problem here is that init_blk_migration() calls monitor_printf() with
>>>> a NULL 'mon' and the backtrace shows that this is true for the entire call
>>>> chain.
>>>>         
>>> What is the backtrace? And how did you start the migration?
>>>       
>>  Started the source VM with:
>>
>> # qemu -hda disks/fedora-11-kratos-i386.img -enable-kvm -snapshot \
>>        -balloon virtio -m 1G -S
>>
>>  and the destination one with:
>>
>> # qemu -hda disks/fedora-11-kratos-i386.img -enable-kvm -snapshot \
>>        -balloon virtio -m 1G -S -incoming tcp:0:4444
>>
>>  Migration command issued:
>>
>> (QEMU) migrate -d -b tcp:0:4444
>>
>>     
>
> Ah, forgot '-d'! Yes, that was precisely the use case for suppressing
> monitor output I was talking about. This depends on the monitor services
> catching NULL properly, so please push the corresponding patch.
>   

Man I really hate detach.  Okay, I agree re: a patch.

Regards,

Anthony Liguori
diff mbox

Patch

diff --git a/monitor.c b/monitor.c
index b035e0b..6d0b1dd 100644
--- a/monitor.c
+++ b/monitor.c
@@ -171,9 +171,6 @@  static void monitor_puts(Monitor *mon, const char *str)
 {
     char c;
 
-    if (!mon)
-        return;
-
     for(;;) {
         c = *str++;
         if (c == '\0')
@@ -189,6 +186,9 @@  static void monitor_puts(Monitor *mon, const char *str)
 
 void monitor_vprintf(Monitor *mon, const char *fmt, va_list ap)
 {
+    if (!mon)
+        return;
+
     if (mon->mc && !mon->mc->print_enabled) {
         qemu_error_new(QERR_UNDEFINED_ERROR);
     } else {
@@ -269,7 +269,7 @@  static void monitor_print_qobject(Monitor *mon, const QObject *data)
             break;
     }
 
-    monitor_puts(mon, "\n");
+    monitor_printf(mon, "\n");
 }
 
 static void monitor_json_emitter(Monitor *mon, const QObject *data)