Patchwork pc: madvise(MADV_DONTNEED) memory on reset

login
register
mail settings
Submitter Anthony Liguori
Date Feb. 24, 2010, 7:10 p.m.
Message ID <1267038612-21581-1-git-send-email-aliguori@us.ibm.com>
Download mbox | patch
Permalink /patch/46202/
State New
Headers show

Comments

Anthony Liguori - Feb. 24, 2010, 7:10 p.m.
If you compare the RSS of a freshly booted guest and the same guest after a
reboot, it's very likely the freshly booted guest will have an RSS that is
much lower the the rebooted guest because the previous run of the guest faulted
in all available memory.

This patch addresses this issue by using madvise() during reset.  It only
resets RAM areas which means it has to be done in the machine.  I've only done
this for the x86 target because I'm fairly confident that this is allowed
architecturally on x86 although I'm not sure this is universely true.

This does not appear to have an observable cost with a large memory guest and
I can't really think of any down sides.

Reported-by: Karl Rister <kmr@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
 hw/pc.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)
Blue Swirl - Feb. 24, 2010, 8:59 p.m.
On 2/24/10, Anthony Liguori <aliguori@us.ibm.com> wrote:
> If you compare the RSS of a freshly booted guest and the same guest after a
>  reboot, it's very likely the freshly booted guest will have an RSS that is
>  much lower the the rebooted guest because the previous run of the guest faulted
>  in all available memory.
>
>  This patch addresses this issue by using madvise() during reset.  It only
>  resets RAM areas which means it has to be done in the machine.  I've only done
>  this for the x86 target because I'm fairly confident that this is allowed
>  architecturally on x86 although I'm not sure this is universely true.
>
>  This does not appear to have an observable cost with a large memory guest and
>  I can't really think of any down sides.

I think it would be much cleaner to make the madvise() calls from
exec.c, now you are duplicating some of the functionality there. The
calls could be controlled by a global variable (set only in pc.c) so
non-PC architectures would not be disturbed.
Anthony Liguori - Feb. 24, 2010, 10:05 p.m.
On 02/24/2010 02:59 PM, Blue Swirl wrote:
> On 2/24/10, Anthony Liguori<aliguori@us.ibm.com>  wrote:
>    
>> If you compare the RSS of a freshly booted guest and the same guest after a
>>   reboot, it's very likely the freshly booted guest will have an RSS that is
>>   much lower the the rebooted guest because the previous run of the guest faulted
>>   in all available memory.
>>
>>   This patch addresses this issue by using madvise() during reset.  It only
>>   resets RAM areas which means it has to be done in the machine.  I've only done
>>   this for the x86 target because I'm fairly confident that this is allowed
>>   architecturally on x86 although I'm not sure this is universely true.
>>
>>   This does not appear to have an observable cost with a large memory guest and
>>   I can't really think of any down sides.
>>      
> I think it would be much cleaner to make the madvise() calls from
> exec.c, now you are duplicating some of the functionality there. The
> calls could be controlled by a global variable (set only in pc.c) so
> non-PC architectures would not be disturbed.
>    

One thing we could do (that I think has other uses) is to add a context 
parameter to qemu_ram_alloc().  We could start with a simple flag of 
something like QRAM_RAM and QRAM_ROM.  QRAM_RAM would get automatically 
madvise()'d on reset.

But that said, does anyone know of an architecture where this type of 
reset would be a problem?  Would it be a problem on sparc?

Regards,

Anthony Liguori
Samuel Thibault - Feb. 24, 2010, 11:56 p.m.
Hello,

I just want to note the fact that although Linux got it badly wrong,
according to POSIX, MADV_DONTNEED is _not_ supposed to drop the content
of the memory, but just to tune the write-back heuristics and such.
(see glibc's ./sysdeps/unix/sysv/linux/posix_madvise.c if you're not
convinced)

qemu should probably try to use MADV_REMOVE (proper linux variant) or
MADV_FREE (solaris variant).

Samuel
Paul Brook - Feb. 28, 2010, 12:57 a.m.
> > I think it would be much cleaner to make the madvise() calls from
> > exec.c, now you are duplicating some of the functionality there. The
> > calls could be controlled by a global variable (set only in pc.c) so
> > non-PC architectures would not be disturbed.
> 
> One thing we could do (that I think has other uses) is to add a context
> parameter to qemu_ram_alloc().  We could start with a simple flag of
> something like QRAM_RAM and QRAM_ROM.  QRAM_RAM would get automatically
> madvise()'d on reset.
> 
> But that said, does anyone know of an architecture where this type of
> reset would be a problem?  Would it be a problem on sparc?

I think it's simplest to just say that qemu_system_reset is a full hard reset. 
We already reload ROM images, etc.

Paul
Stefan Hajnoczi - Feb. 28, 2010, 9:13 a.m.
On Wed, Feb 24, 2010 at 7:10 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> This patch addresses this issue by using madvise() during reset.  It only
> resets RAM areas which means it has to be done in the machine.  I've only done
> this for the x86 target because I'm fairly confident that this is allowed
> architecturally on x86 although I'm not sure this is universely true.
>
> This does not appear to have an observable cost with a large memory guest and
> I can't really think of any down sides.

Preserving the contents of memory across reboot seems to be common on
real x86 hardware.  (Even if you hard power off and on again quickly.)

This behavior is useful for crash analysis.  I use it as a tool for
debugging gPXE both on real hardware and under QEMU for example:

http://git.etherboot.org/?p=people/stefanha/gpxe.git;a=commitdiff;h=67344d8c3adca51658e1b8a80541a16e9f693a8d

posix_madvise() is fine but Linux MADV_DONTNEED will prevent this sort
of tool from working.

Would it be possible to make this configurable in some way?  I can
imagine doing something like:
qemu> system_reset -preserve

Stefan

Patch

diff --git a/hw/pc.c b/hw/pc.c
index 4f6a522..10446ba 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -45,6 +45,11 @@ 
 #include "loader.h"
 #include "elf.h"
 #include "multiboot.h"
+#include "kvm.h"
+
+#ifndef _WIN32
+#include <sys/mman.h>
+#endif
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -63,10 +68,19 @@ 
 
 #define MAX_IDE_BUS 2
 
+#define MAX_MEMORY_ENTRIES 10
+
+typedef struct MemoryEntry {
+    ram_addr_t addr;
+    ram_addr_t size;
+} MemoryEntry;
+
 static FDCtrl *floppy_controller;
 static RTCState *rtc_state;
 static PITState *pit;
 static PCII440FXState *i440fx_state;
+static int num_memory_entries;
+static MemoryEntry memory_entries[MAX_MEMORY_ENTRIES];
 
 #define E820_NR_ENTRIES		16
 
@@ -782,6 +796,27 @@  static CPUState *pc_new_cpu(const char *cpu_model)
     return env;
 }
 
+static void add_mem_entry(ram_addr_t addr, ram_addr_t size)
+{
+    memory_entries[num_memory_entries].addr = addr;
+    memory_entries[num_memory_entries].size = size;
+    num_memory_entries++;
+}
+
+static void pc_reset_ram(void *opaque)
+{
+    int i;
+
+    for (i = 0; i < num_memory_entries; i++) {
+#ifndef _WIN32
+        if (!kvm_enabled() || kvm_has_sync_mmu()) {
+            madvise(qemu_get_ram_ptr(memory_entries[i].addr),
+                    memory_entries[i].size, MADV_DONTNEED);
+        }
+#endif
+    }
+}
+
 /* PC hardware initialisation */
 static void pc_init1(ram_addr_t ram_size,
                      const char *boot_device,
@@ -835,6 +870,7 @@  static void pc_init1(ram_addr_t ram_size,
     /* allocate RAM */
     ram_addr = qemu_ram_alloc(0xa0000);
     cpu_register_physical_memory(0, 0xa0000, ram_addr);
+    add_mem_entry(ram_addr, 0xa0000);
 
     /* Allocate, even though we won't register, so we don't break the
      * phys_ram_base + PA assumption. This range includes vga (0xa0000 - 0xc0000),
@@ -845,6 +881,7 @@  static void pc_init1(ram_addr_t ram_size,
     cpu_register_physical_memory(0x100000,
                  below_4g_mem_size - 0x100000,
                  ram_addr);
+    add_mem_entry(ram_addr, below_4g_mem_size - 0x100000);
 
     /* above 4giga memory allocation */
     if (above_4g_mem_size > 0) {
@@ -855,6 +892,7 @@  static void pc_init1(ram_addr_t ram_size,
         cpu_register_physical_memory(0x100000000ULL,
                                      above_4g_mem_size,
                                      ram_addr);
+        add_mem_entry(ram_addr, above_4g_mem_size);
 #endif
     }
 
@@ -1050,6 +1088,8 @@  static void pc_init1(ram_addr_t ram_size,
             pci_create_simple(pci_bus, -1, "lsi53c895a");
         }
     }
+
+    qemu_register_reset(pc_reset_ram, NULL);
 }
 
 static void pc_init_pci(ram_addr_t ram_size,