Message ID | 1267038612-21581-1-git-send-email-aliguori@us.ibm.com |
---|---|
State | New |
Headers | show |
On 2/24/10, Anthony Liguori <aliguori@us.ibm.com> wrote: > If you compare the RSS of a freshly booted guest and the same guest after a > reboot, it's very likely the freshly booted guest will have an RSS that is > much lower the the rebooted guest because the previous run of the guest faulted > in all available memory. > > This patch addresses this issue by using madvise() during reset. It only > resets RAM areas which means it has to be done in the machine. I've only done > this for the x86 target because I'm fairly confident that this is allowed > architecturally on x86 although I'm not sure this is universely true. > > This does not appear to have an observable cost with a large memory guest and > I can't really think of any down sides. I think it would be much cleaner to make the madvise() calls from exec.c, now you are duplicating some of the functionality there. The calls could be controlled by a global variable (set only in pc.c) so non-PC architectures would not be disturbed.
On 02/24/2010 02:59 PM, Blue Swirl wrote: > On 2/24/10, Anthony Liguori<aliguori@us.ibm.com> wrote: > >> If you compare the RSS of a freshly booted guest and the same guest after a >> reboot, it's very likely the freshly booted guest will have an RSS that is >> much lower the the rebooted guest because the previous run of the guest faulted >> in all available memory. >> >> This patch addresses this issue by using madvise() during reset. It only >> resets RAM areas which means it has to be done in the machine. I've only done >> this for the x86 target because I'm fairly confident that this is allowed >> architecturally on x86 although I'm not sure this is universely true. >> >> This does not appear to have an observable cost with a large memory guest and >> I can't really think of any down sides. >> > I think it would be much cleaner to make the madvise() calls from > exec.c, now you are duplicating some of the functionality there. The > calls could be controlled by a global variable (set only in pc.c) so > non-PC architectures would not be disturbed. > One thing we could do (that I think has other uses) is to add a context parameter to qemu_ram_alloc(). We could start with a simple flag of something like QRAM_RAM and QRAM_ROM. QRAM_RAM would get automatically madvise()'d on reset. But that said, does anyone know of an architecture where this type of reset would be a problem? Would it be a problem on sparc? Regards, Anthony Liguori
Hello, I just want to note the fact that although Linux got it badly wrong, according to POSIX, MADV_DONTNEED is _not_ supposed to drop the content of the memory, but just to tune the write-back heuristics and such. (see glibc's ./sysdeps/unix/sysv/linux/posix_madvise.c if you're not convinced) qemu should probably try to use MADV_REMOVE (proper linux variant) or MADV_FREE (solaris variant). Samuel
> > I think it would be much cleaner to make the madvise() calls from > > exec.c, now you are duplicating some of the functionality there. The > > calls could be controlled by a global variable (set only in pc.c) so > > non-PC architectures would not be disturbed. > > One thing we could do (that I think has other uses) is to add a context > parameter to qemu_ram_alloc(). We could start with a simple flag of > something like QRAM_RAM and QRAM_ROM. QRAM_RAM would get automatically > madvise()'d on reset. > > But that said, does anyone know of an architecture where this type of > reset would be a problem? Would it be a problem on sparc? I think it's simplest to just say that qemu_system_reset is a full hard reset. We already reload ROM images, etc. Paul
On Wed, Feb 24, 2010 at 7:10 PM, Anthony Liguori <aliguori@us.ibm.com> wrote: > This patch addresses this issue by using madvise() during reset. It only > resets RAM areas which means it has to be done in the machine. I've only done > this for the x86 target because I'm fairly confident that this is allowed > architecturally on x86 although I'm not sure this is universely true. > > This does not appear to have an observable cost with a large memory guest and > I can't really think of any down sides. Preserving the contents of memory across reboot seems to be common on real x86 hardware. (Even if you hard power off and on again quickly.) This behavior is useful for crash analysis. I use it as a tool for debugging gPXE both on real hardware and under QEMU for example: http://git.etherboot.org/?p=people/stefanha/gpxe.git;a=commitdiff;h=67344d8c3adca51658e1b8a80541a16e9f693a8d posix_madvise() is fine but Linux MADV_DONTNEED will prevent this sort of tool from working. Would it be possible to make this configurable in some way? I can imagine doing something like: qemu> system_reset -preserve Stefan
diff --git a/hw/pc.c b/hw/pc.c index 4f6a522..10446ba 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -45,6 +45,11 @@ #include "loader.h" #include "elf.h" #include "multiboot.h" +#include "kvm.h" + +#ifndef _WIN32 +#include <sys/mman.h> +#endif /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -63,10 +68,19 @@ #define MAX_IDE_BUS 2 +#define MAX_MEMORY_ENTRIES 10 + +typedef struct MemoryEntry { + ram_addr_t addr; + ram_addr_t size; +} MemoryEntry; + static FDCtrl *floppy_controller; static RTCState *rtc_state; static PITState *pit; static PCII440FXState *i440fx_state; +static int num_memory_entries; +static MemoryEntry memory_entries[MAX_MEMORY_ENTRIES]; #define E820_NR_ENTRIES 16 @@ -782,6 +796,27 @@ static CPUState *pc_new_cpu(const char *cpu_model) return env; } +static void add_mem_entry(ram_addr_t addr, ram_addr_t size) +{ + memory_entries[num_memory_entries].addr = addr; + memory_entries[num_memory_entries].size = size; + num_memory_entries++; +} + +static void pc_reset_ram(void *opaque) +{ + int i; + + for (i = 0; i < num_memory_entries; i++) { +#ifndef _WIN32 + if (!kvm_enabled() || kvm_has_sync_mmu()) { + madvise(qemu_get_ram_ptr(memory_entries[i].addr), + memory_entries[i].size, MADV_DONTNEED); + } +#endif + } +} + /* PC hardware initialisation */ static void pc_init1(ram_addr_t ram_size, const char *boot_device, @@ -835,6 +870,7 @@ static void pc_init1(ram_addr_t ram_size, /* allocate RAM */ ram_addr = qemu_ram_alloc(0xa0000); cpu_register_physical_memory(0, 0xa0000, ram_addr); + add_mem_entry(ram_addr, 0xa0000); /* Allocate, even though we won't register, so we don't break the * phys_ram_base + PA assumption. This range includes vga (0xa0000 - 0xc0000), @@ -845,6 +881,7 @@ static void pc_init1(ram_addr_t ram_size, cpu_register_physical_memory(0x100000, below_4g_mem_size - 0x100000, ram_addr); + add_mem_entry(ram_addr, below_4g_mem_size - 0x100000); /* above 4giga memory allocation */ if (above_4g_mem_size > 0) { @@ -855,6 +892,7 @@ static void pc_init1(ram_addr_t ram_size, cpu_register_physical_memory(0x100000000ULL, above_4g_mem_size, ram_addr); + add_mem_entry(ram_addr, above_4g_mem_size); #endif } @@ -1050,6 +1088,8 @@ static void pc_init1(ram_addr_t ram_size, pci_create_simple(pci_bus, -1, "lsi53c895a"); } } + + qemu_register_reset(pc_reset_ram, NULL); } static void pc_init_pci(ram_addr_t ram_size,
If you compare the RSS of a freshly booted guest and the same guest after a reboot, it's very likely the freshly booted guest will have an RSS that is much lower the the rebooted guest because the previous run of the guest faulted in all available memory. This patch addresses this issue by using madvise() during reset. It only resets RAM areas which means it has to be done in the machine. I've only done this for the x86 target because I'm fairly confident that this is allowed architecturally on x86 although I'm not sure this is universely true. This does not appear to have an observable cost with a large memory guest and I can't really think of any down sides. Reported-by: Karl Rister <kmr@linux.vnet.ibm.com> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> --- hw/pc.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 files changed, 40 insertions(+), 0 deletions(-)