Message ID | 4B78E5C5.80802@lab.ntt.co.jp |
---|---|
State | New |
Headers | show |
On 15.02.2010, at 07:12, OHMURA Kei wrote: > dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c. > But We think that dirty-bitmap-traveling by long size is faster than by byte "We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? Alex
> "We think"? I mean - yes, I think so too. But have you actually measured it? > How much improvement are we talking here? > Is it still faster when a bswap is involved? Thanks for pointing out. I will post the data for x86 later. However, I don't have a test environment to check the impact of bswap. Would you please measure the run time between the following section if possible? start -> qemu-kvm.c: static int kvm_get_dirty_bitmap_cb(unsigned long start, unsigned long len, void *bitmap, void *opaque) { /* warm up each function */ kvm_get_dirty_pages_log_range(start, bitmap, start, len); kvm_get_dirty_pages_log_range_new(start, bitmap, start, len); /* measurement */ int64_t t1, t2; t1 = cpu_get_real_ticks(); kvm_get_dirty_pages_log_range(start, bitmap, start, len); t1 = cpu_get_real_ticks() - t1; t2 = cpu_get_real_ticks(); kvm_get_dirty_pages_log_range_new(start, bitmap, start, len); t2 = cpu_get_real_ticks() - t2; printf("## %zd, %zd\n", t1, t2); fflush(stdout); return kvm_get_dirty_pages_log_range_new(start, bitmap, start, len); } end ->
On 16.02.2010, at 12:16, OHMURA Kei wrote: >> "We think"? I mean - yes, I think so too. But have you actually measured it? >> How much improvement are we talking here? >> Is it still faster when a bswap is involved? > > Thanks for pointing out. > I will post the data for x86 later. > However, I don't have a test environment to check the impact of bswap. > Would you please measure the run time between the following section if possible? It'd make more sense to have a real stand alone test program, no? I can try to write one today, but I have some really nasty important bugs to fix first. Alex
>>> "We think"? I mean - yes, I think so too. But have you actually measured it? >>> How much improvement are we talking here? >>> Is it still faster when a bswap is involved? >> Thanks for pointing out. >> I will post the data for x86 later. >> However, I don't have a test environment to check the impact of bswap. >> Would you please measure the run time between the following section if possible? > > It'd make more sense to have a real stand alone test program, no? > I can try to write one today, but I have some really nasty important bugs to fix first. OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of x86 and ppc. By the way, the following data is a result of x86 measured in QEMU/KVM. This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio). Test1: Guest OS read 3GB file, which is bigger than memory. #called orig.(msec) patch(msec) ratio 108 1.1 0.1 7.6 102 1.0 0.1 6.8 132 1.6 0.2 7.1 Test2: Guest OS read/write 3GB file, which is bigger than memory. #called orig.(msec) patch(msec) ratio 2394 33 7.7 4.3 2100 29 7.1 4.1 2832 40 9.9 4.0
On 17.02.2010, at 10:42, OHMURA Kei wrote: >>>> "We think"? I mean - yes, I think so too. But have you actually measured it? >>>> How much improvement are we talking here? >>>> Is it still faster when a bswap is involved? >>> Thanks for pointing out. >>> I will post the data for x86 later. >>> However, I don't have a test environment to check the impact of bswap. >>> Would you please measure the run time between the following section if possible? >> It'd make more sense to have a real stand alone test program, no? >> I can try to write one today, but I have some really nasty important bugs to fix first. > > > OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of > x86 and ppc. > > > By the way, the following data is a result of x86 measured in QEMU/KVM. > This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio). That does indeed look promising! Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly. Alex
On 02/17/2010 11:42 AM, OHMURA Kei wrote: >>>> "We think"? I mean - yes, I think so too. But have you actually >>>> measured it? >>>> How much improvement are we talking here? >>>> Is it still faster when a bswap is involved? >>> Thanks for pointing out. >>> I will post the data for x86 later. >>> However, I don't have a test environment to check the impact of bswap. >>> Would you please measure the run time between the following section >>> if possible? >> >> It'd make more sense to have a real stand alone test program, no? >> I can try to write one today, but I have some really nasty important >> bugs to fix first. > > > OK. I will prepare a test code with sample data. Since I found a ppc > machine around, I will run the code and post the results of > x86 and ppc. > I've applied the patch - I think the x86 results justify it, and I'll be very surprised if ppc doesn't show a similar gain. Skipping 7 memory accesses and 7 tests must be a win.
On 17.02.2010, at 10:47, Avi Kivity wrote: > On 02/17/2010 11:42 AM, OHMURA Kei wrote: >>>>> "We think"? I mean - yes, I think so too. But have you actually measured it? >>>>> How much improvement are we talking here? >>>>> Is it still faster when a bswap is involved? >>>> Thanks for pointing out. >>>> I will post the data for x86 later. >>>> However, I don't have a test environment to check the impact of bswap. >>>> Would you please measure the run time between the following section if possible? >>> >>> It'd make more sense to have a real stand alone test program, no? >>> I can try to write one today, but I have some really nasty important bugs to fix first. >> >> >> OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of >> x86 and ppc. >> > > I've applied the patch - I think the x86 results justify it, and I'll be very surprised if ppc doesn't show a similar gain. Skipping 7 memory accesses and 7 tests must be a win. Sounds good to me. I don't assume bswap to be horribly slow either. Just want to be sure. Alex
>>>>> "We think"? I mean - yes, I think so too. But have you actually measured it? >>>>> How much improvement are we talking here? >>>>> Is it still faster when a bswap is involved? >>>> Thanks for pointing out. >>>> I will post the data for x86 later. >>>> However, I don't have a test environment to check the impact of bswap. >>>> Would you please measure the run time between the following section if possible? >>> It'd make more sense to have a real stand alone test program, no? >>> I can try to write one today, but I have some really nasty important bugs to fix first. >> >> OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of >> x86 and ppc. >> >> >> By the way, the following data is a result of x86 measured in QEMU/KVM. >> This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio). > > That does indeed look promising! > > Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly. I measured runtime of the test code with sample data. My test environment and results are described below. x86 Test Environment: CPU: 4x Intel Xeon Quad Core 2.66GHz Mem size: 6GB ppc Test Environment: CPU: 2x Dual Core PPC970MP Mem size: 2GB The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS was live migrating. To measure the runtime I copied cpu_get_real_ticks() of QEMU to my test program. Experimental results: Test1: Guest OS read 3GB file, which is bigger than memory. orig.(msec) patch(msec) ratio x86 0.3 0.1 6.4 ppc 7.9 2.7 3.0 Test2: Guest OS read/write 3GB file, which is bigger than memory. orig.(msec) patch(msec) ratio x86 12.0 3.2 3.7 ppc 251.1 123 2.0 I also measured the runtime of bswap itself on ppc, and I found it was only just 0.3% ~ 0.7 % of the runtime described above.
On 18.02.2010, at 06:57, OHMURA Kei wrote: >>>>>> "We think"? I mean - yes, I think so too. But have you actually measured it? >>>>>> How much improvement are we talking here? >>>>>> Is it still faster when a bswap is involved? >>>>> Thanks for pointing out. >>>>> I will post the data for x86 later. >>>>> However, I don't have a test environment to check the impact of bswap. >>>>> Would you please measure the run time between the following section if possible? >>>> It'd make more sense to have a real stand alone test program, no? >>>> I can try to write one today, but I have some really nasty important bugs to fix first. >>> >>> OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of >>> x86 and ppc. >>> >>> >>> By the way, the following data is a result of x86 measured in QEMU/KVM. This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio). >> That does indeed look promising! >> Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly. > > > I measured runtime of the test code with sample data. My test environment and results are described below. > > x86 Test Environment: > CPU: 4x Intel Xeon Quad Core 2.66GHz > Mem size: 6GB > > ppc Test Environment: > CPU: 2x Dual Core PPC970MP > Mem size: 2GB > > The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS > was live migrating. To measure the runtime I copied cpu_get_real_ticks() of > QEMU to my test program. > > > Experimental results: > Test1: Guest OS read 3GB file, which is bigger than memory. orig.(msec) patch(msec) ratio > x86 0.3 0.1 6.4 ppc 7.9 2.7 3.0 > Test2: Guest OS read/write 3GB file, which is bigger than memory. orig.(msec) patch(msec) ratio > x86 12.0 3.2 3.7 ppc 251.1 123 2.0 > > I also measured the runtime of bswap itself on ppc, and I found it was only just 0.3% ~ 0.7 % of the runtime described above. Awesome! Thank you so much for giving actual data to make me feel comfortable with it :-). Alex
diff --git a/bswap.h b/bswap.h index 4558704..1f87e6d 100644 --- a/bswap.h +++ b/bswap.h @@ -205,8 +205,10 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t v) #ifdef HOST_WORDS_BIGENDIAN #define cpu_to_32wu cpu_to_be32wu +#define leul_to_cpu(v) le ## HOST_LONG_BITS ## _to_cpu(v) #else #define cpu_to_32wu cpu_to_le32wu +#define leul_to_cpu(v) (v) #endif #undef le_bswap diff --git a/qemu-kvm.c b/qemu-kvm.c index a305907..6952aa5 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -2434,31 +2434,32 @@ int kvm_physical_memory_set_dirty_tracking(int enable) /* get kvm's dirty pages bitmap and update qemu's */ static int kvm_get_dirty_pages_log_range(unsigned long start_addr, - unsigned char *bitmap, + unsigned long *bitmap, unsigned long offset, unsigned long mem_size) { - unsigned int i, j, n = 0; - unsigned char c; - unsigned long page_number, addr, addr1; + unsigned int i, j; + unsigned long page_number, addr, addr1, c; ram_addr_t ram_addr; - unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + 7) / 8; + unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + HOST_LONG_BITS - 1) / + HOST_LONG_BITS; /* * bitmap-traveling is faster than memory-traveling (for addr...) * especially when most of the memory is not dirty. */ for (i = 0; i < len; i++) { - c = bitmap[i]; - while (c > 0) { - j = ffsl(c) - 1; - c &= ~(1u << j); - page_number = i * 8 + j; - addr1 = page_number * TARGET_PAGE_SIZE; - addr = offset + addr1; - ram_addr = cpu_get_physical_page_desc(addr); - cpu_physical_memory_set_dirty(ram_addr); - n++; + if (bitmap[i] != 0) { + c = leul_to_cpu(bitmap[i]); + do { + j = ffsl(c) - 1; + c &= ~(1ul << j); + page_number = i * HOST_LONG_BITS + j; + addr1 = page_number * TARGET_PAGE_SIZE; + addr = offset + addr1; + ram_addr = cpu_get_physical_page_desc(addr); + cpu_physical_memory_set_dirty(ram_addr); + } while (c != 0); } } return 0;
dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c. But We think that dirty-bitmap-traveling by long size is faster than by byte size especially when most of memory is not dirty. Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp> --- bswap.h | 2 ++ qemu-kvm.c | 31 ++++++++++++++++--------------- 2 files changed, 18 insertions(+), 15 deletions(-)