diff mbox series

[07/12] migration: hold the lock only if it is really needed

Message ID 20180604095520.8563-8-xiaoguangrong@tencent.com
State New
Headers show
Series migration: improve multithreads for compression and decompression | expand

Commit Message

Xiao Guangrong June 4, 2018, 9:55 a.m. UTC
From: Xiao Guangrong <xiaoguangrong@tencent.com>

Try to hold src_page_req_mutex only if the queue is not
empty

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/queue.h | 1 +
 migration/ram.c      | 4 ++++
 2 files changed, 5 insertions(+)

Comments

Peter Xu June 19, 2018, 7:36 a.m. UTC | #1
On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Try to hold src_page_req_mutex only if the queue is not
> empty

Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.

Regards,
Xiao Guangrong June 28, 2018, 9:33 a.m. UTC | #2
On 06/19/2018 03:36 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Try to hold src_page_req_mutex only if the queue is not
>> empty
> 
> Pure question: how much this patch would help?  Basically if you are
> running compression tests then I think it means you are with precopy
> (since postcopy cannot work with compression yet), then here the lock
> has no contention at all.

Yes, you are right, however we can observe it is in the top functions
(after revert this patch):

Samples: 29K of event 'cycles', Event count (approx.): 22263412260
+   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
+   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
+   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
+   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
+   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
+   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
+   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
+   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
+   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
+   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
+   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
+   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
+   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
+   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.90%  kqemu  libc-2.12.so             [.] memcpy
+   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
+   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
+   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
+   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
+   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
+   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
+   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
+   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock

I guess its atomic operations cost CPU resource and check-before-lock is
a common tech, i think it shouldn't have side effect, right? :)
Dr. David Alan Gilbert June 29, 2018, 11:22 a.m. UTC | #3
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/19/2018 03:36 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Try to hold src_page_req_mutex only if the queue is not
> > > empty
> > 
> > Pure question: how much this patch would help?  Basically if you are
> > running compression tests then I think it means you are with precopy
> > (since postcopy cannot work with compression yet), then here the lock
> > has no contention at all.
> 
> Yes, you are right, however we can observe it is in the top functions
> (after revert this patch):

Can you show the matching trace with the patch in?

Dave

> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> 
> I guess its atomic operations cost CPU resource and check-before-lock is
> a common tech, i think it shouldn't have side effect, right? :)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Xiao Guangrong July 3, 2018, 6:27 a.m. UTC | #4
On 06/29/2018 07:22 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Try to hold src_page_req_mutex only if the queue is not
>>>> empty
>>>
>>> Pure question: how much this patch would help?  Basically if you are
>>> running compression tests then I think it means you are with precopy
>>> (since postcopy cannot work with compression yet), then here the lock
>>> has no contention at all.
>>
>> Yes, you are right, however we can observe it is in the top functions
>> (after revert this patch):
> 
> Can you show the matching trace with the patch in?

Sure, there is:

+   8.38%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
+   8.03%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
+   6.62%  kqemu  qemu-system-x86_64       [.] qemu_event_set
+   6.02%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
+   5.81%  kqemu  qemu-system-x86_64       [.] __ring_put
+   5.04%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
+   4.48%  kqemu  qemu-system-x86_64       [.] ring_is_full
+   4.44%  kqemu  qemu-system-x86_64       [.] ring_mp_get
+   3.39%  kqemu  qemu-system-x86_64       [.] __ring_is_full
+   2.61%  kqemu  qemu-system-x86_64       [.] add_to_iovec
+   2.48%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
+   2.08%  kqemu  libc-2.12.so             [.] memcpy
+   2.07%  kqemu  qemu-system-x86_64       [.] ring_len
+   1.91%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.60%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
+   1.16%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
+   1.14%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
+   1.12%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
+   1.09%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
+   1.07%  kqemu  qemu-system-x86_64       [.] test_and_clear_bit
+   1.07%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
+   1.03%  kqemu  qemu-system-x86_64       [.] qemu_put_byte
+   0.80%  kqemu  qemu-system-x86_64       [.] threads_submit_request_commit
+   0.74%  kqemu  qemu-system-x86_64       [.] migration_bitmap_clear_dirty
+   0.70%  kqemu  qemu-system-x86_64       [.] control_save_page
+   0.69%  kqemu  qemu-system-x86_64       [.] test_bit
+   0.69%  kqemu  qemu-system-x86_64       [.] ram_save_iterate
+   0.63%  kqemu  qemu-system-x86_64       [.] migration_bitmap_find_dirty
+   0.63%  kqemu  qemu-system-x86_64       [.] ram_control_save_page
+   0.62%  kqemu  qemu-system-x86_64       [.] rcu_read_lock
+   0.56%  kqemu  qemu-system-x86_64       [.] qemu_file_get_error
+   0.55%  kqemu  [kernel.kallsyms]        [k] lock_acquire
+   0.55%  kqemu  qemu-system-x86_64       [.] find_dirty_block
+   0.54%  kqemu  qemu-system-x86_64       [.] ring_index
+   0.53%  kqemu  qemu-system-x86_64       [.] ring_put
+   0.51%  kqemu  qemu-system-x86_64       [.] unqueue_page
+   0.50%  kqemu  qemu-system-x86_64       [.] migrate_use_compression
+   0.48%  kqemu  qemu-system-x86_64       [.] get_queued_page
+   0.46%  kqemu  qemu-system-x86_64       [.] ring_get
+   0.46%  kqemu  [i40e]                   [k] i40e_clean_tx_irq
+   0.45%  kqemu  [kernel.kallsyms]        [k] lock_release
+   0.44%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
+   0.38%  kqemu  qemu-system-x86_64       [.] migrate_get_current
+   0.38%  kqemu  [kernel.kallsyms]        [k] find_held_lock
+   0.34%  kqemu  [kernel.kallsyms]        [k] __lock_release
+   0.34%  kqemu  qemu-system-x86_64       [.] qemu_ram_pagesize
+   0.29%  kqemu  [kernel.kallsyms]        [k] lock_is_held_type
+   0.27%  kqemu  [kernel.kallsyms]        [k] update_load_avg
+   0.27%  kqemu  qemu-system-x86_64       [.] save_page_use_compression
+   0.24%  kqemu  qemu-system-x86_64       [.] qemu_file_rate_limit
+   0.23%  kqemu  [kernel.kallsyms]        [k] tcp_sendmsg
+   0.23%  kqemu  [kernel.kallsyms]        [k] match_held_lock
+   0.22%  kqemu  [kernel.kallsyms]        [k] do_raw_spin_trylock
+   0.22%  kqemu  [kernel.kallsyms]        [k] cyc2ns_read_begin
Peter Xu July 11, 2018, 8:21 a.m. UTC | #5
On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/19/2018 03:36 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Try to hold src_page_req_mutex only if the queue is not
> > > empty
> > 
> > Pure question: how much this patch would help?  Basically if you are
> > running compression tests then I think it means you are with precopy
> > (since postcopy cannot work with compression yet), then here the lock
> > has no contention at all.
> 
> Yes, you are right, however we can observe it is in the top functions
> (after revert this patch):
> 
> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock

(sorry to respond late; I was busy with other stuff for the
 release...)

I am trying to find out anything related to unqueue_page() but I
failed.  Did I miss anything obvious there?

> 
> I guess its atomic operations cost CPU resource and check-before-lock is
> a common tech, i think it shouldn't have side effect, right? :)

Yeah it makes sense to me. :)

Thanks,
Xiao Guangrong July 12, 2018, 7:47 a.m. UTC | #6
On 07/11/2018 04:21 PM, Peter Xu wrote:
> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Try to hold src_page_req_mutex only if the queue is not
>>>> empty
>>>
>>> Pure question: how much this patch would help?  Basically if you are
>>> running compression tests then I think it means you are with precopy
>>> (since postcopy cannot work with compression yet), then here the lock
>>> has no contention at all.
>>
>> Yes, you are right, however we can observe it is in the top functions
>> (after revert this patch):
>>
>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
>> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
>> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
>> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
>> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
>> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
>> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
>> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
>> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
>> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
>> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
>> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
>> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
>> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
>> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
>> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
>> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
>> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
>> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
>> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
>> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
>> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
>> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> 
> (sorry to respond late; I was busy with other stuff for the
>   release...)
> 

You're welcome. :)

> I am trying to find out anything related to unqueue_page() but I
> failed.  Did I miss anything obvious there?
> 

unqueue_page() was not listed here indeed, i think the function
itself is light enough (a check then directly return) so it
did not leave a trace here.

This perf data was got after reverting this patch, i.e, it's
based on the lockless multithread model, then unqueue_page() is
the only place using mutex in the main thread.

And you can see the overload of mutext was gone after applying
this patch in the mail i replied to Dave.
Peter Xu July 12, 2018, 8:26 a.m. UTC | #7
On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/11/2018 04:21 PM, Peter Xu wrote:
> > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > empty
> > > > 
> > > > Pure question: how much this patch would help?  Basically if you are
> > > > running compression tests then I think it means you are with precopy
> > > > (since postcopy cannot work with compression yet), then here the lock
> > > > has no contention at all.
> > > 
> > > Yes, you are right, however we can observe it is in the top functions
> > > (after revert this patch):
> > > 
> > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > 
> > (sorry to respond late; I was busy with other stuff for the
> >   release...)
> > 
> 
> You're welcome. :)
> 
> > I am trying to find out anything related to unqueue_page() but I
> > failed.  Did I miss anything obvious there?
> > 
> 
> unqueue_page() was not listed here indeed, i think the function
> itself is light enough (a check then directly return) so it
> did not leave a trace here.
> 
> This perf data was got after reverting this patch, i.e, it's
> based on the lockless multithread model, then unqueue_page() is
> the only place using mutex in the main thread.
> 
> And you can see the overload of mutext was gone after applying
> this patch in the mail i replied to Dave.

I see.  It's not a big portion of CPU resource, though of course I
don't have reason to object to this change as well.

Actually what interested me more is why ram_bytes_total() is such a
hot spot.  AFAIU it's only called in ram_find_and_save_block() per
call, and it should be mostly a constant if we don't plug/unplug
memories.  Not sure that means that's a better spot to work on.

Thanks,
Dr. David Alan Gilbert July 13, 2018, 5:44 p.m. UTC | #8
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 07/11/2018 04:21 PM, Peter Xu wrote:
> > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > empty
> > > > 
> > > > Pure question: how much this patch would help?  Basically if you are
> > > > running compression tests then I think it means you are with precopy
> > > > (since postcopy cannot work with compression yet), then here the lock
> > > > has no contention at all.
> > > 
> > > Yes, you are right, however we can observe it is in the top functions
> > > (after revert this patch):
> > > 
> > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > 
> > (sorry to respond late; I was busy with other stuff for the
> >   release...)
> > 
> 
> You're welcome. :)
> 
> > I am trying to find out anything related to unqueue_page() but I
> > failed.  Did I miss anything obvious there?
> > 
> 
> unqueue_page() was not listed here indeed, i think the function
> itself is light enough (a check then directly return) so it
> did not leave a trace here.
> 
> This perf data was got after reverting this patch, i.e, it's
> based on the lockless multithread model, then unqueue_page() is
> the only place using mutex in the main thread.
> 
> And you can see the overload of mutext was gone after applying
> this patch in the mail i replied to Dave.

I got around to playing with this patch and using 'perf top'
to see what was going on.
What I noticed was that without this patch pthread_mutex_unlock and
qemu_mutex_lock_impl were both noticeable; with the patch they'd
pretty mich vanished.  So I think it's worth it.

I ocouldn't honestly see a difference in total CPU usage or
bandwidth; but the migration code is so spiky in usage that
it's difficult to measure anyway.

So,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Xiao Guangrong July 18, 2018, 8:56 a.m. UTC | #9
On 07/12/2018 04:26 PM, Peter Xu wrote:
> On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 07/11/2018 04:21 PM, Peter Xu wrote:
>>> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>
>>>>>> Try to hold src_page_req_mutex only if the queue is not
>>>>>> empty
>>>>>
>>>>> Pure question: how much this patch would help?  Basically if you are
>>>>> running compression tests then I think it means you are with precopy
>>>>> (since postcopy cannot work with compression yet), then here the lock
>>>>> has no contention at all.
>>>>
>>>> Yes, you are right, however we can observe it is in the top functions
>>>> (after revert this patch):
>>>>
>>>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
>>>> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
>>>> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
>>>> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
>>>> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
>>>> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
>>>> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
>>>> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
>>>> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
>>>> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
>>>> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
>>>> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
>>>> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
>>>> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
>>>> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>>>> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
>>>> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
>>>> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
>>>> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
>>>> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
>>>> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
>>>> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
>>>> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
>>>> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
>>>
>>> (sorry to respond late; I was busy with other stuff for the
>>>    release...)
>>>
>>
>> You're welcome. :)
>>
>>> I am trying to find out anything related to unqueue_page() but I
>>> failed.  Did I miss anything obvious there?
>>>
>>
>> unqueue_page() was not listed here indeed, i think the function
>> itself is light enough (a check then directly return) so it
>> did not leave a trace here.
>>
>> This perf data was got after reverting this patch, i.e, it's
>> based on the lockless multithread model, then unqueue_page() is
>> the only place using mutex in the main thread.
>>
>> And you can see the overload of mutext was gone after applying
>> this patch in the mail i replied to Dave.
> 
> I see.  It's not a big portion of CPU resource, though of course I
> don't have reason to object to this change as well.
> 
> Actually what interested me more is why ram_bytes_total() is such a
> hot spot.  AFAIU it's only called in ram_find_and_save_block() per
> call, and it should be mostly a constant if we don't plug/unplug
> memories.  Not sure that means that's a better spot to work on.
> 

I noticed it too. That could be another work we will work on. :)
Peter Xu July 18, 2018, 10:18 a.m. UTC | #10
On Wed, Jul 18, 2018 at 04:56:13PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/12/2018 04:26 PM, Peter Xu wrote:
> > On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 07/11/2018 04:21 PM, Peter Xu wrote:
> > > > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > > > 
> > > > > 
> > > > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > > > 
> > > > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > > > empty
> > > > > > 
> > > > > > Pure question: how much this patch would help?  Basically if you are
> > > > > > running compression tests then I think it means you are with precopy
> > > > > > (since postcopy cannot work with compression yet), then here the lock
> > > > > > has no contention at all.
> > > > > 
> > > > > Yes, you are right, however we can observe it is in the top functions
> > > > > (after revert this patch):
> > > > > 
> > > > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > > > 
> > > > (sorry to respond late; I was busy with other stuff for the
> > > >    release...)
> > > > 
> > > 
> > > You're welcome. :)
> > > 
> > > > I am trying to find out anything related to unqueue_page() but I
> > > > failed.  Did I miss anything obvious there?
> > > > 
> > > 
> > > unqueue_page() was not listed here indeed, i think the function
> > > itself is light enough (a check then directly return) so it
> > > did not leave a trace here.
> > > 
> > > This perf data was got after reverting this patch, i.e, it's
> > > based on the lockless multithread model, then unqueue_page() is
> > > the only place using mutex in the main thread.
> > > 
> > > And you can see the overload of mutext was gone after applying
> > > this patch in the mail i replied to Dave.
> > 
> > I see.  It's not a big portion of CPU resource, though of course I
> > don't have reason to object to this change as well.
> > 
> > Actually what interested me more is why ram_bytes_total() is such a
> > hot spot.  AFAIU it's only called in ram_find_and_save_block() per
> > call, and it should be mostly a constant if we don't plug/unplug
> > memories.  Not sure that means that's a better spot to work on.
> > 
> 
> I noticed it too. That could be another work we will work on. :)

Yeah I'm looking forward to that. :)

Btw, please feel free to post the common codes separately in your next
post if the series grows even bigger (that may help even for
no-compression migrations), I would bet they'll have a better chance
to be reviewed and merged faster (though it'll possibly be after QEMU
3.0).

Regards,
diff mbox series

Patch

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index 59fd1203a1..ac418efc43 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -341,6 +341,7 @@  struct {                                                                \
 /*
  * Simple queue access methods.
  */
+#define QSIMPLEQ_EMPTY_ATOMIC(head) (atomic_read(&((head)->sqh_first)) == NULL)
 #define QSIMPLEQ_EMPTY(head)        ((head)->sqh_first == NULL)
 #define QSIMPLEQ_FIRST(head)        ((head)->sqh_first)
 #define QSIMPLEQ_NEXT(elm, field)   ((elm)->field.sqe_next)
diff --git a/migration/ram.c b/migration/ram.c
index 15b20d3f70..f9a8646520 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1550,6 +1550,10 @@  static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
 {
     RAMBlock *block = NULL;
 
+    if (QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests)) {
+        return NULL;
+    }
+
     qemu_mutex_lock(&rs->src_page_req_mutex);
     if (!QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
         struct RAMSrcPageRequest *entry =