Patchwork qcow2 performance plan

login
register
mail settings
Submitter Anthony Liguori
Date Sept. 14, 2010, 4:16 p.m.
Message ID <4C8F9FDE.8050004@codemonkey.ws>
Download mbox | patch
Permalink /patch/64732/
State New
Headers show

Comments

Anthony Liguori - Sept. 14, 2010, 4:16 p.m.
On 09/14/2010 11:03 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 14, 2010 at 4:47 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>    
>> Am 14.09.2010 17:20, schrieb Anthony Liguori:
>>      
>>> On 09/14/2010 10:11 AM, Kevin Wolf wrote:
>>>        
>>>> Am 14.09.2010 15:43, schrieb Anthony Liguori:
>>>>
>>>>          
>>>>> Hi Avi,
>>>>>
>>>>> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>>>>
>>>>>            
>>>>>>    Here's a draft of a plan that should improve qcow2 performance.  It's
>>>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>>>>>> starting with # are numbered lists, not comments.
>>>>>>
>>>>>>              
>>>>> Thanks for putting this together.  I think it's really useful to think
>>>>> through the problem before anyone jumps in and starts coding.
>>>>>
>>>>>
>>>>>            
>>>>>> = Basics =
>>>>>>
>>>>>> At the minimum level, no operation should block the main thread.  This
>>>>>> could be done in two ways: extending the state machine so that each
>>>>>> blocking operation can be performed asynchronously
>>>>>> (<code>bdrv_aio_*</code>)
>>>>>> or by threading: each new operation is handed off to a worker thread.
>>>>>> Since a full state machine is prohibitively complex, this document
>>>>>> will discuss threading.
>>>>>>
>>>>>>              
>>>>> There's two distinct requirements that must be satisfied by a fast block
>>>>> device.  The device must have fast implementations of aio functions and
>>>>> it must support concurrent request processing.
>>>>>
>>>>> If an aio function blocks in the process of submitting the request, it's
>>>>> by definition slow.  But even if you may the aio functions fast, you
>>>>> still need to be able to support concurrent request processing in order
>>>>> to achieve high throughput.
>>>>>
>>>>> I'm not going to comment in depth on your threading proposal.  When it
>>>>> comes to adding concurrency, I think any approach will require a rewrite
>>>>> of the qcow2 code and if the author of that rewrite is more comfortable
>>>>> implementing concurrency with threads than with a state machine, I'm
>>>>> happy with a threaded implementation.
>>>>>
>>>>> I'd suggest avoiding hyperbole like "a full state machine is
>>>>> prohibitively complex".  QED is a full state machine.  qcow2 adds a
>>>>> number of additional states because of the additional metadata and sync
>>>>> operations but it's not an exponential increase in complexity.
>>>>>
>>>>>            
>>>> It will be quite some additional states that qcow2 brings in, but I
>>>> suspect the really hard thing is getting the dependencies between
>>>> requests right.
>>>>
>>>> I just had a look at how QED is doing this, and it seems to take the
>>>> easy solution, namely allowing only one allocation at the same time.
>>>>          
>>> One L2 allocation, not cluster allocations.  You can allocate multiple
>>> clusters concurrently and you can read/write L2s concurrently.
>>>
>>> Since L2 allocation only happens every 2GB, it's a rare event.
>>>        
>> Then your state machine is too complicated for me to understand. :-)
>>
>> Let me try to chase function pointers for a simple cluster allocation:
>>
>> bdrv_qed_aio_writev
>> qed_aio_setup
>> qed_aio_next_io
>> qed_find_cluster
>> qed_read_l2_table
>> ...
>> qed_find_cluster_cb
>>
>> This function contains the code to check if the cluster is already
>> allocated, right?
>>
>>     n = qed_count_contiguous_clusters(s, request->l2_table->table,
>>                                       index, n,&offset);
>>     ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
>>
>> The callback called from there is qed_aio_write_data(..., ret =
>> QED_CLUSTER_L2, ...) which means
>>
>>     bool need_alloc = ret != QED_CLUSTER_FOUND;
>>     /* Freeze this request if another allocating write is in progress */
>>     if (need_alloc) {
>>     ...
>>
>> So where did I start to follow the path of a L2 table allocation instead
>> of a simple cluster allocation?
>>      
> qed_aio_write_main() writes the main body of data into the cluster.
> Then it decides whether to update/allocate L2 tables if this is an
> allocating write.
>    

Right, it should only freeze if the L2 table needs to be allocated, not 
if it only needs to be updated.  IOW,


It's being a bit more conservative than it needs to be.

Regards,

Anthony Liguori

> qed_aio_write_l2_update() is the function that gets called to touch
> the L2 table (it also handles the allocation case).
>
> Stefan
>
>
Anthony Liguori - Sept. 14, 2010, 5:08 p.m.
On 09/14/2010 11:28 AM, Avi Kivity wrote:
>  On 09/14/2010 06:16 PM, Anthony Liguori wrote:
>>
>> Right, it should only freeze if the L2 table needs to be allocated, 
>> not if it only needs to be updated.  IOW,
>>
>> diff --git a/block/qed.c b/block/qed.c
>> index 4c4e7a2..0357c03 100644
>> --- a/block/qed.c
>> +++ b/block/qed.c
>> @@ -948,7 +948,7 @@ static void qed_aio_write_data(void *opaque, int 
>> ret,
>>      }
>>
>>      /* Freeze this request if another allocating write is in 
>> progress */
>> -    if (need_alloc) {
>> +    if (ret == QED_CLUSTER_L1) {
>>          if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
>>              QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
>>          }
>>
>> It's being a bit more conservative than it needs to be.
>
> Yes, I hit this too.  So without this patch, it does serialize all 
> allocating writes?

Yes, but my patch is not enough as it turns out.

When dealing with O_DIRECT, we have to handle RMW on our own which means 
we need to serialize access to the same sector.

The way we're planning on addressing this in the short term is to break 
the single allocator queue into a per-L2 table queue.  So writes to the 
same L2 would be serialized but writes to different L2s would not be 
serialized.

Regards,

Anthony Liguori

> If multiple requests need to update pointers in L2, will those updates 
> generate one write per request, or just two writes (one write from the 
> first request, another from all those that serialized after it)?
>
Avi Kivity - Sept. 14, 2010, 5:23 p.m.
On 09/14/2010 07:08 PM, Anthony Liguori wrote:
>> Yes, I hit this too.  So without this patch, it does serialize all 
>> allocating writes?
>
>
> Yes, but my patch is not enough as it turns out.
>
> When dealing with O_DIRECT, we have to handle RMW on our own which 
> means we need to serialize access to the same sector.
>
> The way we're planning on addressing this in the short term is to 
> break the single allocator queue into a per-L2 table queue.  So writes 
> to the same L2 would be serialized but writes to different L2s would 
> not be serialized.
>

So at least I read the code correctly.

The next step (also addressed in the qcow2 performance plan) is to batch 
writes to L2.  You'd actually expect to have many concurrent allocating 
writes to one L2.  The first is sent to disk, but the following ones 
just mark the L2 dirty.  When the write returns, it sees it's still 
dirty and goes back to disk again.
Anthony Liguori - Sept. 14, 2010, 6:58 p.m.
On 09/14/2010 12:23 PM, Avi Kivity wrote:
>  On 09/14/2010 07:08 PM, Anthony Liguori wrote:
>>> Yes, I hit this too.  So without this patch, it does serialize all 
>>> allocating writes?
>>
>>
>> Yes, but my patch is not enough as it turns out.
>>
>> When dealing with O_DIRECT, we have to handle RMW on our own which 
>> means we need to serialize access to the same sector.
>>
>> The way we're planning on addressing this in the short term is to 
>> break the single allocator queue into a per-L2 table queue.  So 
>> writes to the same L2 would be serialized but writes to different L2s 
>> would not be serialized.
>>
>
> So at least I read the code correctly.
>
> The next step (also addressed in the qcow2 performance plan) is to 
> batch writes to L2.  You'd actually expect to have many concurrent 
> allocating writes to one L2.  The first is sent to disk, but the 
> following ones just mark the L2 dirty.  When the write returns, it 
> sees it's still dirty and goes back to disk again.

Yeah, I have to think through batching quite a bit more but I agree that 
batching should be a natural next step and can further reduce the cost 
of updating metadata in a streaming workload.

Regards,

Anthony Liguori

Patch

diff --git a/block/qed.c b/block/qed.c
index 4c4e7a2..0357c03 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -948,7 +948,7 @@  static void qed_aio_write_data(void *opaque, int ret,
      }

      /* Freeze this request if another allocating write is in progress */
-    if (need_alloc) {
+    if (ret == QED_CLUSTER_L1) {
          if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
              QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
          }