[RFC-v2,0/2] ext4: Improve ilock scalability to improve performance in DirectIO workloads
mbox series

Message ID 20190923054407.8102-1-riteshh@linux.ibm.com
Headers show
Series
  • ext4: Improve ilock scalability to improve performance in DirectIO workloads
Related show

Message

Ritesh Harjani Sept. 23, 2019, 5:44 a.m. UTC
Hello All, 

These are ilock patches which helps improve the current inode lock scalabiliy
problem in ext4 DIO mixed-read/write workload case. The problem was first
reported by Joseph [1]. These patches are based upon upstream discussion
with Jan Kara & Joseph [2]. 

Since these patches introduce ext4_ilock/unlock API for inode locking/
unlocking in shared v/s exclusive mode to help achieve better
scalability in DIO writes case, it was better to take care of inode_trylock
for RWF_NOWAIT case in the same patch set.
Discussion about this can be found here [3].
Please let me know if otherwise.

These patches are based on the top of ext4 iomap for DIO patches by
Matthew [4].

Now, since this is a very interesting scalability problem w.r.t locking,
I thought of collecting a comprehensive set of performance numbers for
DIO sync mixed random read/write workload w.r.t number of threads (ext4).
This commit msg will present a performance benefit histogram using this
patch set v/s vanilla kernel (v5.3-rc5) in DIO mixed randrw workload.

Git tree for this can be found at-
https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC-v2

Background
==========

dioread_nolock feature in ext4, is to allow multiple DIO reads in parallel.
Also, in case of overwrites DIO since there is no block allocation needed,
those I/Os too can run in parallel. This gives a good scalable I/O
performance w.r.t multiple DIO read/writes threads.
In case of buffer write with dioread_nolock feature, ext4 will allocate
uninitialized extent and once the IO is completed will convert it back
to initialized.

But recently DIO reads got changed to be under shared inode rwsem.
commit 16c54688592c ("ext4: Allow parallel DIO reads").
It did allow parallel DIO reads to work fine, but this degraded the mixed
DIO read/write workload.


Problem description
===================

Currently in DIO writes case, we start with an exclusive inode rwsem
and only once we know that it is a overwrite (with dioread_nolock mnt
option enabled),
we demote it to shared lock. Now with mixed DIO reads & writes, this
has a scalability problem. Which is, if we have any ongoing DIO reads,
then DIO writes (since it starts with exclusive) has to wait until the
shared lock is released (which only happens when DIO read is completed).
The same can be easily observed with perf-tools trace analysis.


Implementation
==============

To fix this, we start with a shared lock (unless an exclusive lock is needed
in certain cases e.g. extending/unaligned writes) in DIO write case.
Then if it is a overwrite we continue with shared lock, if not then we
release the shared lock and switch to exclusive lock and
redo certain checks again.

This approach help in achieving good scalablity for mixed DIO read/writes
worloads [5-6].

Performance results
===================

The performance numbers are collected using fio tool with dioread_nolock
mount option for DIO mixed random read/write(randrw) worload with 4K & 1M burst
sizes for increasing number of threads (1, 2, 4, 8, 12, 16, 24).
The performance historgram shown below is the percentage change in
performance by using this ilock patchset as compared to vanilla kernel
(v5.3-rc5).

To check absolute score comparison with some colorful graphs, refer [5-6].

FIO command:
fio -name=DIO-mixed-randrw -filename=./testfile -direct=1 -iodepth=1 -thread \
-rw=randrw -ioengine=psync -bs=$bs -size=10G -numjobs=$thread \
-group_reporting=1 -runtime=120

With dioread_nolock mount option
================================

Below shows the performance benefit hist with this ilock patchset in (%)
w.r.t vanilla kernel(v5.3-rc5) with dioread_nolock mount option for
mixed randrw workload (for 4K block size).
If we notice, the percentage benefit increases with increasing number of
threads. So this patchset help achieve good scalability in the mentioned
workload. Also this gives upto ~70% perf improvement in 24 threads mixed randrw
workload with 4K burst size.
The performance difference can be even higher with high speed storage
devices, since bw speeds without the patch seems to flatten due to lock
contention problem in case of multiple threads.

Absolute graph comparison can be seen here - [5]

        Performance benefit (%) data randrw (read)-4K
  80 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
     |        +        +       +        +        +        +       +        |   
  70 +-+                                                           **    +-+   
     |                                                    **       **      |   
  60 +-+                                          **      **       **    +-+   
     |                                            **      **       **      |   
  50 +-+                                 **       **      **       **    +-+   
     |                                   **       **      **       **      |   
  40 +-+                                 **       **      **       **    +-+   
  30 +-+                        **       **       **      **       **    +-+   
     |                          **       **       **      **       **      |   
  20 +-+                        **       **       **      **       **    +-+   
     |                          **       **       **      **       **      |   
  10 +-+               **       **       **       **      **       **    +-+   
     |         **      **       **       **       **      **       **      |   
   0 +-+       **      **       **       **       **      **       **    +-+   
     |        +        +       +        +        +        +       +        |   
 -10 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
             1,       2,      4,       8,       12,      16,     24,           
                                 No. of threads                                
                                                                               

        Performance benefit (%) data randrw (write)-4K                           
  80 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
     |        +        +       +        +        +        +       +        |   
  70 +-+                                                           **    +-+   
     |                                                    **       **      |   
  60 +-+                                          **      **       **    +-+   
     |                                            **      **       **      |   
  50 +-+                                 **       **      **       **    +-+   
     |                                   **       **      **       **      |   
  40 +-+                                 **       **      **       **    +-+   
  30 +-+                        **       **       **      **       **    +-+   
     |                          **       **       **      **       **      |   
  20 +-+                        **       **       **      **       **    +-+   
     |                          **       **       **      **       **      |   
  10 +-+                        **       **       **      **       **    +-+   
     |         **      **       **       **       **      **       **      |   
   0 +-+       **      **       **       **       **      **       **    +-+   
     |        +        +       +        +        +        +       +        |   
 -10 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
             1,       2,      4,       8,       12,      16,     24,           
                                 No. of threads                                


Below shows the performance benefit hist with this ilock patchset in (%)
w.r.t vanilla kernel (v5.3-rc5) with dioread_nolock mount option for
mixed randrw workload (for 1M burst size).
This shows upto ~20% perf improvement in 24 threads workload with 1M
burst size.
Absolute graph comparison can be seen here - [6]

	Performance benefit (%) data randrw (read)-1M                         
  25 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
     |        +        +       +        +        +        +       +        |   
     |                                                             **      |   
  20 +-+                                                           **    +-+   
     |                                                             **      |   
     |                                                             **      |   
  15 +-+                                                           **    +-+   
     |                                   **               **       **      |   
  10 +-+                                 **               **       **    +-+   
     |                                   **       **      **       **      |   
     |                                   **       **      **       **      |   
   5 +-+                                 **       **      **       **    +-+   
     |                                   **       **      **       **      |   
     |                 **       **       **       **      **       **      |   
   0 +-+       **      **       **       **       **      **       **    +-+   
     |         **                                                          |   
     |        +        +       +        +        +        +       +        |   
  -5 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
             1,       2,      4,       8,       12,      16,     24,           
                                 No. of threads                                
                                                                               
        Performance benefit (%) data randrw (write)-1M                         
  25 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
     |        +        +       +        +        +        +       +        |   
     |                                                                     |   
  20 +-+                                                           **    +-+   
     |                                                             **      |   
     |                                                             **      |   
  15 +-+                                                           **    +-+   
     |                                                    **       **      |   
  10 +-+                                 **               **       **    +-+   
     |                                   **       **      **       **      |   
     |                                   **       **      **       **      |   
   5 +-+                                 **       **      **       **    +-+   
     |                          **       **       **      **       **      |   
     |                 **       **       **       **      **       **      |   
   0 +-+       **      **       **       **       **      **       **    +-+   
     |         **                                                          |   
     |        +        +       +        +        +        +       +        |   
  -5 +-+------+--------+-------+--------+--------+--------+-------+------+-+   
             1,       2,      4,       8,       12,      16,     24,           
                                 No. of threads                                


Without dioread_nolock mount option
===================================

I do see a little degradation without this mount option, but as per my performance
runs, this is within ~(-0.5 to -3)% limit. With 1M burst size, difference is even
better in some runs. So, ~(-3 to 3)% deviation is something which could be called
a run-to-run variation mostly. But for completeness sake I still present
the below data collected.

        Performance(%) data randrw (read)-4K
  3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
    |        +        +        +        +       +        +        +        |   
    |                                                                      |   
  2 +-+                                                                  +-+   
    |                                                                      |   
    |                                                                      |   
  1 +-+                                                                  +-+   
    |                                                                      |   
  0 +-+       **       **       **      **       **       **       **    +-+   
    |         **       **       **      **       **       **       **      |   
    |         **       **       **      **       **       **       **      |   
 -1 +-+                **       **      **       **       **       **    +-+   
    |                  **       **      **       **       **               |   
    |                  **       **      **       **       **               |   
 -2 +-+                **       **      **       **       **             +-+   
    |                           **      **       **       **               |   
    |        +        +        +**      **      +**      +        +        |   
 -3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
            1,       2,       4,       8,      12,      16,      24,           
                                No. of threads                                 
                                                                               
                                                                               
        Performance(%) data randrw (write)-4K                             
  3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
    |        +        +        +        +       +        +        +        |   
    |                                                                      |   
  2 +-+                                                                  +-+   
    |                                                                      |   
    |                                                                      |   
  1 +-+                                                                  +-+   
    |                                                                      |   
  0 +-+       **       **       **      **       **       **       **    +-+   
    |         **       **       **      **       **       **       **      |   
    |         **       **       **      **       **       **       **      |   
 -1 +-+                **       **      **       **       **       **    +-+   
    |                  **       **      **       **       **               |   
    |                  **       **      **       **       **               |   
 -2 +-+                **       **      **       **       **             +-+   
    |                           **      **       **       **               |   
    |        +        +        +**      +       +        +        +        |   
 -3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
            1,       2,       4,       8,      12,      16,      24,           
                                No. of threads                                 



        Performance(%) data randrw (read)-1M
  3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
    |        +        +        +        +       +        +        +        |   
    |                                                                      |   
  2 +-+       **                                                         +-+   
    |         **                                                   **      |   
    |         **                                 **                **      |   
  1 +-+       **       **                        **                **    +-+   
    |         **       **                        **                **      |   
  0 +-+       **       **       **      **       **       **       **    +-+   
    |                           **      **                **               |   
    |                           **      **                                 |   
 -1 +-+                         **                                       +-+   
    |                           **                                         |   
    |                                                                      |   
 -2 +-+                                                                  +-+   
    |                                                                      |   
    |        +        +        +        +       +        +        +        |   
 -3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
            1,       2,       4,       8,      12,      16,      24,           
                                No. of threads                                 
                                                                               
                                                                               
        Performance(%) data randrw (write)-1M
  3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
    |        +        +        +        +       +        +        +**      |   
    |                                                              **      |   
  2 +-+       **                                                   **    +-+   
    |         **                                                   **      |   
    |         **                                 **                **      |   
  1 +-+       **       **                        **       **       **    +-+   
    |         **       **                        **       **       **      |   
  0 +-+       **       **       **      **       **       **       **    +-+   
    |                           **      **                                 |   
    |                           **      **                                 |   
 -1 +-+                         **                                       +-+   
    |                           **                                         |   
    |                                                                      |   
 -2 +-+                                                                  +-+   
    |                                                                      |   
    |        +        +        +        +       +        +        +        |   
 -3 +-+------+--------+--------+--------+-------+--------+--------+------+-+   
            1,       2,       4,       8,      12,      16,      24,           
                                No. of threads                                 
                                                                              

Git tree- https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC-v2

References
==========
[1]: https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/
[2]: https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
[3]: https://lore.kernel.org/linux-fsdevel/20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com/
[4]: https://lwn.net/Articles/799184/

[5]: https://github.com/riteshharjani/LinuxStudy/blob/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png
[6]: https://github.com/riteshharjani/LinuxStudy/blob/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-1M.pg

Ritesh Harjani (2):
  ext4: Add ext4_ilock & ext4_iunlock API
  ext4: Improve DIO writes locking sequence

 fs/ext4/ext4.h    |  33 ++++++
 fs/ext4/extents.c |  16 +--
 fs/ext4/file.c    | 260 ++++++++++++++++++++++++++++++++--------------
 fs/ext4/inode.c   |   4 +-
 fs/ext4/ioctl.c   |  16 +--
 fs/ext4/super.c   |  12 +--
 fs/ext4/xattr.c   |  16 +--
 7 files changed, 249 insertions(+), 108 deletions(-)