mbox series

[RFC,0/2] ext4: Improve locking sequence in DIO write path

Message ID 20190917103249.20335-1-riteshh@linux.ibm.com
Headers show
Series ext4: Improve locking sequence in DIO write path | expand

Message

Ritesh Harjani Sept. 17, 2019, 10:32 a.m. UTC
Hello,

This patch series is based on the upstream discussion with Jan
& Joseph @ [1].
It is based on top of Matthew's v3 ext4 iomap patch series [2]

Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
inode_lock/unlock instances from fs/ext4/*

For now I already accounted for trylock/lock issue symantics
(which was discussed here [3]) in the same patch,
since the this whole patch was around inode_lock/unlock API,
so I thought it will be best to address that issue in the same patch. 
However, kindly let me know if otherwise.

Patch-2: Commit msg of this patch describes in detail about
what it is doing.
In brief - we try to first take the shared lock (instead of exclusive
lock), unless it is a unaligned_io or extend_io. Then in
ext4_dio_write_checks(), if we start with shared lock, we see
if we can really continue with shared lock or not. If not, then
we release the shared lock then acquire exclusive lock
and restart ext4_dio_write_checks().


Tested against few xfstests (with dioread_nolock mount option),
those ran fine (ext4 & generic).

I tried testing performance numbers on my VM (since I could not get
hold of any real h/w based test device). I could test the fact
that earlier we were trying to do downgrade_write() lock, but with
this patch, that path is now avoided for fio test case
(as reported by Joseph in [4]).
But for the actual results, I am not sure if VM machine testing could
really give the reliable perf numbers which we want to take a look at.
Though I do observe some form of perf improvements, but I could not
get any reliable numbers (not even with the same list of with/without
patches with which Joseph posted his numbers [1]).


@Joseph,
Would it be possible for you to give your test case a run with this
patches? That will be really helpful.

Branch for this is hosted at below tree.

https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC

[1]: https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
[2]: https://lwn.net/Articles/799184/
[3]: https://lore.kernel.org/linux-fsdevel/20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com/
[4]: https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/


Ritesh Harjani (2):
  ext4: Add ext4_ilock & ext4_iunlock API
  ext4: Improve DIO writes locking sequence

 fs/ext4/ext4.h    |  33 ++++++
 fs/ext4/extents.c |  16 +--
 fs/ext4/file.c    | 253 ++++++++++++++++++++++++++++++++--------------
 fs/ext4/inode.c   |   4 +-
 fs/ext4/ioctl.c   |  16 +--
 fs/ext4/super.c   |  12 +--
 fs/ext4/xattr.c   |  16 +--
 7 files changed, 244 insertions(+), 106 deletions(-)

Comments

Joseph Qi Sept. 18, 2019, 12:58 a.m. UTC | #1
On 19/9/17 18:32, Ritesh Harjani wrote:
> Hello,
> 
> This patch series is based on the upstream discussion with Jan
> & Joseph @ [1].
> It is based on top of Matthew's v3 ext4 iomap patch series [2]
> 
> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> inode_lock/unlock instances from fs/ext4/*
> 
> For now I already accounted for trylock/lock issue symantics
> (which was discussed here [3]) in the same patch,
> since the this whole patch was around inode_lock/unlock API,
> so I thought it will be best to address that issue in the same patch. 
> However, kindly let me know if otherwise.
> 
> Patch-2: Commit msg of this patch describes in detail about
> what it is doing.
> In brief - we try to first take the shared lock (instead of exclusive
> lock), unless it is a unaligned_io or extend_io. Then in
> ext4_dio_write_checks(), if we start with shared lock, we see
> if we can really continue with shared lock or not. If not, then
> we release the shared lock then acquire exclusive lock
> and restart ext4_dio_write_checks().
> 
> 
> Tested against few xfstests (with dioread_nolock mount option),
> those ran fine (ext4 & generic).
> 
> I tried testing performance numbers on my VM (since I could not get
> hold of any real h/w based test device). I could test the fact
> that earlier we were trying to do downgrade_write() lock, but with
> this patch, that path is now avoided for fio test case
> (as reported by Joseph in [4]).
> But for the actual results, I am not sure if VM machine testing could
> really give the reliable perf numbers which we want to take a look at.
> Though I do observe some form of perf improvements, but I could not
> get any reliable numbers (not even with the same list of with/without
> patches with which Joseph posted his numbers [1]).
> 
> 
> @Joseph,
> Would it be possible for you to give your test case a run with this
> patches? That will be really helpful.
> 

Sure, will post the result ASAP.

Thanks,
Joseph

> Branch for this is hosted at below tree.
> 
> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> 
> [1]: https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
> [2]: https://lwn.net/Articles/799184/
> [3]: https://lore.kernel.org/linux-fsdevel/20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com/
> [4]: https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/
> 
> 
> Ritesh Harjani (2):
>   ext4: Add ext4_ilock & ext4_iunlock API
>   ext4: Improve DIO writes locking sequence
> 
>  fs/ext4/ext4.h    |  33 ++++++
>  fs/ext4/extents.c |  16 +--
>  fs/ext4/file.c    | 253 ++++++++++++++++++++++++++++++++--------------
>  fs/ext4/inode.c   |   4 +-
>  fs/ext4/ioctl.c   |  16 +--
>  fs/ext4/super.c   |  12 +--
>  fs/ext4/xattr.c   |  16 +--
>  7 files changed, 244 insertions(+), 106 deletions(-)
>
Joseph Qi Sept. 18, 2019, 6:35 a.m. UTC | #2
Hi Ritesh,

On 19/9/17 18:32, Ritesh Harjani wrote:
> Hello,
> 
> This patch series is based on the upstream discussion with Jan
> & Joseph @ [1].
> It is based on top of Matthew's v3 ext4 iomap patch series [2]
> 
> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> inode_lock/unlock instances from fs/ext4/*
> 
> For now I already accounted for trylock/lock issue symantics
> (which was discussed here [3]) in the same patch,
> since the this whole patch was around inode_lock/unlock API,
> so I thought it will be best to address that issue in the same patch. 
> However, kindly let me know if otherwise.
> 
> Patch-2: Commit msg of this patch describes in detail about
> what it is doing.
> In brief - we try to first take the shared lock (instead of exclusive
> lock), unless it is a unaligned_io or extend_io. Then in
> ext4_dio_write_checks(), if we start with shared lock, we see
> if we can really continue with shared lock or not. If not, then
> we release the shared lock then acquire exclusive lock
> and restart ext4_dio_write_checks().
> 
> 
> Tested against few xfstests (with dioread_nolock mount option),
> those ran fine (ext4 & generic).
> 
> I tried testing performance numbers on my VM (since I could not get
> hold of any real h/w based test device). I could test the fact
> that earlier we were trying to do downgrade_write() lock, but with
> this patch, that path is now avoided for fio test case
> (as reported by Joseph in [4]).
> But for the actual results, I am not sure if VM machine testing could
> really give the reliable perf numbers which we want to take a look at.
> Though I do observe some form of perf improvements, but I could not
> get any reliable numbers (not even with the same list of with/without
> patches with which Joseph posted his numbers [1]).
> 
> 
> @Joseph,
> Would it be possible for you to give your test case a run with this
> patches? That will be really helpful.
> 
> Branch for this is hosted at below tree.
> 
> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> 
I've tested your branch, the result is:
mounting with dioread_nolock, it behaves the same like reverting
parallel dio reads + dioread_nolock;
while mounting without dioread_nolock, no improvement, or even worse.
Please refer the test data below. 

fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
-direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
-size=20G -numjobs=8 -runtime=600 -group_reporting

w/     = with parallel dio reads
w/o    = reverting parallel dio reads
w/o+   = reverting parallel dio reads + dioread_nolock
ilock  = ext4-ilock-RFC
ilock+ = ext4-ilock-RFC + dioread_nolock

bs=4k:
--------------------------------------------------------------
      |            READ           |           WRITE          |
--------------------------------------------------------------
w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
--------------------------------------------------------------
w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
--------------------------------------------------------------
w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
--------------------------------------------------------------
ilock | 29964KB/s,7491,326.70us	  | 29940KB/s,7485,740.62us  |
--------------------------------------------------------------
ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
--------------------------------------------------------------


bs=16k:
--------------------------------------------------------------
      |            READ           |           WRITE          |
--------------------------------------------------------------
w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
--------------------------------------------------------------
w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
--------------------------------------------------------------
w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
--------------------------------------------------------------
ilock | 56039KB/s,3502,632.96us	  | 55943KB/s,3496,1652.72us |
--------------------------------------------------------------
ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
--------------------------------------------------------------

bs=64k
----------------------------------------------------------------
      |            READ            |           WRITE           |
----------------------------------------------------------------
w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
----------------------------------------------------------------
w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
--------------------------------------------,-------------------
w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
----------------------------------------------------------------
ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
----------------------------------------------------------------
ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
----------------------------------------------------------------

bs=512k
----------------------------------------------------------------
      |            READ            |           WRITE           |
----------------------------------------------------------------
w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
----------------------------------------------------------------
w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
----------------------------------------------------------------
w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
----------------------------------------------------------------
ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
----------------------------------------------------------------
ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
----------------------------------------------------------------

bs=1M
----------------------------------------------------------------
      |            READ            |           WRITE           |
----------------------------------------------------------------
w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
----------------------------------------------------------------
w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
----------------------------------------------------------------
w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
----------------------------------------------------------------
ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
----------------------------------------------------------------
ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
----------------------------------------------------------------

Thanks,
Joseph
Ritesh Harjani Sept. 18, 2019, 10:03 a.m. UTC | #3
Hello Joseph,

First of all thanks a lot for collecting a thorough
performance numbers.

On 9/18/19 12:05 PM, Joseph Qi wrote:
> Hi Ritesh,
> 
> On 19/9/17 18:32, Ritesh Harjani wrote:
>> Hello,
>>
>> This patch series is based on the upstream discussion with Jan
>> & Joseph @ [1].
>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>
>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>> inode_lock/unlock instances from fs/ext4/*
>>
>> For now I already accounted for trylock/lock issue symantics
>> (which was discussed here [3]) in the same patch,
>> since the this whole patch was around inode_lock/unlock API,
>> so I thought it will be best to address that issue in the same patch.
>> However, kindly let me know if otherwise.
>>
>> Patch-2: Commit msg of this patch describes in detail about
>> what it is doing.
>> In brief - we try to first take the shared lock (instead of exclusive
>> lock), unless it is a unaligned_io or extend_io. Then in
>> ext4_dio_write_checks(), if we start with shared lock, we see
>> if we can really continue with shared lock or not. If not, then
>> we release the shared lock then acquire exclusive lock
>> and restart ext4_dio_write_checks().
>>
>>
>> Tested against few xfstests (with dioread_nolock mount option),
>> those ran fine (ext4 & generic).
>>
>> I tried testing performance numbers on my VM (since I could not get
>> hold of any real h/w based test device). I could test the fact
>> that earlier we were trying to do downgrade_write() lock, but with
>> this patch, that path is now avoided for fio test case
>> (as reported by Joseph in [4]).
>> But for the actual results, I am not sure if VM machine testing could
>> really give the reliable perf numbers which we want to take a look at.
>> Though I do observe some form of perf improvements, but I could not
>> get any reliable numbers (not even with the same list of with/without
>> patches with which Joseph posted his numbers [1]).
>>
>>
>> @Joseph,
>> Would it be possible for you to give your test case a run with this
>> patches? That will be really helpful.
>>
>> Branch for this is hosted at below tree.
>>
>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>
> I've tested your branch, the result is:
> mounting with dioread_nolock, it behaves the same like reverting
> parallel dio reads + dioread_nolock;

Good sign, means that patch is doing what it is supposed to do.


> while mounting without dioread_nolock, no improvement, or even worse.
> Please refer the test data below.
Actually without dioread_nolock, we take the restart path.
i.e. initially we start with SHARED_LOCK, but if dioread_nolock
is not enabled (or check some other conditions like overwrite),
we release the shared lock and re-acquire the EXCL lock.


But as an optimization, I added the below diff just now
to directly first check for ext4_should_dioread_nolock too
before taking the shared lock.

I think with this we should not see any performance regression
(even without dioread_nolock mount option).
Since it will directly start with exclusive lock
if dioread_nolock mount option is not enabled.

I have updated the tree with this diff in same branch.


ext4_dio_file_write_iter ()
<...>

498         if (iolock == EXT4_IOLOCK_SHARED && 
!ext4_should_dioread_nolock(inode))
499                 iolock = EXT4_IOLOCK_EXCL;
500
<...>


> 
> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
> -size=20G -numjobs=8 -runtime=600 -group_reporting
> 
> w/     = with parallel dio reads
> w/o    = reverting parallel dio reads
> w/o+   = reverting parallel dio reads + dioread_nolock
> ilock  = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock

I will request to kindly also add "w/ + dioread_nolock" in your list.


"w/ + dioread_nolock"  v/s  "ilock+" - should show some improvements.
"w/ "  v/s  "ilock" - should not show any regression.

But thanks for the exhaustive performance numbers you collected.


-ritesh


> 
> bs=4k:
> --------------------------------------------------------------
>        |            READ           |           WRITE          |
> --------------------------------------------------------------
> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
> --------------------------------------------------------------
> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> --------------------------------------------------------------
> w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
> --------------------------------------------------------------
> ilock | 29964KB/s,7491,326.70us	  | 29940KB/s,7485,740.62us  |
> --------------------------------------------------------------
> ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
> --------------------------------------------------------------
> 
> 
> bs=16k:
> --------------------------------------------------------------
>        |            READ           |           WRITE          |
> --------------------------------------------------------------
> w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
> --------------------------------------------------------------
> w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
> --------------------------------------------------------------
> w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
> --------------------------------------------------------------
> ilock | 56039KB/s,3502,632.96us	  | 55943KB/s,3496,1652.72us |
> --------------------------------------------------------------
> ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
> --------------------------------------------------------------
> 
> bs=64k
> ----------------------------------------------------------------
>        |            READ            |           WRITE           |
> ----------------------------------------------------------------
> w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
> ----------------------------------------------------------------
> w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
> --------------------------------------------,-------------------
> w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
> ----------------------------------------------------------------
> ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
> ----------------------------------------------------------------
> ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
> ----------------------------------------------------------------
> 
> bs=512k
> ----------------------------------------------------------------
>        |            READ            |           WRITE           |
> ----------------------------------------------------------------
> w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
> ----------------------------------------------------------------
> w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
> ----------------------------------------------------------------
> w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
> ----------------------------------------------------------------
> ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
> ----------------------------------------------------------------
> ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
> ----------------------------------------------------------------
> 
> bs=1M
> ----------------------------------------------------------------
>        |            READ            |           WRITE           |
> ----------------------------------------------------------------
> w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
> ----------------------------------------------------------------
> w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
> ----------------------------------------------------------------
> w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
> ----------------------------------------------------------------
> ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
> ----------------------------------------------------------------
> ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
> ----------------------------------------------------------------
> 
> Thanks,
> Joseph
>
Joseph Qi Sept. 18, 2019, 10:57 a.m. UTC | #4
On 19/9/18 18:03, Ritesh Harjani wrote:
> Hello Joseph,
> 
> First of all thanks a lot for collecting a thorough
> performance numbers.
> 
> On 9/18/19 12:05 PM, Joseph Qi wrote:
>> Hi Ritesh,
>>
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
> 
> Good sign, means that patch is doing what it is supposed to do.
> 
> 
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
> Actually without dioread_nolock, we take the restart path.
> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
> is not enabled (or check some other conditions like overwrite),
> we release the shared lock and re-acquire the EXCL lock.
> 
> 
> But as an optimization, I added the below diff just now
> to directly first check for ext4_should_dioread_nolock too
> before taking the shared lock.
> 
> I think with this we should not see any performance regression
> (even without dioread_nolock mount option).
> Since it will directly start with exclusive lock
> if dioread_nolock mount option is not enabled.
> 
> I have updated the tree with this diff in same branch.
> 
> 
> ext4_dio_file_write_iter ()
> <...>
> 
> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> 499                 iolock = EXT4_IOLOCK_EXCL;
> 500
> <...>
> 
> 
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
> 
> I will request to kindly also add "w/ + dioread_nolock" in your list.
> 
I've done this test before, it still behaves poor.
You can refer the previous RFC thread:
https://www.spinics.net/lists/linux-ext4/msg67066.html

Thanks,
Joseph

> 
> "w/ + dioread_nolock"  v/s  "ilock+" - should show some improvements.
> "w/ "  v/s  "ilock" - should not show any regression.
> 
> But thanks for the exhaustive performance numbers you collected.
> 
> 
> -ritesh
> 
> 
>>
>> bs=4k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
>> w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
>> --------------------------------------------------------------
>> ilock | 29964KB/s,7491,326.70us      | 29940KB/s,7485,740.62us  |
>> --------------------------------------------------------------
>> ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
>> --------------------------------------------------------------
>>
>>
>> bs=16k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
>> --------------------------------------------------------------
>> w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
>> --------------------------------------------------------------
>> w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
>> --------------------------------------------------------------
>> ilock | 56039KB/s,3502,632.96us      | 55943KB/s,3496,1652.72us |
>> --------------------------------------------------------------
>> ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
>> --------------------------------------------------------------
>>
>> bs=64k
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
>> ----------------------------------------------------------------
>> w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
>> --------------------------------------------,-------------------
>> w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
>> ----------------------------------------------------------------
>> ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
>> ----------------------------------------------------------------
>> ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
>> ----------------------------------------------------------------
>>
>> bs=512k
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
>> ----------------------------------------------------------------
>> w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
>> ----------------------------------------------------------------
>> w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
>> ----------------------------------------------------------------
>> ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
>> ----------------------------------------------------------------
>> ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
>> ----------------------------------------------------------------
>>
>> bs=1M
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
>> ----------------------------------------------------------------
>> w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
>> ----------------------------------------------------------------
>> w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
>> ----------------------------------------------------------------
>> ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
>> ----------------------------------------------------------------
>> ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
>> ----------------------------------------------------------------
>>
>> Thanks,
>> Joseph
>>
Joseph Qi Sept. 19, 2019, 2:08 a.m. UTC | #5
On 19/9/18 18:03, Ritesh Harjani wrote:
> Hello Joseph,
> 
> First of all thanks a lot for collecting a thorough
> performance numbers.
> 
> On 9/18/19 12:05 PM, Joseph Qi wrote:
>> Hi Ritesh,
>>
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
> 
> Good sign, means that patch is doing what it is supposed to do.
> 
> 
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
> Actually without dioread_nolock, we take the restart path.
> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
> is not enabled (or check some other conditions like overwrite),
> we release the shared lock and re-acquire the EXCL lock.
> 
> 
> But as an optimization, I added the below diff just now
> to directly first check for ext4_should_dioread_nolock too
> before taking the shared lock.
> 
> I think with this we should not see any performance regression
> (even without dioread_nolock mount option).
> Since it will directly start with exclusive lock
> if dioread_nolock mount option is not enabled.
> 
> I have updated the tree with this diff in same branch.
> 
> 
> ext4_dio_file_write_iter ()
> <...>
> 
> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> 499                 iolock = EXT4_IOLOCK_EXCL;
> 500
> <...>
> 
With this optimization, when mounting without dioread_nolock, it has
a little improvement compared to the last ilock version, but still
poor than original, especially for big block size.

w/        = with parallel dio reads (original)
w/o       = reverting parallel dio reads
w/o+      = reverting parallel dio reads + dioread_nolock
ilock     = ext4-ilock-RFC
ilock+    = ext4-ilock-RFC + dioread_nolock
ilocknew  = ext4-ilock-RFC latest
ilocknew+ = ext4-ilock-RFC latest + dioread_nolock


bs=4k:
------------------------------------------------------------------
          |            READ           |           WRITE          |
------------------------------------------------------------------
w/        | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
------------------------------------------------------------------
w/o       | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
------------------------------------------------------------------
w/o+      | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
------------------------------------------------------------------
ilock     | 29964KB/s,7491,326.70us   | 29940KB/s,7485,740.62us  |
------------------------------------------------------------------
ilocknew  | 30190KB/s,7547,497.47us   | 30159KB/s,7539,561.85us  |
------------------------------------------------------------------
ilock+    | 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
------------------------------------------------------------------
ilocknew+ | 123169KB/s,30792,245.81us | 123087KB/s,30771,12.85us |
------------------------------------------------------------------


bs=16k:
------------------------------------------------------------------
          |            READ           |           WRITE          |
------------------------------------------------------------------
w/        | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
------------------------------------------------------------------
w/o       | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
------------------------------------------------------------------
w/o+      | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
------------------------------------------------------------------
ilock     | 56039KB/s,3502,632.96us   | 55943KB/s,3496,1652.72us |
------------------------------------------------------------------
ilocknew  | 57317KB/s,3582,1023.88us  | 57230KB/s,3576,1209.91us |
------------------------------------------------------------------
ilock+    | 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
------------------------------------------------------------------
ilocknew+ | 221945KB/s,13871,552.61us | 221791KB/s,13861,21.29us |
------------------------------------------------------------------

bs=64k
-------------------------------------------------------------------
          |            READ           |           WRITE           |
-------------------------------------------------------------------
w/        | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
-------------------------------------------------------------------
w/o       | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us   |
--------------------------------------------,----------------------
w/o+      | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us   |
-------------------------------------------------------------------
ilock     | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
-------------------------------------------------------------------
ilocknew  | 107447KB/s,1678,1654.33us | 107322KB/s,1676,3112.96us |
-------------------------------------------------------------------
ilock+    | 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us   |
-------------------------------------------------------------------
ilocknew+ | 427054KB/s,6672,1143.43us | 425846KB/s,6653,52.87us   |
-------------------------------------------------------------------

bs=512k
-------------------------------------------------------------------
          |            READ           |           WRITE           |
-------------------------------------------------------------------
w/        | 392973KB/s,767,5046.35us  | 393165KB/s,767,5359.86us  |
-------------------------------------------------------------------
w/o       | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
-------------------------------------------------------------------
w/o+      | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
-------------------------------------------------------------------
ilock     | 296239KB/s,578,4703.10us  | 296384KB/s,578,9049.32us  |
-------------------------------------------------------------------
ilocknew  | 309740KB/s,604,5485.93us  | 309892KB/s,605,7666.79us  |
-------------------------------------------------------------------
ilock+    | 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
-------------------------------------------------------------------
ilocknew+ | 618159KB/s,1207,4129.76us | 618461KB/s,1207,2486.90us |
-------------------------------------------------------------------

bs=1M
-------------------------------------------------------------------
          |            READ           |           WRITE           |
-------------------------------------------------------------------
w/        | 487779KB/s,476,8058.55us  | 485592KB/s,474,8630.51us  |
-------------------------------------------------------------------
w/o       | 593927KB/s,580,7623.63us  | 591265KB/s,577,6163.42us  |
-------------------------------------------------------------------
w/o+      | 615011KB/s,600,7399.93us  | 612255KB/s,597,5936.61us  |
-------------------------------------------------------------------
ilock     | 394762KB/s,385,7097.55us  | 392993KB/s,383,13626.98us |
-------------------------------------------------------------------
ilocknew  | 422052KB/s,412,8338.16us  | 420161KB/s,410,11008.95us |
-------------------------------------------------------------------
ilock+    | 626183KB/s,611,7319.16us  | 623377KB/s,608,5773.24us  |
-------------------------------------------------------------------
ilocknew+ | 626006KB/s,611,7281.09us  | 623200KB/s,608,5817.84us  |
-------------------------------------------------------------------
Ritesh Harjani Sept. 19, 2019, 6:48 p.m. UTC | #6
Hello,

On 9/19/19 7:38 AM, Joseph Qi wrote:
> 
> 
> On 19/9/18 18:03, Ritesh Harjani wrote:
>> Hello Joseph,
>>
>> First of all thanks a lot for collecting a thorough
>> performance numbers.
>>
>> On 9/18/19 12:05 PM, Joseph Qi wrote:
>>> Hi Ritesh,
>>>
>>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>>> Hello,
>>>>
>>>> This patch series is based on the upstream discussion with Jan
>>>> & Joseph @ [1].
>>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>>
>>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>>> inode_lock/unlock instances from fs/ext4/*
>>>>
>>>> For now I already accounted for trylock/lock issue symantics
>>>> (which was discussed here [3]) in the same patch,
>>>> since the this whole patch was around inode_lock/unlock API,
>>>> so I thought it will be best to address that issue in the same patch.
>>>> However, kindly let me know if otherwise.
>>>>
>>>> Patch-2: Commit msg of this patch describes in detail about
>>>> what it is doing.
>>>> In brief - we try to first take the shared lock (instead of exclusive
>>>> lock), unless it is a unaligned_io or extend_io. Then in
>>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>>> if we can really continue with shared lock or not. If not, then
>>>> we release the shared lock then acquire exclusive lock
>>>> and restart ext4_dio_write_checks().
>>>>
>>>>
>>>> Tested against few xfstests (with dioread_nolock mount option),
>>>> those ran fine (ext4 & generic).
>>>>
>>>> I tried testing performance numbers on my VM (since I could not get
>>>> hold of any real h/w based test device). I could test the fact
>>>> that earlier we were trying to do downgrade_write() lock, but with
>>>> this patch, that path is now avoided for fio test case
>>>> (as reported by Joseph in [4]).
>>>> But for the actual results, I am not sure if VM machine testing could
>>>> really give the reliable perf numbers which we want to take a look at.
>>>> Though I do observe some form of perf improvements, but I could not
>>>> get any reliable numbers (not even with the same list of with/without
>>>> patches with which Joseph posted his numbers [1]).
>>>>
>>>>
>>>> @Joseph,
>>>> Would it be possible for you to give your test case a run with this
>>>> patches? That will be really helpful.
>>>>
>>>> Branch for this is hosted at below tree.
>>>>
>>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>>
>>> I've tested your branch, the result is:
>>> mounting with dioread_nolock, it behaves the same like reverting
>>> parallel dio reads + dioread_nolock;
>>
>> Good sign, means that patch is doing what it is supposed to do.
>>
>>
>>> while mounting without dioread_nolock, no improvement, or even worse.
>>> Please refer the test data below.
>> Actually without dioread_nolock, we take the restart path.
>> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
>> is not enabled (or check some other conditions like overwrite),
>> we release the shared lock and re-acquire the EXCL lock.
>>
>>
>> But as an optimization, I added the below diff just now
>> to directly first check for ext4_should_dioread_nolock too
>> before taking the shared lock.
>>
>> I think with this we should not see any performance regression
>> (even without dioread_nolock mount option).
>> Since it will directly start with exclusive lock
>> if dioread_nolock mount option is not enabled.
>>
>> I have updated the tree with this diff in same branch.
>>
>>
>> ext4_dio_file_write_iter ()
>> <...>
>>
>> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
>> 499                 iolock = EXT4_IOLOCK_EXCL;
>> 500
>> <...>
>>
> With this optimization, when mounting without dioread_nolock, it has
> a little improvement compared to the last ilock version, but still
> poor than original, especially for big block size.

Finally, I got hold of some HW. I am collecting the numbers as we speak.
Will post those tomorrow.

Thanks for your help!!

-ritesh



> 
> w/        = with parallel dio reads (original)
> w/o       = reverting parallel dio reads
> w/o+      = reverting parallel dio reads + dioread_nolock
> ilock     = ext4-ilock-RFC
> ilock+    = ext4-ilock-RFC + dioread_nolock
> ilocknew  = ext4-ilock-RFC latest
> ilocknew+ = ext4-ilock-RFC latest + dioread_nolock
> 
> 
> bs=4k:
> ------------------------------------------------------------------
>            |            READ           |           WRITE          |
> ------------------------------------------------------------------
> w/        | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
> ------------------------------------------------------------------
> w/o       | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> ------------------------------------------------------------------
> w/o+      | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
> ------------------------------------------------------------------
> ilock     | 29964KB/s,7491,326.70us   | 29940KB/s,7485,740.62us  |
> ------------------------------------------------------------------
> ilocknew  | 30190KB/s,7547,497.47us   | 30159KB/s,7539,561.85us  |
> ------------------------------------------------------------------
> ilock+    | 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
> ------------------------------------------------------------------
> ilocknew+ | 123169KB/s,30792,245.81us | 123087KB/s,30771,12.85us |
> ------------------------------------------------------------------
> 
> 
> bs=16k:
> ------------------------------------------------------------------
>            |            READ           |           WRITE          |
> ------------------------------------------------------------------
> w/        | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
> ------------------------------------------------------------------
> w/o       | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
> ------------------------------------------------------------------
> w/o+      | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
> ------------------------------------------------------------------
> ilock     | 56039KB/s,3502,632.96us   | 55943KB/s,3496,1652.72us |
> ------------------------------------------------------------------
> ilocknew  | 57317KB/s,3582,1023.88us  | 57230KB/s,3576,1209.91us |
> ------------------------------------------------------------------
> ilock+    | 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
> ------------------------------------------------------------------
> ilocknew+ | 221945KB/s,13871,552.61us | 221791KB/s,13861,21.29us |
> ------------------------------------------------------------------
> 
> bs=64k
> -------------------------------------------------------------------
>            |            READ           |           WRITE           |
> -------------------------------------------------------------------
> w/        | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
> -------------------------------------------------------------------
> w/o       | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us   |
> --------------------------------------------,----------------------
> w/o+      | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us   |
> -------------------------------------------------------------------
> ilock     | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
> -------------------------------------------------------------------
> ilocknew  | 107447KB/s,1678,1654.33us | 107322KB/s,1676,3112.96us |
> -------------------------------------------------------------------
> ilock+    | 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us   |
> -------------------------------------------------------------------
> ilocknew+ | 427054KB/s,6672,1143.43us | 425846KB/s,6653,52.87us   |
> -------------------------------------------------------------------
> 
> bs=512k
> -------------------------------------------------------------------
>            |            READ           |           WRITE           |
> -------------------------------------------------------------------
> w/        | 392973KB/s,767,5046.35us  | 393165KB/s,767,5359.86us  |
> -------------------------------------------------------------------
> w/o       | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
> -------------------------------------------------------------------
> w/o+      | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
> -------------------------------------------------------------------
> ilock     | 296239KB/s,578,4703.10us  | 296384KB/s,578,9049.32us  |
> -------------------------------------------------------------------
> ilocknew  | 309740KB/s,604,5485.93us  | 309892KB/s,605,7666.79us  |
> -------------------------------------------------------------------
> ilock+    | 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
> -------------------------------------------------------------------
> ilocknew+ | 618159KB/s,1207,4129.76us | 618461KB/s,1207,2486.90us |
> -------------------------------------------------------------------
> 
> bs=1M
> -------------------------------------------------------------------
>            |            READ           |           WRITE           |
> -------------------------------------------------------------------
> w/        | 487779KB/s,476,8058.55us  | 485592KB/s,474,8630.51us  |
> -------------------------------------------------------------------
> w/o       | 593927KB/s,580,7623.63us  | 591265KB/s,577,6163.42us  |
> -------------------------------------------------------------------
> w/o+      | 615011KB/s,600,7399.93us  | 612255KB/s,597,5936.61us  |
> -------------------------------------------------------------------
> ilock     | 394762KB/s,385,7097.55us  | 392993KB/s,383,13626.98us |
> -------------------------------------------------------------------
> ilocknew  | 422052KB/s,412,8338.16us  | 420161KB/s,410,11008.95us |
> -------------------------------------------------------------------
> ilock+    | 626183KB/s,611,7319.16us  | 623377KB/s,608,5773.24us  |
> -------------------------------------------------------------------
> ilocknew+ | 626006KB/s,611,7281.09us  | 623200KB/s,608,5817.84us  |
> -------------------------------------------------------------------
>
Ritesh Harjani Sept. 23, 2019, 6:19 a.m. UTC | #7
> Finally, I got hold of some HW. I am collecting the numbers as we speak.
> Will post those tomorrow.

I have sent RFC-v2 with comprehensive set of performance numbers and
charts. It could be found below.

https://marc.info/?l=linux-ext4&m=156921748126221&w=2

Please take a look at it.

-ritesh
Jan Kara Sept. 24, 2019, 3:10 p.m. UTC | #8
Hi Joseph!

On Wed 18-09-19 14:35:15, Joseph Qi wrote:
> On 19/9/17 18:32, Ritesh Harjani wrote:
> > Hello,
> > 
> > This patch series is based on the upstream discussion with Jan
> > & Joseph @ [1].
> > It is based on top of Matthew's v3 ext4 iomap patch series [2]
> > 
> > Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> > inode_lock/unlock instances from fs/ext4/*
> > 
> > For now I already accounted for trylock/lock issue symantics
> > (which was discussed here [3]) in the same patch,
> > since the this whole patch was around inode_lock/unlock API,
> > so I thought it will be best to address that issue in the same patch. 
> > However, kindly let me know if otherwise.
> > 
> > Patch-2: Commit msg of this patch describes in detail about
> > what it is doing.
> > In brief - we try to first take the shared lock (instead of exclusive
> > lock), unless it is a unaligned_io or extend_io. Then in
> > ext4_dio_write_checks(), if we start with shared lock, we see
> > if we can really continue with shared lock or not. If not, then
> > we release the shared lock then acquire exclusive lock
> > and restart ext4_dio_write_checks().
> > 
> > 
> > Tested against few xfstests (with dioread_nolock mount option),
> > those ran fine (ext4 & generic).
> > 
> > I tried testing performance numbers on my VM (since I could not get
> > hold of any real h/w based test device). I could test the fact
> > that earlier we were trying to do downgrade_write() lock, but with
> > this patch, that path is now avoided for fio test case
> > (as reported by Joseph in [4]).
> > But for the actual results, I am not sure if VM machine testing could
> > really give the reliable perf numbers which we want to take a look at.
> > Though I do observe some form of perf improvements, but I could not
> > get any reliable numbers (not even with the same list of with/without
> > patches with which Joseph posted his numbers [1]).
> > 
> > 
> > @Joseph,
> > Would it be possible for you to give your test case a run with this
> > patches? That will be really helpful.
> > 
> > Branch for this is hosted at below tree.
> > 
> > https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> > 
> I've tested your branch, the result is:
> mounting with dioread_nolock, it behaves the same like reverting
> parallel dio reads + dioread_nolock;
> while mounting without dioread_nolock, no improvement, or even worse.
> Please refer the test data below. 
> 
> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
> -size=20G -numjobs=8 -runtime=600 -group_reporting
> 
> w/     = with parallel dio reads
> w/o    = reverting parallel dio reads

This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
right?

> w/o+   = reverting parallel dio reads + dioread_nolock
> ilock  = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock
> 
> bs=4k:
> --------------------------------------------------------------
>       |            READ           |           WRITE          |
> --------------------------------------------------------------
> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
> --------------------------------------------------------------
> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> --------------------------------------------------------------

I'm really surprised by the numbers here. They would mean that when DIO
read takes i_rwsem exclusive lock instead of shared, it is a win for your
workload... Argh, now checking code in fs/direct-io.c I think I can see the
difference. The trick in do_blockdev_direct_IO() is:

        if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
                inode_unlock(dio->inode);
        if (dio->is_async && retval == 0 && dio->result &&
            (iov_iter_rw(iter) == READ || dio->result == count))
                retval = -EIOCBQUEUED;
        else
                dio_await_completion(dio);

So actually only direct IO read submission is protected by i_rwsem with
DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.

After some thought I think the best solution for this is to just finally
finish the conversion of ext4 so that dioread_nolock is the only DIO path.
With i_rwsem held in shared mode even for "unlocked" DIO, it should be
actually relatively simple and most of the dances with unwritten extents
shouldn't be needed anymore.

								Honza
Ritesh Harjani Sept. 24, 2019, 7:48 p.m. UTC | #9
On 9/24/19 8:40 PM, Jan Kara wrote:
> Hi Joseph!
> 
> On Wed 18-09-19 14:35:15, Joseph Qi wrote:
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
> 
> This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
> right?

He posted the same numbers where he posted previous reverts too,
which I thought we already noticed [1].
 From [2] below, I assumed we knew this.

[2] - """
(note
that the patches actually improve performance of read-only DIO workload
when not using dioread_nolock as for that case, exclusive lock is 
replaced with a shared one)
"""


[1]  https://patchwork.ozlabs.org/patch/1153546/
[2] 
https://lore.kernel.org/linux-ext4/20190830153520.GB25069@quack2.suse.cz/

> 
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>>
>> bs=4k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
> 
> I'm really surprised by the numbers here. They would mean that when DIO

While testing my patches I noticed this again, but then when I saw [2]
above, I thought we were aware of this.
My bad, I should have brought this point up maybe once before going
ahead with implementing our discussed solution.


> read takes i_rwsem exclusive lock instead of shared, it is a win for your
> workload... Argh, now checking code in fs/direct-io.c I think I can see the
> difference. The trick in do_blockdev_direct_IO() is:
> 
>          if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>                  inode_unlock(dio->inode);
>          if (dio->is_async && retval == 0 && dio->result &&
>              (iov_iter_rw(iter) == READ || dio->result == count))
>                  retval = -EIOCBQUEUED;
>          else
>                  dio_await_completion(dio);
> 
> So actually only direct IO read submission is protected by i_rwsem with
> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> 
> After some thought I think the best solution for this is to just finally
> finish the conversion of ext4 so that dioread_nolock is the only DIO path.

Sorry, I still didn't get this completely. Could you please explain a 
bit more?


> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> actually relatively simple and most of the dances with unwritten extents
> shouldn't be needed anymore.

Again, maybe it's related to above comment. Could you please give some
insights?


Or do you mean that we should do it like this-
So as of now in dioread_nolock, we allocate blocks, mark the entry into
extents as unwritten, then do the data IO, and then finally do the
conversion of unwritten to written extents.

So instead of that we first only reserve the disk blocks, (without
making any on-disk changes in extent tree), do the data IO and then
finally make an entry into extent tree on disk. And going
forward only keep this as the default path.

The above is something I have been looking into for enabling
dioread_nolock for powerpc platforms where blocksize < page_size.
This is based upon an upstream discussion between Ted and you :)


But even with above, in case of extending writes, we still
will have to zero out those extending blocks no? Which
will require an exclusive inode lock anyways for zeroing.
(same which has been done in XFS too).

So going with current discussed solution of mounting with
dioread_nolock to provide performance scalability in mixed read/write 
workload should be also the right approach, no?
Also looking at the numbers here [3] & [4], this patch also seems
to improve the performance with dioread_nolock mount option.
Please help me understand your thoughts on this.

[3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
[4] - 
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png


-ritesh
Joseph Qi Sept. 25, 2019, 1:17 a.m. UTC | #10
On 19/9/24 23:10, Jan Kara wrote:
> Hi Joseph!
> 
> On Wed 18-09-19 14:35:15, Joseph Qi wrote:
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch. 
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below. 
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
> 
> This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
> right?

Yes, actually, it also reverts the related patches:

Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
Revert "ext4: fix off-by-one error when writing back pages before dio read"
Revert "ext4: Allow parallel DIO reads"

> 
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>>
>> bs=4k:
>> --------------------------------------------------------------
>>       |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
> 
> I'm really surprised by the numbers here. They would mean that when DIO
> read takes i_rwsem exclusive lock instead of shared, it is a win for your
> workload... Argh, now checking code in fs/direct-io.c I think I can see the
> difference. The trick in do_blockdev_direct_IO() is:
> 
>         if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>                 inode_unlock(dio->inode);
>         if (dio->is_async && retval == 0 && dio->result &&
>             (iov_iter_rw(iter) == READ || dio->result == count))
>                 retval = -EIOCBQUEUED;
>         else
>                 dio_await_completion(dio);
> 
> So actually only direct IO read submission is protected by i_rwsem with
> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> 
> After some thought I think the best solution for this is to just finally
> finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> actually relatively simple and most of the dances with unwritten extents
> shouldn't be needed anymore.
> 
> 								Honza
>
Jan Kara Sept. 25, 2019, 9:23 a.m. UTC | #11
On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
> > read takes i_rwsem exclusive lock instead of shared, it is a win for your
> > workload... Argh, now checking code in fs/direct-io.c I think I can see the
> > difference. The trick in do_blockdev_direct_IO() is:
> > 
> >          if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> >                  inode_unlock(dio->inode);
> >          if (dio->is_async && retval == 0 && dio->result &&
> >              (iov_iter_rw(iter) == READ || dio->result == count))
> >                  retval = -EIOCBQUEUED;
> >          else
> >                  dio_await_completion(dio);
> > 
> > So actually only direct IO read submission is protected by i_rwsem with
> > DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> > 
> > After some thought I think the best solution for this is to just finally
> > finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> 
> Sorry, I still didn't get this completely. Could you please explain a bit
> more?

Well, currently we have two different locking schemes for DIO - the
"normal" case and the "dioread_nolock" case. And the "normal" case really
only exists because buffered writeback needed to be more careful (so that
nolock DIO cannot expose stale data) and nobody did the effort to make that
work when blocksize < pagesize. But having two different locking schemes
for DIO is really confusing to users and a maintenance burden so we want to
get rid of the old scheme once the "dioread_nolock" scheme works for all
the configurations.
 
> > With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> > actually relatively simple and most of the dances with unwritten extents
> > shouldn't be needed anymore.
> 
> Again, maybe it's related to above comment. Could you please give some
> insights?

Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
"unlocked" anymore. Which actually very much limits the races with buffered
writes and thus page writeback (because we flush page cache before doing
DIO).

> Or do you mean that we should do it like this-
> So as of now in dioread_nolock, we allocate blocks, mark the entry into
> extents as unwritten, then do the data IO, and then finally do the
> conversion of unwritten to written extents.

So this allocation of extents as unwritten in dioread_nolock mode is now
mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
submitting any DIO. Because we flush page cache before starting DIO and new
pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
with page writeback and thus cannot expose stale block contents. There is
one exception though - which is why I wrote "mostly" above - pages can
still be dirtied through memory mappings (there's no i_rwsem protection for
mmap writes) and thus races with page writeback exposing stale data are still
theoretically possible. In fact the "normal" DIO mode has this kind of race
all the time ext4 exists.

I guess we could fix this by falling back to page writeback using unwritten
extents when we have dirty pages locked for writeback and see there's any
DIO in flight for the inode. Sadly that means we we cannot get rid of
writeback code using unwritten extents but at least it won't be hurting
performance in the common case.

> So instead of that we first only reserve the disk blocks, (without
> making any on-disk changes in extent tree), do the data IO and then
> finally make an entry into extent tree on disk. And going
> forward only keep this as the default path.
> 
> The above is something I have been looking into for enabling
> dioread_nolock for powerpc platforms where blocksize < page_size.
> This is based upon an upstream discussion between Ted and you :)

Yes, this is even better solution to dioread_nolock "problem" but it is
also more work then just dropping the old DIO locking mode and fix
writeback using unwritten extents for blocksize < pagesize.
 
> But even with above, in case of extending writes, we still
> will have to zero out those extending blocks no? Which
> will require an exclusive inode lock anyways for zeroing.
> (same which has been done in XFS too).

Yes, extending writes are a different matter.

> So going with current discussed solution of mounting with
> dioread_nolock to provide performance scalability in mixed read/write
> workload should be also the right approach, no?
> Also looking at the numbers here [3] & [4], this patch also seems
> to improve the performance with dioread_nolock mount option.
> Please help me understand your thoughts on this.

Yes, your patches are definitely going in the right direction. What I'm
discussing is mostly how to make ext4 perform well for mixed read/write
workload by default without user having to use magic mount option.

> [3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
> [4] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png


								Honza
Ritesh Harjani Sept. 26, 2019, 12:34 p.m. UTC | #12
Hello Jan,

Thanks for answering it all and giving all the info.

On 9/25/19 2:53 PM, Jan Kara wrote:
> On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
>>> read takes i_rwsem exclusive lock instead of shared, it is a win for your
>>> workload... Argh, now checking code in fs/direct-io.c I think I can see the
>>> difference. The trick in do_blockdev_direct_IO() is:
>>>
>>>           if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>>>                   inode_unlock(dio->inode);
>>>           if (dio->is_async && retval == 0 && dio->result &&
>>>               (iov_iter_rw(iter) == READ || dio->result == count))
>>>                   retval = -EIOCBQUEUED;
>>>           else
>>>                   dio_await_completion(dio);
>>>
>>> So actually only direct IO read submission is protected by i_rwsem with
>>> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
>>>
>>> After some thought I think the best solution for this is to just finally
>>> finish the conversion of ext4 so that dioread_nolock is the only DIO path.
>>
>> Sorry, I still didn't get this completely. Could you please explain a bit
>> more?
> 
> Well, currently we have two different locking schemes for DIO - the
> "normal" case and the "dioread_nolock" case. And the "normal" case really
> only exists because buffered writeback needed to be more careful (so that
> nolock DIO cannot expose stale data) and nobody did the effort to make that
> work when blocksize < pagesize. But having two different locking schemes
> for DIO is really confusing to users and a maintenance burden so we want to
> get rid of the old scheme once the "dioread_nolock" scheme works for all
> the configurations.

Agreed.

>   
>>> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
>>> actually relatively simple and most of the dances with unwritten extents
>>> shouldn't be needed anymore.
>>
>> Again, maybe it's related to above comment. Could you please give some
>> insights?
> 
> Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
> "unlocked" anymore. Which actually very much limits the races with buffered
> writes and thus page writeback (because we flush page cache before doing
> DIO).

So looking at the code again based on your inputs from above, we should
be able to remove this condition <snip below> in ext4_dio_write_checks.

What I meant is, in DIO writes path we can start with shared lock
(which the current patch is doing), and only check for below conditions
<snip below> for it to continue using shared lock.
(i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).

That means there should not be any race for non-extent based mode
too right?


Because for overwrites in DIO (for both extents & non-extents)-
(just reiterating)

1. Race against bufferedIO writes and DIO overwrites will be protected
via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
writes. Plus we flush page cache before doing DIO.

2. Race against bufferedIO reads & DIO overwrites will be anyway 
protected since we don't do any block allocations during DIO overwrite.

3. Again race against DIO reads & DIO overwrites is not be a problem,
since no block allocation is done anyway. So no stale data will be
exposed.


<snip of change we should do in ext4_dio_write_checks>

         if (*iolock == EXT4_IOLOCK_SHARED &&
             (!IS_NOSEC(inode) || *unaligned_io || *extend ||
-            !ext4_should_dioread_nolock(inode) ||
              !ext4_overwrite_io(inode, offset, count))) {
                 ext4_iunlock(inode, *iolock);
                 *iolock = EXT4_IOLOCK_EXCL;
                 ext4_ilock(inode, *iolock);
                 goto restart;
         }

> 
>> Or do you mean that we should do it like this-
>> So as of now in dioread_nolock, we allocate blocks, mark the entry into
>> extents as unwritten, then do the data IO, and then finally do the
>> conversion of unwritten to written extents.
> 
> So this allocation of extents as unwritten in dioread_nolock mode is now
> mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
> submitting any DIO. Because we flush page cache before starting DIO and new
> pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
> with page writeback and thus cannot expose stale block contents. There is
> one exception though - which is why I wrote "mostly" above - pages can
> still be dirtied through memory mappings (there's no i_rwsem protection for
> mmap writes) and thus races with page writeback exposing stale data are still
> theoretically possible. In fact the "normal" DIO mode has this kind of race
> all the time ext4 exists.

Yes, now that you said I could see this below race even with current
code. Any other race too?

i.e.
ext4_dio_read			ext4_page_mkwrite()

     filemap_write_and_wait_range()
				     ext4_get_blocks()

     submit_bio
     // this could expose the stale data.
		
				     mark_buffer_dirty()


Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
because then we loose the parallelism that it offers right now.
Wonder how other FS protect this race (like XFS?)


> I guess we could fix this by falling back to page writeback using unwritten
> extents when we have dirty pages locked for writeback and see there's any
> DIO in flight for the inode. Sadly that means we we cannot get rid of
> writeback code using unwritten extents but at least it won't be hurting
> performance in the common case.


1. So why to check for dirty pages locked for writeback (in
ext4_page_mkwrite)? we anyway lock the page which we modify or
allocate block for. So race against bufferedRead should not happen.

2. And even if we check for inode_dio_wait(), we still can't gurantee
that the next DIO may not snoopin while we are in ext4_page_mkwrite?

I am definitely missing something here.


> 
>> So instead of that we first only reserve the disk blocks, (without
>> making any on-disk changes in extent tree), do the data IO and then
>> finally make an entry into extent tree on disk. And going
>> forward only keep this as the default path.
>>
>> The above is something I have been looking into for enabling
>> dioread_nolock for powerpc platforms where blocksize < page_size.
>> This is based upon an upstream discussion between Ted and you :)
> 
> Yes, this is even better solution to dioread_nolock "problem" but it is
> also more work 

Yes, I agree that we should do this incrementally.


> then just dropping the old DIO locking mode and
> fix writeback using unwritten extents for blocksize < pagesize.


So, I was looking into this (to support dioread_nolock for
blocksize < pagesize) but I could not find any history in git,
nor any thread which explains the problem.

I got below understanding while going over code -

1. This problem may be somehow related to bufferheads in the
extent conversion from unwritten to written in writeback path.
But I couldn't see what exactly is the issue there?

I see that the completion path happens via
ext4_end_io
    |- ext4_convert_unwritten_extents() // offset & size is properly 
taken care.
    |- ext4_release_io_end() (which is same in both DIO & writeback).


2. One thing which I noticed is the FIXME in
mpage_map_and_submit_buffers(). Where we
io->io_end->size we add the whole PAGE_SIZE.
I guess we should update the size here carefully
based on buffer_heads.


Could you please give some pointers as to what
is the limitation. Or some hint which I can go
and check by myself.



>> But even with above, in case of extending writes, we still
>> will have to zero out those extending blocks no? Which
>> will require an exclusive inode lock anyways for zeroing.
>> (same which has been done in XFS too).
> 
> Yes, extending writes are a different matter.
> 
>> So going with current discussed solution of mounting with
>> dioread_nolock to provide performance scalability in mixed read/write
>> workload should be also the right approach, no?
>> Also looking at the numbers here [3] & [4], this patch also seems
>> to improve the performance with dioread_nolock mount option.
>> Please help me understand your thoughts on this.
> 
> Yes, your patches are definitely going in the right direction. What I'm
> discussing is mostly how to make ext4 perform well for mixed read/write
> workload by default without user having to use magic mount option.

Really thanks for your support here.

So can we get these patches upstream incrementally?
i.e.
1. As a first step, we can remove this condition
ext4_should_dioread_nolock from the current patchset
(as mentioned above too) &  get this patch rebased
on top of iomap pathset. We should be good to merge
this patchset then, right? Since this should be able
to improve the performance even without dioread_nolock
mount option.


2. Meanwhile I will continue to check for blocksize < pagesize
requirement for dioread_nolock. We can follow the plan
as you mentioned above then.

Your thoughts?


-ritesh
Jan Kara Sept. 26, 2019, 1:47 p.m. UTC | #13
Hello,

On Thu 26-09-19 18:04:55, Ritesh Harjani wrote:
> On 9/25/19 2:53 PM, Jan Kara wrote:
> > On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
> > > > read takes i_rwsem exclusive lock instead of shared, it is a win for your
> > > > workload... Argh, now checking code in fs/direct-io.c I think I can see the
> > > > difference. The trick in do_blockdev_direct_IO() is:
> > > > 
> > > >           if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> > > >                   inode_unlock(dio->inode);
> > > >           if (dio->is_async && retval == 0 && dio->result &&
> > > >               (iov_iter_rw(iter) == READ || dio->result == count))
> > > >                   retval = -EIOCBQUEUED;
> > > >           else
> > > >                   dio_await_completion(dio);
> > > > 
> > > > So actually only direct IO read submission is protected by i_rwsem with
> > > > DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> > > > 
> > > > After some thought I think the best solution for this is to just finally
> > > > finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> > > 
> > > Sorry, I still didn't get this completely. Could you please explain a bit
> > > more?
> > 
> > Well, currently we have two different locking schemes for DIO - the
> > "normal" case and the "dioread_nolock" case. And the "normal" case really
> > only exists because buffered writeback needed to be more careful (so that
> > nolock DIO cannot expose stale data) and nobody did the effort to make that
> > work when blocksize < pagesize. But having two different locking schemes
> > for DIO is really confusing to users and a maintenance burden so we want to
> > get rid of the old scheme once the "dioread_nolock" scheme works for all
> > the configurations.
> 
> Agreed.
> 
> > > > With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> > > > actually relatively simple and most of the dances with unwritten extents
> > > > shouldn't be needed anymore.
> > > 
> > > Again, maybe it's related to above comment. Could you please give some
> > > insights?
> > 
> > Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
> > "unlocked" anymore. Which actually very much limits the races with buffered
> > writes and thus page writeback (because we flush page cache before doing
> > DIO).
> 
> So looking at the code again based on your inputs from above, we should
> be able to remove this condition <snip below> in ext4_dio_write_checks.
> 
> What I meant is, in DIO writes path we can start with shared lock
> (which the current patch is doing), and only check for below conditions
> <snip below> for it to continue using shared lock.
> (i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).
> 
> That means there should not be any race for non-extent based mode
> too right?

Yes, that is correct. But please do this as a separate change with good
explanation of why this is OK.

> Because for overwrites in DIO (for both extents & non-extents)-
> (just reiterating)
> 
> 1. Race against bufferedIO writes and DIO overwrites will be protected
> via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
> writes. Plus we flush page cache before doing DIO.
> 
> 2. Race against bufferedIO reads & DIO overwrites will be anyway protected
> since we don't do any block allocations during DIO overwrite.
> 
> 3. Again race against DIO reads & DIO overwrites is not be a problem,
> since no block allocation is done anyway. So no stale data will be
> exposed.
> 
> 
> <snip of change we should do in ext4_dio_write_checks>
> 
>         if (*iolock == EXT4_IOLOCK_SHARED &&
>             (!IS_NOSEC(inode) || *unaligned_io || *extend ||
> -            !ext4_should_dioread_nolock(inode) ||
>              !ext4_overwrite_io(inode, offset, count))) {
>                 ext4_iunlock(inode, *iolock);
>                 *iolock = EXT4_IOLOCK_EXCL;
>                 ext4_ilock(inode, *iolock);
>                 goto restart;
>         }
> 
> > 
> > > Or do you mean that we should do it like this-
> > > So as of now in dioread_nolock, we allocate blocks, mark the entry into
> > > extents as unwritten, then do the data IO, and then finally do the
> > > conversion of unwritten to written extents.
> > 
> > So this allocation of extents as unwritten in dioread_nolock mode is now
> > mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
> > submitting any DIO. Because we flush page cache before starting DIO and new
> > pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
> > with page writeback and thus cannot expose stale block contents. There is
> > one exception though - which is why I wrote "mostly" above - pages can
> > still be dirtied through memory mappings (there's no i_rwsem protection for
> > mmap writes) and thus races with page writeback exposing stale data are still
> > theoretically possible. In fact the "normal" DIO mode has this kind of race
> > all the time ext4 exists.
> 
> Yes, now that you said I could see this below race even with current
> code. Any other race too?
> 
> i.e.
> ext4_dio_read			ext4_page_mkwrite()
> 
>     filemap_write_and_wait_range()
> 				     ext4_get_blocks()
> 
>     submit_bio
>     // this could expose the stale data.
> 		
> 				     mark_buffer_dirty()

Yes, this is what I meant. Although note that in most cases
ext4_get_blocks() from ext4_page_mkwrite() will just reserve delalloc
blocks so you additionally need writeback to happen on that inode to
allocate delalloc block and only after that call of ext4_iomap_begin() +
submit_bio() from iomap_dio_rw() will be able to expose stale data. So I
don't think the race is very easy to trigger.

> Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
> because then we loose the parallelism that it offers right now.
> Wonder how other FS protect this race (like XFS?)

You cannot get i_rwsem at all in ext4_page_mkwrite() because mm code
holds mmap_sem when calling into ext4_page_mkwrite(). And mmap_sem ranks
below i_rwsem. And the deadlock is actually real because e.g. direct IO
code holds i_rwsem and then grabs mmap_sem as part of
bio_iov_iter_get_pages().

XFS always uses unwritten extents when writing out pages which prevents
this race.

> > I guess we could fix this by falling back to page writeback using unwritten
> > extents when we have dirty pages locked for writeback and see there's any
> > DIO in flight for the inode. Sadly that means we we cannot get rid of
> > writeback code using unwritten extents but at least it won't be hurting
> > performance in the common case.
> 
> 1. So why to check for dirty pages locked for writeback (in
> ext4_page_mkwrite)? we anyway lock the page which we modify or
> allocate block for. So race against bufferedRead should not happen.
> 
> 2. And even if we check for inode_dio_wait(), we still can't gurantee
> that the next DIO may not snoopin while we are in ext4_page_mkwrite?
> 
> I am definitely missing something here.

I was speaking about code in ext4_writepages(). The block mapping in
ext4_page_mkwrite() can always use unwritten extents - that is just a
fallback path in case delayed allocation is disabled or we are running out
of disk space. The code in ext4_writepages() needs to decide whether to
writeback using unwritten extents or we can use normal ones. And in that
place I suggest replacing current "ext4_should_dioread_nolock()" check with
"is there any dio in flight against the inode?". And to make the check
reliable (without holding i_rwsem), you need to do it only after you have
all pages locked for writeback.

> > > So instead of that we first only reserve the disk blocks, (without
> > > making any on-disk changes in extent tree), do the data IO and then
> > > finally make an entry into extent tree on disk. And going
> > > forward only keep this as the default path.
> > > 
> > > The above is something I have been looking into for enabling
> > > dioread_nolock for powerpc platforms where blocksize < page_size.
> > > This is based upon an upstream discussion between Ted and you :)
> > 
> > Yes, this is even better solution to dioread_nolock "problem" but it is
> > also more work
> 
> Yes, I agree that we should do this incrementally.
> 
> > then just dropping the old DIO locking mode and
> > fix writeback using unwritten extents for blocksize < pagesize.
> 
> 
> So, I was looking into this (to support dioread_nolock for
> blocksize < pagesize) but I could not find any history in git,
> nor any thread which explains the problem.

Yeah, part of the problem is that we have only a rough understanding of
what actually the problem is :)

> I got below understanding while going over code -
> 
> 1. This problem may be somehow related to bufferheads in the
> extent conversion from unwritten to written in writeback path.
> But I couldn't see what exactly is the issue there?
> 
> I see that the completion path happens via
> ext4_end_io
>    |- ext4_convert_unwritten_extents() // offset & size is properly taken
> care.
>    |- ext4_release_io_end() (which is same in both DIO & writeback).
> 
> 
> 2. One thing which I noticed is the FIXME in
> mpage_map_and_submit_buffers(). Where we
> io->io_end->size we add the whole PAGE_SIZE.
> I guess we should update the size here carefully
> based on buffer_heads.
> 
> Could you please give some pointers as to what
> is the limitation. Or some hint which I can go
> and check by myself.

You are correct that the problem is in ext4_writepages() (and functions
called from there). Essentially the whole code there needs verification
whether unwritten extent handling works correctly when blocksize <
pagesize. As you noted, there are couple of FIXMEs there where I was aware
that things would break but there may be other problems I've missed.
I remember things can get really complex there when there are multiple
unwritten extents underlying the page and we need to write them out.

> > > But even with above, in case of extending writes, we still
> > > will have to zero out those extending blocks no? Which
> > > will require an exclusive inode lock anyways for zeroing.
> > > (same which has been done in XFS too).
> > 
> > Yes, extending writes are a different matter.
> > 
> > > So going with current discussed solution of mounting with
> > > dioread_nolock to provide performance scalability in mixed read/write
> > > workload should be also the right approach, no?
> > > Also looking at the numbers here [3] & [4], this patch also seems
> > > to improve the performance with dioread_nolock mount option.
> > > Please help me understand your thoughts on this.
> > 
> > Yes, your patches are definitely going in the right direction. What I'm
> > discussing is mostly how to make ext4 perform well for mixed read/write
> > workload by default without user having to use magic mount option.
> 
> Really thanks for your support here.
> 
> So can we get these patches upstream incrementally?
> i.e.
> 1. As a first step, we can remove this condition
> ext4_should_dioread_nolock from the current patchset
> (as mentioned above too) &  get this patch rebased
> on top of iomap pathset. We should be good to merge
> this patchset then, right? Since this should be able
> to improve the performance even without dioread_nolock
> mount option.

Yes, correct.

> 2. Meanwhile I will continue to check for blocksize < pagesize
> requirement for dioread_nolock. We can follow the plan
> as you mentioned above then.

Good.
								Honza