mbox series

[v5,0/7] Extend write-hint framework, and add write-hint for Ext4 journal

Message ID 1556191202-3245-1-git-send-email-joshi.k@samsung.com
Headers show
Series Extend write-hint framework, and add write-hint for Ext4 journal | expand

Message

Kanchan Joshi April 25, 2019, 11:19 a.m. UTC
V5 series, towards extending write-hint/streams infrastructure for
kernel-components, and adding support for sending write-hint with Ext4/JBD2 journal.

Here is the history/changelog -

Changes since v4:
- Removed write-hint field from request. bi_write_hint in bio is used for
  merging checks now.
- Modified write-hint-to-stream conversion logic. Now, kernel hints are mapped
  to upper range of stream-ids, while user-hints continue to remain mapped to
  lower range of stream-ids.

Changes since v3:
- Correction in grouping related changes into patches
- Rectification in commit text at places

Changes since v2:
- Introduce API in block layer so that drivers can register stream info. Added
  new limit in request queue for this purpose.
- Block layer does the conversion from write-hint to stream-id.
- Stream feature is not disabled anymore if device reports less streams than
  a particular number (which was set as 4 earlier).
- Any write-hint beyond reported stream-count turn to 0.
- New macro "WRITE_LIFE_KERN_MIN" can be used as base by kernel mode components.

Changes since v1:
- introduce four more hints for in-kernel use, as recommended by Dave chinner
  & Jens axboe. This isolates kernel-mode hints from user-mode ones.
- remove mount-option to specify write-hint, as recommended by Jan kara &
  Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
  if device does not support stream.
- Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)

V4 patch:
https://lkml.org/lkml/2019/4/17/870

V3 patch:
https://marc.info/?l=linux-block&m=155384631909082&w=2

V2 patch:
https://patchwork.kernel.org/cover/10754405/

V1 patch:
https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2


Kanchan Joshi (7):
  fs: introduce write-hint start point for in-kernel hints
  block: increase stream count for in-kernel use
  block: introduce API to register stream information with block-layer
  block: introduce write-hint to stream-id conversion
  nvme: register stream info with block layer
  fs: introduce APIs to enable passing write-hint with buffer-head
  fs/ext4,jbd2: add support for sending write-hint with journal

 block/blk-core.c            | 29 ++++++++++++++++++++++++++++-
 block/blk-merge.c           |  4 ++--
 block/blk-settings.c        | 12 ++++++++++++
 drivers/nvme/host/core.c    | 23 ++++++-----------------
 fs/buffer.c                 | 18 ++++++++++++++++--
 fs/ext4/ext4_jbd2.h         |  1 +
 fs/ext4/super.c             |  2 ++
 fs/jbd2/commit.c            | 11 +++++++----
 fs/jbd2/journal.c           |  3 ++-
 fs/jbd2/revoke.c            |  3 ++-
 include/linux/blkdev.h      |  8 ++++++--
 include/linux/buffer_head.h |  3 +++
 include/linux/fs.h          |  2 ++
 include/linux/jbd2.h        |  8 ++++++++
 14 files changed, 97 insertions(+), 30 deletions(-)

Comments

Kanchan Joshi May 10, 2019, 5:31 a.m. UTC | #1
Hi Jens & other maintainers,

If this patch-set is in fine shape now, can it please be considered for merge in near future?

Thanks,

-----Original Message-----
From: Kanchan Joshi [mailto:joshi.k@samsung.com] 
Sent: Thursday, April 25, 2019 4:50 PM
To: linux-kernel@vger.kernel.org; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-fsdevel@vger.kernel.org; linux-ext4@vger.kernel.org
Cc: prakash.v@samsung.com; anshul@samsung.com; Kanchan Joshi <joshi.k@samsung.com>
Subject: [PATCH v5 0/7] Extend write-hint framework, and add write-hint for Ext4 journal

V5 series, towards extending write-hint/streams infrastructure for kernel-components, and adding support for sending write-hint with Ext4/JBD2 journal.

Here is the history/changelog -

Changes since v4:
- Removed write-hint field from request. bi_write_hint in bio is used for
  merging checks now.
- Modified write-hint-to-stream conversion logic. Now, kernel hints are mapped
  to upper range of stream-ids, while user-hints continue to remain mapped to
  lower range of stream-ids.

Changes since v3:
- Correction in grouping related changes into patches
- Rectification in commit text at places

Changes since v2:
- Introduce API in block layer so that drivers can register stream info. Added
  new limit in request queue for this purpose.
- Block layer does the conversion from write-hint to stream-id.
- Stream feature is not disabled anymore if device reports less streams than
  a particular number (which was set as 4 earlier).
- Any write-hint beyond reported stream-count turn to 0.
- New macro "WRITE_LIFE_KERN_MIN" can be used as base by kernel mode components.

Changes since v1:
- introduce four more hints for in-kernel use, as recommended by Dave chinner
  & Jens axboe. This isolates kernel-mode hints from user-mode ones.
- remove mount-option to specify write-hint, as recommended by Jan kara &
  Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
  if device does not support stream.
- Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)

V4 patch:
https://lkml.org/lkml/2019/4/17/870

V3 patch:
https://marc.info/?l=linux-block&m=155384631909082&w=2

V2 patch:
https://patchwork.kernel.org/cover/10754405/

V1 patch:
https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2


Kanchan Joshi (7):
  fs: introduce write-hint start point for in-kernel hints
  block: increase stream count for in-kernel use
  block: introduce API to register stream information with block-layer
  block: introduce write-hint to stream-id conversion
  nvme: register stream info with block layer
  fs: introduce APIs to enable passing write-hint with buffer-head
  fs/ext4,jbd2: add support for sending write-hint with journal

 block/blk-core.c            | 29 ++++++++++++++++++++++++++++-
 block/blk-merge.c           |  4 ++--
 block/blk-settings.c        | 12 ++++++++++++
 drivers/nvme/host/core.c    | 23 ++++++-----------------
 fs/buffer.c                 | 18 ++++++++++++++++--
 fs/ext4/ext4_jbd2.h         |  1 +
 fs/ext4/super.c             |  2 ++
 fs/jbd2/commit.c            | 11 +++++++----
 fs/jbd2/journal.c           |  3 ++-
 fs/jbd2/revoke.c            |  3 ++-
 include/linux/blkdev.h      |  8 ++++++--
 include/linux/buffer_head.h |  3 +++
 include/linux/fs.h          |  2 ++
 include/linux/jbd2.h        |  8 ++++++++
 14 files changed, 97 insertions(+), 30 deletions(-)

--
2.7.4
Christoph Hellwig May 10, 2019, 5:02 p.m. UTC | #2
I think this fundamentally goes in the wrong direction.  We explicitly
designed the block layer infrastructure around life time hints and
not the not fish not flesh streams interface, which causes all kinds
of problems.

Including the one this model causes on at least some SSDs where you
now statically allocate resources to a stream that is now not globally
available.  All for the little log with very short date lifetime that
any half decent hot/cold partitioning algorithm in the SSD should be
able to detect.
Kanchan Joshi May 17, 2019, 5:31 a.m. UTC | #3
Hi Christoph, 

> Including the one this model causes on at least some SSDs where you now
statically allocate resources to a stream that is now not globally
available.  

Sorry but can you please elaborate the issue? I do not get what is being
statically allocated which was globally available earlier.
If you are referring to nvme driver,  available streams at subsystem level
are being reflected for all namespaces. This is same as earlier. 
There is no attempt to explicitly allocate (using dir-receive) or reserve
streams for any namespace.  
Streams will continue to get allocated/released implicitly as and when
writes (with stream id) arrive.

> All for the little log with very short date lifetime that any half decent
hot/cold partitioning algorithm in the SSD should be able to detect.

With streams, hot/cold segregation is happening at the time of placement
itself, without algorithm; that is a clear win over algorithms which take
time/computation to be able to do the same.
And infrastructure update (write-hint-to-stream-id conversion in
block-layer,  in-kernel hints etc.) seems to be required anyway for streams
to extend its reach beyond nvme and user-space hints.
  
Thanks,

-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org] 
Sent: Friday, May 10, 2019 10:33 PM
To: Kanchan Joshi <joshi.k@samsung.com>
Cc: linux-kernel@vger.kernel.org; linux-block@vger.kernel.org;
linux-nvme@lists.infradead.org; linux-fsdevel@vger.kernel.org;
linux-ext4@vger.kernel.org; prakash.v@samsung.com; anshul@samsung.com
Subject: Re: [PATCH v5 0/7] Extend write-hint framework, and add write-hint
for Ext4 journal

I think this fundamentally goes in the wrong direction.  We explicitly
designed the block layer infrastructure around life time hints and not the
not fish not flesh streams interface, which causes all kinds of problems.

Including the one this model causes on at least some SSDs where you now
statically allocate resources to a stream that is now not globally
available.  All for the little log with very short date lifetime that any
half decent hot/cold partitioning algorithm in the SSD should be able to
detect.
Christoph Hellwig May 20, 2019, 2:27 p.m. UTC | #4
On Fri, May 17, 2019 at 11:01:55AM +0530, kanchan wrote:
> Sorry but can you please elaborate the issue? I do not get what is being
> statically allocated which was globally available earlier.
> If you are referring to nvme driver,  available streams at subsystem level
> are being reflected for all namespaces. This is same as earlier. 
> There is no attempt to explicitly allocate (using dir-receive) or reserve
> streams for any namespace.  
> Streams will continue to get allocated/released implicitly as and when
> writes (with stream id) arrive.

We have made a concious decision that we do not want to expose streams
as an awkward not fish not flesh interface, but instead life time hints.

I see no reason to change from and burden the whole streams complexity
on other in-kernel callers.
Jan Kara May 21, 2019, 8:25 a.m. UTC | #5
On Mon 20-05-19 07:27:19, 'Christoph Hellwig' wrote:
> On Fri, May 17, 2019 at 11:01:55AM +0530, kanchan wrote:
> > Sorry but can you please elaborate the issue? I do not get what is being
> > statically allocated which was globally available earlier.
> > If you are referring to nvme driver,  available streams at subsystem level
> > are being reflected for all namespaces. This is same as earlier. 
> > There is no attempt to explicitly allocate (using dir-receive) or reserve
> > streams for any namespace.  
> > Streams will continue to get allocated/released implicitly as and when
> > writes (with stream id) arrive.
> 
> We have made a concious decision that we do not want to expose streams
> as an awkward not fish not flesh interface, but instead life time hints.
> 
> I see no reason to change from and burden the whole streams complexity
> on other in-kernel callers.

I'm not following the "streams complexity" you talk about. At least the
usecase Kanchan speaks about here is pretty simple for the filesystem -
tagging journal writes with special stream id. I agree that something like
dynamically allocating available stream ids to different purposes is
complex and has uncertain value but this "static stream id for particular
purpose" looks simple and sensible to me and Kanchan has shown significant
performance benefits for some drives. After all you can just think about it
like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel...

								Honza
Christoph Hellwig May 21, 2019, 8:28 a.m. UTC | #6
On Tue, May 21, 2019 at 10:25:28AM +0200, Jan Kara wrote:
> performance benefits for some drives. After all you can just think about it
> like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel...

Except that it actuallys adds a parallel insfrastructure.  A
RWH_WRITE_LIFE_JOURNAL would be much more palatable, but someone needs
to explain how that is:

 a) different from RWH_WRITE_LIFE_SHORT
 b) would not apply to a log/journal maintained in userspace that works
    exactly the same
Jan Kara May 22, 2019, 10:25 a.m. UTC | #7
On Tue 21-05-19 01:28:46, 'Christoph Hellwig' wrote:
> On Tue, May 21, 2019 at 10:25:28AM +0200, Jan Kara wrote:
> > performance benefits for some drives. After all you can just think about it
> > like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel...
> 
> Except that it actuallys adds a parallel insfrastructure.  A
> RWH_WRITE_LIFE_JOURNAL would be much more palatable, but someone needs
> to explain how that is:
> 
>  a) different from RWH_WRITE_LIFE_SHORT

The problem I have with this is: What does "short" mean? What if
userspace's notion of short differs from the kernel notion? Also the
journal block lifetime is somewhat hard to predict. It depends on the size
of the journal and metadata load on the filesystem so there's big variance.
So all we really know is that all journal blocks are the same.

>  b) would not apply to a log/journal maintained in userspace that works
>     exactly the same

Lifetime of userspace journal/log may be significantly different from the
lifetime of the filesystem journal. So using the same hint for them does
not look like a great idea?

								Honza
Kanchan Joshi June 26, 2019, 12:47 p.m. UTC | #8
Christoph, 
May I know if you have thoughts about what Jan mentioned below? 

I reflected upon the whole series again, and here is my understanding of
your concern (I hope to address that, once I get it right).
Current patch-set targeted adding two things -
1. Extend write-hint infra for in-kernel callers 
2. Send write-hint for FS-journal

In the process of doing 1, write-hint gets more closely connected to stream
(as hint-to-stream conversion moves to block-layer). 
And perhaps this is something that you've objection on. 
Whether write-hint converts into flash-stream or into something-else is
deliberately left to device-driver and that's why block layer does not have
a hint-to-stream conversion in the first place.
Is this the correct understanding of why things are the way they are?

On 2, sending write-hint for FS journal is actually important, as there is
clear data on both performance and endurance benefits.
RWH_WRITE_LIFE_JOURNAL or REQ_JOURNAL (that Martin Petersen suggested) kind
of thing will help in identifying Journal I/O which can be useful for other
purposes (than streams) as well.
I saw this LSFMM coverage https://lwn.net/Articles/788721/ , and felt that
this could be useful for turbo-write in UFS.   

BR,
Kanchan

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Wednesday, May 22, 2019 3:56 PM
To: 'Christoph Hellwig' <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>; kanchan <joshi.k@samsung.com>;
linux-kernel@vger.kernel.org; linux-block@vger.kernel.org;
linux-nvme@lists.infradead.org; linux-fsdevel@vger.kernel.org;
linux-ext4@vger.kernel.org; prakash.v@samsung.com; anshul@samsung.com;
Martin K. Petersen <martin.petersen@oracle.com>
Subject: Re: [PATCH v5 0/7] Extend write-hint framework, and add write-hint
for Ext4 journal

On Tue 21-05-19 01:28:46, 'Christoph Hellwig' wrote:
> On Tue, May 21, 2019 at 10:25:28AM +0200, Jan Kara wrote:
> > performance benefits for some drives. After all you can just think 
> > about it like RWH_WRITE_LIFE_JOURNAL type of hint available for the
kernel...
> 
> Except that it actuallys adds a parallel insfrastructure.  A 
> RWH_WRITE_LIFE_JOURNAL would be much more palatable, but someone needs 
> to explain how that is:
> 
>  a) different from RWH_WRITE_LIFE_SHORT

The problem I have with this is: What does "short" mean? What if userspace's
notion of short differs from the kernel notion? Also the journal block
lifetime is somewhat hard to predict. It depends on the size of the journal
and metadata load on the filesystem so there's big variance.
So all we really know is that all journal blocks are the same.

>  b) would not apply to a log/journal maintained in userspace that works
>     exactly the same

Lifetime of userspace journal/log may be significantly different from the
lifetime of the filesystem journal. So using the same hint for them does not
look like a great idea?

								Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
Christoph Hellwig June 28, 2019, 7:25 a.m. UTC | #9
On Wed, Jun 26, 2019 at 06:17:29PM +0530, kanchan wrote:
> Christoph, 
> May I know if you have thoughts about what Jan mentioned below? 

As said I fundamentally disagree with exposting the streams mess at
the block layer.  I have no problem with setting a hint on the journal,
but I do object to exposting the streams mess even more.