Patchwork [4/4] Support discard if at least one underlying device supports it

login
register
mail settings
Submitter Mikulas Patocka
Date July 2, 2010, 8 p.m.
Message ID <Pine.LNX.4.64.1007021554550.11102@hs20-bc2-1.build.redhat.com>
Download mbox | patch
Permalink /patch/57772/
State New
Headers show

Comments

Mikulas Patocka - July 2, 2010, 8 p.m.
On Fri, 2 Jul 2010, Mikulas Patocka wrote:

> > As we discussed, we have a challenge where we need DM to avoid issuing
> > a barrier before the discard IFF a target doesn't support the discard
> > (which the barrier is paired with).
> > 
> > My understanding is that blkdev_issue_discard() only cares if the
> > discard was supported.  Barrier is used just to decorate the discard
> > (for correctness).  So by returning -EOPNOTSUPP we're saying the discard
> > isn't supported; we're not making any claims about the implict barrier,
> > so best to avoid the barrier entirely.
> > 
> > Otherwise we'll be issuing unnecessary barriers (and associated
> > performance loss).
> > 
> > So yet another TODO item... Anyway:
> > 
> > Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> Unnecessary barriers are issued anyway. With each freed extent.
> 
> The code must issue a "SYNCHRONIZE CACHE" to flush cache for previous 
> writes, then "UNMAP" and then another "SYNCHRONIZE CACHE" to commit that 
> unmap to disk. And this in loop for all extents in 
> "release_blocks_on_commit".
> 
> One idea behind "discard barriers" was to submit a discard request and not 
> wait for it. Then the request would need a barrier so that it doesn't get 
> reordered with further writes (that may potentially write to the same area 
> as the discarded area). But discard isn't used this way anyway, 
> sb_issue_discard waits for completion, so the barrier isn't needed.
> 
> Even if ext4 developers wanted asynchronous discard requests, they should 
> fire all the discards at once and then submit one zero-sized barrier. Not 
> barrier with each discard request.
> 
> This is up to ext4 developers to optimize and remove the barriers and we 
> can't do anything with it. Just send "SYNCHRONIZE 
> CACHE"+"UNMAP"+"SYNCHRONIZE CACHE" like the barrier specification wants...
> 
> Mikulas

BTW. I understand that the current dm implementation will send two useless 
consecutive "SYNCHRONIZE CACHE" commands discard is directed to the part 
of the device that doesn't support it.

But the problem is that when you use discard on a part of the device that 
supports discard, it also sends two useless "SYNCHRONIZE CACHE" commands 
--- they are useless for functionality, but mandated by the barrier 
specification.

The fix is supposedly this:

---
 include/linux/blkdev.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - July 2, 2010, 8:08 p.m.
> ---
>  include/linux/blkdev.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.35-rc3-fast/include/linux/blkdev.h
> ===================================================================
> --- linux-2.6.35-rc3-fast.orig/include/linux/blkdev.h	2010-07-02 21:59:21.000000000 +0200
> +++ linux-2.6.35-rc3-fast/include/linux/blkdev.h	2010-07-02 21:59:37.000000000 +0200
> @@ -1021,7 +1021,7 @@ static inline int sb_issue_discard(struc
>  	block <<= (sb->s_blocksize_bits - 9);
>  	nr_blocks <<= (sb->s_blocksize_bits - 9);
>  	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> 				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
>  }
>  
>  extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
> 

A note for ext4 developers: is that GFP_KERNEL safe? Can't it recurse back 
to ext4 and attempt to flush more data?

I'm not familiar enough with ext4 to declare that it is/isn't a bug, but 
it looks suspicious.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer - July 2, 2010, 8:47 p.m.
On Fri, Jul 02 2010 at  4:00pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Fri, 2 Jul 2010, Mikulas Patocka wrote:
> 
> > > As we discussed, we have a challenge where we need DM to avoid issuing
> > > a barrier before the discard IFF a target doesn't support the discard
> > > (which the barrier is paired with).
> > > 
> > > My understanding is that blkdev_issue_discard() only cares if the
> > > discard was supported.  Barrier is used just to decorate the discard
> > > (for correctness).  So by returning -EOPNOTSUPP we're saying the discard
> > > isn't supported; we're not making any claims about the implict barrier,
> > > so best to avoid the barrier entirely.
> > > 
> > > Otherwise we'll be issuing unnecessary barriers (and associated
> > > performance loss).
> > > 
> > > So yet another TODO item... Anyway:
> > > 
> > > Acked-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > Unnecessary barriers are issued anyway. With each freed extent.
> > 
> > The code must issue a "SYNCHRONIZE CACHE" to flush cache for previous 
> > writes, then "UNMAP" and then another "SYNCHRONIZE CACHE" to commit that 
> > unmap to disk. And this in loop for all extents in 
> > "release_blocks_on_commit".
> > 
> > One idea behind "discard barriers" was to submit a discard request and not 
> > wait for it. Then the request would need a barrier so that it doesn't get 
> > reordered with further writes (that may potentially write to the same area 
> > as the discarded area). But discard isn't used this way anyway, 
> > sb_issue_discard waits for completion, so the barrier isn't needed.
> > 
> > Even if ext4 developers wanted asynchronous discard requests, they should 
> > fire all the discards at once and then submit one zero-sized barrier. Not 
> > barrier with each discard request.
> > 
> > This is up to ext4 developers to optimize and remove the barriers and we 
> > can't do anything with it. Just send "SYNCHRONIZE 
> > CACHE"+"UNMAP"+"SYNCHRONIZE CACHE" like the barrier specification wants...
> > 
> > Mikulas
> 
> BTW. I understand that the current dm implementation will send two useless 
> consecutive "SYNCHRONIZE CACHE" commands discard is directed to the part 
> of the device that doesn't support it.

Issue 1 ^^^

> But the problem is that when you use discard on a part of the device that 
> supports discard, it also sends two useless "SYNCHRONIZE CACHE" commands 
> --- they are useless for functionality, but mandated by the barrier 
> specification.

Issue 2 ^^^

Those are 2 different issues.  Please don't join them as if they are one
in the same.  DM should treat a discard as a first class request (which
may or may not have a barrier).  If a region doesn't support the discard
DM has no business processing anything related to the discard (barriers
included).  It is as simple as that.

> The fix is supposedly this:
> 
> ---
>  include/linux/blkdev.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.35-rc3-fast/include/linux/blkdev.h
> ===================================================================
> --- linux-2.6.35-rc3-fast.orig/include/linux/blkdev.h	2010-07-02 21:59:21.000000000 +0200
> +++ linux-2.6.35-rc3-fast/include/linux/blkdev.h	2010-07-02 21:59:37.000000000 +0200
> @@ -1021,7 +1021,7 @@ static inline int sb_issue_discard(struc
>  	block <<= (sb->s_blocksize_bits - 9);
>  	nr_blocks <<= (sb->s_blocksize_bits - 9);
>  	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> -				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
> +				   BLKDEV_IFL_WAIT);
>  }
>  
>  extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);

Hmm, older kernels use DISCARD_FL_BARRIER which merely mapped to
BLKDEV_IFL_BARRIER.

Seems you've stumbled onto a bug in the conversion that commit
"blkdev: generalize flags for blkdev_issue_fn functions"
(fbd9b09a177a481eda) performed?

That commit seems to have incorrectly replaced DISCARD_FL_BARRIER with
both: BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER

Dmitry and/or Jens was this intended?

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alasdair G Kergon - July 2, 2010, 8:54 p.m.
On Fri, Jul 02, 2010 at 04:47:09PM -0400, Mike Snitzer wrote:

> If a region doesn't support the discard
> DM has no business processing anything related to the discard (barriers
> included).  It is as simple as that.
 
Indeed - if an I/O is going to fail, discover that as early as we can, trying
to avoid the relatively-expensive barrier process whenever we reasonably can.

Alasdair

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - July 5, 2010, 7:03 a.m.
Mike Snitzer <snitzer@redhat.com> writes:

> On Fri, Jul 02 2010 at  4:00pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>> 
>> 
>> On Fri, 2 Jul 2010, Mikulas Patocka wrote:
>> 
>> > > As we discussed, we have a challenge where we need DM to avoid issuing
>> > > a barrier before the discard IFF a target doesn't support the discard
>> > > (which the barrier is paired with).
>> > > 
>> > > My understanding is that blkdev_issue_discard() only cares if the
>> > > discard was supported.  Barrier is used just to decorate the discard
>> > > (for correctness).  So by returning -EOPNOTSUPP we're saying the discard
>> > > isn't supported; we're not making any claims about the implict barrier,
>> > > so best to avoid the barrier entirely.
>> > > 
>> > > Otherwise we'll be issuing unnecessary barriers (and associated
>> > > performance loss).
>> > > 
>> > > So yet another TODO item... Anyway:
>> > > 
>> > > Acked-by: Mike Snitzer <snitzer@redhat.com>
>> > 
>> > Unnecessary barriers are issued anyway. With each freed extent.
>> > 
>> > The code must issue a "SYNCHRONIZE CACHE" to flush cache for previous 
>> > writes, then "UNMAP" and then another "SYNCHRONIZE CACHE" to commit that 
>> > unmap to disk. And this in loop for all extents in 
>> > "release_blocks_on_commit".
>> > 
>> > One idea behind "discard barriers" was to submit a discard request and not 
>> > wait for it. Then the request would need a barrier so that it doesn't get 
>> > reordered with further writes (that may potentially write to the same area 
>> > as the discarded area). But discard isn't used this way anyway, 
>> > sb_issue_discard waits for completion, so the barrier isn't needed.
>> > 
>> > Even if ext4 developers wanted asynchronous discard requests, they should 
>> > fire all the discards at once and then submit one zero-sized barrier. Not 
>> > barrier with each discard request.
>> > 
>> > This is up to ext4 developers to optimize and remove the barriers and we 
>> > can't do anything with it. Just send "SYNCHRONIZE 
>> > CACHE"+"UNMAP"+"SYNCHRONIZE CACHE" like the barrier specification wants...
>> > 
>> > Mikulas
>> 
>> BTW. I understand that the current dm implementation will send two useless 
>> consecutive "SYNCHRONIZE CACHE" commands discard is directed to the part 
>> of the device that doesn't support it.
>
> Issue 1 ^^^
>
>> But the problem is that when you use discard on a part of the device that 
>> supports discard, it also sends two useless "SYNCHRONIZE CACHE" commands 
>> --- they are useless for functionality, but mandated by the barrier 
>> specification.
>
> Issue 2 ^^^
>
> Those are 2 different issues.  Please don't join them as if they are one
> in the same.  DM should treat a discard as a first class request (which
> may or may not have a barrier).  If a region doesn't support the discard
> DM has no business processing anything related to the discard (barriers
> included).  It is as simple as that.
>
>> The fix is supposedly this:
>> 
>> ---
>>  include/linux/blkdev.h |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Index: linux-2.6.35-rc3-fast/include/linux/blkdev.h
>> ===================================================================
>> --- linux-2.6.35-rc3-fast.orig/include/linux/blkdev.h	2010-07-02 21:59:21.000000000 +0200
>> +++ linux-2.6.35-rc3-fast/include/linux/blkdev.h	2010-07-02 21:59:37.000000000 +0200
>> @@ -1021,7 +1021,7 @@ static inline int sb_issue_discard(struc
>>  	block <<= (sb->s_blocksize_bits - 9);
>>  	nr_blocks <<= (sb->s_blocksize_bits - 9);
>>  	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
>> -				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
>> +				   BLKDEV_IFL_WAIT);
>>  }
>>  
>>  extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
>
> Hmm, older kernels use DISCARD_FL_BARRIER which merely mapped to
> BLKDEV_IFL_BARRIER.
>
> Seems you've stumbled onto a bug in the conversion that commit
> "blkdev: generalize flags for blkdev_issue_fn functions"
> (fbd9b09a177a481eda) performed?
>
> That commit seems to have incorrectly replaced DISCARD_FL_BARRIER with
> both: BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER
>
> Dmitry and/or Jens was this intended?
Yes, before the path WAIT behavior was implicit, now caller is
responsible for exact behavior.
So, as it was mentioned earlier, it is reasonable to send
discard request only with BLKDEV_IFL_BARRIER flag from some places in
ext4. I have optimization patches for that in my queue, i hope they
will be ready soon.
>
> Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - July 5, 2010, 11:32 a.m.
On Fri, 2 Jul 2010, Mike Snitzer wrote:

> On Fri, Jul 02 2010 at  4:00pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Fri, 2 Jul 2010, Mikulas Patocka wrote:
> > 
> > > > As we discussed, we have a challenge where we need DM to avoid issuing
> > > > a barrier before the discard IFF a target doesn't support the discard
> > > > (which the barrier is paired with).
> > > > 
> > > > My understanding is that blkdev_issue_discard() only cares if the
> > > > discard was supported.  Barrier is used just to decorate the discard
> > > > (for correctness).  So by returning -EOPNOTSUPP we're saying the discard
> > > > isn't supported; we're not making any claims about the implict barrier,
> > > > so best to avoid the barrier entirely.
> > > > 
> > > > Otherwise we'll be issuing unnecessary barriers (and associated
> > > > performance loss).
> > > > 
> > > > So yet another TODO item... Anyway:
> > > > 
> > > > Acked-by: Mike Snitzer <snitzer@redhat.com>
> > > 
> > > Unnecessary barriers are issued anyway. With each freed extent.
> > > 
> > > The code must issue a "SYNCHRONIZE CACHE" to flush cache for previous 
> > > writes, then "UNMAP" and then another "SYNCHRONIZE CACHE" to commit that 
> > > unmap to disk. And this in loop for all extents in 
> > > "release_blocks_on_commit".
> > > 
> > > One idea behind "discard barriers" was to submit a discard request and not 
> > > wait for it. Then the request would need a barrier so that it doesn't get 
> > > reordered with further writes (that may potentially write to the same area 
> > > as the discarded area). But discard isn't used this way anyway, 
> > > sb_issue_discard waits for completion, so the barrier isn't needed.
> > > 
> > > Even if ext4 developers wanted asynchronous discard requests, they should 
> > > fire all the discards at once and then submit one zero-sized barrier. Not 
> > > barrier with each discard request.
> > > 
> > > This is up to ext4 developers to optimize and remove the barriers and we 
> > > can't do anything with it. Just send "SYNCHRONIZE 
> > > CACHE"+"UNMAP"+"SYNCHRONIZE CACHE" like the barrier specification wants...
> > > 
> > > Mikulas
> > 
> > BTW. I understand that the current dm implementation will send two useless 
> > consecutive "SYNCHRONIZE CACHE" commands discard is directed to the part 
> > of the device that doesn't support it.
> 
> Issue 1 ^^^
> 
> > But the problem is that when you use discard on a part of the device that 
> > supports discard, it also sends two useless "SYNCHRONIZE CACHE" commands 
> > --- they are useless for functionality, but mandated by the barrier 
> > specification.
> 
> Issue 2 ^^^
> 
> Those are 2 different issues.  Please don't join them as if they are one
> in the same.  DM should treat a discard as a first class request (which
> may or may not have a barrier).

What I mean --- if you fix Issue 2, Issue 1 is no longer a problem.

> If a region doesn't support the discard
> DM has no business processing anything related to the discard (barriers
> included).  It is as simple as that.

You can optimize out the second SYNCHRONIZE CACHE, but not the first one 
(because when it is sent, we don't know if the discard will succeed or 
not).

Basically, the fix is to prefix the second dm_flush in process_barrier 
with if (md->barrier_error != -EOPNOTSUPP).

The "barrier+discard" concept is problematic anyway. If we specify that 
"barrier+discard" request doesn't have to do the barrier if discard fails 
(as it is currently), then the request is useless to maintain disk 
integrity because the discard may fail anytime (and so the barrier).

I think they will eventually remove "barrier+discard" from the filesystems 
at all.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

Index: linux-2.6.35-rc3-fast/include/linux/blkdev.h
===================================================================
--- linux-2.6.35-rc3-fast.orig/include/linux/blkdev.h	2010-07-02 21:59:21.000000000 +0200
+++ linux-2.6.35-rc3-fast/include/linux/blkdev.h	2010-07-02 21:59:37.000000000 +0200
@@ -1021,7 +1021,7 @@  static inline int sb_issue_discard(struc
 	block <<= (sb->s_blocksize_bits - 9);
 	nr_blocks <<= (sb->s_blocksize_bits - 9);
 	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
-				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+				   BLKDEV_IFL_WAIT);
 }
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);