diff mbox

[PATCHSET,block#for-2.6.36-post] block: replace barrier with sequenced flush

Message ID 4C655BE5.4070809@kernel.org
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Tejun Heo Aug. 13, 2010, 2:51 p.m. UTC
Hello,

On 08/13/2010 04:38 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
>> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
>> and just deprecate it.  One is to avoid breaking filesystems'
>> expectations underneath it.  Please note that there are out-of-tree
>> filesystems too.  I think it would be too dangerous to relax
>> REQ_HARDBARRIER.
> 
> Note that the renaming patch would include a move from REQ_HARDBARRIER
> to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
> compile.  And while out of tree filesystems do exist they it's their
> problem to keep up with kernel changes.  They decide not to be part
> of the Linux kernel, so it'll be their job to keep up with it.

Oh, right, we can simply remove REQ_HARDBARRIER completely.

>> Another is that pseudo block layer drivers (loop, virtio_blk,
>> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
>> would be broken in obscure ways between REQ_HARDBARRIER semantics
>> change and updates to each of those drivers, so I don't really think
>> changing the semantics while the mechanism is online is a good idea.
> 
> I don't think doing those changes in a separate commit is a good idea.

Do you want to change the whole thing in a single commit?  That would
be a pretty big invasive patch touching multiple subsystems.  Also, I
don't know what to do about drdb and would like to leave its
conversion to the maintainer (in separate patches).

Eh, well, this is mostly logistics.  Jens, what do you think?

>>> Then we can patches do disable the reiserfs barrier "optimization" as
>>> the very first one, and DM/MD support which I'm currently working on
>>> as the last one and we can start doing the heavy testing.
>>
>> Oops, I've already converted loop, virtio_blk/lguest and am working on
>> md/dm right now too.  I'm almost done with md and now doing dm. :-)
>> Maybe we should post them right now so that we don't waste too much
>> time trying to solve the same problems?
> 
> Here's the dm patch.  It only handles normal bio based dm yet, which
> I understand and can test.  request  based dm (multipath) still needs
> work.

Here's the combined patch I've been working on.  I've verified loop
and virtio_blk/loop.  I just (like five mins ago) got dm/dm conversion
compiling, so I'm sure they're broken.  The neat part is that thanks
to the separation between REQ_FLUSH and FUA handling, bio mangling
drivers only have to sequence the pre-flush and pass FUA directly to
lower layers which in many cases saves an array-wide cache flush
cycle.

After getting this patch working, the only remaining bits would be
blktrace and drdb.

Thanks.

 Documentation/lguest/lguest.c   |   36 +++-----
 drivers/block/loop.c            |   18 ++--
 drivers/block/virtio_blk.c      |   26 ++---
 drivers/md/dm-io.c              |   20 ----
 drivers/md/dm-log.c             |    2
 drivers/md/dm-raid1.c           |    8 -
 drivers/md/dm-snap-persistent.c |    2
 drivers/md/dm.c                 |  176 +++++++++++++++++++--------------------
 drivers/md/linear.c             |    4
 drivers/md/md.c                 |  117 +++++---------------------
 drivers/md/md.h                 |   23 +----
 drivers/md/multipath.c          |    4
 drivers/md/raid0.c              |    4
 drivers/md/raid1.c              |  178 +++++++++++++---------------------------
 drivers/md/raid1.h              |    2
 drivers/md/raid10.c             |    6 -
 drivers/md/raid5.c              |   18 +---
 include/linux/virtio_blk.h      |    6 +
 18 files changed, 244 insertions(+), 406 deletions(-)

Comments

Christoph Hellwig Aug. 14, 2010, 10:36 a.m. UTC | #1
On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
> Do you want to change the whole thing in a single commit?  That would
> be a pretty big invasive patch touching multiple subsystems.

We can just stop draining in the block layer in the first patch, then
stop doing the stuff in md/dm/etc in the following and then do the
final renaming patches.  It would still be less patches then now, but
keep things working through the whole transition, which would really
help biseting any problems.

> +			if (req->cmd_flags & REQ_FUA)
> +				vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;

I'd suggest not adding FUA support to virtio yet.  Just using the flush
feature gives you a fully working barrier implementation.

Eventually we might want to add a flag in the block queue to send
REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
avoid separate pre- and post flushes, but I really want to benchmark if
it makes an impact on real life setups first.

> Index: block/drivers/md/linear.c
> ===================================================================
> --- block.orig/drivers/md/linear.c
> +++ block/drivers/md/linear.c
> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
>  	dev_info_t *tmp_dev;
>  	sector_t start_sector;
> 
> -	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> -		md_barrier_request(mddev, bio);
> +	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
> +		md_flush_request(mddev, bio);

We only need the special md_flush_request handling for
empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
flag propagated to the underlying devices.

> +static void md_end_flush(struct bio *bio, int err)
>  {
>  	mdk_rdev_t *rdev = bio->bi_private;
>  	mddev_t *mddev = rdev->mddev;
> 
>  	rdev_dec_pending(rdev, mddev);
> 
>  	if (atomic_dec_and_test(&mddev->flush_pending)) {
> +		/* The pre-request flush has finished */
> +		schedule_work(&mddev->flush_work);

Once we only handle empty barriers here we can directly call bio_endio
instead of first scheduling a work queue.Once we only handle empty
barriers here we can directly call bio_endio and the super wakeup
instead of first scheduling a work queue.

>  	while ((bio = bio_list_pop(writes))) {
> -		if (unlikely(bio_empty_barrier(bio))) {
> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {

I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
useful macro for the bio based drivers.

> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
>  			 */
>  			spin_lock_irqsave(&md->deferred_lock, flags);
>  			if (__noflush_suspending(md)) {
> -				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
> +				if (!(io->bio->bi_rw & REQ_FLUSH))

I suspect we don't actually need to special case flushes here anymore.


> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
>  		io_error = io->error;
>  		bio = io->bio;
> 
> -		if (bio->bi_rw & REQ_HARDBARRIER) {
> +		if (bio->bi_rw & REQ_FLUSH) {
>  			/*
> -			 * There can be just one barrier request so we use
> +			 * There can be just one flush request so we use
>  			 * a per-device variable for error reporting.
>  			 * Note that you can't touch the bio after end_io_acct
>  			 */
> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
> -				md->barrier_error = io_error;
> +			if (!md->flush_error)
> +				md->flush_error = io_error;

And we certainly do not need any special casing here.  See my patch.

>  {
>  	int rw = rq_data_dir(clone);
>  	int run_queue = 1;
> -	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
> +	bool is_flush = clone->cmd_flags & REQ_FLUSH;
>  	struct dm_rq_target_io *tio = clone->end_io_data;
>  	struct mapped_device *md = tio->md;
>  	struct request *rq = tio->orig;
> 
> -	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
> +	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {

We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
for the second half of this conditional.

> +	if (!is_flush)
> +		blk_end_request_all(rq, error);
> +	else {
>  		if (unlikely(error))
> -			store_barrier_error(md, error);
> +			store_flush_error(md, error);
>  		run_queue = 0;
> -	} else
> -		blk_end_request_all(rq, error);
> +	}

Flush requests can now be completed normally.

> @@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru
> 
>  	ci.map = dm_get_live_table(md);
>  	if (unlikely(!ci.map)) {
> -		if (!(bio->bi_rw & REQ_HARDBARRIER))
> +		if (!(bio->bi_rw & REQ_FLUSH))
>  			bio_io_error(bio);
>  		else
> -			if (!md->barrier_error)
> -				md->barrier_error = -EIO;
> +			if (!md->flush_error)
> +				md->flush_error = -EIO;

No need for the special error handling here, flush requests can now
be completed normally.

> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
>  	part_stat_unlock();
> 
>  	/*
> -	 * If we're suspended or the thread is processing barriers
> +	 * If we're suspended or the thread is processing flushes
>  	 * we have to queue this io for later.
>  	 */
>  	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
> -	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> +	    (bio->bi_rw & REQ_FLUSH)) {
>  		up_read(&md->io_lock);

AFAICS this is only needed for the old barrier code, no need for this
for pure flushes.

> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
> 
>  static bool dm_rq_is_flush_request(struct request *rq)
>  {
> -	if (rq->cmd_flags & REQ_FLUSH)
> -		return true;
> -	else
> -		return false;
> +	return rq->cmd_flags & REQ_FLUSH;
>  }

It's probably worth just killing this wrapper.


>  void dm_dispatch_request(struct request *rq)
> @@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
>  	if (dm_rq_is_flush_request(rq)) {
>  		blk_rq_init(NULL, clone);
>  		clone->cmd_type = REQ_TYPE_FS;
> -		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> +		clone->cmd_flags |= (REQ_FLUSH | WRITE);
>  	} else {
>  		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
>  				      dm_rq_bio_constructor, tio);

My suspicion is that we can get rif of all that special casing here
and just use blk_rq_prep_clone once it's been updated to propagate
REQ_FLUSH, similar to the DISCARD flag.

I also suspect that there is absolutely no need to the barrier work
queue once we stop waiting for outstanding request.  But then again
the request based dm code still somewhat confuses me.

> +static void process_flush(struct mapped_device *md, struct bio *bio)
>  {
> +	md->flush_error = 0;
> +
> +	/* handle REQ_FLUSH */
>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> 
> -	bio_init(&md->barrier_bio);
> -	md->barrier_bio.bi_bdev = md->bdev;
> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
> -	__split_and_process_bio(md, &md->barrier_bio);
> +	bio_init(&md->flush_bio);
> +	md->flush_bio.bi_bdev = md->bdev;
> +	md->flush_bio.bi_rw = WRITE_FLUSH;
> +	__split_and_process_bio(md, &md->flush_bio);

There's not need to use a separate flush_bio here.
__split_and_process_bio does the right thing for empty REQ_FLUSH
requests.  See my patch for how to do this differenty.  And yeah,
my version has been tested.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 17, 2010, 9:59 a.m. UTC | #2
Hello, Christoph.

On 08/14/2010 12:36 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
>> Do you want to change the whole thing in a single commit?  That would
>> be a pretty big invasive patch touching multiple subsystems.
> 
> We can just stop draining in the block layer in the first patch, then
> stop doing the stuff in md/dm/etc in the following and then do the
> final renaming patches.  It would still be less patches then now, but
> keep things working through the whole transition, which would really
> help biseting any problems.

I'm not really convinced that would help much.  If bisecting can point
to the conversion as the culprit for whatever kind of failure,
wouldn't that be enough?  No matter what we do the conversion will be
a single step thing.  If we make the filesystems enforce the ordering
first and then relax ordering in the block layer, bisection would
still just point at the later patch.  The same goes for md/dm, the
best we can find out would be whether the conversion is correct or not
anyway.

I'm not against restructuring the patchset if it makes more sense but
it just feels like it would be a bit pointless effort (and one which
would require much tighter coordination among different trees) at this
point.  Am I missing something?

>> +			if (req->cmd_flags & REQ_FUA)
>> +				vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
> 
> I'd suggest not adding FUA support to virtio yet.  Just using the flush
> feature gives you a fully working barrier implementation.
> 
> Eventually we might want to add a flag in the block queue to send
> REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
> avoid separate pre- and post flushes, but I really want to benchmark if
> it makes an impact on real life setups first.

I wrote this in the other mail but I think it would make difference if
the backend storag is md/dm especially if it's shared by multiple VMs.
It cuts down on one array wide cache flush.

>> Index: block/drivers/md/linear.c
>> ===================================================================
>> --- block.orig/drivers/md/linear.c
>> +++ block/drivers/md/linear.c
>> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
>>  	dev_info_t *tmp_dev;
>>  	sector_t start_sector;
>>
>> -	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> -		md_barrier_request(mddev, bio);
>> +	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
>> +		md_flush_request(mddev, bio);
> 
> We only need the special md_flush_request handling for
> empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
> flag propagated to the underlying devices.

Hmm, not really, the WRITE should happen after all the data in cache
are committed to NV media, meaning that empty FLUSH should already
have finished by the time the WRITE starts.

>> +static void md_end_flush(struct bio *bio, int err)
>>  {
>>  	mdk_rdev_t *rdev = bio->bi_private;
>>  	mddev_t *mddev = rdev->mddev;
>>
>>  	rdev_dec_pending(rdev, mddev);
>>
>>  	if (atomic_dec_and_test(&mddev->flush_pending)) {
>> +		/* The pre-request flush has finished */
>> +		schedule_work(&mddev->flush_work);
> 
> Once we only handle empty barriers here we can directly call bio_endio
> instead of first scheduling a work queue.Once we only handle empty
> barriers here we can directly call bio_endio and the super wakeup
> instead of first scheduling a work queue.

Yeap, right.  That would be a nice optimization.

>>  	while ((bio = bio_list_pop(writes))) {
>> -		if (unlikely(bio_empty_barrier(bio))) {
>> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> 
> I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> useful macro for the bio based drivers.

Hmm... maybe.  The reason why I removed bio_empty_flush() was that
except for the front-most sequencer (block layer for all the request
based ones and the front-most make_request for bio based ones), it
doesn't make sense to see REQ_FLUSH + data bios.  They should be
sequenced at the front-most stage anyway, so I didn't have much use
for them.  Those code paths couldn't deal with REQ_FLUSH + data bios
anyway.

>> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
>>  			 */
>>  			spin_lock_irqsave(&md->deferred_lock, flags);
>>  			if (__noflush_suspending(md)) {
>> -				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
>> +				if (!(io->bio->bi_rw & REQ_FLUSH))
> 
> I suspect we don't actually need to special case flushes here anymore.

Oh, I'm not sure about this part at all.  I'll ask Mike.

>> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
>>  		io_error = io->error;
>>  		bio = io->bio;
>>
>> -		if (bio->bi_rw & REQ_HARDBARRIER) {
>> +		if (bio->bi_rw & REQ_FLUSH) {
>>  			/*
>> -			 * There can be just one barrier request so we use
>> +			 * There can be just one flush request so we use
>>  			 * a per-device variable for error reporting.
>>  			 * Note that you can't touch the bio after end_io_acct
>>  			 */
>> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
>> -				md->barrier_error = io_error;
>> +			if (!md->flush_error)
>> +				md->flush_error = io_error;
> 
> And we certainly do not need any special casing here.  See my patch.

I wasn't sure about that part.  You removed store_flush_error(), but
DM_ENDIO_REQUEUE should still have higher priority than other
failures, no?

>>  {
>>  	int rw = rq_data_dir(clone);
>>  	int run_queue = 1;
>> -	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
>> +	bool is_flush = clone->cmd_flags & REQ_FLUSH;
>>  	struct dm_rq_target_io *tio = clone->end_io_data;
>>  	struct mapped_device *md = tio->md;
>>  	struct request *rq = tio->orig;
>>
>> -	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
>> +	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
> 
> We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
> for the second half of this conditional.

I see.

>> +	if (!is_flush)
>> +		blk_end_request_all(rq, error);
>> +	else {
>>  		if (unlikely(error))
>> -			store_barrier_error(md, error);
>> +			store_flush_error(md, error);
>>  		run_queue = 0;
>> -	} else
>> -		blk_end_request_all(rq, error);
>> +	}
> 
> Flush requests can now be completed normally.

The same question as before.  I think we still need to prioritize
DM_ENDIO_REQUEUE failures.

>> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
>>  	part_stat_unlock();
>>
>>  	/*
>> -	 * If we're suspended or the thread is processing barriers
>> +	 * If we're suspended or the thread is processing flushes
>>  	 * we have to queue this io for later.
>>  	 */
>>  	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
>> -	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> +	    (bio->bi_rw & REQ_FLUSH)) {
>>  		up_read(&md->io_lock);
> 
> AFAICS this is only needed for the old barrier code, no need for this
> for pure flushes.

I'll ask Mike.

>> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
>>
>>  static bool dm_rq_is_flush_request(struct request *rq)
>>  {
>> -	if (rq->cmd_flags & REQ_FLUSH)
>> -		return true;
>> -	else
>> -		return false;
>> +	return rq->cmd_flags & REQ_FLUSH;
>>  }
> 
> It's probably worth just killing this wrapper.

Yeah, probably.  It was an accidental edit to begin with and I left
this part out in the new patch.

>> +static void process_flush(struct mapped_device *md, struct bio *bio)
>>  {
>> +	md->flush_error = 0;
>> +
>> +	/* handle REQ_FLUSH */
>>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>>
>> -	bio_init(&md->barrier_bio);
>> -	md->barrier_bio.bi_bdev = md->bdev;
>> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
>> -	__split_and_process_bio(md, &md->barrier_bio);
>> +	bio_init(&md->flush_bio);
>> +	md->flush_bio.bi_bdev = md->bdev;
>> +	md->flush_bio.bi_rw = WRITE_FLUSH;
>> +	__split_and_process_bio(md, &md->flush_bio);
> 
> There's not need to use a separate flush_bio here.
> __split_and_process_bio does the right thing for empty REQ_FLUSH
> requests.  See my patch for how to do this differenty.  And yeah,
> my version has been tested.

But how do you make sure REQ_FLUSHes for preflush finish before
starting the write?

Thanks.
Christoph Hellwig Aug. 17, 2010, 1:19 p.m. UTC | #3
On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
> I'm not really convinced that would help much.  If bisecting can point
> to the conversion as the culprit for whatever kind of failure,
> wouldn't that be enough?  No matter what we do the conversion will be
> a single step thing.  If we make the filesystems enforce the ordering
> first and then relax ordering in the block layer, bisection would
> still just point at the later patch.  The same goes for md/dm, the
> best we can find out would be whether the conversion is correct or not
> anyway.

The filesystems already enforce the ordering, except reiserfs which
opts out if the barrier options is set. 

> I'm not against restructuring the patchset if it makes more sense but
> it just feels like it would be a bit pointless effort (and one which
> would require much tighter coordination among different trees) at this
> point.  Am I missing something?

What other trees do you mean?  The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it.  Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,
and removing about 10 lines of code in reiserfs.

> > We only need the special md_flush_request handling for
> > empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
> > flag propagated to the underlying devices.
> 
> Hmm, not really, the WRITE should happen after all the data in cache
> are committed to NV media, meaning that empty FLUSH should already
> have finished by the time the WRITE starts.

You're right.

> >>  	while ((bio = bio_list_pop(writes))) {
> >> -		if (unlikely(bio_empty_barrier(bio))) {
> >> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> > 
> > I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> > useful macro for the bio based drivers.
> 
> Hmm... maybe.  The reason why I removed bio_empty_flush() was that
> except for the front-most sequencer (block layer for all the request
> based ones and the front-most make_request for bio based ones), it
> doesn't make sense to see REQ_FLUSH + data bios.  They should be
> sequenced at the front-most stage anyway, so I didn't have much use
> for them.  Those code paths couldn't deal with REQ_FLUSH + data bios
> anyway.

The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers.  But I think it's a rather useful
helper for them.  Either way, it's not a big issue and either way is
fine with me.

> >> +		if (bio->bi_rw & REQ_FLUSH) {
> >>  			/*
> >> -			 * There can be just one barrier request so we use
> >> +			 * There can be just one flush request so we use
> >>  			 * a per-device variable for error reporting.
> >>  			 * Note that you can't touch the bio after end_io_acct
> >>  			 */
> >> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
> >> -				md->barrier_error = io_error;
> >> +			if (!md->flush_error)
> >> +				md->flush_error = io_error;
> > 
> > And we certainly do not need any special casing here.  See my patch.
> 
> I wasn't sure about that part.  You removed store_flush_error(), but
> DM_ENDIO_REQUEUE should still have higher priority than other
> failures, no?

Which priority?

> >> +static void process_flush(struct mapped_device *md, struct bio *bio)
> >>  {
> >> +	md->flush_error = 0;
> >> +
> >> +	/* handle REQ_FLUSH */
> >>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> >>
> >> -	bio_init(&md->barrier_bio);
> >> -	md->barrier_bio.bi_bdev = md->bdev;
> >> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
> >> -	__split_and_process_bio(md, &md->barrier_bio);
> >> +	bio_init(&md->flush_bio);
> >> +	md->flush_bio.bi_bdev = md->bdev;
> >> +	md->flush_bio.bi_rw = WRITE_FLUSH;
> >> +	__split_and_process_bio(md, &md->flush_bio);
> > 
> > There's not need to use a separate flush_bio here.
> > __split_and_process_bio does the right thing for empty REQ_FLUSH
> > requests.  See my patch for how to do this differenty.  And yeah,
> > my version has been tested.
> 
> But how do you make sure REQ_FLUSHes for preflush finish before
> starting the write?

Hmm, okay.  I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 17, 2010, 4:41 p.m. UTC | #4
Hi,

On 08/17/2010 03:19 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
>> I'm not against restructuring the patchset if it makes more sense but
>> it just feels like it would be a bit pointless effort (and one which
>> would require much tighter coordination among different trees) at this
>> point.  Am I missing something?
> 
> What other trees do you mean?

I was mostly thinking about dm/md, drdb and stuff, but you're talking
about filesystem conversion patches being routed through block tree,
right?

> The conversions of the 8 filesystems that actually support barriers
> need to go through this tree anyway if we want to be able to test
> it.  Also the changes in the filesystem are absolutely minimal -
> it's basically just s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my
> initial patch kill BH_Orderd, and removing about 10 lines of code in
> reiserfs.

I might just resequence it to finish this part of discussion but what
does that really buy us?  It's not really gonna help bisection.
Bisection won't be able to tell anything in higher resolution than
"the new implementation doesn't work".  If you show me how it would
actually help, I'll happily reshuffle the patches.

>> I wasn't sure about that part.  You removed store_flush_error(), but
>> DM_ENDIO_REQUEUE should still have higher priority than other
>> failures, no?
> 
> Which priority?

IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
core layer to retry the whole bio later), it trumps all other failures
and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
prioritized over other error codes, which actually is sort of
incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
layers as FLUSH failure implies data already lost.  So,
DM_ENDIO_REQUEUE actually should have lower priority than other
failures.  But, then again, the error codes still need to be
prioritized.

>> But how do you make sure REQ_FLUSHes for preflush finish before
>> starting the write?
> 
> Hmm, okay.  I see how the special flush_bio makes the waiting easier,
> let's see if Mike or other in the DM team have a better idea.

Yeah, it would be better if it can be sequenced w/o using a work but
let's leave it for later.

Thanks.
Christoph Hellwig Aug. 17, 2010, 4:59 p.m. UTC | #5
On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
> > What other trees do you mean?
> 
> I was mostly thinking about dm/md, drdb and stuff, but you're talking
> about filesystem conversion patches being routed through block tree,
> right?

I think we really need all the conversions in one tree, block layer,
remapping drivers and filesystems.

Btw, I've done the conversion for all filesystems and I'm running tests
over them now.  Expect the series late today or tomorrow.

> I might just resequence it to finish this part of discussion but what
> does that really buy us?  It's not really gonna help bisection.
> Bisection won't be able to tell anything in higher resolution than
> "the new implementation doesn't work".  If you show me how it would
> actually help, I'll happily reshuffle the patches.

It's not bisecting to find bugs in the barrier conversion.  We can't
easily bisect it down anyway.  The problem is when we try to bisect
other problems and get into the middle of the series barriers suddenly
are gone.  Which is not very helpful for things like data integrity
problems in filesystems.

> >> I wasn't sure about that part.  You removed store_flush_error(), but
> >> DM_ENDIO_REQUEUE should still have higher priority than other
> >> failures, no?
> > 
> > Which priority?
> 
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost.  So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures.  But, then again, the error codes still need to be
> prioritized.

I think that's something we better leave to the DM team.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 18, 2010, 6:35 a.m. UTC | #6
Hello,

On 08/17/2010 06:59 PM, Christoph Hellwig wrote:
> I think we really need all the conversions in one tree, block layer,
> remapping drivers and filesystems.

I don't know.  If filesystem changes are really trivial maybe, but
md/dm changes seem a bit too invasive to go through the block tree.

> Btw, I've done the conversion for all filesystems and I'm running tests
> over them now.  Expect the series late today or tomorrow.

Cool. :-)

>> I might just resequence it to finish this part of discussion but what
>> does that really buy us?  It's not really gonna help bisection.
>> Bisection won't be able to tell anything in higher resolution than
>> "the new implementation doesn't work".  If you show me how it would
>> actually help, I'll happily reshuffle the patches.
> 
> It's not bisecting to find bugs in the barrier conversion.  We can't
> easily bisect it down anyway.  The problem is when we try to bisect
> other problems and get into the middle of the series barriers suddenly
> are gone.  Which is not very helpful for things like data integrity
> problems in filesystems.

Ah, okay, hmmm.... alright, I'll resequence the patches.  If the
filesystem changes can be put into a single tree somehow, we can keep
things mostly working at least for direct devices.

>> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
>> core layer to retry the whole bio later), it trumps all other failures
>> and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
>> prioritized over other error codes, which actually is sort of
>> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
>> layers as FLUSH failure implies data already lost.  So,
>> DM_ENDIO_REQUEUE actually should have lower priority than other
>> failures.  But, then again, the error codes still need to be
>> prioritized.
> 
> I think that's something we better leave to the DM team.

Sure, but we shouldn't be ripping out the code to do that.

Thanks.
Tejun Heo Aug. 18, 2010, 8:11 a.m. UTC | #7
Hello,

On 08/18/2010 08:35 AM, Tejun Heo wrote:
>> It's not bisecting to find bugs in the barrier conversion.  We can't
>> easily bisect it down anyway.  The problem is when we try to bisect
>> other problems and get into the middle of the series barriers suddenly
>> are gone.  Which is not very helpful for things like data integrity
>> problems in filesystems.
> 
> Ah, okay, hmmm.... alright, I'll resequence the patches.  If the
> filesystem changes can be put into a single tree somehow, we can keep
> things mostly working at least for direct devices.

Sorry but I'm doing it.  It just doesn't make much sense.  I can't
relax the ordering for REQ_HARDBARRIER without breaking the remapping
drivers.  So, to keep things working, I'll have to 1. relax the
ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use
them in the filesystems in the same patch.  That's just wrong.  And I
don't think md/dm changes can or should go through the block tree.
They're way too invasive for that.  It's a new implementation and
barrier won't work (fail gracefully) for several commits during the
transition.  I don't think there's a better way around it.

Thanks.
Kiyoshi Ueda Aug. 20, 2010, 8:26 a.m. UTC | #8
Hi Tejun, Christoph,

On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
>>> I wasn't sure about that part.  You removed store_flush_error(), but
>>> DM_ENDIO_REQUEUE should still have higher priority than other
>>> failures, no?
>>
>> Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost.  So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures.  But, then again, the error codes still need to be
> prioritized.

I think that's correct and changing the priority of DM_ENDIO_REQUEUE
for REQ_FLUSH down to the lowest should be fine.
(I didn't know that FLUSH failure implies data loss possibility.)

But the patch is not enough, you have to change target drivers, too.
E.g. As for multipath, you need to change
     drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
     like the REQ_DISCARD support included in 2.6.36-rc1.


By the way, if these patch-set with the change above are included,
even one path failure for REQ_FLUSH on multipath configuration will
be reported to upper layer as error, although it's retried using
other paths currently.
Then, if an upper layer won't take correct recovery action for the error,
it would be seen as a regression for users. (e.g. Frequent EXT3-error
resulting in read-only mount on multipath configuration.)

Although I think the explicit error is fine rather than implicit data
corruption, please check upper layers carefully so that users won't see
such errors as much as possible.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 23, 2010, 12:14 p.m. UTC | #9
Hello,

On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> for REQ_FLUSH down to the lowest should be fine.
> (I didn't know that FLUSH failure implies data loss possibility.)

At least on ATA, FLUSH failure implies that data is already lost, so
the error can't be ignored or retried.

> But the patch is not enough, you have to change target drivers, too.
> E.g. As for multipath, you need to change
>      drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
>      like the REQ_DISCARD support included in 2.6.36-rc1.

I'll take a look but is there an easy to test mpath other than having
fancy hardware?

> By the way, if these patch-set with the change above are included,
> even one path failure for REQ_FLUSH on multipath configuration will
> be reported to upper layer as error, although it's retried using
> other paths currently.
> Then, if an upper layer won't take correct recovery action for the error,
> it would be seen as a regression for users. (e.g. Frequent EXT3-error
> resulting in read-only mount on multipath configuration.)
> 
> Although I think the explicit error is fine rather than implicit data
> corruption, please check upper layers carefully so that users won't see
> such errors as much as possible.

Argh... then it will have to discern why FLUSH failed.  It can retry
for transport errors but if it got aborted by the device it should
report upwards.  Maybe just turn off barrier support in mpath for now?

Thanks.
Mike Snitzer Aug. 23, 2010, 2:17 p.m. UTC | #10
On Mon, Aug 23 2010 at  8:14am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> > I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> > for REQ_FLUSH down to the lowest should be fine.
> > (I didn't know that FLUSH failure implies data loss possibility.)
> 
> At least on ATA, FLUSH failure implies that data is already lost, so
> the error can't be ignored or retried.
> 
> > But the patch is not enough, you have to change target drivers, too.
> > E.g. As for multipath, you need to change
> >      drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
> >      like the REQ_DISCARD support included in 2.6.36-rc1.
> 
> I'll take a look but is there an easy to test mpath other than having
> fancy hardware?

It is easy enough to make a single path use mpath.  Just verify/modify
/etc/multipath.conf so that your device isn't blacklisted.

multipathd will even work with a scsi-debug device.

You obviously won't get path failover but you'll see the path get marked
faulty, etc.

> > By the way, if these patch-set with the change above are included,
> > even one path failure for REQ_FLUSH on multipath configuration will
> > be reported to upper layer as error, although it's retried using
> > other paths currently.
> > Then, if an upper layer won't take correct recovery action for the error,
> > it would be seen as a regression for users. (e.g. Frequent EXT3-error
> > resulting in read-only mount on multipath configuration.)
> > 
> > Although I think the explicit error is fine rather than implicit data
> > corruption, please check upper layers carefully so that users won't see
> > such errors as much as possible.
> 
> Argh... then it will have to discern why FLUSH failed.  It can retry
> for transport errors but if it got aborted by the device it should
> report upwards.

Yes, we discussed this issue of needing to train dm-multipath to know if
there was a transport failure or not (at LSF).  But I'm not sure when
Hannes intends to repost his work in this area (updated to account for
feedback from LSF).

> Maybe just turn off barrier support in mpath for now?

I think we'd prefer to have a device fail rather than jeopardize data
integrity.  Clearly not ideal but...
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kiyoshi Ueda Aug. 24, 2010, 10:24 a.m. UTC | #11
Hi Tejun,

On 08/23/2010 11:17 PM +0900, Mike Snitzer wrote:
> On Mon, Aug 23 2010 at  8:14am -0400, Tejun Heo <tj@kernel.org> wrote:
>> On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
>>> By the way, if these patch-set with the change above are included,
>>> even one path failure for REQ_FLUSH on multipath configuration will
>>> be reported to upper layer as error, although it's retried using
>>> other paths currently.
>>> Then, if an upper layer won't take correct recovery action for the error,
>>> it would be seen as a regression for users. (e.g. Frequent EXT3-error
>>> resulting in read-only mount on multipath configuration.)
>>>
>>> Although I think the explicit error is fine rather than implicit data
>>> corruption, please check upper layers carefully so that users won't see
>>> such errors as much as possible.
>> 
>> Argh... then it will have to discern why FLUSH failed.  It can retry
>> for transport errors but if it got aborted by the device it should
>> report upwards.
> 
> Yes, we discussed this issue of needing to train dm-multipath to know if
> there was a transport failure or not (at LSF).  But I'm not sure when
> Hannes intends to repost his work in this area (updated to account for
> feedback from LSF).

Yes, checking whether it's a transport error in lower layer is
the right solution.
(Since I know it's not available yet, I just hoped if upper layers
 had some other options.)

Anyway, only reporting errors for REQ_FLUSH to upper layer without
such a solution would make dm-multipath almost unusable in real world,
although it's better than implicit data loss.


>> Maybe just turn off barrier support in mpath for now?

If it's possible, it could be a workaround for a short term.
But how can you do that?

I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
Underlying devices of a mpath device may have write-back cache and
it may be enabled.
So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
becomes a device which has write-back cache but doesn't support flush.
Then, upper layer can do nothing to ensure cache flush?

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 24, 2010, 4:59 p.m. UTC | #12
Hello,

On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> Yes, checking whether it's a transport error in lower layer is
> the right solution.
> (Since I know it's not available yet, I just hoped if upper layers
>  had some other options.)
> 
> Anyway, only reporting errors for REQ_FLUSH to upper layer without
> such a solution would make dm-multipath almost unusable in real world,
> although it's better than implicit data loss.

I see.

>>> Maybe just turn off barrier support in mpath for now?
> 
> If it's possible, it could be a workaround for a short term.
> But how can you do that?
> 
> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> Underlying devices of a mpath device may have write-back cache and
> it may be enabled.
> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> becomes a device which has write-back cache but doesn't support flush.
> Then, upper layer can do nothing to ensure cache flush?

Yeah, I was basically suggesting to forget about cache flush w/ mpath
until it can be fixed.  You're saying that if mpath just passes
REQ_FLUSH upwards without retrying, it will be almost unuseable,
right?  I'm not sure how to proceed here.  How much work would
discerning between transport and IO errors take?  If it can't be done
quickly enough the retry logic can be kept around to keep the old
behavior but that already was a broken behavior, so...  :-(

Thanks.
Vladislav Bolkhovitin Aug. 24, 2010, 5:11 p.m. UTC | #13
Tejun Heo, on 08/23/2010 04:14 PM wrote:
>> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
>> for REQ_FLUSH down to the lowest should be fine.
>> (I didn't know that FLUSH failure implies data loss possibility.)
>
> At least on ATA, FLUSH failure implies that data is already lost, so
> the error can't be ignored or retried.

In SCSI there are conditions when a command, including FLUSH 
(SYNC_CACHE), failed which don't imply lost data. For them the caller 
expected to retry the failed command. Most common cases are Unit 
Attentions and TASK QUEUE FULL status.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Aug. 24, 2010, 5:52 p.m. UTC | #14
On Tue, Aug 24 2010 at 12:59pm -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> > Yes, checking whether it's a transport error in lower layer is
> > the right solution.
> > (Since I know it's not available yet, I just hoped if upper layers
> >  had some other options.)
> > 
> > Anyway, only reporting errors for REQ_FLUSH to upper layer without
> > such a solution would make dm-multipath almost unusable in real world,
> > although it's better than implicit data loss.
> 
> I see.
> 
> >>> Maybe just turn off barrier support in mpath for now?
> > 
> > If it's possible, it could be a workaround for a short term.
> > But how can you do that?
> > 
> > I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> > Underlying devices of a mpath device may have write-back cache and
> > it may be enabled.
> > So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> > becomes a device which has write-back cache but doesn't support flush.
> > Then, upper layer can do nothing to ensure cache flush?
> 
> Yeah, I was basically suggesting to forget about cache flush w/ mpath
> until it can be fixed.  You're saying that if mpath just passes
> REQ_FLUSH upwards without retrying, it will be almost unuseable,
> right?  I'm not sure how to proceed here.

Seems clear that we must fix mpath to receive the SCSI errors, in some
form, so it can decide if a retry is required/valid or not.

Such error processing was a big selling point for the transition from
bio-based to request-based multipath; so it's unfortunate that this
piece has been left until now.

> How much work would discerning between transport and IO errors take?

Hannes already proposed some patches:
https://patchwork.kernel.org/patch/61282/
https://patchwork.kernel.org/patch/61283/
https://patchwork.kernel.org/patch/61596/

This work was discussed at LSF, see "Error Handling - Hannes Reinecke"
here: http://lwn.net/Articles/400589/

I thought James, Alasdair and others offered some guidance on what he'd
like to see...

Unfortunately, even though I was at this LSF session, I can't recall any
specific consensus on how Hannes' work should be refactored (to avoid
adding SCSI sense processing code directly in dm-mpath).  Maybe James,
Hannes or others remember?

Was it enough to just have the SCSI sense processing code split out in a
new sub-section of the SCSI midlayer -- and then DM calls that code?

> If it can't be done quickly enough the retry logic can be kept around
> to keep the old behavior but that already was a broken behavior, so...
> :-(

I'll have to review this thread again to understand why mpath's existing
retry logic is broken behavior.  mpath is used with more capable SCSI
devices so I'm missing why a failed FLUSH implies data loss.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 24, 2010, 6:14 p.m. UTC | #15
Hello,

On 08/24/2010 07:52 PM, Mike Snitzer wrote:
>> If it can't be done quickly enough the retry logic can be kept around
>> to keep the old behavior but that already was a broken behavior, so...
>> :-(
> 
> I'll have to review this thread again to understand why mpath's existing
> retry logic is broken behavior.  mpath is used with more capable SCSI
> devices so I'm missing why a failed FLUSH implies data loss.

SBC doesn't specify the failure behavior, so it could be that retrying
flush could be safe.  But for most disk type devices, flush failure
usually indicates that the device exhausted all the options to commit
some of pending data to NV media - ie. even remapping failed for
whatever reason.  Even if retry is safe, it's more likely to simply
delay notification of failure.

In ATA, the situation is clearer, when a device actively fails a
flush, the drive reports the first failed sector it failed to commit
and the next flush will continue _after_ the sector - IOW, data is
already lost.

<speculation>
I think there's no reason mpath should be tasked with retrying flush
failure.  That's upto the SCSI EH.  If the command failed in 'safe'
transient way - ie. device busy or whatnot, SCSI EH can and does retry
the command.  There are several FAILFAST bits already and SCSI EH can
avoid retrying transport errors for mpath (maybe it already does
that?) and just need to be able to tell upper layer that the failure
was a fast one and upper layer is responsible for retrying?  Is there
any reason to pass the whole sense information upwards?
</speculation>

Anyways, flush failure is different from read/write failures.
Read/writes can always be retried cleanly.  They are stateless.  I
don't know how SCSI devices would actually behavior but it's a bit
scary to retry SYNCHRONIZE_CACHE a device failed and report success
upwards.

Thanks.
Alan Cox Aug. 24, 2010, 11:14 p.m. UTC | #16
> In SCSI there are conditions when a command, including FLUSH 
> (SYNC_CACHE), failed which don't imply lost data. For them the caller 
> expected to retry the failed command. Most common cases are Unit 
> Attentions and TASK QUEUE FULL status.

ATA expects the command to be retried as well because a failed flush
indicates the specific sector is lost (unless the host still has a copy
of course - which is *very* likely although we don't use it) but the rest
of the flush transaction can be retried to continue to flush sectors
beyond the failed one.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kiyoshi Ueda Aug. 25, 2010, 8 a.m. UTC | #17
Hi Tejun,

On 08/25/2010 01:59 AM +0900, Tejun Heo wrote:
> On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
>> Anyway, only reporting errors for REQ_FLUSH to upper layer without
>> such a solution would make dm-multipath almost unusable in real world,
>> although it's better than implicit data loss.
> 
> I see.
> 
>>> Maybe just turn off barrier support in mpath for now?
>> 
>> If it's possible, it could be a workaround for a short term.
>> But how can you do that?
>>
>> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
>> Underlying devices of a mpath device may have write-back cache and
>> it may be enabled.
>> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
>> becomes a device which has write-back cache but doesn't support flush.
>> Then, upper layer can do nothing to ensure cache flush?
> 
> Yeah, I was basically suggesting to forget about cache flush w/ mpath
> until it can be fixed.  You're saying that if mpath just passes
> REQ_FLUSH upwards without retrying, it will be almost unuseable,
> right?

Right.
If the error is safe/needed to retry using other paths, mpath should
retry even if REQ_FLUSH.  Otherwise, only one path failure may result
in system down.
Just passing any REQ_FLUSH error upwards regardless the error type
will make such situations, and users will feel the behavior as
unstable/unusable.


> I'm not sure how to proceed here.  How much work would
> discerning between transport and IO errors take?  If it can't be done
> quickly enough the retry logic can be kept around to keep the old
> behavior but that already was a broken behavior, so...  :-(

I'm not sure how long will it take.
Anyway, as you said, the flush error handling of dm-mpath is already
broken if data loss really happens on any storage used by dm-mpath.
Although it's a serious issue and quick fix is required, I think
you may leave the old behavior in your patch-set, since it's
a separate issue.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Aug. 25, 2010, 3:28 p.m. UTC | #18
On Wed, Aug 25 2010 at  4:00am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Tejun,
> 
> On 08/25/2010 01:59 AM +0900, Tejun Heo wrote:
> > On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> >> Anyway, only reporting errors for REQ_FLUSH to upper layer without
> >> such a solution would make dm-multipath almost unusable in real world,
> >> although it's better than implicit data loss.
> > 
> > I see.
> > 
> >>> Maybe just turn off barrier support in mpath for now?
> >> 
> >> If it's possible, it could be a workaround for a short term.
> >> But how can you do that?
> >>
> >> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> >> Underlying devices of a mpath device may have write-back cache and
> >> it may be enabled.
> >> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> >> becomes a device which has write-back cache but doesn't support flush.
> >> Then, upper layer can do nothing to ensure cache flush?
> > 
> > Yeah, I was basically suggesting to forget about cache flush w/ mpath
> > until it can be fixed.  You're saying that if mpath just passes
> > REQ_FLUSH upwards without retrying, it will be almost unuseable,
> > right?
> 
> Right.
> If the error is safe/needed to retry using other paths, mpath should
> retry even if REQ_FLUSH.  Otherwise, only one path failure may result
> in system down.
> Just passing any REQ_FLUSH error upwards regardless the error type
> will make such situations, and users will feel the behavior as
> unstable/unusable.

Right, there are hardware configurations that lend themselves to FLUSH
retries mattering, namely:
1) a SAS drive with 2 ports and a writeback cache
2) theoretically possible: SCSI array that is mpath capable but
   advertises cache as writeback (WCE=1)

The SAS case is obviously a more concrete example of why FLUSH retries
are worthwhile in mpath.

But I understand (and agree) that we'd be better off if mpath could
differentiate between failures rather than blindly retrying on failures
like it does today (fails path and retries if additional paths
available).

> Anyway, as you said, the flush error handling of dm-mpath is already
> broken if data loss really happens on any storage used by dm-mpath.
> Although it's a serious issue and quick fix is required, I think
> you may leave the old behavior in your patch-set, since it's
> a separate issue.

I'm not seeing where anything is broken with current mpath.  If a
multipathed LUN is WCE=1 then it should be fair to assume the cache is
mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
CACHE is needed.

Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
Seems unlikely especially given what Tejun shared from SBC.

It seems that at worst, with current mpath, we retry when it doesn't
make sense (e.g. target failure).

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Aug. 25, 2010, 3:59 p.m. UTC | #19
On Wed, Aug 25 2010 at  4:00am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> > I'm not sure how to proceed here.  How much work would
> > discerning between transport and IO errors take?  If it can't be done
> > quickly enough the retry logic can be kept around to keep the old
> > behavior but that already was a broken behavior, so...  :-(
> 
> I'm not sure how long will it take.

We first need to understand what direction we want to go with this.  We
currently have 2 options.  But any other ideas are obviously welcome.

1)
Mike Christie has a patchset that introduce more specific
target/transport/host error codes.  Mike shared these pointers but he'd
have to put the work in to refresh them:
http://marc.info/?l=linux-scsi&m=112487427230642&w=2
http://marc.info/?l=linux-scsi&m=112487427306501&w=2
http://marc.info/?l=linux-scsi&m=112487431524436&w=2
http://marc.info/?l=linux-scsi&m=112487431524350&w=2

errno.h new EXYZ
http://marc.info/?l=linux-kernel&m=107715299008231&w=2

add block layer blkdev.h error values
http://marc.info/?l=linux-kernel&m=107961883915068&w=2

add block layer blkdev.h error values (v2 convert more drivers)
http://marc.info/?l=linux-scsi&m=112487427230642&w=2

I think that patchset's appoach is fairly disruptive just to be able to
train upper layers to differentiate (e.g. mpath).  But in the end maybe
that change takes the code in a more desirable direction?

2)
Another option is Hannes' approach of having DM consume req->errors and
SCSI sense more directly.

I've refreshed Hannes' previous patchset against 2.6.36-rc2 but I
haven't finished testing it yet (should be OK.. it boots, but still have
FIXME to move scsi_uld_should_retry to scsi_error.c):
http://people.redhat.com/msnitzer/patches/dm-scsi-sense/

Would be great if James, Hannes and others had a look at this
refreshed RFC patchset.  It's clearly not polished but it gives an idea
of the approach.  Does this look worthwhile?

Follow-on work is needed to refine scsi_uld_should_retry further.  Keep
in mind that scsi_error.c is the intended location for this code.

James, please note that I've attempted to make REQ_TYPE_FS set
req->errors only for "genuine errors" by (ab)using
scsi_decide_disposition:
http://people.redhat.com/msnitzer/patches/dm-scsi-sense/scsi-Always-pass-error-result-and-sense-on-request-completion.patch

If others think this may be worthwhile I can finish testing, cleanup the
patches further, and post them.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Christie Aug. 25, 2010, 7:15 p.m. UTC | #20
On 08/25/2010 10:59 AM, Mike Snitzer wrote:
> On Wed, Aug 25 2010 at  4:00am -0400,
> Kiyoshi Ueda<k-ueda@ct.jp.nec.com>  wrote:
>
>>> I'm not sure how to proceed here.  How much work would
>>> discerning between transport and IO errors take?  If it can't be done
>>> quickly enough the retry logic can be kept around to keep the old
>>> behavior but that already was a broken behavior, so...  :-(
>>
>> I'm not sure how long will it take.
>
> We first need to understand what direction we want to go with this.  We
> currently have 2 options.  But any other ideas are obviously welcome.
>
> 1)
> Mike Christie has a patchset that introduce more specific
> target/transport/host error codes.  Mike shared these pointers but he'd
> have to put the work in to refresh them:
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> http://marc.info/?l=linux-scsi&m=112487427306501&w=2
> http://marc.info/?l=linux-scsi&m=112487431524436&w=2
> http://marc.info/?l=linux-scsi&m=112487431524350&w=2
>
> errno.h new EXYZ
> http://marc.info/?l=linux-kernel&m=107715299008231&w=2
>
> add block layer blkdev.h error values
> http://marc.info/?l=linux-kernel&m=107961883915068&w=2
>
> add block layer blkdev.h error values (v2 convert more drivers)
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
>
> I think that patchset's appoach is fairly disruptive just to be able to
> train upper layers to differentiate (e.g. mpath).  But in the end maybe
> that change takes the code in a more desirable direction?

I think it is more disruptive, but is the cleaner approach in the end.

#2 looks hacky. In upper layers, we will have checks for dasd and other 
AOE and other drivers. And then #2 does not even work for filesystems 
(ext said they need this).



>
> 2)
> Another option is Hannes' approach of having DM consume req->errors and
> SCSI sense more directly.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kiyoshi Ueda Aug. 27, 2010, 9:47 a.m. UTC | #21
Hi Mike,

On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> Anyway, as you said, the flush error handling of dm-mpath is already
>> broken if data loss really happens on any storage used by dm-mpath.
>> Although it's a serious issue and quick fix is required, I think
>> you may leave the old behavior in your patch-set, since it's
>> a separate issue.
> 
> I'm not seeing where anything is broken with current mpath.  If a
> multipathed LUN is WCE=1 then it should be fair to assume the cache is
> mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
> CACHE is needed.
> 
> Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
> Seems unlikely especially given what Tejun shared from SBC.

Do we have any proof to wipe that fear?

If retrying on flush failure is safe on all storages used with multipath
(e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
the real world.
But I'm afraid if there is a storage where something like below can happen:
    - a flush command is returned as error to mpath because a part of
      cache has physically broken at the time or so, then that part of
      data loses and the size of the cache is shrunk by the storage.
    - mpath retries the flush command using other path.
    - the flush command is returned as success to mpath.
    - mpath passes the result, success, to upper layer, but some of
      the data already lost.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Aug. 27, 2010, 1:49 p.m. UTC | #22
On Fri, Aug 27 2010 at  5:47am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> >> Anyway, as you said, the flush error handling of dm-mpath is already
> >> broken if data loss really happens on any storage used by dm-mpath.
> >> Although it's a serious issue and quick fix is required, I think
> >> you may leave the old behavior in your patch-set, since it's
> >> a separate issue.
> > 
> > I'm not seeing where anything is broken with current mpath.  If a
> > multipathed LUN is WCE=1 then it should be fair to assume the cache is
> > mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
> > CACHE is needed.
> > 
> > Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
> > Seems unlikely especially given what Tejun shared from SBC.
> 
> Do we have any proof to wipe that fear?
> 
> If retrying on flush failure is safe on all storages used with multipath
> (e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
> the real world.
> But I'm afraid if there is a storage where something like below can happen:
>     - a flush command is returned as error to mpath because a part of
>       cache has physically broken at the time or so, then that part of
>       data loses and the size of the cache is shrunk by the storage.
>     - mpath retries the flush command using other path.
>     - the flush command is returned as success to mpath.
>     - mpath passes the result, success, to upper layer, but some of
>       the data already lost.

That does seem like a valid concern.  But I'm not seeing why its unique
to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
passed up once the error gets to DM.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kiyoshi Ueda Aug. 30, 2010, 6:13 a.m. UTC | #23
Hi Mike,

On 08/27/2010 10:49 PM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>> Anyway, as you said, the flush error handling of dm-mpath is already
>>>> broken if data loss really happens on any storage used by dm-mpath.
>>>> Although it's a serious issue and quick fix is required, I think
>>>> you may leave the old behavior in your patch-set, since it's
>>>> a separate issue.
>>> 
>>> I'm not seeing where anything is broken with current mpath.  If a
>>> multipathed LUN is WCE=1 then it should be fair to assume the cache is
>>> mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
>>> CACHE is needed.
>>>
>>> Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
>>> Seems unlikely especially given what Tejun shared from SBC.
>> 
>> Do we have any proof to wipe that fear?
>>
>> If retrying on flush failure is safe on all storages used with multipath
>> (e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
>> the real world.
>> But I'm afraid if there is a storage where something like below can happen:
>>     - a flush command is returned as error to mpath because a part of
>>       cache has physically broken at the time or so, then that part of
>>       data loses and the size of the cache is shrunk by the storage.
>>     - mpath retries the flush command using other path.
>>     - the flush command is returned as success to mpath.
>>     - mpath passes the result, success, to upper layer, but some of
>>       the data already lost.
> 
> That does seem like a valid concern.  But I'm not seeing why its unique
> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
> passed up once the error gets to DM.

See the Tejun's explanation again:
    http://marc.info/?l=linux-kernel&m=128267361813859&w=2
What I'm concerning is whether the same thing as Tejun explained
for ATA can happen on other types of devices.


Normal write command has data and no data loss happens on error.
So it can be retried cleanly, and if the result of the retry is
success, it's really success, no implicit data loss.

Normal read command has a sector to read.  If the sector is broken,
all retries will fail and the error will be reported upwards.
So it can be retried cleanly as well.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Aug. 30, 2010, 11:38 a.m. UTC | #24
Mike Snitzer wrote:
> On Wed, Aug 25 2010 at  4:00am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>>> I'm not sure how to proceed here.  How much work would
>>> discerning between transport and IO errors take?  If it can't be done
>>> quickly enough the retry logic can be kept around to keep the old
>>> behavior but that already was a broken behavior, so...  :-(
>> I'm not sure how long will it take.
> 
> We first need to understand what direction we want to go with this.  We
> currently have 2 options.  But any other ideas are obviously welcome.
> 
> 1)
> Mike Christie has a patchset that introduce more specific
> target/transport/host error codes.  Mike shared these pointers but he'd
> have to put the work in to refresh them:
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> http://marc.info/?l=linux-scsi&m=112487427306501&w=2
> http://marc.info/?l=linux-scsi&m=112487431524436&w=2
> http://marc.info/?l=linux-scsi&m=112487431524350&w=2
> 
> errno.h new EXYZ
> http://marc.info/?l=linux-kernel&m=107715299008231&w=2
> 
> add block layer blkdev.h error values
> http://marc.info/?l=linux-kernel&m=107961883915068&w=2
> 
> add block layer blkdev.h error values (v2 convert more drivers)
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> 
> I think that patchset's appoach is fairly disruptive just to be able to
> train upper layers to differentiate (e.g. mpath).  But in the end maybe
> that change takes the code in a more desirable direction?
> 
> 2)
> Another option is Hannes' approach of having DM consume req->errors and
> SCSI sense more directly.
> 

Actually, I think we have two separate issues here:
1) The need of having more detailed I/O errors even in the fs layer. This
   we've already discussed at the LSF, consensus here is to allow other
   errors than just 'EIO'.
   Instead of Mike's approach I would rather use existing error codes here;
   this will make the transition somewhat easier.
   Initially I would propose to return 'ENOLINK' for a transport failure,
   'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
   retryable failure on the target.

2) The need to differentiate the various error conditions on the multipath
   layer. Multipath needs to distinguish the three error types as specified
   in 1)

Mike has been trying to solve 1) and 2) by introducing separate/new error
codes, and I have been trying to use 2) by parsing the sense codes directly
from multipathing.

Given that the fs people have expressed their desire to know about these
error classes, too, it makes sense to have them exposed to the fs layer.

I see if I can come up with a patch.

Cheers,

Hannes
Sergei Shtylyov Aug. 30, 2010, 12:07 p.m. UTC | #25
Hello.

Hannes Reinecke wrote:

> Actually, I think we have two separate issues here:
> 1) The need of having more detailed I/O errors even in the fs layer. This
>    we've already discussed at the LSF, consensus here is to allow other
>    errors than just 'EIO'.
>    Instead of Mike's approach I would rather use existing error codes here;
>    this will make the transition somewhat easier.
>    Initially I would propose to return 'ENOLINK' for a transport failure,
>    'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
>    retryable failure on the target.

    Are you sure it's not vice versa: EIO for retryable and ENODEV for 
non-retryable failures. ENODEV looks more like permanent condition to me.

WBR, Sergei
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Aug. 30, 2010, 12:39 p.m. UTC | #26
Sergei Shtylyov wrote:
> Hello.
> 
> Hannes Reinecke wrote:
> 
>> Actually, I think we have two separate issues here:
>> 1) The need of having more detailed I/O errors even in the fs layer. This
>>    we've already discussed at the LSF, consensus here is to allow other
>>    errors than just 'EIO'.
>>    Instead of Mike's approach I would rather use existing error codes
>> here;
>>    this will make the transition somewhat easier.
>>    Initially I would propose to return 'ENOLINK' for a transport failure,
>>    'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
>>    retryable failure on the target.
> 
>    Are you sure it's not vice versa: EIO for retryable and ENODEV for
> non-retryable failures. ENODEV looks more like permanent condition to me.
> 
Ok, can do.
And looking a the error numbers again, maybe we should be using 'EREMOTEIO'
for non-retryable failures.

So we would be ending with:

ENOLINK: transport failure
EIO: retryable remote failure
EREMOTEIO: non-retryable remote failure

Does that look okay?

Cheers,

Hannes
Mike Snitzer Sept. 1, 2010, 12:55 a.m. UTC | #27
Hi Kiyoshi,

On Mon, Aug 30 2010 at  2:13am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> > That does seem like a valid concern.  But I'm not seeing why its unique
> > to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
> > passed up once the error gets to DM.
> 
> See the Tejun's explanation again:
>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
> What I'm concerning is whether the same thing as Tejun explained
> for ATA can happen on other types of devices.
> 
> 
> Normal write command has data and no data loss happens on error.
> So it can be retried cleanly, and if the result of the retry is
> success, it's really success, no implicit data loss.
> 
> Normal read command has a sector to read.  If the sector is broken,
> all retries will fail and the error will be reported upwards.
> So it can be retried cleanly as well.

I reached out to Fred Knight on this, to get a more insight from a pure
SCSI SBC perspective, and he shared the following:

----- Forwarded message from "Knight, Frederick" <Frederick.Knight@netapp.com> -----

> Date: Tue, 31 Aug 2010 13:24:15 -0400
> From: "Knight, Frederick" <Frederick.Knight@netapp.com>
> To: Mike Snitzer <snitzer@redhat.com>
> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>
> There are requirements in SBC to maintain data integrity.  If you WRITE
> a block and READ that block, you must get the data you sent in the
> WRITE.  This will be synchronized around the completion of the WRITE.
> Before the WRITE completes, who knows what a READ will return.  Maybe
> all the old data, maybe all the new data, maybe some mix of old and new
> data.  Once the WRITE ends successful, all READs of those LBAs (from any
> port) will always get the same data.
>
> As for errors, SBC describes how the deferred errors are reported (like
> when a CACHE tries to flush but fails).  So if a write from cache to
> media does have problems, the device would tell you via a CHECK
> CONDITION (with the first byte of the sense data set to 71h or 73h.  SBC
> clause 4.12 and 4.13 cover a lot of this information.  It is these error
> codes that prevent silent loss of data.  And, in this case, when the
> CHECK CONDITION is delivered, it will have nothing to do with the
> command that was issued (the victim command).  If you look into the
> sense data, you will see the deferred error flag, and all the additional
> information fields will relate to the original I/O
>
> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
> data on the media).  So issuing it multiple times wouldn't be any
> different than issuing multiple WRITES (it might put a temporary dent in
> performance as everything flushes out to media).  If it or any other
> commands fail with 71h/73h, then you have to dig down into the sense
> data buffer to find out what happened.  For example, if you issue a
> WRITE command, and it completes into write back cache but later (before
> being written to the media), some of the cache breaks and looses data,
> then the device must signal a deferred error to tell the host, and cause
> a forced error on the LBA in question.
>
> Does that help?
>
>       Fred
----- End forwarded message -----

Seems like verifying/improving the handling of CHECK CONDITION is a more
pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
retries.  Without proper handling we could completely miss these
deferred errors.

But how to effectively report such errors to upper layers is unclear to
me given that a particular SCSI command can carry error information for
IO that was already acknowledged successful (e.g. to the FS).

drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
illustrate Linux's current CHECK CONDITION handling.  I need to look
closer at how deferred errors propagate to upper layers.  After an
initial look it seems scsi_error.c does handle retrying commands where
appropriate.

I believe Hannes has concerns/insight here.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Sept. 1, 2010, 7:32 a.m. UTC | #28
Mike Snitzer wrote:
> Hi Kiyoshi,
> 
> On Mon, Aug 30 2010 at  2:13am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>>> That does seem like a valid concern.  But I'm not seeing why its unique
>>> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
>>> passed up once the error gets to DM.
>> See the Tejun's explanation again:
>>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
>> What I'm concerning is whether the same thing as Tejun explained
>> for ATA can happen on other types of devices.
>>
>>
>> Normal write command has data and no data loss happens on error.
>> So it can be retried cleanly, and if the result of the retry is
>> success, it's really success, no implicit data loss.
>>
>> Normal read command has a sector to read.  If the sector is broken,
>> all retries will fail and the error will be reported upwards.
>> So it can be retried cleanly as well.
> 
> I reached out to Fred Knight on this, to get a more insight from a pure
> SCSI SBC perspective, and he shared the following:
> 
> ----- Forwarded message from "Knight, Frederick" <Frederick.Knight@netapp.com> -----
> 
>> Date: Tue, 31 Aug 2010 13:24:15 -0400
>> From: "Knight, Frederick" <Frederick.Knight@netapp.com>
>> To: Mike Snitzer <snitzer@redhat.com>
>> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>>
>> There are requirements in SBC to maintain data integrity.  If you WRITE
>> a block and READ that block, you must get the data you sent in the
>> WRITE.  This will be synchronized around the completion of the WRITE.
>> Before the WRITE completes, who knows what a READ will return.  Maybe
>> all the old data, maybe all the new data, maybe some mix of old and new
>> data.  Once the WRITE ends successful, all READs of those LBAs (from any
>> port) will always get the same data.
>>
>> As for errors, SBC describes how the deferred errors are reported (like
>> when a CACHE tries to flush but fails).  So if a write from cache to
>> media does have problems, the device would tell you via a CHECK
>> CONDITION (with the first byte of the sense data set to 71h or 73h.  SBC
>> clause 4.12 and 4.13 cover a lot of this information.  It is these error
>> codes that prevent silent loss of data.  And, in this case, when the
>> CHECK CONDITION is delivered, it will have nothing to do with the
>> command that was issued (the victim command).  If you look into the
>> sense data, you will see the deferred error flag, and all the additional
>> information fields will relate to the original I/O
>>
>> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
>> data on the media).  So issuing it multiple times wouldn't be any
>> different than issuing multiple WRITES (it might put a temporary dent in
>> performance as everything flushes out to media).  If it or any other
>> commands fail with 71h/73h, then you have to dig down into the sense
>> data buffer to find out what happened.  For example, if you issue a
>> WRITE command, and it completes into write back cache but later (before
>> being written to the media), some of the cache breaks and looses data,
>> then the device must signal a deferred error to tell the host, and cause
>> a forced error on the LBA in question.
>>
>> Does that help?
>>
>>       Fred
> ----- End forwarded message -----
> 
> Seems like verifying/improving the handling of CHECK CONDITION is a more
> pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
> retries.  Without proper handling we could completely miss these
> deferred errors.
> 
Yes.

> But how to effectively report such errors to upper layers is unclear to
> me given that a particular SCSI command can carry error information for
> IO that was already acknowledged successful (e.g. to the FS).
> 
> drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
> illustrate Linux's current CHECK CONDITION handling.  I need to look
> closer at how deferred errors propagate to upper layers.  After an
> initial look it seems scsi_error.c does handle retrying commands where
> appropriate.
> 
> I believe Hannes has concerns/insight here.
> 

Quite. We _should_ be handling deferred errors correctly;
if you check drivers/scsi/scsi_lib.c:scsi_io_completion()
you'll find this:

	if (host_byte(result) == DID_RESET) {
		/* Third party bus reset or reset for error recovery
		 * reasons.  Just retry the command and see what
		 * happens.
		 */
		action = ACTION_RETRY;
	} else if (sense_valid && !sense_deferred) {
                ...
	} else {
		description = "Unhandled error code";
		action = ACTION_FAIL;
	}

ie for deferred errors we're already aborting the command. Not sure
if I agree with this bit in drivers/scsi/scsi_lib.c:

static int scsi_check_sense(struct scsi_cmnd *scmd)
{
	struct scsi_device *sdev = scmd->device;
	struct scsi_sense_hdr sshdr;

	if (! scsi_command_normalize_sense(scmd, &sshdr))
		return FAILED;	/* no valid sense data */

	if (scsi_sense_is_deferred(&sshdr))
		return NEEDS_RETRY;

I doubt we can resolve the situation by retrying the command, which
will be the wrong command to retry anyway. I would rather
have those retry 'SUCCESS' and add another case in scsi_io_completion()
to notify us about the deferred error.

I'll be sending a patch.

Cheers,

Hannes
diff mbox

Patch

Index: block/drivers/block/loop.c
===================================================================
--- block.orig/drivers/block/loop.c
+++ block/drivers/block/loop.c
@@ -477,17 +477,17 @@  static int do_bio_filebacked(struct loop
 	pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

 	if (bio_rw(bio) == WRITE) {
-		bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
 		struct file *file = lo->lo_backing_file;

-		if (barrier) {
-			if (unlikely(!file->f_op->fsync)) {
-				ret = -EOPNOTSUPP;
-				goto out;
-			}
+		/* REQ_HARDBARRIER is deprecated */
+		if (bio->bi_rw & REQ_HARDBARRIER) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}

+		if (bio->bi_rw & REQ_FLUSH) {
 			ret = vfs_fsync(file, 0);
-			if (unlikely(ret)) {
+			if (unlikely(ret && ret != -EINVAL)) {
 				ret = -EIO;
 				goto out;
 			}
@@ -495,9 +495,9 @@  static int do_bio_filebacked(struct loop

 		ret = lo_send(lo, bio, pos);

-		if (barrier && !ret) {
+		if ((bio->bi_rw & REQ_FUA) && !ret) {
 			ret = vfs_fsync(file, 0);
-			if (unlikely(ret))
+			if (unlikely(ret && ret != -EINVAL))
 				ret = -EIO;
 		}
 	} else
Index: block/drivers/block/virtio_blk.c
===================================================================
--- block.orig/drivers/block/virtio_blk.c
+++ block/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@  static bool do_req(struct request_queue
 		}
 	}

-	if (vbr->req->cmd_flags & REQ_HARDBARRIER)
-		vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

 	/*
@@ -157,6 +154,8 @@  static bool do_req(struct request_queue
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
 			out += num;
+			if (req->cmd_flags & REQ_FUA)
+				vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
 		} else {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
 			in += num;
@@ -307,6 +306,7 @@  static int __devinit virtblk_probe(struc
 {
 	struct virtio_blk *vblk;
 	struct request_queue *q;
+	unsigned int flush;
 	int err;
 	u64 cap;
 	u32 v, blk_size, sg_elems, opt_io_size;
@@ -388,15 +388,13 @@  static int __devinit virtblk_probe(struc
 	vblk->disk->driverfs_dev = &vdev->dev;
 	index++;

-	/*
-	 * If the FLUSH feature is supported we do have support for
-	 * flushing a volatile write cache on the host.  Use that to
-	 * implement write barrier support; otherwise, we must assume
-	 * that the host does not perform any kind of volatile write
-	 * caching.
-	 */
+	/* configure queue flush support */
+	flush = 0;
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
-		blk_queue_flush(q, REQ_FLUSH);
+		flush |= REQ_FLUSH;
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FUA))
+		flush |= REQ_FUA;
+	blk_queue_flush(q, flush);

 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
@@ -515,9 +513,9 @@  static const struct virtio_device_id id_
 };

 static unsigned int features[] = {
-	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
-	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
-	VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+	VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+	VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+	VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_FUA,
 };

 /*
Index: block/include/linux/virtio_blk.h
===================================================================
--- block.orig/include/linux/virtio_blk.h
+++ block/include/linux/virtio_blk.h
@@ -16,6 +16,7 @@ 
 #define VIRTIO_BLK_F_SCSI	7	/* Supports scsi command passthru */
 #define VIRTIO_BLK_F_FLUSH	9	/* Cache flush command support */
 #define VIRTIO_BLK_F_TOPOLOGY	10	/* Topology information is available */
+#define VIRTIO_BLK_F_FUA	11	/* Forced Unit Access write support */

 #define VIRTIO_BLK_ID_BYTES	20	/* ID string length */

@@ -70,7 +71,10 @@  struct virtio_blk_config {
 #define VIRTIO_BLK_T_FLUSH	4

 /* Get device ID command */
-#define VIRTIO_BLK_T_GET_ID    8
+#define VIRTIO_BLK_T_GET_ID	8
+
+/* FUA command */
+#define VIRTIO_BLK_T_FUA	16

 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER	0x80000000
Index: block/Documentation/lguest/lguest.c
===================================================================
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@  static void blk_request(struct virtqueue
 	off = out->sector * 512;

 	/*
-	 * The block device implements "barriers", where the Guest indicates
-	 * that it wants all previous writes to occur before this write.  We
-	 * don't have a way of asking our kernel to do a barrier, so we just
-	 * synchronize all the data in the file.  Pretty poor, no?
-	 */
-	if (out->type & VIRTIO_BLK_T_BARRIER)
-		fdatasync(vblk->fd);
-
-	/*
 	 * In general the virtio block driver is allowed to try SCSI commands.
 	 * It'd be nice if we supported eject, for example, but we don't.
 	 */
@@ -1679,6 +1670,19 @@  static void blk_request(struct virtqueue
 			/* Die, bad Guest, die. */
 			errx(1, "Write past end %llu+%u", off, ret);
 		}
+
+		/* Honor FUA by syncing everything. */
+		if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+			ret = fdatasync(vblk->fd);
+			verbose("FUA fdatasync: %i\n", ret);
+		}
+
+		wlen = sizeof(*in);
+		*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+	} else if (out->type & VIRTIO_BLK_T_FLUSH) {
+		/* Flush */
+		ret = fdatasync(vblk->fd);
+		verbose("FLUSH fdatasync: %i\n", ret);
 		wlen = sizeof(*in);
 		*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
 	} else {
@@ -1702,15 +1706,6 @@  static void blk_request(struct virtqueue
 		}
 	}

-	/*
-	 * OK, so we noted that it was pretty poor to use an fdatasync as a
-	 * barrier.  But Christoph Hellwig points out that we need a sync
-	 * *afterwards* as well: "Barriers specify no reordering to the front
-	 * or the back."  And Jens Axboe confirmed it, so here we are:
-	 */
-	if (out->type & VIRTIO_BLK_T_BARRIER)
-		fdatasync(vblk->fd);
-
 	/* Finished that request. */
 	add_used(vq, head, wlen);
 }
@@ -1735,8 +1730,9 @@  static void setup_block_file(const char
 	vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
 	vblk->len = lseek64(vblk->fd, 0, SEEK_END);

-	/* We support barriers. */
-	add_feature(dev, VIRTIO_BLK_F_BARRIER);
+	/* We support FLUSH and FUA. */
+	add_feature(dev, VIRTIO_BLK_F_FLUSH);
+	add_feature(dev, VIRTIO_BLK_F_FUA);

 	/* Tell Guest how many sectors this device has. */
 	conf.capacity = cpu_to_le64(vblk->len / 512);
Index: block/drivers/md/linear.c
===================================================================
--- block.orig/drivers/md/linear.c
+++ block/drivers/md/linear.c
@@ -294,8 +294,8 @@  static int linear_make_request (mddev_t
 	dev_info_t *tmp_dev;
 	sector_t start_sector;

-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}

Index: block/drivers/md/md.c
===================================================================
--- block.orig/drivers/md/md.c
+++ block/drivers/md/md.c
@@ -226,12 +226,12 @@  static int md_make_request(struct reques
 		return 0;
 	}
 	rcu_read_lock();
-	if (mddev->suspended || mddev->barrier) {
+	if (mddev->suspended) {
 		DEFINE_WAIT(__wait);
 		for (;;) {
 			prepare_to_wait(&mddev->sb_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
-			if (!mddev->suspended && !mddev->barrier)
+			if (!mddev->suspended)
 				break;
 			rcu_read_unlock();
 			schedule();
@@ -280,40 +280,29 @@  static void mddev_resume(mddev_t *mddev)

 int mddev_congested(mddev_t *mddev, int bits)
 {
-	if (mddev->barrier)
-		return 1;
 	return mddev->suspended;
 }
 EXPORT_SYMBOL(mddev_congested);

 /*
- * Generic barrier handling for md
+ * Generic flush handling for md
  */

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
 {
 	mdk_rdev_t *rdev = bio->bi_private;
 	mddev_t *mddev = rdev->mddev;
-	if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
-		set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

 	rdev_dec_pending(rdev, mddev);

 	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		if (mddev->barrier == POST_REQUEST_BARRIER) {
-			/* This was a post-request barrier */
-			mddev->barrier = NULL;
-			wake_up(&mddev->sb_wait);
-		} else
-			/* The pre-request barrier has finished */
-			schedule_work(&mddev->barrier_work);
+		/* The pre-request flush has finished */
+		schedule_work(&mddev->flush_work);
 	}
 	bio_put(bio);
 }

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
 {
 	mdk_rdev_t *rdev;

@@ -330,60 +319,56 @@  static void submit_barriers(mddev_t *mdd
 			atomic_inc(&rdev->nr_pending);
 			rcu_read_unlock();
 			bi = bio_alloc(GFP_KERNEL, 0);
-			bi->bi_end_io = md_end_barrier;
+			bi->bi_end_io = md_end_flush;
 			bi->bi_private = rdev;
 			bi->bi_bdev = rdev->bdev;
 			atomic_inc(&mddev->flush_pending);
-			submit_bio(WRITE_BARRIER, bi);
+			submit_bio(WRITE_FLUSH, bi);
 			rcu_read_lock();
 			rdev_dec_pending(rdev, mddev);
 		}
 	rcu_read_unlock();
 }

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
 {
-	mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
-	struct bio *bio = mddev->barrier;
+	mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+	struct bio *bio = mddev->flush_bio;

 	atomic_set(&mddev->flush_pending, 1);

-	if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
-		bio_endio(bio, -EOPNOTSUPP);
-	else if (bio->bi_size == 0)
+	if (bio->bi_size == 0)
 		/* an empty barrier - all done */
 		bio_endio(bio, 0);
 	else {
-		bio->bi_rw &= ~REQ_HARDBARRIER;
+		bio->bi_rw &= ~REQ_FLUSH;
 		if (mddev->pers->make_request(mddev, bio))
 			generic_make_request(bio);
-		mddev->barrier = POST_REQUEST_BARRIER;
-		submit_barriers(mddev);
 	}
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		mddev->barrier = NULL;
+		mddev->flush_bio = NULL;
 		wake_up(&mddev->sb_wait);
 	}
 }

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
 {
 	spin_lock_irq(&mddev->write_lock);
 	wait_event_lock_irq(mddev->sb_wait,
-			    !mddev->barrier,
+			    !mddev->flush_bio,
 			    mddev->write_lock, /*nothing*/);
-	mddev->barrier = bio;
+	mddev->flush_bio = bio;
 	spin_unlock_irq(&mddev->write_lock);

 	atomic_set(&mddev->flush_pending, 1);
-	INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+	INIT_WORK(&mddev->flush_work, md_submit_flush_data);

-	submit_barriers(mddev);
+	submit_flushes(mddev);

 	if (atomic_dec_and_test(&mddev->flush_pending))
-		schedule_work(&mddev->barrier_work);
+		schedule_work(&mddev->flush_work);
 }
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

 static inline mddev_t *mddev_get(mddev_t *mddev)
 {
@@ -642,31 +627,6 @@  static void super_written(struct bio *bi
 	bio_put(bio);
 }

-static void super_written_barrier(struct bio *bio, int error)
-{
-	struct bio *bio2 = bio->bi_private;
-	mdk_rdev_t *rdev = bio2->bi_private;
-	mddev_t *mddev = rdev->mddev;
-
-	if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
-	    error == -EOPNOTSUPP) {
-		unsigned long flags;
-		/* barriers don't appear to be supported :-( */
-		set_bit(BarriersNotsupp, &rdev->flags);
-		mddev->barriers_work = 0;
-		spin_lock_irqsave(&mddev->write_lock, flags);
-		bio2->bi_next = mddev->biolist;
-		mddev->biolist = bio2;
-		spin_unlock_irqrestore(&mddev->write_lock, flags);
-		wake_up(&mddev->sb_wait);
-		bio_put(bio);
-	} else {
-		bio_put(bio2);
-		bio->bi_private = rdev;
-		super_written(bio, error);
-	}
-}
-
 void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 		   sector_t sector, int size, struct page *page)
 {
@@ -675,51 +635,28 @@  void md_super_write(mddev_t *mddev, mdk_
 	 * and decrement it on completion, waking up sb_wait
 	 * if zero is reached.
 	 * If an error occurred, call md_error
-	 *
-	 * As we might need to resubmit the request if REQ_HARDBARRIER
-	 * causes ENOTSUPP, we allocate a spare bio...
 	 */
 	struct bio *bio = bio_alloc(GFP_NOIO, 1);
-	int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

 	bio->bi_bdev = rdev->bdev;
 	bio->bi_sector = sector;
 	bio_add_page(bio, page, size, 0);
 	bio->bi_private = rdev;
 	bio->bi_end_io = super_written;
-	bio->bi_rw = rw;

 	atomic_inc(&mddev->pending_writes);
-	if (!test_bit(BarriersNotsupp, &rdev->flags)) {
-		struct bio *rbio;
-		rw |= REQ_HARDBARRIER;
-		rbio = bio_clone(bio, GFP_NOIO);
-		rbio->bi_private = bio;
-		rbio->bi_end_io = super_written_barrier;
-		submit_bio(rw, rbio);
-	} else
-		submit_bio(rw, bio);
+	submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+		   bio);
 }

 void md_super_wait(mddev_t *mddev)
 {
-	/* wait for all superblock writes that were scheduled to complete.
-	 * if any had to be retried (due to BARRIER problems), retry them
-	 */
+	/* wait for all superblock writes that were scheduled to complete */
 	DEFINE_WAIT(wq);
 	for(;;) {
 		prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
 		if (atomic_read(&mddev->pending_writes)==0)
 			break;
-		while (mddev->biolist) {
-			struct bio *bio;
-			spin_lock_irq(&mddev->write_lock);
-			bio = mddev->biolist;
-			mddev->biolist = bio->bi_next ;
-			bio->bi_next = NULL;
-			spin_unlock_irq(&mddev->write_lock);
-			submit_bio(bio->bi_rw, bio);
-		}
 		schedule();
 	}
 	finish_wait(&mddev->sb_wait, &wq);
@@ -1016,7 +953,6 @@  static int super_90_validate(mddev_t *md
 	clear_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	clear_bit(WriteMostly, &rdev->flags);
-	clear_bit(BarriersNotsupp, &rdev->flags);

 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 0;
@@ -1431,7 +1367,6 @@  static int super_1_validate(mddev_t *mdd
 	clear_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	clear_bit(WriteMostly, &rdev->flags);
-	clear_bit(BarriersNotsupp, &rdev->flags);

 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 1;
@@ -4463,7 +4398,6 @@  static int md_run(mddev_t *mddev)
 	/* may be over-ridden by personality */
 	mddev->resync_max_sectors = mddev->dev_sectors;

-	mddev->barriers_work = 1;
 	mddev->ok_start_degraded = start_dirty_degraded;

 	if (start_readonly && mddev->ro == 0)
@@ -4638,7 +4572,6 @@  static void md_clean(mddev_t *mddev)
 	mddev->recovery = 0;
 	mddev->in_sync = 0;
 	mddev->degraded = 0;
-	mddev->barriers_work = 0;
 	mddev->safemode = 0;
 	mddev->bitmap_info.offset = 0;
 	mddev->bitmap_info.default_offset = 0;
Index: block/drivers/md/md.h
===================================================================
--- block.orig/drivers/md/md.h
+++ block/drivers/md/md.h
@@ -67,7 +67,6 @@  struct mdk_rdev_s
 #define	Faulty		1		/* device is known to have a fault */
 #define	In_sync		2		/* device is in_sync with rest of array */
 #define	WriteMostly	4		/* Avoid reading if at all possible */
-#define	BarriersNotsupp	5		/* REQ_HARDBARRIER is not supported */
 #define	AllReserved	6		/* If whole device is reserved for
 					 * one array */
 #define	AutoDetected	7		/* added by auto-detect */
@@ -249,13 +248,6 @@  struct mddev_s
 	int				degraded;	/* whether md should consider
 							 * adding a spare
 							 */
-	int				barriers_work;	/* initialised to true, cleared as soon
-							 * as a barrier request to slave
-							 * fails.  Only supported
-							 */
-	struct bio			*biolist; 	/* bios that need to be retried
-							 * because REQ_HARDBARRIER is not supported
-							 */

 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
@@ -308,16 +300,13 @@  struct mddev_s
 	struct list_head		all_mddevs;

 	struct attribute_group		*to_remove;
-	/* Generic barrier handling.
-	 * If there is a pending barrier request, all other
-	 * writes are blocked while the devices are flushed.
-	 * The last to finish a flush schedules a worker to
-	 * submit the barrier request (without the barrier flag),
-	 * then submit more flush requests.
+	/* Generic flush handling.
+	 * The last to finish preflush schedules a worker to submit
+	 * the rest of the request (without the REQ_FLUSH flag).
 	 */
-	struct bio *barrier;
+	struct bio *flush_bio;
 	atomic_t flush_pending;
-	struct work_struct barrier_work;
+	struct work_struct flush_work;
 };


@@ -458,7 +447,7 @@  extern void md_done_sync(mddev_t *mddev,
 extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

 extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
 extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 			   sector_t sector, int size, struct page *page);
 extern void md_super_wait(mddev_t *mddev);
Index: block/drivers/md/raid0.c
===================================================================
--- block.orig/drivers/md/raid0.c
+++ block/drivers/md/raid0.c
@@ -483,8 +483,8 @@  static int raid0_make_request(mddev_t *m
 	struct strip_zone *zone;
 	mdk_rdev_t *tmp_dev;

-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}

Index: block/drivers/md/raid1.c
===================================================================
--- block.orig/drivers/md/raid1.c
+++ block/drivers/md/raid1.c
@@ -319,83 +319,74 @@  static void raid1_end_write_request(stru
 		if (r1_bio->bios[mirror] == bio)
 			break;

-	if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
-		set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
-		set_bit(R1BIO_BarrierRetry, &r1_bio->state);
-		r1_bio->mddev->barriers_work = 0;
-		/* Don't rdev_dec_pending in this branch - keep it for the retry */
-	} else {
+	/*
+	 * 'one mirror IO has finished' event handler:
+	 */
+	r1_bio->bios[mirror] = NULL;
+	to_put = bio;
+	if (!uptodate) {
+		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+		/* an I/O failed, we can't clear the bitmap */
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+	} else
 		/*
-		 * this branch is our 'one mirror IO has finished' event handler:
+		 * Set R1BIO_Uptodate in our master bio, so that we
+		 * will return a good error code for to the higher
+		 * levels even if IO on some other mirrored buffer
+		 * fails.
+		 *
+		 * The 'master' represents the composite IO operation
+		 * to user-side. So if something waits for IO, then it
+		 * will wait for the 'master' bio.
 		 */
-		r1_bio->bios[mirror] = NULL;
-		to_put = bio;
-		if (!uptodate) {
-			md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-			/* an I/O failed, we can't clear the bitmap */
-			set_bit(R1BIO_Degraded, &r1_bio->state);
-		} else
-			/*
-			 * Set R1BIO_Uptodate in our master bio, so that
-			 * we will return a good error code for to the higher
-			 * levels even if IO on some other mirrored buffer fails.
-			 *
-			 * The 'master' represents the composite IO operation to
-			 * user-side. So if something waits for IO, then it will
-			 * wait for the 'master' bio.
-			 */
-			set_bit(R1BIO_Uptodate, &r1_bio->state);
+		set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+	update_head_pos(mirror, r1_bio);

-		update_head_pos(mirror, r1_bio);
+	if (behind) {
+		if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+			atomic_dec(&r1_bio->behind_remaining);

-		if (behind) {
-			if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
-				atomic_dec(&r1_bio->behind_remaining);
-
-			/* In behind mode, we ACK the master bio once the I/O has safely
-			 * reached all non-writemostly disks. Setting the Returned bit
-			 * ensures that this gets done only once -- we don't ever want to
-			 * return -EIO here, instead we'll wait */
-
-			if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
-			    test_bit(R1BIO_Uptodate, &r1_bio->state)) {
-				/* Maybe we can return now */
-				if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
-					struct bio *mbio = r1_bio->master_bio;
-					PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
-					       (unsigned long long) mbio->bi_sector,
-					       (unsigned long long) mbio->bi_sector +
-					       (mbio->bi_size >> 9) - 1);
-					bio_endio(mbio, 0);
-				}
+		/*
+		 * In behind mode, we ACK the master bio once the I/O
+		 * has safely reached all non-writemostly
+		 * disks. Setting the Returned bit ensures that this
+		 * gets done only once -- we don't ever want to return
+		 * -EIO here, instead we'll wait
+		 */
+		if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+		    test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+			/* Maybe we can return now */
+			if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+				struct bio *mbio = r1_bio->master_bio;
+				PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+				       (unsigned long long) mbio->bi_sector,
+				       (unsigned long long) mbio->bi_sector +
+				       (mbio->bi_size >> 9) - 1);
+				bio_endio(mbio, 0);
 			}
 		}
-		rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
 	}
+	rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
 	/*
-	 *
 	 * Let's see if all mirrored write operations have finished
 	 * already.
 	 */
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
-		if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
-			reschedule_retry(r1_bio);
-		else {
-			/* it really is the end of this request */
-			if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
-				/* free extra copy of the data pages */
-				int i = bio->bi_vcnt;
-				while (i--)
-					safe_put_page(bio->bi_io_vec[i].bv_page);
-			}
-			/* clear the bitmap if all writes complete successfully */
-			bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
-					r1_bio->sectors,
-					!test_bit(R1BIO_Degraded, &r1_bio->state),
-					behind);
-			md_write_end(r1_bio->mddev);
-			raid_end_bio_io(r1_bio);
-		}
+		if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+			/* free extra copy of the data pages */
+			int i = bio->bi_vcnt;
+			while (i--)
+				safe_put_page(bio->bi_io_vec[i].bv_page);
+		}
+		/* clear the bitmap if all writes complete successfully */
+		bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+				r1_bio->sectors,
+				!test_bit(R1BIO_Degraded, &r1_bio->state),
+				behind);
+		md_write_end(r1_bio->mddev);
+		raid_end_bio_io(r1_bio);
 	}

 	if (to_put)
@@ -787,17 +778,14 @@  static int make_request(mddev_t *mddev,
 	struct bio_list bl;
 	struct page **behind_pages = NULL;
 	const int rw = bio_data_dir(bio);
-	const bool do_sync = (bio->bi_rw & REQ_SYNC);
-	bool do_barriers;
+	const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+	const unsigned int do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	mdk_rdev_t *blocked_rdev;

 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
-	 * We test barriers_work *after* md_write_start as md_write_start
-	 * may cause the first superblock write, and that will check out
-	 * if barriers work.
 	 */

 	md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +809,6 @@  static int make_request(mddev_t *mddev,
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
-	if (unlikely(!mddev->barriers_work &&
-		     (bio->bi_rw & REQ_HARDBARRIER))) {
-		if (rw == WRITE)
-			md_write_end(mddev);
-		bio_endio(bio, -EOPNOTSUPP);
-		return 0;
-	}

 	wait_barrier(conf);

@@ -877,7 +858,7 @@  static int make_request(mddev_t *mddev,
 		read_bio->bi_sector = r1_bio->sector + mirror->rdev->data_offset;
 		read_bio->bi_bdev = mirror->rdev->bdev;
 		read_bio->bi_end_io = raid1_end_read_request;
-		read_bio->bi_rw = READ | do_sync;
+		read_bio->bi_rw = READ | do_sync | do_flush_fua;
 		read_bio->bi_private = r1_bio;

 		generic_make_request(read_bio);
@@ -959,10 +940,6 @@  static int make_request(mddev_t *mddev,
 	atomic_set(&r1_bio->remaining, 0);
 	atomic_set(&r1_bio->behind_remaining, 0);

-	do_barriers = bio->bi_rw & REQ_HARDBARRIER;
-	if (do_barriers)
-		set_bit(R1BIO_Barrier, &r1_bio->state);
-
 	bio_list_init(&bl);
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
@@ -975,7 +952,7 @@  static int make_request(mddev_t *mddev,
 		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
 		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
 		mbio->bi_end_io	= raid1_end_write_request;
-		mbio->bi_rw = WRITE | do_barriers | do_sync;
+		mbio->bi_rw = WRITE | do_sync;
 		mbio->bi_private = r1_bio;

 		if (behind_pages) {
@@ -1631,41 +1608,6 @@  static void raid1d(mddev_t *mddev)
 		if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
 			sync_request_write(mddev, r1_bio);
 			unplug = 1;
-		} else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
-			/* some requests in the r1bio were REQ_HARDBARRIER
-			 * requests which failed with -EOPNOTSUPP.  Hohumm..
-			 * Better resubmit without the barrier.
-			 * We know which devices to resubmit for, because
-			 * all others have had their bios[] entry cleared.
-			 * We already have a nr_pending reference on these rdevs.
-			 */
-			int i;
-			const bool do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
-			clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
-			clear_bit(R1BIO_Barrier, &r1_bio->state);
-			for (i=0; i < conf->raid_disks; i++)
-				if (r1_bio->bios[i])
-					atomic_inc(&r1_bio->remaining);
-			for (i=0; i < conf->raid_disks; i++)
-				if (r1_bio->bios[i]) {
-					struct bio_vec *bvec;
-					int j;
-
-					bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
-					/* copy pages from the failed bio, as
-					 * this might be a write-behind device */
-					__bio_for_each_segment(bvec, bio, j, 0)
-						bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
-					bio_put(r1_bio->bios[i]);
-					bio->bi_sector = r1_bio->sector +
-						conf->mirrors[i].rdev->data_offset;
-					bio->bi_bdev = conf->mirrors[i].rdev->bdev;
-					bio->bi_end_io = raid1_end_write_request;
-					bio->bi_rw = WRITE | do_sync;
-					bio->bi_private = r1_bio;
-					r1_bio->bios[i] = bio;
-					generic_make_request(bio);
-				}
 		} else {
 			int disk;

Index: block/drivers/md/raid1.h
===================================================================
--- block.orig/drivers/md/raid1.h
+++ block/drivers/md/raid1.h
@@ -117,8 +117,6 @@  struct r1bio_s {
 #define	R1BIO_IsSync	1
 #define	R1BIO_Degraded	2
 #define	R1BIO_BehindIO	3
-#define	R1BIO_Barrier	4
-#define R1BIO_BarrierRetry 5
 /* For write-behind requests, we call bi_end_io when
  * the last non-write-behind device completes, providing
  * any write was successful.  Otherwise we call when
Index: block/drivers/md/raid5.c
===================================================================
--- block.orig/drivers/md/raid5.c
+++ block/drivers/md/raid5.c
@@ -3278,7 +3278,7 @@  static void handle_stripe5(struct stripe

 	if (dec_preread_active) {
 		/* We delay this until after ops_run_io so that if make_request
-		 * is waiting on a barrier, it won't continue until the writes
+		 * is waiting on a flush, it won't continue until the writes
 		 * have actually been submitted.
 		 */
 		atomic_dec(&conf->preread_active_stripes);
@@ -3580,7 +3580,7 @@  static void handle_stripe6(struct stripe

 	if (dec_preread_active) {
 		/* We delay this until after ops_run_io so that if make_request
-		 * is waiting on a barrier, it won't continue until the writes
+		 * is waiting on a flush, it won't continue until the writes
 		 * have actually been submitted.
 		 */
 		atomic_dec(&conf->preread_active_stripes);
@@ -3958,14 +3958,8 @@  static int make_request(mddev_t *mddev,
 	const int rw = bio_data_dir(bi);
 	int remaining;

-	if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
-		/* Drain all pending writes.  We only really need
-		 * to ensure they have been submitted, but this is
-		 * easier.
-		 */
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
-		md_barrier_request(mddev, bi);
+	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bi);
 		return 0;
 	}

@@ -4083,7 +4077,7 @@  static int make_request(mddev_t *mddev,
 			finish_wait(&conf->wait_for_overlap, &w);
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
-			if (mddev->barrier &&
+			if (mddev->flush_bio &&
 			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
 			release_stripe(sh);
@@ -4106,7 +4100,7 @@  static int make_request(mddev_t *mddev,
 		bio_endio(bi, 0);
 	}

-	if (mddev->barrier) {
+	if (mddev->flush_bio) {
 		/* We need to wait for the stripes to all be handled.
 		 * So: wait for preread_active_stripes to drop to 0.
 		 */
Index: block/drivers/md/multipath.c
===================================================================
--- block.orig/drivers/md/multipath.c
+++ block/drivers/md/multipath.c
@@ -142,8 +142,8 @@  static int multipath_make_request(mddev_
 	struct multipath_bh * mp_bh;
 	struct multipath_info *multipath;

-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}

Index: block/drivers/md/raid10.c
===================================================================
--- block.orig/drivers/md/raid10.c
+++ block/drivers/md/raid10.c
@@ -799,13 +799,13 @@  static int make_request(mddev_t *mddev,
 	int i;
 	int chunk_sects = conf->chunk_mask + 1;
 	const int rw = bio_data_dir(bio);
-	const bool do_sync = (bio->bi_rw & REQ_SYNC);
+	const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
 	struct bio_list bl;
 	unsigned long flags;
 	mdk_rdev_t *blocked_rdev;

-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}

Index: block/drivers/md/dm-io.c
===================================================================
--- block.orig/drivers/md/dm-io.c
+++ block/drivers/md/dm-io.c
@@ -31,7 +31,6 @@  struct dm_io_client {
  */
 struct io {
 	unsigned long error_bits;
-	unsigned long eopnotsupp_bits;
 	atomic_t count;
 	struct task_struct *sleeper;
 	struct dm_io_client *client;
@@ -130,11 +129,8 @@  static void retrieve_io_and_region_from_
  *---------------------------------------------------------------*/
 static void dec_count(struct io *io, unsigned int region, int error)
 {
-	if (error) {
+	if (error)
 		set_bit(region, &io->error_bits);
-		if (error == -EOPNOTSUPP)
-			set_bit(region, &io->eopnotsupp_bits);
-	}

 	if (atomic_dec_and_test(&io->count)) {
 		if (io->sleeper)
@@ -310,8 +306,8 @@  static void do_region(int rw, unsigned r
 	sector_t remaining = where->count;

 	/*
-	 * where->count may be zero if rw holds a write barrier and we
-	 * need to send a zero-sized barrier.
+	 * where->count may be zero if rw holds a flush and we need to
+	 * send a zero-sized flush.
 	 */
 	do {
 		/*
@@ -364,7 +360,7 @@  static void dispatch_io(int rw, unsigned
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}

@@ -393,9 +389,7 @@  static int sync_io(struct dm_io_client *
 		return -EIO;
 	}

-retry:
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = current;
 	io->client = client;
@@ -412,11 +406,6 @@  retry:
 	}
 	set_current_state(TASK_RUNNING);

-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
-		goto retry;
-	}
-
 	if (error_bits)
 		*error_bits = io->error_bits;

@@ -437,7 +426,6 @@  static int async_io(struct dm_io_client

 	io = mempool_alloc(client->pool, GFP_NOIO);
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = NULL;
 	io->client = client;
Index: block/drivers/md/dm-raid1.c
===================================================================
--- block.orig/drivers/md/dm-raid1.c
+++ block/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@  static int mirror_flush(struct dm_target
 	struct dm_io_region io[ms->nr_mirrors];
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE_BARRIER,
+		.bi_rw = WRITE_FLUSH,
 		.mem.type = DM_IO_KMEM,
 		.mem.ptr.bvec = NULL,
 		.client = ms->io_client,
@@ -629,7 +629,7 @@  static void do_write(struct mirror_set *
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & (WRITE_FLUSH | WRITE_FUA)),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@  static void do_writes(struct mirror_set
 	bio_list_init(&requeue);

 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@  static int mirror_end_io(struct dm_targe
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!(bio->bi_rw & REQ_FLUSH) || bio_has_data(bio))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -139,21 +139,21 @@  struct mapped_device {
 	spinlock_t deferred_lock;

 	/*
-	 * An error from the barrier request currently being processed.
+	 * An error from the flush request currently being processed.
 	 */
-	int barrier_error;
+	int flush_error;

 	/*
-	 * Protect barrier_error from concurrent endio processing
+	 * Protect flush_error from concurrent endio processing
 	 * in request-based dm.
 	 */
-	spinlock_t barrier_error_lock;
+	spinlock_t flush_error_lock;

 	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
+	struct work_struct flush_work;

 	/* A pointer to the currently processing pre/post flush request */
 	struct request *flush_request;
@@ -195,8 +195,8 @@  struct mapped_device {
 	/* sysfs handle */
 	struct kobject kobj;

-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
+	/* zero-length flush that will be cloned and submitted to targets */
+	struct bio flush_bio;
 };

 /*
@@ -507,7 +507,7 @@  static void end_io_acct(struct dm_io *io

 	/*
 	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
+	 * a flush.
 	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
@@ -621,7 +621,7 @@  static void dec_pending(struct dm_io *io
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & REQ_FLUSH))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -633,14 +633,14 @@  static void dec_pending(struct dm_io *io
 		io_error = io->error;
 		bio = io->bio;

-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			/*
-			 * There can be just one barrier request so we use
+			 * There can be just one flush request so we use
 			 * a per-device variable for error reporting.
 			 * Note that you can't touch the bio after end_io_acct
 			 */
-			if (!md->barrier_error && io_error != -EOPNOTSUPP)
-				md->barrier_error = io_error;
+			if (!md->flush_error)
+				md->flush_error = io_error;
 			end_io_acct(io);
 			free_io(md, io);
 		} else {
@@ -744,21 +744,18 @@  static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }

-static void store_barrier_error(struct mapped_device *md, int error)
+static void store_flush_error(struct mapped_device *md, int error)
 {
 	unsigned long flags;

-	spin_lock_irqsave(&md->barrier_error_lock, flags);
+	spin_lock_irqsave(&md->flush_error_lock, flags);
 	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
+	 * Basically, the first error is taken, but requeue request
+	 * supersedes any I/O error.
+	 */
+	if (!md->flush_error || error == DM_ENDIO_REQUEUE)
+		md->flush_error = error;
+	spin_unlock_irqrestore(&md->flush_error_lock, flags);
 }

 /*
@@ -799,12 +796,12 @@  static void dm_end_request(struct reques
 {
 	int rw = rq_data_dir(clone);
 	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
+	bool is_flush = clone->cmd_flags & REQ_FLUSH;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;

-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;

@@ -819,12 +816,13 @@  static void dm_end_request(struct reques

 	free_rq_clone(clone);

-	if (unlikely(is_barrier)) {
+	if (!is_flush)
+		blk_end_request_all(rq, error);
+	else {
 		if (unlikely(error))
-			store_barrier_error(md, error);
+			store_flush_error(md, error);
 		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
+	}

 	rq_completed(md, rw, run_queue);
 }
@@ -851,9 +849,9 @@  void dm_requeue_unmapped_request(struct
 	struct request_queue *q = rq->q;
 	unsigned long flags;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_FLUSH) {
 		/*
-		 * Barrier clones share an original request.
+		 * Flush clones share an original request.
 		 * Leave it to dm_end_request(), which handles this special
 		 * case.
 		 */
@@ -950,14 +948,14 @@  static void dm_complete_request(struct r
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_FLUSH) {
 		/*
-		 * Barrier clones share an original request.  So can't use
+		 * Flush clones share an original request.  So can't use
 		 * softirq_done with the original.
 		 * Pass the clone to dm_done() directly in this special case.
 		 * It is safe (even if clone->q->queue_lock is held here)
 		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
+		 * of flush clone.
 		 */
 		dm_done(clone, error, true);
 		return;
@@ -979,9 +977,9 @@  void dm_kill_unmapped_request(struct req
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_FLUSH) {
 		/*
-		 * Barrier clones share an original request.
+		 * Flush clones share an original request.
 		 * Leave it to dm_end_request(), which handles this special
 		 * case.
 		 */
@@ -1098,7 +1096,7 @@  static void dm_bio_destructor(struct bio
 }

 /*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that is just a part of a bvec.
  */
 static struct bio *split_bvec(struct bio *bio, sector_t sector,
 			      unsigned short idx, unsigned int offset,
@@ -1113,7 +1111,7 @@  static struct bio *split_bvec(struct bio

 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1138,6 @@  static struct bio *clone_bio(struct bio

 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1186,7 +1183,7 @@  static void __flush_target(struct clone_
 	__map_bio(ti, clone, tio);
 }

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0, flush_nr;
 	struct dm_target *ti;
@@ -1208,9 +1205,6 @@  static int __clone_and_map(struct clone_
 	sector_t len = 0, max;
 	struct dm_target_io *tio;

-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
-
 	ti = dm_table_find_target(ci->map, ci->sector);
 	if (!dm_target_is_valid(ti))
 		return -EIO;
@@ -1308,11 +1302,11 @@  static void __split_and_process_bio(stru

 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			bio_io_error(bio);
 		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+			if (!md->flush_error)
+				md->flush_error = -EIO;
 		return;
 	}

@@ -1325,14 +1319,22 @@  static void __split_and_process_bio(stru
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (!(bio->bi_rw & REQ_FLUSH))
+		ci.sector_count = bio_sectors(bio);
+	else {
+		/* FLUSH bio reaching here should all be empty */
+		WARN_ON_ONCE(bio_has_data(bio));
 		ci.sector_count = 1;
+	}
 	ci.idx = bio->bi_idx;

 	start_io_acct(ci.io);
-	while (ci.sector_count && !error)
-		error = __clone_and_map(&ci);
+	while (ci.sector_count && !error) {
+		if (!(bio->bi_rw & REQ_FLUSH))
+			error = __clone_and_map(&ci);
+		else
+			error = __clone_and_map_flush(&ci);
+	}

 	/* drop the extra reference count */
 	dec_pending(ci.io, error);
@@ -1417,11 +1419,11 @@  static int _dm_request(struct request_qu
 	part_stat_unlock();

 	/*
-	 * If we're suspended or the thread is processing barriers
+	 * If we're suspended or the thread is processing flushes
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    (bio->bi_rw & REQ_FLUSH)) {
 		up_read(&md->io_lock);

 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1464,10 +1466,7 @@  static int dm_request(struct request_que

 static bool dm_rq_is_flush_request(struct request *rq)
 {
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
+	return rq->cmd_flags & REQ_FLUSH;
 }

 void dm_dispatch_request(struct request *rq)
@@ -1520,7 +1519,7 @@  static int setup_clone(struct request *c
 	if (dm_rq_is_flush_request(rq)) {
 		blk_rq_init(NULL, clone);
 		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+		clone->cmd_flags |= (REQ_FLUSH | WRITE);
 	} else {
 		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
 				      dm_rq_bio_constructor, tio);
@@ -1668,7 +1667,7 @@  static void dm_request_fn(struct request
 			BUG_ON(md->flush_request);
 			md->flush_request = rq;
 			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
+			queue_work(md->wq, &md->flush_work);
 			goto out;
 		}

@@ -1843,7 +1842,7 @@  out:
 static const struct block_device_operations dm_blk_dops;

 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
+static void dm_rq_flush_work(struct work_struct *work);

 /*
  * Allocate and initialise a blank device with a given minor.
@@ -1873,7 +1872,7 @@  static struct mapped_device *alloc_dev(i
 	init_rwsem(&md->io_lock);
 	mutex_init(&md->suspend_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
+	spin_lock_init(&md->flush_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1918,7 +1917,7 @@  static struct mapped_device *alloc_dev(i
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
+	INIT_WORK(&md->flush_work, dm_rq_flush_work);
 	init_waitqueue_head(&md->eventq);

 	md->disk->major = _major;
@@ -2233,31 +2232,28 @@  static int dm_wait_for_completion(struct
 	return r;
 }

-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
 {
+	md->flush_error = 0;
+
+	/* handle REQ_FLUSH */
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);

-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+	__split_and_process_bio(md, &md->flush_bio);

 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;

-	dm_flush(md);
+	bio->bi_rw &= ~REQ_FLUSH;

-	if (!bio_empty_barrier(bio)) {
+	/* handle data + REQ_FUA */
+	if (bio_has_data(bio))
 		__split_and_process_bio(md, bio);
-		dm_flush(md);
-	}

-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
+	if (md->flush_error != DM_ENDIO_REQUEUE)
+		bio_endio(bio, md->flush_error);
 	else {
 		spin_lock_irq(&md->deferred_lock);
 		bio_list_add_head(&md->deferred, bio);
@@ -2291,8 +2287,8 @@  static void dm_wq_work(struct work_struc
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
+			if (c->bi_rw & REQ_FLUSH)
+				process_flush(md, c);
 			else
 				__split_and_process_bio(md, c);
 		}
@@ -2317,8 +2313,8 @@  static void dm_rq_set_flush_nr(struct re
 	tio->info.flush_request = flush_nr;
 }

-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
+/* Issue flush requests to targets and wait for their completion. */
+static int dm_rq_flush(struct mapped_device *md)
 {
 	int i, j;
 	struct dm_table *map = dm_get_live_table(md);
@@ -2326,7 +2322,7 @@  static int dm_rq_barrier(struct mapped_d
 	struct dm_target *ti;
 	struct request *clone;

-	md->barrier_error = 0;
+	md->flush_error = 0;

 	for (i = 0; i < num_targets; i++) {
 		ti = dm_table_get_target(map, i);
@@ -2341,26 +2337,26 @@  static int dm_rq_barrier(struct mapped_d
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 	dm_table_put(map);

-	return md->barrier_error;
+	return md->flush_error;
 }

-static void dm_rq_barrier_work(struct work_struct *work)
+static void dm_rq_flush_work(struct work_struct *work)
 {
 	int error;
 	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
+						flush_work);
 	struct request_queue *q = md->queue;
 	struct request *rq;
 	unsigned long flags;

 	/*
 	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
+	 * the md can't be deleted by device opener when the flush request
 	 * completes.
 	 */
 	dm_get(md);

-	error = dm_rq_barrier(md);
+	error = dm_rq_flush(md);

 	rq = md->flush_request;
 	md->flush_request = NULL;
@@ -2520,7 +2516,7 @@  int dm_suspend(struct mapped_device *md,
 	up_write(&md->io_lock);

 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
+	 * Request-based dm uses md->wq for flush (dm_rq_flush_work) which
 	 * can be kicked until md->queue is stopped.  So stop md->queue before
 	 * flushing md->wq.
 	 */
Index: block/drivers/md/dm-log.c
===================================================================
--- block.orig/drivers/md/dm-log.c
+++ block/drivers/md/dm-log.c
@@ -300,7 +300,7 @@  static int flush_header(struct log_c *lc
 		.count = 0,
 	};

-	lc->io_req.bi_rw = WRITE_BARRIER;
+	lc->io_req.bi_rw = WRITE_FLUSH;

 	return dm_io(&lc->io_req, 1, &null_location, NULL);
 }
Index: block/drivers/md/dm-snap-persistent.c
===================================================================
--- block.orig/drivers/md/dm-snap-persistent.c
+++ block/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@  static void persistent_commit_exception(
 	/*
 	 * Commit exceptions to disk.
 	 */
-	if (ps->valid && area_io(ps, WRITE_BARRIER))
+	if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
 		ps->valid = 0;

 	/*