Patchwork [26/30] ext4: do not send discards as barriers

login
register
mail settings
Submitter Tejun Heo
Date Aug. 25, 2010, 3:47 p.m.
Message ID <1282751267-3530-27-git-send-email-tj@kernel.org>
Download mbox | patch
Permalink /patch/62694/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - Aug. 25, 2010, 3:47 p.m.
From: Christoph Hellwig <hch@infradead.org>

ext4 already uses synchronous discards, no need to add I/O barriers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/ext4/mballoc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
Tejun Heo - Aug. 25, 2010, 3:57 p.m.
On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
> On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
>> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
>>> From: Christoph Hellwig <hch@infradead.org>
>>>
>>> ext4 already uses synchronous discards, no need to add I/O barriers.
>>
>> This needs the patch that Jan sent in reply to my initial version merged
>> into it.
> 
> Actually the jbd2 patch needs it merged, but the point still stands.

Yeah, wasn't sure about that one.  Has anyone tested it?  I'll be
happy to merge it but I have no idea whether it's correct or not and
Jan didn't seem to have tested it so...  Jan, shall I merge the patch?

Thanks.
Christoph Hellwig - Aug. 25, 2010, 3:58 p.m.
On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> From: Christoph Hellwig <hch@infradead.org>
> 
> ext4 already uses synchronous discards, no need to add I/O barriers.

This needs the patch that Jan sent in reply to my initial version merged
into it.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - Aug. 25, 2010, 4 p.m.
On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> > From: Christoph Hellwig <hch@infradead.org>
> > 
> > ext4 already uses synchronous discards, no need to add I/O barriers.
> 
> This needs the patch that Jan sent in reply to my initial version merged
> into it.

Actually the jbd2 patch needs it merged, but the point still stands.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Aug. 25, 2010, 8:02 p.m.
On Wed 25-08-10 17:57:41, Tejun Heo wrote:
> On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
> > On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
> >> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> >>> From: Christoph Hellwig <hch@infradead.org>
> >>>
> >>> ext4 already uses synchronous discards, no need to add I/O barriers.
> >>
> >> This needs the patch that Jan sent in reply to my initial version merged
> >> into it.
> > 
> > Actually the jbd2 patch needs it merged, but the point still stands.
> 
> Yeah, wasn't sure about that one.  Has anyone tested it?  I'll be
> happy to merge it but I have no idea whether it's correct or not and
> Jan didn't seem to have tested it so...  Jan, shall I merge the patch?
  I'm quite confident the patch is correct so you can merge it I think but
tomorrow I'll give it some crash testing together with the rest of your
patch set in KVM to be sure.

								Honza
Tejun Heo - Aug. 26, 2010, 8:25 a.m.
On 08/25/2010 10:02 PM, Jan Kara wrote:
> On Wed 25-08-10 17:57:41, Tejun Heo wrote:
>> On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
>>> On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
>>>> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
>>>>> From: Christoph Hellwig <hch@infradead.org>
>>>>>
>>>>> ext4 already uses synchronous discards, no need to add I/O barriers.
>>>>
>>>> This needs the patch that Jan sent in reply to my initial version merged
>>>> into it.
>>>
>>> Actually the jbd2 patch needs it merged, but the point still stands.
>>
>> Yeah, wasn't sure about that one.  Has anyone tested it?  I'll be
>> happy to merge it but I have no idea whether it's correct or not and
>> Jan didn't seem to have tested it so...  Jan, shall I merge the patch?
>   I'm quite confident the patch is correct so you can merge it I think but
> tomorrow I'll give it some crash testing together with the rest of your
> patch set in KVM to be sure.

Patch included in the series before jbd2 conversion patch.

Thanks.
Jan Kara - Aug. 27, 2010, 5:31 p.m.
On Thu 26-08-10 10:25:47, Tejun Heo wrote:
> On 08/25/2010 10:02 PM, Jan Kara wrote:
> > On Wed 25-08-10 17:57:41, Tejun Heo wrote:
> >> On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
> >>> On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
> >>>> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> >>>>> From: Christoph Hellwig <hch@infradead.org>
> >>>>>
> >>>>> ext4 already uses synchronous discards, no need to add I/O barriers.
> >>>>
> >>>> This needs the patch that Jan sent in reply to my initial version merged
> >>>> into it.
> >>>
> >>> Actually the jbd2 patch needs it merged, but the point still stands.
> >>
> >> Yeah, wasn't sure about that one.  Has anyone tested it?  I'll be
> >> happy to merge it but I have no idea whether it's correct or not and
> >> Jan didn't seem to have tested it so...  Jan, shall I merge the patch?
> >   I'm quite confident the patch is correct so you can merge it I think but
> > tomorrow I'll give it some crash testing together with the rest of your
> > patch set in KVM to be sure.
> 
> Patch included in the series before jbd2 conversion patch.
  An update: I've set up an ext4 barrier testing in KVM - run fsstress,
kill KVM at some random moment and check that the filesystem is consistent
(kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
without journal_async_commit passed fine, now I'm running some tests with
the option enabled and the first few rounds passed OK as well.

									Honza
Jeff Moyer - Aug. 30, 2010, 7:56 p.m.
Jan Kara <jack@suse.cz> writes:

>   An update: I've set up an ext4 barrier testing in KVM - run fsstress,
> kill KVM at some random moment and check that the filesystem is consistent
> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs

But doesn't your "disk cache" survive the "power cycle" of your guest?
It's tough to tell exactly what you're testing with so few details;
care to elaborate?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Aug. 30, 2010, 8:20 p.m.
On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >   An update: I've set up an ext4 barrier testing in KVM - run fsstress,
> > kill KVM at some random moment and check that the filesystem is consistent
> > (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
> 
> But doesn't your "disk cache" survive the "power cycle" of your guest?
  Yes, you're right. Thinking about it now the test setup was wrong because
it didn't refuse writes to the VM's data partition after the moment I
killed KVM. Thanks for catching this. I will probably have to use the fault
injection on the host to disallow writing the device at a certain moment.
Or does somebody have a better option?
  My setup is that I have a dedicated partition / drive for a filesystem
which is written to from a guest kernel running under KVM. I have set it up
using virtio driver with cache=writeback so that the host caches the writes
in a similar way disk caches them. At some point I just kill the qemu-kvm
process and at that point I'd like to also throw away data cached by the
host...

									Honza
Ric Wheeler - Aug. 30, 2010, 8:24 p.m.
On 08/30/2010 05:20 PM, Jan Kara wrote:
> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>> Jan Kara<jack@suse.cz>  writes:
>>
>>>    An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>> kill KVM at some random moment and check that the filesystem is consistent
>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>> But doesn't your "disk cache" survive the "power cycle" of your guest?
>    Yes, you're right. Thinking about it now the test setup was wrong because
> it didn't refuse writes to the VM's data partition after the moment I
> killed KVM. Thanks for catching this. I will probably have to use the fault
> injection on the host to disallow writing the device at a certain moment.
> Or does somebody have a better option?
>    My setup is that I have a dedicated partition / drive for a filesystem
> which is written to from a guest kernel running under KVM. I have set it up
> using virtio driver with cache=writeback so that the host caches the writes
> in a similar way disk caches them. At some point I just kill the qemu-kvm
> process and at that point I'd like to also throw away data cached by the
> host...
>
> 									Honza

Hi Jan,

Not sure if this is relevant, but what we have been using for part of the 
testing is an external e-sata enclosure that you can stick pretty much any S-ATA 
disk into. Important to drop power to the external disk (do not pull the s-ata 
cable, the firmware will destage the write cache for some/many disks if it has 
power and sees link loss :)).

Once you turn the drive back on, the test was can you mount without error, 
unmount and do a fsck -f to verify no meta-data corruption,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin - Aug. 30, 2010, 8:39 p.m.
Jan Kara, on 08/31/2010 12:20 AM wrote:
> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>> Jan Kara<jack@suse.cz>  writes:
>>
>>>    An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>> kill KVM at some random moment and check that the filesystem is consistent
>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>>
>> But doesn't your "disk cache" survive the "power cycle" of your guest?
>    Yes, you're right. Thinking about it now the test setup was wrong because
> it didn't refuse writes to the VM's data partition after the moment I
> killed KVM. Thanks for catching this. I will probably have to use the fault
> injection on the host to disallow writing the device at a certain moment.
> Or does somebody have a better option?

Have you considered to setup a second box as an iSCSI target (e.g. with 
iSCSI-SCST)? With it killing the connectivity is just a matter of a 
single iptables command + a lot more options.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer - Aug. 30, 2010, 9:01 p.m.
Jan Kara <jack@suse.cz> writes:

> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> >   An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>> > kill KVM at some random moment and check that the filesystem is consistent
>> > (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>> 
>> But doesn't your "disk cache" survive the "power cycle" of your guest?
>   Yes, you're right. Thinking about it now the test setup was wrong because
> it didn't refuse writes to the VM's data partition after the moment I
> killed KVM. Thanks for catching this. I will probably have to use the fault
> injection on the host to disallow writing the device at a certain moment.
> Or does somebody have a better option?
>   My setup is that I have a dedicated partition / drive for a filesystem
> which is written to from a guest kernel running under KVM. I have set it up
> using virtio driver with cache=writeback so that the host caches the writes
> in a similar way disk caches them. At some point I just kill the qemu-kvm
> process and at that point I'd like to also throw away data cached by the
> host...

I've used ilo to power off the system under test from remote.  I have a
tool to automate the testing.  It works as follows:

There's a client and a server.  The server listens on an ip/port for
connections.  A client will connect, tell the server it's configuration
(including what disk it's writing to, what block size it's using, and
the total amount of I/O to be done), and then start doing I/O.  The I/O
is done using the AIO api, and the data written includes a block number,
a generation number, fill, and a crc.  As each completion comes in, the
completed sectors are communicated to the server program.  Upon
completion of an entire series of writes (writing the entire data set
once), the server waits some amount of time and then power cycles the
client.  The client comes back up and is run in check mode to verify
that all of the data it reported as completed to the server is actually
in tact.

I recently updated the code to run against a file on a file system
(previously it would only work on a block device).  It makes use of
stonith modules to do the power cycling.  It works, but it isn't the
most elegant bit of engineering I've ever done.  ;-)

Anyway, that code is available here:
  http://people.redhat.com/jmoyer/dainto-0.99.4.tar.bz2

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Aug. 30, 2010, 9:02 p.m.
On Tue 31-08-10 00:39:41, Vladislav Bolkhovitin wrote:
> Jan Kara, on 08/31/2010 12:20 AM wrote:
> >On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
> >>Jan Kara<jack@suse.cz>  writes:
> >>
> >>>   An update: I've set up an ext4 barrier testing in KVM - run fsstress,
> >>>kill KVM at some random moment and check that the filesystem is consistent
> >>>(kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
> >>
> >>But doesn't your "disk cache" survive the "power cycle" of your guest?
> >   Yes, you're right. Thinking about it now the test setup was wrong because
> >it didn't refuse writes to the VM's data partition after the moment I
> >killed KVM. Thanks for catching this. I will probably have to use the fault
> >injection on the host to disallow writing the device at a certain moment.
> >Or does somebody have a better option?
> 
> Have you considered to setup a second box as an iSCSI target (e.g.
> with iSCSI-SCST)? With it killing the connectivity is just a matter
> of a single iptables command + a lot more options.
  Hmm, this might be an interesting option. Will try that. Thanks for
suggestion. 

								Honza
Tejun Heo - Aug. 31, 2010, 8:11 a.m.
On 08/30/2010 10:20 PM, Jan Kara wrote:
>   My setup is that I have a dedicated partition / drive for a filesystem
> which is written to from a guest kernel running under KVM. I have set it up
> using virtio driver with cache=writeback so that the host caches the writes
> in a similar way disk caches them. At some point I just kill the qemu-kvm
> process and at that point I'd like to also throw away data cached by the
> host...

$ echo 1 > /sys/block/sdX/device/delete
$ echo - - - > /sys/class/scsi_host/hostX/scan

should do the trick.

Thanks.
Boaz Harrosh - Aug. 31, 2010, 9:55 a.m.
On 08/31/2010 12:02 AM, Jan Kara wrote:
> On Tue 31-08-10 00:39:41, Vladislav Bolkhovitin wrote:
>> Jan Kara, on 08/31/2010 12:20 AM wrote:
>>> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>>>> Jan Kara<jack@suse.cz>  writes:
>>>>
>>>>>   An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>>>> kill KVM at some random moment and check that the filesystem is consistent
>>>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>>>>
>>>> But doesn't your "disk cache" survive the "power cycle" of your guest?
>>>   Yes, you're right. Thinking about it now the test setup was wrong because
>>> it didn't refuse writes to the VM's data partition after the moment I
>>> killed KVM. Thanks for catching this. I will probably have to use the fault
>>> injection on the host to disallow writing the device at a certain moment.
>>> Or does somebody have a better option?
>>
>> Have you considered to setup a second box as an iSCSI target (e.g.
>> with iSCSI-SCST)? With it killing the connectivity is just a matter
>> of a single iptables command + a lot more options.

Still same problem no? the data is still cached on the backing store device
how do you trash the cached data?

>   Hmm, this might be an interesting option. Will try that. Thanks for
> suggestion. 
> 
> 								Honza

with stgt it's very simple as well. It's a user mode application.
All on the same machine:
- run stgt application
- login + mount a filesystem
- run test
- kill -9 stgt mid flight

But how to throw away the data on the backing store cache?

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh - Aug. 31, 2010, 10:07 a.m.
On 08/31/2010 11:11 AM, Tejun Heo wrote:
> On 08/30/2010 10:20 PM, Jan Kara wrote:
>>   My setup is that I have a dedicated partition / drive for a filesystem
>> which is written to from a guest kernel running under KVM. I have set it up
>> using virtio driver with cache=writeback so that the host caches the writes
>> in a similar way disk caches them. At some point I just kill the qemu-kvm
>> process and at that point I'd like to also throw away data cached by the
>> host...
> 
> $ echo 1 > /sys/block/sdX/device/delete
> $ echo - - - > /sys/class/scsi_host/hostX/scan
> 

I don't know all the specifics of the virtio driver and the KVM backend but
don't the KVM target io is eventually directed to a local file or device?
If so the scsi device has disappeard but the bulk of the data is in host cache
at the backstore (file or bdev). Once all files are closed the data is synced
to disk.

Is it not the same as Ric's problem of disconnecting the sata cable but
not dropping power to the drive. The main of the cache is still intact.

> should do the trick.
> 
> Thanks.
> 

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Aug. 31, 2010, 10:13 a.m.
Hello,

On 08/31/2010 12:07 PM, Boaz Harrosh wrote:
> I don't know all the specifics of the virtio driver and the KVM backend but
> don't the KVM target io is eventually directed to a local file or device?
> If so the scsi device has disappeard but the bulk of the data is in host cache
> at the backstore (file or bdev). Once all files are closed the data is synced
> to disk.
> 
> Is it not the same as Ric's problem of disconnecting the sata cable but
> not dropping power to the drive. The main of the cache is still intact.

There are two layers of caching there.

 drive cache - host page cache - guest

When guest issues FLUSH, qemu will translate it into fdatasync which
will flush the host page cache followed by FLUSH to the drive which
will flush the drive cache to the media.  If you delete the host disk
device, it will be detached w/o host page cache flushed.  So, although
it's not complete, it will lose good part of cache.  With out write
out timeout increased and/or with laptop mode enabled, it will
probably lose most of cache.

Thanks.
Boaz Harrosh - Aug. 31, 2010, 10:27 a.m.
On 08/31/2010 01:13 PM, Tejun Heo wrote:
> Hello,
> 
> On 08/31/2010 12:07 PM, Boaz Harrosh wrote:
>> I don't know all the specifics of the virtio driver and the KVM backend but
>> don't the KVM target io is eventually directed to a local file or device?
>> If so the scsi device has disappeard but the bulk of the data is in host cache
>> at the backstore (file or bdev). Once all files are closed the data is synced
>> to disk.
>>
>> Is it not the same as Ric's problem of disconnecting the sata cable but
>> not dropping power to the drive. The main of the cache is still intact.
> 
> There are two layers of caching there.
> 
>  drive cache - host page cache - guest
> 
> When guest issues FLUSH, qemu will translate it into fdatasync which
> will flush the host page cache followed by FLUSH to the drive which
> will flush the drive cache to the media.  If you delete the host disk
> device, it will be detached w/o host page cache flushed.  So, although
> it's not complete, it will lose good part of cache.  With out write
> out timeout increased and/or with laptop mode enabled, it will
> probably lose most of cache.
> 

Ha, ok you meant that device. So if you have a dedicated physical device
for backstore that would be a very nice scriptable way.

Thanks, that's a much better automated test than pulling drives out of
sockets.

> Thanks.
> 

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin - Sept. 2, 2010, 6:46 p.m.
Boaz Harrosh, on 08/31/2010 01:55 PM wrote:
>>>>>> An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>>>>> kill KVM at some random moment and check that the filesystem is consistent
>>>>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>>>>>
>>>>> But doesn't your "disk cache" survive the "power cycle" of your guest?
>>>> Yes, you're right. Thinking about it now the test setup was wrong because
>>>> it didn't refuse writes to the VM's data partition after the moment I
>>>> killed KVM. Thanks for catching this. I will probably have to use the fault
>>>> injection on the host to disallow writing the device at a certain moment.
>>>> Or does somebody have a better option?
>>>
>>> Have you considered to setup a second box as an iSCSI target (e.g.
>>> with iSCSI-SCST)? With it killing the connectivity is just a matter
>>> of a single iptables command + a lot more options.
>
> Still same problem no? the data is still cached on the backing store device
> how do you trash the cached data?

If you need to kill the device's cache you can crash/panic/power off the 
target. That also can be well scriptable.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Sept. 9, 2010, 10:53 p.m.
On Tue 31-08-10 10:11:34, Tejun Heo wrote:
> On 08/30/2010 10:20 PM, Jan Kara wrote:
> >   My setup is that I have a dedicated partition / drive for a filesystem
> > which is written to from a guest kernel running under KVM. I have set it up
> > using virtio driver with cache=writeback so that the host caches the writes
> > in a similar way disk caches them. At some point I just kill the qemu-kvm
> > process and at that point I'd like to also throw away data cached by the
> > host...
> 
> $ echo 1 > /sys/block/sdX/device/delete
> $ echo - - - > /sys/class/scsi_host/hostX/scan
> 
> should do the trick.
  I've tested that when mounting with barrier=0 option inside KVM, this indeed
does destroy the filesystem rather badly. With the barrier option, ext4 has
already survived several crash cycles while running fsstress with
journal_async_commit option. So the patch seems to work as expected.

									Honza

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index df44b34..a22bfef 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2567,7 +2567,7 @@  static inline void ext4_issue_discard(struct super_block *sb,
 	trace_ext4_discard_blocks(sb,
 			(unsigned long long) discard_block, count);
 	ret = sb_issue_discard(sb, discard_block, count, GFP_NOFS,
-			       BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+			       BLKDEV_IFL_WAIT);
 	if (ret == EOPNOTSUPP) {
 		ext4_warning(sb, "discard not supported, disabling");
 		clear_opt(EXT4_SB(sb)->s_mount_opt, DISCARD);