Message ID | 1303762999-20541-1-git-send-email-curtw@google.com |
---|---|
State | Accepted, archived |
Headers | show |
On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: > In the bio completion routine, we should not be setting > PageUptodate at all -- it's set at sys_write() time, and is > unaffected by success/failure of the write to disk. > > This can cause a page corruption bug when > > block size < page size > > @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) > - /* > - * If this is a partial write which happened to make > - * all buffers uptodate then we can optimize away a > - * bogus readpage() for the next read(). Here we > - * 'discover' whether the page went uptodate as a > - * result of this (potentially partial) write. > - */ > - if (!partial_write) > - SetPageUptodate(page); > - I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andreas: On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: > On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >> In the bio completion routine, we should not be setting >> PageUptodate at all -- it's set at sys_write() time, and is >> unaffected by success/failure of the write to disk. >> >> This can cause a page corruption bug when >> >> block size < page size >> >> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >> - /* >> - * If this is a partial write which happened to make >> - * all buffers uptodate then we can optimize away a >> - * bogus readpage() for the next read(). Here we >> - * 'discover' whether the page went uptodate as a >> - * result of this (potentially partial) write. >> - */ >> - if (!partial_write) >> - SetPageUptodate(page); >> - > > I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. Hmm, that's a good question. I would kind of doubt that the page would be marked uptodate when the final block was written, and this might be what the code above was trying to do. It wasn't doing it correctly :-), but it might have possibly avoided the extra read when it there was no error. I'll look at this some more, and see if I can't test for your scenario above. Perhaps at least checking that all BHs in the page are mapped + uptodate => SetPageUptodate would not be out of line. Thanks, Curt > > Cheers, Andreas > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi again Andreas: On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <curtw@google.com> wrote: > Hi Andreas: > > On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: >> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >>> In the bio completion routine, we should not be setting >>> PageUptodate at all -- it's set at sys_write() time, and is >>> unaffected by success/failure of the write to disk. >>> >>> This can cause a page corruption bug when >>> >>> block size < page size >>> >>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >>> - /* >>> - * If this is a partial write which happened to make >>> - * all buffers uptodate then we can optimize away a >>> - * bogus readpage() for the next read(). Here we >>> - * 'discover' whether the page went uptodate as a >>> - * result of this (potentially partial) write. >>> - */ >>> - if (!partial_write) >>> - SetPageUptodate(page); >>> - >> >> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. > > Hmm, that's a good question. I would kind of doubt that the page > would be marked uptodate when the final block was written, and this > might be what the code above was trying to do. It wasn't doing it > correctly :-), but it might have possibly avoided the extra read when > it there was no error. > > I'll look at this some more, and see if I can't test for your scenario > above. Perhaps at least checking that all BHs in the page are mapped > + uptodate => SetPageUptodate would not be out of line. My testing is now showing the read coming through after writing to the 4 blocks of a 4K file, using 1K blocksize. And it seems to me that this is taken care of in __block_commit_write(), which is called from all the .write_end callbacks for ext4, at least. Thanks, Curt > > Thanks, > Curt > > > >> >> Cheers, Andreas >> >> >> >> >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote: > On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <curtw@google.com> wrote: >> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >>>> In the bio completion routine, we should not be setting >>>> PageUptodate at all -- it's set at sys_write() time, and is >>>> unaffected by success/failure of the write to disk. >>>> >>>> This can cause a page corruption bug when >>>> >>>> block size < page size >>>> >>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >>>> - /* >>>> - * If this is a partial write which happened to make >>>> - * all buffers uptodate then we can optimize away a >>>> - * bogus readpage() for the next read(). Here we >>>> - * 'discover' whether the page went uptodate as a >>>> - * result of this (potentially partial) write. >>>> - */ >>>> - if (!partial_write) >>>> - SetPageUptodate(page); >>>> - >>> >>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. >> >> Hmm, that's a good question. I would kind of doubt that the page >> would be marked uptodate when the final block was written, and this >> might be what the code above was trying to do. It wasn't doing it >> correctly :-), but it might have possibly avoided the extra read when >> it there was no error. >> >> I'll look at this some more, and see if I can't test for your scenario >> above. Perhaps at least checking that all BHs in the page are mapped >> + uptodate => SetPageUptodate would not be out of line. > > My testing is now showing the read coming through after writing to the > 4 blocks of a 4K file, using 1K blocksize. Sorry, but could you please clarify? Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache? > And it seems to me that > this is taken care of in __block_commit_write(), which is called from > all the .write_end callbacks for ext4, at least. It does indeed look like that should be handling this case. It would be good to verify that this is still true with your patch, just in case theory and reality don't align. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <adilger@dilger.ca> wrote: > On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote: >> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <curtw@google.com> wrote: >>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >>>>> In the bio completion routine, we should not be setting >>>>> PageUptodate at all -- it's set at sys_write() time, and is >>>>> unaffected by success/failure of the write to disk. >>>>> >>>>> This can cause a page corruption bug when >>>>> >>>>> block size < page size >>>>> >>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >>>>> - /* >>>>> - * If this is a partial write which happened to make >>>>> - * all buffers uptodate then we can optimize away a >>>>> - * bogus readpage() for the next read(). Here we >>>>> - * 'discover' whether the page went uptodate as a >>>>> - * result of this (potentially partial) write. >>>>> - */ >>>>> - if (!partial_write) >>>>> - SetPageUptodate(page); >>>>> - >>>> >>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. >>> >>> Hmm, that's a good question. I would kind of doubt that the page >>> would be marked uptodate when the final block was written, and this >>> might be what the code above was trying to do. It wasn't doing it >>> correctly :-), but it might have possibly avoided the extra read when >>> it there was no error. >>> >>> I'll look at this some more, and see if I can't test for your scenario >>> above. Perhaps at least checking that all BHs in the page are mapped >>> + uptodate => SetPageUptodate would not be out of line. >> >> My testing is now showing the read coming through after writing to the >> 4 blocks of a 4K file, using 1K blocksize. > > Sorry, but could you please clarify? Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache? Wow, even I don't understand what I wrote above... What I meant to write was: In my experiments, I do *not* see a read going down to disk after the writes for the individual blocks. I basically did this: --------------------------- dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc sync dd if=bar of=foo bs=1k count=4 --------------------------- > >> And it seems to me that >> this is taken care of in __block_commit_write(), which is called from >> all the .write_end callbacks for ext4, at least. > > It does indeed look like that should be handling this case. It would be good to verify that this is still true with your patch, just in case theory and reality don't align. I see the same *non-read from disk* with 1k blocks both with my patch, and with a vanilla kernel. Thanks, Curt > > Cheers, Andreas > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 26, 2011 at 12:32 PM, Curt Wohlgemuth <curtw@google.com> wrote: > On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <adilger@dilger.ca> wrote: >> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote: >>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <curtw@google.com> wrote: >>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >>>>>> In the bio completion routine, we should not be setting >>>>>> PageUptodate at all -- it's set at sys_write() time, and is >>>>>> unaffected by success/failure of the write to disk. >>>>>> >>>>>> This can cause a page corruption bug when >>>>>> >>>>>> block size < page size >>>>>> >>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >>>>>> - /* >>>>>> - * If this is a partial write which happened to make >>>>>> - * all buffers uptodate then we can optimize away a >>>>>> - * bogus readpage() for the next read(). Here we >>>>>> - * 'discover' whether the page went uptodate as a >>>>>> - * result of this (potentially partial) write. >>>>>> - */ >>>>>> - if (!partial_write) >>>>>> - SetPageUptodate(page); >>>>>> - >>>>> >>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. >>>> >>>> Hmm, that's a good question. I would kind of doubt that the page >>>> would be marked uptodate when the final block was written, and this >>>> might be what the code above was trying to do. It wasn't doing it >>>> correctly :-), but it might have possibly avoided the extra read when >>>> it there was no error. >>>> >>>> I'll look at this some more, and see if I can't test for your scenario >>>> above. Perhaps at least checking that all BHs in the page are mapped >>>> + uptodate => SetPageUptodate would not be out of line. >>> >>> My testing is now showing the read coming through after writing to the >>> 4 blocks of a 4K file, using 1K blocksize. >> >> Sorry, but could you please clarify? Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache? > > Wow, even I don't understand what I wrote above... > > What I meant to write was: In my experiments, I do *not* see a read > going down to disk after the writes for the individual blocks. I > basically did this: > > --------------------------- > dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc > dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc > dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc > dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc > > sync try to evict all cache pages echo 1 > proc/sys/vm/drop_caches > > dd if=bar of=foo bs=1k count=4 > --------------------------- > >> >>> And it seems to me that >>> this is taken care of in __block_commit_write(), which is called from >>> all the .write_end callbacks for ext4, at least. >> >> It does indeed look like that should be handling this case. It would be good to verify that this is still true with your patch, just in case theory and reality don't align. > > I see the same *non-read from disk* with 1k blocks both with my patch, > and with a vanilla kernel. > > Thanks, > Curt > >> >> Cheers, Andreas >> >> >> >> >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
This function is only called from write path which flushes pages to disk, actually, pages' state have been set right at time when write_end() is called. Why did we handle pages' state in this function? On Tue, Apr 26, 2011 at 6:40 AM, Andreas Dilger <adilger@dilger.ca> wrote: > On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >> In the bio completion routine, we should not be setting >> PageUptodate at all -- it's set at sys_write() time, and is >> unaffected by success/failure of the write to disk. >> >> This can cause a page corruption bug when >> >> block size < page size >> >> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >> - /* >> - * If this is a partial write which happened to make >> - * all buffers uptodate then we can optimize away a >> - * bogus readpage() for the next read(). Here we >> - * 'discover' whether the page went uptodate as a >> - * result of this (potentially partial) write. >> - */ >> - if (!partial_write) >> - SetPageUptodate(page); >> - > > I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. > > Cheers, Andreas > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On Tue, Apr 26, 2011 at 03:41:47PM +0800, Yongqiang Yang wrote: > This function is only called from write path which flushes pages to > disk, actually, pages' state have been set right at time when > write_end() is called. Why did we handle pages' state in this > function? This was a bug on my part. A lot of page_io.c was copied from fs/buffer.c, and then optimized for bulk page writes, instead of sending down a separate 4k bio write request at a time. I (incorrectly) copied logic from block_commit_write() into the write completion callback(). So I agree with Curt that the messing with the page state should be removed from ext4_end_bio(). As far as messing with the buffer_dirty() logic, it's just wrong to be clearing the buffer dirty bit at all in the write callback. In the original codepath, the buffer dirty bit was set by block_commit_write(), only to be cleared immediately by __block_write_full_page(). I'm pretty sure the setting and clearing the buffer dirty bit can be optimized out of fs/ext4/page_io.c What does bother me a little bit is that we have code that tests buffer_dirty() in the delalloc logic in fs/ex4/inode.c --- note ext4_bh_delay_on_unwritten(), and write_cache_pages_da(), where we are testing the buffer_dirty bit for the pages that hang off of struct page. The only thing that might bear on this is the ext4_block_truncate_page(), which does call mark_buffer_dirty() on the buffer containing the truncation point (which by the way has made me very worried about the punch code getting very well tested on 1k block size file systems, since the edge cases when the region to be truncated are not page-aligned are obviously going to be very hard to get right). I *think* we should be nuking the buffer_dirty manipulations in the delalloc path, but we need to look very closely at that to make sure there isn't anything going on here... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 25, 2011 at 11:59 PM, Yongqiang Yang <xiaoqiangnk@gmail.com> wrote: > On Tue, Apr 26, 2011 at 12:32 PM, Curt Wohlgemuth <curtw@google.com> wrote: >> On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote: >>>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <curtw@google.com> wrote: >>>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >>>>>>> In the bio completion routine, we should not be setting >>>>>>> PageUptodate at all -- it's set at sys_write() time, and is >>>>>>> unaffected by success/failure of the write to disk. >>>>>>> >>>>>>> This can cause a page corruption bug when >>>>>>> >>>>>>> block size < page size >>>>>>> >>>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >>>>>>> - /* >>>>>>> - * If this is a partial write which happened to make >>>>>>> - * all buffers uptodate then we can optimize away a >>>>>>> - * bogus readpage() for the next read(). Here we >>>>>>> - * 'discover' whether the page went uptodate as a >>>>>>> - * result of this (potentially partial) write. >>>>>>> - */ >>>>>>> - if (!partial_write) >>>>>>> - SetPageUptodate(page); >>>>>>> - >>>>>> >>>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. >>>>> >>>>> Hmm, that's a good question. I would kind of doubt that the page >>>>> would be marked uptodate when the final block was written, and this >>>>> might be what the code above was trying to do. It wasn't doing it >>>>> correctly :-), but it might have possibly avoided the extra read when >>>>> it there was no error. >>>>> >>>>> I'll look at this some more, and see if I can't test for your scenario >>>>> above. Perhaps at least checking that all BHs in the page are mapped >>>>> + uptodate => SetPageUptodate would not be out of line. >>>> >>>> My testing is now showing the read coming through after writing to the >>>> 4 blocks of a 4K file, using 1K blocksize. >>> >>> Sorry, but could you please clarify? Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache? >> >> Wow, even I don't understand what I wrote above... >> >> What I meant to write was: In my experiments, I do *not* see a read >> going down to disk after the writes for the individual blocks. I >> basically did this: >> >> --------------------------- >> dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc >> dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc >> dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc >> dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc >> >> sync > try to evict all cache pages > echo 1 > proc/sys/vm/drop_caches But this won't prove anything. Andreas was worried about an extraneous read from disk when the page in the page cache was in fact up to date. If we evict the page cache, then we *better* read from disk... Curt > >> >> dd if=bar of=foo bs=1k count=4 >> --------------------------- >> >>> >>>> And it seems to me that >>>> this is taken care of in __block_commit_write(), which is called from >>>> all the .write_end callbacks for ext4, at least. >>> >>> It does indeed look like that should be handling this case. It would be good to verify that this is still true with your patch, just in case theory and reality don't align. >> >> I see the same *non-read from disk* with 1k blocks both with my patch, >> and with a vanilla kernel. >> >> Thanks, >> Curt >> >>> >>> Cheers, Andreas >>> >>> >>> >>> >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > -- > Best Wishes > Yongqiang Yang > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 26, 2011 at 11:37 PM, Curt Wohlgemuth <curtw@google.com> wrote: > On Mon, Apr 25, 2011 at 11:59 PM, Yongqiang Yang <xiaoqiangnk@gmail.com> wrote: >> On Tue, Apr 26, 2011 at 12:32 PM, Curt Wohlgemuth <curtw@google.com> wrote: >>> On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>>> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote: >>>>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <curtw@google.com> wrote: >>>>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>>>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote: >>>>>>>> In the bio completion routine, we should not be setting >>>>>>>> PageUptodate at all -- it's set at sys_write() time, and is >>>>>>>> unaffected by success/failure of the write to disk. >>>>>>>> >>>>>>>> This can cause a page corruption bug when >>>>>>>> >>>>>>>> block size < page size >>>>>>>> >>>>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) >>>>>>>> - /* >>>>>>>> - * If this is a partial write which happened to make >>>>>>>> - * all buffers uptodate then we can optimize away a >>>>>>>> - * bogus readpage() for the next read(). Here we >>>>>>>> - * 'discover' whether the page went uptodate as a >>>>>>>> - * result of this (potentially partial) write. >>>>>>>> - */ >>>>>>>> - if (!partial_write) >>>>>>>> - SetPageUptodate(page); >>>>>>>> - >>>>>>> >>>>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache. >>>>>> >>>>>> Hmm, that's a good question. I would kind of doubt that the page >>>>>> would be marked uptodate when the final block was written, and this >>>>>> might be what the code above was trying to do. It wasn't doing it >>>>>> correctly :-), but it might have possibly avoided the extra read when >>>>>> it there was no error. >>>>>> >>>>>> I'll look at this some more, and see if I can't test for your scenario >>>>>> above. Perhaps at least checking that all BHs in the page are mapped >>>>>> + uptodate => SetPageUptodate would not be out of line. >>>>> >>>>> My testing is now showing the read coming through after writing to the >>>>> 4 blocks of a 4K file, using 1K blocksize. >>>> >>>> Sorry, but could you please clarify? Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache? >>> >>> Wow, even I don't understand what I wrote above... >>> >>> What I meant to write was: In my experiments, I do *not* see a read >>> going down to disk after the writes for the individual blocks. I >>> basically did this: >>> >>> --------------------------- >>> dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc >>> dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc >>> dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc >>> dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc >>> >>> sync >> try to evict all cache pages >> echo 1 > proc/sys/vm/drop_caches > > But this won't prove anything. Andreas was worried about an > extraneous read from disk when the page in the page cache was in fact > up to date. If we evict the page cache, then we *better* read from > disk... If the buffer is marked uptodate, 'read from disk' does not happen even page is not uptodate, mapping->is_partially_uptodate handles the case. > > Curt > >> >>> >>> dd if=bar of=foo bs=1k count=4 >>> --------------------------- >>> >>>> >>>>> And it seems to me that >>>>> this is taken care of in __block_commit_write(), which is called from >>>>> all the .write_end callbacks for ext4, at least. >>>> >>>> It does indeed look like that should be handling this case. It would be good to verify that this is still true with your patch, just in case theory and reality don't align. >>> >>> I see the same *non-read from disk* with 1k blocks both with my patch, >>> and with a vanilla kernel. >>> >>> Thanks, >>> Curt >>> >>>> >>>> Cheers, Andreas >>>> >>>> >>>> >>>> >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> >> >> -- >> Best Wishes >> Yongqiang Yang >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >
On Mon, 25 Apr 2011, Curt Wohlgemuth wrote: > In the bio completion routine, we should not be setting > PageUptodate at all -- it's set at sys_write() time, and is > unaffected by success/failure of the write to disk. > > This can cause a page corruption bug when > > block size < page size > > if we have only written a single block -- we might end up > setting the entire PageUptodate, which will cause subsequent > reads to get bad data. > > This commit also takes the opportunity to clean up error > handling in ext4_end_bio(), and remove some extraneous code: > > - fixes ext4_end_bio() to set AS_EIO in the > page->mapping->flags on error, which was left out by > mistake. > - remove the clear_buffer_dirty() call on unmapped > buffers for each page. > - consolidate page/buffer error handling in a single > section. > > Signed-off-by: Curt Wohlgemuth <curtw@google.com> > Reported-by: Jim Meyering <jim@meyering.net> > Reported-by: Hugh Dickins <hughd@google.com> > Cc: Mingming Cao <cmm@us.ibm.com> > --- > Changlog since v2: > - Removed clear_buffer_dirty() call > - Consolidated error handling for pages and buffer heads > - Loop over BHs in a page even for page size == block size, so > we emit the correct error for such a case. > > Changlog since v1: > - Added commit message text about setting AS_EIO for the > page on error. > - Continue to loop over all BHs in a page and emit unique > errors for each of them. > --- > fs/ext4/page-io.c | 39 +++++++++++---------------------------- > 1 files changed, 11 insertions(+), 28 deletions(-) > > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c > index b6dbd05..7bb8f76 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) > for (i = 0; i < io_end->num_io_pages; i++) { > struct page *page = io_end->pages[i]->p_page; > struct buffer_head *bh, *head; > - int partial_write = 0; > + loff_t offset; > + loff_t io_end_offset; > > - head = page_buffers(page); > - if (error) > + if (error) { > SetPageError(page); > - BUG_ON(!head); > - if (head->b_size != PAGE_CACHE_SIZE) { > - loff_t offset; > - loff_t io_end_offset = io_end->offset + io_end->size; > + set_bit(AS_EIO, &page->mapping->flags); > + head = page_buffers(page); > + BUG_ON(!head); > + > + io_end_offset = io_end->offset + io_end->size; > > offset = (sector_t) page->index << PAGE_CACHE_SHIFT; > bh = head; > do { > if ((offset >= io_end->offset) && > - (offset+bh->b_size <= io_end_offset)) { > - if (error) > - buffer_io_error(bh); > - > - } > - if (buffer_delay(bh)) > - partial_write = 1; > - else if (!buffer_mapped(bh)) > - clear_buffer_dirty(bh); > - else if (buffer_dirty(bh)) > - partial_write = 1; > + (offset+bh->b_size <= io_end_offset)) > + buffer_io_error(bh); > + > offset += bh->b_size; > bh = bh->b_this_page; > } while (bh != head); > } > > - /* > - * If this is a partial write which happened to make > - * all buffers uptodate then we can optimize away a > - * bogus readpage() for the next read(). Here we > - * 'discover' whether the page went uptodate as a > - * result of this (potentially partial) write. > - */ > - if (!partial_write) > - SetPageUptodate(page); > - > put_io_page(io_end->pages[i]); > } > io_end->num_io_pages = 0; > -- > 1.7.3.1 I'm concerned that we've reached -rc7, with Linus planning on 2.6.39 release next week, but Curt's fix above to the mblk_io corruption bug seems to have fallen through the cracks. I've been including it in all my testing over the last two weeks: it works fine - and because of my own tmpfs bug, I even got to see its error messages :) Adding in the patch is easy enough for me, but surely we don't want others to stumble into this bug. Thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 10, 2011 at 10:41:56AM -0700, Hugh Dickins wrote: > > I'm concerned that we've reached -rc7, with Linus planning on 2.6.39 > release next week, but Curt's fix above to the mblk_io corruption bug > seems to have fallen through the cracks. It's in the ext4 master branch so it's queued to be pushed during the next merge window, and then it would go into the 2.6.39.x stable series. So it didn't fall through the cracks; it was just a question of whether to push late in the rc series, or waiting until the merge window and then backporting to 2.6.39 stable. Basically I got the patch in -rc5, and at that point, given that the vast majority of the file systems are 4k blocksize on x86, and most Power and Itanic users are using distribution kernels (and that's a very small number anyway), I decided not to push it to Linus at that point. It's a judgement call; and I could have gone the other way; it was very much a 49/51 sort of decision. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 10, 2011 at 12:17 PM, Ted Ts'o <tytso@mit.edu> wrote: > On Tue, May 10, 2011 at 10:41:56AM -0700, Hugh Dickins wrote: >> >> I'm concerned that we've reached -rc7, with Linus planning on 2.6.39 >> release next week, but Curt's fix above to the mblk_io corruption bug >> seems to have fallen through the cracks. > > It's in the ext4 master branch so it's queued to be pushed during the > next merge window, and then it would go into the 2.6.39.x stable > series. So it didn't fall through the cracks; it was just a question > of whether to push late in the rc series, or waiting until the merge > window and then backporting to 2.6.39 stable. Basically I got the > patch in -rc5, and at that point, given that the vast majority of the > file systems are 4k blocksize on x86, and most Power and Itanic users > are using distribution kernels (and that's a very small number > anyway), I decided not to push it to Linus at that point. > > It's a judgement call; and I could have gone the other way; it was > very much a 49/51 sort of decision. Hmm. Well, thanks for the reply. I do disagree with your decision, and am surprised that you're still seeing it in the past tense. This is a corruption and a regression: prime material for a fix even at this stage. Admittedly, in my case the data was written correctly and only temporarily presented incorrectly; but in other tests that could then have got written back incorrectly. And admittedly, not many people outside of the large-page-size world are using blocksize less than pagesize. But even so.... Moral: cc Rafael's regression list next time? Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index b6dbd05..7bb8f76 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error) for (i = 0; i < io_end->num_io_pages; i++) { struct page *page = io_end->pages[i]->p_page; struct buffer_head *bh, *head; - int partial_write = 0; + loff_t offset; + loff_t io_end_offset; - head = page_buffers(page); - if (error) + if (error) { SetPageError(page); - BUG_ON(!head); - if (head->b_size != PAGE_CACHE_SIZE) { - loff_t offset; - loff_t io_end_offset = io_end->offset + io_end->size; + set_bit(AS_EIO, &page->mapping->flags); + head = page_buffers(page); + BUG_ON(!head); + + io_end_offset = io_end->offset + io_end->size; offset = (sector_t) page->index << PAGE_CACHE_SHIFT; bh = head; do { if ((offset >= io_end->offset) && - (offset+bh->b_size <= io_end_offset)) { - if (error) - buffer_io_error(bh); - - } - if (buffer_delay(bh)) - partial_write = 1; - else if (!buffer_mapped(bh)) - clear_buffer_dirty(bh); - else if (buffer_dirty(bh)) - partial_write = 1; + (offset+bh->b_size <= io_end_offset)) + buffer_io_error(bh); + offset += bh->b_size; bh = bh->b_this_page; } while (bh != head); } - /* - * If this is a partial write which happened to make - * all buffers uptodate then we can optimize away a - * bogus readpage() for the next read(). Here we - * 'discover' whether the page went uptodate as a - * result of this (potentially partial) write. - */ - if (!partial_write) - SetPageUptodate(page); - put_io_page(io_end->pages[i]); } io_end->num_io_pages = 0;
In the bio completion routine, we should not be setting PageUptodate at all -- it's set at sys_write() time, and is unaffected by success/failure of the write to disk. This can cause a page corruption bug when block size < page size if we have only written a single block -- we might end up setting the entire PageUptodate, which will cause subsequent reads to get bad data. This commit also takes the opportunity to clean up error handling in ext4_end_bio(), and remove some extraneous code: - fixes ext4_end_bio() to set AS_EIO in the page->mapping->flags on error, which was left out by mistake. - remove the clear_buffer_dirty() call on unmapped buffers for each page. - consolidate page/buffer error handling in a single section. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Reported-by: Jim Meyering <jim@meyering.net> Reported-by: Hugh Dickins <hughd@google.com> Cc: Mingming Cao <cmm@us.ibm.com> --- Changlog since v2: - Removed clear_buffer_dirty() call - Consolidated error handling for pages and buffer heads - Loop over BHs in a page even for page size == block size, so we emit the correct error for such a case. Changlog since v1: - Added commit message text about setting AS_EIO for the page on error. - Continue to loop over all BHs in a page and emit unique errors for each of them. --- fs/ext4/page-io.c | 39 +++++++++++---------------------------- 1 files changed, 11 insertions(+), 28 deletions(-)