diff mbox

ftruncate-mmap: pages are lost after writing to mmaped file.

Message ID 20090324173511.GJ23439@duck.suse.cz
State Not Applicable, archived
Headers show

Commit Message

Jan Kara March 24, 2009, 5:35 p.m. UTC
On Tue 24-03-09 16:48:14, Jan Kara wrote:
> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> > 
> > > > I don't think it is a very good idea for block_write_full_page recovery
> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> > > >
> > > > Perhaps not the cleanest way to solve the problem if it is just due to
> > > > transient shortage of space in ext3, but generic code shouldn't be
> > > > allowed to throw away dirty data even if it can't be written back due
> > > > to some software or hardware error.
> > >
> > >   Well, that would be one possibility. But then we'd be left with dirty
> > > pages we cannot ever release since they are constantly dirty (when the
> > > filesystem really becomes out of space). So what I
> > 
> > If the filesystem becomes out of space and we have over-committed these
> > dirty mmapped blocks, then we most definitely want to keep them around.
> > An error of the system losing a few pages (or if it happens an insanely
> > large number of times, then slowly dying due to memory leak) is better
> > than an app suddenly seeing the contents of the page change to nulls
> > under it when the kernel decides to do some page reclaim.
>   Hmm, probably you're right. Definitely it would be much easier to track
> the problem down than it is now... Thinking a bit more... But couldn't a
> malicious user bring the machine easily to OOM this way? That would be
> unfortunate.
  OK, below is the patch which makes things work for me (i.e. no data
lost). What do you think?

									Honza

Comments

Ying Han April 1, 2009, 10:36 p.m. UTC | #1
Hi Jan:
    I feel that the problem you saw is kind of differnt than mine. As
you mentioned that you saw the PageError() message, which i don't see
it on my system. I tried you patch(based on 2.6.21) on my system and
it runs ok for 2 days, Still, since i don't see the same error message
as you saw, i am not convineced this is the root cause at least for
our problem. I am still looking into it.
    So, are you seeing the PageError() every time the problem happened?

--Ying


On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 24-03-09 16:48:14, Jan Kara wrote:
>> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
>> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
>> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
>> >
>> > > > I don't think it is a very good idea for block_write_full_page recovery
>> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
>> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
>> > > >
>> > > > Perhaps not the cleanest way to solve the problem if it is just due to
>> > > > transient shortage of space in ext3, but generic code shouldn't be
>> > > > allowed to throw away dirty data even if it can't be written back due
>> > > > to some software or hardware error.
>> > >
>> > >   Well, that would be one possibility. But then we'd be left with dirty
>> > > pages we cannot ever release since they are constantly dirty (when the
>> > > filesystem really becomes out of space). So what I
>> >
>> > If the filesystem becomes out of space and we have over-committed these
>> > dirty mmapped blocks, then we most definitely want to keep them around.
>> > An error of the system losing a few pages (or if it happens an insanely
>> > large number of times, then slowly dying due to memory leak) is better
>> > than an app suddenly seeing the contents of the page change to nulls
>> > under it when the kernel decides to do some page reclaim.
>>   Hmm, probably you're right. Definitely it would be much easier to track
>> the problem down than it is now... Thinking a bit more... But couldn't a
>> malicious user bring the machine easily to OOM this way? That would be
>> unfortunate.
>  OK, below is the patch which makes things work for me (i.e. no data
> lost). What do you think?
>
>                                                                        Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>
> From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Tue, 24 Mar 2009 16:38:22 +0100
> Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()
>
> If getblock() fails in block_write_full_page(), we don't want to clear
> dirty bits on buffers. Actually, we even want to redirty the page. This
> way we just won't silently discard users data (written e.g. through mmap)
> in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
> approach is that if the error is persistent we have this page pinned in
> memory forever and if there are lots of such pages, we can bring the
> machine OOM.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/buffer.c |   10 +++-------
>  1 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..ae779a0 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1833,9 +1833,11 @@ recover:
>        /*
>         * ENOSPC, or some other error.  We may already have added some
>         * blocks to the file, so we need to write these out to avoid
> -        * exposing stale data.
> +        * exposing stale data. We redirty the page so that we don't
> +        * loose data we are unable to write.
>         * The page is currently locked and not marked for writeback
>         */
> +       redirty_page_for_writepage(wbc, page);
>        bh = head;
>        /* Recovery: lock and submit the mapped buffers */
>        do {
> @@ -1843,12 +1845,6 @@ recover:
>                    !buffer_delay(bh)) {
>                        lock_buffer(bh);
>                        mark_buffer_async_write(bh);
> -               } else {
> -                       /*
> -                        * The buffer may have been set dirty during
> -                        * attachment to a dirty page.
> -                        */
> -                       clear_buffer_dirty(bh);
>                }
>        } while ((bh = bh->b_this_page) != head);
>        SetPageError(page);
> --
> 1.6.0.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 2, 2009, 10:11 a.m. UTC | #2
Hi Ying,

On Wed 01-04-09 15:36:13, Ying Han wrote:
>     I feel that the problem you saw is kind of differnt than mine. As
> you mentioned that you saw the PageError() message, which i don't see
> it on my system. I tried you patch(based on 2.6.21) on my system and
> it runs ok for 2 days, Still, since i don't see the same error message
> as you saw, i am not convineced this is the root cause at least for
> our problem. I am still looking into it.
>     So, are you seeing the PageError() every time the problem happened?
  Yes, but I agree that your problem is probably different. BTW: How do you
reproduce the problem?

								Honza

> On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 24-03-09 16:48:14, Jan Kara wrote:
> >> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> >> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> >> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> >> >
> >> > > > I don't think it is a very good idea for block_write_full_page recovery
> >> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> >> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> >> > > >
> >> > > > Perhaps not the cleanest way to solve the problem if it is just due to
> >> > > > transient shortage of space in ext3, but generic code shouldn't be
> >> > > > allowed to throw away dirty data even if it can't be written back due
> >> > > > to some software or hardware error.
> >> > >
> >> > >   Well, that would be one possibility. But then we'd be left with dirty
> >> > > pages we cannot ever release since they are constantly dirty (when the
> >> > > filesystem really becomes out of space). So what I
> >> >
> >> > If the filesystem becomes out of space and we have over-committed these
> >> > dirty mmapped blocks, then we most definitely want to keep them around.
> >> > An error of the system losing a few pages (or if it happens an insanely
> >> > large number of times, then slowly dying due to memory leak) is better
> >> > than an app suddenly seeing the contents of the page change to nulls
> >> > under it when the kernel decides to do some page reclaim.
> >>   Hmm, probably you're right. Definitely it would be much easier to track
> >> the problem down than it is now... Thinking a bit more... But couldn't a
> >> malicious user bring the machine easily to OOM this way? That would be
> >> unfortunate.
> >  OK, below is the patch which makes things work for me (i.e. no data
> > lost). What do you think?
> >
> >                                                                        Honza
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> >
> > From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Tue, 24 Mar 2009 16:38:22 +0100
> > Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()
> >
> > If getblock() fails in block_write_full_page(), we don't want to clear
> > dirty bits on buffers. Actually, we even want to redirty the page. This
> > way we just won't silently discard users data (written e.g. through mmap)
> > in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
> > approach is that if the error is persistent we have this page pinned in
> > memory forever and if there are lots of such pages, we can bring the
> > machine OOM.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/buffer.c |   10 +++-------
> >  1 files changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 891e1c7..ae779a0 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1833,9 +1833,11 @@ recover:
> >        /*
> >         * ENOSPC, or some other error.  We may already have added some
> >         * blocks to the file, so we need to write these out to avoid
> > -        * exposing stale data.
> > +        * exposing stale data. We redirty the page so that we don't
> > +        * loose data we are unable to write.
> >         * The page is currently locked and not marked for writeback
> >         */
> > +       redirty_page_for_writepage(wbc, page);
> >        bh = head;
> >        /* Recovery: lock and submit the mapped buffers */
> >        do {
> > @@ -1843,12 +1845,6 @@ recover:
> >                    !buffer_delay(bh)) {
> >                        lock_buffer(bh);
> >                        mark_buffer_async_write(bh);
> > -               } else {
> > -                       /*
> > -                        * The buffer may have been set dirty during
> > -                        * attachment to a dirty page.
> > -                        */
> > -                       clear_buffer_dirty(bh);
> >                }
> >        } while ((bh = bh->b_this_page) != head);
> >        SetPageError(page);
> > --
> > 1.6.0.2
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
Nick Piggin April 2, 2009, 11:24 a.m. UTC | #3
On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> Hi Jan:
>     I feel that the problem you saw is kind of differnt than mine. As
> you mentioned that you saw the PageError() message, which i don't see
> it on my system. I tried you patch(based on 2.6.21) on my system and
> it runs ok for 2 days, Still, since i don't see the same error message
> as you saw, i am not convineced this is the root cause at least for
> our problem. I am still looking into it.
>     So, are you seeing the PageError() every time the problem happened?

So I asked if you could test with my workaround of taking truncate_mutex
at the start of ext2_get_blocks, and report back. I never heard of any
response after that.

To reiterate: I was able to reproduce a problem with ext2 (I was testing
on brd to get IO rates high enough to reproduce it quite frequently).
I think I narrowed the problem down to block allocation or inode block
tree corruption because I was unable to reproduce it with that hack in
place.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 2, 2009, 11:34 a.m. UTC | #4
On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> > Hi Jan:
> >     I feel that the problem you saw is kind of differnt than mine. As
> > you mentioned that you saw the PageError() message, which i don't see
> > it on my system. I tried you patch(based on 2.6.21) on my system and
> > it runs ok for 2 days, Still, since i don't see the same error message
> > as you saw, i am not convineced this is the root cause at least for
> > our problem. I am still looking into it.
> >     So, are you seeing the PageError() every time the problem happened?
> 
> So I asked if you could test with my workaround of taking truncate_mutex
> at the start of ext2_get_blocks, and report back. I never heard of any
> response after that.
> 
> To reiterate: I was able to reproduce a problem with ext2 (I was testing
> on brd to get IO rates high enough to reproduce it quite frequently).
> I think I narrowed the problem down to block allocation or inode block
> tree corruption because I was unable to reproduce it with that hack in
> place.
  Nick, what load did you use for reproduction? I'll try to reproduce it
here so that I can debug ext2...

								Honza
Nick Piggin April 2, 2009, 3:51 p.m. UTC | #5
On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> > > Hi Jan:
> > >     I feel that the problem you saw is kind of differnt than mine. As
> > > you mentioned that you saw the PageError() message, which i don't see
> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> > > it runs ok for 2 days, Still, since i don't see the same error message
> > > as you saw, i am not convineced this is the root cause at least for
> > > our problem. I am still looking into it.
> > >     So, are you seeing the PageError() every time the problem happened?
> > 
> > So I asked if you could test with my workaround of taking truncate_mutex
> > at the start of ext2_get_blocks, and report back. I never heard of any
> > response after that.
> > 
> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> > on brd to get IO rates high enough to reproduce it quite frequently).
> > I think I narrowed the problem down to block allocation or inode block
> > tree corruption because I was unable to reproduce it with that hack in
> > place.
>   Nick, what load did you use for reproduction? I'll try to reproduce it
> here so that I can debug ext2...

OK, I set up the filesystem like this:

modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
mkfs.ext2 -b1024 /dev/ram0           #1K buffers

Test is basically unmodified except I use 64MB files, and start 8 of them
at once to (8 core system, so improve chances of hitting the bug). Although I
do see it with only 1 running it takes longer to trigger.

I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
know if that really helps speed up reproducing it. It is quite random to hit,
but I was able to hit it IIRC in under a minute with that setup.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ying Han April 2, 2009, 5:44 p.m. UTC | #6
On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> > > Hi Jan:
>> > >     I feel that the problem you saw is kind of differnt than mine. As
>> > > you mentioned that you saw the PageError() message, which i don't see
>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> > > as you saw, i am not convineced this is the root cause at least for
>> > > our problem. I am still looking into it.
>> > >     So, are you seeing the PageError() every time the problem happened?
>> >
>> > So I asked if you could test with my workaround of taking truncate_mutex
>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> > response after that.
>> >
>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> > I think I narrowed the problem down to block allocation or inode block
>> > tree corruption because I was unable to reproduce it with that hack in
>> > place.
>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>> here so that I can debug ext2...
>
> OK, I set up the filesystem like this:
>
> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>
> Test is basically unmodified except I use 64MB files, and start 8 of them
> at once to (8 core system, so improve chances of hitting the bug). Although I
> do see it with only 1 running it takes longer to trigger.
>
> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> know if that really helps speed up reproducing it. It is quite random to hit,
> but I was able to hit it IIRC in under a minute with that setup.
>

Here is how i reproduce it:
Filesystem is ext2 with blocksize 4096
Fill up the ram with 95% anon memory and mlockall ( put enough memory
pressure which will trigger page reclaim and background writeout)
Run one thread of the test program

and i will see "bad pages" within few minutes.

--Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ying Han April 2, 2009, 10:52 p.m. UTC | #7
On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
> On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>>> > > Hi Jan:
>>> > >     I feel that the problem you saw is kind of differnt than mine. As
>>> > > you mentioned that you saw the PageError() message, which i don't see
>>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>>> > > it runs ok for 2 days, Still, since i don't see the same error message
>>> > > as you saw, i am not convineced this is the root cause at least for
>>> > > our problem. I am still looking into it.
>>> > >     So, are you seeing the PageError() every time the problem happened?
>>> >
>>> > So I asked if you could test with my workaround of taking truncate_mutex
>>> > at the start of ext2_get_blocks, and report back. I never heard of any
>>> > response after that.
>>> >
>>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>>> > on brd to get IO rates high enough to reproduce it quite frequently).
>>> > I think I narrowed the problem down to block allocation or inode block
>>> > tree corruption because I was unable to reproduce it with that hack in
>>> > place.
>>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>>> here so that I can debug ext2...
>>
>> OK, I set up the filesystem like this:
>>
>> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
>> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
>> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>>
>> Test is basically unmodified except I use 64MB files, and start 8 of them
>> at once to (8 core system, so improve chances of hitting the bug). Although I
>> do see it with only 1 running it takes longer to trigger.
>>
>> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> know if that really helps speed up reproducing it. It is quite random to hit,
>> but I was able to hit it IIRC in under a minute with that setup.
>>
>
> Here is how i reproduce it:
> Filesystem is ext2 with blocksize 4096
> Fill up the ram with 95% anon memory and mlockall ( put enough memory
> pressure which will trigger page reclaim and background writeout)
> Run one thread of the test program
>
> and i will see "bad pages" within few minutes.

And here is the "top" and stdout while it is getting "bad pages"
top

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
 3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
  261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
  262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1

stdout:

while true; do
    ./ftruncate_mmap;
done
Running 852 bad page
Running 315 bad page
Running 999 bad page
Running 482 bad page
Running 24 bad page

--Ying

>
> --Ying
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..ae779a0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1833,9 +1833,11 @@  recover:
 	/*
 	 * ENOSPC, or some other error.  We may already have added some
 	 * blocks to the file, so we need to write these out to avoid
-	 * exposing stale data.
+	 * exposing stale data. We redirty the page so that we don't
+	 * loose data we are unable to write.
 	 * The page is currently locked and not marked for writeback
 	 */
+	redirty_page_for_writepage(wbc, page);
 	bh = head;
 	/* Recovery: lock and submit the mapped buffers */
 	do {
@@ -1843,12 +1845,6 @@  recover:
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
 			mark_buffer_async_write(bh);
-		} else {
-			/*
-			 * The buffer may have been set dirty during
-			 * attachment to a dirty page.
-			 */
-			clear_buffer_dirty(bh);
 		}
 	} while ((bh = bh->b_this_page) != head);
 	SetPageError(page);