Patchwork Bug: Large writes can fail on ext4 if the write buffer is not empty

login
register
mail settings
Submitter Zheng Liu
Date April 12, 2012, 4:06 p.m.
Message ID <20120412160658.GA9697@gmail.com>
Download mbox | patch
Permalink /patch/152116/
State Superseded
Headers show

Comments

Zheng Liu - April 12, 2012, 4:06 p.m.
On Thu, Apr 12, 2012 at 05:47:41PM +0300, Jouni Siren wrote:
> Hi,
> 
> I recently ran into problems when writing large blocks of data (more than about 2 GB) with a single call, if there is already some data in the write buffer. The problem seems to be specific to ext4, or at least it does not happen when writing to nfs on the same system. Also, the problem does not happen, if the write buffer is flushed before the large write.
> 
> The following C++ program should write a total of 4294967304 bytes, but I end up with a file of size 2147483664.
> 
> #include <fstream>
> 
> int
> main(int argc, char** argv)
> {
>   std::streamsize data_size = (std::streamsize)1 << 31;
>   char* data = new char[data_size];
> 
>   std::ofstream output("test.dat", std::ios_base::binary);
>   output.write(data, 8);
>   output.write(data, data_size);
>   output.write(data, data_size);
>   output.close();
> 
>   delete[] data;
>   return 0;
> }
> 
> 
> The relevant part of strace is the following:
> 
> open("test.dat", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
> writev(3, [{"\0\0\0\0\0\0\0\0", 8}, {"", 2147483648}], 2) = -2147483640
> writev(3, [{0xffffffff80c6d258, 2147483648}, {"", 2147483648}], 2) = -1 EFAULT (Bad address)
> write(3, "\0\0\0\0\0\0\0\0", 8)         = 8
> close(3)                                = 0
> 
> 
> The first two writes are combined into a single writev call that reports having written -2147483640 bytes. This is the same as 8 + 2147483648, when interpreted as a signed 32-bit integer. After the first call, everything more or less fails. This happens on a Linux system, where uname -a returns
> 
> Linux alm01 2.6.32-220.7.1.el6.x86_64 #1 SMP Tue Mar 6 15:45:33 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> I believe that the bug can be found in file.c, function ext4_file_write, where variable ret has type int. Function generic_file_aio_write returns the number of bytes written as a ssize_t, and the returned value is stored in ret and eventually returned by ext4_file_write. If the number of bytes written is more than INT_MAX, the value returned by ext4_file_write will be incorrect.
> 
> If you need more information on the problem, I will be happy to provide it.

Hi Jouni,

Indeed, I think that it is a bug.  So the solution is straightforward.
Could you please try this patch?  Thank you.

Regards,
Zheng

From: Zheng Liu <wenqing.lz@taobao.com>
Subject: [PATCH] ext4: change return value from int to ssize_t in ext4_file_write

in 32 bit platform, when we do a write operation with a huge number of data, it
will cause that the ret variable overflows.  So it is replaced with ssize_t.

Reported-by: Jouni Siren <jouni.siren@iki.fi>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/file.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
Jan Kara - April 12, 2012, 8:20 p.m.
On Fri 13-04-12 00:06:59, Zheng Liu wrote:
> On Thu, Apr 12, 2012 at 05:47:41PM +0300, Jouni Siren wrote:
> > Hi,
> > 
> > I recently ran into problems when writing large blocks of data (more than about 2 GB) with a single call, if there is already some data in the write buffer. The problem seems to be specific to ext4, or at least it does not happen when writing to nfs on the same system. Also, the problem does not happen, if the write buffer is flushed before the large write.
> > 
> > The following C++ program should write a total of 4294967304 bytes, but I end up with a file of size 2147483664.
> > 
> > #include <fstream>
> > 
> > int
> > main(int argc, char** argv)
> > {
> >   std::streamsize data_size = (std::streamsize)1 << 31;
> >   char* data = new char[data_size];
> > 
> >   std::ofstream output("test.dat", std::ios_base::binary);
> >   output.write(data, 8);
> >   output.write(data, data_size);
> >   output.write(data, data_size);
> >   output.close();
> > 
> >   delete[] data;
> >   return 0;
> > }
> > 
> > 
> > The relevant part of strace is the following:
> > 
> > open("test.dat", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
> > writev(3, [{"\0\0\0\0\0\0\0\0", 8}, {"", 2147483648}], 2) = -2147483640
> > writev(3, [{0xffffffff80c6d258, 2147483648}, {"", 2147483648}], 2) = -1 EFAULT (Bad address)
> > write(3, "\0\0\0\0\0\0\0\0", 8)         = 8
> > close(3)                                = 0
> > 
> > 
> > The first two writes are combined into a single writev call that reports having written -2147483640 bytes. This is the same as 8 + 2147483648, when interpreted as a signed 32-bit integer. After the first call, everything more or less fails. This happens on a Linux system, where uname -a returns
> > 
> > Linux alm01 2.6.32-220.7.1.el6.x86_64 #1 SMP Tue Mar 6 15:45:33 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
> > 
> > 
> > I believe that the bug can be found in file.c, function ext4_file_write, where variable ret has type int. Function generic_file_aio_write returns the number of bytes written as a ssize_t, and the returned value is stored in ret and eventually returned by ext4_file_write. If the number of bytes written is more than INT_MAX, the value returned by ext4_file_write will be incorrect.
> > 
> > If you need more information on the problem, I will be happy to provide it.
> 
> Hi Jouni,
> 
> Indeed, I think that it is a bug.  So the solution is straightforward.
> Could you please try this patch?  Thank you.
> 
> Regards,
> Zheng
> 
> From: Zheng Liu <wenqing.lz@taobao.com>
> Subject: [PATCH] ext4: change return value from int to ssize_t in ext4_file_write
> 
> in 32 bit platform, when we do a write operation with a huge number of data, it
> will cause that the ret variable overflows.  So it is replaced with ssize_t.
  Actually, the problem happens for 64-bit platforms. On 32-platforms,
ssize_t is 32-bit and thus we would do only a short write to fit into 2^31.
But on 64-bit, ssize_t is 64-bit so there we can end up writing more than
2^31. So please change the changelog. The patch itself is good, so you can
add:
  Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Reported-by: Jouni Siren <jouni.siren@iki.fi>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
>  fs/ext4/file.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index cb70f18..8c7642a 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -95,7 +95,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
>  {
>  	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
>  	int unaligned_aio = 0;
> -	int ret;
> +	ssize_t ret;
>  
>  	/*
>  	 * If we have encountered a bitmap-format file, the size limit
> -- 
> 1.7.4.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouko Orava - April 19, 2012, 1:10 p.m.
Hi Zheng,

I can confirm the bug exists on latest RHEL 6 x86_64 kernels (2.6.32-220).

On current mainline kernels all writes are limited to one page under 2GB,
which masks the problem. I have not checked if mainline 2.6.32 has this
limit or not. It does not matter: the limit is just a band-aid to paper
over filesystem bugs, and should not mean you don't fix filesystem bugs.

I can confirm the one line patch to fs/ext4/file.c does fix the problem.
I have test kernels based on 2.6.32-220.7.1.el6.x86_64 with only
the patch applied, at
	http://www.helsinki.fi/~joorava/kernel/
if anyone else is interested in testing.

I did some (limited) testing on ext4 with the patch. It fixes the problem:
large writes work very well too. No problems popped up in testing.

Tested-by: Jouko Orava <jouko.orava@helsinki.fi>


I'd also like to point at the real bug here, in Jouni's original strace:

	writev(3, [{"\0\0\0\0\0\0\0\0", 8}, {"", 2147483648}], 2) = -2147483640

The syscall returns a negative value, which is not the actual number of
bytes written (since it is 32-bit wrapped), and errno has not been
changed. There is no way for userspace to handle this result correctly!

There is no way anyone sane should just gloss over this saying
"programmer fault, you're doing it wrong".
This is a real bug, and deserves fixing sooner rather than later.

Thanks,
   Jouko Orava
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - April 19, 2012, 2:15 p.m.
On 4/19/12 8:10 AM, Jouko Orava wrote:
> Hi Zheng,
> 
> I can confirm the bug exists on latest RHEL 6 x86_64 kernels (2.6.32-220).
> 
> On current mainline kernels all writes are limited to one page under 2GB,
> which masks the problem. I have not checked if mainline 2.6.32 has this
> limit or not. It does not matter: the limit is just a band-aid to paper
> over filesystem bugs, and should not mean you don't fix filesystem bugs.

FWIW, we tried fairly hard to get the limit lifted in the vfs, to no avail.

> I can confirm the one line patch to fs/ext4/file.c does fix the problem.
> I have test kernels based on 2.6.32-220.7.1.el6.x86_64 with only
> the patch applied, at
> 	http://www.helsinki.fi/~joorava/kernel/
> if anyone else is interested in testing.
> 
> I did some (limited) testing on ext4 with the patch. It fixes the problem:
> large writes work very well too. No problems popped up in testing.
> 
> Tested-by: Jouko Orava <jouko.orava@helsinki.fi>
> 
> 
> I'd also like to point at the real bug here, in Jouni's original strace:
> 
> 	writev(3, [{"\0\0\0\0\0\0\0\0", 8}, {"", 2147483648}], 2) = -2147483640
> 
> The syscall returns a negative value, which is not the actual number of
> bytes written (since it is 32-bit wrapped), and errno has not been
> changed. There is no way for userspace to handle this result correctly!
> 
> There is no way anyone sane should just gloss over this saying
> "programmer fault, you're doing it wrong".
> This is a real bug, and deserves fixing sooner rather than later.

Agreed, I think Dave was a little to quick on the draw on his reply.  :)

-Eric

> Thanks,
>    Jouko Orava
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouko Orava - April 19, 2012, 2:38 p.m.
> FWIW, we tried fairly hard to get the limit lifted in the vfs, to no avail.

I understand. The downsides from the limit are very small, after all.

Has anyone tested the older stable kernel releases?
When was the VFS limit added, or is it something RHEL kernels patch out?

> Agreed, I think Dave was a little to quick on the draw on his reply.

It is easy to miss, the EFAULT on the syscall that follows is so obvious.

I've filed a terse report to the Red Hat Bugzilla:
	https://bugzilla.redhat.com/show_bug.cgi?id=814296
Let me know if you wish me to expand on that report.

Best regards,
    Jouko Orava
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - April 19, 2012, 2:45 p.m.
On 4/19/12 9:38 AM, Jouko Orava wrote:
>> FWIW, we tried fairly hard to get the limit lifted in the vfs, to no avail.
> 
> I understand. The downsides from the limit are very small, after all.
> 
> Has anyone tested the older stable kernel releases?
> When was the VFS limit added, or is it something RHEL kernels patch out?

No, RHEL kernels have the same limits.

>> Agreed, I think Dave was a little to quick on the draw on his reply.
> 
> It is easy to miss, the EFAULT on the syscall that follows is so obvious.
> 
> I've filed a terse report to the Red Hat Bugzilla:
> 	https://bugzilla.redhat.com/show_bug.cgi?id=814296

Whoops, I just filed one as well.  I'll dup yours to mine since I've already
started the process there.

Thanks,
-Eric

> Let me know if you wish me to expand on that report.
> 
> Best regards,
>     Jouko Orava

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - April 19, 2012, 2:56 p.m.
On Thu, Apr 19, 2012 at 9:10 PM, Jouko Orava <jouko.orava@helsinki.fi> wrote:
> Hi Zheng,
>
> I can confirm the bug exists on latest RHEL 6 x86_64 kernels (2.6.32-220).
>
> On current mainline kernels all writes are limited to one page under 2GB,
> which masks the problem. I have not checked if mainline 2.6.32 has this
> limit or not. It does not matter: the limit is just a band-aid to paper
> over filesystem bugs, and should not mean you don't fix filesystem bugs.
>
> I can confirm the one line patch to fs/ext4/file.c does fix the problem.
> I have test kernels based on 2.6.32-220.7.1.el6.x86_64 with only
> the patch applied, at
>        http://www.helsinki.fi/~joorava/kernel/
> if anyone else is interested in testing.
>
> I did some (limited) testing on ext4 with the patch. It fixes the problem:
> large writes work very well too. No problems popped up in testing.
>
> Tested-by: Jouko Orava <jouko.orava@helsinki.fi>
>
>
> I'd also like to point at the real bug here, in Jouni's original strace:
>
>        writev(3, [{"\0\0\0\0\0\0\0\0", 8}, {"", 2147483648}], 2) = -2147483640
>
> The syscall returns a negative value, which is not the actual number of
> bytes written (since it is 32-bit wrapped), and errno has not been
> changed. There is no way for userspace to handle this result correctly!
>
> There is no way anyone sane should just gloss over this saying
> "programmer fault, you're doing it wrong".
> This is a real bug, and deserves fixing sooner rather than later.
>
> Thanks,
>   Jouko Orava

Hi Jouko,

Thanks for testing.  I don't test this bug in redhat kernel.  As Eric said, he
will fix it if it exists.  I think that it might be in stable kernel.
So I will dig it
in stable kernel.

Regards,
Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouko Orava - April 19, 2012, 3:09 p.m.
> No, RHEL kernels have the same limits.

That means at least 2.6.32.x stable series will suffer from the same bug.
I'll see if I can test those mainline kernels too this weekend.

> Whoops, I just filed one as well.  I'll dup yours to mine since I've already
> started the process there.

That's excellent, thank you. Mine was unacceptably terse anyway :)

Thanks,
  Jouko
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - April 19, 2012, 3:28 p.m.
On Thu, Apr 19, 2012 at 11:09 PM, Jouko Orava <jouko.orava@helsinki.fi> wrote:
>> No, RHEL kernels have the same limits.
>
> That means at least 2.6.32.x stable series will suffer from the same bug.
> I'll see if I can test those mainline kernels too this weekend.

Great!  Please test it in stable kernels because I am afraid that I
couldn't do this work at this weekend. :-)

Regards,
Zheng

>
>> Whoops, I just filed one as well.  I'll dup yours to mine since I've already
>> started the process there.
>
> That's excellent, thank you. Mine was unacceptably terse anyway :)
>
> Thanks,
>  Jouko
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - April 20, 2012, 2:12 a.m.
On Thu, Apr 19, 2012 at 09:15:40AM -0500, Eric Sandeen wrote:
> On 4/19/12 8:10 AM, Jouko Orava wrote:
> > I'd also like to point at the real bug here, in Jouni's original strace:
> > 
> > 	writev(3, [{"\0\0\0\0\0\0\0\0", 8}, {"", 2147483648}], 2) = -2147483640
> > 
> > The syscall returns a negative value, which is not the actual number of
> > bytes written (since it is 32-bit wrapped), and errno has not been
> > changed. There is no way for userspace to handle this result correctly!
> > 
> > There is no way anyone sane should just gloss over this saying
> > "programmer fault, you're doing it wrong".
> > This is a real bug, and deserves fixing sooner rather than later.
> 
> Agreed, I think Dave was a little to quick on the draw on his reply.  :)

I was pointing the -EFAULT error in the second syscall which was
just being passed the 2GB sized buffer, not the above error from the
first syscall.

Cheers,

Dave.

Patch

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cb70f18..8c7642a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@  ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
 	int unaligned_aio = 0;
-	int ret;
+	ssize_t ret;
 
 	/*
 	 * If we have encountered a bitmap-format file, the size limit