diff mbox

[RESEND] lseek: change i_mutex usage.

Message ID 6.0.0.20.2.20090115163853.07056ed0@172.19.0.2
State Rejected, archived
Headers show

Commit Message

Hisashi Hifumi Jan. 15, 2009, 7:42 a.m. UTC
I changed i_mutex usage on generic_file_llseek.
This function is inside i_mutex, but I think there is room for optimization in some cases.
When SEEK_END is specified from caller, in this case we should handle
inode->i_size so i_mutex is needed. But in other cases such as SEEK_CUR or
SEEK_SET, i_mutex is not needed because just changing file->f_pos value without
touching i_size.

I did some test to measure i_mutex contention.
This test do:
	1. make an 128MB file.
	2. fork 100 processes. repeat 10000000 times lseeking randomly on each process to this file.
	3, gauge seconds between start and end of this test.

The result was:

	-2.6.29-rc1
	# time ./lseek_test
	315 sec

	real    5m15.407s
	user    1m19.128s
	sys     5m38.884s

	-2.6.29-rc1-patched
	# time ./lseek_test
	13 sec

	real    0m13.039s
	user    1m14.730s
	sys     2m9.633s 

Hardware environment:
CPU 2.4GHz(Quad Core) *4
Memory 64GB

This improvement is derived from just removal of lseek's i_mutex contention.
There is i_mutex contention not only around lseek, but also fsync or write.
So,  I think we also can mitigate i_mutex contention between fsync and lseek.

Thanks.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>





Following is the test program "lseek_test.c".

#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#define NUM 100
#define LEN 4096
#define LOOP 32*1024

int num;

void catch_SIGCHLD(int signo)
{
	pid_t child_pid = 0;
	do {
		int child_ret;
		child_pid = waitpid(-1, &child_ret, WNOHANG);
		if (child_pid > 0)
			num++;
	} while (child_pid > 0);
}
main()
{
        int  i, pid;
	char buf[LEN];
	unsigned long offset, filesize;
	time_t t1, t2;
	struct sigaction act;
	memset(buf, 0, LEN);
	memset(&act, 0, sizeof(act));
	act.sa_handler = catch_SIGCHLD;
	act.sa_flags = SA_NOCLDSTOP|SA_RESTART;
	sigemptyset(&act.sa_mask);
	sigaction(SIGCHLD, &act, NULL);	

	filesize = LEN * LOOP;
	int fd = open("targetfile1",O_RDWR|O_CREAT);	

	/* create a 128MB file */
	for(i = 0; i < LOOP; i++)
		write(fd, buf, LEN);
	fsync(fd);
	close(fd);
	
	time(&t1);	
	for (i = 0; i < NUM; i++) {	
		pid = fork();
		if(pid == 0){
		/* child */
			int fd = open("targetfile1", O_RDWR);	
			int j;
			for (j = 0; j < 10000000; j++) {
				offset = (random() % filesize);
				lseek(fd, offset, SEEK_SET);
			}
			close(fd);
			exit(0);
		}
	}
	while(num < NUM)
		sleep(1);
	time(&t2);	
	printf("%d sec\n",t2-t1);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Matthew Wilcox Jan. 15, 2009, 1:22 p.m. UTC | #1
On Thu, Jan 15, 2009 at 04:42:19PM +0900, Hisashi Hifumi wrote:
> I changed i_mutex usage on generic_file_llseek.
> This function is inside i_mutex, but I think there is room for optimization in some cases.
> When SEEK_END is specified from caller, in this case we should handle
> inode->i_size so i_mutex is needed. But in other cases such as SEEK_CUR or
> SEEK_SET, i_mutex is not needed because just changing file->f_pos value without
> touching i_size.

Of course if you have multiple threads, they will share a struct file,
and you're updating f_pos and f_version without locking.  Maybe that's
OK, but it's soemthing you didn't discuss.

I think it's the only reason to have the mutex here.  Otherwise we could
simply use i_size_read() in generic_file_llseek_unlocked() and there
would be no need for a mutex at all.

> -	mutex_lock(&file->f_dentry->d_inode->i_mutex);
> -	rval = generic_file_llseek_unlocked(file, offset, origin);
> -	mutex_unlock(&file->f_dentry->d_inode->i_mutex);
> +	if (origin == SEEK_END) {
> +		mutex_lock(&file->f_dentry->d_inode->i_mutex);
> +		rval = generic_file_llseek_unlocked(file, offset, origin);
> +		mutex_unlock(&file->f_dentry->d_inode->i_mutex);
> +	} else
> +		rval = generic_file_llseek_unlocked(file, offset, origin);

I'm pretty sure the spinning mutex work will have a significant effect
on the performance here.
Theodore Ts'o Jan. 15, 2009, 2:21 p.m. UTC | #2
On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote:
> 
> Of course if you have multiple threads, they will share a struct file,
> and you're updating f_pos and f_version without locking.  Maybe that's
> OK, but it's soemthing you didn't discuss.

f_pos is updated by sys_write(), and friends without locking, so we're
fine on that front, or at least no worse off.  SUSv3 doesn't seem to
say one way or another what should happen if two threads try to
write() to a file at the same time using the same file descriptor in
terms of whether or not f_pos gets updated intelligently.  We've opted
for speed over determinism already.

Zero'ing out f_version is fine to do without locking.  It's only used
so we know that we need to revalidate in the readdir() case so that we
know it's pointing at a valid directory pointer.

That being said, I do see a race in fs/ext*/dir.c, but i_mutex locking
isn't the problem and it's not going to save us.  ext[234]_readdir()
uses f_pos through the routine, even between calls that might block;
so if one thread is randomly calling seekdir() (or lseek() directly)
while another read is calling readdir(), ext[234]_readdir() could get
potentially very confused.  If someone wants to take a look at it,
that would be great.  Otherwise I'll put it on my low-priority queue
of things to look at.

> I think it's the only reason to have the mutex here.  Otherwise we could
> simply use i_size_read() in generic_file_llseek_unlocked() and there
> would be no need for a mutex at all.

That's a good point.  Maybe I'm missing something, but I'm not sure we
need the mutex in generic_file_llseek() at all.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jim owens Jan. 15, 2009, 3:36 p.m. UTC | #3
Theodore Tso wrote:
> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote:
>> Of course if you have multiple threads, they will share a struct file,
>> and you're updating f_pos and f_version without locking.  Maybe that's
>> OK, but it's soemthing you didn't discuss.
> 
> f_pos is updated by sys_write(), and friends without locking, so we're
> fine on that front, or at least no worse off.  SUSv3 doesn't seem to
> say one way or another what should happen if two threads try to
> write() to a file at the same time using the same file descriptor in
> terms of whether or not f_pos gets updated intelligently.  We've opted
> for speed over determinism already.
> 
> Zero'ing out f_version is fine to do without locking.  It's only used
> so we know that we need to revalidate in the readdir() case so that we
> know it's pointing at a valid directory pointer.
> 
> That being said, I do see a race in fs/ext*/dir.c, but i_mutex locking
> isn't the problem and it's not going to save us.  ext[234]_readdir()
> uses f_pos through the routine, even between calls that might block;
> so if one thread is randomly calling seekdir() (or lseek() directly)
> while another read is calling readdir(), ext[234]_readdir() could get
> potentially very confused.  If someone wants to take a look at it,
> that would be great.  Otherwise I'll put it on my low-priority queue
> of things to look at.
> 
>> I think it's the only reason to have the mutex here.  Otherwise we could
>> simply use i_size_read() in generic_file_llseek_unlocked() and there
>> would be no need for a mutex at all.
> 
> That's a good point.  Maybe I'm missing something, but I'm not sure we
> need the mutex in generic_file_llseek() at all.

I'm not arguing against this.  Just pointing out what
I suspect Jamie was thinking about and what a difference
3 months makes to the patch author :)

Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit
http://marc.info/?l=linux-fsdevel&m=122335627224515

...that ended with Linus rightly saying users expecting to
do stuff like that are insane.

*However*, we should remember that any time you remove locks
from a path, you run the risk the lock was providing some
unplanned consistency and removing it will increase the window
for unexpected behavior.  And maybe finding more buggy apps
is good or bad only depending on your point of view.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Jan. 16, 2009, 12:40 a.m. UTC | #4
On Thu, 15 Jan 2009 09:21:13 -0500
Theodore Tso <tytso@mit.edu> wrote:

> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote:
> > 
> > Of course if you have multiple threads, they will share a struct file,
> > and you're updating f_pos and f_version without locking.  Maybe that's
> > OK, but it's soemthing you didn't discuss.
> 
> f_pos is updated by sys_write(), and friends without locking, so we're
> fine on that front, or at least no worse off.

bug ;)

>  SUSv3 doesn't seem to
> say one way or another what should happen if two threads try to
> write() to a file at the same time using the same file descriptor in
> terms of whether or not f_pos gets updated intelligently.  We've opted
> for speed over determinism already.

I think our thinking was that if two threads are racily updating f_pos
with different values, then it should end up with one of those values.

From a quality-of-implementation POV (what _is_ that, anyway) it would
be bad if the kernel were to set f_pos to the upper 32 bits of position
A and the lower 32 bits of position B.  Which could happen if we remove
the i_mutex protection on 32-bits.

We could perhaps omit some locking if CONFIG_64BIT.  There's probably
quite a bit of locking which could be omitted in that case.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hisashi Hifumi Jan. 16, 2009, 12:53 a.m. UTC | #5
At 09:40 09/01/16, Andrew Morton wrote:
>On Thu, 15 Jan 2009 09:21:13 -0500
>Theodore Tso <tytso@mit.edu> wrote:
>
>> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote:
>> > 
>> > Of course if you have multiple threads, they will share a struct file,
>> > and you're updating f_pos and f_version without locking.  Maybe that's
>> > OK, but it's soemthing you didn't discuss.
>> 
>> f_pos is updated by sys_write(), and friends without locking, so we're
>> fine on that front, or at least no worse off.
>
>bug ;)
>
>>  SUSv3 doesn't seem to
>> say one way or another what should happen if two threads try to
>> write() to a file at the same time using the same file descriptor in
>> terms of whether or not f_pos gets updated intelligently.  We've opted
>> for speed over determinism already.
>
>I think our thinking was that if two threads are racily updating f_pos
>with different values, then it should end up with one of those values.
>
>From a quality-of-implementation POV (what _is_ that, anyway) it would
>be bad if the kernel were to set f_pos to the upper 32 bits of position
>A and the lower 32 bits of position B.  Which could happen if we remove
>the i_mutex protection on 32-bits.
>
>We could perhaps omit some locking if CONFIG_64BIT.  There's probably
>quite a bit of locking which could be omitted in that case.

Updating f_pos value on 32bit is not atomic, so we discussed about this
but we concluded that it does not matter whether f_pos is atomic or not
See,
Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit
http://marc.info/?l=linux-fsdevel&m=122335627224515

I think even i_mutex is not needed. When we touch i_size, i_size_read is enough,
and we can remove i_mutex at all on lseek.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Jan. 16, 2009, 1:49 a.m. UTC | #6
On Fri, 16 Jan 2009 09:53:02 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:

> 
> At 09:40 09/01/16, Andrew Morton wrote:
> >On Thu, 15 Jan 2009 09:21:13 -0500
> >Theodore Tso <tytso@mit.edu> wrote:
> >
> >> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote:
> >> > 
> >> > Of course if you have multiple threads, they will share a struct file,
> >> > and you're updating f_pos and f_version without locking.  Maybe that's
> >> > OK, but it's soemthing you didn't discuss.
> >> 
> >> f_pos is updated by sys_write(), and friends without locking, so we're
> >> fine on that front, or at least no worse off.
> >
> >bug ;)
> >
> >>  SUSv3 doesn't seem to
> >> say one way or another what should happen if two threads try to
> >> write() to a file at the same time using the same file descriptor in
> >> terms of whether or not f_pos gets updated intelligently.  We've opted
> >> for speed over determinism already.
> >
> >I think our thinking was that if two threads are racily updating f_pos
> >with different values, then it should end up with one of those values.
> >
> >From a quality-of-implementation POV (what _is_ that, anyway) it would
> >be bad if the kernel were to set f_pos to the upper 32 bits of position
> >A and the lower 32 bits of position B.  Which could happen if we remove
> >the i_mutex protection on 32-bits.
> >
> >We could perhaps omit some locking if CONFIG_64BIT.  There's probably
> >quite a bit of locking which could be omitted in that case.
> 
> Updating f_pos value on 32bit is not atomic, so we discussed about this
> but we concluded that it does not matter whether f_pos is atomic or not

It's unclear what you're saying here.

I see three issues here:

a) two racing threads update f_pos.  One of them wins, and the
   outcome in indeterminate.

b) two racing threads update f_pos and the end result is that f_pos
   contains a value which *neither* thread tried to write.

c) one thread is writing and the other reading.  There is a window
   where the reader can see an intermediate value which is a mix of the
   old and new values.

I think we decided that a) is acceptable, b) is not and that c) can only
occur on multiple-of-4G wraparounds and isn't worth bothering about.

> See,
> Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit
> http://marc.info/?l=linux-fsdevel&m=122335627224515

Sorry, I'm disinclined to re-read a long thread, trying to work out
which bit you might be referring to.

> I think even i_mutex is not needed. When we touch i_size, i_size_read is enough,
> and we can remove i_mutex at all on lseek.

Why are we talking about i_size now?

Confused.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hisashi Hifumi Jan. 16, 2009, 2:08 a.m. UTC | #7
At 10:49 09/01/16, Andrew Morton wrote:
>On Fri, 16 Jan 2009 09:53:02 +0900 Hisashi Hifumi 
><hifumi.hisashi@oss.ntt.co.jp> wrote:
>
>> 
>> At 09:40 09/01/16, Andrew Morton wrote:
>> >On Thu, 15 Jan 2009 09:21:13 -0500
>> >Theodore Tso <tytso@mit.edu> wrote:
>> >
>> >> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote:
>> >> > 
>> >> > Of course if you have multiple threads, they will share a struct file,
>> >> > and you're updating f_pos and f_version without locking.  Maybe that's
>> >> > OK, but it's soemthing you didn't discuss.
>> >> 
>> >> f_pos is updated by sys_write(), and friends without locking, so we're
>> >> fine on that front, or at least no worse off.
>> >
>> >bug ;)
>> >
>> >>  SUSv3 doesn't seem to
>> >> say one way or another what should happen if two threads try to
>> >> write() to a file at the same time using the same file descriptor in
>> >> terms of whether or not f_pos gets updated intelligently.  We've opted
>> >> for speed over determinism already.
>> >
>> >I think our thinking was that if two threads are racily updating f_pos
>> >with different values, then it should end up with one of those values.
>> >
>> >From a quality-of-implementation POV (what _is_ that, anyway) it would
>> >be bad if the kernel were to set f_pos to the upper 32 bits of position
>> >A and the lower 32 bits of position B.  Which could happen if we remove
>> >the i_mutex protection on 32-bits.
>> >
>> >We could perhaps omit some locking if CONFIG_64BIT.  There's probably
>> >quite a bit of locking which could be omitted in that case.
>> 
>> Updating f_pos value on 32bit is not atomic, so we discussed about this
>> but we concluded that it does not matter whether f_pos is atomic or not
>
>It's unclear what you're saying here.
>
>I see three issues here:
>
>a) two racing threads update f_pos.  One of them wins, and the
>   outcome in indeterminate.
>
>b) two racing threads update f_pos and the end result is that f_pos
>   contains a value which *neither* thread tried to write.
>
>c) one thread is writing and the other reading.  There is a window
>   where the reader can see an intermediate value which is a mix of the
>   old and new values.
>
>I think we decided that a) is acceptable, b) is not and that c) can only
>occur on multiple-of-4G wraparounds and isn't worth bothering about.
>
>> See,
>> Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit
>> http://marc.info/?l=linux-fsdevel&m=122335627224515
>
>Sorry, I'm disinclined to re-read a long thread, trying to work out
>which bit you might be referring to.

Following is Linus's post about this issue.
http://marc.info/?l=linux-kernel&m=122356445226680&w=2

If we decide that a) is ok and we mind f_pos value being atomic on 32bit arch,
we should use seq_counter to f_pos.

>
>> I think even i_mutex is not needed. When we touch i_size, i_size_read is 
>enough,
>> and we can remove i_mutex at all on lseek.
>
>Why are we talking about i_size now?
>
>Confused.

When caller of lseek set SEEK_END, i_size is referenced. I mentioned about this.
I thought that the reason of i_mutex existence on lseek is only touching i_size.
So if i_size_read is used to touch i_size value, i_mutex could be removed.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -Nrup linux-2.6.29-rc1.org/fs/read_write.c linux-2.6.29-rc1.lseek/fs/read_write.c
--- linux-2.6.29-rc1.org/fs/read_write.c	2009-01-15 15:37:35.000000000 +0900
+++ linux-2.6.29-rc1.lseek/fs/read_write.c	2009-01-15 16:12:17.000000000 +0900
@@ -89,9 +89,12 @@  loff_t generic_file_llseek(struct file *
 {
 	loff_t rval;
 
-	mutex_lock(&file->f_dentry->d_inode->i_mutex);
-	rval = generic_file_llseek_unlocked(file, offset, origin);
-	mutex_unlock(&file->f_dentry->d_inode->i_mutex);
+	if (origin == SEEK_END) {
+		mutex_lock(&file->f_dentry->d_inode->i_mutex);
+		rval = generic_file_llseek_unlocked(file, offset, origin);
+		mutex_unlock(&file->f_dentry->d_inode->i_mutex);
+	} else
+		rval = generic_file_llseek_unlocked(file, offset, origin);
 
 	return rval;
 }