Message ID | 6.0.0.20.2.20090115163853.07056ed0@172.19.0.2 |
---|---|
State | Rejected, archived |
Headers | show |
On Thu, Jan 15, 2009 at 04:42:19PM +0900, Hisashi Hifumi wrote: > I changed i_mutex usage on generic_file_llseek. > This function is inside i_mutex, but I think there is room for optimization in some cases. > When SEEK_END is specified from caller, in this case we should handle > inode->i_size so i_mutex is needed. But in other cases such as SEEK_CUR or > SEEK_SET, i_mutex is not needed because just changing file->f_pos value without > touching i_size. Of course if you have multiple threads, they will share a struct file, and you're updating f_pos and f_version without locking. Maybe that's OK, but it's soemthing you didn't discuss. I think it's the only reason to have the mutex here. Otherwise we could simply use i_size_read() in generic_file_llseek_unlocked() and there would be no need for a mutex at all. > - mutex_lock(&file->f_dentry->d_inode->i_mutex); > - rval = generic_file_llseek_unlocked(file, offset, origin); > - mutex_unlock(&file->f_dentry->d_inode->i_mutex); > + if (origin == SEEK_END) { > + mutex_lock(&file->f_dentry->d_inode->i_mutex); > + rval = generic_file_llseek_unlocked(file, offset, origin); > + mutex_unlock(&file->f_dentry->d_inode->i_mutex); > + } else > + rval = generic_file_llseek_unlocked(file, offset, origin); I'm pretty sure the spinning mutex work will have a significant effect on the performance here.
On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote: > > Of course if you have multiple threads, they will share a struct file, > and you're updating f_pos and f_version without locking. Maybe that's > OK, but it's soemthing you didn't discuss. f_pos is updated by sys_write(), and friends without locking, so we're fine on that front, or at least no worse off. SUSv3 doesn't seem to say one way or another what should happen if two threads try to write() to a file at the same time using the same file descriptor in terms of whether or not f_pos gets updated intelligently. We've opted for speed over determinism already. Zero'ing out f_version is fine to do without locking. It's only used so we know that we need to revalidate in the readdir() case so that we know it's pointing at a valid directory pointer. That being said, I do see a race in fs/ext*/dir.c, but i_mutex locking isn't the problem and it's not going to save us. ext[234]_readdir() uses f_pos through the routine, even between calls that might block; so if one thread is randomly calling seekdir() (or lseek() directly) while another read is calling readdir(), ext[234]_readdir() could get potentially very confused. If someone wants to take a look at it, that would be great. Otherwise I'll put it on my low-priority queue of things to look at. > I think it's the only reason to have the mutex here. Otherwise we could > simply use i_size_read() in generic_file_llseek_unlocked() and there > would be no need for a mutex at all. That's a good point. Maybe I'm missing something, but I'm not sure we need the mutex in generic_file_llseek() at all. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote: >> Of course if you have multiple threads, they will share a struct file, >> and you're updating f_pos and f_version without locking. Maybe that's >> OK, but it's soemthing you didn't discuss. > > f_pos is updated by sys_write(), and friends without locking, so we're > fine on that front, or at least no worse off. SUSv3 doesn't seem to > say one way or another what should happen if two threads try to > write() to a file at the same time using the same file descriptor in > terms of whether or not f_pos gets updated intelligently. We've opted > for speed over determinism already. > > Zero'ing out f_version is fine to do without locking. It's only used > so we know that we need to revalidate in the readdir() case so that we > know it's pointing at a valid directory pointer. > > That being said, I do see a race in fs/ext*/dir.c, but i_mutex locking > isn't the problem and it's not going to save us. ext[234]_readdir() > uses f_pos through the routine, even between calls that might block; > so if one thread is randomly calling seekdir() (or lseek() directly) > while another read is calling readdir(), ext[234]_readdir() could get > potentially very confused. If someone wants to take a look at it, > that would be great. Otherwise I'll put it on my low-priority queue > of things to look at. > >> I think it's the only reason to have the mutex here. Otherwise we could >> simply use i_size_read() in generic_file_llseek_unlocked() and there >> would be no need for a mutex at all. > > That's a good point. Maybe I'm missing something, but I'm not sure we > need the mutex in generic_file_llseek() at all. I'm not arguing against this. Just pointing out what I suspect Jamie was thinking about and what a difference 3 months makes to the patch author :) Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit http://marc.info/?l=linux-fsdevel&m=122335627224515 ...that ended with Linus rightly saying users expecting to do stuff like that are insane. *However*, we should remember that any time you remove locks from a path, you run the risk the lock was providing some unplanned consistency and removing it will increase the window for unexpected behavior. And maybe finding more buggy apps is good or bad only depending on your point of view. jim -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 15 Jan 2009 09:21:13 -0500 Theodore Tso <tytso@mit.edu> wrote: > On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote: > > > > Of course if you have multiple threads, they will share a struct file, > > and you're updating f_pos and f_version without locking. Maybe that's > > OK, but it's soemthing you didn't discuss. > > f_pos is updated by sys_write(), and friends without locking, so we're > fine on that front, or at least no worse off. bug ;) > SUSv3 doesn't seem to > say one way or another what should happen if two threads try to > write() to a file at the same time using the same file descriptor in > terms of whether or not f_pos gets updated intelligently. We've opted > for speed over determinism already. I think our thinking was that if two threads are racily updating f_pos with different values, then it should end up with one of those values. From a quality-of-implementation POV (what _is_ that, anyway) it would be bad if the kernel were to set f_pos to the upper 32 bits of position A and the lower 32 bits of position B. Which could happen if we remove the i_mutex protection on 32-bits. We could perhaps omit some locking if CONFIG_64BIT. There's probably quite a bit of locking which could be omitted in that case. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 09:40 09/01/16, Andrew Morton wrote: >On Thu, 15 Jan 2009 09:21:13 -0500 >Theodore Tso <tytso@mit.edu> wrote: > >> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote: >> > >> > Of course if you have multiple threads, they will share a struct file, >> > and you're updating f_pos and f_version without locking. Maybe that's >> > OK, but it's soemthing you didn't discuss. >> >> f_pos is updated by sys_write(), and friends without locking, so we're >> fine on that front, or at least no worse off. > >bug ;) > >> SUSv3 doesn't seem to >> say one way or another what should happen if two threads try to >> write() to a file at the same time using the same file descriptor in >> terms of whether or not f_pos gets updated intelligently. We've opted >> for speed over determinism already. > >I think our thinking was that if two threads are racily updating f_pos >with different values, then it should end up with one of those values. > >From a quality-of-implementation POV (what _is_ that, anyway) it would >be bad if the kernel were to set f_pos to the upper 32 bits of position >A and the lower 32 bits of position B. Which could happen if we remove >the i_mutex protection on 32-bits. > >We could perhaps omit some locking if CONFIG_64BIT. There's probably >quite a bit of locking which could be omitted in that case. Updating f_pos value on 32bit is not atomic, so we discussed about this but we concluded that it does not matter whether f_pos is atomic or not See, Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit http://marc.info/?l=linux-fsdevel&m=122335627224515 I think even i_mutex is not needed. When we touch i_size, i_size_read is enough, and we can remove i_mutex at all on lseek. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 16 Jan 2009 09:53:02 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote: > > At 09:40 09/01/16, Andrew Morton wrote: > >On Thu, 15 Jan 2009 09:21:13 -0500 > >Theodore Tso <tytso@mit.edu> wrote: > > > >> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote: > >> > > >> > Of course if you have multiple threads, they will share a struct file, > >> > and you're updating f_pos and f_version without locking. Maybe that's > >> > OK, but it's soemthing you didn't discuss. > >> > >> f_pos is updated by sys_write(), and friends without locking, so we're > >> fine on that front, or at least no worse off. > > > >bug ;) > > > >> SUSv3 doesn't seem to > >> say one way or another what should happen if two threads try to > >> write() to a file at the same time using the same file descriptor in > >> terms of whether or not f_pos gets updated intelligently. We've opted > >> for speed over determinism already. > > > >I think our thinking was that if two threads are racily updating f_pos > >with different values, then it should end up with one of those values. > > > >From a quality-of-implementation POV (what _is_ that, anyway) it would > >be bad if the kernel were to set f_pos to the upper 32 bits of position > >A and the lower 32 bits of position B. Which could happen if we remove > >the i_mutex protection on 32-bits. > > > >We could perhaps omit some locking if CONFIG_64BIT. There's probably > >quite a bit of locking which could be omitted in that case. > > Updating f_pos value on 32bit is not atomic, so we discussed about this > but we concluded that it does not matter whether f_pos is atomic or not It's unclear what you're saying here. I see three issues here: a) two racing threads update f_pos. One of them wins, and the outcome in indeterminate. b) two racing threads update f_pos and the end result is that f_pos contains a value which *neither* thread tried to write. c) one thread is writing and the other reading. There is a window where the reader can see an intermediate value which is a mix of the old and new values. I think we decided that a) is acceptable, b) is not and that c) can only occur on multiple-of-4G wraparounds and isn't worth bothering about. > See, > Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit > http://marc.info/?l=linux-fsdevel&m=122335627224515 Sorry, I'm disinclined to re-read a long thread, trying to work out which bit you might be referring to. > I think even i_mutex is not needed. When we touch i_size, i_size_read is enough, > and we can remove i_mutex at all on lseek. Why are we talking about i_size now? Confused. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 10:49 09/01/16, Andrew Morton wrote: >On Fri, 16 Jan 2009 09:53:02 +0900 Hisashi Hifumi ><hifumi.hisashi@oss.ntt.co.jp> wrote: > >> >> At 09:40 09/01/16, Andrew Morton wrote: >> >On Thu, 15 Jan 2009 09:21:13 -0500 >> >Theodore Tso <tytso@mit.edu> wrote: >> > >> >> On Thu, Jan 15, 2009 at 06:22:52AM -0700, Matthew Wilcox wrote: >> >> > >> >> > Of course if you have multiple threads, they will share a struct file, >> >> > and you're updating f_pos and f_version without locking. Maybe that's >> >> > OK, but it's soemthing you didn't discuss. >> >> >> >> f_pos is updated by sys_write(), and friends without locking, so we're >> >> fine on that front, or at least no worse off. >> > >> >bug ;) >> > >> >> SUSv3 doesn't seem to >> >> say one way or another what should happen if two threads try to >> >> write() to a file at the same time using the same file descriptor in >> >> terms of whether or not f_pos gets updated intelligently. We've opted >> >> for speed over determinism already. >> > >> >I think our thinking was that if two threads are racily updating f_pos >> >with different values, then it should end up with one of those values. >> > >> >From a quality-of-implementation POV (what _is_ that, anyway) it would >> >be bad if the kernel were to set f_pos to the upper 32 bits of position >> >A and the lower 32 bits of position B. Which could happen if we remove >> >the i_mutex protection on 32-bits. >> > >> >We could perhaps omit some locking if CONFIG_64BIT. There's probably >> >quite a bit of locking which could be omitted in that case. >> >> Updating f_pos value on 32bit is not atomic, so we discussed about this >> but we concluded that it does not matter whether f_pos is atomic or not > >It's unclear what you're saying here. > >I see three issues here: > >a) two racing threads update f_pos. One of them wins, and the > outcome in indeterminate. > >b) two racing threads update f_pos and the end result is that f_pos > contains a value which *neither* thread tried to write. > >c) one thread is writing and the other reading. There is a window > where the reader can see an intermediate value which is a mix of the > old and new values. > >I think we decided that a) is acceptable, b) is not and that c) can only >occur on multiple-of-4G wraparounds and isn't worth bothering about. > >> See, >> Subject:[RESEND] [PATCH] VFS: make file->f_pos access atomic on 32bit >> http://marc.info/?l=linux-fsdevel&m=122335627224515 > >Sorry, I'm disinclined to re-read a long thread, trying to work out >which bit you might be referring to. Following is Linus's post about this issue. http://marc.info/?l=linux-kernel&m=122356445226680&w=2 If we decide that a) is ok and we mind f_pos value being atomic on 32bit arch, we should use seq_counter to f_pos. > >> I think even i_mutex is not needed. When we touch i_size, i_size_read is >enough, >> and we can remove i_mutex at all on lseek. > >Why are we talking about i_size now? > >Confused. When caller of lseek set SEEK_END, i_size is referenced. I mentioned about this. I thought that the reason of i_mutex existence on lseek is only touching i_size. So if i_size_read is used to touch i_size value, i_mutex could be removed. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -Nrup linux-2.6.29-rc1.org/fs/read_write.c linux-2.6.29-rc1.lseek/fs/read_write.c --- linux-2.6.29-rc1.org/fs/read_write.c 2009-01-15 15:37:35.000000000 +0900 +++ linux-2.6.29-rc1.lseek/fs/read_write.c 2009-01-15 16:12:17.000000000 +0900 @@ -89,9 +89,12 @@ loff_t generic_file_llseek(struct file * { loff_t rval; - mutex_lock(&file->f_dentry->d_inode->i_mutex); - rval = generic_file_llseek_unlocked(file, offset, origin); - mutex_unlock(&file->f_dentry->d_inode->i_mutex); + if (origin == SEEK_END) { + mutex_lock(&file->f_dentry->d_inode->i_mutex); + rval = generic_file_llseek_unlocked(file, offset, origin); + mutex_unlock(&file->f_dentry->d_inode->i_mutex); + } else + rval = generic_file_llseek_unlocked(file, offset, origin); return rval; }
I changed i_mutex usage on generic_file_llseek. This function is inside i_mutex, but I think there is room for optimization in some cases. When SEEK_END is specified from caller, in this case we should handle inode->i_size so i_mutex is needed. But in other cases such as SEEK_CUR or SEEK_SET, i_mutex is not needed because just changing file->f_pos value without touching i_size. I did some test to measure i_mutex contention. This test do: 1. make an 128MB file. 2. fork 100 processes. repeat 10000000 times lseeking randomly on each process to this file. 3, gauge seconds between start and end of this test. The result was: -2.6.29-rc1 # time ./lseek_test 315 sec real 5m15.407s user 1m19.128s sys 5m38.884s -2.6.29-rc1-patched # time ./lseek_test 13 sec real 0m13.039s user 1m14.730s sys 2m9.633s Hardware environment: CPU 2.4GHz(Quad Core) *4 Memory 64GB This improvement is derived from just removal of lseek's i_mutex contention. There is i_mutex contention not only around lseek, but also fsync or write. So, I think we also can mitigate i_mutex contention between fsync and lseek. Thanks. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Following is the test program "lseek_test.c". #include <stdio.h> #include <sched.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/wait.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <signal.h> #define NUM 100 #define LEN 4096 #define LOOP 32*1024 int num; void catch_SIGCHLD(int signo) { pid_t child_pid = 0; do { int child_ret; child_pid = waitpid(-1, &child_ret, WNOHANG); if (child_pid > 0) num++; } while (child_pid > 0); } main() { int i, pid; char buf[LEN]; unsigned long offset, filesize; time_t t1, t2; struct sigaction act; memset(buf, 0, LEN); memset(&act, 0, sizeof(act)); act.sa_handler = catch_SIGCHLD; act.sa_flags = SA_NOCLDSTOP|SA_RESTART; sigemptyset(&act.sa_mask); sigaction(SIGCHLD, &act, NULL); filesize = LEN * LOOP; int fd = open("targetfile1",O_RDWR|O_CREAT); /* create a 128MB file */ for(i = 0; i < LOOP; i++) write(fd, buf, LEN); fsync(fd); close(fd); time(&t1); for (i = 0; i < NUM; i++) { pid = fork(); if(pid == 0){ /* child */ int fd = open("targetfile1", O_RDWR); int j; for (j = 0; j < 10000000; j++) { offset = (random() % filesize); lseek(fd, offset, SEEK_SET); } close(fd); exit(0); } } while(num < NUM) sleep(1); time(&t2); printf("%d sec\n",t2-t1); } -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html