diff mbox

[6/6] fs: Introduce kern_mount_special() to mount special vfs

Message ID 492DDCAB.1070204@cosmosbay.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 26, 2008, 11:32 p.m. UTC
This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
refcounting on permanent system vfs.
Use this function for sockets, pipes, anonymous fds.

(socket8 bench result : from 2.94s to 2.23s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c      |    2 +-
 fs/pipe.c             |    2 +-
 fs/super.c            |    9 +++++++++
 include/linux/fs.h    |    1 +
 include/linux/mount.h |    5 +++--
 net/socket.c          |    2 +-
 6 files changed, 16 insertions(+), 5 deletions(-)

Comments

David Miller Nov. 27, 2008, 8:21 a.m. UTC | #1
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 27 Nov 2008 00:32:59 +0100

> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.
> 
> (socket8 bench result : from 2.94s to 2.23s)
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

For networking bits:

Acked-by: David S. Miller <davem@davemloft.net>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 27, 2008, 9:53 a.m. UTC | #2
On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

special is not a useful name for a flag, by definition everything that
needs a flag is special compared to the version that doesn't need a
flag.

The general idea of skippign the writer counts makes sense, but please
give it a descriptive name that explains the not unmountable thing.
And please kill your kern_mount wrapper and just set the flag manually.

Also I think it should be a superblock flag, not a mount flag as you
don't want thse to differ for multiple mounts of the same filesystem.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 27, 2008, 10:04 a.m. UTC | #3
Christoph Hellwig a écrit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
> 
> special is not a useful name for a flag, by definition everything that
> needs a flag is special compared to the version that doesn't need a
> flag.
> 
> The general idea of skippign the writer counts makes sense, but please
> give it a descriptive name that explains the not unmountable thing.
> And please kill your kern_mount wrapper and just set the flag manually.
> 
> Also I think it should be a superblock flag, not a mount flag as you
> don't want thse to differ for multiple mounts of the same filesystem.
> 
> 

Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
is going to be a litle bit expensive if we add a derefence ?

if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
   ...
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 27, 2008, 10:10 a.m. UTC | #4
On Thu, Nov 27, 2008 at 11:04:38AM +0100, Eric Dumazet wrote:
> Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
> is going to be a litle bit expensive if we add a derefence ?
>
> if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
>   ...
> }

Well, run a benchmark to see if it makes any difference.  And when it
does please always set the mount flag from the common mount code when
it's set on the superblock, and document that this is the only valid way
to set it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 28, 2008, 9:26 a.m. UTC | #5
On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

IMO that's pushing it past the point of usefulness; unless you can show
that this really gives considerable win on pipes et.al. *AND* that it
doesn't hurt other loads...

dput() part: again, I want to see what happens on other loads; it's probably
fine (and win is certainly more than from mntput() change), but...  The
thing is, atomic_dec_and_lock() in there is often done on dentries with
d_count > 1 and that's fairly cheap (and doesn't involve contention on
dcache_lock on sane targets).

FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
in a special way, I'd try to compare with
        if (atomic_add_unless(&dentry->d_count, -1, 1))
                return;
	if (your flag)
		sod off to special
	spin_lock(&dcache_lock);
	if (atomic_dec_and_test(&dentry->d_count)) {
		spin_unlock(&dcache_lock);
		return;
	}
	the rest as usual

	As for the alpha... unless I'm misreading the assembler in
arch/alpha/lib/dec_and_lock.c, it looks like we have essentially an
implementation of atomic_add_unless() in there and one that just
might be better than what we've got in arch/alpha/include/asm/atomic.h.
How about
1:	ldl_l	x, addr
	cmpne	x, u, y	/* y = x != u */
	beq	y, 3f	/* if !y -> bugger off, return 0 */
	addl	x, a, y
	stl_c	y, addr	/* y <- *addr has not changed since ldl_l */
	beq	y, 2f
3:	/* return value is in y */
.subsection 2 /* out of the way */
2:	br	1b
.previous
for atomic_add_unless() guts?  With that we are rid of HAVE_DEC_LOCK and
get a uniform implementation of atomic_dec_and_lock() for all targets...

AFAICS, that would be
static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
{
	unsigned long temp, res;
	__asm__ __volatile__(
	"1:     ldl_l %0,%1\n"
	"       cmpne %0,%4,%2\n"
	"       beq %4,3f\n"
	"       addl %0,%3,%4\n"
	"       stl_c %2,%1\n"
	"       beq %2,2f\n"
	"3:\n"
        ".subsection 2\n"
        "2:     br 1b\n"
        ".previous"
        :"=&r" (temp), "=m" (v->counter), "=&r" (res)
        :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
	smp_mb();
	return res;
}

static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
{
	unsigned long temp, res;
	__asm__ __volatile__(
	"1:     ldq_l %0,%1\n"
	"       cmpne %0,%4,%2\n"
	"       beq %4,3f\n"
	"       addq %0,%3,%4\n"
	"       stq_c %2,%1\n"
	"       beq %2,2f\n"
	"3:\n"
        ".subsection 2\n"
        "2:     br 1b\n"
        ".previous"
        :"=&r" (temp), "=m" (v->counter), "=&r" (res)
        :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
	smp_mb();
	return res;
}

Comments?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 28, 2008, 9:34 a.m. UTC | #6
On Fri, Nov 28, 2008 at 09:26:04AM +0000, Al Viro wrote:

gyah...  That would be

> static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
> {
> 	unsigned long temp, res;
> 	__asm__ __volatile__(
> 	"1:     ldl_l %0,%1\n"
> 	"       cmpne %0,%4,%2\n"
 	"       beq %2,3f\n"
 	"       addl %0,%3,%2\n"
> 	"       stl_c %2,%1\n"
> 	"       beq %2,2f\n"
> 	"3:\n"
>         ".subsection 2\n"
>         "2:     br 1b\n"
>         ".previous"
>         :"=&r" (temp), "=m" (v->counter), "=&r" (res)
>         :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> 	smp_mb();
> 	return res;
> }
> 
> static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
> {
> 	unsigned long temp, res;
> 	__asm__ __volatile__(
> 	"1:     ldq_l %0,%1\n"
> 	"       cmpne %0,%4,%2\n"
 	"       beq %2,3f\n"
 	"       addq %0,%3,%2\n"
> 	"       stq_c %2,%1\n"
> 	"       beq %2,2f\n"
> 	"3:\n"
>         ".subsection 2\n"
>         "2:     br 1b\n"
>         ".previous"
>         :"=&r" (temp), "=m" (v->counter), "=&r" (res)
>         :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> 	smp_mb();
> 	return res;
> }
> 
> Comments?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Nov. 28, 2008, 6:02 p.m. UTC | #7
* Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> > refcounting on permanent system vfs.
> > Use this function for sockets, pipes, anonymous fds.
> 
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

The numbers look pretty convincing:

> >  (socket8 bench result : from 2.94s to 2.23s)

And i wouldnt expect it to hurt real-filesystem workloads.

Here's the contemporary trace of a typical ext3- sys_open():

 0)               |  sys_open() {
 0)               |    do_sys_open() {
 0)               |      getname() {
 0)      0.367 us |        kmem_cache_alloc();
 0)               |        strncpy_from_user(); {
 0)               |          _cond_resched() {
 0)               |            need_resched() {
 0)      0.363 us |              constant_test_bit();
 0)      1. 47 us |            }
 0)      1.815 us |          }
 0)      2.587 us |        }
 0)      4. 22 us |      }
 0)               |      alloc_fd() {
 0)      0.480 us |        _spin_lock();
 0)      0.487 us |        expand_files();
 0)      2.356 us |      }
 0)               |      do_filp_open() {
 0)               |        path_lookup_open() {
 0)               |          get_empty_filp() {
 0)      0.439 us |            kmem_cache_alloc();
 0)               |            security_file_alloc() {
 0)      0.316 us |              cap_file_alloc_security();
 0)      1. 87 us |            }
 0)      3.189 us |          }
 0)               |          do_path_lookup() {
 0)      0.366 us |            _read_lock();
 0)               |            path_walk() {
 0)               |              __link_path_walk() {
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.441 us |                    generic_permission();
 0)      1.247 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.411 us |                    cap_inode_permission();
 0)      1.186 us |                  }
 0)      3.555 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.486 us |                    _spin_lock();
 0)      1.369 us |                  }
 0)      0.442 us |                  __follow_mount();
 0)      3. 14 us |                }
 0)               |                path_to_nameidata() {
 0)      0.476 us |                  dput();
 0)      1.235 us |                }
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)               |                    generic_permission() {
 0)               |                      in_group_p() {
 0)      0.410 us |                        groups_search();
 0)      1.172 us |                      }
 0)      1.994 us |                    }
 0)      2.789 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.454 us |                    cap_inode_permission();
 0)      1.238 us |                  }
 0)      5.262 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.480 us |                    _spin_lock();
 0)      1.621 us |                  }
 0)      0.456 us |                  __follow_mount();
 0)      3.215 us |                }
 0)               |                path_to_nameidata() {
 0)      0.420 us |                  dput();
 0)      1.193 us |                }
 0) +   23.551 us |              }
 0)               |              path_put() {
 0)      0.420 us |                dput();
 0)               |                mntput() {
 0)      0.359 us |                  mntput_no_expire();
 0)      1. 50 us |                }
 0)      2.544 us |              }
 0) +   27.253 us |            }
 0) +   28.850 us |          }
 0) +   33.217 us |        }
 0)               |        may_open() {
 0)               |          inode_permission() {
 0)               |            ext3_permission() {
 0)      0.480 us |              generic_permission();
 0)      1.229 us |            }
 0)               |            security_inode_permission() {
 0)      0.405 us |              cap_inode_permission();
 0)      1.196 us |            }
 0)      3.589 us |          }
 0)      4.600 us |        }
 0)               |        nameidata_to_filp() {
 0)               |          __dentry_open() {
 0)               |            file_move() {
 0)      0.470 us |              _spin_lock();
 0)      1.243 us |            }
 0)               |            security_dentry_open() {
 0)      0.344 us |              cap_dentry_open();
 0)      1.139 us |            }
 0)      0.412 us |            generic_file_open();
 0)      0.561 us |            file_ra_state_init();
 0)      5.714 us |          }
 0)      6.483 us |        }
 0) +   46.494 us |      }
 0)      0.453 us |      inotify_dentry_parent_queue_event();
 0)      0.403 us |      inotify_inode_queue_event();
 0)               |      fd_install() {
 0)      0.440 us |        _spin_lock();
 0)      1.247 us |      }
 0)               |      putname() {
 0)               |        kmem_cache_free() {
 0)               |          virt_to_head_page() {
 0)      0.369 us |            constant_test_bit();
 0)      1. 23 us |          }
 0)      1.738 us |        }
 0)      2.422 us |      }
 0) +   60.560 us |    }
 0) +   61.368 us |  }

and here's a sys_close():

 0)               |  sys_close() {
 0)      0.540 us |    _spin_lock();
 0)               |    filp_close() {
 0)      0.437 us |      dnotify_flush();
 0)      0.401 us |      locks_remove_posix();
 0)      0.349 us |      fput();
 0)      2.679 us |    }
 0)      4.452 us |  }

i'd be surprised to see a flag to show up in that codepath. Eric, does 
your testing confirm that?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Nov. 28, 2008, 6:58 p.m. UTC | #8
* Ingo Molnar <mingo@elte.hu> wrote:

> And i wouldnt expect it to hurt real-filesystem workloads.
> 
> Here's the contemporary trace of a typical ext3- sys_open():

here's a sys_open() that has to touch atime:

 0)               |  sys_open() {
 0)               |    do_sys_open() {
 0)               |      getname() {
 0)      0.377 us |        kmem_cache_alloc();
 0)               |        strncpy_from_user() {
 0)               |          _cond_resched() {
 0)               |            need_resched() {
 0)      0.353 us |              constant_test_bit();
 0)      1. 45 us |            }
 0)      1.739 us |          }
 0)      2.492 us |        }
 0)      3.934 us |      }
 0)               |      alloc_fd() {
 0)      0.374 us |        _spin_lock();
 0)      0.447 us |        expand_files();
 0)      2.124 us |      }
 0)               |      do_filp_open() {
 0)               |        path_lookup_open() {
 0)               |          get_empty_filp() {
 0)      0.689 us |            kmem_cache_alloc();
 0)               |            security_file_alloc() {
 0)      0.327 us |              cap_file_alloc_security();
 0)      1. 71 us |            }
 0)      2.869 us |          }
 0)               |          do_path_lookup() {
 0)      0.460 us |            _read_lock();
 0)               |            path_walk() {
 0)               |              __link_path_walk() {
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.434 us |                    generic_permission();
 0)      1.191 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.400 us |                    cap_inode_permission();
 0)      1.130 us |                  }
 0)      3.453 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.489 us |                    _spin_lock();
 0)      1.525 us |                  }
 0)      0.449 us |                  __follow_mount();
 0)      3.115 us |                }
 0)               |                path_to_nameidata() {
 0)      0.422 us |                  dput();
 0)      1.204 us |                }
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.391 us |                    generic_permission();
 0)      1.223 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.406 us |                    cap_inode_permission();
 0)      1.189 us |                  }
 0)      3.565 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.527 us |                    _spin_lock();
 0)      1.633 us |                  }
 0)      0.440 us |                  __follow_mount();
 0)      3.223 us |                }
 0)               |                do_follow_link() {
 0)               |                  _cond_resched() {
 0)               |                    need_resched() {
 0)      0.361 us |                      constant_test_bit();
 0)      1. 64 us |                    }
 0)      1.749 us |                  }
 0)               |                  security_inode_follow_link() {
 0)      0.390 us |                    cap_inode_follow_link();
 0)      1.260 us |                  }
 0)               |                  touch_atime() {
 0)               |                    mnt_want_write() {
 0)      0.360 us |                      _spin_lock();
 0)      1.137 us |                    }
 0)               |                    mnt_drop_write() {
 0)      0.348 us |                      _spin_lock();
 0)      1.102 us |                    }
 0)      3.402 us |                  }
 0)      0.446 us |                  ext3_follow_link();
 0)               |                  __link_path_walk() {
 0)               |                    inode_permission() {
 0)               |                      ext3_permission() {
 0)               |                        generic_permission() {
 0)      4.481 us |                      }
 0)               |                      security_inode_permission() {
 0)      0.402 us |                        cap_inode_permission();
 0)      1.127 us |                      }
 0)      6.747 us |                    }
 0)               |                    do_lookup() {
 0)               |                      __d_lookup() {
 0)      0.547 us |                        _spin_lock();
 0)      1.758 us |                      }
 0)      0.465 us |                      __follow_mount();
 0)      3.368 us |                    }
 0)               |                    path_to_nameidata() {
 0)      0.419 us |                      dput();
 0)      1.203 us |                    }
 0) +   13. 40 us |                  }
 0)               |                  path_put() {
 0)      0.429 us |                    dput();
 0)               |                    mntput() {
 0)      0.367 us |                      mntput_no_expire();
 0)      1.130 us |                    }
 0)      2.660 us |                  }
 0)               |                  path_put() {
 0)               |                    dput() {
 0)               |                      _cond_resched() {
 0)               |                        need_resched() {
 0)      0.382 us |                          constant_test_bit();
 0)      1. 67 us |                        }
 0)      1.808 us |                      }
 0)      0.399 us |                      _spin_lock();
 0)      0.452 us |                      _spin_lock();
 0)      4.270 us |                    }
 0)               |                    mntput() {
 0)      0.375 us |                      mntput_no_expire();
 0)      1. 62 us |                    }
 0)      6.547 us |                  }
 0) +   32.702 us |                }
 0) +   50.413 us |              }
 0)               |              path_put() {
 0)      0.421 us |                dput();
 0)               |                mntput() {
 0)      0.364 us |                  mntput_no_expire();
 0)      1. 64 us |                }
 0)      2.545 us |              }
 0) +   54.147 us |            }
 0) +   55.780 us |          }
 0) +   59.714 us |        }
 0)               |        may_open() {
 0)               |          inode_permission() {
 0)               |            ext3_permission() {
 0)      0.406 us |              generic_permission();
 0)      1.189 us |            }
 0)               |            security_inode_permission() {
 0)      0.388 us |              cap_inode_permission();
 0)      1.175 us |            }
 0)      3.498 us |          }
 0)      4.328 us |        }
 0)               |        nameidata_to_filp() {
 0)               |          __dentry_open() {
 0)               |            file_move() {
 0)      0.361 us |              _spin_lock();
 0)      1.102 us |            }
 0)               |            security_dentry_open() {
 0)      0.356 us |              cap_dentry_open();
 0)      1.121 us |            }
 0)      0.400 us |            generic_file_open();
 0)      0.544 us |            file_ra_state_init();
 0)      5. 11 us |          }
 0)      5.709 us |        }
 0) +   71.181 us |      }
 0)      0.453 us |      inotify_dentry_parent_queue_event();
 0)      0.403 us |      inotify_inode_queue_event();
 0)               |      fd_install() {
 0)      0.411 us |        _spin_lock();
 0)      1.217 us |      }
 0)               |      putname() {
 0)               |        kmem_cache_free() {
 0)               |          virt_to_head_page() {
 0)      0.371 us |            constant_test_bit();
 0)      1. 47 us |          }
 0)      1.752 us |        }
 0)      2.446 us |      }
 0) +   84.676 us |    }
 0) +   85.365 us |  }

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 28, 2008, 10:20 p.m. UTC | #9
Ingo Molnar a écrit :
> * Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
> 
> The numbers look pretty convincing:
> 
>>>  (socket8 bench result : from 2.94s to 2.23s)
> 
> And i wouldnt expect it to hurt real-filesystem workloads.
> 
> Here's the contemporary trace of a typical ext3- sys_open():
> 
>  0)               |  sys_open() {
>  0)               |    do_sys_open() {
>  0)               |      getname() {
>  0)      0.367 us |        kmem_cache_alloc();
>  0)               |        strncpy_from_user(); {
>  0)               |          _cond_resched() {
>  0)               |            need_resched() {
>  0)      0.363 us |              constant_test_bit();
>  0)      1. 47 us |            }
>  0)      1.815 us |          }
>  0)      2.587 us |        }
>  0)      4. 22 us |      }
>  0)               |      alloc_fd() {
>  0)      0.480 us |        _spin_lock();
>  0)      0.487 us |        expand_files();
>  0)      2.356 us |      }
>  0)               |      do_filp_open() {
>  0)               |        path_lookup_open() {
>  0)               |          get_empty_filp() {
>  0)      0.439 us |            kmem_cache_alloc();
>  0)               |            security_file_alloc() {
>  0)      0.316 us |              cap_file_alloc_security();
>  0)      1. 87 us |            }
>  0)      3.189 us |          }
>  0)               |          do_path_lookup() {
>  0)      0.366 us |            _read_lock();
>  0)               |            path_walk() {
>  0)               |              __link_path_walk() {
>  0)               |                inode_permission() {
>  0)               |                  ext3_permission() {
>  0)      0.441 us |                    generic_permission();
>  0)      1.247 us |                  }
>  0)               |                  security_inode_permission() {
>  0)      0.411 us |                    cap_inode_permission();
>  0)      1.186 us |                  }
>  0)      3.555 us |                }
>  0)               |                do_lookup() {
>  0)               |                  __d_lookup() {
>  0)      0.486 us |                    _spin_lock();
>  0)      1.369 us |                  }
>  0)      0.442 us |                  __follow_mount();
>  0)      3. 14 us |                }
>  0)               |                path_to_nameidata() {
>  0)      0.476 us |                  dput();
>  0)      1.235 us |                }
>  0)               |                inode_permission() {
>  0)               |                  ext3_permission() {
>  0)               |                    generic_permission() {
>  0)               |                      in_group_p() {
>  0)      0.410 us |                        groups_search();
>  0)      1.172 us |                      }
>  0)      1.994 us |                    }
>  0)      2.789 us |                  }
>  0)               |                  security_inode_permission() {
>  0)      0.454 us |                    cap_inode_permission();
>  0)      1.238 us |                  }
>  0)      5.262 us |                }
>  0)               |                do_lookup() {
>  0)               |                  __d_lookup() {
>  0)      0.480 us |                    _spin_lock();
>  0)      1.621 us |                  }
>  0)      0.456 us |                  __follow_mount();
>  0)      3.215 us |                }
>  0)               |                path_to_nameidata() {
>  0)      0.420 us |                  dput();
>  0)      1.193 us |                }
>  0) +   23.551 us |              }
>  0)               |              path_put() {
>  0)      0.420 us |                dput();
>  0)               |                mntput() {
>  0)      0.359 us |                  mntput_no_expire();
>  0)      1. 50 us |                }
>  0)      2.544 us |              }
>  0) +   27.253 us |            }
>  0) +   28.850 us |          }
>  0) +   33.217 us |        }
>  0)               |        may_open() {
>  0)               |          inode_permission() {
>  0)               |            ext3_permission() {
>  0)      0.480 us |              generic_permission();
>  0)      1.229 us |            }
>  0)               |            security_inode_permission() {
>  0)      0.405 us |              cap_inode_permission();
>  0)      1.196 us |            }
>  0)      3.589 us |          }
>  0)      4.600 us |        }
>  0)               |        nameidata_to_filp() {
>  0)               |          __dentry_open() {
>  0)               |            file_move() {
>  0)      0.470 us |              _spin_lock();
>  0)      1.243 us |            }
>  0)               |            security_dentry_open() {
>  0)      0.344 us |              cap_dentry_open();
>  0)      1.139 us |            }
>  0)      0.412 us |            generic_file_open();
>  0)      0.561 us |            file_ra_state_init();
>  0)      5.714 us |          }
>  0)      6.483 us |        }
>  0) +   46.494 us |      }
>  0)      0.453 us |      inotify_dentry_parent_queue_event();
>  0)      0.403 us |      inotify_inode_queue_event();
>  0)               |      fd_install() {
>  0)      0.440 us |        _spin_lock();
>  0)      1.247 us |      }
>  0)               |      putname() {
>  0)               |        kmem_cache_free() {
>  0)               |          virt_to_head_page() {
>  0)      0.369 us |            constant_test_bit();
>  0)      1. 23 us |          }
>  0)      1.738 us |        }
>  0)      2.422 us |      }
>  0) +   60.560 us |    }
>  0) +   61.368 us |  }
> 
> and here's a sys_close():
> 
>  0)               |  sys_close() {
>  0)      0.540 us |    _spin_lock();
>  0)               |    filp_close() {
>  0)      0.437 us |      dnotify_flush();
>  0)      0.401 us |      locks_remove_posix();
>  0)      0.349 us |      fput();
>  0)      2.679 us |    }
>  0)      4.452 us |  }
> 
> i'd be surprised to see a flag to show up in that codepath. Eric, does 
> your testing confirm that?

On a socket/pipe, definitly no, because inode->i_sb->s_flags is not contended.

But on a shared inode, it might hurt :

offsetof(struct inode, i_count)=0x24
offsetof(struct inode, i_lock)=0x70
offsetof(struct inode, i_sb)=0x9c
offsetof(struct inode, i_writecount)=0x144

So i_sb sits in a probably contended cache line 

I wonder why i_writecount sits so far from i_count, that doesnt make sense.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 28, 2008, 10:37 p.m. UTC | #10
Al Viro a écrit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
> 
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

Well, if this is the last cache line that might be shared, then yes, numbers can talk.
But coming from 10 to 1 instead of 0 is OK I guess

> 
> dput() part: again, I want to see what happens on other loads; it's probably
> fine (and win is certainly more than from mntput() change), but...  The
> thing is, atomic_dec_and_lock() in there is often done on dentries with
> d_count > 1 and that's fairly cheap (and doesn't involve contention on
> dcache_lock on sane targets).
> 
> FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
> in a special way, I'd try to compare with

>         if (atomic_add_unless(&dentry->d_count, -1, 1))
>                 return;

I dont know, but *reading* d_count before trying to write it is expensive
on modern cpus. Oprofile clearly show that on Intel Core2.

Then, *testing* the flag before doing the atomic_something() has the same
problem. Or we should put flag in a different cache line.

I am lazy (time for a sleep here), maybe we are smart here and use a trick like that already ?

atomic_t atomic_read_with_write_intent(atomic_t *v)
{
        int val = 0;
	/*
	 * No LOCK prefix here, we only give a write intent hint to cpu
	 */
        asm volatile("xaddl %0, %1"
                     : "+r" (val), "+m" (v->counter)
                     : : "memory");
        return val;
}



> 	if (your flag)
> 		sod off to special
> 	spin_lock(&dcache_lock);
> 	if (atomic_dec_and_test(&dentry->d_count)) {
> 		spin_unlock(&dcache_lock);
> 		return;
> 	}
> 	the rest as usual
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 28, 2008, 10:43 p.m. UTC | #11
Eric Dumazet a écrit :
> Al Viro a écrit :
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>>
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
> 
> Well, if this is the last cache line that might be shared, then yes, 
> numbers can talk.
> But coming from 10 to 1 instead of 0 is OK I guess
> 
>>
>> dput() part: again, I want to see what happens on other loads; it's 
>> probably
>> fine (and win is certainly more than from mntput() change), but...  The
>> thing is, atomic_dec_and_lock() in there is often done on dentries with
>> d_count > 1 and that's fairly cheap (and doesn't involve contention on
>> dcache_lock on sane targets).
>>
>> FWIW, unless there's a really good reason to do alpha 
>> atomic_dec_and_lock()
>> in a special way, I'd try to compare with
> 
>>         if (atomic_add_unless(&dentry->d_count, -1, 1))
>>                 return;
> 
> I dont know, but *reading* d_count before trying to write it is expensive
> on modern cpus. Oprofile clearly show that on Intel Core2.
> 
> Then, *testing* the flag before doing the atomic_something() has the same
> problem. Or we should put flag in a different cache line.
> 
> I am lazy (time for a sleep here), maybe we are smart here and use a 
> trick like that already ?
> 
> atomic_t atomic_read_with_write_intent(atomic_t *v)
> {
>        int val = 0;
>     /*
>      * No LOCK prefix here, we only give a write intent hint to cpu
>      */
>        asm volatile("xaddl %0, %1"
>                     : "+r" (val), "+m" (v->counter)
>                     : : "memory");
>        return val;
> }

Forget it, its wrong... I really need to sleep :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index a0212b3..42dfe28 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -153,7 +153,7 @@  static int __init anon_inode_init(void)
 	error = register_filesystem(&anon_inode_fs_type);
 	if (error)
 		goto err_exit;
-	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
+	anon_inode_mnt = kern_mount_special(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
diff --git a/fs/pipe.c b/fs/pipe.c
index 6fca681..391d4fe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1074,7 +1074,7 @@  static int __init init_pipe_fs(void)
 	int err = register_filesystem(&pipe_fs_type);
 
 	if (!err) {
-		pipe_mnt = kern_mount(&pipe_fs_type);
+		pipe_mnt = kern_mount_special(&pipe_fs_type);
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
diff --git a/fs/super.c b/fs/super.c
index 400a760..a8e14f7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -982,3 +982,12 @@  struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
 }
 
 EXPORT_SYMBOL_GPL(kern_mount_data);
+
+struct vfsmount *kern_mount_special(struct file_system_type *type)
+{
+	struct vfsmount *res = kern_mount_data(type, NULL);
+
+	if (!IS_ERR(res))
+		res->mnt_flags |= MNT_SPECIAL;
+	return res;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd0e8a5..a92544a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1591,6 +1591,7 @@  extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
 #define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount_special(struct file_system_type *);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..cb4fa90 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -30,6 +30,7 @@  struct mnt_namespace;
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_IMBALANCED_WRITE_COUNT	0x200 /* just for debugging */
+#define MNT_SPECIAL	0x400	/* special mount (pipes,sockets,...) */
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
@@ -73,7 +74,7 @@  struct vfsmount {
 
 static inline struct vfsmount *mntget(struct vfsmount *mnt)
 {
-	if (mnt)
+	if (mnt && !(mnt->mnt_flags & MNT_SPECIAL))
 		atomic_inc(&mnt->mnt_count);
 	return mnt;
 }
@@ -87,7 +88,7 @@  extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
-	if (mnt) {
+	if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) {
 		mnt->mnt_expiry_mark = 0;
 		mntput_no_expire(mnt);
 	}
diff --git a/net/socket.c b/net/socket.c
index 4177456..2857d70 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2204,7 +2204,7 @@  static int __init sock_init(void)
 
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
-	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt = kern_mount_special(&sock_fs_type);
 	sock_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 
 	/* The real protocol initialization is performed in later initcalls.