diff mbox

[RFC] non-preemptible kernel socket for RAMster

Message ID 4232c4b6-15be-42d8-be42-6e27f9188ce2@default
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Dan Magenheimer July 5, 2011, 3:54 p.m. UTC
In working on a kernel project called RAMster* (where RAM on a
remote system may be used for clean page cache pages and for swap
pages), I found I have need for a kernel socket to be used when
in non-preemptible state.  I admit to being a networking idiot,
but I have been successfully using the following small patch.
I'm not sure whether I am lucky so far... perhaps more
sockets or larger/different loads will require a lot more
changes (or maybe even make my objective impossible).
So I thought I'd post it for comment.  I'd appreciate
any thoughts or suggestions.

Thanks,
Dan

* http://events.linuxfoundation.org/events/linuxcon/magenheimer 

the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet July 5, 2011, 4:30 p.m. UTC | #1
Le mardi 05 juillet 2011 à 08:54 -0700, Dan Magenheimer a écrit :
> In working on a kernel project called RAMster* (where RAM on a
> remote system may be used for clean page cache pages and for swap
> pages), I found I have need for a kernel socket to be used when
> in non-preemptible state.  I admit to being a networking idiot,
> but I have been successfully using the following small patch.
> I'm not sure whether I am lucky so far... perhaps more
> sockets or larger/different loads will require a lot more
> changes (or maybe even make my objective impossible).
> So I thought I'd post it for comment.  I'd appreciate
> any thoughts or suggestions.
> 
> Thanks,
> Dan
> 
> * http://events.linuxfoundation.org/events/linuxcon/magenheimer 
> 
> diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c
> --- linux-2.6.37/net/core/sock.c	2011-07-03 19:14:52.267853088 -0600
> +++ linux-2.6.37-ramster/net/core/sock.c	2011-07-03 19:10:04.340980799 -0600
> @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk)
>  	__acquires(&sk->sk_lock.slock)
>  {
>  	DEFINE_WAIT(wait);
> +	if (!preemptible()) {
> +		while (sock_owned_by_user(sk)) {
> +			spin_unlock_bh(&sk->sk_lock.slock);
> +			cpu_relax();
> +			spin_lock_bh(&sk->sk_lock.slock);
> +		}
> +		return;
> +	}

Hmm, was this tested on UP machine ?

>  
>  	for (;;) {
>  		prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
> @@ -1623,7 +1631,8 @@ static void __release_sock(struct sock *
>  			 * This is safe to do because we've taken the backlog
>  			 * queue private:
>  			 */
> -			cond_resched_softirq();
> +			if (preemptible())
> +				cond_resched_softirq();
>  			skb = next;
>  		} while (skb != NULL);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan July 5, 2011, 4:36 p.m. UTC | #2
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Dan Magenheimer
> Sent: July 05, 2011 11:54 AM
> To: netdev@vger.kernel.org
> Cc: Konrad Wilk; linux-mm
> Subject: [RFC] non-preemptible kernel socket for RAMster
> 
> In working on a kernel project called RAMster* (where RAM on a
> remote system may be used for clean page cache pages and for swap
> pages), I found I have need for a kernel socket to be used when


How is RAMster+swap different than NBD's (pending etc?)support for SWAP
over NBD?


Chetan Loke

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Magenheimer July 5, 2011, 5:25 p.m. UTC | #3
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Tuesday, July 05, 2011 10:31 AM
> To: Dan Magenheimer
> Cc: netdev@vger.kernel.org; Konrad Wilk; linux-mm
> Subject: Re: [RFC] non-preemptible kernel socket for RAMster
> 
> Le mardi 05 juillet 2011 à 08:54 -0700, Dan Magenheimer a écrit :
> > In working on a kernel project called RAMster* (where RAM on a
> > remote system may be used for clean page cache pages and for swap
> > pages), I found I have need for a kernel socket to be used when
> > in non-preemptible state.  I admit to being a networking idiot,
> > but I have been successfully using the following small patch.
> > I'm not sure whether I am lucky so far... perhaps more
> > sockets or larger/different loads will require a lot more
> > changes (or maybe even make my objective impossible).
> > So I thought I'd post it for comment.  I'd appreciate
> > any thoughts or suggestions.
> >
> > Thanks,
> > Dan
> >
> > * http://events.linuxfoundation.org/events/linuxcon/magenheimer
> >
> > diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c
> > --- linux-2.6.37/net/core/sock.c	2011-07-03 19:14:52.267853088 -0600
> > +++ linux-2.6.37-ramster/net/core/sock.c	2011-07-03 19:10:04.340980799 -0600
> > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk)
> >  	__acquires(&sk->sk_lock.slock)
> >  {
> >  	DEFINE_WAIT(wait);
> > +	if (!preemptible()) {
> > +		while (sock_owned_by_user(sk)) {
> > +			spin_unlock_bh(&sk->sk_lock.slock);
> > +			cpu_relax();
> > +			spin_lock_bh(&sk->sk_lock.slock);
> > +		}
> > +		return;
> > +	}
> 
> Hmm, was this tested on UP machine ?

Hi Eric --

Thanks for the reply!

I hadn't tested UP in awhile so am testing now, and it seems to
work OK so far.  However, I am just testing my socket, *not* testing
sockets in general.  Are you implying that this patch will
break (kernel) sockets in general on a UP machine?  If so,
could you be more specific as to why?  (Again, I said
I am a networking idiot. ;-)  I played a bit with adding
a new SOCK_ flag and triggering off of that, but this
version of the patch seemed much simpler.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Magenheimer July 5, 2011, 5:25 p.m. UTC | #4
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> Sent: Tuesday, July 05, 2011 10:37 AM
> To: Dan Magenheimer; netdev@vger.kernel.org
> Cc: Konrad Wilk; linux-mm
> Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> 
> > In working on a kernel project called RAMster* (where RAM on a
> > remote system may be used for clean page cache pages and for swap
> > pages), I found I have need for a kernel socket to be used when
> 
> How is RAMster+swap different than NBD's (pending etc?)support for SWAP
> over NBD?

Hi Chetan --

Thanks for your question.

I may be ignorant of details about NBD, but did some quick
research using google.  If I understand correctly, swap over
NBD is still writing to a configured swap disk on the remote
machine.  RAMster is swapping to *RAM* on the remote machine.
The idea is that most machines are very overprovisioned in
RAM, and are rarely using all of their RAM, especially when
a machine is (mostly) idle.  In other words, the "max of
the sums" of RAM usage on a group of machines is much lower
than the "sum of the max" of RAM usage.

So if the network is sufficiently faster than disk for
moving a page of data, RAMster provides a significant
performance improvement.  OR RAMster may allow a significant
reduction in the total amount of RAM across a data center.

The version of RAMster I am working on now is really
a proof-of-concept that works over sockets, using the
ocfs2 cluster layer.  One can easily envision a future
"exo-fabric" which allows one machine to write to the
RAM of another machine... for this future hardware,
RAMster becomes much more interesting.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan July 5, 2011, 5:52 p.m. UTC | #5
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: July 05, 2011 1:25 PM
> To: Loke, Chetan; netdev@vger.kernel.org
> Cc: Konrad Wilk; linux-mm
> Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> 
> > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> > Sent: Tuesday, July 05, 2011 10:37 AM
> > To: Dan Magenheimer; netdev@vger.kernel.org
> > Cc: Konrad Wilk; linux-mm
> > Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> >
> > > In working on a kernel project called RAMster* (where RAM on a
> > > remote system may be used for clean page cache pages and for swap
> > > pages), I found I have need for a kernel socket to be used when
> >
> > How is RAMster+swap different than NBD's (pending etc?)support for
> SWAP
> > over NBD?
> 
> Hi Chetan --
> 
> Thanks for your question.
> 
> I may be ignorant of details about NBD, but did some quick
> research using google.  If I understand correctly, swap over
> NBD is still writing to a configured swap disk on the remote

Hi - I thought NBD-server needs a backing store(a file). 
Now the file itself could reside on a RAM-drive or disk-drive etc.
And so a remote NBD(disk or RAM) can be mounted locally as a swap
device.
The local client should still see it as a block device.

I haven't used the RAM-drive feature myself but you may want to check if
it
works or even borrow that logic in your code.


> machine.  RAMster is swapping to *RAM* on the remote machine.
> The idea is that most machines are very overprovisioned in
> RAM, and are rarely using all of their RAM, especially when
> a machine is (mostly) idle.  In other words, the "max of
> the sums" of RAM usage on a group of machines is much lower
> than the "sum of the max" of RAM usage.
> 
> So if the network is sufficiently faster than disk for
> moving a page of data, RAMster provides a significant
> performance improvement.  OR RAMster may allow a significant
> reduction in the total amount of RAM across a data center.
> 
> The version of RAMster I am working on now is really
> a proof-of-concept that works over sockets, using the
> ocfs2 cluster layer.  One can easily envision a future
> "exo-fabric" which allows one machine to write to the
> RAM of another machine... for this future hardware,
> RAMster becomes much more interesting.
> 

Or you can also try scst-in-RAM mode(if you want to experiment with
different fabrics).


> Thanks,
> Dan

Thanks
Chetan Loke
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 5, 2011, 6:23 p.m. UTC | #6
Le mardi 05 juillet 2011 à 10:25 -0700, Dan Magenheimer a écrit :
> > From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> > Sent: Tuesday, July 05, 2011 10:31 AM
> > To: Dan Magenheimer
> > Cc: netdev@vger.kernel.org; Konrad Wilk; linux-mm
> > Subject: Re: [RFC] non-preemptible kernel socket for RAMster
> > 
> > Le mardi 05 juillet 2011 à 08:54 -0700, Dan Magenheimer a écrit :
> > > In working on a kernel project called RAMster* (where RAM on a
> > > remote system may be used for clean page cache pages and for swap
> > > pages), I found I have need for a kernel socket to be used when
> > > in non-preemptible state.  I admit to being a networking idiot,
> > > but I have been successfully using the following small patch.
> > > I'm not sure whether I am lucky so far... perhaps more
> > > sockets or larger/different loads will require a lot more
> > > changes (or maybe even make my objective impossible).
> > > So I thought I'd post it for comment.  I'd appreciate
> > > any thoughts or suggestions.
> > >
> > > Thanks,
> > > Dan
> > >
> > > * http://events.linuxfoundation.org/events/linuxcon/magenheimer
> > >
> > > diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c
> > > --- linux-2.6.37/net/core/sock.c	2011-07-03 19:14:52.267853088 -0600
> > > +++ linux-2.6.37-ramster/net/core/sock.c	2011-07-03 19:10:04.340980799 -0600
> > > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk)
> > >  	__acquires(&sk->sk_lock.slock)
> > >  {
> > >  	DEFINE_WAIT(wait);
> > > +	if (!preemptible()) {
> > > +		while (sock_owned_by_user(sk)) {
> > > +			spin_unlock_bh(&sk->sk_lock.slock);
> > > +			cpu_relax();
> > > +			spin_lock_bh(&sk->sk_lock.slock);
> > > +		}
> > > +		return;
> > > +	}
> > 
> > Hmm, was this tested on UP machine ?
> 
> Hi Eric --
> 
> Thanks for the reply!
> 
> I hadn't tested UP in awhile so am testing now, and it seems to
> work OK so far.  However, I am just testing my socket, *not* testing
> sockets in general.  Are you implying that this patch will
> break (kernel) sockets in general on a UP machine?  If so,
> could you be more specific as to why?  (Again, I said
> I am a networking idiot. ;-)  I played a bit with adding
> a new SOCK_ flag and triggering off of that, but this
> version of the patch seemed much simpler.

Say you have two processes and socket S

One process locks socket S, and is preempted by another process.

This second process is non preemptible and try to lock same socket.

-> deadlock, since P1 never releases socket S



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Magenheimer July 5, 2011, 7:07 p.m. UTC | #7
> > > > +++ linux-2.6.37-ramster/net/core/sock.c	2011-07-03 19:10:04.340980799 -0600
> > > > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk)
> > > >  	__acquires(&sk->sk_lock.slock)
> > > >  {
> > > >  	DEFINE_WAIT(wait);
> > > > +	if (!preemptible()) {
> > > > +		while (sock_owned_by_user(sk)) {
> > > > +			spin_unlock_bh(&sk->sk_lock.slock);
> > > > +			cpu_relax();
> > > > +			spin_lock_bh(&sk->sk_lock.slock);
> > > > +		}
> > > > +		return;
> > > > +	}
> > >
> > > Hmm, was this tested on UP machine ?
> >
> > Hi Eric --
> >
> > Thanks for the reply!
> >
> > I hadn't tested UP in awhile so am testing now, and it seems to
> > work OK so far.  However, I am just testing my socket, *not* testing
> > sockets in general.  Are you implying that this patch will
> > break (kernel) sockets in general on a UP machine?  If so,
> > could you be more specific as to why?  (Again, I said
> > I am a networking idiot. ;-)  I played a bit with adding
> > a new SOCK_ flag and triggering off of that, but this
> > version of the patch seemed much simpler.
> 
> Say you have two processes and socket S
> 
> One process locks socket S, and is preempted by another process.
> 
> This second process is non preemptible and try to lock same socket.
> 
> -> deadlock, since P1 never releases socket S

Oh, OK.  My use model is that a socket that is used non-preemptible
must always be used non-preemptible.  In other words, this kind
of socket is an extreme form of non-blocking.  Doesn't that seem
like a reasonable constraint? 

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Magenheimer July 5, 2011, 7:18 p.m. UTC | #8
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> > Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> >
> > > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> > > Sent: Tuesday, July 05, 2011 10:37 AM
> > > To: Dan Magenheimer; netdev@vger.kernel.org
> > > Cc: Konrad Wilk; linux-mm
> > > Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> > >
> > > > In working on a kernel project called RAMster* (where RAM on a
> > > > remote system may be used for clean page cache pages and for swap
> > > > pages), I found I have need for a kernel socket to be used when
> > >
> > > How is RAMster+swap different than NBD's (pending etc?)support for
> > > SWAP over NBD?
> >
> > I may be ignorant of details about NBD, but did some quick
> > research using google.  If I understand correctly, swap over
> > NBD is still writing to a configured swap disk on the remote
> 
> Hi - I thought NBD-server needs a backing store(a file).
> Now the file itself could reside on a RAM-drive or disk-drive etc.
> And so a remote NBD(disk or RAM) can be mounted locally as a swap
> device.
> The local client should still see it as a block device.
> 
> I haven't used the RAM-drive feature myself but you may want to check if
> it
> works or even borrow that logic in your code.

Actually, RAMster is using a much more flexible type of
RAM-drive; it is built on top of Transcendent Memory
and on top of zcache (and thus on top of cleancache and
frontswap).  A RAM-drive is fixed size so is not very suitable
for the flexibility required for RAMster.  For example,
suppose you have two machines A and B.  At one point in
time A is overcommitted and needs to swap and B is relatively
idle.  Then later, B is overcommitted and needs to swap and
A is relatively idle.  RAMster can handle this entirely
dynamically, a RAM-drive cannot.

> > machine.  RAMster is swapping to *RAM* on the remote machine.
> > The idea is that most machines are very overprovisioned in
> > RAM, and are rarely using all of their RAM, especially when
> > a machine is (mostly) idle.  In other words, the "max of
> > the sums" of RAM usage on a group of machines is much lower
> > than the "sum of the max" of RAM usage.
> >
> > So if the network is sufficiently faster than disk for
> > moving a page of data, RAMster provides a significant
> > performance improvement.  OR RAMster may allow a significant
> > reduction in the total amount of RAM across a data center.
> >
> > The version of RAMster I am working on now is really
> > a proof-of-concept that works over sockets, using the
> > ocfs2 cluster layer.  One can easily envision a future
> > "exo-fabric" which allows one machine to write to the
> > RAM of another machine... for this future hardware,
> > RAMster becomes much more interesting.
> 
> Or you can also try scst-in-RAM mode(if you want to experiment with
> different fabrics).

Thanks.  Could you provide a pointer for this?  I found
the SCST sourceforge page but no obvious references to
scst-in-ram-mode.  (But also, since it appears to be
SCSI-related, I wonder if it also assumes a fixed size
target device, RAM or disk or ??)

Dan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan July 5, 2011, 10:27 p.m. UTC | #9
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: July 05, 2011 3:19 PM
> To: Loke, Chetan; netdev@vger.kernel.org
> Cc: Konrad Wilk; linux-mm
> Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> 

> Actually, RAMster is using a much more flexible type of
> RAM-drive; it is built on top of Transcendent Memory
> and on top of zcache (and thus on top of cleancache and
> frontswap).  A RAM-drive is fixed size so is not very suitable
> for the flexibility required for RAMster.  For example,
> suppose you have two machines A and B.  At one point in
> time A is overcommitted and needs to swap and B is relatively
> idle.  Then later, B is overcommitted and needs to swap and
> A is relatively idle.  RAMster can handle this entirely
> dynamically, a RAM-drive cannot.


Again, iff NBD works with a ram-drive then you really wouldn't need to
do anything. How often are you going to re-size your remote-SWAP?  Plus,
you can make nbd-server listen on multiple ports - Google(Linux NBD)
returned: http://www.fi.muni.cz/~kripac/orac-nbd/ . Look at the
nbd-server code to see if it launches multiple kernel-threads for
servicing different ports. If not, one can enhance it and scale that way
too. But nbd-server today can service multiple-ports(that is effectively
servicing multiple clients). So why not add NBD-filesystem-filters to
make it point to local/remote swap?


> 
> Thanks.  Could you provide a pointer for this?  I found
> the SCST sourceforge page but no obvious references to
> scst-in-ram-mode.  (But also, since it appears to be
> SCSI-related, I wonder if it also assumes a fixed size
> target device, RAM or disk or ??)
> 

Yes, it is SCSI. You should be looking for SCST I/O modes. Read some
docs and then send an email to the scst-mailing-list. If you speak about
block-IO-performance then FC(in its class of price/performance factor)
is more than capable of handling any workload. FC is a protocol designed
for storage. No exotic fabric other than FC is needed.
Folks who start with ethernet for block-IO, always start with bare
minimal code and then for squeezing block-IO performance(aka version 2
of the product), keep hacking repeatedly or go for a link-speed upgrade.
Start with FC, period.


> Dan

Chetan Loke
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Magenheimer July 6, 2011, 1:05 a.m. UTC | #10
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> 
> > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> 
> > Actually, RAMster is using a much more flexible type of
> > RAM-drive; it is built on top of Transcendent Memory
> > and on top of zcache (and thus on top of cleancache and
> > frontswap).  A RAM-drive is fixed size so is not very suitable
> > for the flexibility required for RAMster.  For example,
> > suppose you have two machines A and B.  At one point in
> > time A is overcommitted and needs to swap and B is relatively
> > idle.  Then later, B is overcommitted and needs to swap and
> > A is relatively idle.  RAMster can handle this entirely
> > dynamically, a RAM-drive cannot.
> 
> Again, iff NBD works with a ram-drive then you really wouldn't need to
> do anything. How often are you going to re-size your remote-SWAP?  Plus,
> you can make nbd-server listen on multiple ports - Google(Linux NBD)
> returned: http://www.fi.muni.cz/~kripac/orac-nbd/ . Look at the
> nbd-server code to see if it launches multiple kernel-threads for
> servicing different ports. If not, one can enhance it and scale that way
> too. But nbd-server today can service multiple-ports(that is effectively
> servicing multiple clients). So why not add NBD-filesystem-filters to
> make it point to local/remote swap?

Well, we may be talking past each other, but the RAMster answer to:

> How often are you going to re-size your remote-SWAP?

is "as often as the working set changes on any machine in the
cluster", meaning *constantly*, entirely dynamically!  How
about a more specific example:  Suppose you have 2 machines,
each with 8GB of memory.  99% of the time each machine is
chugging along just fine and doesn't really need more than 4GB,
and may even use less than 1GB a large part of the time.
But very now and then, one of the machines randomly needs
9GB, 10GB, maybe even 12GB  of memory.  This would normally
result in swapping.  (Most system administrators won't even
have this much information... they'll just know they are
seeing swapping and decide they need to buy more RAM.)

With NBD to a ram-drive, each machine would need to pre-allocate
4GB of RAM for the RAM-drive, leaving only 4GB of RAM for
the "local" RAM.  The result will actually be MORE swapping
because a fixed amount of RAM has been pre-reserved for the
other machine's swap.   With RAMster, everything is done dynamically,
so all that matters is the maximum of the sum of the RAM
used.  You may even be able to *remove* ~2GB of RAM from each
of the systems and still never see any swapping to disk.

> > Thanks.  Could you provide a pointer for this?  I found
> > the SCST sourceforge page but no obvious references to
> > scst-in-ram-mode.  (But also, since it appears to be
> > SCSI-related, I wonder if it also assumes a fixed size
> > target device, RAM or disk or ??)
> 
> Yes, it is SCSI. You should be looking for SCST I/O modes. Read some
> docs and then send an email to the scst-mailing-list. If you speak about
> block-IO-performance then FC(in its class of price/performance factor)
> is more than capable of handling any workload. FC is a protocol designed
> for storage. No exotic fabric other than FC is needed.
> Folks who start with ethernet for block-IO, always start with bare
> minimal code and then for squeezing block-IO performance(aka version 2
> of the product), keep hacking repeatedly or go for a link-speed upgrade.
> Start with FC, period.

My point was that block I/O devices (AFAIK) always present a fixed
"size" to the kernel, and if this is also true of scst-in-ram-mode,
the same problem as swap-over-NBD occurs... it's not dynamic.
RAMster does not present a block-I/O storage-like interface;
it's using the Transcendent Memory interface, which is designed
for "slow RAM" of an unknown-and-dynamic size.

I'm not a storage expert either, but I do wonder if "no exotic
fabric other than FC" isn't an oxymoron ;-)  FC is certainly
too exotic for me.

Dan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan July 6, 2011, 6:12 p.m. UTC | #11
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: July 05, 2011 9:06 PM
> To: Loke, Chetan; netdev@vger.kernel.org
> Cc: Konrad Wilk; linux-mm
> Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> 
> > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> > Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> >
> > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> 
> > How often are you going to re-size your remote-SWAP?
> 
> is "as often as the working set changes on any machine in the
> cluster", meaning *constantly*, entirely dynamically!  How
> about a more specific example:  Suppose you have 2 machines,
> each with 8GB of memory.  99% of the time each machine is
> chugging along just fine and doesn't really need more than 4GB,
> and may even use less than 1GB a large part of the time.
> But very now and then, one of the machines randomly needs
> 9GB, 10GB, maybe even 12GB  of memory.  This would normally
> result in swapping.  (Most system administrators won't even
> have this much information... they'll just know they are
> seeing swapping and decide they need to buy more RAM.)
> 

Ok, I understand there is interest in implementing
'remote-volatile-ballooning-variant' but how do you pick a remote
candidate(hypervisor)? Let's say, memory could be available on remote
system but what if the remote-p{NIC,CPU} is overloaded? Sure, sysadmins
won't have this info because this so dynamic(and it's quite possible as
you mentioned above). But does the trans-remote-API know about this
resource-availability before opening a remote-channel?

Stressing the remote-p{NIC/CPU} might trick hypervisor-vmotion-plugin to
vmotion VM[s] to another hypervisor. How is trans-remote-API integrating
with remote/global vmotion policies to avoid this false vmotion?


> Dan

Chetan Loke
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Magenheimer July 7, 2011, 3:34 p.m. UTC | #12
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> Subject: RE: [RFC] non-preemptible kernel socket for RAMster
> 
> > -----Original Message-----
> > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >
> > > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com]
> > >
> > > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >
> > > How often are you going to re-size your remote-SWAP?
> >
> > is "as often as the working set changes on any machine in the
> > cluster", meaning *constantly*, entirely dynamically!  How
> > about a more specific example:  Suppose you have 2 machines,
> > each with 8GB of memory.  99% of the time each machine is
> > chugging along just fine and doesn't really need more than 4GB,
> > and may even use less than 1GB a large part of the time.
> > But very now and then, one of the machines randomly needs
> > 9GB, 10GB, maybe even 12GB  of memory.  This would normally
> > result in swapping.  (Most system administrators won't even
> > have this much information... they'll just know they are
> > seeing swapping and decide they need to buy more RAM.)
> >
> 
> Ok, I understand there is interest in implementing
> 'remote-volatile-ballooning-variant' but how do you pick a remote
> candidate(hypervisor)? Let's say, memory could be available on remote
> system but what if the remote-p{NIC,CPU} is overloaded? Sure, sysadmins
> won't have this info because this so dynamic(and it's quite possible as
> you mentioned above). But does the trans-remote-API know about this
> resource-availability before opening a remote-channel?
> 
> Stressing the remote-p{NIC/CPU} might trick hypervisor-vmotion-plugin to
> vmotion VM[s] to another hypervisor. How is trans-remote-API integrating
> with remote/global vmotion policies to avoid this false vmotion?

Hi Chetan --

Thanks for the continued discussion.

First, let me clarify that RAMster does not depend on virtualization.
At some time in the future, it may be a nice addition for KVM*,
but the version I am developing currently only works on a
cluster of physical machines.  So vmotion/migration is not
an issue right now


As for choosing the remote machine, another key feature of
the Transcendent Memory mechanism is that any and every page
can be rejected.  If rejected, the page remains local.  In
essence, on *every* page-to-be-swapped, machine A *asks*
machine B, "can you take this page"?  If the answer is no,
machine A can choose another machine (C), or may choose to
swap the page to its own slow swap disk.  (Currently,
only the latter is implemented, but more complicated
policy could certainly be implemented.)

Dan

* Xen doesn't have drivers so RAMster-over-network is not an option
for Xen.  A future RAMster-over-exofabric might work with Xen though.)
And, by the way, the Transcendent Memory implementation on Xen
does handle vmotion/migration so it is a solvable problem.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c
--- linux-2.6.37/net/core/sock.c	2011-07-03 19:14:52.267853088 -0600
+++ linux-2.6.37-ramster/net/core/sock.c	2011-07-03 19:10:04.340980799 -0600
@@ -1587,6 +1587,14 @@  static void __lock_sock(struct sock *sk)
 	__acquires(&sk->sk_lock.slock)
 {
 	DEFINE_WAIT(wait);
+	if (!preemptible()) {
+		while (sock_owned_by_user(sk)) {
+			spin_unlock_bh(&sk->sk_lock.slock);
+			cpu_relax();
+			spin_lock_bh(&sk->sk_lock.slock);
+		}
+		return;
+	}
 
 	for (;;) {
 		prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
@@ -1623,7 +1631,8 @@  static void __release_sock(struct sock *
 			 * This is safe to do because we've taken the backlog
 			 * queue private:
 			 */
-			cond_resched_softirq();
+			if (preemptible())
+				cond_resched_softirq();
 			skb = next;
 		} while (skb != NULL);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in