Message ID | 4232c4b6-15be-42d8-be42-6e27f9188ce2@default |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Le mardi 05 juillet 2011 à 08:54 -0700, Dan Magenheimer a écrit : > In working on a kernel project called RAMster* (where RAM on a > remote system may be used for clean page cache pages and for swap > pages), I found I have need for a kernel socket to be used when > in non-preemptible state. I admit to being a networking idiot, > but I have been successfully using the following small patch. > I'm not sure whether I am lucky so far... perhaps more > sockets or larger/different loads will require a lot more > changes (or maybe even make my objective impossible). > So I thought I'd post it for comment. I'd appreciate > any thoughts or suggestions. > > Thanks, > Dan > > * http://events.linuxfoundation.org/events/linuxcon/magenheimer > > diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c > --- linux-2.6.37/net/core/sock.c 2011-07-03 19:14:52.267853088 -0600 > +++ linux-2.6.37-ramster/net/core/sock.c 2011-07-03 19:10:04.340980799 -0600 > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk) > __acquires(&sk->sk_lock.slock) > { > DEFINE_WAIT(wait); > + if (!preemptible()) { > + while (sock_owned_by_user(sk)) { > + spin_unlock_bh(&sk->sk_lock.slock); > + cpu_relax(); > + spin_lock_bh(&sk->sk_lock.slock); > + } > + return; > + } Hmm, was this tested on UP machine ? > > for (;;) { > prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait, > @@ -1623,7 +1631,8 @@ static void __release_sock(struct sock * > * This is safe to do because we've taken the backlog > * queue private: > */ > - cond_resched_softirq(); > + if (preemptible()) > + cond_resched_softirq(); > skb = next; > } while (skb != NULL); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: netdev-owner@vger.kernel.org [mailto:netdev- > owner@vger.kernel.org] On Behalf Of Dan Magenheimer > Sent: July 05, 2011 11:54 AM > To: netdev@vger.kernel.org > Cc: Konrad Wilk; linux-mm > Subject: [RFC] non-preemptible kernel socket for RAMster > > In working on a kernel project called RAMster* (where RAM on a > remote system may be used for clean page cache pages and for swap > pages), I found I have need for a kernel socket to be used when How is RAMster+swap different than NBD's (pending etc?)support for SWAP over NBD? Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com] > Sent: Tuesday, July 05, 2011 10:31 AM > To: Dan Magenheimer > Cc: netdev@vger.kernel.org; Konrad Wilk; linux-mm > Subject: Re: [RFC] non-preemptible kernel socket for RAMster > > Le mardi 05 juillet 2011 à 08:54 -0700, Dan Magenheimer a écrit : > > In working on a kernel project called RAMster* (where RAM on a > > remote system may be used for clean page cache pages and for swap > > pages), I found I have need for a kernel socket to be used when > > in non-preemptible state. I admit to being a networking idiot, > > but I have been successfully using the following small patch. > > I'm not sure whether I am lucky so far... perhaps more > > sockets or larger/different loads will require a lot more > > changes (or maybe even make my objective impossible). > > So I thought I'd post it for comment. I'd appreciate > > any thoughts or suggestions. > > > > Thanks, > > Dan > > > > * http://events.linuxfoundation.org/events/linuxcon/magenheimer > > > > diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c > > --- linux-2.6.37/net/core/sock.c 2011-07-03 19:14:52.267853088 -0600 > > +++ linux-2.6.37-ramster/net/core/sock.c 2011-07-03 19:10:04.340980799 -0600 > > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk) > > __acquires(&sk->sk_lock.slock) > > { > > DEFINE_WAIT(wait); > > + if (!preemptible()) { > > + while (sock_owned_by_user(sk)) { > > + spin_unlock_bh(&sk->sk_lock.slock); > > + cpu_relax(); > > + spin_lock_bh(&sk->sk_lock.slock); > > + } > > + return; > > + } > > Hmm, was this tested on UP machine ? Hi Eric -- Thanks for the reply! I hadn't tested UP in awhile so am testing now, and it seems to work OK so far. However, I am just testing my socket, *not* testing sockets in general. Are you implying that this patch will break (kernel) sockets in general on a UP machine? If so, could you be more specific as to why? (Again, I said I am a networking idiot. ;-) I played a bit with adding a new SOCK_ flag and triggering off of that, but this version of the patch seemed much simpler. Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > Sent: Tuesday, July 05, 2011 10:37 AM > To: Dan Magenheimer; netdev@vger.kernel.org > Cc: Konrad Wilk; linux-mm > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > In working on a kernel project called RAMster* (where RAM on a > > remote system may be used for clean page cache pages and for swap > > pages), I found I have need for a kernel socket to be used when > > How is RAMster+swap different than NBD's (pending etc?)support for SWAP > over NBD? Hi Chetan -- Thanks for your question. I may be ignorant of details about NBD, but did some quick research using google. If I understand correctly, swap over NBD is still writing to a configured swap disk on the remote machine. RAMster is swapping to *RAM* on the remote machine. The idea is that most machines are very overprovisioned in RAM, and are rarely using all of their RAM, especially when a machine is (mostly) idle. In other words, the "max of the sums" of RAM usage on a group of machines is much lower than the "sum of the max" of RAM usage. So if the network is sufficiently faster than disk for moving a page of data, RAMster provides a significant performance improvement. OR RAMster may allow a significant reduction in the total amount of RAM across a data center. The version of RAMster I am working on now is really a proof-of-concept that works over sockets, using the ocfs2 cluster layer. One can easily envision a future "exo-fabric" which allows one machine to write to the RAM of another machine... for this future hardware, RAMster becomes much more interesting. Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: July 05, 2011 1:25 PM > To: Loke, Chetan; netdev@vger.kernel.org > Cc: Konrad Wilk; linux-mm > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > > Sent: Tuesday, July 05, 2011 10:37 AM > > To: Dan Magenheimer; netdev@vger.kernel.org > > Cc: Konrad Wilk; linux-mm > > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > > > In working on a kernel project called RAMster* (where RAM on a > > > remote system may be used for clean page cache pages and for swap > > > pages), I found I have need for a kernel socket to be used when > > > > How is RAMster+swap different than NBD's (pending etc?)support for > SWAP > > over NBD? > > Hi Chetan -- > > Thanks for your question. > > I may be ignorant of details about NBD, but did some quick > research using google. If I understand correctly, swap over > NBD is still writing to a configured swap disk on the remote Hi - I thought NBD-server needs a backing store(a file). Now the file itself could reside on a RAM-drive or disk-drive etc. And so a remote NBD(disk or RAM) can be mounted locally as a swap device. The local client should still see it as a block device. I haven't used the RAM-drive feature myself but you may want to check if it works or even borrow that logic in your code. > machine. RAMster is swapping to *RAM* on the remote machine. > The idea is that most machines are very overprovisioned in > RAM, and are rarely using all of their RAM, especially when > a machine is (mostly) idle. In other words, the "max of > the sums" of RAM usage on a group of machines is much lower > than the "sum of the max" of RAM usage. > > So if the network is sufficiently faster than disk for > moving a page of data, RAMster provides a significant > performance improvement. OR RAMster may allow a significant > reduction in the total amount of RAM across a data center. > > The version of RAMster I am working on now is really > a proof-of-concept that works over sockets, using the > ocfs2 cluster layer. One can easily envision a future > "exo-fabric" which allows one machine to write to the > RAM of another machine... for this future hardware, > RAMster becomes much more interesting. > Or you can also try scst-in-RAM mode(if you want to experiment with different fabrics). > Thanks, > Dan Thanks Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 05 juillet 2011 à 10:25 -0700, Dan Magenheimer a écrit : > > From: Eric Dumazet [mailto:eric.dumazet@gmail.com] > > Sent: Tuesday, July 05, 2011 10:31 AM > > To: Dan Magenheimer > > Cc: netdev@vger.kernel.org; Konrad Wilk; linux-mm > > Subject: Re: [RFC] non-preemptible kernel socket for RAMster > > > > Le mardi 05 juillet 2011 à 08:54 -0700, Dan Magenheimer a écrit : > > > In working on a kernel project called RAMster* (where RAM on a > > > remote system may be used for clean page cache pages and for swap > > > pages), I found I have need for a kernel socket to be used when > > > in non-preemptible state. I admit to being a networking idiot, > > > but I have been successfully using the following small patch. > > > I'm not sure whether I am lucky so far... perhaps more > > > sockets or larger/different loads will require a lot more > > > changes (or maybe even make my objective impossible). > > > So I thought I'd post it for comment. I'd appreciate > > > any thoughts or suggestions. > > > > > > Thanks, > > > Dan > > > > > > * http://events.linuxfoundation.org/events/linuxcon/magenheimer > > > > > > diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c > > > --- linux-2.6.37/net/core/sock.c 2011-07-03 19:14:52.267853088 -0600 > > > +++ linux-2.6.37-ramster/net/core/sock.c 2011-07-03 19:10:04.340980799 -0600 > > > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk) > > > __acquires(&sk->sk_lock.slock) > > > { > > > DEFINE_WAIT(wait); > > > + if (!preemptible()) { > > > + while (sock_owned_by_user(sk)) { > > > + spin_unlock_bh(&sk->sk_lock.slock); > > > + cpu_relax(); > > > + spin_lock_bh(&sk->sk_lock.slock); > > > + } > > > + return; > > > + } > > > > Hmm, was this tested on UP machine ? > > Hi Eric -- > > Thanks for the reply! > > I hadn't tested UP in awhile so am testing now, and it seems to > work OK so far. However, I am just testing my socket, *not* testing > sockets in general. Are you implying that this patch will > break (kernel) sockets in general on a UP machine? If so, > could you be more specific as to why? (Again, I said > I am a networking idiot. ;-) I played a bit with adding > a new SOCK_ flag and triggering off of that, but this > version of the patch seemed much simpler. Say you have two processes and socket S One process locks socket S, and is preempted by another process. This second process is non preemptible and try to lock same socket. -> deadlock, since P1 never releases socket S -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > +++ linux-2.6.37-ramster/net/core/sock.c 2011-07-03 19:10:04.340980799 -0600 > > > > @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk) > > > > __acquires(&sk->sk_lock.slock) > > > > { > > > > DEFINE_WAIT(wait); > > > > + if (!preemptible()) { > > > > + while (sock_owned_by_user(sk)) { > > > > + spin_unlock_bh(&sk->sk_lock.slock); > > > > + cpu_relax(); > > > > + spin_lock_bh(&sk->sk_lock.slock); > > > > + } > > > > + return; > > > > + } > > > > > > Hmm, was this tested on UP machine ? > > > > Hi Eric -- > > > > Thanks for the reply! > > > > I hadn't tested UP in awhile so am testing now, and it seems to > > work OK so far. However, I am just testing my socket, *not* testing > > sockets in general. Are you implying that this patch will > > break (kernel) sockets in general on a UP machine? If so, > > could you be more specific as to why? (Again, I said > > I am a networking idiot. ;-) I played a bit with adding > > a new SOCK_ flag and triggering off of that, but this > > version of the patch seemed much simpler. > > Say you have two processes and socket S > > One process locks socket S, and is preempted by another process. > > This second process is non preemptible and try to lock same socket. > > -> deadlock, since P1 never releases socket S Oh, OK. My use model is that a socket that is used non-preemptible must always be used non-preemptible. In other words, this kind of socket is an extreme form of non-blocking. Doesn't that seem like a reasonable constraint? Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > > > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > > > Sent: Tuesday, July 05, 2011 10:37 AM > > > To: Dan Magenheimer; netdev@vger.kernel.org > > > Cc: Konrad Wilk; linux-mm > > > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > > > > > In working on a kernel project called RAMster* (where RAM on a > > > > remote system may be used for clean page cache pages and for swap > > > > pages), I found I have need for a kernel socket to be used when > > > > > > How is RAMster+swap different than NBD's (pending etc?)support for > > > SWAP over NBD? > > > > I may be ignorant of details about NBD, but did some quick > > research using google. If I understand correctly, swap over > > NBD is still writing to a configured swap disk on the remote > > Hi - I thought NBD-server needs a backing store(a file). > Now the file itself could reside on a RAM-drive or disk-drive etc. > And so a remote NBD(disk or RAM) can be mounted locally as a swap > device. > The local client should still see it as a block device. > > I haven't used the RAM-drive feature myself but you may want to check if > it > works or even borrow that logic in your code. Actually, RAMster is using a much more flexible type of RAM-drive; it is built on top of Transcendent Memory and on top of zcache (and thus on top of cleancache and frontswap). A RAM-drive is fixed size so is not very suitable for the flexibility required for RAMster. For example, suppose you have two machines A and B. At one point in time A is overcommitted and needs to swap and B is relatively idle. Then later, B is overcommitted and needs to swap and A is relatively idle. RAMster can handle this entirely dynamically, a RAM-drive cannot. > > machine. RAMster is swapping to *RAM* on the remote machine. > > The idea is that most machines are very overprovisioned in > > RAM, and are rarely using all of their RAM, especially when > > a machine is (mostly) idle. In other words, the "max of > > the sums" of RAM usage on a group of machines is much lower > > than the "sum of the max" of RAM usage. > > > > So if the network is sufficiently faster than disk for > > moving a page of data, RAMster provides a significant > > performance improvement. OR RAMster may allow a significant > > reduction in the total amount of RAM across a data center. > > > > The version of RAMster I am working on now is really > > a proof-of-concept that works over sockets, using the > > ocfs2 cluster layer. One can easily envision a future > > "exo-fabric" which allows one machine to write to the > > RAM of another machine... for this future hardware, > > RAMster becomes much more interesting. > > Or you can also try scst-in-RAM mode(if you want to experiment with > different fabrics). Thanks. Could you provide a pointer for this? I found the SCST sourceforge page but no obvious references to scst-in-ram-mode. (But also, since it appears to be SCSI-related, I wonder if it also assumes a fixed size target device, RAM or disk or ??) Dan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: July 05, 2011 3:19 PM > To: Loke, Chetan; netdev@vger.kernel.org > Cc: Konrad Wilk; linux-mm > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > Actually, RAMster is using a much more flexible type of > RAM-drive; it is built on top of Transcendent Memory > and on top of zcache (and thus on top of cleancache and > frontswap). A RAM-drive is fixed size so is not very suitable > for the flexibility required for RAMster. For example, > suppose you have two machines A and B. At one point in > time A is overcommitted and needs to swap and B is relatively > idle. Then later, B is overcommitted and needs to swap and > A is relatively idle. RAMster can handle this entirely > dynamically, a RAM-drive cannot. Again, iff NBD works with a ram-drive then you really wouldn't need to do anything. How often are you going to re-size your remote-SWAP? Plus, you can make nbd-server listen on multiple ports - Google(Linux NBD) returned: http://www.fi.muni.cz/~kripac/orac-nbd/ . Look at the nbd-server code to see if it launches multiple kernel-threads for servicing different ports. If not, one can enhance it and scale that way too. But nbd-server today can service multiple-ports(that is effectively servicing multiple clients). So why not add NBD-filesystem-filters to make it point to local/remote swap? > > Thanks. Could you provide a pointer for this? I found > the SCST sourceforge page but no obvious references to > scst-in-ram-mode. (But also, since it appears to be > SCSI-related, I wonder if it also assumes a fixed size > target device, RAM or disk or ??) > Yes, it is SCSI. You should be looking for SCST I/O modes. Read some docs and then send an email to the scst-mailing-list. If you speak about block-IO-performance then FC(in its class of price/performance factor) is more than capable of handling any workload. FC is a protocol designed for storage. No exotic fabric other than FC is needed. Folks who start with ethernet for block-IO, always start with bare minimal code and then for squeezing block-IO performance(aka version 2 of the product), keep hacking repeatedly or go for a link-speed upgrade. Start with FC, period. > Dan Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > > > Actually, RAMster is using a much more flexible type of > > RAM-drive; it is built on top of Transcendent Memory > > and on top of zcache (and thus on top of cleancache and > > frontswap). A RAM-drive is fixed size so is not very suitable > > for the flexibility required for RAMster. For example, > > suppose you have two machines A and B. At one point in > > time A is overcommitted and needs to swap and B is relatively > > idle. Then later, B is overcommitted and needs to swap and > > A is relatively idle. RAMster can handle this entirely > > dynamically, a RAM-drive cannot. > > Again, iff NBD works with a ram-drive then you really wouldn't need to > do anything. How often are you going to re-size your remote-SWAP? Plus, > you can make nbd-server listen on multiple ports - Google(Linux NBD) > returned: http://www.fi.muni.cz/~kripac/orac-nbd/ . Look at the > nbd-server code to see if it launches multiple kernel-threads for > servicing different ports. If not, one can enhance it and scale that way > too. But nbd-server today can service multiple-ports(that is effectively > servicing multiple clients). So why not add NBD-filesystem-filters to > make it point to local/remote swap? Well, we may be talking past each other, but the RAMster answer to: > How often are you going to re-size your remote-SWAP? is "as often as the working set changes on any machine in the cluster", meaning *constantly*, entirely dynamically! How about a more specific example: Suppose you have 2 machines, each with 8GB of memory. 99% of the time each machine is chugging along just fine and doesn't really need more than 4GB, and may even use less than 1GB a large part of the time. But very now and then, one of the machines randomly needs 9GB, 10GB, maybe even 12GB of memory. This would normally result in swapping. (Most system administrators won't even have this much information... they'll just know they are seeing swapping and decide they need to buy more RAM.) With NBD to a ram-drive, each machine would need to pre-allocate 4GB of RAM for the RAM-drive, leaving only 4GB of RAM for the "local" RAM. The result will actually be MORE swapping because a fixed amount of RAM has been pre-reserved for the other machine's swap. With RAMster, everything is done dynamically, so all that matters is the maximum of the sum of the RAM used. You may even be able to *remove* ~2GB of RAM from each of the systems and still never see any swapping to disk. > > Thanks. Could you provide a pointer for this? I found > > the SCST sourceforge page but no obvious references to > > scst-in-ram-mode. (But also, since it appears to be > > SCSI-related, I wonder if it also assumes a fixed size > > target device, RAM or disk or ??) > > Yes, it is SCSI. You should be looking for SCST I/O modes. Read some > docs and then send an email to the scst-mailing-list. If you speak about > block-IO-performance then FC(in its class of price/performance factor) > is more than capable of handling any workload. FC is a protocol designed > for storage. No exotic fabric other than FC is needed. > Folks who start with ethernet for block-IO, always start with bare > minimal code and then for squeezing block-IO performance(aka version 2 > of the product), keep hacking repeatedly or go for a link-speed upgrade. > Start with FC, period. My point was that block I/O devices (AFAIK) always present a fixed "size" to the kernel, and if this is also true of scst-in-ram-mode, the same problem as swap-over-NBD occurs... it's not dynamic. RAMster does not present a block-I/O storage-like interface; it's using the Transcendent Memory interface, which is designed for "slow RAM" of an unknown-and-dynamic size. I'm not a storage expert either, but I do wonder if "no exotic fabric other than FC" isn't an oxymoron ;-) FC is certainly too exotic for me. Dan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: July 05, 2011 9:06 PM > To: Loke, Chetan; netdev@vger.kernel.org > Cc: Konrad Wilk; linux-mm > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > > > How often are you going to re-size your remote-SWAP? > > is "as often as the working set changes on any machine in the > cluster", meaning *constantly*, entirely dynamically! How > about a more specific example: Suppose you have 2 machines, > each with 8GB of memory. 99% of the time each machine is > chugging along just fine and doesn't really need more than 4GB, > and may even use less than 1GB a large part of the time. > But very now and then, one of the machines randomly needs > 9GB, 10GB, maybe even 12GB of memory. This would normally > result in swapping. (Most system administrators won't even > have this much information... they'll just know they are > seeing swapping and decide they need to buy more RAM.) > Ok, I understand there is interest in implementing 'remote-volatile-ballooning-variant' but how do you pick a remote candidate(hypervisor)? Let's say, memory could be available on remote system but what if the remote-p{NIC,CPU} is overloaded? Sure, sysadmins won't have this info because this so dynamic(and it's quite possible as you mentioned above). But does the trans-remote-API know about this resource-availability before opening a remote-channel? Stressing the remote-p{NIC/CPU} might trick hypervisor-vmotion-plugin to vmotion VM[s] to another hypervisor. How is trans-remote-API integrating with remote/global vmotion policies to avoid this false vmotion? > Dan Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > Subject: RE: [RFC] non-preemptible kernel socket for RAMster > > > -----Original Message----- > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > > > > > From: Loke, Chetan [mailto:Chetan.Loke@netscout.com] > > > > > > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > > > > > How often are you going to re-size your remote-SWAP? > > > > is "as often as the working set changes on any machine in the > > cluster", meaning *constantly*, entirely dynamically! How > > about a more specific example: Suppose you have 2 machines, > > each with 8GB of memory. 99% of the time each machine is > > chugging along just fine and doesn't really need more than 4GB, > > and may even use less than 1GB a large part of the time. > > But very now and then, one of the machines randomly needs > > 9GB, 10GB, maybe even 12GB of memory. This would normally > > result in swapping. (Most system administrators won't even > > have this much information... they'll just know they are > > seeing swapping and decide they need to buy more RAM.) > > > > Ok, I understand there is interest in implementing > 'remote-volatile-ballooning-variant' but how do you pick a remote > candidate(hypervisor)? Let's say, memory could be available on remote > system but what if the remote-p{NIC,CPU} is overloaded? Sure, sysadmins > won't have this info because this so dynamic(and it's quite possible as > you mentioned above). But does the trans-remote-API know about this > resource-availability before opening a remote-channel? > > Stressing the remote-p{NIC/CPU} might trick hypervisor-vmotion-plugin to > vmotion VM[s] to another hypervisor. How is trans-remote-API integrating > with remote/global vmotion policies to avoid this false vmotion? Hi Chetan -- Thanks for the continued discussion. First, let me clarify that RAMster does not depend on virtualization. At some time in the future, it may be a nice addition for KVM*, but the version I am developing currently only works on a cluster of physical machines. So vmotion/migration is not an issue right now As for choosing the remote machine, another key feature of the Transcendent Memory mechanism is that any and every page can be rejected. If rejected, the page remains local. In essence, on *every* page-to-be-swapped, machine A *asks* machine B, "can you take this page"? If the answer is no, machine A can choose another machine (C), or may choose to swap the page to its own slow swap disk. (Currently, only the latter is implemented, but more complicated policy could certainly be implemented.) Dan * Xen doesn't have drivers so RAMster-over-network is not an option for Xen. A future RAMster-over-exofabric might work with Xen though.) And, by the way, the Transcendent Memory implementation on Xen does handle vmotion/migration so it is a solvable problem. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -Napur linux-2.6.37/net/core/sock.c linux-2.6.37-ramster/net/core/sock.c --- linux-2.6.37/net/core/sock.c 2011-07-03 19:14:52.267853088 -0600 +++ linux-2.6.37-ramster/net/core/sock.c 2011-07-03 19:10:04.340980799 -0600 @@ -1587,6 +1587,14 @@ static void __lock_sock(struct sock *sk) __acquires(&sk->sk_lock.slock) { DEFINE_WAIT(wait); + if (!preemptible()) { + while (sock_owned_by_user(sk)) { + spin_unlock_bh(&sk->sk_lock.slock); + cpu_relax(); + spin_lock_bh(&sk->sk_lock.slock); + } + return; + } for (;;) { prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait, @@ -1623,7 +1631,8 @@ static void __release_sock(struct sock * * This is safe to do because we've taken the backlog * queue private: */ - cond_resched_softirq(); + if (preemptible()) + cond_resched_softirq(); skb = next; } while (skb != NULL); -- To unsubscribe from this list: send the line "unsubscribe netdev" in