Patchwork [04/18] powerpc/spufs: Implement spu gang scheduling

login
register
mail settings
Submitter Andre Detsch
Date Dec. 10, 2008, 7:40 p.m.
Message ID <200812101740.58525.adetsch@br.ibm.com>
Download mbox | patch
Permalink /patch/13325/
State RFC
Headers show

Comments

Andre Detsch - Dec. 10, 2008, 7:40 p.m.
This patch provides the base support for gang scheudling, including spu
mgmt, runqueue management, placement, activation, deactivation, time slicing,
yield, and preemption.  Basically, all of the core scheduling capabilities.

All spu contexts belong to a gang.  For standalone spu contexts, an internal
gang structure is allocated to present a uniform data abstraction, so that
the gang can be queued on the runqueue.  The priority of the gang dictates
its position on the runqueue.  All gang's have a single priority, policy,
and NUMA attachment which is inherited from creator of the spu context. These
values do not currently change, although there is nothing to prohibit such
support to be added in the future.

All contexts within a gang are scheduled and unscheduled at the same time.
spu_schedule and spu_unschedule have been changed to invoke spu_bind_context
and spu_unbind_context in a loop.  The former is more complicated in that it
must allocate enough spus in a secure manner so that it can get successfully
through its critical section without running out of spus.  For this reason,
SPUs are preallocated.  A reserved spu has the following state.

(spu->alloc_state == SPU_FREE and spu->gang != <gang>)

Timeslicing follows a two step algorithm. 1) all running contexts are
timesliced. The tick counter is implemented at the context level to
simplify the logic as they are all set and decremented at the same time.
When a count goes to zero, the gang is unscheduled.  This frees up space
as much space as posible before the scheduler tries to place a job that
is queued on the runqueue. This is a critical as the size of the job
waiting to run is not known apriori. 2) Sequentially place as many gangs
as possible. Skip over gangs as necessary across all run levels. This is
consistent with spu_yield which unloads the spu across user mode calls.

A simple hueristic has been implemented to prevent too many contexts switches
in step 1.  A limit is based on the number of runnable contexts that are
available on the runqueue.  If the count is less than the number of physical
spus, some spus may not be time sliced.  This is not guaranteed as they
may be part of a gang that is time sliced.  A simple one pass scan is used.

A new gang nstarted counter has been added to the gang structure to create
a synchronization point for gang start.  The counter is incremented, when
a context calls spu_run().  When all of the contexts have been started,
the gang is considered runnable.

The start synchronization point is implemented by passing the first
N-1 contexts directly through spu_run() to the spufs_wait() as before
where they wait on a private spe event word.  As before, they update
their csa area instead of hardware registers. The Nth thread through
spu_run() either places the gang or puts it on the runqueue.

Nearly all of the spu_run() critical section is the same.  It is context
based and runs almost entirely under the context lock.  The gang lock
is only taken when the context is in the SPU_SCHED_STATE, signifying that
the context needs to be activated.  This is an important optimization
that avoids lock contention in the controlling thread.

A gang nrunnable count has been implemented that is incremented and
decremented on entry and exit of spu_run respectively. This count is
intended to provide a measure of whether all of the contexts in the
gang are executing user mode code.  In this case, all of the spus in
the gang are stopped and this is a good point to preempt the gang.  This
is implemented by spu_yield() which triggers a call to spu_deactivate
which unloads the gang.  Importantly, in this case, the gang is not added
to the runqueue as the contexts are stopped.  This is designed to prevent
the pollution of the runqueue with stopped jobs that could only be
lazily loaded.  In this case, it is safe to not queue it as the
application is expected to re-drive the context via spu_run.

Finally, this means that a gang is eligible to be run as long as
one context in the gang is runnable.  Major page faulting is the other
event that may cause a gang to be preempted.  It is implemented via a
nfaulting count and a call to yield.  In this case, it is put on the
runqueue as the context is in kernel mode.  It is sort of a step down
scheduling technique to give something else a chance to run.

Signed-off-by: Luke Browning <lukebrowning@us.ibm.com>
Signed-off-by: Andre Detsch <adetsch@br.ibm.com>
---
 arch/powerpc/include/asm/spu.h              |    1 +
 arch/powerpc/platforms/cell/spufs/context.c |   49 +-
 arch/powerpc/platforms/cell/spufs/file.c    |    2 +-
 arch/powerpc/platforms/cell/spufs/gang.c    |   12 +-
 arch/powerpc/platforms/cell/spufs/run.c     |   50 ++-
 arch/powerpc/platforms/cell/spufs/sched.c   |  985 ++++++++++++++++++---------
 arch/powerpc/platforms/cell/spufs/spufs.h   |   25 +-
 7 files changed, 791 insertions(+), 333 deletions(-)

Patch

diff --git a/arch/powerpc/include/asm/spu.h b/arch/powerpc/include/asm/spu.h
index 8b2eb04..071dbb8 100644
--- a/arch/powerpc/include/asm/spu.h
+++ b/arch/powerpc/include/asm/spu.h
@@ -137,6 +137,7 @@  struct spu {
 	unsigned int slb_replace;
 	struct mm_struct *mm;
 	struct spu_context *ctx;
+	struct spu_gang *gang;
 	struct spu_runqueue *rq;
 	unsigned long long timestamp;
 	pid_t pid;
diff --git a/arch/powerpc/platforms/cell/spufs/context.c b/arch/powerpc/platforms/cell/spufs/context.c
index 85b5c1f..b810050 100644
--- a/arch/powerpc/platforms/cell/spufs/context.c
+++ b/arch/powerpc/platforms/cell/spufs/context.c
@@ -68,9 +68,9 @@  struct spu_context *alloc_spu_context(struct spu_gang *gang)
 	init_waitqueue_head(&ctx->mfc_wq);
 	init_waitqueue_head(&ctx->run_wq);
 	ctx->state = SPU_STATE_SAVED;
+	set_bit(SPU_SCHED_JUST_CREATED, &ctx->sched_flags);
 	ctx->ops = &spu_backing_ops;
 	ctx->owner = get_task_mm(current);
-	INIT_LIST_HEAD(&ctx->rq);
 	INIT_LIST_HEAD(&ctx->aff_list);
 	spu_gang_add_ctx(gang, ctx);
 	spu_update_sched_info(ctx);
@@ -98,14 +98,19 @@  void destroy_spu_context(struct kref *kref)
 	gang = ctx->gang;
 
 	spu_context_nospu_trace(destroy_spu_context__enter, ctx);
-	mutex_lock(&ctx->state_mutex);
-	spu_deactivate(ctx);
-	mutex_unlock(&ctx->state_mutex);
+
+	/*
+	 * Deactivate and make it non-runnable while we work on it.
+	 */
+	mutex_lock(&gang->mutex);
+	WARN_ON(ctx->gang != gang);
+	spu_deactivate(gang);
+	mutex_unlock(&gang->mutex);
+
 	spu_fini_csa(&ctx->csa);
 	spu_gang_remove_ctx(ctx->gang, ctx);
 	if (ctx->prof_priv_kref)
 		kref_put(ctx->prof_priv_kref, ctx->prof_priv_release);
-	BUG_ON(!list_empty(&ctx->rq));
 	atomic_dec(&nr_spu_contexts);
 	kfree(ctx->switch_log);
 	kfree(ctx);
@@ -126,20 +131,19 @@  int put_spu_context(struct spu_context *ctx)
 void spu_forget(struct spu_context *ctx)
 {
 	struct mm_struct *mm;
+	struct spu_gang *gang = ctx->gang;
 
 	/*
-	 * This is basically an open-coded spu_acquire_saved, except that
-	 * we don't acquire the state mutex interruptible, and we don't
-	 * want this context to be rescheduled on release.
+	 * The context is being destroyed but the gang may live on as there
+	 * is no requirement that all contexts within the gang have to die
+	 * at the same time.
 	 */
-	mutex_lock(&ctx->state_mutex);
-	if (ctx->state != SPU_STATE_SAVED)
-		spu_deactivate(ctx);
-
+	mutex_lock(&gang->mutex);
+	spu_deactivate(gang);
 	mm = ctx->owner;
 	ctx->owner = NULL;
 	mmput(mm);
-	spu_release(ctx);
+	mutex_unlock(&gang->mutex);
 }
 
 void spu_unmap_mappings(struct spu_context *ctx)
@@ -168,18 +172,21 @@  void spu_unmap_mappings(struct spu_context *ctx)
  */
 int spu_acquire_saved(struct spu_context *ctx)
 {
+	struct spu_gang *gang = ctx->gang;
 	int ret;
 
 	spu_context_nospu_trace(spu_acquire_saved__enter, ctx);
 
-	ret = spu_acquire(ctx);
+	ret = mutex_lock_interruptible(&gang->mutex);
 	if (ret)
 		return ret;
 
-	if (ctx->state != SPU_STATE_SAVED) {
-		set_bit(SPU_SCHED_WAS_ACTIVE, &ctx->sched_flags);
-		spu_deactivate(ctx);
-	}
+	/*
+	 * Deactivate unconditionally as we leave holding the gang lock.
+	 * If the gang is on the runqueue, the scheduler would block on it,
+	 * if it tried to dispatch it.
+	 */
+	spu_deactivate(gang);
 
 	return 0;
 }
@@ -190,10 +197,12 @@  int spu_acquire_saved(struct spu_context *ctx)
  */
 void spu_release_saved(struct spu_context *ctx)
 {
+	mutex_lock(&ctx->state_mutex);
+	mutex_unlock(&ctx->gang->mutex);
+
 	BUG_ON(ctx->state != SPU_STATE_SAVED);
 
-	if (test_and_clear_bit(SPU_SCHED_WAS_ACTIVE, &ctx->sched_flags) &&
-			test_bit(SPU_SCHED_SPU_RUN, &ctx->sched_flags))
+	if (test_bit(SPU_SCHED_SPU_RUN, &ctx->sched_flags))
 		spu_activate(ctx, 0);
 
 	spu_release(ctx);
diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 1b26071..a02ee25 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -2649,7 +2649,7 @@  static int spufs_show_ctx(struct seq_file *s, void *private)
 		ctx->prio,
 		ctx->time_slice,
 		ctx->spu ? ctx->spu->number : -1,
-		!list_empty(&ctx->rq) ? 'q' : ' ',
+		!list_empty(&ctx->gang->rq) ? 'q' : ' ',
 		ctx->csa.class_0_pending,
 		ctx->csa.class_0_dar,
 		ctx->csa.class_1_dsisr,
diff --git a/arch/powerpc/platforms/cell/spufs/gang.c b/arch/powerpc/platforms/cell/spufs/gang.c
index 2a01271..3fcbdc7 100644
--- a/arch/powerpc/platforms/cell/spufs/gang.c
+++ b/arch/powerpc/platforms/cell/spufs/gang.c
@@ -38,6 +38,7 @@  struct spu_gang *alloc_spu_gang(void)
 	mutex_init(&gang->aff_mutex);
 	INIT_LIST_HEAD(&gang->list);
 	INIT_LIST_HEAD(&gang->aff_list_head);
+	INIT_LIST_HEAD(&gang->rq);
 
 	/*
 	 * Inherit scheduling parameters from the creator of the gang.
@@ -91,7 +92,16 @@  void spu_gang_remove_ctx(struct spu_gang *gang, struct spu_context *ctx)
 	}
 	list_del_init(&ctx->gang_list);
 	gang->contexts--;
-	mutex_unlock(&gang->mutex);
+	atomic_dec(&gang->nstarted);
+	if (spu_gang_runnable(gang)) {
+		ctx = list_first_entry(&gang->list,
+				struct spu_context, gang_list);
+		mutex_lock(&ctx->state_mutex);
+		mutex_unlock(&gang->mutex);
+		spu_activate(ctx, 0);
+		mutex_unlock(&ctx->state_mutex);
+	} else
+		mutex_unlock(&gang->mutex);
 
 	put_spu_gang(gang);
 }
diff --git a/arch/powerpc/platforms/cell/spufs/run.c b/arch/powerpc/platforms/cell/spufs/run.c
index c58bd36..cbce273 100644
--- a/arch/powerpc/platforms/cell/spufs/run.c
+++ b/arch/powerpc/platforms/cell/spufs/run.c
@@ -172,11 +172,45 @@  out:
 static int spu_run_init(struct spu_context *ctx, u32 *npc)
 {
 	unsigned long runcntl = SPU_RUNCNTL_RUNNABLE;
+	struct spu_gang *gang = ctx->gang;
 	int ret;
 
 	spuctx_switch_state(ctx, SPU_UTIL_SYSTEM);
 
 	/*
+	 * Gang start.  The nstarted variable is incremented the first
+	 * time that a context is started.  When all ctxts in a gang have
+	 * been started, the gang is considered runnable.
+	 *
+	 * Gang runnable.  The nrunnable variable is the number of
+	 * contexts that are currently being processed by spufs_run_spu.
+	 * The count is incremented on entry and decremented on return
+	 * to user mode.  When a context is in user mode, the spu is
+	 * stopped.  When the count goes to zero, the gang is unloaded
+	 * if there is another gang waiting to run.  In this case, it
+	 * is unloaded, but it is not added to the runqueue as there
+	 * are no runnable contexts in the gang.  There has to be at
+	 * least one runnable context to be added to the runq.  In this
+	 * way, we prevent the runqueue from being over run with non-
+	 * runnable contexts.  The gang is guaranteed to be put back on
+	 * the runqueue, when a context is started again.  This is
+	 * driven explcitly by the program.
+	 *
+	 * Gang preemption. In addition to the case mentioned above
+	 * when all of the contexts are invoking user mode code, the
+	 * gang is also preempted when all of the contexts are either
+	 * in user mode or processing page faults.  The nfaulting variable
+	 * is the number of contexts in a gang that are currently
+	 * processing page faults.  When it equals nrunnable, the gang
+	 * is yielded and another gang runs if there is one on the
+	 * runqueue.  In this case, the gang is added to the runqueue
+	 * as the nrunnable field is still positive.
+	 */
+	if (test_and_clear_bit(SPU_SCHED_JUST_CREATED, &ctx->sched_flags))
+		atomic_inc(&gang->nstarted);
+	atomic_inc(&gang->nrunnable);
+
+	/*
 	 * NOSCHED is synchronous scheduling with respect to the caller.
 	 * The caller waits for the context to be loaded.
 	 */
@@ -242,8 +276,6 @@  static int spu_run_fini(struct spu_context *ctx, u32 *npc,
 {
 	int ret = 0;
 
-	spu_del_from_rq(ctx);
-
 	*status = ctx->ops->status_read(ctx);
 	*npc = ctx->ops->npc_read(ctx);
 
@@ -252,6 +284,8 @@  static int spu_run_fini(struct spu_context *ctx, u32 *npc,
 	spu_switch_log_notify(NULL, ctx, SWITCH_LOG_EXIT, *status);
 	spu_release(ctx);
 
+	atomic_dec(&ctx->gang->nrunnable);
+
 	if (signal_pending(current))
 		ret = -ERESTARTSYS;
 
@@ -357,6 +391,18 @@  long spufs_run_spu(struct spu_context *ctx, u32 *npc, u32 *event)
 
 	ctx->event_return = 0;
 
+	/*
+	 * This routine is almost entirely context based and runs mostly
+	 * under the context lock.  The locking strategy is driven by the
+	 * state of the context.  On input, the gang and context locks are
+	 * taken if the context is in the non-runnable state.  Both locks
+	 * are required to activate the context as it is a gang level
+	 * operation.  This occurs before the loop in spu_run_init. The
+	 * loop itself is processed just with the context lock and if the
+	 * system call returns and is re-entered as long as the context
+	 * remains in the runnable state, the gang lock is not required
+	 * as the context does not need to be activated.
+	 */
 	ret = spu_acquire(ctx);
 	if (ret)
 		goto out_unlock;
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 1fc0c3d..3c83bd3 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -48,11 +48,39 @@ 
 #include <asm/spu_priv1.h>
 #include "spufs.h"
 
+/*
+ * Gang scheduling locking strategy.
+ *
+ * All contexts in a gang are scheduled / unscheduled at the same time.
+ * Several locks have to be taken at the same time to make this happen
+ * atomically.  The hierarchy is gang, ctxt, runq_lock, and
+ * cbe_spu_info[].list_mutex.  The gang lock is taken principally to ensure
+ * that additional contexts cannot be added or removed while the scheduler
+ * is operating on the gang.  The ctxt lock must be taken to serialize
+ * access to physical spu resources including registers, state save,
+ * and restore.  The runqueue lock to synchronize access to the runqueue
+ * and to ensure that the view of available resources and active gangs
+ * is perfomed atomically across the system.  The list_mutex is needed
+ * to allocate and free individual spu resources.
+ */
+
+struct spu_sched_stats {
+	int inode;	/* node with the most idle spus (idle node) */
+	int inode_icnt;	/* count of idle spus on idle node */
+	int inode_pcnt;	/* count of preemptible spus on idle node */
+	int pnode;	/* node with the most preemptible spus (preempt node) */
+	int pnode_icnt;	/* count of idle spus on preempt node */
+	int pnode_pcnt; /* count of preemptible spus on preempt node */
+	int total_icnt; /* total idle spus across system */
+	int total_pcnt; /* total preemptible spus across system */
+};
+
 struct spu_prio_array {
 	DECLARE_BITMAP(bitmap, MAX_PRIO);
 	struct list_head runq[MAX_PRIO];
 	struct mutex runq_lock;
 	int nr_waiting;
+	int nr_contexts;
 };
 
 static unsigned long spu_avenrun[3];
@@ -60,6 +88,7 @@  static struct spu_prio_array *spu_prio;
 static struct task_struct *spusched_task;
 static struct timer_list spusched_timer;
 static struct timer_list spuloadavg_timer;
+static void spu_unschedule(struct spu_gang *gang);
 
 /*
  * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
@@ -85,6 +114,14 @@  static struct timer_list spuloadavg_timer;
 #define SCALE_PRIO(x, prio) \
 	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
 
+/* These constants are used when reserving spus for gang scheduling. */
+#define SPU_RESERVE		0
+#define SPU_RESERVE_UNDO	1
+
+/* Constants used by the scheduler's placement algorithm. */
+#define SPU_PLACE_NONE		-2
+#define SPU_PLACE_ALL		-1
+
 /*
  * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
  * [800ms ... 100ms ... 5ms]
@@ -109,12 +146,6 @@  void spu_set_timeslice(struct spu_context *ctx)
 static void __spu_update_sched_info(struct spu_context *ctx)
 {
 	/*
-	 * assert that the context is not on the runqueue, so it is safe
-	 * to change its scheduling parameters.
-	 */
-	BUG_ON(!list_empty(&ctx->rq));
-
-	/*
 	 * 32-Bit assignments are atomic on powerpc, and we don't care about
 	 * memory ordering here because retrieving the controlling thread is
 	 * per definition racy.
@@ -178,25 +209,25 @@  void spu_update_sched_info(struct spu_context *ctx)
 	}
 }
 
-static int __node_allowed(struct spu_context *ctx, int node)
+static int __node_allowed(struct spu_gang *gang, int node)
 {
 	if (nr_cpus_node(node)) {
 		cpumask_t mask = node_to_cpumask(node);
 
-		if (cpus_intersects(mask, ctx->cpus_allowed))
+		if (cpus_intersects(mask, gang->cpus_allowed))
 			return 1;
 	}
 
 	return 0;
 }
 
-static int node_allowed(struct spu_context *ctx, int node)
+static int node_allowed(struct spu_gang *gang, int node)
 {
 	int rval;
 
-	mutex_lock(&spu_prio->runq_lock);
-	rval = __node_allowed(ctx, node);
-	mutex_unlock(&spu_prio->runq_lock);
+	mutex_lock(&cbe_spu_info[node].list_mutex);
+	rval = __node_allowed(gang, node);
+	mutex_unlock(&cbe_spu_info[node].list_mutex);
 
 	return rval;
 }
@@ -339,7 +370,7 @@  static struct spu *aff_ref_location(struct spu_context *ctx, int mem_aff,
 		int available_spus;
 
 		node = (node < MAX_NUMNODES) ? node : 0;
-		if (!node_allowed(ctx, node))
+		if (!node_allowed(ctx->gang, node))
 			continue;
 
 		available_spus = 0;
@@ -421,15 +452,17 @@  static struct spu *ctx_location(struct spu *ref, int offset, int node)
  * affinity_check is called each time a context is going to be scheduled.
  * It returns the spu ptr on which the context must run.
  */
-static int has_affinity(struct spu_context *ctx)
+static int has_affinity(struct spu_gang *gang)
 {
-	struct spu_gang *gang = ctx->gang;
-
-	if (list_empty(&ctx->aff_list))
+	if (list_empty(&gang->aff_list_head))
 		return 0;
 
-	if (atomic_read(&ctx->gang->aff_sched_count) == 0)
-		ctx->gang->aff_ref_spu = NULL;
+	/*
+	 * TODO: fix SPU Affinity to work with gang scheduling.
+	 */
+
+	if (atomic_read(&gang->aff_sched_count) == 0)
+		gang->aff_ref_spu = NULL;
 
 	if (!gang->aff_ref_spu) {
 		if (!(gang->aff_flags & AFF_MERGED))
@@ -501,297 +534,463 @@  static void spu_unbind_context(struct spu *spu, struct spu_context *ctx)
 }
 
 /**
- * spu_add_to_rq - add a context to the runqueue
- * @ctx:       context to add
+ * spu_add_to_rq - add a gang to the runqueue
+ * @gang:       gang to add
  */
-static void __spu_add_to_rq(struct spu_context *ctx)
+static void __spu_add_to_rq(struct spu_gang *gang)
 {
-	/*
-	 * Unfortunately this code path can be called from multiple threads
-	 * on behalf of a single context due to the way the problem state
-	 * mmap support works.
-	 *
-	 * Fortunately we need to wake up all these threads at the same time
-	 * and can simply skip the runqueue addition for every but the first
-	 * thread getting into this codepath.
-	 *
-	 * It's still quite hacky, and long-term we should proxy all other
-	 * threads through the owner thread so that spu_run is in control
-	 * of all the scheduling activity for a given context.
-	 */
-	if (list_empty(&ctx->rq)) {
-		list_add_tail(&ctx->rq, &spu_prio->runq[ctx->prio]);
-		set_bit(ctx->prio, spu_prio->bitmap);
+	if (atomic_read(&gang->nrunnable) && list_empty(&gang->rq)) {
+		int prio = gang->prio;
+
+		list_add_tail(&gang->rq, &spu_prio->runq[prio]);
+		set_bit(prio, spu_prio->bitmap);
+		spu_prio->nr_contexts += gang->contexts;
 		if (!spu_prio->nr_waiting++)
 			__mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK);
 	}
 }
 
-static void spu_add_to_rq(struct spu_context *ctx)
+static void spu_add_to_rq(struct spu_gang *gang)
 {
 	mutex_lock(&spu_prio->runq_lock);
-	__spu_add_to_rq(ctx);
+	__spu_add_to_rq(gang);
 	mutex_unlock(&spu_prio->runq_lock);
 }
 
-static void __spu_del_from_rq(struct spu_context *ctx)
+static void __spu_del_from_rq(struct spu_gang *gang)
 {
-	int prio = ctx->prio;
+	if (!list_empty(&gang->rq)) {
+		int prio = gang->prio;
 
-	if (!list_empty(&ctx->rq)) {
+		spu_prio->nr_contexts -= gang->contexts;
 		if (!--spu_prio->nr_waiting)
 			del_timer(&spusched_timer);
-		list_del_init(&ctx->rq);
-
+		list_del_init(&gang->rq);
 		if (list_empty(&spu_prio->runq[prio]))
 			clear_bit(prio, spu_prio->bitmap);
 	}
 }
 
-void spu_del_from_rq(struct spu_context *ctx)
+static void spu_del_from_rq(struct spu_gang *gang)
 {
 	mutex_lock(&spu_prio->runq_lock);
-	__spu_del_from_rq(ctx);
+	__spu_del_from_rq(gang);
 	mutex_unlock(&spu_prio->runq_lock);
 }
 
-static void spu_prio_wait(struct spu_context *ctx)
-{
-	DEFINE_WAIT(wait);
-
-	/*
-	 * The caller must explicitly wait for a context to be loaded
-	 * if the nosched flag is set.  If NOSCHED is not set, the caller
-	 * queues the context and waits for an spu event or error.
-	 */
-	BUG_ON(!(ctx->flags & SPU_CREATE_NOSCHED));
 
-	mutex_lock(&spu_prio->runq_lock);
-	prepare_to_wait_exclusive(&ctx->stop_wq, &wait, TASK_INTERRUPTIBLE);
-	if (!signal_pending(current)) {
-		__spu_add_to_rq(ctx);
-		mutex_unlock(&spu_prio->runq_lock);
-		mutex_unlock(&ctx->state_mutex);
-		schedule();
-		mutex_lock(&ctx->state_mutex);
-		mutex_lock(&spu_prio->runq_lock);
-		__spu_del_from_rq(ctx);
-	}
-	mutex_unlock(&spu_prio->runq_lock);
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&ctx->stop_wq, &wait);
-}
-
-static struct spu *spu_get_idle(struct spu_context *ctx)
+static struct spu *spu_bind(struct spu_gang *gang,
+				struct spu_context *ctx, int node)
 {
-	struct spu *spu, *aff_ref_spu;
-	int node, n;
-
-	spu_context_nospu_trace(spu_get_idle__enter, ctx);
-
-	if (ctx->gang) {
-		mutex_lock(&ctx->gang->aff_mutex);
-		if (has_affinity(ctx)) {
-			aff_ref_spu = ctx->gang->aff_ref_spu;
-			atomic_inc(&ctx->gang->aff_sched_count);
-			mutex_unlock(&ctx->gang->aff_mutex);
-			node = aff_ref_spu->node;
+	struct spu *spu;
+	int n;
 
-			mutex_lock(&cbe_spu_info[node].list_mutex);
-			spu = ctx_location(aff_ref_spu, ctx->aff_offset, node);
-			if (spu && spu->alloc_state == SPU_FREE)
-				goto found;
-			mutex_unlock(&cbe_spu_info[node].list_mutex);
+	if (node == SPU_PLACE_ALL)
+		node = cpu_to_node(raw_smp_processor_id());
 
-			atomic_dec(&ctx->gang->aff_sched_count);
-			goto not_found;
-		}
-		mutex_unlock(&ctx->gang->aff_mutex);
-	}
-	node = cpu_to_node(raw_smp_processor_id());
 	for (n = 0; n < MAX_NUMNODES; n++, node++) {
 		node = (node < MAX_NUMNODES) ? node : 0;
-		if (!node_allowed(ctx, node))
+		if (!node_allowed(gang, node))
 			continue;
 
 		mutex_lock(&cbe_spu_info[node].list_mutex);
 		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
-			if (spu->alloc_state == SPU_FREE)
+			if ((spu->alloc_state == SPU_FREE) &&
+			    (spu->gang == gang))
 				goto found;
 		}
 		mutex_unlock(&cbe_spu_info[node].list_mutex);
 	}
-
- not_found:
-	spu_context_nospu_trace(spu_get_idle__not_found, ctx);
 	return NULL;
 
- found:
+found:
+	cbe_spu_info[node].nr_active++;
 	spu->alloc_state = SPU_USED;
+	spu->ctx = ctx;
 	mutex_unlock(&cbe_spu_info[node].list_mutex);
-	spu_context_trace(spu_get_idle__found, ctx, spu);
-	spu_init_channels(spu);
+
 	return spu;
 }
 
+static void __spu_schedule(struct spu_gang *gang, int node_chosen)
+{
+	struct spu_context *ctx;
+	struct spu *spu;
+
+	spu_del_from_rq(gang);
+
+	list_for_each_entry(ctx, &gang->list, gang_list) {
+		mutex_lock(&ctx->state_mutex);
+		BUG_ON(ctx->spu);
+		spu = spu_bind(gang, ctx, node_chosen);
+		BUG_ON(!spu);
+		spu_bind_context(spu, ctx);
+		spu_set_timeslice(ctx);
+		wake_up_all(&ctx->run_wq);
+		mutex_unlock(&ctx->state_mutex);
+	}
+}
+
+static void spu_unschedule(struct spu_gang *gang)
+{
+	struct spu_context *ctx;
+	struct spu *spu;
+	int node;
+
+	list_for_each_entry(ctx, &gang->list, gang_list) {
+
+		mutex_lock(&ctx->state_mutex);
+		spu = ctx->spu;
+		BUG_ON(!spu);
+		BUG_ON(spu->ctx != ctx);
+		BUG_ON(spu->gang != gang);
+		BUG_ON(spu->alloc_state != SPU_USED);
+		node = spu->node;
+		spu_unbind_context(spu, ctx);
+		mutex_lock(&cbe_spu_info[node].list_mutex);
+		cbe_spu_info[node].nr_active--;
+		spu->alloc_state = SPU_FREE;
+		spu->ctx = NULL;
+		spu->gang = NULL;
+		ctx->stats.invol_ctx_switch++;
+		spu->stats.invol_ctx_switch++;
+		mutex_unlock(&cbe_spu_info[node].list_mutex);
+		mutex_unlock(&ctx->state_mutex);
+	}
+}
+
+static int spu_get_idle(struct spu_gang *gang, int node)
+{
+	struct spu *spu;
+	int n, lnode, count, mode;
+	int ret = 0, found = 0;
+
+	spu_context_nospu_trace(spu_get_idle__enter, gang);
+
+	/* TO DO: SPU affinity scheduling. */
+
+	mode = SPU_RESERVE;
+
+spu_get_idle_top:
+
+	if (node == SPU_PLACE_ALL)
+		lnode = cpu_to_node(raw_smp_processor_id());
+	else
+		lnode = node;
+
+	for (n = 0, count = 0;
+	     n < MAX_NUMNODES && count < gang->contexts;
+	     n++, lnode++) {
+
+		lnode = (lnode < MAX_NUMNODES) ? lnode : 0;
+		if (!node_allowed(gang, lnode))
+			continue;
+
+		mutex_lock(&cbe_spu_info[lnode].list_mutex);
+		list_for_each_entry(spu, &cbe_spu_info[lnode].spus, cbe_list) {
+			switch (mode) {
+			case SPU_RESERVE :
+				if ((spu->alloc_state == SPU_FREE) &&
+				    (spu->gang == NULL)) {
+					spu_init_channels(spu);
+					spu->gang = gang;
+					if (++count == gang->contexts)
+						goto spu_get_idle_found;
+				}
+				break;
+			case SPU_RESERVE_UNDO :
+				if ((spu->alloc_state == SPU_FREE) &&
+				    (spu->gang == gang)) {
+					spu->gang = NULL;
+					if (++count == found)
+						goto spu_get_idle_found;
+				}
+				break;
+			}
+		}
+		mutex_unlock(&cbe_spu_info[lnode].list_mutex);
+	}
+	BUG_ON(mode != SPU_RESERVE);
+
+	found = count;
+	ret = -1;
+	mode = SPU_RESERVE_UNDO;
+
+	if (found)
+		goto spu_get_idle_top;
+	else
+		goto spu_get_idle_out;
+
+spu_get_idle_found:
+
+	found = count;
+
+	mutex_unlock(&cbe_spu_info[lnode].list_mutex);
+
+spu_get_idle_out:
+
+	if (ret)
+		spu_gang_nospu_trace(spu_get_idle__not_found,
+					gang, gang->contexts, found)
+	else
+		spu_gang_trace(spu_get_idle__found, gang, gang->contexts)
+
+	return ret;
+}
+
 /**
- * find_victim - find a lower priority context to preempt
- * @ctx:	canidate context for running
+ * find_victim - find a lower priority gang to preempt
+ * @gang:	canidate gang for running
  *
- * Returns the freed physical spu to run the new context on.
+ * Returns 0 for success (preemption occurred), -1 for failure
  */
-static struct spu *find_victim(struct spu_context *ctx)
+static int find_victim(struct spu_gang *gang, int node)
 {
-	struct spu_context *victim = NULL;
+	struct spu_gang *victim;
+	struct spu_context *ctx;
 	struct spu *spu;
-	int node, n;
+	int n, retry = 3;
 
-	spu_context_nospu_trace(spu_find_victim__enter, ctx);
+	spu_context_nospu_trace(spu_find_victim__enter, gang);
+
+	if (node == SPU_PLACE_ALL)
+		node = cpu_to_node(raw_smp_processor_id());
 
-	/*
-	 * Look for a possible preemption candidate on the local node first.
-	 * If there is no candidate look at the other nodes.  This isn't
-	 * exactly fair, but so far the whole spu scheduler tries to keep
-	 * a strong node affinity.  We might want to fine-tune this in
-	 * the future.
-	 */
- restart:
-	node = cpu_to_node(raw_smp_processor_id());
 	for (n = 0; n < MAX_NUMNODES; n++, node++) {
+
 		node = (node < MAX_NUMNODES) ? node : 0;
-		if (!node_allowed(ctx, node))
+		if (!node_allowed(gang, node))
+			continue;
+restart_node:
+		/*
+		 * Retry to avoid algorithm deadlock.  The act of
+		 * unscheduling takes place before scheduling, so we can't
+		 * spin indefinitely in the unschedule (ie. find_victim)
+		 * waiting for a priority change to occur.
+		 */
+		if (!--retry)
 			continue;
 
+		victim = NULL;
+
 		mutex_lock(&cbe_spu_info[node].list_mutex);
 		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
-			struct spu_context *tmp = spu->ctx;
+			struct spu_gang *tmp = spu->gang;
+			struct spu_context *tmpc = spu->ctx;
 
-			if (tmp && tmp->prio > ctx->prio &&
-			    !(tmp->flags & SPU_CREATE_NOSCHED) &&
+			if ((spu->alloc_state == SPU_USED) &&
+			    (tmp != NULL) &&
+			    (tmp->prio > gang->prio) &&
+			   !(tmpc->flags & SPU_CREATE_NOSCHED) &&
 			    (!victim || tmp->prio > victim->prio)) {
-				victim = spu->ctx;
+				victim = tmp;
 			}
 		}
-		if (victim)
-			get_spu_context(victim);
 		mutex_unlock(&cbe_spu_info[node].list_mutex);
 
 		if (victim) {
 			/*
-			 * This nests ctx->state_mutex, but we always lock
+			 * This nests gang->mutex, but we always lock
 			 * higher priority contexts before lower priority
 			 * ones, so this is safe until we introduce
 			 * priority inheritance schemes.
 			 *
-			 * XXX if the highest priority context is locked,
-			 * this can loop a long time.  Might be better to
-			 * look at another context or give up after X retries.
+			 * XXX If victim no longer exists, we are referencing
+			 * stray kernel memory and changing it with the mutex
+			 * lock and subsequent actions.
 			 */
-			if (!mutex_trylock(&victim->state_mutex)) {
-				put_spu_context(victim);
-				victim = NULL;
-				goto restart;
+			if (test_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags))
+				goto restart_node;
+
+			if (!mutex_trylock(&victim->mutex)) {
+				goto restart_node;
 			}
 
-			spu = victim->spu;
-			if (!spu || victim->prio <= ctx->prio) {
+			if (!spu_gang_runnable(victim)) {
 				/*
 				 * This race can happen because we've dropped
 				 * the active list mutex.  Not a problem, just
 				 * restart the search.
 				 */
-				mutex_unlock(&victim->state_mutex);
-				put_spu_context(victim);
-				victim = NULL;
-				goto restart;
+				mutex_unlock(&victim->mutex);
+				goto restart_node;
 			}
 
-			spu_context_trace(__spu_deactivate__unload, ctx, spu);
+			ctx = list_first_entry(&victim->list,
+					struct spu_context, gang_list);
 
-			mutex_lock(&cbe_spu_info[node].list_mutex);
-			cbe_spu_info[node].nr_active--;
-			spu_unbind_context(spu, victim);
-			mutex_unlock(&cbe_spu_info[node].list_mutex);
+			if (!ctx->spu || victim->prio <= gang->prio) {
+				/*
+				 * This race can happen because we've dropped
+				 * the active list mutex.  Not a problem, just
+				 * restart the search.
+				 */
+				mutex_unlock(&victim->mutex);
+				goto restart_node;
+			}
 
-			victim->stats.invol_ctx_switch++;
-			spu->stats.invol_ctx_switch++;
-			if (test_bit(SPU_SCHED_SPU_RUN, &victim->sched_flags))
-				spu_add_to_rq(victim);
+			spu_gang_trace(__spu_deactivate__unload,
+						victim, victim->contexts);
 
-			mutex_unlock(&victim->state_mutex);
-			put_spu_context(victim);
+			spu_unschedule(victim);
+			spu_add_to_rq(victim);
 
-			return spu;
+			mutex_unlock(&victim->mutex);
+
+			return 0;
 		}
 	}
+	return -1;
+}
 
-	return NULL;
+static int spu_gang_placeable(struct spu_gang *gang,
+	struct spu_sched_stats *stats, int *node_chosen, int *node_preempt)
+{
+	/*
+	 * Strategy is to minimize preemption and to place contexts on
+	 * a single node, if possible.  Affinity gangs must be scheduled
+	 * on one node.  Identify nodes to preempt if necessary.
+	 */
+	if (has_affinity(gang)) {
+		if (stats->inode_icnt + stats->inode_pcnt >= gang->contexts) {
+			*node_chosen = stats->inode;
+			if (stats->inode_icnt < gang->contexts)
+				*node_preempt = stats->inode_icnt;
+			else
+				*node_preempt = SPU_PLACE_NONE;
+			return 1;
+		}
+		if (stats->pnode_icnt + stats->pnode_pcnt >= gang->contexts) {
+			*node_chosen = stats->pnode;
+			if (stats->pnode_icnt < gang->contexts)
+				*node_preempt = stats->pnode;
+			else
+				*node_preempt = SPU_PLACE_NONE;
+			return 1;
+		}
+		return 0;
+	}
+	if (stats->inode_icnt >= gang->contexts) {
+		*node_chosen = stats->inode;	/* allocate idle node */
+		*node_preempt = SPU_PLACE_NONE;
+		return 1;
+	}
+	if (stats->total_icnt >= gang->contexts) {
+		*node_chosen = SPU_PLACE_ALL;	/* allocate all nodes */
+		*node_preempt = SPU_PLACE_NONE;
+		return 1;
+	}
+	if (stats->inode_icnt + stats->inode_pcnt >= gang->contexts) {
+		*node_chosen = stats->inode;	/* allocate idle node */
+		*node_preempt = stats->inode;	/* preempt on idle node */
+		return 1;
+	}
+	if (stats->pnode_icnt + stats->pnode_pcnt >= gang->contexts) {
+		*node_chosen = stats->pnode;	/* allocate other node */
+		*node_preempt = stats->pnode;	/* preempt on other node */
+		return 1;
+	}
+	if (stats->total_icnt + stats->inode_pcnt >= gang->contexts) {
+		*node_chosen = SPU_PLACE_ALL;	/* allocate all nodes */
+		*node_preempt = stats->inode;	/* preempt on idle node */
+		return 1;
+	}
+	if (stats->total_icnt + stats->pnode_pcnt >= gang->contexts) {
+		*node_chosen = SPU_PLACE_ALL;	/* allocate all nodes */
+		*node_preempt = stats->pnode;	/* preempt other node */
+		return 1;
+	}
+	if (stats->total_icnt + stats->total_pcnt >= gang->contexts) {
+		*node_chosen = SPU_PLACE_ALL;	/* allocate nodes */
+		*node_preempt = SPU_PLACE_ALL;	/* preempt all nodes */
+		return 1;
+	}
+	return 0;
 }
 
-static void __spu_schedule(struct spu *spu, struct spu_context *ctx)
+static void spu_get_run_stats(struct spu_gang *gang,
+					struct spu_sched_stats *stats)
 {
-	int node = spu->node;
-	int success = 0;
+	struct spu *spu;
+	int n, node, count_idle, count_preempt;
 
-	spu_set_timeslice(ctx);
+	stats->inode = -1;		/* node with most idle spus */
+	stats->inode_icnt = 0; 		/* n idle on idle node */
+	stats->inode_pcnt = 0;		/* n preemptable on idle node */
+	stats->pnode = -1;		/* node with most preemptable spus */
+	stats->pnode_icnt = 0; 		/* n idle on preempt node */
+	stats->pnode_pcnt = 0; 		/* n preemptable on preempt node */
+	stats->total_icnt = 0; 		/* total idle across all nodes */
+	stats->total_pcnt = 0;		/* total preemptable across all nodes */
 
-	mutex_lock(&cbe_spu_info[node].list_mutex);
-	if (spu->ctx == NULL) {
-		spu_bind_context(spu, ctx);
-		cbe_spu_info[node].nr_active++;
-		spu->alloc_state = SPU_USED;
-		success = 1;
-	}
-	mutex_unlock(&cbe_spu_info[node].list_mutex);
+	node = cpu_to_node(raw_smp_processor_id());
+	for (n = 0; n < MAX_NUMNODES; n++, node++) {
 
-	if (success)
-		wake_up_all(&ctx->run_wq);
-	else
-		spu_add_to_rq(ctx);
-}
+		node = (node < MAX_NUMNODES) ? node : 0;
+		if (!node_allowed(gang, node))
+			continue;
 
-static void spu_schedule(struct spu *spu, struct spu_context *ctx)
-{
-	/* not a candidate for interruptible because it's called either
-	   from the scheduler thread or from spu_deactivate */
-	mutex_lock(&ctx->state_mutex);
-	if (ctx->state == SPU_STATE_SAVED)
-		__spu_schedule(spu, ctx);
-	spu_release(ctx);
+		count_idle = 0;
+		count_preempt = 0;
+
+		mutex_lock(&cbe_spu_info[node].list_mutex);
+		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
+			struct spu_gang *tmp = spu->gang;
+			struct spu_context *tmpc = spu->ctx;
+
+			switch (spu->alloc_state) {
+			case SPU_FREE :
+				if (!tmp)
+					count_idle++;
+				break;
+			case SPU_USED :
+				if (tmpc &&
+				   !(tmpc->flags & SPU_CREATE_NOSCHED)) {
+					if (tmp == gang) /* yield/deactivate */
+						count_idle++;
+					else if (tmp->prio > gang->prio)
+						count_preempt++;
+				}
+				break;
+			}
+		}
+		mutex_unlock(&cbe_spu_info[node].list_mutex);
+
+		if (stats->inode == -1 || stats->inode_icnt < count_idle) {
+			stats->inode_icnt = count_idle;
+			stats->inode_pcnt = count_preempt;
+			stats->inode = node;
+		}
+
+		if (stats->pnode == -1 || stats->pnode_pcnt < count_preempt) {
+			stats->pnode_icnt = count_idle;
+			stats->pnode_pcnt = count_preempt;
+			stats->pnode = node;
+		}
+
+		stats->total_icnt += count_idle;
+		stats->total_pcnt += count_preempt;
+	}
 }
 
-/**
- * spu_unschedule - remove a context from a spu, and possibly release it.
- * @spu:	The SPU to unschedule from
- * @ctx:	The context currently scheduled on the SPU
- * @free_spu	Whether to free the SPU for other contexts
- *
- * Unbinds the context @ctx from the SPU @spu. If @free_spu is non-zero, the
- * SPU is made available for other contexts (ie, may be returned by
- * spu_get_idle). If this is zero, the caller is expected to schedule another
- * context to this spu.
- *
- * Should be called with ctx->state_mutex held.
- */
-static void spu_unschedule(struct spu *spu, struct spu_context *ctx,
-		int free_spu)
+static int spu_gang_schedulable(struct spu_gang *gang,
+		int *node_chosen, int *node_preempt)
 {
-	int node = spu->node;
+	struct spu_sched_stats stats;
+	int ret;
 
-	mutex_lock(&cbe_spu_info[node].list_mutex);
-	cbe_spu_info[node].nr_active--;
-	if (free_spu)
-		spu->alloc_state = SPU_FREE;
-	spu_unbind_context(spu, ctx);
-	ctx->stats.invol_ctx_switch++;
-	spu->stats.invol_ctx_switch++;
-	mutex_unlock(&cbe_spu_info[node].list_mutex);
+	mutex_lock(&spu_prio->runq_lock);
+	spu_get_run_stats(gang, &stats);
+	mutex_unlock(&spu_prio->runq_lock);
+	ret = spu_gang_placeable(gang, &stats, node_chosen, node_preempt);
+	return ret;
 }
 
+
 /**
  * spu_activate - find a free spu for a context and execute it
- * @ctx:	spu context to schedule
- * @flags:	flags (currently ignored)
+ * @ctx:		spu context to schedule
+ * @flags:		flags (currently ignored)
  *
  * Tries to find a free spu to run @ctx.  If no free spu is available
  * add the context to the runqueue so it gets woken up once an spu
@@ -799,100 +998,184 @@  static void spu_unschedule(struct spu *spu, struct spu_context *ctx,
  */
 int spu_activate(struct spu_context *ctx, unsigned long flags)
 {
-	struct spu *spu;
+	struct spu_gang *gang = ctx->gang;
+	int ret, node_chosen, node_preempt;
+
+	if (signal_pending(current))
+		return -ERESTARTSYS;
 
-	/*
-	 * If there are multiple threads waiting for a single context
-	 * only one actually binds the context while the others will
-	 * only be able to acquire the state_mutex once the context
-	 * already is in runnable state.
-	 */
 	if (ctx->spu)
 		return 0;
 
-spu_activate_top:
-	if (signal_pending(current))
-		return -ERESTARTSYS;
+	if (!mutex_trylock(&gang->mutex)) {
+		spu_release(ctx);
+		mutex_lock(&gang->mutex);
+		mutex_lock(&ctx->state_mutex);
+	}
 
-	spu = spu_get_idle(ctx);
 	/*
-	 * If this is a realtime thread we try to get it running by
-	 * preempting a lower priority thread.
+	 * Recheck as we released lock. Context could have been activated
+	 * if it was previously started and sliced.
 	 */
-	if (!spu && rt_prio(ctx->prio))
-		spu = find_victim(ctx);
-	if (spu) {
-		unsigned long runcntl;
-
-		runcntl = ctx->ops->runcntl_read(ctx);
-		__spu_schedule(spu, ctx);
-		if (runcntl & SPU_RUNCNTL_RUNNABLE)
-			spuctx_switch_state(ctx, SPU_UTIL_USER);
-
+	if (ctx->spu) {
+		mutex_unlock(&gang->mutex);
 		return 0;
 	}
 
-	if (ctx->flags & SPU_CREATE_NOSCHED) {
-		spu_prio_wait(ctx);
-		goto spu_activate_top;
+	if (!spu_gang_runnable(gang)) {
+		mutex_unlock(&gang->mutex);
+		if (ctx->flags & SPU_CREATE_NOSCHED) {
+			ret = spufs_wait(ctx->run_wq,
+				ctx->state == SPU_STATE_RUNNABLE);
+			if (unlikely(ret)) {
+				spu_del_from_rq(gang);
+				mutex_lock(&ctx->state_mutex);
+			}
+			return ret;
+		}
+		return 0;
 	}
+	spu_release(ctx);
+
+	/*
+	 * Activation assumes gang is not on the runqueue as it is
+	 * about to be activated. It could be on the runqueue, if it
+	 * were time sliced while executing user mode code.
+	 */
+	spu_del_from_rq(gang);
+
+	clear_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags);
 
-	spu_add_to_rq(ctx);
+spu_activate_top:
+
+	if (spu_gang_schedulable(gang, &node_chosen, &node_preempt)) {
 
+		if (node_preempt != SPU_PLACE_NONE) {
+			ret = find_victim(gang, node_preempt);
+			if (unlikely(ret))
+				goto spu_activate_rq;
+		}
+
+		ret = spu_get_idle(gang, node_chosen);
+		if (ret)
+			goto spu_activate_top;
+		__spu_schedule(gang, node_chosen);
+	} else {
+
+spu_activate_rq:
+		spu_add_to_rq(gang);
+		if (ctx->flags & SPU_CREATE_NOSCHED) {
+			spu_acquire(ctx);
+			mutex_unlock(&gang->mutex);
+			ret = spufs_wait(ctx->run_wq,
+				ctx->state == SPU_STATE_RUNNABLE);
+			if (unlikely(ret)) {
+				spu_del_from_rq(gang);
+				mutex_lock(&ctx->state_mutex);
+			}
+			return ret;
+		}
+	}
+	mutex_lock(&ctx->state_mutex);
+	mutex_unlock(&gang->mutex);
 	return 0;
 }
 
 /**
- * grab_runnable_context - try to find a runnable context
+ * grab_runnable_gang - try to find a runnable gang
  *
- * Remove the highest priority context on the runqueue and return it
- * to the caller.  Returns %NULL if no runnable context was found.
+ * Remove the highest priority gang on the runqueue and return it
+ * to the caller.  Returns %NULL if no runnable gang was found.
  */
-static struct spu_context *grab_runnable_context(int prio, int node)
+static struct spu_gang *grab_runnable_gang(struct spu_gang *old,
+		int prio, int *node_chosen, int *node_preempt)
 {
+	struct spu_sched_stats stats, lstats;
 	struct spu_context *ctx;
-	int best;
+	struct spu_gang *gang;
+	int ret, best;
 
+	/*
+	 * If old != NULL, then caller is spu_yield or spu_deactivate.  We
+	 * can use idle spus and spus assigned to old.  Else the caller is the
+	 * scheduler thread (time slicer) and we may only use idle spus.  In
+	 * neither case are we allowed to preempt.  Preemption is only done by
+	 * spu_activation which assumes that the gang is not on the runqueue.
+	 * The fact that we are getting the gang from the runqueue implies
+	 * that it has a less favored priority than anything currently
+	 * running. spu_activate does not call this routine.
+	 */
 	mutex_lock(&spu_prio->runq_lock);
+	if (old) {
+		ctx = list_first_entry(&old->list,
+				struct spu_context, gang_list);
+		BUG_ON(!ctx->spu);
+		spu_get_run_stats(old, &stats);
+	}
 	best = find_first_bit(spu_prio->bitmap, prio);
 	while (best < prio) {
 		struct list_head *rq = &spu_prio->runq[best];
 
-		list_for_each_entry(ctx, rq, rq) {
-			/* XXX(hch): check for affinity here aswell */
-			if (__node_allowed(ctx, node)) {
-				__spu_del_from_rq(ctx);
+		list_for_each_entry(gang, rq, rq) {
+			if (old) {
+				lstats = stats;
+				if (!node_allowed(gang, lstats.inode))
+					lstats.inode_icnt = 0;
+			} else {
+				spu_get_run_stats(gang, &lstats);
+			}
+			ret = spu_gang_placeable(gang, &lstats,
+						node_chosen, node_preempt);
+			if (ret && (*node_preempt == SPU_PLACE_NONE)) {
+				__spu_del_from_rq(gang);
 				goto found;
 			}
 		}
 		best++;
 	}
-	ctx = NULL;
- found:
+	gang = NULL;
+found:
 	mutex_unlock(&spu_prio->runq_lock);
-	return ctx;
+	return gang;
 }
 
-static int __spu_deactivate(struct spu_context *ctx, int force, int max_prio)
+static int __spu_deactivate(struct spu_gang *gang, int force, int max_prio)
 {
-	struct spu *spu = ctx->spu;
-	struct spu_context *new = NULL;
+	struct spu_gang *new;
+	struct spu_context *ctx;
+	int ret, node_chosen, node_preempt;
 
-	if (spu) {
-		new = grab_runnable_context(max_prio, spu->node);
-		if (new || force) {
-			spu_unschedule(spu, ctx, new == NULL);
-			if (new) {
-				if (new->flags & SPU_CREATE_NOSCHED)
-					wake_up(&new->stop_wq);
-				else {
-					spu_release(ctx);
-					spu_schedule(spu, new);
-					/* this one can't easily be made
-					   interruptible */
-					mutex_lock(&ctx->state_mutex);
-				}
+	ctx = list_first_entry(&gang->list, struct spu_context, gang_list);
+	if (!ctx->spu)
+		return 0;
+
+	new = grab_runnable_gang(gang, max_prio, &node_chosen, &node_preempt);
+	if (new || force) {
+		spu_unschedule(gang);
+		if (new) {
+			/*
+			 * None of these locks can be easily made interruptible
+			 */
+			mutex_unlock(&gang->mutex);
+
+			/*
+			 * Schedule gang taken from runqueue.  Should fit, but
+			 * might not as we dropped the lock above.  Also, it
+			 * could have been activated by spu_run.
+			 */
+			mutex_lock(&new->mutex);
+			ctx = list_first_entry(&new->list,
+				struct spu_context, gang_list);
+			if (!ctx->spu) {
+				ret = spu_get_idle(new, node_chosen);
+				if (ret)
+					spu_add_to_rq(new);
+				else
+					__spu_schedule(new, node_chosen);
 			}
+			mutex_unlock(&new->mutex);
+
+			mutex_lock(&gang->mutex);
 		}
 	}
 
@@ -906,67 +1189,85 @@  static int __spu_deactivate(struct spu_context *ctx, int force, int max_prio)
  * Unbind @ctx from the physical spu it is running on and schedule
  * the highest priority context to run on the freed physical spu.
  */
-void spu_deactivate(struct spu_context *ctx)
+void spu_deactivate(struct spu_gang *gang)
 {
-	spu_context_nospu_trace(spu_deactivate__enter, ctx);
-	__spu_deactivate(ctx, 1, MAX_PRIO);
+	spu_context_nospu_trace(spu_deactivate__enter, gang);
+	set_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags);
+	__spu_deactivate(gang, 1, MAX_PRIO);
+	spu_del_from_rq(gang);
 }
 
 /**
  * spu_yield -	yield a physical spu if others are waiting
  * @ctx:	spu context to yield
  *
- * Check if there is a higher priority context waiting and if yes
- * unbind @ctx from the physical spu and schedule the highest
- * priority context to run on the freed physical spu instead.
+ * Check if there is a higher priority gang waiting and if yes
+ * unbind @ctx and any other within the same gang freeing one or
+ * more physical spus and schedule the highest priority gang
+ * to run on the freed physical spu(s) instead.  The gang is added
+ * back to the runqueue, so that it may be resumed by the scheduler
+ * as soon as there are idle spus.
  */
 void spu_yield(struct spu_context *ctx)
 {
-	spu_context_nospu_trace(spu_yield__enter, ctx);
+	struct spu_gang *gang = ctx->gang;
+	int ret;
+
+	spu_context_nospu_trace(spu_yield__enter, gang);
 	if (!(ctx->flags & SPU_CREATE_NOSCHED)) {
-		mutex_lock(&ctx->state_mutex);
-		__spu_deactivate(ctx, 0, MAX_PRIO);
-		mutex_unlock(&ctx->state_mutex);
+		mutex_lock(&gang->mutex);
+		ret = __spu_deactivate(gang, 0, MAX_PRIO);
+		if (ret)
+			spu_add_to_rq(gang);
+		mutex_unlock(&gang->mutex);
 	}
 }
 
-static noinline void spusched_tick(struct spu_context *ctx)
+static noinline int spusched_tick(struct spu_gang *gang,
+					struct spu_context *ctx,
+					int preempt)
 {
-	struct spu_context *new = NULL;
-	struct spu *spu = NULL;
+	int best, yield;
+	int npreempted = 0;
 
-	if (spu_acquire(ctx))
-		BUG();	/* a kernel thread never has signals pending */
+	if (test_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags))
+		return 0;
 
-	if (ctx->state != SPU_STATE_RUNNABLE)
-		goto out;
-	if (ctx->flags & SPU_CREATE_NOSCHED)
-		goto out;
-	if (ctx->policy == SCHED_FIFO)
-		goto out;
+	mutex_lock(&gang->mutex);
 
-	if (--ctx->time_slice && test_bit(SPU_SCHED_SPU_RUN, &ctx->sched_flags))
-		goto out;
+	BUG_ON(!spu_gang_runnable(gang));
 
-	spu = ctx->spu;
+	if (!ctx->spu)
+		goto out;
 
-	spu_context_trace(spusched_tick__preempt, ctx, spu);
+	if (ctx->flags & SPU_CREATE_NOSCHED)
+		goto out;
 
-	new = grab_runnable_context(ctx->prio + 1, spu->node);
-	if (new) {
-		spu_unschedule(spu, ctx, 0);
-		if (test_bit(SPU_SCHED_SPU_RUN, &ctx->sched_flags))
-			spu_add_to_rq(ctx);
-	} else {
+	/*
+	 * If nrunnable is zero, then all of the contexts are in user mode.
+	 */
+	yield = !atomic_read(&gang->nrunnable);
+
+	if (yield || ((ctx->policy != SCHED_FIFO) && (!--ctx->time_slice))) {
+		if (spu_prio->nr_waiting) {
+			best = find_first_bit(spu_prio->bitmap, gang->prio);
+			if (yield || (preempt && best <= gang->prio)) {
+				spu_context_trace(spusched_tick__preempt,
+							ctx, ctx->spu);
+				npreempted = gang->contexts;
+				spu_unschedule(gang);
+				spu_add_to_rq(gang);
+				goto out;
+			}
+		}
 		spu_context_nospu_trace(spusched_tick__newslice, ctx);
-		if (!ctx->time_slice)
+		if (ctx->policy != SCHED_FIFO)
 			ctx->time_slice++;
 	}
 out:
-	spu_release(ctx);
+	mutex_unlock(&gang->mutex);
 
-	if (new)
-		spu_schedule(spu, new);
+	return npreempted;
 }
 
 /**
@@ -1019,30 +1320,104 @@  static void spuloadavg_wake(unsigned long data)
 
 static int spusched_thread(void *unused)
 {
+	struct spu_gang *gang;
+	struct spu_context *ctx;
 	struct spu *spu;
-	int node;
+	int node, active, pnode, preempted, maxcontexts, retry, deactivated;
 
 	while (!kthread_should_stop()) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		schedule();
-		for (node = 0; node < MAX_NUMNODES; node++) {
+
+		/*
+		 * Try to limit the number of context switches because
+		 * we preempt all of the contexts first and then place
+		 * them.  We preempt first as the next gang to run may
+		 * not fit.  Also, it helps us place NUMA and SPU affinity
+		 * jobs as they have special requirements.  Not that we
+		 * are doing that now, but we could. The downside is
+		 * that we will have a little more idle time.  What we
+		 * want to avoid is preempting and then rescheduling
+		 * the same job as it creates hicups in the execution.
+		 */
+		maxcontexts = spu_prio->nr_contexts;
+		preempted = 0;
+
+		/*
+		 * Time slice contexts first assuming that there are
+		 * spe jobs waiting to run.  The next gang to be scheduled
+		 * is of indeterminate size, so it is not sufficient to
+		 * follow the traditional mode which assumes a 1:1
+		 * replacement (1 spu : 1 spe context).
+		 */
+		for (node = 0, active = 0; node < MAX_NUMNODES; node++) {
 			struct mutex *mtx = &cbe_spu_info[node].list_mutex;
 
 			mutex_lock(mtx);
 			list_for_each_entry(spu, &cbe_spu_info[node].spus,
 					cbe_list) {
-				struct spu_context *ctx = spu->ctx;
-
-				if (ctx) {
+				if (spu->alloc_state == SPU_USED) {
+					gang = spu->gang;
+					BUG_ON(!gang);
+					ctx = spu->ctx;
 					get_spu_context(ctx);
 					mutex_unlock(mtx);
-					spusched_tick(ctx);
+					preempted += spusched_tick(gang, ctx,
+						preempted < maxcontexts);
 					mutex_lock(mtx);
 					put_spu_context(ctx);
+					active++;
 				}
 			}
 			mutex_unlock(mtx);
 		}
+
+		spu_sched_stats(spusched_thread__timeslice,
+					maxcontexts, active, preempted);
+
+		/*
+		 * Place as many gangs as possible.  A gang might fail to
+		 * be placed if it has NUMA bindings, SPU Affinity, or is
+		 * to big to fit.  On failure, the gang is added to the
+		 * back of the runqueue, which effectively allows us to
+		 * skip over a job, assuming it is not the only one at a
+		 * given priority level.
+		 */
+		retry = 0;
+		active = 0;
+		while (retry < maxcontexts) {
+
+			gang = grab_runnable_gang(NULL, MAX_PRIO,
+					&node, &pnode);
+			if (!gang)
+				break;
+
+			/*
+			 * Must recheck state as we have dropped all locks.
+			 * It could be running, deactivated, or destroyed. The
+			 * latter is still a problem. See find_victim (XXX).
+			 */
+			deactivated = test_bit(SPU_SCHED_DEACTIVATED,
+						&gang->sched_flags);
+			if (!deactivated) {
+				mutex_lock(&gang->mutex);
+				ctx = list_first_entry(&gang->list,
+					struct spu_context, gang_list);
+				if (!ctx->spu) {
+					if (spu_get_idle(gang, node)) {
+						spu_add_to_rq(gang);
+						retry++;
+					} else {
+						__spu_schedule(gang, node);
+						active++;
+					}
+				}
+				mutex_unlock(&gang->mutex);
+			}
+		}
+
+		spu_sched_stats(spusched_thread__scheduled,
+				spu_prio->nr_contexts, active, retry);
 	}
 
 	return 0;
@@ -1180,8 +1555,10 @@  void spu_sched_exit(void)
 	for (node = 0; node < MAX_NUMNODES; node++) {
 		mutex_lock(&cbe_spu_info[node].list_mutex);
 		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list)
-			if (spu->alloc_state != SPU_FREE)
+			if (spu->alloc_state != SPU_FREE) {
 				spu->alloc_state = SPU_FREE;
+				spu->gang = NULL;
+			}
 		mutex_unlock(&cbe_spu_info[node].list_mutex);
 	}
 	kfree(spu_prio);
diff --git a/arch/powerpc/platforms/cell/spufs/spufs.h b/arch/powerpc/platforms/cell/spufs/spufs.h
index a8f5203..8e84481 100644
--- a/arch/powerpc/platforms/cell/spufs/spufs.h
+++ b/arch/powerpc/platforms/cell/spufs/spufs.h
@@ -47,11 +47,12 @@  enum {
 struct spu_context_ops;
 struct spu_gang;
 
-/* ctx->sched_flags */
+/* ctx->sched_flags and gang->sched_flags */
 enum {
 	SPU_SCHED_NOTIFY_ACTIVE,
-	SPU_SCHED_WAS_ACTIVE,	/* was active upon spu_acquire_saved()  */
 	SPU_SCHED_SPU_RUN,	/* context is within spu_run */
+	SPU_SCHED_JUST_CREATED,	/* context created but not started */
+	SPU_SCHED_DEACTIVATED,	/* gang has been deactivated */
 };
 
 enum {
@@ -121,7 +122,6 @@  struct spu_context {
 	pid_t tid;
 
 	/* scheduler fields */
-	struct list_head rq;
 	unsigned int time_slice;
 	unsigned long sched_flags;
 	cpumask_t cpus_allowed;
@@ -165,7 +165,12 @@  struct spu_gang {
 	cpumask_t cpus_allowed;
 	int policy;
 	int prio;
+	atomic_t nstarted;
+	atomic_t nrunnable;
+	unsigned long sched_flags;
+	struct list_head rq;
 
+	/* spu scheduler affinity fields */
 	struct spu_context *aff_ref_ctx;
 	struct list_head aff_list_head;
 	struct mutex aff_mutex;
@@ -174,6 +179,11 @@  struct spu_gang {
 	atomic_t aff_sched_count;
 };
 
+static inline int spu_gang_runnable(struct spu_gang *g)
+{
+	return (g->contexts && (atomic_read(&g->nstarted) == g->contexts));
+}
+
 /* Flag bits for spu_gang aff_flags */
 #define AFF_OFFSETS_SET		1
 #define AFF_MERGED		2
@@ -298,9 +308,8 @@  int __must_check spu_acquire_saved(struct spu_context *ctx);
 void spu_release_saved(struct spu_context *ctx);
 
 int spu_stopped(struct spu_context *ctx, u32 * stat);
-void spu_del_from_rq(struct spu_context *ctx);
 int spu_activate(struct spu_context *ctx, unsigned long flags);
-void spu_deactivate(struct spu_context *ctx);
+void spu_deactivate(struct spu_gang *gang);
 void spu_yield(struct spu_context *ctx);
 void spu_switch_notify(struct spu *spu, struct spu_context *ctx);
 void spu_switch_log_notify(struct spu *spu, struct spu_context *ctx,
@@ -377,6 +386,12 @@  extern void spu_free_lscsa(struct spu_state *csa);
 extern void spuctx_switch_state(struct spu_context *ctx,
 		enum spu_utilization_state new_state);
 
+#define spu_sched_stats(name, maxcontexts, active, preempted) \
+	trace_mark(name, "%d %d %d", maxcontexts, active, preempted);
+#define spu_gang_trace(name, gang, cnt) \
+	trace_mark(name, "%p %d", gang, cnt);
+#define spu_gang_nospu_trace(name, gang, cnt, found) \
+	trace_mark(name, "%p %d %d", gang, cnt, found);
 #define spu_context_trace(name, ctx, spu) \
 	trace_mark(name, "ctx %p spu %p", ctx, spu);
 #define spu_context_nospu_trace(name, ctx) \