diff mbox series

[committed,amdgcn] Limit LDS usage

Message ID 3873d05a-c08b-70e9-8112-5beb0f6dbddb@codesourcery.com
State New
Headers show
Series [committed,amdgcn] Limit LDS usage | expand

Commit Message

Andrew Stubbs Nov. 22, 2019, 5:02 p.m. UTC
This patch changes the amount of LDS (Local Data Store) memory requested 
for offload kernels. This allows more teams/gangs to run on the same 
compute unit, increasing potential data throughput.

For OpenMP we can reduce the allocation to almost nothing. This means we 
can have up-to 40 single-thread teams per CU.

For OpenACC we need enough LDS to broadcast data between workers, and 
the algorithm is not particularly memory efficient. This means we cannot 
yet achieve the maximum thread count, but we can at least double the 
current thread-count -- to 32 -- but halving the LDS usage and relying 
on having 16 workers. (Note that I'm assuming Julian's multi-worker 
support patches will be committed soon. Without those we can allocate no 
LDS and have 40 single-worker teams. With the patches the same can also 
be true, but that's still on the to-do list.)

LDS allocation remains unchanged for non-offload compiles (this is only 
really used for running the testsuite).
diff mbox series

Patch

Limit LDS usage.

2019-11-22  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* config/gcn/gcn.c (OMP_LDS_SIZE): Define.
	(ACC_LDS_SIZE): Define.
	(OTHER_LDS_SIZE): Define.
	(LDS_SIZE): Redefine using above.
	(gcn_expand_prologue): Initialize m0 with LDS_SIZE-1.

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 3a8c10ed8b4..f85d84bbe95 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -70,10 +70,15 @@  int gcn_isa = 3;		/* Default to GCN3.  */
    worker-single mode to worker-partitioned mode), per workgroup.  Global
    analysis could calculate an exact bound, but we don't do that yet.
  
-   We reserve the whole LDS, which also prevents any other workgroup
-   sharing the Compute Unit.  */
+   We want to permit full occupancy, so size accordingly.  */
 
-#define LDS_SIZE 65536
+#define OMP_LDS_SIZE 0x600    /* 0x600 is 1/40 total, rounded down.  */
+#define ACC_LDS_SIZE 32768    /* Half of the total should be fine.  */
+#define OTHER_LDS_SIZE 65536  /* If in doubt, reserve all of it.  */
+
+#define LDS_SIZE (flag_openacc ? ACC_LDS_SIZE \
+		  : flag_openmp ? OMP_LDS_SIZE \
+		  : OTHER_LDS_SIZE)
 
 /* The number of registers usable by normal non-kernel functions.
    The SGPR count includes any special extra registers such as VCC.  */
@@ -2876,8 +2881,11 @@  gcn_expand_prologue ()
   /* Ensure that the scheduler doesn't do anything unexpected.  */
   emit_insn (gen_blockage ());
 
+  /* m0 is initialized for the usual LDS DS and FLAT memory case.
+     The low-part is the address of the topmost addressable byte, which is
+     size-1.  The high-part is an offset and should be zero.  */
   emit_move_insn (gen_rtx_REG (SImode, M0_REG),
-		  gen_int_mode (LDS_SIZE, SImode));
+		  gen_int_mode (LDS_SIZE-1, SImode));
 
   emit_insn (gen_prologue_use (gen_rtx_REG (SImode, M0_REG)));