diff mbox

[DOC] PowerPC extended asm example

Message ID 20170404121450.GF16711@bubble.grove.modra.org
State New
Headers show

Commit Message

Alan Modra April 4, 2017, 12:14 p.m. UTC
Revised patch.

	* doc/extend.texi (Extended Asm <Clobbers>): Rename to
	"Clobbers and Scratch Registers".  Add OpenBLAS example.

Comments

Sandra Loosemore April 5, 2017, 3:37 p.m. UTC | #1
On 04/04/2017 06:14 AM, Alan Modra wrote:
> Revised patch.
>
> [snip]
> +@smallexample
> +static void
> +dgemv_kernel_4x4 (long n, const double *ap, long lda,
> +                  const double *x, double *y, double alpha)
> +@{
> +  double *a0;
> +  double *a1;
> +  double *a2;
> +  double *a3;
> +
> +  __asm__
> +    (
> +       "lxvd2x		34, 0, %10	\n\t"	// x0, x1
> +       "lxvd2x		35, %11, %10	\n\t"	// x2, x3
> +       "xxspltd		32, %x9, 0	\n\t"	// alpha, alpha
> +       "sldi		%6, %13, 3	\n\t"	// lda * sizeof (double)
> +       "xvmuldp		34, 34, 32	\n\t"	// x0 * alpha, x1 * alpha
> +       "xvmuldp		35, 35, 32	\n\t"	// x2 * alpha, x3 * alpha
> +       "add		%4, %3, %6	\n\t"	// a0 = ap, a1 = a0 + lda
> +       "add		%6, %6, %6	\n\t"	// 2 * lda
> +       "xxspltd		32, 34, 0	\n\t"	// x0 * alpha, x0 * alpha
> +       "xxspltd		33, 34, 1	\n\t"	// x1 * alpha, x1 * alpha
> +       "xxspltd		34, 35, 0	\n\t"	// x2 * alpha, x2 * alpha
> +       "xxspltd		35, 35, 1	\n\t"	// x3 * alpha, x3 * alpha
> +       "add		%5, %3, %6	\n\t"	// a2 = a0 + 2 * lda
> +       "add		%6, %4, %6	\n\t"	// a3 = a1 + 2 * lda
> +     ...
> +     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
> +     "#a0=%3 a1=%4 a2=%5 a3=%6"
> +     :
> +       "+m" (*y),
> +       "+r" (n),	// 1
> +       "+b" (y),	// 2
> +       "=b" (a0),	// 3
> +       "=b" (a1),	// 4
> +       "=&b" (a2),	// 5
> +       "=&b" (a3)	// 6
> +     :
> +       "m" (*x),
> +       "m" (*ap),
> +       "d" (alpha),	// 9
> +       "r" (x),		// 10
> +       "b" (16),	// 11
> +       "3" (ap),	// 12
> +       "4" (lda)	// 13
> +     :
> +       "cr0",
> +       "vs32","vs33","vs34","vs35","vs36","vs37",
> +       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
> +     );
> +@}
> +@end smallexample

Hmmm.  My main objection to this version is that it's unintelligible to 
anyone who can't parse PowerPC assembly language without the help of an 
architecture manual, and that's probably the majority of readers.

I'm now wondering if it would be better to have a series of small 
examples showing these tricks individually instead of one giant example 
that tries to illustrate multiple things?

-Sandra
Alan Modra April 6, 2017, 1:38 a.m. UTC | #2
On Wed, Apr 05, 2017 at 09:37:04AM -0600, Sandra Loosemore wrote:
> On 04/04/2017 06:14 AM, Alan Modra wrote:
> >Revised patch.
> >
> >[snip]
> >+@smallexample
> >+static void
> >+dgemv_kernel_4x4 (long n, const double *ap, long lda,
> >+                  const double *x, double *y, double alpha)
> >+@{
> >+  double *a0;
> >+  double *a1;
> >+  double *a2;
> >+  double *a3;
> >+
> >+  __asm__
> >+    (
> >+       "lxvd2x		34, 0, %10	\n\t"	// x0, x1
> >+       "lxvd2x		35, %11, %10	\n\t"	// x2, x3
> >+       "xxspltd		32, %x9, 0	\n\t"	// alpha, alpha
> >+       "sldi		%6, %13, 3	\n\t"	// lda * sizeof (double)
> >+       "xvmuldp		34, 34, 32	\n\t"	// x0 * alpha, x1 * alpha
> >+       "xvmuldp		35, 35, 32	\n\t"	// x2 * alpha, x3 * alpha
> >+       "add		%4, %3, %6	\n\t"	// a0 = ap, a1 = a0 + lda
> >+       "add		%6, %6, %6	\n\t"	// 2 * lda
> >+       "xxspltd		32, 34, 0	\n\t"	// x0 * alpha, x0 * alpha
> >+       "xxspltd		33, 34, 1	\n\t"	// x1 * alpha, x1 * alpha
> >+       "xxspltd		34, 35, 0	\n\t"	// x2 * alpha, x2 * alpha
> >+       "xxspltd		35, 35, 1	\n\t"	// x3 * alpha, x3 * alpha
> >+       "add		%5, %3, %6	\n\t"	// a2 = a0 + 2 * lda
> >+       "add		%6, %4, %6	\n\t"	// a3 = a1 + 2 * lda
> >+     ...
> >+     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
> >+     "#a0=%3 a1=%4 a2=%5 a3=%6"
> >+     :
> >+       "+m" (*y),
> >+       "+r" (n),	// 1
> >+       "+b" (y),	// 2
> >+       "=b" (a0),	// 3
> >+       "=b" (a1),	// 4
> >+       "=&b" (a2),	// 5
> >+       "=&b" (a3)	// 6
> >+     :
> >+       "m" (*x),
> >+       "m" (*ap),
> >+       "d" (alpha),	// 9
> >+       "r" (x),		// 10
> >+       "b" (16),	// 11
> >+       "3" (ap),	// 12
> >+       "4" (lda)	// 13
> >+     :
> >+       "cr0",
> >+       "vs32","vs33","vs34","vs35","vs36","vs37",
> >+       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
> >+     );
> >+@}
> >+@end smallexample
> 
> Hmmm.  My main objection to this version is that it's unintelligible to
> anyone who can't parse PowerPC assembly language without the help of an
> architecture manual, and that's probably the majority of readers.

Heh, even I have trouble parsing some powerpc assembly!  That's why
there are a few lines of text describing what the assembly code does.
I am concerned that the 14 lines of assembly shown make the example
too big, but it's harder to describe code that isn't shown than to
describe something under the nose of the reader.

> I'm now wondering if it would be better to have a series of small examples
> showing these tricks individually instead of one giant example that tries to
> illustrate multiple things?

Possibly, but this example comes after many others.  If people have
waded this far into the asm section of the manual they shouldn't have
too much trouble understanding the concepts here.

Also, there's value in a real-world example.  Maybe that's just me.
I'm not someone who tends to read manuals first, preferring to dive
right in then go back to a manual later for some detail that can't be
easily deduced.  In fact, I have a distrust of manuals..  ;)  This
isn't a criticism of the gcc manual, but other documents I've read
over the years are often just plain wrong.  I've even been the
*author* of technical documentation that had errors, some by yours
truly, and some introduced by a "technical writer" who edited my input
to make it read better, in the process accidentally changing something
that made the details incorrect.  I'm sure others have had the same
experience.  So I like *and trust* code snippets taken from working
code more than made up examples created for documentation.
diff mbox

Patch

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 0f44ece..0b0a021 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -7869,7 +7869,7 @@  A comma-separated list of C expressions read by the instructions in the
 @item Clobbers
 A comma-separated list of registers or other values changed by the 
 @var{AssemblerTemplate}, beyond those listed as outputs.
-An empty list is permitted.  @xref{Clobbers}.
+An empty list is permitted.  @xref{Clobbers and Scratch Registers}.
 
 @item GotoLabels
 When you are using the @code{goto} form of @code{asm}, this section contains 
@@ -8229,7 +8229,7 @@  The enclosing parentheses are a required part of the syntax.
 
 When the compiler selects the registers to use to 
 represent the output operands, it does not use any of the clobbered registers 
-(@pxref{Clobbers}).
+(@pxref{Clobbers and Scratch Registers}).
 
 Output operand expressions must be lvalues. The compiler cannot check whether 
 the operands have data types that are reasonable for the instruction being 
@@ -8465,7 +8465,8 @@  as input.  The enclosing parentheses are a required part of the syntax.
 @end table
 
 When the compiler selects the registers to use to represent the input 
-operands, it does not use any of the clobbered registers (@pxref{Clobbers}).
+operands, it does not use any of the clobbered registers
+(@pxref{Clobbers and Scratch Registers}).
 
 If there are no output operands but there are input operands, place two 
 consecutive colons where the output operands would go:
@@ -8516,9 +8517,10 @@  asm ("cmoveq %1, %2, %[result]"
    : "r" (test), "r" (new), "[result]" (old));
 @end example
 
-@anchor{Clobbers}
-@subsubsection Clobbers
+@anchor{Clobbers and Scratch Registers}
+@subsubsection Clobbers and Scratch Registers
 @cindex @code{asm} clobbers
+@cindex @code{asm} scratch registers
 
 While the compiler is aware of changes to entries listed in the output 
 operands, the inline @code{asm} code may modify more than just the outputs. For 
@@ -8589,6 +8591,110 @@  ten bytes of a string, use a memory input like:
 
 @end table
 
+Rather than allocating fixed registers via clobbers to provide scratch
+registers for an @code{asm} statement, there are better techniques you
+can use which give the compiler more freedom.  There are also better
+ways than using a @code{"memory"} clobber to tell GCC that an
+@code{asm} statement accesses or modifies memory.  The following
+PowerPC example taken from OpenBLAS illustrates some of these
+techniques.
+
+In the function shown below, all of the function parameters are inputs
+except for the @code{y} array, which is modified by the function.
+Only the first few lines of assembly in the @code{asm} statement are
+shown, and a comment handy for checking register assignments.  These
+insns set up some registers for later use in loops, and in particular,
+set up four pointers into the @code{ap} array, @code{a0=ap},
+@code{a1=ap+lda}, @code{a2=ap+2*lda}, and @code{a3=ap+3*lda}.  The
+rest of the assembly is simply too large to include here.
+
+@smallexample
+static void
+dgemv_kernel_4x4 (long n, const double *ap, long lda,
+                  const double *x, double *y, double alpha)
+@{
+  double *a0;
+  double *a1;
+  double *a2;
+  double *a3;
+
+  __asm__
+    (
+       "lxvd2x		34, 0, %10	\n\t"	// x0, x1
+       "lxvd2x		35, %11, %10	\n\t"	// x2, x3
+       "xxspltd		32, %x9, 0	\n\t"	// alpha, alpha
+       "sldi		%6, %13, 3	\n\t"	// lda * sizeof (double)
+       "xvmuldp		34, 34, 32	\n\t"	// x0 * alpha, x1 * alpha
+       "xvmuldp		35, 35, 32	\n\t"	// x2 * alpha, x3 * alpha
+       "add		%4, %3, %6	\n\t"	// a0 = ap, a1 = a0 + lda
+       "add		%6, %6, %6	\n\t"	// 2 * lda
+       "xxspltd		32, 34, 0	\n\t"	// x0 * alpha, x0 * alpha
+       "xxspltd		33, 34, 1	\n\t"	// x1 * alpha, x1 * alpha
+       "xxspltd		34, 35, 0	\n\t"	// x2 * alpha, x2 * alpha
+       "xxspltd		35, 35, 1	\n\t"	// x3 * alpha, x3 * alpha
+       "add		%5, %3, %6	\n\t"	// a2 = a0 + 2 * lda
+       "add		%6, %4, %6	\n\t"	// a3 = a1 + 2 * lda
+     ...
+     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
+     "#a0=%3 a1=%4 a2=%5 a3=%6"
+     :
+       "+m" (*y),
+       "+r" (n),	// 1
+       "+b" (y),	// 2
+       "=b" (a0),	// 3
+       "=b" (a1),	// 4
+       "=&b" (a2),	// 5
+       "=&b" (a3)	// 6
+     :
+       "m" (*x),
+       "m" (*ap),
+       "d" (alpha),	// 9
+       "r" (x),		// 10
+       "b" (16),	// 11
+       "3" (ap),	// 12
+       "4" (lda)	// 13
+     :
+       "cr0",
+       "vs32","vs33","vs34","vs35","vs36","vs37",
+       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
+     );
+@}
+@end smallexample
+
+Allocating scratch registers is done by declaring a variable and
+making it an early-clobber @code{asm} output as with @code{a2} and
+@code{a3}, or making it an output tied to an input as with @code{a0}
+and @code{a1}.  You can use a normal @code{asm} output if all inputs
+that might share the same register are consumed before the scratch is
+used.  The VSX registers clobbered by the @code{asm} statement could
+have used the same technique except for GCC's limit on number of
+@code{asm} parameters.  It shouldn't be surprising that @code{a0} is
+tied to @code{ap} from the above description, and @code{lda} is only
+used in the fourth machine insn shown above, so that register is
+available for reuse as @code{a1}.  Note that tying an input to an
+output is the way to set up an initialized temporary register modified
+by an @code{asm} statement.  The example also shows an initialized
+register unchanged by the @code{asm} statement; @code{"b" (16)} sets
+up @code{%11} to 16.
+
+Rather than using a @code{"memory"} clobber, the @code{asm} has
+@code{"+m" (*y)} in the list of outputs to tell GCC that the @code{y}
+array is both read and written by the @code{asm} statement.
+@code{"m" (*x)} and @code{"m" (*ap)} in the inputs tell GCC that these
+arrays are read.  At a minimum, aliasing rules allow GCC to know what
+memory @emph{doesn't} need to be flushed, and if the function were
+inlined then GCC may be able to do even better.  Also, if GCC can
+prove that all of the outputs of an @code{asm} statement are unused,
+then the @code{asm} may be deleted.  Removal of dead @code{asm}
+statements will not happen if they clobber @code{"memory"}.  Notice
+that @code{x}, @code{y}, and @code{ap} all appear twice in the
+@code{asm} parameters, once to specify memory accessed, and once to
+specify a base register used by the @code{asm}.  You won't normally be
+wasting a register by doing this as GCC can use the same register for
+both purposes.  However, it would be foolish to use both @code{%0} and
+@code{%2} for @code{y} in this @code{asm} assembly and expect them to
+be the same.
+
 @anchor{GotoLabels}
 @subsubsection Goto Labels
 @cindex @code{asm} goto labels