[RFC,PR48941,/,51980] Rewrite arm_neon.h to use __builtin_shuffle

Hi,

        A number of the vector permute intrinsics in arm_neon.h end up
generating quite wasteful code because we end up packing these into
larger types. While looking at a particularly bad example and the
infamous PR48941 and cases that a lot of these large intrinsic forms
could be represented as only affecting their constituent parts with lo
and high style operations. The one thing I'm missing handling here is
the case with vext which we could do once vector permute support
handles the right thing.

I don't like the ML bits of the patch as it stands today and before
committing I would like to clean up the ML bits quite a bit further
especially in areas where I've put FIXMEs and before you ask - yes I
am trying to get some execute testcases in for all these that might be
useful. I will also point out that this implementation actually makes
things worse at -O0 given we don't constant propagate the mask into
the VEC_PERM_EXPR which is actually a regression compared to the
current state of the art (yes at O0 but I'm sure someone somewhere
will care about that.) I did think about big-endian but surely that
should not be a problem in this case as the operations in this case
(i.e. zip , unzip , rev64, rev32, rev16, transpose) really should be
the same on both endian-ness.  I am not setup with a big-endian system
to do some testing on but looking at the code coming out it's
identical to what's coming out on little endian systems.It's been
through a full round of testing with a cross-compiler and there are
some fallouts with the neon intrinsics tests failing but that's a
result of these instructions not getting generated at O0.

There are a few ways I can think of for dealing with this -

1.  We check at lowering time of vec_perm_expr if the mask is actually
associated with a constant - should be an extra constant time check
I'd think and if so, do a simple constant propagate type operation at
that point. Is that reasonable ?
2.  We annotate arm_neon.h so that the relevant functions are all
compiled at O1 so that such constant propagation would occur within
just these functions. However we need to fix the backend so that
target_pragma_parse and friends work fine which is a nice side-effect
of doing that.
3.  Allow __builtin_shuffle to take constant vectors as parameters (
unfortunately that means a change and I'm not sure if that's good in
terms of compatibility with OpenCL )
4.  Define a "new" md builtin which is lowered into a vec_perm_expr
with a constant mask using targetm.fold_builtin.

What would be considered the least worse option out of these or is
there another way that could be used .

Thus I thought I'd put this out there for some comments on the ML bits
and in case anyone else also wanted to play with this. With the simple
testcases I've played with

 * Test from PR48941
 * Test from PR51980
 * A couple of routines that I use as testcases for some more complex
use of some of the intrinsics.

I see a significant improvement in code generated with the diffs being
attached for the testcases from PR48941 and PR51980. Thoughts,
opinions , brickbats ?

regards,
Ramana

	* config/arm/neon-gen.ml (gcc_builtin_shuffle): New.
	(return_by_ptr): Delete.
	(base_type): New helper function.
	(masktype): Likewise.
	(num_vec_elt): Likewise.
	(range): Likewise.
	(gen_revmask): Likewise.
	(int_rev_mask): New function and use some of the reverse helper
	functions.
	(permute_range): Likewise.
	(zip_range): Likewise.
	(uzip_range): Likewise.
	(trn_range): Likewise.
	(init_zip_mask): Likewise and use the permutation helper functions.
	(perm_locode): New function.
	(perm_hicode): Likewise.
	(return): Delete handling of return_by_ptr. Handle the gcc_builtin_shuffle case
	for the vector permutes.
	(params): Delete handling of return_by_ptr.
	* config/arm/neon.ml: Update copyright years.
	(shuffletype): New type.
	(features): New feature GCCBuiltinShuffle. Delete ReturnPtr.
	(ops): Use for Vrev64, Vrev32, Vrev16, Vtrn, Vzip and Vunzip.
	* config/arm/arm_neon.h: Regenerate.

.cpu cortex-a9							.cpu cortex-a9
	.eabi_attribute 27, 3						.eabi_attribute 27, 3
	.fpu neon							.fpu neon
	.eabi_attribute 20, 1						.eabi_attribute 20, 1
	.eabi_attribute 21, 1						.eabi_attribute 21, 1
	.eabi_attribute 23, 3						.eabi_attribute 23, 3
	.eabi_attribute 24, 1						.eabi_attribute 24, 1
	.eabi_attribute 25, 1						.eabi_attribute 25, 1
	.eabi_attribute 26, 2						.eabi_attribute 26, 2
	.eabi_attribute 30, 2						.eabi_attribute 30, 2
	.eabi_attribute 34, 1						.eabi_attribute 34, 1
	.eabi_attribute 18, 4						.eabi_attribute 18, 4
	.file	"pr48941.c"						.file	"pr48941.c"
	.text								.text
	.align	2							.align	2
	.global	cross							.global	cross
	.type	cross, %function					.type	cross, %function
cross:								cross:
	@ args = 0, pretend = 0, frame = 16		      |		@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 1, uses_anonymous_args = 0	      |		@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.				@ link register save eliminated.
	str	fp, [sp, #-4]!				      <
	add	fp, sp, #0				      <
	sub	sp, sp, #20				      <
	vldmia	r0, {d16-d17}				      <
	vmov	q10, q8  @ v4sf				      <
	sub	sp, sp, #48				      <
	vmov	q12, q8  @ v4sf				      <
	add	r3, sp, #15				      <
	bic	r3, r3, #15				      <
	vzip.32	q10, q12				      <
	vstmia	r3, {d20-d21}				      <
	vstr	d24, [r3, #16]				      <
	vstr	d25, [r3, #24]				      <
	vldmia	r1, {d16-d17}						vldmia	r1, {d16-d17}
	vmov	q9, q8  @ v4sf				      |		vmov	q10, q8  @ v4sf
	vmov	q11, q8  @ v4sf				      |		vldmia	r0, {d18-d19}
							      >		vmov	q11, q9  @ v4sf
							      >		vzip.32	q8, q10
	vzip.32	q9, q11							vzip.32	q9, q11
	vstmia	r3, {d18-d19}				      |		vmov	d23, d16  @ v2sf
	vstr	d22, [r3, #16]				      |		vmov	d24, d19  @ v2sf
	vstr	d23, [r3, #24]				      |		vmov	d16, d17  @ v2sf
	vmov	d25, d18  @ v2sf			      |		vsub.f32	d17, d19, d20
	vsub.f32	d17, d21, d22			      |		vsub.f32	d19, d22, d23
	vsub.f32	d18, d24, d18			      |		vsub.f32	d21, d18, d16
	vmov	d16, d19  @ v2sf			      |		vmls.f32	d17, d22, d16
	vsub.f32	d19, d20, d19			      |		vmls.f32	d19, d18, d20
	vmls.f32	d17, d24, d16			      |		vmls.f32	d21, d24, d23
	vmls.f32	d18, d20, d22			      |		vuzp.32	d17, d19
	vmls.f32	d19, d21, d25			      |		vmov	d18, d17  @ v2sf
	vuzp.32	d17, d18				      |		vmov	d19, d21  @ v2sf
	vmov	d20, d17  @ v2sf			      |		vmov	r0, r1, d18  @ v4sf
	vmov	d21, d19  @ v2sf			      |		vmov	r2, r3, d19
	vmov	r0, r1, d20  @ v4sf			      <
	vmov	r2, r3, d21				      <
	add	sp, fp, #0				      <
	ldmfd	sp!, {fp}				      <
	bx	lr							bx	lr
	.size	cross, .-cross						.size	cross, .-cross
	.comm	a,4,4							.comm	a,4,4
	.ident	"GCC: (GNU) 4.8.0 20120607 (experimental)"		.ident	"GCC: (GNU) 4.8.0 20120607 (experimental)"
	.section	.note.GNU-stack,"",%progbits			.section	.note.GNU-stack,"",%progbits

[RFC,PR48941,/,51980] Rewrite arm_neon.h to use __builtin_shuffle

Commit Message

Comments

Patch