Patchwork RFC: ARM 64-bit shifts in NEON

login
register
mail settings
Submitter Andrew Stubbs
Date Dec. 2, 2011, 12:42 p.m.
Message ID <4ED8C7B0.9000308@codesourcery.com>
Download mbox | patch
Permalink /patch/128827/
State New
Headers show

Comments

Andrew Stubbs - Dec. 2, 2011, 12:42 p.m.
Hi All,

I'm trying to implement DImode shifts using ARM NEON instructions. This 
wouldn't be difficult in itself, but making it play nice with the 
existing implementation is causing me problems. I'd like a few 
suggestions/pointers/comments to help me get this right, please.

The existing shift mechanisms must be kept, partly because the NEON unit 
is optional, and partly because it does not permit the full range of 
DImode operations, so sometimes it's more efficient to do 64-bit 
operations in core-registers, rather than copy all the values over to 
NEON, do the operation, and move the result back. Which set of patterns 
are used is determined by the register allocator and its costs mechanism.

The late decision means that the patterns may only use the post-reload 
splitter, and so cannot rely on many of the usual passes to sort out 
inefficiencies. In particular, the lack of combine makes it hard to 
detect and optimize extend-and-copy sequences.

So, I've attached two patches. The first is neon-shifts.patch, and does 
most of the work. The second is extendsidi2_neon.patch, and is intended 
to aid moving the shift amount from SImode registers, but doesn't go as 
far as I'd like.

I've not actually tested any of the output code just yet, so there may 
be logic errors, but those are easily fixed later, and what I'm trying 
to get right here is the GCC machine description.

Given this testcase:

    void
    f (long long *a, int b)
    {
      *a = *a << b;
    }

Without any patches, GCC gives this output, using only ARM core 
registers (in thumb2 mode):

    f:
          ldr     r2, [r0, #0]
          ldr     r3, [r0, #4]
          push    {r4, r5, r6}
          rsb     r6, r1, #32
          sub     r4, r1, #32
          lsrs    r6, r2, r6
          lsls    r5, r2, r4
          lsls    r3, r3, r1
          lsls    r1, r2, r1
          orrs    r3, r3, r6
          str     r1, [r0, #0]
          ands    r4, r3, r4, asr #32
          it      cc
          movcc   r4, r5
          str     r4, [r0, #4]
          pop     {r4, r5, r6}
          bx      lr

With just neon-shifts.patch, we get this output, now with NEON shifts:

f:
         fldd    d17, [r0, #0]   @ int
         mov     r2, r1
         movs    r3, #0
         push    {r4, r5}
         fmdrr   d18, r2, r3     @ int
         vshl.i64        d16, d17, d18
         fstd    d16, [r0, #0]   @ int
         pop     {r4, r5}
         bx      lr


As you can see, the shift is much improved, but the shift amount is 
first extended into two SImode registers, and then moved to a NEON 
DImode register, which increases core-register pressure unnecessarily.

With both patches, we now get this:

f:
         fldd    d17, [r0, #0]   @ int
         vdup.32 d16, r1
         vshr.u64        d16, d16, #32   <-- still unnecessary
         vshl.i64        d16, d17, d16
         fstd    d16, [r0, #0]   @ int
         bx      lr

Now the value is copied and then extended. I have chosen to use vdup.32 
instead of vmov.i32 because the latter can only target half the DImode 
registers. The right shift is necessary for a general zero-extend, but 
is not useful in this case as only the bottom 8 bits are interesting, 
and vdup has already done the right thing.

Note that the examples I've given are for left shifts. Right shifts are 
also implemented, but are a little more complicated (in the 
shift-by-register case) because the shift must be implemented as a left 
shift by a negative amount, and so an unspec is used to prevent the 
compiler doing anything 'clever'. Apart from an extra negation, the end 
result is much the same, but the patterns look different.


All this is a nice improvement, but I'm not happy:

1. The post-reload split means that I've had to add a clobber for CC to 
all the patterns, even though only some of them really need it. I think 
I've convinced myself that this is ok because it doesn't matter before 
scheduling, and after splitting the clobbers are only retained if 
they're really needed, but it still feels wrong.

2. The extend optimization is fine for general case extends, but it can 
be improved for the shift-amount case because we actually only need the 
bottom 8 bits, as indicated above. The problem is that there's no 
obvious way to achieve this:
    - there's no combine pass after this point, so a pattern that 
recognises and re-splits the extend, move and shift can't be used.
    - I don't believe there can be a pattern that uses SImode for the 
shift amount because the value needs to be in a DImode register 
eventually, and that means one needs to have been allocated before it 
gets split, and that means the extend needs to be separate.

3. The type of the shift-amount is determined by the type used in the 
ashldi3 pattern, and that uses SImode. This is fine for values already 
in SImode registers (probably the common case), but means that values 
already in DImode registers will have to get truncated and then 
re-extended, and this is not an operation that can generally be 
optimized away once introduced.
    - I've considered using a DImode shift-amount for the ashldi3 
pattern, and that would solve this problem - extend and truncate *can* 
be optimized away, but since it doesn't get split until post reload, the 
register allocator would already have allocated two SImode registers 
before we have any chance to make it go away.

4. I'm not sure, but I think the general-case shift in core registers is 
sufficiently long-winded that it might be worthwhile completely 
discarding that option (i.e. it might cheaper to just always use neon 
shifts, when neon is available, of course). I'd keep the 
shift-by-constant-amount variants though. Does anybody have any comments 
on that?

5. The left and right shift patterns couldn't be unified because I 
couldn't find a way to do match_operand with unspecs, and anyway, the 
patterns are a slightly different shape.

6. Same with the logical and arithmetic right shifts; I couldn't find a 
way to unify those patterns either, even though the only difference is 
the unspec index number.


Any help would be appreciated. I've probably implemented this backwards, 
or something ...

Thanks a lot

Andrew
Richard Earnshaw - Dec. 7, 2011, 1:42 p.m.
On 02/12/11 12:42, Andrew Stubbs wrote:
> Hi All,
> 
> I'm trying to implement DImode shifts using ARM NEON instructions. This 
> wouldn't be difficult in itself, but making it play nice with the 
> existing implementation is causing me problems. I'd like a few 
> suggestions/pointers/comments to help me get this right, please.
> 
> The existing shift mechanisms must be kept, partly because the NEON unit 
> is optional, and partly because it does not permit the full range of 
> DImode operations, so sometimes it's more efficient to do 64-bit 
> operations in core-registers, rather than copy all the values over to 
> NEON, do the operation, and move the result back. Which set of patterns 
> are used is determined by the register allocator and its costs mechanism.
> 
> The late decision means that the patterns may only use the post-reload 
> splitter, and so cannot rely on many of the usual passes to sort out 
> inefficiencies. In particular, the lack of combine makes it hard to 
> detect and optimize extend-and-copy sequences.
> 
> So, I've attached two patches. The first is neon-shifts.patch, and does 
> most of the work. The second is extendsidi2_neon.patch, and is intended 
> to aid moving the shift amount from SImode registers, but doesn't go as 
> far as I'd like.
> 
> I've not actually tested any of the output code just yet, so there may 
> be logic errors, but those are easily fixed later, and what I'm trying 
> to get right here is the GCC machine description.
> 
> Given this testcase:
> 
>     void
>     f (long long *a, int b)
>     {
>       *a = *a << b;
>     }
> 
> Without any patches, GCC gives this output, using only ARM core 
> registers (in thumb2 mode):
> 
>     f:
>           ldr     r2, [r0, #0]
>           ldr     r3, [r0, #4]
>           push    {r4, r5, r6}
>           rsb     r6, r1, #32
>           sub     r4, r1, #32
>           lsrs    r6, r2, r6
>           lsls    r5, r2, r4
>           lsls    r3, r3, r1
>           lsls    r1, r2, r1
>           orrs    r3, r3, r6
>           str     r1, [r0, #0]
>           ands    r4, r3, r4, asr #32
>           it      cc
>           movcc   r4, r5
>           str     r4, [r0, #4]
>           pop     {r4, r5, r6}
>           bx      lr
> 
> With just neon-shifts.patch, we get this output, now with NEON shifts:
> 
> f:
>          fldd    d17, [r0, #0]   @ int
>          mov     r2, r1
>          movs    r3, #0
>          push    {r4, r5}
>          fmdrr   d18, r2, r3     @ int
>          vshl.i64        d16, d17, d18
>          fstd    d16, [r0, #0]   @ int
>          pop     {r4, r5}
>          bx      lr
> 
> 
> As you can see, the shift is much improved, but the shift amount is 
> first extended into two SImode registers, and then moved to a NEON 
> DImode register, which increases core-register pressure unnecessarily.
> 
> With both patches, we now get this:
> 
> f:
>          fldd    d17, [r0, #0]   @ int
>          vdup.32 d16, r1
>          vshr.u64        d16, d16, #32   <-- still unnecessary
>          vshl.i64        d16, d17, d16
>          fstd    d16, [r0, #0]   @ int
>          bx      lr
> 
> Now the value is copied and then extended. I have chosen to use vdup.32 
> instead of vmov.i32 because the latter can only target half the DImode 
> registers. The right shift is necessary for a general zero-extend, but 
> is not useful in this case as only the bottom 8 bits are interesting, 
> and vdup has already done the right thing.
> 
> Note that the examples I've given are for left shifts. Right shifts are 
> also implemented, but are a little more complicated (in the 
> shift-by-register case) because the shift must be implemented as a left 
> shift by a negative amount, and so an unspec is used to prevent the 
> compiler doing anything 'clever'. Apart from an extra negation, the end 
> result is much the same, but the patterns look different.
> 
> 
> All this is a nice improvement, but I'm not happy:
> 
> 1. The post-reload split means that I've had to add a clobber for CC to 
> all the patterns, even though only some of them really need it. I think 
> I've convinced myself that this is ok because it doesn't matter before 
> scheduling, and after splitting the clobbers are only retained if 
> they're really needed, but it still feels wrong.
> 
> 2. The extend optimization is fine for general case extends, but it can 
> be improved for the shift-amount case because we actually only need the 
> bottom 8 bits, as indicated above. The problem is that there's no 
> obvious way to achieve this:
>     - there's no combine pass after this point, so a pattern that 
> recognises and re-splits the extend, move and shift can't be used.
>     - I don't believe there can be a pattern that uses SImode for the 
> shift amount because the value needs to be in a DImode register 
> eventually, and that means one needs to have been allocated before it 
> gets split, and that means the extend needs to be separate.
> 
> 3. The type of the shift-amount is determined by the type used in the 
> ashldi3 pattern, and that uses SImode. This is fine for values already 
> in SImode registers (probably the common case), but means that values 
> already in DImode registers will have to get truncated and then 
> re-extended, and this is not an operation that can generally be 
> optimized away once introduced.
>     - I've considered using a DImode shift-amount for the ashldi3 
> pattern, and that would solve this problem - extend and truncate *can* 
> be optimized away, but since it doesn't get split until post reload, the 
> register allocator would already have allocated two SImode registers 
> before we have any chance to make it go away.
> 
> 4. I'm not sure, but I think the general-case shift in core registers is 
> sufficiently long-winded that it might be worthwhile completely 
> discarding that option (i.e. it might cheaper to just always use neon 
> shifts, when neon is available, of course). I'd keep the 
> shift-by-constant-amount variants though. Does anybody have any comments 
> on that?
> 
> 5. The left and right shift patterns couldn't be unified because I 
> couldn't find a way to do match_operand with unspecs, and anyway, the 
> patterns are a slightly different shape.
> 
> 6. Same with the logical and arithmetic right shifts; I couldn't find a 
> way to unify those patterns either, even though the only difference is 
> the unspec index number.
> 
> 
> Any help would be appreciated. I've probably implemented this backwards, 
> or something ...
> 


So it looks like the code generated for core registers with thumb2 is
pretty rubbish (no real surprise there -- to get the best code you need
to make use of the fact that on ARM a shift by a small negative number
(< -128) will give zero.  This gives us sequences like:

For ARM state it's something like (untested)

					@ shft < 32			, shft >= 32
__ashldi3_v3:
	sub	r3, r2, #32		@ -ve            		, shft - 32
	lsl	ah, ah, r2		@ ah << shft     		, 0
	rsb	ip, r2, #32		@ 32 - shft      		, -ve
	orr	ah, ah, al, lsl r3	@ ah << shft     		, al << shft - 32
	orr	ah, ah, al, lsr ip	@ ah << shft | al >> 32 - shft	, al << shft - 32
	lsl	al, al, r2		@ al << shft     		, 0

For Thumb2 (where there is no orr with register shift)

	lsls	ah, ah, r2		@ ah << shft     		, 0
	sub	r3, r2, #32		@ -ve            		, shft - 32
	lsl	ip, al, r3		@ 0              		, al << shft - 32
	negs	r3, r3			@ 32 - shft      		, -ve
	orr	ah, ah, ip		@ ah << shft     		, al << shft - 32
	lsr	r3, al, r3		@ al >> 32 - shft		, 0
	orrs	ah, ah, r3		@ ah << shft | al >> 32 - shft	, al << shft - 32
	lsls	al, al, r2		@ al << shft     		, 0

Neither of which needs the condition flags during execution (and indeed
is probably better in both cases than the code currently in lib1funcs.asm
for a modern core).  The flag clobbering behaviour in the thumb2 variant
is only for code size saving; that would normally be added by a late
optimization pass.

None of this directly helps with your neon usage, but it does show that we
really don't need to clobber the condition code register to get an
efficient sequence.

R.
Stubbs, Andrew - Dec. 7, 2011, 2:03 p.m.
On Wed 07 Dec 2011 13:42:37 GMT, Richard Earnshaw wrote:
> So it looks like the code generated for core registers with thumb2 is
> pretty rubbish (no real surprise there -- to get the best code you need
> to make use of the fact that on ARM a shift by a small negative number
> (<  -128) will give zero.  This gives us sequences like:
>
> For ARM state it's something like (untested)
>
> 					@ shft<  32			, shft>= 32
> __ashldi3_v3:
> 	sub	r3, r2, #32		@ -ve            		, shft - 32
> 	lsl	ah, ah, r2		@ ah<<  shft     		, 0
> 	rsb	ip, r2, #32		@ 32 - shft      		, -ve
> 	orr	ah, ah, al, lsl r3	@ ah<<  shft     		, al<<  shft - 32
> 	orr	ah, ah, al, lsr ip	@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
> 	lsl	al, al, r2		@ al<<  shft     		, 0
>
> For Thumb2 (where there is no orr with register shift)
>
> 	lsls	ah, ah, r2		@ ah<<  shft     		, 0
> 	sub	r3, r2, #32		@ -ve            		, shft - 32
> 	lsl	ip, al, r3		@ 0              		, al<<  shft - 32
> 	negs	r3, r3			@ 32 - shft      		, -ve
> 	orr	ah, ah, ip		@ ah<<  shft     		, al<<  shft - 32
> 	lsr	r3, al, r3		@ al>>  32 - shft		, 0
> 	orrs	ah, ah, r3		@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
> 	lsls	al, al, r2		@ al<<  shft     		, 0
>
> Neither of which needs the condition flags during execution (and indeed
> is probably better in both cases than the code currently in lib1funcs.asm
> for a modern core).  The flag clobbering behaviour in the thumb2 variant
> is only for code size saving; that would normally be added by a late
> optimization pass.

OK, those are interesting, and I can look into making it happen, with or 
without NEON.

Would it not require an unspec to prevent 'clever things' happening to 
the negative shift, if we were to encode these in the machine 
description? I'm not too clear on what these 'clever things' might be in 
the case of shift-by-register (presumably value-range propagation is 
one), but I know the NEON shifts are encoded this way for safety.

> None of this directly helps with your neon usage, but it does show that we
> really don't need to clobber the condition code register to get an
> efficient sequence.

Except that it doesn't in the case of a shift by one where there is a 
two-instruction sequence that clobbers CC. Presumably this special case 
can be treated differently though, right from expand.

Andrew
Richard Earnshaw - Dec. 7, 2011, 2:20 p.m.
On 07/12/11 14:03, Andrew Stubbs wrote:
> On Wed 07 Dec 2011 13:42:37 GMT, Richard Earnshaw wrote:
>> So it looks like the code generated for core registers with thumb2 is
>> pretty rubbish (no real surprise there -- to get the best code you need
>> to make use of the fact that on ARM a shift by a small negative number
>> (<  -128) will give zero.  This gives us sequences like:
>>
>> For ARM state it's something like (untested)
>>
>> 					@ shft<  32			, shft>= 32
>> __ashldi3_v3:
>> 	sub	r3, r2, #32		@ -ve            		, shft - 32
>> 	lsl	ah, ah, r2		@ ah<<  shft     		, 0
>> 	rsb	ip, r2, #32		@ 32 - shft      		, -ve
>> 	orr	ah, ah, al, lsl r3	@ ah<<  shft     		, al<<  shft - 32
>> 	orr	ah, ah, al, lsr ip	@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
>> 	lsl	al, al, r2		@ al<<  shft     		, 0
>>
>> For Thumb2 (where there is no orr with register shift)
>>
>> 	lsls	ah, ah, r2		@ ah<<  shft     		, 0
>> 	sub	r3, r2, #32		@ -ve            		, shft - 32
>> 	lsl	ip, al, r3		@ 0              		, al<<  shft - 32
>> 	negs	r3, r3			@ 32 - shft      		, -ve
>> 	orr	ah, ah, ip		@ ah<<  shft     		, al<<  shft - 32
>> 	lsr	r3, al, r3		@ al>>  32 - shft		, 0
>> 	orrs	ah, ah, r3		@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
>> 	lsls	al, al, r2		@ al<<  shft     		, 0
>>
>> Neither of which needs the condition flags during execution (and indeed
>> is probably better in both cases than the code currently in lib1funcs.asm
>> for a modern core).  The flag clobbering behaviour in the thumb2 variant
>> is only for code size saving; that would normally be added by a late
>> optimization pass.
> 
> OK, those are interesting, and I can look into making it happen, with or 
> without NEON.
> 
> Would it not require an unspec to prevent 'clever things' happening to 
> the negative shift, if we were to encode these in the machine 
> description? I'm not too clear on what these 'clever things' might be in 
> the case of shift-by-register (presumably value-range propagation is 
> one), but I know the NEON shifts are encoded this way for safety.
> 

Given the way the shift patterns in the compiler are written today, quite possibly.  Though in the
general case of a non-constant shift the optimizer probably wouldn't be able to safely make any
assumptions that would break things.

I suspect that the shift patterns should really be changed to make the shift be by a QImode value;
this would then correctly describe the number of bits in the register that are really involved in
the shift.  Further, we could then say that, for core registers, the full value in that QI register
was used to determine the shift.  It would be quite a lot of churn to fix this though.

>> None of this directly helps with your neon usage, but it does show that we
>> really don't need to clobber the condition code register to get an
>> efficient sequence.
> 
> Except that it doesn't in the case of a shift by one where there is a 
> two-instruction sequence that clobbers CC. Presumably this special case 
> can be treated differently though, right from expand.
> 

All of the sequences above can be simplified significantly if the shift amount is constant and I
think then, that with the exception of the special case you mention (which is only for shift right
by 1) you never need the condition codes and you never need more than 3 ARM instructions:

shifts < 32

LSL	AH, AH, #n
ORR	AH, AH, AL, LSR #(32 - n)
LSL	AL, AL, #n

shifts >= 32
LSL	AH, AL, #(n - 32)
MOV	AL, #0

In fact both of the above sequences are equally good for Thumb2.  If we lost the RRX tweak it
wouldn't be a major loss (we could even put it back as a peephole2 to handle the common case where
the condition code registers were known to be dead).

R.
Andrew Stubbs - Dec. 7, 2011, 2:36 p.m.
On Wed 07 Dec 2011 14:20:43 GMT, Richard Earnshaw wrote:
>> Would it not require an unspec to prevent 'clever things' happening to
>> the negative shift, if we were to encode these in the machine
>> description? I'm not too clear on what these 'clever things' might be in
>> the case of shift-by-register (presumably value-range propagation is
>> one), but I know the NEON shifts are encoded this way for safety.
>>
>
> Given the way the shift patterns in the compiler are written today, quite possibly.  Though in the
> general case of a non-constant shift the optimizer probably wouldn't be able to safely make any
> assumptions that would break things.

I've noticed that the right-shift NEON insns have an "EQUIV" entry in 
the dumps, so I'm suspicious that even then we're not totally safe from 
"optimization".

> I suspect that the shift patterns should really be changed to make the shift be by a QImode value;
> this would then correctly describe the number of bits in the register that are really involved in
> the shift.  Further, we could then say that, for core registers, the full value in that QI register
> was used to determine the shift.  It would be quite a lot of churn to fix this though.

Yeah, I considered this, but I couldn't figure out how to make the 
details work, and anyway, I've seen the compiler trying to truncate 
things (unnecessarily) for that sort of thing, so I haven't actually 
tried it.

Maybe I'll have a go and see what happens. I suspect it's need extra 
patterns to combine the truncate seamlessly, and allow actual QImode 
input also?

>>> None of this directly helps with your neon usage, but it does show that we
>>> really don't need to clobber the condition code register to get an
>>> efficient sequence.
>>
>> Except that it doesn't in the case of a shift by one where there is a
>> two-instruction sequence that clobbers CC. Presumably this special case
>> can be treated differently though, right from expand.
>>
>
> All of the sequences above can be simplified significantly if the shift amount is constant and I
> think then, that with the exception of the special case you mention (which is only for shift right
> by 1) you never need the condition codes and you never need more than 3 ARM instructions:

Actually, there are "1bit" patterns for all the shift types.

> shifts<  32
>
> LSL	AH, AH, #n
> ORR	AH, AH, AL, LSR #(32 - n)
> LSL	AL, AL, #n
>
> shifts>= 32
> LSL	AH, AL, #(n - 32)
> MOV	AL, #0
>
> In fact both of the above sequences are equally good for Thumb2.  If we lost the RRX tweak it
> wouldn't be a major loss (we could even put it back as a peephole2 to handle the common case where
> the condition code registers were known to be dead).

Yes, these are what the compiler currently generates. With my patch, 
*sh*di3 never fails to expand (not if TARGET_NEON is true, anyway), so 
the compiler doesn't do it automatically any more, so I have added 
splitters to do it manually.

Andrew
Stubbs, Andrew - Dec. 12, 2011, 4:28 p.m.
On 07/12/11 13:42, Richard Earnshaw wrote:
> So it looks like the code generated for core registers with thumb2 is
> pretty rubbish (no real surprise there -- to get the best code you need
> to make use of the fact that on ARM a shift by a small negative number
> (<  -128) will give zero.  This gives us sequences like:
>
> For ARM state it's something like (untested)
>
> 					@ shft<  32			, shft>= 32
> __ashldi3_v3:
> 	sub	r3, r2, #32		@ -ve            		, shft - 32
> 	lsl	ah, ah, r2		@ ah<<  shft     		, 0
> 	rsb	ip, r2, #32		@ 32 - shft      		, -ve
> 	orr	ah, ah, al, lsl r3	@ ah<<  shft     		, al<<  shft - 32
> 	orr	ah, ah, al, lsr ip	@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
> 	lsl	al, al, r2		@ al<<  shft     		, 0
>
> For Thumb2 (where there is no orr with register shift)
>
> 	lsls	ah, ah, r2		@ ah<<  shft     		, 0
> 	sub	r3, r2, #32		@ -ve            		, shft - 32
> 	lsl	ip, al, r3		@ 0              		, al<<  shft - 32
> 	negs	r3, r3			@ 32 - shft      		, -ve
> 	orr	ah, ah, ip		@ ah<<  shft     		, al<<  shft - 32
> 	lsr	r3, al, r3		@ al>>  32 - shft		, 0
> 	orrs	ah, ah, r3		@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
> 	lsls	al, al, r2		@ al<<  shft     		, 0
>
> Neither of which needs the condition flags during execution (and indeed
> is probably better in both cases than the code currently in lib1funcs.asm
> for a modern core).  The flag clobbering behaviour in the thumb2 variant
> is only for code size saving; that would normally be added by a late
> optimization pass.
>
> None of this directly helps with your neon usage, but it does show that we
> really don't need to clobber the condition code register to get an
> efficient sequence.

Unfortunately, both these sequences use two scratch registers, as shown, 
and that's worse than clobbering CC.

Now, I can implement this for non-Neon easily enough, I think, and that 
would be a win, but I'm trying to figure out how best to do it for both 
that case and the case where neon is available but the compiler chooses 
not to do it.

The problem is that when there is no neon available, this can be 
converted at expand or split1 time, but when neon *is* available we have 
to wait until a post-reload split, and then we'd be forced to expand 
this in early-clobber mode, which is far less optimal.

Any suggestions now to do this without pessimizing the code in the case 
that neon is available but not used?

In fact, is the general shift operation sufficiently expensive that I 
should I just abandon the fall back alternatives and *always* use Neon 
when available? In this case, what about A8 vs. A9?

Thanks

Andrew
Richard Earnshaw - Dec. 12, 2011, 5:26 p.m.
On 12/12/11 16:28, Andrew Stubbs wrote:
> On 07/12/11 13:42, Richard Earnshaw wrote:
>> So it looks like the code generated for core registers with thumb2 is
>> pretty rubbish (no real surprise there -- to get the best code you need
>> to make use of the fact that on ARM a shift by a small negative number
>> (<  -128) will give zero.  This gives us sequences like:
>>
>> For ARM state it's something like (untested)
>>
>> 					@ shft<  32			, shft>= 32
>> __ashldi3_v3:
>> 	sub	r3, r2, #32		@ -ve            		, shft - 32
>> 	lsl	ah, ah, r2		@ ah<<  shft     		, 0
>> 	rsb	ip, r2, #32		@ 32 - shft      		, -ve
>> 	orr	ah, ah, al, lsl r3	@ ah<<  shft     		, al<<  shft - 32
>> 	orr	ah, ah, al, lsr ip	@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
>> 	lsl	al, al, r2		@ al<<  shft     		, 0
>>
>> For Thumb2 (where there is no orr with register shift)
>>
>> 	lsls	ah, ah, r2		@ ah<<  shft     		, 0
>> 	sub	r3, r2, #32		@ -ve            		, shft - 32
>> 	lsl	ip, al, r3		@ 0              		, al<<  shft - 32
>> 	negs	r3, r3			@ 32 - shft      		, -ve
>> 	orr	ah, ah, ip		@ ah<<  shft     		, al<<  shft - 32
>> 	lsr	r3, al, r3		@ al>>  32 - shft		, 0
>> 	orrs	ah, ah, r3		@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
>> 	lsls	al, al, r2		@ al<<  shft     		, 0
>>
>> Neither of which needs the condition flags during execution (and indeed
>> is probably better in both cases than the code currently in lib1funcs.asm
>> for a modern core).  The flag clobbering behaviour in the thumb2 variant
>> is only for code size saving; that would normally be added by a late
>> optimization pass.
>>
>> None of this directly helps with your neon usage, but it does show that we
>> really don't need to clobber the condition code register to get an
>> efficient sequence.
> 
> Unfortunately, both these sequences use two scratch registers, as shown, 
> and that's worse than clobbering CC.
> 
> Now, I can implement this for non-Neon easily enough, I think, and that 
> would be a win, but I'm trying to figure out how best to do it for both 
> that case and the case where neon is available but the compiler chooses 
> not to do it.
> 
> The problem is that when there is no neon available, this can be 
> converted at expand or split1 time, but when neon *is* available we have 
> to wait until a post-reload split, and then we'd be forced to expand 
> this in early-clobber mode, which is far less optimal.
> 
> Any suggestions now to do this without pessimizing the code in the case 
> that neon is available but not used?
> 
> In fact, is the general shift operation sufficiently expensive that I 
> should I just abandon the fall back alternatives and *always* use Neon 
> when available? In this case, what about A8 vs. A9?
> 
> Thanks
> 
> Andrew
> 

Can't you write the pattern with the scratch registers, but use "X" as
the constraint when neon (so that no register gets allocated)?  It's
easier to do that for real registers than the condition codes register
because they really can just go away if they aren't needed.


R.

Patch



--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -4403,33 +4403,35 @@ 
 ;; Zero and sign extension instructions.
 
 (define_insn "zero_extend<mode>di2"
-  [(set (match_operand:DI 0 "s_register_operand" "=r")
+  [(set (match_operand:DI 0 "s_register_operand" "=w, r")
         (zero_extend:DI (match_operand:QHSI 1 "<qhs_zextenddi_op>"
 					    "<qhs_zextenddi_cstr>")))]
   "TARGET_32BIT <qhs_zextenddi_cond>"
   "#"
-  [(set_attr "length" "8")
-   (set_attr "ce_count" "2")
-   (set_attr "predicable" "yes")]
+  [(set_attr "length" "8,8")
+   (set_attr "ce_count" "2,2")
+   (set_attr "predicable" "yes,yes")]
 )
 
 (define_insn "extend<mode>di2"
-  [(set (match_operand:DI 0 "s_register_operand" "=r")
+  [(set (match_operand:DI 0 "s_register_operand" "=w,r")
         (sign_extend:DI (match_operand:QHSI 1 "<qhs_extenddi_op>"
 					    "<qhs_extenddi_cstr>")))]
   "TARGET_32BIT <qhs_sextenddi_cond>"
   "#"
-  [(set_attr "length" "8")
-   (set_attr "ce_count" "2")
-   (set_attr "shift" "1")
-   (set_attr "predicable" "yes")]
+  [(set_attr "length" "8,8")
+   (set_attr "ce_count" "2,2")
+   (set_attr "shift" "1,1")
+   (set_attr "predicable" "yes,yes")]
 )
 
 ;; Splits for all extensions to DImode
 (define_split
   [(set (match_operand:DI 0 "s_register_operand" "")
         (zero_extend:DI (match_operand 1 "nonimmediate_operand" "")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && (!TARGET_NEON
+		    || (reload_completed
+			&& !(IS_VFP_REGNUM (REGNO (operands[0])))))"
   [(set (match_dup 0) (match_dup 1))]
 {
   rtx lo_part = gen_lowpart (SImode, operands[0]);
@@ -4455,7 +4457,9 @@ 
 (define_split
   [(set (match_operand:DI 0 "s_register_operand" "")
         (sign_extend:DI (match_operand 1 "nonimmediate_operand" "")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && (!TARGET_NEON
+		    || (reload_completed
+			&& !(IS_VFP_REGNUM (REGNO (operands[0])))))"
   [(set (match_dup 0) (ashiftrt:SI (match_dup 1) (const_int 31)))]
 {
   rtx lo_part = gen_lowpart (SImode, operands[0]);
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -405,8 +405,8 @@ 
 (define_mode_attr qhs_extenddi_op [(SI "s_register_operand")
 				   (HI "nonimmediate_operand")
 				   (QI "arm_reg_or_extendqisi_mem_op")])
-(define_mode_attr qhs_extenddi_cstr [(SI "r") (HI "rm") (QI "rUq")])
-(define_mode_attr qhs_zextenddi_cstr [(SI "r") (HI "rm") (QI "rm")])
+(define_mode_attr qhs_extenddi_cstr [(SI "r,r") (HI "r,rm") (QI "r,rUq")])
+(define_mode_attr qhs_zextenddi_cstr [(SI "r,r") (HI "r,rm") (QI "r,rm")])
 
 ;; Mode attributes used for fixed-point support.
 (define_mode_attr qaddsub_suf [(V4UQQ "8") (V2UHQ "16") (UQQ "8") (UHQ "16")
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5818,3 +5818,25 @@ 
                                    (const_string "neon_fp_vadd_qqq_vabs_qq"))
                      (const_string "neon_int_5")))]
 )
+
+;; Copy from core-to-neon regs, then extend, not vice-versa
+
+(define_split
+  [(set (match_operand:DI 0 "s_register_operand" "")
+	(sign_extend:DI (match_operand:SI 1 "s_register_operand" "")))]
+  "TARGET_NEON && reload_completed && IS_VFP_REGNUM (REGNO (operands[0]))"
+  [(set (match_dup 0) (vec_duplicate:V2SI (match_dup 1)))
+   (parallel [(set (match_dup 0) (ashiftrt:DI (match_dup 0) (const_int 32)))
+	      (clobber (reg:CC CC_REGNUM))])])
+
+(define_split
+  [(set (match_operand:DI 0 "s_register_operand" "")
+	(zero_extend:DI (match_operand:SI 1 "s_register_operand" "")))]
+  "TARGET_NEON && reload_completed && IS_VFP_REGNUM (REGNO (operands[0]))"
+  [(set (match_dup 2) (vec_duplicate:V2SI (match_dup 1)))
+   (parallel [(set (match_dup 0) (lshiftrt:DI (match_dup 0) (const_int 32)))
+              (clobber (reg:CC CC_REGNUM))])]
+  "
+  {
+    operands[2] = gen_rtx_REG (V2SImode, REGNO (operands[0]));
+  }")