diff mbox

[AVR] : FIX ICE in optabs due to bad rotate expander.

Message ID 4DA880F6.4070109@gjlay.de
State New
Headers show

Commit Message

Georg-Johann Lay April 15, 2011, 5:31 p.m. UTC
Denis Chertykov schrieb:
> 2011/4/14 Georg-Johann Lay <avr@gjlay.de>:
>> Denis Chertykov schrieb:
>>> 2011/4/14 Georg-Johann Lay <avr@gjlay.de>:
>>>> The "rotl<mode>3" expanders (mode \in {HI,SI,DI}) violates synopsis by
>>>> using 4 operands instead of 3. This runs in ICE in top of
>>>> optabs.c:maybe_gen_insn.
>>>>
>>>> The right way to do this is to use match_dup, not match_operand. So
>>>> the fix is obvious.
>>>>
>>>> Regenerated avr-gcc and run against avr testsuite:
>>>>
>>>> gcc.target/avr/torture/pr41885.c generates these patterns
>>>>
>>>> Johann
>>>>
>>>> 2011-04-14  Georg-Johann Lay  <avr@gjlay.de>
>>>>
>>>>        * config/avr/avr.md ("rotlhi3", "rotlsi3", "rotldi3"): Fix
>>>>        expanders operand 3 from match_operand to match_dup.
>>> May be better to use (clobber (match_scratch ...)) here and in
>>> `*rotw<mode>' and `*rotb<mode>'.
>> These are splitters that might split before reload, producing strange,
>> non-matching insns like
>>  (set (scratch:QI) ...
>> or crashing already in avr_rotate_bytes becase the split condition reads
>> "&& (reload_completed || <MODE>mode == DImode)"
> 
> Generally I'm agree with you, change match_operand to match_dup is easy.

I already submitted the easy patch to avoid the ICE.

> But IMHO the right way is:
>  - the splitters and avr_rotate_bytes are wrong and must be fixed too.
>  - operands[3] is a scratch register and right way is to use match_scratch.
> 
> I can approve the patch.
> 
> Denis.

Ok, here is the right-way patch.

The expander generates a SCRATCH instead of a pseudo, and in case of a
pre-reload split the SCRATCH is replaced with a pseudo because it is
not known if or if not the scratch will actually be needed.

avr_rotate_bytes now skips operations on non-REG 'scratch'.
Furthermore, I added an assertion that ensures that 'scratch' is
actually a REG when it is needed.

Besides that, I fixed indentation. I know it is not optimal to fix
indentation alongside with functionality... did it anyway :-)

Finally, I exposed alternative #3 of the insns to the register
allocator, because it is not possible to distinguish between
overlapping or non-overlapping regs, and #3 does not need a scratch.

Ran C-testsuite with no regressions.

Johann

2011-04-15  Georg-Johann Lay  <avr@gjlay.de>

	* config/avr/avr.md ("rotl<mode>3"): Use SCRATCH instead
	of REG in operand 3. Fix indentation. Unquote C snippet.
	("*rotw<mode>","*rotbw<mode>"): Ditto. Replace scratch
	with pseudo for pre-reload splits. Use const_int_operand for
	operand 2. Expose alternative 3 to register allocator.
	* config/avr/avr.c (avr_rotate_bytes): Handle SCRATCH in
	operands[3]. Use GNU indentation style.

Comments

Denis Chertykov April 17, 2011, 9:47 a.m. UTC | #1
2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
> Denis Chertykov schrieb:
>> 2011/4/14 Georg-Johann Lay <avr@gjlay.de>:
>>> Denis Chertykov schrieb:
>>>> 2011/4/14 Georg-Johann Lay <avr@gjlay.de>:
>>>>> The "rotl<mode>3" expanders (mode \in {HI,SI,DI}) violates synopsis by
>>>>> using 4 operands instead of 3. This runs in ICE in top of
>>>>> optabs.c:maybe_gen_insn.
>>>>>
>>>>> The right way to do this is to use match_dup, not match_operand. So
>>>>> the fix is obvious.
>>>>>
>>>>> Regenerated avr-gcc and run against avr testsuite:
>>>>>
>>>>> gcc.target/avr/torture/pr41885.c generates these patterns
>>>>>
>>>>> Johann
>>>>>
>>>>> 2011-04-14  Georg-Johann Lay  <avr@gjlay.de>
>>>>>
>>>>>        * config/avr/avr.md ("rotlhi3", "rotlsi3", "rotldi3"): Fix
>>>>>        expanders operand 3 from match_operand to match_dup.
>>>> May be better to use (clobber (match_scratch ...)) here and in
>>>> `*rotw<mode>' and `*rotb<mode>'.
>>> These are splitters that might split before reload, producing strange,
>>> non-matching insns like
>>>  (set (scratch:QI) ...
>>> or crashing already in avr_rotate_bytes becase the split condition reads
>>> "&& (reload_completed || <MODE>mode == DImode)"
>>
>> Generally I'm agree with you, change match_operand to match_dup is easy.
>
> I already submitted the easy patch to avoid the ICE.
>
>> But IMHO the right way is:
>>  - the splitters and avr_rotate_bytes are wrong and must be fixed too.
>>  - operands[3] is a scratch register and right way is to use match_scratch.
>>
>> I can approve the patch.
>>
>> Denis.
>
> Ok, here is the right-way patch.
>
> The expander generates a SCRATCH instead of a pseudo, and in case of a
> pre-reload split the SCRATCH is replaced with a pseudo because it is
> not known if or if not the scratch will actually be needed.
>
> avr_rotate_bytes now skips operations on non-REG 'scratch'.
> Furthermore, I added an assertion that ensures that 'scratch' is
> actually a REG when it is needed.
>
> Besides that, I fixed indentation. I know it is not optimal to fix
> indentation alongside with functionality... did it anyway :-)
>
> Finally, I exposed alternative #3 of the insns to the register
> allocator, because it is not possible to distinguish between
> overlapping or non-overlapping regs, and #3 does not need a scratch.
>
> Ran C-testsuite with no regressions.

Are you encountered any difference in code size ?

Denis.
Denis Chertykov April 17, 2011, 6:57 p.m. UTC | #2
2011/4/17 Denis Chertykov <chertykov@gmail.com>:
> 2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
>> Finally, I exposed alternative #3 of the insns to the register
>> allocator, because it is not possible to distinguish between
>> overlapping or non-overlapping regs, and #3 does not need a scratch.
>>
>> Ran C-testsuite with no regressions.
>
> Are you encountered any difference in code size ?

I'm ask about code size because the IRA pass isn't work with
`scratch:MODE' at all.
This lead to bad/wrong register allocation in IRA pass.
The reload pass will correct such a wrong allocation, but reload can't
generate optimal code. (reload generate correct code).
Because of that, may be you right and may be better to have
(clobber (match_operand....)) instead of (clobber (match_scratch...)).

Denis.
Georg-Johann Lay April 18, 2011, 10:39 a.m. UTC | #3
Denis Chertykov schrieb:
> 2011/4/17 Denis Chertykov <chertykov@gmail.com>:
>> 2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
>>> Finally, I exposed alternative #3 of the insns to the register
>>> allocator, because it is not possible to distinguish between
>>> overlapping or non-overlapping regs, and #3 does not need a scratch.
>>>
>>> Ran C-testsuite with no regressions.
>> Are you encountered any difference in code size ?
> 
> I'm ask about code size because the IRA pass isn't work with
> `scratch:MODE' at all.

I wonder what the difference is; IRA could treat scratch:MODE just the
same way as pseudo:MODE. Anyway, thanks for that insight. It's merely
impossible to get the big picture of IRA/reload from digging sources.

So that means IRA generates garbage for scratch and reload has to fix
it, generate spill code if needed, etc?

Does this mean that if IRA sees a pseudo together with constraint "X"
it will allocate the pseudo, anyway?

> This lead to bad/wrong register allocation in IRA pass.
> The reload pass will correct such a wrong allocation, but reload can't
> generate optimal code. (reload generate correct code).
> Because of that, may be you right and may be better to have
> (clobber (match_operand....)) instead of (clobber (match_scratch...)).

I don't see effects on code size, at least for the binaries left by
testsuite.

Is there a policy how to drive size-performance tests for avr?
Anatoly posted a result on code size some weeks ago (concerning move
costs), it would be interesting to know his setup and what
sources/options he uses. Using testsuite seems not appropriate to me,
because it runs -O3, -finline etc. and most code is no real-world
code, at least not for avr.

FYI, what I observe is a remarkable dependency on subreg lowering that
can be generalized as this:
* code size decreases for mode <= SImode
* code size increases for mode > SImode

An example for gcc.target/avr/torture/pr41885.c -Os

-fno-split-wide-types

00000000 00000008 T rotl_16a
00000008 00000006 T rotl_16b
00000014 00000012 T rotl_32a
00000026 00000010 T rotl_32b
00000036 00000006 T rotl_32c
00000042 00000010 T rotl_32d
00000052 00000010 T rotl_32e
00000062 00000088 T rotl_64
00000150 00000106 T rotl_64a
00000256 00000082 T rotl_64b
00000338 00000096 T rotl_64c
00000434 00000098 T rotl_64d
00000532 00000100 T rotl_64e
00000632 00000102 T rotl_64f
00000734 00000104 T rotl_64g
00000838 00000110 T rotl_64h

-fsplit-wide-types

00000000 00000008 T rotl_16a
00000008 00000006 T rotl_16b
00000014 00000024 T rotl_32a
00000038 00000020 T rotl_32b
00000058 00000024 T rotl_32c
00000082 00000022 T rotl_32d
00000104 00000022 T rotl_32e
00000126 00000036 T rotl_64
00000162 00000050 T rotl_64a
00000212 00000026 T rotl_64b
00000238 00000036 T rotl_64c
00000274 00000038 T rotl_64d
00000312 00000040 T rotl_64e
00000352 00000042 T rotl_64f
00000394 00000044 T rotl_64g
00000438 00000038 T rotl_64h

Johann

> 
> Denis.
>
Denis Chertykov April 18, 2011, 4:31 p.m. UTC | #4
2011/4/18 Georg-Johann Lay <avr@gjlay.de>:
> Denis Chertykov schrieb:
>> 2011/4/17 Denis Chertykov <chertykov@gmail.com>:
>>> 2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
>>>> Finally, I exposed alternative #3 of the insns to the register
>>>> allocator, because it is not possible to distinguish between
>>>> overlapping or non-overlapping regs, and #3 does not need a scratch.
>>>>
>>>> Ran C-testsuite with no regressions.
>>> Are you encountered any difference in code size ?
>>
>> I'm ask about code size because the IRA pass isn't work with
>> `scratch:MODE' at all.
>
> I wonder what the difference is; IRA could treat scratch:MODE just the
> same way as pseudo:MODE. Anyway, thanks for that insight. It's merely
> impossible to get the big picture of IRA/reload from digging sources.
>
> So that means IRA generates garbage for scratch and reload has to fix
> it, generate spill code if needed, etc?

No garbage, IRA just ignore scratch.
The reload pass will allocate scratch registers if it's possible.
If it's impossible, then reload pass will spill and generate reload for
insns with scratches.

> Does this mean that if IRA sees a pseudo together with constraint "X"
> it will allocate the pseudo, anyway?

I think so, but better to test IRA.

>> This lead to bad/wrong register allocation in IRA pass.
>> The reload pass will correct such a wrong allocation, but reload can't
>> generate optimal code. (reload generate correct code).
>> Because of that, may be you right and may be better to have
>> (clobber (match_operand....)) instead of (clobber (match_scratch...)).
>
> I don't see effects on code size, at least for the binaries left by
> testsuite.
>
> Is there a policy how to drive size-performance tests for avr?
> Anatoly posted a result on code size some weeks ago (concerning move
> costs), it would be interesting to know his setup and what
> sources/options he uses. Using testsuite seems not appropriate to me,
> because it runs -O3, -finline etc. and most code is no real-world
> code, at least not for avr.

IMHO libgcc is a good test case.

> FYI, what I observe is a remarkable dependency on subreg lowering that
> can be generalized as this:
> * code size decreases for mode <= SImode
> * code size increases for mode > SImode

I think that DImode move instructions must be supported for representative
test results for mode > SImode.


Personally I'm don't want to support DImode for AVR.
IMHO DImode is an overkill for 8bits controller.


Another one note:
Few years ago I have played with early splitting of anything possible
(move,add,sub,and,...). The results was very bad.
It's happened because flow of splitted insns (8bits insns) becomes
unreadable for most of GCC optimisation passes.
May be splitting is appropriate before register allocation or even after
reload pass.

Denis.
PS: about code size tests: try to email directly to Anatoly.
PPS: I can try to call him.
Denis Chertykov April 18, 2011, 4:35 p.m. UTC | #5
2011/4/18 Denis Chertykov <chertykov@gmail.com>:
> 2011/4/18 Georg-Johann Lay <avr@gjlay.de>:
> Few years ago I have played with early splitting of anything possible
> (move,add,sub,and,...). The results was very bad.
> It's happened because flow of splitted insns (8bits insns) becomes
> unreadable for most of GCC optimisation passes.
> May be splitting is appropriate before register allocation or even after
> reload pass.

May be something like this (early splitting) happened with the subreg
lowering.

Denis.
Georg-Johann Lay April 19, 2011, 9:12 a.m. UTC | #6
Denis Chertykov schrieb:
> 2011/4/18 Georg-Johann Lay <avr@gjlay.de>:
>> Denis Chertykov schrieb:
>>> 2011/4/17 Denis Chertykov <chertykov@gmail.com>:
>>>> 2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
>>>>> Finally, I exposed alternative #3 of the insns to the register
>>>>> allocator, because it is not possible to distinguish between
>>>>> overlapping or non-overlapping regs, and #3 does not need a scratch.
>>>>>
>>>>> Ran C-testsuite with no regressions.
>>>> Are you encountered any difference in code size ?
>>> I'm ask about code size because the IRA pass isn't work with
>>> `scratch:MODE' at all.
>> I wonder what the difference is; IRA could treat scratch:MODE just the
>> same way as pseudo:MODE. Anyway, thanks for that insight. It's merely
>> impossible to get the big picture of IRA/reload from digging sources.
>>
>> So that means IRA generates garbage for scratch and reload has to fix
>> it, generate spill code if needed, etc?
> 
> No garbage, IRA just ignore scratch.
> The reload pass will allocate scratch registers if it's possible.
> If it's impossible, then reload pass will spill and generate reload for
> insns with scratches.
> 
>> Does this mean that if IRA sees a pseudo together with constraint "X"
>> it will allocate the pseudo, anyway?
> 
> I think so, but better to test IRA.
> 
>>> This lead to bad/wrong register allocation in IRA pass.
>>> The reload pass will correct such a wrong allocation, but reload can't
>>> generate optimal code. (reload generate correct code).
>>> Because of that, may be you right and may be better to have
>>> (clobber (match_operand....)) instead of (clobber (match_scratch...)).
>> I don't see effects on code size, at least for the binaries left by
>> testsuite.
>>
>> Is there a policy how to drive size-performance tests for avr?
>> Anatoly posted a result on code size some weeks ago (concerning move
>> costs), it would be interesting to know his setup and what
>> sources/options he uses. Using testsuite seems not appropriate to me,
>> because it runs -O3, -finline etc. and most code is no real-world
>> code, at least not for avr.
> 
> IMHO libgcc is a good test case.

I don't see differences there, either. This is presumably because
rotate pattern is not used resp. used only very seldom.

>> FYI, what I observe is a remarkable dependency on subreg lowering that
>> can be generalized as this:
>> * code size decreases for mode <= SImode
>> * code size increases for mode > SImode
> 
> I think that DImode move instructions must be supported for representative
> test results for mode > SImode.
> 
> 
> Personally I'm don't want to support DImode for AVR.
> IMHO DImode is an overkill for 8bits controller.

Some people use 64bit on AVR. But to support a movdi would be very
painful, I think it would cause more harm like spill failures than good.

> Another one note:
> Few years ago I have played with early splitting of anything possible
> (move,add,sub,and,...). The results was very bad.
> It's happened because flow of splitted insns (8bits insns) becomes
> unreadable for most of GCC optimisation passes.
> May be splitting is appropriate before register allocation or even after
> reload pass.

The -f[no-]split-wide-types happens before reload, in .subreg1 (prior
to combine) resp. .subreg2 (after combine).

For 64bit, without subreg lowering, there is an insn like

(clobber (reg:DI 43 [ <retval> ]))

which causes bunch of moves and finally the setup of a frame pointer
even though nothing lives in the frame.

How can add, sub etc. be split? This would need an explicit
representation of carry.

> Denis.
> PS: about code size tests: try to email directly to Anatoly.
> PPS: I can try to call him.

Johann
Denis Chertykov April 19, 2011, 9:52 a.m. UTC | #7
2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
> How can add, sub etc. be split? This would need an explicit
> representation of carry.

Yes.

Look at http://gcc.gnu.org/ml/gcc/2005-03/msg00871.html

Denis.
Georg-Johann Lay April 19, 2011, 10:20 a.m. UTC | #8
Denis Chertykov schrieb:
> 2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
>> How can add, sub etc. be split? This would need an explicit
>> representation of carry.
> 
> Yes.
> 
> Look at http://gcc.gnu.org/ml/gcc/2005-03/msg00871.html

Just skimmed the conversation. I thought about making AVR ISA's
effects on SREG explicit several times, but I always got stuck at some
point.

- It's not only about scheduling (which does not happen for avr) but
  also about moving instructions across jumps.

- Many transformations would happen before reload, but at these stages
the effect on SREG is not yet known in many cases. There is
sophisticated instruction output for many patterns, and their impact
on SREG/CC is not known before reload.

- Making CC explicit would render many single_set insns to PARALLELs
making the optimizers' live much harder or impossible. Imagine
instructions that could be combined. Explicit CC would clutter up
insns and combine won't try to transform the bulky patterns.

- Backend would be much more complicated, harder to maintain and
understand. Almost any insn would have to be changed.

Johann

> 
> Denis.
>
Denis Chertykov April 19, 2011, 11:17 a.m. UTC | #9
2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
> Denis Chertykov schrieb:
>> 2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
>>> How can add, sub etc. be split? This would need an explicit
>>> representation of carry.
>>
>> Yes.
>>
>> Look at http://gcc.gnu.org/ml/gcc/2005-03/msg00871.html
>
> Just skimmed the conversation. I thought about making AVR ISA's
> effects on SREG explicit several times, but I always got stuck at some
> point.
>
> - It's not only about scheduling (which does not happen for avr) but
>  also about moving instructions across jumps.
>
> - Many transformations would happen before reload, but at these stages
> the effect on SREG is not yet known in many cases. There is
> sophisticated instruction output for many patterns, and their impact
> on SREG/CC is not known before reload.
>
> - Making CC explicit would render many single_set insns to PARALLELs
> making the optimizers' live much harder or impossible. Imagine
> instructions that could be combined. Explicit CC would clutter up
> insns and combine won't try to transform the bulky patterns.
>
> - Backend would be much more complicated, harder to maintain and
> understand. Almost any insn would have to be changed.

Generally, I'm agree with you, the AVR port uses CC0 because of that.

Denis.
Richard Earnshaw April 19, 2011, 12:28 p.m. UTC | #10
On Tue, 2011-04-19 at 15:17 +0400, Denis Chertykov wrote:
> 2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
> > Denis Chertykov schrieb:
> >> 2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
> >>> How can add, sub etc. be split? This would need an explicit
> >>> representation of carry.
> >>
> >> Yes.
> >>
> >> Look at http://gcc.gnu.org/ml/gcc/2005-03/msg00871.html
> >
> > Just skimmed the conversation. I thought about making AVR ISA's
> > effects on SREG explicit several times, but I always got stuck at some
> > point.
> >
> > - It's not only about scheduling (which does not happen for avr) but
> >  also about moving instructions across jumps.
> >
> > - Many transformations would happen before reload, but at these stages
> > the effect on SREG is not yet known in many cases. There is
> > sophisticated instruction output for many patterns, and their impact
> > on SREG/CC is not known before reload.
> >
> > - Making CC explicit would render many single_set insns to PARALLELs
> > making the optimizers' live much harder or impossible. Imagine
> > instructions that could be combined. Explicit CC would clutter up
> > insns and combine won't try to transform the bulky patterns.
> >
> > - Backend would be much more complicated, harder to maintain and
> > understand. Almost any insn would have to be changed.
> 
> Generally, I'm agree with you, the AVR port uses CC0 because of that.

Thumb-1 support in the ARM compiler has similar flag-setting properties
(ie most instructions set the condition codes).  It doesn't use CC0.  It
works because it doesn't model the condition code register at all, but
treats compare/branch sequences as indivisible operations.

R.
Georg-Johann Lay April 20, 2011, 10:19 a.m. UTC | #11
Denis Chertykov schrieb:
> 2011/4/17 Denis Chertykov <chertykov@gmail.com>:
>> 2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
>>> Finally, I exposed alternative #3 of the insns to the register
>>> allocator, because it is not possible to distinguish between
>>> overlapping or non-overlapping regs, and #3 does not need a scratch.
>>>
>>> Ran C-testsuite with no regressions.
>> Are you encountered any difference in code size ?
> 
> I'm ask about code size because the IRA pass isn't work with
> `scratch:MODE' at all.
> This lead to bad/wrong register allocation in IRA pass.
> The reload pass will correct such a wrong allocation, but reload can't
> generate optimal code. (reload generate correct code).
> Because of that, may be you right and may be better to have
> (clobber (match_operand....)) instead of (clobber (match_scratch...)).

So the conclusion is not to commit this patch and that the one-liner
already installed is sufficient to fix the ICE?

> 
> Denis.
>
Denis Chertykov April 20, 2011, 11:18 a.m. UTC | #12
2011/4/20 Georg-Johann Lay <avr@gjlay.de>:
> Denis Chertykov schrieb:
>> 2011/4/17 Denis Chertykov <chertykov@gmail.com>:
>>> 2011/4/15 Georg-Johann Lay <avr@gjlay.de>:
>>>> Finally, I exposed alternative #3 of the insns to the register
>>>> allocator, because it is not possible to distinguish between
>>>> overlapping or non-overlapping regs, and #3 does not need a scratch.
>>>>
>>>> Ran C-testsuite with no regressions.
>>> Are you encountered any difference in code size ?
>>
>> I'm ask about code size because the IRA pass isn't work with
>> `scratch:MODE' at all.
>> This lead to bad/wrong register allocation in IRA pass.
>> The reload pass will correct such a wrong allocation, but reload can't
>> generate optimal code. (reload generate correct code).
>> Because of that, may be you right and may be better to have
>> (clobber (match_operand....)) instead of (clobber (match_scratch...)).
>
> So the conclusion is not to commit this patch and that the one-liner
> already installed is sufficient to fix the ICE?

Yes.

Denis.
Richard Henderson April 20, 2011, 5:23 p.m. UTC | #13
On 04/19/2011 02:52 AM, Denis Chertykov wrote:
> 2011/4/19 Georg-Johann Lay <avr@gjlay.de>:
>> How can add, sub etc. be split? This would need an explicit
>> representation of carry.
> 
> Yes.
> 
> Look at http://gcc.gnu.org/ml/gcc/2005-03/msg00871.html

Well, sort-of, but not really.

It gets a tad ugly, but have a look at the adddi3* patterns in
the mn10300 and rx ports.

In particular note how both the inputs and outputs to the insn
(not the expander) are all SImode, allowing for lower_subreg to
do its job.  The patterns are split post-reload -- some of that
is for scheduling, some of that simply makes computing the 
individual insn lengths significantly easier -- but you wouldn't
really have to do that for AVR.

For AVR things would become even trickier.  You might consider

(define_predicate "concat_operator"
  (match_code "concat"))

(define_insn "addsi3_qqqq"
  [(set (match_operand:QI 0 "register_operand" "=r")
	(truncate:QI
	  (plus:SI
	    (match_operator:SI 12 "concat_operator"
	       [(match_operand:QI  4 "register_operand" "0")
		(match_operand:QI  5 "register_operand" "1")
		(match_operand:QI  6 "register_operand" "2")
		(match_operand:QI  7 "register_operand" "3")])
	    (match_operator:SI 13 "concat_operator"
	       [(match_operand:QI  8 "reg_or_0_operand" "rL")
		(match_operand:QI  9 "reg_or_0_operand" "rL")
		(match_operand:QI 10 "reg_or_0_operand" "rL")
		(match_operand:QI 11 "reg_or_0_operand" "rL")])))
   (set (match_operand:QI 1 "register_operand" "=r")
	(truncate:QI
	  (lshiftrt:SI
	    (plus:SI (match_dup 24) (match_dup 25))
	    (const_int 8))))
   (set (match_operand:QI 2 "register_operand" "=r")
	(truncate:QI
	  (lshiftrt:SI
	    (plus:SI (match_dup 24) (match_dup 25))
	    (const_int 16))))
   (set (match_operand:QI 3 "register_operand" "=r")
	(truncate:QI
	  (lshiftrt:SI
	    (plus:SI (match_dup 24) (match_dup 25))
	    (const_int 24))))]
  ""
  "add %0,%Z8\;adc %1,%Z9\;adc %2,%Z10\;adc %3,%Z11"
  [(set_attr "length" "4")]
)

This may require a little bit of code inside combine to handle
CONCAT in a reasonable way, but that should be fairly minimal.

It may also want some more special case patterns and/or peep2s
to more efficiently handle constants, particularly considering
adiw and subic.  But I think it's at least worth investigating.


r~
diff mbox

Patch

Index: config/avr/avr.md
===================================================================
--- config/avr/avr.md	(Revision 172493)
+++ config/avr/avr.md	(Arbeitskopie)
@@ -1519,22 +1519,22 @@  (define_mode_attr rotsmode [(DI "QI") (S
 
 (define_expand "rotl<mode>3"
   [(parallel [(set (match_operand:HIDI 0 "register_operand" "")
-		   (rotate:HIDI (match_operand:HIDI 1 "register_operand" "")
-				(match_operand:VOID 2 "const_int_operand" "")))
-		(clobber (match_dup 3))])]
+                   (rotate:HIDI (match_operand:HIDI 1 "register_operand" "")
+                                (match_operand:VOID 2 "const_int_operand" "")))
+              (clobber (match_dup 3))])]
   ""
-  "
-{
-  if (CONST_INT_P (operands[2]) && 0 == (INTVAL (operands[2]) % 8))
   {
-  if (AVR_HAVE_MOVW && 0 == INTVAL (operands[2]) % 16)
-    operands[3] = gen_reg_rtx (<rotsmode>mode);
-  else
-    operands[3] = gen_reg_rtx (QImode);
-  }
-  else
-    FAIL;
-}")
+    if (CONST_INT_P (operands[2])
+        && 0 == INTVAL (operands[2]) % 8)
+      {
+        if (AVR_HAVE_MOVW && 0 == INTVAL (operands[2]) % 16)
+          operands[3] = gen_rtx_SCRATCH (<rotsmode>mode);
+        else
+          operands[3] = gen_rtx_SCRATCH (QImode);
+      }
+    else
+      FAIL;
+  })
 
 
 ;; Overlapping non-HImode registers often (but not always) need a scratch.
@@ -1545,35 +1545,49 @@  (define_expand "rotl<mode>3"
 
 ; Split word aligned rotates using scratch that is mode dependent.
 (define_insn_and_split "*rotw<mode>"
-  [(set (match_operand:HIDI 0 "register_operand" "=r,r,#&r")
-	(rotate:HIDI (match_operand:HIDI 1 "register_operand" "0,r,r")
-		     (match_operand 2 "immediate_operand" "n,n,n")))
-   (clobber (match_operand:<rotsmode> 3 "register_operand"  "=<rotx>" ))]
-  "(CONST_INT_P (operands[2]) &&
-     (0 == (INTVAL (operands[2]) % 16) && AVR_HAVE_MOVW))"
+  [(set (match_operand:HIDI 0 "register_operand"              "=r,r,&r")
+        (rotate:HIDI (match_operand:HIDI 1 "register_operand"  "0,r,r")
+                     (match_operand 2 "const_int_operand"      "n,n,n")))
+   (clobber (match_scratch:<rotsmode> 3                       "=<rotx>"))]
+  "AVR_HAVE_MOVW
+   && 0 == INTVAL (operands[2]) % 16"
   "#"
   "&& (reload_completed || <MODE>mode == DImode)"
   [(const_int 0)]
-  "avr_rotate_bytes (operands);
-  DONE;"
-)
+  {
+    /* Split happens in .split1: Cook up a pseudo */
+    if (SCRATCH == GET_CODE (operands[3])
+        && !reload_completed)
+      {
+        operands[3] = gen_reg_rtx (GET_MODE (operands[3]));
+      }
+    avr_rotate_bytes (operands);
+    DONE;
+  })
 
 
 ; Split byte aligned rotates using scratch that is always QI mode.
 (define_insn_and_split "*rotb<mode>"
-  [(set (match_operand:HIDI 0 "register_operand" "=r,r,#&r")
-	(rotate:HIDI (match_operand:HIDI 1 "register_operand" "0,r,r")
-		     (match_operand 2 "immediate_operand" "n,n,n")))
-   (clobber (match_operand:QI 3 "register_operand" "=<rotx>" ))]
-  "(CONST_INT_P (operands[2]) &&
-     (8 == (INTVAL (operands[2]) % 16)
-     	|| (!AVR_HAVE_MOVW && 0 == (INTVAL (operands[2]) % 16))))"
+  [(set (match_operand:HIDI 0 "register_operand"              "=r,r,&r")
+        (rotate:HIDI (match_operand:HIDI 1 "register_operand"  "0,r,r")
+                     (match_operand 2 "const_int_operand"      "n,n,n")))
+   (clobber (match_scratch:QI 3                               "=<rotx>" ))]
+  "8 == INTVAL (operands[2]) % 16
+   || (!AVR_HAVE_MOVW
+       && 0 == INTVAL (operands[2]) % 16)"
   "#"
   "&& (reload_completed || <MODE>mode == DImode)"
   [(const_int 0)]
-  "avr_rotate_bytes (operands);
-  DONE;"
-)
+  {
+    /* Split happens in .split1: Cook up a pseudo */
+    if (SCRATCH == GET_CODE (operands[3])
+        && !reload_completed)
+      {
+        operands[3] = gen_reg_rtx (QImode);
+      }
+    avr_rotate_bytes (operands);
+    DONE;
+  })
 
 
 ;;<< << << << << << << << << << << << << << << << << << << << << << << << << <<
Index: config/avr/avr.c
===================================================================
--- config/avr/avr.c	(Revision 172493)
+++ config/avr/avr.c	(Arbeitskopie)
@@ -4481,141 +4481,147 @@  lshrsi3_out (rtx insn, rtx operands[], i
 bool
 avr_rotate_bytes (rtx operands[])
 {
-    int i, j;
-    enum machine_mode mode = GET_MODE (operands[0]);
-    bool overlapped = reg_overlap_mentioned_p (operands[0], operands[1]);
-    bool same_reg = rtx_equal_p (operands[0], operands[1]);
-    int num = INTVAL (operands[2]);
-    rtx scratch = operands[3];
-    /* Work out if byte or word move is needed.  Odd byte rotates need QImode.
-       Word move if no scratch is needed, otherwise use size of scratch.  */
-    enum machine_mode move_mode = QImode;
-    int move_size, offset, size;
-
-    if (num & 0xf)
-      move_mode = QImode;
-    else if ((mode == SImode && !same_reg) || !overlapped)
-      move_mode = HImode;
-    else
-      move_mode = GET_MODE (scratch);
-
-    /* Force DI rotate to use QI moves since other DI moves are currently split
-       into QI moves so forward propagation works better.  */
-    if (mode == DImode)
-      move_mode = QImode;
-    /* Make scratch smaller if needed.  */
-    if (GET_MODE (scratch) == HImode && move_mode == QImode)
-      scratch = simplify_gen_subreg (move_mode, scratch, HImode, 0); 
-
-    move_size = GET_MODE_SIZE (move_mode);
-    /* Number of bytes/words to rotate.  */
-    offset = (num  >> 3) / move_size;
-    /* Number of moves needed.  */
-    size = GET_MODE_SIZE (mode) / move_size;
-    /* Himode byte swap is special case to avoid a scratch register.  */
-    if (mode == HImode && same_reg)
-      {
-	/* HImode byte swap, using xor.  This is as quick as using scratch.  */
-	rtx src, dst;
-	src = simplify_gen_subreg (move_mode, operands[1], mode, 0);
-	dst = simplify_gen_subreg (move_mode, operands[0], mode, 1);
-	if (!rtx_equal_p (dst, src))
-	  {
-	     emit_move_insn (dst, gen_rtx_XOR (QImode, dst, src));
-	     emit_move_insn (src, gen_rtx_XOR (QImode, src, dst));
-	     emit_move_insn (dst, gen_rtx_XOR (QImode, dst, src));
-	  }
-      }    
-    else  
-      {
+  int i, j;
+  enum machine_mode mode = GET_MODE (operands[0]);
+  bool overlapped = reg_overlap_mentioned_p (operands[0], operands[1]);
+  bool same_reg = rtx_equal_p (operands[0], operands[1]);
+  int num = INTVAL (operands[2]);
+  rtx scratch = operands[3];
+  /* Work out if byte or word move is needed.  Odd byte rotates need QImode.
+     Word move if no scratch is needed, otherwise use size of scratch.  */
+  enum machine_mode move_mode = QImode;
+  int move_size, offset, size;
+
+  if (num & 0xf)
+    move_mode = QImode;
+  else if ((mode == SImode && !same_reg) || !overlapped)
+    move_mode = HImode;
+  else
+    move_mode = GET_MODE (scratch);
+
+  /* Force DI rotate to use QI moves since other DI moves are currently split
+     into QI moves so forward propagation works better.  */
+  if (mode == DImode)
+    move_mode = QImode;
+  /* Make scratch smaller if needed.  */
+  if (GET_MODE (scratch) == HImode
+      && move_mode == QImode
+      && REG_P (scratch))
+    {
+      scratch = simplify_gen_subreg (move_mode, scratch, HImode, 0);
+    }
+
+  move_size = GET_MODE_SIZE (move_mode);
+  /* Number of bytes/words to rotate.  */
+  offset = (num  >> 3) / move_size;
+  /* Number of moves needed.  */
+  size = GET_MODE_SIZE (mode) / move_size;
+  /* Himode byte swap is special case to avoid a scratch register.  */
+  if (mode == HImode && same_reg)
+    {
+      /* HImode byte swap, using xor.  This is as quick as using scratch.  */
+      rtx src, dst;
+      src = simplify_gen_subreg (move_mode, operands[1], mode, 0);
+      dst = simplify_gen_subreg (move_mode, operands[0], mode, 1);
+      if (!rtx_equal_p (dst, src))
+        {
+          emit_move_insn (dst, gen_rtx_XOR (QImode, dst, src));
+          emit_move_insn (src, gen_rtx_XOR (QImode, src, dst));
+          emit_move_insn (dst, gen_rtx_XOR (QImode, dst, src));
+        }
+    }    
+  else  
+    {
 #define MAX_SIZE 8 /* GET_MODE_SIZE (DImode) / GET_MODE_SIZE (QImode)  */
-	/* Create linked list of moves to determine move order.  */
-	struct {
-	  rtx src, dst;
-	  int links;
-	} move[MAX_SIZE + 8];
-	int blocked, moves;
-
-	gcc_assert (size <= MAX_SIZE);
-	/* Generate list of subreg moves.  */
-	for (i = 0; i < size; i++)
-	  {
-	    int from = i;
-	    int to = (from + offset) % size;          
-	    move[i].src = simplify_gen_subreg (move_mode, operands[1],
-						mode, from * move_size);
-	    move[i].dst = simplify_gen_subreg (move_mode, operands[0],
-						mode, to   * move_size);
-	    move[i].links = -1;
-	   }
-	/* Mark dependence where a dst of one move is the src of another move.
-	   The first move is a conflict as it must wait until second is
-	   performed.  We ignore moves to self - we catch this later.  */
-	if (overlapped)
-	  for (i = 0; i < size; i++)
-	    if (reg_overlap_mentioned_p (move[i].dst, operands[1]))
-	      for (j = 0; j < size; j++)
-		if (j != i && rtx_equal_p (move[j].src, move[i].dst))
-		  {
-		    /* The dst of move i is the src of move j.  */
-		    move[i].links = j;
-		    break;
-		  }
-
-	blocked = -1;
-	moves = 0;
-	/* Go through move list and perform non-conflicting moves.  As each
-	   non-overlapping move is made, it may remove other conflicts
-	   so the process is repeated until no conflicts remain.  */
-	do
-	  {
-	    blocked = -1;
-	    moves = 0;
-	    /* Emit move where dst is not also a src or we have used that
-	       src already.  */
-	    for (i = 0; i < size; i++)
-	      if (move[i].src != NULL_RTX)
-		{
-		  if (move[i].links == -1
-		      || move[move[i].links].src == NULL_RTX)
-		    {
-		      moves++;
-		      /* Ignore NOP moves to self.  */
-		      if (!rtx_equal_p (move[i].dst, move[i].src))
-			emit_move_insn (move[i].dst, move[i].src);
-
-		      /* Remove  conflict from list.  */
-		      move[i].src = NULL_RTX;
-		    }
-		  else
-		    blocked = i;
-		}
-
-	    /* Check for deadlock. This is when no moves occurred and we have
-	       at least one blocked move.  */
-	    if (moves == 0 && blocked != -1)
-	      {
-		/* Need to use scratch register to break deadlock.
-		   Add move to put dst of blocked move into scratch.
-		   When this move occurs, it will break chain deadlock.
-		   The scratch register is substituted for real move.  */
-
-		move[size].src = move[blocked].dst;
-		move[size].dst =  scratch;
-		/* Scratch move is never blocked.  */
-		move[size].links = -1; 
-		/* Make sure we have valid link.  */
-		gcc_assert (move[blocked].links != -1);
-		/* Replace src of  blocking move with scratch reg.  */
-		move[move[blocked].links].src = scratch;
-		/* Make dependent on scratch move occuring.  */
-		move[blocked].links = size; 
-		size=size+1;
-	      }
-	  }
-	while (blocked != -1);
-      }
-    return true;
+      /* Create linked list of moves to determine move order.  */
+      struct {
+        rtx src, dst;
+        int links;
+      } move[MAX_SIZE + 8];
+      int blocked, moves;
+
+      gcc_assert (size <= MAX_SIZE);
+      /* Generate list of subreg moves.  */
+      for (i = 0; i < size; i++)
+        {
+          int from = i;
+          int to = (from + offset) % size;          
+          move[i].src = simplify_gen_subreg (move_mode, operands[1],
+                                             mode, from * move_size);
+          move[i].dst = simplify_gen_subreg (move_mode, operands[0],
+                                             mode, to   * move_size);
+          move[i].links = -1;
+        }
+      /* Mark dependence where a dst of one move is the src of another move.
+         The first move is a conflict as it must wait until second is
+         performed.  We ignore moves to self - we catch this later.  */
+      if (overlapped)
+        for (i = 0; i < size; i++)
+          if (reg_overlap_mentioned_p (move[i].dst, operands[1]))
+            for (j = 0; j < size; j++)
+              if (j != i && rtx_equal_p (move[j].src, move[i].dst))
+                {
+                  /* The dst of move i is the src of move j.  */
+                  move[i].links = j;
+                  break;
+                }
+
+      blocked = -1;
+      moves = 0;
+      /* Go through move list and perform non-conflicting moves.  As each
+         non-overlapping move is made, it may remove other conflicts
+         so the process is repeated until no conflicts remain.  */
+      do
+        {
+          blocked = -1;
+          moves = 0;
+          /* Emit move where dst is not also a src or we have used that
+             src already.  */
+          for (i = 0; i < size; i++)
+            if (move[i].src != NULL_RTX)
+              {
+                if (move[i].links == -1
+                    || move[move[i].links].src == NULL_RTX)
+                  {
+                    moves++;
+                    /* Ignore NOP moves to self.  */
+                    if (!rtx_equal_p (move[i].dst, move[i].src))
+                      emit_move_insn (move[i].dst, move[i].src);
+
+                    /* Remove  conflict from list.  */
+                    move[i].src = NULL_RTX;
+                  }
+                else
+                  blocked = i;
+              }
+
+          /* Check for deadlock. This is when no moves occurred and we have
+             at least one blocked move.  */
+          if (moves == 0 && blocked != -1)
+            {
+              /* Need to use scratch register to break deadlock.
+                 Add move to put dst of blocked move into scratch.
+                 When this move occurs, it will break chain deadlock.
+                 The scratch register is substituted for real move.  */
+
+              gcc_assert (REG_P (scratch));
+
+              move[size].src = move[blocked].dst;
+              move[size].dst =  scratch;
+              /* Scratch move is never blocked.  */
+              move[size].links = -1; 
+              /* Make sure we have valid link.  */
+              gcc_assert (move[blocked].links != -1);
+              /* Replace src of  blocking move with scratch reg.  */
+              move[move[blocked].links].src = scratch;
+              /* Make dependent on scratch move occuring.  */
+              move[blocked].links = size; 
+              size=size+1;
+            }
+        }
+      while (blocked != -1);
+    }
+  return true;
 }
 
 /* Modifies the length assigned to instruction INSN