diff mbox

[AArch64,PR65375] Fix RTX cost for vector SET

Message ID 5507813E.3060106@linaro.org
State New
Headers show

Commit Message

Kugan Vivekanandarajah March 17, 2015, 1:19 a.m. UTC
On 17/03/15 03:48, Kyrill Tkachov wrote:
> 
> On 16/03/15 13:15, Kugan wrote:
>> On 16/03/15 23:32, Kugan wrote:
>>>>> lower-subreg.c:compute_costs() only cares about the cost of a (set
>>>>> (reg)
>>>>> (const_int )) move but I think the intention, at least for now, is to
>>>>> return extra_cost->vect.alu for all the vector operations.
>>>> Almost, what we want at the moment is COSTS_N_INSNS (1) +
>>>> extra_cost->vect.alu
>>> Thanks Kyrill for the review.
>>>
>>>>> Regression tested on aarch64-linux-gnu with no new regression. Is this
>>>>> OK for trunk?
>>>> Are you sure it's a (set (reg) (const_int)) that's being costed here? I
>>>> thought for moves into vecto registers it would be a (set (reg)
>>>> (const_vector)) which we don't handle in our rtx costs currently. I
>>>> think the correct approach would be to extend the aarch64_rtx_costs
>>>> switch statement to handle the CONST_VECT case. I believe you can use
>>>> aarch64_simd_valid_immediate to check whether x is a valid immediate
>>>> for
>>>> a simd instruction and give it a cost of extra_cost->vect.alu. The
>>>> logic
>>>> should be similar to the CONST_INT case.
>>> Sorry about the (set (reg) (const_int)) above. But the actual RTL that
>>> is being split at 220r.subreg2 is
>>>
>>> (insn 11 10 12 2 (set (subreg:V4SF (reg/v:OI 77 [ __o ]) 0)
>>>           (subreg:V4SF (reg/v:OI 73 [ __o ]) 0))
>>> /home/kugan/work/builds/gcc-fsf-gcc/tools/lib/gcc/aarch64-none-linux-gnu/5.0.0/include/arm_neon.h:22625
>>>
>>> 800 {*aarch64_simd_movv4sf}
>>>        (nil))
>>>
>>> And also, if we return RTX cost above COSTS_N_INSNS (1), it will be
>>> split and it dosent recover from there. Therefore we need something like
>>> the below to prevent that happening.
>>>
>> Hi Kyrill,
>>
>> How about the attached patch? It is similar to what is currently done
>> for scalar register move.
> 
> Hi Kugan,
> yeah, I think this is a better approach, though I can't approve.
> 

Here is the patch with minor comment update. Regression tested on
aarch64-linux-gnu with no new regression. Is this
OK for trunk?

Thanks,
Kugan

gcc/ChangeLog:

2015-03-17  Kugan Vivekanandarajah  <kuganv@linaro.org>
            Jim Wilson  <jim.wilson@linaro.org>

	PR target/65375
	* config/aarch64/aarch64.c (aarch64_rtx_costs): Handle
	vector register copies.

Comments

Kugan Vivekanandarajah March 26, 2015, 7:21 a.m. UTC | #1
ping?

Thanks,
Kugan

On 17/03/15 12:19, Kugan wrote:
> 
> 
> On 17/03/15 03:48, Kyrill Tkachov wrote:
>>
>> On 16/03/15 13:15, Kugan wrote:
>>> On 16/03/15 23:32, Kugan wrote:
>>>>>> lower-subreg.c:compute_costs() only cares about the cost of a (set
>>>>>> (reg)
>>>>>> (const_int )) move but I think the intention, at least for now, is to
>>>>>> return extra_cost->vect.alu for all the vector operations.
>>>>> Almost, what we want at the moment is COSTS_N_INSNS (1) +
>>>>> extra_cost->vect.alu
>>>> Thanks Kyrill for the review.
>>>>
>>>>>> Regression tested on aarch64-linux-gnu with no new regression. Is this
>>>>>> OK for trunk?
>>>>> Are you sure it's a (set (reg) (const_int)) that's being costed here? I
>>>>> thought for moves into vecto registers it would be a (set (reg)
>>>>> (const_vector)) which we don't handle in our rtx costs currently. I
>>>>> think the correct approach would be to extend the aarch64_rtx_costs
>>>>> switch statement to handle the CONST_VECT case. I believe you can use
>>>>> aarch64_simd_valid_immediate to check whether x is a valid immediate
>>>>> for
>>>>> a simd instruction and give it a cost of extra_cost->vect.alu. The
>>>>> logic
>>>>> should be similar to the CONST_INT case.
>>>> Sorry about the (set (reg) (const_int)) above. But the actual RTL that
>>>> is being split at 220r.subreg2 is
>>>>
>>>> (insn 11 10 12 2 (set (subreg:V4SF (reg/v:OI 77 [ __o ]) 0)
>>>>           (subreg:V4SF (reg/v:OI 73 [ __o ]) 0))
>>>> /home/kugan/work/builds/gcc-fsf-gcc/tools/lib/gcc/aarch64-none-linux-gnu/5.0.0/include/arm_neon.h:22625
>>>>
>>>> 800 {*aarch64_simd_movv4sf}
>>>>        (nil))
>>>>
>>>> And also, if we return RTX cost above COSTS_N_INSNS (1), it will be
>>>> split and it dosent recover from there. Therefore we need something like
>>>> the below to prevent that happening.
>>>>
>>> Hi Kyrill,
>>>
>>> How about the attached patch? It is similar to what is currently done
>>> for scalar register move.
>>
>> Hi Kugan,
>> yeah, I think this is a better approach, though I can't approve.
>>
> 
> Here is the patch with minor comment update. Regression tested on
> aarch64-linux-gnu with no new regression. Is this
> OK for trunk?
> 
> Thanks,
> Kugan
> 
> gcc/ChangeLog:
> 
> 2015-03-17  Kugan Vivekanandarajah  <kuganv@linaro.org>
>             Jim Wilson  <jim.wilson@linaro.org>
> 
> 	PR target/65375
> 	* config/aarch64/aarch64.c (aarch64_rtx_costs): Handle
> 	vector register copies.
> 
> 
>
Kugan Vivekanandarajah April 14, 2015, 10:08 p.m. UTC | #2
Ping?

Now that Stage1 is open, is this OK for trunk.

Thanks,
Kugan

On 26/03/15 18:21, Kugan wrote:
> ping?
> 
> Thanks,
> Kugan
> 
> On 17/03/15 12:19, Kugan wrote:
>>
>>
>> On 17/03/15 03:48, Kyrill Tkachov wrote:
>>>
>>> On 16/03/15 13:15, Kugan wrote:
>>>> On 16/03/15 23:32, Kugan wrote:
>>>>>>> lower-subreg.c:compute_costs() only cares about the cost of a (set
>>>>>>> (reg)
>>>>>>> (const_int )) move but I think the intention, at least for now, is to
>>>>>>> return extra_cost->vect.alu for all the vector operations.
>>>>>> Almost, what we want at the moment is COSTS_N_INSNS (1) +
>>>>>> extra_cost->vect.alu
>>>>> Thanks Kyrill for the review.
>>>>>
>>>>>>> Regression tested on aarch64-linux-gnu with no new regression. Is this
>>>>>>> OK for trunk?
>>>>>> Are you sure it's a (set (reg) (const_int)) that's being costed here? I
>>>>>> thought for moves into vecto registers it would be a (set (reg)
>>>>>> (const_vector)) which we don't handle in our rtx costs currently. I
>>>>>> think the correct approach would be to extend the aarch64_rtx_costs
>>>>>> switch statement to handle the CONST_VECT case. I believe you can use
>>>>>> aarch64_simd_valid_immediate to check whether x is a valid immediate
>>>>>> for
>>>>>> a simd instruction and give it a cost of extra_cost->vect.alu. The
>>>>>> logic
>>>>>> should be similar to the CONST_INT case.
>>>>> Sorry about the (set (reg) (const_int)) above. But the actual RTL that
>>>>> is being split at 220r.subreg2 is
>>>>>
>>>>> (insn 11 10 12 2 (set (subreg:V4SF (reg/v:OI 77 [ __o ]) 0)
>>>>>           (subreg:V4SF (reg/v:OI 73 [ __o ]) 0))
>>>>> /home/kugan/work/builds/gcc-fsf-gcc/tools/lib/gcc/aarch64-none-linux-gnu/5.0.0/include/arm_neon.h:22625
>>>>>
>>>>> 800 {*aarch64_simd_movv4sf}
>>>>>        (nil))
>>>>>
>>>>> And also, if we return RTX cost above COSTS_N_INSNS (1), it will be
>>>>> split and it dosent recover from there. Therefore we need something like
>>>>> the below to prevent that happening.
>>>>>
>>>> Hi Kyrill,
>>>>
>>>> How about the attached patch? It is similar to what is currently done
>>>> for scalar register move.
>>>
>>> Hi Kugan,
>>> yeah, I think this is a better approach, though I can't approve.
>>>
>>
>> Here is the patch with minor comment update. Regression tested on
>> aarch64-linux-gnu with no new regression. Is this
>> OK for trunk?
>>
>> Thanks,
>> Kugan
>>
>> gcc/ChangeLog:
>>
>> 2015-03-17  Kugan Vivekanandarajah  <kuganv@linaro.org>
>>             Jim Wilson  <jim.wilson@linaro.org>
>>
>> 	PR target/65375
>> 	* config/aarch64/aarch64.c (aarch64_rtx_costs): Handle
>> 	vector register copies.
>>
>>
>>
James Greenhalgh April 15, 2015, 9:25 a.m. UTC | #3
On Tue, Apr 14, 2015 at 11:08:55PM +0100, Kugan wrote:
> Now that Stage1 is open, is this OK for trunk.

Hi Kugan,

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index cba3c1a..d6ad0af 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -5544,10 +5544,17 @@ aarch64_rtx_costs (rtx x, int code, int outer ATTRIBUTE_UNUSED,
>  
>  	  /* Fall through.  */
>  	case REG:
> +	  if (VECTOR_MODE_P (GET_MODE (op0)) && REG_P (op1))
> +	    {
> +              /* The cost is 1 per vector-register copied.  */
> +              int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
> +			      / GET_MODE_SIZE (V4SImode);
> +              *cost = COSTS_N_INSNS (n_minus_1 + 1);
> +	    }
>  	  /* const0_rtx is in general free, but we will use an
>  	     instruction to set a register to 0.  */
> -          if (REG_P (op1) || op1 == const0_rtx)
> -            {
> +	  else if (REG_P (op1) || op1 == const0_rtx)
> +	    {
>                /* The cost is 1 per register copied.  */
>                int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>  			      / UNITS_PER_WORD;

I would not have expected control flow to reach this point, as we have:

>  /* TODO: The cost infrastructure currently does not handle
>     vector operations.  Assume that all vector operations
>     are equally expensive.  */
>  if (VECTOR_MODE_P (mode))
>    {
>      if (speed)
>	*cost += extra_cost->vect.alu;
>      return true;
>    }

But, I see that this check is broken for a set RTX (which has no mode).
So, your patch works, but only due to a bug in my original implementation.
This leaves the code with quite a messy design.

There are two ways I see that we could clean things up, both of which
require some reworking of your patch.

Either we remove my check above and teach the RTX costs how to properly
cost vector operations, or we fix my check to catch all vector RTX
and add the special cases for the small subset of things we understand
up there.

The correct approach in the long term is to fix the RTX costs to correctly
understand vector operations, so I'd much prefer to see a patch along
these lines, though I appreciate that is a substantially more invasive
piece of work.

Thanks,
James
Kyrylo Tkachov April 15, 2015, 10:14 a.m. UTC | #4
On 15/04/15 10:25, James Greenhalgh wrote:
> On Tue, Apr 14, 2015 at 11:08:55PM +0100, Kugan wrote:
>> Now that Stage1 is open, is this OK for trunk.
> Hi Kugan,
>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index cba3c1a..d6ad0af 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -5544,10 +5544,17 @@ aarch64_rtx_costs (rtx x, int code, int outer ATTRIBUTE_UNUSED,
>>   
>>   	  /* Fall through.  */
>>   	case REG:
>> +	  if (VECTOR_MODE_P (GET_MODE (op0)) && REG_P (op1))
>> +	    {
>> +              /* The cost is 1 per vector-register copied.  */
>> +              int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>> +			      / GET_MODE_SIZE (V4SImode);
>> +              *cost = COSTS_N_INSNS (n_minus_1 + 1);
>> +	    }
>>   	  /* const0_rtx is in general free, but we will use an
>>   	     instruction to set a register to 0.  */
>> -          if (REG_P (op1) || op1 == const0_rtx)
>> -            {
>> +	  else if (REG_P (op1) || op1 == const0_rtx)
>> +	    {
>>                 /* The cost is 1 per register copied.  */
>>                 int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>>   			      / UNITS_PER_WORD;
> I would not have expected control flow to reach this point, as we have:
>
>>   /* TODO: The cost infrastructure currently does not handle
>>      vector operations.  Assume that all vector operations
>>      are equally expensive.  */
>>   if (VECTOR_MODE_P (mode))
>>     {
>>       if (speed)
>> 	*cost += extra_cost->vect.alu;
>>       return true;
>>     }
> But, I see that this check is broken for a set RTX (which has no mode).
> So, your patch works, but only due to a bug in my original implementation.
> This leaves the code with quite a messy design.
>
> There are two ways I see that we could clean things up, both of which
> require some reworking of your patch.
>
> Either we remove my check above and teach the RTX costs how to properly
> cost vector operations, or we fix my check to catch all vector RTX
> and add the special cases for the small subset of things we understand
> up there.
>
> The correct approach in the long term is to fix the RTX costs to correctly
> understand vector operations, so I'd much prefer to see a patch along
> these lines, though I appreciate that is a substantially more invasive
> piece of work.


Would we want to catch all vector RTXes in that check at the top
and have special vector rtx handling there? (Perhaps even in a function
of its own like aarch64_vector_rtx_costs?). Or do you think it would
be cleaner to handle them in the aarch64_rtx_costs giant switch?
Vector-specific RTX codes like vec_concat, vec_select would integrate
cleanly, but handling other common rtxen could potentially be messy?

Kyrill

>
> Thanks,
> James
>
Kugan Vivekanandarajah April 15, 2015, 10:45 a.m. UTC | #5
On 15/04/15 19:25, James Greenhalgh wrote:
> On Tue, Apr 14, 2015 at 11:08:55PM +0100, Kugan wrote:
>> Now that Stage1 is open, is this OK for trunk.
> 
> Hi Kugan,
> 
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index cba3c1a..d6ad0af 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -5544,10 +5544,17 @@ aarch64_rtx_costs (rtx x, int code, int outer ATTRIBUTE_UNUSED,
>>  
>>  	  /* Fall through.  */
>>  	case REG:
>> +	  if (VECTOR_MODE_P (GET_MODE (op0)) && REG_P (op1))
>> +	    {
>> +              /* The cost is 1 per vector-register copied.  */
>> +              int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>> +			      / GET_MODE_SIZE (V4SImode);
>> +              *cost = COSTS_N_INSNS (n_minus_1 + 1);
>> +	    }
>>  	  /* const0_rtx is in general free, but we will use an
>>  	     instruction to set a register to 0.  */
>> -          if (REG_P (op1) || op1 == const0_rtx)
>> -            {
>> +	  else if (REG_P (op1) || op1 == const0_rtx)
>> +	    {
>>                /* The cost is 1 per register copied.  */
>>                int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>>  			      / UNITS_PER_WORD;
> 
> I would not have expected control flow to reach this point, as we have:

It does for mode == VODmode. RTL X is for example:

(set (reg:V8DI 67 virtual-incoming-args)
    (reg:V8DI 68 virtual-stack-vars))

> 
>>  /* TODO: The cost infrastructure currently does not handle
>>     vector operations.  Assume that all vector operations
>>     are equally expensive.  */
>>  if (VECTOR_MODE_P (mode))
>>    {
>>      if (speed)
>> 	*cost += extra_cost->vect.alu;
>>      return true;
>>    }
> 
> But, I see that this check is broken for a set RTX (which has no mode).
> So, your patch works, but only due to a bug in my original implementation.
> This leaves the code with quite a messy design.
> 
> There are two ways I see that we could clean things up, both of which
> require some reworking of your patch.
> 
> Either we remove my check above and teach the RTX costs how to properly
> cost vector operations, or we fix my check to catch all vector RTX
> and add the special cases for the small subset of things we understand
> up there.
> 
> The correct approach in the long term is to fix the RTX costs to correctly
> understand vector operations, so I'd much prefer to see a patch along
> these lines, though I appreciate that is a substantially more invasive
> piece of work.
> 


I agree that rtx cost for vector is not handled right now. We might not
be able to completely separate as Kyrill suggested.  We still need the
vector SET with VOIDmode to be handled inline. This patch is that part.
We can work on the others as a separate function, if you prefer that. I
am happy to look this as a separate patch.


Thanks,
Kugan
James Greenhalgh April 15, 2015, 11:05 a.m. UTC | #6
On Wed, Apr 15, 2015 at 11:14:11AM +0100, Kyrill Tkachov wrote:
> 
> On 15/04/15 10:25, James Greenhalgh wrote:
> > On Tue, Apr 14, 2015 at 11:08:55PM +0100, Kugan wrote:
> >> Now that Stage1 is open, is this OK for trunk.
> > Hi Kugan,
> >
> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> index cba3c1a..d6ad0af 100644
> >> --- a/gcc/config/aarch64/aarch64.c
> >> +++ b/gcc/config/aarch64/aarch64.c
> >> @@ -5544,10 +5544,17 @@ aarch64_rtx_costs (rtx x, int code, int outer ATTRIBUTE_UNUSED,
> >>   
> >>   	  /* Fall through.  */
> >>   	case REG:
> >> +	  if (VECTOR_MODE_P (GET_MODE (op0)) && REG_P (op1))
> >> +	    {
> >> +              /* The cost is 1 per vector-register copied.  */
> >> +              int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
> >> +			      / GET_MODE_SIZE (V4SImode);
> >> +              *cost = COSTS_N_INSNS (n_minus_1 + 1);
> >> +	    }
> >>   	  /* const0_rtx is in general free, but we will use an
> >>   	     instruction to set a register to 0.  */
> >> -          if (REG_P (op1) || op1 == const0_rtx)
> >> -            {
> >> +	  else if (REG_P (op1) || op1 == const0_rtx)
> >> +	    {
> >>                 /* The cost is 1 per register copied.  */
> >>                 int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
> >>   			      / UNITS_PER_WORD;
> > I would not have expected control flow to reach this point, as we have:
> >
> >>   /* TODO: The cost infrastructure currently does not handle
> >>      vector operations.  Assume that all vector operations
> >>      are equally expensive.  */
> >>   if (VECTOR_MODE_P (mode))
> >>     {
> >>       if (speed)
> >> 	*cost += extra_cost->vect.alu;
> >>       return true;
> >>     }
> > But, I see that this check is broken for a set RTX (which has no mode).
> > So, your patch works, but only due to a bug in my original implementation.
> > This leaves the code with quite a messy design.
> >
> > There are two ways I see that we could clean things up, both of which
> > require some reworking of your patch.
> >
> > Either we remove my check above and teach the RTX costs how to properly
> > cost vector operations, or we fix my check to catch all vector RTX
> > and add the special cases for the small subset of things we understand
> > up there.
> >
> > The correct approach in the long term is to fix the RTX costs to correctly
> > understand vector operations, so I'd much prefer to see a patch along
> > these lines, though I appreciate that is a substantially more invasive
> > piece of work.
> 
> 
> Would we want to catch all vector RTXes in that check at the top
> and have special vector rtx handling there? (Perhaps even in a function
> of its own like aarch64_vector_rtx_costs?).

No, I think this would necessitate duplicating all of the idiom
recognition and RTX walking code from aarch64_rtx_costs. However, this
would be the easiest way to fix this PR in the short term.

> Or do you think it would be cleaner to handle them in the aarch64_rtx_costs
> giant switch?  Vector-specific RTX codes like vec_concat, vec_select would
> integrate cleanly, but handling other common rtxen could potentially be
> messy?

Well, if I'm allowed to dream for a bit...

To reduce the need for spaghetti code a little, what I would really like to
see is a logical split between the recognition of the instruction and the
costing of individual modes of that instruction. So we would invent a
function like aarch64_classify_rtx which would return "You gave me something
which looks like an add immediate" - then we would leave switching on modes
to aarch64_rtx_costs.

If I can dream even more - I don't see why it makes sense for us to have a
hand-rolled instruction recognizer in the back-end and I'd like to find
a way to resuse common recog infrastructure, and then add
something like what sched1 does to guess at likely register allocations
and to then extract the type attribute. For that to work, we would need
to change a huge amount of infrastructure to ensure that a register
allocation guess was available whenever someone wanted a cost estimate - 
a huge, huge problem when a Gimple pass speculatively forms some
invalid RTX and hands it off to rtx_costs. So I think this is not a
realistic plan!

Those are huge refactoring tasks which I'm not going to get a chance to
look at any time soon, so I think we have to be pragmatic about what can
be achieved.

Adding to the common RTX recognisers will potentially be messy, but it
is a neater approach than duplicating the logic (have a look at the
amount of effort we go to to spot a non-fused Multiply Add operation -
we certainly don't want to duplicate that out for vectors).

Thanks,
James
Kyrylo Tkachov April 15, 2015, 11:16 a.m. UTC | #7
On 15/04/15 12:05, James Greenhalgh wrote:
> On Wed, Apr 15, 2015 at 11:14:11AM +0100, Kyrill Tkachov wrote:
>> On 15/04/15 10:25, James Greenhalgh wrote:
>>> On Tue, Apr 14, 2015 at 11:08:55PM +0100, Kugan wrote:
>>>> Now that Stage1 is open, is this OK for trunk.
>>> Hi Kugan,
>>>
>>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>>> index cba3c1a..d6ad0af 100644
>>>> --- a/gcc/config/aarch64/aarch64.c
>>>> +++ b/gcc/config/aarch64/aarch64.c
>>>> @@ -5544,10 +5544,17 @@ aarch64_rtx_costs (rtx x, int code, int outer ATTRIBUTE_UNUSED,
>>>>    
>>>>    	  /* Fall through.  */
>>>>    	case REG:
>>>> +	  if (VECTOR_MODE_P (GET_MODE (op0)) && REG_P (op1))
>>>> +	    {
>>>> +              /* The cost is 1 per vector-register copied.  */
>>>> +              int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>>>> +			      / GET_MODE_SIZE (V4SImode);
>>>> +              *cost = COSTS_N_INSNS (n_minus_1 + 1);
>>>> +	    }
>>>>    	  /* const0_rtx is in general free, but we will use an
>>>>    	     instruction to set a register to 0.  */
>>>> -          if (REG_P (op1) || op1 == const0_rtx)
>>>> -            {
>>>> +	  else if (REG_P (op1) || op1 == const0_rtx)
>>>> +	    {
>>>>                  /* The cost is 1 per register copied.  */
>>>>                  int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
>>>>    			      / UNITS_PER_WORD;
>>> I would not have expected control flow to reach this point, as we have:
>>>
>>>>    /* TODO: The cost infrastructure currently does not handle
>>>>       vector operations.  Assume that all vector operations
>>>>       are equally expensive.  */
>>>>    if (VECTOR_MODE_P (mode))
>>>>      {
>>>>        if (speed)
>>>> 	*cost += extra_cost->vect.alu;
>>>>        return true;
>>>>      }
>>> But, I see that this check is broken for a set RTX (which has no mode).
>>> So, your patch works, but only due to a bug in my original implementation.
>>> This leaves the code with quite a messy design.
>>>
>>> There are two ways I see that we could clean things up, both of which
>>> require some reworking of your patch.
>>>
>>> Either we remove my check above and teach the RTX costs how to properly
>>> cost vector operations, or we fix my check to catch all vector RTX
>>> and add the special cases for the small subset of things we understand
>>> up there.
>>>
>>> The correct approach in the long term is to fix the RTX costs to correctly
>>> understand vector operations, so I'd much prefer to see a patch along
>>> these lines, though I appreciate that is a substantially more invasive
>>> piece of work.
>>
>> Would we want to catch all vector RTXes in that check at the top
>> and have special vector rtx handling there? (Perhaps even in a function
>> of its own like aarch64_vector_rtx_costs?).
> No, I think this would necessitate duplicating all of the idiom
> recognition and RTX walking code from aarch64_rtx_costs. However, this
> would be the easiest way to fix this PR in the short term.
>
>> Or do you think it would be cleaner to handle them in the aarch64_rtx_costs
>> giant switch?  Vector-specific RTX codes like vec_concat, vec_select would
>> integrate cleanly, but handling other common rtxen could potentially be
>> messy?
> Well, if I'm allowed to dream for a bit...
>
> To reduce the need for spaghetti code a little, what I would really like to
> see is a logical split between the recognition of the instruction and the
> costing of individual modes of that instruction. So we would invent a
> function like aarch64_classify_rtx which would return "You gave me something
> which looks like an add immediate" - then we would leave switching on modes
> to aarch64_rtx_costs.
>
> If I can dream even more - I don't see why it makes sense for us to have a
> hand-rolled instruction recognizer in the back-end and I'd like to find
> a way to resuse common recog infrastructure, and then add
> something like what sched1 does to guess at likely register allocations
> and to then extract the type attribute. For that to work, we would need
> to change a huge amount of infrastructure to ensure that a register
> allocation guess was available whenever someone wanted a cost estimate -
> a huge, huge problem when a Gimple pass speculatively forms some
> invalid RTX and hands it off to rtx_costs. So I think this is not a
> realistic plan!

(Unrelated to this patch) So, I find the worst offender in this
regard is expmed that generates rtx instances of every single integer mode
from QImode to EImode with common codes like PLUS,ASHIFT,MULT etc and asks the
backend rtx costs to assign it a number, which forces us to handle them even
though they are invalid and don't have any patterns that match them.
I'm working on some patches to remedy that, though there are some tree-ssa passes
that generate explicit rtxes that may not be valid as well.

Kyrill

>
> Those are huge refactoring tasks which I'm not going to get a chance to
> look at any time soon, so I think we have to be pragmatic about what can
> be achieved.
>
> Adding to the common RTX recognisers will potentially be messy, but it
> is a neater approach than duplicating the logic (have a look at the
> amount of effort we go to to spot a non-fused Multiply Add operation -
> we certainly don't want to duplicate that out for vectors).
>
> Thanks,
> James
>
James Greenhalgh April 15, 2015, 11:18 a.m. UTC | #8
On Wed, Apr 15, 2015 at 11:45:36AM +0100, Kugan wrote:
> > There are two ways I see that we could clean things up, both of which
> > require some reworking of your patch.
> > 
> > Either we remove my check above and teach the RTX costs how to properly
> > cost vector operations, or we fix my check to catch all vector RTX
> > and add the special cases for the small subset of things we understand
> > up there.
> > 
> > The correct approach in the long term is to fix the RTX costs to correctly
> > understand vector operations, so I'd much prefer to see a patch along
> > these lines, though I appreciate that is a substantially more invasive
> > piece of work.
> > 
> 
> 
> I agree that rtx cost for vector is not handled right now. We might not
> be able to completely separate as Kyrill suggested.  We still need the
> vector SET with VOIDmode to be handled inline. This patch is that part.
> We can work on the others as a separate function, if you prefer that. I
> am happy to look this as a separate patch.

My point is that adding your patch while keeping the logic at the top
which claims to catch ALL vector operations makes for less readable
code.

At the very least you'll need to update this comment:

  /* TODO: The cost infrastructure currently does not handle
     vector operations.  Assume that all vector operations
     are equally expensive.  */

to make it clear that this doesn't catch vector set operations.

But fixing the comment doesn't improve the messy code so I'd certainly
prefer to see one of the other approaches which have been discussed.

Thanks,
James
Kugan Vivekanandarajah April 15, 2015, 11:33 a.m. UTC | #9
On 15/04/15 21:18, James Greenhalgh wrote:
> On Wed, Apr 15, 2015 at 11:45:36AM +0100, Kugan wrote:
>>> There are two ways I see that we could clean things up, both of which
>>> require some reworking of your patch.
>>>
>>> Either we remove my check above and teach the RTX costs how to properly
>>> cost vector operations, or we fix my check to catch all vector RTX
>>> and add the special cases for the small subset of things we understand
>>> up there.
>>>
>>> The correct approach in the long term is to fix the RTX costs to correctly
>>> understand vector operations, so I'd much prefer to see a patch along
>>> these lines, though I appreciate that is a substantially more invasive
>>> piece of work.
>>>
>>
>>
>> I agree that rtx cost for vector is not handled right now. We might not
>> be able to completely separate as Kyrill suggested.  We still need the
>> vector SET with VOIDmode to be handled inline. This patch is that part.
>> We can work on the others as a separate function, if you prefer that. I
>> am happy to look this as a separate patch.
> 
> My point is that adding your patch while keeping the logic at the top
> which claims to catch ALL vector operations makes for less readable
> code.
> 
> At the very least you'll need to update this comment:
> 
>   /* TODO: The cost infrastructure currently does not handle
>      vector operations.  Assume that all vector operations
>      are equally expensive.  */
> 
> to make it clear that this doesn't catch vector set operations.
> 
> But fixing the comment doesn't improve the messy code so I'd certainly
> prefer to see one of the other approaches which have been discussed.

I see your point. Let me work on this based on your suggestions above.

Thanks,
Kugan
Maxim Kuvyrkov April 15, 2015, 11:35 a.m. UTC | #10
> On Apr 15, 2015, at 2:18 PM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> On Wed, Apr 15, 2015 at 11:45:36AM +0100, Kugan wrote:
>>> There are two ways I see that we could clean things up, both of which
>>> require some reworking of your patch.
>>> 
>>> Either we remove my check above and teach the RTX costs how to properly
>>> cost vector operations, or we fix my check to catch all vector RTX
>>> and add the special cases for the small subset of things we understand
>>> up there.
>>> 
>>> The correct approach in the long term is to fix the RTX costs to correctly
>>> understand vector operations, so I'd much prefer to see a patch along
>>> these lines, though I appreciate that is a substantially more invasive
>>> piece of work.
>>> 
>> 
>> 
>> I agree that rtx cost for vector is not handled right now. We might not
>> be able to completely separate as Kyrill suggested.  We still need the
>> vector SET with VOIDmode to be handled inline. This patch is that part.
>> We can work on the others as a separate function, if you prefer that. I
>> am happy to look this as a separate patch.
> 
> My point is that adding your patch while keeping the logic at the top
> which claims to catch ALL vector operations makes for less readable
> code.
> 
> At the very least you'll need to update this comment:
> 
>  /* TODO: The cost infrastructure currently does not handle
>     vector operations.  Assume that all vector operations
>     are equally expensive.  */
> 
> to make it clear that this doesn't catch vector set operations.
> 
> But fixing the comment doesn't improve the messy code so I'd certainly
> prefer to see one of the other approaches which have been discussed.

While I am for cleaning up messy code, I want to avoid Kugan's patch being held hostage until all the proper refactorings and cleanups are done.  If we consider the patch on its own merits: Is it a worthwhile improvement? -- [Probably, "yes".]  Does it make current spaghetti code significantly more difficult to understand? -- [Probably, "no", if we update the current comments.]

Let's discuss the effort of cleaning RTX costs as a separate task.  It can be either a joint effort for ARM and Linaro, or one of us can tackle it.

Thank you,

--
Maxim Kuvyrkov
www.linaro.org
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index cba3c1a..d6ad0af 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5544,10 +5544,17 @@  aarch64_rtx_costs (rtx x, int code, int outer ATTRIBUTE_UNUSED,
 
 	  /* Fall through.  */
 	case REG:
+	  if (VECTOR_MODE_P (GET_MODE (op0)) && REG_P (op1))
+	    {
+              /* The cost is 1 per vector-register copied.  */
+              int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
+			      / GET_MODE_SIZE (V4SImode);
+              *cost = COSTS_N_INSNS (n_minus_1 + 1);
+	    }
 	  /* const0_rtx is in general free, but we will use an
 	     instruction to set a register to 0.  */
-          if (REG_P (op1) || op1 == const0_rtx)
-            {
+	  else if (REG_P (op1) || op1 == const0_rtx)
+	    {
               /* The cost is 1 per register copied.  */
               int n_minus_1 = (GET_MODE_SIZE (GET_MODE (op0)) - 1)
 			      / UNITS_PER_WORD;