diff mbox

[PR43513,1/3] Replace vla with array - Implementation.

Message ID 4E33B110.8070708@codesourcery.com
State New
Headers show

Commit Message

Tom de Vries July 30, 2011, 7:21 a.m. UTC
Hi,

On 07/28/2011 08:20 PM, Tom de Vries wrote:
> On 07/28/2011 06:25 PM, Richard Guenther wrote:
>> On Thu, 28 Jul 2011, Tom de Vries wrote:
>>
>>> On 07/28/2011 12:22 PM, Richard Guenther wrote:
>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>
>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote:
>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>
>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote:
>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>>
>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote:
>>>>>>>>>> Hi Richard,
>>>>>>>>>>
>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted twice.
>>>>>>>>>>
>>>>>>>>>> 01_pr43513.3.patch
>>>>>>>>>> 02_pr43513.3.test.patch
>>>>>>>>>> 03_pr43513.3.mudflap.patch
>>>>>>>>>>
>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64.
>>>>>>>>>>
>>>>>>>>>> I will sent out the patches individually.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant argument with an
>>>>>>>>> array declaration.
>>>>>>>>>
>>>>>>>>> OK for trunk?
>>>>>>>>
>>>>>>>> I don't think it is safe to try to get at the VLA type the way you do.
>>>>>>>
>>>>>>> I don't understand in what way it's not safe. Do you mean I don't manage to find
>>>>>>> the type always, or that I find the wrong type, or something else?
>>>>>>
>>>>>> I think you might get the wrong type,
>>>>>
>>>>> Ok, I'll review that code one more time.
>>>>>
>>>>>> you also do not transform code
>>>>>> like
>>>>>>
>>>>>>   int *p = alloca(4);
>>>>>>   *p = 3;
>>>>>>
>>>>>> as there is no array type involved here.
>>>>>>
>>>>>
>>>>> I was trying to stay away from non-vla allocas.  A source declared alloca has
>>>>> function livetime, so we could have a single alloca in a loop, called 10 times,
>>>>> with all 10 instances live at the same time. This patch does not detect such
>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have such
>>>>> problems, the lifetime ends when it goes out of scope.
>>>>
>>>> Yes indeed - that probably would require more detailed analysis.
>>>>
>>>>>>>> In fact I would simply do sth like
>>>>>>>>
>>>>>>>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>>>>>>   n_elem = size * 8 / BITS_PER_UNIT;
>>>>>>>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>>>>>>>   var = create_tmp_var (array_type, NULL);
>>>>>>>>   return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
>>>>>>>>
>>>>>>>
>>>>>>> I tried this code on the example, and it works, but the newly declared type has
>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment.  This make
>>>>>>> the memory access in the example potentially unaligned, which prohibits an
>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 achieved
>>>>>>> with my current patch.
>>>>>>
>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable
>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)).  Basically the
>>>>>> alignment that the targets alloca function would guarantee.
>>>>>>
>>>>>
>>>>> I tried that, but that doesn't help. It's the alignment of the type that
>>>>> matters, not of the decl.
>>>>
>>>> It shouldn't.  All accesses are performed with the original types and
>>>> alignment comes from that (plus the underlying decl).
>>>>
>>>
>>> I managed to get it all working by using build_aligned_type rather that DECL_ALIGN.
>>
>> That's really odd, DECL_ALIGN should just work - nothing refers to the
>> type of the decl in the IL.  Can you try also setting DECL_USER_ALIGN to 
>> 1 maybe?
>>
> 
> This doesn't work either.
> 
>   /* Declare array.  */
>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>   n_elem = size * 8 / BITS_PER_UNIT;
>   align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
>   array_type = build_array_type_nelts (elem_type, n_elem);
>   var = create_tmp_var (array_type, NULL);
>   DECL_ALIGN (var) = align;
>   DECL_USER_ALIGN (var) = 1;
> 
> Maybe this clarifies it:
> 
> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at
> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621
> (gdb) call debug_generic_expr (ref)
> MEM[(int[0:D.2579] *)&D.2595][0]
> (gdb) call debug_generic_expr (step)
> 4
> 
> 1627	  base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
> (gdb) call debug_generic_expr (base)
> D.2595
> 
> 1629	  base_type = TREE_TYPE (base);
> (gdb) call debug_generic_expr (base_type)
> <unnamed-unsigned:8>[40]
> 
> 1630	  base_align = TYPE_ALIGN (base_type);
> (gdb) p base_align
> $1 = 8
> 
> So the align is 8-bits, and we return true here:
> 
> (gdb) n
> 1632	  if (mode != BLKmode)
> (gdb) n
> 1634	      unsigned mode_align = GET_MODE_ALIGNMENT (mode);
> (gdb)
> 1636	      if (base_align < mode_align
> (gdb)
> 1639		return true;
> 
> 
> Here we can see that the base actually has the (user) align on it:
> 
> (gdb) call debug_tree (base)
>  <var_decl 0xf7e1b420 D.2595
>     type <array_type 0xf7e1b360
>         type <integer_type 0xf7e1b2a0 public unsigned QI
>             size <integer_cst 0xf7d3d604 constant 8>
>             unit size <integer_cst 0xf7d3d620 constant 1>
>             align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8
>             min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255>
>             pointer_to_this <pointer_type 0xf7e1b3c0>>
>         BLK
>         size <integer_cst 0xf7d5d070 constant 320>
>         unit size <integer_cst 0xf7dde2a0 constant 40>
>         align 8 symtab 0 alias set -1 canonical type 0xf7e1b360
>         domain <integer_type 0xf7de12a0
>                 type <integer_type 0xf7d51000 unsigned int>
>             SI
>             size <integer_cst 0xf7d3d78c constant 32>
>             unit size <integer_cst 0xf7d3d578 constant 4>
>             align 32 symtab 0 alias set -1 canonical type 0xf7de12a0
>             precision 32 min <integer_cst 0xf7d3d594 0>
>             max <integer_cst 0xf7dde284 39>>
>         pointer_to_this <pointer_type 0xf7e1b480>>
>     addressable used ignored BLK file pr43513.c line 4 col 6
>     size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40>
>     user align 32 context <function_decl 0xf7dfd480 foo3>>
> 
> 
>>>
>>>>> So should we try to find the base type of the vla, and use that, or use the
>>>>> nonstandard char type?
>>>>
>>>> I don't think we can reliably find the base type of the vla - well,
>>>> in practice we may because we control how we lower VLAs during
>>>> gimplification, but nothing in the IL constraints say that the
>>>> resulting pointer type should be special.
>>>>
>>>> Using a char[] decl shouldn't be a problem IMHO.
>>>>
>>>>>>>> And obviously you lose the optimization we arrange with inserting
>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no
>>>>>>>> longer be shared for subsequent VLAs.  Which means that you'd
>>>>>>>> better limit the size you allow this promotion.
>>>>>>>>
>>>>>>>
>>>>>>> Right, I could introduce a parameter for this.
>>>>>>
>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say,
>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10?
>>>>>>
>>>>>
>>>>> That unfortunately is too small for the example from bug report. The default
>>>>> value of the param is 250, so that would be a threshold of 25, and the alloca
>>>>> size of the example is 40.  Perhaps we can try a threshold of
>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such?
>>>>
>>>> Hm.  estimated_stack_size is not O(1), so no.  I think we need to
>>>> find a sensible way of allowing stack sharing.  Eventually Michas
>>>> patch for introducing points-of-death would help here, if we'd
>>>> go for folding this during stack-save/restore optimization.
>>>>
>>>
>>> I changed the heuristics to this:
>>>
>>> +  /* Heuristic: don't fold large vlas.  */
>>> +  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
>>> +  /* In case a vla is declared at function scope, it has the same lifetime as a
>>> +     declared array, so we allow a larger size.  */
>>> +  block = gimple_block (stmt);
>>> +  if (!(cfun->after_inlining
>>> +        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
>>> +    threshold /= 10;
>>> +  if (size > threshold)
>>> +    return NULL_TREE;
>>>
>>> The heuristics distinguishes between before and after inlining.
>>>
>>> After inlining, vla's declared at function scope have the same lifetimes as
>>> declared arrays, and don't share their space. There should be no negative
>>> effects from folding an alloca in this case, but for safety we set a threshold
>>> of PARAM_LARGE_STACK_FRAME.
>>>
>>> Before inlining, such a vla might be inlined and share its space with another
>>> vla, so we stick with the normal threshold before inlining.
>>
>> That sounds reasonable, though the block check should probably use the
>> original VLA decl block, not that of the basic-block of the allocation,
>> but unfortunately we don't have access to that.  So I suppose using
>> the allocation basic-block BLOCK is good enough (still we don't
>> really care about BLOCK boundaries when doing CFG manipulations, so
>> the allocation bbs block may be not the outermost scope in more cases
>> than necessary).
>>
>>> However, using this heuristic we still don't generate optimal code.
>>>
>>> During the first pass_ccp, the folding is not done, because the size (40) is
>>> larger than the threshold 25. The threshold is 25, because inlining is not yet done.
>>>
>>> During pass_fold_builtins, the folding is done because it's after inlining, but
>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal size
>>> of 64.
>>>
>>> The folding is not done during any of the other invocations or pass_ccp, because
>>> the argument has already become constant in the earlier invocation.
>>
>> Yeah, that's the issue with relying on folding to do this transformation.
>>
>>> Using this change, I manage to trigger folding during the second invocation of
>>> pass_ccp, before iv_optimize so we generate optimal code.
>>>
>>> Index: gcc/tree-ssa-ccp.c
>>> ===================================================================
>>> --- gcc/tree-ssa-ccp.c (revision 173734)
>>> +++ gcc/tree-ssa-ccp.c (working copy)
>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
>>>  	if (gimple_call_internal_p (stmt))
>>>  	  return false;
>>>
>>> +        /* The heuristic of fold_builtin_alloca differs before and after
>>> +           inlining, so we don't require the arg to be changed into a constant
>>> +           for folding, but just to be constant.  */
>>> +        if (gimple_call_alloca_for_var_p (stmt)
>>> +            && get_constant_value (gimple_call_arg (stmt, 0)))
>>
>> Probably reverse the get_constant_value check and the transformation
> 
> Done.
> 
>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects,
>> so its name should be changed).
>>
>>> +          return true;
>>> +
>>>  	/* Propagate into the call arguments.  Compared to replace_uses_in
>>>  	   this can use the argument slot types for type verification
>>>  	   instead of the current argument type.  We also can safely
>>>
>>> But, to me it feels like a hack. Do you have any ideas how to do this better?
>>
>> It's somewhat of a hack, but at least it is more of a defined place
>> for this transformation - which then suggests to remove it from
>> generic folding and only keep calling it from CCP this way.
>>
> 
> Done.
> 

This is an updated version of the patch. I have 2 new patches and an updated
testcase which I will sent out individually.

Patch set was bootstrapped and reg-tested on x86_64.

Ok for trunk?

Thanks,
- Tom

2011-07-30  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43513
	* Makefile.in (tree-ssa-ccp.o): Add $(PARAMS_H) to rule.
	* tree-ssa-ccp.c (params.h): Include.
	(fold_builtin_alloca_for_var): New function.
	(ccp_fold_stmt): Use fold_builtin_alloca_for_var.

Comments

Tom de Vries Aug. 7, 2011, 11:23 a.m. UTC | #1
On 07/30/2011 09:21 AM, Tom de Vries wrote:
> Hi,
> 
> On 07/28/2011 08:20 PM, Tom de Vries wrote:
>> On 07/28/2011 06:25 PM, Richard Guenther wrote:
>>> On Thu, 28 Jul 2011, Tom de Vries wrote:
>>>
>>>> On 07/28/2011 12:22 PM, Richard Guenther wrote:
>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>
>>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote:
>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>
>>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote:
>>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>>>
>>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote:
>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>
>>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted twice.
>>>>>>>>>>>
>>>>>>>>>>> 01_pr43513.3.patch
>>>>>>>>>>> 02_pr43513.3.test.patch
>>>>>>>>>>> 03_pr43513.3.mudflap.patch
>>>>>>>>>>>
>>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64.
>>>>>>>>>>>
>>>>>>>>>>> I will sent out the patches individually.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant argument with an
>>>>>>>>>> array declaration.
>>>>>>>>>>
>>>>>>>>>> OK for trunk?
>>>>>>>>>
>>>>>>>>> I don't think it is safe to try to get at the VLA type the way you do.
>>>>>>>>
>>>>>>>> I don't understand in what way it's not safe. Do you mean I don't manage to find
>>>>>>>> the type always, or that I find the wrong type, or something else?
>>>>>>>
>>>>>>> I think you might get the wrong type,
>>>>>>
>>>>>> Ok, I'll review that code one more time.
>>>>>>
>>>>>>> you also do not transform code
>>>>>>> like
>>>>>>>
>>>>>>>   int *p = alloca(4);
>>>>>>>   *p = 3;
>>>>>>>
>>>>>>> as there is no array type involved here.
>>>>>>>
>>>>>>
>>>>>> I was trying to stay away from non-vla allocas.  A source declared alloca has
>>>>>> function livetime, so we could have a single alloca in a loop, called 10 times,
>>>>>> with all 10 instances live at the same time. This patch does not detect such
>>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have such
>>>>>> problems, the lifetime ends when it goes out of scope.
>>>>>
>>>>> Yes indeed - that probably would require more detailed analysis.
>>>>>
>>>>>>>>> In fact I would simply do sth like
>>>>>>>>>
>>>>>>>>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>>>>>>>   n_elem = size * 8 / BITS_PER_UNIT;
>>>>>>>>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>>>>>>>>   var = create_tmp_var (array_type, NULL);
>>>>>>>>>   return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
>>>>>>>>>
>>>>>>>>
>>>>>>>> I tried this code on the example, and it works, but the newly declared type has
>>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment.  This make
>>>>>>>> the memory access in the example potentially unaligned, which prohibits an
>>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 achieved
>>>>>>>> with my current patch.
>>>>>>>
>>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable
>>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)).  Basically the
>>>>>>> alignment that the targets alloca function would guarantee.
>>>>>>>
>>>>>>
>>>>>> I tried that, but that doesn't help. It's the alignment of the type that
>>>>>> matters, not of the decl.
>>>>>
>>>>> It shouldn't.  All accesses are performed with the original types and
>>>>> alignment comes from that (plus the underlying decl).
>>>>>
>>>>
>>>> I managed to get it all working by using build_aligned_type rather that DECL_ALIGN.
>>>
>>> That's really odd, DECL_ALIGN should just work - nothing refers to the
>>> type of the decl in the IL.  Can you try also setting DECL_USER_ALIGN to 
>>> 1 maybe?
>>>
>>
>> This doesn't work either.
>>
>>   /* Declare array.  */
>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>   n_elem = size * 8 / BITS_PER_UNIT;
>>   align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>   var = create_tmp_var (array_type, NULL);
>>   DECL_ALIGN (var) = align;
>>   DECL_USER_ALIGN (var) = 1;
>>
>> Maybe this clarifies it:
>>
>> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at
>> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621
>> (gdb) call debug_generic_expr (ref)
>> MEM[(int[0:D.2579] *)&D.2595][0]
>> (gdb) call debug_generic_expr (step)
>> 4
>>
>> 1627	  base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
>> (gdb) call debug_generic_expr (base)
>> D.2595
>>
>> 1629	  base_type = TREE_TYPE (base);
>> (gdb) call debug_generic_expr (base_type)
>> <unnamed-unsigned:8>[40]
>>
>> 1630	  base_align = TYPE_ALIGN (base_type);
>> (gdb) p base_align
>> $1 = 8
>>
>> So the align is 8-bits, and we return true here:
>>
>> (gdb) n
>> 1632	  if (mode != BLKmode)
>> (gdb) n
>> 1634	      unsigned mode_align = GET_MODE_ALIGNMENT (mode);
>> (gdb)
>> 1636	      if (base_align < mode_align
>> (gdb)
>> 1639		return true;
>>
>>
>> Here we can see that the base actually has the (user) align on it:
>>
>> (gdb) call debug_tree (base)
>>  <var_decl 0xf7e1b420 D.2595
>>     type <array_type 0xf7e1b360
>>         type <integer_type 0xf7e1b2a0 public unsigned QI
>>             size <integer_cst 0xf7d3d604 constant 8>
>>             unit size <integer_cst 0xf7d3d620 constant 1>
>>             align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8
>>             min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255>
>>             pointer_to_this <pointer_type 0xf7e1b3c0>>
>>         BLK
>>         size <integer_cst 0xf7d5d070 constant 320>
>>         unit size <integer_cst 0xf7dde2a0 constant 40>
>>         align 8 symtab 0 alias set -1 canonical type 0xf7e1b360
>>         domain <integer_type 0xf7de12a0
>>                 type <integer_type 0xf7d51000 unsigned int>
>>             SI
>>             size <integer_cst 0xf7d3d78c constant 32>
>>             unit size <integer_cst 0xf7d3d578 constant 4>
>>             align 32 symtab 0 alias set -1 canonical type 0xf7de12a0
>>             precision 32 min <integer_cst 0xf7d3d594 0>
>>             max <integer_cst 0xf7dde284 39>>
>>         pointer_to_this <pointer_type 0xf7e1b480>>
>>     addressable used ignored BLK file pr43513.c line 4 col 6
>>     size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40>
>>     user align 32 context <function_decl 0xf7dfd480 foo3>>
>>
>>
>>>>
>>>>>> So should we try to find the base type of the vla, and use that, or use the
>>>>>> nonstandard char type?
>>>>>
>>>>> I don't think we can reliably find the base type of the vla - well,
>>>>> in practice we may because we control how we lower VLAs during
>>>>> gimplification, but nothing in the IL constraints say that the
>>>>> resulting pointer type should be special.
>>>>>
>>>>> Using a char[] decl shouldn't be a problem IMHO.
>>>>>
>>>>>>>>> And obviously you lose the optimization we arrange with inserting
>>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no
>>>>>>>>> longer be shared for subsequent VLAs.  Which means that you'd
>>>>>>>>> better limit the size you allow this promotion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Right, I could introduce a parameter for this.
>>>>>>>
>>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say,
>>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10?
>>>>>>>
>>>>>>
>>>>>> That unfortunately is too small for the example from bug report. The default
>>>>>> value of the param is 250, so that would be a threshold of 25, and the alloca
>>>>>> size of the example is 40.  Perhaps we can try a threshold of
>>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such?
>>>>>
>>>>> Hm.  estimated_stack_size is not O(1), so no.  I think we need to
>>>>> find a sensible way of allowing stack sharing.  Eventually Michas
>>>>> patch for introducing points-of-death would help here, if we'd
>>>>> go for folding this during stack-save/restore optimization.
>>>>>
>>>>
>>>> I changed the heuristics to this:
>>>>
>>>> +  /* Heuristic: don't fold large vlas.  */
>>>> +  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
>>>> +  /* In case a vla is declared at function scope, it has the same lifetime as a
>>>> +     declared array, so we allow a larger size.  */
>>>> +  block = gimple_block (stmt);
>>>> +  if (!(cfun->after_inlining
>>>> +        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
>>>> +    threshold /= 10;
>>>> +  if (size > threshold)
>>>> +    return NULL_TREE;
>>>>
>>>> The heuristics distinguishes between before and after inlining.
>>>>
>>>> After inlining, vla's declared at function scope have the same lifetimes as
>>>> declared arrays, and don't share their space. There should be no negative
>>>> effects from folding an alloca in this case, but for safety we set a threshold
>>>> of PARAM_LARGE_STACK_FRAME.
>>>>
>>>> Before inlining, such a vla might be inlined and share its space with another
>>>> vla, so we stick with the normal threshold before inlining.
>>>
>>> That sounds reasonable, though the block check should probably use the
>>> original VLA decl block, not that of the basic-block of the allocation,
>>> but unfortunately we don't have access to that.  So I suppose using
>>> the allocation basic-block BLOCK is good enough (still we don't
>>> really care about BLOCK boundaries when doing CFG manipulations, so
>>> the allocation bbs block may be not the outermost scope in more cases
>>> than necessary).
>>>
>>>> However, using this heuristic we still don't generate optimal code.
>>>>
>>>> During the first pass_ccp, the folding is not done, because the size (40) is
>>>> larger than the threshold 25. The threshold is 25, because inlining is not yet done.
>>>>
>>>> During pass_fold_builtins, the folding is done because it's after inlining, but
>>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal size
>>>> of 64.
>>>>
>>>> The folding is not done during any of the other invocations or pass_ccp, because
>>>> the argument has already become constant in the earlier invocation.
>>>
>>> Yeah, that's the issue with relying on folding to do this transformation.
>>>
>>>> Using this change, I manage to trigger folding during the second invocation of
>>>> pass_ccp, before iv_optimize so we generate optimal code.
>>>>
>>>> Index: gcc/tree-ssa-ccp.c
>>>> ===================================================================
>>>> --- gcc/tree-ssa-ccp.c (revision 173734)
>>>> +++ gcc/tree-ssa-ccp.c (working copy)
>>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
>>>>  	if (gimple_call_internal_p (stmt))
>>>>  	  return false;
>>>>
>>>> +        /* The heuristic of fold_builtin_alloca differs before and after
>>>> +           inlining, so we don't require the arg to be changed into a constant
>>>> +           for folding, but just to be constant.  */
>>>> +        if (gimple_call_alloca_for_var_p (stmt)
>>>> +            && get_constant_value (gimple_call_arg (stmt, 0)))
>>>
>>> Probably reverse the get_constant_value check and the transformation
>>
>> Done.
>>
>>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects,
>>> so its name should be changed).
>>>
>>>> +          return true;
>>>> +
>>>>  	/* Propagate into the call arguments.  Compared to replace_uses_in
>>>>  	   this can use the argument slot types for type verification
>>>>  	   instead of the current argument type.  We also can safely
>>>>
>>>> But, to me it feels like a hack. Do you have any ideas how to do this better?
>>>
>>> It's somewhat of a hack, but at least it is more of a defined place
>>> for this transformation - which then suggests to remove it from
>>> generic folding and only keep calling it from CCP this way.
>>>
>>
>> Done.
>>
> 
> This is an updated version of the patch. I have 2 new patches and an updated
> testcase which I will sent out individually.
> 
> Patch set was bootstrapped and reg-tested on x86_64.
> 
> Ok for trunk?
> 

Ping. http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02730.html and
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02733.html ok for trunk?

Thanks,
- Tom
Tom de Vries Aug. 17, 2011, 9:38 a.m. UTC | #2
Hi Richard,

Ping. http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02730.html and
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02733.html ok for trunk?

Thanks,
- Tom
Tom de Vries Aug. 24, 2011, 8:47 a.m. UTC | #3
Hi Richard,

On 07/30/2011 09:21 AM, Tom de Vries wrote:
> This is an updated version of the patch. I have 2 new patches and an updated
> testcase which I will sent out individually.
> 
> Patch set was bootstrapped and reg-tested on x86_64.
> 
> Ok for trunk?
> 

You already approved the the 2 new patches, and I checked them in.

Ping for the updated patch and updated testcase:
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02730.html
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02733.html

The patch replaces a vla __builtin_alloca that has a constant argument with an
array declaration.

Thanks,
- Tom
Richard Biener Aug. 30, 2011, 10:48 a.m. UTC | #4
On Wed, Aug 24, 2011 at 10:47 AM, Tom de Vries <vries@codesourcery.com> wrote:
> Hi Richard,
>
> On 07/30/2011 09:21 AM, Tom de Vries wrote:
>> This is an updated version of the patch. I have 2 new patches and an updated
>> testcase which I will sent out individually.
>>
>> Patch set was bootstrapped and reg-tested on x86_64.
>>
>> Ok for trunk?
>>
>
> You already approved the the 2 new patches, and I checked them in.
>
> Ping for the updated patch and updated testcase:
> http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02730.html
> http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02733.html
>
> The patch replaces a vla __builtin_alloca that has a constant argument with an
> array declaration.

Ok with

+  /* Declare array.  */
+  elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
+  n_elem = size * 8 / BITS_PER_UNIT;
+  align = MIN (size * 8, GET_MODE_PRECISION (word_mode));

changed to use BIGGEST_ALIGNMENT instead of GET_MODE_PRECISION (word_mode).

For the future, can you please include the testcases in the main patch
instead of sending them separately?  You're the only one separating them
and it's really not helpful.

Thanks,
Richard.

> Thanks,
> - Tom
>
H.J. Lu Aug. 31, 2011, 12:47 p.m. UTC | #5
On Sat, Jul 30, 2011 at 12:21 AM, Tom de Vries <vries@codesourcery.com> wrote:
>
> This is an updated version of the patch. I have 2 new patches and an updated
> testcase which I will sent out individually.
>
> Patch set was bootstrapped and reg-tested on x86_64.
>
> Ok for trunk?
>
> Thanks,
> - Tom
>
> 2011-07-30  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43513
>        * Makefile.in (tree-ssa-ccp.o): Add $(PARAMS_H) to rule.
>        * tree-ssa-ccp.c (params.h): Include.
>        (fold_builtin_alloca_for_var): New function.
>        (ccp_fold_stmt): Use fold_builtin_alloca_for_var.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50251
diff mbox

Patch

Index: gcc/tree-ssa-ccp.c
===================================================================
--- gcc/tree-ssa-ccp.c (revision 173734)
+++ gcc/tree-ssa-ccp.c (working copy)
@@ -133,6 +133,7 @@  along with GCC; see the file COPYING3.  
 #include "diagnostic-core.h"
 #include "dbgcnt.h"
 #include "gimple-fold.h"
+#include "params.h"
 
 
 /* Possible lattice values.  */
@@ -1659,6 +1660,51 @@  evaluate_stmt (gimple stmt)
   return val;
 }
 
+/* Detects a vla-related alloca with a constant argument.  Declares fixed-size
+   array and return the address, if found, otherwise returns NULL_TREE.  */
+
+static tree
+fold_builtin_alloca_for_var (gimple stmt)
+{
+  unsigned HOST_WIDE_INT size, threshold, n_elem;
+  tree lhs, arg, block, var, elem_type, array_type;
+  unsigned int align;
+
+  /* Get lhs.  */
+  lhs = gimple_call_lhs (stmt);
+  if (lhs == NULL_TREE)
+    return NULL_TREE;
+
+  /* Detect constant argument.  */
+  arg = get_constant_value (gimple_call_arg (stmt, 0));
+  if (arg == NULL_TREE || TREE_CODE (arg) != INTEGER_CST
+      || !host_integerp (arg, 1))
+    return NULL_TREE;
+  size = TREE_INT_CST_LOW (arg);
+
+  /* Heuristic: don't fold large vlas.  */
+  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
+  /* In case a vla is declared at function scope, it has the same lifetime as a
+     declared array, so we allow a larger size.  */
+  block = gimple_block (stmt);
+  if (!(cfun->after_inlining
+        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
+    threshold /= 10;
+  if (size > threshold)
+    return NULL_TREE;
+
+  /* Declare array.  */
+  elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
+  n_elem = size * 8 / BITS_PER_UNIT;
+  align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
+  array_type = build_array_type_nelts (elem_type, n_elem);
+  var = create_tmp_var (array_type, NULL);
+  DECL_ALIGN (var) = align;
+
+  /* Fold alloca to the address of the array.  */
+  return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
+}
+
 /* Fold the stmt at *GSI with CCP specific information that propagating
    and regular folding does not catch.  */
 
@@ -1727,6 +1773,20 @@  ccp_fold_stmt (gimple_stmt_iterator *gsi
 	if (gimple_call_internal_p (stmt))
 	  return false;
 
+        /* The heuristic of fold_builtin_alloca_for_var differs before and after
+           inlining, so we don't require the arg to be changed into a constant
+           for folding, but just to be constant.  */
+        if (gimple_call_alloca_for_var_p (stmt))
+          {
+            tree new_rhs = fold_builtin_alloca_for_var (stmt);
+            bool res;
+            if (new_rhs == NULL_TREE)
+              return false;
+            res = update_call_from_tree (gsi, new_rhs);
+            gcc_assert (res);
+            return true;
+          }
+
 	/* Propagate into the call arguments.  Compared to replace_uses_in
 	   this can use the argument slot types for type verification
 	   instead of the current argument type.  We also can safely
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in (revision 173734)
+++ gcc/Makefile.in (working copy)
@@ -3157,7 +3157,7 @@  tree-call-cdce.o : tree-call-cdce.c $(CO
 tree-ssa-ccp.o : tree-ssa-ccp.c $(TREE_FLOW_H) $(CONFIG_H) \
    $(SYSTEM_H) $(TREE_H) $(TM_P_H) $(EXPR_H) output.h \
    $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
-   $(TREE_DUMP_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) langhooks.h \
+   $(TREE_DUMP_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) langhooks.h  $(PARAMS_H) \
    tree-ssa-propagate.h value-prof.h $(FLAGS_H) $(TARGET_H) $(DIAGNOSTIC_CORE_H) \
    $(DBGCNT_H) tree-pretty-print.h gimple-pretty-print.h gimple-fold.h
 tree-sra.o : tree-sra.c $(CONFIG_H) $(SYSTEM_H) coretypes.h alloc-pool.h \