Patchwork [PR43513,1/3] Replace vla with array - Implementation.

login
register
mail settings
Submitter Tom de Vries
Date July 30, 2011, 7:24 a.m.
Message ID <4E33B1A2.3020606@codesourcery.com>
Download mbox | patch
Permalink /patch/107481/
State New
Headers show

Comments

Tom de Vries - July 30, 2011, 7:24 a.m.
On 07/30/2011 10:21 AM, Tom de Vries wrote:
> Hi,
> 
> On 07/28/2011 08:20 PM, Tom de Vries wrote:
>> On 07/28/2011 06:25 PM, Richard Guenther wrote:
>>> On Thu, 28 Jul 2011, Tom de Vries wrote:
>>>
>>>> On 07/28/2011 12:22 PM, Richard Guenther wrote:
>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>
>>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote:
>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>
>>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote:
>>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>>>
>>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote:
>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>
>>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted twice.
>>>>>>>>>>>
>>>>>>>>>>> 01_pr43513.3.patch
>>>>>>>>>>> 02_pr43513.3.test.patch
>>>>>>>>>>> 03_pr43513.3.mudflap.patch
>>>>>>>>>>>
>>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64.
>>>>>>>>>>>
>>>>>>>>>>> I will sent out the patches individually.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant argument with an
>>>>>>>>>> array declaration.
>>>>>>>>>>
>>>>>>>>>> OK for trunk?
>>>>>>>>>
>>>>>>>>> I don't think it is safe to try to get at the VLA type the way you do.
>>>>>>>>
>>>>>>>> I don't understand in what way it's not safe. Do you mean I don't manage to find
>>>>>>>> the type always, or that I find the wrong type, or something else?
>>>>>>>
>>>>>>> I think you might get the wrong type,
>>>>>>
>>>>>> Ok, I'll review that code one more time.
>>>>>>
>>>>>>> you also do not transform code
>>>>>>> like
>>>>>>>
>>>>>>>   int *p = alloca(4);
>>>>>>>   *p = 3;
>>>>>>>
>>>>>>> as there is no array type involved here.
>>>>>>>
>>>>>>
>>>>>> I was trying to stay away from non-vla allocas.  A source declared alloca has
>>>>>> function livetime, so we could have a single alloca in a loop, called 10 times,
>>>>>> with all 10 instances live at the same time. This patch does not detect such
>>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have such
>>>>>> problems, the lifetime ends when it goes out of scope.
>>>>>
>>>>> Yes indeed - that probably would require more detailed analysis.
>>>>>
>>>>>>>>> In fact I would simply do sth like
>>>>>>>>>
>>>>>>>>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>>>>>>>   n_elem = size * 8 / BITS_PER_UNIT;
>>>>>>>>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>>>>>>>>   var = create_tmp_var (array_type, NULL);
>>>>>>>>>   return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
>>>>>>>>>
>>>>>>>>
>>>>>>>> I tried this code on the example, and it works, but the newly declared type has
>>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment.  This make
>>>>>>>> the memory access in the example potentially unaligned, which prohibits an
>>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 achieved
>>>>>>>> with my current patch.
>>>>>>>
>>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable
>>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)).  Basically the
>>>>>>> alignment that the targets alloca function would guarantee.
>>>>>>>
>>>>>>
>>>>>> I tried that, but that doesn't help. It's the alignment of the type that
>>>>>> matters, not of the decl.
>>>>>
>>>>> It shouldn't.  All accesses are performed with the original types and
>>>>> alignment comes from that (plus the underlying decl).
>>>>>
>>>>
>>>> I managed to get it all working by using build_aligned_type rather that DECL_ALIGN.
>>>
>>> That's really odd, DECL_ALIGN should just work - nothing refers to the
>>> type of the decl in the IL.  Can you try also setting DECL_USER_ALIGN to 
>>> 1 maybe?
>>>
>>
>> This doesn't work either.
>>
>>   /* Declare array.  */
>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>   n_elem = size * 8 / BITS_PER_UNIT;
>>   align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>   var = create_tmp_var (array_type, NULL);
>>   DECL_ALIGN (var) = align;
>>   DECL_USER_ALIGN (var) = 1;
>>
>> Maybe this clarifies it:
>>
>> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at
>> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621
>> (gdb) call debug_generic_expr (ref)
>> MEM[(int[0:D.2579] *)&D.2595][0]
>> (gdb) call debug_generic_expr (step)
>> 4
>>
>> 1627	  base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
>> (gdb) call debug_generic_expr (base)
>> D.2595
>>
>> 1629	  base_type = TREE_TYPE (base);
>> (gdb) call debug_generic_expr (base_type)
>> <unnamed-unsigned:8>[40]
>>
>> 1630	  base_align = TYPE_ALIGN (base_type);
>> (gdb) p base_align
>> $1 = 8
>>
>> So the align is 8-bits, and we return true here:
>>
>> (gdb) n
>> 1632	  if (mode != BLKmode)
>> (gdb) n
>> 1634	      unsigned mode_align = GET_MODE_ALIGNMENT (mode);
>> (gdb)
>> 1636	      if (base_align < mode_align
>> (gdb)
>> 1639		return true;
>>
>>
>> Here we can see that the base actually has the (user) align on it:
>>
>> (gdb) call debug_tree (base)
>>  <var_decl 0xf7e1b420 D.2595
>>     type <array_type 0xf7e1b360
>>         type <integer_type 0xf7e1b2a0 public unsigned QI
>>             size <integer_cst 0xf7d3d604 constant 8>
>>             unit size <integer_cst 0xf7d3d620 constant 1>
>>             align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8
>>             min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255>
>>             pointer_to_this <pointer_type 0xf7e1b3c0>>
>>         BLK
>>         size <integer_cst 0xf7d5d070 constant 320>
>>         unit size <integer_cst 0xf7dde2a0 constant 40>
>>         align 8 symtab 0 alias set -1 canonical type 0xf7e1b360
>>         domain <integer_type 0xf7de12a0
>>                 type <integer_type 0xf7d51000 unsigned int>
>>             SI
>>             size <integer_cst 0xf7d3d78c constant 32>
>>             unit size <integer_cst 0xf7d3d578 constant 4>
>>             align 32 symtab 0 alias set -1 canonical type 0xf7de12a0
>>             precision 32 min <integer_cst 0xf7d3d594 0>
>>             max <integer_cst 0xf7dde284 39>>
>>         pointer_to_this <pointer_type 0xf7e1b480>>
>>     addressable used ignored BLK file pr43513.c line 4 col 6
>>     size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40>
>>     user align 32 context <function_decl 0xf7dfd480 foo3>>
>>
>>
>>>>
>>>>>> So should we try to find the base type of the vla, and use that, or use the
>>>>>> nonstandard char type?
>>>>>
>>>>> I don't think we can reliably find the base type of the vla - well,
>>>>> in practice we may because we control how we lower VLAs during
>>>>> gimplification, but nothing in the IL constraints say that the
>>>>> resulting pointer type should be special.
>>>>>
>>>>> Using a char[] decl shouldn't be a problem IMHO.
>>>>>
>>>>>>>>> And obviously you lose the optimization we arrange with inserting
>>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no
>>>>>>>>> longer be shared for subsequent VLAs.  Which means that you'd
>>>>>>>>> better limit the size you allow this promotion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Right, I could introduce a parameter for this.
>>>>>>>
>>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say,
>>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10?
>>>>>>>
>>>>>>
>>>>>> That unfortunately is too small for the example from bug report. The default
>>>>>> value of the param is 250, so that would be a threshold of 25, and the alloca
>>>>>> size of the example is 40.  Perhaps we can try a threshold of
>>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such?
>>>>>
>>>>> Hm.  estimated_stack_size is not O(1), so no.  I think we need to
>>>>> find a sensible way of allowing stack sharing.  Eventually Michas
>>>>> patch for introducing points-of-death would help here, if we'd
>>>>> go for folding this during stack-save/restore optimization.
>>>>>
>>>>
>>>> I changed the heuristics to this:
>>>>
>>>> +  /* Heuristic: don't fold large vlas.  */
>>>> +  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
>>>> +  /* In case a vla is declared at function scope, it has the same lifetime as a
>>>> +     declared array, so we allow a larger size.  */
>>>> +  block = gimple_block (stmt);
>>>> +  if (!(cfun->after_inlining
>>>> +        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
>>>> +    threshold /= 10;
>>>> +  if (size > threshold)
>>>> +    return NULL_TREE;
>>>>
>>>> The heuristics distinguishes between before and after inlining.
>>>>
>>>> After inlining, vla's declared at function scope have the same lifetimes as
>>>> declared arrays, and don't share their space. There should be no negative
>>>> effects from folding an alloca in this case, but for safety we set a threshold
>>>> of PARAM_LARGE_STACK_FRAME.
>>>>
>>>> Before inlining, such a vla might be inlined and share its space with another
>>>> vla, so we stick with the normal threshold before inlining.
>>>
>>> That sounds reasonable, though the block check should probably use the
>>> original VLA decl block, not that of the basic-block of the allocation,
>>> but unfortunately we don't have access to that.  So I suppose using
>>> the allocation basic-block BLOCK is good enough (still we don't
>>> really care about BLOCK boundaries when doing CFG manipulations, so
>>> the allocation bbs block may be not the outermost scope in more cases
>>> than necessary).
>>>
>>>> However, using this heuristic we still don't generate optimal code.
>>>>
>>>> During the first pass_ccp, the folding is not done, because the size (40) is
>>>> larger than the threshold 25. The threshold is 25, because inlining is not yet done.
>>>>
>>>> During pass_fold_builtins, the folding is done because it's after inlining, but
>>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal size
>>>> of 64.
>>>>
>>>> The folding is not done during any of the other invocations or pass_ccp, because
>>>> the argument has already become constant in the earlier invocation.
>>>
>>> Yeah, that's the issue with relying on folding to do this transformation.
>>>
>>>> Using this change, I manage to trigger folding during the second invocation of
>>>> pass_ccp, before iv_optimize so we generate optimal code.
>>>>
>>>> Index: gcc/tree-ssa-ccp.c
>>>> ===================================================================
>>>> --- gcc/tree-ssa-ccp.c (revision 173734)
>>>> +++ gcc/tree-ssa-ccp.c (working copy)
>>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
>>>>  	if (gimple_call_internal_p (stmt))
>>>>  	  return false;
>>>>
>>>> +        /* The heuristic of fold_builtin_alloca differs before and after
>>>> +           inlining, so we don't require the arg to be changed into a constant
>>>> +           for folding, but just to be constant.  */
>>>> +        if (gimple_call_alloca_for_var_p (stmt)
>>>> +            && get_constant_value (gimple_call_arg (stmt, 0)))
>>>
>>> Probably reverse the get_constant_value check and the transformation
>>
>> Done.
>>
>>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects,
>>> so its name should be changed).
>>>
>>>> +          return true;
>>>> +
>>>>  	/* Propagate into the call arguments.  Compared to replace_uses_in
>>>>  	   this can use the argument slot types for type verification
>>>>  	   instead of the current argument type.  We also can safely
>>>>
>>>> But, to me it feels like a hack. Do you have any ideas how to do this better?
>>>
>>> It's somewhat of a hack, but at least it is more of a defined place
>>> for this transformation - which then suggests to remove it from
>>> generic folding and only keep calling it from CCP this way.
>>>
>>
>> Done.
>>
> 
> This is an updated version of the patch. I have 2 new patches and an updated
> testcase which I will sent out individually.
> 

2011-07-30  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43513
	* tree-ssa-loop-ivopts.c (may_be_unaligned_p): Use get_object_alignment.
Richard Guenther - July 30, 2011, 9:27 a.m.
On Sat, Jul 30, 2011 at 9:24 AM, Tom de Vries <vries@codesourcery.com> wrote:
> On 07/30/2011 10:21 AM, Tom de Vries wrote:
>> Hi,
>>
>> On 07/28/2011 08:20 PM, Tom de Vries wrote:
>>> On 07/28/2011 06:25 PM, Richard Guenther wrote:
>>>> On Thu, 28 Jul 2011, Tom de Vries wrote:
>>>>
>>>>> On 07/28/2011 12:22 PM, Richard Guenther wrote:
>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>
>>>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote:
>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>>
>>>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote:
>>>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>>>>
>>>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote:
>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted twice.
>>>>>>>>>>>>
>>>>>>>>>>>> 01_pr43513.3.patch
>>>>>>>>>>>> 02_pr43513.3.test.patch
>>>>>>>>>>>> 03_pr43513.3.mudflap.patch
>>>>>>>>>>>>
>>>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64.
>>>>>>>>>>>>
>>>>>>>>>>>> I will sent out the patches individually.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant argument with an
>>>>>>>>>>> array declaration.
>>>>>>>>>>>
>>>>>>>>>>> OK for trunk?
>>>>>>>>>>
>>>>>>>>>> I don't think it is safe to try to get at the VLA type the way you do.
>>>>>>>>>
>>>>>>>>> I don't understand in what way it's not safe. Do you mean I don't manage to find
>>>>>>>>> the type always, or that I find the wrong type, or something else?
>>>>>>>>
>>>>>>>> I think you might get the wrong type,
>>>>>>>
>>>>>>> Ok, I'll review that code one more time.
>>>>>>>
>>>>>>>> you also do not transform code
>>>>>>>> like
>>>>>>>>
>>>>>>>>   int *p = alloca(4);
>>>>>>>>   *p = 3;
>>>>>>>>
>>>>>>>> as there is no array type involved here.
>>>>>>>>
>>>>>>>
>>>>>>> I was trying to stay away from non-vla allocas.  A source declared alloca has
>>>>>>> function livetime, so we could have a single alloca in a loop, called 10 times,
>>>>>>> with all 10 instances live at the same time. This patch does not detect such
>>>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have such
>>>>>>> problems, the lifetime ends when it goes out of scope.
>>>>>>
>>>>>> Yes indeed - that probably would require more detailed analysis.
>>>>>>
>>>>>>>>>> In fact I would simply do sth like
>>>>>>>>>>
>>>>>>>>>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>>>>>>>>   n_elem = size * 8 / BITS_PER_UNIT;
>>>>>>>>>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>>>>>>>>>   var = create_tmp_var (array_type, NULL);
>>>>>>>>>>   return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I tried this code on the example, and it works, but the newly declared type has
>>>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment.  This make
>>>>>>>>> the memory access in the example potentially unaligned, which prohibits an
>>>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 achieved
>>>>>>>>> with my current patch.
>>>>>>>>
>>>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable
>>>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)).  Basically the
>>>>>>>> alignment that the targets alloca function would guarantee.
>>>>>>>>
>>>>>>>
>>>>>>> I tried that, but that doesn't help. It's the alignment of the type that
>>>>>>> matters, not of the decl.
>>>>>>
>>>>>> It shouldn't.  All accesses are performed with the original types and
>>>>>> alignment comes from that (plus the underlying decl).
>>>>>>
>>>>>
>>>>> I managed to get it all working by using build_aligned_type rather that DECL_ALIGN.
>>>>
>>>> That's really odd, DECL_ALIGN should just work - nothing refers to the
>>>> type of the decl in the IL.  Can you try also setting DECL_USER_ALIGN to
>>>> 1 maybe?
>>>>
>>>
>>> This doesn't work either.
>>>
>>>   /* Declare array.  */
>>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>   n_elem = size * 8 / BITS_PER_UNIT;
>>>   align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
>>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>>   var = create_tmp_var (array_type, NULL);
>>>   DECL_ALIGN (var) = align;
>>>   DECL_USER_ALIGN (var) = 1;
>>>
>>> Maybe this clarifies it:
>>>
>>> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at
>>> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621
>>> (gdb) call debug_generic_expr (ref)
>>> MEM[(int[0:D.2579] *)&D.2595][0]
>>> (gdb) call debug_generic_expr (step)
>>> 4
>>>
>>> 1627   base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
>>> (gdb) call debug_generic_expr (base)
>>> D.2595
>>>
>>> 1629   base_type = TREE_TYPE (base);
>>> (gdb) call debug_generic_expr (base_type)
>>> <unnamed-unsigned:8>[40]
>>>
>>> 1630   base_align = TYPE_ALIGN (base_type);
>>> (gdb) p base_align
>>> $1 = 8
>>>
>>> So the align is 8-bits, and we return true here:
>>>
>>> (gdb) n
>>> 1632   if (mode != BLKmode)
>>> (gdb) n
>>> 1634       unsigned mode_align = GET_MODE_ALIGNMENT (mode);
>>> (gdb)
>>> 1636       if (base_align < mode_align
>>> (gdb)
>>> 1639         return true;
>>>
>>>
>>> Here we can see that the base actually has the (user) align on it:
>>>
>>> (gdb) call debug_tree (base)
>>>  <var_decl 0xf7e1b420 D.2595
>>>     type <array_type 0xf7e1b360
>>>         type <integer_type 0xf7e1b2a0 public unsigned QI
>>>             size <integer_cst 0xf7d3d604 constant 8>
>>>             unit size <integer_cst 0xf7d3d620 constant 1>
>>>             align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8
>>>             min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255>
>>>             pointer_to_this <pointer_type 0xf7e1b3c0>>
>>>         BLK
>>>         size <integer_cst 0xf7d5d070 constant 320>
>>>         unit size <integer_cst 0xf7dde2a0 constant 40>
>>>         align 8 symtab 0 alias set -1 canonical type 0xf7e1b360
>>>         domain <integer_type 0xf7de12a0
>>>                 type <integer_type 0xf7d51000 unsigned int>
>>>             SI
>>>             size <integer_cst 0xf7d3d78c constant 32>
>>>             unit size <integer_cst 0xf7d3d578 constant 4>
>>>             align 32 symtab 0 alias set -1 canonical type 0xf7de12a0
>>>             precision 32 min <integer_cst 0xf7d3d594 0>
>>>             max <integer_cst 0xf7dde284 39>>
>>>         pointer_to_this <pointer_type 0xf7e1b480>>
>>>     addressable used ignored BLK file pr43513.c line 4 col 6
>>>     size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40>
>>>     user align 32 context <function_decl 0xf7dfd480 foo3>>
>>>
>>>
>>>>>
>>>>>>> So should we try to find the base type of the vla, and use that, or use the
>>>>>>> nonstandard char type?
>>>>>>
>>>>>> I don't think we can reliably find the base type of the vla - well,
>>>>>> in practice we may because we control how we lower VLAs during
>>>>>> gimplification, but nothing in the IL constraints say that the
>>>>>> resulting pointer type should be special.
>>>>>>
>>>>>> Using a char[] decl shouldn't be a problem IMHO.
>>>>>>
>>>>>>>>>> And obviously you lose the optimization we arrange with inserting
>>>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no
>>>>>>>>>> longer be shared for subsequent VLAs.  Which means that you'd
>>>>>>>>>> better limit the size you allow this promotion.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right, I could introduce a parameter for this.
>>>>>>>>
>>>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say,
>>>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10?
>>>>>>>>
>>>>>>>
>>>>>>> That unfortunately is too small for the example from bug report. The default
>>>>>>> value of the param is 250, so that would be a threshold of 25, and the alloca
>>>>>>> size of the example is 40.  Perhaps we can try a threshold of
>>>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such?
>>>>>>
>>>>>> Hm.  estimated_stack_size is not O(1), so no.  I think we need to
>>>>>> find a sensible way of allowing stack sharing.  Eventually Michas
>>>>>> patch for introducing points-of-death would help here, if we'd
>>>>>> go for folding this during stack-save/restore optimization.
>>>>>>
>>>>>
>>>>> I changed the heuristics to this:
>>>>>
>>>>> +  /* Heuristic: don't fold large vlas.  */
>>>>> +  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
>>>>> +  /* In case a vla is declared at function scope, it has the same lifetime as a
>>>>> +     declared array, so we allow a larger size.  */
>>>>> +  block = gimple_block (stmt);
>>>>> +  if (!(cfun->after_inlining
>>>>> +        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
>>>>> +    threshold /= 10;
>>>>> +  if (size > threshold)
>>>>> +    return NULL_TREE;
>>>>>
>>>>> The heuristics distinguishes between before and after inlining.
>>>>>
>>>>> After inlining, vla's declared at function scope have the same lifetimes as
>>>>> declared arrays, and don't share their space. There should be no negative
>>>>> effects from folding an alloca in this case, but for safety we set a threshold
>>>>> of PARAM_LARGE_STACK_FRAME.
>>>>>
>>>>> Before inlining, such a vla might be inlined and share its space with another
>>>>> vla, so we stick with the normal threshold before inlining.
>>>>
>>>> That sounds reasonable, though the block check should probably use the
>>>> original VLA decl block, not that of the basic-block of the allocation,
>>>> but unfortunately we don't have access to that.  So I suppose using
>>>> the allocation basic-block BLOCK is good enough (still we don't
>>>> really care about BLOCK boundaries when doing CFG manipulations, so
>>>> the allocation bbs block may be not the outermost scope in more cases
>>>> than necessary).
>>>>
>>>>> However, using this heuristic we still don't generate optimal code.
>>>>>
>>>>> During the first pass_ccp, the folding is not done, because the size (40) is
>>>>> larger than the threshold 25. The threshold is 25, because inlining is not yet done.
>>>>>
>>>>> During pass_fold_builtins, the folding is done because it's after inlining, but
>>>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal size
>>>>> of 64.
>>>>>
>>>>> The folding is not done during any of the other invocations or pass_ccp, because
>>>>> the argument has already become constant in the earlier invocation.
>>>>
>>>> Yeah, that's the issue with relying on folding to do this transformation.
>>>>
>>>>> Using this change, I manage to trigger folding during the second invocation of
>>>>> pass_ccp, before iv_optimize so we generate optimal code.
>>>>>
>>>>> Index: gcc/tree-ssa-ccp.c
>>>>> ===================================================================
>>>>> --- gcc/tree-ssa-ccp.c (revision 173734)
>>>>> +++ gcc/tree-ssa-ccp.c (working copy)
>>>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
>>>>>    if (gimple_call_internal_p (stmt))
>>>>>      return false;
>>>>>
>>>>> +        /* The heuristic of fold_builtin_alloca differs before and after
>>>>> +           inlining, so we don't require the arg to be changed into a constant
>>>>> +           for folding, but just to be constant.  */
>>>>> +        if (gimple_call_alloca_for_var_p (stmt)
>>>>> +            && get_constant_value (gimple_call_arg (stmt, 0)))
>>>>
>>>> Probably reverse the get_constant_value check and the transformation
>>>
>>> Done.
>>>
>>>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects,
>>>> so its name should be changed).
>>>>
>>>>> +          return true;
>>>>> +
>>>>>    /* Propagate into the call arguments.  Compared to replace_uses_in
>>>>>       this can use the argument slot types for type verification
>>>>>       instead of the current argument type.  We also can safely
>>>>>
>>>>> But, to me it feels like a hack. Do you have any ideas how to do this better?
>>>>
>>>> It's somewhat of a hack, but at least it is more of a defined place
>>>> for this transformation - which then suggests to remove it from
>>>> generic folding and only keep calling it from CCP this way.
>>>>
>>>
>>> Done.
>>>
>>
>> This is an updated version of the patch. I have 2 new patches and an updated
>> testcase which I will sent out individually.
>>
>
> 2011-07-30  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43513
>        * tree-ssa-loop-ivopts.c (may_be_unaligned_p): Use get_object_alignment.

I think it should do

   base_type = TREE_TYPE (base);
   base_align = get_object_alignment (base, BIGGEST_ALIGNMENT);
   base_align = MAX (base_align, TYPE_ALIGN (base_type));

for alignment that is less than the alignment of the type we rely on people
building variant types.  get_object_alignment is too conservative to be used
alone.

Ok with the above change.

As further improvement the code could use get_object_alignment_1 on the
full reference to also track misalignment using that function, removing the
seemingly duplicate code in may_be_unaligned_p.

Thanks,
Richard.

Patch

Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c (revision 173734)
+++ gcc/tree-ssa-loop-ivopts.c (working copy)
@@ -1608,7 +1608,6 @@  static bool
 may_be_unaligned_p (tree ref, tree step)
 {
   tree base;
-  tree base_type;
   HOST_WIDE_INT bitsize;
   HOST_WIDE_INT bitpos;
   tree toffset;
@@ -1626,8 +1625,7 @@  may_be_unaligned_p (tree ref, tree step)
      STRICT_ALIGNMENT is true.  */
   base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
 			      &unsignedp, &volatilep, true);
-  base_type = TREE_TYPE (base);
-  base_align = TYPE_ALIGN (base_type);
+  base_align = get_object_alignment (base, BIGGEST_ALIGNMENT);
 
   if (mode != BLKmode)
     {