[i386] Limit unroll factor for certain loops on Corei7

Message ID	CAAe5K+UcRwNSeAZCQrafXoNxW_8+eXSZ99_fHAXQMZ1JFzRYvg@mail.gmail.com
State	New
Headers	show Return-Path: <gcc-patches-return-308933-incoming=patchwork.ozlabs.org@gcc.gnu.org> MIME-Version: 1.0 Date: Thu, 1 Dec 2011 23:39:36 -0800 Message-ID: <CAAe5K+UcRwNSeAZCQrafXoNxW_8+eXSZ99_fHAXQMZ1JFzRYvg@mail.gmail.com> Subject: [Patch, i386] Limit unroll factor for certain loops on Corei7 From: Teresa Johnson <tejohnson@google.com> To: gcc-patches@gcc.gnu.org, Andreas.Krebbel@de.ibm.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org

Teresa Johnson Dec. 2, 2011, 7:39 a.m. UTC

The attached patch detects loops containing instructions that tend to
incur high LCP (loop changing prefix) stalls on Core i7, and limits
their unroll factor to try to keep the unrolled loop body small enough
to fit in the Corei7's loop stream detector which can hide LCP stalls
in loops.

To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
hook, which was previously only defined for s390. I added one
additional call to this target hook, when unrolling for constant trip
count loops. Previously it was only called for runtime computed trip
counts. Andreas, can you comment on the effect for s390 of this
additional call of the target hook, since I can't measure that?

Successfully bootstrapped and checked with x86_64-unknown-linux-gnu.
Could someone please review?

Thanks,
Teresa

2011-12-01  Teresa Johnson  <tejohnson@google.com>

        * loop-unroll.c (loop_has_FP_comp): New function.
        (decide_unroll_constant_iterations): Call loop unroll target hook.
        * cfgloop.h (loop_has_FP_comp): Ditto.
        * config/i386/i386.c (ix86_loop_unroll_adjust): New function.
        (TARGET_LOOP_UNROLL_ADJUST): Define hook for x86.

 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -38685,6 +38755,9 @@ ix86_autovectorize_vector_sizes (void)
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
 #endif

+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST ix86_loop_unroll_adjust
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 ^L
 #include "gt-i386.h"

Andreas Krebbel Dec. 2, 2011, 8:54 a.m. UTC | #1

On Thu, Dec 01, 2011 at 11:39:36PM -0800, Teresa Johnson wrote:
> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
> hook, which was previously only defined for s390. I added one
> additional call to this target hook, when unrolling for constant trip
> count loops. Previously it was only called for runtime computed trip
> counts. Andreas, can you comment on the effect for s390 of this
> additional call of the target hook, since I can't measure that?

Limiting the unrolling of loops with constant iterations makes also
sense for s390.  However, the limitations are only relevant if it
actually stays a loop. If the loop gets completely peeled into a
sequential instruction stream there should be no limitation. But as I
understand it this will be done by different code paths.

So I think the change should be ok for s390 as well. It will take some
time to get measurements on that. I'll try to keep that in mind until
then.

Bye,

-Andreas-

Teresa Johnson Dec. 2, 2011, 3:47 p.m. UTC | #2

Thanks, Andreas. You are right in that fully peeling a loop is done by
a different code path (peel_loops_completely() and earlier in the tree
unroller).

Teresa

On Fri, Dec 2, 2011 at 12:54 AM, Andreas Krebbel
<krebbel@linux.vnet.ibm.com> wrote:
> On Thu, Dec 01, 2011 at 11:39:36PM -0800, Teresa Johnson wrote:
>> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
>> hook, which was previously only defined for s390. I added one
>> additional call to this target hook, when unrolling for constant trip
>> count loops. Previously it was only called for runtime computed trip
>> counts. Andreas, can you comment on the effect for s390 of this
>> additional call of the target hook, since I can't measure that?
>
> Limiting the unrolling of loops with constant iterations makes also
> sense for s390.  However, the limitations are only relevant if it
> actually stays a loop. If the loop gets completely peeled into a
> sequential instruction stream there should be no limitation. But as I
> understand it this will be done by different code paths.
>
> So I think the change should be ok for s390 as well. It will take some
> time to get measurements on that. I'll try to keep that in mind until
> then.
>
> Bye,
>
> -Andreas-
>

Andi Kleen Dec. 2, 2011, 7:36 p.m. UTC | #3

Teresa Johnson <tejohnson@google.com> writes:

Interesting optimization. I would be concerned a little bit
about compile time, does it make a measurable difference?

> The attached patch detects loops containing instructions that tend to
> incur high LCP (loop changing prefix) stalls on Core i7, and limits
> their unroll factor to try to keep the unrolled loop body small enough
> to fit in the Corei7's loop stream detector which can hide LCP stalls
> in loops.

One more optimization would be to optimize padding for this case,
the LSD only works if the loop is not spread over too many 32 byte
chunks. So if you detect the loop is LSD worthy always pad to 32 bytes
at the beginning.

> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
> hook, which was previously only defined for s390. I added one
> additional call to this target hook, when unrolling for constant trip
> count loops. Previously it was only called for runtime computed trip
> counts. Andreas, can you comment on the effect for s390 of this
> additional call of the target hook, since I can't measure that?

On Sandy-Bridge there's also the decoded icache which is much larger,
but also has some restrictions. It would be nice if this optimization
was general enough to handle this case too.

In general I notice that the tree loop unroller is too aggressive recently:
a lot of loops that probably shouldn't be unrolled (like containing
function calls etc.) are unrolled at -O3. So probably a better cost
model for unrolling would make sense anyways.

> +  /* Don't reduce unroll factor in loops with floating point
> +     computation, which tend to benefit more heavily from
> +     larger unroll factors and are less likely to bottleneck
> +     at the decoder. */
> +  has_FP = loop_has_FP_comp(loop);

You could cache the loop body and pass it in here.

Patch looks reasonable to me, but I cannot approve.

-Andi

Xinliang David Li Dec. 2, 2011, 7:59 p.m. UTC | #4

;
>
> +/* Determine whether LOOP contains floating-point computation. */
> +bool
> +loop_has_FP_comp(struct loop *loop)
> +{
> +  rtx set, dest;

This probably should be extended to detect other long latency
operations in the future.


> +
> +  if (ix86_tune != PROCESSOR_COREI7_64 &&
> +      ix86_tune != PROCESSOR_COREI7_32)
> +    return nunroll;

Is it better to generalize it and model the LSD and LSD size in the
target model description? -- probably a different patch for that.


> +
> +  /* Look for instructions that store a constant into HImode (16-bit)
> +     memory. These require a length-changing prefix and on corei7 are
> +     prone to LCP stalls. These stalls can be avoided if the loop
> +     is streamed from the loop stream detector. */
> +  body = get_loop_body (loop);
> +  for (i = 0; i < loop->num_nodes && !found; i++)
> +    {
> +      bb = body[i];
> +
> +      FOR_BB_INSNS (bb, insn)
> +        {
> +          rtx set_expr;
> +          set_expr = single_set (insn);
> +          if (set_expr != NULL_RTX
> +              && GET_MODE (SET_DEST (set_expr)) == HImode
> +              && CONST_INT_P (SET_SRC (set_expr))
> +              && MEM_P (SET_DEST (set_expr)))
> +            {
> +              found = true;
> +              break;
> +            }
> +        }
> +    }
> +  free (body);


Probably generalize this to handle other long latency FE stalls -- for
now it only handles LCP stalls.

> +
> +  if (!found)
> +    return nunroll;
> +
> +  /* Don't reduce unroll factor in loops with floating point
> +     computation, which tend to benefit more heavily from
> +     larger unroll factors and are less likely to bottleneck
> +     at the decoder. */
> +  has_FP = loop_has_FP_comp(loop);
> +  if (has_FP)
> +    return nunroll;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +               ";; Loop contains HImode store of const (possible LCP
> stalls),\n");
> +      fprintf (dump_file,
> +               "   reduce unroll factor to fit into Loop Stream Detector\n");
> +    }
> +
> +  /* On corei7 the loop stream detector can hold about 28 instructions, so
> +     don't allow unrolling to exceed that. */
> +  newunroll = 28 / loop->av_ninsns;

Is 28 number of instructions or number of uOps?

thanks,

David

> +  if (newunroll < nunroll)
> +    return newunroll;
> +
> +  return nunroll;
> +}
> +
>  /* Initialize the GCC target structure.  */
>  #undef TARGET_RETURN_IN_MEMORY
>  #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
> @@ -38685,6 +38755,9 @@ ix86_autovectorize_vector_sizes (void)
>  #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
>  #endif
>
> +#undef TARGET_LOOP_UNROLL_ADJUST
> +#define TARGET_LOOP_UNROLL_ADJUST ix86_loop_unroll_adjust
> +
>  struct gcc_target targetm = TARGET_INITIALIZER;
>  ^L
>  #include "gt-i386.h"
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Xinliang David Li Dec. 2, 2011, 8:01 p.m. UTC | #5

On Fri, Dec 2, 2011 at 11:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Teresa Johnson <tejohnson@google.com> writes:
>
> Interesting optimization. I would be concerned a little bit
> about compile time, does it make a measurable difference?
>
>> The attached patch detects loops containing instructions that tend to
>> incur high LCP (loop changing prefix) stalls on Core i7, and limits
>> their unroll factor to try to keep the unrolled loop body small enough
>> to fit in the Corei7's loop stream detector which can hide LCP stalls
>> in loops.
>
> One more optimization would be to optimize padding for this case,
> the LSD only works if the loop is not spread over too many 32 byte
> chunks. So if you detect the loop is LSD worthy always pad to 32 bytes
> at the beginning.
>
>> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
>> hook, which was previously only defined for s390. I added one
>> additional call to this target hook, when unrolling for constant trip
>> count loops. Previously it was only called for runtime computed trip
>> counts. Andreas, can you comment on the effect for s390 of this
>> additional call of the target hook, since I can't measure that?
>
> On Sandy-Bridge there's also the decoded icache which is much larger,
> but also has some restrictions. It would be nice if this optimization
> was general enough to handle this case too.
>
> In general I notice that the tree loop unroller is too aggressive recently:
> a lot of loops that probably shouldn't be unrolled (like containing
> function calls etc.) are unrolled at -O3. So probably a better cost
> model for unrolling would make sense anyways.
>

Yes, I believe there are many target specific unroll tunings that can
be done -- the current unroll target independent cost/benefit analysis
is too weak.

David



>> +  /* Don't reduce unroll factor in loops with floating point
>> +     computation, which tend to benefit more heavily from
>> +     larger unroll factors and are less likely to bottleneck
>> +     at the decoder. */
>> +  has_FP = loop_has_FP_comp(loop);
>
> You could cache the loop body and pass it in here.
>
> Patch looks reasonable to me, but I cannot approve.
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only

Teresa Johnson Dec. 2, 2011, 8:11 p.m. UTC | #6

On Fri, Dec 2, 2011 at 11:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Teresa Johnson <tejohnson@google.com> writes:
>
> Interesting optimization. I would be concerned a little bit
> about compile time, does it make a measurable difference?

I haven't measured compile time explicitly, but I don't it should,
especially after I address your efficiency suggestion (see below),
since it will just have one pass over the instructions in innermost
loops.

>
>> The attached patch detects loops containing instructions that tend to
>> incur high LCP (loop changing prefix) stalls on Core i7, and limits
>> their unroll factor to try to keep the unrolled loop body small enough
>> to fit in the Corei7's loop stream detector which can hide LCP stalls
>> in loops.
>
> One more optimization would be to optimize padding for this case,
> the LSD only works if the loop is not spread over too many 32 byte
> chunks. So if you detect the loop is LSD worthy always pad to 32 bytes
> at the beginning.

Thanks for the suggestion, I will look at doing that in follow-on tuning.

>
>> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
>> hook, which was previously only defined for s390. I added one
>> additional call to this target hook, when unrolling for constant trip
>> count loops. Previously it was only called for runtime computed trip
>> counts. Andreas, can you comment on the effect for s390 of this
>> additional call of the target hook, since I can't measure that?
>
> On Sandy-Bridge there's also the decoded icache which is much larger,
> but also has some restrictions. It would be nice if this optimization
> was general enough to handle this case too.
>
> In general I notice that the tree loop unroller is too aggressive recently:
> a lot of loops that probably shouldn't be unrolled (like containing
> function calls etc.) are unrolled at -O3. So probably a better cost
> model for unrolling would make sense anyways.

These are both good suggestions, and I will look into Sandy Bridge
heuristics in follow-on work, since we will need to tune for that
soon.

>
>> +  /* Don't reduce unroll factor in loops with floating point
>> +     computation, which tend to benefit more heavily from
>> +     larger unroll factors and are less likely to bottleneck
>> +     at the decoder. */
>> +  has_FP = loop_has_FP_comp(loop);
>
> You could cache the loop body and pass it in here.

That is a great idea, and in fact, I think I will do away with this
separate function completely for this patch. I can more efficiently
look for the FP computation while I am looking for the half word
stores. I'll do that and send a follow up with the new patch.

>
> Patch looks reasonable to me, but I cannot approve.

Thanks!
Teresa

>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only

Teresa Johnson Dec. 5, 2011, 6:18 a.m. UTC | #7

On Fri, Dec 2, 2011 at 11:59 AM, Xinliang David Li <davidxl@google.com> wrote:
> ;
>>
>> +/* Determine whether LOOP contains floating-point computation. */
>> +bool
>> +loop_has_FP_comp(struct loop *loop)
>> +{
>> +  rtx set, dest;
>
> This probably should be extended to detect other long latency
> operations in the future.
>
>
>> +
>> +  if (ix86_tune != PROCESSOR_COREI7_64 &&
>> +      ix86_tune != PROCESSOR_COREI7_32)
>> +    return nunroll;
>
> Is it better to generalize it and model the LSD and LSD size in the
> target model description? -- probably a different patch for that.

Yes, I thought it made sense to keep the check here for now, but it
could be generalized that way to handle the limits in different
implementations.

>
>
>> +
>> +  /* Look for instructions that store a constant into HImode (16-bit)
>> +     memory. These require a length-changing prefix and on corei7 are
>> +     prone to LCP stalls. These stalls can be avoided if the loop
>> +     is streamed from the loop stream detector. */
>> +  body = get_loop_body (loop);
>> +  for (i = 0; i < loop->num_nodes && !found; i++)
>> +    {
>> +      bb = body[i];
>> +
>> +      FOR_BB_INSNS (bb, insn)
>> +        {
>> +          rtx set_expr;
>> +          set_expr = single_set (insn);
>> +          if (set_expr != NULL_RTX
>> +              && GET_MODE (SET_DEST (set_expr)) == HImode
>> +              && CONST_INT_P (SET_SRC (set_expr))
>> +              && MEM_P (SET_DEST (set_expr)))
>> +            {
>> +              found = true;
>> +              break;
>> +            }
>> +        }
>> +    }
>> +  free (body);
>
>
> Probably generalize this to handle other long latency FE stalls -- for
> now it only handles LCP stalls.
>
>> +
>> +  if (!found)
>> +    return nunroll;
>> +
>> +  /* Don't reduce unroll factor in loops with floating point
>> +     computation, which tend to benefit more heavily from
>> +     larger unroll factors and are less likely to bottleneck
>> +     at the decoder. */
>> +  has_FP = loop_has_FP_comp(loop);
>> +  if (has_FP)
>> +    return nunroll;
>> +
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file,
>> +               ";; Loop contains HImode store of const (possible LCP
>> stalls),\n");
>> +      fprintf (dump_file,
>> +               "   reduce unroll factor to fit into Loop Stream Detector\n");
>> +    }
>> +
>> +  /* On corei7 the loop stream detector can hold about 28 instructions, so
>> +     don't allow unrolling to exceed that. */
>> +  newunroll = 28 / loop->av_ninsns;
>
> Is 28 number of instructions or number of uOps?

It is actually 28 uops, I have updated the comments in the latest
patch, which I am sending out in a follow-on email.

Thanks,
Teresa

>
> thanks,
>
> David
>
>> +  if (newunroll < nunroll)
>> +    return newunroll;
>> +
>> +  return nunroll;
>> +}
>> +
>>  /* Initialize the GCC target structure.  */
>>  #undef TARGET_RETURN_IN_MEMORY
>>  #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
>> @@ -38685,6 +38755,9 @@ ix86_autovectorize_vector_sizes (void)
>>  #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
>>  #endif
>>
>> +#undef TARGET_LOOP_UNROLL_ADJUST
>> +#define TARGET_LOOP_UNROLL_ADJUST ix86_loop_unroll_adjust
>> +
>>  struct gcc_target targetm = TARGET_INITIALIZER;
>>  ^L
>>  #include "gt-i386.h"
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

[i386] Limit unroll factor for certain loops on Corei7

Commit Message

Comments

Patch