diff mbox

Guard use of modulo in cshift (speedup protein)

Message ID Pine.LNX.4.64.1204101646390.25409@wotan.suse.de
State New
Headers show

Commit Message

Michael Matz April 10, 2012, 2:53 p.m. UTC
Hi,

this patch speeds up polyhedrons protein on Bulldozer quite a bit.  The 
things is that in this testcase cshift is called with a very short length
(<=3) and that the shift amount always is less than the length.  
Surprisingly the division instruction takes up considerable amount of 
time, so much that it makes sense to guard it, when the shift is in bound.

Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):

    23  0.0032 :   caf00:       idiv   %r13
 13863  1.9055 :   caf03:       lea    (%rdx,%r13,1),%r12

I.e. despite the memory shuffling one third of the cshift cycles are that 
division.  With the patch the time for protein drops from 0m21.367s to 
0m20.547s on this Bulldozer machine.  I've checked that it has no adverse 
effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed, 
still an improvement).

Regstrapped on x86_64-linux.  Okay for trunk?


Ciao,
Michael.

	* m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.

	* generated/cshift0_c10.c: Regenerated.
	* generated/cshift0_c16.c: Regenerated.
	* generated/cshift0_c4.c: Regenerated.
	* generated/cshift0_c8.c: Regenerated.
	* generated/cshift0_i16.c: Regenerated.
	* generated/cshift0_i1.c: Regenerated.
	* generated/cshift0_i2.c: Regenerated.
	* generated/cshift0_i4.c: Regenerated.
	* generated/cshift0_i8.c: Regenerated.
	* generated/cshift0_r10.c: Regenerated.
	* generated/cshift0_r16.c: Regenerated.
	* generated/cshift0_r4.c: Regenerated.
	* generated/cshift0_r8.c: Regenerated.

Comments

Steven Bosscher April 10, 2012, 3:16 p.m. UTC | #1
On Tue, Apr 10, 2012 at 4:53 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> this patch speeds up polyhedrons protein on Bulldozer quite a bit.  The
> things is that in this testcase cshift is called with a very short length
> (<=3) and that the shift amount always is less than the length.
> Surprisingly the division instruction takes up considerable amount of
> time, so much that it makes sense to guard it, when the shift is in bound.
>
> Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):
>
>    23  0.0032 :   caf00:       idiv   %r13
>  13863  1.9055 :   caf03:       lea    (%rdx,%r13,1),%r12
>
> I.e. despite the memory shuffling one third of the cshift cycles are that
> division.  With the patch the time for protein drops from 0m21.367s to
> 0m20.547s on this Bulldozer machine.  I've checked that it has no adverse
> effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed,
> still an improvement).
>
> Regstrapped on x86_64-linux.  Okay for trunk?
>
>
> Ciao,
> Michael.
>
>        * m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.
>
>        * generated/cshift0_c10.c: Regenerated.
>        * generated/cshift0_c16.c: Regenerated.
>        * generated/cshift0_c4.c: Regenerated.
>        * generated/cshift0_c8.c: Regenerated.
>        * generated/cshift0_i16.c: Regenerated.
>        * generated/cshift0_i1.c: Regenerated.
>        * generated/cshift0_i2.c: Regenerated.
>        * generated/cshift0_i4.c: Regenerated.
>        * generated/cshift0_i8.c: Regenerated.
>        * generated/cshift0_r10.c: Regenerated.
>        * generated/cshift0_r16.c: Regenerated.
>        * generated/cshift0_r4.c: Regenerated.
>        * generated/cshift0_r8.c: Regenerated.
>
> Index: m4/cshift0.m4
> ===================================================================
> --- m4/cshift0.m4       (revision 186272)
> +++ m4/cshift0.m4       (working copy)
> @@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons
>   rptr = ret->base_addr;
>   sptr = array->base_addr;
>
> -  shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> -  if (shift < 0)
> -    shift += len;
> +  /* Avoid the costly modulo for trivially in-bound shifts.  */
> +  if (shift < 0 || shift >= len)
> +    {
> +      shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> +      if (shift < 0)
> +       shift += len;
> +    }
>
>   while (rptr)
>     {

This is OK.

Do you think it would be worthwhile to do this transformation in the
middle end too, based on profile information for values? IIRC
value-prof handles constant divmod but not ranges for modulo
operations.

Steven
Michael Matz April 10, 2012, 3:40 p.m. UTC | #2
Hi,

On Tue, 10 Apr 2012, Steven Bosscher wrote:

> This is OK.

r186283.

> Do you think it would be worthwhile to do this transformation in the 
> middle end too, based on profile information for values?

I'd think so, but it probably requires a new profiler that counts for how 
often 0 <= A <= B for every "A % B".  Just profiling the range of values 
might be misleading (because A <= N and B <= M and N <= M doesn't imply 
that A <= B often holds).

But it would possibly be an interesting experiment already to do such 
transformation generally (without profiling) and see what it gives on some 
benchmarks.  Just to get a feel what's on the plate.

> IIRC value-prof 
> handles constant divmod but not ranges for modulo operations.


Ciao,
Michael.
Richard Biener April 11, 2012, 8:27 a.m. UTC | #3
On Tue, Apr 10, 2012 at 5:40 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Tue, 10 Apr 2012, Steven Bosscher wrote:
>
>> This is OK.
>
> r186283.
>
>> Do you think it would be worthwhile to do this transformation in the
>> middle end too, based on profile information for values?
>
> I'd think so, but it probably requires a new profiler that counts for how
> often 0 <= A <= B for every "A % B".  Just profiling the range of values
> might be misleading (because A <= N and B <= M and N <= M doesn't imply
> that A <= B often holds).
>
> But it would possibly be an interesting experiment already to do such
> transformation generally (without profiling) and see what it gives on some
> benchmarks.  Just to get a feel what's on the plate.

The question is, of course, why on earth is a modulo operation in the
loop setup so expensive that avoiding it improves the performance of
the overall routine so much ... did you expect the code-gen difference
of your patch?

Richard.

>> IIRC value-prof
>> handles constant divmod but not ranges for modulo operations.
>

> Ciao,
> Michael.
Thomas Koenig April 11, 2012, 8:27 a.m. UTC | #4
Hi Michael,

could you replace

> +  if (shift<  0 || shift>= len)

by

 > +  if (unlikely(shift<  0 || shift>= len))

?  This could save a few more cycles.

	Thomas
Michael Matz April 11, 2012, 12:04 p.m. UTC | #5
Hi,

On Wed, 11 Apr 2012, Richard Guenther wrote:

> > But it would possibly be an interesting experiment already to do such 
> > transformation generally (without profiling) and see what it gives on 
> > some benchmarks.  Just to get a feel what's on the plate.
> 
> The question is, of course, why on earth is a modulo operation in the 
> loop setup so expensive that avoiding it improves the performance of the 
> overall routine so much ...

Because in most cases in protein the loop actually runs only one or two 
times or not at all, hence loop setup is more expensive than the loop 
itself.

> did you expect the code-gen difference of your patch?

Which code-gen difference?  I expected that in the protein case the 
division isn't executed, if that was what your question was about.  If it 
wasn't, please reformulate :)


Ciao,
Michael.
diff mbox

Patch

Index: m4/cshift0.m4
===================================================================
--- m4/cshift0.m4	(revision 186272)
+++ m4/cshift0.m4	(working copy)
@@ -98,9 +98,13 @@  cshift0_'rtype_code` ('rtype` *ret, cons
   rptr = ret->base_addr;
   sptr = array->base_addr;
 
-  shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
-  if (shift < 0)
-    shift += len;
+  /* Avoid the costly modulo for trivially in-bound shifts.  */
+  if (shift < 0 || shift >= len)
+    {
+      shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
+      if (shift < 0)
+	shift += len;
+    }
 
   while (rptr)
     {