diff mbox

[wide-int] int_traits <tree>

Message ID alpine.LNX.2.00.1310161537220.11149@zhemvz.fhfr.qr
State New
Headers show

Commit Message

Richard Biener Oct. 16, 2013, 1:55 p.m. UTC
Speaking in terms of a patch:

          /* We have to futz with this because the canonization for
@@ -5201,7 +5198,7 @@ wi::int_traits <const_tree>::decompose (
            {
              for (unsigned int i = 0; i < len - 1; i++)
                scratch[i] = val[i];
-             scratch[len - 1] = sext_hwi (val[len - 1], precision);
+             scratch[len - 1] = sext_hwi (val[len - 1], small_prec);
              return wi::storage_ref (scratch, len, precision);
            }
        }

the precision 0 thing is gone as far as I understand?

Also sign-extending the upper hwi must be done with precision % 
HOST_BITS_PER_WIDE_INT, no?  And it assumes that we only ever
have to extend the upper most HWI, thus 'len' is never too big.

I note that we immediately return in the above case, so

      if (precision < xprecision + HOST_BITS_PER_WIDE_INT)
        {
          len = wi::force_to_size (scratch, val, len, xprecision, 
precision, UNSIGNED);
          return wi::storage_ref (scratch, len, precision);
        }

applies only when the desired precision is a multiple of a HWI.
I assume it adds that extra zero word in case the MSB is set,
right?  I am confused about the condition written here and
how we look at precision and not xprecision when deciding how
to extend - given a 8-bit precision unsigned 0xff and precision == 64
we do not even consider sign-extending the value because we look
at precision and not xprecision.  Don't we need to look at
xprecision here?  After all an assert precision == xprecision
does not work in this routine.

Quickly execute.exp tested only.

To avoid code quality wreckage we have to implement a different way
of allocating the 'scratch' storage of wide_int_ref_storage
(well, ideally we wouldn't need it ...).  I thought of simply
allocating it via alloca (), but we need to do that in all
callers that build a wide_int_ref (eventually via a hidden
default arg).  But as that's a template specialization of
generic_wide_int ... so the option is to provide a function
wrapper inside wi:: for this, like

template <typename T>
wide_int_ref
wi::ref (const T &t, HOST_WIDE_INT *scratch = XALLOCAVEC (HOST_WIDE_INT, 
WIDE_INT_MAX_ELTS))
{
  return wide_int_ref (t, scratch);
}

and amend the storage constructors to get a scratch argument.

The reason for all this is that otherwise the wide_int_ref object
escapes when we pass the scratch storage to the workers in
int_traits <const_tree>.

Thanks,
Richard.

Comments

Kenneth Zadeck Oct. 16, 2013, 3:47 p.m. UTC | #1
On 10/16/2013 09:55 AM, Richard Biener wrote:
> Speaking in terms of a patch:
>
> Index: tree.h
> ===================================================================
> --- tree.h      (revision 203690)
> +++ tree.h      (working copy)
> @@ -5184,14 +5184,11 @@ wi::int_traits <const_tree>::decompose (
>                            / HOST_BITS_PER_WIDE_INT);
>     unsigned int xprecision = get_precision (x);
>   
> -  gcc_assert (precision >= xprecision);
> +  gcc_checking_assert (precision >= xprecision && precision != 0);
>   
> -  /* Got to be careful of precision 0 values.  */
> -  if (precision)
> -    len = MIN (len, max_len);
>     if (TYPE_SIGN (TREE_TYPE (x)) == UNSIGNED)
>       {
> -      unsigned int small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1);
> +      unsigned int small_prec = xprecision & (HOST_BITS_PER_WIDE_INT -
> 1);
>         if (small_prec)
>          {
>            /* We have to futz with this because the canonization for
> @@ -5201,7 +5198,7 @@ wi::int_traits <const_tree>::decompose (
>              {
>                for (unsigned int i = 0; i < len - 1; i++)
>                  scratch[i] = val[i];
> -             scratch[len - 1] = sext_hwi (val[len - 1], precision);
> +             scratch[len - 1] = sext_hwi (val[len - 1], small_prec);
>                return wi::storage_ref (scratch, len, precision);
>              }
>          }
>
> the precision 0 thing is gone as far as I understand?
There is still the case where the front ends create types with precision 
0, and that causes a few odd cases, but the precision 0 that you refer 
to is gone.
> Also sign-extending the upper hwi must be done with precision %
> HOST_BITS_PER_WIDE_INT, no?  And it assumes that we only ever
> have to extend the upper most HWI, thus 'len' is never too big.
yes
> I note that we immediately return in the above case, so
>
>        if (precision < xprecision + HOST_BITS_PER_WIDE_INT)
>          {
>            len = wi::force_to_size (scratch, val, len, xprecision,
> precision, UNSIGNED);
>            return wi::storage_ref (scratch, len, precision);
>          }
>
> applies only when the desired precision is a multiple of a HWI.
yes
> I assume it adds that extra zero word in case the MSB is set,
> right?
yes
> I am confused about the condition written here and
> how we look at precision and not xprecision when deciding how
> to extend - given a 8-bit precision unsigned 0xff and precision == 64
> we do not even consider sign-extending the value because we look
> at precision and not xprecision.  Don't we need to look at
> xprecision here?
I do not think so.   There are three cases here:

1) xprecision < precision:
say xprecision = 8 and precision = 32.

The number is between 0 and 255 and after canonicalization the number 
will be between 0 and 255.

2) xprecision == precision
not much to talk about here.

3) xprecision > precision

We need to loose the upper bits of the input and it does not matter what 
they were.    The only thing that we are concerned about is that the 
bits above what is now the sign bit pointed to by precision are matched 
in the bits above precision.


I do think we could change is that the "if (precision < xprecision + 
HWI)" to become "else if (...)", which will save a test.
> After all an assert precision == xprecision
> does not work in this routine.
>
> Quickly execute.exp tested only.
>
> To avoid code quality wreckage we have to implement a different way
> of allocating the 'scratch' storage of wide_int_ref_storage
> (well, ideally we wouldn't need it ...).  I thought of simply
> allocating it via alloca (), but we need to do that in all
> callers that build a wide_int_ref (eventually via a hidden
> default arg).  But as that's a template specialization of
> generic_wide_int ... so the option is to provide a function
> wrapper inside wi:: for this, like
I want richard and mike to be the people who respond to the next 
point.    I am not a c++ person and all of this storage manager layer 
stuff gives me a headache.

> template <typename T>
> wide_int_ref
> wi::ref (const T &t, HOST_WIDE_INT *scratch = XALLOCAVEC (HOST_WIDE_INT,
> WIDE_INT_MAX_ELTS))
> {
>    return wide_int_ref (t, scratch);
> }
>
> and amend the storage constructors to get a scratch argument.
>
> The reason for all this is that otherwise the wide_int_ref object
> escapes when we pass the scratch storage to the workers in
> int_traits <const_tree>.
>
> Thanks,
> Richard.
Mike Stump Oct. 16, 2013, 5:24 p.m. UTC | #2
On Oct 16, 2013, at 8:47 AM, Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
>> To avoid code quality wreckage we have to implement a different way
>> of allocating the 'scratch' storage of wide_int_ref_storage

> I want richard and mike to be the people who respond to the next point.    I am not a c++ person and all of this storage manager layer stuff gives me a headache.

I don't have a problem with that direction if it improves code-gen...
Richard Sandiford Oct. 16, 2013, 6:45 p.m. UTC | #3
Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>> I note that we immediately return in the above case, so
>>
>>        if (precision < xprecision + HOST_BITS_PER_WIDE_INT)
>>          {
>>            len = wi::force_to_size (scratch, val, len, xprecision,
>> precision, UNSIGNED);
>>            return wi::storage_ref (scratch, len, precision);
>>          }
>>
>> applies only when the desired precision is a multiple of a HWI.
> yes
>> I assume it adds that extra zero word in case the MSB is set,
>> right?
> yes
>> I am confused about the condition written here and
>> how we look at precision and not xprecision when deciding how
>> to extend - given a 8-bit precision unsigned 0xff and precision == 64
>> we do not even consider sign-extending the value because we look
>> at precision and not xprecision.  Don't we need to look at
>> xprecision here?
> I do not think so.   There are three cases here:
>
> 1) xprecision < precision:
> say xprecision = 8 and precision = 32.
>
> The number is between 0 and 255 and after canonicalization the number 
> will be between 0 and 255.
>
> 2) xprecision == precision
> not much to talk about here.
>
> 3) xprecision > precision
>
> We need to loose the upper bits of the input and it does not matter what 
> they were.    The only thing that we are concerned about is that the 
> bits above what is now the sign bit pointed to by precision are matched 
> in the bits above precision.

(3) is gone with the assert though.

As mentioned in my message yesterday, I thought your new way of canonising
unsigned tree constants meant that there was always an upper zero bit.
Is that right?

If so, xprecision < precision is a no-op, because the number always
has the right form for wider precisions.  The only difficult case is
xprecision == precision, since then we need to peel off any upper -1 HWIs.

If that's right and xprecision == precision can use val with an adjusted len,
then...

>> After all an assert precision == xprecision
>> does not work in this routine.
>>
>> Quickly execute.exp tested only.
>>
>> To avoid code quality wreckage we have to implement a different way
>> of allocating the 'scratch' storage of wide_int_ref_storage
>> (well, ideally we wouldn't need it ...).  I thought of simply
>> allocating it via alloca (), but we need to do that in all
>> callers that build a wide_int_ref (eventually via a hidden
>> default arg).  But as that's a template specialization of
>> generic_wide_int ... so the option is to provide a function
>> wrapper inside wi:: for this, like
> I want richard and mike to be the people who respond to the next 
> point.    I am not a c++ person and all of this storage manager layer 
> stuff gives me a headache.

...the use of the scratch array goes away completely for addr_wide_int
and max_wide_int.  If this code is on a hot path for wide_int then maybe
we should simply require that wide_int operations involving trees have
explicit extensions, like in rtl.  (I.e. only do the implicit extension
for addr_wide_int and max_wide_int.)

I'm not opposed to going back to a separate scratch array if all else fails,
but it's really ugly, so I'd like to look at the alternatives first...

Thanks,
Richard
Kenneth Zadeck Oct. 16, 2013, 8:50 p.m. UTC | #4
On 10/16/2013 02:45 PM, Richard Sandiford wrote:
> Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>>> I note that we immediately return in the above case, so
>>>
>>>         if (precision < xprecision + HOST_BITS_PER_WIDE_INT)
>>>           {
>>>             len = wi::force_to_size (scratch, val, len, xprecision,
>>> precision, UNSIGNED);
>>>             return wi::storage_ref (scratch, len, precision);
>>>           }
>>>
>>> applies only when the desired precision is a multiple of a HWI.
>> yes
>>> I assume it adds that extra zero word in case the MSB is set,
>>> right?
>> yes
>>> I am confused about the condition written here and
>>> how we look at precision and not xprecision when deciding how
>>> to extend - given a 8-bit precision unsigned 0xff and precision == 64
>>> we do not even consider sign-extending the value because we look
>>> at precision and not xprecision.  Don't we need to look at
>>> xprecision here?
>> I do not think so.   There are three cases here:
>>
>> 1) xprecision < precision:
>> say xprecision = 8 and precision = 32.
>>
>> The number is between 0 and 255 and after canonicalization the number
>> will be between 0 and 255.
>>
>> 2) xprecision == precision
>> not much to talk about here.
>>
>> 3) xprecision > precision
>>
>> We need to loose the upper bits of the input and it does not matter what
>> they were.    The only thing that we are concerned about is that the
>> bits above what is now the sign bit pointed to by precision are matched
>> in the bits above precision.
> (3) is gone with the assert though.
you seem to be correct here.
> As mentioned in my message yesterday, I thought your new way of canonising
> unsigned tree constants meant that there was always an upper zero bit.
> Is that right?
i believe this is correct.
> If so, xprecision < precision is a no-op, because the number always
> has the right form for wider precisions.  The only difficult case is
> xprecision == precision, since then we need to peel off any upper -1 HWIs.
say my HWI size is 8 bits (just to keep from typing a million 'f's.   if 
i have a 16 bit unsigned number that is all 1s in the tree world it is 3 
hwis
0x00 0xff 0xff.

but inside regular wide int, it would take 1 wide int whose value is 0xff.
inside of max it would be the same as the tree, but then the test 
precision < xprecision + hbpwi never kicks in because precision is 
guaranteed to be huge.

inside of addr_wide_int, i think we tank with the assertion.

the case actually comes up on the ppc because they do a lot of 128 bit 
math.    I think i got thru the x86-64 without noticing this.



> If that's right and xprecision == precision can use val with an adjusted len,
> then...
>
>>> After all an assert precision == xprecision
>>> does not work in this routine.
>>>
>>> Quickly execute.exp tested only.
>>>
>>> To avoid code quality wreckage we have to implement a different way
>>> of allocating the 'scratch' storage of wide_int_ref_storage
>>> (well, ideally we wouldn't need it ...).  I thought of simply
>>> allocating it via alloca (), but we need to do that in all
>>> callers that build a wide_int_ref (eventually via a hidden
>>> default arg).  But as that's a template specialization of
>>> generic_wide_int ... so the option is to provide a function
>>> wrapper inside wi:: for this, like
>> I want richard and mike to be the people who respond to the next
>> point.    I am not a c++ person and all of this storage manager layer
>> stuff gives me a headache.
> ...the use of the scratch array goes away completely for addr_wide_int
> and max_wide_int.  If this code is on a hot path for wide_int then maybe
> we should simply require that wide_int operations involving trees have
> explicit extensions, like in rtl.  (I.e. only do the implicit extension
> for addr_wide_int and max_wide_int.)
i do not understand how this is true.
> I'm not opposed to going back to a separate scratch array if all else fails,
> but it's really ugly, so I'd like to look at the alternatives first...
>
> Thanks,
> Richard
Richard Sandiford Oct. 17, 2013, 8:08 a.m. UTC | #5
Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>> As mentioned in my message yesterday, I thought your new way of canonising
>> unsigned tree constants meant that there was always an upper zero bit.
>> Is that right?
> i believe this is correct.
>> If so, xprecision < precision is a no-op, because the number always
>> has the right form for wider precisions.  The only difficult case is
>> xprecision == precision, since then we need to peel off any upper -1 HWIs.
> say my HWI size is 8 bits (just to keep from typing a million 'f's.   if 
> i have a 16 bit unsigned number that is all 1s in the tree world it is 3 
> hwis
> 0x00 0xff 0xff.
>
> but inside regular wide int, it would take 1 wide int whose value is 0xff.
> inside of max it would be the same as the tree, but then the test 
> precision < xprecision + hbpwi never kicks in because precision is 
> guaranteed to be huge.
>
> inside of addr_wide_int, i think we tank with the assertion.

It should be OK for addr_wide_int too.  The precision still fits 2 HWIs.
The initial length is greater than the maximum length of an addr_wide_int,
but your "len = MAX (len, max_len)" deals with that.

> the case actually comes up on the ppc because they do a lot of 128 bit 
> math.    I think i got thru the x86-64 without noticing this.

Well, it'd be suspicious if we're directly using 128-bit numbers
in addr_wide_int.  The justification for the assertion was that we
should explicitly truncate to addr_wide_int when deliberately
ignoring upper bits, beyond bit or byte address width.  128 bits
definitely falls into that category on powerpc.

Thanks,
Richard
Richard Biener Oct. 17, 2013, 8:46 a.m. UTC | #6
On Thu, 17 Oct 2013, Richard Sandiford wrote:

> Kenneth Zadeck <zadeck@naturalbridge.com> writes:
> >> As mentioned in my message yesterday, I thought your new way of canonising
> >> unsigned tree constants meant that there was always an upper zero bit.
> >> Is that right?
> > i believe this is correct.
> >> If so, xprecision < precision is a no-op, because the number always
> >> has the right form for wider precisions.  The only difficult case is
> >> xprecision == precision, since then we need to peel off any upper -1 HWIs.
> > say my HWI size is 8 bits (just to keep from typing a million 'f's.   if 
> > i have a 16 bit unsigned number that is all 1s in the tree world it is 3 
> > hwis
> > 0x00 0xff 0xff.
> >
> > but inside regular wide int, it would take 1 wide int whose value is 0xff.
> > inside of max it would be the same as the tree, but then the test 
> > precision < xprecision + hbpwi never kicks in because precision is 
> > guaranteed to be huge.
> >
> > inside of addr_wide_int, i think we tank with the assertion.
> 
> It should be OK for addr_wide_int too.  The precision still fits 2 HWIs.
> The initial length is greater than the maximum length of an addr_wide_int,
> but your "len = MAX (len, max_len)" deals with that.

It's

  len = MIN (len, max_len)

which looked suspicious to me, but with precision >= xprecision
precision can only be zero if xprecision is zero which looked to
me like it cannot happen - or rather it should be fixed.

> > the case actually comes up on the ppc because they do a lot of 128 bit 
> > math.    I think i got thru the x86-64 without noticing this.
> 
> Well, it'd be suspicious if we're directly using 128-bit numbers
> in addr_wide_int.  The justification for the assertion was that we
> should explicitly truncate to addr_wide_int when deliberately
> ignoring upper bits, beyond bit or byte address width.  128 bits
> definitely falls into that category on powerpc.

My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid
wide-int value if it has precision 16.  AFAIK that is what the
code produces, but now Kenny says this is only for some kind
of wide-ints but not all?  That is, why is

inline wi::storage_ref
wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch,
                                        unsigned int precision, const_tree 
x)
{
  unsigned int len = TREE_INT_CST_NUNITS (x);
  const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, 
0);
  return wi::storage_ref (val, len, precision);
}

not a valid implementation together with making sure that the
INTEGER_CST tree rep has that extra word of zeros if required?
I would like to see us move in that direction (given that the
tree rep of INTEGER_CST has transitioned to variable-length already).

Btw, code such as

tree
wide_int_to_tree (tree type, const wide_int_ref &pcst)
{
...
  unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1);
  bool recanonize = sgn == UNSIGNED
    && small_prec
    && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT == 
len;

definitely needs a comment.  I would have thought _all_ unsigned
numbers need re-canonicalization.  Well, maybe only if we're
forcing the extra zeros.

[This function shows another optimization issue:

    case BOOLEAN_TYPE:
      /* Cache false or true.  */
      limit = 2;
      if (wi::leu_p (cst, 1))
        ix = cst.to_uhwi ();

I would have expected cst <= 1 be optimized to cst.len == 1 &&
cst.val[0] <= 1.  It expands to

<L27>:
  MEM[(long int *)&D.50698 + 16B] = 1;
  MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct 
wide_int_ref_storage *)&D.50698].scratch;
  MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1;
  MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32;
  _277 = MEM[(const struct wide_int_storage *)&cst + 260B];
  if (_277 <= 64)
    goto <bb 42>;
  else
    goto <bb 43>;

  <bb 42>:
  xl_491 = zext_hwi (1, 32);  // ok, checking enabled and thus out-of-line
  _494 = MEM[(const long int *)&cst];
  _495 = (long unsigned int) _494;
  yl_496 = zext_hwi (_495, _277);
  _497 = xl_491 < yl_496;
  goto <bb 44>;

  <bb 43>:
  _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage 
*)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage 
*)&cst].val, len_274, _277);

this keeps D.50698 and cst un-SRAable - inline storage is problematic
for this reason.  But the representation should guarantee the
compare with a low precision (32 bit) constant is evaluatable
at compile-time if len of the larger value is > 1, no?

  <bb 44>:
  # _504 = PHI <_497(42), _503(43)>
  D.50698 ={v} {CLOBBER};
  if (_504 != 0)
    goto <bb 45>;
  else
    goto <bb 46>;

  <bb 45>:
  pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B];
  goto <bb 229> (<L131>);

  <bb 46>:
  _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0);
  ix_66 = (int) _65;
  goto <bb 91>;

The question is whether we should try to optimize wide-int for
such cases or simply not use wi:leu_p (cst, 1) but rather

 if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)

?


> Thanks,
> Richard
> 
>
Richard Sandiford Oct. 17, 2013, 10:58 a.m. UTC | #7
Richard Biener <rguenther@suse.de> writes:
> On Thu, 17 Oct 2013, Richard Sandiford wrote:
>
>> Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>> >> As mentioned in my message yesterday, I thought your new way of canonising
>> >> unsigned tree constants meant that there was always an upper zero bit.
>> >> Is that right?
>> > i believe this is correct.
>> >> If so, xprecision < precision is a no-op, because the number always
>> >> has the right form for wider precisions.  The only difficult case is
>> >> xprecision == precision, since then we need to peel off any upper -1 HWIs.
>> > say my HWI size is 8 bits (just to keep from typing a million 'f's.   if 
>> > i have a 16 bit unsigned number that is all 1s in the tree world it is 3 
>> > hwis
>> > 0x00 0xff 0xff.
>> >
>> > but inside regular wide int, it would take 1 wide int whose value is 0xff.
>> > inside of max it would be the same as the tree, but then the test 
>> > precision < xprecision + hbpwi never kicks in because precision is 
>> > guaranteed to be huge.
>> >
>> > inside of addr_wide_int, i think we tank with the assertion.
>> 
>> It should be OK for addr_wide_int too.  The precision still fits 2 HWIs.
>> The initial length is greater than the maximum length of an addr_wide_int,
>> but your "len = MAX (len, max_len)" deals with that.
>
> It's
>
>   len = MIN (len, max_len)

Oops, yeah, I meant MIN, sorry.

> which looked suspicious to me, but with precision >= xprecision
> precision can only be zero if xprecision is zero which looked to
> me like it cannot happen - or rather it should be fixed.

Despite the comment above the code, I don't think this MIN is there
for the zero-precision case.  I think it's there to handle the new
tree representation.

The new tree representation can have a length greater than max_len
for an unsigned tree constant that occupies a whole number of HWIs.
The tree representation of an unsigned 0x8000 is 0x00 0x80 0x00.
When extended to max_wide_int the representation is the same.
But a 2-HWI addr_wide_int would be 0x80 0x00, without the leading zero.
The MIN trims the length from 3 to 2 in the last case.

> > the case actually comes up on the ppc because they do a lot of 128 bit 
>> > math.    I think i got thru the x86-64 without noticing this.
>> 
>> Well, it'd be suspicious if we're directly using 128-bit numbers
>> in addr_wide_int.  The justification for the assertion was that we
>> should explicitly truncate to addr_wide_int when deliberately
>> ignoring upper bits, beyond bit or byte address width.  128 bits
>> definitely falls into that category on powerpc.
>
> My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid
> wide-int value if it has precision 16.

No, for a 16-bit wide_int it should be 0xff.  0x00 0xff 0xff is correct
for any wide_int wider than 16 bits though.

> AFAIK that is what the code produces,

In which case?  This is:

   precision == 16
   xprecision == 16
   len == 3
   max_len == 2

The MIN trims the len to 2 and then the loop Kenny added trims it
again to 1, so the "0x00 0xff 0xff" becomes "0xff".  The "0x00 0xff"
is still there in the array, but not used.

> but now Kenny says this is only for some kind
> of wide-ints but not all?  That is, why is
>
> inline wi::storage_ref
> wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch,
>                                         unsigned int precision, const_tree 
> x)
> {
>   unsigned int len = TREE_INT_CST_NUNITS (x);
>   const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, 
> 0);
>   return wi::storage_ref (val, len, precision);
> }
>
> not a valid implementation together with making sure that the
> INTEGER_CST tree rep has that extra word of zeros if required?

The fundamental problem here is that we're trying to support two cases:

(a) doing N-bit arithemtic in cases where the inputs have N bits
(b) doing N-bit arithmetic in cases where the inputs have fewer than N bits
    and are extended according to TYPE_SIGN.

Let's assume 32-bit HWIs.  The 16-bit (4-hex-digit) constant 0x8000 is
0x8000 regardless of whether the type is signed or unsigned.  But if it's
extended to 32-bits you get two different numbers 0xffff8000 and 0x00008000,
depending on the sign.

So for one value of the "precision" parameter (i.e. xprecision), signed
and unsigned constants produce the same number.  But for another value
of the "precision" parameter (those greater than xprecision), signed and
unsigned constants produce different numbers.  Yet at the moment the tree
constant has a single representation.

So I think the possibilities are:

(1) Use the representation of an N-bit wide_int to store N-bit tree constants.
    Do work when extending them to wider wide_ints.

(2) Use the representation of max_wide_int to store N-bit tree constants.
    Do work when creating an N-bit wide_int.

(3) Store both representations in the tree constant.

(4) Require all tree arithmetic to be done in the same way as rtl arithmetic,
    with explicit extensions.  This gets rid of case (b).

(5) Require all tree arithemtic to be done in wider wide_ints than the inputs,
    which I think is what you preferred.  This gets rid of case (a).

(6) Allow the same wide_int constant to have several different representations.
    Remember that this is to some extent what Kenny's original implementation
    did, since partial HWIs were filled with don't-cares.  You and I are the
    ones who argued that each wide_int should have a single representation.

> [This function shows another optimization issue:
>
>     case BOOLEAN_TYPE:
>       /* Cache false or true.  */
>       limit = 2;
>       if (wi::leu_p (cst, 1))
>         ix = cst.to_uhwi ();
>
> I would have expected cst <= 1 be optimized to cst.len == 1 &&
> cst.val[0] <= 1.  It expands to
>
> <L27>:
>   MEM[(long int *)&D.50698 + 16B] = 1;
>   MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct 
> wide_int_ref_storage *)&D.50698].scratch;
>   MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1;
>   MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32;
>   _277 = MEM[(const struct wide_int_storage *)&cst + 260B];
>   if (_277 <= 64)
>     goto <bb 42>;
>   else
>     goto <bb 43>;
>
>   <bb 42>:
>   xl_491 = zext_hwi (1, 32);  // ok, checking enabled and thus out-of-line
>   _494 = MEM[(const long int *)&cst];
>   _495 = (long unsigned int) _494;
>   yl_496 = zext_hwi (_495, _277);
>   _497 = xl_491 < yl_496;
>   goto <bb 44>;
>
>   <bb 43>:
>   _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage 
> *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage 
> *)&cst].val, len_274, _277);
>
> this keeps D.50698 and cst un-SRAable - inline storage is problematic
> for this reason.  But the representation should guarantee the
> compare with a low precision (32 bit) constant is evaluatable
> at compile-time if len of the larger value is > 1, no?
>
>   <bb 44>:
>   # _504 = PHI <_497(42), _503(43)>
>   D.50698 ={v} {CLOBBER};
>   if (_504 != 0)
>     goto <bb 45>;
>   else
>     goto <bb 46>;
>
>   <bb 45>:
>   pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B];
>   goto <bb 229> (<L131>);
>
>   <bb 46>:
>   _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0);
>   ix_66 = (int) _65;
>   goto <bb 91>;
>
> The question is whether we should try to optimize wide-int for
> such cases or simply not use wi:leu_p (cst, 1) but rather
>
>  if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
>
> ?

I think we should do the first, trying to optimise wide_int.

Thanks,
Richard
Kenneth Zadeck Oct. 17, 2013, 11:43 a.m. UTC | #8
On 10/17/2013 04:46 AM, Richard Biener wrote:
> the case actually comes up on the ppc because they do a lot of 128 bit
> math.    I think i got thru the x86-64 without noticing this.
>> Well, it'd be suspicious if we're directly using 128-bit numbers
>> in addr_wide_int.  The justification for the assertion was that we
>> should explicitly truncate to addr_wide_int when deliberately
>> ignoring upper bits, beyond bit or byte address width.  128 bits
>> definitely falls into that category on powerpc.
> My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid
> wide-int value if it has precision 16.  AFAIK that is what the
> code produces, but now Kenny says this is only for some kind
> of wide-ints but not all?  That is, why is

The issue is not that the rules are different between the different 
flavors of wide int, it is that the circumstances are different. The 
rule is that the only bits above the precision that exist are if the 
precision is not an even multiple of the HBPWI.   In that case, the bits 
are always an extension of the sign bits.

max_wide_int and addr_wide_int have large enough precisions so that it 
is impossible to ever generate an unsigned number on the target that is 
large enough to ever run against the precision.    However, for the 
fixed precision case, you can, and at least on the ppc, you do, generate 
unsigned numbers that are big enough to have to be recanonicalized like 
this.

> inline wi::storage_ref
> wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch,
>                                          unsigned int precision, const_tree
> x)
> {
>    unsigned int len = TREE_INT_CST_NUNITS (x);
>    const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x,
> 0);
>    return wi::storage_ref (val, len, precision);
> }
>
> not a valid implementation together with making sure that the
> INTEGER_CST tree rep has that extra word of zeros if required?
> I would like to see us move in that direction (given that the
> tree rep of INTEGER_CST has transitioned to variable-length already).
>
> Btw, code such as
>
> tree
> wide_int_to_tree (tree type, const wide_int_ref &pcst)
> {
> ...
>    unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1);
>    bool recanonize = sgn == UNSIGNED
>      && small_prec
>      && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT ==
> len;
>
> definitely needs a comment.  I would have thought _all_ unsigned
> numbers need re-canonicalization.  Well, maybe only if we're
> forcing the extra zeros.
It will get a comment.
> [This function shows another optimization issue:
>
>      case BOOLEAN_TYPE:
>        /* Cache false or true.  */
>        limit = 2;
>        if (wi::leu_p (cst, 1))
>          ix = cst.to_uhwi ();
>
> I would have expected cst <= 1 be optimized to cst.len == 1 &&
> cst.val[0] <= 1.  It expands to
>
> <L27>:
>    MEM[(long int *)&D.50698 + 16B] = 1;
>    MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct
> wide_int_ref_storage *)&D.50698].scratch;
>    MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1;
>    MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32;
>    _277 = MEM[(const struct wide_int_storage *)&cst + 260B];
>    if (_277 <= 64)
>      goto <bb 42>;
>    else
>      goto <bb 43>;
>
>    <bb 42>:
>    xl_491 = zext_hwi (1, 32);  // ok, checking enabled and thus out-of-line
>    _494 = MEM[(const long int *)&cst];
>    _495 = (long unsigned int) _494;
>    yl_496 = zext_hwi (_495, _277);
>    _497 = xl_491 < yl_496;
>    goto <bb 44>;
>
>    <bb 43>:
>    _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage
> *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage
> *)&cst].val, len_274, _277);
>
> this keeps D.50698 and cst un-SRAable - inline storage is problematic
> for this reason.  But the representation should guarantee the
> compare with a low precision (32 bit) constant is evaluatable
> at compile-time if len of the larger value is > 1, no?
>
>    <bb 44>:
>    # _504 = PHI <_497(42), _503(43)>
>    D.50698 ={v} {CLOBBER};
>    if (_504 != 0)
>      goto <bb 45>;
>    else
>      goto <bb 46>;
>
>    <bb 45>:
>    pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B];
>    goto <bb 229> (<L131>);
>
>    <bb 46>:
>    _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0);
>    ix_66 = (int) _65;
>    goto <bb 91>;
>
> The question is whether we should try to optimize wide-int for
> such cases or simply not use wi:leu_p (cst, 1) but rather
>
>   if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
>
> ?
i find this ugly, but i see where you are coming from.   The problem is 
that both you and i know that the len has to be 1, but the optimizer 
does not.   This is a case where I think that we made a mistake getting 
rid of the wi::one_p, wi::zero_p and wi::minus_one_p.  The internals of 
one_p were return (len == 1 && val[0] ==1) and i think that is much 
nicer than what you put there.      On the other hand, it seem that a 
person more skilled than i am with c++ could specialize the comparisons 
with an integer constant, since i believe that that constant must fit in 
one hwi (I am a little concerned about large unsigned constants).
>
>> Thanks,
>> Richard
>>
>>
Richard Biener Oct. 17, 2013, 11:49 a.m. UTC | #9
On Thu, 17 Oct 2013, Kenneth Zadeck wrote:

> On 10/17/2013 04:46 AM, Richard Biener wrote:
> > the case actually comes up on the ppc because they do a lot of 128 bit
> > math.    I think i got thru the x86-64 without noticing this.
> > > Well, it'd be suspicious if we're directly using 128-bit numbers
> > > in addr_wide_int.  The justification for the assertion was that we
> > > should explicitly truncate to addr_wide_int when deliberately
> > > ignoring upper bits, beyond bit or byte address width.  128 bits
> > > definitely falls into that category on powerpc.
> > My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid
> > wide-int value if it has precision 16.  AFAIK that is what the
> > code produces, but now Kenny says this is only for some kind
> > of wide-ints but not all?  That is, why is
> 
> The issue is not that the rules are different between the different flavors of
> wide int, it is that the circumstances are different. The rule is that the
> only bits above the precision that exist are if the precision is not an even
> multiple of the HBPWI.   In that case, the bits are always an extension of the
> sign bits.
> 
> max_wide_int and addr_wide_int have large enough precisions so that it is
> impossible to ever generate an unsigned number on the target that is large
> enough to ever run against the precision.    However, for the fixed precision
> case, you can, and at least on the ppc, you do, generate unsigned numbers that
> are big enough to have to be recanonicalized like this.
> 
> > inline wi::storage_ref
> > wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch,
> >                                          unsigned int precision, const_tree
> > x)
> > {
> >    unsigned int len = TREE_INT_CST_NUNITS (x);
> >    const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x,
> > 0);
> >    return wi::storage_ref (val, len, precision);
> > }
> > 
> > not a valid implementation together with making sure that the
> > INTEGER_CST tree rep has that extra word of zeros if required?
> > I would like to see us move in that direction (given that the
> > tree rep of INTEGER_CST has transitioned to variable-length already).
> > 
> > Btw, code such as
> > 
> > tree
> > wide_int_to_tree (tree type, const wide_int_ref &pcst)
> > {
> > ...
> >    unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1);
> >    bool recanonize = sgn == UNSIGNED
> >      && small_prec
> >      && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT ==
> > len;
> > 
> > definitely needs a comment.  I would have thought _all_ unsigned
> > numbers need re-canonicalization.  Well, maybe only if we're
> > forcing the extra zeros.
> It will get a comment.
> > [This function shows another optimization issue:
> > 
> >      case BOOLEAN_TYPE:
> >        /* Cache false or true.  */
> >        limit = 2;
> >        if (wi::leu_p (cst, 1))
> >          ix = cst.to_uhwi ();
> > 
> > I would have expected cst <= 1 be optimized to cst.len == 1 &&
> > cst.val[0] <= 1.  It expands to
> > 
> > <L27>:
> >    MEM[(long int *)&D.50698 + 16B] = 1;
> >    MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct
> > wide_int_ref_storage *)&D.50698].scratch;
> >    MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1;
> >    MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32;
> >    _277 = MEM[(const struct wide_int_storage *)&cst + 260B];
> >    if (_277 <= 64)
> >      goto <bb 42>;
> >    else
> >      goto <bb 43>;
> > 
> >    <bb 42>:
> >    xl_491 = zext_hwi (1, 32);  // ok, checking enabled and thus out-of-line
> >    _494 = MEM[(const long int *)&cst];
> >    _495 = (long unsigned int) _494;
> >    yl_496 = zext_hwi (_495, _277);
> >    _497 = xl_491 < yl_496;
> >    goto <bb 44>;
> > 
> >    <bb 43>:
> >    _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage
> > *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage
> > *)&cst].val, len_274, _277);
> > 
> > this keeps D.50698 and cst un-SRAable - inline storage is problematic
> > for this reason.  But the representation should guarantee the
> > compare with a low precision (32 bit) constant is evaluatable
> > at compile-time if len of the larger value is > 1, no?
> > 
> >    <bb 44>:
> >    # _504 = PHI <_497(42), _503(43)>
> >    D.50698 ={v} {CLOBBER};
> >    if (_504 != 0)
> >      goto <bb 45>;
> >    else
> >      goto <bb 46>;
> > 
> >    <bb 45>:
> >    pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B];
> >    goto <bb 229> (<L131>);
> > 
> >    <bb 46>:
> >    _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0);
> >    ix_66 = (int) _65;
> >    goto <bb 91>;
> > 
> > The question is whether we should try to optimize wide-int for
> > such cases or simply not use wi:leu_p (cst, 1) but rather
> > 
> >   if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
> > 
> > ?
> i find this ugly, but i see where you are coming from.   The problem is that
> both you and i know that the len has to be 1, but the optimizer does not.
> This is a case where I think that we made a mistake getting rid of the
> wi::one_p, wi::zero_p and wi::minus_one_p.  The internals of one_p were return
> (len == 1 && val[0] ==1) and i think that is much nicer than what you put
> there.      On the other hand, it seem that a person more skilled than i am
> with c++ could specialize the comparisons with an integer constant, since i
> believe that that constant must fit in one hwi (I am a little concerned about
> large unsigned constants).

one_p and zero_p wouldn't catch a compare with 3 ;)  So yes, it looks
like if we cannot trick the optimizers to do the right thing, like with

inline bool
wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
{
  if (x.precision <= HOST_BITS_PER_WIDE_INT
      && y.precision <= HOST_BITS_PER_WIDE_INT)
    return x.slow () < y.slow ();
  else if (x.precision <= HOST_BITS_PER_WIDE_INT)   // why not len == 1?
    ... optimize
  else if (y.precision <= HOST_BITS_PER_WIDE_INT)
    ... optimize
  else
    return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
                        y.precision);
}

which of course pessimizes the case where the precisions are not
known at compile-time.  So maybe the above should always be

  if (__builtin_constant_p (x.precision)
      && x.precision <= HOST_BITS_PER_WIDE_INT
     ....

and lts_p_large should handle the small precision case efficiently
(but out-of-line?)

Richard.
Kenneth Zadeck Oct. 17, 2013, 12:32 p.m. UTC | #10
On 10/17/2013 07:49 AM, Richard Biener wrote:
> On Thu, 17 Oct 2013, Kenneth Zadeck wrote:
>
>> On 10/17/2013 04:46 AM, Richard Biener wrote:
>>> the case actually comes up on the ppc because they do a lot of 128 bit
>>> math.    I think i got thru the x86-64 without noticing this.
>>>> Well, it'd be suspicious if we're directly using 128-bit numbers
>>>> in addr_wide_int.  The justification for the assertion was that we
>>>> should explicitly truncate to addr_wide_int when deliberately
>>>> ignoring upper bits, beyond bit or byte address width.  128 bits
>>>> definitely falls into that category on powerpc.
>>> My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid
>>> wide-int value if it has precision 16.  AFAIK that is what the
>>> code produces, but now Kenny says this is only for some kind
>>> of wide-ints but not all?  That is, why is
>> The issue is not that the rules are different between the different flavors of
>> wide int, it is that the circumstances are different. The rule is that the
>> only bits above the precision that exist are if the precision is not an even
>> multiple of the HBPWI.   In that case, the bits are always an extension of the
>> sign bits.
>>
>> max_wide_int and addr_wide_int have large enough precisions so that it is
>> impossible to ever generate an unsigned number on the target that is large
>> enough to ever run against the precision.    However, for the fixed precision
>> case, you can, and at least on the ppc, you do, generate unsigned numbers that
>> are big enough to have to be recanonicalized like this.
>>
>>> inline wi::storage_ref
>>> wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch,
>>>                                           unsigned int precision, const_tree
>>> x)
>>> {
>>>     unsigned int len = TREE_INT_CST_NUNITS (x);
>>>     const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x,
>>> 0);
>>>     return wi::storage_ref (val, len, precision);
>>> }
>>>
>>> not a valid implementation together with making sure that the
>>> INTEGER_CST tree rep has that extra word of zeros if required?
>>> I would like to see us move in that direction (given that the
>>> tree rep of INTEGER_CST has transitioned to variable-length already).
>>>
>>> Btw, code such as
>>>
>>> tree
>>> wide_int_to_tree (tree type, const wide_int_ref &pcst)
>>> {
>>> ...
>>>     unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1);
>>>     bool recanonize = sgn == UNSIGNED
>>>       && small_prec
>>>       && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT ==
>>> len;
>>>
>>> definitely needs a comment.  I would have thought _all_ unsigned
>>> numbers need re-canonicalization.  Well, maybe only if we're
>>> forcing the extra zeros.
>> It will get a comment.
>>> [This function shows another optimization issue:
>>>
>>>       case BOOLEAN_TYPE:
>>>         /* Cache false or true.  */
>>>         limit = 2;
>>>         if (wi::leu_p (cst, 1))
>>>           ix = cst.to_uhwi ();
>>>
>>> I would have expected cst <= 1 be optimized to cst.len == 1 &&
>>> cst.val[0] <= 1.  It expands to
>>>
>>> <L27>:
>>>     MEM[(long int *)&D.50698 + 16B] = 1;
>>>     MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct
>>> wide_int_ref_storage *)&D.50698].scratch;
>>>     MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1;
>>>     MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32;
>>>     _277 = MEM[(const struct wide_int_storage *)&cst + 260B];
>>>     if (_277 <= 64)
>>>       goto <bb 42>;
>>>     else
>>>       goto <bb 43>;
>>>
>>>     <bb 42>:
>>>     xl_491 = zext_hwi (1, 32);  // ok, checking enabled and thus out-of-line
>>>     _494 = MEM[(const long int *)&cst];
>>>     _495 = (long unsigned int) _494;
>>>     yl_496 = zext_hwi (_495, _277);
>>>     _497 = xl_491 < yl_496;
>>>     goto <bb 44>;
>>>
>>>     <bb 43>:
>>>     _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage
>>> *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage
>>> *)&cst].val, len_274, _277);
>>>
>>> this keeps D.50698 and cst un-SRAable - inline storage is problematic
>>> for this reason.  But the representation should guarantee the
>>> compare with a low precision (32 bit) constant is evaluatable
>>> at compile-time if len of the larger value is > 1, no?
>>>
>>>     <bb 44>:
>>>     # _504 = PHI <_497(42), _503(43)>
>>>     D.50698 ={v} {CLOBBER};
>>>     if (_504 != 0)
>>>       goto <bb 45>;
>>>     else
>>>       goto <bb 46>;
>>>
>>>     <bb 45>:
>>>     pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B];
>>>     goto <bb 229> (<L131>);
>>>
>>>     <bb 46>:
>>>     _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0);
>>>     ix_66 = (int) _65;
>>>     goto <bb 91>;
>>>
>>> The question is whether we should try to optimize wide-int for
>>> such cases or simply not use wi:leu_p (cst, 1) but rather
>>>
>>>    if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
>>>
>>> ?
>> i find this ugly, but i see where you are coming from.   The problem is that
>> both you and i know that the len has to be 1, but the optimizer does not.
>> This is a case where I think that we made a mistake getting rid of the
>> wi::one_p, wi::zero_p and wi::minus_one_p.  The internals of one_p were return
>> (len == 1 && val[0] ==1) and i think that is much nicer than what you put
>> there.      On the other hand, it seem that a person more skilled than i am
>> with c++ could specialize the comparisons with an integer constant, since i
>> believe that that constant must fit in one hwi (I am a little concerned about
>> large unsigned constants).
> one_p and zero_p wouldn't catch a compare with 3 ;)  So yes, it looks
> like if we cannot trick the optimizers to do the right thing, like with
>
> inline bool
> wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
> {
>    if (x.precision <= HOST_BITS_PER_WIDE_INT
>        && y.precision <= HOST_BITS_PER_WIDE_INT)
>      return x.slow () < y.slow ();
>    else if (x.precision <= HOST_BITS_PER_WIDE_INT)   // why not len == 1?
>      ... optimize
>    else if (y.precision <= HOST_BITS_PER_WIDE_INT)
>      ... optimize
>    else
>      return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
>                          y.precision);
> }
>
> which of course pessimizes the case where the precisions are not
> known at compile-time.  So maybe the above should always be
>
>    if (__builtin_constant_p (x.precision)
>        && x.precision <= HOST_BITS_PER_WIDE_INT
>       ....
>
> and lts_p_large should handle the small precision case efficiently
> (but out-of-line?)
>
> Richard.
richi

what i was thinking about was to overload lts and its friends so that 
case of the second arg being an int or a long has handled separately.   
in the cases of any int and signed longs, you know that the value can 
always be checked with a len check and a single compare.    unsigned 
longs can cause trouble and so you may not want to overload that.

The part that i do not know how to do is to write the signature of the 
function.    if i just make the sig of the second arg as HOST_WIDE_INT, 
am i safe to get int, unsigned int and long but not unsigned long?
Mike Stump Oct. 18, 2013, 12:07 a.m. UTC | #11
On Oct 17, 2013, at 1:46 AM, Richard Biener <rguenther@suse.de> wrote:
> [This function shows another optimization issue:
> 
>    case BOOLEAN_TYPE:
>      /* Cache false or true.  */
>      limit = 2;
>      if (wi::leu_p (cst, 1))
>        ix = cst.to_uhwi ();
> 
> I would have expected cst <= 1 be optimized to cst.len == 1 &&
> cst.val[0] <= 1.

So lts_p has the same problem, and is used just below.  If we do:

@@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y)
 inline bool
 wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
 {
-  if (x.precision <= HOST_BITS_PER_WIDE_INT
-      && y.precision <= HOST_BITS_PER_WIDE_INT)
-    return x.slow () < y.slow ();
-  else
-    return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
-                       y.precision);
+  // We optimize w < N, where N is 64 or fewer bits.
+  if (y.precision <= HOST_BITS_PER_WIDE_INT)
+    {
+      // If it fits directly into a shwi, we can compare directly.
+      if (wi::fits_shwi_p (x))
+       return x.slow () < y.slow ();
+      // If it is doesn't fit, then it must be more negative than any
+      // value in y, and hence smaller.
+      if (neg_p (x, SIGNED))
+       return true;
+      // If it is positve, then it must be larger than any value in y,
+      // and hence greater than.
+      return false;
+    }
+  return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
+                     y.precision);
 }

then for:

bool MGEN(wide_int cst) {
  return wi::lts_p (cst, 100);
}

we generate:

        movl    520(%rsp), %eax
        cmpl    $1, %eax
        je      .L5283
        subl    $1, %eax
        movq    8(%rsp,%rax,8), %rax
        shrq    $63, %rax
        ret
        .p2align 4,,10
        .p2align 3
.L5283:
        cmpq    $99, 8(%rsp)
        setle   %al
        ret

which as you can see, never calls lts_p_large, and hence, if that was the only reason the storage escapes, then it should be able to optimize better.  In the above code, minimal, no function calls, and the 100 is propagated all the way down into the cmpq.  Now, the reason why we should do it this way, most of the time, most types are 64 bits or less.  When that is true, then it will never call into lts_p_large.

If the above 100 is changed to l, from a parameter long l, then the code is the same except the last part is:

        cmpq    8(%rsp), %rdi
        setg    %al
        ret

If we have two wide_int's, then the code does this:

        .cfi_startproc
        subq    $8, %rsp
	.cfi_def_cfa_offset 16
        movl    1052(%rsp), %r9d
        movl    1048(%rsp), %r8d
        movl    532(%rsp), %edx
        movl    528(%rsp), %esi
        cmpl    $64, %r9d
        ja      .L5281
        cmpl    $1, %esi
	je      .L5284
        subl    $1, %esi
        movq    16(%rsp,%rsi,8), %rax
        addq    $8, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        shrq    $63, %rax
        ret
        .p2align 4,,10
        .p2align 3
.L5281:
        .cfi_restore_state
        leaq    536(%rsp), %rcx
        leaq    16(%rsp), %rdi
        call    _ZN2wi11lts_p_largeEPKljjS1_jj
	addq    $8, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
        .p2align 4,,10
        .p2align 3
.L5284:
        .cfi_restore_state
        movq    536(%rsp), %rax
        cmpq    %rax, 16(%rsp)
        setl    %al
        addq    $8, %rsp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc

which is still pretty nice.

Does this look ok?  Kenny, can you double check the cases, think I have them right, but…  a double check would be good.

> the representation should guarantee the
> compare with a low precision (32 bit) constant is evaluatable
> at compile-time if len of the larger value is > 1, no?

I agree, though, I didn't check the symmetric case, as constants and smaller values can be put on the right by convention.

> The question is whether we should try to optimize wide-int for
> such cases or simply not use wi:leu_p (cst, 1) but rather
> 
> if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
> 
> ?

Since we can do an excellent job with the simple interface, I'd argue for the simple interface and the simple change to do the conditional better.  I'd rather up the ante on meta programming to get any performance we want, retaining the simple interfaces.
Richard Biener Oct. 18, 2013, 8:45 a.m. UTC | #12
On Thu, 17 Oct 2013, Mike Stump wrote:

> On Oct 17, 2013, at 1:46 AM, Richard Biener <rguenther@suse.de> wrote:
> > [This function shows another optimization issue:
> > 
> >    case BOOLEAN_TYPE:
> >      /* Cache false or true.  */
> >      limit = 2;
> >      if (wi::leu_p (cst, 1))
> >        ix = cst.to_uhwi ();
> > 
> > I would have expected cst <= 1 be optimized to cst.len == 1 &&
> > cst.val[0] <= 1.
> 
> So lts_p has the same problem, and is used just below.  If we do:
> 
> @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y)
>  inline bool
>  wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
>  {
> -  if (x.precision <= HOST_BITS_PER_WIDE_INT
> -      && y.precision <= HOST_BITS_PER_WIDE_INT)
> -    return x.slow () < y.slow ();
> -  else
> -    return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
> -                       y.precision);
> +  // We optimize w < N, where N is 64 or fewer bits.
> +  if (y.precision <= HOST_BITS_PER_WIDE_INT)
> +    {
> +      // If it fits directly into a shwi, we can compare directly.
> +      if (wi::fits_shwi_p (x))
> +       return x.slow () < y.slow ();
> +      // If it is doesn't fit, then it must be more negative than any
> +      // value in y, and hence smaller.
> +      if (neg_p (x, SIGNED))
> +       return true;
> +      // If it is positve, then it must be larger than any value in y,
> +      // and hence greater than.
> +      return false;
> +    }
> +  return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
> +                     y.precision);
>  }
> 
> then for:
> 
> bool MGEN(wide_int cst) {
>   return wi::lts_p (cst, 100);
> }
> 
> we generate:
> 
>         movl    520(%rsp), %eax
>         cmpl    $1, %eax
>         je      .L5283
>         subl    $1, %eax
>         movq    8(%rsp,%rax,8), %rax
>         shrq    $63, %rax
>         ret
>         .p2align 4,,10
>         .p2align 3
> .L5283:
>         cmpq    $99, 8(%rsp)
>         setle   %al
>         ret
> 
> which as you can see, never calls lts_p_large, and hence, if that was the only reason the storage escapes, then it should be able to optimize better.  In the above code, minimal, no function calls, and the 100 is propagated all the way down into the cmpq.  Now, the reason why we should do it this way, most of the time, most types are 64 bits or less.  When that is true, then it will never call into lts_p_large.
> 
> If the above 100 is changed to l, from a parameter long l, then the code is the same except the last part is:
> 
>         cmpq    8(%rsp), %rdi
>         setg    %al
>         ret
> 
> If we have two wide_int's, then the code does this:
> 
>         .cfi_startproc
>         subq    $8, %rsp
> 	.cfi_def_cfa_offset 16
>         movl    1052(%rsp), %r9d
>         movl    1048(%rsp), %r8d
>         movl    532(%rsp), %edx
>         movl    528(%rsp), %esi
>         cmpl    $64, %r9d
>         ja      .L5281
>         cmpl    $1, %esi
> 	je      .L5284
>         subl    $1, %esi
>         movq    16(%rsp,%rsi,8), %rax
>         addq    $8, %rsp
>         .cfi_remember_state
>         .cfi_def_cfa_offset 8
>         shrq    $63, %rax
>         ret
>         .p2align 4,,10
>         .p2align 3
> .L5281:
>         .cfi_restore_state
>         leaq    536(%rsp), %rcx
>         leaq    16(%rsp), %rdi
>         call    _ZN2wi11lts_p_largeEPKljjS1_jj
> 	addq    $8, %rsp
>         .cfi_remember_state
>         .cfi_def_cfa_offset 8
>         ret
>         .p2align 4,,10
>         .p2align 3
> .L5284:
>         .cfi_restore_state
>         movq    536(%rsp), %rax
>         cmpq    %rax, 16(%rsp)
>         setl    %al
>         addq    $8, %rsp
>         .cfi_def_cfa_offset 8
>         ret
>         .cfi_endproc
> 
> which is still pretty nice.
> 
> Does this look ok?  Kenny, can you double check the cases, think I have them right, but?  a double check would be good.

That works for me.

> > the representation should guarantee the
> > compare with a low precision (32 bit) constant is evaluatable
> > at compile-time if len of the larger value is > 1, no?
> 
> I agree, though, I didn't check the symmetric case, as constants and 
> smaller values can be put on the right by convention.
>
> > The question is whether we should try to optimize wide-int for
> > such cases or simply not use wi:leu_p (cst, 1) but rather
> > 
> > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
> > 
> > ?
> 
> Since we can do an excellent job with the simple interface, I'd argue 
> for the simple interface and the simple change to do the conditional 
> better.  I'd rather up the ante on meta programming to get any 
> performance we want, retaining the simple interfaces.

Agreed.  Note that I'd make heavy use of __builtin_constant_p
in the inline implementation as if we cannot optimize the choice
at compile-time we'll both slow down things and make the code
bloated up to the extent that the inliner no longer will consider
it for inlining.  That is,

> +  // We optimize w < N, where N is 64 or fewer bits.
> +  if (y.precision <= HOST_BITS_PER_WIDE_INT)

should be something like

   if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT))
    ...

and evaluate to false if it doesn't evaluate to compile-time
constant true.  The ultimate 'else' case then of couse has
to work for all cases.

If I got it right then sth like

#define CONSTANT (x) __builtin_constant_p (x) ? x : 0

should do the job (eventually with some stmt expression
magic to evaluate x only once).

That said, the various tree predicates in tree.c should
ultimately just work on the tree rep - they are basic
enough that they don't need to go through wide-ints.  So,
instead of

int
integer_zerop (const_tree expr)
{
  STRIP_NOPS (expr);

  switch (TREE_CODE (expr))
    {
    case INTEGER_CST:
      return wi::eq_p (expr, 0);

I'd put there

    case INTEGER_CST:
      return TREE_INT_CST_NUNITS (expr) == 1
             && TREE_INT_CST_ELT (expr, 0) == 0;

I just used the existing code as convenient simplest users
of wide-ints to inspect code generation (originally I looked
at tree-dfa.c:get_ref_base_and_extent which is timing critical
but somewhat convoluted).

So before the merge we should adjust the tree.c changes on the
branch appropriately.

Richard.
Kenneth Zadeck Oct. 18, 2013, 1:11 p.m. UTC | #13
On 10/18/2013 04:45 AM, Richard Biener wrote:
> On Thu, 17 Oct 2013, Mike Stump wrote:
>
>> On Oct 17, 2013, at 1:46 AM, Richard Biener <rguenther@suse.de> wrote:
>>> [This function shows another optimization issue:
>>>
>>>     case BOOLEAN_TYPE:
>>>       /* Cache false or true.  */
>>>       limit = 2;
>>>       if (wi::leu_p (cst, 1))
>>>         ix = cst.to_uhwi ();
>>>
>>> I would have expected cst <= 1 be optimized to cst.len == 1 &&
>>> cst.val[0] <= 1.
>> So lts_p has the same problem, and is used just below.  If we do:
>>
>> @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y)
>>   inline bool
>>   wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
>>   {
>> -  if (x.precision <= HOST_BITS_PER_WIDE_INT
>> -      && y.precision <= HOST_BITS_PER_WIDE_INT)
>> -    return x.slow () < y.slow ();
>> -  else
>> -    return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
>> -                       y.precision);
>> +  // We optimize w < N, where N is 64 or fewer bits.
>> +  if (y.precision <= HOST_BITS_PER_WIDE_INT)
>> +    {
>> +      // If it fits directly into a shwi, we can compare directly.
>> +      if (wi::fits_shwi_p (x))
>> +       return x.slow () < y.slow ();
>> +      // If it is doesn't fit, then it must be more negative than any
>> +      // value in y, and hence smaller.
>> +      if (neg_p (x, SIGNED))
>> +       return true;
>> +      // If it is positve, then it must be larger than any value in y,
>> +      // and hence greater than.
>> +      return false;
>> +    }
>> +  return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
>> +                     y.precision);
>>   }
>>
>> then for:
>>
>> bool MGEN(wide_int cst) {
>>    return wi::lts_p (cst, 100);
>> }
>>
>> we generate:
>>
>>          movl    520(%rsp), %eax
>>          cmpl    $1, %eax
>>          je      .L5283
>>          subl    $1, %eax
>>          movq    8(%rsp,%rax,8), %rax
>>          shrq    $63, %rax
>>          ret
>>          .p2align 4,,10
>>          .p2align 3
>> .L5283:
>>          cmpq    $99, 8(%rsp)
>>          setle   %al
>>          ret
>>
>> which as you can see, never calls lts_p_large, and hence, if that was the only reason the storage escapes, then it should be able to optimize better.  In the above code, minimal, no function calls, and the 100 is propagated all the way down into the cmpq.  Now, the reason why we should do it this way, most of the time, most types are 64 bits or less.  When that is true, then it will never call into lts_p_large.
>>
>> If the above 100 is changed to l, from a parameter long l, then the code is the same except the last part is:
>>
>>          cmpq    8(%rsp), %rdi
>>          setg    %al
>>          ret
>>
>> If we have two wide_int's, then the code does this:
>>
>>          .cfi_startproc
>>          subq    $8, %rsp
>> 	.cfi_def_cfa_offset 16
>>          movl    1052(%rsp), %r9d
>>          movl    1048(%rsp), %r8d
>>          movl    532(%rsp), %edx
>>          movl    528(%rsp), %esi
>>          cmpl    $64, %r9d
>>          ja      .L5281
>>          cmpl    $1, %esi
>> 	je      .L5284
>>          subl    $1, %esi
>>          movq    16(%rsp,%rsi,8), %rax
>>          addq    $8, %rsp
>>          .cfi_remember_state
>>          .cfi_def_cfa_offset 8
>>          shrq    $63, %rax
>>          ret
>>          .p2align 4,,10
>>          .p2align 3
>> .L5281:
>>          .cfi_restore_state
>>          leaq    536(%rsp), %rcx
>>          leaq    16(%rsp), %rdi
>>          call    _ZN2wi11lts_p_largeEPKljjS1_jj
>> 	addq    $8, %rsp
>>          .cfi_remember_state
>>          .cfi_def_cfa_offset 8
>>          ret
>>          .p2align 4,,10
>>          .p2align 3
>> .L5284:
>>          .cfi_restore_state
>>          movq    536(%rsp), %rax
>>          cmpq    %rax, 16(%rsp)
>>          setl    %al
>>          addq    $8, %rsp
>>          .cfi_def_cfa_offset 8
>>          ret
>>          .cfi_endproc
>>
>> which is still pretty nice.
>>
>> Does this look ok?  Kenny, can you double check the cases, think I have them right, but?  a double check would be good.
> That works for me.
>
i talked to mike about this last night, but did not follow up with an 
email until now.   The problem is that this code is wrong!!!   He is 
working to fix that and so i would expect something from him later 
(tomorrow for those in europe).    The problem is if the variable that 
comes in is a unsigned HWI that has to the top bit set.    In that case, 
the wide-int canonicalized version would have two hwi's, the top one 
being a 0 block and so you cannot do the fast path test coming in on 
that even though it's precison fits.
>>> the representation should guarantee the
>>> compare with a low precision (32 bit) constant is evaluatable
>>> at compile-time if len of the larger value is > 1, no?
>> I agree, though, I didn't check the symmetric case, as constants and
>> smaller values can be put on the right by convention.
>>
>>> The question is whether we should try to optimize wide-int for
>>> such cases or simply not use wi:leu_p (cst, 1) but rather
>>>
>>> if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
>>>
>>> ?
>> Since we can do an excellent job with the simple interface, I'd argue
>> for the simple interface and the simple change to do the conditional
>> better.  I'd rather up the ante on meta programming to get any
>> performance we want, retaining the simple interfaces.
> Agreed.  Note that I'd make heavy use of __builtin_constant_p
> in the inline implementation as if we cannot optimize the choice
> at compile-time we'll both slow down things and make the code
> bloated up to the extent that the inliner no longer will consider
> it for inlining.  That is,
>
>> +  // We optimize w < N, where N is 64 or fewer bits.
>> +  if (y.precision <= HOST_BITS_PER_WIDE_INT)
> should be something like
>
>     if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT))
>      ...
>
> and evaluate to false if it doesn't evaluate to compile-time
> constant true.  The ultimate 'else' case then of couse has
> to work for all cases.
>
> If I got it right then sth like
>
> #define CONSTANT (x) __builtin_constant_p (x) ? x : 0
>
> should do the job (eventually with some stmt expression
> magic to evaluate x only once).
>
> That said, the various tree predicates in tree.c should
> ultimately just work on the tree rep - they are basic
> enough that they don't need to go through wide-ints.  So,
> instead of
>
> int
> integer_zerop (const_tree expr)
> {
>    STRIP_NOPS (expr);
>
>    switch (TREE_CODE (expr))
>      {
>      case INTEGER_CST:
>        return wi::eq_p (expr, 0);
>
> I'd put there
>
>      case INTEGER_CST:
>        return TREE_INT_CST_NUNITS (expr) == 1
>               && TREE_INT_CST_ELT (expr, 0) == 0;
>
> I just used the existing code as convenient simplest users
> of wide-ints to inspect code generation (originally I looked
> at tree-dfa.c:get_ref_base_and_extent which is timing critical
> but somewhat convoluted).
>
> So before the merge we should adjust the tree.c changes on the
> branch appropriately.
>
> Richard.
diff mbox

Patch

Index: tree.h
===================================================================
--- tree.h      (revision 203690)
+++ tree.h      (working copy)
@@ -5184,14 +5184,11 @@  wi::int_traits <const_tree>::decompose (
                          / HOST_BITS_PER_WIDE_INT);
   unsigned int xprecision = get_precision (x);
 
-  gcc_assert (precision >= xprecision);
+  gcc_checking_assert (precision >= xprecision && precision != 0);
 
-  /* Got to be careful of precision 0 values.  */
-  if (precision)
-    len = MIN (len, max_len);
   if (TYPE_SIGN (TREE_TYPE (x)) == UNSIGNED)
     {
-      unsigned int small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1);
+      unsigned int small_prec = xprecision & (HOST_BITS_PER_WIDE_INT - 
1);
       if (small_prec)
        {