Message ID | alpine.LNX.2.00.1310161537220.11149@zhemvz.fhfr.qr |
---|---|
State | New |
Headers | show |
On 10/16/2013 09:55 AM, Richard Biener wrote: > Speaking in terms of a patch: > > Index: tree.h > =================================================================== > --- tree.h (revision 203690) > +++ tree.h (working copy) > @@ -5184,14 +5184,11 @@ wi::int_traits <const_tree>::decompose ( > / HOST_BITS_PER_WIDE_INT); > unsigned int xprecision = get_precision (x); > > - gcc_assert (precision >= xprecision); > + gcc_checking_assert (precision >= xprecision && precision != 0); > > - /* Got to be careful of precision 0 values. */ > - if (precision) > - len = MIN (len, max_len); > if (TYPE_SIGN (TREE_TYPE (x)) == UNSIGNED) > { > - unsigned int small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1); > + unsigned int small_prec = xprecision & (HOST_BITS_PER_WIDE_INT - > 1); > if (small_prec) > { > /* We have to futz with this because the canonization for > @@ -5201,7 +5198,7 @@ wi::int_traits <const_tree>::decompose ( > { > for (unsigned int i = 0; i < len - 1; i++) > scratch[i] = val[i]; > - scratch[len - 1] = sext_hwi (val[len - 1], precision); > + scratch[len - 1] = sext_hwi (val[len - 1], small_prec); > return wi::storage_ref (scratch, len, precision); > } > } > > the precision 0 thing is gone as far as I understand? There is still the case where the front ends create types with precision 0, and that causes a few odd cases, but the precision 0 that you refer to is gone. > Also sign-extending the upper hwi must be done with precision % > HOST_BITS_PER_WIDE_INT, no? And it assumes that we only ever > have to extend the upper most HWI, thus 'len' is never too big. yes > I note that we immediately return in the above case, so > > if (precision < xprecision + HOST_BITS_PER_WIDE_INT) > { > len = wi::force_to_size (scratch, val, len, xprecision, > precision, UNSIGNED); > return wi::storage_ref (scratch, len, precision); > } > > applies only when the desired precision is a multiple of a HWI. yes > I assume it adds that extra zero word in case the MSB is set, > right? yes > I am confused about the condition written here and > how we look at precision and not xprecision when deciding how > to extend - given a 8-bit precision unsigned 0xff and precision == 64 > we do not even consider sign-extending the value because we look > at precision and not xprecision. Don't we need to look at > xprecision here? I do not think so. There are three cases here: 1) xprecision < precision: say xprecision = 8 and precision = 32. The number is between 0 and 255 and after canonicalization the number will be between 0 and 255. 2) xprecision == precision not much to talk about here. 3) xprecision > precision We need to loose the upper bits of the input and it does not matter what they were. The only thing that we are concerned about is that the bits above what is now the sign bit pointed to by precision are matched in the bits above precision. I do think we could change is that the "if (precision < xprecision + HWI)" to become "else if (...)", which will save a test. > After all an assert precision == xprecision > does not work in this routine. > > Quickly execute.exp tested only. > > To avoid code quality wreckage we have to implement a different way > of allocating the 'scratch' storage of wide_int_ref_storage > (well, ideally we wouldn't need it ...). I thought of simply > allocating it via alloca (), but we need to do that in all > callers that build a wide_int_ref (eventually via a hidden > default arg). But as that's a template specialization of > generic_wide_int ... so the option is to provide a function > wrapper inside wi:: for this, like I want richard and mike to be the people who respond to the next point. I am not a c++ person and all of this storage manager layer stuff gives me a headache. > template <typename T> > wide_int_ref > wi::ref (const T &t, HOST_WIDE_INT *scratch = XALLOCAVEC (HOST_WIDE_INT, > WIDE_INT_MAX_ELTS)) > { > return wide_int_ref (t, scratch); > } > > and amend the storage constructors to get a scratch argument. > > The reason for all this is that otherwise the wide_int_ref object > escapes when we pass the scratch storage to the workers in > int_traits <const_tree>. > > Thanks, > Richard.
On Oct 16, 2013, at 8:47 AM, Kenneth Zadeck <zadeck@naturalbridge.com> wrote: >> To avoid code quality wreckage we have to implement a different way >> of allocating the 'scratch' storage of wide_int_ref_storage > I want richard and mike to be the people who respond to the next point. I am not a c++ person and all of this storage manager layer stuff gives me a headache. I don't have a problem with that direction if it improves code-gen...
Kenneth Zadeck <zadeck@naturalbridge.com> writes: >> I note that we immediately return in the above case, so >> >> if (precision < xprecision + HOST_BITS_PER_WIDE_INT) >> { >> len = wi::force_to_size (scratch, val, len, xprecision, >> precision, UNSIGNED); >> return wi::storage_ref (scratch, len, precision); >> } >> >> applies only when the desired precision is a multiple of a HWI. > yes >> I assume it adds that extra zero word in case the MSB is set, >> right? > yes >> I am confused about the condition written here and >> how we look at precision and not xprecision when deciding how >> to extend - given a 8-bit precision unsigned 0xff and precision == 64 >> we do not even consider sign-extending the value because we look >> at precision and not xprecision. Don't we need to look at >> xprecision here? > I do not think so. There are three cases here: > > 1) xprecision < precision: > say xprecision = 8 and precision = 32. > > The number is between 0 and 255 and after canonicalization the number > will be between 0 and 255. > > 2) xprecision == precision > not much to talk about here. > > 3) xprecision > precision > > We need to loose the upper bits of the input and it does not matter what > they were. The only thing that we are concerned about is that the > bits above what is now the sign bit pointed to by precision are matched > in the bits above precision. (3) is gone with the assert though. As mentioned in my message yesterday, I thought your new way of canonising unsigned tree constants meant that there was always an upper zero bit. Is that right? If so, xprecision < precision is a no-op, because the number always has the right form for wider precisions. The only difficult case is xprecision == precision, since then we need to peel off any upper -1 HWIs. If that's right and xprecision == precision can use val with an adjusted len, then... >> After all an assert precision == xprecision >> does not work in this routine. >> >> Quickly execute.exp tested only. >> >> To avoid code quality wreckage we have to implement a different way >> of allocating the 'scratch' storage of wide_int_ref_storage >> (well, ideally we wouldn't need it ...). I thought of simply >> allocating it via alloca (), but we need to do that in all >> callers that build a wide_int_ref (eventually via a hidden >> default arg). But as that's a template specialization of >> generic_wide_int ... so the option is to provide a function >> wrapper inside wi:: for this, like > I want richard and mike to be the people who respond to the next > point. I am not a c++ person and all of this storage manager layer > stuff gives me a headache. ...the use of the scratch array goes away completely for addr_wide_int and max_wide_int. If this code is on a hot path for wide_int then maybe we should simply require that wide_int operations involving trees have explicit extensions, like in rtl. (I.e. only do the implicit extension for addr_wide_int and max_wide_int.) I'm not opposed to going back to a separate scratch array if all else fails, but it's really ugly, so I'd like to look at the alternatives first... Thanks, Richard
On 10/16/2013 02:45 PM, Richard Sandiford wrote: > Kenneth Zadeck <zadeck@naturalbridge.com> writes: >>> I note that we immediately return in the above case, so >>> >>> if (precision < xprecision + HOST_BITS_PER_WIDE_INT) >>> { >>> len = wi::force_to_size (scratch, val, len, xprecision, >>> precision, UNSIGNED); >>> return wi::storage_ref (scratch, len, precision); >>> } >>> >>> applies only when the desired precision is a multiple of a HWI. >> yes >>> I assume it adds that extra zero word in case the MSB is set, >>> right? >> yes >>> I am confused about the condition written here and >>> how we look at precision and not xprecision when deciding how >>> to extend - given a 8-bit precision unsigned 0xff and precision == 64 >>> we do not even consider sign-extending the value because we look >>> at precision and not xprecision. Don't we need to look at >>> xprecision here? >> I do not think so. There are three cases here: >> >> 1) xprecision < precision: >> say xprecision = 8 and precision = 32. >> >> The number is between 0 and 255 and after canonicalization the number >> will be between 0 and 255. >> >> 2) xprecision == precision >> not much to talk about here. >> >> 3) xprecision > precision >> >> We need to loose the upper bits of the input and it does not matter what >> they were. The only thing that we are concerned about is that the >> bits above what is now the sign bit pointed to by precision are matched >> in the bits above precision. > (3) is gone with the assert though. you seem to be correct here. > As mentioned in my message yesterday, I thought your new way of canonising > unsigned tree constants meant that there was always an upper zero bit. > Is that right? i believe this is correct. > If so, xprecision < precision is a no-op, because the number always > has the right form for wider precisions. The only difficult case is > xprecision == precision, since then we need to peel off any upper -1 HWIs. say my HWI size is 8 bits (just to keep from typing a million 'f's. if i have a 16 bit unsigned number that is all 1s in the tree world it is 3 hwis 0x00 0xff 0xff. but inside regular wide int, it would take 1 wide int whose value is 0xff. inside of max it would be the same as the tree, but then the test precision < xprecision + hbpwi never kicks in because precision is guaranteed to be huge. inside of addr_wide_int, i think we tank with the assertion. the case actually comes up on the ppc because they do a lot of 128 bit math. I think i got thru the x86-64 without noticing this. > If that's right and xprecision == precision can use val with an adjusted len, > then... > >>> After all an assert precision == xprecision >>> does not work in this routine. >>> >>> Quickly execute.exp tested only. >>> >>> To avoid code quality wreckage we have to implement a different way >>> of allocating the 'scratch' storage of wide_int_ref_storage >>> (well, ideally we wouldn't need it ...). I thought of simply >>> allocating it via alloca (), but we need to do that in all >>> callers that build a wide_int_ref (eventually via a hidden >>> default arg). But as that's a template specialization of >>> generic_wide_int ... so the option is to provide a function >>> wrapper inside wi:: for this, like >> I want richard and mike to be the people who respond to the next >> point. I am not a c++ person and all of this storage manager layer >> stuff gives me a headache. > ...the use of the scratch array goes away completely for addr_wide_int > and max_wide_int. If this code is on a hot path for wide_int then maybe > we should simply require that wide_int operations involving trees have > explicit extensions, like in rtl. (I.e. only do the implicit extension > for addr_wide_int and max_wide_int.) i do not understand how this is true. > I'm not opposed to going back to a separate scratch array if all else fails, > but it's really ugly, so I'd like to look at the alternatives first... > > Thanks, > Richard
Kenneth Zadeck <zadeck@naturalbridge.com> writes: >> As mentioned in my message yesterday, I thought your new way of canonising >> unsigned tree constants meant that there was always an upper zero bit. >> Is that right? > i believe this is correct. >> If so, xprecision < precision is a no-op, because the number always >> has the right form for wider precisions. The only difficult case is >> xprecision == precision, since then we need to peel off any upper -1 HWIs. > say my HWI size is 8 bits (just to keep from typing a million 'f's. if > i have a 16 bit unsigned number that is all 1s in the tree world it is 3 > hwis > 0x00 0xff 0xff. > > but inside regular wide int, it would take 1 wide int whose value is 0xff. > inside of max it would be the same as the tree, but then the test > precision < xprecision + hbpwi never kicks in because precision is > guaranteed to be huge. > > inside of addr_wide_int, i think we tank with the assertion. It should be OK for addr_wide_int too. The precision still fits 2 HWIs. The initial length is greater than the maximum length of an addr_wide_int, but your "len = MAX (len, max_len)" deals with that. > the case actually comes up on the ppc because they do a lot of 128 bit > math. I think i got thru the x86-64 without noticing this. Well, it'd be suspicious if we're directly using 128-bit numbers in addr_wide_int. The justification for the assertion was that we should explicitly truncate to addr_wide_int when deliberately ignoring upper bits, beyond bit or byte address width. 128 bits definitely falls into that category on powerpc. Thanks, Richard
On Thu, 17 Oct 2013, Richard Sandiford wrote: > Kenneth Zadeck <zadeck@naturalbridge.com> writes: > >> As mentioned in my message yesterday, I thought your new way of canonising > >> unsigned tree constants meant that there was always an upper zero bit. > >> Is that right? > > i believe this is correct. > >> If so, xprecision < precision is a no-op, because the number always > >> has the right form for wider precisions. The only difficult case is > >> xprecision == precision, since then we need to peel off any upper -1 HWIs. > > say my HWI size is 8 bits (just to keep from typing a million 'f's. if > > i have a 16 bit unsigned number that is all 1s in the tree world it is 3 > > hwis > > 0x00 0xff 0xff. > > > > but inside regular wide int, it would take 1 wide int whose value is 0xff. > > inside of max it would be the same as the tree, but then the test > > precision < xprecision + hbpwi never kicks in because precision is > > guaranteed to be huge. > > > > inside of addr_wide_int, i think we tank with the assertion. > > It should be OK for addr_wide_int too. The precision still fits 2 HWIs. > The initial length is greater than the maximum length of an addr_wide_int, > but your "len = MAX (len, max_len)" deals with that. It's len = MIN (len, max_len) which looked suspicious to me, but with precision >= xprecision precision can only be zero if xprecision is zero which looked to me like it cannot happen - or rather it should be fixed. > > the case actually comes up on the ppc because they do a lot of 128 bit > > math. I think i got thru the x86-64 without noticing this. > > Well, it'd be suspicious if we're directly using 128-bit numbers > in addr_wide_int. The justification for the assertion was that we > should explicitly truncate to addr_wide_int when deliberately > ignoring upper bits, beyond bit or byte address width. 128 bits > definitely falls into that category on powerpc. My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid wide-int value if it has precision 16. AFAIK that is what the code produces, but now Kenny says this is only for some kind of wide-ints but not all? That is, why is inline wi::storage_ref wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch, unsigned int precision, const_tree x) { unsigned int len = TREE_INT_CST_NUNITS (x); const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, 0); return wi::storage_ref (val, len, precision); } not a valid implementation together with making sure that the INTEGER_CST tree rep has that extra word of zeros if required? I would like to see us move in that direction (given that the tree rep of INTEGER_CST has transitioned to variable-length already). Btw, code such as tree wide_int_to_tree (tree type, const wide_int_ref &pcst) { ... unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1); bool recanonize = sgn == UNSIGNED && small_prec && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT == len; definitely needs a comment. I would have thought _all_ unsigned numbers need re-canonicalization. Well, maybe only if we're forcing the extra zeros. [This function shows another optimization issue: case BOOLEAN_TYPE: /* Cache false or true. */ limit = 2; if (wi::leu_p (cst, 1)) ix = cst.to_uhwi (); I would have expected cst <= 1 be optimized to cst.len == 1 && cst.val[0] <= 1. It expands to <L27>: MEM[(long int *)&D.50698 + 16B] = 1; MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct wide_int_ref_storage *)&D.50698].scratch; MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1; MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32; _277 = MEM[(const struct wide_int_storage *)&cst + 260B]; if (_277 <= 64) goto <bb 42>; else goto <bb 43>; <bb 42>: xl_491 = zext_hwi (1, 32); // ok, checking enabled and thus out-of-line _494 = MEM[(const long int *)&cst]; _495 = (long unsigned int) _494; yl_496 = zext_hwi (_495, _277); _497 = xl_491 < yl_496; goto <bb 44>; <bb 43>: _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage *)&cst].val, len_274, _277); this keeps D.50698 and cst un-SRAable - inline storage is problematic for this reason. But the representation should guarantee the compare with a low precision (32 bit) constant is evaluatable at compile-time if len of the larger value is > 1, no? <bb 44>: # _504 = PHI <_497(42), _503(43)> D.50698 ={v} {CLOBBER}; if (_504 != 0) goto <bb 45>; else goto <bb 46>; <bb 45>: pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B]; goto <bb 229> (<L131>); <bb 46>: _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0); ix_66 = (int) _65; goto <bb 91>; The question is whether we should try to optimize wide-int for such cases or simply not use wi:leu_p (cst, 1) but rather if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) ? > Thanks, > Richard > >
Richard Biener <rguenther@suse.de> writes: > On Thu, 17 Oct 2013, Richard Sandiford wrote: > >> Kenneth Zadeck <zadeck@naturalbridge.com> writes: >> >> As mentioned in my message yesterday, I thought your new way of canonising >> >> unsigned tree constants meant that there was always an upper zero bit. >> >> Is that right? >> > i believe this is correct. >> >> If so, xprecision < precision is a no-op, because the number always >> >> has the right form for wider precisions. The only difficult case is >> >> xprecision == precision, since then we need to peel off any upper -1 HWIs. >> > say my HWI size is 8 bits (just to keep from typing a million 'f's. if >> > i have a 16 bit unsigned number that is all 1s in the tree world it is 3 >> > hwis >> > 0x00 0xff 0xff. >> > >> > but inside regular wide int, it would take 1 wide int whose value is 0xff. >> > inside of max it would be the same as the tree, but then the test >> > precision < xprecision + hbpwi never kicks in because precision is >> > guaranteed to be huge. >> > >> > inside of addr_wide_int, i think we tank with the assertion. >> >> It should be OK for addr_wide_int too. The precision still fits 2 HWIs. >> The initial length is greater than the maximum length of an addr_wide_int, >> but your "len = MAX (len, max_len)" deals with that. > > It's > > len = MIN (len, max_len) Oops, yeah, I meant MIN, sorry. > which looked suspicious to me, but with precision >= xprecision > precision can only be zero if xprecision is zero which looked to > me like it cannot happen - or rather it should be fixed. Despite the comment above the code, I don't think this MIN is there for the zero-precision case. I think it's there to handle the new tree representation. The new tree representation can have a length greater than max_len for an unsigned tree constant that occupies a whole number of HWIs. The tree representation of an unsigned 0x8000 is 0x00 0x80 0x00. When extended to max_wide_int the representation is the same. But a 2-HWI addr_wide_int would be 0x80 0x00, without the leading zero. The MIN trims the length from 3 to 2 in the last case. > > the case actually comes up on the ppc because they do a lot of 128 bit >> > math. I think i got thru the x86-64 without noticing this. >> >> Well, it'd be suspicious if we're directly using 128-bit numbers >> in addr_wide_int. The justification for the assertion was that we >> should explicitly truncate to addr_wide_int when deliberately >> ignoring upper bits, beyond bit or byte address width. 128 bits >> definitely falls into that category on powerpc. > > My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid > wide-int value if it has precision 16. No, for a 16-bit wide_int it should be 0xff. 0x00 0xff 0xff is correct for any wide_int wider than 16 bits though. > AFAIK that is what the code produces, In which case? This is: precision == 16 xprecision == 16 len == 3 max_len == 2 The MIN trims the len to 2 and then the loop Kenny added trims it again to 1, so the "0x00 0xff 0xff" becomes "0xff". The "0x00 0xff" is still there in the array, but not used. > but now Kenny says this is only for some kind > of wide-ints but not all? That is, why is > > inline wi::storage_ref > wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch, > unsigned int precision, const_tree > x) > { > unsigned int len = TREE_INT_CST_NUNITS (x); > const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, > 0); > return wi::storage_ref (val, len, precision); > } > > not a valid implementation together with making sure that the > INTEGER_CST tree rep has that extra word of zeros if required? The fundamental problem here is that we're trying to support two cases: (a) doing N-bit arithemtic in cases where the inputs have N bits (b) doing N-bit arithmetic in cases where the inputs have fewer than N bits and are extended according to TYPE_SIGN. Let's assume 32-bit HWIs. The 16-bit (4-hex-digit) constant 0x8000 is 0x8000 regardless of whether the type is signed or unsigned. But if it's extended to 32-bits you get two different numbers 0xffff8000 and 0x00008000, depending on the sign. So for one value of the "precision" parameter (i.e. xprecision), signed and unsigned constants produce the same number. But for another value of the "precision" parameter (those greater than xprecision), signed and unsigned constants produce different numbers. Yet at the moment the tree constant has a single representation. So I think the possibilities are: (1) Use the representation of an N-bit wide_int to store N-bit tree constants. Do work when extending them to wider wide_ints. (2) Use the representation of max_wide_int to store N-bit tree constants. Do work when creating an N-bit wide_int. (3) Store both representations in the tree constant. (4) Require all tree arithmetic to be done in the same way as rtl arithmetic, with explicit extensions. This gets rid of case (b). (5) Require all tree arithemtic to be done in wider wide_ints than the inputs, which I think is what you preferred. This gets rid of case (a). (6) Allow the same wide_int constant to have several different representations. Remember that this is to some extent what Kenny's original implementation did, since partial HWIs were filled with don't-cares. You and I are the ones who argued that each wide_int should have a single representation. > [This function shows another optimization issue: > > case BOOLEAN_TYPE: > /* Cache false or true. */ > limit = 2; > if (wi::leu_p (cst, 1)) > ix = cst.to_uhwi (); > > I would have expected cst <= 1 be optimized to cst.len == 1 && > cst.val[0] <= 1. It expands to > > <L27>: > MEM[(long int *)&D.50698 + 16B] = 1; > MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct > wide_int_ref_storage *)&D.50698].scratch; > MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1; > MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32; > _277 = MEM[(const struct wide_int_storage *)&cst + 260B]; > if (_277 <= 64) > goto <bb 42>; > else > goto <bb 43>; > > <bb 42>: > xl_491 = zext_hwi (1, 32); // ok, checking enabled and thus out-of-line > _494 = MEM[(const long int *)&cst]; > _495 = (long unsigned int) _494; > yl_496 = zext_hwi (_495, _277); > _497 = xl_491 < yl_496; > goto <bb 44>; > > <bb 43>: > _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage > *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage > *)&cst].val, len_274, _277); > > this keeps D.50698 and cst un-SRAable - inline storage is problematic > for this reason. But the representation should guarantee the > compare with a low precision (32 bit) constant is evaluatable > at compile-time if len of the larger value is > 1, no? > > <bb 44>: > # _504 = PHI <_497(42), _503(43)> > D.50698 ={v} {CLOBBER}; > if (_504 != 0) > goto <bb 45>; > else > goto <bb 46>; > > <bb 45>: > pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B]; > goto <bb 229> (<L131>); > > <bb 46>: > _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0); > ix_66 = (int) _65; > goto <bb 91>; > > The question is whether we should try to optimize wide-int for > such cases or simply not use wi:leu_p (cst, 1) but rather > > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) > > ? I think we should do the first, trying to optimise wide_int. Thanks, Richard
On 10/17/2013 04:46 AM, Richard Biener wrote: > the case actually comes up on the ppc because they do a lot of 128 bit > math. I think i got thru the x86-64 without noticing this. >> Well, it'd be suspicious if we're directly using 128-bit numbers >> in addr_wide_int. The justification for the assertion was that we >> should explicitly truncate to addr_wide_int when deliberately >> ignoring upper bits, beyond bit or byte address width. 128 bits >> definitely falls into that category on powerpc. > My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid > wide-int value if it has precision 16. AFAIK that is what the > code produces, but now Kenny says this is only for some kind > of wide-ints but not all? That is, why is The issue is not that the rules are different between the different flavors of wide int, it is that the circumstances are different. The rule is that the only bits above the precision that exist are if the precision is not an even multiple of the HBPWI. In that case, the bits are always an extension of the sign bits. max_wide_int and addr_wide_int have large enough precisions so that it is impossible to ever generate an unsigned number on the target that is large enough to ever run against the precision. However, for the fixed precision case, you can, and at least on the ppc, you do, generate unsigned numbers that are big enough to have to be recanonicalized like this. > inline wi::storage_ref > wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch, > unsigned int precision, const_tree > x) > { > unsigned int len = TREE_INT_CST_NUNITS (x); > const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, > 0); > return wi::storage_ref (val, len, precision); > } > > not a valid implementation together with making sure that the > INTEGER_CST tree rep has that extra word of zeros if required? > I would like to see us move in that direction (given that the > tree rep of INTEGER_CST has transitioned to variable-length already). > > Btw, code such as > > tree > wide_int_to_tree (tree type, const wide_int_ref &pcst) > { > ... > unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1); > bool recanonize = sgn == UNSIGNED > && small_prec > && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT == > len; > > definitely needs a comment. I would have thought _all_ unsigned > numbers need re-canonicalization. Well, maybe only if we're > forcing the extra zeros. It will get a comment. > [This function shows another optimization issue: > > case BOOLEAN_TYPE: > /* Cache false or true. */ > limit = 2; > if (wi::leu_p (cst, 1)) > ix = cst.to_uhwi (); > > I would have expected cst <= 1 be optimized to cst.len == 1 && > cst.val[0] <= 1. It expands to > > <L27>: > MEM[(long int *)&D.50698 + 16B] = 1; > MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct > wide_int_ref_storage *)&D.50698].scratch; > MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1; > MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32; > _277 = MEM[(const struct wide_int_storage *)&cst + 260B]; > if (_277 <= 64) > goto <bb 42>; > else > goto <bb 43>; > > <bb 42>: > xl_491 = zext_hwi (1, 32); // ok, checking enabled and thus out-of-line > _494 = MEM[(const long int *)&cst]; > _495 = (long unsigned int) _494; > yl_496 = zext_hwi (_495, _277); > _497 = xl_491 < yl_496; > goto <bb 44>; > > <bb 43>: > _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage > *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage > *)&cst].val, len_274, _277); > > this keeps D.50698 and cst un-SRAable - inline storage is problematic > for this reason. But the representation should guarantee the > compare with a low precision (32 bit) constant is evaluatable > at compile-time if len of the larger value is > 1, no? > > <bb 44>: > # _504 = PHI <_497(42), _503(43)> > D.50698 ={v} {CLOBBER}; > if (_504 != 0) > goto <bb 45>; > else > goto <bb 46>; > > <bb 45>: > pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B]; > goto <bb 229> (<L131>); > > <bb 46>: > _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0); > ix_66 = (int) _65; > goto <bb 91>; > > The question is whether we should try to optimize wide-int for > such cases or simply not use wi:leu_p (cst, 1) but rather > > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) > > ? i find this ugly, but i see where you are coming from. The problem is that both you and i know that the len has to be 1, but the optimizer does not. This is a case where I think that we made a mistake getting rid of the wi::one_p, wi::zero_p and wi::minus_one_p. The internals of one_p were return (len == 1 && val[0] ==1) and i think that is much nicer than what you put there. On the other hand, it seem that a person more skilled than i am with c++ could specialize the comparisons with an integer constant, since i believe that that constant must fit in one hwi (I am a little concerned about large unsigned constants). > >> Thanks, >> Richard >> >>
On Thu, 17 Oct 2013, Kenneth Zadeck wrote: > On 10/17/2013 04:46 AM, Richard Biener wrote: > > the case actually comes up on the ppc because they do a lot of 128 bit > > math. I think i got thru the x86-64 without noticing this. > > > Well, it'd be suspicious if we're directly using 128-bit numbers > > > in addr_wide_int. The justification for the assertion was that we > > > should explicitly truncate to addr_wide_int when deliberately > > > ignoring upper bits, beyond bit or byte address width. 128 bits > > > definitely falls into that category on powerpc. > > My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid > > wide-int value if it has precision 16. AFAIK that is what the > > code produces, but now Kenny says this is only for some kind > > of wide-ints but not all? That is, why is > > The issue is not that the rules are different between the different flavors of > wide int, it is that the circumstances are different. The rule is that the > only bits above the precision that exist are if the precision is not an even > multiple of the HBPWI. In that case, the bits are always an extension of the > sign bits. > > max_wide_int and addr_wide_int have large enough precisions so that it is > impossible to ever generate an unsigned number on the target that is large > enough to ever run against the precision. However, for the fixed precision > case, you can, and at least on the ppc, you do, generate unsigned numbers that > are big enough to have to be recanonicalized like this. > > > inline wi::storage_ref > > wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch, > > unsigned int precision, const_tree > > x) > > { > > unsigned int len = TREE_INT_CST_NUNITS (x); > > const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, > > 0); > > return wi::storage_ref (val, len, precision); > > } > > > > not a valid implementation together with making sure that the > > INTEGER_CST tree rep has that extra word of zeros if required? > > I would like to see us move in that direction (given that the > > tree rep of INTEGER_CST has transitioned to variable-length already). > > > > Btw, code such as > > > > tree > > wide_int_to_tree (tree type, const wide_int_ref &pcst) > > { > > ... > > unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1); > > bool recanonize = sgn == UNSIGNED > > && small_prec > > && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT == > > len; > > > > definitely needs a comment. I would have thought _all_ unsigned > > numbers need re-canonicalization. Well, maybe only if we're > > forcing the extra zeros. > It will get a comment. > > [This function shows another optimization issue: > > > > case BOOLEAN_TYPE: > > /* Cache false or true. */ > > limit = 2; > > if (wi::leu_p (cst, 1)) > > ix = cst.to_uhwi (); > > > > I would have expected cst <= 1 be optimized to cst.len == 1 && > > cst.val[0] <= 1. It expands to > > > > <L27>: > > MEM[(long int *)&D.50698 + 16B] = 1; > > MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct > > wide_int_ref_storage *)&D.50698].scratch; > > MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1; > > MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32; > > _277 = MEM[(const struct wide_int_storage *)&cst + 260B]; > > if (_277 <= 64) > > goto <bb 42>; > > else > > goto <bb 43>; > > > > <bb 42>: > > xl_491 = zext_hwi (1, 32); // ok, checking enabled and thus out-of-line > > _494 = MEM[(const long int *)&cst]; > > _495 = (long unsigned int) _494; > > yl_496 = zext_hwi (_495, _277); > > _497 = xl_491 < yl_496; > > goto <bb 44>; > > > > <bb 43>: > > _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage > > *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage > > *)&cst].val, len_274, _277); > > > > this keeps D.50698 and cst un-SRAable - inline storage is problematic > > for this reason. But the representation should guarantee the > > compare with a low precision (32 bit) constant is evaluatable > > at compile-time if len of the larger value is > 1, no? > > > > <bb 44>: > > # _504 = PHI <_497(42), _503(43)> > > D.50698 ={v} {CLOBBER}; > > if (_504 != 0) > > goto <bb 45>; > > else > > goto <bb 46>; > > > > <bb 45>: > > pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B]; > > goto <bb 229> (<L131>); > > > > <bb 46>: > > _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0); > > ix_66 = (int) _65; > > goto <bb 91>; > > > > The question is whether we should try to optimize wide-int for > > such cases or simply not use wi:leu_p (cst, 1) but rather > > > > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) > > > > ? > i find this ugly, but i see where you are coming from. The problem is that > both you and i know that the len has to be 1, but the optimizer does not. > This is a case where I think that we made a mistake getting rid of the > wi::one_p, wi::zero_p and wi::minus_one_p. The internals of one_p were return > (len == 1 && val[0] ==1) and i think that is much nicer than what you put > there. On the other hand, it seem that a person more skilled than i am > with c++ could specialize the comparisons with an integer constant, since i > believe that that constant must fit in one hwi (I am a little concerned about > large unsigned constants). one_p and zero_p wouldn't catch a compare with 3 ;) So yes, it looks like if we cannot trick the optimizers to do the right thing, like with inline bool wi::lts_p (const wide_int_ref &x, const wide_int_ref &y) { if (x.precision <= HOST_BITS_PER_WIDE_INT && y.precision <= HOST_BITS_PER_WIDE_INT) return x.slow () < y.slow (); else if (x.precision <= HOST_BITS_PER_WIDE_INT) // why not len == 1? ... optimize else if (y.precision <= HOST_BITS_PER_WIDE_INT) ... optimize else return lts_p_large (x.val, x.len, x.precision, y.val, y.len, y.precision); } which of course pessimizes the case where the precisions are not known at compile-time. So maybe the above should always be if (__builtin_constant_p (x.precision) && x.precision <= HOST_BITS_PER_WIDE_INT .... and lts_p_large should handle the small precision case efficiently (but out-of-line?) Richard.
On 10/17/2013 07:49 AM, Richard Biener wrote: > On Thu, 17 Oct 2013, Kenneth Zadeck wrote: > >> On 10/17/2013 04:46 AM, Richard Biener wrote: >>> the case actually comes up on the ppc because they do a lot of 128 bit >>> math. I think i got thru the x86-64 without noticing this. >>>> Well, it'd be suspicious if we're directly using 128-bit numbers >>>> in addr_wide_int. The justification for the assertion was that we >>>> should explicitly truncate to addr_wide_int when deliberately >>>> ignoring upper bits, beyond bit or byte address width. 128 bits >>>> definitely falls into that category on powerpc. >>> My question is whether with 8-bit HWI 0x00 0xff 0xff is a valid >>> wide-int value if it has precision 16. AFAIK that is what the >>> code produces, but now Kenny says this is only for some kind >>> of wide-ints but not all? That is, why is >> The issue is not that the rules are different between the different flavors of >> wide int, it is that the circumstances are different. The rule is that the >> only bits above the precision that exist are if the precision is not an even >> multiple of the HBPWI. In that case, the bits are always an extension of the >> sign bits. >> >> max_wide_int and addr_wide_int have large enough precisions so that it is >> impossible to ever generate an unsigned number on the target that is large >> enough to ever run against the precision. However, for the fixed precision >> case, you can, and at least on the ppc, you do, generate unsigned numbers that >> are big enough to have to be recanonicalized like this. >> >>> inline wi::storage_ref >>> wi::int_traits <const_tree>::decompose (HOST_WIDE_INT *scratch, >>> unsigned int precision, const_tree >>> x) >>> { >>> unsigned int len = TREE_INT_CST_NUNITS (x); >>> const HOST_WIDE_INT *val = (const HOST_WIDE_INT *) &TREE_INT_CST_ELT (x, >>> 0); >>> return wi::storage_ref (val, len, precision); >>> } >>> >>> not a valid implementation together with making sure that the >>> INTEGER_CST tree rep has that extra word of zeros if required? >>> I would like to see us move in that direction (given that the >>> tree rep of INTEGER_CST has transitioned to variable-length already). >>> >>> Btw, code such as >>> >>> tree >>> wide_int_to_tree (tree type, const wide_int_ref &pcst) >>> { >>> ... >>> unsigned int small_prec = prec & (HOST_BITS_PER_WIDE_INT - 1); >>> bool recanonize = sgn == UNSIGNED >>> && small_prec >>> && (prec + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT == >>> len; >>> >>> definitely needs a comment. I would have thought _all_ unsigned >>> numbers need re-canonicalization. Well, maybe only if we're >>> forcing the extra zeros. >> It will get a comment. >>> [This function shows another optimization issue: >>> >>> case BOOLEAN_TYPE: >>> /* Cache false or true. */ >>> limit = 2; >>> if (wi::leu_p (cst, 1)) >>> ix = cst.to_uhwi (); >>> >>> I would have expected cst <= 1 be optimized to cst.len == 1 && >>> cst.val[0] <= 1. It expands to >>> >>> <L27>: >>> MEM[(long int *)&D.50698 + 16B] = 1; >>> MEM[(struct wide_int_ref_storage *)&D.50698] = &MEM[(struct >>> wide_int_ref_storage *)&D.50698].scratch; >>> MEM[(struct wide_int_ref_storage *)&D.50698 + 8B] = 1; >>> MEM[(struct wide_int_ref_storage *)&D.50698 + 12B] = 32; >>> _277 = MEM[(const struct wide_int_storage *)&cst + 260B]; >>> if (_277 <= 64) >>> goto <bb 42>; >>> else >>> goto <bb 43>; >>> >>> <bb 42>: >>> xl_491 = zext_hwi (1, 32); // ok, checking enabled and thus out-of-line >>> _494 = MEM[(const long int *)&cst]; >>> _495 = (long unsigned int) _494; >>> yl_496 = zext_hwi (_495, _277); >>> _497 = xl_491 < yl_496; >>> goto <bb 44>; >>> >>> <bb 43>: >>> _503 = wi::ltu_p_large (&MEM[(struct wide_int_ref_storage >>> *)&D.50698].scratch, 1, 32, &MEM[(const struct wide_int_storage >>> *)&cst].val, len_274, _277); >>> >>> this keeps D.50698 and cst un-SRAable - inline storage is problematic >>> for this reason. But the representation should guarantee the >>> compare with a low precision (32 bit) constant is evaluatable >>> at compile-time if len of the larger value is > 1, no? >>> >>> <bb 44>: >>> # _504 = PHI <_497(42), _503(43)> >>> D.50698 ={v} {CLOBBER}; >>> if (_504 != 0) >>> goto <bb 45>; >>> else >>> goto <bb 46>; >>> >>> <bb 45>: >>> pretmp_563 = MEM[(const struct wide_int_storage *)&cst + 256B]; >>> goto <bb 229> (<L131>); >>> >>> <bb 46>: >>> _65 = generic_wide_int<wide_int_storage>::to_uhwi (&cst, 0); >>> ix_66 = (int) _65; >>> goto <bb 91>; >>> >>> The question is whether we should try to optimize wide-int for >>> such cases or simply not use wi:leu_p (cst, 1) but rather >>> >>> if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) >>> >>> ? >> i find this ugly, but i see where you are coming from. The problem is that >> both you and i know that the len has to be 1, but the optimizer does not. >> This is a case where I think that we made a mistake getting rid of the >> wi::one_p, wi::zero_p and wi::minus_one_p. The internals of one_p were return >> (len == 1 && val[0] ==1) and i think that is much nicer than what you put >> there. On the other hand, it seem that a person more skilled than i am >> with c++ could specialize the comparisons with an integer constant, since i >> believe that that constant must fit in one hwi (I am a little concerned about >> large unsigned constants). > one_p and zero_p wouldn't catch a compare with 3 ;) So yes, it looks > like if we cannot trick the optimizers to do the right thing, like with > > inline bool > wi::lts_p (const wide_int_ref &x, const wide_int_ref &y) > { > if (x.precision <= HOST_BITS_PER_WIDE_INT > && y.precision <= HOST_BITS_PER_WIDE_INT) > return x.slow () < y.slow (); > else if (x.precision <= HOST_BITS_PER_WIDE_INT) // why not len == 1? > ... optimize > else if (y.precision <= HOST_BITS_PER_WIDE_INT) > ... optimize > else > return lts_p_large (x.val, x.len, x.precision, y.val, y.len, > y.precision); > } > > which of course pessimizes the case where the precisions are not > known at compile-time. So maybe the above should always be > > if (__builtin_constant_p (x.precision) > && x.precision <= HOST_BITS_PER_WIDE_INT > .... > > and lts_p_large should handle the small precision case efficiently > (but out-of-line?) > > Richard. richi what i was thinking about was to overload lts and its friends so that case of the second arg being an int or a long has handled separately. in the cases of any int and signed longs, you know that the value can always be checked with a len check and a single compare. unsigned longs can cause trouble and so you may not want to overload that. The part that i do not know how to do is to write the signature of the function. if i just make the sig of the second arg as HOST_WIDE_INT, am i safe to get int, unsigned int and long but not unsigned long?
On Oct 17, 2013, at 1:46 AM, Richard Biener <rguenther@suse.de> wrote: > [This function shows another optimization issue: > > case BOOLEAN_TYPE: > /* Cache false or true. */ > limit = 2; > if (wi::leu_p (cst, 1)) > ix = cst.to_uhwi (); > > I would have expected cst <= 1 be optimized to cst.len == 1 && > cst.val[0] <= 1. So lts_p has the same problem, and is used just below. If we do: @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y) inline bool wi::lts_p (const wide_int_ref &x, const wide_int_ref &y) { - if (x.precision <= HOST_BITS_PER_WIDE_INT - && y.precision <= HOST_BITS_PER_WIDE_INT) - return x.slow () < y.slow (); - else - return lts_p_large (x.val, x.len, x.precision, y.val, y.len, - y.precision); + // We optimize w < N, where N is 64 or fewer bits. + if (y.precision <= HOST_BITS_PER_WIDE_INT) + { + // If it fits directly into a shwi, we can compare directly. + if (wi::fits_shwi_p (x)) + return x.slow () < y.slow (); + // If it is doesn't fit, then it must be more negative than any + // value in y, and hence smaller. + if (neg_p (x, SIGNED)) + return true; + // If it is positve, then it must be larger than any value in y, + // and hence greater than. + return false; + } + return lts_p_large (x.val, x.len, x.precision, y.val, y.len, + y.precision); } then for: bool MGEN(wide_int cst) { return wi::lts_p (cst, 100); } we generate: movl 520(%rsp), %eax cmpl $1, %eax je .L5283 subl $1, %eax movq 8(%rsp,%rax,8), %rax shrq $63, %rax ret .p2align 4,,10 .p2align 3 .L5283: cmpq $99, 8(%rsp) setle %al ret which as you can see, never calls lts_p_large, and hence, if that was the only reason the storage escapes, then it should be able to optimize better. In the above code, minimal, no function calls, and the 100 is propagated all the way down into the cmpq. Now, the reason why we should do it this way, most of the time, most types are 64 bits or less. When that is true, then it will never call into lts_p_large. If the above 100 is changed to l, from a parameter long l, then the code is the same except the last part is: cmpq 8(%rsp), %rdi setg %al ret If we have two wide_int's, then the code does this: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 movl 1052(%rsp), %r9d movl 1048(%rsp), %r8d movl 532(%rsp), %edx movl 528(%rsp), %esi cmpl $64, %r9d ja .L5281 cmpl $1, %esi je .L5284 subl $1, %esi movq 16(%rsp,%rsi,8), %rax addq $8, %rsp .cfi_remember_state .cfi_def_cfa_offset 8 shrq $63, %rax ret .p2align 4,,10 .p2align 3 .L5281: .cfi_restore_state leaq 536(%rsp), %rcx leaq 16(%rsp), %rdi call _ZN2wi11lts_p_largeEPKljjS1_jj addq $8, %rsp .cfi_remember_state .cfi_def_cfa_offset 8 ret .p2align 4,,10 .p2align 3 .L5284: .cfi_restore_state movq 536(%rsp), %rax cmpq %rax, 16(%rsp) setl %al addq $8, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc which is still pretty nice. Does this look ok? Kenny, can you double check the cases, think I have them right, but… a double check would be good. > the representation should guarantee the > compare with a low precision (32 bit) constant is evaluatable > at compile-time if len of the larger value is > 1, no? I agree, though, I didn't check the symmetric case, as constants and smaller values can be put on the right by convention. > The question is whether we should try to optimize wide-int for > such cases or simply not use wi:leu_p (cst, 1) but rather > > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) > > ? Since we can do an excellent job with the simple interface, I'd argue for the simple interface and the simple change to do the conditional better. I'd rather up the ante on meta programming to get any performance we want, retaining the simple interfaces.
On Thu, 17 Oct 2013, Mike Stump wrote: > On Oct 17, 2013, at 1:46 AM, Richard Biener <rguenther@suse.de> wrote: > > [This function shows another optimization issue: > > > > case BOOLEAN_TYPE: > > /* Cache false or true. */ > > limit = 2; > > if (wi::leu_p (cst, 1)) > > ix = cst.to_uhwi (); > > > > I would have expected cst <= 1 be optimized to cst.len == 1 && > > cst.val[0] <= 1. > > So lts_p has the same problem, and is used just below. If we do: > > @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y) > inline bool > wi::lts_p (const wide_int_ref &x, const wide_int_ref &y) > { > - if (x.precision <= HOST_BITS_PER_WIDE_INT > - && y.precision <= HOST_BITS_PER_WIDE_INT) > - return x.slow () < y.slow (); > - else > - return lts_p_large (x.val, x.len, x.precision, y.val, y.len, > - y.precision); > + // We optimize w < N, where N is 64 or fewer bits. > + if (y.precision <= HOST_BITS_PER_WIDE_INT) > + { > + // If it fits directly into a shwi, we can compare directly. > + if (wi::fits_shwi_p (x)) > + return x.slow () < y.slow (); > + // If it is doesn't fit, then it must be more negative than any > + // value in y, and hence smaller. > + if (neg_p (x, SIGNED)) > + return true; > + // If it is positve, then it must be larger than any value in y, > + // and hence greater than. > + return false; > + } > + return lts_p_large (x.val, x.len, x.precision, y.val, y.len, > + y.precision); > } > > then for: > > bool MGEN(wide_int cst) { > return wi::lts_p (cst, 100); > } > > we generate: > > movl 520(%rsp), %eax > cmpl $1, %eax > je .L5283 > subl $1, %eax > movq 8(%rsp,%rax,8), %rax > shrq $63, %rax > ret > .p2align 4,,10 > .p2align 3 > .L5283: > cmpq $99, 8(%rsp) > setle %al > ret > > which as you can see, never calls lts_p_large, and hence, if that was the only reason the storage escapes, then it should be able to optimize better. In the above code, minimal, no function calls, and the 100 is propagated all the way down into the cmpq. Now, the reason why we should do it this way, most of the time, most types are 64 bits or less. When that is true, then it will never call into lts_p_large. > > If the above 100 is changed to l, from a parameter long l, then the code is the same except the last part is: > > cmpq 8(%rsp), %rdi > setg %al > ret > > If we have two wide_int's, then the code does this: > > .cfi_startproc > subq $8, %rsp > .cfi_def_cfa_offset 16 > movl 1052(%rsp), %r9d > movl 1048(%rsp), %r8d > movl 532(%rsp), %edx > movl 528(%rsp), %esi > cmpl $64, %r9d > ja .L5281 > cmpl $1, %esi > je .L5284 > subl $1, %esi > movq 16(%rsp,%rsi,8), %rax > addq $8, %rsp > .cfi_remember_state > .cfi_def_cfa_offset 8 > shrq $63, %rax > ret > .p2align 4,,10 > .p2align 3 > .L5281: > .cfi_restore_state > leaq 536(%rsp), %rcx > leaq 16(%rsp), %rdi > call _ZN2wi11lts_p_largeEPKljjS1_jj > addq $8, %rsp > .cfi_remember_state > .cfi_def_cfa_offset 8 > ret > .p2align 4,,10 > .p2align 3 > .L5284: > .cfi_restore_state > movq 536(%rsp), %rax > cmpq %rax, 16(%rsp) > setl %al > addq $8, %rsp > .cfi_def_cfa_offset 8 > ret > .cfi_endproc > > which is still pretty nice. > > Does this look ok? Kenny, can you double check the cases, think I have them right, but? a double check would be good. That works for me. > > the representation should guarantee the > > compare with a low precision (32 bit) constant is evaluatable > > at compile-time if len of the larger value is > 1, no? > > I agree, though, I didn't check the symmetric case, as constants and > smaller values can be put on the right by convention. > > > The question is whether we should try to optimize wide-int for > > such cases or simply not use wi:leu_p (cst, 1) but rather > > > > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) > > > > ? > > Since we can do an excellent job with the simple interface, I'd argue > for the simple interface and the simple change to do the conditional > better. I'd rather up the ante on meta programming to get any > performance we want, retaining the simple interfaces. Agreed. Note that I'd make heavy use of __builtin_constant_p in the inline implementation as if we cannot optimize the choice at compile-time we'll both slow down things and make the code bloated up to the extent that the inliner no longer will consider it for inlining. That is, > + // We optimize w < N, where N is 64 or fewer bits. > + if (y.precision <= HOST_BITS_PER_WIDE_INT) should be something like if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT)) ... and evaluate to false if it doesn't evaluate to compile-time constant true. The ultimate 'else' case then of couse has to work for all cases. If I got it right then sth like #define CONSTANT (x) __builtin_constant_p (x) ? x : 0 should do the job (eventually with some stmt expression magic to evaluate x only once). That said, the various tree predicates in tree.c should ultimately just work on the tree rep - they are basic enough that they don't need to go through wide-ints. So, instead of int integer_zerop (const_tree expr) { STRIP_NOPS (expr); switch (TREE_CODE (expr)) { case INTEGER_CST: return wi::eq_p (expr, 0); I'd put there case INTEGER_CST: return TREE_INT_CST_NUNITS (expr) == 1 && TREE_INT_CST_ELT (expr, 0) == 0; I just used the existing code as convenient simplest users of wide-ints to inspect code generation (originally I looked at tree-dfa.c:get_ref_base_and_extent which is timing critical but somewhat convoluted). So before the merge we should adjust the tree.c changes on the branch appropriately. Richard.
On 10/18/2013 04:45 AM, Richard Biener wrote: > On Thu, 17 Oct 2013, Mike Stump wrote: > >> On Oct 17, 2013, at 1:46 AM, Richard Biener <rguenther@suse.de> wrote: >>> [This function shows another optimization issue: >>> >>> case BOOLEAN_TYPE: >>> /* Cache false or true. */ >>> limit = 2; >>> if (wi::leu_p (cst, 1)) >>> ix = cst.to_uhwi (); >>> >>> I would have expected cst <= 1 be optimized to cst.len == 1 && >>> cst.val[0] <= 1. >> So lts_p has the same problem, and is used just below. If we do: >> >> @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y) >> inline bool >> wi::lts_p (const wide_int_ref &x, const wide_int_ref &y) >> { >> - if (x.precision <= HOST_BITS_PER_WIDE_INT >> - && y.precision <= HOST_BITS_PER_WIDE_INT) >> - return x.slow () < y.slow (); >> - else >> - return lts_p_large (x.val, x.len, x.precision, y.val, y.len, >> - y.precision); >> + // We optimize w < N, where N is 64 or fewer bits. >> + if (y.precision <= HOST_BITS_PER_WIDE_INT) >> + { >> + // If it fits directly into a shwi, we can compare directly. >> + if (wi::fits_shwi_p (x)) >> + return x.slow () < y.slow (); >> + // If it is doesn't fit, then it must be more negative than any >> + // value in y, and hence smaller. >> + if (neg_p (x, SIGNED)) >> + return true; >> + // If it is positve, then it must be larger than any value in y, >> + // and hence greater than. >> + return false; >> + } >> + return lts_p_large (x.val, x.len, x.precision, y.val, y.len, >> + y.precision); >> } >> >> then for: >> >> bool MGEN(wide_int cst) { >> return wi::lts_p (cst, 100); >> } >> >> we generate: >> >> movl 520(%rsp), %eax >> cmpl $1, %eax >> je .L5283 >> subl $1, %eax >> movq 8(%rsp,%rax,8), %rax >> shrq $63, %rax >> ret >> .p2align 4,,10 >> .p2align 3 >> .L5283: >> cmpq $99, 8(%rsp) >> setle %al >> ret >> >> which as you can see, never calls lts_p_large, and hence, if that was the only reason the storage escapes, then it should be able to optimize better. In the above code, minimal, no function calls, and the 100 is propagated all the way down into the cmpq. Now, the reason why we should do it this way, most of the time, most types are 64 bits or less. When that is true, then it will never call into lts_p_large. >> >> If the above 100 is changed to l, from a parameter long l, then the code is the same except the last part is: >> >> cmpq 8(%rsp), %rdi >> setg %al >> ret >> >> If we have two wide_int's, then the code does this: >> >> .cfi_startproc >> subq $8, %rsp >> .cfi_def_cfa_offset 16 >> movl 1052(%rsp), %r9d >> movl 1048(%rsp), %r8d >> movl 532(%rsp), %edx >> movl 528(%rsp), %esi >> cmpl $64, %r9d >> ja .L5281 >> cmpl $1, %esi >> je .L5284 >> subl $1, %esi >> movq 16(%rsp,%rsi,8), %rax >> addq $8, %rsp >> .cfi_remember_state >> .cfi_def_cfa_offset 8 >> shrq $63, %rax >> ret >> .p2align 4,,10 >> .p2align 3 >> .L5281: >> .cfi_restore_state >> leaq 536(%rsp), %rcx >> leaq 16(%rsp), %rdi >> call _ZN2wi11lts_p_largeEPKljjS1_jj >> addq $8, %rsp >> .cfi_remember_state >> .cfi_def_cfa_offset 8 >> ret >> .p2align 4,,10 >> .p2align 3 >> .L5284: >> .cfi_restore_state >> movq 536(%rsp), %rax >> cmpq %rax, 16(%rsp) >> setl %al >> addq $8, %rsp >> .cfi_def_cfa_offset 8 >> ret >> .cfi_endproc >> >> which is still pretty nice. >> >> Does this look ok? Kenny, can you double check the cases, think I have them right, but? a double check would be good. > That works for me. > i talked to mike about this last night, but did not follow up with an email until now. The problem is that this code is wrong!!! He is working to fix that and so i would expect something from him later (tomorrow for those in europe). The problem is if the variable that comes in is a unsigned HWI that has to the top bit set. In that case, the wide-int canonicalized version would have two hwi's, the top one being a 0 block and so you cannot do the fast path test coming in on that even though it's precison fits. >>> the representation should guarantee the >>> compare with a low precision (32 bit) constant is evaluatable >>> at compile-time if len of the larger value is > 1, no? >> I agree, though, I didn't check the symmetric case, as constants and >> smaller values can be put on the right by convention. >> >>> The question is whether we should try to optimize wide-int for >>> such cases or simply not use wi:leu_p (cst, 1) but rather >>> >>> if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) >>> >>> ? >> Since we can do an excellent job with the simple interface, I'd argue >> for the simple interface and the simple change to do the conditional >> better. I'd rather up the ante on meta programming to get any >> performance we want, retaining the simple interfaces. > Agreed. Note that I'd make heavy use of __builtin_constant_p > in the inline implementation as if we cannot optimize the choice > at compile-time we'll both slow down things and make the code > bloated up to the extent that the inliner no longer will consider > it for inlining. That is, > >> + // We optimize w < N, where N is 64 or fewer bits. >> + if (y.precision <= HOST_BITS_PER_WIDE_INT) > should be something like > > if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT)) > ... > > and evaluate to false if it doesn't evaluate to compile-time > constant true. The ultimate 'else' case then of couse has > to work for all cases. > > If I got it right then sth like > > #define CONSTANT (x) __builtin_constant_p (x) ? x : 0 > > should do the job (eventually with some stmt expression > magic to evaluate x only once). > > That said, the various tree predicates in tree.c should > ultimately just work on the tree rep - they are basic > enough that they don't need to go through wide-ints. So, > instead of > > int > integer_zerop (const_tree expr) > { > STRIP_NOPS (expr); > > switch (TREE_CODE (expr)) > { > case INTEGER_CST: > return wi::eq_p (expr, 0); > > I'd put there > > case INTEGER_CST: > return TREE_INT_CST_NUNITS (expr) == 1 > && TREE_INT_CST_ELT (expr, 0) == 0; > > I just used the existing code as convenient simplest users > of wide-ints to inspect code generation (originally I looked > at tree-dfa.c:get_ref_base_and_extent which is timing critical > but somewhat convoluted). > > So before the merge we should adjust the tree.c changes on the > branch appropriately. > > Richard.
Index: tree.h =================================================================== --- tree.h (revision 203690) +++ tree.h (working copy) @@ -5184,14 +5184,11 @@ wi::int_traits <const_tree>::decompose ( / HOST_BITS_PER_WIDE_INT); unsigned int xprecision = get_precision (x); - gcc_assert (precision >= xprecision); + gcc_checking_assert (precision >= xprecision && precision != 0); - /* Got to be careful of precision 0 values. */ - if (precision) - len = MIN (len, max_len); if (TYPE_SIGN (TREE_TYPE (x)) == UNSIGNED) { - unsigned int small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1); + unsigned int small_prec = xprecision & (HOST_BITS_PER_WIDE_INT - 1); if (small_prec) {