Message ID | 5672D68F.3030408@foss.arm.com |
---|---|
State | New |
Headers | show |
On 12/17/2015 04:36 PM, Kyrill Tkachov wrote: > The documentation on RTL canonical forms in md.texi says: > > "Equality comparisons of a group of bits (usually a single bit) with zero > will be written using @code{zero_extract} rather than the equivalent > @code{and} or @code{sign_extract} operations. " > > However, this is not always followed in combine. If it's trying to optimise > a comparison against zero of a bitmask that is the mode mask of some mode > (255 for QImode and 65535 for HImode in the testcases of this patch) > it will instead create a subreg to that shorter mode. I suspect that this is an oversight in the documentation, and if given two choices the simpler form is intended to be the canonical one. > it ends up trying to make a QImode comparison against zero, for which > targets like > aarch64 have no pattern. So, can you define a pattern for it... > To get the benefit on aarch64 this needs patch 1/2 that adds an aarch64 > pattern > for comparing a zero_extract with zero. ... instead of this one? > What do people think of this approach? > I hope this just enforces the already documented canonicalisation rules > with minimal(none?) negative > fallout. I'm not so sure about this. Other ports have QImode comparisons and I would want to see some evidence that there are no code quality regressions. This is not stage 3 material in any case. Bernd
On 17/12/15 15:58, Bernd Schmidt wrote: > On 12/17/2015 04:36 PM, Kyrill Tkachov wrote: >> The documentation on RTL canonical forms in md.texi says: >> >> "Equality comparisons of a group of bits (usually a single bit) with zero >> will be written using @code{zero_extract} rather than the equivalent >> @code{and} or @code{sign_extract} operations. " >> >> However, this is not always followed in combine. If it's trying to optimise >> a comparison against zero of a bitmask that is the mode mask of some mode >> (255 for QImode and 65535 for HImode in the testcases of this patch) >> it will instead create a subreg to that shorter mode. > > I suspect that this is an oversight in the documentation, and if given two choices the simpler form is intended to be the canonical one. > >> it ends up trying to make a QImode comparison against zero, for which >> targets like >> aarch64 have no pattern. > > So, can you define a pattern for it... > >> To get the benefit on aarch64 this needs patch 1/2 that adds an aarch64 >> pattern >> for comparing a zero_extract with zero. > > ... instead of this one? > Yes, I had investigated that approach and it has the same effect (on aarch64). My motivation for this approach was to try avoiding defining multiple patterns for what should be equivalent expressions. But if the short subreg form is intended to be the canonical form... >> What do people think of this approach? >> I hope this just enforces the already documented canonicalisation rules >> with minimal(none?) negative >> fallout. > > I'm not so sure about this. Other ports have QImode comparisons and I would want to see some evidence that there are no code quality regressions. This is not stage 3 material in any case. > Well, this patch still produces the QImode comparison if the target has a QImode comparison (the have_insn_for check in the simplify_comparison hunk). As I said, the effects on arm and aarch64 were strictly beneficial. On x86_64 I saw no codegen difference on SPEC2006. If this is considered too risky at this stage I can propose a QImode pattern for aarch64 instead to isolate this fix to that backend. Thanks, Kyrill > > Bernd
On 12/17/2015 05:10 PM, Kyrill Tkachov wrote: > Well, this patch still produces the QImode comparison if the target has > a QImode comparison > (the have_insn_for check in the simplify_comparison hunk). Ok, I didn't look that closely because I had doubts about the approach. This kind of check also goes somewhat against the principles of just producing canonical forms of RTL. Bernd
On 17/12/15 16:12, Bernd Schmidt wrote: > On 12/17/2015 05:10 PM, Kyrill Tkachov wrote: >> Well, this patch still produces the QImode comparison if the target has >> a QImode comparison >> (the have_insn_for check in the simplify_comparison hunk). > > Ok, I didn't look that closely because I had doubts about the approach. This kind of check also goes somewhat against the principles of just producing canonical forms of RTL. > One could argue that if the target has (or advertises having) a native QImode register comparison then it's objectively a simplification to transform a comparison in a wider mode to a comparison in the shorter mode. If, however, the target doesn't have such an instruction (like aarch64 doesn't have QImode registers) then truncating the wider mode to QImode through a subreg is not less complex than a zero_extract, as both will involve some form of extracting/masking the desired QImode bits. So picking a canonical form there makes sense, and the documentation already specifies the zero_extract form as the canonical. Would be nice to get a definite clarification on whether the subreg form is indeed the canonical one. Then we can document it and I can just add a QI/HImode compare pattern to aarch64. Thanks, Kyrill > > Bernd >
On 12/17/2015 08:58 AM, Bernd Schmidt wrote: > On 12/17/2015 04:36 PM, Kyrill Tkachov wrote: >> The documentation on RTL canonical forms in md.texi says: >> >> "Equality comparisons of a group of bits (usually a single bit) with zero >> will be written using @code{zero_extract} rather than the equivalent >> @code{and} or @code{sign_extract} operations. " >> >> However, this is not always followed in combine. If it's trying to >> optimise >> a comparison against zero of a bitmask that is the mode mask of some mode >> (255 for QImode and 65535 for HImode in the testcases of this patch) >> it will instead create a subreg to that shorter mode. > > I suspect that this is an oversight in the documentation, and if given > two choices the simpler form is intended to be the canonical one. It's also the case that sometimes a SUBREG is preferred because it conveys that certain bits are "don't care". In theory this may allow things to optimize better. However, in practice, I'm not sure that's regularly the case because various passes are weak in trying to exploit the semantics of the SUBREG and passes are generally pretty strong in their handling of zero_extract and friends. IIRC I actually bumped against this in the gcc-5 cycle when fixing some suboptimal code generation issues. I think it was BZ15184. I'd check the archives for Dec 2014 and Jan 2015. There may be a mention of this issue in there from me (I can recall bumping into it, but can't recall if I ever did mentioned it publicly or if I ever submitted the change to prefer the zero_extract form over the subreg form. Jeff
On 12/17/2015 08:58 AM, Bernd Schmidt wrote: > > I suspect that this is an oversight in the documentation, and if given > two choices the simpler form is intended to be the canonical one. The other BZ I was looking at in this space was 15596. It's PPC, but shows a generic weakness in how we identify extractions and insertions. Fixing it would probably help all the ports that have relatively strong methods to set/clear a series of bits in the middle of a word. It feels like combine has all the information necessary to improve things, but the overall combiner flow and APIs are extremely uncooperative. jeff
On 12/17/2015 09:26 AM, Kyrill Tkachov wrote: > One could argue that if the target has (or advertises having) a native > QImode register comparison then it's objectively a simplification to > transform a comparison in a wider mode > to a comparison in the shorter mode. Generally true. The most commonly cited exception is any port that defines WORD_REGISTER_OPERATIONS. However, I would be comfortable with the idea that defining QImode comparisons on a target with WORD_REGISTER_OPERATIONS is a pretty explicit indication that it wants to try and shorten comparisons for one reason or another. > > If, however, the target doesn't have such an instruction (like aarch64 > doesn't have QImode registers) then > truncating the wider mode to QImode through a subreg is not less complex > than a zero_extract, as both will > involve some form of extracting/masking the desired QImode bits. So > picking a canonical form there makes sense, > and the documentation already specifies the zero_extract form as the > canonical. > > Would be nice to get a definite clarification on whether the subreg form > is indeed the canonical one. The subreg style "extension" isn't really an extension. It is a way to say that we want to look at the object in a wider mode, but we don't actually care about the upper bits. It's generally expected that the subreg won't result in the generation of any code. A zero extract defines all the bits. In theory the optimizers can use a SUBREG just like they could a REG, which should enable additional optimization. In practice I don't think that's been as true as we'd like. jeff
Hi Jeff, On 17/12/15 16:59, Jeff Law wrote: > On 12/17/2015 09:26 AM, Kyrill Tkachov wrote: >> One could argue that if the target has (or advertises having) a native >> QImode register comparison then it's objectively a simplification to >> transform a comparison in a wider mode >> to a comparison in the shorter mode. > Generally true. > > The most commonly cited exception is any port that defines WORD_REGISTER_OPERATIONS. However, I would be comfortable with the idea that defining QImode comparisons on a target with WORD_REGISTER_OPERATIONS is a pretty explicit indication > that it wants to try and shorten comparisons for one reason or another. > > I was investigating WORD_REGISTER_OPERATIONS as part of this. But we can't define it for aarch64. In any case, aarch64 doesn't have QImode registers so I thought we'd try to avoid creating them. > > >> >> If, however, the target doesn't have such an instruction (like aarch64 >> doesn't have QImode registers) then >> truncating the wider mode to QImode through a subreg is not less complex >> than a zero_extract, as both will >> involve some form of extracting/masking the desired QImode bits. So >> picking a canonical form there makes sense, >> and the documentation already specifies the zero_extract form as the >> canonical. >> >> Would be nice to get a definite clarification on whether the subreg form >> is indeed the canonical one. > The subreg style "extension" isn't really an extension. It is a way to say that we want to look at the object in a wider mode, but we don't actually care about the upper bits. It's generally expected that the subreg won't result in the > generation of any code. > > A zero extract defines all the bits. In this case, I'm expecting a QImode compare with zero to map down to the aarch64 TST reg, #255 instruction which definitely zeroes out any bits outside of QImode (as it is a bitwise AND with a bitmask), so zero_extract is the more correct expression here, no? > > > In theory the optimizers can use a SUBREG just like they could a REG, which should enable additional optimization. In practice I don't think that's been as true as we'd like. > > jeff >
On Thu, Dec 17, 2015 at 05:12:16PM +0100, Bernd Schmidt wrote: > On 12/17/2015 05:10 PM, Kyrill Tkachov wrote: > >Well, this patch still produces the QImode comparison if the target has > >a QImode comparison > >(the have_insn_for check in the simplify_comparison hunk). > > Ok, I didn't look that closely because I had doubts about the approach. > This kind of check also goes somewhat against the principles of just > producing canonical forms of RTL. The canonicalisation rules exist so that optimisers only need to match one form instead of several, and machine descriptions only need to describe one form instead of several. For this bitmasking case it perversely forces you to describe the same instruction in many ways, for many targets. This is what the change_zero_ext was about as well. It's not so easy to fix for the compare case. Maybe the idea of making genrecog make code that recognises more forms of the same insn will work out. GCC 7 in any case... Segher
On 12/17/2015 10:04 AM, Kyrill Tkachov wrote: > In this case, I'm expecting a QImode compare with zero to map down to > the aarch64 TST reg, #255 instruction which > definitely zeroes out any bits outside of QImode (as it is a bitwise AND > with a bitmask), > so zero_extract is the more correct expression here, no? It's more about the semantics of the code and how it interacts with RTL generation, optimization and analysis than it is with the final assembly generated by the backend that drives SUBREG vs zero_extract. The backend assembly code generator is free to implement stricter semantics (such as defining all the bits for a paradoxical subreg), but the rest of the compiler can not depend on those stricter semantics. The easiest way to think about the subreg case here is that it's used when we've got a narrow object that we want to view in a wider mode, but we don't actually care about the upper bits. The widening is merely to make the mode match another operand. zero_extract is still the canonical form. subreg is a specialized form for cases where the upper bits are "don't care" values. This should probably be documented as the current state of the world. I think it's an open question whether or not to drop the subreg form and always use zero-extract. I've certainly seen cases where the former is *supposed* to allow better code generation, but in fact actually gets in the way resulting in poorer code generation. Jeff
On 17/12/15 17:27, Segher Boessenkool wrote: > On Thu, Dec 17, 2015 at 05:12:16PM +0100, Bernd Schmidt wrote: >> On 12/17/2015 05:10 PM, Kyrill Tkachov wrote: >>> Well, this patch still produces the QImode comparison if the target has >>> a QImode comparison >>> (the have_insn_for check in the simplify_comparison hunk). >> Ok, I didn't look that closely because I had doubts about the approach. >> This kind of check also goes somewhat against the principles of just >> producing canonical forms of RTL. > The canonicalisation rules exist so that optimisers only need to match > one form instead of several, and machine descriptions only need to > describe one form instead of several. For this bitmasking case it > perversely forces you to describe the same instruction in many ways, > for many targets. This is what the change_zero_ext was about as well. > > It's not so easy to fix for the compare case. Maybe the idea of making > genrecog make code that recognises more forms of the same insn will work > out. GCC 7 in any case... Perhaps I had underestimated how involved this issue is :) So if I want to improve the aarch64 situation for GCC 6, would the recommended course of action be to just define the QI and HImode compare against zero patterns? Note that I think the make_extraction hunk from my patch is in line with the function comment of make_extraction that says: " IN_COMPARE is nonzero if we are in a COMPARE. This means that a ZERO_EXTRACT should be built even for bits starting at bit 0." whereas the condition that I'm adding "&& !in_compare" is explicitly trying to avoid an extraction. But anyway, if this has the potential to cause negative fallout that I had not anticipated, it can wait for later. Thanks, Kyrill > > Segher
On 12/17/2015 06:44 PM, Kyrill Tkachov wrote: > Perhaps I had underestimated how involved this issue is :) > So if I want to improve the aarch64 situation for GCC 6, > would the recommended course of action be to just define the > QI and HImode compare against zero patterns? For GCC 6 I think this is the only approach. Bernd
diff --git a/gcc/combine.c b/gcc/combine.c index 8601d8983ce345e2129dd047b3520d98c0582842..345e63f9a05f2310a5c9e5b239ed069d22565d1c 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -7337,10 +7337,13 @@ make_extraction (machine_mode mode, rtx inner, HOST_WIDE_INT pos, low-order bit and this is either not in the destination or we have the appropriate STRICT_LOW_PART operation available. + Don't do this if we are inside a comparison, as the canonicalization + rules call for a zero_extract form. For MEM, we can avoid an extract if the field starts on an appropriate boundary and we can change the mode of the memory reference. */ if (tmode != BLKmode + && !in_compare && ((pos_rtx == 0 && (pos % BITS_PER_WORD) == 0 && !MEM_P (inner) && (inner_mode == tmode @@ -12108,14 +12111,19 @@ simplify_comparison (enum rtx_code code, rtx *pop0, rtx *pop1) unless TRULY_NOOP_TRUNCATION allows it or the register is known to hold a value of the required mode the - transformation is invalid. */ + transformation is invalid. + If the target does not have a compare instruction of that mode + don't do this when comparing against 0 since the canonicalization + rules require such an operation to be represented as a + zero_extract, which make_extraction will produce later on. */ if ((equality_comparison_p || unsigned_comparison_p) && CONST_INT_P (XEXP (op0, 1)) && (i = exact_log2 ((UINTVAL (XEXP (op0, 1)) & GET_MODE_MASK (mode)) + 1)) >= 0 && const_op >> i == 0 - && (tmode = mode_for_size (i, MODE_INT, 1)) != BLKmode) + && (tmode = mode_for_size (i, MODE_INT, 1)) != BLKmode + && (const_op != 0 || have_insn_for (COMPARE, tmode))) { op0 = gen_lowpart_or_truncate (tmode, XEXP (op0, 0)); continue; diff --git a/gcc/testsuite/gcc.target/aarch64/tst_5.c b/gcc/testsuite/gcc.target/aarch64/tst_5.c new file mode 100644 index 0000000000000000000000000000000000000000..868a3be502e1468592c17b8a9c6a359af00e61a8 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/tst_5.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int +f255 (int x) +{ + if (x & 255) + return 1; + return x; +} + +int +f65535 (int x) +{ + if (x & 65535) + return 1; + return x; +} + +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]*.*255" } } */ +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]*.*65535" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/tst_6.c b/gcc/testsuite/gcc.target/aarch64/tst_6.c new file mode 100644 index 0000000000000000000000000000000000000000..2f1c78d873445b11562c5e2863a3f642ff11f318 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/tst_6.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int +foo (long x) +{ + return ((short) x != 0) ? x : 1; +} + +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]*.*65535" } } */