Message ID | 871sy180ie.fsf@atmel.com |
---|---|
State | New |
Headers | show |
On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: > Hi, > > I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr > target. I found that the (dump) failure is because there are 4 > instances of memcpy, while the testcase expects only 2 for a > non-strict align target like the avr. > > Comparing that with a dump generated by x64_64-pc-linux, I found that > the extra memcpy's come from the forwprop pass, when it replaces > strcat with strlen and memcpy. For x86_64, the memcpy generated gets > folded into a load/store in gimple_fold_builtin_memory_op. That > doesn't happen for the avr because len (2) happens to be bigger than > MOVE_MAX (1). > > The avr can only move 1 byte efficiently from reg <-> memory, but it's > more efficient to load and store 2 bytes than to call memcpy, so > MOVE_MAX_PIECES is set to 2. > > Given that gimple_fold_builtin_memory_op gets to choose between > leaving the memcpy call as is, or breaking it down to a by-pieces > move, shouldn't it use MOVE_MAX_PIECES instead of > MOV_MAX? > > That is what the below patch does, and that makes the test > pass. Does this sound right? No, as we handle both memcpy and memmove this way we rely on the whole storage fit in a single register so we do the right thing for overlapping memory. Richard. > Regards > Senthil > > Index: gcc/gimple-fold.c > =================================================================== > --- gcc/gimple-fold.c (revision 242741) > +++ gcc/gimple-fold.c (working copy) > @@ -703,7 +703,7 @@ > src_align = get_pointer_alignment (src); > dest_align = get_pointer_alignment (dest); > if (tree_fits_uhwi_p (len) > - && compare_tree_int (len, MOVE_MAX) <= 0 > + && compare_tree_int (len, MOVE_MAX_PIECES) <= 0 > /* ??? Don't transform copies from strings with known length this > confuses the tree-ssa-strlen.c. This doesn't handle > the case in gcc.dg/strlenopt-8.c which is XFAILed for that > >
On Thu, 24 Nov 2016, Richard Biener wrote: > On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: > > > Hi, > > > > I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr > > target. I found that the (dump) failure is because there are 4 > > instances of memcpy, while the testcase expects only 2 for a > > non-strict align target like the avr. > > > > Comparing that with a dump generated by x64_64-pc-linux, I found that > > the extra memcpy's come from the forwprop pass, when it replaces > > strcat with strlen and memcpy. For x86_64, the memcpy generated gets > > folded into a load/store in gimple_fold_builtin_memory_op. That > > doesn't happen for the avr because len (2) happens to be bigger than > > MOVE_MAX (1). > > > > The avr can only move 1 byte efficiently from reg <-> memory, but it's > > more efficient to load and store 2 bytes than to call memcpy, so > > MOVE_MAX_PIECES is set to 2. > > > > Given that gimple_fold_builtin_memory_op gets to choose between > > leaving the memcpy call as is, or breaking it down to a by-pieces > > move, shouldn't it use MOVE_MAX_PIECES instead of > > MOV_MAX? > > > > That is what the below patch does, and that makes the test > > pass. Does this sound right? > > No, as we handle both memcpy and memmove this way we rely on > the whole storage fit in a single register so we do the > right thing for overlapping memory. So actually your patch doesn't chnage that, the ordering is ensured by emitting a single GIMPLE load/store pair. There are only four targets defining MOVE_MAX_PIECES, and one (s390) even has a smaller MOVE_MAX_PIECES than MOVE_MAX (huh). AVR has larger MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much sense to me given their very similar description plus the fact that AVR can only load a single byte at a time... The x86 comment says /* MOVE_MAX_PIECES is the number of bytes at a time which we can move efficiently, as opposed to MOVE_MAX which is the maximum number of bytes we can move with a single instruction. which doesn't match up with @defmac MOVE_MAX The maximum number of bytes that a single instruction can move quickly between memory and registers or between two memory locations. @end defmac note "quickly" here. But OTOH @defmac MOVE_MAX_PIECES A C expression used by @code{move_by_pieces} to determine the largest unit a load or store used to copy memory is. Defaults to @code{MOVE_MAX}. @end defmac here the only difference is "copy memory". But we try to special case the one load - one store case, not generally "copy memory". So I think MOVE_MAX matches my intent when writing the code. Richard. > Richard. > > > Regards > > Senthil > > > > Index: gcc/gimple-fold.c > > =================================================================== > > --- gcc/gimple-fold.c (revision 242741) > > +++ gcc/gimple-fold.c (working copy) > > @@ -703,7 +703,7 @@ > > src_align = get_pointer_alignment (src); > > dest_align = get_pointer_alignment (dest); > > if (tree_fits_uhwi_p (len) > > - && compare_tree_int (len, MOVE_MAX) <= 0 > > + && compare_tree_int (len, MOVE_MAX_PIECES) <= 0 > > /* ??? Don't transform copies from strings with known length this > > confuses the tree-ssa-strlen.c. This doesn't handle > > the case in gcc.dg/strlenopt-8.c which is XFAILed for that > > > > > >
Richard Biener writes: > On Thu, 24 Nov 2016, Richard Biener wrote: > >> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: >> >> > Hi, >> > >> > I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr >> > target. I found that the (dump) failure is because there are 4 >> > instances of memcpy, while the testcase expects only 2 for a >> > non-strict align target like the avr. >> > >> > Comparing that with a dump generated by x64_64-pc-linux, I found that >> > the extra memcpy's come from the forwprop pass, when it replaces >> > strcat with strlen and memcpy. For x86_64, the memcpy generated gets >> > folded into a load/store in gimple_fold_builtin_memory_op. That >> > doesn't happen for the avr because len (2) happens to be bigger than >> > MOVE_MAX (1). >> > >> > The avr can only move 1 byte efficiently from reg <-> memory, but it's >> > more efficient to load and store 2 bytes than to call memcpy, so >> > MOVE_MAX_PIECES is set to 2. >> > >> > Given that gimple_fold_builtin_memory_op gets to choose between >> > leaving the memcpy call as is, or breaking it down to a by-pieces >> > move, shouldn't it use MOVE_MAX_PIECES instead of >> > MOV_MAX? >> > >> > That is what the below patch does, and that makes the test >> > pass. Does this sound right? >> >> No, as we handle both memcpy and memmove this way we rely on >> the whole storage fit in a single register so we do the >> right thing for overlapping memory. > > So actually your patch doesn't chnage that, the ordering is ensured > by emitting a single GIMPLE load/store pair. There are only > four targets defining MOVE_MAX_PIECES, and one (s390) even has > a smaller MOVE_MAX_PIECES than MOVE_MAX (huh). AVR has larger > MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much > sense to me given their very similar description plus the > fact that AVR can only load a single byte at a time... > > The x86 comment says > > /* MOVE_MAX_PIECES is the number of bytes at a time which we can > move efficiently, as opposed to MOVE_MAX which is the maximum > number of bytes we can move with a single instruction. > > which doesn't match up with > > @defmac MOVE_MAX > The maximum number of bytes that a single instruction can move quickly > between memory and registers or between two memory locations. > @end defmac > > note "quickly" here. But OTOH > > @defmac MOVE_MAX_PIECES > A C expression used by @code{move_by_pieces} to determine the largest unit > a load or store used to copy memory is. Defaults to @code{MOVE_MAX}. > @end defmac > > here the only difference is "copy memory". But we try to special > case the one load - one store case, not generally "copy memory". > > So I think MOVE_MAX matches my intent when writing the code. Ok, but isn't that inconsistent with tree-inline.c:estimate_move_cost, which considers MOVE_MAX_PIECES and MOVE_RATIO to decide between a libcall and by-pieces move? Regards Senthil > > Richard. > >> Richard. >> >> > Regards >> > Senthil >> > >> > Index: gcc/gimple-fold.c >> > =================================================================== >> > --- gcc/gimple-fold.c (revision 242741) >> > +++ gcc/gimple-fold.c (working copy) >> > @@ -703,7 +703,7 @@ >> > src_align = get_pointer_alignment (src); >> > dest_align = get_pointer_alignment (dest); >> > if (tree_fits_uhwi_p (len) >> > - && compare_tree_int (len, MOVE_MAX) <= 0 >> > + && compare_tree_int (len, MOVE_MAX_PIECES) <= 0 >> > /* ??? Don't transform copies from strings with known length this >> > confuses the tree-ssa-strlen.c. This doesn't handle >> > the case in gcc.dg/strlenopt-8.c which is XFAILed for that >> > >> > >> >>
On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: > > Richard Biener writes: > > > On Thu, 24 Nov 2016, Richard Biener wrote: > > > >> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: > >> > >> > Hi, > >> > > >> > I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr > >> > target. I found that the (dump) failure is because there are 4 > >> > instances of memcpy, while the testcase expects only 2 for a > >> > non-strict align target like the avr. > >> > > >> > Comparing that with a dump generated by x64_64-pc-linux, I found that > >> > the extra memcpy's come from the forwprop pass, when it replaces > >> > strcat with strlen and memcpy. For x86_64, the memcpy generated gets > >> > folded into a load/store in gimple_fold_builtin_memory_op. That > >> > doesn't happen for the avr because len (2) happens to be bigger than > >> > MOVE_MAX (1). > >> > > >> > The avr can only move 1 byte efficiently from reg <-> memory, but it's > >> > more efficient to load and store 2 bytes than to call memcpy, so > >> > MOVE_MAX_PIECES is set to 2. > >> > > >> > Given that gimple_fold_builtin_memory_op gets to choose between > >> > leaving the memcpy call as is, or breaking it down to a by-pieces > >> > move, shouldn't it use MOVE_MAX_PIECES instead of > >> > MOV_MAX? > >> > > >> > That is what the below patch does, and that makes the test > >> > pass. Does this sound right? > >> > >> No, as we handle both memcpy and memmove this way we rely on > >> the whole storage fit in a single register so we do the > >> right thing for overlapping memory. > > > > So actually your patch doesn't chnage that, the ordering is ensured > > by emitting a single GIMPLE load/store pair. There are only > > four targets defining MOVE_MAX_PIECES, and one (s390) even has > > a smaller MOVE_MAX_PIECES than MOVE_MAX (huh). AVR has larger > > MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much > > sense to me given their very similar description plus the > > fact that AVR can only load a single byte at a time... > > > > The x86 comment says > > > > /* MOVE_MAX_PIECES is the number of bytes at a time which we can > > move efficiently, as opposed to MOVE_MAX which is the maximum > > number of bytes we can move with a single instruction. > > > > which doesn't match up with > > > > @defmac MOVE_MAX > > The maximum number of bytes that a single instruction can move quickly > > between memory and registers or between two memory locations. > > @end defmac > > > > note "quickly" here. But OTOH > > > > @defmac MOVE_MAX_PIECES > > A C expression used by @code{move_by_pieces} to determine the largest unit > > a load or store used to copy memory is. Defaults to @code{MOVE_MAX}. > > @end defmac > > > > here the only difference is "copy memory". But we try to special > > case the one load - one store case, not generally "copy memory". > > > > So I think MOVE_MAX matches my intent when writing the code. > > Ok, but isn't that inconsistent with tree-inline.c:estimate_move_cost, which > considers MOVE_MAX_PIECES and MOVE_RATIO to decide between a libcall and > by-pieces move? Well, I don't understand why we have both MOVE_MAX and MOVE_MAX_PIECES. There are exactly two uses of MOVE_MAX in GCC AFAICS, the gimple-fold.c one and caller-save.c which derives MOVE_MAX_WORDS from it. MOVE_MAX_PIECES has the only use in block move expansion plus the single use in tree-inline.c. So I can't give a reason why one or the other should be more valid but the tree-inline.c one tries to match memcpy expansion (obviously), while the gimple-fold.c one tries to get at the maximum possible single-insn move amount (and AVR is odd here having that lower than MOVE_MAX_PIECES, compared to say s390 which has it the opposite way around). Richard. > Regards > Senthil > > > > > Richard. > > > >> Richard. > >> > >> > Regards > >> > Senthil > >> > > >> > Index: gcc/gimple-fold.c > >> > =================================================================== > >> > --- gcc/gimple-fold.c (revision 242741) > >> > +++ gcc/gimple-fold.c (working copy) > >> > @@ -703,7 +703,7 @@ > >> > src_align = get_pointer_alignment (src); > >> > dest_align = get_pointer_alignment (dest); > >> > if (tree_fits_uhwi_p (len) > >> > - && compare_tree_int (len, MOVE_MAX) <= 0 > >> > + && compare_tree_int (len, MOVE_MAX_PIECES) <= 0 > >> > /* ??? Don't transform copies from strings with known length this > >> > confuses the tree-ssa-strlen.c. This doesn't handle > >> > the case in gcc.dg/strlenopt-8.c which is XFAILed for that > >> > > >> > > >> > >> > >
On 11/25/2016 01:04 AM, Richard Biener wrote: > On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: > >> >> Richard Biener writes: >> >>> On Thu, 24 Nov 2016, Richard Biener wrote: >>> >>>> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote: >>>> >>>>> Hi, >>>>> >>>>> I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr >>>>> target. I found that the (dump) failure is because there are 4 >>>>> instances of memcpy, while the testcase expects only 2 for a >>>>> non-strict align target like the avr. >>>>> >>>>> Comparing that with a dump generated by x64_64-pc-linux, I found that >>>>> the extra memcpy's come from the forwprop pass, when it replaces >>>>> strcat with strlen and memcpy. For x86_64, the memcpy generated gets >>>>> folded into a load/store in gimple_fold_builtin_memory_op. That >>>>> doesn't happen for the avr because len (2) happens to be bigger than >>>>> MOVE_MAX (1). >>>>> >>>>> The avr can only move 1 byte efficiently from reg <-> memory, but it's >>>>> more efficient to load and store 2 bytes than to call memcpy, so >>>>> MOVE_MAX_PIECES is set to 2. >>>>> >>>>> Given that gimple_fold_builtin_memory_op gets to choose between >>>>> leaving the memcpy call as is, or breaking it down to a by-pieces >>>>> move, shouldn't it use MOVE_MAX_PIECES instead of >>>>> MOV_MAX? >>>>> >>>>> That is what the below patch does, and that makes the test >>>>> pass. Does this sound right? >>>> >>>> No, as we handle both memcpy and memmove this way we rely on >>>> the whole storage fit in a single register so we do the >>>> right thing for overlapping memory. >>> >>> So actually your patch doesn't chnage that, the ordering is ensured >>> by emitting a single GIMPLE load/store pair. There are only >>> four targets defining MOVE_MAX_PIECES, and one (s390) even has >>> a smaller MOVE_MAX_PIECES than MOVE_MAX (huh). AVR has larger >>> MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much >>> sense to me given their very similar description plus the >>> fact that AVR can only load a single byte at a time... >>> >>> The x86 comment says >>> >>> /* MOVE_MAX_PIECES is the number of bytes at a time which we can >>> move efficiently, as opposed to MOVE_MAX which is the maximum >>> number of bytes we can move with a single instruction. >>> >>> which doesn't match up with >>> >>> @defmac MOVE_MAX >>> The maximum number of bytes that a single instruction can move quickly >>> between memory and registers or between two memory locations. >>> @end defmac >>> >>> note "quickly" here. But OTOH >>> >>> @defmac MOVE_MAX_PIECES >>> A C expression used by @code{move_by_pieces} to determine the largest unit >>> a load or store used to copy memory is. Defaults to @code{MOVE_MAX}. >>> @end defmac >>> >>> here the only difference is "copy memory". But we try to special >>> case the one load - one store case, not generally "copy memory". >>> >>> So I think MOVE_MAX matches my intent when writing the code. >> >> Ok, but isn't that inconsistent with tree-inline.c:estimate_move_cost, which >> considers MOVE_MAX_PIECES and MOVE_RATIO to decide between a libcall and >> by-pieces move? > > Well, I don't understand why we have both MOVE_MAX and MOVE_MAX_PIECES. > There are exactly two uses of MOVE_MAX in GCC AFAICS, the gimple-fold.c > one and caller-save.c which derives MOVE_MAX_WORDS from it. > MOVE_MAX_PIECES has the only use in block move expansion plus the > single use in tree-inline.c. > > So I can't give a reason why one or the other should be more valid > but the tree-inline.c one tries to match memcpy expansion (obviously), > while the gimple-fold.c one tries to get at the maximum possible > single-insn move amount (and AVR is odd here having that lower than > MOVE_MAX_PIECES, compared to say s390 which has it the opposite way > around). It's probably historical. MOVE_MAX_PIECES, IIRC, was primarily to guide whether or not to emit a structure copy inline vs call memcpy. MOVE_MAX, IIRC, was supposed to control how many consecutive registers we'd try to squash into a single save/restore during caller-saves. ie, it might be profitable to squash a pair of 32 bit saves/restores into a single 64 bit save/restore. But it may not have been profitable to squash a pair of 64bit saves/restores into a 128bit save/restore. There may have been other uses of each that have been removed over time. jeff
Index: gcc/gimple-fold.c =================================================================== --- gcc/gimple-fold.c (revision 242741) +++ gcc/gimple-fold.c (working copy) @@ -703,7 +703,7 @@ src_align = get_pointer_alignment (src); dest_align = get_pointer_alignment (dest); if (tree_fits_uhwi_p (len) - && compare_tree_int (len, MOVE_MAX) <= 0 + && compare_tree_int (len, MOVE_MAX_PIECES) <= 0 /* ??? Don't transform copies from strings with known length this confuses the tree-ssa-strlen.c. This doesn't handle the case in gcc.dg/strlenopt-8.c which is XFAILed for that