diff mbox

Fix PR 47272 to restore Altivec vec_ld/vec_st

Message ID 20110124213133.GA21518@hungry-tiger.westford.ibm.com
State New
Headers show

Commit Message

Michael Meissner Jan. 24, 2011, 9:31 p.m. UTC
This patch fixes bug PR target/47272, but I'm sending it out to a wider
audience to solicit feedback from other developers to resolve a sticky
situation with the PowerPC code gen.

For those of you who don't know the power architecture, particularly with the
VSX extensions, the first main vector extension was the Altivec (VMX) vector
support.  The VSX vector support adds more floating point vector registers, and
overlaps with the Altivec support.  In particular, the vector loads and stores
on Altivec ignore the bottom 3 bits of the address, while the vector loads and
stores of the VSX instruction set do not, and will do unaligned loads and
stores.  Obviously, if the address is completely aligned, either an Altivec or
a VSX memory instruction will behave the same.  If on the other hand you have
an unaligned address, you will get different bytes loaded/stored.

The PowerPC compiler has a full set of overloaded vector intrinisic builtin
functions, including builtins for doing load and store.  When I added the VSX
support, I changed the compiler to do VSX loads/stores if the user used -mvsx
or -mcpu=power7, including changing the builtin load/store functions to use the
VSX instructions.  However, as I said, you get different results for unaligned
addresses.

Richard Henderson's change to libcpp/lex.c in August 21st, 2010 added code to
use the Altivec instruction set if the compiler supports it to speed up the
preprocessor:

2010-08-21  Richard Henderson  <rth@redhat.com>
	    Andi Kleen <ak@linux.intel.com>
	    David S. Miller  <davem@davemloft.net>

	* configure.ac (AC_C_BIGENDIAN, AC_TYPE_UINTPTR_T): New tests.
	(ssize_t): Check via AC_TYPE_SSIZE_T instead of AC_CHECK_TYPE.
	(ptrdiff_t): Check via AC_CHECK_TYPE.
	* config.in, configure: Rebuild.
	* system.h: Include stdint.h, if available.
	* lex.c (WORDS_BIGENDIAN): Provide default.
	(acc_char_mask_misalign, acc_char_replicate, acc_char_cmp,
	acc_char_index, search_line_acc_char, repl_chars, search_line_mmx,
	search_line_sse2, search_line_sse42, init_vectorized_lexer,
	search_line_fast): New.
	(_cpp_clean_line): Use search_line_fast.  Restructure the fast
	loop to make it clear when we're leaving the loop.  Stay in the
	fast loop for non-trigraph '?'.

Recently we started to look at building internal versions of the GCC 4.6
compiler with the --with-cpu=power7 support, and it exposed the difference
between the two loads.

So after some debate within IBM, we've come to the conclusion that I should not
have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
we should go back to using the Altivec form for these instructions.  However,
in doing so, it means that anybody who has written new code explicitly for
power7 since GCC 4.5 came out might now be suprised.  Unfortunately the bug
exists in GCC 4.5 as well as the Red Hat RHEL6 and SUSE Sles 11 Sp1 compilers.

I realize that we are in stage 4 of the release process, but if we are going to
change the builtins back to the 4.4 semantics, we should do it as soon as
possible.

David suggested I poll release managers and other interested parties
what path we should take (make the builtins adhere to the 4.4 semantics, or
just keep the current situation).

I'm enclosing patches to make the load/store builtins go back to the Altivec
semantics, and added vector double support to those.  In addition, I added
patches for libcpp/lex.c so that it will work with 4.5 compilers as well as 4.4
and future 4.6 compilers.  No matter whether we decide not to re-change the
builtin semantics or not, I feel the lex.c patch should go it.

Right now, I did not add an #ifdef or -m switch to toggle to the 4.5
behaviour.  I can do this if desired (it probably is a good idea to allow code
written for 4.5 to continue to be used).  I don't know how many people directly
write using the Altivec semantics.

I should mention that Darwin users and people using the host processor in PS3
that might have written Altivec specific code will not be affected, since those
machines do not have the VSX instruction set.  It is only the newer machines
being shipped by IBM that currently will have the problem.

Sorry about all this.

Comments

Mark Mitchell Jan. 24, 2011, 9:34 p.m. UTC | #1
On 1/24/2011 1:31 PM, Michael Meissner wrote:

> So after some debate within IBM, we've come to the conclusion that I should not
> have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
> we should go back to using the Altivec form for these instructions

Can you explain why that's desirable?  I think that the first thing to
do is to convince ourselves that's technically desirable; if we can't do
that, then there's no need to think about whether to do it now or later.

My gut instinct is that having released 4.5, we should just live with
the semantics we now have; we've broken compatibility with some Altivec
code when compiled for Power 7, but breaking compatibility again seems
like it will just confuse things worse.

Thank you,
Michael Meissner Jan. 24, 2011, 9:52 p.m. UTC | #2
On Mon, Jan 24, 2011 at 01:34:50PM -0800, Mark Mitchell wrote:
> On 1/24/2011 1:31 PM, Michael Meissner wrote:
> 
> > So after some debate within IBM, we've come to the conclusion that I should not
> > have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
> > we should go back to using the Altivec form for these instructions
> 
> Can you explain why that's desirable?  I think that the first thing to
> do is to convince ourselves that's technically desirable; if we can't do
> that, then there's no need to think about whether to do it now or later.
> 
> My gut instinct is that having released 4.5, we should just live with
> the semantics we now have; we've broken compatibility with some Altivec
> code when compiled for Power 7, but breaking compatibility again seems
> like it will just confuse things worse.

Sure, if you have a program that is dealing with unaligned data, and you know
the machine ANDs out the bottom bits, you would write the code to handle the
initial bits using something like lex.c has.  At this point, I really can see
both sides (do we cater to the 4.4 users or the 4.5 users).

I suspect that if we wanted to do it automatically, we would need to see if
LVSR or similar instruction was used.  Or just provide an #ifdef, and make the
default the 4.5 behavior.

Here is the code in lex.c that knows about this alignment quirk:

static const uchar *
search_line_fast (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
{
  typedef __attribute__((altivec(vector))) unsigned char vc;

  /* ... */

  const vc ones = {
    -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1,
  };
  const vc zero = { 0 };

  vc data, mask, t;

  /* Altivec loads automatically mask addresses with -16.  This lets us
     issue the first load as early as possible.  */
  data = __builtin_vec_ld(0, (const vc *)s);

  /* Discard bytes before the beginning of the buffer.  Do this by
     beginning with all ones and shifting in zeros according to the
     mis-alignment.  The LVSR instruction pulls the exact shift we
     want from the address.  */
  mask = __builtin_vec_lvsr(0, s);
  mask = __builtin_vec_perm(zero, ones, mask);
  data &= mask;

  /* While altivec loads mask addresses, we still need to align S so
     that the offset we compute at the end is correct.  */
  s = (const uchar *)((uintptr_t)s & -16);

  /* Main loop processing 16 bytes at a time.  */
  goto start;
  do
    {
      vc m_nl, m_cr, m_bs, m_qm;

      s += 16;
      data = __builtin_vec_ld(0, (const vc *)s);

    /* ... */
    }
  while (!__builtin_vec_vcmpeq_p(/*__CR6_LT_REV*/3, t, zero));
Mark Mitchell Jan. 25, 2011, 3:16 a.m. UTC | #3
On 1/24/2011 1:52 PM, Michael Meissner wrote:

>> My gut instinct is that having released 4.5, we should just live with
>> the semantics we now have; we've broken compatibility with some Altivec
>> code when compiled for Power 7, but breaking compatibility again seems
>> like it will just confuse things worse.

> Sure, if you have a program that is dealing with unaligned data, and you know
> the machine ANDs out the bottom bits, you would write the code to handle the
> initial bits using something like lex.c has.  At this point, I really can see
> both sides (do we cater to the 4.4 users or the 4.5 users).

OK, right, I see that.

We keep learning how important backwards compatibility is to people.  We
really need to take that incredibly seriously going forward; we (and by
"we" I mean "I") have historically under-estimated how important that is.

In any case, here we are.  I think that having come this far, we might
as well just leave things as they are.  I don't think that we really
make things better by flip-flopping on semantics from release to
release; by now, some people have probably figured out that we changed,
and have some #ifdef somewhere to deal with it, and when we change back,
their #ifdef will break, and we'll lose again.

My two cents,
Richard Biener Jan. 25, 2011, 10:15 a.m. UTC | #4
On Mon, 24 Jan 2011, Mark Mitchell wrote:

> On 1/24/2011 1:31 PM, Michael Meissner wrote:
> 
> > So after some debate within IBM, we've come to the conclusion that I should not
> > have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
> > we should go back to using the Altivec form for these instructions
> 
> Can you explain why that's desirable?  I think that the first thing to
> do is to convince ourselves that's technically desirable; if we can't do
> that, then there's no need to think about whether to do it now or later.
> 
> My gut instinct is that having released 4.5, we should just live with
> the semantics we now have; we've broken compatibility with some Altivec
> code when compiled for Power 7, but breaking compatibility again seems
> like it will just confuse things worse.

I think we should revert to the pre-4.5 behavior and also fix 4.5.
Especially so if there are other compilers that follow the pre-4.5
behavior - are there?

Richard.

--
Richard Guenther <rguenther@suse.de>
Novell / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 - GF: Markus Rex
Jakub Jelinek Jan. 25, 2011, 10:18 a.m. UTC | #5
On Tue, Jan 25, 2011 at 11:15:25AM +0100, Richard Guenther wrote:
> On Mon, 24 Jan 2011, Mark Mitchell wrote:
> > > So after some debate within IBM, we've come to the conclusion that I should not
> > > have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
> > > we should go back to using the Altivec form for these instructions
> > 
> > Can you explain why that's desirable?  I think that the first thing to
> > do is to convince ourselves that's technically desirable; if we can't do
> > that, then there's no need to think about whether to do it now or later.
> > 
> > My gut instinct is that having released 4.5, we should just live with
> > the semantics we now have; we've broken compatibility with some Altivec
> > code when compiled for Power 7, but breaking compatibility again seems
> > like it will just confuse things worse.
> 
> I think we should revert to the pre-4.5 behavior and also fix 4.5.

Yeah, I agree.  Especially when it is compatible not just with pre-4.5,
but also with 4.5 with -mno-vsx.  Changing behavior of an intrinsic
depending on what CPU variant it is targetting is bad, for the VSX
load/store behavior there should be a different intrinsic instead.

	Jakub
Mark Mitchell Jan. 25, 2011, 6:40 p.m. UTC | #6
On 1/25/2011 2:15 AM, Richard Guenther wrote:

> I think we should revert to the pre-4.5 behavior and also fix 4.5.
> Especially so if there are other compilers that follow the pre-4.5
> behavior - are there?

You're suggesting changing 4.5 so that 4.5.N is incompatible with
previous 4.5 releases?  That seems really unfortunate to me; we're now
introducing compatibility problems in what's supposed to be a bug-fix
release.  To justify that, we really have to decide that this was a
blatant bug.  But, it wasn't; it was a reasonable choice given the
hardware.  In retrospect, not the best choice, but a reasonable choice.

I'm certainly willing to bow to a consensus opinion here.  If the Power
maintainers have consensus that this is the right way to go, then I
would respect that.

Thank you,
Joseph Myers Jan. 26, 2011, 12:48 a.m. UTC | #7
On Mon, 24 Jan 2011, Mark Mitchell wrote:

> On 1/24/2011 1:31 PM, Michael Meissner wrote:
> 
> > So after some debate within IBM, we've come to the conclusion that I should not
> > have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
> > we should go back to using the Altivec form for these instructions
> 
> Can you explain why that's desirable?  I think that the first thing to
> do is to convince ourselves that's technically desirable; if we can't do
> that, then there's no need to think about whether to do it now or later.
> 
> My gut instinct is that having released 4.5, we should just live with
> the semantics we now have; we've broken compatibility with some Altivec
> code when compiled for Power 7, but breaking compatibility again seems
> like it will just confuse things worse.

First, the semantics of vec_ld and vec_st (the altivec.h intrinsics, as 
opposed to the built-in functions) are documented in the AltiVec PIM.  
Both my copies (I have one PDF with a Motorola logo and one with a 
Freescale logo) explicitly say for vec_ld:

    Each operation performs a 16-byte load at a 16-byte aligned address. 
    The a is taken to be an integer value, while b is a pointer. 
    BoundAlign(a+b,16) is the largest value less than or equal to a + b 
    that is a multiple of 16.

with similar wording for vec_st.  Thus, I would say the semantics of these 
intrinsics for unaligned operations are well-defined, without reference to 
particular instructions, and breaking them for a particular -mcpu should 
be treated much the same as if we (for example) broke integer division for 
some -mcpu where it had previously followed C99 semantics: it is a 
wrong-code regression and should be fixed as such.

As for the built-in functions: we do of course document that they should 
not be used directly.  I'm not sure there's a particular reason this code 
uses them, and indeed PR 45381 has a patch to use altivec.h instead, 
though it's reported as not working to fix the underlying problem there 
(AltiVec build not working with Apple 4.0 compiler).  I don't think 
there's any good reason for them to deviate from the AltiVec PIM semantics 
even if they weren't expressly documented to keep to those semantics, and 
making vec_ld and __builtin-vec_ld different would certainly be nasty for 
the altivec.h implementation.

Unlike __sync_fetch_and_nand and __sync_nand_and_fetch, where there was a 
long history of consistently implementing semantics different from those 
intended, and where we added a warning with -Wsync-nand (enabled by 
default), I do not think a warning is needed here, where the broken 
semantics were only for a limited period with particular options.
Richard Biener Jan. 26, 2011, 10:21 a.m. UTC | #8
On Tue, 25 Jan 2011, Mark Mitchell wrote:

> On 1/25/2011 2:15 AM, Richard Guenther wrote:
> 
> > I think we should revert to the pre-4.5 behavior and also fix 4.5.
> > Especially so if there are other compilers that follow the pre-4.5
> > behavior - are there?
> 
> You're suggesting changing 4.5 so that 4.5.N is incompatible with
> previous 4.5 releases?  That seems really unfortunate to me; we're now
> introducing compatibility problems in what's supposed to be a bug-fix
> release.  To justify that, we really have to decide that this was a
> blatant bug.  But, it wasn't; it was a reasonable choice given the
> hardware.  In retrospect, not the best choice, but a reasonable choice.

I see it as a blatant bug as it appearantly breaks code that was
perfectly working with previous GCC releases.

> I'm certainly willing to bow to a consensus opinion here.  If the Power
> maintainers have consensus that this is the right way to go, then I
> would respect that.

Of course - it's up to the Power maintainers to decide what to do.

Richard.
David Edelsohn Jan. 26, 2011, 3:27 p.m. UTC | #9
On Wed, Jan 26, 2011 at 5:21 AM, Richard Guenther <rguenther@suse.de> wrote:
> On Tue, 25 Jan 2011, Mark Mitchell wrote:
>
>> On 1/25/2011 2:15 AM, Richard Guenther wrote:
>>
>> > I think we should revert to the pre-4.5 behavior and also fix 4.5.
>> > Especially so if there are other compilers that follow the pre-4.5
>> > behavior - are there?
>>
>> You're suggesting changing 4.5 so that 4.5.N is incompatible with
>> previous 4.5 releases?  That seems really unfortunate to me; we're now
>> introducing compatibility problems in what's supposed to be a bug-fix
>> release.  To justify that, we really have to decide that this was a
>> blatant bug.  But, it wasn't; it was a reasonable choice given the
>> hardware.  In retrospect, not the best choice, but a reasonable choice.
>
> I see it as a blatant bug as it appearantly breaks code that was
> perfectly working with previous GCC releases.
>
>> I'm certainly willing to bow to a consensus opinion here.  If the Power
>> maintainers have consensus that this is the right way to go, then I
>> would respect that.
>
> Of course - it's up to the Power maintainers to decide what to do.

I agree that the POWER port must revert back to the Altivec/VMX
meaning of vec_ld/vec_st and backport the fix to the GCC 4.5 branch.

- David
Mark Mitchell Jan. 26, 2011, 3:42 p.m. UTC | #10
On 1/26/2011 7:27 AM, David Edelsohn wrote:

>> Of course - it's up to the Power maintainers to decide what to do.
> 
> I agree that the POWER port must revert back to the Altivec/VMX
> meaning of vec_ld/vec_st and backport the fix to the GCC 4.5 branch.

OK, I think we've reached pretty overwhelming consensus.  For avoidance
of doubt, I'm fine with that conclusion.

Thank you,
diff mbox

Patch

Index: gcc/doc/extend.texi
===================================================================
--- gcc/doc/extend.texi	(revision 169112)
+++ gcc/doc/extend.texi	(working copy)
@@ -12354,6 +12354,12 @@  vector bool long vec_cmplt (vector doubl
 vector float vec_div (vector float, vector float);
 vector double vec_div (vector double, vector double);
 vector double vec_floor (vector double);
+vector double vec_ld (int, const vector double *);
+vector double vec_ld (int, const double *);
+vector double vec_ldl (int, const vector double *);
+vector double vec_ldl (int, const double *);
+vector unsigned char vec_lvsl (int, const volatile double *);
+vector unsigned char vec_lvsr (int, const volatile double *);
 vector double vec_madd (vector double, vector double, vector double);
 vector double vec_max (vector double, vector double);
 vector double vec_min (vector double, vector double);
@@ -12382,6 +12388,8 @@  vector double vec_sel (vector double, ve
 vector double vec_sub (vector double, vector double);
 vector float vec_sqrt (vector float);
 vector double vec_sqrt (vector double);
+void vec_st (vector double, int, vector double *);
+void vec_st (vector double, int, double *);
 vector double vec_trunc (vector double);
 vector double vec_xor (vector double, vector double);
 vector double vec_xor (vector double, vector bool long);
@@ -12412,6 +12420,10 @@  int vec_any_nlt (vector double, vector d
 int vec_any_numeric (vector double);
 @end smallexample
 
+Note that the @samp{vec_ld} and @samp{vec_st} builtins will always
+generate the Altivec @samp{LVX} and @samp{STVX} instructions even
+if the VSX instruction set is available.
+
 GCC provides a few other builtins on Powerpc to access certain instructions:
 @smallexample
 float __builtin_recipdivf (float, float);
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c	(revision 169112)
+++ libcpp/lex.c	(working copy)
@@ -547,6 +547,11 @@  search_line_fast (const uchar *s, const 
   const vc zero = { 0 };
 
   vc data, mask, t;
+  const uchar *unaligned_s = s;
+
+  /* While altivec loads mask addresses, we still need to align S so
+     that the offset we compute at the end is correct.  */
+  s = (const uchar *)((uintptr_t)s & -16);
 
   /* Altivec loads automatically mask addresses with -16.  This lets us
      issue the first load as early as possible.  */
@@ -555,15 +560,20 @@  search_line_fast (const uchar *s, const 
   /* Discard bytes before the beginning of the buffer.  Do this by
      beginning with all ones and shifting in zeros according to the
      mis-alignment.  The LVSR instruction pulls the exact shift we
-     want from the address.  */
-  mask = __builtin_vec_lvsr(0, s);
+     want from the address.
+
+     Originally, we used s in the lvsr and did the alignment afterwords, which
+     works on a system that supported just the Altivec instruction set using
+     the LVX instruction.  With the introduction of the VSX instruction, for
+     GCC 4.5, the load became LXVW4X.  LVX ignores the bottom 3 bits, and
+     LXVW4X does not.  While GCC 4.6 will revert vec_ld/vec_st to go back to
+     only produce Altivec instructions, the possibiliy exists that the stage1
+     compiler was built with a compiler that generated LXVW4X.  This code will
+     work on either system.  */
+  mask = __builtin_vec_lvsr(0, unaligned_s);
   mask = __builtin_vec_perm(zero, ones, mask);
   data &= mask;
 
-  /* While altivec loads mask addresses, we still need to align S so
-     that the offset we compute at the end is correct.  */
-  s = (const uchar *)((uintptr_t)s & -16);
-
   /* Main loop processing 16 bytes at a time.  */
   goto start;
   do