From patchwork Mon Jan 24 21:31:33 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michael Meissner <meissner@linux.vnet.ibm.com>
X-Patchwork-Id: 80263
Return-Path: 
 <gcc-patches-return-283658-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	by ozlabs.org (Postfix) with SMTP id 05E4EB7107
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Jan 2011 08:31:57 +1100 (EST)
Received: (qmail 1913 invoked by alias); 24 Jan 2011 21:31:55 -0000
Received: (qmail 1840 invoked by uid 22791); 24 Jan 2011 21:31:53 -0000
X-SWARE-Spam-Status: No, hits=-1.3 required=5.0	tests=AWL, BAYES_00,
	NO_DNS_FOR_FROM, TW_LV, TW_TV, T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from e4.ny.us.ibm.com (HELO e4.ny.us.ibm.com) (32.97.182.144) by
	sourceware.org (qpsmtpd/0.43rc1) with ESMTP;
	Mon, 24 Jan 2011 21:31:44 +0000
Received: from d01dlp02.pok.ibm.com (d01dlp02.pok.ibm.com [9.56.224.85])	by
	e4.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id
	p0OLDQdH024013	for <gcc-patches@gcc.gnu.org>;
	Mon, 24 Jan 2011 16:13:30 -0500
Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com
	[9.56.227.147])	by d01dlp02.pok.ibm.com (Postfix) with ESMTP
	id 7C8074DE8041	for <gcc-patches@gcc.gnu.org>;
	Mon, 24 Jan 2011 16:28:13 -0500 (EST)
Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217])	by
	d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP
	id p0OLVdSd2355240	for <gcc-patches@gcc.gnu.org>;
	Mon, 24 Jan 2011 16:31:39 -0500
Received: from d01av03.pok.ibm.com (loopback [127.0.0.1])	by
	d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with
	ESMTP id p0OLVdw7013935	for <gcc-patches@gcc.gnu.org>;
	Mon, 24 Jan 2011 19:31:39 -0200
Received: from hungry-tiger.westford.ibm.com (dyn9033037078.westford.ibm.com
	[9.33.37.78])	by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0
	AVin) with ESMTP id p0OLVd9I013077; Mon, 24 Jan 2011 19:31:39 -0200
Received: by hungry-tiger.westford.ibm.com (Postfix, from userid 500)	id
	6D346F7EFA; Mon, 24 Jan 2011 16:31:34 -0500 (EST)
Date: Mon, 24 Jan 2011 16:31:33 -0500
From: Michael Meissner <meissner@linux.vnet.ibm.com>
To: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, rth@redhat.com,
	rguenther@suse.de, jakub@redhat.com, berner@vnet.ibm.com,
	geoffk@geoffk.org, mark@codesourcery.com,
	joseph@codesourcery.com, pinskia@gmail.com, dominiq@lps.ens.fr
Subject: [PATCH] Fix PR 47272 to restore Altivec vec_ld/vec_st
Message-ID: <20110124213133.GA21518@hungry-tiger.westford.ibm.com>
Mail-Followup-To: Michael Meissner <meissner@linux.vnet.ibm.com>,
	gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, rth@redhat.com,
	rguenther@suse.de, jakub@redhat.com, berner@vnet.ibm.com,
	geoffk@geoffk.org, mark@codesourcery.com,
	joseph@codesourcery.com, pinskia@gmail.com, dominiq@lps.ens.fr
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Content-Scanned: Fidelis XPS MAILER
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org

This patch fixes bug PR target/47272, but I'm sending it out to a wider
audience to solicit feedback from other developers to resolve a sticky
situation with the PowerPC code gen.

For those of you who don't know the power architecture, particularly with the
VSX extensions, the first main vector extension was the Altivec (VMX) vector
support.  The VSX vector support adds more floating point vector registers, and
overlaps with the Altivec support.  In particular, the vector loads and stores
on Altivec ignore the bottom 3 bits of the address, while the vector loads and
stores of the VSX instruction set do not, and will do unaligned loads and
stores.  Obviously, if the address is completely aligned, either an Altivec or
a VSX memory instruction will behave the same.  If on the other hand you have
an unaligned address, you will get different bytes loaded/stored.

The PowerPC compiler has a full set of overloaded vector intrinisic builtin
functions, including builtins for doing load and store.  When I added the VSX
support, I changed the compiler to do VSX loads/stores if the user used -mvsx
or -mcpu=power7, including changing the builtin load/store functions to use the
VSX instructions.  However, as I said, you get different results for unaligned
addresses.

Richard Henderson's change to libcpp/lex.c in August 21st, 2010 added code to
use the Altivec instruction set if the compiler supports it to speed up the
preprocessor:

2010-08-21  Richard Henderson  <rth@redhat.com>
	    Andi Kleen <ak@linux.intel.com>
	    David S. Miller  <davem@davemloft.net>

	* configure.ac (AC_C_BIGENDIAN, AC_TYPE_UINTPTR_T): New tests.
	(ssize_t): Check via AC_TYPE_SSIZE_T instead of AC_CHECK_TYPE.
	(ptrdiff_t): Check via AC_CHECK_TYPE.
	* config.in, configure: Rebuild.
	* system.h: Include stdint.h, if available.
	* lex.c (WORDS_BIGENDIAN): Provide default.
	(acc_char_mask_misalign, acc_char_replicate, acc_char_cmp,
	acc_char_index, search_line_acc_char, repl_chars, search_line_mmx,
	search_line_sse2, search_line_sse42, init_vectorized_lexer,
	search_line_fast): New.
	(_cpp_clean_line): Use search_line_fast.  Restructure the fast
	loop to make it clear when we're leaving the loop.  Stay in the
	fast loop for non-trigraph '?'.

Recently we started to look at building internal versions of the GCC 4.6
compiler with the --with-cpu=power7 support, and it exposed the difference
between the two loads.

So after some debate within IBM, we've come to the conclusion that I should not
have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
we should go back to using the Altivec form for these instructions.  However,
in doing so, it means that anybody who has written new code explicitly for
power7 since GCC 4.5 came out might now be suprised.  Unfortunately the bug
exists in GCC 4.5 as well as the Red Hat RHEL6 and SUSE Sles 11 Sp1 compilers.

I realize that we are in stage 4 of the release process, but if we are going to
change the builtins back to the 4.4 semantics, we should do it as soon as
possible.

David suggested I poll release managers and other interested parties
what path we should take (make the builtins adhere to the 4.4 semantics, or
just keep the current situation).

I'm enclosing patches to make the load/store builtins go back to the Altivec
semantics, and added vector double support to those.  In addition, I added
patches for libcpp/lex.c so that it will work with 4.5 compilers as well as 4.4
and future 4.6 compilers.  No matter whether we decide not to re-change the
builtin semantics or not, I feel the lex.c patch should go it.

Right now, I did not add an #ifdef or -m switch to toggle to the 4.5
behaviour.  I can do this if desired (it probably is a good idea to allow code
written for 4.5 to continue to be used).  I don't know how many people directly
write using the Altivec semantics.

I should mention that Darwin users and people using the host processor in PS3
that might have written Altivec specific code will not be affected, since those
machines do not have the VSX instruction set.  It is only the newer machines
being shipped by IBM that currently will have the problem.

Sorry about all this.

Index: gcc/doc/extend.texi
===================================================================
--- gcc/doc/extend.texi	(revision 169112)
+++ gcc/doc/extend.texi	(working copy)
@@ -12354,6 +12354,12 @@ vector bool long vec_cmplt (vector doubl
 vector float vec_div (vector float, vector float);
 vector double vec_div (vector double, vector double);
 vector double vec_floor (vector double);
+vector double vec_ld (int, const vector double *);
+vector double vec_ld (int, const double *);
+vector double vec_ldl (int, const vector double *);
+vector double vec_ldl (int, const double *);
+vector unsigned char vec_lvsl (int, const volatile double *);
+vector unsigned char vec_lvsr (int, const volatile double *);
 vector double vec_madd (vector double, vector double, vector double);
 vector double vec_max (vector double, vector double);
 vector double vec_min (vector double, vector double);
@@ -12382,6 +12388,8 @@ vector double vec_sel (vector double, ve
 vector double vec_sub (vector double, vector double);
 vector float vec_sqrt (vector float);
 vector double vec_sqrt (vector double);
+void vec_st (vector double, int, vector double *);
+void vec_st (vector double, int, double *);
 vector double vec_trunc (vector double);
 vector double vec_xor (vector double, vector double);
 vector double vec_xor (vector double, vector bool long);
@@ -12412,6 +12420,10 @@ int vec_any_nlt (vector double, vector d
 int vec_any_numeric (vector double);
 @end smallexample
 
+Note that the @samp{vec_ld} and @samp{vec_st} builtins will always
+generate the Altivec @samp{LVX} and @samp{STVX} instructions even
+if the VSX instruction set is available.
+
 GCC provides a few other builtins on Powerpc to access certain instructions:
 @smallexample
 float __builtin_recipdivf (float, float);
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c	(revision 169112)
+++ libcpp/lex.c	(working copy)
@@ -547,6 +547,11 @@ search_line_fast (const uchar *s, const 
   const vc zero = { 0 };
 
   vc data, mask, t;
+  const uchar *unaligned_s = s;
+
+  /* While altivec loads mask addresses, we still need to align S so
+     that the offset we compute at the end is correct.  */
+  s = (const uchar *)((uintptr_t)s & -16);
 
   /* Altivec loads automatically mask addresses with -16.  This lets us
      issue the first load as early as possible.  */
@@ -555,15 +560,20 @@ search_line_fast (const uchar *s, const 
   /* Discard bytes before the beginning of the buffer.  Do this by
      beginning with all ones and shifting in zeros according to the
      mis-alignment.  The LVSR instruction pulls the exact shift we
-     want from the address.  */
-  mask = __builtin_vec_lvsr(0, s);
+     want from the address.
+
+     Originally, we used s in the lvsr and did the alignment afterwords, which
+     works on a system that supported just the Altivec instruction set using
+     the LVX instruction.  With the introduction of the VSX instruction, for
+     GCC 4.5, the load became LXVW4X.  LVX ignores the bottom 3 bits, and
+     LXVW4X does not.  While GCC 4.6 will revert vec_ld/vec_st to go back to
+     only produce Altivec instructions, the possibiliy exists that the stage1
+     compiler was built with a compiler that generated LXVW4X.  This code will
+     work on either system.  */
+  mask = __builtin_vec_lvsr(0, unaligned_s);
   mask = __builtin_vec_perm(zero, ones, mask);
   data &= mask;
 
-  /* While altivec loads mask addresses, we still need to align S so
-     that the offset we compute at the end is correct.  */
-  s = (const uchar *)((uintptr_t)s & -16);
-
   /* Main loop processing 16 bytes at a time.  */
   goto start;
   do