From patchwork Mon Jan 24 21:31:33 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michael Meissner X-Patchwork-Id: 80263 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 05E4EB7107 for ; Tue, 25 Jan 2011 08:31:57 +1100 (EST) Received: (qmail 1913 invoked by alias); 24 Jan 2011 21:31:55 -0000 Received: (qmail 1840 invoked by uid 22791); 24 Jan 2011 21:31:53 -0000 X-SWARE-Spam-Status: No, hits=-1.3 required=5.0 tests=AWL, BAYES_00, NO_DNS_FOR_FROM, TW_LV, TW_TV, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from e4.ny.us.ibm.com (HELO e4.ny.us.ibm.com) (32.97.182.144) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 24 Jan 2011 21:31:44 +0000 Received: from d01dlp02.pok.ibm.com (d01dlp02.pok.ibm.com [9.56.224.85]) by e4.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p0OLDQdH024013 for ; Mon, 24 Jan 2011 16:13:30 -0500 Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 7C8074DE8041 for ; Mon, 24 Jan 2011 16:28:13 -0500 (EST) Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p0OLVdSd2355240 for ; Mon, 24 Jan 2011 16:31:39 -0500 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p0OLVdw7013935 for ; Mon, 24 Jan 2011 19:31:39 -0200 Received: from hungry-tiger.westford.ibm.com (dyn9033037078.westford.ibm.com [9.33.37.78]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p0OLVd9I013077; Mon, 24 Jan 2011 19:31:39 -0200 Received: by hungry-tiger.westford.ibm.com (Postfix, from userid 500) id 6D346F7EFA; Mon, 24 Jan 2011 16:31:34 -0500 (EST) Date: Mon, 24 Jan 2011 16:31:33 -0500 From: Michael Meissner To: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, rth@redhat.com, rguenther@suse.de, jakub@redhat.com, berner@vnet.ibm.com, geoffk@geoffk.org, mark@codesourcery.com, joseph@codesourcery.com, pinskia@gmail.com, dominiq@lps.ens.fr Subject: [PATCH] Fix PR 47272 to restore Altivec vec_ld/vec_st Message-ID: <20110124213133.GA21518@hungry-tiger.westford.ibm.com> Mail-Followup-To: Michael Meissner , gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, rth@redhat.com, rguenther@suse.de, jakub@redhat.com, berner@vnet.ibm.com, geoffk@geoffk.org, mark@codesourcery.com, joseph@codesourcery.com, pinskia@gmail.com, dominiq@lps.ens.fr MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org This patch fixes bug PR target/47272, but I'm sending it out to a wider audience to solicit feedback from other developers to resolve a sticky situation with the PowerPC code gen. For those of you who don't know the power architecture, particularly with the VSX extensions, the first main vector extension was the Altivec (VMX) vector support. The VSX vector support adds more floating point vector registers, and overlaps with the Altivec support. In particular, the vector loads and stores on Altivec ignore the bottom 3 bits of the address, while the vector loads and stores of the VSX instruction set do not, and will do unaligned loads and stores. Obviously, if the address is completely aligned, either an Altivec or a VSX memory instruction will behave the same. If on the other hand you have an unaligned address, you will get different bytes loaded/stored. The PowerPC compiler has a full set of overloaded vector intrinisic builtin functions, including builtins for doing load and store. When I added the VSX support, I changed the compiler to do VSX loads/stores if the user used -mvsx or -mcpu=power7, including changing the builtin load/store functions to use the VSX instructions. However, as I said, you get different results for unaligned addresses. Richard Henderson's change to libcpp/lex.c in August 21st, 2010 added code to use the Altivec instruction set if the compiler supports it to speed up the preprocessor: 2010-08-21 Richard Henderson Andi Kleen David S. Miller * configure.ac (AC_C_BIGENDIAN, AC_TYPE_UINTPTR_T): New tests. (ssize_t): Check via AC_TYPE_SSIZE_T instead of AC_CHECK_TYPE. (ptrdiff_t): Check via AC_CHECK_TYPE. * config.in, configure: Rebuild. * system.h: Include stdint.h, if available. * lex.c (WORDS_BIGENDIAN): Provide default. (acc_char_mask_misalign, acc_char_replicate, acc_char_cmp, acc_char_index, search_line_acc_char, repl_chars, search_line_mmx, search_line_sse2, search_line_sse42, init_vectorized_lexer, search_line_fast): New. (_cpp_clean_line): Use search_line_fast. Restructure the fast loop to make it clear when we're leaving the loop. Stay in the fast loop for non-trigraph '?'. Recently we started to look at building internal versions of the GCC 4.6 compiler with the --with-cpu=power7 support, and it exposed the difference between the two loads. So after some debate within IBM, we've come to the conclusion that I should not have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that we should go back to using the Altivec form for these instructions. However, in doing so, it means that anybody who has written new code explicitly for power7 since GCC 4.5 came out might now be suprised. Unfortunately the bug exists in GCC 4.5 as well as the Red Hat RHEL6 and SUSE Sles 11 Sp1 compilers. I realize that we are in stage 4 of the release process, but if we are going to change the builtins back to the 4.4 semantics, we should do it as soon as possible. David suggested I poll release managers and other interested parties what path we should take (make the builtins adhere to the 4.4 semantics, or just keep the current situation). I'm enclosing patches to make the load/store builtins go back to the Altivec semantics, and added vector double support to those. In addition, I added patches for libcpp/lex.c so that it will work with 4.5 compilers as well as 4.4 and future 4.6 compilers. No matter whether we decide not to re-change the builtin semantics or not, I feel the lex.c patch should go it. Right now, I did not add an #ifdef or -m switch to toggle to the 4.5 behaviour. I can do this if desired (it probably is a good idea to allow code written for 4.5 to continue to be used). I don't know how many people directly write using the Altivec semantics. I should mention that Darwin users and people using the host processor in PS3 that might have written Altivec specific code will not be affected, since those machines do not have the VSX instruction set. It is only the newer machines being shipped by IBM that currently will have the problem. Sorry about all this. Index: gcc/doc/extend.texi =================================================================== --- gcc/doc/extend.texi (revision 169112) +++ gcc/doc/extend.texi (working copy) @@ -12354,6 +12354,12 @@ vector bool long vec_cmplt (vector doubl vector float vec_div (vector float, vector float); vector double vec_div (vector double, vector double); vector double vec_floor (vector double); +vector double vec_ld (int, const vector double *); +vector double vec_ld (int, const double *); +vector double vec_ldl (int, const vector double *); +vector double vec_ldl (int, const double *); +vector unsigned char vec_lvsl (int, const volatile double *); +vector unsigned char vec_lvsr (int, const volatile double *); vector double vec_madd (vector double, vector double, vector double); vector double vec_max (vector double, vector double); vector double vec_min (vector double, vector double); @@ -12382,6 +12388,8 @@ vector double vec_sel (vector double, ve vector double vec_sub (vector double, vector double); vector float vec_sqrt (vector float); vector double vec_sqrt (vector double); +void vec_st (vector double, int, vector double *); +void vec_st (vector double, int, double *); vector double vec_trunc (vector double); vector double vec_xor (vector double, vector double); vector double vec_xor (vector double, vector bool long); @@ -12412,6 +12420,10 @@ int vec_any_nlt (vector double, vector d int vec_any_numeric (vector double); @end smallexample +Note that the @samp{vec_ld} and @samp{vec_st} builtins will always +generate the Altivec @samp{LVX} and @samp{STVX} instructions even +if the VSX instruction set is available. + GCC provides a few other builtins on Powerpc to access certain instructions: @smallexample float __builtin_recipdivf (float, float); Index: libcpp/lex.c =================================================================== --- libcpp/lex.c (revision 169112) +++ libcpp/lex.c (working copy) @@ -547,6 +547,11 @@ search_line_fast (const uchar *s, const const vc zero = { 0 }; vc data, mask, t; + const uchar *unaligned_s = s; + + /* While altivec loads mask addresses, we still need to align S so + that the offset we compute at the end is correct. */ + s = (const uchar *)((uintptr_t)s & -16); /* Altivec loads automatically mask addresses with -16. This lets us issue the first load as early as possible. */ @@ -555,15 +560,20 @@ search_line_fast (const uchar *s, const /* Discard bytes before the beginning of the buffer. Do this by beginning with all ones and shifting in zeros according to the mis-alignment. The LVSR instruction pulls the exact shift we - want from the address. */ - mask = __builtin_vec_lvsr(0, s); + want from the address. + + Originally, we used s in the lvsr and did the alignment afterwords, which + works on a system that supported just the Altivec instruction set using + the LVX instruction. With the introduction of the VSX instruction, for + GCC 4.5, the load became LXVW4X. LVX ignores the bottom 3 bits, and + LXVW4X does not. While GCC 4.6 will revert vec_ld/vec_st to go back to + only produce Altivec instructions, the possibiliy exists that the stage1 + compiler was built with a compiler that generated LXVW4X. This code will + work on either system. */ + mask = __builtin_vec_lvsr(0, unaligned_s); mask = __builtin_vec_perm(zero, ones, mask); data &= mask; - /* While altivec loads mask addresses, we still need to align S so - that the offset we compute at the end is correct. */ - s = (const uchar *)((uintptr_t)s & -16); - /* Main loop processing 16 bytes at a time. */ goto start; do