diff mbox series

[V3] powerpc: Optimized strncmp for power10

Message ID 20240429095847.3541150-1-amritahs@linux.vnet.ibm.com
State New
Headers show
Series [V3] powerpc: Optimized strncmp for power10 | expand

Commit Message

Amrita H S April 29, 2024, 9:58 a.m. UTC
This patch is based on __strcmp_power10.

Improvements from __strncmp_power9:

    1. Uses new POWER10 instructions
       - This code uses lxvp to decrease contention on load
         by loading 32 bytes per instruction.

    2. Performance implication
       - This version has around 38% better performance on average.
       - Minor performance regression is seen for few small sizes
	 and specific combination of alignments.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
---
 .../powerpc/powerpc64/le/power10/strncmp.S    | 271 ++++++++++++++++++
 sysdeps/powerpc/powerpc64/multiarch/Makefile  |   2 +-
 .../powerpc64/multiarch/ifunc-impl-list.c     |   3 +
 .../powerpc64/multiarch/strncmp-power10.S     |  25 ++
 sysdeps/powerpc/powerpc64/multiarch/strncmp.c |   4 +
 5 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strncmp.S
 create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strncmp-power10.S

Comments

Peter Bergner May 3, 2024, 9:31 p.m. UTC | #1
On 4/29/24 4:58 AM, Amrita H S wrote:
> +++ b/sysdeps/powerpc/powerpc64/le/power10/strncmp.S
> @@ -0,0 +1,271 @@
> +/* Optimized strncmp implementation for PowerPC64/POWER10.
> +   Copyright (C) 2021-2023 Free Software Foundation, Inc.

This is a new file, so I believe the Copyright date should just be "2024".


> +++ b/sysdeps/powerpc/powerpc64/multiarch/strncmp-power10.S
> @@ -0,0 +1,25 @@
> +/* Copyright (C) 2016-2023 Free Software Foundation, Inc.

Likewise.



> +ENTRY_TOCLESS (STRNCMP, 4)
> +	/* Check if size is 0.  */
> +	cmpdi	 cr0,r5,0
> +	beq	 cr0,L(ret0)
> +	andi.   r7,r3,4095
> +	andi.   r8,r4,4095
> +	cmpldi  cr0,r7,4096-16
> +	cmpldi  cr1,r8,4096-16
> +	bgt     cr0,L(crosses)
> +	bgt     cr1,L(crosses)
> +	COMPARE_16(v4,v5,0)
> +	addi	r3,r3,16
> +	addi	r4,r4,16

This code looks like it assumes the kernel is using a 4k page size.
All distros that I know of and the default kernel config for ppc64
and ppc64le kernels is to use a 64K HW page size.  Is there a reason
we're not checking for a 64k cache boundary here?

Adhemerval, you seem to have added the first power8 strncmp.S optimized
routine (sysdeps/powerpc/powerpc64/power8/strncmp.S) and that also uses
a 4k page boundary.  Do you remember the history of why we checked for
a 4k page boundary rather than 64k?  Was is a matter of using 64k showed
no improvement over 4k and using 4k meant we didn't have to worry about
some system maybe running in 4k page size kernels?

Peter
Florian Weimer May 6, 2024, 6:13 a.m. UTC | #2
* Peter Bergner:

> This code looks like it assumes the kernel is using a 4k page size.
> All distros that I know of and the default kernel config for ppc64
> and ppc64le kernels is to use a 64K HW page size.  Is there a reason
> we're not checking for a 64k cache boundary here?

The ABI says the minimum page size is 4K, right?

I suppose we could add a page size check on startup and make glibc
requirements configure and build time.

Thanks,
Florian
Peter Bergner May 6, 2024, 12:55 p.m. UTC | #3
On 5/6/24 1:13 AM, Florian Weimer wrote:
> * Peter Bergner:
> 
>> This code looks like it assumes the kernel is using a 4k page size.
>> All distros that I know of and the default kernel config for ppc64
>> and ppc64le kernels is to use a 64K HW page size.  Is there a reason
>> we're not checking for a 64k cache boundary here?
> 
> The ABI says the minimum page size is 4K, right?

The HW supports a 4K page size, but none of the major distros use that,
they all use 64K pages.  That said, one of our kernel developers said
that are a couple of minor distros building with 4K pages, so I guess
we should go with that until...


> I suppose we could add a page size check on startup and make glibc
> requirements configure and build time.

Yes, this sounds like a good future project, but no need to hold
this patch up for that.


Peter
Peter Bergner May 6, 2024, 1:01 p.m. UTC | #4
On 5/3/24 4:31 PM, Peter Bergner wrote:
> This code looks like it assumes the kernel is using a 4k page size.
> All distros that I know of and the default kernel config for ppc64
> and ppc64le kernels is to use a 64K HW page size.  Is there a reason
> we're not checking for a 64k cache boundary here?

Talking with one of our kernel developers, there are a couple of
minor Linux distros building ppc64* kernels with 4K pages, so that
answers my question why we're using that here.  

A good future project would be to either have a configure option
to state the page size to use or detect it at runtime from the
AT_PAGESZ AUXV entry and choose the right routine.  Either way,
this current patch...

LGTM.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>


I'll take care of updating the Copyright dates and merging it for you.
Thanks!

Peter
Adhemerval Zanella Netto May 6, 2024, 1:04 p.m. UTC | #5
On 03/05/24 18:31, Peter Bergner wrote:
> On 4/29/24 4:58 AM, Amrita H S wrote:
>> +++ b/sysdeps/powerpc/powerpc64/le/power10/strncmp.S
>> @@ -0,0 +1,271 @@
>> +/* Optimized strncmp implementation for PowerPC64/POWER10.
>> +   Copyright (C) 2021-2023 Free Software Foundation, Inc.
> 
> This is a new file, so I believe the Copyright date should just be "2024".
> 
> 
>> +++ b/sysdeps/powerpc/powerpc64/multiarch/strncmp-power10.S
>> @@ -0,0 +1,25 @@
>> +/* Copyright (C) 2016-2023 Free Software Foundation, Inc.
> 
> Likewise.
> 
> 
> 
>> +ENTRY_TOCLESS (STRNCMP, 4)
>> +	/* Check if size is 0.  */
>> +	cmpdi	 cr0,r5,0
>> +	beq	 cr0,L(ret0)
>> +	andi.   r7,r3,4095
>> +	andi.   r8,r4,4095
>> +	cmpldi  cr0,r7,4096-16
>> +	cmpldi  cr1,r8,4096-16
>> +	bgt     cr0,L(crosses)
>> +	bgt     cr1,L(crosses)
>> +	COMPARE_16(v4,v5,0)
>> +	addi	r3,r3,16
>> +	addi	r4,r4,16
> 
> This code looks like it assumes the kernel is using a 4k page size.
> All distros that I know of and the default kernel config for ppc64
> and ppc64le kernels is to use a 64K HW page size.  Is there a reason
> we're not checking for a 64k cache boundary here?
> 
> Adhemerval, you seem to have added the first power8 strncmp.S optimized
> routine (sysdeps/powerpc/powerpc64/power8/strncmp.S) and that also uses
> a 4k page boundary.  Do you remember the history of why we checked for
> a 4k page boundary rather than 64k?  Was is a matter of using 64k showed
> no improvement over 4k and using 4k meant we didn't have to worry about
> some system maybe running in 4k page size kernels?

If I recall correctly it was to not tie the implementation to an specific
page size, since the ABI still allows 4k page sizes.  I think both branches
will highly unlikely to be taken, so branch prediction will most likely
get a high frequency hit.

We can also try to make it dynamically if you think these checks are really
costly, this will mean to add two extra loads and possible an extra cache
like hit (one for GLRO struct, another for dl_pagesize).  I don't think this
is worth.

Another question is whether this tests still make sense for POWER10, is it
still that costly for cross page-page reads as for POWER8?
Peter Bergner May 6, 2024, 2:10 p.m. UTC | #6
On 5/6/24 8:01 AM, Peter Bergner wrote:
> I'll take care of updating the Copyright dates and merging it for you.

Pushed now.

Peter
Peter Bergner May 6, 2024, 2:16 p.m. UTC | #7
On 5/6/24 8:04 AM, Adhemerval Zanella Netto wrote:
> If I recall correctly it was to not tie the implementation to an specific
> page size, since the ABI still allows 4k page sizes.  I think both branches
> will highly unlikely to be taken, so branch prediction will most likely
> get a high frequency hit.

Even though the hardware supports 4K pages, I thought we never built it
that way and the major distros to build it with 64K pages, but I learned
there are some minor distros the use 4K pages, so I agree we should use
that here.


> We can also try to make it dynamically if you think these checks are really
> costly, this will mean to add two extra loads and possible an extra cache
> like hit (one for GLRO struct, another for dl_pagesize).  I don't think this
> is worth.

I don't know that they costly, I just though that if they're useless because
we always use 64K pages, then it seems dumb to check the 4K boundary.
Since can/might have 4K pages, then the patch code is correct as is.


> Another question is whether this tests still make sense for POWER10, is it
> still that costly for cross page-page reads as for POWER8?

I'm not 100% sure and it would be something we'd need to test, but I suspect
it probably hasn't changed too much???

Peter
diff mbox series

Patch

diff --git a/sysdeps/powerpc/powerpc64/le/power10/strncmp.S b/sysdeps/powerpc/powerpc64/le/power10/strncmp.S
new file mode 100644
index 0000000000..7d58a62358
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/le/power10/strncmp.S
@@ -0,0 +1,271 @@ 
+/* Optimized strncmp implementation for PowerPC64/POWER10.
+   Copyright (C) 2021-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+/* Implements the function
+
+   int [r3] strncmp (const char *s1 [r3], const char *s2 [r4], size_t [r5] n)
+
+   The implementation uses unaligned doubleword access to avoid specialized
+   code paths depending of data alignment for first 32 bytes and uses
+   vectorised loops after that.  */
+
+#ifndef STRNCMP
+# define STRNCMP strncmp
+#endif
+
+/* TODO: Change this to actual instructions when minimum binutils is upgraded
+   to 2.27.  Macros are defined below for these newer instructions in order
+   to maintain compatibility.  */
+
+#define LXVP(xtp,dq,ra)              \
+	.long(((6)<<(32-6))          \
+	| ((((xtp)-32)>>1)<<(32-10)) \
+	| ((1)<<(32-11))             \
+	| ((ra)<<(32-16))            \
+	| dq)
+
+#define COMPARE_16(vreg1,vreg2,offset) \
+	lxv	  vreg1+32,offset(r3); \
+	lxv	  vreg2+32,offset(r4); \
+	vcmpnezb. v7,vreg1,vreg2;      \
+	bne	  cr6,L(different);    \
+	cmpldi	  cr7,r5,16;           \
+	ble	  cr7,L(ret0);         \
+	addi	  r5,r5,-16;
+
+#define COMPARE_32(vreg1,vreg2,offset,label1,label2) \
+	LXVP(vreg1+32,offset,r3);                    \
+	LXVP(vreg2+32,offset,r4);                    \
+	vcmpnezb. v7,vreg1+1,vreg2+1;                \
+	bne	  cr6,L(label1);                     \
+	vcmpnezb. v7,vreg1,vreg2;                    \
+	bne	  cr6,L(label2);                     \
+	cmpldi	  cr7,r5,32;                         \
+	ble	  cr7,L(ret0);                       \
+	addi	  r5,r5,-32;
+
+#define TAIL_FIRST_16B(vreg1,vreg2) \
+	vctzlsbb r6,v7;             \
+	cmpld	 cr7,r5,r6;         \
+	ble	 cr7,L(ret0);       \
+	vextubrx r5,r6,vreg1;       \
+	vextubrx r4,r6,vreg2;       \
+	subf	 r3,r4,r5;          \
+	blr;
+
+#define TAIL_SECOND_16B(vreg1,vreg2) \
+	vctzlsbb r6,v7;              \
+	addi	 r0,r6,16;           \
+	cmpld	 cr7,r5,r0;          \
+	ble	 cr7,L(ret0);        \
+	vextubrx r5,r6,vreg1;        \
+	vextubrx r4,r6,vreg2;        \
+	subf	 r3,r4,r5;           \
+	blr;
+
+#define CHECK_N_BYTES(reg1,reg2,len_reg) \
+	sldi	  r6,len_reg,56;	 \
+	lxvl	  32+v4,reg1,r6;	 \
+	lxvl	  32+v5,reg2,r6;	 \
+	add	  reg1,reg1,len_reg;	 \
+	add	  reg2,reg2,len_reg;	 \
+	vcmpnezb  v7,v4,v5;		 \
+	vctzlsbb  r6,v7;		 \
+	cmpld	  cr7,r6,len_reg;	 \
+	blt	  cr7,L(different);	 \
+	cmpld	  cr7,r5,len_reg;	 \
+	ble	  cr7,L(ret0);		 \
+	sub	  r5,r5,len_reg;	 \
+
+	/* TODO: change this to .machine power10 when the minimum required
+	 binutils allows it.  */
+	.machine  power9
+ENTRY_TOCLESS (STRNCMP, 4)
+	/* Check if size is 0.  */
+	cmpdi	 cr0,r5,0
+	beq	 cr0,L(ret0)
+	andi.   r7,r3,4095
+	andi.   r8,r4,4095
+	cmpldi  cr0,r7,4096-16
+	cmpldi  cr1,r8,4096-16
+	bgt     cr0,L(crosses)
+	bgt     cr1,L(crosses)
+	COMPARE_16(v4,v5,0)
+	addi	r3,r3,16
+	addi	r4,r4,16
+
+L(crosses):
+	andi.	 r7,r3,15
+	subfic	 r7,r7,16	/* r7(nalign1) = 16 - (str1 & 15).  */
+	andi.	 r9,r4,15
+	subfic	 r8,r9,16	/* r8(nalign2) = 16 - (str2 & 15).  */
+	cmpld	 cr7,r7,r8
+	beq	 cr7,L(same_aligned)
+	blt	 cr7,L(nalign1_min)
+
+	/* nalign2 is minimum and s2 pointer is aligned.  */
+	CHECK_N_BYTES(r3,r4,r8)
+	/* Are we on the 64B hunk which crosses a page?  */
+	andi.   r10,r3,63       /* Determine offset into 64B hunk.  */
+	andi.   r8,r3,15        /* The offset into the 16B hunk.  */
+	neg     r7,r3
+	andi.   r9,r7,15        /* Number of bytes after a 16B cross.  */
+	rlwinm. r7,r7,26,0x3F   /* ((r4-4096))>>6&63.  */
+	beq     L(compare_64_pagecross)
+	mtctr   r7
+	b       L(compare_64B_unaligned)
+
+	/* nalign1 is minimum and s1 pointer is aligned.  */
+L(nalign1_min):
+	CHECK_N_BYTES(r3,r4,r7)
+	/* Are we on the 64B hunk which crosses a page?  */
+	andi.   r10,r4,63       /* Determine offset into 64B hunk.  */
+	andi.   r8,r4,15        /* The offset into the 16B hunk.  */
+	neg     r7,r4
+	andi.   r9,r7,15        /* Number of bytes after a 16B cross.  */
+	rlwinm. r7,r7,26,0x3F   /* ((r4-4096))>>6&63.  */
+	beq     L(compare_64_pagecross)
+	mtctr   r7
+
+	.p2align 5
+L(compare_64B_unaligned):
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	COMPARE_16(v4,v5,32)
+	COMPARE_16(v4,v5,48)
+	addi    r3,r3,64
+	addi    r4,r4,64
+	bdnz    L(compare_64B_unaligned)
+
+	/* Cross the page boundary of s2, carefully. Only for first
+	iteration we have to get the count of 64B blocks to be checked.
+	From second iteration and beyond, loop counter is always 63.  */
+L(compare_64_pagecross):
+	li      r11, 63
+	mtctr   r11
+	cmpldi  r10,16
+	ble     L(cross_4)
+	cmpldi  r10,32
+	ble     L(cross_3)
+	cmpldi  r10,48
+	ble     L(cross_2)
+L(cross_1):
+	CHECK_N_BYTES(r3,r4,r9)
+	CHECK_N_BYTES(r3,r4,r8)
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	COMPARE_16(v4,v5,32)
+	addi    r3,r3,48
+	addi    r4,r4,48
+	b       L(compare_64B_unaligned)
+L(cross_2):
+	COMPARE_16(v4,v5,0)
+	addi    r3,r3,16
+	addi    r4,r4,16
+	CHECK_N_BYTES(r3,r4,r9)
+	CHECK_N_BYTES(r3,r4,r8)
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	addi    r3,r3,32
+	addi    r4,r4,32
+	b       L(compare_64B_unaligned)
+L(cross_3):
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	addi    r3,r3,32
+	addi    r4,r4,32
+	CHECK_N_BYTES(r3,r4,r9)
+	CHECK_N_BYTES(r3,r4,r8)
+	COMPARE_16(v4,v5,0)
+	addi    r3,r3,16
+	addi    r4,r4,16
+	b       L(compare_64B_unaligned)
+L(cross_4):
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	COMPARE_16(v4,v5,32)
+	addi    r3,r3,48
+	addi    r4,r4,48
+	CHECK_N_BYTES(r3,r4,r9)
+	CHECK_N_BYTES(r3,r4,r8)
+	b       L(compare_64B_unaligned)
+
+L(same_aligned):
+	CHECK_N_BYTES(r3,r4,r7)
+	/* Align s1 to 32B and adjust s2 address.
+	   Use lxvp only if both s1 and s2 are 32B aligned.  */
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	COMPARE_16(v4,v5,32)
+	COMPARE_16(v4,v5,48)
+	addi	r3,r3,64
+	addi	r4,r4,64
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	addi	r5,r5,32
+
+	clrldi  r6,r3,59
+	subfic	r7,r6,32
+	add	r3,r3,r7
+	add	r4,r4,r7
+	subf	r5,r7,r5
+	andi.	r7,r4,0x1F
+	beq	cr0,L(32B_aligned_loop)
+
+	.p2align 5
+L(16B_aligned_loop):
+	COMPARE_16(v4,v5,0)
+	COMPARE_16(v4,v5,16)
+	COMPARE_16(v4,v5,32)
+	COMPARE_16(v4,v5,48)
+	addi	r3,r3,64
+	addi	r4,r4,64
+	b	L(16B_aligned_loop)
+
+	/* Calculate and return the difference.  */
+L(different):
+	TAIL_FIRST_16B(v4,v5)
+
+	.p2align 5
+L(32B_aligned_loop):
+	COMPARE_32(v14,v16,0,tail1,tail2)
+	COMPARE_32(v18,v20,32,tail3,tail4)
+	COMPARE_32(v22,v24,64,tail5,tail6)
+	COMPARE_32(v26,v28,96,tail7,tail8)
+	addi	r3,r3,128
+	addi	r4,r4,128
+	b	L(32B_aligned_loop)
+
+L(tail1): TAIL_FIRST_16B(v15,v17)
+L(tail2): TAIL_SECOND_16B(v14,v16)
+L(tail3): TAIL_FIRST_16B(v19,v21)
+L(tail4): TAIL_SECOND_16B(v18,v20)
+L(tail5): TAIL_FIRST_16B(v23,v25)
+L(tail6): TAIL_SECOND_16B(v22,v24)
+L(tail7): TAIL_FIRST_16B(v27,v29)
+L(tail8): TAIL_SECOND_16B(v26,v28)
+
+	.p2align 5
+L(ret0):
+	li	r3,0
+	blr
+
+END(STRNCMP)
+libc_hidden_builtin_def(strncmp)
diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
index 594fbb8058..2233d3e708 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
+++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
@@ -34,7 +34,7 @@  ifneq (,$(filter %le,$(config-machine)))
 sysdep_routines += memchr-power10 memcmp-power10 memcpy-power10 \
 		   memmove-power10 memset-power10 rawmemchr-power9 \
 		   rawmemchr-power10 strcmp-power9 strcmp-power10 \
-		   strncmp-power9 strcpy-power9 stpcpy-power9 \
+		   strncmp-power9 strncmp-power10 strcpy-power9 stpcpy-power9 \
 		   strlen-power9 strncpy-power9 stpncpy-power9 strlen-power10
 endif
 CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
index 1a34629db9..925d417330 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
@@ -164,6 +164,9 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
   /* Support sysdeps/powerpc/powerpc64/multiarch/strncmp.c.  */
   IFUNC_IMPL (i, name, strncmp,
 #ifdef __LITTLE_ENDIAN__
+	      IFUNC_IMPL_ADD (array, i, strncmp, hwcap2 & PPC_FEATURE2_ARCH_3_1
+			      && hwcap & PPC_FEATURE_HAS_VSX,
+			      __strncmp_power10)
 	      IFUNC_IMPL_ADD (array, i, strncmp, hwcap2 & PPC_FEATURE2_ARCH_3_00
 			      && hwcap & PPC_FEATURE_HAS_ALTIVEC,
 			      __strncmp_power9)
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strncmp-power10.S b/sysdeps/powerpc/powerpc64/multiarch/strncmp-power10.S
new file mode 100644
index 0000000000..c309d3caf9
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/strncmp-power10.S
@@ -0,0 +1,25 @@ 
+/* Copyright (C) 2016-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if defined __LITTLE_ENDIAN__ && IS_IN (libc)
+#define STRNCMP __strncmp_power10
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#include <sysdeps/powerpc/powerpc64/le/power10/strncmp.S>
+#endif
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strncmp.c b/sysdeps/powerpc/powerpc64/multiarch/strncmp.c
index e8bab8e23d..6f430d710d 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/strncmp.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/strncmp.c
@@ -29,6 +29,7 @@  extern __typeof (strncmp) __strncmp_ppc attribute_hidden;
 extern __typeof (strncmp) __strncmp_power8 attribute_hidden;
 # ifdef __LITTLE_ENDIAN__
 extern __typeof (strncmp) __strncmp_power9 attribute_hidden;
+extern __typeof (strncmp) __strncmp_power10 attribute_hidden;
 # endif
 # undef strncmp
 
@@ -36,6 +37,9 @@  extern __typeof (strncmp) __strncmp_power9 attribute_hidden;
    ifunc symbol properly.  */
 libc_ifunc_redirected (__redirect_strncmp, strncmp,
 # ifdef __LITTLE_ENDIAN__
+			(hwcap2 & PPC_FEATURE2_ARCH_3_1
+		        && hwcap & PPC_FEATURE_HAS_VSX)
+			? __strncmp_power10 :
 			(hwcap2 & PPC_FEATURE2_ARCH_3_00
 			 && hwcap & PPC_FEATURE_HAS_ALTIVEC)
 			? __strncmp_power9 :