diff mbox

[25/27] S390: Optimize wmemset.

Message ID 1435319512-22245-26-git-send-email-stli@linux.vnet.ibm.com
State New
Headers show

Commit Message

Stefan Liebler June 26, 2015, 11:51 a.m. UTC
This patch provides optimized version of wmemset with the z13 vector
instructions.

ChangeLog:

	* sysdeps/s390/multiarch/wmemset-c.c: New File.
	* sysdeps/s390/multiarch/wmemset-vx.S: Likewise.
	* sysdeps/s390/multiarch/wmemset.c: Likewise.
	* sysdeps/s390/multiarch/Makefile
	(sysdep_routines): Add wmemset functions.
	* sysdeps/s390/multiarch/ifunc-impl-list-common.c
	(__libc_ifunc_impl_list_common): Add ifunc test for wmemset.
	* wcsmbs/wmemset.c: Use WMEMSET if defined.
	* string/test-memset.c: Add wmemset support.
	* wcsmbs/test-wmemset.c: New File.
	* wcsmbs/Makefile (strop-tests): Add wmemset.
	* benchtests/bench-memset.c: Add wmemset support.
	* benchtests/bench-wmemset.c: New File.
	* benchtests/Makefile (wcsmbs-bench): Add wmemset.
---
 benchtests/Makefile                      |   2 +-
 benchtests/bench-memset.c                |  63 +++++++++-----
 benchtests/bench-wmemset.c               |  20 +++++
 string/test-memset.c                     |  90 +++++++++++++-------
 sysdeps/s390/multiarch/Makefile          |   3 +-
 sysdeps/s390/multiarch/ifunc-impl-list.c |   1 +
 sysdeps/s390/multiarch/wmemset-c.c       |  37 ++++++++
 sysdeps/s390/multiarch/wmemset-vx.S      | 142 +++++++++++++++++++++++++++++++
 sysdeps/s390/multiarch/wmemset.c         |  29 +++++++
 wcsmbs/Makefile                          |   2 +-
 wcsmbs/test-wmemset.c                    |  20 +++++
 wcsmbs/wmemset.c                         |   3 +
 12 files changed, 356 insertions(+), 56 deletions(-)
 create mode 100644 benchtests/bench-wmemset.c
 create mode 100644 sysdeps/s390/multiarch/wmemset-c.c
 create mode 100644 sysdeps/s390/multiarch/wmemset-vx.S
 create mode 100644 sysdeps/s390/multiarch/wmemset.c
 create mode 100644 wcsmbs/test-wmemset.c

Comments

Ondřej Bílka June 26, 2015, 1:53 p.m. UTC | #1
On Fri, Jun 26, 2015 at 01:51:50PM +0200, Stefan Liebler wrote:
> This patch provides optimized version of wmemset with the z13 vector
> instructions.
> 
Why do you optimize wmemset but not memset?
Stefan Liebler June 29, 2015, 9:23 a.m. UTC | #2
On 06/26/2015 03:53 PM, Ondřej Bílka wrote:
> On Fri, Jun 26, 2015 at 01:51:50PM +0200, Stefan Liebler wrote:
>> This patch provides optimized version of wmemset with the z13 vector
>> instructions.
>>
> Why do you optimize wmemset but not memset?
>
The current memset implementation uses mvc instruction.
It is optimized for setting one byte, which is still the preferred way
for memset. But setting four bytes with mvc is not optimized in this
way and thus only wmemset is optimized with vector instructions.
Ondřej Bílka June 29, 2015, 12:05 p.m. UTC | #3
On Mon, Jun 29, 2015 at 11:23:52AM +0200, Stefan Liebler wrote:
> On 06/26/2015 03:53 PM, Ondřej Bílka wrote:
> >On Fri, Jun 26, 2015 at 01:51:50PM +0200, Stefan Liebler wrote:
> >>This patch provides optimized version of wmemset with the z13 vector
> >>instructions.
> >>
> >Why do you optimize wmemset but not memset?
> >
> The current memset implementation uses mvc instruction.
> It is optimized for setting one byte, which is still the preferred way
> for memset. But setting four bytes with mvc is not optimized in this
> way and thus only wmemset is optimized with vector instructions.

Why? I ran dryrun to see how often that happens. Results are bit
surprise to me as they mean that I could optimize memset bit more.

I knew that you could assume that its aligned to 8 bytes. Now I also see
that size is quite likely multiple of 4/8. See below for raw data.

Other characteristic is that average size is in hundreds of bytes so 
vectorization pays off. 

It looks that best approach for memset would be first write few bytes to
align and make size multiple of 4. Then it would be identical logic as
wmemset so you could just do jump to appropriate memset instruction.

I would use control flow like following, recursion there just because I
cannot change memset return value in c.

void *
memset(void *_x, int _c, size_t n)
{
  char *x = (char *) _x;
  unsigned char c = (unsigned char) _c;
  if (n == 0)
    return _x;
  
  if (__glibc_unlikely ((((uintptr_t) x) | n) % 4 != 0))
    {
      if ((((uintptr_t) x) & 3))
        {
          if (((uintptr_t) x) & 3)
            {
              *x++ = c;
              n--;
            }
          if (((uintptr_t) x) & 3)
            {
              *x++ = c;
              n--;
            }
          if (((uintptr_t) x) & 3)
            {
              *x++ = c;
              n--;
            }
          memset (x, c, n);
          return _x;
        }

      if (n & 2)
        {
          *((uint16_t *)(x + n - 2)) = c * 0x01010101;
          n -= 2;
        }
      if (n & 1)
        x[--n] = c;

      return (void *) wmemset (x, c * 0x01010101, n / 4);
    }
  else
    return (void *) wmemset (x, c * 0x01010101, n / 4);
}


replaying bash

calls 1268

average capacity  161.6 
suceed:    0.0% 
size % 4 == 0:   56.1% 
size % 8 == 0:   49.1% success probability   0.0% 
average n:  161.6    n <= 0:   0.1% n <= 4:   0.1% n <= 8:   0.1% n <= 16:   0.1% n <= 24:   0.1% n <= 32:   3.6% n <= 48:  26.0% n <= 64:  48.7% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes:   0.1% 
average *s access cache latency    0.9    l <= 8:  94.8% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 

replaying awk

calls 59

average capacity   40.8 
suceed:    0.0% 
size % 4 == 0:   67.8% 
size % 8 == 0:   67.8% success probability   0.0% 
average n:   40.8    n <= 0:   1.7% n <= 4:   1.7% n <= 8:   5.1% n <= 16:  23.7% n <= 24:  40.7% n <= 32:  89.8% n <= 48:  89.8% n <= 64:  91.5% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes: 100.0% 
average *s access cache latency    1.1    l <= 8:  98.3% l <= 16:  98.3% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 

replaying ssh-add

calls 149

average capacity  139.0 
suceed:    0.0% 
size % 4 == 0:   37.6% 
size % 8 == 0:   17.4% success probability   0.0% 
average n:  139.0    n <= 0:   0.7% n <= 4:  26.2% n <= 8:  35.6% n <= 16:  72.5% n <= 24:  77.9% n <= 32:  83.2% n <= 48:  87.2% n <= 64:  89.9% 
s aligned to 4 bytes:  89.3%  8 bytes:  87.9% 16 bytes:  87.2% 
average *s access cache latency    0.4    l <= 8:  99.3% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 

replaying ssh-keygen

calls 157

average capacity  144.8 
suceed:    0.0% 
size % 4 == 0:   37.6% 
size % 8 == 0:   17.8% success probability   0.0% 
average n:  144.8    n <= 0:   0.6% n <= 4:  25.5% n <= 8:  34.4% n <= 16:  69.4% n <= 24:  74.5% n <= 32:  79.6% n <= 48:  84.1% n <= 64:  87.3% 
s aligned to 4 bytes:  89.2%  8 bytes:  87.9% 16 bytes:  87.3% 
average *s access cache latency    0.3    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 

replaying /usr/lib/gcc/x86_64-linux-gnu/5/cc1

calls 48817

average capacity  142.5 
suceed:    0.0% 
size % 4 == 0:   99.5% 
size % 8 == 0:   87.2% success probability   0.0% 
average n:  142.5    n <= 0:   2.2% n <= 4:   6.0% n <= 8:  25.7% n <= 16:  29.2% n <= 24:  32.8% n <= 32:  35.8% n <= 48:  47.6% n <= 64:  53.6% 
s aligned to 4 bytes: 100.0%  8 bytes:  99.7% 16 bytes:  73.6% 
average *s access cache latency   55.4    l <= 8:  78.7% l <= 16:  86.0% l <= 32:  90.8% l <= 64:  95.8% l <= 128:  97.0% 

replaying as

calls 424

average capacity  8829.0 
suceed:    0.0% 
size % 4 == 0:  100.0% 
size % 8 == 0:   89.4% success probability   0.0% 
average n:  8829.0    n <= 0:   0.2% n <= 4:  10.8% n <= 8:  10.8% n <= 16:  10.8% n <= 24:  10.8% n <= 32:  10.8% n <= 48:  10.8% n <= 64:  10.8% 
s aligned to 4 bytes: 100.0%  8 bytes:  94.8% 16 bytes:  79.5% 
average *s access cache latency   12.1    l <= 8:  77.1% l <= 16:  92.0% l <= 32:  97.2% l <= 64:  98.1% l <= 128:  98.1% 

replaying ar

calls 387

average capacity  347.8 
suceed:    0.0% 
size % 4 == 0:   91.5% 
size % 8 == 0:   87.6% success probability   0.0% 
average n:  347.8    n <= 0:   0.3% n <= 4:   0.3% n <= 8:   8.5% n <= 16:   8.8% n <= 24:   8.8% n <= 32:   8.8% n <= 48:   8.8% n <= 64:   8.8% 
s aligned to 4 bytes:  91.5%  8 bytes:  91.5% 16 bytes:  57.1% 
average *s access cache latency    4.6    l <= 8:  87.6% l <= 16:  92.8% l <= 32:  96.9% l <= 64:  99.0% l <= 128:  99.5% 

replaying ranlib

calls 372

average capacity  368.9 
suceed:    0.0% 
size % 4 == 0:   95.4% 
size % 8 == 0:   94.4% success probability   0.0% 
average n:  368.9    n <= 0:   0.3% n <= 4:   0.3% n <= 8:   0.8% n <= 16:   5.1% n <= 24:   5.1% n <= 32:   5.1% n <= 48:   5.1% n <= 64:   5.1% 
s aligned to 4 bytes:  99.2%  8 bytes:  99.2% 16 bytes:  59.4% 
average *s access cache latency    5.3    l <= 8:  78.5% l <= 16:  91.4% l <= 32:  97.8% l <= 64:  99.5% l <= 128:  99.5% 

replaying /usr/bin/ld

calls 1815

average capacity  754.1 
suceed:    0.0% 
size % 4 == 0:   91.1% 
size % 8 == 0:   89.5% success probability   0.0% 
average n:  754.1    n <= 0:   0.1% n <= 4:   4.6% n <= 8:   8.4% n <= 16:   9.5% n <= 24:  10.0% n <= 32:  10.8% n <= 48:  19.4% n <= 64:  19.9% 
s aligned to 4 bytes:  93.6%  8 bytes:  92.3% 16 bytes:  64.4% 
average *s access cache latency    9.1    l <= 8:  92.0% l <= 16:  94.5% l <= 32:  96.0% l <= 64:  96.5% l <= 128:  96.7% 

replaying mutt

calls 89

average capacity  1175.9 
suceed:    0.0% 
size % 4 == 0:  100.0% 
size % 8 == 0:  100.0% success probability   0.0% 
average n:  1175.9    n <= 0:   1.1% n <= 4:   1.1% n <= 8:   1.1% n <= 16:   1.1% n <= 24:   1.1% n <= 32:   1.1% n <= 48:   1.1% n <= 64:   1.1% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes: 100.0% 
average *s access cache latency   61.0    l <= 8:  12.4% l <= 16:  42.7% l <= 32:  76.4% l <= 64:  77.5% l <= 128:  78.7% 

replaying mc

calls 2650

average capacity  175.1 
suceed:    0.0% 
size % 4 == 0:   87.3% 
size % 8 == 0:   83.2% success probability   0.0% 
average n:  175.1    n <= 0:   0.0% n <= 4:   6.5% n <= 8:   9.9% n <= 16:  13.5% n <= 24:  13.5% n <= 32:  39.8% n <= 48:  59.8% n <= 64:  86.8% 
s aligned to 4 bytes:  94.7%  8 bytes:  93.7% 16 bytes:  93.4% 
average *s access cache latency    6.9    l <= 8:  90.5% l <= 16:  94.9% l <= 32:  97.5% l <= 64:  97.6% l <= 128:  97.6% 

replaying gawk

calls 409

average capacity   36.9 
suceed:    0.0% 
size % 4 == 0:   98.0% 
size % 8 == 0:   98.0% success probability   0.0% 
average n:   36.9    n <= 0:   0.2% n <= 4:   0.2% n <= 8:   0.5% n <= 16:   2.7% n <= 24:   2.7% n <= 32:  95.8% n <= 48:  95.8% n <= 64:  96.1% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes: 100.0% 
average *s access cache latency    0.4    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%
diff mbox

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 95712a3..28a5170 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -38,7 +38,7 @@  string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
 		strcoll
 wcsmbs-bench := wcslen wcsnlen wcscpy wcpcpy wcsncpy wcpncpy wcscat wcsncat \
 		wcscmp wcsncmp wcschr wcschrnul wcsrchr wcsspn wcspbrk wcscspn \
-		wmemchr
+		wmemchr wmemset
 string-bench-all := $(string-bench) ${wcsmbs-bench}
 
 # We have to generate locales
diff --git a/benchtests/bench-memset.c b/benchtests/bench-memset.c
index 9786ce7..66c3c45 100644
--- a/benchtests/bench-memset.c
+++ b/benchtests/bench-memset.c
@@ -20,12 +20,29 @@ 
 #ifdef TEST_BZERO
 # define TEST_NAME "bzero"
 #else
-# define TEST_NAME "memset"
-#endif
+# ifndef WIDE
+#  define TEST_NAME "memset"
+# else
+#  define TEST_NAME "wmemset"
+# endif /* WIDE */
+#endif /* !TEST_BZERO */
 #define MIN_PAGE_SIZE 131072
 #include "bench-string.h"
 
-char *simple_memset (char *, int, size_t);
+#ifndef WIDE
+# define MEMSET memset
+# define CHAR char
+# define SIMPLE_MEMSET simple_memset
+# define MEMCMP memcmp
+#else
+# include <wchar.h>
+# define MEMSET wmemset
+# define CHAR wchar_t
+# define SIMPLE_MEMSET simple_wmemset
+# define MEMCMP wmemcmp
+#endif /* WIDE */
+
+CHAR *SIMPLE_MEMSET (CHAR *, int, size_t);
 
 #ifdef TEST_BZERO
 typedef void (*proto_t) (char *, size_t);
@@ -39,7 +56,7 @@  IMPL (bzero, 1)
 void
 simple_bzero (char *s, size_t n)
 {
-  simple_memset (s, 0, n);
+  SIMPLE_MEMSET (s, 0, n);
 }
 
 void
@@ -48,46 +65,50 @@  builtin_bzero (char *s, size_t n)
   __builtin_bzero (s, n);
 }
 #else
-typedef char *(*proto_t) (char *, int, size_t);
-char *builtin_memset (char *, int, size_t);
+typedef CHAR *(*proto_t) (CHAR *, int, size_t);
 
-IMPL (simple_memset, 0)
+IMPL (SIMPLE_MEMSET, 0)
+# ifndef WIDE
+char *builtin_memset (char *, int, size_t);
 IMPL (builtin_memset, 0)
-IMPL (memset, 1)
+# endif /* !WIDE */
+IMPL (MEMSET, 1)
 
+# ifndef WIDE
 char *
 builtin_memset (char *s, int c, size_t n)
 {
   return __builtin_memset (s, c, n);
 }
-#endif
+# endif /* !WIDE */
+#endif /* !TEST_BZERO */
 
-char *
+CHAR *
 inhibit_loop_to_libcall
-simple_memset (char *s, int c, size_t n)
+SIMPLE_MEMSET (CHAR *s, int c, size_t n)
 {
-  char *r = s, *end = s + n;
+  CHAR *r = s, *end = s + n;
   while (r < end)
     *r++ = c;
   return s;
 }
 
 static void
-do_one_test (impl_t *impl, char *s, int c __attribute ((unused)), size_t n)
+do_one_test (impl_t *impl, CHAR *s, int c __attribute ((unused)), size_t n)
 {
   size_t i, iters = INNER_LOOP_ITERS;
   timing_t start, stop, cur;
-  char tstbuf[n];
+  CHAR tstbuf[n];
 #ifdef TEST_BZERO
   simple_bzero (tstbuf, n);
   CALL (impl, s, n);
   if (memcmp (s, tstbuf, n) != 0)
 #else
-  char *res = CALL (impl, s, c, n);
+  CHAR *res = CALL (impl, s, c, n);
   if (res != s
-      || simple_memset (tstbuf, c, n) != tstbuf
-      || memcmp (s, tstbuf, n) != 0)
-#endif
+      || SIMPLE_MEMSET (tstbuf, c, n) != tstbuf
+      || MEMCMP (s, tstbuf, n) != 0)
+#endif /* !TEST_BZERO */
     {
       error (0, 0, "Wrong result in function %s", impl->name);
       ret = 1;
@@ -101,7 +122,7 @@  do_one_test (impl_t *impl, char *s, int c __attribute ((unused)), size_t n)
       CALL (impl, s, n);
 #else
       CALL (impl, s, c, n);
-#endif
+#endif /* !TEST_BZERO */
     }
   TIMING_NOW (stop);
 
@@ -114,13 +135,13 @@  static void
 do_test (size_t align, int c, size_t len)
 {
   align &= 7;
-  if (align + len > page_size)
+  if ((align + len) * sizeof (CHAR) > page_size)
     return;
 
   printf ("Length %4zd, alignment %2zd, c %2d:", len, align, c);
 
   FOR_EACH_IMPL (impl, 0)
-    do_one_test (impl, (char *) buf1 + align, c, len);
+    do_one_test (impl, (CHAR *) (buf1) + align, c, len);
 
   putchar ('\n');
 }
diff --git a/benchtests/bench-wmemset.c b/benchtests/bench-wmemset.c
new file mode 100644
index 0000000..540829c
--- /dev/null
+++ b/benchtests/bench-wmemset.c
@@ -0,0 +1,20 @@ 
+/* Measure wmemset functions.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define WIDE 1
+#include "bench-memset.c"
diff --git a/string/test-memset.c b/string/test-memset.c
index 9f3af46..79b2983 100644
--- a/string/test-memset.c
+++ b/string/test-memset.c
@@ -1,4 +1,4 @@ 
-/* Test and measure memset functions.
+/* Test memset functions.
    Copyright (C) 1999-2015 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
    Written by Jakub Jelinek <jakub@redhat.com>, 1999.
@@ -21,12 +21,33 @@ 
 #ifdef TEST_BZERO
 # define TEST_NAME "bzero"
 #else
-# define TEST_NAME "memset"
-#endif
+# ifndef WIDE
+#  define TEST_NAME "memset"
+# else
+#  define TEST_NAME "wmemset"
+# endif /* WIDE */
+#endif /* !TEST_BZERO */
 #define MIN_PAGE_SIZE 131072
 #include "test-string.h"
 
-char *simple_memset (char *, int, size_t);
+#ifndef WIDE
+# define MEMSET memset
+# define CHAR char
+# define UCHAR unsigned char
+# define SIMPLE_MEMSET simple_memset
+# define MEMCMP memcmp
+# define BIG_CHAR CHAR_MAX
+#else
+# include <wchar.h>
+# define MEMSET wmemset
+# define CHAR wchar_t
+# define UCHAR wchar_t
+# define SIMPLE_MEMSET simple_wmemset
+# define MEMCMP wmemcmp
+# define BIG_CHAR WCHAR_MAX
+#endif /* WIDE */
+
+CHAR *SIMPLE_MEMSET (CHAR *, int, size_t);
 
 #ifdef TEST_BZERO
 typedef void (*proto_t) (char *, size_t);
@@ -40,7 +61,7 @@  IMPL (bzero, 1)
 void
 simple_bzero (char *s, size_t n)
 {
-  simple_memset (s, 0, n);
+  SIMPLE_MEMSET (s, 0, n);
 }
 
 void
@@ -49,44 +70,48 @@  builtin_bzero (char *s, size_t n)
   __builtin_bzero (s, n);
 }
 #else
-typedef char *(*proto_t) (char *, int, size_t);
-char *builtin_memset (char *, int, size_t);
+typedef CHAR *(*proto_t) (CHAR *, int, size_t);
 
-IMPL (simple_memset, 0)
+IMPL (SIMPLE_MEMSET, 0)
+# ifndef WIDE
+char *builtin_memset (char *, int, size_t);
 IMPL (builtin_memset, 0)
-IMPL (memset, 1)
+# endif /* !WIDE */
+IMPL (MEMSET, 1)
 
+# ifndef WIDE
 char *
 builtin_memset (char *s, int c, size_t n)
 {
   return __builtin_memset (s, c, n);
 }
-#endif
+# endif /* !WIDE */
+#endif /* !TEST_BZERO */
 
-char *
+CHAR *
 inhibit_loop_to_libcall
-simple_memset (char *s, int c, size_t n)
+SIMPLE_MEMSET (CHAR *s, int c, size_t n)
 {
-  char *r = s, *end = s + n;
+  CHAR *r = s, *end = s + n;
   while (r < end)
     *r++ = c;
   return s;
 }
 
 static void
-do_one_test (impl_t *impl, char *s, int c __attribute ((unused)), size_t n)
+do_one_test (impl_t *impl, CHAR *s, int c __attribute ((unused)), size_t n)
 {
-  char tstbuf[n];
+  CHAR tstbuf[n];
 #ifdef TEST_BZERO
   simple_bzero (tstbuf, n);
   CALL (impl, s, n);
   if (memcmp (s, tstbuf, n) != 0)
 #else
-  char *res = CALL (impl, s, c, n);
+  CHAR *res = CALL (impl, s, c, n);
   if (res != s
-      || simple_memset (tstbuf, c, n) != tstbuf
-      || memcmp (s, tstbuf, n) != 0)
-#endif
+      || SIMPLE_MEMSET (tstbuf, c, n) != tstbuf
+      || MEMCMP (s, tstbuf, n) != 0)
+#endif /* !TEST_BZERO */
     {
       error (0, 0, "Wrong result in function %s", impl->name);
       ret = 1;
@@ -98,11 +123,11 @@  static void
 do_test (size_t align, int c, size_t len)
 {
   align &= 7;
-  if (align + len > page_size)
+  if ((align + len) * sizeof (CHAR) > page_size)
     return;
 
   FOR_EACH_IMPL (impl, 0)
-    do_one_test (impl, (char *) buf1 + align, c, len);
+    do_one_test (impl, (CHAR *) (buf1) + align, c, len);
 }
 
 #ifndef TEST_BZERO
@@ -111,18 +136,19 @@  do_random_tests (void)
 {
   size_t i, j, k, n, align, len, size;
   int c, o;
-  unsigned char *p, *res;
+  UCHAR *p, *res;
+  UCHAR *p2 = (UCHAR *) buf2;
 
-  for (i = 0; i < 65536; ++i)
-    buf2[i] = random () & 255;
+  for (i = 0; i < 65536 / sizeof (CHAR); ++i)
+    p2[i] = random () & BIG_CHAR;
 
   for (n = 0; n < ITERATIONS; n++)
     {
       if ((random () & 31) == 0)
-	size = 65536;
+	size = 65536 / sizeof (CHAR);
       else
 	size = 512;
-      p = buf1 + page_size - size;
+      p = (UCHAR *) (buf1 + page_size) - size;
       len = random () & (size - 1);
       align = size - len - (random () & 31);
       if (align > size)
@@ -132,10 +158,10 @@  do_random_tests (void)
       if ((random () & 7) == 0)
 	c = 0;
       else
-	c = random () & 255;
-      o = random () & 255;
+	c = random () & BIG_CHAR;
+      o = random () & BIG_CHAR;
       if (o == c)
-        o = (c + 1) & 255;
+	o = (c + 1) & BIG_CHAR;
       j = len + align + 128;
       if (j > size)
 	j = size;
@@ -152,11 +178,11 @@  do_random_tests (void)
 	{
 	  for (i = 0; i < len; ++i)
 	    {
-	      p[i + align] = buf2[i];
+	      p[i + align] = p2[i];
 	      if (p[i + align] == c)
 		p[i + align] = o;
 	    }
-	  res = (unsigned char *) CALL (impl, (char *) p + align, c, len);
+	  res = (UCHAR *) CALL (impl, (CHAR *) p + align, c, len);
 	  if (res != p + align)
 	    {
 	      error (0, 0, "Iteration %zd - wrong result in function %s (%zd, %d, %zd) %p != %p",
@@ -190,7 +216,7 @@  do_random_tests (void)
 	}
     }
 }
-#endif
+#endif /* !TEST_BZERO */
 
 int
 test_main (void)
diff --git a/sysdeps/s390/multiarch/Makefile b/sysdeps/s390/multiarch/Makefile
index 87dff0f..eac88e0 100644
--- a/sysdeps/s390/multiarch/Makefile
+++ b/sysdeps/s390/multiarch/Makefile
@@ -37,5 +37,6 @@  sysdep_routines += wcslen wcslen-vx wcslen-c \
 		   wcsspn wcsspn-vx wcsspn-c \
 		   wcspbrk wcspbrk-vx wcspbrk-c \
 		   wcscspn wcscspn-vx wcscspn-c \
-		   wmemchr wmemchr-vx wmemchr-c
+		   wmemchr wmemchr-vx wmemchr-c \
+		   wmemset wmemset-vx wmemset-c
 endif
diff --git a/sysdeps/s390/multiarch/ifunc-impl-list.c b/sysdeps/s390/multiarch/ifunc-impl-list.c
index c90fb6b..93789ac 100644
--- a/sysdeps/s390/multiarch/ifunc-impl-list.c
+++ b/sysdeps/s390/multiarch/ifunc-impl-list.c
@@ -133,6 +133,7 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   IFUNC_VX_IMPL (memccpy);
 
+  IFUNC_VX_IMPL (wmemset);
 #endif /* HAVE_S390_VX_ASM_SUPPORT */
 
   return i;
diff --git a/sysdeps/s390/multiarch/wmemset-c.c b/sysdeps/s390/multiarch/wmemset-c.c
new file mode 100644
index 0000000..cec2339
--- /dev/null
+++ b/sysdeps/s390/multiarch/wmemset-c.c
@@ -0,0 +1,37 @@ 
+/* Default wmemset implementation for S/390.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#if defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)
+# define WMEMSET  __wmemset_c
+
+# include <wchar.h>
+extern __typeof (__wmemset) __wmemset_c;
+# undef weak_alias
+# define weak_alias(name, alias)
+# ifdef SHARED
+#  undef libc_hidden_def
+#  define libc_hidden_def(name)					\
+  __hidden_ver1 (__wmemset_c, __GI___wmemset, __wmemset_c);
+#  undef libc_hidden_weak
+#  define libc_hidden_weak(name)					\
+  strong_alias (__wmemset_c, __wmemset_c_1);				\
+  __hidden_ver1 (__wmemset_c_1, __GI_wmemset, __wmemset_c_1);
+# endif /* SHARED */
+
+# include <wcsmbs/wmemset.c>
+#endif /* HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc) */
diff --git a/sysdeps/s390/multiarch/wmemset-vx.S b/sysdeps/s390/multiarch/wmemset-vx.S
new file mode 100644
index 0000000..22bddcd
--- /dev/null
+++ b/sysdeps/s390/multiarch/wmemset-vx.S
@@ -0,0 +1,142 @@ 
+/* Vector Optimized 32/64 bit S/390 version of wmemset.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#if defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)
+
+# include "sysdep.h"
+# include "asm-syntax.h"
+
+	.text
+
+/* wchar_t *wmemset(wchar_t *dest, wchar_t wc, size_t n)
+   Fill an array of wide-characters with a constant wide character
+   and returns dest.
+
+   Register usage:
+   -r0=tmp
+   -r1=tmp
+   -r2=dest or current-pointer
+   -r3=wc
+   -r4=n
+   -r5=tmp
+   -v16=replicated wc
+   -v17,v18,v19=copy of v16 for vstm
+   -v31=saved dest for return
+*/
+ENTRY(__wmemset_vx)
+	.machine "z13"
+	.machinemode "zarch_nohighgprs"
+
+# if !defined __s390x__
+	llgfr	%r4,%r4
+# endif /* !defined __s390x__ */
+
+	vlvgg	%v31,%r2,0	/* Save destination pointer for return.  */
+	clgije	%r4,0,.Lend
+
+	vlvgf	%v16,%r3,0	/* Generate vector with wchar_t wc.  */
+	vrepf	%v16,%v16,0
+
+	/* Check range of maxlen and convert to byte-count.  */
+# ifdef __s390x__
+	tmhh	%r4,49152	/* Test bit 0 or 1 of maxlen.  */
+	lghi	%r5,-4		/* Max byte-count is 18446744073709551612.  */
+# else
+	tmlh	%r4,49152	/* Test bit 0 or 1 of maxlen.  */
+	llilf	%r5,4294967292	/* Max byte-count is 4294967292.  */
+# endif /* !__s390x__ */
+	sllg	%r4,%r4,2	/* Convert character-count to byte-count.  */
+	locgrne	%r4,%r5		/* Use max byte-count, if bit 0/1 was one.  */
+
+	/* Align dest to 16 byte.  */
+	risbg	%r0,%r2,60,128+63,0 /* Test if s is aligned and
+				       %r3 = bits 60-63 'and' 15.  */
+	je	.Lpreloop	/* If s is aligned, loop aligned.  */
+	tmll	%r2,3		/* Test if s is 4-byte aligned?  */
+	jne	.Lfallback	/* And use common-code variant if not.  */
+	lghi	%r1,16
+	slr	%r1,%r0		/* Compute byte count to load (16-x).  */
+	clgr	%r1,%r4
+	locgrh	%r1,%r4		/* min (byte count, n)  */
+	aghik	%r5,%r1,-1	/* vstl needs highest index.  */
+	vstl	%v16,%r5,0(%r2)	/* Store remaining bytes.  */
+	clgrje	%r1,%r4,.Lend	/* Return if n bytes where set.  */
+	slgr	%r4,%r1		/* Compute remaining byte count.  */
+	la	%r2,0(%r1,%r2)
+
+.Lpreloop:
+	/* Now we are 16-byte aligned.  */
+	clgijl	%r4,17,.Lremaining
+	srlg	%r1,%r4,8	/* Split into 256byte blocks */
+	clgije	%r1,0,.Lpreloop64
+	vlr	%v17,%v16
+	vlr	%v18,%v16
+	vlr	%v19,%v16
+
+.Lloop256:
+	vstm	%v16,%v19,0(%r2)
+	vstm	%v16,%v19,64(%r2)
+	vstm	%v16,%v19,128(%r2)
+	vstm	%v16,%v19,192(%r2)
+	la	%r2,256(%r2)
+	brctg	%r1,.Lloop256	/* Loop until all blocks are processed.  */
+
+	llgfr	%r4,%r4
+	nilf	%r4,255		/* Get remaining bytes */
+	je	.Lend		/* Skip store remaining bytes if zero.  */
+
+.Lpreloop64:
+	clgijl	%r4,17,.Lremaining
+	clgijl	%r4,33,.Lpreloop16
+	srlg	%r1,%r4,5	/* Split into 32byte blocks */
+
+.Lloop32:
+	vst	%v16,0(%r2)
+	vst	%v16,16(%r2)
+	la	%r2,32(%r2)
+	brctg	%r1,.Lloop32	/* Loop until all blocks are processed.  */
+
+	llgfr	%r4,%r4
+	nilf	%r4,31		/* Get remaining bytes */
+	je	.Lend		/* Skip store remaining bytes if zero.  */
+
+.Lpreloop16:
+	clgijl	%r4,17,.Lremaining
+	srlg	%r1,%r4,4	/* Split into 16byte blocks */
+
+.Lloop16:
+	vst	%v16,0(%r2)
+	la	%r2,16(%r2)
+	brctg	%r1,.Lloop16	/* Loop until all blocks are processed.  */
+
+	llgfr	%r4,%r4
+	nilf	%r4,15		/* Get remaining bytes */
+	je	.Lend		/* Skip store remaining bytes if zero.  */
+
+.Lremaining:
+	aghi	%r4,-1		/* vstl needs highest index.  */
+	vstl	%v16,%r4,0(%r2)
+
+.Lend:
+	vlgvg	%r2,%v31,0	/* Load saved dest for return value.  */
+	br	%r14
+.Lfallback:
+	srlg	%r4,%r4,2	/* Convert byte-count to character-count.  */
+	jg	__wmemset_c
+END(__wmemset_vx)
+#endif /* HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc) */
diff --git a/sysdeps/s390/multiarch/wmemset.c b/sysdeps/s390/multiarch/wmemset.c
new file mode 100644
index 0000000..884fea9
--- /dev/null
+++ b/sysdeps/s390/multiarch/wmemset.c
@@ -0,0 +1,29 @@ 
+/* Multiple versions of wmemset.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#if defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)
+# include <wchar.h>
+# include <ifunc-resolve.h>
+
+s390_vx_libc_ifunc (__wmemset)
+weak_alias (__wmemset, wmemset)
+libc_hidden_weak (wmemset)
+
+#else
+# include <wcsmbs/wmemset.c>
+#endif /* !(defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)) */
diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile
index e8ce579..dc72ba4 100644
--- a/wcsmbs/Makefile
+++ b/wcsmbs/Makefile
@@ -44,7 +44,7 @@  routines := wcscat wcschr wcscmp wcscpy wcscspn wcsdup wcslen wcsncat \
 
 strop-tests :=  wcscmp wcsncmp wmemcmp wcslen wcschr wcsrchr wcscpy wcsnlen \
 		wcpcpy wcsncpy wcpncpy wcscat wcsncat wcschrnul wcsspn wcspbrk \
-		wcscspn wmemchr
+		wcscspn wmemchr wmemset
 tests := tst-wcstof wcsmbs-tst1 tst-wcsnlen tst-btowc tst-mbrtowc \
 	 tst-wcrtomb tst-wcpncpy tst-mbsrtowcs tst-wchar-h tst-mbrtowc2 \
 	 tst-c16c32-1 wcsatcliff $(addprefix test-,$(strop-tests))
diff --git a/wcsmbs/test-wmemset.c b/wcsmbs/test-wmemset.c
new file mode 100644
index 0000000..c03c759
--- /dev/null
+++ b/wcsmbs/test-wmemset.c
@@ -0,0 +1,20 @@ 
+/* Test wmemset functions.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define WIDE 1
+#include "../string/test-memset.c"
diff --git a/wcsmbs/wmemset.c b/wcsmbs/wmemset.c
index 88fc015..1eb6b2b 100644
--- a/wcsmbs/wmemset.c
+++ b/wcsmbs/wmemset.c
@@ -18,6 +18,9 @@ 
 
 #include <wchar.h>
 
+#ifdef WMEMSET
+# define __wmemset WMEMSET
+#endif
 
 wchar_t *
 __wmemset (s, c, n)