Message ID | 20200612160755.9597-2-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Series | None | expand |
On 6/12/20 6:07 PM, Paolo Bonzini wrote: > From: Joseph Myers <joseph@codesourcery.com> > > This corrects a bug introduced in my previous fix for SSE4.2 pcmpestri > / pcmpestrm / pcmpistri / pcmpistrm substring search, commit > ae35eea7e4a9f21dd147406dfbcd0c4c6aaf2a60. > > That commit fixed a bug that showed up in four GCC tests with one libc > implementation. The tests in question generate random inputs to the > intrinsics and compare results to a C implementation, but they only > test 1024 possible random inputs, and when the tests use the cases of > those instructions that work with word rather than byte inputs, it's > easy to have problematic cases that show up much less frequently than > that. Thus, testing with a different libc implementation, and so a > different random number generator, showed up a problem with the > previous patch. > > When investigating the previous test failures, I found the description > of these instructions in the Intel manuals (starting from computing a > 16x16 or 8x8 set of comparison results) confusing and hard to match up > with the more optimized implementation in QEMU, and referred to AMD > manuals which described the instructions in a different way. Those > AMD descriptions are very explicit that the whole of the string being > searched for must be found in the other operand, not running off the > end of that operand; they say "If the prototype and the SUT are equal > in length, the two strings must be identical for the comparison to be > TRUE.". However, that statement is incorrect. > > In my previous commit message, I noted: > > The operation in this case is a search for a string (argument d to > the helper) in another string (argument s to the helper); if a copy > of d at a particular position would run off the end of s, the > resulting output bit should be 0 whether or not the strings match in > the region where they overlap, but the QEMU implementation was > wrongly comparing only up to the point where s ends and counting it > as a match if an initial segment of d matched a terminal segment of > s. Here, "run off the end of s" means that some byte of d would > overlap some byte outside of s; thus, if d has zero length, it is > considered to match everywhere, including after the end of s. > > The description "some byte of d would overlap some byte outside of s" > is accurate only when understood to refer to overlapping some byte > *within the 16-byte operand* but at or after the zero terminator; it > is valid to run over the end of s if the end of s is the end of the > 16-byte operand. So the fix in the previous patch for the case of d > being empty was correct, but the other part of that patch was not > correct (as it never allowed partial matches even at the end of the > 16-byte operand). Nor was the code before the previous patch correct > for the case of d nonempty, as it would always have allowed partial > matches at the end of s. > > Fix with a partial revert of my previous change, combined with > inserting a check for the special case of s having maximum length to > determine where it is necessary to check for matches. > > In the added test, test 1 is for the case of empty strings, which > failed before my 2017 patch, test 2 is for the bug introduced by my > 2017 patch and test 3 deals with the case where a match of an initial > segment at the end of the string is not valid when the string ends > before the end of the 16-byte operand (that is, the case that would be > broken by a simple revert of the non-empty-string part of my 2017 > patch). > > Signed-off-by: Joseph Myers <joseph@codesourcery.com> > Message-Id: <alpine.DEB.2.21.2006121344290.9881@digraph.polyomino.org.uk> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > target/i386/ops_sse.h | 4 ++-- > tests/tcg/i386/Makefile.target | 3 +++ > tests/tcg/i386/test-i386-pcmpistri.c | 33 ++++++++++++++++++++++++++++ > 3 files changed, 38 insertions(+), 2 deletions(-) > create mode 100644 tests/tcg/i386/test-i386-pcmpistri.c > > diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h > index 01d6017412..14f2b16abd 100644 > --- a/target/i386/ops_sse.h > +++ b/target/i386/ops_sse.h > @@ -2089,10 +2089,10 @@ static inline unsigned pcmpxstrx(CPUX86State *env, Reg *d, Reg *s, > res = (2 << upper) - 1; > break; > } > - for (j = valids - validd; j >= 0; j--) { > + for (j = valids == upper ? valids : valids - validd; j >= 0; j--) { > res <<= 1; > v = 1; > - for (i = validd; i >= 0; i--) { > + for (i = MIN(valids - j, validd); i >= 0; i--) { > v &= (pcmp_val(s, ctrl, i + j) == pcmp_val(d, ctrl, i)); > } > res |= v; > diff --git a/tests/tcg/i386/Makefile.target b/tests/tcg/i386/Makefile.target > index 43ee2e181e..53efec0668 100644 > --- a/tests/tcg/i386/Makefile.target > +++ b/tests/tcg/i386/Makefile.target > @@ -10,6 +10,9 @@ ALL_X86_TESTS=$(I386_SRCS:.c=) > SKIP_I386_TESTS=test-i386-ssse3 > X86_64_TESTS:=$(filter test-i386-ssse3, $(ALL_X86_TESTS)) > > +test-i386-pcmpistri: CFLAGS += -msse4.2 > +run-test-i386-pcmpistri: QEMU_OPTS += -cpu max This test fails on our CI: https://travis-ci.org/github/qemu/qemu/jobs/698006621#L4246 > + > # > # hello-i386 is a barebones app > # > diff --git a/tests/tcg/i386/test-i386-pcmpistri.c b/tests/tcg/i386/test-i386-pcmpistri.c > new file mode 100644 > index 0000000000..1e81ae611a > --- /dev/null > +++ b/tests/tcg/i386/test-i386-pcmpistri.c > @@ -0,0 +1,33 @@ > +/* Test pcmpistri instruction. */ > + > +#include <nmmintrin.h> > +#include <stdio.h> > + > +union u { > + __m128i x; > + unsigned char uc[16]; > +}; > + > +union u s0 = { .uc = { 0 } }; > +union u s1 = { .uc = "abcdefghijklmnop" }; > +union u s2 = { .uc = "bcdefghijklmnopa" }; > +union u s3 = { .uc = "bcdefghijklmnab" }; > + > +int > +main(void) > +{ > + int ret = 0; > + if (_mm_cmpistri(s0.x, s0.x, 0x4c) != 15) { > + printf("FAIL: pcmpistri test 1\n"); > + ret = 1; > + } > + if (_mm_cmpistri(s1.x, s2.x, 0x4c) != 15) { > + printf("FAIL: pcmpistri test 2\n"); > + ret = 1; > + } > + if (_mm_cmpistri(s1.x, s3.x, 0x4c) != 16) { > + printf("FAIL: pcmpistri test 3\n"); > + ret = 1; > + } > + return ret; > +} >
On 6/15/20 12:18 PM, Philippe Mathieu-Daudé wrote: > On 6/12/20 6:07 PM, Paolo Bonzini wrote: >> From: Joseph Myers <joseph@codesourcery.com> >> >> This corrects a bug introduced in my previous fix for SSE4.2 pcmpestri >> / pcmpestrm / pcmpistri / pcmpistrm substring search, commit >> ae35eea7e4a9f21dd147406dfbcd0c4c6aaf2a60. >> >> That commit fixed a bug that showed up in four GCC tests with one libc >> implementation. The tests in question generate random inputs to the >> intrinsics and compare results to a C implementation, but they only >> test 1024 possible random inputs, and when the tests use the cases of >> those instructions that work with word rather than byte inputs, it's >> easy to have problematic cases that show up much less frequently than >> that. Thus, testing with a different libc implementation, and so a >> different random number generator, showed up a problem with the >> previous patch. >> >> When investigating the previous test failures, I found the description >> of these instructions in the Intel manuals (starting from computing a >> 16x16 or 8x8 set of comparison results) confusing and hard to match up >> with the more optimized implementation in QEMU, and referred to AMD >> manuals which described the instructions in a different way. Those >> AMD descriptions are very explicit that the whole of the string being >> searched for must be found in the other operand, not running off the >> end of that operand; they say "If the prototype and the SUT are equal >> in length, the two strings must be identical for the comparison to be >> TRUE.". However, that statement is incorrect. >> >> In my previous commit message, I noted: >> >> The operation in this case is a search for a string (argument d to >> the helper) in another string (argument s to the helper); if a copy >> of d at a particular position would run off the end of s, the >> resulting output bit should be 0 whether or not the strings match in >> the region where they overlap, but the QEMU implementation was >> wrongly comparing only up to the point where s ends and counting it >> as a match if an initial segment of d matched a terminal segment of >> s. Here, "run off the end of s" means that some byte of d would >> overlap some byte outside of s; thus, if d has zero length, it is >> considered to match everywhere, including after the end of s. >> >> The description "some byte of d would overlap some byte outside of s" >> is accurate only when understood to refer to overlapping some byte >> *within the 16-byte operand* but at or after the zero terminator; it >> is valid to run over the end of s if the end of s is the end of the >> 16-byte operand. So the fix in the previous patch for the case of d >> being empty was correct, but the other part of that patch was not >> correct (as it never allowed partial matches even at the end of the >> 16-byte operand). Nor was the code before the previous patch correct >> for the case of d nonempty, as it would always have allowed partial >> matches at the end of s. >> >> Fix with a partial revert of my previous change, combined with >> inserting a check for the special case of s having maximum length to >> determine where it is necessary to check for matches. >> >> In the added test, test 1 is for the case of empty strings, which >> failed before my 2017 patch, test 2 is for the bug introduced by my >> 2017 patch and test 3 deals with the case where a match of an initial >> segment at the end of the string is not valid when the string ends >> before the end of the 16-byte operand (that is, the case that would be >> broken by a simple revert of the non-empty-string part of my 2017 >> patch). >> >> Signed-off-by: Joseph Myers <joseph@codesourcery.com> >> Message-Id: <alpine.DEB.2.21.2006121344290.9881@digraph.polyomino.org.uk> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> target/i386/ops_sse.h | 4 ++-- >> tests/tcg/i386/Makefile.target | 3 +++ >> tests/tcg/i386/test-i386-pcmpistri.c | 33 ++++++++++++++++++++++++++++ >> 3 files changed, 38 insertions(+), 2 deletions(-) >> create mode 100644 tests/tcg/i386/test-i386-pcmpistri.c >> >> diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h >> index 01d6017412..14f2b16abd 100644 >> --- a/target/i386/ops_sse.h >> +++ b/target/i386/ops_sse.h >> @@ -2089,10 +2089,10 @@ static inline unsigned pcmpxstrx(CPUX86State *env, Reg *d, Reg *s, >> res = (2 << upper) - 1; >> break; >> } >> - for (j = valids - validd; j >= 0; j--) { >> + for (j = valids == upper ? valids : valids - validd; j >= 0; j--) { >> res <<= 1; >> v = 1; >> - for (i = validd; i >= 0; i--) { >> + for (i = MIN(valids - j, validd); i >= 0; i--) { >> v &= (pcmp_val(s, ctrl, i + j) == pcmp_val(d, ctrl, i)); >> } >> res |= v; >> diff --git a/tests/tcg/i386/Makefile.target b/tests/tcg/i386/Makefile.target >> index 43ee2e181e..53efec0668 100644 >> --- a/tests/tcg/i386/Makefile.target >> +++ b/tests/tcg/i386/Makefile.target >> @@ -10,6 +10,9 @@ ALL_X86_TESTS=$(I386_SRCS:.c=) >> SKIP_I386_TESTS=test-i386-ssse3 >> X86_64_TESTS:=$(filter test-i386-ssse3, $(ALL_X86_TESTS)) >> >> +test-i386-pcmpistri: CFLAGS += -msse4.2 >> +run-test-i386-pcmpistri: QEMU_OPTS += -cpu max > > This test fails on our CI: > https://travis-ci.org/github/qemu/qemu/jobs/698006621#L4246 FYI Paolo's analysis from 'make V=1' output https://api.travis-ci.org/v3/job/698459904/log.txt: timeout 60 /home/travis/build/philmd/qemu/build/i386-linux-user/qemu-i386 -cpu max test-i386-pcmpistri > test-i386-pcmpistri.out timeout 60 /home/travis/build/philmd/qemu/build/i386-linux-user/qemu-i386 -plugin ../../plugin/libbb.so -d plugin -D test-i386-pcmpistri-with-libbb.so.pout test-i386-pcmpistri > run-plugin-test-i386-pcmpistri-with-libbb.so.out "incorrect qemu invocation, missing -cpu max in the second". > >> + >> # >> # hello-i386 is a barebones app >> # >> diff --git a/tests/tcg/i386/test-i386-pcmpistri.c b/tests/tcg/i386/test-i386-pcmpistri.c >> new file mode 100644 >> index 0000000000..1e81ae611a >> --- /dev/null >> +++ b/tests/tcg/i386/test-i386-pcmpistri.c >> @@ -0,0 +1,33 @@ >> +/* Test pcmpistri instruction. */ >> + >> +#include <nmmintrin.h> >> +#include <stdio.h> >> + >> +union u { >> + __m128i x; >> + unsigned char uc[16]; >> +}; >> + >> +union u s0 = { .uc = { 0 } }; >> +union u s1 = { .uc = "abcdefghijklmnop" }; >> +union u s2 = { .uc = "bcdefghijklmnopa" }; >> +union u s3 = { .uc = "bcdefghijklmnab" }; >> + >> +int >> +main(void) >> +{ >> + int ret = 0; >> + if (_mm_cmpistri(s0.x, s0.x, 0x4c) != 15) { >> + printf("FAIL: pcmpistri test 1\n"); >> + ret = 1; >> + } >> + if (_mm_cmpistri(s1.x, s2.x, 0x4c) != 15) { >> + printf("FAIL: pcmpistri test 2\n"); >> + ret = 1; >> + } >> + if (_mm_cmpistri(s1.x, s3.x, 0x4c) != 16) { >> + printf("FAIL: pcmpistri test 3\n"); >> + ret = 1; >> + } >> + return ret; >> +} >> >
Philippe Mathieu-Daudé <philmd@redhat.com> writes: > On 6/15/20 12:18 PM, Philippe Mathieu-Daudé wrote: >> On 6/12/20 6:07 PM, Paolo Bonzini wrote: >>> From: Joseph Myers <joseph@codesourcery.com> >>> >>> This corrects a bug introduced in my previous fix for SSE4.2 pcmpestri >>> / pcmpestrm / pcmpistri / pcmpistrm substring search, commit >>> ae35eea7e4a9f21dd147406dfbcd0c4c6aaf2a60. >>> >>> That commit fixed a bug that showed up in four GCC tests with one libc >>> implementation. The tests in question generate random inputs to the >>> intrinsics and compare results to a C implementation, but they only >>> test 1024 possible random inputs, and when the tests use the cases of >>> those instructions that work with word rather than byte inputs, it's >>> easy to have problematic cases that show up much less frequently than >>> that. Thus, testing with a different libc implementation, and so a >>> different random number generator, showed up a problem with the >>> previous patch. >>> >>> When investigating the previous test failures, I found the description >>> of these instructions in the Intel manuals (starting from computing a >>> 16x16 or 8x8 set of comparison results) confusing and hard to match up >>> with the more optimized implementation in QEMU, and referred to AMD >>> manuals which described the instructions in a different way. Those >>> AMD descriptions are very explicit that the whole of the string being >>> searched for must be found in the other operand, not running off the >>> end of that operand; they say "If the prototype and the SUT are equal >>> in length, the two strings must be identical for the comparison to be >>> TRUE.". However, that statement is incorrect. >>> >>> In my previous commit message, I noted: >>> >>> The operation in this case is a search for a string (argument d to >>> the helper) in another string (argument s to the helper); if a copy >>> of d at a particular position would run off the end of s, the >>> resulting output bit should be 0 whether or not the strings match in >>> the region where they overlap, but the QEMU implementation was >>> wrongly comparing only up to the point where s ends and counting it >>> as a match if an initial segment of d matched a terminal segment of >>> s. Here, "run off the end of s" means that some byte of d would >>> overlap some byte outside of s; thus, if d has zero length, it is >>> considered to match everywhere, including after the end of s. >>> >>> The description "some byte of d would overlap some byte outside of s" >>> is accurate only when understood to refer to overlapping some byte >>> *within the 16-byte operand* but at or after the zero terminator; it >>> is valid to run over the end of s if the end of s is the end of the >>> 16-byte operand. So the fix in the previous patch for the case of d >>> being empty was correct, but the other part of that patch was not >>> correct (as it never allowed partial matches even at the end of the >>> 16-byte operand). Nor was the code before the previous patch correct >>> for the case of d nonempty, as it would always have allowed partial >>> matches at the end of s. >>> >>> Fix with a partial revert of my previous change, combined with >>> inserting a check for the special case of s having maximum length to >>> determine where it is necessary to check for matches. >>> >>> In the added test, test 1 is for the case of empty strings, which >>> failed before my 2017 patch, test 2 is for the bug introduced by my >>> 2017 patch and test 3 deals with the case where a match of an initial >>> segment at the end of the string is not valid when the string ends >>> before the end of the 16-byte operand (that is, the case that would be >>> broken by a simple revert of the non-empty-string part of my 2017 >>> patch). >>> >>> Signed-off-by: Joseph Myers <joseph@codesourcery.com> >>> Message-Id: <alpine.DEB.2.21.2006121344290.9881@digraph.polyomino.org.uk> >>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >>> --- >>> target/i386/ops_sse.h | 4 ++-- >>> tests/tcg/i386/Makefile.target | 3 +++ >>> tests/tcg/i386/test-i386-pcmpistri.c | 33 ++++++++++++++++++++++++++++ >>> 3 files changed, 38 insertions(+), 2 deletions(-) >>> create mode 100644 tests/tcg/i386/test-i386-pcmpistri.c >>> >>> diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h >>> index 01d6017412..14f2b16abd 100644 >>> --- a/target/i386/ops_sse.h >>> +++ b/target/i386/ops_sse.h >>> @@ -2089,10 +2089,10 @@ static inline unsigned pcmpxstrx(CPUX86State *env, Reg *d, Reg *s, >>> res = (2 << upper) - 1; >>> break; >>> } >>> - for (j = valids - validd; j >= 0; j--) { >>> + for (j = valids == upper ? valids : valids - validd; j >= 0; j--) { >>> res <<= 1; >>> v = 1; >>> - for (i = validd; i >= 0; i--) { >>> + for (i = MIN(valids - j, validd); i >= 0; i--) { >>> v &= (pcmp_val(s, ctrl, i + j) == pcmp_val(d, ctrl, i)); >>> } >>> res |= v; >>> diff --git a/tests/tcg/i386/Makefile.target b/tests/tcg/i386/Makefile.target >>> index 43ee2e181e..53efec0668 100644 >>> --- a/tests/tcg/i386/Makefile.target >>> +++ b/tests/tcg/i386/Makefile.target >>> @@ -10,6 +10,9 @@ ALL_X86_TESTS=$(I386_SRCS:.c=) >>> SKIP_I386_TESTS=test-i386-ssse3 >>> X86_64_TESTS:=$(filter test-i386-ssse3, $(ALL_X86_TESTS)) >>> >>> +test-i386-pcmpistri: CFLAGS += -msse4.2 >>> +run-test-i386-pcmpistri: QEMU_OPTS += -cpu max >> >> This test fails on our CI: >> https://travis-ci.org/github/qemu/qemu/jobs/698006621#L4246 > > FYI Paolo's analysis from 'make V=1' output > https://api.travis-ci.org/v3/job/698459904/log.txt: > > timeout 60 > /home/travis/build/philmd/qemu/build/i386-linux-user/qemu-i386 -cpu max > test-i386-pcmpistri > test-i386-pcmpistri.out > > timeout 60 > /home/travis/build/philmd/qemu/build/i386-linux-user/qemu-i386 -plugin > ../../plugin/libbb.so -d plugin -D > test-i386-pcmpistri-with-libbb.so.pout test-i386-pcmpistri > > run-plugin-test-i386-pcmpistri-with-libbb.so.out > > "incorrect qemu invocation, missing -cpu max in the second". Just testing some patches now.
diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h index 01d6017412..14f2b16abd 100644 --- a/target/i386/ops_sse.h +++ b/target/i386/ops_sse.h @@ -2089,10 +2089,10 @@ static inline unsigned pcmpxstrx(CPUX86State *env, Reg *d, Reg *s, res = (2 << upper) - 1; break; } - for (j = valids - validd; j >= 0; j--) { + for (j = valids == upper ? valids : valids - validd; j >= 0; j--) { res <<= 1; v = 1; - for (i = validd; i >= 0; i--) { + for (i = MIN(valids - j, validd); i >= 0; i--) { v &= (pcmp_val(s, ctrl, i + j) == pcmp_val(d, ctrl, i)); } res |= v; diff --git a/tests/tcg/i386/Makefile.target b/tests/tcg/i386/Makefile.target index 43ee2e181e..53efec0668 100644 --- a/tests/tcg/i386/Makefile.target +++ b/tests/tcg/i386/Makefile.target @@ -10,6 +10,9 @@ ALL_X86_TESTS=$(I386_SRCS:.c=) SKIP_I386_TESTS=test-i386-ssse3 X86_64_TESTS:=$(filter test-i386-ssse3, $(ALL_X86_TESTS)) +test-i386-pcmpistri: CFLAGS += -msse4.2 +run-test-i386-pcmpistri: QEMU_OPTS += -cpu max + # # hello-i386 is a barebones app # diff --git a/tests/tcg/i386/test-i386-pcmpistri.c b/tests/tcg/i386/test-i386-pcmpistri.c new file mode 100644 index 0000000000..1e81ae611a --- /dev/null +++ b/tests/tcg/i386/test-i386-pcmpistri.c @@ -0,0 +1,33 @@ +/* Test pcmpistri instruction. */ + +#include <nmmintrin.h> +#include <stdio.h> + +union u { + __m128i x; + unsigned char uc[16]; +}; + +union u s0 = { .uc = { 0 } }; +union u s1 = { .uc = "abcdefghijklmnop" }; +union u s2 = { .uc = "bcdefghijklmnopa" }; +union u s3 = { .uc = "bcdefghijklmnab" }; + +int +main(void) +{ + int ret = 0; + if (_mm_cmpistri(s0.x, s0.x, 0x4c) != 15) { + printf("FAIL: pcmpistri test 1\n"); + ret = 1; + } + if (_mm_cmpistri(s1.x, s2.x, 0x4c) != 15) { + printf("FAIL: pcmpistri test 2\n"); + ret = 1; + } + if (_mm_cmpistri(s1.x, s3.x, 0x4c) != 16) { + printf("FAIL: pcmpistri test 3\n"); + ret = 1; + } + return ret; +}