Message ID  AM0PR08MB336270B1CBEBF8122A3D36FD84A40@AM0PR08MB3362.eurprd08.prod.outlook.com 

State  New 
Headers  show 
Series 

Related  show 
On 8/23/19 5:19 AM, Elen Kalda wrote: > Hi all, > > Libgcc currently implements the Smith's algorithm > (https://dl.acm.org/citation.cfm?id=368661) for complex division. It is > designed to prevent overflows and underflows and it does that well for around > 97.7% cases. However, there has been some interest in improving that result, as > an example Baudin improves the robustness to ~99.5% by introducing an extra > underflow check and also gains in performance by replacing two divisions by two > multiplications and one division (https://arxiv.org/abs/1210.4539). > > This patch proposes the implementation of Baudin algorithm to Libgcc (in case > the increase in rounding error is acceptable). > > In general, when it comes to deciding what is the best way of doing the > complex divisions, there are three things to consider: >  speed >  accuracy >  robustness (how often overflow/underflow happens, how many division results > are way off from the exact value) > > Improving the algorithms by looking at the failures of specific cases is not > simple. There are 4 real numbers to be manipulated to get the final result and > some minimum number of operations that needs to be done. The quite minimal > Smiths algorithm currently does 9 operations, of which 6 are multiplications > or divisions which are susceptible to overflow and underflow. It is not > feasible to check for both, underflow and overflow, in all 6 operations > without significantly impacting the performance. Most of the complex division > algorithms have been created to solve some special difficult cases, however, > because of the abundance of failure opportunities, algorithm that works well > for some type of divisions is likely to fail for other types of divisions. > > The simplest way to dodge overflow and underflow without impacting performance > much is to vary the order of operations. When naively expanding the complex > division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the > denominator is the greatest source of overflow and underflow. Smiths > improvement was to introduce scaling factor r = real(y)/imag(y), such that the > denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check > explicitly for underflow in r and change the order of operations if necessary. > > One way of comparing the algorithms is to generate lots of random complex > numbers and see how the different algorithms perform. That approach is closer > to the "real world" situation where the complex numbers are often random, not > powers of two (which oddly is the assumption many authors make, including > Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm > was compared to two versions of Baudin. The difference is that in one version > (b2div), real and imaginary parts of the result are both divided through by > the same denominator d, in the other one (b1div) the real and imaginary > parts are multiplied through by the reciprocal of that d, effectively > replacing two divisions with one division and two multiplications. Since > division is significantly slower on AArch64, that swap introduces gains in > performance (though with the cost of rounding error, unfortunately). > > To compare the performance, I used a test that generates 1800 random complex > double pairs (which fit nicely into the 64 kb cache) and divide each pair 200 > 000 times, effectively doing 360 000 000 operations. > >  CPU time >  > smiths  7 296 924 > b1div  6 558 148 > b2div  8 658 079 > > On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than > Smiths. > > For the accuracy, 1 000 000 pairs of complex doubles were divided and compared > to the exact results (assuming that the division of the same numbers as long > doubles gives the exact value). > >  >2ulp  >1ulp  >0.5ulp  <0.5ulp >  > smiths  22 937  23 944  124 228  828 891 > b1div  4 450  59 716  350 792  585 042 > b2div  4 036  24 488  127 144  844 332 > > Same data, but showing the percentage of divisions falling into each category: > >  >2ulp  >1ulp  >0.5ulp  <0.5ulp >  > smiths  2.29%  2.39%  12.42%  82.89% > b1div  0.45%  5.97%  35.08%  58.50% > b2div  0.40%  2.45%  12.71%  84.43% > > The rightmost column (<0.5ulp) shows the number of calculations for which the > ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning > they were correctly rounded. >0.5ulp gives the number of calculations for > which the largest ulp error of the real and imaginary parts were between 0.5 > and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but > not THE nearest double. Anything less accurate is not great... > > FMA doesn't create any problems on AArch64. Bootstrapped and tested on > aarch64nonelinuxgnu  no problems with bootstrap, but some regressions in > gcc/testsuite/gcc.dg/torture/builtinmath7.c. The regressions are a result of > the loss in accuracy  for a division (3. + 4.i) / 5i = 0.8  0.6i, Smiths > gives: > 0.800000000000000044408920985006  0.599999999999999977795539507497 i > ulp error in real part: 0.4 > ulp error in imaginary part: 0.2 > b1div gives: > 0.800000000000000044408920985006  0.600000000000000088817841970013 i > ulp error in real part: 0.4 > ulp error in imaginary part: 0.8 > That means the imaginary part of the Baudin result is rounded to one of the two > nearest floats but not the one which is the nearest. Similar thing happens to > another division in that test. > > > Some questions: >  Is the 10.13% improvement in performance with b1div worth the loss in > accuracy? >  In the case of b1div, most of the results that ceased to be correctly > rounded as a result of replacing the two divisions with multiplications, ended > up in being in 1.0ulp. Maybe that is acceptable? >  The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is > that a significant/useful improvement? > > > Best wishes, > Elen > > libgcc/ChangeLog: > > 20190731 Elen Kalda <elen.kalda@arm.com> > > * libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm I'd really like to hear from Joseph on this. He's got *far* more experience in evaluating stuff in this space than I do and is almost certainly going to be able to give a more informed opinion than myself. I believe he's digging out from some time off, so it might be a little while before he can respond. jeff >
diff git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index 763b5fabd512d4efd3ca007d9a96d8778a5de422..9bc0168281dac231aeb4d7b9852a24e61128b77a 100644  a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ 2033,58 +2033,75 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) #if defined(L_divhc3)  defined(L_divsc3)  defined(L_divdc3) \  defined(L_divxc3)  defined(L_divtc3) + +/* Complex division x/y, where + a = real (x) + b = imag (x) + c = real (y) + d = imag (y) +*/ CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) {  MTYPE denom, ratio, x, y; + MTYPE e, f, r, t; CTYPE res;  /* ??? We can get better behavior from logarithmic scaling instead of  the division. But that would mean starting to link libgcc against  libm. We could implement something akin to ldexp/frexp as gcc builtins  fairly easily... */  if (FABS (c) < FABS (d)) + inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d) + { + r = d/c; + t = 1.0 / (c + (d * r)); + + if (r != 0) {  ratio = c / d;  denom = (c * ratio) + d;  x = ((a * ratio) + b) / denom;  y = ((b * ratio)  a) / denom; + e = (a + (b * r)) * t; + f = (b  (a * r)) * t; }  else + /* Changing the order of operations avoids the underflow of r impacting + the result. */ + else {  ratio = d / c;  denom = (d * ratio) + c;  x = ((b * ratio) + a) / denom;  y = (b  (a * ratio)) / denom; + e = (a + (d * (b / c))) * t; + f = (b  (d * (a / c))) * t; } + } + + if (FABS (d) <= FABS (c)) + { + improved_internal (a, b, c, d); + } + else + { + improved_internal (b, a, d, c); + f = f; + } /* Recover infinities and zeros that computed as NaN+iNaN; the only cases are nonzero/zero, infinite/finite, and finite/infinite. */  if (isnan (x) && isnan (y)) + if (isnan (e) && isnan (f)) + { + if (c == 0.0 && d == 0.0 && (!isnan (a)  !isnan (b))) {  if (c == 0.0 && d == 0.0 && (!isnan (a)  !isnan (b)))  {  x = COPYSIGN (INFINITY, c) * a;  y = COPYSIGN (INFINITY, c) * b;  }  else if ((isinf (a)  isinf (b)) && isfinite (c) && isfinite (d))  {  a = COPYSIGN (isinf (a) ? 1 : 0, a);  b = COPYSIGN (isinf (b) ? 1 : 0, b);  x = INFINITY * (a * c + b * d);  y = INFINITY * (b * c  a * d);  }  else if ((isinf (c)  isinf (d)) && isfinite (a) && isfinite (b))  {  c = COPYSIGN (isinf (c) ? 1 : 0, c);  d = COPYSIGN (isinf (d) ? 1 : 0, d);  x = 0.0 * (a * c + b * d);  y = 0.0 * (b * c  a * d);  } + e = COPYSIGN (INFINITY, c) * a; + f = COPYSIGN (INFINITY, c) * b; } + else if ((isinf (a)  isinf (b)) && isfinite (c) && isfinite (d)) + { + a = COPYSIGN (isinf (a) ? 1 : 0, a); + b = COPYSIGN (isinf (b) ? 1 : 0, b); + e = INFINITY * (a * c + b * d); + f = INFINITY * (b * c  a * d); + } + else if ((isinf (c)  isinf (d)) && isfinite (a) && isfinite (b)) + { + c = COPYSIGN (isinf (c) ? 1 : 0, c); + d = COPYSIGN (isinf (d) ? 1 : 0, d); + e = 0.0 * (a * c + b * d); + f = 0.0 * (b * c  a * d); + } + }  __real__ res = x;  __imag__ res = y; + __real__ res = e; + __imag__ res = f; return res; } #endif /* complex divide */