Patchwork [U-Boot,PATCHv2] fsl_ddr: Don't use full 64-bit divides on 32-bit PowerPC

login
register
mail settings
Submitter Kyle Moffett
Date Feb. 23, 2011, 4:35 p.m.
Message ID <1298478920-22044-1-git-send-email-Kyle.D.Moffett@boeing.com>
Download mbox | patch
Permalink /patch/84207/
State Changes Requested
Delegated to: Kumar Gala
Headers show

Comments

Kyle Moffett - Feb. 23, 2011, 4:35 p.m.
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit
integer divide operations to convert between nanoseconds and DDR clock
cycles given arbitrary DDR clock frequencies.

Since all of the inputs to this are 32-bit (nanoseconds, clock cycles,
and DDR frequencies), we can easily restructure the computation to use
the "do_div()" function to perform 64-bit/32-bit divide operations.

This decreases compute time rather significantly for each conversion and
avoids bringing in a very complicated function from libgcc.

It should be noted that nothing else in U-Boot or the Linux kernel seems
to require a full 64-bit divide on any 32-bit PowerPC.

Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.

Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
Cc: Andy Fleming <afleming@gmail.com>
Cc: Kumar Gala <kumar.gala@freescale.com>
Cc: Wolfgang Denk <wd@denx.de>
Cc: Kim Phillips <kim.phillips@freescale.com>
---

Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.

I spent a fair amount of time trying to verify that the resulting math is
exactly the same as it was before, but the consequences of a mistake are
insideous timing problems.  Additional in-depth review would be much
appreciated.

Cheers,
Kyle Moffett

Changelog:
  v2: Resubmitted separately from the other HWW-1U-1A patches

 arch/powerpc/cpu/mpc8xxx/ddr/util.c |   58 ++++++++++++++++++++++++----------
 1 files changed, 41 insertions(+), 17 deletions(-)
York Sun - March 14, 2011, 6:19 p.m.
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
> The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit
> integer divide operations to convert between nanoseconds and DDR clock
> cycles given arbitrary DDR clock frequencies.
> 
> Since all of the inputs to this are 32-bit (nanoseconds, clock cycles,
> and DDR frequencies), we can easily restructure the computation to use
> the "do_div()" function to perform 64-bit/32-bit divide operations.
> 
> This decreases compute time rather significantly for each conversion and
> avoids bringing in a very complicated function from libgcc.
> 
> It should be noted that nothing else in U-Boot or the Linux kernel seems
> to require a full 64-bit divide on any 32-bit PowerPC.
> 
> Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
> 
> Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
> Cc: Andy Fleming <afleming@gmail.com>
> Cc: Kumar Gala <kumar.gala@freescale.com>
> Cc: Wolfgang Denk <wd@denx.de>
> Cc: Kim Phillips <kim.phillips@freescale.com>
> ---
> 
> Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
> 
> I spent a fair amount of time trying to verify that the resulting math is
> exactly the same as it was before, but the consequences of a mistake are
> insideous timing problems.  Additional in-depth review would be much
> appreciated.
> 

What's the gain of the changes? A smaller footprint? Going forward from
32- to 64-bit, are we going to change it back?

York
Kyle Moffett - March 14, 2011, 7:04 p.m.
On Mar 14, 2011, at 14:19, York Sun wrote:
> On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
>> The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit
>> integer divide operations to convert between nanoseconds and DDR clock
>> cycles given arbitrary DDR clock frequencies.
>> 
>> Since all of the inputs to this are 32-bit (nanoseconds, clock cycles,
>> and DDR frequencies), we can easily restructure the computation to use
>> the "do_div()" function to perform 64-bit/32-bit divide operations.
>> 
>> This decreases compute time rather significantly for each conversion and
>> avoids bringing in a very complicated function from libgcc.
>> 
>> It should be noted that nothing else in U-Boot or the Linux kernel seems
>> to require a full 64-bit divide on any 32-bit PowerPC.
>> 
>> Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
>> 
>> Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
>> Cc: Andy Fleming <afleming@gmail.com>
>> Cc: Kumar Gala <kumar.gala@freescale.com>
>> Cc: Wolfgang Denk <wd@denx.de>
>> Cc: Kim Phillips <kim.phillips@freescale.com>
>> ---
>> 
>> Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
>> 
>> I spent a fair amount of time trying to verify that the resulting math is
>> exactly the same as it was before, but the consequences of a mistake are
>> insideous timing problems.  Additional in-depth review would be much
>> appreciated.
>> 
> 
> What's the gain of the changes? A smaller footprint? Going forward from
> 32- to 64-bit, are we going to change it back?

On 64-bit this change is basically a no-op, because do_div() is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.

On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is 1.1kb of code and hundreds or thousands of dependent cycles to compute, all of which is linked in from libgcc.  Another 1.2kb of code comes in for __umoddi3.

The original reason I wrote this patch is that the native "libgcc" on my boards is hard-float and therefore generates warnings when linking to it from soft-float code.  The toolchain is the native Debian powerpc (or powerpcspe), and most other native PowerPC distributions are also hard-float-only.

When I combine this patch with the other patch I posted to create a minimal internal libgcc with a few 64-bit shift functions from the linux kernel, I can successfully build U-Boot on those native PowerPC systems.

Cheers,
Kyle Moffett
York Sun - March 14, 2011, 7:41 p.m.
Kyle,

On Mon, 2011-03-14 at 14:04 -0500, Moffett, Kyle D wrote:
> On 64-bit this change is basically a no-op, because do_div() is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
> 
> On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is 1.1kb of code and hundreds or thousands of dependent cycles to compute, all of which is linked in from libgcc.  Another 1.2kb of code comes in for __umoddi3.
> 
> The original reason I wrote this patch is that the native "libgcc" on my boards is hard-float and therefore generates warnings when linking to it from soft-float code.  The toolchain is the native Debian powerpc (or powerpcspe), and most other native PowerPC distributions are also hard-float-only.
> 
> When I combine this patch with the other patch I posted to create a minimal internal libgcc with a few 64-bit shift functions from the linux kernel, I can successfully build U-Boot on those native PowerPC systems.
> 

Points taken. Can you explain your algorithm? I see you want to do
clks/5^12/2^13, but I don't see when the clks is divided by 5^12. Did I
miss some lines?

York
Kyle Moffett - March 14, 2011, 8:01 p.m.
On Mar 14, 2011, at 15:41, York Sun wrote:
> Kyle,
> 
> On Mon, 2011-03-14 at 14:04 -0500, Moffett, Kyle D wrote:
>> On 64-bit this change is basically a no-op, because do_div() is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
>> 
>> On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is 1.1kb of code and hundreds or thousands of dependent cycles to compute, all of which is linked in from libgcc.  Another 1.2kb of code comes in for __umoddi3.
>> 
>> The original reason I wrote this patch is that the native "libgcc" on my boards is hard-float and therefore generates warnings when linking to it from soft-float code.  The toolchain is the native Debian powerpc (or powerpcspe), and most other native PowerPC distributions are also hard-float-only.
>> 
>> When I combine this patch with the other patch I posted to create a minimal internal libgcc with a few 64-bit shift functions from the linux kernel, I can successfully build U-Boot on those native PowerPC systems.
>> 
> 
> Points taken. Can you explain your algorithm? I see you want to do
> clks/5^12/2^13, but I don't see when the clks is divided by 5^12. Did I
> miss some lines?

The "do_div()" macro is a bit confusing that way, it simultaneously performs an in-place division of the first argument and returns the remainder.  Unfortunately that's a clumsy interface that's been around in the Linux kernel for ages.

So this line here:
  clks_rem = do_div(clks, UL_5pow12);

It divides clks by UL_5pow12 and saves the remainder as clks_rem.

Then the 2^^13 divide and remainder is perform with shifts and masks.

Actually, it occurs to me that the one comment isn't quite right, the remainder is a 64-bit value (not a 32-bit one).

I feel relatively confident that this is the right technical direction to go, because almost all of the Linux kernel timekeeping and scheduling code uses 64-bit values and careful application of do_div() or shift+mask for efficiency.

Cheers,
Kyle Moffett
York Sun - March 14, 2011, 8:22 p.m.
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
> +	 * Now divide by 5^12 and track the 32-bit remainder, then divide
> +	 * by 2*(2^12) using shifts (and updating the remainder).
> +	 */
> +	clks_rem = do_div(clks, UL_5pow12);
> +	clks_rem <<= 13;

Shouldn't this be clks_rem >>= 13 ?

> +	clks_rem |= clks & (UL_2pow13-1);
> +	clks >>= 13;
> +
> +	/* If we had a remainder, then round up */
> +	if (clks_rem)
>  		clks++;
> -	}


York
Kyle Moffett - March 14, 2011, 9:35 p.m.
On Mar 14, 2011, at 16:22, York Sun wrote:
> On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
>> +	 * Now divide by 5^12 and track the 32-bit remainder, then divide
>> +	 * by 2*(2^12) using shifts (and updating the remainder).
>> +	 */
>> +	clks_rem = do_div(clks, UL_5pow12);
>> +	clks_rem <<= 13;
> 
> Shouldn't this be clks_rem >>= 13 ?
>> 
>> +	clks_rem |= clks & (UL_2pow13-1);
>> +	clks >>= 13;
>> +
>> +	/* If we had a remainder, then round up */
>> +	if (clks_rem)
>> 		clks++;
>> -	}

Since I'm dividing a second time, the old remainder value represents the high bits of the new remainder value and therefore needs to be left-shifted, then the now-empty low bits of the remainder are taken from the bits of "clks" which get shifted away.

Example:

Say I want to divide 1999999999999 (IE: 2*10^^12 - 1) by 2000000000000 (2*10^^12).  Obviously the dividend is less than the divisor (by 1), so the result should be 0 and the remainder should be equal to the dividend.

So my initial value is:
  clks = 1999999999999;

Now I divide by 5^^12:
  clks_rem = do_div(clks, 244140625); /* This number is 5pow12 */

The results are:
  clks_rem == 244140624
  clks == 8191

Now I shift left:
  clks_rem <<= 13;

And get:
  clks_rem == 1999999991808

Finally, I copy the low bits of clks to clks_rem and shift them out of clks:
  clks_rem |= clks & (UL_2pow13-1);
  clks >>= 13

The result is as expected:
  clks_rem == 1999999999999;
  clks == 0;

Cheers,
Kyle Moffett
York Sun - March 14, 2011, 9:41 p.m.
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
> The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit
> integer divide operations to convert between nanoseconds and DDR clock
> cycles given arbitrary DDR clock frequencies.
> 
> Since all of the inputs to this are 32-bit (nanoseconds, clock cycles,
> and DDR frequencies), we can easily restructure the computation to use
> the "do_div()" function to perform 64-bit/32-bit divide operations.
> 
> This decreases compute time rather significantly for each conversion and
> avoids bringing in a very complicated function from libgcc.
> 
> It should be noted that nothing else in U-Boot or the Linux kernel seems
> to require a full 64-bit divide on any 32-bit PowerPC.
> 
> Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
> 
> Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
> Cc: Andy Fleming <afleming@gmail.com>
> Cc: Kumar Gala <kumar.gala@freescale.com>
> Cc: Wolfgang Denk <wd@denx.de>
> Cc: Kim Phillips <kim.phillips@freescale.com>
> ---

Acked-by: York Sun <yorksun@freescale.com>
Kumar Gala - March 15, 2011, 7:55 a.m.
On Feb 23, 2011, at 10:35 AM, Kyle Moffett wrote:

> The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit
> integer divide operations to convert between nanoseconds and DDR clock
> cycles given arbitrary DDR clock frequencies.
> 
> Since all of the inputs to this are 32-bit (nanoseconds, clock cycles,
> and DDR frequencies), we can easily restructure the computation to use
> the "do_div()" function to perform 64-bit/32-bit divide operations.
> 
> This decreases compute time rather significantly for each conversion and
> avoids bringing in a very complicated function from libgcc.
> 
> It should be noted that nothing else in U-Boot or the Linux kernel seems
> to require a full 64-bit divide on any 32-bit PowerPC.
> 
> Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
> 
> Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
> Cc: Andy Fleming <afleming@gmail.com>
> Cc: Kumar Gala <kumar.gala@freescale.com>
> Cc: Wolfgang Denk <wd@denx.de>
> Cc: Kim Phillips <kim.phillips@freescale.com>
> ---
> 
> Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
> 
> I spent a fair amount of time trying to verify that the resulting math is
> exactly the same as it was before, but the consequences of a mistake are
> insideous timing problems.  Additional in-depth review would be much
> appreciated.
> 
> Cheers,
> Kyle Moffett
> 
> Changelog:
>  v2: Resubmitted separately from the other HWW-1U-1A patches
> 
> arch/powerpc/cpu/mpc8xxx/ddr/util.c |   58 ++++++++++++++++++++++++----------
> 1 files changed, 41 insertions(+), 17 deletions(-)

Can you rebase on 'next' branch of git://git.denx.de/u-boot-mpc85xx.git

- k
Wolfgang Denk - April 13, 2011, 8:50 p.m.
Dear Kyle Moffett,

In message <1298478920-22044-1-git-send-email-Kyle.D.Moffett@boeing.com> you wrote:
> The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit
> integer divide operations to convert between nanoseconds and DDR clock
> cycles given arbitrary DDR clock frequencies.
...
> +/* To avoid 64-bit full-divides, we factor this here */
> +#define ULL_2e12 2000000000000ULL
> +#define UL_5pow12 244140625UL
> +#define UL_2pow13 (1UL << 13)
> +
> +#define ULL_8Fs 0xFFFFFFFFULL

We don't allow CamelCaps macro names. Macros are all caps, please.

Best regards,

Wolfgang Denk
York Sun - July 28, 2011, 9:41 p.m.
Kyle,

On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
> On Mar 14, 2011, at 16:22, York Sun wrote:
> > On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
> >> +	 * Now divide by 5^12 and track the 32-bit remainder, then divide
> >> +	 * by 2*(2^12) using shifts (and updating the remainder).
> >> +	 */
> >> +	clks_rem = do_div(clks, UL_5pow12);
> >> +	clks_rem <<= 13;
> > 
> > Shouldn't this be clks_rem >>= 13 ?
> >> 
> >> +	clks_rem |= clks & (UL_2pow13-1);
> >> +	clks >>= 13;
> >> +
> >> +	/* If we had a remainder, then round up */
> >> +	if (clks_rem)
> >> 		clks++;
> >> -	}

I found a problem with the round up. Please try

 picos_to_mclk(mclk_to_picos(3))

You will notice the result is round up. I am sure it is unexpected.

York
Kyle Moffett - July 29, 2011, 5:20 p.m.
On Jul 28, 2011, at 17:41, York Sun wrote:
> On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
>> On Mar 14, 2011, at 16:22, York Sun wrote:
>>> On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
>>>> +	 * Now divide by 5^12 and track the 32-bit remainder, then divide
>>>> +	 * by 2*(2^12) using shifts (and updating the remainder).
>>>> +	 */
>>>> +	clks_rem = do_div(clks, UL_5pow12);
>>>> +	clks_rem <<= 13;
>>> 
>>> Shouldn't this be clks_rem >>= 13 ?
>>>> 
>>>> +	clks_rem |= clks & (UL_2pow13-1);
>>>> +	clks >>= 13;
>>>> +
>>>> +	/* If we had a remainder, then round up */
>>>> +	if (clks_rem)
>>>> 		clks++;
>>>> -	}
> 
> I found a problem with the round up. Please try
> 
> picos_to_mclk(mclk_to_picos(3))
> 
> You will notice the result is round up. I am sure it is unexpected.

Hmm... what does fsl_ddr_get_mem_data_rate() return on your system?
I cannot reproduce this here with my memory freq of 800MHz.

The function has obviously always done rounding, even before I made
any changes.  Have you retested what the math does with the patch
reverted to see if it makes any difference?

Hmm, looking at it, the obvious problem case would be mclk_to_picos()
using a rounded result in a subsequent multiply.  Can you retest by
replacing mclk_to_picos() with the following code; this should keep
the error down by doing the rounding-to-10ps *after* the multiply.

unsigned int mclk_to_picos(unsigned int mclk)
{
        unsigned int data_rate = get_ddr_freq(0);
        unsigned int result;

        /* Round to nearest 10ps, being careful about 64-bit multiply/divide */
        unsigned long long mclk_ps = ULL_2e12 * mclk;

        /* Add 5*data_rate, for rounding */
        mclk_ps += 5*(unsigned long long)data_rate;
 
        /* Now perform the big divide, the result fits in 32-bits */
        do_div(mclk_ps, data_rate);
        result = mclk_ps;

        /* We still need to round to 10ps */
        return 10 * (result/10);
}

Obviously you could also get rid of the rounding alltogether, but I
don't know why it was put in the code in the first place...

Cheers,
Kyle Moffett
York Sun - July 29, 2011, 6:35 p.m.
Kyle,

On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
> On Jul 28, 2011, at 17:41, York Sun wrote:
> > On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
> >> On Mar 14, 2011, at 16:22, York Sun wrote:
> >>> On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
> >>>> +	 * Now divide by 5^12 and track the 32-bit remainder, then divide
> >>>> +	 * by 2*(2^12) using shifts (and updating the remainder).
> >>>> +	 */
> >>>> +	clks_rem = do_div(clks, UL_5pow12);
> >>>> +	clks_rem <<= 13;
> >>> 
> >>> Shouldn't this be clks_rem >>= 13 ?
> >>>> 
> >>>> +	clks_rem |= clks & (UL_2pow13-1);
> >>>> +	clks >>= 13;
> >>>> +
> >>>> +	/* If we had a remainder, then round up */
> >>>> +	if (clks_rem)
> >>>> 		clks++;
> >>>> -	}
> > 
> > I found a problem with the round up. Please try
> > 
> > picos_to_mclk(mclk_to_picos(3))
> > 
> > You will notice the result is round up. I am sure it is unexpected.
> 
> Hmm... what does fsl_ddr_get_mem_data_rate() return on your system?
> I cannot reproduce this here with my memory freq of 800MHz.
> 
> The function has obviously always done rounding, even before I made
> any changes.  Have you retested what the math does with the patch
> reverted to see if it makes any difference?
> 
> Hmm, looking at it, the obvious problem case would be mclk_to_picos()
> using a rounded result in a subsequent multiply.  Can you retest by
> replacing mclk_to_picos() with the following code; this should keep
> the error down by doing the rounding-to-10ps *after* the multiply.
> 
> unsigned int mclk_to_picos(unsigned int mclk)
> {
>         unsigned int data_rate = get_ddr_freq(0);
>         unsigned int result;
> 
>         /* Round to nearest 10ps, being careful about 64-bit multiply/divide */
>         unsigned long long mclk_ps = ULL_2e12 * mclk;
> 
>         /* Add 5*data_rate, for rounding */
>         mclk_ps += 5*(unsigned long long)data_rate;
>  
>         /* Now perform the big divide, the result fits in 32-bits */
>         do_div(mclk_ps, data_rate);
>         result = mclk_ps;
> 
>         /* We still need to round to 10ps */
>         return 10 * (result/10);
> }
> 
> Obviously you could also get rid of the rounding alltogether, but I
> don't know why it was put in the code in the first place...

I think the problem comes from the round up. But your calculation of
remainder is not right. You explained to me with example of 2*10^^12-1.
It seems to be right until I tried another number 264*10^^7. I think the
following calculation of remainder is

	clks_rem = do_div(clks, UL_5POW12);
	clks >>= 13;
	clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;

Then we can compare the remainder with the "error" introduced by
mclk_to_picos, which is 10ps.

	if (clks_rem > 10 * data_rate)
		clk++;

What do you think?

York
York Sun - July 29, 2011, 6:46 p.m.
On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
> Kyle,
> 
> On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
> > On Jul 28, 2011, at 17:41, York Sun wrote:
> > > On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
> > >> On Mar 14, 2011, at 16:22, York Sun wrote:
> > >>> On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
> > >>>> +	 * Now divide by 5^12 and track the 32-bit remainder, then divide
> > >>>> +	 * by 2*(2^12) using shifts (and updating the remainder).
> > >>>> +	 */
> > >>>> +	clks_rem = do_div(clks, UL_5pow12);
> > >>>> +	clks_rem <<= 13;
> > >>> 
> > >>> Shouldn't this be clks_rem >>= 13 ?
> > >>>> 
> > >>>> +	clks_rem |= clks & (UL_2pow13-1);
> > >>>> +	clks >>= 13;
> > >>>> +
> > >>>> +	/* If we had a remainder, then round up */
> > >>>> +	if (clks_rem)
> > >>>> 		clks++;
> > >>>> -	}
> > > 
> > > I found a problem with the round up. Please try
> > > 
> > > picos_to_mclk(mclk_to_picos(3))
> > > 
> > > You will notice the result is round up. I am sure it is unexpected.
> > 
> > Hmm... what does fsl_ddr_get_mem_data_rate() return on your system?
> > I cannot reproduce this here with my memory freq of 800MHz.
> > 
> > The function has obviously always done rounding, even before I made
> > any changes.  Have you retested what the math does with the patch
> > reverted to see if it makes any difference?
> > 
> > Hmm, looking at it, the obvious problem case would be mclk_to_picos()
> > using a rounded result in a subsequent multiply.  Can you retest by
> > replacing mclk_to_picos() with the following code; this should keep
> > the error down by doing the rounding-to-10ps *after* the multiply.
> > 
> > unsigned int mclk_to_picos(unsigned int mclk)
> > {
> >         unsigned int data_rate = get_ddr_freq(0);
> >         unsigned int result;
> > 
> >         /* Round to nearest 10ps, being careful about 64-bit multiply/divide */
> >         unsigned long long mclk_ps = ULL_2e12 * mclk;
> > 
> >         /* Add 5*data_rate, for rounding */
> >         mclk_ps += 5*(unsigned long long)data_rate;
> >  
> >         /* Now perform the big divide, the result fits in 32-bits */
> >         do_div(mclk_ps, data_rate);
> >         result = mclk_ps;
> > 
> >         /* We still need to round to 10ps */
> >         return 10 * (result/10);
> > }
> > 
> > Obviously you could also get rid of the rounding alltogether, but I
> > don't know why it was put in the code in the first place...
> 
> I think the problem comes from the round up. But your calculation of
> remainder is not right. You explained to me with example of 2*10^^12-1.
> It seems to be right until I tried another number 264*10^^7. I think the
> following calculation of remainder is
> 
> 	clks_rem = do_div(clks, UL_5POW12);
> 	clks >>= 13;
> 	clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;
> 
> Then we can compare the remainder with the "error" introduced by
> mclk_to_picos, which is 10ps.
> 
> 	if (clks_rem > 10 * data_rate)
> 		clk++;
> 
> What do you think?
> 

Maybe not a good idea. It may not round up when necessary.

York
Kyle Moffett - July 29, 2011, 9:26 p.m.
On Jul 29, 2011, at 14:46, York Sun wrote:
> On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
>> On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
>>> On Jul 28, 2011, at 17:41, York Sun wrote:
>>>> I found a problem with the round up. Please try
>>>> 
>>>> picos_to_mclk(mclk_to_picos(3))
>>>> 
>>>> You will notice the result is round up. I am sure it is unexpected.
>>> 
>>> Hmm... what does fsl_ddr_get_mem_data_rate() return on your system?
>>> I cannot reproduce this here with my memory freq of 800MHz.
>>> 
>>> The function has obviously always done rounding, even before I made
>>> any changes.  Have you retested what the math does with the patch
>>> reverted to see if it makes any difference?
>>> 
>>> Hmm, looking at it, the obvious problem case would be mclk_to_picos()
>>> using a rounded result in a subsequent multiply.  Can you retest by
>>> replacing mclk_to_picos() with the following code; this should keep
>>> the error down by doing the rounding-to-10ps *after* the multiply.
>>> 
>>> unsigned int mclk_to_picos(unsigned int mclk)
>>> {
>>>        unsigned int data_rate = get_ddr_freq(0);
>>>        unsigned int result;
>>> 
>>>        /* Round to nearest 10ps, being careful about 64-bit multiply/divide */
>>>        unsigned long long mclk_ps = ULL_2e12 * mclk;
>>> 
>>>        /* Add 5*data_rate, for rounding */
>>>        mclk_ps += 5*(unsigned long long)data_rate;
>>> 
>>>        /* Now perform the big divide, the result fits in 32-bits */
>>>        do_div(mclk_ps, data_rate);
>>>        result = mclk_ps;
>>> 
>>>        /* We still need to round to 10ps */
>>>        return 10 * (result/10);
>>> }
>>> 
>>> Obviously you could also get rid of the rounding alltogether, but I
>>> don't know why it was put in the code in the first place...
>> 
>> I think the problem comes from the round up. But your calculation of
>> remainder is not right. You explained to me with example of 2*10^^12-1.
>> It seems to be right until I tried another number 264*10^^7. I think the
>> following calculation of remainder is
>> 
>> 	clks_rem = do_div(clks, UL_5POW12);
>> 	clks >>= 13;
>> 	clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;

That code you pasted is definitely not what is in the tree.
The in-tree code is:
  57  /* First multiply the time by the data rate (32x32 => 64) */
  58  clks = picos * (unsigned long long)get_ddr_freq(0);
  59
  60  /*
  61   * Now divide by 5^12 and track the 32-bit remainder, then divide
  62   * by 2*(2^12) using shifts (and updating the remainder).
  63   */
  64  clks_rem = do_div(clks, UL_5pow12);
  65  clks_rem <<= 13;
  66  clks_rem |= clks & (UL_2pow13-1);
  67  clks >>= 13;
  68
  69  /* If we had a remainder, then round up */
  70  if (clks_rem)
  71          clks++;

Can you tell me what your value of get_ddr_freq() is?  I may be able to
reproduce the issue here.

Also, can you try your test case with the replacement for mclk_to_picos()
that I pasted above and let me know what happens?

I think the *real* error is that get_memory_clk_period_ps() (which I did
not change) rounds up or down to a multiple of 10ps, and then the *result*
of get_memory_clk_period_ps() is multiplied by the number of clocks.  The
effect is that the 5ps of rounding error that can occur is multiplied by
the number of clocks, and may result in a much-too-large result.  When you
convert that number of picoseconds *back* to memory clocks you get a
result that is too big, but that's because it was too many picoseconds!

Where did you get the number 264*10^7?  That seems too big to be a number
of picoseconds (that's 2ms!), but way too small to be the product of
get_ddr_freq() and some number of picoseconds.

EG: For my system here with a DDR frequency of 800MHz, each clock is 2500ps,
so 3 clocks times 2500ps times 800MHz is 6*10^12.

In fact if you give "picos_to_mclk()" any number of picoseconds larger than
at least 1 clock, the initial value of the "clks" variable is guaranteed to
be at least 2*10^12, right?

Using my number, let's run through the math:

So if you start with picos = 7499 (to put rounding to the test) and
get_ddr_freq() is 800MHz:
  clks = 7499 * 800*1000*1000;

Now we have "clks" = 5999200000000, and we divide by 5**12:
  clks_rem = do_div(clks, UL_5pow12);

Since 5999200000000/(5**12) is 24572.7232, we get:
  clks = 24572
  clks_rem = 176562500 (which is equal to 5^12 * 0.7232)

Now left-shift clks_rem by 13:
  clks_rem = 1446400000000

Then bitwise-OR in the lower 13 bits of clks:
  clks_rem |= (24572 & 8191)
  clks_rem = 1446400008188;

Finally, rightshift clks by 13:
  clks = 2

Since there is a remainder, we "round up" from 2 to 3.

These numbers empirically make sense, because I gave the system 7499ps
(which is 3 clocks minus 1ps) and it rounded up properly.

If I give it a simple number like 7500ps, then the math is really easy
and obvious:
  clks = 7500 * 800*1000*100

Then the do_div():
  clks = 24576
  clks_rem = 0

Shifting clks_rem does nothing, because it is zero...

Oring with the lower 13 bits of clks has no effect (it has no low bits set)

Finally, when right-shifting clks by 13:
  clks = 3

This is obviously the correct result.

Cheers,
Kyle Moffett
York Sun - July 29, 2011, 9:46 p.m.
On Fri, 2011-07-29 at 16:26 -0500, Moffett, Kyle D wrote:
> On Jul 29, 2011, at 14:46, York Sun wrote:
> > On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
> >> On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
> >>> On Jul 28, 2011, at 17:41, York Sun wrote:
> >>>> I found a problem with the round up. Please try
> >>>> 
> >>>> picos_to_mclk(mclk_to_picos(3))
> >>>> 
> >>>> You will notice the result is round up. I am sure it is unexpected.
> >>> 
> >>> Hmm... what does fsl_ddr_get_mem_data_rate() return on your system?
> >>> I cannot reproduce this here with my memory freq of 800MHz.
> >>> 
> >>> The function has obviously always done rounding, even before I made
> >>> any changes.  Have you retested what the math does with the patch
> >>> reverted to see if it makes any difference?
> >>> 
> >>> Hmm, looking at it, the obvious problem case would be mclk_to_picos()
> >>> using a rounded result in a subsequent multiply.  Can you retest by
> >>> replacing mclk_to_picos() with the following code; this should keep
> >>> the error down by doing the rounding-to-10ps *after* the multiply.
> >>> 
> >>> unsigned int mclk_to_picos(unsigned int mclk)
> >>> {
> >>>        unsigned int data_rate = get_ddr_freq(0);
> >>>        unsigned int result;
> >>> 
> >>>        /* Round to nearest 10ps, being careful about 64-bit multiply/divide */
> >>>        unsigned long long mclk_ps = ULL_2e12 * mclk;
> >>> 
> >>>        /* Add 5*data_rate, for rounding */
> >>>        mclk_ps += 5*(unsigned long long)data_rate;
> >>> 
> >>>        /* Now perform the big divide, the result fits in 32-bits */
> >>>        do_div(mclk_ps, data_rate);
> >>>        result = mclk_ps;
> >>> 
> >>>        /* We still need to round to 10ps */
> >>>        return 10 * (result/10);
> >>> }
> >>> 
> >>> Obviously you could also get rid of the rounding alltogether, but I
> >>> don't know why it was put in the code in the first place...
> >> 
> >> I think the problem comes from the round up. But your calculation of
> >> remainder is not right. You explained to me with example of 2*10^^12-1.
> >> It seems to be right until I tried another number 264*10^^7. I think the
> >> following calculation of remainder is
> >> 
> >> 	clks_rem = do_div(clks, UL_5POW12);
> >> 	clks >>= 13;
> >> 	clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;
> 
> That code you pasted is definitely not what is in the tree.
> The in-tree code is:
>   57  /* First multiply the time by the data rate (32x32 => 64) */
>   58  clks = picos * (unsigned long long)get_ddr_freq(0);
>   59
>   60  /*
>   61   * Now divide by 5^12 and track the 32-bit remainder, then divide
>   62   * by 2*(2^12) using shifts (and updating the remainder).
>   63   */
>   64  clks_rem = do_div(clks, UL_5pow12);
>   65  clks_rem <<= 13;
>   66  clks_rem |= clks & (UL_2pow13-1);
>   67  clks >>= 13;
>   68
>   69  /* If we had a remainder, then round up */
>   70  if (clks_rem)
>   71          clks++;
> 
> Can you tell me what your value of get_ddr_freq() is?  I may be able to
> reproduce the issue here.

264000000


> 
> Also, can you try your test case with the replacement for mclk_to_picos()
> that I pasted above and let me know what happens?
> 
> I think the *real* error is that get_memory_clk_period_ps() (which I did
> not change) rounds up or down to a multiple of 10ps, and then the *result*
> of get_memory_clk_period_ps() is multiplied by the number of clocks.  The
> effect is that the 5ps of rounding error that can occur is multiplied by
> the number of clocks, and may result in a much-too-large result.  When you
> convert that number of picoseconds *back* to memory clocks you get a
> result that is too big, but that's because it was too many picoseconds!

Correct. get_memory_clk_period_ps() is causing this problem.

> 
> Where did you get the number 264*10^7?  That seems too big to be a number
> of picoseconds (that's 2ms!), but way too small to be the product of
> get_ddr_freq() and some number of picoseconds.

Forget that. I was just trying.

> 
> EG: For my system here with a DDR frequency of 800MHz, each clock is 2500ps,
> so 3 clocks times 2500ps times 800MHz is 6*10^12.
> 
> In fact if you give "picos_to_mclk()" any number of picoseconds larger than
> at least 1 clock, the initial value of the "clks" variable is guaranteed to
> be at least 2*10^12, right?
> 
> Using my number, let's run through the math:
> 
> So if you start with picos = 7499 (to put rounding to the test) and
> get_ddr_freq() is 800MHz:
>   clks = 7499 * 800*1000*1000;
> 
> Now we have "clks" = 5999200000000, and we divide by 5**12:
>   clks_rem = do_div(clks, UL_5pow12);
> 
> Since 5999200000000/(5**12) is 24572.7232, we get:
>   clks = 24572
>   clks_rem = 176562500 (which is equal to 5^12 * 0.7232)
> 
> Now left-shift clks_rem by 13:
>   clks_rem = 1446400000000
> 

Here I think it is wrong. I want to use the clks_rem value later.

I change it to this

	clks_rem = do_div(clks, UL_5POW12);
	clks_rem += (clks & (UL_2POW13-1)) * UL_5POW12;
	clks >>= 13;

	/* If we had a remainder greater than the 1ps error, then round up */
	if (clks_rem > data_rate)
		clks++;

After fixing the get_memory_clk_period_ps() to round up to near 1ps, the
calculation is more accurate (well, not as floating point). And ow
clks_rem holds the correct remainder.

York
Kyle Moffett - Aug. 3, 2011, 7:07 p.m.
On Jul 29, 2011, at 17:46, York Sun wrote:
> On Fri, 2011-07-29 at 16:26 -0500, Moffett, Kyle D wrote:
>> On Jul 29, 2011, at 14:46, York Sun wrote:
>>> On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
>>>> On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
>>>>> On Jul 28, 2011, at 17:41, York Sun wrote:
>>>>>> I found a problem with the round up. Please try
>>>>>> 
>>>>>> picos_to_mclk(mclk_to_picos(3))
>>>>>> 
>>>>>> You will notice the result is round up. I am sure it is unexpected.
>>>>> 
>>>>> Hmm... what does fsl_ddr_get_mem_data_rate() return on your system?
>>>>> I cannot reproduce this here with my memory freq of 800MHz.
>>>>> 
>>>>> [...snip...]
>>>>> 
>>>>> Obviously you could also get rid of the rounding alltogether, but I
>>>>> don't know why it was put in the code in the first place...
>>>> 
>>>> I think the problem comes from the round up. But your calculation of
>>>> remainder is not right.
>> 
>> Can you tell me what your value of get_ddr_freq() is?  I may be able to
>> reproduce the issue here.
> 
> 264000000

Ok, so that's a 264MHz DDR frequency, right?  264 * 10^6?

If I plug that number in to the code I can reproduce your result:
  picos_to_mclk(mclk_to_picos(3)) == 4

But that happens with both the new code *AND* the old code.  I ran them
side-by-side in the same test-harness:
  OLD:
  get_memory_clk_period_ps(): 7580
  picos_to_mclk(mclk_to_picos(3)): 4
  ----------------------
  NEW:
  get_memory_clk_period_ps(): 7580
  picos_to_mclk(mclk_to_picos(3)): 4

So there's a bug, but it's always been there...

Just to show the floating-point math really quick:

  2 * (10^12 picosec/sec) / (264 * 10^6 cyc/sec)
     == 7575.757575..... picoseconds

Since the code rounds to the nearest 10ps, we get 7580ps.  Then 3 mclk is
22740ps, but it should be 22727.272727.....

The picos_to_mclk() function always rounds up, because we must avoid
undercutting the SPD timing margin.  If your SPD specifies a minimum of a
7590ps timing window, you MUST use a 2 mclk delay (even though it's
painfully close to 1 mclk), because otherwise you will get RAM errors.

Therefore mclk_to_picos() and get_memory_clk_period_ps() should always
round down, even though they currently round to the closest 10ps.

Continuing the example:  If we later we do picos_to_mclk(22740):
  22740ps * (264 * 10^5 cyc/sec) * 0.5 / (10^12 picosec/sec)
    == 3.00168

When we do the integer math with remainder, we get:
  3 rem 3360000000

In both cases, we correctly automatically round up to 4.

I believe I have a fix for the problem; here's my fixed test case log:
  get_memory_clk_period_ps(): 7575
  picos_to_mclk(mclk_to_picos(3)): 3

And here is the new code (the picos_to_mclk() function is unchanged):

  unsigned int mclk_to_picos(unsigned int mclk)
  {
          unsigned int data_rate = get_ddr_freq(0);

          /* Be careful to avoid a 64-bit full-divide (result fits in 32) */
          unsigned long long mclk_ps = ULL_2E12 * mclk;
          do_div(mclk_ps, data_rate);
          return mclk_ps;
  }

  unsigned int get_memory_clk_period_ps(void)
  {
          return mclk_to_picos(1);
  }

Cheers,
Kyle Moffett

Patch

diff --git a/arch/powerpc/cpu/mpc8xxx/ddr/util.c b/arch/powerpc/cpu/mpc8xxx/ddr/util.c
index 1e2d921..c545d59 100644
--- a/arch/powerpc/cpu/mpc8xxx/ddr/util.c
+++ b/arch/powerpc/cpu/mpc8xxx/ddr/util.c
@@ -8,11 +8,19 @@ 
 
 #include <common.h>
 #include <asm/fsl_law.h>
+#include <div64.h>
 
 #include "ddr.h"
 
 unsigned int fsl_ddr_get_mem_data_rate(void);
 
+/* To avoid 64-bit full-divides, we factor this here */
+#define ULL_2e12 2000000000000ULL
+#define UL_5pow12 244140625UL
+#define UL_2pow13 (1UL << 13)
+
+#define ULL_8Fs 0xFFFFFFFFULL
+
 /*
  * Round mclk_ps to nearest 10 ps in memory controller code.
  *
@@ -22,36 +30,52 @@  unsigned int fsl_ddr_get_mem_data_rate(void);
  */
 unsigned int get_memory_clk_period_ps(void)
 {
-	unsigned int mclk_ps;
+	unsigned int data_rate = fsl_ddr_get_mem_data_rate();
+	unsigned int result;
+
+	/* Round to nearest 10ps, being careful about 64-bit multiply/divide */
+	unsigned long long mclk_ps = ULL_2e12;
 
-	mclk_ps = 2000000000000ULL / fsl_ddr_get_mem_data_rate();
-	/* round to nearest 10 ps */
-	return 10 * ((mclk_ps + 5) / 10);
+	/* Add 5*data_rate, for rounding */
+	mclk_ps += 5*(unsigned long long)data_rate;
+
+	/* Now perform the big divide, the result fits in 32-bits */
+	do_div(mclk_ps, data_rate);
+	result = mclk_ps;
+
+	/* We still need to round to 10ps */
+	return 10 * (result/10);
 }
 
 /* Convert picoseconds into DRAM clock cycles (rounding up if needed). */
 unsigned int picos_to_mclk(unsigned int picos)
 {
-	const unsigned long long ULL_2e12 = 2000000000000ULL;
-	const unsigned long long ULL_8Fs = 0xFFFFFFFFULL;
-	unsigned long long clks;
-	unsigned long long clks_temp;
+	unsigned long long clks, clks_rem;
 
+	/* Short circuit for zero picos */
 	if (!picos)
 		return 0;
 
-	clks = fsl_ddr_get_mem_data_rate() * (unsigned long long) picos;
-	clks_temp = clks;
-	clks = clks / ULL_2e12;
-	if (clks_temp % ULL_2e12) {
+	/* First multiply the time by the data rate (32x32 => 64) */
+	clks = picos * (unsigned long long)fsl_ddr_get_mem_data_rate();
+
+	/*
+	 * Now divide by 5^12 and track the 32-bit remainder, then divide
+	 * by 2*(2^12) using shifts (and updating the remainder).
+	 */
+	clks_rem = do_div(clks, UL_5pow12);
+	clks_rem <<= 13;
+	clks_rem |= clks & (UL_2pow13-1);
+	clks >>= 13;
+
+	/* If we had a remainder, then round up */
+	if (clks_rem)
 		clks++;
-	}
 
-	if (clks > ULL_8Fs) {
+	/* Clamp to the maximum representable value */
+	if (clks > ULL_8Fs)
 		clks = ULL_8Fs;
-	}
-
-	return (unsigned int) clks;
+	return (unsigned int)clks;
 }
 
 unsigned int mclk_to_picos(unsigned int mclk)