diff mbox series

utils/genrandconfig: reduce the maximum "size" of random configurations

Message ID 20211113141703.1124706-1-thomas.petazzoni@bootlin.com
State Accepted
Headers show
Series utils/genrandconfig: reduce the maximum "size" of random configurations | expand

Commit Message

Thomas Petazzoni Nov. 13, 2021, 2:17 p.m. UTC
genrandconfig is used by the Buildroot autobuilders to generate
semi-random configurations that we build test. As part of this, we use
"make randpackageconfig" to randomize the selection of packages,
together with a KCONFIG_PROBABILITY value, which indicates the
probabibility for each option to be enabled. This probability is
itself randomized, between 1% and 30% for every build.

However, with our increasing number of packages (over 2900), when we
use a 30% probability for options to be enabled, it means a *lot* of
options are enabled, causing very large configurations to be
tested. These configurations are not very realistic, and they take
ages to build on our autobuilders: we have builds that take 4, 5 or
even 7 hours to build.

In order to test a larger number of configurations and therefore a
larger variety of configurations, this commit reduces the maximum
probability to 20%.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
---
 utils/genrandconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yann E. MORIN Nov. 13, 2021, 6:02 p.m. UTC | #1
Thomas, All,

On 2021-11-13 15:17 +0100, Thomas Petazzoni spake thusly:
> genrandconfig is used by the Buildroot autobuilders to generate
> semi-random configurations that we build test. As part of this, we use
> "make randpackageconfig" to randomize the selection of packages,
> together with a KCONFIG_PROBABILITY value, which indicates the
> probabibility for each option to be enabled. This probability is
> itself randomized, between 1% and 30% for every build.
> 
> However, with our increasing number of packages (over 2900), when we
> use a 30% probability for options to be enabled, it means a *lot* of
> options are enabled, causing very large configurations to be
> tested. These configurations are not very realistic, and they take
> ages to build on our autobuilders: we have builds that take 4, 5 or
> even 7 hours to build.
> 
> In order to test a larger number of configurations and therefore a
> larger variety of configurations, this commit reduces the maximum
> probability to 20%.
> 
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
> ---
>  utils/genrandconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/utils/genrandconfig b/utils/genrandconfig
> index 4fffcbad11..77c7e585f5 100755
> --- a/utils/genrandconfig
> +++ b/utils/genrandconfig
> @@ -410,7 +410,7 @@ def gen_config(args):
>              return 1
>          bounded_loop -= 1
>          subprocess.check_call(["make", "O=%s" % args.outputdir, "-C", args.buildrootdir,
> -                               "KCONFIG_PROBABILITY=%d" % randint(1, 30),
> +                               "KCONFIG_PROBABILITY=%d" % randint(1, 20),

On IRC, we discussed about using a normal distrobution rather than the
uniform distribution that randint() provides.

So, I've done a few tests with 100M iterations, adn for example, here is
the (15,5) gaussian [0]:

     1: #    0%
     2: #    0%
     3: ##    0%
     4: ####    0%
     5: ######    1%
     6: #########    1%
     7: #############    2%
     8: #################    3%
     9: #####################    4%
    10: ##########################    5%
    11: ###############################    6%
    12: ###################################    7%
    13: ######################################    7%
    14: #######################################    7%
    15: #######################################    7%
    16: ######################################    7%
    17: ###################################    7%
    18: ###############################    6%
    19: ##########################    5%
    20: #####################    4%
    21: #################    3%
    22: #############    2%
    23: #########    1%
    24: ######    1%
    25: ####    0%
    26: ##    0%
    27: #    0%
    28: #    0%
    29:     0%
    30:     0%

And here is the (10,7) gaussian (which has my preference):

     1: ###############    3%
     2: #################    3%
     3: ####################    4%
     4: #######################    4%
     5: #########################    5%
     6: ###########################    5%
     7: #############################    5%
     8: ##############################    6%
     9: ###############################    6%
    10: ###############################    6%
    11: ##############################    6%
    12: #############################    5%
    13: ###########################    5%
    14: #########################    5%
    15: #######################    4%
    16: ####################    4%
    17: #################    3%
    18: ###############    3%
    19: ############    2%
    20: ##########    2%
    21: ########    1%
    22: ######    1%
    23: ####    0%
    24: ###    0%
    25: ##    0%
    26: #    0%
    27: #    0%
    28:     0%
    29:     0%
    30:     0%

Since a gaussian can output values all over ℝ, we must limit the range
explicitly (note that the 0% above are not representing 0 occurences!):

    proba = 0
    while proba < 1 or proba > 100:
        proba = int(random.gauss(15, 5))

What do you think?

Regards,
Yann E. MORIN.

[0] Here's the ugly script to reprodce the gaussians; adapt for other
paramaters...

    #!/usr/bin/env python3
    import random

    iters = 100000  # In units of 1k

    def main():
        buckets = dict()
        for i in range(1, 31):
            buckets[i] = 0
        for i in range(1000*iters):
            r = 0
            while r < 1 or r > 30:  # 30 for the bucket size
                r = int(random.gauss(15, 5))
            buckets[r] = buckets[r]+1
        for i in range(1, 31):
            print('{:2}: '.format(i), end='')
            for j in range(int(buckets[i]/(2*iters))):
                print('#', end='')
            print('   {: 2}%'.format(int(buckets[i]/(10*iters))))

    if __name__ == '__main__':
        main()

>                                 "randpackageconfig"])
>  
>          if fixup_config(sysinfo, configfile):
> -- 
> 2.31.1
> 
> _______________________________________________
> buildroot mailing list
> buildroot@buildroot.org
> https://lists.buildroot.org/mailman/listinfo/buildroot
Peter Korsgaard Nov. 14, 2021, 12:48 p.m. UTC | #2
>>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes:

 > genrandconfig is used by the Buildroot autobuilders to generate
 > semi-random configurations that we build test. As part of this, we use
 > "make randpackageconfig" to randomize the selection of packages,
 > together with a KCONFIG_PROBABILITY value, which indicates the
 > probabibility for each option to be enabled. This probability is
 > itself randomized, between 1% and 30% for every build.

 > However, with our increasing number of packages (over 2900), when we
 > use a 30% probability for options to be enabled, it means a *lot* of
 > options are enabled, causing very large configurations to be
 > tested. These configurations are not very realistic, and they take
 > ages to build on our autobuilders: we have builds that take 4, 5 or
 > even 7 hours to build.

 > In order to test a larger number of configurations and therefore a
 > larger variety of configurations, this commit reduces the maximum
 > probability to 20%.

 > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

Makes sense. Committed, thanks.
Thomas Petazzoni Nov. 14, 2021, 1 p.m. UTC | #3
Hello,

On Sat, 13 Nov 2021 19:02:06 +0100
"Yann E. MORIN" <yann.morin.1998@free.fr> wrote:

> On IRC, we discussed about using a normal distrobution rather than the
> uniform distribution that randint() provides.

Yeah, but after thinking more about it, I'm wondering whether this
makes sense or not. Indeed, why are we more interested by
configurations having ~10% of options enabled compared to the ones
having 1% or 20% ? All of them are equally interesting, aren't they?

> And here is the (10,7) gaussian (which has my preference):
> 
>      1: ###############    3%
>      2: #################    3%
>      3: ####################    4%
>      4: #######################    4%
>      5: #########################    5%
>      6: ###########################    5%
>      7: #############################    5%
>      8: ##############################    6%
>      9: ###############################    6%
>     10: ###############################    6%
>     11: ##############################    6%
>     12: #############################    5%
>     13: ###########################    5%
>     14: #########################    5%
>     15: #######################    4%
>     16: ####################    4%
>     17: #################    3%
>     18: ###############    3%
>     19: ############    2%
>     20: ##########    2%
>     21: ########    1%
>     22: ######    1%
>     23: ####    0%
>     24: ###    0%
>     25: ##    0%
>     26: #    0%
>     27: #    0%

I think the reality of Buildroot configurations is probably even more
biased towards having just a few percents of options enabled.

> Since a gaussian can output values all over ℝ, we must limit the range
> explicitly (note that the 0% above are not representing 0 occurences!):
> 
>     proba = 0
>     while proba < 1 or proba > 100:
>         proba = int(random.gauss(15, 5))

How much CPU is this loop going to consume before it gives a value that
is within 1 and 100 ? Would a % 100 be better ?

Best regards,

Thomas
Thomas Petazzoni Nov. 14, 2021, 1:01 p.m. UTC | #4
On Sun, 14 Nov 2021 13:48:43 +0100
Peter Korsgaard <peter@korsgaard.com> wrote:

> >>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes:  
> 
>  > genrandconfig is used by the Buildroot autobuilders to generate
>  > semi-random configurations that we build test. As part of this, we use
>  > "make randpackageconfig" to randomize the selection of packages,
>  > together with a KCONFIG_PROBABILITY value, which indicates the
>  > probabibility for each option to be enabled. This probability is
>  > itself randomized, between 1% and 30% for every build.  
> 
>  > However, with our increasing number of packages (over 2900), when we
>  > use a 30% probability for options to be enabled, it means a *lot* of
>  > options are enabled, causing very large configurations to be
>  > tested. These configurations are not very realistic, and they take
>  > ages to build on our autobuilders: we have builds that take 4, 5 or
>  > even 7 hours to build.  
> 
>  > In order to test a larger number of configurations and therefore a
>  > larger variety of configurations, this commit reduces the maximum
>  > probability to 20%.  
> 
>  > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>  
> 
> Makes sense. Committed, thanks.

Thanks! I think we want it in all branches currently tested by the
autobuilders, i.e master, next, 2021.02.x and 2021.08.x.

Thomas
Yann E. MORIN Nov. 14, 2021, 6:35 p.m. UTC | #5
Thomas, All,

On 2021-11-14 14:00 +0100, Thomas Petazzoni spake thusly:
> On Sat, 13 Nov 2021 19:02:06 +0100
> "Yann E. MORIN" <yann.morin.1998@free.fr> wrote:
> > On IRC, we discussed about using a normal distrobution rather than the
> > uniform distribution that randint() provides.
> Yeah, but after thinking more about it, I'm wondering whether this
> makes sense or not. Indeed, why are we more interested by
> configurations having ~10% of options enabled compared to the ones
> having 1% or 20% ? All of them are equally interesting, aren't they?

Well, you said yourself that it is most probably that configurations
have few options turned on, so we'd probably favour smaller
configurations over bigger ones.

Especially since smaller configurations are more prone to exhibit
missing or otherwise-inherited dependencies. Conversely, though, bigger
configurations are also more prone to exhibit build failures on
unexpected dependencies.

So YMMV...

> > And here is the (10,7) gaussian (which has my preference):
[--SNIP--]
> I think the reality of Buildroot configurations is probably even more
> biased towards having just a few percents of options enabled.

We can maybe refine the parameters, like (3,7).

> > Since a gaussian can output values all over ℝ, we must limit the range
> > explicitly (note that the 0% above are not representing 0 occurences!):
> > 
> >     proba = 0
> >     while proba < 1 or proba > 100:
> >         proba = int(random.gauss(15, 5))
> 
> How much CPU is this loop going to consume before it gives a value that
> is within 1 and 100 ? Would a % 100 be better ?

Well, the normal distribution is very predictable [0]. Here, the mean µ
is 10, and the standard deviation σ is 7. This means that:

  - 50% of values are below µ, 50% above
  - ~68.29% of values are between µ-σ=3 and µ+σ=17
  - ~95.45% of values are between µ-2σ=-4 and µ+2σ=24
  - ~99.73% of values are between µ-3σ=-11 and µ+3σ=31
  - ~84.1% of values are greater than µ-σ=3
  - ~15.4% of values are below µ-σ=3
  - ~2.2% of values are below µ-2σ=-4

And with the following test:

    >>> import random
    >>> restart = (0, 0)
    >>> for i in range(100*1000*1000):  # 100M iterations
    ...     r = int(random.gauss(10, 7)+0.5)  # Round to nearest int
    ...     if r < 1:
    ...         restart = (restart[0]+1, restart[1])
    ...     elif r > 100:
    ...         restart = (restart[0], restart[1]+1)
    ...
    >>> print('restart={}'.format(restart))
    restart=(8736128, 0)

So, it means there were ~8.736% of values below 1, and none above
100, which is consistent with the theory. So, we would usually need
just a very few loops to find an appropriate value:

                 Cumulative probability the loop generates:
                | <1                    | >=1
    ------------+-----------------------+-----------------
    1st iter    | 0.08736               | 0.91264
    2nd iter    | 0.00763 =0.08736^2    | 0.99236
    3rd iter    | 0.00066 =0.08736^3    | 0.99933
    4th iter    | 0.00006 =0.08736^4    | 0.99994
    5th iter    | 0.000005 =0.08736^5   | 0.99999

I.e. the odds to get something less than one after the 5th iteration are
less than in in 100,000.

For the other extremity of the range, 100, it is so far away (almost
13σ !) that we did not even get one in 100 million iterations. That's
less than the odds of a dinosaur-killing asteroid hitting Earth in the
next million years (or a very big number like that!)...

Oh, and doing those 100M iterations took ~54s, which means the loop is,
well, cheap. Very cheap.

[0] https://en.wikipedia.org/wiki/Normal_distribution

(Yeah, I love maths and probablities. Even though I'm pretty bad at it!)

Regards,
Yann E. MORIN.
Peter Korsgaard Nov. 17, 2021, 10:15 p.m. UTC | #6
>>>>> "Peter" == Peter Korsgaard <peter@korsgaard.com> writes:

>>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes:
 >> genrandconfig is used by the Buildroot autobuilders to generate
 >> semi-random configurations that we build test. As part of this, we use
 >> "make randpackageconfig" to randomize the selection of packages,
 >> together with a KCONFIG_PROBABILITY value, which indicates the
 >> probabibility for each option to be enabled. This probability is
 >> itself randomized, between 1% and 30% for every build.

 >> However, with our increasing number of packages (over 2900), when we
 >> use a 30% probability for options to be enabled, it means a *lot* of
 >> options are enabled, causing very large configurations to be
 >> tested. These configurations are not very realistic, and they take
 >> ages to build on our autobuilders: we have builds that take 4, 5 or
 >> even 7 hours to build.

 >> In order to test a larger number of configurations and therefore a
 >> larger variety of configurations, this commit reduces the maximum
 >> probability to 20%.

 >> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

 > Makes sense. Committed, thanks.

Committed to 2021.02.x and 2021.08.x, thanks.
diff mbox series

Patch

diff --git a/utils/genrandconfig b/utils/genrandconfig
index 4fffcbad11..77c7e585f5 100755
--- a/utils/genrandconfig
+++ b/utils/genrandconfig
@@ -410,7 +410,7 @@  def gen_config(args):
             return 1
         bounded_loop -= 1
         subprocess.check_call(["make", "O=%s" % args.outputdir, "-C", args.buildrootdir,
-                               "KCONFIG_PROBABILITY=%d" % randint(1, 30),
+                               "KCONFIG_PROBABILITY=%d" % randint(1, 20),
                                "randpackageconfig"])
 
         if fixup_config(sysinfo, configfile):