Message ID | 20211113141703.1124706-1-thomas.petazzoni@bootlin.com |
---|---|
State | Accepted |
Headers | show |
Series | utils/genrandconfig: reduce the maximum "size" of random configurations | expand |
Thomas, All, On 2021-11-13 15:17 +0100, Thomas Petazzoni spake thusly: > genrandconfig is used by the Buildroot autobuilders to generate > semi-random configurations that we build test. As part of this, we use > "make randpackageconfig" to randomize the selection of packages, > together with a KCONFIG_PROBABILITY value, which indicates the > probabibility for each option to be enabled. This probability is > itself randomized, between 1% and 30% for every build. > > However, with our increasing number of packages (over 2900), when we > use a 30% probability for options to be enabled, it means a *lot* of > options are enabled, causing very large configurations to be > tested. These configurations are not very realistic, and they take > ages to build on our autobuilders: we have builds that take 4, 5 or > even 7 hours to build. > > In order to test a larger number of configurations and therefore a > larger variety of configurations, this commit reduces the maximum > probability to 20%. > > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> > --- > utils/genrandconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/utils/genrandconfig b/utils/genrandconfig > index 4fffcbad11..77c7e585f5 100755 > --- a/utils/genrandconfig > +++ b/utils/genrandconfig > @@ -410,7 +410,7 @@ def gen_config(args): > return 1 > bounded_loop -= 1 > subprocess.check_call(["make", "O=%s" % args.outputdir, "-C", args.buildrootdir, > - "KCONFIG_PROBABILITY=%d" % randint(1, 30), > + "KCONFIG_PROBABILITY=%d" % randint(1, 20), On IRC, we discussed about using a normal distrobution rather than the uniform distribution that randint() provides. So, I've done a few tests with 100M iterations, adn for example, here is the (15,5) gaussian [0]: 1: # 0% 2: # 0% 3: ## 0% 4: #### 0% 5: ###### 1% 6: ######### 1% 7: ############# 2% 8: ################# 3% 9: ##################### 4% 10: ########################## 5% 11: ############################### 6% 12: ################################### 7% 13: ###################################### 7% 14: ####################################### 7% 15: ####################################### 7% 16: ###################################### 7% 17: ################################### 7% 18: ############################### 6% 19: ########################## 5% 20: ##################### 4% 21: ################# 3% 22: ############# 2% 23: ######### 1% 24: ###### 1% 25: #### 0% 26: ## 0% 27: # 0% 28: # 0% 29: 0% 30: 0% And here is the (10,7) gaussian (which has my preference): 1: ############### 3% 2: ################# 3% 3: #################### 4% 4: ####################### 4% 5: ######################### 5% 6: ########################### 5% 7: ############################# 5% 8: ############################## 6% 9: ############################### 6% 10: ############################### 6% 11: ############################## 6% 12: ############################# 5% 13: ########################### 5% 14: ######################### 5% 15: ####################### 4% 16: #################### 4% 17: ################# 3% 18: ############### 3% 19: ############ 2% 20: ########## 2% 21: ######## 1% 22: ###### 1% 23: #### 0% 24: ### 0% 25: ## 0% 26: # 0% 27: # 0% 28: 0% 29: 0% 30: 0% Since a gaussian can output values all over ℝ, we must limit the range explicitly (note that the 0% above are not representing 0 occurences!): proba = 0 while proba < 1 or proba > 100: proba = int(random.gauss(15, 5)) What do you think? Regards, Yann E. MORIN. [0] Here's the ugly script to reprodce the gaussians; adapt for other paramaters... #!/usr/bin/env python3 import random iters = 100000 # In units of 1k def main(): buckets = dict() for i in range(1, 31): buckets[i] = 0 for i in range(1000*iters): r = 0 while r < 1 or r > 30: # 30 for the bucket size r = int(random.gauss(15, 5)) buckets[r] = buckets[r]+1 for i in range(1, 31): print('{:2}: '.format(i), end='') for j in range(int(buckets[i]/(2*iters))): print('#', end='') print(' {: 2}%'.format(int(buckets[i]/(10*iters)))) if __name__ == '__main__': main() > "randpackageconfig"]) > > if fixup_config(sysinfo, configfile): > -- > 2.31.1 > > _______________________________________________ > buildroot mailing list > buildroot@buildroot.org > https://lists.buildroot.org/mailman/listinfo/buildroot
>>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes: > genrandconfig is used by the Buildroot autobuilders to generate > semi-random configurations that we build test. As part of this, we use > "make randpackageconfig" to randomize the selection of packages, > together with a KCONFIG_PROBABILITY value, which indicates the > probabibility for each option to be enabled. This probability is > itself randomized, between 1% and 30% for every build. > However, with our increasing number of packages (over 2900), when we > use a 30% probability for options to be enabled, it means a *lot* of > options are enabled, causing very large configurations to be > tested. These configurations are not very realistic, and they take > ages to build on our autobuilders: we have builds that take 4, 5 or > even 7 hours to build. > In order to test a larger number of configurations and therefore a > larger variety of configurations, this commit reduces the maximum > probability to 20%. > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Makes sense. Committed, thanks.
Hello, On Sat, 13 Nov 2021 19:02:06 +0100 "Yann E. MORIN" <yann.morin.1998@free.fr> wrote: > On IRC, we discussed about using a normal distrobution rather than the > uniform distribution that randint() provides. Yeah, but after thinking more about it, I'm wondering whether this makes sense or not. Indeed, why are we more interested by configurations having ~10% of options enabled compared to the ones having 1% or 20% ? All of them are equally interesting, aren't they? > And here is the (10,7) gaussian (which has my preference): > > 1: ############### 3% > 2: ################# 3% > 3: #################### 4% > 4: ####################### 4% > 5: ######################### 5% > 6: ########################### 5% > 7: ############################# 5% > 8: ############################## 6% > 9: ############################### 6% > 10: ############################### 6% > 11: ############################## 6% > 12: ############################# 5% > 13: ########################### 5% > 14: ######################### 5% > 15: ####################### 4% > 16: #################### 4% > 17: ################# 3% > 18: ############### 3% > 19: ############ 2% > 20: ########## 2% > 21: ######## 1% > 22: ###### 1% > 23: #### 0% > 24: ### 0% > 25: ## 0% > 26: # 0% > 27: # 0% I think the reality of Buildroot configurations is probably even more biased towards having just a few percents of options enabled. > Since a gaussian can output values all over ℝ, we must limit the range > explicitly (note that the 0% above are not representing 0 occurences!): > > proba = 0 > while proba < 1 or proba > 100: > proba = int(random.gauss(15, 5)) How much CPU is this loop going to consume before it gives a value that is within 1 and 100 ? Would a % 100 be better ? Best regards, Thomas
On Sun, 14 Nov 2021 13:48:43 +0100 Peter Korsgaard <peter@korsgaard.com> wrote: > >>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes: > > > genrandconfig is used by the Buildroot autobuilders to generate > > semi-random configurations that we build test. As part of this, we use > > "make randpackageconfig" to randomize the selection of packages, > > together with a KCONFIG_PROBABILITY value, which indicates the > > probabibility for each option to be enabled. This probability is > > itself randomized, between 1% and 30% for every build. > > > However, with our increasing number of packages (over 2900), when we > > use a 30% probability for options to be enabled, it means a *lot* of > > options are enabled, causing very large configurations to be > > tested. These configurations are not very realistic, and they take > > ages to build on our autobuilders: we have builds that take 4, 5 or > > even 7 hours to build. > > > In order to test a larger number of configurations and therefore a > > larger variety of configurations, this commit reduces the maximum > > probability to 20%. > > > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> > > Makes sense. Committed, thanks. Thanks! I think we want it in all branches currently tested by the autobuilders, i.e master, next, 2021.02.x and 2021.08.x. Thomas
Thomas, All, On 2021-11-14 14:00 +0100, Thomas Petazzoni spake thusly: > On Sat, 13 Nov 2021 19:02:06 +0100 > "Yann E. MORIN" <yann.morin.1998@free.fr> wrote: > > On IRC, we discussed about using a normal distrobution rather than the > > uniform distribution that randint() provides. > Yeah, but after thinking more about it, I'm wondering whether this > makes sense or not. Indeed, why are we more interested by > configurations having ~10% of options enabled compared to the ones > having 1% or 20% ? All of them are equally interesting, aren't they? Well, you said yourself that it is most probably that configurations have few options turned on, so we'd probably favour smaller configurations over bigger ones. Especially since smaller configurations are more prone to exhibit missing or otherwise-inherited dependencies. Conversely, though, bigger configurations are also more prone to exhibit build failures on unexpected dependencies. So YMMV... > > And here is the (10,7) gaussian (which has my preference): [--SNIP--] > I think the reality of Buildroot configurations is probably even more > biased towards having just a few percents of options enabled. We can maybe refine the parameters, like (3,7). > > Since a gaussian can output values all over ℝ, we must limit the range > > explicitly (note that the 0% above are not representing 0 occurences!): > > > > proba = 0 > > while proba < 1 or proba > 100: > > proba = int(random.gauss(15, 5)) > > How much CPU is this loop going to consume before it gives a value that > is within 1 and 100 ? Would a % 100 be better ? Well, the normal distribution is very predictable [0]. Here, the mean µ is 10, and the standard deviation σ is 7. This means that: - 50% of values are below µ, 50% above - ~68.29% of values are between µ-σ=3 and µ+σ=17 - ~95.45% of values are between µ-2σ=-4 and µ+2σ=24 - ~99.73% of values are between µ-3σ=-11 and µ+3σ=31 - ~84.1% of values are greater than µ-σ=3 - ~15.4% of values are below µ-σ=3 - ~2.2% of values are below µ-2σ=-4 And with the following test: >>> import random >>> restart = (0, 0) >>> for i in range(100*1000*1000): # 100M iterations ... r = int(random.gauss(10, 7)+0.5) # Round to nearest int ... if r < 1: ... restart = (restart[0]+1, restart[1]) ... elif r > 100: ... restart = (restart[0], restart[1]+1) ... >>> print('restart={}'.format(restart)) restart=(8736128, 0) So, it means there were ~8.736% of values below 1, and none above 100, which is consistent with the theory. So, we would usually need just a very few loops to find an appropriate value: Cumulative probability the loop generates: | <1 | >=1 ------------+-----------------------+----------------- 1st iter | 0.08736 | 0.91264 2nd iter | 0.00763 =0.08736^2 | 0.99236 3rd iter | 0.00066 =0.08736^3 | 0.99933 4th iter | 0.00006 =0.08736^4 | 0.99994 5th iter | 0.000005 =0.08736^5 | 0.99999 I.e. the odds to get something less than one after the 5th iteration are less than in in 100,000. For the other extremity of the range, 100, it is so far away (almost 13σ !) that we did not even get one in 100 million iterations. That's less than the odds of a dinosaur-killing asteroid hitting Earth in the next million years (or a very big number like that!)... Oh, and doing those 100M iterations took ~54s, which means the loop is, well, cheap. Very cheap. [0] https://en.wikipedia.org/wiki/Normal_distribution (Yeah, I love maths and probablities. Even though I'm pretty bad at it!) Regards, Yann E. MORIN.
>>>>> "Peter" == Peter Korsgaard <peter@korsgaard.com> writes: >>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes: >> genrandconfig is used by the Buildroot autobuilders to generate >> semi-random configurations that we build test. As part of this, we use >> "make randpackageconfig" to randomize the selection of packages, >> together with a KCONFIG_PROBABILITY value, which indicates the >> probabibility for each option to be enabled. This probability is >> itself randomized, between 1% and 30% for every build. >> However, with our increasing number of packages (over 2900), when we >> use a 30% probability for options to be enabled, it means a *lot* of >> options are enabled, causing very large configurations to be >> tested. These configurations are not very realistic, and they take >> ages to build on our autobuilders: we have builds that take 4, 5 or >> even 7 hours to build. >> In order to test a larger number of configurations and therefore a >> larger variety of configurations, this commit reduces the maximum >> probability to 20%. >> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> > Makes sense. Committed, thanks. Committed to 2021.02.x and 2021.08.x, thanks.
diff --git a/utils/genrandconfig b/utils/genrandconfig index 4fffcbad11..77c7e585f5 100755 --- a/utils/genrandconfig +++ b/utils/genrandconfig @@ -410,7 +410,7 @@ def gen_config(args): return 1 bounded_loop -= 1 subprocess.check_call(["make", "O=%s" % args.outputdir, "-C", args.buildrootdir, - "KCONFIG_PROBABILITY=%d" % randint(1, 30), + "KCONFIG_PROBABILITY=%d" % randint(1, 20), "randpackageconfig"]) if fixup_config(sysinfo, configfile):
genrandconfig is used by the Buildroot autobuilders to generate semi-random configurations that we build test. As part of this, we use "make randpackageconfig" to randomize the selection of packages, together with a KCONFIG_PROBABILITY value, which indicates the probabibility for each option to be enabled. This probability is itself randomized, between 1% and 30% for every build. However, with our increasing number of packages (over 2900), when we use a 30% probability for options to be enabled, it means a *lot* of options are enabled, causing very large configurations to be tested. These configurations are not very realistic, and they take ages to build on our autobuilders: we have builds that take 4, 5 or even 7 hours to build. In order to test a larger number of configurations and therefore a larger variety of configurations, this commit reduces the maximum probability to 20%. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> --- utils/genrandconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)