diff mbox

bind: fix intermittent build issues with high BR2_JLEVEL

Message ID 1446721054-15603-1-git-send-email-heyleke@gmail.com
State Superseded
Headers show

Commit Message

Jan Heylen Nov. 5, 2015, 10:57 a.m. UTC
From: Peter Korsgaard <jacmet@sunsite.dk>

Build sometimes breaks with:

libtool: link: `unix/os.lo' is not a valid libtool object
make[3]: *** [rndc-confgen] Error 1
make[3]: *** Waiting for unfinished jobs....
make[4]: Leaving directory `/scratch/peko/build/bind-9.6-ESV-R4/bin/rndc/unix'

So disable parallel builds.

Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
[Jan Heylen: This patch was removed with commit c36b5d89c5616f7ca0a7295cbb5c231606beb71e
by Gustavo Zacarias <gustavo@zacarias.com.ar> but the problem still occurs,
so disabling parallel builds again]
Signed-off-by: Jan Heylen <heyleke@gmail.com>
---
 package/bind/bind.mk | 1 +
 1 file changed, 1 insertion(+)

Comments

Thomas Petazzoni Nov. 5, 2015, 9:42 p.m. UTC | #1
Dear Jan Heylen,

Thanks for your patch!

On Thu,  5 Nov 2015 11:57:34 +0100, Jan Heylen wrote:
> From: Peter Korsgaard <jacmet@sunsite.dk>

I am not sure it is appropriate to send a patch in the name of someone
else. Maybe you're taking one of Peter's previous commit, and
re-applying it, but the context is different, so I believe it should be
under your own name.

> 
> Build sometimes breaks with:
> 
> libtool: link: `unix/os.lo' is not a valid libtool object
> make[3]: *** [rndc-confgen] Error 1
> make[3]: *** Waiting for unfinished jobs....
> make[4]: Leaving directory `/scratch/peko/build/bind-9.6-ESV-R4/bin/rndc/unix'
> 
> So disable parallel builds.

I've been trying to reproduce the parallel build issue, and I haven't
been able to do so. It seems our autobuilders also didn't catch it.

How often are you able to reproduce it ? On what type of build machine ?

Thanks,

Thomas
Jan Heylen Nov. 6, 2015, 7:04 a.m. UTC | #2
Hi,

On Thu, Nov 5, 2015 at 10:42 PM, Thomas Petazzoni <
thomas.petazzoni@free-electrons.com> wrote:

> Dear Jan Heylen,
>
> Thanks for your patch!
>
> On Thu,  5 Nov 2015 11:57:34 +0100, Jan Heylen wrote:
> > From: Peter Korsgaard <jacmet@sunsite.dk>
>
> I am not sure it is appropriate to send a patch in the name of someone
> else. Maybe you're taking one of Peter's previous commit, and
> re-applying it, but the context is different, so I believe it should be
> under your own name.
>
OK, just wanted to point out it is the same issue (and the same solution).

>
> >
> > Build sometimes breaks with:
> >
> > libtool: link: `unix/os.lo' is not a valid libtool object
> > make[3]: *** [rndc-confgen] Error 1
> > make[3]: *** Waiting for unfinished jobs....
> > make[4]: Leaving directory
> `/scratch/peko/build/bind-9.6-ESV-R4/bin/rndc/unix'
> >
> > So disable parallel builds.
>
> I've been trying to reproduce the parallel build issue, and I haven't
> been able to do so. It seems our autobuilders also didn't catch it.
>

Example of the output (paths shrink-ed):


<during compilation of bind>
libtool: link: `unix/os.lo' is not a valid libtool object
make[3]: *** [named] Error 1
make[3]: *** Waiting for unfinished jobs....
make[4]: Leaving directory `<CUT>/output/build/bind-9.9.7/bin/named/unix'
make[3]: Leaving directory `<CUT>/output/build/bind-9.9.7/bin/named'
make[2]: *** [subdirs] Error 1
make[2]: Leaving directory `<CUT>/output/build/bind-9.9.7/bin'
make[1]: *** [subdirs] Error 1
make[1]: Leaving directory `<CUT>/output/build/bind-9.9.7'
make: *** [<CUT>/output/build/bind-9.9.7/.stamp_built] Error 2

How often are you able to reproduce it ? On what type of build machine ?
>

We are on the  2015.05, Released May 31st, 2015 Buildroot release.

We build a couple of times a day on a centos 7 environment: 8 cores, 32G
mem.

 processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
stepping : 3
microcode : 0x1c
cpu MHz : 3390.171
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb
xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
bogomips : 6784.08
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

So nothing special?

BR2_JLEVEL is set to '0' (so auto)

We do build multiple defconfigs (up to 8) (in separate buildroot working
folders) at once on the same machine. But I see from the buildroot output
that BR2_JLEVEL is set to '9' (cores +1) for each of these jobs?

>>> bind 9.9.7 Building
PATH="<CUT>/output/host/bin:<CUT>/output/host/sbin:<CUT>/output/host/usr/bin:<CUT>/output/host/usr/sbin:/usr/local/bin:/usr/bin"
 /usr/bin/make -j9  -C <CUT>/output/build/bind-9.9.7/
make[1]: Entering directory `<CUT>/output/build/bind-9.9.7'
making all in <CUT>/output/build/bind-9.9.7/make


Maybe the exact condition is to have multiple buildroot jobs (8) on 8 cores
with BR2_JLEVEL set to 8 (so 8*8 = 64 'jobs').

So we might optimize that on our side ;-), but still it shouldn't trigger
this error?

Jan


> Thanks,
>
> Thomas
> --
> Thomas Petazzoni, CTO, Free Electrons
> Embedded Linux, Kernel and Android engineering
> http://free-electrons.com
>
Jan Heylen Nov. 6, 2015, 7:06 a.m. UTC | #3
On Fri, Nov 6, 2015 at 8:04 AM, Jan Heylen <heyleke@gmail.com> wrote:

> Hi,
>
> On Thu, Nov 5, 2015 at 10:42 PM, Thomas Petazzoni <
> thomas.petazzoni@free-electrons.com> wrote:
>
>> Dear Jan Heylen,
>>
>> Thanks for your patch!
>>
>> On Thu,  5 Nov 2015 11:57:34 +0100, Jan Heylen wrote:
>> > From: Peter Korsgaard <jacmet@sunsite.dk>
>>
>> I am not sure it is appropriate to send a patch in the name of someone
>> else. Maybe you're taking one of Peter's previous commit, and
>> re-applying it, but the context is different, so I believe it should be
>> under your own name.
>>
> OK, just wanted to point out it is the same issue (and the same solution).
>
>>
>> >
>> > Build sometimes breaks with:
>> >
>> > libtool: link: `unix/os.lo' is not a valid libtool object
>> > make[3]: *** [rndc-confgen] Error 1
>> > make[3]: *** Waiting for unfinished jobs....
>> > make[4]: Leaving directory
>> `/scratch/peko/build/bind-9.6-ESV-R4/bin/rndc/unix'
>> >
>> > So disable parallel builds.
>>
>> I've been trying to reproduce the parallel build issue, and I haven't
>> been able to do so. It seems our autobuilders also didn't catch it.
>>
>
> Example of the output (paths shrink-ed):
>
>
> <during compilation of bind>
> libtool: link: `unix/os.lo' is not a valid libtool object
> make[3]: *** [named] Error 1
> make[3]: *** Waiting for unfinished jobs....
> make[4]: Leaving directory `<CUT>/output/build/bind-9.9.7/bin/named/unix'
> make[3]: Leaving directory `<CUT>/output/build/bind-9.9.7/bin/named'
> make[2]: *** [subdirs] Error 1
> make[2]: Leaving directory `<CUT>/output/build/bind-9.9.7/bin'
> make[1]: *** [subdirs] Error 1
> make[1]: Leaving directory `<CUT>/output/build/bind-9.9.7'
> make: *** [<CUT>/output/build/bind-9.9.7/.stamp_built] Error 2
>
> How often are you able to reproduce it ? On what type of build machine ?
>>
>
> We are on the  2015.05, Released May 31st, 2015 Buildroot release.
>
> We build a couple of times a day on a centos 7 environment: 8 cores, 32G
> mem.
>

To be correct: 4 physical, 8 virtual cores


>
>  processor : 7
> vendor_id : GenuineIntel
> cpu family : 6
> model : 60
> model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> stepping : 3
> microcode : 0x1c
> cpu MHz : 3390.171
> cache size : 8192 KB
> physical id : 0
> siblings : 8
> core id : 3
> cpu cores : 4
> apicid : 7
> initial apicid : 7
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx
> est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb
> xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
> tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
> bogomips : 6784.08
> clflush size : 64
> cache_alignment : 64
> address sizes : 39 bits physical, 48 bits virtual
> power management:
>
> So nothing special?
>
> BR2_JLEVEL is set to '0' (so auto)
>
> We do build multiple defconfigs (up to 8) (in separate buildroot working
> folders) at once on the same machine. But I see from the buildroot output
> that BR2_JLEVEL is set to '9' (cores +1) for each of these jobs?
>
> >>> bind 9.9.7 Building
> PATH="<CUT>/output/host/bin:<CUT>/output/host/sbin:<CUT>/output/host/usr/bin:<CUT>/output/host/usr/sbin:/usr/local/bin:/usr/bin"  /usr/bin/make -j9  -C <CUT>/output/build/bind-9.9.7/
> make[1]: Entering directory `<CUT>/output/build/bind-9.9.7'
> making all in <CUT>/output/build/bind-9.9.7/make
>
>
> Maybe the exact condition is to have multiple buildroot jobs (8) on 8
> cores with BR2_JLEVEL set to 8 (so 8*8 = 64 'jobs').
>
To be correct: 8 jobs * -j9 = 72 'jobs'


> So we might optimize that on our side ;-), but still it shouldn't trigger
> this error?
>
> Jan
>
>
>> Thanks,
>>
>> Thomas
>> --
>> Thomas Petazzoni, CTO, Free Electrons
>> Embedded Linux, Kernel and Android engineering
>> http://free-electrons.com
>>
>
>
Thomas Petazzoni Nov. 6, 2015, 9:19 a.m. UTC | #4
Jan,

On Fri, 6 Nov 2015 08:04:35 +0100, Jan Heylen wrote:

> > I am not sure it is appropriate to send a patch in the name of someone
> > else. Maybe you're taking one of Peter's previous commit, and
> > re-applying it, but the context is different, so I believe it should be
> > under your own name.
> >
> OK, just wanted to point out it is the same issue (and the same solution).

OK.


> make[1]: Leaving directory `<CUT>/output/build/bind-9.9.7'
> make: *** [<CUT>/output/build/bind-9.9.7/.stamp_built] Error 2
> 
> How often are you able to reproduce it ? On what type of build machine ?
> >
> 
> We are on the  2015.05, Released May 31st, 2015 Buildroot release.

On master, we updated bind to 9.9.8. Do you also reproduce the issue
with bind 9.9.8 ?

What I find weird is that our autobuilder infrastructure generally
catches pretty well the parallel build issues. And we currently have
zero failures on bind 9.9.7 and bind 9.9.8:

  http://autobuild.buildroot.org/?reason=bind-9.9.7
  http://autobuild.buildroot.org/?reason=bind-9.9.8


> We do build multiple defconfigs (up to 8) (in separate buildroot working
> folders) at once on the same machine. But I see from the buildroot output
> that BR2_JLEVEL is set to '9' (cores +1) for each of these jobs?

That's expected if you have left BR2_JLEVEL to its default of 0.

> >>> bind 9.9.7 Building
> PATH="<CUT>/output/host/bin:<CUT>/output/host/sbin:<CUT>/output/host/usr/bin:<CUT>/output/host/usr/sbin:/usr/local/bin:/usr/bin"
>  /usr/bin/make -j9  -C <CUT>/output/build/bind-9.9.7/
> make[1]: Entering directory `<CUT>/output/build/bind-9.9.7'
> making all in <CUT>/output/build/bind-9.9.7/make
> 
> 
> Maybe the exact condition is to have multiple buildroot jobs (8) on 8 cores
> with BR2_JLEVEL set to 8 (so 8*8 = 64 'jobs').
> 
> So we might optimize that on our side ;-), but still it shouldn't trigger
> this error?

It should trigger this error indeed.

Thomas
Thomas De Schampheleire Jan. 6, 2016, 9:10 p.m. UTC | #5
Hi Thomas,

On Fri, Nov 6, 2015 at 10:19 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Jan,
>
> On Fri, 6 Nov 2015 08:04:35 +0100, Jan Heylen wrote:
>
>> > I am not sure it is appropriate to send a patch in the name of someone
>> > else. Maybe you're taking one of Peter's previous commit, and
>> > re-applying it, but the context is different, so I believe it should be
>> > under your own name.
>> >
>> OK, just wanted to point out it is the same issue (and the same solution).
>
> OK.
>
>
>> make[1]: Leaving directory `<CUT>/output/build/bind-9.9.7'
>> make: *** [<CUT>/output/build/bind-9.9.7/.stamp_built] Error 2
>>
>> How often are you able to reproduce it ? On what type of build machine ?

We saw this issue from time to time. Definitely not always, but also
definitely more than once. I don't have exact figures.

>> >
>>
>> We are on the  2015.05, Released May 31st, 2015 Buildroot release.
>
> On master, we updated bind to 9.9.8. Do you also reproduce the issue
> with bind 9.9.8 ?
>
> What I find weird is that our autobuilder infrastructure generally
> catches pretty well the parallel build issues. And we currently have
> zero failures on bind 9.9.7 and bind 9.9.8:
>
>   http://autobuild.buildroot.org/?reason=bind-9.9.7
>   http://autobuild.buildroot.org/?reason=bind-9.9.8
>
>
>> We do build multiple defconfigs (up to 8) (in separate buildroot working
>> folders) at once on the same machine. But I see from the buildroot output
>> that BR2_JLEVEL is set to '9' (cores +1) for each of these jobs?
>
> That's expected if you have left BR2_JLEVEL to its default of 0.
>
>> >>> bind 9.9.7 Building
>> PATH="<CUT>/output/host/bin:<CUT>/output/host/sbin:<CUT>/output/host/usr/bin:<CUT>/output/host/usr/sbin:/usr/local/bin:/usr/bin"
>>  /usr/bin/make -j9  -C <CUT>/output/build/bind-9.9.7/
>> make[1]: Entering directory `<CUT>/output/build/bind-9.9.7'
>> making all in <CUT>/output/build/bind-9.9.7/make
>>
>>
>> Maybe the exact condition is to have multiple buildroot jobs (8) on 8 cores
>> with BR2_JLEVEL set to 8 (so 8*8 = 64 'jobs').
>>
>> So we might optimize that on our side ;-), but still it shouldn't trigger
>> this error?
>
> It should trigger this error indeed.

From the bind website:
https://kb.isc.org/article/AA-00291/46/Im-trying-to-compile-BIND-9-and-make-is-failing-due-to-files-not-being-found.-Why-.html

"Using a parallel or distributed "make" to build BIND 9 is not
supported, and doesn't work. If you are using one of these, use normal
make or gmake instead."

Based on our observed failures and the above upstream message that
parallel make is not supported, shouldn't we take that into account in
buildroot (and thus applying this patch) ?

Thanks,
Thomas
Thomas De Schampheleire Feb. 5, 2016, 2:49 p.m. UTC | #6
On Wed, Jan 6, 2016 at 10:10 PM, Thomas De Schampheleire
<patrickdepinguin@gmail.com> wrote:
> Hi Thomas,
>
> On Fri, Nov 6, 2015 at 10:19 AM, Thomas Petazzoni
> <thomas.petazzoni@free-electrons.com> wrote:
>> Jan,
>>
>> On Fri, 6 Nov 2015 08:04:35 +0100, Jan Heylen wrote:
>>
>>> > I am not sure it is appropriate to send a patch in the name of someone
>>> > else. Maybe you're taking one of Peter's previous commit, and
>>> > re-applying it, but the context is different, so I believe it should be
>>> > under your own name.
>>> >
>>> OK, just wanted to point out it is the same issue (and the same solution).
>>
>> OK.
>>
>>
>>> make[1]: Leaving directory `<CUT>/output/build/bind-9.9.7'
>>> make: *** [<CUT>/output/build/bind-9.9.7/.stamp_built] Error 2
>>>
>>> How often are you able to reproduce it ? On what type of build machine ?
>
> We saw this issue from time to time. Definitely not always, but also
> definitely more than once. I don't have exact figures.
>
>>> >
>>>
>>> We are on the  2015.05, Released May 31st, 2015 Buildroot release.
>>
>> On master, we updated bind to 9.9.8. Do you also reproduce the issue
>> with bind 9.9.8 ?
>>
>> What I find weird is that our autobuilder infrastructure generally
>> catches pretty well the parallel build issues. And we currently have
>> zero failures on bind 9.9.7 and bind 9.9.8:
>>
>>   http://autobuild.buildroot.org/?reason=bind-9.9.7
>>   http://autobuild.buildroot.org/?reason=bind-9.9.8
>>
>>
>>> We do build multiple defconfigs (up to 8) (in separate buildroot working
>>> folders) at once on the same machine. But I see from the buildroot output
>>> that BR2_JLEVEL is set to '9' (cores +1) for each of these jobs?
>>
>> That's expected if you have left BR2_JLEVEL to its default of 0.
>>
>>> >>> bind 9.9.7 Building
>>> PATH="<CUT>/output/host/bin:<CUT>/output/host/sbin:<CUT>/output/host/usr/bin:<CUT>/output/host/usr/sbin:/usr/local/bin:/usr/bin"
>>>  /usr/bin/make -j9  -C <CUT>/output/build/bind-9.9.7/
>>> make[1]: Entering directory `<CUT>/output/build/bind-9.9.7'
>>> making all in <CUT>/output/build/bind-9.9.7/make
>>>
>>>
>>> Maybe the exact condition is to have multiple buildroot jobs (8) on 8 cores
>>> with BR2_JLEVEL set to 8 (so 8*8 = 64 'jobs').
>>>
>>> So we might optimize that on our side ;-), but still it shouldn't trigger
>>> this error?
>>
>> It should trigger this error indeed.
>
> From the bind website:
> https://kb.isc.org/article/AA-00291/46/Im-trying-to-compile-BIND-9-and-make-is-failing-due-to-files-not-being-found.-Why-.html
>
> "Using a parallel or distributed "make" to build BIND 9 is not
> supported, and doesn't work. If you are using one of these, use normal
> make or gmake instead."
>
> Based on our observed failures and the above upstream message that
> parallel make is not supported, shouldn't we take that into account in
> buildroot (and thus applying this patch) ?
>

Ping... I think this should go in for the release.
I should have discussed this on the Buildroot days, but forgot...
diff mbox

Patch

diff --git a/package/bind/bind.mk b/package/bind/bind.mk
index e93b356..3601d42 100644
--- a/package/bind/bind.mk
+++ b/package/bind/bind.mk
@@ -6,6 +6,7 @@ 
 
 BIND_VERSION = 9.9.8
 BIND_SITE = ftp://ftp.isc.org/isc/bind9/$(BIND_VERSION)
+BIND_MAKE = $(MAKE1)
 BIND_INSTALL_STAGING = YES
 BIND_CONFIG_SCRIPTS = bind9-config isc-config.sh
 BIND_LICENSE = ISC