diff mbox

[LEDE-DEV,RFC] brcm47xx: bump kernel to 4.4

Message ID trinity-9a34f671-de0c-4525-a66e-7dcbe02484ae-1477217711975@3capp-gmx-bs60
State RFC
Headers show

Commit Message

p.wassi@gmx.at Oct. 23, 2016, 10:15 a.m. UTC
Hi there!

As one of LEDE's TODOs is "Bump all kernels to v4.4", I've done
some testing on two brcm47xx devices (brcm47xx is still on kernel 4.1).

The devices I tested on + bootlog:
Linksys WRT54GL - https://pwassi.privatedns.org/lede/brcm47xx/#wrt54gl
ASUS WL500gP V2 - https://pwassi.privatedns.org/lede/brcm47xx/#wl500gpv2

I know, that there are much more devices on brcm47xx which are untested yet.
However, these are the ones I have at home to play with and everything seems
to work fine there. So what do you think about the following 'patch'?

Best regards,
P. Wassi

Comments

Rafał Miłecki Oct. 23, 2016, 5:49 p.m. UTC | #1
On 23 October 2016 at 12:15,  <p.wassi@gmx.at> wrote:
> As one of LEDE's TODOs is "Bump all kernels to v4.4", I've done
> some testing on two brcm47xx devices (brcm47xx is still on kernel 4.1).
>
> The devices I tested on + bootlog:
> Linksys WRT54GL - https://pwassi.privatedns.org/lede/brcm47xx/#wrt54gl
> ASUS WL500gP V2 - https://pwassi.privatedns.org/lede/brcm47xx/#wl500gpv2
>
> I know, that there are much more devices on brcm47xx which are untested yet.
> However, these are the ones I have at home to play with and everything seems
> to work fine there. So what do you think about the following 'patch'?

It breaks Linksys WRT300N V1 and that's why I still didn't bump it.
Last time I talked with Felix we suggested some offset in lzmaloader,
I hope to finally find time to try that...
Rafał Miłecki Oct. 24, 2016, 3:25 p.m. UTC | #2
On 23 October 2016 at 19:49, Rafał Miłecki <zajec5@gmail.com> wrote:
> On 23 October 2016 at 12:15,  <p.wassi@gmx.at> wrote:
>> As one of LEDE's TODOs is "Bump all kernels to v4.4", I've done
>> some testing on two brcm47xx devices (brcm47xx is still on kernel 4.1).
>>
>> The devices I tested on + bootlog:
>> Linksys WRT54GL - https://pwassi.privatedns.org/lede/brcm47xx/#wrt54gl
>> ASUS WL500gP V2 - https://pwassi.privatedns.org/lede/brcm47xx/#wl500gpv2
>>
>> I know, that there are much more devices on brcm47xx which are untested yet.
>> However, these are the ones I have at home to play with and everything seems
>> to work fine there. So what do you think about the following 'patch'?
>
> It breaks Linksys WRT300N V1 and that's why I still didn't bump it.
> Last time I talked with Felix we suggested some offset in lzmaloader,
> I hope to finally find time to try that...

I started working on this today, I reproduce see the issue during a
lot of tests until it magically disappeared. Then LEDE started booting
on my unit with kernel 4.4.

I bumped kernel with some lengthy description (for a further
reference, just in case):
https://git.lede-project.org/?p=source.git;a=commitdiff;h=06405df7a8da24b7d735b32454c7d3b1f2ebaabc
p.wassi@gmx.at Oct. 24, 2016, 4:43 p.m. UTC | #3
> I started working on this today, I reproduce see the issue during a
> lot of tests until it magically disappeared. Then LEDE started booting
> on my unit with kernel 4.4.

Oh - that rings a bell.
May I point you to a discussion we had earlier this year?
http://lists.infradead.org/pipermail/lede-dev/2016-May/000980.html

I had this issue on all my WRT54GLs:
trunk images stopped booting at "Starting program at 0x80001000".
However, if I compiled the images myself (with the exact same configuration
as the buildbot), the images would boot fine.
I later checked again: buildbot's images did not boot.
(It was kernel 4.1 back then, but it's the exact same behaviour as you describe
in the commit message)
Additionally builtbot's images worked out-of-the-box on an ASUS WL500gP V2.

Best regards,
P. Wassi
Rafał Miłecki Oct. 25, 2016, 5:02 a.m. UTC | #4
On 24 October 2016 at 18:43,  <p.wassi@gmx.at> wrote:
>> I started working on this today, I reproduce see the issue during a
>> lot of tests until it magically disappeared. Then LEDE started booting
>> on my unit with kernel 4.4.
>
> Oh - that rings a bell.
> May I point you to a discussion we had earlier this year?
> http://lists.infradead.org/pipermail/lede-dev/2016-May/000980.html
>
> I had this issue on all my WRT54GLs:
> trunk images stopped booting at "Starting program at 0x80001000".
> However, if I compiled the images myself (with the exact same configuration
> as the buildbot), the images would boot fine.
> I later checked again: buildbot's images did not boot.
> (It was kernel 4.1 back then, but it's the exact same behaviour as you describe
> in the commit message)
> Additionally builtbot's images worked out-of-the-box on an ASUS WL500gP V2.

It seems I'm experiencing the same crazy problem. Local builds work
for me (as explained in commit message), but image from buildbot
(https://downloads.lede-project.org/snapshots/targets/brcm47xx/legacy/)
doesn't boot. This is some totally crazy thing :|
p.wassi@gmx.at Oct. 25, 2016, 6:13 a.m. UTC | #5
> It seems I'm experiencing the same crazy problem. Local builds work
> for me (as explained in commit message), but image from buildbot
> doesn't boot. This is some totally crazy thing :|

I opt for staying at 4.4 and not reverting, since the issue already occured
on kernel 4.1 for the WRT54GL (and probably other devices as well?).
So it's _not_ a 4.1 -> 4.4 issue! The last kernel, that booted fine
on my devices (with buildbot images) was OpenWRT 15.05.1 - 3.18.23
As you've experienced yourself: local images also work fine. With 4.1 and 4.4.

Best regards,
P. Wassi
Hannu Nyman Oct. 25, 2016, 7:55 a.m. UTC | #6
On 25.10.2016 9:13, p.wassi@gmx.at wrote:
>> It seems I'm experiencing the same crazy problem. Local builds work
>> for me (as explained in commit message), but image from buildbot
>> doesn't boot. This is some totally crazy thing :|
> I opt for staying at 4.4 and not reverting, since the issue already occured
> on kernel 4.1 for the WRT54GL (and probably other devices as well?).
> So it's _not_ a 4.1 -> 4.4 issue! The last kernel, that booted fine
> on my devices (with buildbot images) was OpenWRT 15.05.1 - 3.18.23
> As you've experienced yourself: local images also work fine. With 4.1 and 4.4.

This freezing after "> Starting program at 0x80001000" reminds me of an old 
issue on ar71xx/WNDR3700:

This might be something related to uncompressing the image by the bootloader. 
Kernel & firmware size growth may be exposing some compression/uncompression 
problem that gets semi-arbitrarily triggered now (when the images sizes have 
grown due to kernel size growth?). It is possible that the old bootloader 
fails to decompress the image but has no proper verbose error message about that.

There was something rather similar four years ago with ar71xx-based 
WNDR3700/WNDR3800 series. There the problem was fixed by decreasing the 
compression dictionary size used for WNDR3700/3800 devices. I debugged that 
after receiving a hint from Rafal (who had had a bit similar problems with 
WNDR4500). Rafal's last message of a long thread:
https://lists.openwrt.org/pipermail/openwrt-devel/2012-June/015846.html

For reference, the ar71xx WNDR3700 debugging:
https://forum.openwrt.org/viewtopic.php?id=40565
https://lists.openwrt.org/pipermail/openwrt-devel/2012-November/thread.html#17445
https://dev.openwrt.org/ticket/12454#comment:12
Jo-Philipp Wich Oct. 25, 2016, 8 a.m. UTC | #7
Hi Rafal,

did you ever test local builds with all additional kmods (including all
kmods from feeds) enabled as <m> ? I guess this will bump the kernel
size somewhat due to additional subsystems which are getting enabled.

~ Jo
Rafał Miłecki Oct. 25, 2016, 8:34 a.m. UTC | #8
On 25 October 2016 at 08:13,  <p.wassi@gmx.at> wrote:
>> It seems I'm experiencing the same crazy problem. Local builds work
>> for me (as explained in commit message), but image from buildbot
>> doesn't boot. This is some totally crazy thing :|
>
> I opt for staying at 4.4 and not reverting, since the issue already occured
> on kernel 4.1 for the WRT54GL (and probably other devices as well?).
> So it's _not_ a 4.1 -> 4.4 issue! The last kernel, that booted fine
> on my devices (with buildbot images) was OpenWRT 15.05.1 - 3.18.23
> As you've experienced yourself: local images also work fine. With 4.1 and 4.4.

I agree, it doesn't make much sense to revert.

Can you test if (uns)setting CONFIG_KERNEL_KALLSYMS makes a difference
for your local builds? Please remember to compile without -j N as it
isn't reliable, see:
[LEDE-DEV] make -j 4: race between Image/Prepare and device images?
http://lists.infradead.org/pipermail/lede-dev/2016-October/003632.html
Rafał Miłecki Oct. 25, 2016, 8:37 a.m. UTC | #9
On 25 October 2016 at 09:55, Hannu Nyman <hannu.nyman@iki.fi> wrote:
> On 25.10.2016 9:13, p.wassi@gmx.at wrote:
>>>
>>> It seems I'm experiencing the same crazy problem. Local builds work
>>> for me (as explained in commit message), but image from buildbot
>>> doesn't boot. This is some totally crazy thing :|
>>
>> I opt for staying at 4.4 and not reverting, since the issue already
>> occured
>> on kernel 4.1 for the WRT54GL (and probably other devices as well?).
>> So it's _not_ a 4.1 -> 4.4 issue! The last kernel, that booted fine
>> on my devices (with buildbot images) was OpenWRT 15.05.1 - 3.18.23
>> As you've experienced yourself: local images also work fine. With 4.1 and
>> 4.4.
>
>
> This freezing after "> Starting program at 0x80001000" reminds me of an old
> issue on ar71xx/WNDR3700:
>
> This might be something related to uncompressing the image by the
> bootloader. Kernel & firmware size growth may be exposing some
> compression/uncompression problem that gets semi-arbitrarily triggered now
> (when the images sizes have grown due to kernel size growth?). It is
> possible that the old bootloader fails to decompress the image but has no
> proper verbose error message about that.

For Linksys WRT300N v1 we use lzma-loader compressed using gzip. It
didn't change between 4.1 and 4.4.

So CFE decompression doesn't matter as it deals with the same loader.
If there is some decompression bug it may be inside lzma-loader.
Rafał Miłecki Oct. 25, 2016, 8:41 a.m. UTC | #10
On 25 October 2016 at 10:00, Jo-Philipp Wich <jo@mein.io> wrote:
> did you ever test local builds with all additional kmods (including all
> kmods from feeds) enabled as <m> ? I guess this will bump the kernel
> size somewhat due to additional subsystems which are getting enabled.

No, I never expected extra kmod packages to affect kernel size.

Anyway I finally debugged this local vs. buildbot difference to the
CONFIG_KERNEL_KALLSYMS. Images from buildbot have this symbol enabled
which slightly increases kernel size. Enough to stop it from booting
on WRT300N v1.

The reason it took me so much time to realize it's related to
CONFIG_KERNEL_KALLSYMS is bug in LEDE building system I just spotted:
[LEDE-DEV] make -j 4: race between Image/Prepare and device images?
http://lists.infradead.org/pipermail/lede-dev/2016-October/003632.html
p.wassi@gmx.at Oct. 25, 2016, 9:05 a.m. UTC | #11
> Anyway I finally debugged this local vs. buildbot difference to the
> CONFIG_KERNEL_KALLSYMS. Images from buildbot have this symbol enabled
> which slightly increases kernel size. Enough to stop it from booting
> on WRT300N v1.

There must be something more...
What I had on WRT54GL:
http://lists.infradead.org/pipermail/lede-dev/2016-June/001162.html

Buildbot's images did _not_ boot (assuming they have KALLSYMS as you stated above).
However, my local images did boot fine (with KALLSYMS enabled).
So although I had KALLSYMS increasing the kernel size, a local image
booted, while buildbots image did not (both using the same config as jow already
showed me here: http://lists.infradead.org/pipermail/lede-dev/2016-June/001153.html )

Anyway, I'm just compiling a new set of images and will report back.
p.wassi@gmx.at Oct. 25, 2016, 3:51 p.m. UTC | #12
Ok, here's some news on this topic.
I've built some images for WRT54GL to test, here come the results:

-) Builtbot's image: does NOT boot (as expected)
-) Local image without KALLSYMS: works fine
-) Local image with KALLSYMS: does NOT boot (which is unexpected,
       as such an image booted without issues back in June)

Here are my vmlinux.lzma sizes:
1140026 bytes - without KALLSYMS - works
1229451 bytes - with KALLSYMS - does not work

To further investigate where the 'magic border' is, (assuming it really is dependant on
the vmlinux.lzma size), I've disabled KALLSYMS and introduced a new char array in the
kernel with random contents. I could more or less define what size the kernel should have.
Here are the results:
size     | status
1140026  |  Ok    <= this is the untouched kernel without KALLSYMS 
1195918  |  Ok
1206415  |  Ok
1227370  |  Ok
1229451  |  FAIL  <= this is the untouched kernel with KALLSYMS enabled
1237671  |  Ok
1279313  |  Ok

So there must be more than just the bare vmlinux-size.

Best regards,
P. Wassi
p.wassi@gmx.at Nov. 22, 2016, 5:36 p.m. UTC | #13
Hi Rafal,

> On 25 October 2016 at 9:41, Rafał Miłecki <zajec5@gmail.com> wrote:
> Anyway I finally debugged this local vs. buildbot difference to the
> CONFIG_KERNEL_KALLSYMS. Images from buildbot have this symbol enabled
> which slightly increases kernel size. Enough to stop it from booting
> on WRT300N v1.

Yesterday, I had the serial console attached to a WRT54GL for other things I had
to do and tried buildbot's brcm47xx image for this device. It did not boot.
So I'm again using my local builds without CONFIG_KERNEL_KALLSYMS

Since some kind of release seems to be coming, I think we can't leave it in
this state. Either disable the image generation for that device, or get the image
built without KALLSYMS. What do you think about this?

Best regards,
P. Wassi
Rafał Miłecki Nov. 23, 2016, 8:19 a.m. UTC | #14
On 22 November 2016 at 18:36,  <p.wassi@gmx.at> wrote:
>> On 25 October 2016 at 9:41, Rafał Miłecki <zajec5@gmail.com> wrote:
>> Anyway I finally debugged this local vs. buildbot difference to the
>> CONFIG_KERNEL_KALLSYMS. Images from buildbot have this symbol enabled
>> which slightly increases kernel size. Enough to stop it from booting
>> on WRT300N v1.
>
> Yesterday, I had the serial console attached to a WRT54GL for other things I had
> to do and tried buildbot's brcm47xx image for this device. It did not boot.
> So I'm again using my local builds without CONFIG_KERNEL_KALLSYMS
>
> Since some kind of release seems to be coming, I think we can't leave it in
> this state. Either disable the image generation for that device, or get the image
> built without KALLSYMS. What do you think about this?

I think building OpenWrt releases without KALLSYMS was always the case
for brcm47xx / legacy or even all the targets. I expect to do the same
for LEDE.
diff mbox

Patch

diff --git a/target/linux/brcm47xx/Makefile b/target/linux/brcm47xx/Makefile
--- a/target/linux/brcm47xx/Makefile
+++ b/target/linux/brcm47xx/Makefile
@@ -13,7 +13,7 @@  FEATURES:=squashfs usb
 SUBTARGETS:=generic mips74k legacy
 MAINTAINER:=Hauke Mehrtens <hauke@hauke-m.de>
 
-KERNEL_PATCHVER:=4.1
+KERNEL_PATCHVER:=4.4
 
 define Target/Description
        Build firmware images for Broadcom based BCM47xx/53xx routers with MIPS CPU, *not* ARM.