diff mbox

[OpenWrt-Devel] Factory startup issues since mount_root / libfstools improvements in Chaos Calmer

Message ID 1b0f8ebd7c9748d099aea99949d7ffb8@VI1PR9003MB0253.MGDPHG.emi.philips.com
State Changes Requested
Delegated to: John Crispin
Headers show

Commit Message

Smith, Pieter March 28, 2017, 1:26 p.m. UTC
Hi Rafał and John,

I need your help in confirming a bug that I suspect can be traced back to a
patch you authored / merged. This is what I am observing:

Commit ba01996534d15dc725a2dcc56a59fbfb24b58787 in
git://git.openwrt.org/project/fstools.git seems to have broken startup for
factory-flashed jffs2 OpenWRT systems, causing substantial slowdown in factory
environments.

When a factory-flashed jffs2 on ubi system starts up, the "rootfs_data" volume
contains a deadcode marker. When this happens, mount_root temporarily mounts a
tmpfs overlay, postponing the remounting of the jffs2 overlay until the done
phase of the startup.

The refactoring in ba019965 eliminated an unneeded call to volume_find() when
done() calls jffs2_switch(). Unfortunately the refactoring did not take into
account that volume_identify() has side-effects with the mtd driver
implementation and a second call to volume_identify() on the same struct volume
may render a different result.

This is exactly what happens when the struct volume is passed to jffs2_switch()
by done(). In done(), the volume was identified as FS_DEADCODE by the first
call to volume_identify(). Passing the struct volume to jffs2_switch() then
results in a second call to volume_identify() on the same struct volume. The
second time around the volume is identified as FS_JFFS2 within jffs2_switch(),
resulting in the wrong case being run in the switch. The entire rootfs ends up
yanked out from under the OpenWRT userspace, resulting in everything failing in
unexpected ways. Rebooting recovers userspace, but is quite expensive in
factory environments.

Before the Chaos Calmer release, jffs2_switch() was not stumbling over the
volume_identify() side-effect, because it was retrieving a fresh struct volume
with volume_find() before calling volume_identify().

Adjusting the jffs2_switch() prototype so that done() can forward the result
of volume_identify() to jffs2_switch() functionally resolves the issue, but I
am not sure if this is the best approach (See attached patch below).

Will you please assist?

Regards,
Pieter Smith
Sr. Software Designer


From d0dcde51ec68497882b4d0138a019c049e699177 Mon Sep 17 00:00:00 2001
From: Pieter Smith <pieter.smith@philips.com>
Date: Mon, 27 Mar 2017 17:18:55 +0200
Subject: [PATCH] fix startup with empty "rootfs_data" UBI volume

Commit ba019965 broke startup for a factory-flashed jffs2-on-ubi systems,
causing substantial slowdown in factory environments.

When a factory-flashed jffs2 on ubi system starts up, the "rootfs_data" volume
contains a deadcode marker. When this happens, mount_root temporarily mounts a
tmpfs overlay, postponing the remounting of the jffs2 overlay until the done
phase of the startup.

The refactoring in ba019965 eliminated an "unneeded" call to volume_find() when
done() called jffs2_switch(). Unfortunately the refactoring did not take into
account that volume_identify() has side-effects with the mtd driver
implementation and a second call to volume_identify() on the same struct volume
may render a different result.

This is exactly what happens when the struct volume is passed to jffs2_switch()
by done(). In done(), the volume was identified as FS_DEADCODE by the first
call to volume_identify(). Passing the struct volume to jffs2_switch() then
results in a second call to volume_identify() on the same struct volume. The
second time around the volume is identified as FS_JFFS2 within jffs2_switch(),
resulting in the wrong case being run in the switch. The entire rootfs ends up
yanked out from under the OpenWRT userspace, resulting in everything failing in
unexpected ways.

This patch solves the issue by adjusting jffs2_switch() to also receive the
result of the first volume_identify() call.
---
 libfstools/libfstools.h | 2 +-
 libfstools/overlay.c    | 4 ++--
 mount_root.c            | 5 +++--
 3 files changed, 6 insertions(+), 5 deletions(-)

--
2.7.4

Comments

Rafał Miłecki March 28, 2017, 2:52 p.m. UTC | #1
On 28 March 2017 at 15:26, Smith, Pieter <pieter.smith@philips.com> wrote:
> I need your help in confirming a bug that I suspect can be traced back to a
> patch you authored / merged. This is what I am observing:
>
> Commit ba01996534d15dc725a2dcc56a59fbfb24b58787 in
> git://git.openwrt.org/project/fstools.git seems to have broken startup for
> factory-flashed jffs2 OpenWRT systems, causing substantial slowdown in factory
> environments.

I'll research on that, but could you meanwhile send a patch that is
not white space mangled? This one can't really be applied cleanly.
Smith, Pieter March 29, 2017, 9:53 a.m. UTC | #2
Hi Rafał,

My apologies. I am not able to get mutt working with our corporate
infrastructure. I hope it arrives unmangled as an attachment. If not, I'll use
my personal mail.

Regards,
Pieter

-----Original Message-----
From: Rafał Miłecki [mailto:zajec5@gmail.com]

Sent: 28 March 2017 16:52
To: Smith, Pieter <pieter.smith@philips.com>
Cc: John Crispin <blogic@openwrt.org>; openwrt-devel@lists.openwrt.org
Subject: Re: Factory startup issues since mount_root / libfstools improvements in Chaos Calmer

On 28 March 2017 at 15:26, Smith, Pieter <pieter.smith@philips.com> wrote:
> I need your help in confirming a bug that I suspect can be traced back

> to a patch you authored / merged. This is what I am observing:

>

> Commit ba01996534d15dc725a2dcc56a59fbfb24b58787 in

> git://git.openwrt.org/project/fstools.git seems to have broken startup

> for factory-flashed jffs2 OpenWRT systems, causing substantial

> slowdown in factory environments.


I'll research on that, but could you meanwhile send a patch that is not white space mangled? This one can't really be applied cleanly.

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Rafał Miłecki March 29, 2017, 12:10 p.m. UTC | #3
On 03/28/2017 03:26 PM, Smith, Pieter wrote:
> The refactoring in ba019965 eliminated an unneeded call to volume_find() when
> done() calls jffs2_switch(). Unfortunately the refactoring did not take into
> account that volume_identify() has side-effects with the mtd driver
> implementation and a second call to volume_identify() on the same struct volume
> may render a different result.

AFAIU this is about mtd driver, so volume_identify basically means
mtd_volume_identify.

Can you do some extra debugging for me and check which exact part of
mtd_volume_identify has this side effect? I don't see which part of this
function would affect volume identification in further calls.

This may be a bit tricky to debug as you can't call volume_identify inside (at
least without some recursive check). Can you handle this somehow?

I'd like to make sure I understand this bug fully before pushing a fix.

Also: which target uses jffs2 with ubi?!
Rafał Miłecki March 29, 2017, 12:11 p.m. UTC | #4
On 03/29/2017 11:53 AM, Smith, Pieter wrote:
> My apologies. I am not able to get mutt working with our corporate
> infrastructure. I hope it arrives unmangled as an attachment. If not, I'll use
> my personal mail.

This would be acceptable for me to pick patch sent as attachment, but you
really need to sign it off with your name. Please add a proper
Signed-off-by:
line and resend.
Smith, Pieter March 29, 2017, 2:01 p.m. UTC | #5
> On 03/28/2017 03:26 PM, Smith, Pieter wrote:
> > The refactoring in ba019965 eliminated an unneeded call to
> > volume_find() when
> > done() calls jffs2_switch(). Unfortunately the refactoring did not
> > take into account that volume_identify() has side-effects with the mtd
> > driver implementation and a second call to volume_identify() on the
> > same struct volume may render a different result.
>
> AFAIU this is about mtd driver, so volume_identify basically means
> mtd_volume_identify.
>
> Can you do some extra debugging for me and check which exact part of
> mtd_volume_identify has this side effect? I don't see which part of this
> function would affect volume identification in further calls.

I will see what I can do tomorrow.

> This may be a bit tricky to debug as you can't call volume_identify inside
> (at least without some recursive check). Can you handle this somehow?
>
> I'd like to make sure I understand this bug fully before pushing a fix.
>
> Also: which target uses jffs2 with ubi?!

A concession I grudgingly had to accept from Qualcomm on their IoE platform in
order to hit our market release. OpenWRT at that point in time still lacked a
lot of infrastructure, putting UBI/FS out of reach given the time constraints.

- Pieter
Smith, Pieter March 29, 2017, 2:03 p.m. UTC | #6
> On 03/29/2017 11:53 AM, Smith, Pieter wrote:

> > My apologies. I am not able to get mutt working with our corporate

> > infrastructure. I hope it arrives unmangled as an attachment. If not,

> > I'll use my personal mail.

>

> This would be acceptable for me to pick patch sent as attachment, but you

> really need to sign it off with your name. Please add a proper

> Signed-off-by:

> line and resend.


Off-course! My bad...

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Smith, Pieter March 29, 2017, 4:33 p.m. UTC | #7
Hi Rafał,

> > On 03/29/2017 11:53 AM, Smith, Pieter wrote:

> > > My apologies. I am not able to get mutt working with our corporate

> > > infrastructure. I hope it arrives unmangled as an attachment. If not,

> > > I'll use my personal mail.

> >

> > This would be acceptable for me to pick patch sent as attachment, but you

> > really need to sign it off with your name. Please add a proper

> > Signed-off-by:

> > line and resend.

>

> Off-course! My bad...


In adding more logging so that you can diagnose the issue, I tracked down the
root cause and fixed the issue. Attached, please find an updated patch with
sign-off and a problem description.

- Pieter

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Smith, Pieter May 8, 2017, 11:43 a.m. UTC | #8
Hi Rafał,

I am still hoping to get confirmation from you on this patch. Is this acceptable for upstreaming?

> Hi Rafał,

>

> > > On 03/29/2017 11:53 AM, Smith, Pieter wrote:

> > > > My apologies. I am not able to get mutt working with our corporate

> > > > infrastructure. I hope it arrives unmangled as an attachment. If

> > > > not, I'll use my personal mail.

> > >

> > > This would be acceptable for me to pick patch sent as attachment,

> > > but you really need to sign it off with your name. Please add a

> > > proper

> > > Signed-off-by:

> > > line and resend.

> >

> > Off-course! My bad...

>

> In adding more logging so that you can diagnose the issue, I tracked down the root cause and fixed the issue. Attached, > please find an updated patch with sign-off and a problem description.

>

> - Pieter


Regards,
Pieter


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Rafał Miłecki May 9, 2017, 8:24 a.m. UTC | #9
On 8 May 2017 at 13:43, Smith, Pieter <pieter.smith@philips.com> wrote:
> I am still hoping to get confirmation from you on this patch. Is this acceptable for upstreaming?
>
>> Hi Rafał,
>>
>> > > On 03/29/2017 11:53 AM, Smith, Pieter wrote:
>> > > > My apologies. I am not able to get mutt working with our corporate
>> > > > infrastructure. I hope it arrives unmangled as an attachment. If
>> > > > not, I'll use my personal mail.
>> > >
>> > > This would be acceptable for me to pick patch sent as attachment,
>> > > but you really need to sign it off with your name. Please add a
>> > > proper
>> > > Signed-off-by:
>> > > line and resend.
>> >
>> > Off-course! My bad...
>>
>> In adding more logging so that you can diagnose the issue, I tracked down the root cause and fixed the issue. Attached, > please find an updated patch with sign-off and a problem description.

Sorry, it took me more time than it was supposed to. Your patch went
into the fstools repository.

1) LEDE master branch
We'll bump package after fixing regression caused by commit
a19f2b3c21288 ("build: disable the format-truncation warning error to
fix gcc 7 build errors") which results in:
cc1: error: -Werror=format-truncation: no option -Wformat-truncation

2) LEDE lede-17.01 branch:
I sent patch to backport your fix:
https://patchwork.ozlabs.org/patch/759957/

3) OpenWrt 15.05 branch:
I don't really want to mess with that. I don't know if I have
privileges nor if my patch could be accepted. We need someone working
on OpenWrt to handle this.
Smith, Pieter May 9, 2017, 3:57 p.m. UTC | #10
Hi Rafał,

On 09 May 2017 at 10:25, Rafał Miłecki <zajec5@gmail.com> wrote:
> On 8 May 2017 at 13:43, Smith, Pieter <pieter.smith@philips.com> wrote:

> > I am still hoping to get confirmation from you on this patch. Is this acceptable for upstreaming?

> >

> >> Hi Rafał,

> >>

> >> > > On 03/29/2017 11:53 AM, Smith, Pieter wrote:

> >> > > > My apologies. I am not able to get mutt working with our

> >> > > > corporate infrastructure. I hope it arrives unmangled as an

> >> > > > attachment. If not, I'll use my personal mail.

> >> > >

> >> > > This would be acceptable for me to pick patch sent as attachment,

> >> > > but you really need to sign it off with your name. Please add a

> >> > > proper

> >> > > Signed-off-by:

> >> > > line and resend.

> >> >

> >> > Off-course! My bad...

> >>

> >> In adding more logging so that you can diagnose the issue, I tracked down the root cause and fixed the issue. Attached, > > please find an updated patch with sign-off and a problem description.

>

> Sorry, it took me more time than it was supposed to. Your patch went into the fstools repository.

>

> 1) LEDE master branch

> We'll bump package after fixing regression caused by commit

> a19f2b3c21288 ("build: disable the format-truncation warning error to fix gcc 7 build errors") which results in:

> cc1: error: -Werror=format-truncation: no option -Wformat-truncation

>

> 2) LEDE lede-17.01 branch:

> I sent patch to backport your fix:

> https://patchwork.ozlabs.org/patch/759957/

>

> 3) OpenWrt 15.05 branch:

> I don't really want to mess with that. I don't know if I have privileges nor if my patch could be accepted. We need someone > working on OpenWrt to handle this.

>

> --

> Rafał


Thanks for the confirmation!
- Pieter


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Rafał Miłecki May 22, 2017, 9:27 a.m. UTC | #11
On 9 May 2017 at 10:24, Rafał Miłecki <zajec5@gmail.com> wrote:
> On 8 May 2017 at 13:43, Smith, Pieter <pieter.smith@philips.com> wrote:
>> I am still hoping to get confirmation from you on this patch. Is this acceptable for upstreaming?
>>
>>> Hi Rafał,
>>>
>>> > > On 03/29/2017 11:53 AM, Smith, Pieter wrote:
>>> > > > My apologies. I am not able to get mutt working with our corporate
>>> > > > infrastructure. I hope it arrives unmangled as an attachment. If
>>> > > > not, I'll use my personal mail.
>>> > >
>>> > > This would be acceptable for me to pick patch sent as attachment,
>>> > > but you really need to sign it off with your name. Please add a
>>> > > proper
>>> > > Signed-off-by:
>>> > > line and resend.
>>> >
>>> > Off-course! My bad...
>>>
>>> In adding more logging so that you can diagnose the issue, I tracked down the root cause and fixed the issue. Attached, > please find an updated patch with sign-off and a problem description.
>
> Sorry, it took me more time than it was supposed to. Your patch went
> into the fstools repository.
>
> 1) LEDE master branch
> We'll bump package after fixing regression caused by commit
> a19f2b3c21288 ("build: disable the format-truncation warning error to
> fix gcc 7 build errors") which results in:
> cc1: error: -Werror=format-truncation: no option -Wformat-truncation

commit 65a3dc277f3f4 ("fstools: update to the latest version")


> 2) LEDE lede-17.01 branch:
> I sent patch to backport your fix:
> https://patchwork.ozlabs.org/patch/759957/

commit 0bef8f8011d11 ("fstools: backport regression fix for volume_identify")


> 3) OpenWrt 15.05 branch:
> I don't really want to mess with that. I don't know if I have
> privileges nor if my patch could be accepted. We need someone working
> on OpenWrt to handle this.
diff mbox

Patch

diff --git a/libfstools/libfstools.h b/libfstools/libfstools.h
index 940c504..36ba5fb 100644
--- a/libfstools/libfstools.h
+++ b/libfstools/libfstools.h
@@ -51,7 +51,7 @@  extern char* find_mount(char *mp);
 extern char* find_mount_point(char *block, int mtd_only);
 extern int find_filesystem(char *fs);

-extern int jffs2_switch(struct volume *v);
+extern int jffs2_switch(struct volume *v, int ident);

 extern int handle_whiteout(const char *dir);
 extern void foreachdir(const char *dir, int (*cb)(const char*));
diff --git a/libfstools/overlay.c b/libfstools/overlay.c
index 7a62c23..3972af1 100644
--- a/libfstools/overlay.c
+++ b/libfstools/overlay.c
@@ -193,7 +193,7 @@  handle_whiteout(const char *dir)
 }

 int
-jffs2_switch(struct volume *v)
+jffs2_switch(struct volume *v, int ident)
 {
 char *mp;
 char buf[32];
@@ -218,7 +218,7 @@  jffs2_switch(struct volume *v)
 system(buf);
 }

-switch (volume_identify(v)) {
+switch (ident) {
 case FS_NONE:
 ULOG_ERR("no jffs2 marker found\n");
 /* fall through */
diff --git a/mount_root.c b/mount_root.c
index 1335f2b..81089a2 100644
--- a/mount_root.c
+++ b/mount_root.c
@@ -100,10 +100,11 @@  done(int argc, char *argv[1])
 if (!v)
 return -1;

-switch (volume_identify(v)) {
+int ident = volume_identify(v);
+switch (ident) {
 case FS_NONE:
 case FS_DEADCODE:
-return jffs2_switch(v);
+return jffs2_switch(v, ident);

 case FS_JFFS2:
 case FS_UBIFS: