diff mbox series

sandbox: Support signal handling only when requested

Message ID 20210322052102.2934066-1-sjg@chromium.org
State Accepted
Commit 85f718f64d65390f385111e57cfa017abd12879d
Delegated to: Simon Glass
Headers show
Series sandbox: Support signal handling only when requested | expand

Commit Message

Simon Glass March 22, 2021, 5:21 a.m. UTC
At present if sandbox crashes it prints a message and tries to exit. But
with the recently introduced signal handler, it often seems to get stuck
in a loop until the stack overflows:

Segmentation violation

Segmentation violation

Segmentation violation

Segmentation violation

Segmentation violation

Segmentation violation

Segmentation violation
...

The signal handler is only useful for a few tests, as I understand it.
Make it optional.

Signed-off-by: Simon Glass <sjg@chromium.org>
---

 arch/sandbox/cpu/start.c         | 18 +++++++++++++++---
 arch/sandbox/include/asm/state.h |  1 +
 2 files changed, 16 insertions(+), 3 deletions(-)

Comments

Heinrich Schuchardt March 22, 2021, 10:02 a.m. UTC | #1
On 22.03.21 06:21, Simon Glass wrote:
> At present if sandbox crashes it prints a message and tries to exit. But
> with the recently introduced signal handler, it often seems to get stuck
> in a loop until the stack overflows:
>
> Segmentation violation
>
> Segmentation violation
>
> Segmentation violation
>
> Segmentation violation
>
> Segmentation violation
>
> Segmentation violation
>
> Segmentation violation
> ...

Hello Simon,

do you have a reproducible example? I never have seen this.

Corrupting gd could cause an endless recursive loop, as these lines
follow printing the observed string:

        printf("pc = 0x%lx, ", pc);
        printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);

If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
recursion cannot occur anymore. If a segmentation violation occurs
inside the handler it will be delegated to the default handler.

Furthermore we could consider removing the signal handler at the start
of os_signal_action().

Best regards

Heinrich

>
> The signal handler is only useful for a few tests, as I understand it.
> Make it optional.
>
> Signed-off-by: Simon Glass <sjg@chromium.org>
> ---
>
>  arch/sandbox/cpu/start.c         | 18 +++++++++++++++---
>  arch/sandbox/include/asm/state.h |  1 +
>  2 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/arch/sandbox/cpu/start.c b/arch/sandbox/cpu/start.c
> index e87365e800d..72fe293e8bc 100644
> --- a/arch/sandbox/cpu/start.c
> +++ b/arch/sandbox/cpu/start.c
> @@ -389,6 +389,16 @@ static int sandbox_cmdline_cb_select_unittests(struct sandbox_state *state,
>  }
>  SANDBOX_CMDLINE_OPT_SHORT(select_unittests, 'k', 1, "Select unit tests to run");
>
> +static int sandbox_cmdline_cb_signals(struct sandbox_state *state,
> +				      const char *arg)
> +{
> +	state->handle_signals = true;
> +
> +	return 0;
> +}
> +SANDBOX_CMDLINE_OPT_SHORT(signals, 'S', 0,
> +			  "Handle signals (such as SIGSEGV) in sandbox");
> +
>  static void setup_ram_buf(struct sandbox_state *state)
>  {
>  	/* Zero the RAM buffer if we didn't read it, to keep valgrind happy */
> @@ -472,9 +482,11 @@ int main(int argc, char *argv[])
>  	if (ret)
>  		goto err;
>
> -	ret = os_setup_signal_handlers();
> -	if (ret)
> -		goto err;
> +	if (state->handle_signals) {
> +		ret = os_setup_signal_handlers();
> +		if (ret)
> +			goto err;
> +	}
>
>  #if CONFIG_VAL(SYS_MALLOC_F_LEN)
>  	gd->malloc_base = CONFIG_MALLOC_F_ADDR;
> diff --git a/arch/sandbox/include/asm/state.h b/arch/sandbox/include/asm/state.h
> index bca13069824..1c4c571e28d 100644
> --- a/arch/sandbox/include/asm/state.h
> +++ b/arch/sandbox/include/asm/state.h
> @@ -93,6 +93,7 @@ struct sandbox_state {
>  	bool ram_buf_read;		/* true if we read the RAM buffer */
>  	bool run_unittests;		/* Run unit tests */
>  	const char *select_unittests;	/* Unit test to run */
> +	bool handle_signals;		/* Handle signals within sandbox */
>
>  	/* Pointer to information for each SPI bus/cs */
>  	struct sandbox_spi_info spi[CONFIG_SANDBOX_SPI_MAX_BUS]
>
Simon Glass March 23, 2021, 12:56 a.m. UTC | #2
Hi Heinrich,

On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>
> On 22.03.21 06:21, Simon Glass wrote:
> > At present if sandbox crashes it prints a message and tries to exit. But
> > with the recently introduced signal handler, it often seems to get stuck
> > in a loop until the stack overflows:
> >
> > Segmentation violation
> >
> > Segmentation violation
> >
> > Segmentation violation
> >
> > Segmentation violation
> >
> > Segmentation violation
> >
> > Segmentation violation
> >
> > Segmentation violation
> > ...
>
> Hello Simon,
>
> do you have a reproducible example? I never have seen this.

https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433

You need to run that commit with pytest though...it does not happen
when run directly.

BTW this sems to expose some rather nasty bug in dlmalloc or how it is
used. I notice that as soon as the first test is run, the 'top' value
in dlmalloc is outside the range of the malloc pool, which seems
wrong. I wonder if there is something broken with how
dm_test_pre_run() and dm_test_post_run() work.

>
> Corrupting gd could cause an endless recursive loop, as these lines
> follow printing the observed string:
>
>         printf("pc = 0x%lx, ", pc);
>         printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);

Yes I suspect printf() is dead.

>
> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
> recursion cannot occur anymore. If a segmentation violation occurs
> inside the handler it will be delegated to the default handler.
>
> Furthermore we could consider removing the signal handler at the start
> of os_signal_action().

The issue is that if you get a segfault you really don't know if you
can continue and do anything else.

What is the goal with the signal handler? I don't think the user can
do anything about it.

Regards,
Simon
Simon Glass June 6, 2021, 4:44 p.m. UTC | #3
Hi Heinrich,

On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
>
> Hi Heinrich,
>
> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
> >
> > On 22.03.21 06:21, Simon Glass wrote:
> > > At present if sandbox crashes it prints a message and tries to exit. But
> > > with the recently introduced signal handler, it often seems to get stuck
> > > in a loop until the stack overflows:
> > >
> > > Segmentation violation
> > >
> > > Segmentation violation
> > >
> > > Segmentation violation
> > >
> > > Segmentation violation
> > >
> > > Segmentation violation
> > >
> > > Segmentation violation
> > >
> > > Segmentation violation
> > > ...
> >
> > Hello Simon,
> >
> > do you have a reproducible example? I never have seen this.
>
> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>
> You need to run that commit with pytest though...it does not happen
> when run directly.
>
> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
> used. I notice that as soon as the first test is run, the 'top' value
> in dlmalloc is outside the range of the malloc pool, which seems
> wrong. I wonder if there is something broken with how
> dm_test_pre_run() and dm_test_post_run() work.
>
> >
> > Corrupting gd could cause an endless recursive loop, as these lines
> > follow printing the observed string:
> >
> >         printf("pc = 0x%lx, ", pc);
> >         printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);
>
> Yes I suspect printf() is dead.
>
> >
> > If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
> > recursion cannot occur anymore. If a segmentation violation occurs
> > inside the handler it will be delegated to the default handler.
> >
> > Furthermore we could consider removing the signal handler at the start
> > of os_signal_action().
>
> The issue is that if you get a segfault you really don't know if you
> can continue and do anything else.
>
> What is the goal with the signal handler? I don't think the user can
> do anything about it.

I keep hitting this problem during development with sandbox, so I
think I need to apply this patch.

Does anything need to be updated in the tests?

Regards,
Simon
Heinrich Schuchardt June 6, 2021, 5:28 p.m. UTC | #4
On 6/6/21 6:44 PM, Simon Glass wrote:
> Hi Heinrich,
>
> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
>>
>> Hi Heinrich,
>>
>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>>>
>>> On 22.03.21 06:21, Simon Glass wrote:
>>>> At present if sandbox crashes it prints a message and tries to exit. But
>>>> with the recently introduced signal handler, it often seems to get stuck
>>>> in a loop until the stack overflows:
>>>>
>>>> Segmentation violation
>>>>
>>>> Segmentation violation
>>>>
>>>> Segmentation violation
>>>>
>>>> Segmentation violation
>>>>
>>>> Segmentation violation
>>>>
>>>> Segmentation violation
>>>>
>>>> Segmentation violation
>>>> ...
>>>
>>> Hello Simon,
>>>
>>> do you have a reproducible example? I never have seen this.
>>
>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>
>> You need to run that commit with pytest though...it does not happen
>> when run directly.
>>
>> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
>> used. I notice that as soon as the first test is run, the 'top' value
>> in dlmalloc is outside the range of the malloc pool, which seems
>> wrong. I wonder if there is something broken with how
>> dm_test_pre_run() and dm_test_post_run() work.
>>
>>>
>>> Corrupting gd could cause an endless recursive loop, as these lines
>>> follow printing the observed string:
>>>
>>>          printf("pc = 0x%lx, ", pc);
>>>          printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);
>>
>> Yes I suspect printf() is dead.
>>
>>>
>>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
>>> recursion cannot occur anymore. If a segmentation violation occurs
>>> inside the handler it will be delegated to the default handler.
>>>
>>> Furthermore we could consider removing the signal handler at the start
>>> of os_signal_action().
>>
>> The issue is that if you get a segfault you really don't know if you
>> can continue and do anything else.
>>
>> What is the goal with the signal handler? I don't think the user can
>> do anything about it.

Hello Simon,

the signal handler prints out the crash location and this makes
analyzing problems much easier. It proved valuable to me several times.

>
> I keep hitting this problem during development with sandbox, so I
> think I need to apply this patch.
>
> Does anything need to be updated in the tests?
>
> Regards,
> Simon
>

Did you try removing SA_NODEFER as proposed?

Best regards

Heinrich
Simon Glass June 6, 2021, 5:37 p.m. UTC | #5
Hi Heinrich,

On Sun, 6 Jun 2021 at 11:28, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>
> On 6/6/21 6:44 PM, Simon Glass wrote:
> > Hi Heinrich,
> >
> > On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
> >>
> >> Hi Heinrich,
> >>
> >> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
> >>>
> >>> On 22.03.21 06:21, Simon Glass wrote:
> >>>> At present if sandbox crashes it prints a message and tries to exit. But
> >>>> with the recently introduced signal handler, it often seems to get stuck
> >>>> in a loop until the stack overflows:
> >>>>
> >>>> Segmentation violation
> >>>>
> >>>> Segmentation violation
> >>>>
> >>>> Segmentation violation
> >>>>
> >>>> Segmentation violation
> >>>>
> >>>> Segmentation violation
> >>>>
> >>>> Segmentation violation
> >>>>
> >>>> Segmentation violation
> >>>> ...
> >>>
> >>> Hello Simon,
> >>>
> >>> do you have a reproducible example? I never have seen this.
> >>
> >> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
> >>
> >> You need to run that commit with pytest though...it does not happen
> >> when run directly.
> >>
> >> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
> >> used. I notice that as soon as the first test is run, the 'top' value
> >> in dlmalloc is outside the range of the malloc pool, which seems
> >> wrong. I wonder if there is something broken with how
> >> dm_test_pre_run() and dm_test_post_run() work.
> >>
> >>>
> >>> Corrupting gd could cause an endless recursive loop, as these lines
> >>> follow printing the observed string:
> >>>
> >>>          printf("pc = 0x%lx, ", pc);
> >>>          printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);
> >>
> >> Yes I suspect printf() is dead.
> >>
> >>>
> >>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
> >>> recursion cannot occur anymore. If a segmentation violation occurs
> >>> inside the handler it will be delegated to the default handler.
> >>>
> >>> Furthermore we could consider removing the signal handler at the start
> >>> of os_signal_action().
> >>
> >> The issue is that if you get a segfault you really don't know if you
> >> can continue and do anything else.
> >>
> >> What is the goal with the signal handler? I don't think the user can
> >> do anything about it.
>
> Hello Simon,
>
> the signal handler prints out the crash location and this makes
> analyzing problems much easier. It proved valuable to me several times.

Well I think we are at a draw on that point, as the patch has caused
me pain many times!

>
> >
> > I keep hitting this problem during development with sandbox, so I
> > think I need to apply this patch.
> >
> > Does anything need to be updated in the tests?
> >
> > Regards,
> > Simon
> >
>
> Did you try removing SA_NODEFER as proposed?

But what is the goal here...do you mean you want it to crash later? I
just want it to crash immediately.

What's actually wrong with putting this behaviour behind a flag? You
could always run with the flag enabled if needed. But I just don't
think it makes sense for the default behaviour to be to try to
continue operation.

Regards,
Simon
Heinrich Schuchardt June 6, 2021, 5:50 p.m. UTC | #6
On 6/6/21 7:37 PM, Simon Glass wrote:
> Hi Heinrich,
>
> On Sun, 6 Jun 2021 at 11:28, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>>
>> On 6/6/21 6:44 PM, Simon Glass wrote:
>>> Hi Heinrich,
>>>
>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
>>>>
>>>> Hi Heinrich,
>>>>
>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>>>>>
>>>>> On 22.03.21 06:21, Simon Glass wrote:
>>>>>> At present if sandbox crashes it prints a message and tries to exit. But
>>>>>> with the recently introduced signal handler, it often seems to get stuck
>>>>>> in a loop until the stack overflows:
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>> ...
>>>>>
>>>>> Hello Simon,
>>>>>
>>>>> do you have a reproducible example? I never have seen this.
>>>>
>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>>>
>>>> You need to run that commit with pytest though...it does not happen
>>>> when run directly.
>>>>
>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
>>>> used. I notice that as soon as the first test is run, the 'top' value
>>>> in dlmalloc is outside the range of the malloc pool, which seems
>>>> wrong. I wonder if there is something broken with how
>>>> dm_test_pre_run() and dm_test_post_run() work.
>>>>
>>>>>
>>>>> Corrupting gd could cause an endless recursive loop, as these lines
>>>>> follow printing the observed string:
>>>>>
>>>>>           printf("pc = 0x%lx, ", pc);
>>>>>           printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);
>>>>
>>>> Yes I suspect printf() is dead.
>>>>
>>>>>
>>>>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
>>>>> recursion cannot occur anymore. If a segmentation violation occurs
>>>>> inside the handler it will be delegated to the default handler.
>>>>>
>>>>> Furthermore we could consider removing the signal handler at the start
>>>>> of os_signal_action().
>>>>
>>>> The issue is that if you get a segfault you really don't know if you
>>>> can continue and do anything else.
>>>>
>>>> What is the goal with the signal handler? I don't think the user can
>>>> do anything about it.
>>
>> Hello Simon,
>>
>> the signal handler prints out the crash location and this makes
>> analyzing problems much easier. It proved valuable to me several times.
>
> Well I think we are at a draw on that point, as the patch has caused
> me pain many times!
>
>>
>>>
>>> I keep hitting this problem during development with sandbox, so I
>>> think I need to apply this patch.
>>>
>>> Does anything need to be updated in the tests?
>>>
>>> Regards,
>>> Simon
>>>
>>
>> Did you try removing SA_NODEFER as proposed?
>
> But what is the goal here...do you mean you want it to crash later? I
> just want it to crash immediately.

If you have not messed up gd, U-Boot will print the program counter
address and the program will exit. You can try this with the exception
command.

If you have messed up gd, you will still exit but you will get a SIGSEGV
warning from the operating system.

>
> What's actually wrong with putting this behaviour behind a flag? You
> could always run with the flag enabled if needed. But I just don't
> think it makes sense for the default behaviour to be to try to
> continue operation.

I want to know the relocated PC counter when the sandbox crashes.
Why do you want to hide it by default?

Best regards

Heinrich
Sean Anderson June 6, 2021, 5:52 p.m. UTC | #7
On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
> On 6/6/21 6:44 PM, Simon Glass wrote:
>> Hi Heinrich,
>>
>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
>>>
>>> Hi Heinrich,
>>>
>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>>>>
>>>> On 22.03.21 06:21, Simon Glass wrote:
>>>>> At present if sandbox crashes it prints a message and tries to exit. But
>>>>> with the recently introduced signal handler, it often seems to get stuck
>>>>> in a loop until the stack overflows:
>>>>>
>>>>> Segmentation violation
>>>>>
>>>>> Segmentation violation
>>>>>
>>>>> Segmentation violation
>>>>>
>>>>> Segmentation violation
>>>>>
>>>>> Segmentation violation
>>>>>
>>>>> Segmentation violation
>>>>>
>>>>> Segmentation violation
>>>>> ...
>>>>
>>>> Hello Simon,
>>>>
>>>> do you have a reproducible example? I never have seen this.
>>>
>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>>
>>> You need to run that commit with pytest though...it does not happen
>>> when run directly.
>>>
>>> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
>>> used. I notice that as soon as the first test is run, the 'top' value
>>> in dlmalloc is outside the range of the malloc pool, which seems
>>> wrong. I wonder if there is something broken with how
>>> dm_test_pre_run() and dm_test_post_run() work.
>>>
>>>>
>>>> Corrupting gd could cause an endless recursive loop, as these lines
>>>> follow printing the observed string:
>>>>
>>>>          printf("pc = 0x%lx, ", pc);
>>>>          printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off);
>>>
>>> Yes I suspect printf() is dead.
>>>
>>>>
>>>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
>>>> recursion cannot occur anymore. If a segmentation violation occurs
>>>> inside the handler it will be delegated to the default handler.
>>>>
>>>> Furthermore we could consider removing the signal handler at the start
>>>> of os_signal_action().
>>>
>>> The issue is that if you get a segfault you really don't know if you
>>> can continue and do anything else.
>>>
>>> What is the goal with the signal handler? I don't think the user can
>>> do anything about it.
> 
> Hello Simon,
> 
> the signal handler prints out the crash location and this makes
> analyzing problems much easier. It proved valuable to me several times.

Can't you just rerun with gdb?

--Sean

> 
>>
>> I keep hitting this problem during development with sandbox, so I
>> think I need to apply this patch.
>>
>> Does anything need to be updated in the tests?
>>
>> Regards,
>> Simon
>>
> 
> Did you try removing SA_NODEFER as proposed?
> 
> Best regards
> 
> Heinrich
Heinrich Schuchardt June 6, 2021, 5:57 p.m. UTC | #8
On 6/6/21 7:52 PM, Sean Anderson wrote:
> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
>> On 6/6/21 6:44 PM, Simon Glass wrote:
>>> Hi Heinrich,
>>>
>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
>>>>
>>>> Hi Heinrich,
>>>>
>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
>>>> <xypron.glpk@gmx.de> wrote:
>>>>>
>>>>> On 22.03.21 06:21, Simon Glass wrote:
>>>>>> At present if sandbox crashes it prints a message and tries to
>>>>>> exit. But
>>>>>> with the recently introduced signal handler, it often seems to get
>>>>>> stuck
>>>>>> in a loop until the stack overflows:
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>>
>>>>>> Segmentation violation
>>>>>> ...
>>>>>
>>>>> Hello Simon,
>>>>>
>>>>> do you have a reproducible example? I never have seen this.
>>>>
>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>>>
>>>> You need to run that commit with pytest though...it does not happen
>>>> when run directly.
>>>>
>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
>>>> used. I notice that as soon as the first test is run, the 'top' value
>>>> in dlmalloc is outside the range of the malloc pool, which seems
>>>> wrong. I wonder if there is something broken with how
>>>> dm_test_pre_run() and dm_test_post_run() work.
>>>>
>>>>>
>>>>> Corrupting gd could cause an endless recursive loop, as these lines
>>>>> follow printing the observed string:
>>>>>
>>>>>          printf("pc = 0x%lx, ", pc);
>>>>>          printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
>>>>
>>>> Yes I suspect printf() is dead.
>>>>
>>>>>
>>>>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
>>>>> recursion cannot occur anymore. If a segmentation violation occurs
>>>>> inside the handler it will be delegated to the default handler.
>>>>>
>>>>> Furthermore we could consider removing the signal handler at the start
>>>>> of os_signal_action().
>>>>
>>>> The issue is that if you get a segfault you really don't know if you
>>>> can continue and do anything else.
>>>>
>>>> What is the goal with the signal handler? I don't think the user can
>>>> do anything about it.
>>
>> Hello Simon,
>>
>> the signal handler prints out the crash location and this makes
>> analyzing problems much easier. It proved valuable to me several times.
>
> Can't you just rerun with gdb?

This would require that the problem is easily reproducible which may not
be the case.

Best regards

Heinrich

>
>>
>>>
>>> I keep hitting this problem during development with sandbox, so I
>>> think I need to apply this patch.
>>>
>>> Does anything need to be updated in the tests?
>>>
>>> Regards,
>>> Simon
>>>
>>
>> Did you try removing SA_NODEFER as proposed?
>>
>> Best regards
>>
>> Heinrich
>
>
Sean Anderson June 6, 2021, 6:07 p.m. UTC | #9
On 6/6/21 1:57 PM, Heinrich Schuchardt wrote:
> On 6/6/21 7:52 PM, Sean Anderson wrote:
>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
>>> On 6/6/21 6:44 PM, Simon Glass wrote:
>>>> Hi Heinrich,
>>>>
>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org> wrote:
>>>>>
>>>>> Hi Heinrich,
>>>>>
>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
>>>>> <xypron.glpk@gmx.de> wrote:
>>>>>>
>>>>>> On 22.03.21 06:21, Simon Glass wrote:
>>>>>>> At present if sandbox crashes it prints a message and tries to
>>>>>>> exit. But
>>>>>>> with the recently introduced signal handler, it often seems to get
>>>>>>> stuck
>>>>>>> in a loop until the stack overflows:
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>>
>>>>>>> Segmentation violation
>>>>>>> ...
>>>>>>
>>>>>> Hello Simon,
>>>>>>
>>>>>> do you have a reproducible example? I never have seen this.
>>>>>
>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>>>>
>>>>> You need to run that commit with pytest though...it does not happen
>>>>> when run directly.
>>>>>
>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how it is
>>>>> used. I notice that as soon as the first test is run, the 'top' value
>>>>> in dlmalloc is outside the range of the malloc pool, which seems
>>>>> wrong. I wonder if there is something broken with how
>>>>> dm_test_pre_run() and dm_test_post_run() work.
>>>>>
>>>>>>
>>>>>> Corrupting gd could cause an endless recursive loop, as these lines
>>>>>> follow printing the observed string:
>>>>>>
>>>>>>          printf("pc = 0x%lx, ", pc);
>>>>>>          printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
>>>>>
>>>>> Yes I suspect printf() is dead.
>>>>>
>>>>>>
>>>>>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c,
>>>>>> recursion cannot occur anymore. If a segmentation violation occurs
>>>>>> inside the handler it will be delegated to the default handler.
>>>>>>
>>>>>> Furthermore we could consider removing the signal handler at the start
>>>>>> of os_signal_action().
>>>>>
>>>>> The issue is that if you get a segfault you really don't know if you
>>>>> can continue and do anything else.
>>>>>
>>>>> What is the goal with the signal handler? I don't think the user can
>>>>> do anything about it.
>>>
>>> Hello Simon,
>>>
>>> the signal handler prints out the crash location and this makes
>>> analyzing problems much easier. It proved valuable to me several times.
>>
>> Can't you just rerun with gdb?
> 
> This would require that the problem is easily reproducible which may not
> be the case.

Hm, perhaps you could keep track of how many times we've tried to catch a signal, and bail if this is the second time around. E.g. something like

static void os_signal_handler(int sig, siginfo_t *info, void *con)
{
	/* other variables */
	static int level = 0;

	switch (level++) {
	case 0:
		break;
	case 1:
		sandbox_exit();
	default:
		os_exit(0);
	}

	/* rest of the handler */
}

--Sean

> 
> Best regards
> 
> Heinrich
> 
>>
>>>
>>>>
>>>> I keep hitting this problem during development with sandbox, so I
>>>> think I need to apply this patch.
>>>>
>>>> Does anything need to be updated in the tests?
>>>>
>>>> Regards,
>>>> Simon
>>>>
>>>
>>> Did you try removing SA_NODEFER as proposed?
>>>
>>> Best regards
>>>
>>> Heinrich
>>
>>
>
Heinrich Schuchardt June 6, 2021, 9:35 p.m. UTC | #10
Am 6. Juni 2021 20:07:31 MESZ schrieb Sean Anderson <seanga2@gmail.com>:
>On 6/6/21 1:57 PM, Heinrich Schuchardt wrote:
>> On 6/6/21 7:52 PM, Sean Anderson wrote:
>>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
>>>> On 6/6/21 6:44 PM, Simon Glass wrote:
>>>>> Hi Heinrich,
>>>>>
>>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org>
>wrote:
>>>>>>
>>>>>> Hi Heinrich,
>>>>>>
>>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
>>>>>> <xypron.glpk@gmx.de> wrote:
>>>>>>>
>>>>>>> On 22.03.21 06:21, Simon Glass wrote:
>>>>>>>> At present if sandbox crashes it prints a message and tries to
>>>>>>>> exit. But
>>>>>>>> with the recently introduced signal handler, it often seems to
>get
>>>>>>>> stuck
>>>>>>>> in a loop until the stack overflows:
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>>
>>>>>>>> Segmentation violation
>>>>>>>> ...
>>>>>>>
>>>>>>> Hello Simon,
>>>>>>>
>>>>>>> do you have a reproducible example? I never have seen this.
>>>>>>
>>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>>>>>
>>>>>> You need to run that commit with pytest though...it does not
>happen
>>>>>> when run directly.
>>>>>>
>>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how
>it is
>>>>>> used. I notice that as soon as the first test is run, the 'top'
>value
>>>>>> in dlmalloc is outside the range of the malloc pool, which seems
>>>>>> wrong. I wonder if there is something broken with how
>>>>>> dm_test_pre_run() and dm_test_post_run() work.
>>>>>>
>>>>>>>
>>>>>>> Corrupting gd could cause an endless recursive loop, as these
>lines
>>>>>>> follow printing the observed string:
>>>>>>>
>>>>>>>          printf("pc = 0x%lx, ", pc);
>>>>>>>          printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
>>>>>>
>>>>>> Yes I suspect printf() is dead.
>>>>>>
>>>>>>>
>>>>>>> If we remove SA_NODEFER from the signal mask in
>arch/sandbox/cpu/os.c,
>>>>>>> recursion cannot occur anymore. If a segmentation violation
>occurs
>>>>>>> inside the handler it will be delegated to the default handler.
>>>>>>>
>>>>>>> Furthermore we could consider removing the signal handler at the
>start
>>>>>>> of os_signal_action().
>>>>>>
>>>>>> The issue is that if you get a segfault you really don't know if
>you
>>>>>> can continue and do anything else.
>>>>>>
>>>>>> What is the goal with the signal handler? I don't think the user
>can
>>>>>> do anything about it.
>>>>
>>>> Hello Simon,
>>>>
>>>> the signal handler prints out the crash location and this makes
>>>> analyzing problems much easier. It proved valuable to me several
>times.
>>>
>>> Can't you just rerun with gdb?
>> 
>> This would require that the problem is easily reproducible which may
>not
>> be the case.
>
>Hm, perhaps you could keep track of how many times we've tried to catch
>a signal, and bail if this is the second time around. E.g. something
>like
>

Removing SA_NODEFER from the signal mask will let the OS kick in if an exception occurs in a signal handler.

No counter is needed.

Best regards

Heinrich

>static void os_signal_handler(int sig, siginfo_t *info, void *con)
>{
>	/* other variables */
>	static int level = 0;
>
>	switch (level++) {
>	case 0:
>		break;
>	case 1:
>		sandbox_exit();
>	default:
>		os_exit(0);
>	}
>
>	/* rest of the handler */
>}
>
>--Sean
>
>> 
>> Best regards
>> 
>> Heinrich
>> 
>>>
>>>>
>>>>>
>>>>> I keep hitting this problem during development with sandbox, so I
>>>>> think I need to apply this patch.
>>>>>
>>>>> Does anything need to be updated in the tests?
>>>>>
>>>>> Regards,
>>>>> Simon
>>>>>
>>>>
>>>> Did you try removing SA_NODEFER as proposed?
>>>>
>>>> Best regards
>>>>
>>>> Heinrich
>>>
>>>
>>
Simon Glass July 4, 2021, 7:24 p.m. UTC | #11
Hi Heinrich,

On Sun, 6 Jun 2021 at 15:35, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>
> Am 6. Juni 2021 20:07:31 MESZ schrieb Sean Anderson <seanga2@gmail.com>:
> >On 6/6/21 1:57 PM, Heinrich Schuchardt wrote:
> >> On 6/6/21 7:52 PM, Sean Anderson wrote:
> >>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
> >>>> On 6/6/21 6:44 PM, Simon Glass wrote:
> >>>>> Hi Heinrich,
> >>>>>
> >>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org>
> >wrote:
> >>>>>>
> >>>>>> Hi Heinrich,
> >>>>>>
> >>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
> >>>>>> <xypron.glpk@gmx.de> wrote:
> >>>>>>>
> >>>>>>> On 22.03.21 06:21, Simon Glass wrote:
> >>>>>>>> At present if sandbox crashes it prints a message and tries to
> >>>>>>>> exit. But
> >>>>>>>> with the recently introduced signal handler, it often seems to
> >get
> >>>>>>>> stuck
> >>>>>>>> in a loop until the stack overflows:
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>> ...
> >>>>>>>
> >>>>>>> Hello Simon,
> >>>>>>>
> >>>>>>> do you have a reproducible example? I never have seen this.
> >>>>>>
> >>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
> >>>>>>
> >>>>>> You need to run that commit with pytest though...it does not
> >happen
> >>>>>> when run directly.
> >>>>>>
> >>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how
> >it is
> >>>>>> used. I notice that as soon as the first test is run, the 'top'
> >value
> >>>>>> in dlmalloc is outside the range of the malloc pool, which seems
> >>>>>> wrong. I wonder if there is something broken with how
> >>>>>> dm_test_pre_run() and dm_test_post_run() work.
> >>>>>>
> >>>>>>>
> >>>>>>> Corrupting gd could cause an endless recursive loop, as these
> >lines
> >>>>>>> follow printing the observed string:
> >>>>>>>
> >>>>>>>          printf("pc = 0x%lx, ", pc);
> >>>>>>>          printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
> >>>>>>
> >>>>>> Yes I suspect printf() is dead.
> >>>>>>
> >>>>>>>
> >>>>>>> If we remove SA_NODEFER from the signal mask in
> >arch/sandbox/cpu/os.c,
> >>>>>>> recursion cannot occur anymore. If a segmentation violation
> >occurs
> >>>>>>> inside the handler it will be delegated to the default handler.
> >>>>>>>
> >>>>>>> Furthermore we could consider removing the signal handler at the
> >start
> >>>>>>> of os_signal_action().
> >>>>>>
> >>>>>> The issue is that if you get a segfault you really don't know if
> >you
> >>>>>> can continue and do anything else.
> >>>>>>
> >>>>>> What is the goal with the signal handler? I don't think the user
> >can
> >>>>>> do anything about it.
> >>>>
> >>>> Hello Simon,
> >>>>
> >>>> the signal handler prints out the crash location and this makes
> >>>> analyzing problems much easier. It proved valuable to me several
> >times.
> >>>
> >>> Can't you just rerun with gdb?
> >>
> >> This would require that the problem is easily reproducible which may
> >not
> >> be the case.
> >
> >Hm, perhaps you could keep track of how many times we've tried to catch
> >a signal, and bail if this is the second time around. E.g. something
> >like
> >
>
> Removing SA_NODEFER from the signal mask will let the OS kick in if an exception occurs in a signal handler.
>
> No counter is needed.

Yes that is correct.

However I am still going to apply this patch, since I would prefer
that the default behaviour is to exit with a signal, rather than
continue. It indicates some sort of error (except if we are running a
strange test), so it seems wrong to try to continue. It may produce
other issues and sandbox is in an unknown state.

I am happy to discuss another way / patch for doing what you need.

Regards,
Simon

[..]
Simon Glass July 4, 2021, 8:15 p.m. UTC | #12
Hi Heinrich,

On Sun, 6 Jun 2021 at 15:35, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>
> Am 6. Juni 2021 20:07:31 MESZ schrieb Sean Anderson <seanga2@gmail.com>:
> >On 6/6/21 1:57 PM, Heinrich Schuchardt wrote:
> >> On 6/6/21 7:52 PM, Sean Anderson wrote:
> >>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
> >>>> On 6/6/21 6:44 PM, Simon Glass wrote:
> >>>>> Hi Heinrich,
> >>>>>
> >>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org>
> >wrote:
> >>>>>>
> >>>>>> Hi Heinrich,
> >>>>>>
> >>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
> >>>>>> <xypron.glpk@gmx.de> wrote:
> >>>>>>>
> >>>>>>> On 22.03.21 06:21, Simon Glass wrote:
> >>>>>>>> At present if sandbox crashes it prints a message and tries to
> >>>>>>>> exit. But
> >>>>>>>> with the recently introduced signal handler, it often seems to
> >get
> >>>>>>>> stuck
> >>>>>>>> in a loop until the stack overflows:
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>>
> >>>>>>>> Segmentation violation
> >>>>>>>> ...
> >>>>>>>
> >>>>>>> Hello Simon,
> >>>>>>>
> >>>>>>> do you have a reproducible example? I never have seen this.
> >>>>>>
> >>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
> >>>>>>
> >>>>>> You need to run that commit with pytest though...it does not
> >happen
> >>>>>> when run directly.
> >>>>>>
> >>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how
> >it is
> >>>>>> used. I notice that as soon as the first test is run, the 'top'
> >value
> >>>>>> in dlmalloc is outside the range of the malloc pool, which seems
> >>>>>> wrong. I wonder if there is something broken with how
> >>>>>> dm_test_pre_run() and dm_test_post_run() work.
> >>>>>>
> >>>>>>>
> >>>>>>> Corrupting gd could cause an endless recursive loop, as these
> >lines
> >>>>>>> follow printing the observed string:
> >>>>>>>
> >>>>>>>          printf("pc = 0x%lx, ", pc);
> >>>>>>>          printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
> >>>>>>
> >>>>>> Yes I suspect printf() is dead.
> >>>>>>
> >>>>>>>
> >>>>>>> If we remove SA_NODEFER from the signal mask in
> >arch/sandbox/cpu/os.c,
> >>>>>>> recursion cannot occur anymore. If a segmentation violation
> >occurs
> >>>>>>> inside the handler it will be delegated to the default handler.
> >>>>>>>
> >>>>>>> Furthermore we could consider removing the signal handler at the
> >start
> >>>>>>> of os_signal_action().
> >>>>>>
> >>>>>> The issue is that if you get a segfault you really don't know if
> >you
> >>>>>> can continue and do anything else.
> >>>>>>
> >>>>>> What is the goal with the signal handler? I don't think the user
> >can
> >>>>>> do anything about it.
> >>>>
> >>>> Hello Simon,
> >>>>
> >>>> the signal handler prints out the crash location and this makes
> >>>> analyzing problems much easier. It proved valuable to me several
> >times.
> >>>
> >>> Can't you just rerun with gdb?
> >>
> >> This would require that the problem is easily reproducible which may
> >not
> >> be the case.
> >
> >Hm, perhaps you could keep track of how many times we've tried to catch
> >a signal, and bail if this is the second time around. E.g. something
> >like
> >
>
> Removing SA_NODEFER from the signal mask will let the OS kick in if an exception occurs in a signal handler.
>
> No counter is needed.

Yes that is correct.

However I am still going to apply this patch, since I would prefer
that the default behaviour is to exit with a signal, rather than
continue. It indicates some sort of error (except if we are running a
strange test), so it seems wrong to try to continue. It may produce
other issues and sandbox is in an unknown state.

I am happy to discuss another way / patch for doing what you need.

Regards,
Simon

[..]

Applied to u-boot-dm/next, thanks!
Heinrich Schuchardt July 5, 2021, 5:30 p.m. UTC | #13
On 7/4/21 10:15 PM, Simon Glass wrote:
> Hi Heinrich,
>
> On Sun, 6 Jun 2021 at 15:35, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>>
>> Am 6. Juni 2021 20:07:31 MESZ schrieb Sean Anderson <seanga2@gmail.com>:
>>> On 6/6/21 1:57 PM, Heinrich Schuchardt wrote:
>>>> On 6/6/21 7:52 PM, Sean Anderson wrote:
>>>>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
>>>>>> On 6/6/21 6:44 PM, Simon Glass wrote:
>>>>>>> Hi Heinrich,
>>>>>>>
>>>>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org>
>>> wrote:
>>>>>>>>
>>>>>>>> Hi Heinrich,
>>>>>>>>
>>>>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
>>>>>>>> <xypron.glpk@gmx.de> wrote:
>>>>>>>>>
>>>>>>>>> On 22.03.21 06:21, Simon Glass wrote:
>>>>>>>>>> At present if sandbox crashes it prints a message and tries to
>>>>>>>>>> exit. But
>>>>>>>>>> with the recently introduced signal handler, it often seems to
>>> get
>>>>>>>>>> stuck
>>>>>>>>>> in a loop until the stack overflows:
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>>
>>>>>>>>>> Segmentation violation
>>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> Hello Simon,
>>>>>>>>>
>>>>>>>>> do you have a reproducible example? I never have seen this.
>>>>>>>>
>>>>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
>>>>>>>>
>>>>>>>> You need to run that commit with pytest though...it does not
>>> happen
>>>>>>>> when run directly.
>>>>>>>>
>>>>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how
>>> it is
>>>>>>>> used. I notice that as soon as the first test is run, the 'top'
>>> value
>>>>>>>> in dlmalloc is outside the range of the malloc pool, which seems
>>>>>>>> wrong. I wonder if there is something broken with how
>>>>>>>> dm_test_pre_run() and dm_test_post_run() work.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Corrupting gd could cause an endless recursive loop, as these
>>> lines
>>>>>>>>> follow printing the observed string:
>>>>>>>>>
>>>>>>>>>           printf("pc = 0x%lx, ", pc);
>>>>>>>>>           printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
>>>>>>>>
>>>>>>>> Yes I suspect printf() is dead.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If we remove SA_NODEFER from the signal mask in
>>> arch/sandbox/cpu/os.c,
>>>>>>>>> recursion cannot occur anymore. If a segmentation violation
>>> occurs
>>>>>>>>> inside the handler it will be delegated to the default handler.
>>>>>>>>>
>>>>>>>>> Furthermore we could consider removing the signal handler at the
>>> start
>>>>>>>>> of os_signal_action().
>>>>>>>>
>>>>>>>> The issue is that if you get a segfault you really don't know if
>>> you
>>>>>>>> can continue and do anything else.
>>>>>>>>
>>>>>>>> What is the goal with the signal handler? I don't think the user
>>> can
>>>>>>>> do anything about it.
>>>>>>
>>>>>> Hello Simon,
>>>>>>
>>>>>> the signal handler prints out the crash location and this makes
>>>>>> analyzing problems much easier. It proved valuable to me several
>>> times.
>>>>>
>>>>> Can't you just rerun with gdb?
>>>>
>>>> This would require that the problem is easily reproducible which may
>>> not
>>>> be the case.
>>>
>>> Hm, perhaps you could keep track of how many times we've tried to catch
>>> a signal, and bail if this is the second time around. E.g. something
>>> like
>>>
>>
>> Removing SA_NODEFER from the signal mask will let the OS kick in if an exception occurs in a signal handler.
>>
>> No counter is needed.
>
> Yes that is correct.
>
> However I am still going to apply this patch, since I would prefer
> that the default behaviour is to exit with a signal, rather than
> continue. It indicates some sort of error (except if we are running a
> strange test), so it seems wrong to try to continue. It may produce
> other issues and sandbox is in an unknown state.
>
> I am happy to discuss another way / patch for doing what you need.


Your patch is not needed to exit if a signal occurs.
If the U-Boot sandbox resets or exits is controlled by SANDBOX_CRASH_RESET.

As said, to solve your problem just remove SA_NODEFER from the signal mask.

Best regards

Heinrich
Simon Glass July 6, 2021, 2:42 p.m. UTC | #14
Hi Heinrich,

On Mon, 5 Jul 2021 at 11:31, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
>
> On 7/4/21 10:15 PM, Simon Glass wrote:
> > Hi Heinrich,
> >
> > On Sun, 6 Jun 2021 at 15:35, Heinrich Schuchardt <xypron.glpk@gmx.de> wrote:
> >>
> >> Am 6. Juni 2021 20:07:31 MESZ schrieb Sean Anderson <seanga2@gmail.com>:
> >>> On 6/6/21 1:57 PM, Heinrich Schuchardt wrote:
> >>>> On 6/6/21 7:52 PM, Sean Anderson wrote:
> >>>>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote:
> >>>>>> On 6/6/21 6:44 PM, Simon Glass wrote:
> >>>>>>> Hi Heinrich,
> >>>>>>>
> >>>>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <sjg@chromium.org>
> >>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Heinrich,
> >>>>>>>>
> >>>>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt
> >>>>>>>> <xypron.glpk@gmx.de> wrote:
> >>>>>>>>>
> >>>>>>>>> On 22.03.21 06:21, Simon Glass wrote:
> >>>>>>>>>> At present if sandbox crashes it prints a message and tries to
> >>>>>>>>>> exit. But
> >>>>>>>>>> with the recently introduced signal handler, it often seems to
> >>> get
> >>>>>>>>>> stuck
> >>>>>>>>>> in a loop until the stack overflows:
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>>
> >>>>>>>>>> Segmentation violation
> >>>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>> Hello Simon,
> >>>>>>>>>
> >>>>>>>>> do you have a reproducible example? I never have seen this.
> >>>>>>>>
> >>>>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433
> >>>>>>>>
> >>>>>>>> You need to run that commit with pytest though...it does not
> >>> happen
> >>>>>>>> when run directly.
> >>>>>>>>
> >>>>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how
> >>> it is
> >>>>>>>> used. I notice that as soon as the first test is run, the 'top'
> >>> value
> >>>>>>>> in dlmalloc is outside the range of the malloc pool, which seems
> >>>>>>>> wrong. I wonder if there is something broken with how
> >>>>>>>> dm_test_pre_run() and dm_test_post_run() work.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Corrupting gd could cause an endless recursive loop, as these
> >>> lines
> >>>>>>>>> follow printing the observed string:
> >>>>>>>>>
> >>>>>>>>>           printf("pc = 0x%lx, ", pc);
> >>>>>>>>>           printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off);
> >>>>>>>>
> >>>>>>>> Yes I suspect printf() is dead.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If we remove SA_NODEFER from the signal mask in
> >>> arch/sandbox/cpu/os.c,
> >>>>>>>>> recursion cannot occur anymore. If a segmentation violation
> >>> occurs
> >>>>>>>>> inside the handler it will be delegated to the default handler.
> >>>>>>>>>
> >>>>>>>>> Furthermore we could consider removing the signal handler at the
> >>> start
> >>>>>>>>> of os_signal_action().
> >>>>>>>>
> >>>>>>>> The issue is that if you get a segfault you really don't know if
> >>> you
> >>>>>>>> can continue and do anything else.
> >>>>>>>>
> >>>>>>>> What is the goal with the signal handler? I don't think the user
> >>> can
> >>>>>>>> do anything about it.
> >>>>>>
> >>>>>> Hello Simon,
> >>>>>>
> >>>>>> the signal handler prints out the crash location and this makes
> >>>>>> analyzing problems much easier. It proved valuable to me several
> >>> times.
> >>>>>
> >>>>> Can't you just rerun with gdb?
> >>>>
> >>>> This would require that the problem is easily reproducible which may
> >>> not
> >>>> be the case.
> >>>
> >>> Hm, perhaps you could keep track of how many times we've tried to catch
> >>> a signal, and bail if this is the second time around. E.g. something
> >>> like
> >>>
> >>
> >> Removing SA_NODEFER from the signal mask will let the OS kick in if an exception occurs in a signal handler.
> >>
> >> No counter is needed.
> >
> > Yes that is correct.
> >
> > However I am still going to apply this patch, since I would prefer
> > that the default behaviour is to exit with a signal, rather than
> > continue. It indicates some sort of error (except if we are running a
> > strange test), so it seems wrong to try to continue. It may produce
> > other issues and sandbox is in an unknown state.
> >
> > I am happy to discuss another way / patch for doing what you need.
>
>
> Your patch is not needed to exit if a signal occurs.
> If the U-Boot sandbox resets or exits is controlled by SANDBOX_CRASH_RESET.
>
> As said, to solve your problem just remove SA_NODEFER from the signal mask.

Perhaps we should discuss this on a contributor call. This thread is
getting too long...!

Regards,
Simon
diff mbox series

Patch

diff --git a/arch/sandbox/cpu/start.c b/arch/sandbox/cpu/start.c
index e87365e800d..72fe293e8bc 100644
--- a/arch/sandbox/cpu/start.c
+++ b/arch/sandbox/cpu/start.c
@@ -389,6 +389,16 @@  static int sandbox_cmdline_cb_select_unittests(struct sandbox_state *state,
 }
 SANDBOX_CMDLINE_OPT_SHORT(select_unittests, 'k', 1, "Select unit tests to run");
 
+static int sandbox_cmdline_cb_signals(struct sandbox_state *state,
+				      const char *arg)
+{
+	state->handle_signals = true;
+
+	return 0;
+}
+SANDBOX_CMDLINE_OPT_SHORT(signals, 'S', 0,
+			  "Handle signals (such as SIGSEGV) in sandbox");
+
 static void setup_ram_buf(struct sandbox_state *state)
 {
 	/* Zero the RAM buffer if we didn't read it, to keep valgrind happy */
@@ -472,9 +482,11 @@  int main(int argc, char *argv[])
 	if (ret)
 		goto err;
 
-	ret = os_setup_signal_handlers();
-	if (ret)
-		goto err;
+	if (state->handle_signals) {
+		ret = os_setup_signal_handlers();
+		if (ret)
+			goto err;
+	}
 
 #if CONFIG_VAL(SYS_MALLOC_F_LEN)
 	gd->malloc_base = CONFIG_MALLOC_F_ADDR;
diff --git a/arch/sandbox/include/asm/state.h b/arch/sandbox/include/asm/state.h
index bca13069824..1c4c571e28d 100644
--- a/arch/sandbox/include/asm/state.h
+++ b/arch/sandbox/include/asm/state.h
@@ -93,6 +93,7 @@  struct sandbox_state {
 	bool ram_buf_read;		/* true if we read the RAM buffer */
 	bool run_unittests;		/* Run unit tests */
 	const char *select_unittests;	/* Unit test to run */
+	bool handle_signals;		/* Handle signals within sandbox */
 
 	/* Pointer to information for each SPI bus/cs */
 	struct sandbox_spi_info spi[CONFIG_SANDBOX_SPI_MAX_BUS]