diff mbox series

ARC: prevent showing irrelevant exception info in signal message

Message ID 20180629193907.17227-1-Eugeniy.Paltsev@synopsys.com
State New
Headers show
Series ARC: prevent showing irrelevant exception info in signal message | expand

Commit Message

Eugeniy Paltsev June 29, 2018, 7:39 p.m. UTC
We process signals in the end of syscall/exception handler.
It the signal is fatal we print register's content using
show_regs function. show_regs() also prints information about
last exception happened.

In case of multicore system we can catch the situation when we
will print wrong information about exception. See the example:

Comments

Vineet Gupta July 2, 2018, 5:57 p.m. UTC | #1
+CC Al

On 06/29/2018 12:39 PM, Eugeniy Paltsev wrote:
> We process signals in the end of syscall/exception handler.
> It the signal is fatal we print register's content using
> show_regs function. show_regs() also prints information about
> last exception happened.
>
> In case of multicore system we can catch the situation when we
> will print wrong information about exception. See the example:
> ______________________________
> CPU-0: started to handle page fault
> CPU-1: sent signal to process, which is executed on CPU-0
> CPU-0: ended page fault handle. Started to process signal before
>        returnig to userspace. Process signal, which is send from
>        CPU-0. As th signal is fatal we call show_regs().
>        show_regs() will show information about last exception
>        which is *page fault* (instead of "trap" which is used for
>        signals and happened on CPU-0)
>
> So we will get message like this:
>     /home/waitpid02
>   potentially unexpected fatal signal 8.
>   Path: /home/waitpid02
>   CPU: 0 PID: 100 Comm: waitpid02 Not tainted 4.10.0-rc4 #2
>   task: 9f11c200 task.stack: 9f3ae000
>
>   [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000123ec
>   [EFA   ]: 0x00000000
>   [BLINK ]: 0x123ea
>   [ERET  ]: 0x123ec
>     @off 0x123ec in [/home/waitpid02]
>     VMA: 0x00010000 to 0x00016000
>   [STAT32]: 0x80080882 : IE U
>   BTA: 0x000123ea  SP: 0x5ffd3db0  FP: 0x00000000
>   LPS: 0x20031684 LPE: 0x2003169a LPC: 0x00000006
>   [-----other-info-----]
>
> This message is confusing because it show information about page fault
> ( [ECR   ]: 0x00050200 => Invalid Write ) which is absolutely irrelevant
> to signal.

Agreed this is misleading. @Al, is there a way to identify process termination
from signals because it did something wrong vs. say unhandled signal. For former,
we want to dump additional info in show_regs() such as PC / Fault addres etc and
not in other scenario.

> This situation was reproduced with waitpid02 LTP test.
> _____________________________
>
> So remove printing information about exceptions from show_regs()
> to avoid confusing messages. Print information about exceptions
> only in required places instead of show_regs()
>
> Now we don't print information about exceptions if signal is simply
> send by another userspace app. So in case of waitpid02 we will print
> next message:
> _____________________________
>     ./waitpid02
>   potentially unexpected fatal signal 8.
>   [STAT32]: 0x80080082 : IE U
>   BTA: 0x20000fc4	 SP: 0x5ff8bd64	 FP: 0x00000000
>   LPS: 0x200524a0	LPE: 0x200524b6	LPC: 0x00000006
>   [-----other-info-----]
> _____________________________

The prints I'm seeing now, for a segv from NULL pointer access is even more
confusing !
There's a mixup of prints....

-------------------->8--------------------
Path: /segv
CPU: 0 PID: 70 Comm: segv Not tainted 4.17.0+ #412

[ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000103ac
[EFA   ]: 0x00000000
[BLINK ]: 0x20047bb0
[ERET  ]: 0x103ac
    @off 0x103ac in [/segv]
    VMA: 0x00010000 to 0x00012000

potentially unexpected fatal signal 11.
[STAT32]: 0x80080882 : IE U    
BTA: 0x00010398     SP: 0x5fc95e1c     FP: 0x5fc95e20
LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
r00: 0x00000001    r01: 0x5fc95e94    r02: 0x00000000   
r03: 0x00000064    r04: 0x80808080    r05: 0x2f2f2f2f   
...
-------------------->8--------------------

and for the process killed by signal 8, we get below.

-------------------->8--------------------
[ARCLinux]# kill -8 71
[ARCLinux]# potentially unexpected fatal signal 8.
[STAT32]: 0x80080882 : IE U    
BTA: 0x20020660     SP: 0x5fbcddec     FP: 0x5fbcde1c
LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
r00: 0xfffffdfc    r01: 0x5fbcddf0    r02: 0x00000000   
r03: 0x00000008    r04: 0x80808080    r05: 0x2f2f2f2f   
r06: 0x7a2f5f4a    r07: 0x00000000    r08: 0x00000065   
...


[1]+  Floating point exception   ./sleep
-------------------->8--------------------

I'm not sure whats the improvement here vs. the status quo.

For signal based kill, we don't want to dump the extra registers and if any, we
might still want to print the PC where the process was last seen in user mode to
give user of idea what it was doing at the time.

-Vineet
Eugeniy Paltsev July 3, 2018, 10:57 a.m. UTC | #2
On Mon, 2018-07-02 at 10:57 -0700, Vineet Gupta wrote:
> +CC Al
> 
> On 06/29/2018 12:39 PM, Eugeniy Paltsev wrote:
> > We process signals in the end of syscall/exception handler.
> > It the signal is fatal we print register's content using
> > show_regs function. show_regs() also prints information about
> > last exception happened.
> > 
> > In case of multicore system we can catch the situation when we
> > will print wrong information about exception. See the example:
> > ______________________________
> > CPU-0: started to handle page fault
> > CPU-1: sent signal to process, which is executed on CPU-0
> > CPU-0: ended page fault handle. Started to process signal before
> >        returnig to userspace. Process signal, which is send from
> >        CPU-0. As th signal is fatal we call show_regs().
> >        show_regs() will show information about last exception
> >        which is *page fault* (instead of "trap" which is used for
> >        signals and happened on CPU-0)
> > 
> > So we will get message like this:
> >     /home/waitpid02
> >   potentially unexpected fatal signal 8.
> >   Path: /home/waitpid02
> >   CPU: 0 PID: 100 Comm: waitpid02 Not tainted 4.10.0-rc4 #2
> >   task: 9f11c200 task.stack: 9f3ae000
> > 
> >   [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000123ec
> >   [EFA   ]: 0x00000000
> >   [BLINK ]: 0x123ea
> >   [ERET  ]: 0x123ec
> >     @off 0x123ec in [/home/waitpid02]
> >     VMA: 0x00010000 to 0x00016000
> >   [STAT32]: 0x80080882 : IE U
> >   BTA: 0x000123ea  SP: 0x5ffd3db0  FP: 0x00000000
> >   LPS: 0x20031684 LPE: 0x2003169a LPC: 0x00000006
> >   [-----other-info-----]
> > 
> > This message is confusing because it show information about page fault
> > ( [ECR   ]: 0x00050200 => Invalid Write ) which is absolutely irrelevant
> > to signal.
> 
> Agreed this is misleading. @Al, is there a way to identify process termination
> from signals because it did something wrong vs. say unhandled signal. For former,
> we want to dump additional info in show_regs() such as PC / Fault addres etc and
> not in other scenario.
> 
> > This situation was reproduced with waitpid02 LTP test.
> > _____________________________
> > 
> > So remove printing information about exceptions from show_regs()
> > to avoid confusing messages. Print information about exceptions
> > only in required places instead of show_regs()
> > 
> > Now we don't print information about exceptions if signal is simply
> > send by another userspace app. So in case of waitpid02 we will print
> > next message:
> > _____________________________
> >     ./waitpid02
> >   potentially unexpected fatal signal 8.
> >   [STAT32]: 0x80080082 : IE U
> >   BTA: 0x20000fc4	 SP: 0x5ff8bd64	 FP: 0x00000000
> >   LPS: 0x200524a0	LPE: 0x200524b6	LPC: 0x00000006
> >   [-----other-info-----]
> > _____________________________
> 
> The prints I'm seeing now, for a segv from NULL pointer access is even more
> confusing !
> There's a mixup of prints....
> 
> -------------------->8--------------------
> Path: /segv
> CPU: 0 PID: 70 Comm: segv Not tainted 4.17.0+ #412
> 
> [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000103ac
> [EFA   ]: 0x00000000
> [BLINK ]: 0x20047bb0
> [ERET  ]: 0x103ac
>     @off 0x103ac in [/segv]
>     VMA: 0x00010000 to 0x00012000
> 
> potentially unexpected fatal signal 11.
> [STAT32]: 0x80080882 : IE U    
> BTA: 0x00010398     SP: 0x5fc95e1c     FP: 0x5fc95e20
> LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
> r00: 0x00000001    r01: 0x5fc95e94    r02: 0x00000000   
> r03: 0x00000064    r04: 0x80808080    r05: 0x2f2f2f2f   
> ...
> -------------------->8--------------------
> 
> and for the process killed by signal 8, we get below.
> 
> -------------------->8--------------------
> [ARCLinux]# kill -8 71
> [ARCLinux]# potentially unexpected fatal signal 8.
> [STAT32]: 0x80080882 : IE U    
> BTA: 0x20020660     SP: 0x5fbcddec     FP: 0x5fbcde1c
> LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
> r00: 0xfffffdfc    r01: 0x5fbcddf0    r02: 0x00000000   
> r03: 0x00000008    r04: 0x80808080    r05: 0x2f2f2f2f   
> r06: 0x7a2f5f4a    r07: 0x00000000    r08: 0x00000065   
> ...
> 
> 
> [1]+  Floating point exception   ./sleep
> -------------------->8--------------------
> I'm not sure whats the improvement here vs. the status quo.

Why do you think this is confusing?
The main change is that we don't print exception registers for signal based kill.

Moreover, new behavior is more like x86-64 behavior. See the example:

NULL pointer access on x86-64:
-------------------->8--------------------
hell[3925]: segfault at 0 ip 00007f621eb56c41 sp 00007ffcaa5951b8 error 4 in libc-2.26.so[7f621ea01000+1ac000]
potentially unexpected fatal signal 11.
CPU: 1 PID: 3925 Comm: hell Not tainted 4.16.13-200.fc27.x86_64 #1
Hardware name: Dell Inc. Latitude 7480/00F6D3, BIOS 1.9.3 03/09/2018
RIP: 0033:0x7f621eb56c41
RSP: 002b:00007ffcaa5951b8 EFLAGS: 00010283
RAX: 000000000000000e RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000016eb260 RDI: 0000000000000000
RBP: 00007ffcaa5951f0 R08: 0000000000000003 R09: 0000000000000077
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffcaa5952d0 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f621efbf4c0(0000) GS:ffff8f6abe480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000004556a4005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
-------------------->8--------------------

Process killed by signal 11 on x86-64:
-------------------->8--------------------
potentially unexpected fatal signal 11.
CPU: 0 PID: 3998 Comm: sh Not tainted 4.16.13-200.fc27.x86_64 #1
Hardware name: Dell Inc. Latitude 7480/00F6D3, BIOS 1.9.3 03/09/2018
RIP: 0033:0x7fc0b13d744e
RSP: 002b:00007fffba851e30 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: 0000000000000003 RBX: 000055b3e21256a0 RCX: 00007fc0b13d744e
RDX: 0000000000000241 RSI: 000055b3e2125540 RDI: ffffffffffffff9c
RBP: 00007fffba851f40 R08: 0000000000000020 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000a
R13: 0000000000000003 R14: 0000000000000001 R15: 000055b3e2125540
FS:  00007fc0b1cdb740(0000) GS:ffff8f6abe400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff33d403000 CR3: 00000003dc7ea004 CR4: 00000000003606f0
-------------------->8--------------------


NOTE: in both cases (ARC and x86-64) "/proc/sys/kernel/print-fatal-signals" was
set to "1"
Vineet Gupta July 5, 2018, 9:26 p.m. UTC | #3
On 07/03/2018 03:57 AM, Eugeniy Paltsev wrote:
> On Mon, 2018-07-02 at 10:57 -0700, Vineet Gupta wrote:
>> +CC Al
>>
>> On 06/29/2018 12:39 PM, Eugeniy Paltsev wrote:
>>> We process signals in the end of syscall/exception handler.
>>> It the signal is fatal we print register's content using
>>> show_regs function. show_regs() also prints information about
>>> last exception happened.
>>>
>>> In case of multicore system we can catch the situation when we
>>> will print wrong information about exception. See the example:
>>> ______________________________
>>> CPU-0: started to handle page fault
>>> CPU-1: sent signal to process, which is executed on CPU-0
>>> CPU-0: ended page fault handle. Started to process signal before
>>>        returnig to userspace. Process signal, which is send from
>>>        CPU-0. As th signal is fatal we call show_regs().
>>>        show_regs() will show information about last exception
>>>        which is *page fault* (instead of "trap" which is used for
>>>        signals and happened on CPU-0)
>>>
>>> So we will get message like this:
>>>     /home/waitpid02
>>>   potentially unexpected fatal signal 8.
>>>   Path: /home/waitpid02
>>>   CPU: 0 PID: 100 Comm: waitpid02 Not tainted 4.10.0-rc4 #2
>>>   task: 9f11c200 task.stack: 9f3ae000
>>>
>>>   [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000123ec
>>>   [EFA   ]: 0x00000000
>>>   [BLINK ]: 0x123ea
>>>   [ERET  ]: 0x123ec
>>>     @off 0x123ec in [/home/waitpid02]
>>>     VMA: 0x00010000 to 0x00016000
>>>   [STAT32]: 0x80080882 : IE U
>>>   BTA: 0x000123ea  SP: 0x5ffd3db0  FP: 0x00000000
>>>   LPS: 0x20031684 LPE: 0x2003169a LPC: 0x00000006
>>>   [-----other-info-----]
>>>
>>> This message is confusing because it show information about page fault
>>> ( [ECR   ]: 0x00050200 => Invalid Write ) which is absolutely irrelevant
>>> to signal.
>> Agreed this is misleading. @Al, is there a way to identify process termination
>> from signals because it did something wrong vs. say unhandled signal. For former,
>> we want to dump additional info in show_regs() such as PC / Fault addres etc and
>> not in other scenario.
>>
>>> This situation was reproduced with waitpid02 LTP test.
>>> _____________________________
>>>
>>> So remove printing information about exceptions from show_regs()
>>> to avoid confusing messages. Print information about exceptions
>>> only in required places instead of show_regs()
>>>
>>> Now we don't print information about exceptions if signal is simply
>>> send by another userspace app. So in case of waitpid02 we will print
>>> next message:
>>> _____________________________
>>>     ./waitpid02
>>>   potentially unexpected fatal signal 8.
>>>   [STAT32]: 0x80080082 : IE U
>>>   BTA: 0x20000fc4	 SP: 0x5ff8bd64	 FP: 0x00000000
>>>   LPS: 0x200524a0	LPE: 0x200524b6	LPC: 0x00000006
>>>   [-----other-info-----]
>>> _____________________________
>> The prints I'm seeing now, for a segv from NULL pointer access is even more
>> confusing !
>> There's a mixup of prints....
>>
>> -------------------->8--------------------
>> Path: /segv
>> CPU: 0 PID: 70 Comm: segv Not tainted 4.17.0+ #412
>>
>> [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000103ac
>> [EFA   ]: 0x00000000
>> [BLINK ]: 0x20047bb0
>> [ERET  ]: 0x103ac
>>     @off 0x103ac in [/segv]
>>     VMA: 0x00010000 to 0x00012000
>>
>> potentially unexpected fatal signal 11.
>> [STAT32]: 0x80080882 : IE U    
>> BTA: 0x00010398     SP: 0x5fc95e1c     FP: 0x5fc95e20
>> LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
>> r00: 0x00000001    r01: 0x5fc95e94    r02: 0x00000000   
>> r03: 0x00000064    r04: 0x80808080    r05: 0x2f2f2f2f   
>> ...
>> -------------------->8--------------------
>>
>> and for the process killed by signal 8, we get below.
>>
>> -------------------->8--------------------
>> [ARCLinux]# kill -8 71
>> [ARCLinux]# potentially unexpected fatal signal 8.
>> [STAT32]: 0x80080882 : IE U    
>> BTA: 0x20020660     SP: 0x5fbcddec     FP: 0x5fbcde1c
>> LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
>> r00: 0xfffffdfc    r01: 0x5fbcddf0    r02: 0x00000000   
>> r03: 0x00000008    r04: 0x80808080    r05: 0x2f2f2f2f   
>> r06: 0x7a2f5f4a    r07: 0x00000000    r08: 0x00000065   
>> ...
>>
>>
>> [1]+  Floating point exception   ./sleep
>> -------------------->8--------------------
>> I'm not sure whats the improvement here vs. the status quo.
> Why do you think this is confusing?
> The main change is that we don't print exception registers for signal based kill.

For the pure signal based termination, what is the point of printing the rest of
registers. If you say "it is to give a feel of what the userspace was doing at the
time...." then we are lacking the most crucial piece which is the PC at the time
(i.e. ERET placeholder).

> Moreover, new behavior is more like x86-64 behavior. See the example:

For a null pointer based termination, we now have a ugly looking "potentially
unexpected..." print in the middle of reg file dump, that is not what x86 does.
Anyways, that print It is undesirable, but not a deal breaker.  The issue is point
above, can we remedy it.

BTW in your original patch, for a null pointer access, the printing code now
allocates 2 pages, in each of show_xxx routines and the one in show_regs() is now
pointless, as it is not used there at all there - so please fix that as well.

-Vineet
Eugeniy Paltsev July 6, 2018, 1:05 p.m. UTC | #4
Hi Vineet,

On Thu, 2018-07-05 at 14:26 -0700, Vineet Gupta wrote:
> On 07/03/2018 03:57 AM, Eugeniy Paltsev wrote:
> > On Mon, 2018-07-02 at 10:57 -0700, Vineet Gupta wrote:
> > > +CC Al
> > > 
> > > On 06/29/2018 12:39 PM, Eugeniy Paltsev wrote:
> > > > We process signals in the end of syscall/exception handler.
> > > > It the signal is fatal we print register's content using
> > > > show_regs function. show_regs() also prints information about
> > > > last exception happened.
> > > > 
> > > > In case of multicore system we can catch the situation when we
> > > > will print wrong information about exception. See the example:
> > > > ______________________________
> > > > CPU-0: started to handle page fault
> > > > CPU-1: sent signal to process, which is executed on CPU-0
> > > > CPU-0: ended page fault handle. Started to process signal before
> > > >        returnig to userspace. Process signal, which is send from
> > > >        CPU-0. As th signal is fatal we call show_regs().
> > > >        show_regs() will show information about last exception
> > > >        which is *page fault* (instead of "trap" which is used for
> > > >        signals and happened on CPU-0)
> > > > 
> > > > So we will get message like this:
> > > >     /home/waitpid02
> > > >   potentially unexpected fatal signal 8.
> > > >   Path: /home/waitpid02
> > > >   CPU: 0 PID: 100 Comm: waitpid02 Not tainted 4.10.0-rc4 #2
> > > >   task: 9f11c200 task.stack: 9f3ae000
> > > > 
> > > >   [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000123ec
> > > >   [EFA   ]: 0x00000000
> > > >   [BLINK ]: 0x123ea
> > > >   [ERET  ]: 0x123ec
> > > >     @off 0x123ec in [/home/waitpid02]
> > > >     VMA: 0x00010000 to 0x00016000
> > > >   [STAT32]: 0x80080882 : IE U
> > > >   BTA: 0x000123ea  SP: 0x5ffd3db0  FP: 0x00000000
> > > >   LPS: 0x20031684 LPE: 0x2003169a LPC: 0x00000006
> > > >   [-----other-info-----]
> > > > 
> > > > This message is confusing because it show information about page fault
> > > > ( [ECR   ]: 0x00050200 => Invalid Write ) which is absolutely irrelevant
> > > > to signal.
> > > 
> > > Agreed this is misleading. @Al, is there a way to identify process termination
> > > from signals because it did something wrong vs. say unhandled signal. For former,
> > > we want to dump additional info in show_regs() such as PC / Fault addres etc and
> > > not in other scenario.
> > > 
> > > > This situation was reproduced with waitpid02 LTP test.
> > > > _____________________________
> > > > 
> > > > So remove printing information about exceptions from show_regs()
> > > > to avoid confusing messages. Print information about exceptions
> > > > only in required places instead of show_regs()
> > > > 
> > > > Now we don't print information about exceptions if signal is simply
> > > > send by another userspace app. So in case of waitpid02 we will print
> > > > next message:
> > > > _____________________________
> > > >     ./waitpid02
> > > >   potentially unexpected fatal signal 8.
> > > >   [STAT32]: 0x80080082 : IE U
> > > >   BTA: 0x20000fc4	 SP: 0x5ff8bd64	 FP: 0x00000000
> > > >   LPS: 0x200524a0	LPE: 0x200524b6	LPC: 0x00000006
> > > >   [-----other-info-----]
> > > > _____________________________
> > > 
> > > The prints I'm seeing now, for a segv from NULL pointer access is even more
> > > confusing !
> > > There's a mixup of prints....
> > > 
> > > -------------------->8--------------------
> > > Path: /segv
> > > CPU: 0 PID: 70 Comm: segv Not tainted 4.17.0+ #412
> > > 
> > > [ECR   ]: 0x00050200 => Invalid Write @ 0x00000000 by insn @ 0x000103ac
> > > [EFA   ]: 0x00000000
> > > [BLINK ]: 0x20047bb0
> > > [ERET  ]: 0x103ac
> > >     @off 0x103ac in [/segv]
> > >     VMA: 0x00010000 to 0x00012000
> > > 
> > > potentially unexpected fatal signal 11.
> > > [STAT32]: 0x80080882 : IE U    
> > > BTA: 0x00010398     SP: 0x5fc95e1c     FP: 0x5fc95e20
> > > LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
> > > r00: 0x00000001    r01: 0x5fc95e94    r02: 0x00000000   
> > > r03: 0x00000064    r04: 0x80808080    r05: 0x2f2f2f2f   
> > > ...
> > > -------------------->8--------------------
> > > 
> > > and for the process killed by signal 8, we get below.
> > > 
> > > -------------------->8--------------------
> > > [ARCLinux]# kill -8 71
> > > [ARCLinux]# potentially unexpected fatal signal 8.
> > > [STAT32]: 0x80080882 : IE U    
> > > BTA: 0x20020660     SP: 0x5fbcddec     FP: 0x5fbcde1c
> > > LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
> > > r00: 0xfffffdfc    r01: 0x5fbcddf0    r02: 0x00000000   
> > > r03: 0x00000008    r04: 0x80808080    r05: 0x2f2f2f2f   
> > > r06: 0x7a2f5f4a    r07: 0x00000000    r08: 0x00000065   
> > > ...
> > > 
> > > 
> > > [1]+  Floating point exception   ./sleep
> > > -------------------->8--------------------
> > > I'm not sure whats the improvement here vs. the status quo.
> > 
> > Why do you think this is confusing?
> > The main change is that we don't print exception registers for signal based kill.
> 
> For the pure signal based termination, what is the point of printing the rest of
> registers. If you say "it is to give a feel of what the userspace was doing at the
> time...." then we are lacking the most crucial piece which is the PC at the time
> (i.e. ERET placeholder).

We can disable printing the rest of registers by setting "print-fatal-signals" to 0
by default. In that case we will print rest of registers only for exceptions
happened in kernel space (when die() function is called)

So in case of setting "print-fatal-signals" to 0 we will get following messages:

NULL pointer access from user space:
---------->8-------------
# ./arc_hell &
# Path: /root/arc_hell
CPU: 0 PID: 79 Comm: arc_hell Not tainted 4.17.0+ #10

[ECR   ]: 0x00050100 => Invalid Read @ 0x00000000 by insn @ 0x2003a35c
[EFA   ]: 0x00000000
[BLINK ]: 0x20039ef8
[ERET  ]: 0x2003a35c
    @off 0x2e35c in [/lib/libuClibc-1.0.18.so]
    VMA: 0x2000c000 to 0x20072000

[1]+  Segmentation fault         ./arc_hell
---------->8-------------

Process killed by signal 11 via 'kill -s 11':
---------->8-------------
# sleep 1000 &
[1]+  Segmentation fault         sleep 1000
---------->8-------------



Probably the best variant is to print only information about unexpected exception
from exception handler and leave printing ECR/EFA/BLINK/ERET in show_regs().

So in case of exception based signal we will print from exception handler
something like that:
---------->8-------------
Unexpected exception: Invalid Read @ 0x00000000 by insn @ 0x2003a35c
---------->8-------------

Other information will be print in show_regs (only if "print-fatal-signals"
is enabled):
---------->8-------------
potentially unexpected fatal signal 11.
Path: /root/arc_hell
CPU: 0 PID: 79 Comm: arc_hell Not tainted 4.17.0+ #10
[ECR   ]: 0x00050100
[EFA   ]: 0x00000000
[BLINK ]: 0x20039ef8
[ERET  ]: 0x2003a35c
    @off 0x2e35c in [/lib/libuClibc-1.0.18.so]
    VMA: 0x2000c000 to 0x20072000
[STAT32]: 0x80080882 : IE U
BTA: 0x00010398     SP: 0x5fc95e1c     FP: 0x5fc95e20
LPS: 0x20039ffc    LPE: 0x2003a000    LPC: 0x00000000
r00: 0x00000001    r01: 0x5fc95e94    r02: 0x00000000
r03: 0x00000064    r04: 0x80808080    r05: 0x2f2f2f2f
...
---------->8-------------


> > Moreover, new behavior is more like x86-64 behavior. See the example:
> 
> For a null pointer based termination, we now have a ugly looking "potentially
> unexpected..." print in the middle of reg file dump, that is not what x86 does.
> Anyways, that print It is undesirable, but not a deal breaker.  The issue is point
> above, can we remedy it.
> 
> BTW in your original patch, for a null pointer access, the printing code now
> allocates 2 pages, in each of show_xxx routines and the one in show_regs() is now
> pointless, as it is not used there at all there - so please fix that as well.

Thanks for pointing, will fix in v2 patch.

> -Vineet
>
diff mbox series

Patch

diff --git a/arch/arc/include/asm/bug.h b/arch/arc/include/asm/bug.h
index 21ec82466d62..b9fd60f7d36b 100644
--- a/arch/arc/include/asm/bug.h
+++ b/arch/arc/include/asm/bug.h
@@ -16,6 +16,7 @@ 
 struct task_struct;
 
 void show_regs(struct pt_regs *regs);
+void show_exception_regs(struct pt_regs *regs);
 void show_stacktrace(struct task_struct *tsk, struct pt_regs *regs);
 void show_kernel_fault_diag(const char *str, struct pt_regs *regs,
 			    unsigned long address);
diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c
index b123558bf0bb..fe41f0d9488f 100644
--- a/arch/arc/kernel/traps.c
+++ b/arch/arc/kernel/traps.c
@@ -50,6 +50,7 @@  unhandled_exception(const char *str, struct pt_regs *regs, siginfo_t *info)
 		tsk->thread.fault_address = (__force unsigned int)info->si_addr;
 
 		force_sig_info(info->si_signo, info, tsk);
+		show_exception_regs(regs);
 
 	} else {
 		/* If not due to copy_(to|from)_user, we are doomed */
diff --git a/arch/arc/kernel/troubleshoot.c b/arch/arc/kernel/troubleshoot.c
index 783b20354f8b..313aea25e638 100644
--- a/arch/arc/kernel/troubleshoot.c
+++ b/arch/arc/kernel/troubleshoot.c
@@ -170,14 +170,9 @@  static void show_ecr_verbose(struct pt_regs *regs)
 	}
 }
 
-/************************************************************************
- *  API called by rest of kernel
- ***********************************************************************/
-
-void show_regs(struct pt_regs *regs)
+void show_exception_regs(struct pt_regs *regs)
 {
 	struct task_struct *tsk = current;
-	struct callee_regs *cregs;
 	char *buf;
 
 	buf = (char *)__get_free_page(GFP_KERNEL);
@@ -190,12 +185,28 @@  void show_regs(struct pt_regs *regs)
 	show_ecr_verbose(regs);
 
 	pr_info("[EFA   ]: 0x%08lx\n[BLINK ]: %pS\n[ERET  ]: %pS\n",
-		current->thread.fault_address,
+		tsk->thread.fault_address,
 		(void *)regs->blink, (void *)regs->ret);
 
 	if (user_mode(regs))
 		show_faulting_vma(regs->ret, buf); /* faulting code, not data */
 
+	free_page((unsigned long)buf);
+}
+
+/************************************************************************
+ *  API called by rest of kernel
+ ***********************************************************************/
+
+void show_regs(struct pt_regs *regs)
+{
+	struct callee_regs *cregs;
+	char *buf;
+
+	buf = (char *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		return;
+
 	pr_info("[STAT32]: 0x%08lx", regs->status32);
 
 #define STS_BIT(r, bit)	r->status32 & STATUS_##bit##_MASK ? #bit" " : ""
@@ -238,6 +249,8 @@  void show_kernel_fault_diag(const char *str, struct pt_regs *regs,
 	/* Show fault description */
 	pr_info("\n%s\n", str);
 
+	show_exception_regs(regs);
+
 	/* Caller and Callee regs */
 	show_regs(regs);
 
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index a0b7bd6d030d..7e0e2591bb27 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -200,6 +200,7 @@  void do_page_fault(unsigned long address, struct pt_regs *regs)
 		/* info.si_code has been set above */
 		info.si_addr = (void __user *)address;
 		force_sig_info(SIGSEGV, &info, tsk);
+		show_exception_regs(regs);
 		return;
 	}
 
@@ -239,4 +240,5 @@  void do_page_fault(unsigned long address, struct pt_regs *regs)
 	info.si_code = BUS_ADRERR;
 	info.si_addr = (void __user *)address;
 	force_sig_info(SIGBUS, &info, tsk);
+	show_exception_regs(regs);
 }