Message ID | alpine.LFD.1.10.0810021142370.5549@apollo.tec.linutronix.de |
---|---|
State | Not Applicable, archived |
Headers | show |
On Thu, 2 Oct 2008 12:42:44 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > echo sym-verbose > /debug/tracing/iter_ctrl Ok, this one I changed to: echo verbose > /debug/tracing/iter_ctrl ;) since there's no sym-verbose. > Find below a patch which disables the tracer when the listen overflow > is detected. It could occassionally be hit in a legimate case as well, > but we did not come up with any better idea. Ok, I applied the patch, but it returns this error: CC kernel/trace/ftrace.o kernel/trace/ftrace.c: In function 'stop_event_tracer_now': kernel/trace/ftrace.c:73: error: 'tracer_enabled' undeclared (first use in this function) kernel/trace/ftrace.c:73: error: (Each undeclared identifier is reported only once kernel/trace/ftrace.c:73: error: for each function it appears in.) make[2]: *** [kernel/trace/ftrace.o] Error 1 make[1]: *** [kernel/trace] Error 2 make: *** [kernel] Error 2 > If it triggers, please cat /debug/tracing/trace to a file. Can you > bzip2 and upload it to some place. The trace does not contain any > sensible private information. Ok. Thanks.
On Thu, 2 Oct 2008 14:08:14 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:
> grmbl. tracer_enabled should have been ftrace_enabled of course.
Ok Thomas. I got a stall on my desktop.
It happens when I try to use sudo and it just stay stalled (I
think trying to access /dev/log or something like that).
I got the trace here:
http://www.abusar.org/ftrace/trace.txt.bz2
after the stall.
Could you please check if there's something wrong?
I'd prefer if the stall happened in the server, but it still
didn't happen. Anyway, it's the same problem.
Ps: this time I couldn't solve it with nmap (sometimes nmap
solves, sometimes it doesn't).
Ps2: in this stall, the tracing wasn't disabled automatically
by your patch, because it happened in a different situation (when I
tried to use sudo). But I know it's the same problem, because the
"solution" I use is always the same (nmap) and until it's not stalled
anymore, I can't connect to any port on my machine.
Ps3: if this trace is useless, wait until I get something on
the server.
On Thu, 2 Oct 2008 21:16:32 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > Hmm, not sure. sudo is a different beast. All I can see in the trace > is that reiserfs is turning itself in circles (which is nothing new), > but there is nothing which might tell us about a stall. At the point > where you stopped the trace the relevant info was already replaced by > new stuff. It's a circular buffer which gets overwritten. Ok, no problem. I'll keep waiting for a stall on the server-side..
On Sun, 5 Oct 2008 20:09:22 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > did you get the content of /debug/tracing/trace nevertheless ? Yes! I got several... maybe one is valid.... > The cat /debug/tracing/tracing_enabled == 1 might be just a user space > interface thingy. Ok, here they are (I hope one of them is correct): http://www.abusar.org/ftrace/ These ones I get after I saw the kernel message "Tracer stopped". http://www.abusar.org/ftrace/trace.txt.bz2 http://www.abusar.org/ftrace/trace2.txt.bz2 http://www.abusar.org/ftrace/trace3.txt.bz2 http://www.abusar.org/ftrace/trace4.txt.bz2 http://www.abusar.org/ftrace/trace5.txt.bz2 These ones were during a stall and after a nmap: http://www.abusar.org/ftrace/trace-stop.txt.bz2 http://www.abusar.org/ftrace/trace-stop2.txt.bz2 If none of these is valid, I will use monit to check for Tracer stopped and force the tracing to be disabled.
On Sun, 5 Oct 2008 20:09:22 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > did you get the content of /debug/tracing/trace nevertheless ? > > The cat /debug/tracing/tracing_enabled == 1 might be just a user space > interface thingy. Hi Thomas, I added this one too (from my desktop): http://www.abusar.org/ftrace/trace-desktop.txt.bz2 I don't know if it will help, but if you could look the entries generated by sudo and fetchnews, they were stalled. sudo has a lot of: sudo-1081 [01] 34460.250986: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b> sudo-1081 [01] 34460.250986: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b> sudo-1081 [01] 34460.250987: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b> sudo-1081 [01] 34460.250987: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b> sudo-1081 [01] 34460.250987: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b> sudo-1081 [01] 34460.250987: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b> sudo-1081 [01] 34460.250987: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b> sudo-1081 [01] 34460.250988: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b> sudo-1081 [01] 34460.250988: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b> I don't know if it's normal... Thank you.
On Sun, 5 Oct 2008 20:09:22 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > The cat /debug/tracing/tracing_enabled == 1 might be just a user space > interface thingy. > > Thanks, Thomas, Ilpo etc, I got a trace exactly after the patch provided by Thomas detected the stall: http://www.abusar.org/ftrace/trace-final.txt.bz2 Thomas, since you didn't reply to the previous messages, I assume the previous traces were invalid. But could you please check this one? It would be wise to search for "monit" in the trace, since monit activated the script "stop-tracing.sh" which stops the trace and dump the trace output to the file. So what probably interest you is what comes *before* "monit". "monit" appears for the first time in line 16921. To make it more clear, I configured monit with the following: check file syslog with path /var/adm/messages if match "Tracer stopped" then exec "/home/fraga/stop-tracing.sh" And the stop-tracing.sh script is just that: #!/bin/sh /usr/local/bin/echo 0 > /debug/tracing/tracing_enabled /usr/local/bin/cat /debug/tracing/trace > /home/fraga/trace.txt So I expect that this trace is right now. If monit was too slow in detecting the "Tracer stopped" log entry, I'll use a Perl script which is faster. Thanks, I wait for your reply. Ps: flood_mail.pl is just a perl script I use to prevent people from flooding my smtp server with mail. Ps2: I just don't know if it would be better to stop the tracing *before* dump the trace or it would be better to dump the trace without stopping it... I noticed that even the "echo" I use to stop the tracing, generated a lot of useless tracing information.
On Wed, 8 Oct 2008 21:56:00 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > Will do. Sorry was distracted. Steven Rostedt posted a patch to stop > the tracer really from the kernel, I need to verify that it works on > .27. Will do tomorrow when my brain is more awake than now. Hi Thomas, did you have time to verify that patch? If so, I'd like to test it. Thanks.
On Mon, 13 Oct 2008 16:35:08 +0300 (EEST) "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: Just reporting another stall event... Who knows it helps in some way... > ...might turn out related. It has somewhat similar problem though > different syscall. Not enough info on what kernel version etc. > there though. Well, I can be wrong, but I think those stalls are related to USB (or USB helped to "solve" the stall temporarily)... I explain below (I have a monitor with usb hub connected): Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0) Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session closed for user root ^---------- sudo opened the session at 19:35 Oct 17 19:38:16 tux sudo: fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/rm libdvdread.a libdvdread.la libdvdread.so libdvdread.so.3 libdvdread.so.3.2.1 libdvdread.so.4 libdvdread .so.4.1.2 ^---------- I tried to run this command with sudo. And it stalled. Notice that it was logged only on 19:38. Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0) Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session closed for user root Oct 17 19:40:20 tux vmunix: ci000/000:21ub/-/-./-../-..:./nu/nu7 ^----------- Here is the stall. You can see that the log is corrupted, every other line. Oct 17 19:40:20 tux vmunix: 6ipt S I 11 os Lgtc pia S os]o s-000:21112 ^----------- Every other line. Here is when I turned off my monitor with USB... Unless it was an extreme coincidence, I'm sure this event of turning off and... Oct 17 19:40:25 tux vmunix: sd 9:0:0:0: [sdb] Attached SCSI removable disk Oct 17 19:40:25 tux scsi.agent[5498]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/9:0:0:0 ^---------- ...turning on again, helped to "solve" the stall. The stall ended when I turned off the monitor. Oct 17 19:40:28 tux sudo: fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/ls Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0) Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session closed for user root Oct 17 19:40:30 tux scsi.agent[5310]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/type does not exist Oct 17 19:40:35 tux scsi.agent[5493]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/type does not exist Oct 17 20:12:37 tux vmunix: usb-storage: device found at 14 Oct 17 20:12:37 tux vmunix: usb-storage: waiting for device to settle before scanning Oct 17 20:12:42 tux vmunix: scsi 10:0:0:0: Direct-Access Generic ls SC 53 Q NI Oct 17 20:12:42 tux vmunix: k Oct 17 20:12:42 tux scsi.agent[7439]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/10:0:0:0 Oct 17 20:12:47 tux vmunix: DROP INPUT: IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:15:e9:e2:00:db:08:00 SRC=192.168.0.1 DST=255.255.255.255 LEN=334 TOS=0x00 PREC=0x00 TTL=64 ID=6668 PROTO=UDP SPT=67 DPT=68 LEN=314 Oct 17 20:12:47 tux scsi.agent[7252]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/type does not exist Oct 17 20:12:53 tux scsi.agent[7435]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/type does not exist *** At the server I have an usb device connected too... OR at eh end it's just a symptom of the problem... I don't know. The interesting thing is that as soon as I turned off the monitor (at the exact moment), the speaker beeped, because I used TAB to auto complete my command, and I knew that sudo wasn't stalled anymore. Maybe what the kernel does when we remove a usb hub could give us a hint about the stalling issue...
On Fri, 17 Oct 2008, Dâniel Fraga wrote: > On Mon, 13 Oct 2008 16:35:08 +0300 (EEST) > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > > Just reporting another stall event... Who knows it helps in some way... > > > ...might turn out related. It has somewhat similar problem though > > different syscall. Not enough info on what kernel version etc. > > there though. > > Well, I can be wrong, but I think those stalls are related to USB (or USB helped to > "solve" the stall temporarily)... I explain below (I have a monitor with usb hub connected): > > Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0) > Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session closed for user root > > ^---------- sudo opened the session at 19:35 > > Oct 17 19:38:16 tux sudo: fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/rm libdvdread.a libdvdread.la libdvdread.so libdvdread.so.3 libdvdread.so.3.2.1 libdvdread.so.4 libdvdread > .so.4.1.2 > > ^---------- I tried to run this command with sudo. And it stalled. Notice that it was logged only on 19:38. > > Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0) > Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session closed for user root > Oct 17 19:40:20 tux vmunix: ci000/000:21ub/-/-./-../-..:./nu/nu7 > > ^----------- Here is the stall. You can see that the log is corrupted, every other line. > > Oct 17 19:40:20 tux vmunix: 6ipt S I 11 os Lgtc pia S os]o s-000:21112 > > ^----------- Every other line. Here is when I turned off my monitor with USB... Unless it was an extreme coincidence, I'm sure this event > of turning off and... You probably meant every other character... :-) The same as with the netfilter logs... I've two open question related to it, where did the every other letter go...? and why not a larger block but only a single char is missing (at a time)? I've no idea about printk's internals, maybe I'll try to ping somebody tomorrow about it. > Oct 17 19:40:25 tux vmunix: sd 9:0:0:0: [sdb] Attached SCSI removable disk > Oct 17 19:40:25 tux scsi.agent[5498]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/9:0:0:0 > > ^---------- ...turning on again, helped to "solve" the stall. The stall ended when I turned off the monitor. > > Oct 17 19:40:28 tux sudo: fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/ls > Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0) > Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session closed for user root > Oct 17 19:40:30 tux scsi.agent[5310]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/type does not exist > Oct 17 19:40:35 tux scsi.agent[5493]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/type does not exist > Oct 17 20:12:37 tux vmunix: usb-storage: device found at 14 > Oct 17 20:12:37 tux vmunix: usb-storage: waiting for device to settle before scanning > Oct 17 20:12:42 tux vmunix: scsi 10:0:0:0: Direct-Access Generic ls SC 53 Q NI > Oct 17 20:12:42 tux vmunix: k > Oct 17 20:12:42 tux scsi.agent[7439]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/10:0:0:0 > Oct 17 20:12:47 tux vmunix: DROP INPUT: IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:15:e9:e2:00:db:08:00 SRC=192.168.0.1 DST=255.255.255.255 LEN=334 TOS=0x00 PREC=0x00 TTL=64 ID=6668 PROTO=UDP SPT=67 DPT=68 LEN=314 > Oct 17 20:12:47 tux scsi.agent[7252]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/type does not exist > Oct 17 20:12:53 tux scsi.agent[7435]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/type does not exist > > *** > > At the server I have an usb device connected too... > > OR at eh end it's just a symptom of the problem... I don't > know. The interesting thing is that as soon as I turned off the monitor > (at the exact moment), the speaker beeped, because I used TAB to auto > complete my command, and I knew that sudo wasn't stalled > anymore. > > Maybe what the kernel does when we remove a usb hub could give us a hint about the stalling issue... I doubt we'll find the cause from networking... Trying something a bit more generic: Try running vmstat 1 > somelog.file to see if there's something particularly interesting behavior during the stall.
On Tue, 21 Oct 2008, Ilpo Järvinen wrote: > Trying something a bit more generic: > > Try running vmstat 1 > somelog.file to see if there's something > particularly interesting behavior during the stall. Won't hurt doing this on both the server and the client... :-)
On Tue, 21 Oct 2008 00:48:16 +0300 (EEST) "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > Try running vmstat 1 > somelog.file to see if there's something > particularly interesting behavior during the stall. Here it is. I used nmap more or less in the middle of this log. The stall happened just on my client machine (stalled sudo) so there's only one log ok? Thanks. procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 52540 69928 63832 1001580 0 1 17 48 30 119 18 1 80 1 0 0 52540 69844 63832 1001580 0 0 0 0 633 921 0 0 100 0 0 0 52540 69844 63832 1001580 0 0 0 100 621 916 0 0 99 0 1 0 52540 69844 63832 1001580 0 0 0 0 611 893 0 0 100 0 0 0 52540 69716 63832 1001580 0 0 0 0 671 961 0 0 100 0 1 0 52540 69552 63832 1001580 0 0 0 0 813 975 4 0 95 0 0 0 52540 69596 63832 1001580 0 0 0 0 716 947 2 0 98 0 0 0 52540 69596 63832 1001580 0 0 0 36 866 2047 2 2 96 0 0 0 52540 69596 63832 1001580 0 0 0 0 688 964 0 0 100 0 0 0 52540 69596 63832 1001580 0 0 0 0 721 1005 0 0 99 0 0 0 52540 69720 63832 1001580 0 0 0 0 679 958 0 0 99 0 0 0 52540 69720 63832 1001580 0 0 4 0 651 935 0 0 99 1 0 0 52540 69760 63832 1001584 0 0 0 36 990 2105 2 2 96 0 0 0 52540 69720 63832 1001584 0 0 0 0 869 1187 0 0 99 0 0 0 52540 69720 63832 1001584 0 0 0 0 937 1272 0 0 100 0 0 0 52540 69472 63832 1001584 0 0 0 0 1486 1702 17 3 79 0 1 0 52540 68092 63832 1001604 0 0 0 0 2263 2553 36 13 50 0 0 0 52540 68092 63832 1001604 0 0 0 336 1214 1223 3 4 92 0 3 0 52540 67752 63832 1001604 0 0 0 0 1031 1192 1 0 99 0 0 0 52540 67712 63832 1001604 0 0 0 0 988 1149 1 0 99 0 0 0 52540 67712 63832 1001608 0 0 0 0 989 1127 0 0 100 0 1 0 52540 67712 63832 1001608 0 0 0 0 985 1124 0 0 100 0 0 0 52540 67752 63832 1001608 0 0 0 512 1149 1106 0 0 99 0 2 0 52540 67544 63832 1001608 0 0 0 0 1463 1329 7 7 86 0 2 0 52540 66496 63832 1001612 0 0 0 0 1755 2439 21 5 74 0 0 0 52540 66332 63832 1001620 0 0 0 0 1597 1534 23 12 65 0 0 0 52540 65992 63836 1001620 0 0 72 0 861 1155 0 0 97 2 0 0 52540 65952 63836 1001688 0 0 0 304 819 1111 1 0 98 0 0 0 52540 65704 63840 1001696 0 0 12 0 847 1194 1 0 97 1 0 0 52540 65704 63840 1001696 0 0 0 0 797 1123 0 0 99 0 2 0 52540 65620 63840 1001696 0 0 0 0 827 1155 1 0 99 0 0 0 52540 65580 63840 1001696 0 0 0 0 747 1033 0 0 100 0 0 0 52540 65496 63840 1001696 0 0 0 84 969 2128 2 3 95 0 0 0 52540 64368 63924 1002608 480 0 1444 148 1593 1471 10 5 57 28 0 0 52540 64480 63924 1002600 0 0 0 0 854 1141 1 1 98 0 1 0 52540 64480 63924 1002604 0 0 12 0 1517 1409 18 1 81 0 0 0 52540 64476 63924 1002624 0 0 0 0 1717 1331 27 1 72 0 0 0 52540 64268 63980 1002600 0 0 56 300 1816 1543 17 14 66 3 0 0 52540 64240 64004 1002616 0 0 24 0 805 1063 2 0 93 4 0 0 52540 64240 64020 1002628 0 0 16 0 877 1033 1 0 96 3 1 0 52540 64272 64048 1002632 0 0 28 0 1202 2408 3 3 90 3 0 0 52540 64232 64068 1002640 0 0 20 0 961 1121 2 0 93 4 0 0 52540 64232 64076 1002664 0 0 8 288 975 1197 1 0 97 2 1 0 52540 64232 64100 1002668 0 0 24 0 974 1247 1 0 93 6 0 0 52540 64232 64100 1002676 0 0 0 0 961 1177 1 0 98 0 0 0 52540 64108 64300 1002684 0 0 200 0 1157 1337 4 0 74 21 1 0 52540 64108 64328 1002668 0 0 28 0 909 1156 1 1 92 6 0 0 52540 64108 64352 1002688 0 0 24 244 879 1117 1 0 94 4 0 1 52540 64024 64376 1002712 0 0 24 0 932 1171 1 0 94 4 0 0 52540 63984 64396 1002728 0 0 20 0 974 1190 3 1 91 4 1 0 52540 63900 64412 1002732 0 0 16 0 908 1143 1 0 96 3 0 0 52540 63860 64436 1002760 0 0 24 0 857 1105 0 0 92 7 0 0 52540 63608 64444 1002796 0 0 4 380 961 1129 1 1 97 1 0 0 52540 63612 64456 1002824 0 0 12 0 881 1088 2 0 95 3
On Wed, 22 Oct 2008, Dâniel Fraga wrote: > On Tue, 21 Oct 2008 00:48:16 +0300 (EEST) > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > > > Try running vmstat 1 > somelog.file to see if there's something > > particularly interesting behavior during the stall. > > Here it is. I used nmap more or less in the middle of this log. > The stall happened just on my client machine (stalled sudo) so there's only one > log ok? Thanks. Nothing particularly interesting in it (sorry for the delay, there were some other bugs in the meantime to fix and partially because I was out of ideas). Minor increase in cpu/wa after nmap, before that it was 0-2, mostly zero. Another thing that occurred to me today is that all these events I know of are related to syslog/klogd (I don't know too well how they relate to each other and what functionality is provided by each, thus I refer them here as one :-)). - sudo sendmsg'ed to /dev/log and stalled - every-2nd-char missing in printk relates to it - usb does cause printks - server processes likely do some logging (and can then hang on waiting until that finishes and therefore eventually no worker is available to process requests) ...Some other type of case I'm missing here? Perhaps we could try to solve it though stracing syslogd... Did you ever actually try btw with something else than NO_TICK, we discussed it at some point of time but I'm not sure what was the end result of what got tested and what not?
On Thu, 30 Oct 2008 12:43:05 +0200 (EET) "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > Nothing particularly interesting in it (sorry for the delay, there were > some other bugs in the meantime to fix and partially because I was out > of ideas). Minor increase in cpu/wa after nmap, before that it was 0-2, > mostly zero. Hi Ilpo. No problem, it's better late than never ;) > Another thing that occurred to me today is that all these events I know of > are related to syslog/klogd (I don't know too well how they relate to each > other and what functionality is provided by each, thus I refer them here > as one :-)). > > - sudo sendmsg'ed to /dev/log and stalled > - every-2nd-char missing in printk relates to it > - usb does cause printks > - server processes likely do some logging (and can then hang on waiting > until that finishes and therefore eventually no worker is available to > process requests) > > ...Some other type of case I'm missing here? Yes, it is that. Anyway I'm talking directly to netfilter people and they suggested that I use ulogd to capture the exact packet when the stall happens using pcap. I'll try that. If they can figure it out what happens to make the log prints every other char, maybe we have a hint. > Perhaps we could try to solve it though stracing syslogd... Yes I can trace it. > Did you ever actually try btw with something else than NO_TICK, we > discussed it at some point of time but I'm not sure what was the end > result of what got tested and what not? I tried with many options, with TICK enabled and disabled, high res timer enabled and disabled and other options in the kernel, but it didn't make any difference. I'll try with netfilter people and if i get something, I will reply ok?
On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> Perhaps we could try to solve it though stracing syslogd...
Well Ilpo, you're right, what I'm about to write here will make
me very ashamed, but the truth must be told! The culprit was syslogd!
Almost unbeliavable, but I had been using and old syslogd version for
about 5 years!
How can I'm sure that it's syslogd's fault? Simply, because I
had a stall today and when I killed syslogd everything was back to
normal.
Well, I reinstalled GNU inetutils 1.5 (which I had already
installed before), but I don't know why it put syslogd
in /usr/local/libexec directory.
But no problem. I'll just wait a few more days to test if
syslogd is the only responsible for this, but I'm 90% sure it is.
So, just posting this, so if someone, who knows, some day, have
a similar problem, can read this message and avoid all the problems I
had.
I apologize for thinking that it was a kernel fault. Anyway,
one more lesson I learned: do not keep old binaries lying around... ;)
Thanks everyone, mainly Ilpo for giving me all tools to reach
to this point.
Ps: just for curiosity, I was using a syslogd binary from Mar,
3, 2003! Extremely old! This is so old, it was compiled for Linux
2.2.5. Or maybe I was too lazy and copied it from another machine...
syslogd: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for
GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped
Ps2: I'll close the bug I opened on bugzilla.
Ps3: anyway, it's interesting how a small piece of the system
(syslogd) can generate those kinds of problems... I mean, a simple
error on syslogd could lead to a complete stall on connections, just
because everything is waiting for it to log through /dev/log. Of course
the problem was the binary, but it could have a time out, so even if it
was in fact a buggy syslogd, it won't cause such a stall on the
system. I really don't know what changed from 2.6.24 to 2.6.25, but
maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that...
you guys know much more than me.
Ps4: maybe now we can understand why nmap solved the issue...
On Sun, 2 Nov 2008, Dâniel Fraga wrote: > On Thu, 30 Oct 2008 12:43:05 +0200 (EET) > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > > > Perhaps we could try to solve it though stracing syslogd... > > Well Ilpo, you're right, what I'm about to write here will make > me very ashamed, but the truth must be told! The culprit was syslogd! > Almost unbeliavable, but I had been using and old syslogd version for > about 5 years! > > How can I'm sure that it's syslogd's fault? Simply, because I > had a stall today and when I killed syslogd everything was back to > normal. Once there's any kind of flow control, anything jamming downstream will eventually make upstream to stall as well (or to appear as not working as expected. Sadly, it's exactly opposite from correctness point of view as flow control is a feature in TCP, not a bug :-)). Thus I occassionally run to these tcp with flow control not working reports which turn to be totally unrelated. This still doesn't explain everything though afaik... E.g., why did the sendto() to SOCK_DGRAM socket hung. > But no problem. I'll just wait a few more days to test if > syslogd is the only responsible for this, but I'm 90% sure it is. And you had the same old syslogd on both hosts? In any case the loss of every other character deterministically sounds like a real bug in the syslogd since it doesn't make too much sense to happen in kernel->syslogd communication (where I'd expect it to not show up in such consistent pattern but would cause more randomness). > I apologize for thinking that it was a kernel fault. It's not clear what caused this to happen _now_, nor the exact mechanism. > Ps: just for curiosity, I was using a syslogd binary from Mar, > 3, 2003! Extremely old! This is so old, it was compiled for Linux > 2.2.5. Or maybe I was too lazy and copied it from another machine... In theory this shouldn't be too big problem, but I'm hardly an expert of those things and syslogd is anyway more thightly coupled to kernel than some random app. > Ps3: anyway, it's interesting how a small piece of the system > (syslogd) can generate those kinds of problems... I mean, a simple > error on syslogd could lead to a complete stall on connections, just > because everything is waiting for it to log through /dev/log. This is more of a philosophical question than something else... it's always balancing between data loss (=possibly losing a logline of an important event) or possibility of a stall. But this shouldn't be a concern in the case where SOCK_DGRAM was used by the sudo (like in the strace you sent to sudo people), in general UDP doesn't guarantee reliability so not delivering wouldn't be a problem but I don't know if PF_FILE domain does something otherwise in there. > Of course > the problem was the binary, but it could have a time out, so even if it > was in fact a buggy syslogd, it won't cause such a stall on the > system. I really don't know what changed from 2.6.24 to 2.6.25, but > maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that... > you guys know much more than me. Until we know more details than that killing syslogd helped it's hard to tell what is the actual cause. And I have no clue about semantics of /dev/log anyway. > Ps4: maybe now we can understand why nmap solved the issue... Not very clear but at least sudo does some writing there too.
On Mon, 3 Nov 2008 17:37:09 +0200 (EET) "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > Once there's any kind of flow control, anything jamming downstream will > eventually make upstream to stall as well (or to appear as not working > as expected. Sadly, it's exactly opposite from correctness point of view > as flow control is a feature in TCP, not a bug :-)). Thus I occassionally > run to these tcp with flow control not working reports which turn to be > totally unrelated. > > This still doesn't explain everything though afaik... E.g., why did the > sendto() to SOCK_DGRAM socket hung. Well, the fact that the problem happened since 2.6.25 kernel make me believe that it could exist a possible kernel issue too, but I think that most part was caused by syslogd. > And you had the same old syslogd on both hosts? Yes. My desktop and server have the same installation. > In any case the loss of every other character deterministically sounds > like a real bug in the syslogd since it doesn't make too much sense to > happen in kernel->syslogd communication (where I'd expect it to not show > up in such consistent pattern but would cause more randomness). Yes. With the new compiled syslogd it doesn't happen anymore. And I don't have stall too. > It's not clear what caused this to happen _now_, nor the exact mechanism. Ok. > This is more of a philosophical question than something else... it's > always balancing between data loss (=possibly losing a logline of an > important event) or possibility of a stall. But this shouldn't be a > concern in the case where SOCK_DGRAM was used by the sudo (like in the > strace you sent to sudo people), in general UDP doesn't guarantee > reliability so not delivering wouldn't be a problem but I don't know if > PF_FILE domain does something otherwise in there. I see. > Until we know more details than that killing syslogd helped it's hard to > tell what is the actual cause. And I have no clue about semantics of > /dev/log anyway. Ok. Anyway, at least the problem was registered and if in the future we have something related, maybe this can help someone.
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index f6e3af3..a1cecb0 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -68,6 +68,12 @@ static void ftrace_list_func(unsigned long ip, unsigned long parent_ip) }; } +void stop_event_tracer_now(void) +{ + tracer_enabled = 0; + printk(KERN_INFO "Tracer stopped\n"); +} + /** * clear_ftrace_function - reset the ftrace function * diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index f976fc5..527dd45 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -485,6 +485,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, return newsk; } +extern void stop_event_tracer_now(void); + /* * Process an incoming packet for SYN_RECV sockets represented * as a request_sock. @@ -687,6 +689,7 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb, return child; listen_overflow: + stop_event_tracer_now(); if (!sysctl_tcp_abort_on_overflow) { inet_rsk(req)->acked = 1; return NULL;