Message ID | ye6qhakx7bxl.fsf@elbrus2.mtv.corp.google.com |
---|---|
State | New |
Headers | show |
On Wed, Feb 27, 2013 at 3:13 PM, Paul Pluzhnikov <ppluzhnikov@google.com> wrote:
> Ok for google/gcc-4_7 and google/integration?
OK.
Diego.
On Wed, Feb 27, 2013 at 12:13 PM, Paul Pluzhnikov <ppluzhnikov@google.com> wrote: > Greetings, > > Google ref b/8187733 > > Build libstdc++-v3/src/c++11/debug.cc with -fno-omit-frame-pointer, so > frame-based unwinder can step through it. Just an aside which program does not understand dwarf2 unwinding? Perf is setup to understand it. Valgrind is setup to understand it. ASAN should be setup to understand it. Thanks, Andrew Pinski > > Tested: bootstrap build and verified debug.cc is built with > -fno-omit-frame-pointer. > > Ok for google/gcc-4_7 and google/integration? > > Thanks, > > -- > Paul Pluzhnikov > > > Index: libstdc++-v3/src/c++11/Makefile.am > =================================================================== > --- libstdc++-v3/src/c++11/Makefile.am (revision 196316) > +++ libstdc++-v3/src/c++11/Makefile.am (working copy) > @@ -124,3 +124,4 @@ > > # Google-specific pessimization > functexcept.lo_no_omit_frame_pointer = -fno-omit-frame-pointer > +debug.lo_no_omit_frame_pointer = -fno-omit-frame-pointer > Index: libstdc++-v3/src/c++11/Makefile.in > =================================================================== > --- libstdc++-v3/src/c++11/Makefile.in (revision 196316) > +++ libstdc++-v3/src/c++11/Makefile.in (working copy) > @@ -391,6 +391,7 @@ > > # Google-specific pessimization > functexcept.lo_no_omit_frame_pointer = -fno-omit-frame-pointer > +debug.lo_no_omit_frame_pointer = -fno-omit-frame-pointer > all: all-am > > .SUFFIXES: > >
On Wed, Feb 27, 2013 at 12:17 PM, Andrew Pinski <pinskia@gmail.com> wrote:
> Just an aside which program does not understand dwarf2 unwinding?
All Google executables currently link in a frame-based unwinder.
This allows us to report crashes and record runtime statistics from the
executable itself, in ways that are convenient when dealing with thousands
of executables spread across millions of machines, and which are much
harder to achieve using external tools.
There is of course libunwind, but it turns out that it's very hard to
beat speed and simplicity of a frame-based unwinder (which matters when
you collect stack traces across live production services).
I hope this answers your question.
On 02/27/2013 01:50 PM, Paul Pluzhnikov wrote: > On Wed, Feb 27, 2013 at 12:17 PM, Andrew Pinski <pinskia@gmail.com> wrote: > >> Just an aside which program does not understand dwarf2 unwinding? > > All Google executables currently link in a frame-based unwinder. > > This allows us to report crashes and record runtime statistics from the > executable itself, in ways that are convenient when dealing with thousands > of executables spread across millions of machines, and which are much > harder to achieve using external tools. > > There is of course libunwind, but it turns out that it's very hard to > beat speed and simplicity of a frame-based unwinder (which matters when > you collect stack traces across live production services). > > I hope this answers your question. Or imagine something like trying to get traces out of low level profilers and such. There are situations where dwarf2 unwinders are just too damn bloated. jeff
On Wed, Feb 27, 2013 at 12:53 PM, Jeff Law <law@redhat.com> wrote: > On 02/27/2013 01:50 PM, Paul Pluzhnikov wrote: >> >> On Wed, Feb 27, 2013 at 12:17 PM, Andrew Pinski <pinskia@gmail.com> wrote: >> >>> Just an aside which program does not understand dwarf2 unwinding? >> >> >> All Google executables currently link in a frame-based unwinder. >> >> This allows us to report crashes and record runtime statistics from the >> executable itself, in ways that are convenient when dealing with thousands >> of executables spread across millions of machines, and which are much >> harder to achieve using external tools. >> >> There is of course libunwind, but it turns out that it's very hard to >> beat speed and simplicity of a frame-based unwinder (which matters when >> you collect stack traces across live production services). libunwind is not needed since there is already a dwarf2 based unwinder that is accessible in libgcc already. I don't know why people still promote libunwind when libgcc already has similar facilities. >> >> I hope this answers your question. > > Or imagine something like trying to get traces out of low level profilers > and such. There are situations where dwarf2 unwinders are just too damn > bloated. Perf handles this by saving off some of the stack space instead and then post-process it. This is why I said perf handles this case already. Now oprofile does not but oprofile is really going away. Thanks, Andrew > > jeff
On 02/27/2013 02:46 PM, Andrew Pinski wrote: > Perf handles this by saving off some of the stack space instead and > then post-process it. This is why I said perf handles this case > already. Now oprofile does not but oprofile is really going away. I'm well aware of how perf handles this, having had numerous discussions with Red Hat's kernel team about this issue. jeff
On Wed, Feb 27, 2013 at 1:46 PM, Andrew Pinski <pinskia@gmail.com> wrote: > libunwind is not needed since there is already a dwarf2 based unwinder > that is accessible in libgcc already. Last I checked, libgcc dwarf handling code called malloc, and thus wasn't suitable when you want to instrument malloc *itself* to collect stack traces, nor for SIGPROF profiling. Is libgcc dwarf code async-signal safe now? If not, will it ever be? > I don't know why people still > promote libunwind when libgcc already has similar facilities. Because these facilities are not (or at least were not in the recent past) suitable for the unwinder requirements that people have? Thanks,
On Wed, Feb 27, 2013 at 2:01 PM, Paul Pluzhnikov <ppluzhnikov@google.com> wrote: > > Last I checked, libgcc dwarf handling code called malloc, and thus wasn't > suitable when you want to instrument malloc *itself* to collect stack > traces, nor for SIGPROF profiling. > > Is libgcc dwarf code async-signal safe now? > If not, will it ever be? I think that the libgcc unwinder only calls malloc if somebody calls __register_frame_info. And in ordinary circumstances nobody ever calls that. I don't think there is an malloc issue there. Unless I've forgotten something. I thought a more serious issue is that the libgcc unwinder has no way of dealing with a corrupt stack--it will simply crash. Ian
On Wed, Feb 27, 2013 at 5:52 PM, Ian Lance Taylor <iant@google.com> wrote: > I think that the libgcc unwinder only calls malloc if somebody calls > __register_frame_info. So I looked. Indeed it doesn't call malloc in our environment, but does call dl_iterate_phdr, which is just as deadly. To be fair, libunwind does that as well, and we have to provide a workaround for that. Still, out-of-the-box libgcc unwinder is not suitable for our unwind requirements. > I thought a more serious issue is that the libgcc unwinder has no way > of dealing with a corrupt stack--it will simply crash. That too ... Thanks,
On Thu, Feb 28, 2013 at 8:23 AM, Paul Pluzhnikov <ppluzhnikov@google.com> wrote: > On Wed, Feb 27, 2013 at 5:52 PM, Ian Lance Taylor <iant@google.com> wrote: > >> I think that the libgcc unwinder only calls malloc if somebody calls >> __register_frame_info. > > So I looked. Indeed it doesn't call malloc in our environment, but > does call dl_iterate_phdr, which is just as deadly. Hmmm, I guess I can't remember why. dl_iterate_phdr uses a recursive lock so there shouldn't be any deadlock problem. If the problem is the dynamic lookup of the dl_iterate_phdr symbol for lazy PLT evaluation, that could be finessed by calling it beforehand. Ian
On Thu, Feb 28, 2013 at 9:28 AM, Ian Lance Taylor <iant@google.com> wrote: >> does call dl_iterate_phdr, which is just as deadly. > > Hmmm, I guess I can't remember why. Let me refresh your memory. You've seen this deadlock before: Thread 1 Thread 2 dlopen malloc ... takes loader lock ... takes malloc lock malloc _Unwind_Backtrace ... needs malloc lock dl_iterate_phdr held by T2 ... needs loader lock held by T1 -- Paul Pluzhnikov
On Thu, Feb 28, 2013 at 10:10 AM, Paul Pluzhnikov <ppluzhnikov@google.com> wrote: > On Thu, Feb 28, 2013 at 9:28 AM, Ian Lance Taylor <iant@google.com> wrote: > >>> does call dl_iterate_phdr, which is just as deadly. >> >> Hmmm, I guess I can't remember why. > > Let me refresh your memory. You've seen this deadlock before: > > Thread 1 Thread 2 > dlopen malloc > ... takes loader lock ... takes malloc lock > malloc _Unwind_Backtrace > ... needs malloc lock dl_iterate_phdr > held by T2 ... needs loader lock held by T1 Oh yeah. Thanks. Ian
> Thread 1 Thread 2 > dlopen malloc > ... takes loader lock ... takes malloc lock > malloc _Unwind_Backtrace > ... needs malloc lock dl_iterate_phdr > held by T2 ... needs loader lock held by T1 Am I missing something, or wouldn't it be feasible when instrumenting malloc to call _Unwind_Backtrace before obtaining the malloc lock (or drop the malloc lock while calling _Unwind_Backtrace)? Similarly, couldn't dlopen drop the loader lock while calling malloc? -cary
On Thu, Feb 28, 2013 at 11:57:48AM -0800, Cary Coutant wrote:
> Similarly, couldn't dlopen drop the loader lock while calling malloc?
It can't, but perhaps it could call some alternative malloc instead
(the simpler malloc version in ld.so or similar).
Jakub
On Thu, Feb 28, 2013 at 9:07 AM, Xinliang David Li <davidxl@google.com> wrote:
> Any insight about the relative performance of the two implementations?
We have a benchmark for the speed of unwinder. Here are results.
The number /1, /2, etc. is the number of levels in the stack trace.
Using frame-based unwinder:
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------
BM_GetStackTrace/1 29 29 24137931
BM_GetStackTrace/2 38 38 17948718
BM_GetStackTrace/3 43 44 16279070
BM_GetStackTrace/4 57 57 10000000
BM_GetStackTrace/5 62 62 10000000
BM_GetStackTrace/6 65 65 10000000
BM_GetStackTrace/7 65 64 10000000
BM_GetStackTrace/8 65 65 10000000
BM_GetStackTrace/9 65 65 10000000
BM_GetStackTrace/10 65 65 10000000
Using libgcc:
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------
BM_GetStackTrace/1 1543 1543 466667
BM_GetStackTrace/2 2042 2057 350000
BM_GetStackTrace/3 2378 2366 291667
BM_GetStackTrace/4 2754 2720 250000
BM_GetStackTrace/5 3212 3200 218750
BM_GetStackTrace/6 3655 3651 194444
BM_GetStackTrace/7 4039 4000 175000
BM_GetStackTrace/8 4009 4000 175000
BM_GetStackTrace/9 4002 4000 175000
BM_GetStackTrace/10 4017 4000 175000
Using libunwind:
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------
BM_GetStackTrace/1 109 108 6363636
BM_GetStackTrace/2 121 122 5833333
BM_GetStackTrace/3 130 130 5000000
BM_GetStackTrace/4 133 132 5000000
BM_GetStackTrace/5 148 148 4666667
BM_GetStackTrace/6 162 162 4375000
BM_GetStackTrace/7 174 175 4117647
BM_GetStackTrace/8 185 185 3888889
BM_GetStackTrace/9 188 187 3684211
BM_GetStackTrace/10 188 187 3684211
Conclusions:
- frame based unwinder is hard to beat (re-confirmed :-)
- libunwind is getting really close at 3x the overhead
- libgcc is nowhere close at 50x.
On Thu, Feb 28, 2013 at 11:59 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Thu, Feb 28, 2013 at 11:57:48AM -0800, Cary Coutant wrote: >> Similarly, couldn't dlopen drop the loader lock while calling malloc? > > It can't, but perhaps it could call some alternative malloc instead > (the simpler malloc version in ld.so or similar). ld-linux starts calling the simpler malloc in dl-minimal.c, then switches to "real" libc.so.6 malloc later on. This behavior causes a lot of pain to anyone who tries to interpose malloc and use dlsym(RTLD_NEXT,...) or similar from the interposer. Roland explained to me ~15 years ago why it is that way (it had something to do with Hurd); an explanation I can't find at the moment.
Index: libstdc++-v3/src/c++11/Makefile.am =================================================================== --- libstdc++-v3/src/c++11/Makefile.am (revision 196316) +++ libstdc++-v3/src/c++11/Makefile.am (working copy) @@ -124,3 +124,4 @@ # Google-specific pessimization functexcept.lo_no_omit_frame_pointer = -fno-omit-frame-pointer +debug.lo_no_omit_frame_pointer = -fno-omit-frame-pointer Index: libstdc++-v3/src/c++11/Makefile.in =================================================================== --- libstdc++-v3/src/c++11/Makefile.in (revision 196316) +++ libstdc++-v3/src/c++11/Makefile.in (working copy) @@ -391,6 +391,7 @@ # Google-specific pessimization functexcept.lo_no_omit_frame_pointer = -fno-omit-frame-pointer +debug.lo_no_omit_frame_pointer = -fno-omit-frame-pointer all: all-am .SUFFIXES: