Sign Up
Log In
Log In
or
Sign Up
Places
All Projects
Status Monitor
Collapse sidebar
home:Ledest:erlang:20
erlang
1638-Add-documentation-about-debbuging-NIFs-dri...
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
File 1638-Add-documentation-about-debbuging-NIFs-drivers.patch of Package erlang
From abb60125ae543ef75346423f6ca24dcae4ccdd69 Mon Sep 17 00:00:00 2001 From: Sverker Eriksson <sverker@erlang.org> Date: Wed, 29 Jun 2022 21:21:27 +0200 Subject: [PATCH 2/2] Add documentation about debbuging NIFs/drivers in Interoperability Tutorial. --- erts/doc/src/erl_cmd.xml | 5 +- erts/doc/src/erl_nif.xml | 7 +- system/doc/tutorial/debugging.xml | 346 ++++++++++++++++++++++++++++++ system/doc/tutorial/part.xml | 1 + system/doc/tutorial/xmlfiles.mk | 4 +- 5 files changed, 359 insertions(+), 4 deletions(-) create mode 100644 system/doc/tutorial/debugging.xml diff --git a/erts/doc/src/erl.xml b/erts/doc/src/erl.xml index 8f63d73b08..9f764d2b4e 100644 --- a/erts/doc/src/erl.xml +++ b/erts/doc/src/erl.xml @@ -338,7 +338,10 @@ $ <input>erl \ <item> <p>Useful for debugging. Prints the arguments sent to the emulator.</p> </item> - <tag><c><![CDATA[-emu_type Type]]></c></tag> + <tag> + <marker id="emu_type"></marker> + <c><![CDATA[-emu_type Type]]></c> + </tag> <item> <p>Start an emulator of a different type. For example, to start the lock-counter emualator, use <c>-emu_type lcnt</c>. (The emulator diff --git a/erts/doc/src/erl_nif.xml b/erts/doc/src/erl_nif.xml index 2803f2b868..4208aa269a 100644 --- a/erts/doc/src/erl_nif.xml +++ b/erts/doc/src/erl_nif.xml @@ -3765,7 +3765,10 @@ if (retval & ERL_NIF_SELECT_STOP_CALLED) { <section> <title>See Also</title> - <p><seealso marker="erlang#load_nif-2"> - <c>erlang:load_nif/2</c></seealso></p> + <p> + <seealso marker="erlang#load_nif-2"><c>erlang:load_nif/2</c></seealso><br/> + <seealso marker="system/tutorial:nif">NIFs (tutorial)</seealso><br/> + <seealso marker="system/tutorial:debugging">Debugging NIFs and Port Drivers</seealso> + </p> </section> </cref> diff --git a/system/doc/tutorial/debugging.xml b/system/doc/tutorial/debugging.xml new file mode 100644 index 0000000000..bd7814c441 --- /dev/null +++ b/system/doc/tutorial/debugging.xml @@ -0,0 +1,346 @@ +<?xml version="1.0" encoding="utf-8"?> +<!DOCTYPE chapter SYSTEM "chapter.dtd"> +<chapter> + <header> + <copyright> + <year>2022</year> + <holder>Ericsson AB. All Rights Reserved.</holder> + </copyright> + <legalnotice> + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + </legalnotice> + + <title>Debugging NIFs and Port Drivers</title> + <prepared/> + <docno/> + <date/> + <rev/> + <file>debugging.xml</file> + </header> + + <section> + <title>With great power comes great responsibilty</title> + <p> + NIFs and port driver code run inside the Erlang VM OS process (the + "Beam"). To maximize performance the code is called directly by the same + threads executing Erlang beam code and has full access to all the memory + of the OS process. A buggy NIF/driver can thus make severe damage by + corrupting memory. + </p> + <p> + In a best case scenario such memory corruption is detected immediately + causing the Beam to crash generating a core dump file which can be + analyzed to find the bug. However, it is very common for memory corruption + bugs to not be immediately detected when the faulty write happens, but + instead much later, for example when the calling Erlang process is garbage + collected. When that happens it can be very hard to find the root cause of + the memory corruption by analysing the core dump. All traces that could + have indicated which specific buggy NIF/driver that caused the corruption + may be long gone. + </p> + </section> + <section> + <title>The debug emulator</title> + <p> + One way to make debugging easier is to run an emulator built with target + <c>debug</c>. It will + </p> + <list type="bulleted"> + <item> + <p> + <em>Increase probability of detecting bugs earlier</em>. It contains a + lot more runtime checks to ensure correct use of internal interfaces + and data structures. + </p> + </item> + <item> + <p> + <em>Generate a core dump that is easier to analyze</em>. Compiler + optimizations are turned off, which stops the compiler from + "optimizing away" variables, thus making it easier/possible to inspect + their state. + </p> + </item> + <item> + <p> + <em>Detect lock order violations</em>. A runtime lock checker will + verify that the locks in the + <seecref marker="erts:erl_nif"><c>erl_nif</c></seecref> and + <seecref marker="erts:erl_driver"><c>erl_driver</c></seecref> + APIs are seized in a consistent order that cannot result in deadlock + bugs. + </p> + </item> + </list> + <p> + In fact, we recommend to use the debug emulator as default during + development of NIFs and drivers, regardless if you are troubleshooting + bugs or not. Some subtle bugs may not be detected by the normal emulator + and just happen to work anyway by chance. However, another version of the + emulator, or even different circumstances within the same emulator, may + cause the bug to later provoke all kinds of problems. + </p> + <p> + The main disadvantage of the <c>debug</c> emulator is its reduced + performance. The extra runtime checks and lack of compiler optimizations + may result in a slowdown with a factor of two or more depending on + load. The memory footprint should be about the same. + </p> + <p> + If the <c>debug</c> emulator is part of the Erlang/OTP installation, it can be + started with</p> + <pre> +> <input>erl <seecom marker="erts:erl#emu_type">-emu_type</seecom> debug</input> +Erlang/OTP 25 [erts-13.0.2] ... <em>[type-assertions] [debug-compiled] [lock-checking]</em> + +Eshell V13.0.2 (abort with ^G) +1> +</pre> + <p> + If the <c>debug</c> emulator is not part of the installation, you need to + <seeguide marker="system/installation_guide:INSTALL#Advanced-configuration-and-build-of-ErlangOTP_Building_How-to-Build-a-Debug-Enabled-Erlang-RunTime-System"> + build it from the Erlang/OTP source code</seeguide>. After building from source + either make an Erlang/OTP installation or you can run the debug emulator + directly in the source tree with the <c>cerl</c> script: + </p> + <pre> +> <input>$ERL_TOP/bin/cerl -debug</input> +Erlang/OTP 25 [erts-13.0.2] ... <em>[type-assertions] [debug-compiled] [lock-checking]</em> + +Eshell V13.0.2 (abort with ^G) +1> +</pre> + <p> + The <c>cerl</c> script can also be used as a convenient way to start + the debugger <c>gdb</c> for core dump analysis: + </p> + <pre> +> <input>$ERL_TOP/bin/cerl -debug -core core.12345</input> +or +> <input>$ERL_TOP/bin/cerl -debug -rcore core.12345</input> +</pre> + <p> + The first variant starts Emacs and runs <c>gdb</c> within, while + the other <c>-rcore</c> runs <c>gdb</c> directly in the terminal. Apart + from starting <c>gdb</c> with the correct <c>beam.debug.smp</c> executable + file it will also read the file <c>$ERL_TOP/erts/etc/unix/etp-commands</c> + which contains a lot of <c>gdb</c> command for inspecting a beam core + dump. For example, the command <c>etp</c> that will print the content of + an Erlang term (<c>Eterm</c>) in plain Erlang syntax. + </p> + </section> + <section> + <title>Address Sanitizer</title> + <p> + <url href="https://clang.llvm.org/docs/AddressSanitizer.html"> + AddressSanitizer</url> (asan) is an open source programming tool that + detects memory corruption bugs such as buffer overflows, use-after-free + and memory leaks. AddressSanitizer is based on compiler instrumentation + and is supported by both gcc and clang. + </p> + <p> + Similar to the <c>debug</c> emulator, the <c>asan</c> emulator runs slower + than normal, about 2-3 times slower. However, it also has a larger memory + footprint, about 3 times more memory than normal. + </p> + <p> + To get full effect you should compile both your own NIF/driver code as + well as the Erlang emulator with AddressSanitizer instrumentation. Compile + your own code by passing option <c>-fsanitize=address</c> to gcc or + clang. Other recommended options that will improve the fault + identification are <c>-fno-common</c> and <c>-fno-omit-frame-pointer</c>. + </p> + <p> + Build and run the emulator with AddressSanitizer support by using the same + procedure as for the debug emulator, except use the <c>asan</c> build + target instead of <c>debug</c>. + </p> + <taglist> + <tag>Run in source tree</tag> + <item> + <p> + If you run the <c>asan</c> emulator directly in the source tree with the + <c>cerl</c> script you only need to set environment variable + <c>ASAN_LOG_DIR</c> to the directory where the error log files will be + generated. + </p> + <pre> +> <input>export ASAN_LOG_DIR=/my/asan/log/dir</input> +> <input>$ERL_TOP/bin/cerl -asan</input> +Erlang/OTP 25 [erts-13.0.2] ... <em>[address-sanitizer]</em> + +Eshell V13.0.2 (abort with ^G) +1> +</pre> + <p> + You may however also want to set <c>ASAN_OPTIONS="halt_on_error=true"</c> + if you want the emulator to crash when an error is detected. + </p> + </item> + <tag>Run installed Erlang/OTP</tag> + <item> + <p> + If you run the <c>asan</c> emulator in an installed Erlang/OTP with <c>erl + -emu_type asan</c> you need to set the path to the error log + <em>file</em> with + </p> + <pre> +> <input>export ASAN_OPTIONS="log_path=/my/asan/log/file"</input></pre> + <p> + To avoid false positive memory leak reports from the emulator + itself set <c>LSAN_OPTIONS</c> (LSAN=LeakSanitizer): + </p> + <pre> +> <input>export LSAN_OPTIONS="suppressions=$ERL_TOP/erts/emulator/asan/suppress"</input></pre> + <p> + The <c>suppress</c> file is currently not installed but can be copied + manually from the source tree to wherever you want it. + </p> + </item> + </taglist> + <p> + Memory corruption errors are reported by AddressSanitizer when they + happen, but memory leaks are only checked and reported by default then the + emulator terminates. + </p> + </section> + <section> + <title>Valgrind</title> + <p> + An even more heavy weight debugging tool is <url + href="https://valgrind.org">Valgrind</url>. It can also find memory + corruption bugs and memory leaks similar to <c>asan</c>. Valgrind is not + as good at buffer overflow bugs, but it will find use of undefined data, + which is a type of error that <c>asan</c> cannot detect. + </p> + <p> + Valgrind is much slower than <c>asan</c> and it is incapable at + exploiting CPU multicore processing. We therefore recommend <c>asan</c> as + the first choice before trying valgrind. + </p> + <p> + Valgrind runs as a virtual machine itself, emulating execution of hardware + machine instructions. This means you can run almost any program unchanged + on valgrind. However, we have found that the beam executable benefits from + being compiled with special adaptions for running on valgrind. + </p> + <p> + Build the emulator with <c>valgrind</c> target the same as is done for + <c>debug</c> and <c>asan</c>. Note that <c>valgrind</c> needs to be + installed on the machine before the build starts. + </p> + <p> + Run the <c>valgrind</c> emulator directly in the source tree with the + <c>cerl</c> script. Set environment variable <c>VALGRIND_LOG_DIR</c> to + the directory where the error log files will be generated. + </p> + <pre> +> <input>export VALGRIND_LOG_DIR=/my/valgrind/log/dir</input> +> <input>$ERL_TOP/bin/cerl -valgrind</input> +Erlang/OTP 25 [erts-13.0.2] ... <em>[valgrind-compiled]</em> + +Eshell V13.0.2 (abort with ^G) +1> +</pre> + </section> + <section> + <title>rr - Record and Replay</title> + <p> + Last but not least, the fantastic interactive debugging tool <url + href="https://rr-project.org/"><c>rr</c></url>, developed by Mozilla as + open source. <c>rr</c> stands for Record and Replay. While a core dump + represents only a static snapshot of the OS process when it crashed, with + <c>rr</c> you instead record the entire session, from start of the OS + process to the end (the crash). You can then replay that session from + within <c>gdb</c>. Single step, set breakpoints and watchpoints, and even + <em>execute backwards</em>. + </p> + <p> + Considering its powerful utility, <c>rr</c> is remarkably light weight. + It runs on Linux with any reasonably modern x86 CPU. You may get a two + times slowdown when executing in recording mode. The big weakness is its + inability to exploite CPU multicore processing. If the bug is a race + condition between concurrently running threads, it may be hard to + reproduce with <c>rr</c>. + </p> + <p> + <c>rr</c> does not require any special instrumented compilation. However, + if possible, run it together with the <c>debug</c> emulator, as that will + result in a much nicer debugging experience. You run <c>rr</c> in the + source tree using the <c>cerl</c> script. + </p> + <p> + Here is an example of a typical session. First we catch the crash in an rr + recording session: + </p> + <pre> +> <input>$ERL_TOP/bin/cerl -debug -rr</input> +rr: Saving execution to trace directory /home/foobar/.local/share/rr/beam.debug.smp-1. +Erlang/OTP 25 [erts-13.0.2] + +Eshell V13.0.2 (abort with ^G) +1> <input>mymod:buggy_nif().</input> +Segmentation fault</pre> + <p> + Now we can replay that session with <c>rr replay</c>: + </p> + <pre> +> <input>rr replay</input> +GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 +: +(rr) <input>continue</input> +: +Thread 2 received signal SIGSEGV, Segmentation fault. +(rr) <input>backtrace</input></pre> + <p> + You get the call stack at the moment of the crash. Bad luck, it is + somewhere deep down in the garbage collection of the beam. But you manage + to figure out that variable <c>hp</c> points to a broken Erlang term. + </p> + <p> + Set a watch point on that memory position and resume execution + <em>backwards</em>. The debugger will then stop at the exact position when + that memory position <c>*hp</c> was written. + </p> + <pre> +(rr) <input>watch -l *hp</input> +Hardware watchpoint 1: -location *hp +(rr) <input>reverse-continue</input> +Continuing. + +Thread 2 received signal SIGSEGV, Segmentation fault.</pre> + <p> + This is a quirk to be aware about. We started by executing forward until + it crashed with SIGSEGV. We are now executing backwards from that point, + so we are hitting the same SIGSEGV again but from the other + direction. Just continue backwards once more to move past it. + </p> + <pre> +(rr) <input>reverse-continue</input> +Continuing. + +Thread 2 hit Hardware watchpoint 1: -location *hp + +Old value = 42 +New value = 0</pre> + <p> + And here we are at the position when someone wrote a broken term on the + process heap. Note that "Old value" and "New value" are reversed when we + execute backwards. In this case the value 42 was written on the heap. + Let's see who the guilty one is: + </p> + <pre> +(rr) <input>backtrace</input></pre> + </section> +</chapter> diff --git a/system/doc/tutorial/part.xml b/system/doc/tutorial/part.xml index 4a66f0cb22..44bca4c4cd 100644 --- a/system/doc/tutorial/part.xml +++ b/system/doc/tutorial/part.xml @@ -36,5 +36,6 @@ <xi:include href="c_portdriver.xml"/> <xi:include href="cnode.xml"/> <xi:include href="nif.xml"/> + <xi:include href="debugging.xml"/> </part> diff --git a/system/doc/tutorial/xmlfiles.mk b/system/doc/tutorial/xmlfiles.mk index 74e174f6d4..ccb572123b 100644 --- a/system/doc/tutorial/xmlfiles.mk +++ b/system/doc/tutorial/xmlfiles.mk @@ -1,3 +1,4 @@ + # # %CopyrightBegin% # @@ -19,7 +20,8 @@ # TUTORIAL_CHAPTER_FILES = \ introduction.xml\ - overview.xml + overview.xml\ + debugging.xml TUTORIAL_CHAPTER_GEN_FILES = \ cnode.xml\ -- 2.35.3
Locations
Projects
Search
Status Monitor
Help
OpenBuildService.org
Documentation
API Documentation
Code of Conduct
Contact
Support
@OBShq
Terms
openSUSE Build Service is sponsored by
The Open Build Service is an
openSUSE project
.
Sign Up
Log In
Places
Places
All Projects
Status Monitor