In Progress
Unit 1, Lesson 21
In Progress

Strace

Have you ever found yourself confusedly speculating about how a program works? Googling documentation and Stack Overflow answers, poking through the source code, but still feeling like you’re missing important information?

In this episode you’ll learn to use strace, a powerful tool for gathering hard data about exactly how a program interacts with the operating system. We’ll walk through a concrete example, in which we derive and drill into comprehensive intelligence on every file the Ruby bundle tool interacts with.

Video transcript & code

The other day I had a question about Ruby’s bundler tool.

$ bundle
Could not locate Gemfile

Specifically, I wanted to understand all the places it looks for configuration. Like really understand. Sure, this message is straightforward enough: it looked for a Gemfile and couldn’t find one. But what are all the paths it checked for that file? And what other configuration sfiles did it silently open or look for?

I spent a fair amount of time dredging through documentation and google searches. But documentation is sometimes out of date and incomplete, and so is the web. I didn’t want hypotheses; I wanted facts.

It’s tempting in a situation like this to search through the code. After all the, code is definitive, right? But the trouble with code is that it delegates to other code, which delegates to other code, which eventually delegates to operating system calls in ways that may not be obvious.

The wonderful thing about working with software is that, unlike in the physical world, we can usually get conclusive and comprehensive data about what’s going on. The trick is knowing where to find the right microscope.

One such microscope available on UNIX-like systems is called strace.

$ apt install strace

strace is a tool for tracing system calls. And it’s very easy to get started with.

All we have to do us put the strace command in front of the command we want to trace.

$ strace bundle
# ...
rt_sigaction(SIGINT, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f50f39d6730}, {sa_handler=0x7f50f3dffa30, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f50f39d6730}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f50f39d6730}, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f50f39d6730}, 8) = 0
close(7)                                = 0
close(8)                                = 0
getpid()                                = 6583
getpid()                                = 6583
timer_delete(0)                         = 0
munmap(0x7f50efadf000, 790528)          = 0
exit_group(10)                          = ?
+++ exited with 10 +++

Yikes, that was a lot of output!

Let’s pipe it through a pager so we can scroll through it more slowly.

$ strace bundle | less 

Er, wait, what just happened?

Lesson one about strace: it dumps to the standard error stream, not standard out.

To pipe its output into a pager, we first need to redirect stderr to stdout.

$ strace bundle 2>&1 | less

Now we can page through the output at our leisure.

What we’re looking at here is a comprehensive listing of every single system call that was invoked while running the bundle command. Bear in mind that most of these aren’t invoked by the Bundler code directly. Many of them are side effects of Ruby starting up and loading standard libraries. Many are also triggered deep in the C libraries that Ruby depends on, or the libraries they depend on, and so on.

The cool thing about this is that we’ve found a choke point: every interaction a program has with the system is ultimately done with system calls, and strace lets us see all of them. There’s no more speculation, only cold hard facts.

The less cool thing about this is that, well… there’s a lot of facts here to sift through.

Fortunately, we can filter this output.

If we add a -e flag, we can tell strace which system calls we’re interested in. This requires some familiarity with the Linux programming API, which is beyond the scope of this episode. But let’s say we happen to remember you use the open() syscall to open files.

We might be tempted to just look for calls to open().

$ strace -e open bundle 2>&1 

Surprisingly though, this lists no system calls at all.

One thing to try when we don’t see the results we’re looking for from strace is the -f flag.

$ strace -f -e open bundle 2>&1 

This flag tells strace to trace not only the immediate process spawned by the command we passed, but also to attach itself to any subprocesses it spawns. This can be helpful for commands which are actually just stubs or wrappers that then delegate to some other command to get the work done.

In this case though, while we see some indications that it attached to subprocesses, we still don’t see any system call traces.

$ strace -f -e open bundle 
strace: Process 6158 attached
strace: Process 6159 attached
[pid  6159] +++ exited with 0 +++
[pid  6158] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6159, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
[pid  6158] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6158, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
Could not locate Gemfile
+++ exited with 10 +++

As it turns out, on this machine this program doesn’t use the basic open system call at all.

Instead, it uses an open() variant called openat(). When we search for that, we see lots and lots of output.

$ strace -f -e openat bundle 2>&1 | tail
[pid  6684] openat(AT_FDCWD, "/usr/lib/locale/en_US/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  6684] openat(AT_FDCWD, "/usr/lib/locale/en.UTF-8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  6684] openat(AT_FDCWD, "/usr/lib/locale/en.utf8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  6684] openat(AT_FDCWD, "/usr/lib/locale/en/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  6684] +++ exited with 0 +++
[pid  6683] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6684, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
[pid  6683] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6683, si_uid=0, si_status=0, si_utime=0, si_stime=1} ---
Could not locate Gemfile
+++ exited with 10 +++

Now we’re getting somewhere! But the fact that we didn’t know what syscall to search for might make us wonder: are there other open() variants that we’re not seeing here?

This is almost the right question. A better question to ask is: what are all the possible syscalls this program might use to interact with files?

Because as it turns out, just looking for file opens won’t give us the information we’re seeking. A common pattern in code is to first test for the existence of a file, and then only if it is found, try to open it. So just looking for file opens isn’t going to teach us much about where Bundler goes looking for configuration.

Happily, we don’t have to spend all day learning the Linux programming API to figure out which system calls we ought to be filtering for.

We can provide the %file shorthand to the -e flag. This asks strace to show us every system call that interacts with files, and nothing else.

# strace -f -e %file bundle 2>&1 | tail
[pid  7332] openat(AT_FDCWD, "/usr/lib/locale/en_US/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  7332] openat(AT_FDCWD, "/usr/lib/locale/en.UTF-8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  7332] openat(AT_FDCWD, "/usr/lib/locale/en.utf8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  7332] openat(AT_FDCWD, "/usr/lib/locale/en/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  7332] +++ exited with 0 +++
[pid  7331] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7332, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
[pid  7331] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7331, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
Could not locate Gemfile
+++ exited with 10 +++

When we run strace this way, one thing we can immediately see is that there are a lot of stat and lstat calls, which are used to test for the existence of a file. So it’s a good thing we switched to using the %file shortcut!

But this is still an awful lot of information to sift through. And most of it has to do with loading libraries, not searching for configuration files.

After scrolling through this output for a while, we realize that most of the paths we don’t care about start with /lib or /usr. At this point we can turn to good old grep to add another layer of filtering.

We pipe the output through grep, with the -v flag that tells grep to invert its usual behavior and filter out any lines that match the provided expression. Then we give it patterns to match /lib or /usr for exclusion.

This time when we scroll through the output, we see we’ve narrowed it down enough to be interesting!

We can see Bundler looking for filenames like gems.deps.rb, gems.rb, Gemfile, and Isolate in various directories up the tree from the current directory. Later, we see .bundle/config referenced. That’s exactly the kind of data we’ve been looking for!

# strace -f -e %file bundle 2>&1 | grep -v -e /lib -e /usr | head
openat(AT_FDCWD, "/proc/self/maps", O_RDONLY|O_CLOEXEC) = 3
stat("/root", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
stat("/root/.gem/ruby/2.7.0", 0x7ffdc31436d0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/.gem/ruby/2.7.0/specifications", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/.gem/ruby/2.7.0/specifications", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
getcwd("/workspaces/examples", 4096)    = 21
getcwd("/workspaces/examples", 4096)    = 21
chdir("/workspaces/examples")           = 0
stat("gem.deps.rb", 0x7ffdc3143de0)     = -1 ENOENT (No such file or directory)
stat("gems.rb", 0x7ffdc3143ed0)         = -1 ENOENT (No such file or directory)

This output will warrant some careful study. But we now have a source of hard data, that we can cross-reference against the Bundler documentation and the source code to really understand exactly what user-files Bundler looks for to do its work.

The immediate moral of this story is: strace is awesome, and if you’re working on a platform that supports it, it should totally be in your toolbox.

The larger moral is this: when you are trying to understand a tool or a library, and you find yourself building airy castles of speculation… stop, and look for a way to collect concrete data. In computers, there’s usually a choke point that can be instrumented and interrogated for facts. But finding it may require taking a step back and working at a level of the system than you’re not used to thinking about.

Happy hacking!

Responses