DeconstructSeattle, WA - Thu & Fri, Apr 23-24 2020

← Back to 2018 talks

Bibliography

Flame Graphs (Brendan Gregg)

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

All right. I'm Julia, and this talk is called Build Impossible Programs. So this is about a program that I built that I really wasn't sure was going to work when I started building on it. It's a Ruby profiler.

So something that is important to understand about me for this talk is that I really love debugging and profiling tools. I've written a lot of zines, which are like, hello, did you know that you can use strace to figure out what system calls your programs are using? What about TCPDUMP? Oh my goodness! So I'm kind of obsessed with tools that can tell you what your programs are doing.

And I ran to this problem with Ruby where sometimes I would have a Ruby program that was using 100% of my CPU. And I'd be like, what? Why? Like, what's happening? I should be able to know. I am entitled to this knowledge, right? And I wanted to be able to find out. And I was really upset, even when it wasn't that big of a deal. I just really wanted to know.

So I came with this idea in my head. And I was like, well, what if I could write a profiler that could tell me for any Ruby program I have running anywhere, no matter what, it will tell me what it's doing, like, why it's using the CPU? So the kind of interface I wanted this to have is I would have my profiler, and I would start it up. And I would be like, hey, this PID, like, 2345, what is it doing? And then it would tell me, well, this is what function it's running right now, and maybe it would take, like, 100,000 samples, and then it would draw me a beautiful graph.

But the basic thing it needed to be able to do is to be able to find out, at least once, what the program was doing. And this was not possible, really. And what I knew was that profilers like this existed for C and for Java, and for a few other languages. For Java you have YourKit, which is this really great profiler, and you just attach it to any Java profiler and it gives us this beautiful interface and you can find out anything, and it's incredible.

For C, you have Perf, which I also wrote a zine about because it's delightful. You just run top, and then it tells you what functions are running. It's like top, but for functions. And I was like, I want that for Ruby, right? Like, why not? What's going on?

And so at the time I thought like, well, maybe the reason that no one has built this is that it's impossible, because it can't be done. The thing I want to address in this talk is like, I think there are a lot of myths about doing innovative work, perhaps spoken by your inner critic who is like, hey, if you have an idea for something you probably shouldn't start working on it because it's not possible. You don't want to waste your time. And I really don't like wasting my time, so this is a pretty compelling argument, right?

So the three myths that I want to start out by talking about are myth one-- to do something new and innovative you need to be an expert-- myth two-- if it were possible and worthwhile, someone would have done it already so you probably shouldn't try-- and three-- if you want to do a new open source project, you need to code a lot on the weekend and your evenings.

So myth one-- you need to be an expert. So before I started working on this project, I had a few, like, minor issues. I did not know anything at Ruby internals. I had never contributed to a Ruby open source project. I had never written a profiler or debugger. To do this, I needed to use one of C++, Rust, or C, and my skills in those were, like, pretty beginner. I'd written snake in C, and I could like, allocate-- but not free-- memory. So I wasn't really in the best spot, you might think.

But there's a very useful alternative to being an expert, which is you can, step one, find a starting point, step two, spend some time learning about stuff, and then build a prototype, right? And prototypes are really cool because they don't have to work. And if they don't work, you're like, well, I was just building a prototype. So you have a lot of plausible deniability, right?

So this is what I tried to do. I was talking to my friend Julian about this in 2016, about this idea. And he was like, oh yeah. That's totally possible. Did you know about the system call called process_vm_readv that lets you read arbitrary memory from another process? And I was like, that sounds useful, right? Like, that would help me figure out what my program was doing, potentially, if I can read the Ruby interpreter's memory. I don't know what memory I would read yet, but being able to read the Ruby interpreter's memory sounds like it could tell me what the Ruby interpreter's doing.

And the second thing that happened in 2016 is I was talking to this guy Scott Francis, who works at Shopify, and he works on performance there. And he said hey, I wrote this blog post that tells you how to figure out what your Ruby program is doing with GDB because the Ruby interpreter is a C program. GDB is a debugger for C programs. So you can use GDB to figure out what Ruby is doing.

So that was really exciting. It wasn't really what I-- like, I wanted something with a better interface and that was easier to use than GDB. But I was going to spend a week at the Recurse Center, which is like this writing retreat for programmers that never graduate week, which is like the alumni reunion week. And I was like, oh, I'll spend this delightful week with these delightful people building this cool prototype of this certain thing I'm excited about. And if it doesn't work, it doesn't matter. I'll learn something.

So here is everything I'm going to tell you about the Ruby interpreter in this talk. So if you type this incantation at the top right, ruby current thread arrow cfp arrow [? isaac ?] arrow body arrow location arrow label, and then you cast it to struct rstring*, and you print it, you will get the name of the currently running Ruby function in your Ruby interpreter. And this, like-- at that time, I didn't understand this, right? If you've not seen this before, you might also not understand this. But I was like, sweet, that's what I want, right? Like, I want to know what's running. And basically what happens behind the scenes is it just reads a bunch of memory and follows a bunch of pointers, and then it eventually gets to the right string inside the Ruby interpreter's memory.

Anyway, so once I had this incantation, I was like, cool, I have a starting point, right? I can reverse engineer how GDB does this, and then I can write my own program that does the same thing. And I spent a lot of time staring at it with this face kind of like, what's happening? But eventually I figured it out. And then a week later, I had, like, an extremely sketchy demo, which only worked on my computer, which was great, right? This was a huge improvement over the current, like, this is impossible.

I wrote a blog post about it. I was like, hey, this is so fun. People were like, that's cool, Julia. But it wasn't like a viable tool, right? Like, it didn't work. And I didn't work on it for over a year after that. And so my prototype, at the time, had really serious problems. It depended on a lot of really unstable details about Ruby interpreter internals, right? Like, the Ruby interpreter changes how it works all the time, which is an issue. Like, if it works on Ruby 2.3.2 and not Ruby 2.3.3, that's not a good user experience. And it required you to have debugging symbols in your Ruby installation, which was also not great.

And so I kind of came to the second myth, which was like, well, OK. I have this cool prototype, but if it were really possible to build a real version of this that actually works, like, probably someone would've done it already, right? Like, I think it's a cool idea. Probably someone else thought it was a cool idea. Like, maybe they would have done it.

And so three months later I gave this talk at RustCON where I was basically just like, hey, I love Rust. It helps me learn stuff. It's great. And I talked about my profiler that I built because I built it in Rust, because I can't free memory in C, basically.

And so I was talking to Yehuda Katz, who's this extremely accomplished open source developer who also works on profilers. And I was like, hey, I want to make this profiler. And I kind of described what I wanted to do. And he was like, oh, that sounds, like, difficult, and maybe potentially impossible. I don't know. And I was like, oh, but I have this prototype. It works. I think it could work. And he was like, well, go for it. Like, probably the reason no one has done it is that no one kind of thought to spend a lot of time on it, right? And if you think it's a cool idea, you should just try. And I was like oh, wow! Interesting! Right? Like, maybe interesting points.

Of course, the reality is that not that many people work on Ruby profilers, right? And I think the same is true for a lot of other things. Like, for most topics, not that many people work on that thing. And so just because I have an idea which is potentially a good idea, maybe just no one else has decided to spend a lot of time working on that thing before.

OK, so now let's talk about time, which is a major issue. Well, this is a myth. It's frequently true. Like, a lot of people do work on their ideas for innovative projects in their spare time. But this is kind of the issue of, like, I think I spent 400 hours working on this project in the end. So it's like, where am I going to find 400 hours, right? Like, where? Where does it happen? Am I going to just spend all my weekends doing this?

I don't do that. Like, I don't program after work or on the weekend, really. I do write blog posts and draw a lot of comics. But I don't program. And so I was like, OK, well I'm never going to work on this, right? Which is what I did-- I didn't work on it. And where I ended up finding the time is I got funded to work on it for three months. So here's how that happened.

There's this thing called the Segment Open Fellowship. And I saw this announcement. And it was like, hey, this is a three-month program that supports you to focus on your project for three months. And I was like, oh. That sounds cool. Like, that sounds really fun. I would like to do that.

And I didn't think that they would accept me because I'm not like an experienced open source developer. I had never built a real open source project ever, which I didn't mention in my application. I just-- like, I didn't say that I had. But I also didn't say that I hadn't. But I was just like, well, if I'm not getting rejected I'm not being ambitious enough. I should just apply for this and see what happens. What happened was that they accepted me, surprisingly, which was very exciting.

And then I was like, OK, cool. I have three months to work on this. What do I do now? Like, how do I actually spend my time-- which is what the rest of this talk is about. I was like how I-- because I wanted to make it a real project that actually worked and that people actually used, which I'd never done before. So like, how do you do that, right? So my tactics here were to use Rust, which we're going to talk about, survey the space, do some testing, work on usability, and do some documentation.

So about choosing Rust-- I didn't really know Rust that well when I started. I think there's kind of this idea that if you're trying to do something new and ambitious you should use something that you're already comfortable with, and this was not true. Like, when I started writing Rust for this project again, I had done some Rust before, but I was like, what is a reference? Like, how does it work? I had a lot of pretty basic questions about the Rust programming language which were still unanswered.

But the reason Rust was important for me was that I needed to work with C data structures, I needed my program to be fast-- because if you're writing a profiler that's going to figure out why another program is slow, your program cannot also take up 100% of the CPU, right? Like, the profiler actually does need to be fast. And I probably wanted it to run, like, whatever, like maybe 100 times faster than the Ruby program I'm profiling, which is doable, right, because it's in Rust. And I needed to work with C data structures.

And I also didn't-- like, I wanted to give people my program to run on their computers. And I didn't want it to have a lot of memory leaks and segfaults. And given the fact that I didn't know any of Rust or C or C++ that well, I think the only really viable language for me personally to use was Rust, right? Because with Rust, my program doesn't have memory leaks and it doesn't segfault and it works, which is kind of shocking. Like, I don't think that could have happened in another language.

So I think if you were trying to do something ambitious, then you might need to learn something new to do it. And it was fine. Like, I can accomplish a lot as, like, a mediocre Rust programmer. And it's really great.

The next thing I did, kind of like right before I started, was I was like, OK. I want to build something new in this space, and I want it to work. So what I'm going to do is I'm going to survey every past Ruby and Python profiler, right? I'm just going to get a list of, like, 15 or 20 of them. I think there's maybe like really 10 Ruby and Python profilers, or so. And I'm just going to figure out exactly how each of them works.

This might seem like a daunting project, but it actually only took one day because there are basically three kinds of profilers for dynamic programming languages. The first kind of profiler you have is a tracing profiler. So the way a tracing profiler works is it lives inside the same process as your program. And every single time a function is called it saves it. And it's like, OK, this function got called. OK, this function got called. OK, this function got called. These profilers have a lot of overhead, and they weren't what I was interested in. But they're a useful class of profiler, right?

The next kind of profiler is a sampling profiler, which also lives in the same process as your program, but instead of tracing every single function call, it will, instead, like every 10 milliseconds or whatever, be like, OK, this is what happened now. OK, this is what happened now. OK, this is what happened now. And the way that they typically run every 10 milliseconds is they'll ask like, for example, the Linux kernel, to give them a signal every 10 milliseconds using a system call called setitimer.

And the reason it's easy to tell if a profiler is like this is you can just literally do grep setitimer in its source code. And if it's there, then you're like done. Like, I know how it works, right? And so I didn't need to read all the code. I just needed to grep for setitimer once I learned that that was a way that profilers worked.

And then there was the last profiler, which is Pyflame, which was exactly what I wanted to do, except for Python. And I was like, oh, amazing! I'm going to spend the rest of my time learning how this thing works. I think I don't have time to explain a lot about how Pyflame works, but the main thing that was relevant to me was-- so if you're reading the Ruby interpreter's internals, you have all these different data structures. And you need to know how they're laid out in order to navigate them.

And so the way that I was figuring out how they were laid out was at runtime, using this debugging information in this format called DWARF. And the way Python did it was at compile time by using the Python interpreter's header files. And this compile time thing was really attractive to me because I thought it would be good for performance, and I also thought that it would be easier to test. And it would also make it work for people if I could just decide ahead of the time how to make it work for all people.

And so what I ended up doing was I ended up making a, like, major design change from my prototype to the actual thing that I built where I also found what the [INAUDIBLE] was at compile time. And this was super useful. And I think it's a good example of why doing sort of like a literature review of existing work is really important, because there might be some ideas out there that you can steal, right, and be like, I will do that too.

This was a lot like-- so for Python, the header files that it used were Python's public header files. And instead, I had to use public header files that were internal to the Ruby interpreter, and changed every Ruby version. So I have, like, a very large list of, effectually, header files that change every time Ruby changes. Like, first you went 2.3.2, and 2.3.3. And it's like a little janky. And also, there's some ifdefs, which we're not going to talk about ifdefs. Anyway, I just guessed about all the ifdefs, and hoped that they were later in the struct. Anyway, we won't talk about that.

Oh yeah, testing! OK, so we're trying to write this profiler, right? And it needs to work on all these different Ruby versions. And it needs to kind of be, like, binary compatible with them, right? It's like, oh, if someone is using Ruby 1.9.3, it needs to be able to navigate the Ruby 1.3 program's memory. That's possibly like running a 10-year-old-- or, I don't know how old Ruby 1.9 is. Anyway, whatever-- like, six-year-old version of Ruby. And it needs to just work, right? And like, how do you make that happen?

And so what I ended up doing, which was my amazing partner Kamal's idea, was to collect core dumps, like memory snapshots, from a bunch of different Ruby versions, and then make a unit test that tested on those. So basically, you take a bunch of different Ruby interpreters, and then you just save their memory to disk. And then you run the profiler on them, as if that program was running. And then you just have unit tests that are testing this, like, sort of extremely difficult test thing.

And this was way better than just, like, releasing the profiler into the world and be like, tell me if it doesn't work, right? Because like, it probably wasn't going to work. And then I would have to deal with a lot of bug reports, and it would have been really difficult to develop. So this testing strategy was super important.

And the result was that, like, I was really worried about binary incompatibility bugs, where someone would just like, hey, it just like mysteriously crashes on my Ruby version, because it is reading, like, the wrong pointer. And I've only had one bug report like that in the last two months, which I think is really remarkable because it's like, originally-- my original prototype didn't work at all, right? And now it actually works for most people, which I think is kind of shocking.

Another thing I did was I invested a lot of time in usability. Like, I think I spent kind of half my time on these weird issues related to how do you make it work with an arbitrary Ruby binary running anywhere? And then I spent like half my time on how do you make it so that a human can use this and understand what's going on? Because I basically believe that features that nobody can use might as well not exist.

Like, I spend a lot of time explaining strace to people, which is this really cool tool that lets you see what system calls Python is running-- not Python, any program is running. And I kind of have this attitude which is like I invented strace, because I think a lot of people didn't know about it, and then I evangelize it a lot. And it's like, oh. If you don't know about it, it doesn't exist, right? I did not invent strace.

And so one feature that I think is important is that there's this visualization tool called the flame graph here, which basically is a way of visualizing the, like, execution of your program. So you can see sort of like-- you can't really see this. But you can see it's spent different amounts of time in different parts of the program. And it's a way to see this. And I found it to be a super useful tool for figuring out what my programs are doing.

There's this really flame graph library called Brendan Gregg. And the way that profiler authors usually expose flame graphs to you-- they're like, oh, OK. We give you this output. You can get cloned at Brendan Gregg's flame graph library. You can add it to your path. Then you can, like, cache some data into this flame graph Perl flame graph script. Then they'll write an SVG, and you can open the SVG in your browser.

And look, none of this is really all that complicated. But I thought it was too complicated. And I thought that maybe people who had never done this before might not realize that that was what you had to do, or they might not want to go through the steps. So what I did instead was I made it just work, right? Like, if you say, Rbspy, record my Ruby program, it just will create a flame graph of your program like that automatically.

And the way I did that was like this flamegraph.pl script, I checked the license on it. And I was like, oh. I think I can just like include this with my program. So I just compiled it into my binary. And now it's always there. You don't need to clone this thing, right? And it doesn't change that much. So it says, OK. Yeah, now it's compiled into my binary. And it's not that big, so it doesn't matter.

And someone left this really heartwarming comment that I felt really confirmed my approach, which was like, they were like, "I'm not sure if flame graphs are available in other tools. But the ease and accessibility of them in Rbspy brought them to me for the first time." And they mentioned that they'd used a couple of other profilers. And those profilers do support flame graphs, right? And they absolutely could have used flame graphs with those tools. But it was just, like, more complicated, right, so that they didn't realize it was possible, which is totally reasonable.

And so I think making things easy is sort of like a way to invent a feature for people, right? It's like, I didn't invent flame graphs. But I brought flame graphs to this person through this tool, just by making it the default.

The last thing I did was write documentation, because documentation is really important. I mean, I kind of tried to make it pretty a little bit. I'm not that good at web design. But I stole a thing from somewhere. And I also made a logo. I did not make this logo. I made a substantially less pretty logo. And then Ashley McNamara very kindly made a beautiful version of it for me, which was delightful. But I wanted to kind of communicate, like, this is a thing which is for you, right? It has nice documentation and it has a cool logo. Like, this is something that maybe you could be using. Please try it out.

And so I ended up spending like 400 hours on this project, which felt like a long time, right? Like, I spent maybe three months on it. And the surprising thing is that after all that, it mostly works. I have a story about this, if I have time. Do I have time? I think so. I think so. No one will tell me. I'm going to tell you a story about it mostly working.

OK, so this comment is just someone saying like, hey. I have this Rails app. It's like 10 years old. It's tested really slow. I ran Rbspy. And then now our tests, like if you run them in a single thread, are like an hour faster. That's amazing, right? And I was really happy, because they were just reporting just using this thing which, previously, I thought was, like, impossible to build, right? And they're like, oh. It just works. And I was like, wow.

Another kind of like surprising thing to me about it just working is-- so at some point, I decided to build Mac support. And I had never programmed for Mac before. But I rented a Mac VM for, like, $40 a week, which is a good incentive to program fast, because I was like, if I could finish Mac support faster I can stop paying the $40 a week for the Mac VM.

Anyway, so I sort of like built the support, and then people started testing it out, even before I asked anyone to test it out. And they were like, hey. So I ran the unit test, and it froze my Mac. And I was like, oh. Strange.

Anyway, the fallout was it turns out that in Mac OS High Sierra, there is a bug in the kernel which has a race condition, where if you do the wrong thing, it will freeze your Mac, which was kind of exciting. It was like oh, from the time-- from like zero to like Mac kernel bug was like three days, right?

But this is not really meant to make fun of Mac. It's more just like, I think, to me, making something, like the fact that it works at all-- now people use it on Mac. And they're like oh, yeah. It works on Mac. And I'm like, whoa. Like, it's not freezing your computer? Like, it just works? That's, like, outrageous. I'm so happy. And like, fixing bugs and seeing them stay fixed and seeing the thing work, I think, is really incredible. And it's a cool feature.

And I was really happy at the end that I gave myself some time to do something which was, like, a little ambitious for me personally. So 2017 Julia had to be like-- I think I decided to work on this. I got the fellowship in June. And then I was like, hey, can you delay for six months because I actually can't just take time off work right now. And then I asked my manager, and I was like, hey, in six months, can I take three months off work? And he said yes.

And so I kind of orchestrated this thing six months in the future. And then when I got to 2018, I was like oh, wow, thank you past Julia for organizing this cool thing for me. I'm so happy. And it made me want to plan more like delightful surprises for myself in the future, and more time to do something that's important to me.

I think the issue of finding time to do ambitious work is a big issue. I feel like sabbaticals are really cool, because quitting your job to work on your weird idea for, like, a Ruby profiler, is maybe not a great idea. And so I think sabbaticals are a really cool thing. I think funding open source is a really cool thing. Thank you, Segment, for doing that. It's really delightful.

And the last thing I want to say is I saw this cool tweet by [INAUDIBLE] which was like, why should you work in software development, right? Like, what is cool about working in software development? And there are a lot of things that are cool about working in software development.

But I think that this thing about, like, there are a lot of places where there are a lot of improvements that can be made, and that there are a lot of ideas which are available for non-experts to work on, is really, really cool, right? Like, I think there are a lot of places where you can relatively easily survey the state of the art and be like, OK, cool. This is what's going on right now. This is my idea. And then maybe I can just spend a lot of time and make something which is a little bit better or just a little bit new. And I think that's really exciting. So maybe go build something impossible. Thank you.