maxaf 3 months ago

I'm curious to see what the decompiler would show if it were run over binaries produced by Go or Rust. This might be an invaluable tool to help programmers in higher level languages than C understand what really goes on under the hood.

  • runevault 3 months ago

    Depends, with rust for example it may not generate that different a code in some cases, but the compiler checked at compile time to make certain you weren't doing certain bad things.

  • bennofs 3 months ago

    I haven't tried it, but I would be very surprised if it yielded good results for go code. Go assembly, from what I've seen, looks very different from the assembly produced by a C compiler. In particular, the calling convention will be different and everything involving strings or other non-integer types will be completely different.

    • matt-noonan 3 months ago

      Actually, I'd expect Go to work really well! Not if you tried to generate C code that compiles to a given Go binary, of course.

      But that's not what you'd do. Instead, you'd use the Go compiler as the black box, and learn Go code to produce the given binary. And for that, Go might work better than C: the speed of the compiler would help churn through generations that much faster.

      • XR0CSWV3h3kZWg 3 months ago

        Is go compilation that much faster?

        • zik 3 months ago

          Go compilation prioritises compile speed over executable speed by default. It's very fast.

          • XR0CSWV3h3kZWg 3 months ago

            Is it faster than gcc -O1?

            • zik 3 months ago

              From memory yes, it's a lot faster. Compile/link of programs with a few thousand lines of code typically takes on the order of a fraction of a second to a second.

bennofs 3 months ago

This looks really nice, but I think it will be hard to scale for larger binaries that contain significant amount of non-trivial algorithms (and not just copy-pasted stackoverflow answers). But this could still be useful the decompile individual functions / groups of functions.

One good application for this might be targets where there are no good existing decompilers for, such as some of the microconroller instruction sets. As long as you have a compiler for it (which you probably have if the architecture is used) this technique works. Also, code for microcontrollers is often unoptimized which should make the assembly prodiuced by the compiler more deterministic.

chatmasta 3 months ago

Really cool project. I’ve been thinking about this problem for a few years now, and have often wanted to try something very similar. Two years ago I posted a comment on HN [0] outlining my thoughts on how a system similar to this could be used for vulnerability mining, by decompiling security patch bindiffs and looking for other software with binary regions similar to the one patched with the security update.

The idea of decompiling code into human readable form has a certain romance to it. Taken to its extreme, it’s the first step in enabling an AI to write a program.

Most researchers are too scared to touch the problem because getting it “perfect” is almost certainly NP-hard, effectively impossible. So kudos to you for trying these techniques. The success you’ve seen, however limited, certainly pushes the boundaries and hopefully gets more people interested in this problem space.

There is a lot of cool stuff you can do with decompilation. Despite it being a hard problem, you’ve demonstrated that there are many shortcuts and heuristics we can use to make solving it easier.

I really liked the quote you cited from Gabel and Su:

> general lack of uniqueness in software at levels of granularity equivalent to approximately one to seven lines of source code [...] crossing both project and programming language boundaries.

That is the key point that makes this research viable IMO. Programmers are not writing unique code, because they depend on idioms and conventions that are the most “human readable” way of expressing certain bundles of logic. Think how many times you've written a for loop that breaks on a condition, for example. Or an if(a && !b) {...} else {...}.

I’m not surprised at all that so many lines of code are just reinventing tiny wheels over and over again, because ultimately those 1-10 line snippets of control structures are the second-order building blocks of programming. Simplifying things a bit, the only changes between implementations of a common idiom are the variable names and scope, because these are what contextualize the idiom to a human reader. But the compiler doesn’t care about that.

[0] https://news.ycombinator.com/item?id=11573547

jcranmer 3 months ago

Having read the paper, this effort, while interesting, does raise some red flags as to how viable the approach actually is.

The first big one is that their approach relies on matching the compiler and flags. While they claim that this is relatively easy to find (though the cited paper doesn't attempt to match flags), there is no indication of what the impact of not having, or having incorrect, information will do to the results. I'm left to guesstimate from their main experiment.

Let's talk about that. It seems that a substantial fraction of the good results comes from having a substantially correct seed. Without having any decompiler helpers, they were able to byte-equivalently decompile exactly 4 programs, one of which was hello world. With the decompiler's output as a seed, they got 10 programs--with the decompiler's output providing byte-equivalent output for 6 of those. Without an oracle for program correctness, it's hard to get an idea of what, say, a 60% equivalent program really means, especially. Having their example say "the answer was argv[i], but we did argv instead" is concerning--it's not implying that close is accurate.

One thing that's completely understated: they had to seed values. Having to seed strings is understandable, but then they give the example of the compiler-generated (x + x + x) << 2 for 12 x. The Project Euler example they couldn't decompile is "largest prime factor", and given that they did it worse on optimized than the unoptimized binary, I'm guessing that there's some small constant division and/or modulo operations that they can't seed.

The examples they evaluate on are tiny--the largest one is 88 lines (I'm guessing that's including whitespace and #include <stdio.h> judging from the size of hello world). They also look guaranteed not to trigger many advanced compiler optimizations (e.g., vectorization). It would be more interesting and useful to see the results on code from programming computations or things like the benchmark game which are somewhat designed to exercise more advanced features.

Scaling is, as others noted, not evidenced at all. They claim that it should be solved by doing functions individually, but it is worth noting that even trying to find the functions is a very challenging problem in decompilation. Moreover, interprocedural optimizations are an important part of optimizing compilers, and some of those kinds of optimizations are not going to be easily derived from a function-by-function approach (devirtualization comes to mind).

At the end of the day, while intriguing, this approach doesn't seem fruitful. The quality is still quite poor, despite running on very tiny programs. There's no evidence for being able to handle advanced compiler techniques (such as vectorization), and some a priori reasons to doubt its ability to handle it. It doesn't free the decompiler writer from having to put a fair amount of effort into solving hard problems (identifying functions), nor does it free them from having to do brittle pattern matching (the extent to which it is necessary is not made clear).

westoncb 3 months ago

I recently had an idea for doing something like this with encoded/somehow transformed/encrypted natural language documents—anyone know if such things already exist?

Havoc 3 months ago

That's actually a really cool application of machine learning.