Python for Reverse Engineering 1: ELF Binaries

173 points by xrisk 5 years ago

> I’m not sure why it uses puts here? I might be missing something; perhaps printf calls puts.

It’s because you passed a constant string to printf, so the compiler decided it was not worth making the call and used puts instead.

Icyphox 5 years ago

Thanks! I’d actually figured that out a little while after publishing.

billfruit 5 years ago

In general though, dealing with binary data in python isn't particularly intuitive.Also many python tutorials and books fails to mentioned how to manipulate binary data. I feel that is one of the places where the standard library is not that rich.

civility 5 years ago

I disagree. The struct and array (not Numpy) modules are pretty great at cutting up binary data. You provide a format string and it just works.
- vram22 5 years ago
  
  Might be useful for beginners to binary file I/O in Python and reverse engineering of data formats:
  DBFReader.py [1], which is part of my xtopdf toolkit [2] is a program that uses struct.unpack() multiple times to decode the fields in a DBF (XBASE) file.
  [1] https://bitbucket.org/vasudevram/xtopdf/src/default/DBFReade...
  [2] http://slides.com/vasudevram/xtopdf
  I had first written a Pascal program to read and dump DBF file data after reverse-engineering the DBF format, based on some sketchy info I had access to, years ago. Later wrote the same program in C, and later still in Python, i.e. DBFReader.py .
- jgalt212 5 years ago
  
  your statement and his statement are both True (Python spelling variant). Of course, I don't think there are many books or tutorials geared towards beginners that deal with manipulating binary data.
  > Also many python tutorials and books fails to mentioned how to manipulate binary data
- billfruit 5 years ago
  
  I thought, the format string is unintutive if there are nested binary structures or if there are arrays of nested binary structures.
  
  civility 5 years ago
  
  I do wish they were combined. It would be nice to handle arrays of structs and structs of arrays more gracefully, and it's unfortunate how the format strings almost (but not really) agree with each other.
  And so long as I'm asking for ponies, it would also be nice if they handled complex numbers gracefully.

hultner 5 years ago

Is it just for me or is the scroll on this site horrible broken? Shame because the content looks great.

bhargav 5 years ago

Default behaviour seems to be overridden. I read the article and would recommend you look past the scrolling. If you are on an iDevixe, reader mode will help!
Edit: Spelling

RayDonnelly 5 years ago

If you haven't seen it, also checkout Project LIEF. It is very good indeed. We use it for a lot of post-build binary verification in the conda ecosystem.

Windows, macOS and Linux are all supported.

https://lief.quarkslab.com/

Icyphox 5 years ago

Hi, I’m the author of this post. Feel free to ask questions, if any.

matmann2001 5 years ago

Hey. In your C code, you write to memory beyond what you malloc'd. You malloc'd 9 bytes for 'pw', but later do "pw[9] = '\0'", which accesses the 10th byte, which doesn't belong to you.
- blattimwind 5 years ago
  
  malloc allocates aligned memory [1], so technically it's correct that he writes past the allocated memory, but technically it's also impossible for that write to fail or for that write to overwrite something else.
  [1] bonus point: for what kind of alignment? (The minimum is quite well specified, for C standards)
  
  spieglt 5 years ago
  
  https://www.gnu.org/software/libc/manual/html_node/Aligned-M...
  "The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems)."
  I was about to say, "what if they're on a 32-bit system and so were only allocated one 8-byte block?" but then realized that since they'd requested 9 bytes, they'd be given two 8-byte blocks, or one 16-byte block on a 64-bit system. Is that right?
  
  spieglt 5 years ago
  
  Well, I guess alignment doesn't say anything about how large of a block is allocated.... And this is the clearest source I can find, which says 32 bytes. https://prog21.dadgum.com/179.html
  
  blattimwind 5 years ago
  
  > Well, I guess alignment doesn't say anything about how large of a block is allocated
  It tells you where something can't be, and because virtual memory is allocated in whole pages the "padding" so to speak will always be accessible.
  There's also the obvious truism that if you can access something in a cache line, all addresses in the cache line are safe to access. (Vectorized algorithms frequently implicitly rely on this for short reads, IOW there is no way reading a 128 or 256 bit vector can fault if just reading the first lane would not fault).
  
  saagarjha 5 years ago
  
  > Vectorized algorithms frequently implicitly rely on this for short reads
  This is extremely processor-dependent and you should not be writing C if you’re relying on this.
  
  blattimwind 5 years ago
  
  > This is extremely processor-dependent
  No, it's not.
  > you should not be writing C if you’re relying on this.
  Luckily you are in no position to tell anyone what they should or shouldn't do.
  
  saagarjha 5 years ago
  
  Sorry, I misunderstood the context of that statement and was thought you were talking about vectorized algorithms exploiting out-of-bounds reads in general, which is pretty dependent on the processor as to when it will work (depending on how page boundaries and cache lines are set up). And I didn't really mean my statement about using C in the prescriptive way you seem to have taken it: I was merely trying to say that you should probably be using assembly in this case, because you are relying on details of your processor that your compiler is likely to be unaware about and may penalize you for. For example, the vectorized string routines in libSystem do overshoot the end of the string because they use pcmpeqb, and it is written in assembly because it relies on alignment guarantees that are difficult to express in C. Plus it guarantees vectorization ;)
  
  blattimwind 5 years ago
  
  Ah, true, it is my turn to apologize then for interpreting your post in a rather uncharitable way.
  
  jmts 5 years ago
  
  Then one day you come back and resize the array to a multiple of the memory alignment, and BAM! Off-by-one errors, or even vulnerabilities.
  Or you enable more strict build settings and BAM! You have to go back and deal with all the places your code allows you to write off the end of a buffer because you just didn't give a damn before.
  
  saagarjha 5 years ago
  
  For Glibc on Linux, I believe this is 32 bytes. I think musl does 16 bytes, as does libSystem on macOS.
  
  sgillen 5 years ago
  
  Still feels dirty though doesn't it? Would never want to rely on this fact..
  
  saagarjha 5 years ago
  
  Yeah, this is undefined behavior and your compiler might bite you for it.
- w0mbat 5 years ago
  
  Yes, that jumped off the page at me too, and distracted me from the rest of the article.
  
  matmann2001 5 years ago
  
  Especially given the topic, I kept jumping around to see if it was intentional. Like maybe they would use these RE tools to exploit it.
- Icyphox 5 years ago
  
  Ah my bad. I’ll make sure to fix it. Sorry about that.
75dvtwin 5 years ago

if you could briefly outline the space/position of this framework, relative to others (eg https://github.com/cea-sec/miasm ). Would very much appreciate.
Also, besides security aspect (eg intrusion/virus detection), I was looking at these frameworks as a 'higher-level than assembler, and less hardware architecture dependent than LLVM IR) -- is there an angle where reverse engineering tools, have a separate live an better-than-assembler toolchain for low level programming?
- Icyphox 5 years ago
  
  For starters, the purpose of this post was never to build an entire framework, like the one you’ve linked, but rather a small set of scripts to try and understand what disassemblers do under the hood. These scripts can also be tossed into some kind of automation pipeline of sorts, something like a CI/CD perhaps. There's a lot you can (potentially) do.
  And your second question, I'm not sure I understand what you're attempting to convey.

monocasa 5 years ago

Neat!

You can see some similar code I wrote in Rust here: https://github.com/monocasa/exeutils

Icyphox 5 years ago

Nice. I’ve been planning to rewrite `readelf(1)` in Nim, I’ll check out your code for some pointers :)
- monocasa 5 years ago
  
  Word, you should check out the backing library I wrote too then.
  https://github.com/monocasa/exefmt

qaq 5 years ago

Wonder why security topics never get much interest on HN. It's a huge industry with a ton of VC funding going to security startups.

daeken 5 years ago

Eh, it depends on the topic. Binary reversing stuff rarely gets much love, but there frankly just aren't too many people doing that stuff. Web security things get lots of love, usually -- I both launched and sold a web security class via HN, very successfully -- because there are just so many people who are interested in it; it's the bread and butter of the industry nowadays. And anything privacy-oriented or seriously pwned always gets clicks and upvotes.
But yeah, this stuff is good content but doesn't have much reach.
dang 5 years ago

I'd have said it's of consistently high interest. What makes you say it isn't?
- qaq 5 years ago
  
  I might be really off but it seems they rarely get more than 50 comments (unless it's some major breach).
  
  dang 5 years ago
  
  Did you see https://news.ycombinator.com/item?id=19315273? It's just one data point but you might find it interesting.
  There is a pattern where highly specialized technical posts don't get as many comments, relative to votes. Possibly reverse engineering and other security-related specialties fall into this. One can see the same thing in e.g. articles about type theory: people are interested, but don't necessarily feel qualified to add to the discussion. That's probably good if it prevents the dumb sorts of comment from getting posted, but maybe the threads would be more valuable if more users would ask questions. Then the users who know could explain, and more learning would take place.
  
  saagarjha 5 years ago
  
  Then again Ghidra was hyped for months prior to its release.
  
  qaq 5 years ago
  
  Good point might be because it was NSA tool
  
  pjc50 5 years ago
  
  Upvotes and comments are very different; even commenting beyond a certain limit counts negatively towards the article's front page position. If you want a lot of comments start an argument.
z3phyr 5 years ago

Binary, firmware and hardware level security topics are academically most satisfying and fun to me. But there is a lot of mystery in these topics, given the inherent negativity and legal grey areas people have to deal with. I guess that is one of the reasons..
rhexs 5 years ago

For one, the article seems to be impossible to read on an iPhone via safari.
- danmg 5 years ago
  
  Add this to your iphone's rss reader:
  https://damng.github.io/hackernews-rss-with-inlined-content/...
  Content will be made readable and inlined into the feed.
- kiddico 5 years ago
  
  It seems to break in a different way every time I reload the page.
  
  Icyphox 5 years ago
  
  Mind telling me which model? Could be my my piss poor CSS acting up at that resolution.
- xrisk 5 years ago
  
  What's the issue? I'll message the author.
benj111 5 years ago

I got here via the front page, which would seem to discredit your theory.
Anyway VC funding doesn't necessarily equate to being interesting.