Decompiling Binary Code with Large Language Models

cm0002@lemdro.id · 2 months ago

Decompiling Binary Code with Large Language Models

Beej Jorgensen@lemmy.sdf.org · 2 months ago

Now this is a great use of LLMs. Love it. So many old apps and games exist only in compiled form.

Lucy :3@feddit.org · edit-2 2 months ago

If it actually works.

I’d guess training a model on nothing but C and the resulting ASM would be much better.

WolfLink@sh.itjust.works · 2 months ago

It doesn’t look like it works very well. If I’m reading their results section correctly, it works less than 20% of the time on real world problems.

Lucy :3@feddit.org · 2 months ago

Jankatarch@lemmy.world · 2 months ago

Why LLM? What was wrong with training a model specifically for decompiling?

MotoAsh@piefed.social · edit-2 2 months ago

LLM is being used in a colloquial way here. It’s just how the algorithm is arranged. Tokenize input, generate output by stacking the most likely subsequent tokens, etc.

It still differentiates it from neural networks and other more basic forms of machine “learning” (god what an anthropomorphized term from the start…).

WolfLink@sh.itjust.works · 2 months ago

They did train a model specifically for decompiling.

zygo_histo_morpheus@programming.dev · 2 months ago

Is the decompiled code guaranteed to be equivalent to the compiled code? While this might be cool it doesn’t seem that useful if you can’t reason about the correctness of the output. I skimmed the README and didn’t manage to figure it out

jacksilver@lemmy.world · 2 months ago

I can’t speak for this specific approach/system, but no. LLMs never really guarantee anything, and for translation roles like this, it’s hard to say how much help they provide. The main issue being that you now have to understand what the LLM generated before you can start fixing it and/or debugging it.

cm0002@lemdro.id · edit-2 2 months ago

From my understanding, it trys to tackle the hardest part, getting from Assembly back to something human readable and not necessarily compilable out the gate

A large part of the tedious and intensive process of decompilation is just figuring out what chunks in ASM do what and working it out to named functions and variables

anton@lemmy.blahaj.zone · 2 months ago

deleted by creator

oplkill@lemmy.world · 2 months ago

I don’t get it, how is it better than ghidra? Or it tries to name func, vars and types too, which is hard work

cm0002@lemdro.id · 2 months ago

Or it tries to name func, vars and types too,

It tries to do exactly that, it actually uses ghidra for the initial decompilation

oplkill@lemmy.world · 2 months ago

Mmm, exciting, will it guess global unknown array variables, where god knows where they start/ends? From git example it seems just works in specific functions, not globally the whole code with global variable space

Decompiling Binary Code with Large Language Models

Decompiling Binary Code with Large Language Models

GitHub - albertan017/LLM4Decompile: Reverse Engineering: Decompiling Binary Code with Large Language Models