|
|
|
| This answer is reassuring.
Based on it, I went and read the readme. The readme was also excellent, and answered every question I had. Great job, thank you, I'll be trying this. |
|
| Depends on how many tokens you want to spend.
Making the code, fully commenting it and also giving an example after that might cost three times as much |
|
| On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though. |
|
| Is it possible to add a mode that doesn't depend on API access (e.g. copy and paste this prompt to get your answer)? Or do you make roundtrips? |
|
| Super interesting! Since you're generating code with LLMs, you should check out this paper:
https://arxiv.org/pdf/2405.15793 It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't. |
|
| JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy. It would be of course tedious enough to be manually done in general, but transformations themselves are fairly limited so it is possible to read them only with some notes to track mangled identifiers.
A more general unminification or unobfuscation still seems to be an open problem. I wrote handful of programs that are intentionally obfuscated in the past and ChatGPT couldn't understand them even at the surface level in my experience. For example, a gist for my 160-byte-long Brainfuck interpreter in C had some comment trying to use GPT-4 to explain the code [1], but the "clarified version" bore zero similarity with the original code... [1] https://gist.github.com/lifthrasiir/596667#gistcomment-47512... |
|
| Of course, it is not generalizable! In my experience though, most minifiers do only the following:
- Whitespace removal, which is trivially invertible. - Comment removal, which we never expect to recover via unminification. - Renaming to shorter names, which is tedious to track but still mechanical. And most minifiers have little understanding of underlying types anyway, so they are usually very conservative and rarely reuse the same mangled identifier for multiple uses. (Google Closure Compiler is a significant counterexample here, but it is also known to be much slower.) - Constant folding and inlining, which is annoying but can be still tracked. Again, most minifiers are limited in their reasoning to do extensive constant folding and inlining. - Language-specific transformations, like turning `a; b; c;` into `a, b, c;` and `if (a) b;` into `a && b;` whenever possible. They will be hard to understand if you don't know in advance, but there aren't too many of them anyway. As a result, minified code still remains comparably human-readable with some note taking and perseverance. And since these transformations are mostly local, I would expect LLMs can pick them up by their own as well. (But why? Because I do inspect such programs fairly regularly, for example for comments like https://news.ycombinator.com/item?id=39066262) |
|
| I feel you’re downplaying the obfuscatory power of name-mangling. Reversing that (giving everything meaningful names) is surely a difficult problem? |
|
| Yeah, having run some state of the art obfuscated code through ChatGPT, it still fails miserably. Even what was state of the art 20 years ago it can't make heads or tails of. |
|
| Yep, I've tried to use LLMs to disassemble and decompile binaries (giving them the hex bytes as plaintext), they do OK on trivial/artificial cases but quickly fail after that. |
|
| There is a certain justice in the use of OpenAI as a name for their product, given that OpenAI has turned the generic technical GPT name into a brand. |
|
| growing up in India over past 4 decades .. 'Xerox' was/is the default and most common word used for photocopying ... only recently have I started using/hearing the term 'photocopy'.
every town and every street had "XEROX shops" where people went to get various documents photocopied for INR 1 per page for example Most photocopy centers are still called XEROX Shops -- and their boards say that in big bold text: https://www.google.com/search?q=xerox+shop+india&udm=2 It doesnt matter if they use Canon, HP, or other brands of machines |
|
| I don’t claim expertise in AI or understanding intelligence, but could we also say that a pocket calculator really understands arithmetic and has superior intellectual performance compared to humans? |
|
| That's interesting. It's gotten a lot better I guess. A little over a year ago, I tried to use GPT to assist me in deobfuscating malicious code (someone emailed me asking for help with their hacked WP site via custom plugin). I got much further just stepping through the code myself.
After reading through this article, I tried again [0]. It gave me something to understand, though it's obfuscated enough to essentially eval unreadable strings (via the Window object), so it's not enough on it's own. Here was an excerpt of the report I sent to the person: > For what it’s worth, I dug through the heavily obfuscated JavaScript code and was able to decipher logic that it: > - Listens for a page load > - Invokes a facade of calculations which are in theory constant > - Redirects the page to a malicious site (unk or something) [0] https://chatgpt.com/share/f51fbd50-8df0-49e9-86ef-fc972bca6b... |
|
| Most definitely; if I use "View >> Repair Text Encoding" in Firefox, it shows the block characters. But I have to admit, it's strange that Firefox does not choose UTF-8 by default in this case. |
|
| He also told it to reimplement from JavaScript to TypeScript.
I would guess if he just told it to rename the variables and method first, it would have been closer to the original. |
|
| Apologizing to a program seems rather silly though. Do you apologize to your compiler when you have a typo in your code, and have to make it do all that work again? |
|
| I asked Claude 3.5 Sonnet a question in Italian in rot13 and it replied in Italian in rot13, there are a few typos but it's perfectly understandable. |
|
| I agree, it is fun!
LLM source recovery from binaries is thing. The amazing part is that they are pretty good at adding back meaningful variable names to the generated source code. |
|
| I’m hoping LLMs get better at decompiling/RE’ing assembly because it’s a very laborious process. Currently I don’t think they have enough in their training sets to be very good at it. |
|
| You can do this on minified code with beautifiers like js-beautify, for example. It's not clear why we need to make this an LLM task when we have existing simple scripts to do it? |
|
| This post basically says that I don't need to document my code anymore. No more comments, they can be generated automatically. Hurray! |
|
| Unfortunately the comments that could be generated are exactly the ones that should never be written. You want the comment to explain why, the information missing from the code. |
|
| You can put these comments into the name of a function, getting rid of the redundancy and having them read by whoever would just be reading the code not to be distracted by the comments. |
|
| It is true that at least some jurisdictions do also explicitly allow for reverse engineering to achieve interoperability, but I don't know if such provision is widespread. |
Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').