（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40819479

该用户分享了他们对一篇有关在软件开发流程中利用大型语言模型 (LLM) 进行编译和反编译过程的研究论文的想法。他们承认编译器中可靠性和可预测性的重要性，并对纳入像 LLM 这样本质上不可预测的工具表示保留，因为它对决定论有潜在影响。尽管最初有些担忧，作者还是深入研究了这项研究，强调了研究人员从 CodeLlama 开始的方法，逐步训练模型以执行多项任务，包括编译、编译器标志预测/优化和反编译。尽管第一个任务（编译）由于在没有传统编译器的情况下验证不可靠而引起了怀疑，但后续任务通过优化较小的程序集大小和有效的反编译结果提供了优势。确定性方法和法学硕士方法的结果相结合提高了整体效率并减少了错误。然而，该研究存在局限性，特别是在程序集大小改进之外的性能优化以及专门为反编译任务而设计的各种其他 LLM 模型之间的比较，而不是像 CodeLlama 那样为更广泛的目的进行预训练。尽管如此，作者对这个项目的未来影响和可能带来的进步仍然保持着兴趣和热情。

Huh. This is a very... "interesting" application for an LLM. I'm not the brightest crayon in the box, but if anyone else would like to follow along with my non-expert opinion as I read through the paper, here's my take on it.

It's pretty important for compilers / decompilers to be reliable and accurate -- compilers behaving in a deterministic and predictable way is an important fundamental of pipelines.

LLMs are inherently unpredictable, and so using an LLM for compilation / decompilation -- even an LLM that has 99.99% accuracy -- feels a bit odd to include as a piece in my build pipeline.

That said, let's look at the paper and see what they did.

They essentially started with CodeLlama, and then went further to train the model on three tasks -- one primary, and two downstream.

The first task is compilation: given input code and a set of compiler flags, can we predict the output assembly? Given the inability to verify correctness without using a traditional compiler, this feels like it's of limited use on its own. However, training a model on this as a primary task enables a couple of downstream tasks. Namely:

The second task (and first downstream task) is compiler flag prediction / optimization to predict / optimize for smaller assembly sizes. It's a bit disappointing that they only seem to be able to optimize for assembly size (and not execution speed), but it's not without its uses. Because the output of this task (compiler flags) are then passed to a deterministic function (a traditional compiler), then the instability of the LLM is mitigated.

The third task (second downstream task) is decompilation. This is not the first time that LLMs have been trained to do better decompilation -- however, because of the pretraining that they did on the primary task, they feel that this provides some advantages over previous approaches. Sadly, they only compare LLM Compiler to Code Llama and GPT-4 Turbo, and not against any other LLMs fine-tuned for the decompilation task, so it's difficult to see in context how much better their approach is.

Regarding the verifiability of the disassembly approach, the authors note that there are issues regarding correctness. So the authors employ round-tripping -- recompiling the decompiled code (using the same compiler flags) to verify correctness / exact-match. This still puts accuracy in the 45% or so (if I understand their output numbers), so it's not entirely trustworthy yet, but it might be able to still be useful (especially if used alongside a traditional decompiler, and this model's outputs only used when they are verifiably correct).

Overall I'm happy to see this model be released as it seems like an interesting use-case. I may need to read more, but at first blush I'm not immediately excited by the possibilities that this unlocks. Most of all, I would like to see it explored if these methods could be extended to optimize for performance -- not just size of assembly.

>compilers behaving in a deterministic and predictable way is an important fundamental of pipelines. LLMs are inherently unpredictable, and so using an LLM for compilation / decompilation -- even an LLM that has 99.99% accuracy

You're confusing different concepts here. An llm is technically not unpredictable by itself (at least the ones we are talking about here, there are different problems with beasts like GPT4 [1]). The "randomness" of llms you are probably experiencing stems from the autoregressive completion, which samples from probabilities for a temperature T>0 (which is very common because it makes sense in chat applications). But there is nothing that prevents you from simply choosing greedy sampling, which would make your output 100% deterministic and reproducible. That is particularly useful for disassembling/decompiling and has the chance to vastly improve over existing tools, because it is common knowledge that they are often not the sharpest tools and humans are much better at piecing together working code.

The other question here is accuracy for compiling. For that it is important whether the llm can follow a specification correctly. Because once you write unspecified behaviour, your code is fair game for other compilers as well. So the real question is how well does it follow the spec how good is it at dealing with situations where normal compilers will flounder.

[1] https://152334h.github.io/blog/non-determinism-in-gpt-4/

Determinism for any given input isn't an interesting metric though. Inputs are always different, or else you could just replace it with a lookup function. What's important is reliability of the output given a distribution of inputs, and that's where LLMs are unreliable. Temperature sampling can be a technique to improve reliability particularly when things get into repetitve loops - though usually it's to increase creativity.

> The "randomness" of llms you are probably experiencing stems from the autoregressive completion, which samples from probabilities for a temperature T>0 (which is very common because it makes sense in chat applications).

Even that “random” sampling is deterministic, in that if you use the same PRNG algorithm with the same random seed, then (all else being equal) you should get the same results every time.

To get genuine nondeterminism, you need an external source of randomness, such as thermal noise, keystroke timing, etc. (Even whether that is really non-deterministic depends on a whole lot of contested issues in philosophy and physics, but at least we can say it is non-deterministic for all practical purposes.)

It is normally not a necessary feature of a compiler to be determistic. A compiler should be correct against a specification. If the specification allows indeterminism a compiler should be able to exploit them. I remember the story of the sather-k compiler that did things differently based on the phase of the moon.

It's technically correct that a language specification is rarely precise enough to require compiler output to be deterministic.

But it's pragmatically true that engineers will want to murder you if your compiler is non-deterministic. All sorts of build systems, benchmark harnesses, supply chain validation tools, and other bits of surrounding ecosystem will shit the bed if the compiler doesn't produce bitwise identical output on the same input and compiler flags.

Can vouch for this having fixed non-determinism bugs in a compiler. Nobody is happy if your builds aren't reproducible. You'll also suffer crazy performance problems as everything downstream rebuilds randomly and all your build caches randomly miss.

PGO can be used in such situations, but the profile needs to be checked in. Same code + same profile -> same binary (assuming the compiler is deterministic, which is tested quite extensively).

There are several big projects that use PGO (like Chrome), and you can get a deterministic build at whatever revision using PGO as the profiles are checked in to the repository.

I wasn’t trying to conflate the two. PGO traditionally meant a trace build but as a term it’s pretty generic, at least to me to the general concept of “you have profile information that replaces generically tuned heuristics that the compiler uses). AutoFDO I’d classify as an extension to that concept to a more general PGO technique; kind of ThinLTO vs LTO. Specifically, it generates the “same” information to supplant compiler heuristics, but is more flexible in that the sample can be fed back into “arbitrary” versions of the code using normal sampling techniques instead of an instrumented trace. The reason sampling is better is that it more easily fits into capturing data from production which is much harder to accomplish for the tracing variant (due to perf overheads). Additionally, because it works across versions the amortized compile cost drops from 2x to 1x because you only need to reseed your profile data periodically.

I was under the impression they had switched to AutoFDO across the board but maybe that’s just for their cloud stuff and Chrome continues to run a representative workload since that path is more mature. I would guess that if it’s not being used already, they’re exploring how to make Chrome run AutoFDO for the same reason everyone started using ThinLTO - it brought most of the advantages while fixing the disadvantages that hampered adoption.

And yes, while PGO is available natively, AutoFDO isn’t quite as smooth.

I'm not sure where you're getting your information from.

Chrome (and many other performance-critical workloads) is using instrumented PGO because it gives better performance gains, not because it's a more mature path. AutoFDO is only used in situations where collecting data with an instrumented build is difficult.

Last I looked AutoFDO builds were similar in performance to PGO as ThinLTO vs LTO is. I’d say that collecting data with an instrumented Chrome build is extremely difficult - you’re relying on your synthetic benchmark environment which is very very different from the real world (eg extensions aren’t installed, the patterns of websites being browsed is not realistic, etc). There’s also a 2x compile cost because you have to build Chrome twice in the exact same way + you have to run a synthetic benchmark on each build to generate the trace.

I’m just using an educated guess to say that at some point in the future Chrome will switch to AutoFDO, potentially using traces harvested from end user computers (potentially just from their employees even to avoid privacy complaints).

You can make the synthetic benchmarks relatively accurate, it just takes effort. The compile-time hit and additional effort is often worth it for the extra couple percent for important applications.

Performance is also pretty different on the scales that performance engineers are interested in for these sorts of production codes, but without the build system scalability problems that LTO has. The original AutoFDO paper shows an improvement of 10.5%->12.5% going from AutoFDO to instrumented PGO. That is pretty big. It's probably even bigger with newer instrumentation based techniques like CSPGO.

They also mention the exact reasons that AutoFDO will not perform as well, with issues in debug info and losing profile accuracy due to sampling inaccuracy.

I couldn't find any numbers for Chrome, but I am reasonably certain that they have tried both and continue to use instrumented PGO for the extra couple percent. There are other pieces of the Chrome ecosystem (specifically the ChromeOS kernel) that are already optimized using sampling-based profiling. It's been a while since I last talked to the Chromium toolchain people about this though. I also remember hearing them benchmark FEPGO vs IRPGO at some point and concluding that IRPGO was better.

Yeah, and nixpkgs also, last time I checked, does patch GCC/ clang to ensure determinism. Many compilers and toolchain by default want to, e.g., embed build information that may leak from the build env in a non-deterministic/ non-reprodicible manner.

> But it's pragmatically true that engineers will want to murder you if your compiler is non-deterministic. All sorts of build systems, benchmark harnesses, supply chain validation tools, and other bits of surrounding ecosystem will shit the bed if the compiler doesn't produce bitwise identical output on the same input and compiler flags.

I think that’s rather true nowadays, but hasn’t always been thus. Back in the 20th century, non-deterministic compiler output was very common - even if only due to the common practice of embedding the compilation timestamp in the resulting executable - and very few ever cared. Whereas nowadays, there is a much bigger culture of hermetic/reproducible build processes, in which stuff like embedding compilation timestamps in the executable or object files is viewed as an antipattern.

NVCC CUDA builds were nondeterministic last time I checked, it made certain things (trying to get very clever with generating patches) difficult. This was also hampered by certain libraries (maybe GTSAM?) wanting to write __DATE__ somewhere in the build output, creating endlessly changing builds.

In parallel computing you run into nondeterminism pretty quickly anyways - especially with CUDA because of undetermined execution order and floating point accuracy.

Yes, at runtime. Compiling CUDA doesn’t require a GPU, though, and doesn’t really use “parallel computing”. I think CUDA via clang gets this right and will produce the same build every time - it was purely an NVCC issue.

I’m amused by the possibility of a compiler having a flag to set a random seed. (with a fixed default, of course).

If you hit a compiler bug, you could try a different seed to see what happens.

Or how about a code formatter with a random seed?

Tool developers could run unit tests with a different seed until they find a bug - or hide the problem by finding a lucky seed for which you have no provable bugs :)

Edit:

Or how about this: we write a compiler as a nondeterministic algorithm where every output is correct, but they are optimized differently depending on an input vector of choices. Then use machine learning techniques to find the picks that produce the best output.

LLMs can be deterministic if you set the random seed and pin it to a certain version of the weights.

My bigger concern would be bugs in the machine code would be very, very difficult to track down.

> It is normally not a necessary feature of a compiler to be determistic. A compiler should be correct against a specification.

That sounds like a nightmare. Optimizing code to play nice with black-box heuristic compilers like V8's TurboFan is, already in fact, a continual maintenance nightmare.

If you don't care about performance, non-deterministic compilation is probably "good enough." See TurboFan.

But temperature 0 LLM's don't exhibit the emergent phenomena we like, even in apparently non-creative tasks. The randomness is, in some sense, a cheap proxy for an infeasible search over all completion sequences, much like simulated annealing with zero temperature is a search for a local optimum but adding randomness makes it explore globally and find more interesting possibilities.

Temperature is at ~1.2 in this thread, here's some 0.0:

- Yes, temperature 0.0 is less creative.

- Injecting pseudo-random noise to get deterministic creative outputs is "not even wrong", in the Wolfgang Pauli sense. It's fixing something that isn't broken, with something that can't fix it, that if it could, would be replicating the original behavior - more simply, it's proposing non-deterministic determinism.

- Temperature 0.0, in practice, is an LLM. There aren't emergent phenomena, in the sense "emergent phenomena" is used with LLMs, missing. Many, many, many, applications use this.

- In simplistic scenarios, on very small models, 0.0 could get stuck literally repeating the same token.

- There's a whole other layer of ex. repeat penalties/frequency penalties and such that are used during inference to limit this. Only OpenAI and llama.cpp expose repeat/frequency.

- Temperature 0.0 is still non-deterministic on ex. OpenAI, though substantially the same, and even the same most of the time. It's hard to notice differences. (Reproducible builds require extra engineering effort, the same way ensuring temperature = 0.0 is truly deterministic requires engineering effort.)

- Pedantically, only temperature 0.0 at the same seed (initial state) is deterministic.

It is very important for a compiler to be deterministic. Otherwise you can't validate the integrity of binaries! We already have issues with reproducibility without adding this shit in the mix.

Reproducible builds are an edge case, that require determistic compilation for sure. But profile based optimisation or linker address randomisation are sometimes also useful. While rule out one thing for the other. Normally you can easily turn on and off optimisation depending on your need. Just do -O0 if you want determinism. But normally you should not rely on it (also at execution time)

Thank you for the summary. My memory of SOTA on disassembly about a year ago was sub—30% accuracy, so this is a significant step forward.

I do think the idea of a 90%+-ish forward and backward assembler LLM is pretty intriguing. There’s bound to be a lot of uses for it; especially if you’re of the mind that to get there it would have to have learned a lot about computers in the foundation model training phase.

Like, you’d definitely want to have those weights somehow baked into a typical coding assistant LLM, and of course you’d be able to automate round one of a lot of historical archiving projects that would like to get compilable modern code but only have a binary, you’d be able to turn some PDP-1 code into something that would compile on a modern machine, … you’d probably be able to leverage it into building chip simulations / code easily, it would be really useful for writing Verilog, (maybe), anyway, the use cases seem pretty broad to me.

Sure, performance is more interesting, but it's significantly harder.

With code size, you just need to run the code through the compiler and you have a deterministic measurement for evaluation.

Performance has no such metric. Benchmarks are expensive and noisy. Cost models seem like a promising direction, but they aren't really there yet.

LLMs are probably great at this. You can break down the code into functions or basic blocks. You can use the LLM to decompile them and then test if the decompilation results match the code when executed. You'll probably get this right after a few tries. Then you can train your model with the succesful decompilation results so your model will get better.

Maybe they are thinking about embedding a program generator and execution environment into their LLM inferencing loop in a tighter way. The model invents a program that guides the output in a specific/algorithmic way, tailored to the prompt.

some comments like this make me waant to subscribe to you for all your future comments. thanks for doing the hard work of summarizing and taking the bold step of sharing your thoughts in public. i wish more HNers were like you.

I continue to be fascinated about what the next qualitative iteration of models will be, marrying the language processing and broad knowledge of LLMs with an ability to reason rigorously.

If I understand correctly, this work (or the most obvious productionized version of it) is similar to the work Deep Mind released a while back: the LLM is essentially used for “intuition”—-to pick the approach—-and then you hand off to something mechanical/rigorous.

I think we’re going to see a huge growth in that type of system. I still think it’s kind of weird and cool that our meat brains with spreading activation can (with some amount of effort/concentration) switch over into math mode and manipulate symbols and inferences rigorously.

my knowledge of compilers don't extend beyond a 101 course done ages ago, but i wonder how the researchers enriched the dataset for improving these features.

did they just happen to find a way to format the heuristics of major compilers in half-code, half-language mix? confusingly enough, another use case where a (potential) tool that let us veer into the solution with some work is being replaced by an llm.

I am curious about CUDA assembly, does it work on CUDA -> ptx level? or ptx -> sass? I have done some work on SASS optimization and it would be a lot easier if LLM could be applied at SASS level

For the disassembler we round trip. x86 ->(via model) IR ->(via clang) x86. If they are identical then the IR is correct. Could be correct even if not identical, but then you need to check.

For the auto-tuning, we suggest the best passes to use in LLVM. We take some effort to weed out bad passes, but LLVM has bugs. This is in common with any auto-tuner.

We train it to emulate the compiler. The compiler does that better already. We do it because it helps the LLM understand the compiler better and it auto-tunes better as a result.

We hope people will use this model to fine-tune for other heuristics. E.g. an inliner which accepts the IR or the caller and callee to decide profitability. We think things like that will be vastly cheaper for people if they can start from LLM Compiler. Training LLMs from scratch is expensive :-)

IMO, right now, AI should be used to decide profitability not correctness.

As usual, Twitter is impressed by this, but I'm very skeptical, the chance of it breaking your program is pretty high. The thing that makes optimizations so hard to make is that they have to match the behavior without optimizations (unless you have UBs), which is something that LLMs probably will struggle with since they can't exactly understand the code and execution tree.

Hey! The idea isn't to replace the compiler with an LLM, the tech is not there yet. Where we see value is in using these models to guide an existing compiler. E.g. orchestrating optimization passes. That way the LLM won't break your code, nor will the compiler (to the extent that your compiler is free from bugs, which can tricky to detect - cf Sec 3.1 of our paper).

I've done some similar LLM compiler work, obviously not on Meta's scale, teaching an LLM to do optimization by feeding an encoder/decoder pairs of -O0 and -O3 code and even on my small scale I managed to get the LLM to spit out the correct optimization every once and a while.

I think there's a lot of value in LLM compilers to specifically be used for superoptimization where you can generate many possible optimizations, verify the correctness, and pick the most optimal one. I'm excited to see where y'all go with this.

Thank you for freeing me from one of my to-do projects. I wanted to do a similar autoencoder with optimisations. Did you write about it anywhere? I'd love to read the details.

then maybe dont name it "LLM Compiler", just "Compiler Guidance with LLMs" or "LLM-aided Compiler optimization" or something - will get much more to the point without overpromising

> The idea isn't to replace the compiler with an LLM, the tech is not there yet

What do you mean the tech isn't there yet, why would it ever even go into that direction? I mean we do those kinds of things for shits and giggles but for any practical use? I mean come on. From fast and reliable to glacial and not even working a quarter of the time.

I guess maybe if all compiler designers die in a freak accident and there's literally nobody to replace them, then we'll have to resort to that after the existing versions break.

As this LLM operates on LLVM intermediate representation language, the result can be fed into https://alive2.llvm.org/ce/ and formally verified. For those who don't know what to print there: here is an example of C++ spaceship operator: https://alive2.llvm.org/ce/z/YJPr84 (try to replace -1 with -2 there to break). This is kind of a Swiss knife for LLVM developers, they often start optimizations with this tool.

What they missed is to mention verification (they probably don't know about alive2) and comparison with other compilers. It is very likely that LLM Compiler "learned" from GCC and with huge computational effort simply generates what GCC can do out of the box.

I'm reasonably certain the authors are aware of alive2.

The problem with using alive2 to verify LLM based compilation is that alive2 isn't really designed for that. It's an amazing tool for catching correctness issues in LLVM, but it's expensive to run and will time out reasonably often, especially on cases involving floating point. It's explicitly designed to minimize the rate of false-positive correctness issues to serve the primary purpose of alerting compiler developers to correctness issues that need to be fixed.

I'm not sure it's likely that the LLM here learned from gcc. The size optimization work here is focused on learning phase orderings for LLVM passes/the LLVM pipeline, which wouldn't be at all applicable to gcc.

Additionally, they train approximately half on assembly and half on LLVM-IR. They don't talk much about how they generate the dataset other than that they generated it from the CodeLlama dataset, but I would guess they compile as much code as they can into LLVM-IR and then just lower that into assembly, leaving gcc out of the loop completely for the vast majority of the compiler specific training.

Yep! No GCC on this one. And yep, that's not far off how the pretraining data was gathered - but with random optimisations to give it a bit of variety.

Do you have more information on how the dataset was constructed?

It seems like somehow build systems were invoked given the different targets present in the final version?

Was it mostly C/C++ (if so, how did you resolve missing includes/build flags), or something else?

We plan to have a peer reviewed version of the paper where we will probably have more details on that. Otherwise we can't give anymore details than in the paper or post, etc. without going through legal which takes ages. Science is getting harder to do :-(

> C++ spaceship operator

> (A <=> B) < 0 is true if A < B

> (A <=> B) > 0 is true if A > B

> (A <=> B) == 0 is true if A and B are equal/equivalent.

TIL of the spaceship operator. Was this added as an april fools?

The three-way comparison operator just needs to return a ternary, and many comparisons boil down to integer subtraction. strcmp is also defined this way.

In C++20 the compiler will automatically use the spaceship operator to implement other comparisons if it is available, so it's a significant convenience.

This is one of the oldest computer operators in the game: the arithmetic IF statement from FORTRAN.

It's useful for stable-sorting collections with a single test. Also, overloading <=> for a type, gives all comparison operators "for free": ==, !=, <, <=, >=, >

How would that apply to Fortran’s arithmetic IF statement? It goes to one label for a negative value, or to a second label for a zero, or to a third label for positive. A NaN is in none of these categories.

I mean maybe I'm missing something but it seems like it behaves exactly the same way as subtraction? At least for integers it's definitely the same, for floats I imagine it might handle equals better?

C++ has operator overloading, so you can define the spaceship for any class, and get every comparison operator from the fallback definitions, which use `<=>` in some obvious ways.

If I understand correctly, the AI is only choosing the optimization passes and their relative order. Each individual optimization step would still be designed and verified manually, and maybe even proven to be correct mathematically.

Right, it's only solving phase ordering.

In practice though, correctness even over ordering of hand-written passes is difficult. Within the paper they describe a methodology to evaluate phase orderings against a small test set as a smoke test for correctness (PassListEval) and observe that ~10% of the phase orderings result in assertion failures/compiler crashes/correctness issues.

You will end up with a lot more correctness issues adjusting phase orderings like this than you would using one of the more battle-tested default optimization pipelines.

Correctness in a production compiler is a pretty hard problem.

There are two models.

- foundation model is pretrained on asm and ir. Then it is trained to emulate the compiler (ir + passes -> ir or asm)

- ftd model is fine tuned for solving phase ordering and disassembling

FTD is there to demo capabilities. We hope people will fine tune for other optimisations. It will be much, much cheaper than starting from scratch.

Yep, correctness in compilers is a pain. Auto-tuning is a very easy way to break a compiler.

People simply have no idea what they're talking about. It's just jumping on to the latest hype train. My first impression here was per the name that it was actually some sort of compiler in it of itself--ie programming language in and pure machine code or some other IR out. It's got bits and pieces of that here and there but that's not what it really is at all. It's more of a predictive engine for an optimizer and not a very generalized one for that.

What would be more interesting is training a large model on pure (code, assembly) pairs like a normal translation task. Presumably a very generalized model would be good at even doing the inverse: given some assembly, write code that will produce the given assembly. Unlike human language there is a finite set of possible correct answers here and you have the convenience of being able to generate synthetic data for cheap. I think optimizations would arise as a natural side effect this way: if there's multiple trees of possible generations (like choosing between logits in an LLM) you could try different branches to see what's smaller in terms of byte code or faster in terms of execution.

It can emulate the compiler (IR + passes -> IR or ASM).

> What would be more interesting is training a large model on pure (code, assembly) pairs like a normal translation task.

It is that.

> Presumably a very generalized model would be good at even doing the inverse: given some assembly, write code that will produce the given assembly.

Is has been trained to disassemble. It is much, much better than other models at that.

> Presumably a very generalized model would be good at even doing the inverse: given some assembly, write code that will produce the given assembly.

ChatGPT does this, unreliably.

AFAIK this is a heuristic, not a category. The underlying grammar would be preserved.

Personally I thought we were way too close to perfect to make meaningful progress on compilation, but that’s probably just naïveté

Unlike many other AI-themed papers at Meta this one omits any mention of the model output getting used at Instagram, Facebook or Meta. Research is great! But doesn't seem all that actionable today.

This would be difficult to deploy as-is in production.

There are correctness issues mentioned in the paper regarding adjusting phase orderings away from the well-trodden O0/O1/O2/O3/Os/Oz path. Their methodology works for a research project quite well, but I personally wouldn't trust it in production. While some obvious issues can be caught by a small test suite and unit tests, there are others that won't be, and that's really risky in production scenarios.

There are also some practical software engineering things like deployment in the compiler. There is actually tooling in upstream LLVM to do this (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running models on a GPU would be difficult and I would expect CPU inference to massively blow up compile times.

I don’t understand the purpose of this. Feels like a task for function calling and sending it to an actual compiler.

Is there an obvious use case I’m missing?

This is not a product, it's a research project.

They don't expect you to use this.

Applications might require further research. And the main takeaway might be not "here's a tool to generate code", but "LLMs are able to understand binary code, and thus we can train them to do ...".

GPT 6 can write software directly (as assembly) instead of writing c first.

Lots of training data for binary, and it can train itself by seeing if the program does what it expects it to do.

Reading the title, I thought this was a tool for optimizing and disassembling LLMs, not an LLM designed to optimize and disassemble. Seeing it's just a model is a little disappointing in comparison.

I hate the company (Facebook), but I still think them having been publicly releasing a bunch of the research they've been doing (and models they've been making) has been a net good for almost everybody, at least in terms of exploring the field of LLMs.

（评论） (comments)

（评论）
(comments)