建议给Tenstorrent

建议给Tenstorrent
Advice to Tenstorrent

Tenstorrent在AI计算领域的优势在于其可编程性，但当前的方法阻碍了进展。LLK是一个主要问题。我们应关注一个简化的三层架构：前端（PyTorch、ONNX等）、编译器和运行时/驱动程序。运行时应该是一个精简的、与应用无关的层，通过简单的C API（类似CUDA）来暴露硬件功能（编译、调度、排队）。应消除特定操作的实现，例如ELU。编译器应该处理内存分配、操作调度和内核融合。MLIR/LLVM是可行的选择。最后，前端性能至关重要。在实现自定义操作（如ELU）之前，应确保其性能与基本函数（如ReLU）相当。ELU的实现应该针对性能进行优化。当前的分层架构过于复杂，阻碍了Tenstorrent发挥其硬件可编程性的优势。一个精简高效的架构对于成功至关重要。

Hacker News 的一个帖子讨论了 George Hotz（geohot）给硬件公司 Tenstorrent 的建议。一位博士生和系统程序员表达了对 Tenstorrent 复杂抽象的挫败感，发现它们难以理解，即使付出了巨大的努力。另一位评论者也表达了同样的观点，指出尽管拥有 JAX 和 ML 编译器开发经验，但在 Blackhole 硬件上训练 VLM 仍然很困难。他们认为 Tenstorrent 的工程师人手不足，并且缺乏内部使用自家产品的实践（dogfooding）。评论者们就 geohot 的资历和过去的创业经历展开了辩论，一些人赞扬了他的见解，另一些人则批评了他的夸夸其谈的风格和被认为的失败。他的 AMD 批评也被提及，一些人认为这是宝贵的警告，而另一些人则认为这是自私的攻击。一些人捍卫他在 Comma.ai 的工作，而另一些人则强调他在 Twitter 的短暂实习。讨论涉及到用户友好型 API 和复杂图编译器之间的矛盾，以及构建与硬件互补的软件平台的挑战。关于 geohot 的可信度，意见不一，一些人认为他理应直言不讳，而另一些人则认为他过去的成就并不足以证明他现在的行为是正当的。

GPU 运行 Brrr 2024-05-14

（评论） 2024-03-11

（评论） 2023-12-16

（评论） 2024-07-13

原文

Advice to Tenstorrent

If you want to get acquired / become scam IP licensing co...I can't help you.

If you want to win AI compute, read on

===

This is your 7th stack?

Plz bro one more stack this stack will be good i promise
bro bro bro plz one more make it all back one trade

You can't build a castle on a shit swamp. LLK is the wrong approach.

===

Tenstorrent advantage is in more programmability wrt GPUs. Hardware shapes model arch.

If you don't expose that programmability, you are guaranteed to lose. sfpi_elu is a problem.

You aren't going to get better deals on tapeouts/IP than NVIDIA/AMD. You need some advantage.

But but but it's all open source.

===

If you want a dataflow graph compiler, build a dataflow graph compiler.
This is not 6 layers of abstraction, it's 3 (and only 2 you have to build).

1. frontend <PyTorch, ONNX, tensor.py>
2. compiler
3. runtime/driver

===

Start with 3.

The driver is fine.

The runtime should JUST BE A RUNTIME. I better never see mention of a elu.

Make the runtime expose hardware in a application agnostic way. Compilation, dispatch, queuing, etc...

As long as LLK sits under tt-metalium, you aren't doing this.

CUDA is a simple C API for this. I advise doing the same.

===

Now for 2.

tinygrad is this, but you don't have to use it. MLIR/LLVM is probably fine.

ELU still should not be here!!!!

This should deal with memory placement, op scheduling, kernel fusion. Not ELU.

This is not easy. But importing 6 abstraction layers of cruft doesn't fix that!!!!

===

Now for 1.

self.elu() needs to have same perf as self.relu() - alpha*(1-self.exp()).relu()

If it doesn't, you messed up. Only once it does are you ready to write elu.

HINT for how to write ELU: def elu(self): return self.relu() - alpha*(1-self.exp()).relu()

HINT is not a hint, it's the actual code.

建议给Tenstorrent Advice to Tenstorrent

建议给Tenstorrent
Advice to Tenstorrent