![]() |
|
![]() |
|
Easier said than done. I've got a dual X5690 at home in Kiev, Ukraine and I just couldn't find anything to run on it 24x7. And it doesn't produce much heat idling. I mean at all.
|
![]() |
|
You don't need to use C++ to interface with CUDA or even write it. A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low: https://developer.nvidia.com/blog/grcuda-a-polyglot-language... And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA: There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground. |
![]() |
|
I can't tell if your comment is sarcastic or genuine :). It goes to show how out of touch I am on AI hw and sw matters. Yesterday I thought about installing and trying to use https://news.ycombinator.com/item?id=39372159 (Reor is an open-source AI note-taking app that runs models locally.) and feed it my markdown folder but I stop midway, asking myself "don't I need some kind of powerful GPU for that ?". And now I am thinking "wait, should I wait for `standard` pluggable AI computing hardware device ? Is that Intel Gaudi 3 something like that ?". |
![]() |
|
Itanium only failed because AMD was allowed to come up with AMD64, Intel would have managed to push Itanium no matter what, if there were no alternatives to a 64bit compatible x86 CPU.
|
![]() |
|
I think it's a valid question. Intel has a habit of whispering away anything that doesn't immediately ship millions of units or that they're contractually obligated to support.
|
![]() |
|
No, this means Intel has woken up and trying. There's no guarantee in anything. I'm more of an AMD person, but I want to see fierce competition, not monopoly, even if it's "my team's monopoly".
|
![]() |
|
See https://www.nextplatform.com/2024/04/09/with-gaudi-3-intel-c... for more details. Here’s the relevant bits, although you should visit the article to see the networking diagrams: > The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this: > As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick: > To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this: The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication. |
![]() |
|
It will most likely use copper QSFP56 cables since these interfaces are either used in inter rack or adjacent rack direct attachments or to the nearest switch. O.5-1.5/2m copper cables are easily available and cheap and 4-8m (and even longer) is also possible with copper but tends to be more expensive and harder to get by. Even 800gb is possible with copper cables these days but you’ll end up spending just as much if not more on cabling as the rest of your kit…https://www.fibermall.com/sale-460634-800g-osfp-acc-3m-flt.h... |
![]() |
|
Those cables definitely exist for Ethernet, and regarding cross talk, that's what shielding is for. Although not for 200 Gbps, at that rate you either use big twinax DACs, or go to fibre. |
![]() |
|
Surely NVidia’s pricing is more what the market will bear vs an intrinsic cost to build. Intel being the underdog should be willing to offer a discount just to get their foot in the door.
|
![]() |
|
> blessed Ubuntu version with the blessed kernel version To an SRE, this is a nightmare to read. Cuda is bad in this regard (can often prevent major kernel version updates), but this is worse. |
![]() |
|
Hey man have seen you around here, very knowledgeable, thanks for your input! What's your take on projects like https://github.com/corundum/corundum I'm trying to get better at FPGA design, perhaps learn PCIe and some such but Vivado is intimidating (as opposed to Yosys/nextpnr which you seem to hate) should I just get involved with a project like this to acclimatise somewhat? |
![]() |
|
> the only MLPerf-benchmarked alternative for LLMs on the market I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization. |
![]() |
|
> Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ... I didn't know "terabytes (TB)" was a unit of memory bandwidth... |
![]() |
|
I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.
|
![]() |
|
Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.
|
![]() |
|
vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16. Not the best of times for stuff that doesn't fit matrix processing units. |
![]() |
|
it's crazy that Intel can't manufacture its own chips atm, but it looks like that might change in the coming years as new fabs come online.
|
![]() |
|
What else in on the BOM? Volume? At that price you likely want to use whatever resources are on the SoC that runs the thing and work around that. Feel free to e-mail me.
|
With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.