Benchmarking AMD's MI300X and Nvidia's H100 and H200; in theory, AMD's GPU has advantages in specs and total cost of ownership, but software bugs hold it back

This NVidia monopoly on #ai hardware is not good for anybody. … X: Nicholas Wilt / @cudahandbook : NVIDIA to churn the hardware instruction set with abandon, sometimes even committing featurecide and relying on the PTX translator to emulate instructions that were removed. The PTX translation code is in the driver as well as the offfline toolchain (ptxas) and is 3/x Nicholas Wilt / @cudahandbook : supercomputers in the world. So yeah, CUDA is a deep, deep moat. I get pretty offended when folks intimate that any luck was involved. We knew exactly what we were doing and why. /fin Nicholas Wilt / @cudahandbook : why I was making sure it ran on Windows as well as Linux. It was healthy for the code base. Today, NVIDIA has parlayed CUDA's Windowa support into a monopoly position in GPU workstations, because 1,200 workstation apps use CUDA. Another pillar is PTX, which enables 2/x Nicholas Wilt / @cudahandbook : multithreaded, so it can exploit modern multicore CPUs for performance gains proportional to the core count. Another triumph of software engineering. All of this great software runs on a span of platforms from tiny SOCs for cars and drones and robots, to the biggest 4/x Nicholas Wilt / @cudahandbook : CUDA's software stack has a few distinct pillars that are triumphs of software engineering (let alone software architecture). The driver API was built in C, portable across both operating systems and CPU architectures. Across the 6 years I worked on CUDA, no one questioned 1/x George Hotz / @realgeorgehotz : “AMD's software experience is riddled with bugs rendering out of the box training with AMD is impossible.” “It's not just that it's immature software, they need to change how they do development.” Good luck @dylan522p! You Jiacheng / @youjiacheng : > Tensorwave, the largest AMD GPU Cloud has given GPU time for free to a team at AMD to fix software issues, which is insane given they paid for the GPUs. This is INSANE... Indrajit Bhosale / @devnull_0 : Recently tried to (painstakingly) write a convolution layer kernel using AWS Trainum with their NKI programming model (their latest and greatest), only to realize CUDA moat is well and truly alive! Dylan Patel / @dylan522p : Our 5-month journey conducting independent analysis & benchmarking of AMD MI300X vs Nvidia H100 + H200 Detailed, open source low-level benchmarks performance vs TCO Comprehensive public recommendations It's not just immature software, they need to change how they do development @vintrotweets : > Tensorwave, the largest AMD GPU Cloud has given GPU time for free to a team at AMD to fix software issues, which is insane given they paid for the GPUs. glad we got the tinybox green Ben Thompson / @benthompson : This is incredible work. And yeah, AMD has sucked at software for 50 years. Ben Bajarin / @benbajarin : Repeat after me the reason Nvidia GPUs remain so well positioned - massive (and growing installed base) and architectural compatibility. Good analysis below. Saurabh Dash / @theycallmemr_ : “AMD” is insane because it appears to be one of the most legitimately dangerous companies with the potential to gigafry the market but exclusively employs literal turbonormies who unironically want to like design x86 processors and basically get oneshotted by their own drivers. Daniel Lemire / @lemire : NVIDIA's GPUs excel due to their mature CUDA software stack, which provides better performance in practical scenarios even when they have inferior hardware specs. via @GeimanThiesen [image] LinkedIn: Phillip Hendrickson, Ph.D. : Commentary about SemiAnalysis' MI300X deep dive is currently making the rounds in the news and on social media platforms, understandably so given what's in the article. …

SemiAnalysis 2024-12-26

Chronicles

Benchmarking AMD's MI300X and Nvidia's H100 and H200; in theory, AMD's GPU has advantages in specs and total cost of ownership, but software bugs hold it back