The co-founder and CEO of Lamini, an artificial intelligence (AI) large language model (LLM) startup, posted a video to Twitter/X poking fun at the ongoing Nvidia GPU shortage. The Lamini boss is quite smug at the moment, and this seems to be largely because the firm’s LLM runs exclusively on readily available AMD GPU architectures. Moreover, the firm claims that AMD GPUs using ROCm have reached “software parity” with the previously dominant Nvidia CUDA platform.
Just grilling up some GPUs 💁🏻♀️Kudos to Jensen for baking them first https://t.co/4448NNf2JP pic.twitter.com/IV4UqIS7ORSeptember 26, 2023
The video shows Sharon Zhou, CEO of Lamini, checking an oven in search of some AI LLM accelerating GPUs. First she ventures into a kitchen, superficially similar to Jensen Huang’s famous Californian coquina, but upon checking the oven she notes that there is “52 weeks lead time – not ready.” Frustrated, Zhou checks the grill in the yard, and there is a freshly BBQed AMD Instinct GPU ready for the taking.
We don’t know the technical reasons why Nvidia GPUs require lengthy oven cooking while AMD GPUs can be prepared on a grill. Hopefully, our readers can shine some light on this semiconductor conundrum in the comments.
On a more serious note, if we look more closely at Lamini, the headlining LLM startup, we can see they are no joke. CRN provided some background coverage of the Palo Alto, Calif.-based startup on Tuesday. Some of the important things mentioned in the coverage include the fact that Lamini CEO Sharon Zhou is a machine learning expert, and CTO Greg Diamos is a former Nvidia CUDA software architect.
It turns out that Lamini has been “secretly” running LLMs on AMD Instinct GPUs for the past year, with a number of enterprises benefitting from private LLMs during the testing period. The most notable Lamini customer is probably AMD, who “deployed Lamini in our internal Kubernetes cluster with AMD Instinct GPUs, and are using finetuning to create models that are trained on AMD code base across multiple components for specific developer tasks.”
A very interesting key claim from Lamini is that it only needs “3 lines of code,” to run production-ready LLMs on AMD Instinct GPUs. Additionally, Lamini is said to have the key advantage of working on readily available AMD GPUs. CTO Diamos also asserts that Lamini’s performance isn’t overshadowed by Nvidia solutions, as AMD ROCm has achieved “software parity” with Nvidia CUDA for LLMs.
We’d expect as much from a company focused on providing LLM solutions using AMD hardware, though they’re not inherently wrong. AMD Instinct GPUs can be competitive with Nvidia A100 and H100 GPUs, particularly if you have enough of them. The Instinct MI250 for example offers up to 362 teraflops of peak BF16/FP16 compute for AI workloads, and the MI250X pushes that to 383 teraflops. Both have 128GB of HBM2e memory as well, which can be critical for running LLMs.
AMD’s upcoming Instinct MI300X meanwhile bumps the memory capacity up to 192GB, double what you can get with Nvidia’s Hopper H100. However, AMD hasn’t officially revealed the compute performance of MI300 yet — it’s a safe bet it will be higher than the MI250X, but how much higher isn’t fully known.
By way of comparison, Nvidia’s A100 offers up to 312 teraflops of BF16/FP16 compute, or 624 teraflops peak compute with sparsity — basically, sparsity “skips” multiplication by zero calculations as the answer is known, potentially doubling throughput. The H100 has up to 1979 teraflops of BF16/FP16 compute with sparsity (and half that without sparsity). On paper, then, AMD can take on A100 but falls behind H100. But that assumes you can actually get H100 GPUs, which as Lamini notes currently means wait times of a year or more.
The alternative in the meantime is to run LLMs on AMD’s Instinct GPUs. A single MI250X might not be a match for H100, but five of them, running optimized ROCm code, should prove competitive. There’s also the question of how much memory the LLMs require, and as noted, 128GB is more than 80GB or 94GB (the maximum on current H100, unless you include the dual-GPU H100 NVL). An LLM that needs 800GB of memory, like ChatGPT, would potentially need a cluster of ten or more H100 or A100 GPUs, or seven MI250X GPUs.
It’s only natural that an AMD partner like Lamini is going to highlight the best of its solution, and cherry pick data / benchmarks to reinforce its stance. It cannot be denied, though, that the current ready availability of AMD GPUs and the non-scarcity pricing means the red team’s chips may deliver the best price per teraflop, or the best price per GB of GPU memory.
The co-founder and CEO of Lamini, an artificial intelligence (AI) large language model (LLM) startup, posted a video to Twitter/X poking fun at the ongoing Nvidia GPU shortage. The Lamini boss is quite smug at the moment, and this seems to be largely because the firm’s LLM runs exclusively on readily available AMD GPU architectures. Moreover, the firm claims that AMD GPUs using ROCm have reached “software parity” with the previously dominant Nvidia CUDA platform.
Just grilling up some GPUs 💁🏻♀️Kudos to Jensen for baking them first https://t.co/4448NNf2JP pic.twitter.com/IV4UqIS7ORSeptember 26, 2023
The video shows Sharon Zhou, CEO of Lamini, checking an oven in search of some AI LLM accelerating GPUs. First she ventures into a kitchen, superficially similar to Jensen Huang’s famous Californian coquina, but upon checking the oven she notes that there is “52 weeks lead time – not ready.” Frustrated, Zhou checks the grill in the yard, and there is a freshly BBQed AMD Instinct GPU ready for the taking.
We don’t know the technical reasons why Nvidia GPUs require lengthy oven cooking while AMD GPUs can be prepared on a grill. Hopefully, our readers can shine some light on this semiconductor conundrum in the comments.
On a more serious note, if we look more closely at Lamini, the headlining LLM startup, we can see they are no joke. CRN provided some background coverage of the Palo Alto, Calif.-based startup on Tuesday. Some of the important things mentioned in the coverage include the fact that Lamini CEO Sharon Zhou is a machine learning expert, and CTO Greg Diamos is a former Nvidia CUDA software architect.
It turns out that Lamini has been “secretly” running LLMs on AMD Instinct GPUs for the past year, with a number of enterprises benefitting from private LLMs during the testing period. The most notable Lamini customer is probably AMD, who “deployed Lamini in our internal Kubernetes cluster with AMD Instinct GPUs, and are using finetuning to create models that are trained on AMD code base across multiple components for specific developer tasks.”
A very interesting key claim from Lamini is that it only needs “3 lines of code,” to run production-ready LLMs on AMD Instinct GPUs. Additionally, Lamini is said to have the key advantage of working on readily available AMD GPUs. CTO Diamos also asserts that Lamini’s performance isn’t overshadowed by Nvidia solutions, as AMD ROCm has achieved “software parity” with Nvidia CUDA for LLMs.
We’d expect as much from a company focused on providing LLM solutions using AMD hardware, though they’re not inherently wrong. AMD Instinct GPUs can be competitive with Nvidia A100 and H100 GPUs, particularly if you have enough of them. The Instinct MI250 for example offers up to 362 teraflops of peak BF16/FP16 compute for AI workloads, and the MI250X pushes that to 383 teraflops. Both have 128GB of HBM2e memory as well, which can be critical for running LLMs.
AMD’s upcoming Instinct MI300X meanwhile bumps the memory capacity up to 192GB, double what you can get with Nvidia’s Hopper H100. However, AMD hasn’t officially revealed the compute performance of MI300 yet — it’s a safe bet it will be higher than the MI250X, but how much higher isn’t fully known.
By way of comparison, Nvidia’s A100 offers up to 312 teraflops of BF16/FP16 compute, or 624 teraflops peak compute with sparsity — basically, sparsity “skips” multiplication by zero calculations as the answer is known, potentially doubling throughput. The H100 has up to 1979 teraflops of BF16/FP16 compute with sparsity (and half that without sparsity). On paper, then, AMD can take on A100 but falls behind H100. But that assumes you can actually get H100 GPUs, which as Lamini notes currently means wait times of a year or more.
The alternative in the meantime is to run LLMs on AMD’s Instinct GPUs. A single MI250X might not be a match for H100, but five of them, running optimized ROCm code, should prove competitive. There’s also the question of how much memory the LLMs require, and as noted, 128GB is more than 80GB or 94GB (the maximum on current H100, unless you include the dual-GPU H100 NVL). An LLM that needs 800GB of memory, like ChatGPT, would potentially need a cluster of ten or more H100 or A100 GPUs, or seven MI250X GPUs.
It’s only natural that an AMD partner like Lamini is going to highlight the best of its solution, and cherry pick data / benchmarks to reinforce its stance. It cannot be denied, though, that the current ready availability of AMD GPUs and the non-scarcity pricing means the red team’s chips may deliver the best price per teraflop, or the best price per GB of GPU memory.