| GPU | VRAM | Price (€) | Bandwidth (TB/s) | TFLOP16 | €/GB | €/TB/s | €/TFLOP16 |
|---|---|---|---|---|---|---|---|
| NVIDIA H200 NVL | 141GB | 36284 | 4.89 | 1671 | 257 | 7423 | 21 |
| NVIDIA RTX PRO 6000 Blackwell | 96GB | 8450 | 1.79 | 126.0 | 88 | 4720 | 67 |
| NVIDIA RTX 5090 | 32GB | 2299 | 1.79 | 104.8 | 71 | 1284 | 22 |
| AMD RADEON 9070XT | 16GB | 665 | 0.6446 | 97.32 | 41 | 1031 | 7 |
| AMD RADEON 9070 | 16GB | 619 | 0.6446 | 72.25 | 38 | 960 | 8.5 |
| AMD RADEON 9060XT | 16GB | 382 | 0.3223 | 51.28 | 23 | 1186 | 7.45 |
This post is part “hear me out” and part asking for advice.
Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.
so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.
Well, a few issues:
- For hosting or training large models you want high bandwidth between GPUs. PCIe is too slow, NVLink has literally a magnitude more bandwidth. See what Nvidia is doing with NVLink and AMD is doing with InfinityFabric. Only available if you pay the premium, and if you need the bandwidth, you are most likely happy to pay.
- Same thing as above, but with memory bandwidth. The HBM-chips in a H200 will run in circles around the GDDR-garbage they hand out to the poor people with filthy consumer cards. By the way, your inference and training is most likely bottlenecked by memory bandwidth, not available compute.
- Commercially supported cooling of gaming GPUs in rack servers? Lol. Good luck getting any reputable hardware vendor to sell you that, and definitely not at the power densities you want in a data center.
- TFLOP16 isn’t enough. Look at 4 and 8 bit tensor numbers, that’s where the expensive silicon is used.
- Nvidias licensing agreements basically prohibit gaming cards in servers. No one will sell it to you at any scale.
For fun, home use, research or small time hacking? Sure, buy all the gaming cards you can. If you actually need support and have a commercial use case? Pony up. Either way, benchmark your workload, don’t look at marketing numbers.
Is it a scam? Of course, but you can’t avoid it.
- I know the more bandwidth the better, but i wonder how does it scale. I can only test my own setup which is less then optimal for this purpose with pcie 4.0 x16 and no p2p, but it goes as follows: a single 4090 gets 40.9 t/s while 2 get 58.5 t/s using tensor parrelism tested on Qwen/Qwen3-8B-FP8 with vLLM. I am really curious how this scales over more then 2 pcie 5.0 cards with p2p, which all cards here listed except the 5090 support.
- The theory goes that yes while the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.
- I don’t need to build a datacenter, i’m fine with building a rack myself in my garage. And i don’t think that requires higher volumes than just purchasing at different retailers
- I intend to run at fp8 so i wanted to show that instead of fp16 but its surprisingly difficult to find the numbers for that, only the H200 datasheet, cleary displays
FP8 Tensor Core, the RTX pro 6000 datasheet keeps it vague with only mentioningAI TOPS, which they define asEffective FP4 TOPS with sparsity, and they didn’t even bother writing a datasheet for he 5090 only saying3352 AI TOPS, which i suppose is fp4 then. the AMD datasheets only list fp16 and int8 matrix, whether int8 matrix is equal to fp8 i don’t know. So FP16 was the common denominator for all the cards i could find without comparing apples with oranges.
I don’t need to build a datacenter, i’m fine with building a rack myself in my garage.
During the last GPU mining craze, I helped build a 3-rack mining operation. Gpus are unregulated pieces of power-sucking shit from a power management perspective. You do not have the power requirements to do this on residential power, even at 300amp service.
Think of a microwave’s behaviour ; yes, a 1000w microwave pulls between 700 and 900w while cooking, but the startup load is massive, almost 1800w sometimes, depending on how cheap the thing is.
GPUs also behave like this, but not at startup. They spin up load predictively, which means the hardware demands more power to get the job done, it doesn’t scale down the job to save power. Multiply by 58 rx9070. Now add cooling.
You cannot do this.
If you have 3 phase you could reasonably do this. This is not very common but some people have it in which case running about 50 rx9070 plus a strong AC should be possible, I think.
I guess. I don’t know why a person would do this, though… Especially just for an LLM.
🤷♂️
Thanks, While I still would like to know thr peformance scaling of a cheap cluster this does awnser the question, pay way more for high end cards like the H200 for greater efficiency, or pay less and have to deal with these issues.
the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.
Your math checks out, but only for some workloads. Other workloads scale out like shit, and then you want all your bandwidth concentrated. At some point you’ll also want to consider power draw:
- One H200 is like 1500W when including support infrastructure like networking, motherboard, CPUs, storage, etc.
- 58 consumer cards will be like 8 servers loaded with GPUs, at like 5kW each, so say 40kW in total.
Now include power and cooling over a few years and do the same calculations.
As for apples and oranges, this is why you can’t look at the marketing numbers, you need to benchmark your workload yourself.
“AI” in it’s current form, is a scam. Nvidia is making the most of this grift. They are now worth more money in the world than any other company.
The AI cards prioritize compute density instead of frame rate, etc so you can’t directly compare price points between them like that without including that data. You could cluster gaming cards, though, using NVLink or the AMD Fabric thing. You aren’t going to get any where near the same performance, and you are really going to rely on quantization to make it work, but depending on your use case in self-hosting you probably don’t need a $30,000 card.
Its not a scam, but its also something you probably don’t need.
Initially a lot of the AI was getting trained on lower class GPUs and none of these AI special cards/blades existed. The problem is that the problems are quite large and hence require a lot of VRAM to work on or you split it and pay enormous latency penalties going across the network. Putting it all into one giant package costs a lot more but it also performs a lot better, because AI is not an embarrassingly parallel problem that can be easily split across many GPUs without penalty. So the goal is often to reduce the number of GPUs you need to get a result quickly enough and it brings its own set of problems of power density in server racks.
The table you’re referencing leaves out CUDA/ tensor cores (count+gen) which is a big part of the gpus, and also not factoring in type of memory. From the comments it looks like you want to use a large MoE model. You aren’t going to be able to just stack raw power and expect to be able to run this without major deterioration of performance if it runs at all.
Don’t forget your MoE model needs all-to-all communication for expert routing
Why do core counts and memory type matter when the table includes memory bandwith and tflop16?
The H200 has HBM and alot of tensor cores which is reflected in its high stats in the table and the amd gpus don’t have cuda cores.
I know a major deterioration is to be expected but how major? Even in extreme cases with only 10% efficiency of the total power then its still competitive against the H200 since you can get way more for the price, even if you can only use 10% of that.
Tflops is a generic measurement, not actual utilization, and not specific to a given type of workload. Not all workloads saturate gpu utilization equally and ai models will depend on cuda/tensor. the gen/count of your cores will be better optimized for AI workloads and better able to utilize those tflops for your task. and yes, amd uses rocm which i didn’t feel i needed to specify since its a given (and years behind cuda capabilities). The point is that these things are not equal and there are major differences here alone.
I mentioned memory type since the cards you listed use different versions ( hbm vs gddr) so you can’t just compare the capacity alone and expect equal performance.
And again for your specific use case of this large MoE model you’d need to solve the gpu-to-gpu communication issue (ensuring both connections + sufficient speed without getting bottlenecked)
I think you’re going to need to do actual analysis of the specific set up youre proposing. Good luck
Products targeted towards businesses have always been unreasonably more expensive than those targeted towards consumers. It sucks for us AI hobbyists that Nvidia are stingy with VRAM on consumer cards, but I don’t find it surprising.
Personally I only have a single RTX 3090, but I know a lot of people online who are stacking multiple consumer cards to run AI. Buying used 3090s and putting them in a mining rig is probably still the best value for money if you need a large amount of VRAM.
How much VRAM do you actually need btw?


