Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We did run vLLM on the 3090s — measured ~3 tok/s slower on generation for our single-to-few-user pattern, plus less flexibility on quant and slower startup (actual minutes vs single digit seconds). We may do more with it again in the future - there isn't unlimited time for us to tinker, I'm sharing our journey (so far) and reasoning.

It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.

The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.

..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).

 help



I hear you on the insane amount of time vllm takes to launch (atlas is a move in the right direction in that regard).

But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.

Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.


wanna chime in, recently tried vLLM to consume a NVFP4 Gemma4 safetensor model and see how the batching can show up in nice t/s numbers. it's slow to start, it's Linux only, it doesn't like WSL much, ended up with either old or nightly container builds, I more or less have given up. Appreciate how llama.cpp simply works and does things fast and obvious



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: