Good value but a 12GB card isn't going to let you do too much given the low quality of small models. Curious what "home AI" use cases small models are being used for?
It would be nice to see a best value home AI setups under different budgets or RAM tiers, e.g. best value configuration for 128 GPU VRAM, etc.
My 48GB GPU VRAM "Home AI Server" cost ~$3100 from all parts on eBay running 3x A4000's in a Supermicro 128GB RAM, 32/64 core Xeon 1U rack server. Nothing amazing but wanted the most GPU VRAM before paying the premium Nvidia tax on their larger GPUs.
This works well for Ollama/llama-server which can make use of all GPU VRAM unfortunately ComfyUI can't make use of all GPU VRAM to run larger models, so on the lookout for a lot more RAM in my next GPU Server.
Really hoping Intel can deliver with its upcoming Arc Pro B60 Dual GPU for a great value 48GB option which can be run 4x in an affordable 192GB VRAM workstation [1]. If it runs Ollama and ComfyUI efficiently I'm sold.
I use a Proxmox server with RTX 3060 to generate paintings (I have a couple of old jailbroken Amazon Kindle's attached to walls for that purpose), and to run ollama, which is connected to Home Assistant & their voice preview device, allowing me to talk with LLM without transmitting anything to cloud services.
Admittedly with that amount of VRAM the models I can run are fairly useless for stuff like controlling lights via Home Assistant, occasionally does what I tell it to do but usually not. It is pretty okay for telling me information, like temperature or value of some sensors I have connected to HA. For generating AI paintings it's enough. My server also hosts tons of virtual machines, docker containers and is used for remote gameplay, so the AI thing is just an extra.
msgodel 1 hours ago [-]
It's really not going to let you train much which IMO is the only reason I'd personally bother with a big GPU. Gradients get huge and everything does them with single/half precision floating point.
jononor 3 hours ago [-]
Agreed, 12 GB does not seem useful. For coding LLM, it seems 128 GB is needed to be even close to the frontier models.
For generative image processing (not video), it looks like one can get started with 16GB.
itake 3 hours ago [-]
My home AI machine does image classification.
mythz 3 hours ago [-]
Using just an Ollama VL Model (gemma3/mistral-small3.1/qwen2.5vl) or a specific library?
itake 1 hours ago [-]
My home server detects NSFW images in user generated content on my side project.
I'm curiuos why OP didn't go for the more recent Nvidia RTX 4060 Ti with 16 GB VRAM that cost cheaper (~USD500) brand new and lesser power consumption at 165W [1].
[1] RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI:
And if you're gonna be fine with 12GB, why not a 2080ti instead?
iamnotagenius 50 seconds ago [-]
[dead]
7speter 3 days ago [-]
I dunno everyone, but I think Intel has something big on their hands with their announced workstation gpus. The b50 is a low profile card that doesn’t have a powersupply hookup because it only uses something like 60 watts, and comes with 16gb vram at a msrp of 300 dollars.
I imagine companies will have first dibs via the likes of agreements with suppliers like CDW, etc, but if Intel had enough of these battlemage dies accumulated, it could also drastically change the local ai enthusiast/hobbyist landscape; for starters this could drive down the price of workstation cards that are ideal for inference, at the very least. I’m cautiously excited.
On the AMD front (really, a sort of open compute front), Vulkan Kompute is picking up steam and it would be really cool to have a standard that mostly(?) ships with Linux, and older ports available for Freebsd, so that we can actually run free as in freedom inference locally.
JKCalhoun 13 hours ago [-]
Someone posted that they had used a "mining rig" [0] from AliExpress for less than $100. It even has RAM and a CPU. He picked up a 2000W (!) DELL server PS for cheap off eBay. The GPUs were NVIDIA TESLAs (M40 for example) since they often have a lot of RAM and are less expensive.
I followed in those footsteps to create my own [1] (photo [2]).
I picked up a 24GB M40 for around $300 off eBay. I 3D printed a "cowl" for the GPU that I found online and picked up two small fans from Amazon that got int he cowl. Attached the cowl + fans keep the GPU cool. (These TESLA server GPUs have no fan since they're expected to live in one of those wind-tunnels called a server rack).
I bought the same cheap DELL server PS that the original person had used and I also had to get a break-out board (and power-supply cables and adapters) for the GPU.
Thanks to LLMs, I was able to successfully install Rocky Linux as well as CUDA and NVIDIA drivers. I SSH into it and run ollama commands.
My own hurdle at this point is: I have a 2nd 24 GB M40 TESLA but when installed on the motherboard, Linux will not boot. LLMs are helping me try to set up BIOS correctly or otherwise determine what the issue is. (We'll see.) I would love to get to 48 GB.
I had an old Tesla M40 12 GB lying around and figured I’d try it out with some 8-13B llms, but was disappointed to find that it’s around the same speed as my mac mini m2. I suppose the mac mini is a 10 years newer chip, but it’s crazy that mobile today matches data center from 10 years ago
Uehreka 3 days ago [-]
Love the attention to detail, I can tell this was a lot of work to put together and I hope it helps people new to PC building.
I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling you’re going to hit pretty quickly if you’re into messing with LLMs. There’s basically no way to do a better job at the budget you’re working with though.
One thing I hear about a lot is people using things like RunPod to briefly get access to powerful GPUs/servers when they need one. If you spend $2/hr you can get access to an H100. If you have a budget of $1300 that could get you about 600 hours of compute time, which (unless you’re doing training runs) should last you several months.
In several months time the specs required to run good models will be different again in ways that are hard to predict, so this approach can help save on the heartbreak of buying an RTX 5090 only to find that even that doesn’t help much with LLM inference and we’re all gonna need the cheaper-but-more-VRAM Intel Arc B60s.
numpad0 7 hours ago [-]
I don't understand why some people build a "rig", put a lot of thoughts into ever so slightly differently binned CPUs, and then don't max out RAM(put aside DDR5 quirk considerations). It's like buying a sports car only to cheap out on tires. It makes no sense.
semi-extrinsic 12 hours ago [-]
> save on the heartbreak of buying an RTX 5090 only to find that even that doesn’t help much with LLM inference and we’re all gonna need the cheaper-but-more-VRAM Intel Arc B60s
When going for more VRAM, with an RTX 5090 currently sitting at $3000 for 32GB, I'm curious why people aren't trying to get the Dell C4140s. Those seem to go for $3000-$4000 for the whole server with 4x V100 16GB, so 64GB total VRAM.
Maybe it's just because they produce heat and noise like a small turbojet.
nickpsecurity 5 hours ago [-]
Don't the parallelizing techniques of a 4x build make using them more difficult than a 1x build with no extra parallelism? Couldn't the 32GB 4090 handle more models in their original configurations?
zargon 37 minutes ago [-]
> Don't the parallelizing techniques of a 4x build make using them more difficult than a 1x build with no extra parallelism?
For inference, no.
Havoc 9 hours ago [-]
> You pay a lot upfront for the hardware, but if your usage of the GPU is heavy, then you save a lot of money in the long run.
Last I saw data on this wasn’t true. A like for like comparison (same model and quant) API is cheaper than elec so you never make back hardware cost. That was a year ago and api costs have plummeted so I’d imagine it’s even worse now.
Datacenters have cheaper elec, can do batch inference at scale and more efficient cards. And that’s before we consider the huge free allowances by Google etc
Own AI gear is cool…but not due to economics
edg5000 3 hours ago [-]
Is this also the case for token-heavy uses such as Claude Code? Not sure if I will end up using CC for development in the future, but if I end up leaning on that, I wonder if there would be a desire to essentially have it run 24/7. When ran 24/7, CC would possibly incur more API fees than residential electricity would cost when running on your own gear? I have no idea about the numbers. Just wondering.
Jedd 3 days ago [-]
In January 2024 there was a similar post ( https://news.ycombinator.com/item?id=38985152 ) wherein the author selected dual NVidia 4060 Ti's for an at-home-LLM-with-voice-control -- because they were the cheapest cost per GB of well-supported VRAM at the time.
(They probably still are, or at least pretty close to it.)
That informed my decision shortly after, when I built something similar - that video card model was widely panned by gamers (or more accurately, gamer 'influencers'), but it was an excellent choice if you wanted 16GB of VRAM with relatively low power draw (150W peak).
TFA doesn't say where they are, or what currency they're using (which implies the hubris of a North American) - at which point that pricing for a second hand, smaller-capacity, higher-power-drawing 4070 just seems weird.
Appreciate the 'on a budget' aspect, it just seems like an objectively worse path, as upgrades are going to require replacement, rather than augment.
As per other comments here, 32 / 12 is going to be really limiting. Yes - lower parameter / smaller-quant models are becoming more capable, but at the same time we're seeing increasing interest in larger context for these at home use cases, and that chews up memory real fast.
T-A 11 hours ago [-]
> TFA doesn't say where they are
"the 1,440W limit on wall outlets in California" is a pretty good hint.
zxexz 3 hours ago [-]
Bringing back memories of testing the breakers in my college apartments to verify exactly which outlets were on which circuit, so I could pool as much as possible as needed. I distinctly remember pulling 20kw once, celebrating with a beer; the memory of all those cables snaking through the old apartment makes me almost uneasy now. I do remember we didn’t have to pay for heat that winter; which felt like a major win in Massachusetts. Come to think of it, I’m pretty sure there are still some servers tucked away in a crawlspace in that basement.
dcassett 9 hours ago [-]
San Francisco specifically:
"I prompted ChatGPT to give me recommendations. Prompt: ... The final build will be located at my residence in San Francisco, CA, ..."
1shooner 9 hours ago [-]
>TFA doesn't say where they are, or what currency they're using
They say California, and I'm seeing the dollar amount in the title and metadata as $1,3k, was that an edit?
throwaway314155 13 hours ago [-]
> which implies the hubris of a North American
No need for that.
Jedd 2 hours ago [-]
Probably true.
But for those of us outside the USA bubble, it's incredibly tring to have to intuit geo information (when geo information would add to the understanding).
As others noted in sibling comments, TFA had in fact mentioned in passing their location (in their quoted prompt to chatgpt, and at the very end of the third supporting point for the decision to go for an Nvidia 4070) 'California, CA'. I confess that I skimmed over both those paragraphs.
Now, sure, CA is a country code, but I stand corrected that the author completely hid their location. Had I spotted those clues I'd not have to have made any assumptions around wall power capabilities & costs, new & second hand market availability / costs, etc.
I think I mostly catered for those considerations in the rest of my original comment though - asserted power sensitivity makes it surprising that a higher-power-requiring, smaller-RAM-capacity, more-expensive-than-a-sibling-generation-16GB card was selected.
topato 12 hours ago [-]
True, though
topato 12 hours ago [-]
He did soften the blow by saying North American, rather than the more correctly appropos, American
dfc 9 hours ago [-]
The author also refers to Californian power limits. So it seems the criticism is misplaced.
PeterStuer 3 hours ago [-]
For image generation the article's setup might be viable, but do not expect to run LLM's with satisfactory quality and speed on 12GB vram.
DogRunner 3 days ago [-]
I used a similar budget and build something like this:
7x RTX 3060 - 12 GB which results in 84GB Vram
AMD Ryzen 5 - 5500GT with 32GB Ram
All in a 19-inch rack with a nice cooling solution and a beefy power supply.
My costs? 1300 Euro, but yeah, I sourced my parts on ebay / second hand.
My power consumption is below 500 Watt at the wall, when using LLLMs,since I did some optimizations:
* Worked on power optimizations and after many weeks of benchmarking, the sweet spot on the RTX3060 12GB cards is a 105 Watt limit
* Created Patches for Ollama ( https://github.com/ollama/ollama/pull/10678) to group models to exactly memory allocation instead of spreading over all available GPUs (This also reduces the VRAM overhead)
* ensured that ASPM is used on all relevant PCI components (Powertop is your friend)
It's not all shiny:
* I still use PCIe3 X1 for most of the cards, which limits their capability, but all I found so far (PCIe Gen4 x4 extender and bifurcation/special PCIE routers) are just too expensive to be used on such low powered cards
* Due to the slow PCIe bandwidth, the performance drops significantly
* Max VRAM per GPU is king. If you split up a model over several cards, the RAM allocation overhead is huge! (See Examples in my ollama patch about). I would rather use 3x 48GB instead of 7x 12G.
* Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do idle at 3 Watt, which is a huge difference when you use 7 or more cards. These BIOSes can be patched, but this can be risky
All in all, this server idles at 90-100Watt currently, which is perfect as a central service for my tinkerings and my family usage.
jononor 3 hours ago [-]
Impressive! What kind of motherboard do you use to host 7 GPUs?
noufalibrahim 7 hours ago [-]
This is interesting. We recently built a similar machine to implement a product that we're building on a customer site.
I didn't buy second hand parts since i wasn't sure of the quality so it was a little pricey but we have the entire thing working now and over the last week, we added the llm server to the mix. Haven't released it yet though.
I wrote about some "fun" we had getting it together here but it's not as technically detailed as the original article.
The caveat is that sometimes a library might be expecting an older version of cuda.
The vram on the GPU does make a difference, so it would at some point be worth looking at another GPU or increasing your system ram if you start running into limits.
However I wouldn't worry too much right away, it's more important to get started and get an understanding of how these local LLMs operate and take advantage of the optimisations that the community is making to make it more accessible. Not everyone has a 5090, and if LLMs remain in the realms of high end hardware, it's not worth the time.
throwaway314155 13 hours ago [-]
The other main caveat is that installing from custom sources using apt is a massive pain in the ass.
koakuma-chan 5 hours ago [-]
I tried running an LLM locally today, installed cuda toolkit, and it was missing cudann.h
I gave up.
vunderba 3 days ago [-]
The RTX market is particularly irritating right now, even second-hard 4090s are still going for MSRP if you can find them at all.
Most of the recommendations for this budget AI system are on point - the only thing I'd recommend is more RAM. 32GB is not a lot - particularly if you start to load larger models through formats such as GGUF and want to take advantage of system ram to split the layers at the cost of inference speed. I'd recommend at least 2 x 32GB or even 4 x 32GB if you can swing it budget-wise.
Author mentioned using Claude for recommendations, but another great resource for building machines is PC Part Picker. They'll even show warnings if you try pairing incompatible parts or try to use a PSU that won't supply the minimum recommended power.
I thought those 4090’s were weird. You pay more for them than the brand new 5090. And then there’s AMD, which everyone loves to hate, but has similar GPU’s that cost 1/4th of what a similar Nvidia GPU costs.
zlies 2 hours ago [-]
Did you not use any thermal paste at all, or did you just forget to mention it in your post?
danielhep 7 hours ago [-]
What are the practical uses of a self hosted LLM? Is it actually possible to approach the likes of Claude or one of the other big ones on your own hardware for a reasonable budget? I don’t know if this is something that’s actually worth it or if people are just building these rigs for fun or niche use cases that don’t require the intelligence of a hosted LLM.
Doesn't the new computer that is about to be released from NVIDIA much better than this one and is same price? Why would anyone buy this one now, seems like a waste of money.
numpad0 8 hours ago [-]
Couple best vram for buck && borderline space heater GPUs off top of my head: Tesla K80(12GBx2), M40(24GB), Radeon Instinct MI(25|50|60|100)(8-32GB?), Radeon Pro V340(16GBx2), bunch of other Radeon Vega 8GB cards e.g. Vega 56, NVIDIA P102/P104(~16GB), Intel A770(16GB). Note: some of these are truly just space heaters.
I'm not sure if right now is the best timing for building an LLM rig, as Intel Arc B60(24GBx2) is about to go on sale. Or maybe it is to secure multiples of 16GB cards hastily offloaded before its launch?
djhworld 13 hours ago [-]
With system builds like this I always feel the VRAM is the limiting factor when it comes to what models you can run, and consumer grade stuff tends to max out at 16GB or (somemtimes) 24GB for more expensive models.
It does make me wonder whether we'll start to see more and more computers with unified memory architecture (like the Mac) - I know nvidia have the Digits thing which has been renamed to something else
JKCalhoun 12 hours ago [-]
Go server GPU (TESLA) and 24 GB is not unusual. (And also about $300 used on eBay.)
pshirshov 11 hours ago [-]
3090 for ~1000 is much more solid choice. Also these old mining mobos play very well for multi-gpu ollama.
msp26 12 hours ago [-]
> 12GB vram
waste of effort, why would you go through the trouble of building + blogging for this?
jacekm 12 hours ago [-]
For $100 more you could get a used 3090 with twice as much VRAM. You could also get 4060 Ti which is cheaper than 4070 and it has 16 GB VRAM (although it's less powerfull too, so I guess depends on the use case)
AJRF 9 hours ago [-]
Why a 4070 over a 3090? A 4070 has half the VRAM. In the UK you can get a 3090 for like 600GBP.
golly_ned 3 days ago [-]
Whenever I get to a section that was clearly autogenerated by an LLM I lose interest in the entire article. Suddenly the entire thing is suspect and I feel like I’m wasting my time, since I’m lo lingering encountering the mind of another person, just interacting with a system.
bravesoul2 3 days ago [-]
I didn't see anything like that here. Yeah they used bullets.
golly_ned 2 days ago [-]
There’s a section that says what the parts of a pc are, and what that part is.
Nevermark 14 hours ago [-]
> I used the AI-generated recommendations as a starting point, and refined the options with my own research.
Referring to this section?
I don't see a problem with that. This isn't an article about a design intended for 10,000 systems. Just one person's follow through on an interesting project. With disclosure of methodology.
throwaway314155 13 hours ago [-]
Eh, yeah - the article starts off pretty specific but then gets into the weeds of stuff like how to put your PC together, which is far from novel information and certainly not on-topic in my opinion.
rcarmo 3 days ago [-]
The trouble with these things is that “on a budget” doesn’t deliver much when most interesting and truly useful models are creeping beyond the 16GB VRAM limit and/or require a lot of wattage. Even a Mac mini with enough RAM is starting to look like an expensive proposition, and the AMD Stryx Halo APUs (the SKUs that matter, like the Framework Desktop at 128GB) are around $2K.
As someone who built a period-equivalent rig (with a 12GB 3060 and 128GB RAM) a few years ago, I am not overly optimistic that local models will keep being a cheap alternative (never mind the geopolitics). And yeah, there are vey cheap ways to run inference, but hey become pointless - I can run Qwen and Phi4 locally on an ARM chip like the RK3588, but it is still dog slow.
uniposterz 3 days ago [-]
I had a similar setup for a local LLM, 32GB was not enough. I recommend going for 64GB.
alganet 1 hours ago [-]
Let me try to put this in the scale of coffee:
--
Using LLM via api: Starbucks.
Inference at home: Nespresso capsules.
Fine-tune a small model at home: Owning a grinder and an italian espresso machine.
Pre-training a model: Owning a moderate coffee plantation.
incomingpain 3 days ago [-]
I've been dreaming on pcpartpicker.
I think Radeon RX 7900 XT - 20 GB has been the best bang for your buck. Enables full gpu 32B?
Looking at what other people have been doing lately, they arent doing this.
They are getting 64+ core cpus and 512GB of ram. Keeping it on cpu and enabling massive models. This setup lets you do deepseek 671B.
It makes me wonder, how much better is 671B vs 32B?
Aeolun 10 hours ago [-]
I bought an RX 7900 XTX with 24GB, and it’s everything I expected of it. It’s absolutely massive though. I thought I could add one extra for more memory, but that’s a pipe dream in my little desktop box.
Cheap too, compared to a lot of what I’m seeing.
atentaten 13 hours ago [-]
Enjoyed the article as I am interested in the same. I would like to have seen more about the specific use cases and how they performed on the rig.
v5v3 3 days ago [-]
I thought prevailing wisdom was that a used 3090 with it's larger vram was the best budget gpu choice?
And in general, if on a budget then why not buy used and not new? And more so as the author himself talks about the resale value for when he sells it on.
olowe 3 days ago [-]
> I thought prevailing wisdom was that a used 3090 with it's larger vram was the best budget gpu choice?
The trick is memory bandwidth - not just the amount of VRAM - is important for LLM inference. For example, the B50 specs list a memory bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over 900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3].
More VRAM helps run larger models but with lower bandwidth tokens could be generating so slowly it's not really practical for day-to-day use or experimenting.
> The trick is memory bandwidth - not just the amount of VRAM - is important for LLM inference.
I'm not really knowledgeable about this space, so maybe I'm missing something:
Why does the bus performance affect token generation? I would expect it to cause a slow startup when loading the model, but once the model is loaded, just how much bandwidth can the token generation possibly use?
Token generation is completely on the card using the memory on the card, without any bus IO at all, no?
IOW, I'm trying to think of what IO the card is going to need for token generation, and I can't think of any other than returning the tokens (which, even on a slow 100MB/s transfer is still going to be about 100x the rate at which tokens are being generated.
stevenhuang 7 hours ago [-]
During inference, each token passes through each parameter of the model as a matrix vector products. And then as context grows, each new token passes through all current context tokens as matrix vector products.
This means bandwidth requirements grow as context sizes grow.
For datacenter workloads batching can be used to efficiently use this memory bandwidth and make things compute bound instead
lelanthran 3 hours ago [-]
[I'm still not understanding]
It seems to me that even if you pass in a long context on every prompt, that context is still tiny compared to the execution time on the processor/GPU/tensorcore/etc.
Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass in a prompt with 1MB of context which causes a response of 500kb after 1s. That's still only 1.5MB of IO transferred in 1s, which kept the GPU busy for 1s. Increasing the prompt is going to increase the duration to a response accordingly.
Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.
jononor 3 hours ago [-]
GPU memory bandwidth is the limiting factor, not PCIe bandwidth.
The memory bandwidth is critical because the models rely on getting all the parameters from memory to do computation, and there is a low amount of computation per parameter, so memory tends to be the bottleneck.
imtringued 1 hours ago [-]
1MB of context can maybe hold 10 tokens depending on your model.
For reference. llama 3.2 8B used to take 4 KiB per token per layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV cache (context). If your context holds 8000 tokens including responses then you need around 1GB.
>Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.
Matrix vector multiplication implies a single floating point multiplication and addition (2 flops) per parameter. Your GPU can do way more flops than that without using tensor cores at all. In fact, this workload bores your GPU to death.
retinaros 14 hours ago [-]
yes it is
ww520 13 hours ago [-]
I use a 10-year old laptop to run a local LLM. The time between prompts are 10-30 seconds. Not for speedy interactive usage.
usercvapp 11 hours ago [-]
I have a server at home sitting IDLE for the last 2 years with 2 TB of RAM and 4 CPUs.
I am gonna push it this week and launch some LLM models to see how they perform!
How much electric bill efficient are they running locally?
whalesalad 10 hours ago [-]
I would rather spend $1,300 on openai/anthropic credits. The performance from that 4070 cannot be worth the squeeze.
It would be nice to see a best value home AI setups under different budgets or RAM tiers, e.g. best value configuration for 128 GPU VRAM, etc.
My 48GB GPU VRAM "Home AI Server" cost ~$3100 from all parts on eBay running 3x A4000's in a Supermicro 128GB RAM, 32/64 core Xeon 1U rack server. Nothing amazing but wanted the most GPU VRAM before paying the premium Nvidia tax on their larger GPUs.
This works well for Ollama/llama-server which can make use of all GPU VRAM unfortunately ComfyUI can't make use of all GPU VRAM to run larger models, so on the lookout for a lot more RAM in my next GPU Server.
Really hoping Intel can deliver with its upcoming Arc Pro B60 Dual GPU for a great value 48GB option which can be run 4x in an affordable 192GB VRAM workstation [1]. If it runs Ollama and ComfyUI efficiently I'm sold.
[1] https://www.servethehome.com/maxsun-intel-arc-pro-b60-dual-g...
Admittedly with that amount of VRAM the models I can run are fairly useless for stuff like controlling lights via Home Assistant, occasionally does what I tell it to do but usually not. It is pretty okay for telling me information, like temperature or value of some sensors I have connected to HA. For generating AI paintings it's enough. My server also hosts tons of virtual machines, docker containers and is used for remote gameplay, so the AI thing is just an extra.
source code: https://github.com/KevinColemanInc/NSFW-FLASK
https://github.com/KevinColemanInc/NSFW-FLASK
I'm curiuos why OP didn't go for the more recent Nvidia RTX 4060 Ti with 16 GB VRAM that cost cheaper (~USD500) brand new and lesser power consumption at 165W [1].
[1] RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI:
https://news.ycombinator.com/item?id=44196991
I imagine companies will have first dibs via the likes of agreements with suppliers like CDW, etc, but if Intel had enough of these battlemage dies accumulated, it could also drastically change the local ai enthusiast/hobbyist landscape; for starters this could drive down the price of workstation cards that are ideal for inference, at the very least. I’m cautiously excited.
On the AMD front (really, a sort of open compute front), Vulkan Kompute is picking up steam and it would be really cool to have a standard that mostly(?) ships with Linux, and older ports available for Freebsd, so that we can actually run free as in freedom inference locally.
I followed in those footsteps to create my own [1] (photo [2]).
I picked up a 24GB M40 for around $300 off eBay. I 3D printed a "cowl" for the GPU that I found online and picked up two small fans from Amazon that got int he cowl. Attached the cowl + fans keep the GPU cool. (These TESLA server GPUs have no fan since they're expected to live in one of those wind-tunnels called a server rack).
I bought the same cheap DELL server PS that the original person had used and I also had to get a break-out board (and power-supply cables and adapters) for the GPU.
Thanks to LLMs, I was able to successfully install Rocky Linux as well as CUDA and NVIDIA drivers. I SSH into it and run ollama commands.
My own hurdle at this point is: I have a 2nd 24 GB M40 TESLA but when installed on the motherboard, Linux will not boot. LLMs are helping me try to set up BIOS correctly or otherwise determine what the issue is. (We'll see.) I would love to get to 48 GB.
[0] https://www.aliexpress.us/item/3256806580127486.html
[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...
[2] https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxjqlam...
[1] https://www.tomshardware.com/pc-components/gpus/crazed-modde...
I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling you’re going to hit pretty quickly if you’re into messing with LLMs. There’s basically no way to do a better job at the budget you’re working with though.
One thing I hear about a lot is people using things like RunPod to briefly get access to powerful GPUs/servers when they need one. If you spend $2/hr you can get access to an H100. If you have a budget of $1300 that could get you about 600 hours of compute time, which (unless you’re doing training runs) should last you several months.
In several months time the specs required to run good models will be different again in ways that are hard to predict, so this approach can help save on the heartbreak of buying an RTX 5090 only to find that even that doesn’t help much with LLM inference and we’re all gonna need the cheaper-but-more-VRAM Intel Arc B60s.
When going for more VRAM, with an RTX 5090 currently sitting at $3000 for 32GB, I'm curious why people aren't trying to get the Dell C4140s. Those seem to go for $3000-$4000 for the whole server with 4x V100 16GB, so 64GB total VRAM.
Maybe it's just because they produce heat and noise like a small turbojet.
For inference, no.
Last I saw data on this wasn’t true. A like for like comparison (same model and quant) API is cheaper than elec so you never make back hardware cost. That was a year ago and api costs have plummeted so I’d imagine it’s even worse now.
Datacenters have cheaper elec, can do batch inference at scale and more efficient cards. And that’s before we consider the huge free allowances by Google etc
Own AI gear is cool…but not due to economics
(They probably still are, or at least pretty close to it.)
That informed my decision shortly after, when I built something similar - that video card model was widely panned by gamers (or more accurately, gamer 'influencers'), but it was an excellent choice if you wanted 16GB of VRAM with relatively low power draw (150W peak).
TFA doesn't say where they are, or what currency they're using (which implies the hubris of a North American) - at which point that pricing for a second hand, smaller-capacity, higher-power-drawing 4070 just seems weird.
Appreciate the 'on a budget' aspect, it just seems like an objectively worse path, as upgrades are going to require replacement, rather than augment.
As per other comments here, 32 / 12 is going to be really limiting. Yes - lower parameter / smaller-quant models are becoming more capable, but at the same time we're seeing increasing interest in larger context for these at home use cases, and that chews up memory real fast.
"the 1,440W limit on wall outlets in California" is a pretty good hint.
"I prompted ChatGPT to give me recommendations. Prompt: ... The final build will be located at my residence in San Francisco, CA, ..."
They say California, and I'm seeing the dollar amount in the title and metadata as $1,3k, was that an edit?
No need for that.
But for those of us outside the USA bubble, it's incredibly tring to have to intuit geo information (when geo information would add to the understanding).
As others noted in sibling comments, TFA had in fact mentioned in passing their location (in their quoted prompt to chatgpt, and at the very end of the third supporting point for the decision to go for an Nvidia 4070) 'California, CA'. I confess that I skimmed over both those paragraphs.
Now, sure, CA is a country code, but I stand corrected that the author completely hid their location. Had I spotted those clues I'd not have to have made any assumptions around wall power capabilities & costs, new & second hand market availability / costs, etc.
I think I mostly catered for those considerations in the rest of my original comment though - asserted power sensitivity makes it surprising that a higher-power-requiring, smaller-RAM-capacity, more-expensive-than-a-sibling-generation-16GB card was selected.
7x RTX 3060 - 12 GB which results in 84GB Vram AMD Ryzen 5 - 5500GT with 32GB Ram
All in a 19-inch rack with a nice cooling solution and a beefy power supply.
My costs? 1300 Euro, but yeah, I sourced my parts on ebay / second hand.
(Added some 3d printed parts into the mix: https://www.printables.com/model/1142963-inter-tech-and-gene... https://www.printables.com/model/1142973-120mm-5mm-rised-noc... https://www.printables.com/model/1142962-cable-management-fu... if you think about building something similar)
My power consumption is below 500 Watt at the wall, when using LLLMs,since I did some optimizations:
* Worked on power optimizations and after many weeks of benchmarking, the sweet spot on the RTX3060 12GB cards is a 105 Watt limit
* Created Patches for Ollama ( https://github.com/ollama/ollama/pull/10678) to group models to exactly memory allocation instead of spreading over all available GPUs (This also reduces the VRAM overhead)
* ensured that ASPM is used on all relevant PCI components (Powertop is your friend)
It's not all shiny:
* I still use PCIe3 X1 for most of the cards, which limits their capability, but all I found so far (PCIe Gen4 x4 extender and bifurcation/special PCIE routers) are just too expensive to be used on such low powered cards
* Due to the slow PCIe bandwidth, the performance drops significantly
* Max VRAM per GPU is king. If you split up a model over several cards, the RAM allocation overhead is huge! (See Examples in my ollama patch about). I would rather use 3x 48GB instead of 7x 12G.
* Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do idle at 3 Watt, which is a huge difference when you use 7 or more cards. These BIOSes can be patched, but this can be risky
All in all, this server idles at 90-100Watt currently, which is perfect as a central service for my tinkerings and my family usage.
I didn't buy second hand parts since i wasn't sure of the quality so it was a little pricey but we have the entire thing working now and over the last week, we added the llm server to the mix. Haven't released it yet though.
I wrote about some "fun" we had getting it together here but it's not as technically detailed as the original article.
https://blog.hpcinfra.com/when-linkedin-met-reality-our-bang...
https://www.bosgamepc.com/products/bosgame-m5-ai-mini-deskto...
The caveat is that sometimes a library might be expecting an older version of cuda.
The vram on the GPU does make a difference, so it would at some point be worth looking at another GPU or increasing your system ram if you start running into limits.
However I wouldn't worry too much right away, it's more important to get started and get an understanding of how these local LLMs operate and take advantage of the optimisations that the community is making to make it more accessible. Not everyone has a 5090, and if LLMs remain in the realms of high end hardware, it's not worth the time.
I gave up.
Most of the recommendations for this budget AI system are on point - the only thing I'd recommend is more RAM. 32GB is not a lot - particularly if you start to load larger models through formats such as GGUF and want to take advantage of system ram to split the layers at the cost of inference speed. I'd recommend at least 2 x 32GB or even 4 x 32GB if you can swing it budget-wise.
Author mentioned using Claude for recommendations, but another great resource for building machines is PC Part Picker. They'll even show warnings if you try pairing incompatible parts or try to use a PSU that won't supply the minimum recommended power.
https://pcpartpicker.com
https://www.amazon.sg/NVIDIA-Jetson-Orin-64GB-Developer/dp/B...
I'm not sure if right now is the best timing for building an LLM rig, as Intel Arc B60(24GBx2) is about to go on sale. Or maybe it is to secure multiples of 16GB cards hastily offloaded before its launch?
It does make me wonder whether we'll start to see more and more computers with unified memory architecture (like the Mac) - I know nvidia have the Digits thing which has been renamed to something else
waste of effort, why would you go through the trouble of building + blogging for this?
Referring to this section?
I don't see a problem with that. This isn't an article about a design intended for 10,000 systems. Just one person's follow through on an interesting project. With disclosure of methodology.
As someone who built a period-equivalent rig (with a 12GB 3060 and 128GB RAM) a few years ago, I am not overly optimistic that local models will keep being a cheap alternative (never mind the geopolitics). And yeah, there are vey cheap ways to run inference, but hey become pointless - I can run Qwen and Phi4 locally on an ARM chip like the RK3588, but it is still dog slow.
--
Using LLM via api: Starbucks.
Inference at home: Nespresso capsules.
Fine-tune a small model at home: Owning a grinder and an italian espresso machine.
Pre-training a model: Owning a moderate coffee plantation.
I think Radeon RX 7900 XT - 20 GB has been the best bang for your buck. Enables full gpu 32B?
Looking at what other people have been doing lately, they arent doing this.
They are getting 64+ core cpus and 512GB of ram. Keeping it on cpu and enabling massive models. This setup lets you do deepseek 671B.
It makes me wonder, how much better is 671B vs 32B?
Cheap too, compared to a lot of what I’m seeing.
And in general, if on a budget then why not buy used and not new? And more so as the author himself talks about the resale value for when he sells it on.
The trick is memory bandwidth - not just the amount of VRAM - is important for LLM inference. For example, the B50 specs list a memory bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over 900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3].
More VRAM helps run larger models but with lower bandwidth tokens could be generating so slowly it's not really practical for day-to-day use or experimenting.
[1]: https://www.intel.com/content/www/us/en/products/sku/242615/...
[2]: https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
[3]: https://www.thefpsreview.com/gpu-family/nvidia-geforce-rtx-4...
I'm not really knowledgeable about this space, so maybe I'm missing something:
Why does the bus performance affect token generation? I would expect it to cause a slow startup when loading the model, but once the model is loaded, just how much bandwidth can the token generation possibly use?
Token generation is completely on the card using the memory on the card, without any bus IO at all, no?
IOW, I'm trying to think of what IO the card is going to need for token generation, and I can't think of any other than returning the tokens (which, even on a slow 100MB/s transfer is still going to be about 100x the rate at which tokens are being generated.
This means bandwidth requirements grow as context sizes grow.
For datacenter workloads batching can be used to efficiently use this memory bandwidth and make things compute bound instead
It seems to me that even if you pass in a long context on every prompt, that context is still tiny compared to the execution time on the processor/GPU/tensorcore/etc.
Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass in a prompt with 1MB of context which causes a response of 500kb after 1s. That's still only 1.5MB of IO transferred in 1s, which kept the GPU busy for 1s. Increasing the prompt is going to increase the duration to a response accordingly.
Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.
For reference. llama 3.2 8B used to take 4 KiB per token per layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV cache (context). If your context holds 8000 tokens including responses then you need around 1GB.
>Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.
Matrix vector multiplication implies a single floating point multiplication and addition (2 flops) per parameter. Your GPU can do way more flops than that without using tensor cores at all. In fact, this workload bores your GPU to death.
I am gonna push it this week and launch some LLM models to see how they perform!
How much electric bill efficient are they running locally?
I'll be that guy™ that says if you're going to do any computing half-way reliably, only use ECC RAM. Silent bit flips suck.