Running LLMs Locally

Jonathan Lahijani · 2026-04-29T15:33:47Z

I use VSCode and have been a subscriber to GitHub Copilot for a few years. I didn't pay much attention to the plan I was on until I started using agentic coding in ~October 2025. I then realized what made GitHub Copilot a ridiculously good value, as others discovered as well, was that it works on a "per-request" billing model. In short, if you knew what you were doing (I didn't realize this fully at the time), you could use a high-end model like Opus 4.5, which costs 3 credits, and if just have it rip for HOURS on a task and it would only cost 3 requests (lower end models would be 1 request). With the cheapest plan on GitHub Copilot, it is (well, now WAS), $10/month which gave you 300 requests. A lot of people took advantage of this... imagine paying $10/month and getting like $5000/$10,000 worth of value (ie, what would be the real cost of per-token billing) out of it per month! Absolutely insane.

Microsoft understandably put an end to that last week because they were losing their shirts:
https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/

In short, they are one of the first to go per-token/per-usage billing and I suspect others like Anthropic and OpenAI will eventually follow suit. It's only a matter of time given the economics of it all.

However as you may know, the Chinese AI labs are extremely competitive with their AI offerings and have both monthly plans and token-based plans (generally 90% less cost) and of course, they release the models outright. Because of what Microsoft did, I've been experimenting with the various Chinese models via OpenRouter (and still using VSCode GitHub Copilot "Bring Your Own Key" which they support), so it's basically the same experience.

However there seems to be a lot of advancements in high density, low parameter models (Qwen3.6 27B, Deepseek V4 Flash) which can be run on consumer hardware at good speeds and output that is not far behind something like Sonnet or Opus, or at least catching up quickly. I haven't owned a discreet GPU in many many years (I don't game), but I believe with a RTX 5090 and 64-128GB of RAM, it can be done. Don't quote me on any that however... I haven't dived into this world yet and don't yet have an understanding of all the settings that determine how well an LLM runs and how it affects its intelligence.

I'd be interested to hear anyone who is playing with this idea: What models are you using? What hardware? What software are you using to run it?

Sergio · 2026-04-29T16:40:28Z

I was experimenting with Ollama using its cloud models the other day, on its $20/month plan. I had good results. For my next project, I am considering using Kimi-K2.6, GLM 5+, and others.

You can use Opencode or Claude for it: `ollama launch claude --model minimax-m2.5:cloud`

Edited yesterday at 04:41 PM by Sergio

Peter Knight · 2026-04-29T16:42:10Z

Hi Jonathan. It was quite difficult to buy a Mac Mini recently when Open Claw was released. People realised the little machines were quite happy chugging along on local LLMs and they were selling like hot cakes.
Is that any interest to you? Have to admit, I like the idea of having a replaceable GPU but not sure my gaming machine would be up for any serious work.

Sign In

Running LLMs Locally

Recommended Posts

Jonathan Lahijani

Sergio

Peter Knight

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details