Open source language models have closed most of the gap with closed frontier models. In 2026, the question is no longer “is open source good enough” but “which open model fits our license terms, our context needs, and the hardware we actually have.” This article ranks the top open models on exactly those three axes.
Before the ranking, the licenses. They matter more than most teams realize until legal review starts.
Open source vs open weights
Most models marketed as “open source” are technically open weights. The difference is real:
- Open source (strict): Weights, training code, and training data are all public under a permissive license. You can fully audit, retrain, and redistribute. Almost no frontier model qualifies. OLMo from AI2 and Pythia from EleutherAI come closest.
- Open weights: The trained model weights are downloadable, but training data and pipeline may be private. This is where Llama, Qwen, Gemma, DeepSeek, Kimi, GLM, and Mistral all sit.
For most commercial work the distinction is academic. What actually matters is the license attached to the weights.
The licenses that matter
Apache 2.0
The cleanest option for commercial use. No usage caps, no royalties, no geographic restrictions, and crucially an explicit patent grant. The patent grant matters in enterprise contexts; MIT does not include one. Qwen 3 and 3.5, Mistral Large 3, Mistral Small 4, and Gemma 4 (26B A4B) ship under Apache 2.0.
MIT license
Equally permissive in practice, slightly simpler text, no patent grant. Kimi K2.5, GLM-4.7, GLM-5, MiMo-V2-Flash, GPT-oss 120B, DeepSeek R1, DeepSeek V4 (Pro and Flash), and Phi-4 all use MIT. For most companies this license has zero friction.
Llama community license (Meta)
Permissive enough for almost every business, but with two specific traps:
- Commercial use is allowed for organizations with under 700 million monthly active users. Above that threshold, you need a separate license from Meta. Irrelevant unless you are larger than Facebook.
- Llama 4’s Community License blocks EU-based companies from accepting the license terms outright. For EU companies, this is a hard stop. Read the model card before you build anything serious on Llama 4 from an EU entity.
- Attribution required (“Built with Llama”) on derivative products.
Gemma terms (Google)
While Gemma’s terms look permissive, be cautious with fine-tuning: distributing a model trained on proprietary data creates ambiguity around whether derivative models inherit Google’s restrictions. Internal use is safe, but shipping externally requires legal sign-off.
This same scrutiny should apply across the broader LLaMA family. Focusing solely on Llama 4 Scout for its 10-million token context window overlooks critical variants in the ecosystem. A complete evaluation must include Llama 4 Maverick, which serves as the more balanced, higher-performing core reasoning model for standard workflows, and Behemoth, the massive 2T model utilized primarily for enterprise distillation pipelines.
DeepSeek license
DeepSeek V3.2 has its own non-standard license that needs review for commercial use. DeepSeek V4 (released April 23, 2026) ships under MIT, which is a clear improvement.
Non-commercial licenses
A few models marketed as “open” prohibit commercial use entirely (Command R+ on CC-BY-NC, some Grok variants restrict using weights to train other models). Useful for research, useless for products.
Practical rule of thumb
If license cleanliness is the priority, the safe defaults are Qwen (Apache 2.0), DeepSeek V4 (MIT), GLM-5 (MIT), Mistral (Apache 2.0), or Phi-4 (MIT). Anything outside this group needs a careful read of the terms.
What the ranking is based on:
The ranking is based on the following 3 categories:
- Context window. How many tokens the model can hold in working memory. This determines whether you need to chunk documents, codebases, and long conversations, or whether you can throw the whole thing in at once.
- Model size (parameters). Total parameters and, for Mixture-of-Experts (MoE) architectures, active parameters per token. Active parameters drive inference cost; total parameters drive memory needs.
- Hardware needed. What you actually have to put in a rack to run this thing at usable quality. Consumer GPU, single H100, multiple H100s, or full data-center cluster.
This ranking deliberately ignores raw benchmark scores. A model that wins MMLU but needs eight H100s is useless to a team with a single workstation.
The Ranking
1.The Llama 4 Family
- Variants: Llama 4 Scout (The Context King), Llama 4 Maverick (The Agentic Core), Llama 4 Behemoth (The Enterprise Teacher)
medium.com - Architecture: Sparse Mixture-of-Experts (MoE) with native multimodality
apxml.com - Context Windows: 130K tokens (Maverick) up to an industry-leading 10 million tokens (Scout)
medium.com - License: Llama 4 Community License (note the EU restriction)
artificialanalysis.ai
Meta’s model cards outline a highly specialized ecosystem designed to partition enterprise workloads. Llama 4 Scout(109B total, 17B active) dominates massive retrieval tasks, providing a 10M token context window ideal for whole-codebase reviews and deep research archives where token chunking breaks the analysis. Llama 4 Maverick (400B total, 17B active across 128 experts) shifts the focus to depth, acting as the higher-performing, more balanced core reasoning model for standard agentic and multi-tool workflows. At the absolute frontier sits Llama 4 Behemoth (2T total, 288B active), a massive non-deployed model built to serve as a high-capacity “teacher” engine for generating synthetic data and powering enterprise distillation pipelines.
The catch: The EU licensing block. For EU companies, Meta’s regional restrictions introduce massive compliance hurdles that rule the entire Llama 4 herd out for most commercial deployments. Getting legal sign-off is a prerequisite before building out any architecture here.
2. DeepSeek V4 Pro: The Reasoning Flagship
- Context window: 1 million tokens
- Parameters: 1.6T total, 49B active (MoE)
- Hardware: 8x H100 80GB class for full-precision inference; quantized variants on smaller setups
- License: MIT
DeepSeek V4 Pro is the flagship model for maximum reasoning, coding, and agentic performance. It is the largest open-weight model ever released and uses a hybrid attention design that drops single-token FLOPs to roughly 10% of dense equivalents. MIT licensed, so commercial use is straightforward.
3. Kimi K2.6 Thinking: The Coding Specialist
- Context window: 256K tokens
- Parameters: 1T total, 32B active (MoE)
- Hardware: 4x H100 80GB class or equivalent
- License: MIT
On the May 12, 2026 snapshot, Kimi K2.6 Thinking is the strongest open-source entry across both key coding metrics: 78.57 Coding Avg and 58.33 Agentic Coding Avg. If your use case is agentic coding, software engineering, or long-horizon tool use, this is the top pick. MIT license, no commercial friction.
4. DeepSeek V4 Flash: The Best Mid-Range Option
- Context window: 1 million tokens
- Parameters: 284 billion total, 13 billion active per token (MoE)
- Hardware: A 128 GB Mac Studio can just about reach it; single H100 with quantization
- License: MIT
The sweet spot for teams that want a frontier-class model without an enterprise GPU rack. DeepSeek V4 Flash supports a 1 million token context window by default and offers three reasoning effort modes (Non-Thinking, Thinking, Max). Same hybrid attention as Pro, much lower memory footprint.
5. Qwen 3.6 / Qwen 3 235B: The Apache 2.0 Workhorse
- Context window: Up to 256K tokens
- Parameters: 235B total in the flagship; smaller variants from 1.5B to 35B
- Hardware: Single H100 for mid-range; multi-GPU for 235B
- License: Apache 2.0 (most variants)
Qwen is the default recommendation when license cleanliness matters. Qwen 3.5 leads on reasoning among Apache 2.0 models. The smaller Qwen variants (1.5B through 32B under Apache 2.0) cover everything from on-device deployment to mid-range GPU work. Strong multilingual support.
6. GLM-5.1: The SWE-Bench Leader
- Context window: 128K tokens (extended variants available)
- Parameters: Multiple sizes; flagship in the hundreds of billions
- Hardware: Multi-GPU enterprise class
- License: MIT
GLM-5 posts 77.8% on SWE-bench Verified, the strongest coding benchmark result among open models. The right pick if your use case is autonomously fixing real software bugs rather than generating code from scratch.
7. Gemma 4 26B A4B: The Consumer-Hardware Default
- Context window: 256K tokens
- Parameters: 26B (MoE, A4B = roughly 4B active per token)
- Hardware: Runs in 16 GB of RAM at Q4, supported on day one in every major local inference stack
- License: Apache 2.0
The model most readers will actually run. Native multimodality, Apache 2.0 license, fits on a Mac with 16GB unified memory or a single consumer GPU. Performance is not at the level of the giants above, but for the hardware budget it is excellent. Watch the Gemma fine-tuning terms if you plan to distribute a tuned version.
8. Mistral Large 3 / Small 4: The European Choice
- Context window: 128K tokens
- Parameters: Small 4 is efficient; Large 3 is the flagship
- Hardware: Wide range; Small 4 runs comfortably on a single H100
- License: Apache 2.0
Both Mistral Large 3 and Mistral Small 4 now ship under Apache 2.0, a significant shift from Mistral’s earlier restrictive licensing. European company, no EU access issues, clean license. For organizations choosing on jurisdiction as well as technical fit, Mistral is the natural starting point. DataNorth runs a dedicated Mistral workshop for teams that want to evaluate deployment options, fine-tuning, and the practical trade-offs against the other open alternatives.
9. Phi-4: The On-Device Option
- Context window: 16K to 128K depending on variant
- Parameters: From 3.8B (Phi-3.5 Mini) up
- Hardware: Runs on phones and laptops at small sizes
- License: MIT
The right answer when the hardware is a laptop, an edge device, or a phone. MIT license, surprisingly strong reasoning for its size, and small enough to ship inside an application.
Decision guide for each use case
| If your priority is: | Pick: |
|---|---|
| Longest possible context | Llama 4 Scout (watch the EU license) |
| Maximum reasoning quality | DeepSeek V4 Pro |
| Agentic coding | Kimi K2.6 Thinking |
| Fixing real bugs | GLM-5.1 |
| Clean Apache 2.0 license | Qwen 3 or Mistral |
| Running on a single workstation | Gemma 4 26B or Qwen 3.6-35B-A3B |
| Running on a phone or edge device | Phi-4 |
| European jurisdiction, no license risk | Mistral Large 3 or Small 4 |
| Best price/quality on hosted API | DeepSeek V4 Flash |
What hardware do you need?
A few practical notes that get glossed over in benchmark posts:
- 8GB VRAM is enough for smaller 7B to 8B models, 24GB VRAM is a more practical floor for 30B-class models, and 40GB+ is usually required once you move into 70B territory unless you quantize aggressively.
- 4-bit quantization (Q4_K_M) roughly halves VRAM requirements with minimal quality loss. Llama 3.3 70B at Q4_K_M runs in around 40GB.
- Apple Silicon with high unified memory (64GB and up) is genuinely viable for mid-sized open models. Less viable for the 100B+ class without quantization.
- For MoE models, all expert weights still need to fit in memory even though only a fraction activate per token. A 1T MoE model needs roughly 1T parameters of memory, not 32B.
Conclusion
In 2026, picking an open model is not a question of finding “the best one.” It is a question of matching three constraints: the license your legal team will accept, the context window your workflow actually needs, and the hardware budget that already exists or can be approved.
For most SMEs and enterprise teams, the practical path is hybrid: a Gemma 4 or Qwen 3.6 running locally on existing hardware for sensitive data and high-volume work, paired with a hosted API to a frontier open model (DeepSeek V4 Flash or Kimi K2.6) for the harder tasks. That setup gives full GDPR control over the sensitive workload, predictable cost on the bulk workload, and access to frontier reasoning when it actually matters.
The open source LLM stack is finally good enough to build serious products on. Pick the license first, the context window second, and the hardware third. The benchmark numbers sort themselves out from there.
Frequently asked questions
Is open source actually good enough to replace closed models like GPT-5 or Claude?
For most workloads, yes. Kimi K2.6 Thinking leads open models on agentic coding, DeepSeek V4 Pro competes on raw reasoning, and GLM-5.1 tops SWE-bench Verified. The gap that remains is mostly in the very hardest reasoning tasks and in tool ecosystems around the closed APIs. For 80% of practical use cases (RAG, summarization, classification, mid-complexity coding, agent workflows), an open model is good enough and often cheaper per token.
Can I use Llama 4 if I’m based in the EU?
Not straightforwardly. Llama 4’s Community License explicitly blocks EU-based companies from accepting the license terms. For European organizations, this is a hard stop on commercial deployment. Older Llama 3.x models do not have this restriction, and Qwen, Mistral, DeepSeek V4, GLM-5, and Gemma are all viable alternatives without the EU issue.
What’s the difference between total and active parameters in a MoE model?
Total parameters are how much memory the model needs. Active parameters are how much compute each token costs. DeepSeek V4 Pro has 1.6T total but only 49B active, which means you need enough GPU memory for the full 1.6T weights, but inference cost per token is closer to a 49B dense model. This is why MoE is attractive for hosted inference and painful for local deployment: the memory bill does not shrink.
Do I really need an H100 to run any of these?
No. Gemma 4 26B A4B runs in 16GB of RAM at Q4 quantization, which fits on a MacBook Pro or a consumer GPU. Phi-4 runs on a phone. Qwen 3 has variants from 1.5B up. The H100 question only matters if you want frontier-class quality on-prem. For most internal tooling, a workstation with 24GB to 64GB of VRAM or unified memory covers it.
For a SME that wants to start somewhere, which model should I pick?
Default to Gemma 4 26B or Qwen 3 (Apache 2.0, runs on existing hardware, no license risk) for local workloads, and pair it with a hosted API to DeepSeek V4 Flash or Kimi K2.6 for the harder tasks. That gives you GDPR control on sensitive data and frontier reasoning when you need it, without committing to an H100 cluster. If you want help scoping which workloads belong local versus hosted, that is the kind of question a DataNorth AI assessment is built to answer.
Is quantization safe to use in production?
For Q4_K_M and Q5_K_M, yes, for most use cases. Quality loss is small and the memory savings are roughly half. Where it matters is precise tasks: structured output, math, long-context retrieval. Test on your actual workload before assuming a quantized model behaves the same as full precision. For sensitive domains (medical, legal), validate with your own evals rather than relying on aggregate benchmarks.
How do I check if a license is safe for commercial use?
Three quick checks. First, does the license name a commercial-use threshold (Llama’s 700M MAU clause)? Second, does it restrict by jurisdiction (Llama 4 and the EU)? Third, does it carry forward to fine-tuned derivatives you might distribute (the Gemma ambiguity)? Apache 2.0 and MIT pass all three. Anything else, get legal sign-off before you build on it.