Unsloth Adds Part-by-Part Qwen3.5 Fine-Tuning
Unsloth now lets you fine-tune Qwen3.5 vision models by layer type, with faster training, lower VRAM, and 201-language support.

Unsloth just made Qwen3.5 training more flexible, and the numbers are hard to ignore: its docs say Qwen3.5-35B-A3B bf16 LoRA fits in 74GB of VRAM, while smaller models can train in as little as 3GB to 22GB depending on size. The bigger update, though, is control: for vision models, you can now fine-tune only the vision layers, only the language layers, or just attention and MLP blocks.
That matters because multimodal fine-tuning has usually been an all-or-nothing affair. If you wanted to adapt a vision-language model, you often had to update the whole stack and pay for it in time, memory, and debugging pain. Unsloth’s new knobs let teams trim that scope and make a more deliberate tradeoff between quality and cost.
What changed in Qwen3.5 fine-tuning
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Qwen 3.5 is a broad family, covering 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, and 122B-A10B models. Unsloth says its support now spans text, vision, and reinforcement learning workflows, which is a useful spread if you are building one pipeline for chat, document understanding, and agent-style training.

The practical headline is that Unsloth says Qwen3.5 trains about 1.5x faster and uses 50% less VRAM than FA2 setups. That is the kind of delta that changes which GPU you can rent, or whether you can train at all without waiting in a queue.
Unsloth also warns that QLoRA-style 4-bit training is a bad fit for Qwen3.5, both dense and MoE variants, because quantization differences are larger than usual. In other words, the cheap path is not always the smart one here.
- Qwen3.5-0.8B bf16 LoRA: about 3GB VRAM
- Qwen3.5-2B bf16 LoRA: about 5GB VRAM
- Qwen3.5-4B bf16 LoRA: about 10GB VRAM
- Qwen3.5-9B bf16 LoRA: about 22GB VRAM
- Qwen3.5-27B bf16 LoRA: about 56GB VRAM
For smaller teams, those figures are the real story. A 4B model on 10GB VRAM is the kind of setup that can run on a single modest GPU. A 27B model at 56GB pushes you into serious hardware, but it is still far from the kind of cluster-only workflow that bigger frontier models demand.
Why partial vision tuning matters
The new layer selection is the most interesting part of the update. In Unsloth, you can now choose whether to fine-tune the vision layers, the language layers, the attention modules, or the MLP modules. All of them are on by default, but the option to switch pieces off gives practitioners a cleaner way to isolate what actually needs adaptation.
That is especially useful for multimodal work where the image encoder may need domain adaptation while the text side should stay mostly intact. Think medical imaging, industrial inspection, or document parsing. If your text behavior is already solid, there is no reason to pay to rewrite it just because your image inputs changed.
“The future of AI will be the era of the foundation model.” — Jensen Huang, NVIDIA GTC 2023 keynote
Huang’s line is relevant here because foundation models are no longer being treated as fixed artifacts. They are becoming modular systems, and this update is another sign that training is moving toward targeted edits instead of full rewrites.
Unsloth says it sets all those vision fine-tuning switches on by default, which is a sensible choice for beginners. But for teams doing repeated experiments, the ability to freeze parts of the model can save a lot of wasted runs.
MoE, RL, and the hardware reality
Qwen3.5 also includes MoE models such as 35B-A3B and 122B-A10B, and Unsloth says it supports those with recent MoE training improvements that are about 12x faster, use more than 35% less VRAM, and allow around 6x longer context. That is a meaningful jump, especially for long-context tasks where memory pressure usually becomes the bottleneck first.

There are also some clear boundaries. Unsloth recommends bf16 setups for MoE fine-tuning and says MoE QLoRA is not advised because of BitsandBytes limitations. Router-layer fine-tuning is disabled by default for stability, which tells you this is still a system where power comes with sharp edges.
On the reinforcement learning side, Unsloth says Qwen3.5 can be trained with GRPO, GSPO, and related methods even though vLLM support is not the path it uses for that flow. The workaround is to set fast_inference=False and use Unsloth inference instead.
- Qwen3.5-35B-A3B bf16 LoRA: 74GB VRAM
- Qwen3.5-122B-A10B bf16 LoRA: 256GB VRAM
- Qwen3.5 supports 201 languages
- Full fine-tuning uses about 4x more VRAM than LoRA
- Unsloth says its training is 1.5x faster than FA2 setups
Those numbers tell a blunt story: the model family is broad, but the hardware bar rises fast as soon as you move past the smaller variants. If you are planning a production experiment, the first question is no longer “Can we fine-tune it?” It is “Which version fits our memory budget without forcing compromises elsewhere?”
Export, deployment, and where this fits next
Unsloth also keeps the post-training path fairly practical. You can export to llama.cpp and Ollama through GGUF, or save for vLLM once you are on a supported version. The docs even call out a version issue: vLLM 0.16.0 does not support Qwen3.5, so users should wait for 0.17.0 or try a nightly build.
That kind of deployment advice is more valuable than it looks. Fine-tuning is easy to celebrate in a notebook, then painful to ship if the export format or chat template is wrong. Unsloth explicitly warns that mismatched chat templates or EOS tokens can make an exported model behave worse in another runtime.
For teams using local AI tools, Unsloth Studio adds another path. It runs on macOS, Windows, and Linux, and the docs say it can train models about 2x faster with 70% less VRAM while also handling model search, downloads, inference, and export. If your workflow starts on a laptop and ends on a server, that kind of bridge is handy.
My read: the most important part of this update is not the benchmark claim or the new notebooks. It is the fact that multimodal fine-tuning is becoming more selective. If Unsloth keeps pushing that direction, the next practical question for Qwen3.5 users will be simple: when do you freeze the image tower, when do you freeze the text stack, and when do you touch the router at all?
For now, the answer depends on the task, but the new control surface makes that decision much easier to test instead of guess.
// Related Articles
- [TOOLS]
Why Gemini API pricing is cheaper than it looks
- [TOOLS]
Why VidHub 会员互通不是“买一次全设备通用”
- [TOOLS]
Why Bun’s Zig-to-Rust experiment is the right move
- [TOOLS]
Why OpenAI API pricing is a product strategy, not a footnote
- [TOOLS]
Why Claude Code’s prompt design beats IDE copilots
- [TOOLS]
Why Databricks Model Serving is the right default for production infe…