Running Qwen3.5-0.8B on M3 Mac — Direct Install Without Ollama (All Errors & Fixes)
A complete honest walkthrough of getting Qwen3.5-0.8B running on M3 Mac using HuggingFace Transformers directly — no Ollama, no shortcuts. Every error hit, every fix applied, model running at the end.
Running Qwen3.5-0.8B on M3 Mac — Direct Install Without Ollama
This is a real install log — not a cleaned-up tutorial where everything works first time. I wanted to learn how to run LLMs directly using HuggingFace Transformers, bypassing Ollama completely. Here’s every command I ran, every error I hit, and exactly how I fixed each one.
What you’ll have at the end:
- ✅ Qwen3.5-0.8B running directly via HuggingFace Transformers
- ✅ Apple Metal (MPS) GPU acceleration on M3 Mac
- ✅ Interactive chat loop with no Ollama dependency
- ✅ Understanding of what’s actually happening under the hood
Model: Qwen/Qwen3.5-0.8B Size: ~1.77GB License: Apache 2.0 Machine: M3 MacBook Pro (18GB RAM) Python: 3.11 via Miniconda
Why Not Ollama?
Ollama is a great tool but it’s a black box. I wanted to understand the actual stack:
1
HuggingFace Hub → SafeTensors weights → Transformers library → PyTorch MPS → M3 GPU
Once you understand this, you can swap any model, change any parameter, and build real applications on top.
The Final Working Script
Download: run_qwen.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
"""
Run Qwen3.5-0.8B locally on M3 Mac using HuggingFace Transformers
No Ollama — direct PyTorch + Metal (MPS) inference
"""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "Qwen/Qwen3.5-0.8B"
# Detect best device
if torch.backends.mps.is_available():
DEVICE = "mps"
DTYPE = torch.float16
print("Using Apple Metal (MPS) GPU")
elif torch.cuda.is_available():
DEVICE = "cuda"
DTYPE = torch.float16
print("Using CUDA GPU")
else:
DEVICE = "cpu"
DTYPE = torch.float32
print("Using CPU (slower)")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
dtype=DTYPE,
device_map=DEVICE
)
model.eval()
def chat(user_message: str, max_tokens: int = 500) -> str:
messages = [{"role": "user", "content": user_message}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(
outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
)
print("Qwen3.5-0.8B ready. Type 'quit' to exit.\n")
while True:
try:
user_input = input("You: ").strip()
if user_input.lower() in ("quit", "exit", "q"):
break
if not user_input:
continue
print("Qwen:", chat(user_input))
print("-" * 50)
except KeyboardInterrupt:
break
Step 1 — Set Up Conda Environment
I use a conda env called ranger for AI/ML work. First problem: the env had a broken Python path because it was originally created when an external drive (/Volumes/KaliPro) was mounted.
Error:
1
zsh: bad interpreter: /Volumes/KaliPro/Applications/miniconda3/envs/ranger/bin/python3: no such file or directory
Fix — Recreate the env with local Python:
1
2
3
conda remove -n ranger --all -y
conda create -n ranger python=3.11 -y
conda activate ranger
Step 2 — Install Dependencies
1
pip install transformers torch accelerate huggingface_hub
Step 3 — Download and Run the Model
1
python3 ~/models/run_qwen.py
What followed was a chain of 6 errors, each one teaching something new.
Error 1 — huggingface-cli: bad interpreter
Command:
1
huggingface-cli download Qwen/Qwen3.5-0.8B --local-dir ~/models/Qwen3.5-0.8B
Error:
1
2
zsh: bad interpreter: /Volumes/KaliPro/Applications/miniconda3/envs/ranger/bin/python3: no such file or directory
Fetching 0 files: 0it [00:00, ?it/s]
What happened: The CLI script was installed when KaliPro external drive was mounted. Shebang line points to a path that no longer exists. The download silently failed — it printed the cache path as if it worked, but fetched 0 files.
Fix: Recreate the conda env (done above), then use the model ID directly in Python — let Transformers handle the download automatically:
1
MODEL_PATH = "Qwen/Qwen3.5-0.8B" # Transformers downloads and caches this
Error 2 — OSError: Incorrect path_or_model_id
Error:
1
2
OSError: Incorrect path_or_model_id: '/Users/ranger/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364...'
Please provide either the path to a local folder or the repo_id of a model on the Hub.
What happened: I pointed the script at the HuggingFace cache snapshot directory, thinking the model was there. But since the download failed (Error 1), the directory was empty. Transformers couldn’t find tokenizer_config.json and threw an error.
Fix: Use the HuggingFace repo ID directly:
1
MODEL_PATH = "Qwen/Qwen3.5-0.8B"
Transformers downloads from HuggingFace and caches automatically. No manual download needed.
Error 3 — KeyError: ‘qwen3_5’ (Transformers too old)
Error:
1
2
3
KeyError: 'qwen3_5'
ValueError: The checkpoint you are trying to load has model type `qwen3_5`
but Transformers does not recognize this architecture.
What happened: Transformers was too old — didn’t know about the qwen3_5 architecture. Qwen3.5 is a recent model (2026) that needs a newer Transformers version.
Fix:
1
pip install --upgrade transformers
After upgrade: transformers==5.2.0 ✅
Error 4 — PyTorch too old (< 2.4 required)
Error:
1
2
Disabling PyTorch because PyTorch >= 2.4 is required but found 2.3.0
ImportError: AutoModelForCausalLM requires the PyTorch library but it was not found in your environment.
What happened: Transformers 5.2.0 requires PyTorch 2.4+. I had 2.3.0.
Fix:
1
pip install "torch>=2.4" torchvision torchaudio --upgrade
After upgrade: PyTorch 2.10.0 ✅
Verify Metal GPU works:
1
2
3
python3 -c "import torch; print('PyTorch:', torch.__version__); print('MPS:', torch.backends.mps.is_available())"
# PyTorch: 2.10.0
# MPS: True
Error 5 — NumPy version conflict / soxr crash
This was the persistent enemy. Hit it three times.
Error:
1
2
3
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x as it may crash.
ImportError: numpy.core.multiarray failed to import
File "src/soxr/cysoxr.pyx", line 1, in init soxr.cysoxr
What happened: soxr (an audio resampling library, pulled in by Transformers for audio model support) was compiled against NumPy 1.x. NumPy 2.x was installed. The Cython extension couldn’t load.
Why it kept coming back: Every pip install of a new package pulled in a newer NumPy, overriding any downgrade:
- First run: NumPy 2.3.1
- After
pip install accelerate: NumPy 2.4.2 (accelerate pulled it in as a dependency)
Things that didn’t fully work:
1
2
pip install "numpy<2.0" # Overridden by next install
pip install "numpy==1.26.4" --force-reinstall # Overridden by accelerate
Fix that actually worked — rebuild soxr from source:
1
pip install soxr --no-binary :all: --force-reinstall
This compiles soxr fresh against whatever NumPy version is currently installed, so there’s no version mismatch. If this fails:
1
2
3
pip install cython
brew install libsoxr
pip install soxr --no-binary :all: --force-reinstall
Error 6 — Missing accelerate
Error:
1
2
3
ValueError: Using a `device_map`, `tp_plan`, `torch.device` context manager or setting
`torch.set_default_device(device)` requires `accelerate`.
You can install it with `pip install accelerate`
What happened: device_map='mps' in the model loading call requires the accelerate library. It wasn’t installed yet.
Fix:
1
pip install accelerate
Final Working Install Sequence
After all the above, here’s the clean install sequence that works:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 1. Create fresh conda env
conda create -n ranger python=3.11 -y
conda activate ranger
# 2. Install PyTorch for Apple Silicon
pip install "torch>=2.4" torchvision torchaudio
# 3. Install Transformers and accelerate
pip install transformers accelerate huggingface_hub
# 4. Fix soxr NumPy conflict (build from source)
pip install soxr --no-binary :all: --force-reinstall
# 5. Run
python3 ~/models/run_qwen.py
What It Looks Like Running
1
2
3
4
5
6
7
8
9
10
11
12
Using Apple Metal (MPS) GPU
Loading tokenizer from Qwen/Qwen3.5-0.8B...
Loading model...
Qwen3.5-0.8B ready. Type 'quit' to exit.
--------------------------------------------------
You: Tell me about cybersecurity
Qwen: Cybersecurity is the practice of protecting systems, networks,
and programs from digital attacks...
--------------------------------------------------
You: quit
Rangers lead the way!
First run downloads ~1.77GB of model weights. Every run after that uses the local cache — instant load.
What Each Part Does (Learning Notes)
| Component | What it does |
|---|---|
AutoTokenizer | Converts text → token IDs the model understands |
AutoModelForCausalLM | Loads the neural network weights from SafeTensors files |
device_map='mps' | Routes computation to M3 GPU via Apple Metal |
torch.float16 | Half precision — 2x memory saving, minimal quality loss |
apply_chat_template | Formats messages in Qwen’s expected <|im_start|>user... format |
max_new_tokens=500 | Maximum tokens to generate (~375 words) |
temperature=0.7 | Creativity level (0=deterministic, 1=more random) |
do_sample=True | Use sampling — needed when temperature > 0 |
Ollama vs Direct Transformers
| Ollama | Direct (Transformers) | |
|---|---|---|
| Setup | ollama pull model | 5 steps + fix dependencies |
| Model format | GGUF (quantised, smaller) | SafeTensors (full precision) |
| RAM usage | Lower | Higher (full weights) |
| Speed | Fast (optimised) | Good on MPS |
| Control | Limited CLI flags | Full Python — change everything |
| Learning value | Black box | You see the entire stack |
| Best for | Quick use | Building, learning, customising |
Errors Summary Table
| Error | Cause | Fix |
|---|---|---|
bad interpreter: /Volumes/KaliPro/... | Conda env built on external drive, now unmounted | Recreate env: conda remove -n ranger --all -y && conda create -n ranger python=3.11 |
OSError: Incorrect path_or_model_id | Empty cache dir (download failed silently) | Use repo ID "Qwen/Qwen3.5-0.8B" directly |
KeyError: 'qwen3_5' | Transformers too old for this architecture | pip install --upgrade transformers |
PyTorch >= 2.4 required but found 2.3.0 | PyTorch too old | pip install "torch>=2.4" --upgrade |
soxr cysoxr ImportError (numpy mismatch) | soxr compiled against NumPy 1.x, NumPy 2.x installed | pip install soxr --no-binary :all: --force-reinstall |
ValueError: requires accelerate | Missing library for device_map | pip install accelerate |
Resources
- Qwen3.5-0.8B on HuggingFace
- run_qwen.py script (this site)
- HuggingFace Transformers docs
- PyTorch MPS Backend
- Apple Metal Performance Shaders
Next Steps
- Learn llama.cpp for GGUF quantised models (smaller, faster)
- Try Qwen3.5-9B on M4 Max (128GB RAM — beast mode)
- Build an n8n workflow using this as a local AI backend
- Try vLLM for high-throughput batch inference
Support This Content
If this saved you time (and frustration), consider buying me a coffee!
Written 2026-03-03 — every error in this post was real, hit during a live session on M3 MacBook Pro. Model confirmed working after 6 errors and 1 hour of dependency battles.
Rangers lead the way! 🎖️