In the evolving world of large language models (LLMs), how a model is stored, optimized, and run is as important as what the model does. Whether you're deploying a chatbot on a Raspberry Pi, fine-tuning LLaMA on a GPU cluster, or running LLMs on your MacBook, your model format and runtime choice matters.
This article explores the three most popular model formats in local and cloud LLM workflows:
And weβll cover how quantization fits into all of them.
Hugging Face models are the default standard in the NLP ecosystem. These models use formats like:
.bin or .pt β PyTorch serialized weights.safetensors β a safe and fast alternative to pickle.onnx β portable, inference-optimized formatModel development, training, and fine-tuning workflows.
GGUF is the modern successor to GGML .bin files used by llama.cpp. It consolidates model weights, config, tokenizer, and metadata into a single binary format optimized for local inference β especially with quantized models.
llama.cppOllamaLM Studiotext-generation-webuiRunning large LLMs on laptops, desktops, or local servers with minimal memory and power.
MLX is Appleβs native machine learning framework built to maximize GPU and memory efficiency on M1, M2, and M3 chips.
MacBook or Mac Studio developers who want native LLM performance with minimal setup.
Quantization reduces the precision of model weights (e.g., from 16-bit floats to 4-bit integers), resulting in: - β‘ Faster inference - πΎ Lower memory usage - π More devices supported
| Type | Precision | Use Cases |
|---|---|---|
| FP16 | 16-bit float | Balanced speed/accuracy |
| INT8 | 8-bit integer | Edge and embedded devices |
| INT4 | 4-bit integer | CPUs, low-RAM systems |
| QLoRA | 4-bit quant + LoRA fine-tuning | Train on GPUs with <24GB VRAM |
| Format | Quantization Support |
|---|---|
| Hugging Face | β (via QLoRA, bitsandbytes) |
| GGUF | β Native (Q40, Q4K_M etc.) |
| MLX | β Not yet supported |
| Feature | Hugging Face | GGUF | MLX |
|---|---|---|---|
| Training Support | β | β | β (WIP) |
| Inference on CPU | β οΈ Slow | β Fast | β Mac-only |
| Quantized Weights | β (external) | β Native | β |
| Mac GPU Optimization | β οΈ via MPS | β (Ollama) | β Native |
| Deployment Simplicity | β Medium | β Easy | β Easy |
| If you want to... | Use this |
|---|---|
| Train or fine-tune models | Hugging Face |
| Run models efficiently on CPU (any OS) | GGUF with llama.cpp |
| Use LLMs on a MacBook without effort | MLX or LM Studio |
| Run quantized 4-bit models locally | GGUF or QLoRA |
| Provide an API for your local model | Ollama + GGUF |
Each format has its own sweet spot β and choosing the right one depends on whether you're training, serving, or exploring large language models.