IntermediatePython

vllm-mlxRun LLMs at 400+ tok/s on Apple Silicon

vllm-mlx is a native MLX inference server optimized for Apple Silicon, offering OpenAI and Anthropic API compatibility. It supports text and vision-language models with continuous batching, achieving over 400 tokens per second on M1 Ultra—ideal for local development and privacy-sensitive deployments.

1.4K Stars
189 forks
56 issues
101 browse
Python
Apache-2.0
Indexed

Project Overview

vllm-mlx is a native MLX inference server optimized for Apple Silicon, offering OpenAI and Anthropic API compatibility. It supports text and vision-language models with continuous batching, achieving over 400 tokens per second on M1 Ultra—ideal for local development and privacy-sensitive deployments.

If you own a Mac and want to run large language models locally, ollama or llama.cpp are usually the first choices. But vllm-mlx gives Apple Silicon users a fresh alternative: a native MLX inference server that speaks OpenAI and Anthropic APIs. Named after the popular vLLM project, it's rebuilt from the ground up for Apple's MLX framework.

The Speed Advantage

What makes vllm-mlx stand out is raw speed. Benchmarks show over 400 tokens per second on an M1 Ultra—a number that makes chat feel nearly instant. The secret is native MLX acceleration and continuous batching, which lets the server handle multiple requests simultaneously without dropping performance. For developers, this means squeezing more concurrent users out of a single Mac.

It also supports vision-language models like Qwen-VL and LLaVA, so you can feed images directly to the model for description or question answering. Multimodal local inference is still rare, and vllm-mlx handles it well.

API Compatibility and Ecosystem

A key selling point is its compatibility with OpenAI's Chat Completions API and Anthropic's Messages API. That means existing tools—LangChain, LlamaIndex, even Claude Code—can switch to a local backend by simply changing the endpoint URL. No code rewrites needed. This is a big deal for teams prioritizing privacy or cutting costs on API calls.

Another thoughtful addition is MCP (Model Context Protocol) tool calling. It allows the model to invoke external tools like search or database queries through a standard protocol, breaking the LLM out of the chat box. Still early, but the direction is promising.

Getting Started and Limitations

Installation requires Python—preferably in a conda or venv. The project depends on MLX, which means Intel Macs are out of luck. If you're on M1 or later, setup is straightforward:

  • Clone the repo: git clone https://github.com/waybarrios/vllm-mlx
  • Install dependencies: pip install -r requirements.txt
  • Start the server: python -m vllm_mlx.server --model meta-llama/Llama-3.2-3B-Instruct

Models download automatically from Hugging Face and cache locally. The first run is slow, but subsequent loads are fast. Speed is genuinely impressive even with smaller 3B models, delivering responsive interactions.

However, there are caveats. The supported model list is limited to Llama, Qwen, and LLaVA families; others require manual conversion. Documentation is still basic, and debugging may involve digging through GitHub issues. The community is small but active.

Who Should Try It?

This tool is for Mac developers who want fast, local model inference without cloud dependencies. It's also a solid choice for teams building a low-latency inference service on a budget. The native MLX acceleration outperforms generic solutions, and API compatibility lowers integration friction. For production, consider wrapping it in Docker or systemd, and watch memory usage—large models can strain a 16GB unified memory Mac.

Bottom line: vllm-mlx is currently one of the most compelling local inference servers for Apple Silicon. It's not perfect, but it's heading in the right direction.

vllm-mlxApple SiliconMLXLLM inferencevision-language modellocal inferencecontinuous batchingMCP tool callingOpenAI compatibleAnthropic compatible

Project Rating

0.0 (0 Evaluation)

Share

Frequently Asked Questions

What is vllm-mlx: Run LLMs at 400+ tok/s on Apple Silicon?

vllm-mlx is a native MLX inference server optimized for Apple Silicon, offering OpenAI and Anthropic API compatibility. It supports text and vision-language models with continuous batching, achieving over 400 tokens per second on M1 Ultra—ideal for local development and privacy-sensitive deployments.

What language is vllm-mlx: Run LLMs at 400+ tok/s on Apple Silicon written in?

vllm-mlx: Run LLMs at 400+ tok/s on Apple Silicon is primarily written in Python.

What license is vllm-mlx: Run LLMs at 400+ tok/s on Apple Silicon under?

vllm-mlx: Run LLMs at 400+ tok/s on Apple Silicon is released under the Apache-2.0 license.

Related Projects

No results yet

Explore More

Similar Tools

Cursor

Cursor

A smart code editor based on secondary development of VS Code, with "native built-in AI" as its core selling point. It does not rely on plugins but deeply integrates AI into the underlying architecture of the editor, enabling it to understand the context of the entire project's codebase. It also supports seamless migration of all VS Code configurations and plugins.

Google Antigravity

Google Antigravity

Antigravity supports multiple models, including Gemini 3 Pro, Claude Sonnet 4.5, and GPT-OSS, allowing developers to select the most suitable model for their tasks within the same environment.

Codex

Codex

OpenAI Codex is an AI programming model and assistant developed by OpenAI, capable of translating natural language instructions into corresponding source code. It provides developers with intelligent code completion and code generation functionalities. Initially launched in 2021 as the code model for the OpenAI API, it once served as the core engine for GitHub Copilot. With the evolution of OpenAI's technology, Codex returned in 2025 in a new form as an "AI programming agent," capable of understanding complex requirements and automatically writing and debugging code, significantly enhancing development efficiency and software delivery speed.

Kiro

Kiro

Kiro is an AI-powered programming IDE launched by AWS, which adopts a specification-driven development model. It transforms natural language requirements into clear specification documents and tasks, then uses built-in AI agents to generate code, debug, and optimize, providing comprehensive assistance throughout the development process of large-scale projects.

Trae

Trae

Trae (official website: trae.ai) is an AI-native integrated development environment (IDE) launched by ByteDance. It is not merely a programming assistant but rather a "collaborative partner" that deeply integrates large language models (LLMs) to help developers achieve more intelligent and automated software development—from requirements analysis and code construction to debugging and deployment.

Claude

Claude

Claude is an intelligent language interaction platform developed by the American AI company Anthropic. It integrates capabilities such as deep text understanding, information organization, code assistance, and task analysis, enabling it to handle more complex tasks beyond simple chat conversations. These include long-text summarization, image analysis, logical reasoning, and programming assistance, among others. Compared to some single-purpose Q&A bots, Claude functions more like an intelligent tool equipped with reasoning logic and scalable features.

Comments

Comments

0
0/500 Characters

No comments yet

Be the first to comment

Open Source Project

Explore, learn and contribute to open source AI projects to advance the development of artificial intelligence technology

View All