Notes · Other Peoples' Talks · FOSDEM 2026 · Multimodal support in llama.cpp - Achievements and Future Directions
https://fosdem.org/2026/schedule/event/LRZJEH-llama-cpp-multimodal/
- Support timeline for multimodal: initial support in October 2023, removed due to buggy nature in May 2024, hacky touch-and-go implementations until libmtmd in May 2025
- The previous implementation used
libllava, which interfaced ithclip.cppto produce embeddings, which were then passed tolibllama- This proved troublesome due to having to interface multiple libraries with differen tinstances
libmtmdabstracts all input to be fed to the model- Encapsulates all input - including bundling
clip.cpp- to pass data tolibllama - Does true multimodal input + audio (LFM2.5-Audio), and is extensible
- Encapsulates all input - including bundling
mtmd_tokenizeis a single function that can tokenise an input prompt + mixed modality inputs, and match them together (assuming that there are matching markers in the text prompt to correspond to the multimodal inputs)- Future work:
- Multimodal output: image and audio generation, by having
libllamagenerate embeddings that are then passed tomtmd, which can then process that into the specific output format- Some models support interleaved text and multimodal output generation, so it's a bidirectional problem, as
mtmdmight need to give control back tolibllama - Actual implementation is non-trivial; multiple ways to implement audio decoders (convolution, transformer, diffusion), and vision requires diffusion
- There's an existing
diffusion.cppimplementation, but it's not appropriate; image generation is a long way out
- Some models support interleaved text and multimodal output generation, so it's a bidirectional problem, as
- Video input: interleaved image and audio.
- This is complicated by needing memory for all of these frames + audio buffers
libmtmdis developing a streaming API to deal with this; seems to be a pull interface wherelibmtmdcan pull frames from a video decoder?- Also, all of the usual packaging concerns: how do you decode video in a portable way? Forced ffmpeg install? Separate versions? How do you deal with the UX hit? Discussion here: https://github.com/ggml-org/llama.cpp/issues/18389
- Multimodal output: image and audio generation, by having
- Contributing:
- Look around the codebase, use what's already there
- Open a discussion if you want to push significant changes
- Try to reuse existing functionality
- Use AI to discover, but not to write code
- Keep things KISS and model-agnostic
- Question: How long until llama.cpp has a first-working version of multimodal output? No timeline right now, but 1-2 months maybe? Image generation will take more time.