Philpax icon

Philpax

Notes · Other Peoples' Talks · FOSDEM 2026 · Multimodal support in llama.cpp - Achievements and Future Directions

https://fosdem.org/2026/schedule/event/LRZJEH-llama-cpp-multimodal/

  • Support timeline for multimodal: initial support in October 2023, removed due to buggy nature in May 2024, hacky touch-and-go implementations until libmtmd in May 2025
  • The previous implementation used libllava, which interfaced ith clip.cpp to produce embeddings, which were then passed to libllama
    • This proved troublesome due to having to interface multiple libraries with differen tinstances
  • libmtmd abstracts all input to be fed to the model
    • Encapsulates all input - including bundling clip.cpp - to pass data to libllama
    • Does true multimodal input + audio (LFM2.5-Audio), and is extensible
  • mtmd_tokenize is a single function that can tokenise an input prompt + mixed modality inputs, and match them together (assuming that there are matching markers in the text prompt to correspond to the multimodal inputs)
  • Future work:
    • Multimodal output: image and audio generation, by having libllama generate embeddings that are then passed to mtmd, which can then process that into the specific output format
      • Some models support interleaved text and multimodal output generation, so it's a bidirectional problem, as mtmd might need to give control back to libllama
      • Actual implementation is non-trivial; multiple ways to implement audio decoders (convolution, transformer, diffusion), and vision requires diffusion
      • There's an existing diffusion.cpp implementation, but it's not appropriate; image generation is a long way out
    • Video input: interleaved image and audio.
      • This is complicated by needing memory for all of these frames + audio buffers
      • libmtmd is developing a streaming API to deal with this; seems to be a pull interface where libmtmd can pull frames from a video decoder?
      • Also, all of the usual packaging concerns: how do you decode video in a portable way? Forced ffmpeg install? Separate versions? How do you deal with the UX hit? Discussion here: https://github.com/ggml-org/llama.cpp/issues/18389
  • Contributing:
    • Look around the codebase, use what's already there
    • Open a discussion if you want to push significant changes
    • Try to reuse existing functionality
    • Use AI to discover, but not to write code
    • Keep things KISS and model-agnostic
  • Question: How long until llama.cpp has a first-working version of multimodal output? No timeline right now, but 1-2 months maybe? Image generation will take more time.