- Published on
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
LlamaFusiontransformer-modulesdiffusionmultimodal-generationtext-and-image-processingpretrained-language-modelsself-attention-layersautoregressive-processingimage-understandingvision-language-models
University of Washington•FAIR, Meta•Stanford University•University of Washington, FAIR, Meta•
We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3’s weights for processing texts...