nanoVLM: The simplest repository to train your VLM in pure PyTorch
nanoVLM provides a minimal PyTorch toolkit to train a Vision Language Model on a free Colab tier. It fuses a SigLIP-based vision transformer with a Llama 3 language backbone, using a Modality Projection (pixel shuffle + linear) to align image and text embeddings for decoding. The post offers quickstart steps: clone the repo and run train.py, or use the Colab notebook to begin training without local setup.