articleHuggingFace Blog
Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial
Replaying the DeepSeek-R1 'aha moment', this post uses Group Relative Policy Optimization (GRPO) and the Countdown Game to train an open model via RL. It details a distributed setup with DeepSpeed and vLLM on 4× NVIDIA H100 GPUs and explains how GRPO replaces a value function with group-based baselines. The aim is self-verification and search abilities learned with minimal human data, illustrating a concrete RL workflow for LLMs.
publié 31 JANV. 2025★★★★★
Lire la sourcehuggingface.co/blog/open-r1/mini-r1-contdown-game
[*] Ouvre dans un nouvel onglet · pas de tracking côté Lantern
- Source
- HuggingFace Blog
- Ingéré
- 31 JANV. 2025 · 19:10
- Score édito
- 3.0 / 5