articleHuggingFace Blog

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Replaying the DeepSeek-R1 'aha moment', this post uses Group Relative Policy Optimization (GRPO) and the Countdown Game to train an open model via RL. It details a distributed setup with DeepSpeed and vLLM on 4× NVIDIA H100 GPUs and explains how GRPO replaces a value function with group-based baselines. The aim is self-verification and search abilities learned with minimal human data, illustrating a concrete RL workflow for LLMs.

publié 31 JANV. 2025★★★★★

Lire la sourcehuggingface.co/blog/open-r1/mini-r1-contdown-game

[*] Ouvre dans un nouvel onglet · pas de tracking côté Lantern

Source: HuggingFace Blog
Ingéré: 31 JANV. 2025 · 19:10
Score édito: 3.0 / 5