Building a fully local deep researcher with DeepSeek-R1
AI Summary
Summary of Deep Seek’s R1 Reasoning Model Release
- Deep Seek released R1, a new open-source reasoning model, with an accompanying paper detailing its training strategy.
- Reasoning models represent a new scaling paradigm for language models (LLMs), focusing on system 2 reasoning (deliberate and logical) rather than system 1 (fast and intuitive).
- R1 is trained using a combination of fine-tuning and reinforcement learning (RL) with a method called GRPO.
- The training process involves:
- Fine-tuning Deep Seek V3 on thousands of Chain of Thought reasoning examples (exact number unspecified).
- Using GRPO RL on 144,000 hard verifiable problems in math and coding, generating 64 attempts per example and scoring them based on correctness.
- Comparing each attempt to the mean of the batch and adjusting token generation probabilities accordingly to reinforce good reasoning patterns.
- Filtering the RL outputs to obtain 600,000 high-quality reasoning traces for further training.
- A second fine-tuning phase with these traces and an additional 200,000 non-reasoning examples to restore general capabilities.
- A second RL phase with rewards for helpfulness, harm, and reasoning, using a mix of reasoning and general problems.
- Deep Seek also created distilled, smaller R1 models that can run on personal laptops.
- Results show that R1 performs comparably to OpenAI’s O1 on coding and math challenges, with the distilled 14b model running effectively on a 32GB MacBook Pro.
- The video demonstrates using R1 with AMA (a tool for running models) in a notebook environment, highlighting the presence of “think tokens” in the outputs, which can be removed in JSON mode.
- An example workflow using R1 for report writing is shown, involving generating queries, performing web searches, summarizing results, and reflecting to generate new queries in a loop.
- The presenter concludes that while the “think token” issue is a minor annoyance, the ability to run reasoning models locally is a significant advancement, and the open-source nature of the training process is commendable.
Detailed Instructions and URLs
- No specific CLI commands, website URLs, or detailed instructions were provided in the transcript.