Post-Training

Conclusion and Next Steps: Scaling to the Frontier

Authors

Johannes

Sami

Jackmin

Vincent

Announcements

November 29, 2024

INTELLECT-1 Release: The First Globally Trained 10B Parameter Model

We're excited to release INTELLECT-1, the first 10B parameter language model collaboratively trained across the globe. This represents a 10× scale-up from our previous research and demonstrates that large-scale model training is no longer confined to large corporations but can be achieved through distributed, community-driven approaches. The next step is scaling this even further to frontier model sizes and ultimately open source AGI.

‍

Today, we're releasing:

Detailed Technical Report
INTELLECT-1 base model, intermediate checkpoints and post-trained model
- Chat Interface to try it out: chat.primeintellect.ai
Pre-training Dataset
Post-training datasets by Arcee AI
PRIME framework

Scaling Decentralized Training

We present the first large-scale experiment collaboratively training a 10 billion parameter model over 1 trillion tokens across five countries and three continents on up to 112 H100 GPUs simultaneously. We achieve an overall compute utilization of 83% across continents and 96% when training exclusively on nodes distributed across the entire United States, introducing minimal overhead compared to centralized training approaches.

Our results show that INTELLECT-1 can maintain training convergence and high compute utilization despite severe bandwidth constraints and node volatility, opening new possibilities for decentralized, community-driven training of frontier foundation models.

Technical Progress

Our research and engineering work on PRIME helped us achieve this milestone.

Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamicglobal process groups for fault-tolerant communication across the internet andlocal process groups for communication within a node, live checkpoint recovery,kernels, and a hybrid DiLoCo-FSDP2 implementation.

Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve an overall 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance at the 10B scale.

Training Details and Datasets

INTELLECT-1 is based on the Llama-3 architecture, comprising:

42 layers with 4,096 hidden dim
32 attention heads
8,192 sequence length
128,256 vocab size

The model was trained on a carefully curated 1T token dataset mix (Huggingface Link)

55% FineWeb-Edu
20% Stack v2
10% FineWeb
10% DCLM-baseline
5% OpenWebMath

Training completed over 42 days using:

WSD learning rate scheduler
7.5e-5 inner learning rate
Auxiliary max-z-loss for stability
Nesterov momentum outer optimizer
Dynamic on-/off-boarding of compute resources with up to 14 nodes

Number of active training nodes over training steps. The graph demonstrates PRIME’s ability to handle dynamic node participation, starting with 4 nodes and scaling up to 14 nodes, while maintaining training stability despite frequent node fluctuations.

‍

Training dynamics showing model perplexity and learning rate over training steps, including warmup, stable, and annealing phases.

‍

Compute Efficiency

The system achieved great training efficiency across different geographical settings:

96% compute utilization training across the entire United States (103s median sync time)
85.6% compute utilization for transatlantic training (382s median sync time)
83% compute utilization for global distributed training (469s median sync time)

Performance metrics for training across different geographical configurations.

Post-Training

After completing the globally distributed pretraining phase, we applied several post-training techniques in collaboration with Arcee AI to enhance INTELLECT-1s’s capabilities and task-specific performance. Our post-training methodology consisted of three main phases including extensive SFT (16 runs), DPO (8 runs), and strategic model merging using MergeKit.

For more info check out our detailed technical report.

Conclusion and Next Steps: Scaling to the Frontier

The successful training of INTELLECT-1 demonstrates a key technical advancement in enabling a more open and decentralized AI ecosystem, with significant implications for the future development and governance of advanced artificial intelligence systems.

Through our open-source PRIME framework, we've established a foundation for distributed AI development that can rival centralized training facilities.

Looking ahead, we envision scaling this approach to frontier model sizes by:

Expanding our global compute network
Implementing new economic incentives to drive community participation
Further optimizing our distributed training architecture for even larger models

This work represents a crucial step toward democratizing AI development and preventing the consolidation of AI capabilities within a few organizations. By open-sourcing INTELLECT-1's model, checkpoints, and training framework, we invite the global AI community to join us in pushing the boundaries of decentralized training.

To get involved, visit our GitHub repository or join our Discord community. Together, we can build a more open, collaborative future for AI development.

‍

Eurus-2-PRIME

System Prompt

When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]

User

Quadratic polynomials

P (x)

and

Q (x)

have leading coefficients

2

and

- 2,

respectively. The graphs of both polynomials pass through the two points

(16, 54)

and

(20, 53) .

Find

P (0) + Q (0) .

Eurus-2-PRIME

[ASSESS]

We are given two quadratic polynomials

P (x)

and

Q (x)

with leading coefficients

2

and

- 2

, respectively. The graphs of both polynomials pass through the points

(16, 54)

and

(20, 53)

. We need to find the value of

P (0) + Q (0)

.

[ADVANCE]

Let's express the quadratic polynomials

P (x)

and

Q (x)

in the general form:

P (x) = 2 x^{2} + b x + c

Q (x) = - 2 x^{2} + d x + e

Since the graphs of both polynomials pass through the points

(16, 54)

and

(20, 53)

, we can substitute these points into the equations of

P (x)

and

Q (x)

to form a system of equations.

For

P (x)

P (16) = 2 (16)^{2} + 16 b + c = 54

512 + 16 b + c = 54

16 b + c = - 458 (Equation 1)

P (20) = 2 (20)^{2} + 20 b + c = 53

800 + 20 b + c = 53

20 b + c = - 747 (Equation 2)

For

Q (x)

Q (16) = - 2 (16)^{2} + 16 d + e = 54

- 512 + 16 d + e = 54

16 d + e = 566 (Equation 3)

Q (20) = - 2 (20)^{2} + 20 d + e = 53

- 800 + 20 d + e = 53

20 d + e = 853 (Equation 4)

[ADVANCE]

Now, we can solve the system of equations for

P (x)

:
Subtract Equation 1 from Equation 2:

(20 b + c) - (16 b + c) = - 747 - (- 458)

4 b = - 289

b = - \frac{289}{4}

Substitute

b = - \frac{289}{4}

into Equation 1:

16 (- \frac{289}{4}) + c = - 458

- 1156 + c = - 458

c = 698

So, for

P (x)

, we have:

P (x) = 2 x^{2} - \frac{289}{4} x + 698

P (0) = 698

Next, we can solve the system of equations for

Q (x)

:
Subtract Equation 3 from Equation 4:

(20 d + e) - (16 d + e) = 853 - 566

4 d = 287

d = \frac{287}{4}

Substitute

d = \frac{287}{4}

into Equation 3:

16 (\frac{287}{4}) + e = 566

1148 + e = 566

e = - 582

So, for

Q (x)

, we have:

Q (x) = - 2 x^{2} + \frac{287}{4} x - 582

Q (0) = - 582

[ADVANCE]

Finally, we can find

P (0) + Q (0)

P (0) + Q (0) = 698 + (- 582) = 116

So, the value of

P (0) + Q (0)

116

INTELLECT-MATH-7B-SFT (ours)

System Prompt

Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is:

b o x e d a n s w e r

. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.

User

Quadratic polynomials

P (x)

and

Q (x)

have leading coefficients

2

and

- 2,

respectively. The graphs of both polynomials pass through the two points

(16, 54)

and

(20, 53) .

Find

P (0) + Q (0) .

INTELLECT-MATH-7B-SFT

So I've got this problem here about quadratic polynomials

P (x)

and

Q (x)

P (x)

has a leading coefficient of

2

, and

Q (x)

has a leading coefficient of

- 2

. Both polynomials pass through the points

(16, 54)

and

(20, 53)

. I need to find

P (0) + Q (0)

.

First, since both

P (x)

and

Q (x)

are quadratic polynomials, I can express them in the general form:

P (x) = 2 x^{2} + b x + c

Q (x) = - 2 x^{2} + d x + e

Wait, actually, the leading coefficient is the coefficient of the highest power of

x

, which for a quadratic is

x^{2}

. So yes,

P (x) = 2 x^{2} + b x + c

and

Q (x) = - 2 x^{2} + d x + e

.

Now, both polynomials pass through the points

(16, 54)

and

(20, 53)

. That means when

x = 16

P (16) = 54

, and when

x = 20

P (20) = 53

. Similarly for

Q (x)

.

So, I can set up equations using these points.

First, for

P (x)

P (16) = 2 (16)^{2} + b 16 + c = 54

Similarly,

P (20) = 2 (20)^{2} + b 20 + c = 53

Let me calculate

16^{2}

and

20^{2}

16^{2} = 256

20^{2} = 400

So,

P (16) : 2256 + 16 b + c = 54 \Rightarrow 512 + 16 b + c = 54 \Rightarrow 16 b + c = 54 - 512 \Rightarrow 16 b + c = - 458

P (20) : 2400 + 20 b + c = 53 \Rightarrow 800 + 20 b + c = 53 \Rightarrow 20 b + c = 53 - 800 \Rightarrow 20 b + c = - 747

Now, I have two equations:

1)

16 b + c = - 458

20 b + c = - 747

I can subtract equation 1 from equation 2 to eliminate

c

(20 b + c) - (16 b + c) = - 747 - (- 458)

20 b + c - 16 b - c = - 747 + 458

4 b = - 289

So,

b = - 289 / 4 = - 72.25

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug

b

back into one of the equations to find

c

. Let's use equation 1:

16 (- 72.25) + c = - 458

- 1156 + c = - 458

c = - 458 + 1156 = 698

So,

P (x) = 2 x^{2} - 72.25 x + 698

Hmm,

P (0)

would be

c

, which is

698

.

Now, let's do the same for

Q (x)

Q (16) = - 2 * (16)^{2} + d 16 + e = 54 \Rightarrow - 2256 + 16 d + e = 54 \Rightarrow - 512 + 16 d + e = 54 \Rightarrow 16 d + e = 54 + 512 \Rightarrow 16 d + e = 566

Q (20) = - 2 * (20)^{2} + d 20 + e = 53 \Rightarrow - 2400 + 20 d + e = 53 \Rightarrow - 800 + 20 d + e = 53 \Rightarrow 20 d + e = 53 + 800 \Rightarrow 20 d + e = 853

Now, equations for

Q (x)

:

1)

16 d + e = 566

20 d + e = 853

Subtract equation 1 from equation 2:

(20 d + e) - (16 d + e) = 853 - 566

4 d = 287

d = 287 / 4 = 71.75

Now, plug

d

back into equation 1:

1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = - 582

So,

Q (x) = - 2 x^{2} + 71.75 x - 582

Therefore,

Q (0) = e = - 582

Now,

P (0) + Q (0) = 698 + (- 582) = 116

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write

116

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for

P (x)

P (16) = 2 (256) + 16 b + c = 512 + 16 b + c = 54 \Rightarrow 16 b + c = - 458

P (20) = 2 * (400) + 20 b + c = 800 + 20 b + c = 53 \Rightarrow 20 b + c = - 747

Subtracting:

4 b = - 289 \Rightarrow b = - 72.25

Then,

c = - 458 - 16 * (- 72.25) = - 458 + 1156 = 698

Similarly for

Q (x)

Q (16) = - 2 * (256) + 16 d + e = - 512 + 16 d + e = 54 \Rightarrow 16 d + e = 566

Q (20) = - 2 * (400) + 20 d + e = - 800 + 20 d + e = 53 \Rightarrow 20 d + e = 853

Subtracting:

4 d = 287 \Rightarrow d = 71.75

Then,

e = 566 - 16 * 71.75 = 566 - 1148 = - 582

So,

P (0) = 698

Q (0) = - 582

, sum is

116

.

Seems correct.

Final Answer

116

INTELLECT-MATH-7B (ours)

System Prompt

b o x e d a n s w e r

. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.

User

Quadratic polynomials

P (x)

and

Q (x)

have leading coefficients

2

and

- 2,

respectively. The graphs of both polynomials pass through the two points

(16, 54)

and

(20, 53) .

Find

P (0) + Q (0) .

INTELLECT-MATH-7B

So I have two quadratic polynomials,

P (x)

and

Q (x)

, with leading coefficients 2 and -2, respectively. Both pass through the points

(16, 54)

and

(20, 53)

. I need to find

P (0) + Q (0)

.

First, since both are quadratic polynomials, I can express them in the general form:

P (x) = 2 x ² + b x + c

Q (x) = - 2 x ² + d x + e

But since both pass through the points

(16, 54)

and

(20, 53)

, I can set up equations using these points.

For

P (x)

P (16) = 2 * (16) ² + b \cdot 16 + c = 54

P (20) = 2 (20) ² + b \cdot 20 + c = 53

Similarly for

Q (x)

Q (16) = - 2 (16) ² + d \cdot 16 + e = 54

Q (20) = - 2 (20) ² + d \cdot 20 + e = 53

Let's compute the squares:

16 ² = 256

20 ² = 400

So for

P (x)

2256 + 16 b + c = 54 \Rightarrow 512 + 16 b + c = 54 \Rightarrow 16 b + c = 54 - 512 = - 458 . . . (1)

2400 + 20 b + c = 53 \Rightarrow 800 + 20 b + c = 53 \Rightarrow 20 b + c = 53 - 800 = - 747 . . . (2)

For

Q (x)

- 2256 + 16 d + e = 54 \Rightarrow - 512 + 16 d + e = 54 \Rightarrow 16 d + e = 54 + 512 = 566 . . . (3)

- 2400 + 20 d + e = 53 \Rightarrow - 800 + 20 d + e = 53 \Rightarrow 20 d + e = 53 + 800 = 853 . . . (4)

Now, I have two systems of linear equations:

For

P (x)

16 b + c = - 458 . . . (1)

20 b + c = - 747 . . . (2)

Subtract equation (1) from equation (2):

(20 b + c) - (16 b + c) = - 747 - (- 458)

4 b = - 289 \Rightarrow b = - \frac{289}{4}

Now plug

b

back into equation (1):

16 (- \frac{289}{4}) + c = - 458 \Rightarrow - 1156 + c = - 458 \Rightarrow c = - 458 + 1156 = 698

P (x) = 2 x ² - \frac{289}{4} x + 698

For

Q (x)

16 d + e = 566 . . . (3)

20 d + e = 853 . . . (4)

Subtract equation (3) from equation (4):

(20 d + e) - (16 d + e) = 853 - 566

4 d = 287 \Rightarrow d = \frac{287}{4}

Now plug

d

back into equation (3):

16 (\frac{287}{4}) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = - 582

Q (x) = - 2 x^{2} + \frac{287}{4} x - 582

.

Now,

P (0) = c = 698

Q (0) = e = - 582

.

Therefore,

P (0) + Q (0) = 698 + (- 582) = 116

.

Final Answer

116

Authors

Acknowledgements

To our team working tirelessly on the research, engineering, and design of this run: Sami, Jackmin, Johannes, Mike, Jannik, Burnpiro, Manveer, Kemal, and Leonardo. Special thanks to Elie Bakouch for his work on the dataset, Max Ryabinin for his invaluable research advice, and the Arcee.ai team—Lucas, Maziyar, Charlie, and others—for their exceptional contributions to post-training INTELLECT-1. We also extend our gratitude to Arthur Douillard and collaborators for their work on DiLoCo and to everyone who shared ideas and discussions that made this project a success.

INTELLECT-2: The First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model

Today we are launching INTELLECT-2: the first 32B parameter globally decentralized Reinforcement Learning training run where anyone can permissionlessly contribute their heterogeneous compute resources.

15M to Build a Peer-to-Peer AI Protocol

Prime Intellect is building a peer-to-peer protocol for compute and intelligence, enabling collective creation, ownership, and access to sovereign open-source AI. We’re moving beyond centralized AI to empower anyone—from solo GPU operators to global datacenters—to contribute compute, code, or capital and shape the open and decentralized AI ecosystem.

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

We are releasing SYNTHETIC-1, the largest open reasoning dataset generated from Deepseek-R1, collaboratively generated by compute contributors across the globe.

INTELLECT-1 Release: The First Globally Trained 10B Parameter Model

Scaling Decentralized Training

Technical Progress

Training Details and Datasets

Compute Efficiency

Post-Training

Conclusion and Next Steps: Scaling to the Frontier

INTELLECT-2: The First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model

15M to Build a Peer-to-Peer AI Protocol

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

Related Posts

INTELLECT-2: The First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model

15M to Build a Peer-to-Peer AI Protocol

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1