Announcements
November 29, 2024

INTELLECT-1 Release: The First Globally Trained 10B Parameter Model

We're excited to release INTELLECT-1, the first 10B parameter language model collaboratively trained across the globe. This represents a 10× scale-up from our previous research and demonstrates that large-scale model training is no longer confined to large corporations but can be achieved through distributed, community-driven approaches. The next step is scaling this even further to frontier model sizes and ultimately open source AGI.

Today, we're releasing:

Scaling Decentralized Training

We present the first large-scale experiment collaboratively training a 10 billion parameter model over 1 trillion tokens across five countries and three continents on up to 112 H100 GPUs simultaneously. We achieve an overall compute utilization of 83% across continents and 96% when training exclusively on nodes distributed across the entire United States, introducing minimal overhead compared to centralized training approaches.

Our results show that INTELLECT-1 can maintain training convergence and high compute utilization despite severe bandwidth constraints and node volatility, opening new possibilities for decentralized, community-driven training of frontier foundation models.

Technical Progress

Our research and engineering work on PRIME helped us achieve this milestone.

Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamicglobal process groups for fault-tolerant communication across the internet andlocal process groups for communication within a node, live checkpoint recovery,kernels, and a hybrid DiLoCo-FSDP2 implementation.

Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve an overall 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance at the 10B scale.

Training Details and Datasets

INTELLECT-1 is based on the Llama-3 architecture, comprising:

  • 42 layers with 4,096 hidden dim
  • 32 attention heads
  • 8,192 sequence length
  • 128,256 vocab size

The model was trained on a carefully curated 1T token dataset mix (Huggingface Link)

  • 55% FineWeb-Edu
  • 20% Stack v2
  • 10% FineWeb
  • 10% DCLM-baseline
  • 5% OpenWebMath

Training completed over 42 days using:

  • WSD learning rate scheduler
  • 7.5e-5 inner learning rate
  • Auxiliary max-z-loss for stability
  • Nesterov momentum outer optimizer
  • Dynamic on-/off-boarding of compute resources with up to 14 nodes
Number of active training nodes over training steps. The graph demonstrates PRIME’s ability to handle dynamic node participation, starting with 4 nodes and scaling up to 14 nodes, while maintaining training stability despite frequent node fluctuations.

Training dynamics showing model perplexity and learning rate over training steps, including warmup, stable, and annealing phases.

Compute Efficiency

The system achieved great training efficiency across different geographical settings:

  • 96% compute utilization training across the entire United States (103s median sync time)
  • 85.6% compute utilization for transatlantic training (382s median sync time)
  • 83% compute utilization for global distributed training (469s median sync time)
Performance metrics for training across different geographical configurations.

Post-Training

After completing the globally distributed pretraining phase, we applied several post-training techniques in collaboration with Arcee AI to enhance INTELLECT-1s’s capabilities and task-specific performance. Our post-training methodology consisted of three main phases including extensive SFT (16 runs), DPO (8 runs), and strategic model merging using MergeKit.

For more info check out our detailed technical report.

Conclusion and Next Steps: Scaling to the Frontier

The successful training of INTELLECT-1 demonstrates a key technical advancement in enabling a more open and decentralized AI ecosystem, with significant implications for the future development and governance of advanced artificial intelligence systems.

Through our open-source PRIME framework, we've established a foundation for distributed AI development that can rival centralized training facilities.

Looking ahead, we envision scaling this approach to frontier model sizes by:

  • Expanding our global compute network
  • Implementing new economic incentives to drive community participation
  • Further optimizing our distributed training architecture for even larger models

This work represents a crucial step toward democratizing AI development and preventing the consolidation of AI capabilities within a few organizations. By open-sourcing INTELLECT-1's model, checkpoints, and training framework, we invite the global AI community to join us in pushing the boundaries of decentralized training.

To get involved, visit our GitHub repository or join our Discord community. Together, we can build a more open, collaborative future for AI development.

Eurus-2-PRIME
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]
User
Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and 2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0)+Q(0).
Eurus-2-PRIME
[ASSESS]

We are given two quadratic polynomials P(x) and Q(x) with leading coefficients 2 and 2, respectively. The graphs of both polynomials pass through the points (16,54) and (20,53). We need to find the value of P(0)+Q(0).

[ADVANCE]

Let's express the quadratic polynomials P(x) and Q(x) in the general form:
P(x)=2x2+bx+c
Q(x)=2x2+dx+e

Since the graphs of both polynomials pass through the points (16,54) and (20,53), we can substitute these points into the equations of P(x) and Q(x) to form a system of equations.

For P(x):
P(16)=2(16)2+16b+c=54
512+16b+c=54
16b+c=458(Equation 1)

P(20)=2(20)2+20b+c=53
800+20b+c=53
20b+c=747(Equation 2)

For Q(x):
Q(16)=2(16)2+16d+e=54
512+16d+e=54
16d+e=566(Equation 3)

Q(20)=2(20)2+20d+e=53
800+20d+e=53
20d+e=853(Equation 4)

[ADVANCE]

Now, we can solve the system of equations for P(x):
Subtract Equation 1 from Equation 2:
(20b+c)(16b+c)=747(458)
4b=289
b=2894

Substitute b=2894 into Equation 1:
16(2894)+c=458
1156+c=458
c=698

So, for P(x), we have:
P(x)=2x22894x+698
P(0)=698

Next, we can solve the system of equations for Q(x):
Subtract Equation 3 from Equation 4:
(20d+e)(16d+e)=853566
4d=287
d=2874

Substitute d=2874 into Equation 3:
16(2874)+e=566
1148+e=566
e=582

So, for Q(x), we have:
Q(x)=2x2+2874x582
Q(0)=582

[ADVANCE]

Finally, we can find P(0)+Q(0):
P(0)+Q(0)=698+(582)=116

So, the value of P(0)+Q(0) is 116.
INTELLECT-MATH-7B-SFT (ours)
System Prompt
Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: boxedanswer. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.
User
Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and 2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0)+Q(0).
INTELLECT-MATH-7B-SFT
So I've got this problem here about quadratic polynomials P(x) and Q(x). P(x) has a leading coefficient of 2, and Q(x) has a leading coefficient of 2. Both polynomials pass through the points (16,54) and (20,53). I need to find P(0)+Q(0).

First, since both P(x) and Q(x) are quadratic polynomials, I can express them in the general form:

P(x)=2x2+bx+c

Q(x)=2x2+dx+e

Wait, actually, the leading coefficient is the coefficient of the highest power of x, which for a quadratic is x2. So yes, P(x)=2x2+bx+c and Q(x)=2x2+dx+e.

Now, both polynomials pass through the points (16,54) and (20,53). That means when x=16, P(16)=54, and when x=20, P(20)=53. Similarly for Q(x).

So, I can set up equations using these points.

First, for P(x):

P(16)=2(16)2+b16+c=54

Similarly, P(20)=2(20)2+b20+c=53

Let me calculate 162 and 202:

162=256

202=400

So,

P(16):2256+16b+c=54512+16b+c=5416b+c=5451216b+c=458

P(20):2400+20b+c=53800+20b+c=5320b+c=5380020b+c=747

Now, I have two equations:

1) 16b+c=458

2) 20b+c=747

I can subtract equation 1 from equation 2 to eliminate c:

(20b+c)(16b+c)=747(458)

20b+c16bc=747+458

4b=289

So, b=289/4=72.25

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug b back into one of the equations to find c. Let's use equation 1:

16(72.25)+c=458

1156+c=458

c=458+1156=698

So, P(x)=2x272.25x+698

Hmm, P(0) would be c, which is 698.

Now, let's do the same for Q(x).

Q(16)=2(16)2+d16+e=542256+16d+e=54512+16d+e=5416d+e=54+51216d+e=566

Q(20)=2(20)2+d20+e=532400+20d+e=53800+20d+e=5320d+e=53+80020d+e=853

Now, equations for Q(x):

1) 16d+e=566

2) 20d+e=853

Subtract equation 1 from equation 2:

(20d+e)(16d+e)=853566

4d=287

d=287/4=71.75

Now, plug d back into equation 1:

1671.75+e=5661148+e=566e=5661148=582

So, Q(x)=2x2+71.75x582

Therefore, Q(0)=e=582

Now, P(0)+Q(0)=698+(582)=116

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write 116

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for P(x):

P(16)=2(256)+16b+c=512+16b+c=5416b+c=458

P(20)=2(400)+20b+c=800+20b+c=5320b+c=747

Subtracting: 4b=289b=72.25

Then, c=45816(72.25)=458+1156=698

Similarly for Q(x):

Q(16)=2(256)+16d+e=512+16d+e=5416d+e=566

Q(20)=2(400)+20d+e=800+20d+e=5320d+e=853

Subtracting: 4d=287d=71.75

Then, e=5661671.75=5661148=582

So, P(0)=698, Q(0)=582, sum is 116.

Seems correct.

Final Answer

116
INTELLECT-MATH-7B (ours)
System Prompt
Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: boxedanswer. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.
User
Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and 2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0)+Q(0).
INTELLECT-MATH-7B
So I have two quadratic polynomials, P(x) and Q(x), with leading coefficients 2 and -2, respectively. Both pass through the points (16,54) and (20,53). I need to find P(0)+Q(0).

First, since both are quadratic polynomials, I can express them in the general form:

P(x)=2x²+bx+c

Q(x)=2x²+dx+e

But since both pass through the points (16,54) and (20,53), I can set up equations using these points.

For P(x):

P(16)=2(16)²+b16+c=54

P(20)=2(20)²+b20+c=53

Similarly for Q(x):

Q(16)=2(16)²+d16+e=54

Q(20)=2(20)²+d20+e=53

Let's compute the squares:

16²=256

20²=400

So for P(x):

2256+16b+c=54512+16b+c=5416b+c=54512=458...(1)

2400+20b+c=53800+20b+c=5320b+c=53800=747...(2)

For Q(x):

2256+16d+e=54512+16d+e=5416d+e=54+512=566...(3)

2400+20d+e=53800+20d+e=5320d+e=53+800=853...(4)

Now, I have two systems of linear equations:

For P(x):

16b+c=458...(1)

20b+c=747...(2)

Subtract equation (1) from equation (2):

(20b+c)(16b+c)=747(458)

4b=289b=2894

Now plug b back into equation (1):

16(2894)+c=4581156+c=458c=458+1156=698

So P(x)=2x²2894x+698

For Q(x):

16d+e=566...(3)

20d+e=853...(4)

Subtract equation (3) from equation (4):

(20d+e)(16d+e)=853566

4d=287d=2874

Now plug d back into equation (3):

16(2874)+e=5661148+e=566e=5661148=582

So Q(x)=2x2+2874x582.

Now, P(0)=c=698.

Q(0)=e=582.

Therefore, P(0)+Q(0)=698+(582)=116.

Final Answer

116.
To our team working tirelessly on the research, engineering, and design of this run: Sami, Jackmin, Johannes, Mike, Jannik, Burnpiro, Manveer, Kemal, and Leonardo. Special thanks to Elie Bakouch for his work on the dataset, Max Ryabinin for his invaluable research advice, and the Arcee.ai team—Lucas, Maziyar, Charlie, and others—for their exceptional contributions to post-training INTELLECT-1. We also extend our gratitude to Arthur Douillard and collaborators for their work on DiLoCo and to everyone who shared ideas and discussions that made this project a success.