SupraLabs_

Small weights.
Huge potential.

We ran a controlled data-mixture sweep to crack the optimal training tokens for Small Language Models (SLMs). We trained a 5,114,304 parameter Llama architecture from scratch across multiple variations, scaling every single experiment up to exactly 100,000,000 tokens using custom tokenizers.

// Project_Overview_&_Learnings

The Core Discoveries

Our empirical study tested two competing design schools of modern SLM training on consumer-grade computing hardware:

The "Llama/Web-First" Approach: Relying mostly on organic web crawls (FineWeb-Edu) to maximize fluid linguistics and human-like natural distribution.
The "Phi/SmolLM/Synthetik-First" Approach: Treating high-density synthetic data (Cosmopedia-v2) as the main textbook course to force rapid logic and structural understanding.

Crucial Insight on Phase Transitions: Small models possess almost zero tolerance for volatile phase transitions between chaotic web text and rigid synthetic formatting. When mixing datasets, leaving out a grammatical "stabilizer" or "glue" (DCLM-Edu) results in cognitive degradation, tanking logical scores like boolq and inducing token fragmentation (glitch text).

The Final Triumph of Test #4: By combining a heavy synthetic majority (60% Cosmopedia) with a natural backbone (30% FineWeb) and a grammatical stabilizer (10% DCLM), we successfully induced an early code-generation ability. The model spontaneously began outputting syntactically structured Python/NumPy logic to explain physical topics, shattering the evaluation leaderboard with an absolute Overall Score of 48 points.

// Empirical_Data_Matrix

Evaluation Standings

Relative point grading model applied across 10 vital data markers. (For an N-model match: 1st Place = N credits, Last Place = 1 credit). Relative Norm metric parsed where applicable.

Metric / Benchmark	FineWeb Baseline	Cosmopedia Baseline	DCLM Baseline	Test #1 (Mix)	Test #2 (Mix)	Test #3 (Mix)	Test #4 (Supra Mix)
Composition	100% FW	100% Cosmo	100% DCLM	60D / 30C / 10F	60F / 30C / 10D	60F / 40C	60C / 30F / 10D
Final Pretrain Loss	5.056	4.463	5.070	4.977	4.941	4.907	4.775
Wikitext Perplexity (↓)	783.42	1779.99	835.91	861.13	767.99	903.93	953.99
BLiMP (Grammar)	61.95%	57.19%	62.00%	60.34%	59.39%	59.96%	60.64%
BoolQ (Reading Logic)	37.83%	60.28%	41.50%	53.64%	38.29%	37.83%	38.62%
ARC-Easy	28.96%	28.54%	28.87%	28.75%	28.37%	29.21%	29.63%
ARC-Challenge	22.35%	24.91%	22.78%	22.95%	22.78%	23.55%	23.04%
OpenBookQA	26.20%	26.40%	26.00%	25.60%	25.20%	27.00%	26.80%
PIQA	52.34%	53.16%	50.33%	52.61%	51.69%	51.47%	52.61%
SciQ	27.80%	24.30%	25.70%	25.30%	27.30%	26.80%	25.40%
Winogrande	49.72%	50.04%	49.09%	49.33%	50.20%	49.88%	50.83%
OVERALL SCORE	36	41	29	37	37	35	48 (🏆 WIN)

// Visualizing_Performance

Overall Relative Sweep Scores

Linguistic Fluidity (Wikitext PPL) vs. Cognitive Logic (BoolQ)

Note: Lower Wikitext values are vastly superior (fluid syntax), whereas higher BoolQ percentages reflect stronger logical reasoning.

// Data_Repository_&_Artifact_Exports

Upstream Datasets utilized

HuggingFaceTB/smollm-corpus

The base repository hosting the 'cosmopedia-v2' subset split. Highly distilled synthetic data compiled via local LLMs.

View Dataset

HuggingFaceFW/fineweb-edu

Edu-filtered web-scraping samples. Evaluated using the 'sample-10BT' partition for high semantic wealth.

View Dataset

HuggingFaceTB/dclm-edu

Model-filtered crawl cluster containing top-tier structural configurations. Serves as our synthetic-natural stabilizer.

View Dataset

SupraLabs Model Repo

Our official profile holding finalized model checkpoints, evaluation suites, and training scripts.

SupraLabs_

Small weights.
Huge potential.

The Core Discoveries

Evaluation Standings

Overall Relative Sweep Scores

Linguistic Fluidity (Wikitext PPL) vs. Cognitive Logic (BoolQ)

Upstream Datasets utilized

HuggingFaceTB/smollm-corpus

HuggingFaceFW/fineweb-edu

HuggingFaceTB/dclm-edu

SupraLabs Model Repo

Kaggle Working Directory Full Exports (.ZIP)

Fineweb-Edu Baseline Export

Cosmopedia-v2 Baseline Export

DCLM-Edu Baseline Export

Test Run Mix #1 Export

Test Run Mix #2 Export

Test Run Mix #3 Export

Test Run Mix #4 (Winner Custom Build)

Small weights.Huge potential.

The Core Discoveries

Evaluation Standings

Overall Relative Sweep Scores

Linguistic Fluidity (Wikitext PPL) vs. Cognitive Logic (BoolQ)

Upstream Datasets utilized

HuggingFaceTB/smollm-corpus

HuggingFaceFW/fineweb-edu

HuggingFaceTB/dclm-edu

SupraLabs Model Repo

Kaggle Working Directory Full Exports (.ZIP)

Fineweb-Edu Baseline Export

Cosmopedia-v2 Baseline Export

DCLM-Edu Baseline Export

Test Run Mix #1 Export

Test Run Mix #2 Export

Test Run Mix #3 Export

Test Run Mix #4 (Winner Custom Build)

Small weights.
Huge potential.