pretraining

7 stories

Artificial Intelligence #scaling laws#pretraining

New Research Shows Pretraining Data Composition Can Engineer Neural Scaling Laws for Particle Physics

A new arXiv paper demonstrates that neural scaling laws in particle physics can be engineered by adjusting pretraining data composition. The study shows that including more diverse and task-aligned synthetic data can shift scaling behavior to require more data rather than larger models, offering insights for efficient AI training.

Jul 8, 2026 1 source

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Technology

Artificial Intelligence #natural language processing#persian nlp

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Researchers present IHUBERT, a monolingual Persian language model pretrained on a 45GB curated subset of the Sepahr-Danesh collection using a multi-stage pipeline that includes vector-database-based semantic deduplication and domain-balanced pretraining. IHUBERT achieves top scores on extractive QA benchmarks PQuAD and ParsiNLU-RC, and best results on FarsTail NLI, while remaining competitive on NER and topic classification.

Jun 20, 2026 1 source

Researchers Identify Shrinkage Bias in LLM FP4 Pretraining, Propose UFP4 Recipe for Stability

Technology

Artificial Intelligence #llm#pretraining

Researchers Identify Shrinkage Bias in LLM FP4 Pretraining, Propose UFP4 Recipe for Stability

A new study from researchers on arXiv identifies 'Shrinkage Bias' in E2M1-based FP4 pretraining for large language models, a systematic error that accumulates across layers. The proposed UFP4 recipe, using uniform grids like E1M2/INT4, demonstrates lower BF16-relative loss degradation on models up to 124B parameters, urging hardware support for uniform 4-bit formats.

Jun 20, 2026 1 source

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Technology

Artificial Intelligence #spokes#diverse pretraining

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.

Jun 16, 2026 1 source

EyeMVP AI Model Enhances Retinal Screening by Learning OCT Insights from Fundus Photos

Technology

Artificial Intelligence #artificial intelligence#computer vision

EyeMVP AI Model Enhances Retinal Screening by Learning OCT Insights from Fundus Photos

Researchers developed EyeMVP, a cross-modal retinal foundation model that enriches color fundus photography (CFP) with depth-resolved information from optical coherence tomography (OCT). Pretrained on 674,893 paired images from 112,642 patients across eight Chinese hospitals, EyeMVP outperforms leading models on 16 downstream tasks including macular edema detection (AUROC 0.948 vs 0.852) and myopic macular schisis (0.825).

Jun 16, 2026 1 source

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

Technology

Artificial Intelligence #data augmentation#language model

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

As AI labs face a data ceiling where compute capacity outpaces new high-quality text, researchers propose data augmentations to enable productive multi-epoch training on fixed corpora. Three categories—token-level noise, sequence permutations, and target offset prediction—are shown to delay overfitting and lower validation loss compared to standard autoregressive pretraining. Random token replacement achieved the best minimum loss among individual methods, with combined augmentations further improving results.

Jun 16, 2026 1 source

X-Tokenizer: Semantic Action Tokenizer Boosts Robot Control by 13.5% Over FAST

Technology

Artificial Intelligence #x-tokenizer#multimodal

X-Tokenizer: Semantic Action Tokenizer Boosts Robot Control by 13.5% Over FAST

Researchers propose X-Tokenizer, a new action tokenizer that treats tokenization as semantic interface learning rather than mere compression. Using a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture, it improves multimodal grounding by 13.5% and long-horizon task performance by 8.25 points over existing methods like FAST.

Jun 16, 2026 1 source