Scaling Beyond Autoregression

08-12-2025

order scaling • limited data

Speaker

Jinjie Ni is an AI researcher and dedicated individual contributor at the National University of Singapore, working with Prof. Michael Shieh. His work centers on next generation modeling paradigms and the design of scalable foundation model systems. He is currently focused on several core areas: (1) LLM pretraining, scaling, and architectural innovation; (2) reinforcement learning methods that enhance reasoning capabilities in large language models; (3) and algorithm–system co design aimed at pushing model efficiency and performance. Jinjie approaches research with both scientific rigor and practical engineering insight, contributing to the broader effort of advancing modern AI systems.

Abstract

Several core scaling paths towards AGI, such as parameter count, dataset size, and test-time compute, were extensively studied over the years. We introduce order scaling, another scalable axis further advancing model intelligence, especially when data, instead of compute, is the bottleneck in the long run. We see the autoregression as an inductive bias for efficiency in the era around 2020. Instead of forcing the model to learn autoregressively from left to right, we learn more orders from the fixed amount of data, with larger models, more training- and test-time compute, and simpler architectures. We show evidence at scale (up to 8B parameters, 1.5T tokens, 480 epochs) that performance grows rapidly as we scale the modeling orders and it consistently surpasses autoregressive language models under strictly controlled settings. We established large-scale compute- and data-constrained scaling laws to model the trend, and trained MoE language models from scratch to capture the larger modeling space. We also note that order scaling is tightly coupled with test-time compute scaling, and further discuss how it boosts reasoning capability, and the challenges in developing their RL algorithms.

Speaker

Abstract

Video