Towards Packing the Intelligence of Large Foundation Models on Low Resource Devices

Speaker

Dr. Souvik Kundu (Member IEEE, ACM) is currently a Staff Research scientist at Intel Labs, USA, leading research efforts in scalable and novel AI primitives for foundation models. Souvik received the prestigious AI Rising Star recognition in 2025 from CPAL and Stanford Data Science. He was among the youngest recipients of the Semiconductor Research Corporation Outstanding Industry Liaison Award in 2023. Souvik has played active role in various key efficiency innovations including the LLM KV cache quantization, N:M Sparsity, and efficient long-context understanding generalization for LLMs/VLMs. Souvik serves as the “founding PC” of the ICLR Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models, AC of flagship venues including NeurIPS, ACL, and DAC.

Abstract

With the emergence of large foundation models (LFMs), artificial intelligence (AI) has found its use-cases in various automations across multiple modalities. With this increasing surge of AI assistance, there has been increasing demand for deployment of these models at the edge including AI personal computers (AIPCs) and mobile devices. However, these deployments at scale face a fundamental challenge of deploying large models on a small computation and memory budget. Moreover, AI assisted tasks like long context reasoning require additional memory overhead of long prefix storage. In the quest to bring the potential of AI intelligence at edge, we, at Intel Labs, are exploring various avenues to tackle the memory and compute challenges of the LFMs. Specifically, in this talk I will highlight some of our key research outcomes over the past year, in the space of post training optimizations of LFMs. These optimizations not only enable the model to reduce the compute and memory cost, improving the tokens/watt budget, but also enable new opportunities of context extension during inference time.

Video

Coming soon. Stay tuned. :-)