First International Workshop on Distributed and Parallel Programming for Extreme-scale AI

Location: Mines Paris - PSL University, 60 Boulevard Saint-Michel, Paris
Main Room: L108-B

DP2E-AI 2025 PROGRAM

Program Chair: Serge G. Petiton

Monday, June 16th

8:30-9:00 | Registration, welcome coffee

9:00-9:30 | Opening session

Welcome address from Ecole des Mines de Paris
Agnes Laboudigue, Deputy Director of Research at Mines Paris – PSL
Corinne Ancourt, Mines Paris – PSL, France
Introduction to the workshop
Serge G. Petiton, University of Lille and CNRS, France
[Bio]
Serge G. Petiton received the B.S. degree in mathematics, the Ph.D. degree in computer science, and the “Habilitation à diriger des recherches”, from the Sorbonne University, Pierre et Marie Curie Campus. He was post-doc student, registered at the graduate school, and junior researcher scientist at Yale University, 1989-1990. He has been researcher at the “Site Experimental en Hyperparallelisme” (supported by CNRS and CEA) from 1991 to 1994. He also was affiliate research scientist at Yale and visiting research fellow in several US laboratories (NASA/ICASE, AHPCRC,..) during the period 1991-1994. Since 1994, Serge G. Petiton is tenured full Professor at the University of Lille in France and he had-has CNRS and/or INRIA associated senior positions in several laboratories (LIFL and CRISTAL in Lille, and ASCI, LRI and the “Maison de la Simulation” in Paris-Saclay). Serge G. Petiton was visiting awarded Professor at the Chinese Academy of Science, a few weeks in 2016. He was P.I. of several international projects with Japan and Germany (ANR, CNRS, SPPEXA,..) and had-has many industrial collaborations (TOTAL, CEA, Airbus, Nvidia, Intel, Huawei…). Serge G. Petiton has been scientific director of more than 30 Ph.D.s and has authored more than 150 articles on international journals, books, and conferences. His main current research interests are in “Parallel and Distributed Computing”, “Sparse Linear Algebra”, “Language and Programming Paradigms”, and “AI methods”.
[Slides]

9:30-10:50 | Keynotes

Chair: Serge G. Petiton

An Overview of High Performance Computing and Responsibly Reckless Algorithm. [9:30-10:10]
Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory, USA, University of Manchester, UK
[Bio]
Jack Dongarra is specializes in numerical algorithms in linear algebra, parallel computing, the use of advanced computer architectures, programming methodology, and tools for parallel computers. He holds appointments at the University of Manchester, Oak Ridge National Laboratory, and the University of Tennessee. In 2019 he received the ACM/SIAM Computational Science and Engineering Prize. In 2020 he received the IEEE-CS Computer Pioneer Award. In 2021 he received the ACM A.M. Turing Award for his pioneering contributions to numerical algorithms and software that have driven decades of extraordinary progress in computing performance and applications. He is a Fellow of the AAAS, ACM, IEEE, and SIAM; a foreign member of the British Royal Society and a member of the U.S. National Academy of Sciences and the U.S. National Academy of Engineering.
[Abstract]
In this talk, we examine how high-performance computing has changed over the last 10 years and look at trends in the future. These changes have had and will continue to impact our software significantly. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile-time and run-time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder.
Mixed precision numerical methods are paramount for increasing the throughput of traditional and artificial intelligence (AI) workloads beyond riding the wave of the hardware alone. Reducing precision comes at the price of trading away some accuracy for performance (reckless behavior) but in noncritical segments of the workflow (responsible behavior) so that the accuracy requirements of the application can still be satisfied.
[Slides]
TBA [10:10-10:50]
Satoshi Matsuoka, RIKEN, Japan
[Bio]

[Abstract]

[Slides]

10:50-12:50 | Session : Distibuted and parallel computing, and AI

Chair: Kengo Nakajima

Deep Learning is More Than Just Dense Matrix Multiplication. [10:50-11:20]
Rio Yokota, RIKEN, Japan.
[Bio]
Rio Yokota is a Professor at the Supercomputing Research Center, Institute of Integrated Research, Institute of Science Tokyo. His research interests lie at the intersection of high performance computing, linear algebra, and machine learning. He has been optimizing algorithms on GPUs since 2007, and was part of a team that received the Gordon Bell prize in 2009 using the first GPU supercomputer. He is the developer of scalable libraries for fast multipole methods ExaFMM and hierarchical low-rank matrices Hatrix, and second order optimizers for deep learning ASDL. He is involved in many efforts to train large language models in Japan. Rio is a member of ACM, IEEE, and SIAM.
[Abstract]
In most cases, the majority of computation in deep neural networks is spent on dense matrix multiplication, but there are some exceptions. Second order optimization, quantization, continual learning, and model merging often rely on linear solvers and eigenvalue solvers to make use of Hessian, Gauss-Newton, or Fisher matrices. In such cases, sophisticated linear algebra techniques can greatly improve the performance. When such problems are scaled up, distributed versions of these sophisticated algorithms that can efficiently run on thousands of GPUs becomes necessary.
[Slides]
Enabling Sparsity in AI Workloads. [11:20-11:50]
Maryam Dehnavi, University of Toronto and Nvidia
[Bio]
Maryam Mehri Dehnavi is an Associate Professor in the Department of Computer Science at the University of Toronto and a Lead Research Scientist at NVIDIA Research. Between 2021 and 2023, she was a researcher at Microsoft Research and also served as the Associate Chair of Research at the University of Toronto. She holds the prestigious Canada Research Chair in Parallel and Distributed Computing and is the recipient of the Ontario Early Researcher Award. Additionally, she has served as the General Chair of PPoPP and is an Associate Editor for the Journal of Parallel and Distributed Computing. Maryam’s research focuses on algorithmic and systems/compiler techniques for model compression, sparsity, and computation restructuring, with applications in machine learning and graphics. Her work is published in top venues such as PLDI, ICML, ICLR, NeurIPS, SC, SIGGRAPH, and IPDPS.
[Abstract]
In this talk, I’ll present our recent work on enabling efficient training and deployment of large neural networks through structured sparsity, quantization, and low-rank adaptation. These techniques reduce computation and memory overhead with minimal accuracy degradation, and can be applied without retraining. I’ll also discuss the compiler and kernel-level mechanisms we developed to support these algorithms, including support for irregular sparsity patterns and optimized data reuse. Together, these contributions offer a path toward scalable, hardware-efficient execution of compressed models.
[Slides]
Beyond double precision: AI-driven innovations in HPC offer a quantum leap to scientific computing. [11:50-12:20]
Harun Bayraktar, Nvidia, USA
[Bio]
Since joining NVIDIA in 2017, Harun has been leading the Math Libraries organization which builds software to help accelerate applications in science, engineering, Quantum Computing, and artificial intelligence (AI). Prior to joining NVIDIA his career path included developing high-performance computational mechanics SW development for an ISV and physics-based simulation research and technology development for advanced composites materials in aerospace. Harun holds a PhD in mechanical engineering from UC Berkeley.
[Abstract]
Artificial intelligence (AI) and most recently large language models (LLMs) are driving significant innovations in mixed-precision computing, processor, and system architecture, creating both opportunities and challenges for scientific computing. On one end, while the scientific computing community continues to measure the impact of a system using an almost half a century old benchmark that runs in double precision (64-bit float); at the other end, LLMs are utilizing 4-bit floats with block scaling in mixed-precision algorithms on the latest GPUs to achieve unprecedented performance and capabilities. Indeed, major scientific breakthroughs have been enabled by AI, most notably by AlphaFold predicting the structure of 200 million proteins. The advances in hardware capabilities especially around reduced and mixed-precision computing have also led to innovations in algorithms that can leverage these to significantly accelerate scientific applications. In this talk, we will challenge the notion that newer systems are not suitable for scientific computing and present opportunities to harness their power. Specifically, we will go into detail on both the hardware and algorithmic advancements around reduced and mixed-precision types and demonstrate how higher computational throughput can be achieved with great power efficiency.
[Slides]
Challenges of training and deploying foundation models at scale. [12:20-12:50]
Maxime Hugues, AWS, USA
[Bio]
Dr. Maxime Hugues is a Principal Applied Scientist in GenAI at AWS, which he joined in 2020. He’s focus on training performance, system reliability, large scale simulation. Prior joining AWS, he worked as HPC Research Scientist at TotalEnergies and as HPC Consultant at Google. He holds a M.E. from the French National Engineer School “ISEN-Toulon”, a M.S. degree from the University of Science and a Ph.D. degree in Computer Science in 2011 from the University of Lille 1.
[Abstract]
Generative AI has begun to transform many industries, such as healthcare, finance, legal, technology, automotive and others. Enterprise and startups are embracing AI to accelerate their business through innovation or optimization. Many choose AWS to quickly access GPU to train and deploy their model. It also enables them to focus on the business instead of infrastructure management. The primary focus is on delivering AI value, while the training speed performance is secondary or absent. In this talk, we will discuss why training performance is secondary, its difficulty and the challenge to solve to simplify the machine learning engineer. Then, we will present the challenge of serving AI at the world scale and why many customers choose AWS to do so.
[Slides]

12:50-13:50 | Lunch (Mines de Paris)

13:50-16:30 | Keynotes

Chair: Georges-Andre Silber

Mind the Middleware: From Data to Discovery in Brain-Scale Agentic AI. [13:50-14:30]
Ian Foster, University of Chicago and ANL, USA
[Bio]

[Abstract]
Brain-scale AI is transforming how we pose questions and harvest answers, yet the glitter of trillion-parameter models conceals a widening middleware gap that now limits scientific discovery itself. This talk reframes “extreme-scale AI” as a middleware challenge and outlines three grand opportunities where the HPC, linear-algebra, and systems communities can make outsized contributions:
1. Living data pipelines. Tomorrow’s training corpora are petabytes in motion—ingested hourly, carrying complex provenance, and spanning burst buffers to archival tape. I will describe policy-aware data fabrics that let datasets evolve safely and reproducibly while remaining shardable across heterogeneous storage hierarchies.
2. Agent evolution ecosystems. Future applications will stitch together thousands of quasi-autonomous agents that learn, negotiate, and self-reconfigure. I will present middleware patterns that blend classic actor/workflow abstractions with online learning, enabling safe “hot-swap” of agent policies and collective memory at runtime.
3. Scientific performance telemetry. FLOPS are no proxy for insight. We need observability stacks that couple hardware counters with scientific metrics—prediction skill, uncertainty, novelty detection, even hypothesis provenance—and return actionable feedback in near real time. I will sketch an architecture that streams these signals back into data curation and agent selection loops, closing the discovery cycle.
By mapping these needs onto enduring ideas from grid middleware, distributed linear algebra, and emerging AI hardware, I will argue that interoperable middleware is the bridge that can keep HPC and AI innovation on a unified, discovery-driven roadmap—and invite the workshop community to help build it.
[Slides]
Dimensionality Reduction in High-Performance Language Models. [14:30-15:10]
Nahid Emad, University of Paris-Saclay, France
[Bio]
Nahid Emad received the Habilitation to Direct Research in computer science from the University of Paris Saclay/Versailles, the PhD and MS in applied mathematics from Pierre and Marie Curie University (Sorbonne University) and BS in pure mathematics from the University of Arak (Iran). She is a Professor at the University of Paris Saclay/Versailles and affiliated with the Maison de la Simulation and LI-PARAD laboratories where she heads the Intensive Numerical Computing group. She was appointed by the French Prime Minister to the rank of Knight in the Order of Academic Palms (promotion of July 14, 2022). She maintains long international collaborations, notably with Japan and Germany through the bi- and trilateral ANR and SPPEXA projects and with the United States where she is affiliate professor at the University of California Berkeley. She has been scientific supervisor of 20 PhDs and HDRs and is the author of more than 150 articles in international journals, conferences, and book chapters. Her main research interests include numerical algorithms, linear algebra, parallel and distributed programming methodology, software engineering for parallel and distributed numerical computing, and high-performance data analysis.
[Abstract]
Dimensionality reduction has a significant role on the performance and applicability of LLMs. By distilling large and complex raw data sets, it transforms them into refined, focused, and usable formats. Thus, models accelerate learning processes, extracting crucial insights, trends, and patterns from distilled data sets. Appropriate data dimensionality reduction, combined with high-performance computing techniques, helps avoid significant additional costs and risks in all application areas.
This talk offers an overview of dimensionality reduction techniques, with a strong focus on the role of advanced high-performance numerical algorithms in enhancing large model accuracy. We will introduce contemporary approaches such as Unite and Conquer, demonstrating how they can improve both solution convergence and computational efficiency. Particular attention is given to the challenges these methods pose when scaled to high-dimensional settings typical of large language models, including issues of numerical stability, scalability, and parallelization. We will also review recent advances in the field, highlighting their relevance to LLMs and outlining potential directions to push the current state of the art forward.
[Slides]
Vision AI for Science and Engineering Applications. [15:10-15:50]
Mohamed Wahib, RIKEN, Japan.
[Bio]
Mohamed Wahib is a team leader of the “High Performance Artificial Intelligence Systems Research Team” at RIKEN Center for Computational Science (R-CCS), Kobe, Japan. Prior to that he worked as is a senior scientist at AIST/TokyoTech Open Innovation Laboratory, Tokyo, Japan. He received his Ph.D. in Computer Science from Hokkaido University, Japan. His research interests revolve around the central topic of high-performance programming systems, in the context of HPC and AI. He is actively working on several projects including AI-based science, as well as high-level frameworks for programming traditional scientific applications.
[Abstract]
Large-scale vision foundation models demand substantial computing resources. In this talk, we highlight the challenges of scaling Vision Transformer (ViT) architectures to enable hardware efficiency, multi-modality support, and computational scalability. We present approaches to enhance both efficiency and scalability, key enablers for accelerating AI-driven scientific discovery on supercomputers at HPC facilities.
[Slides]
AI for Science: Exploring Software Sustainability through Couplers. [15:50-16:30]
Kengo Nakajima, Japan
[Bio]
Kengo Nakajima has been a professor in the Supercomputing Research Division of the Information Technology Center at the University of Tokyo since 2008. Prior to joining the University of Tokyo in 2004, he spent 19 years in industry. He has also been a deputy director of RIKEN Center for Computational Science (R-CCS) since 2018. His research interests cover computational mechanics, parallel numerical algorithms and high performance computing (HPC). Kengo holds a B.Eng in aeronautics (University of Tokyo, 1985), an MS in aerospace engineering (University of Texas at Austin, 1993), and a PhD in engineering mechanics (University of Tokyo, 2003).
[Abstract]
"Coupler" is originally a tool for coupling multiple simulation models such as atmosphere and ocean, structure and fluid. In recent years, computer systems and workloads have become more diverse, and the role of couplers in supercomputing has become more important. In this talk, we focus on the "history" of couplers and consider what software sustainability means. We briefly describe three projects, In the 1st project (ppOpen-HPC: 2011-2018), we developed an MPI-based scalable coupler for multi-physics simulations. In the 2nd project (h3-Open-BDEC:2019-2024), we extended the idea of multi-physics coupler for integration of Simulation/Data/Learning (S+D+L) on heterogeneous supercomputer system Wisteria/BDEC-01 by the University of Tokyo, which consists of computing nodes for computational science and engineering with A64FX (Odyssey), and those for Data Analytics/AI with NVIDIA A100 GPU's (Aquarius). The third project (JHPC-quantum: 2023-2028) has started in November 2023, further expanding h3-Open-BDEC to realize Quantum-HPC hybrid computing. In this talk, we will introduce how couplers have evolved and what role they have been playing in supercomputing.
[Slides]

16:30-17:00 | Coffee break

17:00-17:50 | Panel and general discussion on HPC, LA, and AI: programming and optimisation.

Moderator: Serge G. Petiton

Participants: Jack Dongarra, Nahid Emad, Ian Foster, Kengo Nakajima

17:50-18:00 | Closing remarks of the first day

19:00-21:00 | Workshop Diner in Paris

Dîner at Bouillon Racine, 3 rue Racine, 75006 Paris.

Tuesday June 17th

8:30-9:00 | Registration, welcome coffee

9:00-10:20 | Keynotes

Chair: Nahid Emad

Exponential Technologies: The Role of High-Performance Computing in Shaping the Future of AI. [9:00-9:40]
Horst Simon, ADIA Lab, United Arab Emirates.
[Bio]
Dr. Horst Simon is an internationally recognized expert in the development of parallel computational methods for large-scale scientific challenges. Since 2023, he has been the director of ADIA Lab, where he leads initiatives that leverage artificial intelligence and computational science to address complex global issues. His extensive research background includes pioneering work in sparse matrix algorithms, large-scale eigenvalue problems, and domain decomposition algorithms. Dr. Simon’s recursive spectral bisection algorithm is regarded as a landmark achievement in parallel algorithms. With 40 years of experience across high-performance computing, numerical algorithms, and AI, Dr. Simon has contributed significantly to advancements in industry (Boeing, SGI), research labs (NASA Ames, Berkeley Lab), and academia (Stony Brook University, UC Berkeley).
[Abstract]
Artificial Intelligence is not just the result of clever algorithms or vast datasets—it is fundamentally powered by the exponential progress in high-performance computing (HPC). This presentation explores the critical role of HPC in driving AI innovation, from training large language models to breakthroughs in generative and reinforcement learning. By tracing the trajectory of HPC—from its early role in scientific computing to its central position in AI—we highlight the convergence of semiconductors, data infrastructure, and algorithms that has created a powerful feedback loop of progress.
Major global initiatives such as the U.S. Exascale Computing Project and international benchmarking efforts like the TOP500 list serve as milestones in this journey, underscoring the rapid advances in computing power over the past decades. Today’s AI workloads demand specialized architectures, immense parallelism, and unprecedented scale—demands met only through sustained investments in HPC infrastructure.
Yet AI is only part of the story. Bitcoin mining and other blockchain-based technologies are also reshaping the global computing landscape. While distinct in function, they share a common reliance on massive, energy-intensive compute resources. As the digital economy expands, the competition for computational power will intensify—whether for training trillion-parameter models or mining the next Bitcoin block.
This exponential growth raises urgent questions about energy consumption and sustainability. Estimates suggest that global computing demand will continue to double every few years, with AI and crypto accounting for an increasing share of electricity usage. The energy footprint of future HPC systems—especially when deployed at hyperscale—necessitates radical improvements in efficiency, cooling, and architectural design.
To meet these challenges, we must rethink how we build and deploy HPC systems. The next era of progress will hinge on advances not only in hardware but also in software, systems integration, and energy-aware computing. This includes novel accelerators, hybrid cloud/HPC environments, and a renewed focus on sustainability as a design principle.
In the Exponential Age, HPC is both an enabler and a beneficiary of transformative technologies. Its role extends beyond scientific discovery to encompass the infrastructure of AI, finance, and emerging digital economies. To shape this future, continued investment in HPC is not optional—it is essential.
[Slides]
From Large Language Models to Reasoning Language Models. [9:40-10:20]
Torsten Hoefler, Federal Institute of Technolgy in Zurich, Switzerland
[Bio]
Torsten Hoefler is a Professor of Computer Science at ETH Zurich, a member of Academia Europaea, and a Fellow of the ACM, IEEE, and ELLIS. He received the 2024 ACM Prize in Computing, one of the highest honors in the field. His research interests revolve around the central topic of "Performance-centric System Design" and include scalable networks, parallel programming techniques, and performance modeling. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference SC10, SC13, SC14, SC19, SC22, SC23, SC24, HPDC'15, HPDC'16, IPDPS'15, and other conferences. He published hundreds of peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. He received the IEEE CS Sidney Fernbach Award, the ACM Gordon Bell Prize, the ISC Jack Dongarra award, the Latsis prize of ETH Zurich, as well as the German Max Planck-Humboldt Medal. Additional information about Torsten can be found on his homepage at htor.ethz.ch.
[Abstract]
In this talk, we explore the fascinating evolution of Large Language Models (LLMs) and their transformative journey through the lenses of computation and optimization. We begin by tracing the origins of LLMs, highlighting how advances in computation and optimization were pivotal in their development. We then delve into the key optimizations that have achieved a staggering 1,000x cost reduction, making LLMs widely accessible even on portable devices. Moving forward, we address the limitations of human-generated data and introduce the concept of constructive hallucination in LLMs. This technique allows for the generation of new hypotheses and their validation through reasoning chains, pushing the boundaries of knowledge creation. Next, we provide an overview of the technology fundamentals and early successes of reasoning models, such as OpenAI's o1 and o3 preview. These models, while significantly enhancing computational capabilities, also exponentially increase computational demands. Finally, we conclude by presenting our ambitious Ultra Ethernet effort, which aims to establish the interconnect standard for future AI workloads. This initiative is crucial in meeting the growing demands at the system level, ensuring seamless and efficient operation in the age of reasoning models.
[Slides]
Optimizing inference engine for large MoE language models: experience and lesson. [10:20-11:00]
Kun Tan, Huawei, China
[Bio]
Dr. Kun Tan is Director and Chief Expert of Distributed and Parallel Software Lab, 2012 Labs, Huawei. His team develops cutting-edge Al framework, cloud native, big data analytics, and cloud networking technologies for many Huawei products. He is Huawei Scientist. Before joined Huawei, he was Research Manager and Senior Researcher of Wireless and Networking Group, Microsoft Research Asia. He won USENIX Test-of-Time Award in 2019 and USENIX NSDI Best Paper Award in 2009.
[Abstract]
In this talk, we will introduce our experience and lessons during building a LLM inference engine, named JiuSi, for Ascend NPU cluster. Specifically, we optimize our inference engine based on DeepSeek R1 model. DeepSeek R1 is a latest open source model that has 671B parameters and massive mixture of experts. We employ this massive expert-parallelism to implement a high-throughput and low latency DeepSeek R1 inference model based on JiuSi and MindSpore. We will discuss several key techniques to optimize the inference engine, including adaptive Prefill and Decode separation, distributed KV cache with shared memory, and dynamic load balancing among experts. We show that our implementation delivers the state-of-the-art performance on Ascend hardware.
[Slides]

11:00-11:50 | Panel and general discussion on The Future of AI, from LLMs to Agents and Beyond

Moderator: Ian Foster

Participants: Torsten Hoefler, Horst Simon, Kun Tan and Wilfried Kirschenmann.

11:50-12:20 | Poster presentation

Chair: Soraya Zertal

[Poster List]
- Exploring Fine-Grained GPU Sharing for Low-Latency Inference Workloads , Zixi CHEN, Junqiao QIU, [Poster]
- Simple Idea discovery in a minimalist LLM Architecture Implementation , Robert Chihaia, Maria Trocan, Florin Leon, [Poster]
- Spectral Embedding to Compress Neural Architectures Without Performance Loss ,Quentin Peti, Chong Li, Nahid Emad, [Poster]
- Unified Symbolic Modeling and Adaptive Tensor Partitioning for Efficient Deep Learning Parallelization , Hongxing Wang, Zhengdao Yu, Chong Li, Serge Petiton [Poster]
- Symbolic Computation-Memory Optimization for Pipeline Efficiency in ultra-scale DNN training, Ruiwen Wang, Chong Li, Thibaut Tachon, Raja Appuswamy [Poster]
- Improving Giant Neural Network Performance with Symbolic Analytical Formulas, Shijie Shen, Walid Astaoui, Thibaut Tachon, Chong Li [Poster]
- Performance of Very Large Very Sparse Matrix Matrix and Very Large Very Sparse Matrix Vector Multiplication on Different Cluster Architectures, Maxence Buisson, Geraud Krawezik [Poster]
- Rockmate: an Efficient, Fast, Automatic and Generic Tool for Re-materialization in PyTorch, Xunyi Zhao, Théotime Le Hellard, Lionel Eyraud-Dubois, Julia Gusak, Olivier Beaumont
- Faho : mutlitasking on UPMEM's processing in memory technology , Maxime Collette, Weihao Ni, Alain Tchana, Renaud Lachaize
- ArmoniK : An Open-Source Solution for Computation Orchestration and Distribution, Jérome Gurhem, Wilfried Kirschenmann
- Benchmarking of Deep Learning Convolutions on Energy-constrained CPUs, Enrique Galvez, Adrien Cassagne, Alix Munier Kordon, Manuel Bouyer

12:20-13:20 | Lunch (Mines de Paris)

13:20-14:00 | Poster session

Chair: Soraya Zertal

14:00-16:30 | Session: HPC, LA and AI

Chair: Corinne Ancourt (Mines Paris – PSL)

Pretraining Large Language Models: from distributed to decentralized settings. [14:00-14:30]
Ferdinand Mom, HuggingFace, France.
[Bio]

[Abstract]
This talk will trace the evolution of LLM training from distributed to decentralized approaches. We'll begin by reviewing key parallelism strategies, such as Data Parallel, Tensor Parallel, and Pipeline Parallel techniques that enabled models like DeepSeek to scale to billions of parameters. Next, we'll demonstrate how these methods efficiently distribute computation across homogeneous clusters while managing memory constraints and communication overhead. Specifically, we'll share our analysis of scaling limitations in centralized training paradigms and introduce emerging decentralized alternatives techniques that address these challenges. Finally, we'll highlight practical insights and discuss how these decentralized combined with distributed approaches democratize LLM development by enabling broader participation in the training process across diverse computational environments
[Slides]
Converging AI and High-Performance Computing for Scalable and Robust Anomaly Detection. [14:30-15:00]
Zineb Ziani, AnotherBrain, France.
[Bio]
Engineer from Ensimag with a specialization in applied mathematics and PhD holder in mathematical computer science from Université Paris-Saclay, I am currently an AI research engineer at AnotherBrain. My research focuses on artificial intelligence and high-performance computing, with a particular interest in anomaly detection and the design of accurate, efficient, and scalable machine learning models. I have conducted experiments on large-scale infrastructures such as the Ruche cluster and the Fugaku supercomputer. At AnotherBrain, I investigate bio-inspired intelligence paradigms based on the structure of cortical columns and memory-based mechanisms for visual recognition.
[Abstract]
Anomaly detection, a subfield of AI, focuses on identifying data patterns or instances that deviate from expected behavior. This capability is critical across various domains. However, existing techniques often face limitations in terms of accuracy and computational efficiency, especially in high-dimensional, dynamic, or large-scale environments. Generalizing these methods to effectively detect diverse types of anomalies remains a key challenge, often requiring hybrid or ensemble-based approaches.
Ensemble learning has shown promise in enhancing the robustness, adaptability, and generalizability of anomaly detection systems. Yet, such methods demand considerable computational resources to process vast volumes of data in real time, which is an essential requirement for operational anomaly detection platforms.
To address this, we propose applying the "Unite and Conquer (UC)" approach to ensemble learning. This approach builds a global model by orchestrating several co-models that collaboratively solve the same problem. UC improves both detection accuracy and computational efficiency by enabling co-models to share intermediate results and converge faster. Furthermore, it enhances system resilience through fault tolerance and optimized load balancing, making it particularly suitable for deployment on massive distributed infrastructures such as Fugaku.
[Slides]
Resilient Orchestration of Compute-Intensive Workflows: The ArmoniK Framework for HPC and MLtion. [15:00-15:30]
Jérôme Gurhem, Anéo, France
[Bio]
Jérôme Gurhem, PhD, is a senior consultant at Aneo, specialising in high-performance computing (HPC), cloud-native orchestration, and distributed systems. He is a key contributor to ArmoniK, an open-source task orchestration platform designed to efficiently manage large-scale parallel workloads across heterogeneous infrastructures. Gurhem's expertise encompasses the development and deployment of scalable computing solutions, with applications in finance, healthcare, and scientific research. He has also been involved in research on distributed matrix-vector computations for sparse and irregular matrices, contributing to advancements in computational methods for large-scale simulations.
[Abstract]
As demand for high-performance computing (HPC) and large-scale data processing continues to grow, ArmoniK offers an open-source platform for efficient, scalable execution of parallel workloads. It works with heterogeneous infrastructures and simplifies the development and deployment of distributed computing. ArmoniK helps optimize resource usage across public clouds, private clouds, and soon, HPC clusters. It lets users focus on building applications without dealing with the complexity of distributed execution.
The ArmoniK compute orchestrator handles dynamic distribution of task graphs and their data. It offers fault tolerance, elasticity, portability, and observability by default. ArmoniK can run both independent and dependent tasks in parallel. It automatically schedules and distributes them across available resources. Built-in failure handling restarts tasks if they fail, ensuring reliable execution. Its elastic design supports dynamic scaling, adjusting to changing workloads while maintaining strong performance and horizontal scalability.
With its multi-language SDKs (C++, Python, Rust, C#, Java), ArmoniK supports a broad range of applications across diverse ecosystems. In particular, its integration potential with JAX makes it a strong fit for machine learning workflows and use cases. It can manage and scale ML workflows involving distributed training, parameter sweeps, or model parallelism. JAX’s functional and hardware-accelerated model pairs well with ArmoniK’s ability to orchestrate large, compute-heavy pipelines.
This integration would bridge HPC orchestration with modern ML workflows, making ArmoniK a powerful tool for AI and scientific computing. It enables teams to scale from prototype to production with greater speed and efficiency.
[Slides]
Condenser: noun, an apparatus for compiling science to the cloud. [15:30-16:00]
Albert Cohen, Google, France
[Bio]
Albert Cohen leads cutting-edge research on the acceleration and energy efficiency of artificial intelligence applications at Google DeepMind. Graduating from École Normale Supérieure de Lyon and the University of Versailles (Paris Saclay), he joined INRIA as a research scientist and worked as a part-time associate professor at École Polytechnique. He has been a visiting scholar at the University of Illinois, an invited professor at Philips Research as a recipient of a Marie Curie fellowship on technology transfer, and a visiting professor at Facebook Artificial Intelligence Research. Albert’s work ranges from the theory of programming languages to the engineering of high-performance AI systems. His work resulted in 250 peer-reviewed publications together with 30 PhD students and international collaborators, with technology transfers to both open source platforms and commercial products. In particular, Albert played a pioneering role in the design and adoption of the MLIR software platform for scalable and efficient machine learning.
[Abstract]
Advances in ML-driven supercomputing offer unprecedented potential for scientific research. What about running state of the art simulations on specialized tensor accelerators, leveraging disruptive interconnect technology, real-world automatic parallelization through domain-specific compilers, and even complementing physics-based algorithms with ML modeling? This is where condensers come in: compilers to finally put an end to vaporware science in the cloud. Unfortunately scientific applications such as ocean models are built for traditional HPC systems, often written in Fortran, C++ or more recently Julia, and remain largely incompatible with these technologies. Portable abstractions exist, such as Kokkos in C++, but remain much lower level than the compute graphs of popular ML frameworks. The abstraction and domain-specialization gap isolates scientific computing from rapid cloud-based innovation for AI workloads. With a compiler-centric focus, we will survey the main compiler-assisted approaches to port HPC workloads onto commercial cloud systems with GPUs or Google TPUs. We will showcase recent achievements on a Julia-based ocean model, leveraging the MLIR infrastructure to deploy a range of techniques, from collective communication optimization to low-level code generation and automatic differentiation.
[Slides]
Pretraining Scientific Foundation Models from Spatiotemporal Surrogate Models to Large Multimodal Data Models [16:00-16:30]
Francois Lanusse, CNRS, France
[Bio]
Dr. Lanusse is a permanent CNRS researcher at the Astrophysics Department of CEA Paris-Saclay (France), and a member of the Polymathic AI team. He received his PhD in cosmology and inverse problems in 2015 in Paris, and further developed an interdisciplinary expertise in Deep Learning for cosmology as a postdoctoral researcher at Carnegie Mellon University (2015-2018) and UC Berkeley (2018-2019) through multiple collaborations with their respective Machine Learning and Statistics Departments. He is broadly interested in developing scientific applications of state of the art Deep Learning techniques, by combining concepts of Bayesian inference, deep neural networks, and physical forward modeling.
[Abstract]
Scientific simulations and observations routinely out‐strip the capacity of traditional machine-learning pipelines: the data are expensive to generate, heterogeneous in format, and often too sparse for task-specific models. In this talk I will show how foundation-model pretraining, which has been so successful for text and vision, can be adapted to these scientific constraints, and how modern high-performance hardware lets us scale these ideas in practice. I will cover two specific examples:
1. Multiple-Physics Pretraining (MPP). We first introduce MPP, an autoregressive, task-agnostic surrogate that learns shared spatiotemporal representations by jointly predicting the dynamics of dozens of disparate physical systems. Training a single model across 15 TB of simulations in The Well dataset required aggressive data-parallel and sequence-parallel strategies on the Flatiron Institute’s H100 GPU cluster. The result is a reusable surrogate that accelerates downstream fluid, plasma, and biophysical solvers by one to two orders of magnitude while preserving high-fidelity temporal evolution.
2. AstronomIcal Omnimodal Network (AION). On a second line of work, I will present our work on multimodal generative self-supervision to fuse highly inhomogeneous observations (spectra, irregular time-series, images, even instrument metadata) into a single large scale model. Training this model was carried out on the Jean Zay supercomputer during the 2024 Grand Challenge, on 120TB of data, and scaled up to 13B parameters. The pretrained model can be linearly probed for physical parameter estimation, object classification, or rare-object search, achieving near-optimal accuracy with trivial fine-tuning.
Across both projects I will highlight some of the infrastructure and training choices that allowed us to scale the training of these models, which are quite different from more traditional language models.
[Slides]

16:30-17:00 | General discussion

1700-17:15 | Closing remarks

Contact Us

Please send any questions related to the DP2E-AI 2025 workshop to dp2eai2025@gmail.com

DP2E-AI 2025 PROGRAM

Program Chair: Serge G. Petiton

Monday, June 16th

8:30-9:00 | Registration, welcome coffee

9:00-9:30 | Opening session

9:30-10:50 | Keynotes

Chair: Serge G. Petiton

10:50-12:50 | Session : Distibuted and parallel computing, and AI

Chair: Kengo Nakajima

12:50-13:50 | Lunch (Mines de Paris)

13:50-16:30 | Keynotes

Chair: Georges-Andre Silber

16:30-17:00 | Coffee break

17:00-17:50 | Panel and general discussion on HPC, LA, and AI: programming and optimisation.

Moderator: Serge G. Petiton

17:50-18:00 | Closing remarks of the first day

19:00-21:00 | Workshop Diner in Paris

Tuesday June 17th

8:30-9:00 | Registration, welcome coffee

9:00-10:20 | Keynotes

Chair: Nahid Emad

11:00-11:50 | Panel and general discussion on The Future of AI, from LLMs to Agents and Beyond

Moderator: Ian Foster

11:50-12:20 | Poster presentation

Chair: Soraya Zertal

12:20-13:20 | Lunch (Mines de Paris)

13:20-14:00 | Poster session

Chair: Soraya Zertal

14:00-16:30 | Session: HPC, LA and AI

Chair: Corinne Ancourt (Mines Paris – PSL)

16:30-17:00 | General discussion

1700-17:15 | Closing remarks

Sponsors

Contact Us