ALCF Systems

Supercomputing Resources

ALCF supercomputing resources support large-scale, computationally intensive projects aimed at solving some of the world’s most complex and challenging scientific problems.

System Name	Purpose	Architecture	Peak Performance	Processors per Node	GPUs per Node	Nodes	Cores	Memory	Interconnect	Racks
Aurora	Purpose Science Campaigns	Architecture HPE Cray EX	Peak Performance 2 EF	Processors per Node 2 Intel Xeon CPU Max Series	GPUs per Node 6 Intel Data Center GPU Max Series	Nodes 10,624	Cores 9,264,128	Memory 20.4 PB	Interconnect HPE Slingshot 11 with Dragonfly Configuration	Racks 166
Polaris	Purpose Science Campaigns	Architecture HPE Apollo 6500 Gen10+	Peak Performance 34 PF 44 PF of Tensor Core	Processors per Node 1 3rd Gen AMD EPYC (Milan)	GPUs per Node 4 NVIDIA A100 Tensor Core	Nodes 560 CPUs: 21,248 GPUs: 63,744	Cores 17,920	Memory 280 TB (DDR4) 87.5 TB (HBM)	Interconnect HPE Slingshot 11 with Dragonfly Configuration	Racks 40
Sophia	Purpose Science Campaigns	Architecture NVIDIA DGX A100	Peak Performance 3.9 PF (FP64)	Processors per Node 2 AMD EPYC 7742 (Rome)	GPUs per Node 8 NVIDIA A100 Tensor Core	Nodes 24	Cores 3,072	Memory 26 TB (DDR4) 8.32 TB (GPU)	Interconnect NVIDIA HDR with InfiniBand	Racks 7
Crux	Purpose Science Campaigns	Architecture HPE Cray EX	Peak Performance 1.18 PF	Processors per Node 2 AMD EPYC 7742 (Rome)	GPUs per Node —	Nodes 256	Cores 16,384	Memory 64 TB (DDR4)	Interconnect HPE Slingshot 11	Racks 1
Minerva	Purpose AI Training & Inference	Architecture NVIDIA DGX B200	Peak Performance (per node) 72 PF (FP8) 144 PF (FP4)	Processors per Node 2 Intel Xeon Platinum	GPUs per Node 8 NVIDIA B200 Tensor Core	Nodes 8	Cores 1,024	Memory 16 TB (DDR5) 11.5 TB (HBM)	Interconnect InfiniBand	Racks 5

ALCF AI Testbed

The ALCF AI Testbed provides an infrastructure of next-generation AI-accelerator machines for research campaigns at the intersection of AI and science. AI testbeds include:

System Name	System Size	Compute Units per Accelerator	Single Accelerator Performance (TFlops)	Software Stack Support	Interconnect
Cerebras CS-2	System Size 2 Nodes (Each with a Wafer-Scale Engine) Including MemoryX and SwarmX	Compute Units per Accelerator 850,000 Cores	Single Accelerator Performance (TFlops) >5,780 (FP16)	Software Stack Support Cerebras SDK, TensorFlow, PyTorch	Interconnect Ethernet-based
Cerebras CS-3	System Size 4 Nodes (Each with a Wafer-Scale Engine) Including MemoryX and SwarmX	Compute Units per Accelerator 900,000 Cores	Single Accelerator Performance (TFlops) 125,000 (FP16)	Software Stack Support Cerebras Model Zoo, PyTorch	Interconnect Ethernet-based
SambaNova Cardinal SN30	System Size 64 Accelerators (8 Nodes and 8 Accelerators per Node)	Compute Units per Accelerator 1,280 Programmable Compute Units	Single Accelerator Performance (TFlops) >660 (BF16)	Software Stack Support SambaFlow, PyTorch	Interconnect Ethernet-based
SambaNova Metis SN40L	System Size 32 Accelerators (16 Nodes and 2 Accelerators per Node)	Compute Units per Accelerator 1,040	Single Accelerator Performance (TFlops) 637.5 (BF16)	Software Stack Support SambaStudio, SambaStack	Interconnect Ethernet-based
GroqRack	System Size 72 Accelerators (9 Nodes and 8 Accelerators per Node)	Compute Units per Accelerator 5,120 Vector ALUs	Single Accelerator Performance (TFlops) >188 (FP16) >750 (INT8)	Software Stack Support GroqWare SDK, ONNX	Interconnect RealScale™
Graphcore Bow Pod-64	System Size 64 Accelerators (4 Nodes and 16 Accelerators per Node)	Compute Units per Accelerator 1,472 Independent Processing Units	Single Accelerator Performance (TFlops) >250 (FP16)	Software Stack Support PopART, TensorFlow, PyTorch, ONNX	Interconnect IPU Link

Data Storage Systems

ALCF disk storage systems provide intermediate-term storage for users to access, analyze, and share computational and experimental data. Tape storage is used to archive data from completed projects.

System Name	File System	Storage System	Usable Capacity	Sustained Data Transfer Rate	Disk Drives
Aurora DAOS (Preproduction)	File System —	Storage System HPE Distributed Asynchronous Object Storage	Usable Capacity 220 PB	Sustained Data Transfer Rate 25 TB/s (not validated)	Disk Drives 16,384 SSD
Eagle	File System Lustre	Storage System HPE ClusterStor E1000	Usable Capacity 100 PB	Sustained Data Transfer Rate 650 GB/s	Disk Drives 8,480
Grand	File System Lustre	Storage System HPE ClusterStor E1000	Usable Capacity 100 PB	Sustained Data Transfer Rate 650 GB/s	Disk Drives 8,480
Swift	File System Lustre	Storage System All NVMe Flash Storage Array	Usable Capacity 123 TB	Sustained Data Transfer Rate 48 GB/s	Disk Drives 24
Tape Storage	File System –	Storage System LT06 and LT08 Tape Technology	Usable Capacity 300 PB	Sustained Data Transfer Rate –	Disk Drives –

Networking

Networking is the fabric that ties all of the ALCF’s computing systems together. InfiniBand enables communication between system I/O nodes and the ALCF’s various storage systems. The production HPC SAN is built upon NVIDIA Mellanox High Data Rate (HDR) InfiniBand hardware. Two 800-port core switches provide the backbone links between 80 edge switches, yielding 1600 total available host ports, each at 200 Gbps, in a non-blocking fat-tree topology. The full bisection bandwidth of this fabric is 320 Tbps. The HPC SAN is maintained by the NVIDIA Mellanox Unified Fabric Manager (UFM), providing adaptive routing to avoid congestion, as well as the NVIDIA Mellanox Self-Healing Interconnect Enhancement for Intelligent Datacenters (SHIELD) resiliency system for link fault detection and recovery.

When external communications are required, Ethernet is the interconnect of choice. Remote user access, systems maintenance and management, and high-performance data transfers are all enabled by the local area network (LAN) and wide area network (WAN) Ethernet infrastructure. This connectivity is built upon a combination of Extreme Networks SLX and MLXe routers and NVIDIA Mellanox Ethernet switches.

ALCF systems connect to other research institutions over multiple 100 Gbps connections that link to many high-performance research networks, including regional networks like the Metropolitan Research and Education Network (MREN), as well as national and international networks like the Energy Sciences Network (ESnet) and Internet2.

Joint Laboratory for System Evaluation

Argonne’s Joint Laboratory for System Evaluation (JLSE) provides access to leading-edge testbeds for research aimed at evaluating future extreme-scale computing systems, technologies, and capabilities. Here is a partial listing of the novel technology that makes up the JLSE.

Arm Ecosystem: Apollo 80 Fujitsu A64FX Arm system, NVIDIA Ampere Arm and A100 test kits, and an HPE Comanche with Marvell ARM64 CPU platform provide an ecosystem for porting applications and measuring performance on next-generation systems
Edge Testbed: NVIDIA Jetson Xavier and Jetson Nano platforms provide a resource for testing and developing edge computing applications
NVIDIA GPUs: Clusters of NVIDIA GH200, H100, V100, A100, and A40 GPUs for preparing applications for heterogeneous computing architectures
AMD GPUs: Clusters of MD MI300A, MI300x, MI250, MI50 and MI100 GPUs for preparing applications for heterogeneous computing architectures
Intel GPUs: Intel Data Center GPU Max 1550 (PVC) NVIDIA Bluefield-2 DPU SmartNICs: Platform used for confidential computing, MPICH offloading, and APS data transfer acceleration
NextSilicon Maverick: First-generation product being tested by Argonne researchers
Atos Quantum Learning Machine: Platform for testing and developing quantum algorithms and applications