Awesome HPC
High Performance Computing tools and resources for engineers and administrators.
High Performance Computing (HPC) most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.
Contents
(click to expand)
- [Provisioning](#provisioning)
- [Workload Managers](#workload-managers)
- [Pipelines](#pipelines)
- [Applications](#applications)
- [Compilers](#compilers)
- [MPI](#mpi)
- [Parallel Computing](#parallel-computing)
- [Benchmarking](#benchmarking)
- [Miscellaneous](#miscellaneous)
- [Performance](#performance)
- [Parallel Shells](#parallel-shells)
- [Containers](#containers)
- [Environment Management](#environment-management)
- [Visualization](#visualization)
- [Parallel Filesystems](#parallel-filesystems)
- [Programming Languages](#programming-languages)
- [Monitoring](#monitoring)
- [Journals](#journals)
- [Podcasts](#podcasts)
- [Blogs](#blogs)
- [Conferences](#conferences)
- [Websites](#websites)
- [User Groups](#user-groups)
Provisioning
-
Grendel - Bare Metal Provisioning system for HPC Linux clusters (Source Code)
GPL-3
.
-
XCat - xCAT is a toolkit for deployment and administration of clusters of all sizes (Source Code
) EPL-1.0
.
-
Warewulf - Warewulf is a stateless and diskless container operating system provisioning system for large clusters of bare metal and/or virtual systems (Source Code
) BSD-3
.
-
Rocks - A Linux distribution for developing Linux clusters
other
.
-
Cobbler - Cobbler is a Linux installation server that allows for rapid setup of network installation environments (Source Code
) GPL-2.0
.
-
Base Command Manager - Base Command Manager allows administrator to quickly build and manage heterogeneous clusters
Proprietary
.
-
Scyld - Scyld Clusterware Scyld ClusterWare is developed based on the continuing evolution of Beowulf clusters first developed at NASA in the 1990s
Proprietary
.
-
BlueBanquise - BlueBanquise is an open source cluster deployment and management stack built on Python and Ansible (Source Code
) MIT
.
Workload Managers
-
Slurm - A free and open source job scheduler (Source Code
) OSS
.
-
LSF - A job scheduler and workload management software developed by IBM
Proprietary
.
-
Moab - Moab is a workload management and job scheduler
other
.
-
Torque - Torque is a workload management and job scheduler
other
.
-
OpenLava - OpenLava is a workload management and job scheduler
other
.
-
UGE/SGE - Univa Grid Engine is a workload management engine for HPC
Proprietary
.
-
Volcano - Volcano is a batch system built on Kubernetes
Apache-2.0
.
-
Maui - Maui is a workload management and job scheduler
other
.
-
Kube Batch
- A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC Apache-2.0
.
-
OpenPBS - OpenPBS® software optimizes job scheduling and workload management in high-performance computing (HPC) environments (Source Code
) other
.
Pipelines
-
Nextflow - Data drive computational pipelines
Apache-2.0
.
-
Cromwell - Scientific workflow engine designed for simplicity & scalability (Source Code
) BSD-3
.
-
Pegasus - A configurable system for mapping and executing scientific workflows over a wide range of computational infrastructure (Source Code
)Apache-2.0
.
Applications
-
Spack - A flexible package manager that supports multiple versions, configurations, platforms, and compilers (Source Code
) other
.
-
EasyBuild - EasyBuild - building software with ease (Source Code
) GPL-2
.
Compilers
-
Nvidia - NVIDIA HPC compiler suite for Fortran, C/C++ with OpenACC
Proprietary
.
-
Portland Group - The Portland Group compilers were Fortran, C/C++ compilers now integrated into NVIDIA HPC SDK
Proprietary
.
-
Intel - The Intel compiler suite offers many language compilers for use in the HPC space
Proprietary
.
-
Cray - A suite of compilers designed and optimized to target the AMD interlagos instruction set
Proprietary
.
-
GNU - The GNU Compiler Collection is a suite of compilers targeting many languages (Source Code)
GPL-3
.
-
LLVM - The LLVM project is a collection of modular compilers and toolchains (Source Code
) OSS
.
MPI
-
OpenMPI - OpenMPI is an open source implementation of the MPI-3.1 standard (Source Code
) BSD
.
-
MPICH - MPICH is a high-performance and widely portable implementation of the MPI-3.1 standard (Source Code
) other
.
-
MVAPICH - MVAPICH is an open source implementation of the MPI-3.1 standard developed by Ohio State University
BSD
.
-
Intel-MPI - Intel-MPI is Intel’s MPI-3.1 implementation included in their compiler suite
other
.
Parallel Computing
-
ArrayFire - A general purpose tensor library that simplifies the process of software development for parallel architectures
other
.
-
OpenMP - OpenMP is an application programming interface that supports multi-platform shared-memory multiprocessing programming
other
.
Benchmarking
-
OSU Benchmarks - A collection of benchmarking tools for MPI developed by Ohio State University
other
.
-
Intel MPI Benchmarks - A set of benchmarks developed by Intel for use with their Intel MPI
other
.
-
HPCC Systems - HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics (Source Code
) other
.
-
LINPACK - LINPACK is a set of efficient fortran subroutines for solving linear systems which benchmarks are useful for HPC
other
.
-
IOzone - IOzone is a filesystem benchmark tool
OSS
.
-
IOR - Interleaved or Random is a useful benchmarking tool for testing parallel filesystems
other
.
-
MDtest - MDtest is an MPI-based application for evaluating the metadata performance of a file system
other
.
-
FIO - Flexible I/O is an advanced disk benchmark that depends upon the kernel’s AIO access library (Source Code)
GPL-2
.
-
elbencho
- A distributed storage benchmark for files, objects & blocks with support for GPUs GPL-3
.
Miscellaneous
-
OpenOnDemand - Open OnDemand helps computational researchers and students efficiently utilize remote computing resources by making them easy to access from any device (Source Code
) MIT
.
-
Open XDMod - Open XDMoD is an open source tool to facilitate the management of high performance computing resources (Source Code
) LGPL-3
.
-
Coldfront - ColdFront is an open source resource allocation system designed to provide a central portal for administration, reporting, and measuring scientific impact of HPC resources (Source Code
) GPL-3
.
-
Pavilion2 - Pavilion is a Python 3 (3.6+) based framework for running and analyzing tests targeting HPC systems (Source Code
) other
.
-
Reframe - A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems. (Source Code
) BSD-3
.
-
OLCF Test Harness - The OLCF Test Harness (OTH) helps automate the testing of applications, tools, and other system software (Source Code
) other
.
-
GoSlmailer
- Goslmailer is a drop-in notification delivery solution for slurm that can do slack, mattermost, teams, and more.
-
TotalView - TotalView is a debugging tool for HPC applications
Proprietary
.
-
Tau - TAU Performance System® is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, Python
other
.
-
Valgrind - Valgrind is a tool designed to profile programs to determine memory leaks (Source Code)
GPL-2
.
-
Paraver - Paraver is a very flexible data browser that is part of the CEPBA-Tools toolkit
other
.
-
PAPI - Performance Application Programming Interface (PAPI) is a performance analysis tool (Source Code)
other
.
Parallel Shells
Containers
-
Apptainer - Apptainer is an open source container system (Source Code
) BSD
.
-
Charliecloud - Charliecloud provides user-defined software stacks (UDSS) for high-performance computing (HPC) centers (Source Code
) Apache-2.0
.
-
Docker - Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers
other
.
-
uDocker - A basic user tool to execute simple docker containers in batch or interactive systems without root privileges (Source Code
) Apache-2.0
.
-
Shifter - Shifter is Linux containers for HPC (Source Code
) other
.
-
HPC Container Maker
- HPC Container Maker is an open source tool to make it easier to generate container specification files. Apache-2.0
.
-
Scarus
- An OCI-compatible container engine for HPC BSD
.
-
Singularity HPC - Singularity Registry HPC (shpc) allows you to install containers as modules (Source Code
) MPL 2.0
.
Environment Management
-
Lmod - Lmod: An Environment Module System based on Lua, Reads TCL Modules, Supports a Software Hierarchy (Source Code
) other
.
-
Environment Modules - Environment Modules: provides dynamic modification of a user’s environment (Source Code
) GPL-2
.
-
Anaconda - Anaconda is a Python and R distribution for use in computational science
other
.
-
Mamba - Mamba is a reimplementation of the conda package manager in C++ (Source Code
) BSD
.
Visualization
-
Visit - VisIt - Visualization and Data Analysis for Mesh-based Scientific Data (Source Code
) BSD-3
.
-
Paraview - ParaView is an open-source, multi-platform data analysis and visualization application based on Visualization Toolkit (VTK) (Source Code
) BSD-3
.
Parallel Filesystems
-
GPFS - GPFS is a high-performance clustered file system software developed by IBM
Proprietary
.
-
Quobyte - A high performance filesystem
Proprietary
.
-
Ceph - Ceph is a distributed object, block, and file storage platform (Source Code
) other
.
-
Weka - A file system designed for HPC
Proprietary
.
-
Lustre/Exascaler - Lustre is an open-source, distributed parallel file system software platform designed for scalability, high-performance, and high-availability (Source Code)
other
.
-
BeeGFS - BeeGFS is a hardware-independent POSIX parallel file system developed with a strong focus on performance and designed for ease of use, simple installation, and management
Proprietary
.
-
OrangeFS - OrangeFS is a next generation parallel file system for Linux clusters (Source Code
) other
.
-
MooseFS - Moose File System is an Open-source, POSIX-compliant distributed file system developed by Core Technology (Source Code
) GPL-2.0
.
Programming Languages
-
Julia - Julia is a high-level, high-performance dynamic language for technical computing
MIT
.
-
Futhark - Futhark is a purely functional data-parallel programming language in the ML family
isc
.
-
Chapel - Chapel is a programming language designed for productive parallel computing at scale
Apache-2.0
.
Monitoring
Prometheus Based
-
Slurm Exporter
- Prometheus exporter for performance metrics from Slurm GPL-3.0
.
-
Slurm Exporter
- Slurm Exporter for Prometheus using Rest API GPL-3.0
.
-
Infiniband Exporter
- The InfiniBand exporter collects counters from InfiniBand switches and HCAs Apache-2.0
.
-
Cgroup Exporter
- Produces metrics from cgroups Apache-2.0
.
-
Cgroup Exporter
- A Prometheus exporter for cgroup-level metrics unknown
.
-
GPFS Exporter
- The GPFS exporter collects metrics from the GPFS filesystem Apache-2.0
.
-
Lustre Exporter
- Prometheus exporter for use with the Lustre parallel filesystem GPL-3.0
.
-
DCGM Exporter
- NVIDIA GPU metrics exporter for Prometheus leveraging DCGM Apache-2.0
.
Journals
Podcasts
-
This week in HPC - Each week, Intersect360 Research CEO Addison Snell and HPCwire editor Tiffany Trader dissect the week’s top HPC stories.
-
Exascaler Project - ECP’s Let’s Talk Exascale podcast goes behind the scenes to chat with some of the people who are bringing a capable and sustainable exascale computing ecosystem to fruition.
-
@HPCpodcast - Join Shahin Khan and Doug Black as they discuss Supercomputing technologies and the applications, markets, and policies that shape them.
Blogs
-
HPCWire - Since 1987 covering the fastest computers in the world and the people who run them.
-
InsideHPC - insideHPC is a global publication recognized for its comprehensive and insightful coverage of the HPC-AI community, linking vendors, end-users and HPC strategists.
-
The Next Platform - Offers in-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds.
-
The Register HPC - The Register is a leading and trusted global online enterprise technology news publication, reaching roughly 40 million readers worldwide.
-
HPC at Dell - High-Performance Computing knowledge base articles from Dell.
Conferences
-
Pearc - Practice & Experience in Advanced Research Computing.
-
Supercomputing (SC) - The International Conference for High Performance Computing, Networking, Storage, and Analysis.
-
Supercomputing International (ISC) - The International Conference for High Performance Computing, Networking, Storage, and Analysis.
-
CCGrid - IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing.
-
IEEE-HPEC - IEEE High Performance Embedded Computing.
-
Hot Chips - Semiconductor industry’s leading conference on high-performance microprocessors and related circuits.
-
Hot Interconnects - IEEE conference on software architectures and implementations for interconnection networks of all scales.
-
ESSA - Workshop on Extreme-Scale Storage and Analysis.
-
IEEE-IPDPS - IEEE International Parallel & Distributed Processing Symposium.
-
ESPM2 Workshop - International Workshop on Extreme Scale Programming Models and Middleware.
-
LCI Workshops - The Linux Clusters Institute (LCI) is providing education and advanced technical training for the deployment and use of computing clusters to the high performance computing community worldwide.
-
HPC Carpentry - Teaching basic skills for high-performance computing.
Websites
-
Top500 - The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world.
User Groups
-
MVAPICH - The MUG conference provides an open forum for all attendees (users, system administrators, researchers, engineers, and students) to discuss and share their knowledge on using MVAPICH libraries.
-
Slurm - The annual Slurm user group meeting.
Contributing
Contributing guidelines can be found in contributing.md.