Rethinking Collective Communication for CPU–GPU–DPU Heterogeneous Systems
Designed by PADSYS Lab at University of Florida
Why HCCL?
Traditional collectives assume homogeneous nodes and devices
NCCL/MPI does not deeply model:
DPU/SmartNIC offloading
Heterogeneous capabilities
Multi-rail topologies
Hierarchical AI workloads
Large-scale LLM training exposes:
Cross-device imbalance
Congestion
Inefficient gradient aggregation
Heterogeneity-aware scheduling
Multi-rail path optimization
DPU-/SmartNIC offloaded computation and reduction
Topology-aware collective decomposition
Performance modeling guided selection (LogGP-based)
A description of an effort and why it matters
A description of an effort and why it matters
A description of an effort and why it matters
A description of an effort and why it matters
Contact [email] to get more information on the project