CS349F
Download as PDF
Fabric Architectures For AI Systems
Computer ScienceENGR - School of Engineering
Course Description
The course is concerned with the design and operation of the network fabrics which interconnect large-scale compute and storage nodes in modern AI GPU clusters, cloud computing systems, and "time-sensitive systems" like financial trading platforms and massive multi-player games. We will consider architectures, protocols and algorithms which enable these network fabrics to deliver deterministic and ultra-low latency at near-100% goodput. Topics include data center fabric architectures - the fat tree topology, transport protocols - congestion control and load balancing, and scheduling algorithms - job scheduling, fabric scheduling and their interaction. A particular focus will be the contrast and synergy between the edge- and network-centric approaches to building high-performance network fabrics. Students will hear from industry experts who design, operate and use large-scale GPU and CPU clusters. Recommended: Knowledge of basic Networking, OS, or Distributed Systems (CS 144, 140, or equivalent), as well as basic EE courses (EE 178) will be useful.
Grading Basis
RSN - Satisfactory/No Credit
Min
2
Max
2
Course Repeatable for Degree Credit?
No
Course Component
Lecture
Enrollment Optional?
No
Does this course satisfy the University Language Requirement?
No
Programs
CS349F is a completion requirement for:
- (from the following course set: )
- (from the following course set: )
- (from the following course set: )