CS349F

Download as PDF

Fabric Architectures For AI Systems

Computer ScienceENGR - School of Engineering

Course Description

The course is concerned with the design and operation of the network fabrics which interconnect large-scale compute and storage nodes in modern AI GPU clusters, cloud computing systems, and "time-sensitive systems" like financial trading platforms and massive multi-player games. We will consider architectures, protocols and algorithms which enable these network fabrics to deliver deterministic and ultra-low latency at near-100% goodput. Topics include data center fabric architectures - the fat tree topology, transport protocols - congestion control and load balancing, and scheduling algorithms - job scheduling, fabric scheduling and their interaction. A particular focus will be the contrast and synergy between the edge- and network-centric approaches to building high-performance network fabrics. Students will hear from industry experts who design, operate and use large-scale GPU and CPU clusters. Recommended: Knowledge of basic Networking, OS, or Distributed Systems (CS 144, 140, or equivalent), as well as basic EE courses (EE 178) will be useful.

Grading Basis

RSN - Satisfactory/No Credit

Min

2

Max

2

Course Repeatable for Degree Credit?

No

Course Component

Lecture

Enrollment Optional?

No

Does this course satisfy the University Language Requirement?

No

Programs

CS349F is a completion requirement for:
  • (from the following course set: )
  • (from the following course set: )
  • (from the following course set: )