← Back to Event List

Tutorial V of The “Scale-Up” HPC & AI Tutorial Series: Scalable AI

Location

Engineering : 102

Date & Time

May 5, 2026, 1:00 pm2:00 pm

Description

Moving from basic machine learning to scalable AI requires more than just code; it requires a robust strategy for managing massive datasets, monitoring hardware limits, and tracking experiment versions. Building on the foundational GPU skills from Tutorial IV, this session focuses on the end-to-end workflow of training large-scale models on UMBC's chip cluster.


We will begin by pulling research repositories directly from GitHub and sourcing large-scale datasets from platforms like HuggingFace. You will learn to navigate the complexities of deep learning at scale, including monitoring real-time CUDA memory usage and visualizing training progress using professional tools for monitoring training progress. By the end of this session, you will be able to optimize model performance, interpret detailed log files, and push your final versions back to your repository.


Similar to other tutorials of the series, the training will follow the "flipped-classroom" active learning style. Before coming to the lab, participants will complete a self‑paced Blackboard module that includes short hands‑on activities (called "DoIT Yourself Activities"). During the in‑person lab session, we will focus on questions, troubleshooting, and practicing examples so that you can further your understanding and strengthen the workflow.


These events are hybrid, but there is limited support for online participants. For those attending in-person: Be sure to bring your laptop!


Note that it's important to RSVP by Monday, 4/27/2026, so that we can provision accounts on the chip HPC and make Blackboard content available to you.


Remember that it's important that each participant review the available content before the event to make the best use of the synchronous classroom session (this event). We'll send a reminder to those who have RSVPed a day or so before the event!

Tags: