BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Down

1. Are the authors confident that profiling bubbles in just the initial few iterations is sufficient for the entire training run, given how dynamic and unpredictable distributed training can be?
Training workloads are rarely static, factors like varying data loading times, adaptive optimizers, mixed precision, or even thermal throttling can alter execution timelines. Yet, the proposed method profiles bubbles only at the beginning and assumes that their duration and distribution stay stable. How does the framework adapt when the timing patterns shift over epochs? Is there a mechanism for re-profiling or updating the checkpoint plan dynamically? If not, isn’t this a major limitation that could lead to either missed checkpoints or unexpected overhead, especially during long training runs on shared infrastructure?

2. Why didn’t the authors evaluate actual failure scenarios to demonstrate recovery correctness and its impact on model convergence or training integrity?
The paper focuses heavily on timing overhead and efficiency metrics but never actually shows what happens when a node fails mid-training. Do the recovered models maintain accuracy? Is the loss curve preserved after resuming from checkpoints? Were faults artificially injected during testing to assess whether BAFT’s “full recovery” claim holds under real error conditions? Without showing these results, how can we trust that the recovery mechanism is not only fast but also correct and safe for critical deployments? Optimizing checkpoint overhead is important, but isn’t recovery robustness the core requirement for any fault-tolerant framework?

3. How realistic is the assumption that checkpoint transfers won’t interfere with other communications, especially on bandwidth-constrained or multi-tenant systems?
The authors claim that BAFT hides transfer overhead within the so-called “bubble time,” but this assumes that network bandwidth is always available during those bubbles. What happens when checkpoint transfers overlap with gradient syncs, forward/backward passes, or competing jobs in a shared cluster? Did the authors evaluate performance under high communication loads or simulate network congestion? In production, bandwidth is often the bottleneck. Without quantifying contention or demonstrating graceful degradation, isn’t the claim of “negligible overhead” potentially misleading in real-world deployments?

ScienceGuardians

Did You Know?

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

ScienceGuardians

Did You Know?

Welcome to ScienceGuardians, the First Fully Verified Journal Club,Safeguarding the Integrity of Science

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Welcome to ScienceGuardians, the First Fully Verified Journal Club,
Safeguarding the Integrity of Science