Mnemosyne

Mnemosyne

A lightweight, fast, and transparent error recovery framework for LLM training in a just-in-time manner.

Overview.

Figure: The workflow of Mnemosyne. The left part shows the steady-state work, and the right part shows the recovery process.

Advantages

Note

The prototype of device proxy can be found in the device-proxy directory, and the prototype of flexible CCL can be found in the flexible-ccl directory.

Cite

Your citations are greatly appreciated. 🥰