IBM researchers shared in a blog post the architecture and supporting features of Vela, which was launched in May 2022 on the IBM cloud.
Traditional supercomputers designed for modeling and simulation can accommodate AI capabilities, but they were not built for large-scale AI processes, the researchers said. In conceptualizing a world-class AI modeling system, the scientists had to balance performance as well as productivity.
They noted that a cloud-based AI supercomputer will be more productive despite compromises in performance because it would provide access to other cloud applications, especially storage and security.
Vela’s compute nodes feature 80-gigabyte and A100 graphics processing units linked by multiple 100G network interfaces to support distributed training. Its redundancy structure is comprised of network interface cards individually connected to a different top-of-rack switch, which are linked to four spine switches with 1.6 terabyte cross-rack bandwidth to ensure continuous operation in case of system failures.
The researcher team intends to provide updates on upcoming improvements to achieve both high end-user productivity and high-performance computing through the Vela cloud-native AI supercomputer.