Ant Group NextEvo fully open source AI Infra technology
【数据猿导读】 Ant Group NextEvo fully open source AI Infra technology
On February 1, NextEvo, the AI innovation research and development department of Ant Group, fully opened source AI Infra technology, which can help large model kcal training effective time account for more than 95%, and can achieve "automatic driving" during training, which promotes the efficiency of AI research and development. The technology framework, called DLRover, aims to make large-scale distributed training intelligent. The latest integration into DLRover is the Flash Checkpoint (FCP) scheme. During model training, it is generally necessary to Checkpoint (check point), so that when interrupted, it can be restored to the recent state. The conventional method takes a long time, the high-frequency check point is easy to reduce the training available time, and the low frequency check point is lost too much when recovering. After the training of the kilocarb parameter model, the training waste time caused by Checkpoint is reduced by about 5 times, the persistence time is reduced by about 70 times, and the effective training time is increased from 90% to 95%.
来源:DIYuan