ChannelLife UK - Industry insider news for technology resellers
Story image

iFLYTEK wins CNCF award for AI model training with Volcano

Yesterday

iFLYTEK has been named the winner of the Cloud Native Computing Foundation's End User Case Study Contest for advancements in scalable artificial intelligence infrastructure using the Volcano project.

The selection recognises iFLYTEK's deployment of Volcano to address operational inefficiencies and resource management issues that arose as the company expanded its AI workloads. iFLYTEK, which specialises in speech and language artificial intelligence, reported experiencing underutilised GPUs, increasingly complex workflows, and competition among teams for resources as its computing demands expanded. These problems resulted in slower development progress and placed additional strain on infrastructure assets.

With the implementation of Volcano, iFLYTEK introduced elastic scheduling, directed acyclic graph (DAG)-based workflows, and multi-tenant isolation into its AI model training operations. This transition allowed the business to improve the efficiency of its infrastructure and simplify the management of large-scale training projects. Key operational improvements cited include a significant increase in resource utilisation and reductions in system disruptions.

DongJiang, Senior Platform Architect at iFLYTEK, said, "Before Volcano, coordinating training under large-scale GPU clusters across teams meant constant firefighting, from resource bottlenecks and job failures to debugging tangled training pipelines. Volcano gave us the flexibility and control to scale AI training reliably and efficiently. We're honoured to have our work recognized by CNCF, and we're excited to share our journey with the broader community at KubeCon + CloudNativeCon China."

Volcano is a cloud native batch system built on Kubernetes and is designed to support performance-focused workloads such as artificial intelligence and machine learning training, big data processing, and scientific computing. The platform's features include job orchestration, resource fairness, and queue management, intended to maximise the efficient management of distributed workloads. Volcano was first accepted into the CNCF Sandbox in 2020 and achieved Incubating maturity level by 2022, reflecting increasing adoption for compute-intensive operations.

iFLYTEK's engineering team cited the need for an infrastructure that could adapt to the rising scale and complexity of AI model training. Their objectives were to improve allocation of computing resources, manage multi-stage workflows efficiently, and limit disruptions to jobs while ensuring equitable resource access among multiple internal teams.

The adoption of Volcano yielded several measurable outcomes for iFLYTEK's AI infrastructure. The company reported a 40% increase in GPU utilisation, contributing to lower infrastructure costs and reduced idle periods. Additionally, the company experienced a 70% faster recovery rate from training job failures, which contributed to more consistent and uninterrupted AI development. The speed of hyperparameter searches—a process integral to AI model optimisation—was accelerated by 50%, allowing the company's teams to test and refine models more swiftly.

Chris Aniszczyk, Chief Technology Officer at CNCF, said, "iFLYTEK's case study shows how open source can solve complex, high-stakes challenges at scale. By using Volcano to boost GPU efficiency and streamline training workflows, they've cut costs, sped up development, and built a more reliable AI platform on top of Kubernetes, which is essential for any organization striving to lead in AI."

As artificial intelligence workloads become increasingly complex and reliant on large-scale compute resources, the use of tools like Volcano has expanded among organisations seeking more effective operational strategies. iFLYTEK's experience with the platform will be the subject of a presentation at KubeCon + CloudNativeCon China, where company representatives will outline approaches to managing distributed model training within Kubernetes-based environments.

iFLYTEK will present its case study, titled "Scaling Large Model Training in Kubernetes Clusters with Volcano," sharing technical and practical insights with participants seeking to optimise large-scale artificial intelligence training infrastructure.

Follow us on:
Follow us on LinkedIn Follow us on X
Share on:
Share on LinkedIn Share on X