MiniMax Built a Cost-Effective, High-Performance AI Platform with JuiceFS

2024-09-02
Juicedata Team

Founded in December 2021, MiniMax is a leading general artificial intelligence (AI) technology company dedicated to co-creating intelligence with users. The company has developed various general foundation models, including trillion-parameter MoE text models, voice models, and image models. Notable releases include video-01, music-01, and abab 6.5s models. Their Talkie app, which allows users to interact with virtual chatbot characters, is currently ranked No. 4 in entertainment apps on Google Play in the US.

Faced with challenges like high data volume, POSIX compatibility, and hybrid cloud architecture, MiniMax chose JuiceFS Enterprise Edition for its scalability, cost-efficiency, and superior data management. Based on JuiceFS, they built a high-performance AI platform. The platform integrates advanced caching and metadata services to optimize performance and reduce costs.

In this article, we'll dive into MiniMax’s storage challenges, why they chose JuiceFS, and how JuiceFS addressed their storage needs.

Storage challenges in multimodal foundation model development

As a startup, MiniMax focused on flexibility and cost efficiency when building its infrastructure. The company chose to deploy critical workloads (such as GPU resources) in local data centers, while other components were deployed in the cloud to leverage cloud platform technologies, elasticity, and flexibility.

Consequently, MiniMax adopted a hybrid cloud solution combining local data centers with multi-cloud environments. To manage the complexity and challenges of underlying infrastructure, the company utilized Kubernetes as a unified management layer.

Key storage challenges included:

  • High performance: Training and inference of foundation models required handling and storing vast amounts of data. This demanded high-capacity storage solutions with fast read and write speeds.
  • POSIX compatibility: Deep learning frameworks and algorithm engineers relied on POSIX interfaces. This required full POSIX compatibility in the storage system to ensure proper functioning of AI tasks.
  • Hybrid cloud architecture: Compute resources, especially GPUs, were distributed across different geographic regions and service providers. Therefore, the storage system must support various vendors and hardware environments, with high flexibility for cross-region data replication, access, and migration.
  • Storage cost optimization: As data volume grew, especially in big data and AI applications, balancing storage capacity expansion with cost control was a major challenge. Companies needed cost-effective storage technologies that seamlessly integrated with existing IT architectures.

Why choose JuiceFS Enterprise Edition

During the initial selection process, MiniMax evaluated CephFS, which had bottlenecks in metadata services. The company also tested high-performance file storage solutions from public cloud services but found them too costly. MiniMax sought a storage system that offered good flexibility, high scalability, cost-effectiveness, and compatibility with a hybrid cloud architecture.

Finally, MiniMax chose JuiceFS Enterprise Edition as the storage foundation for its company-wide AI platform. JuiceFS supports various models (including text, voice, image, and multimodal) and meets high-performance data access needs for data cleaning, model training, and inference. Particularly in distributed training scenarios of ultra-large-scale GPU clusters, JuiceFS' excellent performance has been crucial for model iteration and GPU utilization:

  • Compatibility: Supports POSIX, HDFS, and S3 interfaces, providing a unified storage solution that reduces data copying and migration.
  • Optimized I/O efficiency: Enhances I/O performance significantly through multi-level caching, readaheading, and concurrent reading strategies.
  • High-performance metadata services: Custom metadata services handle millions of requests per second with sub-millisecond response times, meeting the stringent demands of AI training stages.
  • Multi-cloud/hybrid cloud data management: Automatic cross-cloud and cross-region data replication ensures data synchronization with computing migration, suitable for global computing needs.
  • Cost efficiency: JuiceFS' object storage-based design significantly reduces data storage costs, allowing MiniMax to leverage cost-effective object storage. In addition, JuiceFS’ simple operation helps lower overall costs for MiniMax.

Building a unified storage system in a hybrid cloud architecture

Initially, MiniMax faced frequent read operations with few writes. To address this, the company utilized JuiceFS' distributed cache feature. They used NVMe SSDs to accelerate read operations. As data processing needs and cluster size expanded, a single cluster's capacity no longer met MiniMax’s requirements. Consequently, MiniMax built a centralized metadata distribution engine and an edge cluster architecture for high-speed read/write operations based on JuiceFS.

MiniMax used JuiceFS' mirror file system feature to automatically replicate metadata from the central cluster to edge clusters. Edge clusters do not store actual data but use dedicated connections to warm up data from the central cluster into JuiceFS' distributed cache. This improves read bandwidth and reduces data duplication.

In addition, data can be downloaded on demand to edge JuiceFS clusters and managed through lifecycle management by upper-layer services.

With this unified storage system, MiniMax established a large nationwide ring network with core access points in key cities. Data centers are closely connected to these access points to support efficient data distribution.

"JuiceFS not only provided a storage system that met our hybrid cloud needs but also optimized data processing with high-performance metadata services and multi-interface compatibility. This significantly reduced our operational costs. It’s the ideal choice for running our foundation model platform."

—Long Xing, Technical Director at MiniMax

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.

Related Posts

TAL: Building a Low-Operation Model Repository Based on JuiceFS in a Multi-Cloud Environment

2024-11-21
TAL, a leading smart learning solution provider, uses JuiceFS for an efficient, low-maintenance mod…

Deep Dive into JuiceFS Data Synchronization and Consistency in Multi-Cloud Architectures

2024-11-06
Learn how the JuiceFS mirror file system implements read and write operations, enabling efficient d…

Getting Started with the JuiceFS Python SDK

2024-10-30
Learn how to use the JuiceFS Python SDK for seamless file system management in containerized enviro…

LLM Storage Selection & Detailed Performance Analysis of JuiceFS

2024-10-23
Explore storage solutions for large language models, comparing JuiceFS performance with CephFS, Lus…