As artificial intelligence (AI) technology rapidly develops, the scale and complexity of models, as well as the volume of data to be processed, are rising sharply. These trends make high-performance computing (HPC) increasingly necessary. HPC integrates powerful computing resources such as GPU and CPU clusters to provide the computing power needed to process and analyze large-scale data.
However, this also brings new challenges, especially in the storage system, including how to effectively handle large amounts of data and ensure efficient data access and operations costs.
Distributed file systems, as a high-cost, high-benefit solution, are gradually being widely adopted in AI and HPC scenarios. They distribute storage resources across multiple nodes, effectively handling and managing large datasets to meet HPC's high demands for data access speed.
Renmin University of China, one of the top universities in China, has conducted research in the fields of file systems for AI. Our HPC center provides strong support for scientific research. We’ve conducted preliminary performance evaluations of mainstream distributed file systems, including Lustre, Alluxio, and JuiceFS.
In this article, we’ll deep dive into the data patterns and characteristics of HPC, big data, and AI applications. Then, we’ll share common performance evaluation tools and our evaluation results of several file systems. We hope this post can provide valuable references for users choosing HPC and AI storage solutions.
Large-scale data application scenarios: HPC vs. big data vs. AI
High-performance computing
HPC plays a critical role in scientific research and engineering applications, such as climate prediction, protein folding, and computational fluid dynamics. Unlike the methods employed by machine learning and AI, HPC often relies on simulation and scientific formulas to solve complex problems.
In HPC, tasks can be divided into compute-intensive and data-intensive. For example, weather forecasting requires a large amount of computing resources to simulate weather and also needs to process and analyze large volumes of data. Compute-intensive tasks such as molecular dynamics simulations are compute-intensive and less dependent on data.
The following figure (source: Introduction to HPC: What are HPC & HPC Clusters?) shows an HPC cluster architecture:
HPC environments typically use high-bandwidth, low-latency networks such as InfiniBand to support fast data transfer. In terms of software configuration, efficient data communication between multiple nodes depends on standards like the message passing interface (MPI).
In addition, the application of GPUs in HPC is increasing, which accelerates various computing tasks. Compared to traditional data centers, HPC clusters significantly differ in their network configuration and use of shared file systems.
Big data
Different from HPC, big data applications are more common in internet companies. This is mainly because these companies can continuously collect large volumes of user data, like user behaviors and uploaded content. These are usually structured data like user behaviors and unstructured data, such as images, audios, and videos.
In these companies, big data engineers are often responsible for extract-transform-load (ETL) work. This involves extracting, transforming, and loading data into data warehouses, where data is processed and organized for analysis or machine learning use. Unlike compute-intensive tasks in HPC, big data applications focus more on data processing and analysis.
The following figure (source: Data Warehouse Architecture: Traditional vs. Cloud Models) shows a big data analysis architecture:
Cost is a significant consideration in big data applications. Internet companies prefer to use cost-effective standard hardware and open-source software to control expenses. For example, open-source software like Hadoop is widely adopted due to its robust community support and maturity. Internet companies can choose suitable software solutions based on their needs, whether opting for open-source versions or selecting commercial versions for support.
Artificial intelligence (AI)
In recent years, the development in the AI field has been widely acknowledged. Its workloads include training and inference. Particularly, with the emergence of large-scale models such as GPT and BERT, AI's dependence on HPC has deepened. These complex models urgently require robust computing resources to shorten training time; at the same time, they also bring challenges regarding efficient management of large datasets.
The following figure (source: ML Workflows: What Can You Automate?) shows an AI pipeline:
In this context, file systems face specific requirements, especially when handling numerous small image or video files. These files typically range from a few kilobytes to several megabytes, imposing high demands on IOPS.
There’s a notable characteristic during AI model training, that is, the same data would be used for multiple training jobs at the same time. Machine learning engineers and scientists would experiment with different hyper-parameters, modifying network architectures and trying various optimization techniques on the same data.
Therefore, AI model training is not only a compute-intensive task but also a process that requires meticulous adjustments and fine-tuning. This method of continuously adjusting parameters and model structures is key to achieving efficient and accurate AI models.
In this background, data preprocessing is particularly critical in AI model training. For example, Hugging Face’ Falcon demonstrates the importance of data preprocessing in handling large-scale models. They established a long processing pipeline for handling large model data, starting from raw external web data and undergoing multiple preprocessing steps such as URL filtering and deduplication.
As the figure (source: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only) below shows, each step gradually reduces the dataset size. After several processing stages, less than 10% of the data remains suitable for training large language models (LLMs). Thus, we can see that data preprocessing constitutes a substantial portion of the work in training large models, involving tasks like data cleaning, HTML tag extraction, ad filtering, and text recognition. These steps are crucial for ensuring data quality and training results.
Evolution of AI applications and distributed file systems
First, let's briefly review the evolution of file systems, where emerging needs at each stage have spurred the development of new-generation file systems.
- Lustre, one of the earliest distributed file systems, was designed for HPC and funded by the U.S. government.
- The emergence of file systems like Hadoop and S3 primarily addressed the explosive growth of internet data, along with big data processing-oriented file systems like Ceph. These systems aimed to support big data applications.
- With the advent of Kubernetes, cloud-native data management and applications have gradually become the focus. This leads to the development of various emerging data management systems and applications like JuiceFS.
As the foundational infrastructure for data storage and access, file systems directly impact the training and inference processes of AI models. When applying file systems to AI scenarios, they often face two key challenges:
- High IOPS: The primary challenge is processing datasets with large amounts of small files, such as images and videos. This places high demands on file systems’ IOPS. Bandwidth is usually sufficient, but file systems’ IOPS processing capability often limits performance.
- Portable: The complexity of AI workflows and collaboration across teams require file systems to have high portability and compatibility. This ensures support for various data processing and model training tools, facilitating the smooth exchange and processing of data between teams.
To improve performance, some solutions adopt high-performance storage media such as SSD or NVMe and implement local caching. Here's a brief introduction to a common concept in academia: burst buffer, which refers to a cache layer between the front-end computing processes and the back-end storage systems. Different file systems provide their unique solutions. We’ll introduce them one by one in the following text.
In addressing compatibility issues, providing POSIX interfaces ensures compatibility and portability and can also lower user learning costs. POSIX refers to a series of file interfaces in the Linux operating system, and POSIX-compatible file systems means users can work with distributed file systems just like with their local machines. However, from a development perspective, implementing POSIX compatibility may incur additional development costs and complexity. Products like HDFS and Amazon S3 do not adopt complete POSIX compatibility. They require developers or users to use their pre-defined APIs or tools. This may involve changing existing usage habits and learning new tools.
The table below compares POSIX and non-POSIX file systems:
Category | Pros | Cons | File systems |
---|---|---|---|
POSIX | Devs, ops, and software rely on POSIX , Portable | Overhead | Lustre, JuiceFS |
Non-POSIX | Low cost | Limited abilities , Additional code for users | HDFS, S3 |
Next, we’ll introduce several file systems suitable for AI scenarios to help you better understand the advantages and limitations of various file systems in AI applications.
Lustre
Lustre performs well in HPC scenarios because it was designed to meet the needs of supercomputers. Its support for the POSIX protocol and development using C and C++ makes Lustre very suitable for handling large-scale parallel computing tasks.
The figure below shows Lustre’s architecture:
Lustre's architecture is particularly suitable for typical HPC cluster environments, which include high-speed data transmission using the InfiniBand network. The cluster contains multiple computing nodes (“Computation” in the figure) and storage nodes (“Data Management” in the figure). They’re used for processing metadata by the metadata server, while the actual data is stored on object storage (“Object Storage” in the figure). This design optimizes support for MPI applications. Especially MPI-IO, the ability for multiple processes to read and write to a file simultaneously, is particularly important for parallel computing and scientific research applications.
The direct application of MPI-IO might be less common in AI and machine learning workloads. This is primarily because file operations in these applications mainly involve writing checkpoints rather than complex distributed I/O. However, the characteristics of MPI-IO remain highly important for scientific computing applications that require parallel processing and complex data interaction.
In terms of caching, the Lustre file system recently introduced a feature called Lustre Persistent Cache on Client (PCC). However, in practice, it requires a lot of configuration by operations staff.
Alluxio
We attempted to deploy Alluxio as a data caching layer. Alluxio's design philosophy is mainly as a cache above underlying storage systems, such as HDFS or S3, to accelerate data access. Alluxio is particularly suitable for big data scenarios, such as using it with big data processing systems like Spark.
Tests in AI and machine learning scenarios have shown that its performance did not meet our expectations. In AI scenarios, especially when the initial request included datasets containing a large number of small files, this process on Alluxio was extremely slow. Taking the ImageNet dataset as an example, we noticed that it may take several hours for the data to be initially loaded into Alluxio. This severely affected performance. According to the latest news, Alluxio has also optimized for this issue, providing a storage solution designed specifically for AI. It is currently for commercial use only and not open source.
The following figure (source: Alluxio architecture) shows the Alluxio architecture:
JuiceFS
We conducted preliminary proof of concept (PoC) tests on JuiceFS Community Edition and have not yet fully deployed it in our production environment.
JuiceFS is a high-performance, cloud-native, distributed file system designed to provide efficient data storage and access in cloud environments.
In JuiceFS, data is stored in chunks in object storage such as S3, while metadata is stored in database systems such as Redis, MySQL, or PostgreSQL. This design makes metadata management efficient and flexible. Clients access JuiceFS through mounting the Filesystem in Userspace (FUSE) interface. This enables it to seamlessly run on various operating systems. In addition, JuiceFS offers a POSIX file system interface, which is familiar to users. Moreover, JuiceFS' caching feature directly supports burst buffers.
The figure below shows the architecture of JuiceFS Community Edition:
From a cost perspective, JuiceFS' operating costs are much lower than traditional disk arrays. This is mainly due to its cloud-native design, which effectively utilizes cloud storage resources and reduces reliance on physical hardware.
Benchmark tools and PoC results
When evaluating file system performance, benchmark tools like IOzone, MDTest, and fio are widely used to test IOPS and bandwidth. However, standard benchmark tools cannot fully reflect the requirements and load characteristics of real-world application scenarios. To obtain performance evaluations closer to actual workloads, the MLPerf tool was designed.
MLPerf simulates real AI workloads for evaluation and provides a series of tests. In the computer vision field, it uses the residual neural network (ResNet) model for testing; in text processing, it uses GPT-3 or Bidirectional Encoder Representations from Transformers (BERT) models and test on Wikipedia or C4 datasets.
The following table outlines standard AI performance benchmarks in different domains, including vision, language processing, and recommendation systems. It specifies the datasets used in each test, the performance goals aimed at, such as accuracy and error rate, and the reference models for each test, like ResNet, BERT, and GPT-3. This provides a unified framework for evaluating and comparing AI model performance.
Area | Benchmark | Dataset | Quality target | Reference implementation model |
---|---|---|---|---|
Vision | Image classification | ImageNet | 75.90% classification | ResNet-50 v1.5 |
Vision | Image segmentation (medical) | KiTS19 | 0.908 Mean DICE score | 3D U-Net |
Vision | Object detection (light weight) | Open Images | 34.0% mAP | RetinaNet |
Vision | Object detection (heavy weight) | COCO | 0.377 Box min AP and 0.339 Mask min AP | Mask R-CNN |
Language | Speech recognition | LibriSpeech | 0.058 Word Error Rate | RNN-T |
Language | NLP | Wikipedia 2020/01/01 | 0.72 Mask-LM accuracy | BERT-large |
Language | LLM | C4 | 2.69 log perplexity | GPT3 |
Commerce | Recommendation | Criteo 4TB multi-hot | 0.8032 AUC | DLRM-dcnv2 |
We used fio to compare Lustre, JuiceFS, and XFS. Below are our preliminary test results:
- In terms of IOPS and bandwidth read performance, Lustre configured with full NVMe flash showed very high performance.
- In testing the ImageNet dataset on PyTorch, all file systems completed the task in similar times, with JuiceFS + S3 and XFS + local SSD sharing the lowest.
- During LLM checkpoint testing, Lustre + all flash took only 1 minute, much faster than the 10 minutes of Lustre + HDD.
Item | Lustre + all flash | Lustre + HDD | JuiceFS + S3 | XFS + local SSD |
---|---|---|---|---|
Fio IOPS READ | 2,700k | 20k | 14k | 40k |
Fio BW READ | 30 GB/s | 12 GB/s | 2.6 GB/s | 0.9 GB/s |
ImageNet PyTorch | 1,600s | 1,640s | 1,570s | 1,570s |
LLM checkpoint (LLaMA 70B) | 1 min | 10 min |
Conclusion
This article shared scenarios such as HPC, big data, and AI that require processing large-scale data and their data characteristics. It introduced common file systems suitable for AI scenarios and their pros and cons and showcased internal test results from Renmin University of China's HPC center.
When enterprises choose file systems, they should not only consider benchmark results but also consider actual application needs and costs. Companies with operational capabilities may prefer open-source solutions with higher cost-effectiveness. The final choice needs to consider costs, operational capabilities, and other factors comprehensively.
If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and their community on Slack.