<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>JuiceFS Blog</title><link>https://www.juicefs.com/en/blog/</link><description>Latest news from JuiceFS</description><atom:link href="http://juicefs.com/en/blog/latest/feed/" rel="self"/><language>en</language><lastBuildDate>Thu, 18 Jun 2026 06:10:00 +0000</lastBuildDate><item><title>JuiceFS 1.4: Faster Metadata Operations with Batch Unlink, Batch Clone, and Redis Client-Side Caching</title><link>https://www.juicefs.com/en/blog/engineering/improve-metadata-operation-performance</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;In large-scale file access scenarios such as AI training and dataset management, metadata often becomes the first performance bottleneck as file counts and concurrency grow. Whether you're deleting millions of small files, cloning large datasets, or traversing directories under heavy concurrency, metadata performance directly impacts application efficiency.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://juicefs.com/docs/community/introduction"&gt;JuiceFS Community Edition&lt;/a&gt; 1.4 introduces three major metadata optimizations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Batch unlink&lt;/strong&gt; for large-scale file deletion  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch clone&lt;/strong&gt; for metadata cloning  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis client-side caching&lt;/strong&gt; for hot metadata reads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These improvements reduce transaction commits, network round trips, and redundant metadata lookups. In tests on a flat directory containing 100,000 files, batch unlink improved performance by up to &lt;strong&gt;93×&lt;/strong&gt;, while batch clone achieved up to &lt;strong&gt;24×&lt;/strong&gt; speedup.&lt;/p&gt;
&lt;p&gt;In this article, we’ll explain the motivation, design, and performance benefits behind these optimizations.&lt;/p&gt;
&lt;h2&gt;Deletion: From one‑by‑one to batched transactions&lt;/h2&gt;
&lt;p&gt;Under &lt;a href="https://juicefs.com/docs/community/architecture"&gt;JuiceFS' metadata-data separation architecture&lt;/a&gt;, deleting a file involves much more than removing a directory entry. The system must also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Update inode reference counts  &lt;/li&gt;
&lt;li&gt;Reclaim inode and space resources  &lt;/li&gt;
&lt;li&gt;Process trash entries  &lt;/li&gt;
&lt;li&gt;Update quota statistics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These operations must typically be completed within the same transaction.&lt;/p&gt;
&lt;p&gt;When a directory contains hundreds of thousands or even millions of files, the traditional file-by-file deletion approach used by &lt;code&gt;rm -rf&lt;/code&gt; quickly becomes a bottleneck. Each &lt;code&gt;unlink&lt;/code&gt; request goes through the &lt;a href="https://www.kernel.org/doc/html/next/filesystems/fuse.html"&gt;FUSE protocol&lt;/a&gt;, switches between kernel and user space, and triggers a separate metadata transaction.&lt;/p&gt;
&lt;p&gt;As the number of files grows, the overhead from system calls, context switches, network round trips, and transaction commits accumulates rapidly.&lt;/p&gt;
&lt;p&gt;To mitigate this issue, JuiceFS previously introduced the &lt;code&gt;juicefs rmr&lt;/code&gt; command. Unlike &lt;code&gt;rm -rf&lt;/code&gt;, &lt;code&gt;rmr&lt;/code&gt; bypasses the FUSE layer and sends deletion requests directly to the client. It also supports multi-threaded deletion (50 threads by default), significantly improving throughput.&lt;/p&gt;
&lt;p&gt;However, each file deletion still requires its own metadata transaction. Deleting 100,000 files still means executing 100,000 transactions.&lt;/p&gt;
&lt;p&gt;Batch unlink takes optimization one step further by merging many independent deletion operations within the same directory into a single batch transaction, further removing network overhead.&lt;/p&gt;
&lt;h3&gt;Core design&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The key is to turn many small transactions into fewer large ones. JuiceFS adds a batch unlink interface at the metadata engine layer. It allows the client to delete multiple non‑directory files under the same directory in one call.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When recursively clearing a directory, JuiceFS reduces deletion overhead in two ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Different subdirectories are handled concurrently with multi‑threaded deletion.  &lt;/li&gt;
&lt;li&gt;Inside each directory, normal files and symlinks are grouped into batches and sent to &lt;code&gt;BatchUnlink&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This merges many unlink operations into fewer batch transactions at the metadata level.&lt;br&gt;
It's important to note that &lt;code&gt;BatchUnlink&lt;/code&gt; does not directly delete directories. Directory removal still follows the standard recursive workflow: empty the subdirectory first, and then delete the subdirectory itself.  Therefore, &lt;code&gt;BatchUnlink&lt;/code&gt; only applies to regular files and symbolic links within the same directory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This restriction preserves correct recursive deletion semantics while avoiding consistency risks to the directory tree structure.&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS delete operations&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Optimization path of JuiceFS delete operations&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h3&gt;Implementation across metadata engines&lt;/h3&gt;
&lt;p&gt;JuiceFS uses different batching strategies depending on the &lt;a href="https://juicefs.com/docs/community/databases_for_metadata/"&gt;metadata backend&lt;/a&gt; to minimize transaction commits and network round trips.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SQL backends (MySQL, PostgreSQL, etc.):&lt;/strong&gt; Previously, each file deletion required its own sequence of &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, and &lt;code&gt;UPDATE&lt;/code&gt; statements. With &lt;code&gt;BatchUnlink&lt;/code&gt;, the system:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fetches all edge records for the target entries in a single batch query.  &lt;/li&gt;
&lt;li&gt;Retrieves the relevant inode attributes in a single locked batch query.  &lt;/li&gt;
&lt;li&gt;Executes edge deletions, inode state updates (decrementing nlink or marking for cleanup), and delfile entry insertions — all within one transaction.  &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Instead of executing one transaction per file, the entire batch can now be completed in a single transaction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Redis backend:&lt;/strong&gt; &lt;strong&gt;The optimization uses Redis pipelines and transactions.&lt;/strong&gt; Where individual deletions previously required separate command round trips, &lt;code&gt;BatchUnlink&lt;/code&gt; collects all &lt;code&gt;HDEL&lt;/code&gt; (dentry removal), &lt;code&gt;ZADD&lt;/code&gt; (enqueue for cleanup), &lt;code&gt;SET&lt;/code&gt; (inode attribute update), and &lt;code&gt;INCRBY&lt;/code&gt; (counter update) commands for multiple files into a single pipeline, executed atomically within one &lt;code&gt;MULTI&lt;/code&gt;/&lt;code&gt;EXEC&lt;/code&gt; transaction. To avoid blocking Redis' single-threaded event loop for too long, batch size is capped at 250 entries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TiKV backend:&lt;/strong&gt; &lt;code&gt;BatchUnlink&lt;/code&gt; consolidates multiple deletions into a single transaction, using TiKV's batch write capability to reduce network round trips and transaction overhead. &lt;strong&gt;For distributed key-value backends, this kind of batching allows the backend's concurrent write capacity to be more fully utilized.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The figure below shows benchmark results on a flat directory of 100,000 files using &lt;code&gt;juicefs rmr --threads 16&lt;/code&gt;. &lt;code&gt;BatchUnlink&lt;/code&gt; delivers meaningful improvements across all metadata backends, with TiKV and Redis showing the largest gains.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;deletion&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Throughput improvement for batch deletion across metadata backends&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h2&gt;Clone: From one‑by‑one copy to batched references&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://juicefs.com/docs/community/guide/clone/"&gt;&lt;code&gt;juicefs clone&lt;/code&gt;&lt;/a&gt; creates fast copies of files or directories for training dataset version management, experiment snapshots, and large-scale directory duplication. Its efficiency comes from the fact that cloning doesn't immediately copy the underlying data blocks. Instead, it creates new file records at the metadata layer and reuses the source file's existing block references. New data blocks are only allocated when the clone is actually written to. This avoids the time and storage overhead of a full copy.&lt;/p&gt;
&lt;p&gt;For large directory clones, the same problem as deletion arises: processing files one by one generates a large number of short transactions and network round trips. &lt;strong&gt;The core idea behind batch clone is to merge the clone operations for multiple files in the same directory into a single batch transaction.&lt;/strong&gt; When recursively cloning a directory, the system reads directory entries in batches as a stream. For each batch, all non-directory entries are collected and cloned together in one operation.&lt;/p&gt;
&lt;p&gt;One key implementation detail is &lt;strong&gt;inode pre-allocation&lt;/strong&gt;: before entering the transaction, the system uses &lt;code&gt;nextInode&lt;/code&gt; to pre-allocate target inodes for all entries to be cloned. This avoids lock contention from repeatedly requesting inodes inside the transaction. Once inside the transaction, the system batch-queries all source file attributes (with row locks), builds all the insertion data for target nodes, edges, chunks, symlinks, and xattrs, and then inserts everything in a single batch.&lt;/p&gt;
&lt;p&gt;Batch clone uses each backend's native batch write capabilities in a similar way to batch unlink. The per-backend implementation details won't be repeated here.&lt;/p&gt;
&lt;p&gt;The performance gains vary across backends depending on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transaction models  &lt;/li&gt;
&lt;li&gt;Network communication overhead  &lt;/li&gt;
&lt;li&gt;Batch insertion efficiency for metadata records such as nodes, edges, and chunk references&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Results on a flat directory of 100,000 files are shown below. MySQL sees the largest improvement at approximately 24x; Redis at approximately 5x; TiKV at approximately 2x.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;clone&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Throughput improvement for batch clone across metadata backends&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h2&gt;Redis client-side caching: Keeping hot metadata local&lt;/h2&gt;
&lt;p&gt;In high-concurrency metadata workloads such as AI training dataset access and large-scale container startup, network round trips between JuiceFS clients and Redis often become a major performance bottleneck.&lt;/p&gt;
&lt;p&gt;Consider the following operation:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;open(&amp;quot;/mnt/jfs/dataset/images/cat.jpg&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Before the file can be opened, the Linux Virtual File System (VFS) must resolve every component in the path:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Look up &lt;code&gt;dataset&lt;/code&gt;.  &lt;/li&gt;
&lt;li&gt;Look up &lt;code&gt;images&lt;/code&gt;.  &lt;/li&gt;
&lt;li&gt;Look up &lt;code&gt;cat.jpg&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;VFS&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;VFS path resolution&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;If the &lt;code&gt;images&lt;/code&gt; directory contains hundreds of thousands of files and training jobs perform random access across the dataset, each lookup requires a &lt;code&gt;GET&lt;/code&gt; request to Redis.&lt;br&gt;
Under heavy concurrency, this results in large numbers of network round trips and increased Redis CPU utilization. &lt;strong&gt;Even though a single Redis query takes only a few dozen microseconds, network latency pushes each lookup to hundreds of microseconds or even milliseconds. When thousands of training processes are accessing files simultaneously, this overhead becomes significant.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;How it works: Redis 6.0 client-side caching&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://redis.io/docs/latest/develop/reference/client-side-caching/"&gt;Redis 6.0 introduced &lt;strong&gt;client-side caching&lt;/strong&gt;&lt;/a&gt;, which allows clients to cache hot keys locally and receive invalidation notifications whenever those keys are modified.&lt;/p&gt;
&lt;p&gt;Based on this capability, JuiceFS caches two categories of metadata in client memory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Inode attribute cache.&lt;/strong&gt; Keyed by inode number, this stores the complete attribute data for a file, such as type, size, permissions, and timestamps. The caching is implemented transparently through hook mechanisms in the Redis driver layer. On query, it first checks the local cache; on hit, it returns immediately without any network request. On modification, it automatically invalidates the corresponding cache. Application logic requires no awareness of the cache.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Directory entry cache.&lt;/strong&gt; Keyed by "parent inode + path separator + filename," this caches the results of directory lookups. Unlike the inode attribute cache, the lookup logic for entry cache is embedded directly in the directory lookup path rather than being intercepted transparently at the driver layer. When entries for a directory are invalidated, all related cache entries under that directory are cleared using prefix matching. This allows path resolution and repeated access to hot entries in the same directory to be served from local memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Introducing client-side caching creates a consistency challenge in multi-mount scenarios.&lt;/strong&gt; When multiple clients share the same JuiceFS file system, an operation on one client — creating, deleting, renaming, or updating attributes of a file or directory — can invalidate cached inode attributes or directory entries on other clients. Without an effective invalidation mechanism, subsequent reads could hit stale metadata, causing the directory entries or file attributes seen by one client to diverge from the actual state in the backend.&lt;/p&gt;
&lt;p&gt;To address this, JuiceFS introduces a &lt;a href="https://redis.io/docs/latest/commands/client-tracking/"&gt;&lt;strong&gt;Tracking and Broadcast Invalidation&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(BCAST)&lt;/strong&gt; model on top of Redis' client-side caching mechanism. After connecting to Redis, each client declares the metadata key prefixes it wants to track. When those keys are modified, Redis sends invalidation notifications to the relevant clients. On receiving a notification, the client clears the corresponding inode attribute cache or entry cache entries, so that subsequent accesses fetch fresh data from the metadata engine.&lt;/p&gt;
&lt;p&gt;In addition, at client initialization, JuiceFS warms up metadata for the root directory of the mount point. Since these files are typically the most frequently accessed, benchmarks show this warm-up significantly improves overall access performance.&lt;/p&gt;
&lt;p&gt;Through this mechanism, hot metadata can be reused locally. When the metadata changes, the related caches are evicted in time, reducing the risk of stale metadata.&lt;/p&gt;
&lt;h3&gt;When to use it&lt;/h3&gt;
&lt;p&gt;Redis client‑side caching works best in read‑heavy, write‑light scenarios with repeated access to hot metadata. AI training dataset loading is a good example: the dataset is usually read‑only during training, and tasks repeatedly access the same directories and files, so inode attribute cache and entry cache hit often, reducing redundant lookups and remote metadata queries.&lt;/p&gt;
&lt;p&gt;The benefit is even more obvious when there is higher network latency between the client and the Redis metadata engine, such as in cross-availability-zone deployments.&lt;/p&gt;
&lt;p&gt;Redis 6.0 or later is required to use this feature. The default cache expiration time is 1 minute, which provides a safety net in case of network interruptions or connection anomalies where invalidation notifications may not arrive, preventing stale entries from persisting indefinitely. For workloads with stricter consistency requirements, the expiration time can be shortened or client-side caching can be disabled entirely to reduce the risk of reading stale metadata.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;These three optimizations each target a different path through the metadata layer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Batch unlink&lt;/strong&gt; merges multiple independent unlink operations within the same directory into a single batch transaction.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch clone&lt;/strong&gt; merges multiple independent clone operations within the same directory into a single batch transaction.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis client-side caching&lt;/strong&gt; keeps hot metadata in client memory, bringing read latency from network-level down to memory-level, with broadcast invalidation to maintain consistency across multiple clients.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;BatchUnlink&lt;/code&gt; and &lt;code&gt;BatchClone&lt;/code&gt; are internal interfaces. Users do not call them directly. Just use the right commands: &lt;code&gt;juicefs rmr&lt;/code&gt; for deleting large directories, &lt;code&gt;juicefs clone&lt;/code&gt; for copying directories. The optimization will be applied automatically.&lt;/p&gt;
&lt;p&gt;One thing worth noting: both batch operations work by merging regular files within the same directory into a single batch transaction. Subdirectories are handled recursively by concurrent goroutines. The larger the directory, the greater the benefit.&lt;/p&gt;
&lt;p&gt;Batch operations mainly merge ordinary files under the same directory into one batch transaction. Subdirectories are handled recursively by concurrent goroutines. The larger the directory, the bigger the benefit.  &lt;/p&gt;
&lt;p&gt;All optimizations above are available in JuiceFS Community Edition 1.4. Upgrade the client to get the performance gains.  &lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 18 Jun 2026 06:10:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/improve-metadata-operation-performance</guid></item><item><title>How Gongjiyun Keeps Model Distribution Fast Enough for Cross-Cloud Elastic Inference</title><link>https://www.juicefs.com/en/blog/user-stories/ai-storage-model-distribution-cross-cloud-inference</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Founded in 2023 at Tsinghua University, &lt;a href="https://www.techinasia.com/companies/gongjiyun"&gt;Gongjiyun&lt;/a&gt; provides compute platforms and Model as a Service (MaaS) for artificial intelligence generated content (AIGC) enterprises and research institutions. We aim to alleviate the mismatch between elastic compute demand and supply. By aggregating idle IDC resources and edge resources, the platform offers containerized services, delivering rapidly schedulable compute for volatile workloads such as AI inference, video rendering, data processing, and data synthesis.&lt;/p&gt;
&lt;p&gt;In cross-cloud elastic inference scenarios, compute tasks can be scheduled to different regions, cloud environments, and clusters, but model files and application data are large and cannot be migrated as quickly as compute resources. Especially in online inference, the model repository is read‑heavy and frequently accessed – storage access performance directly affects service startup, elastic scaling, and request latency.&lt;/p&gt;
&lt;p&gt;To address this, we built an &lt;strong&gt;object storage acceleration&lt;/strong&gt; solution on top of &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS&lt;/a&gt;, integrating users’ existing object storage into elastic inference clusters. Through a unified namespace, metadata import, FUSE mount, distributed cache, and data warm-up, it improves access efficiency for model repositories across clouds and clusters. In a case study with a leading text‑to‑image model community, the solution supports a tens‑of‑TB model repository, dynamic loading of checkpoints and low-rank adaptations (LoRAs), and elastic scaling of hundreds of GPUs at peak, while keeping additional latency within the customer’s acceptance range.&lt;/p&gt;
&lt;p&gt;In this post, we'll walk through why storage — not compute — is the real bottleneck in cross-cloud elastic inference, how we evaluated and chose JuiceFS, and the step-by-step optimizations that brought latency from +10s down to under 2s in production.&lt;/p&gt;
&lt;h2&gt;Elastic demand is widespread, but supply is hard to match&lt;/h2&gt;
&lt;p&gt;As AI applications grow rapidly, compute demand continues to increase, but resource usage patterns differ across scenarios. &lt;strong&gt;Compared to training, which has stable resource needs, &lt;a href="https://www.ibm.com/think/topics/ai-inference"&gt;AI inference&lt;/a&gt;, data processing, and data synthesis are often more volatile&lt;/strong&gt;: office applications may see higher traffic during the day, entertainment apps during evenings or weekends, and project‑based data processing may consume large amounts of compute in short bursts then idle. For small teams or exploratory applications, elastic compute also helps them better evaluate the relationship between per‑request cost and application value.&lt;/p&gt;
&lt;p&gt;On the supply side, compute infrastructure is capital‑intensive. Resource providers are not incapable of offering elastic services, but they prefer long‑term dedicated leases to recover costs and reduce risk. As a result, low price, stability, and elasticity are difficult to achieve together:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dedicated leases are low‑cost and stable but lack elasticity.&lt;/li&gt;
&lt;li&gt;Spot resources are cheap and elastic but uncertain.&lt;/li&gt;
&lt;li&gt;On‑demand resources are elastic and stable but expensive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In China, this contradiction is further reflected by a market dominated by dedicated leases, with elastic supply accounting for a small share.&lt;/p&gt;
&lt;p&gt;We aim to resolve this mismatch between elastic demand and supply. &lt;strong&gt;By aggregating idle IDC and edge resources, the platform offers containerized services, providing rapidly schedulable compute for AI inference, video rendering, data processing, and data synthesis.&lt;/strong&gt; At lower resource costs, we help users quickly spin up tasks during peaks, schedule them across clusters, and handle elastic demand, while enabling resource providers to improve utilization and monetize idle capacity beyond dedicated leases.&lt;/p&gt;
&lt;h2&gt;Compute can be scheduled: How does storage keep up?&lt;/h2&gt;
&lt;p&gt;As elastic compute platforms evolve, compute resource scheduling is easy. Container images can be synchronized across clusters via registries and distribution networks, tasks can be launched in different resource pools by schedulers, and traffic can be distributed via unified ingress and traffic management.&lt;/p&gt;
&lt;p&gt;But model and data files are typically large, making cross‑cloud, cross‑cluster migration costly and slow, unable to match the sub‑second startup and release of compute. Therefore, &lt;strong&gt;in cross‑cloud elastic inference architectures, the real limitation on system elasticity is often not compute scheduling, but the efficiency of data and model distribution&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Different application scenarios have different storage requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://www.ibm.com/think/topics/model-training"&gt;&lt;strong&gt;Model training&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;, development, and debugging:&lt;/strong&gt; These involve complex read‑write needs, including code repositories, model files, experiment results, and intermediate state. They also require high environment stability; users cannot tolerate state loss from frequent host switching. Thus, the platform typically provides long‑term stable compute resources and runtime environments, and storage needs can be met by existing stable storage systems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data processing:&lt;/strong&gt; This can be split further. If a single processing job has high application value and can cover cross‑cloud network transfer costs, you can build a pipeline that continuously pulls data from S3 or other object storage, processes it in the compute cluster, and writes back streaming. The system does not need large local storage. If the data scale is larger or per‑job value is low, local storage acts as a one‑time cache. Data flows through and does not need to be persisted.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;What is truly more challenging is the online inference scenario&lt;/strong&gt;. Online inference services cannot tolerate downtime. However, the resources used by an elastic computing platform may come from idle resource pools. These resources could be preempted. Once resources in a certain data center or cluster become unavailable, the platform must be able to migrate tasks to other providers or other clusters in time. This means not only computing tasks must be migrated. Model files and related storage access capabilities must also be migrated at the same time&lt;/p&gt;
&lt;p&gt;Online inference has higher requirements for service continuity and cross-cluster migration capabilities, but its storage access pattern is also more clear. Compared to training, development, and debugging scenarios, inference workloads are typically read heavy. The core needs focus on efficient model loading, reading model weights, and accessing the model repository. For large models and online applications, model loading speed directly affects service startup time, elastic scaling efficiency, and request response stability. Therefore, inference scenarios are not suitable for simply adopting traditional read-write hybrid storage architectures. Instead, they are better suited for specialized optimizations around model distribution, read only access, and cache acceleration.&lt;/p&gt;
&lt;p&gt;In addition, an elastic computing platform usually does not host a user's complete application system. The user's primary cloud account, application database, model management system, and even some fixed computing resources often already exist in other clouds or on premises. For the platform to integrate with the user's application, it must be compatible with the user's existing model repository and model management processes. It cannot require the user to fully migrate the entire system.&lt;/p&gt;
&lt;p&gt;Therefore, &lt;strong&gt;to support cross-cloud elastic inference, we need more than just compute scheduling capabilities. We need a cross-cloud high-performance storage and model distribution solution tailored for model inference scenarios&lt;/strong&gt;. This solution must support hosting a large model repository and high-performance reading, it must adapt to the user's existing model management system. And it must provide stable data access capabilities when resources are migrated across clouds and clusters.&lt;/p&gt;
&lt;h2&gt;Why JuiceFS: Unified cross-cloud access, strongly consistent metadata, and high-performance cache&lt;/h2&gt;
&lt;p&gt;Facing cross-cloud elastic inference scenarios, the storage system needs to meet several conditions at the same time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;It must provide a unified access point across different clouds and clusters. It must support shared read-write access and unified metadata management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It must be compatible with the user's existing &lt;a href="https://en.wikipedia.org/wiki/Object_storage"&gt;object storage&lt;/a&gt; and model repository to avoid data migration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It needs low operational complexity and good read performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When evaluating storage options, we considered Ceph:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Ceph is mature. It’s suitable for building unified storage within a single data center or a stable resource domain.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;However, in cross cloud elastic inference scenarios, Ceph requires high network stability and operational skills. The overall integration cost is higher. So we did not choose it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We also evaluated Alluxio. However, in a &lt;a href="https://en.wikipedia.org/wiki/Multicloud"&gt;multi-cloud&lt;/a&gt; environment, multiple clusters need to access the same underlying object storage data concurrently. The workload is not purely read only; there are also occasional writes. This scenario requires strong data consistency. Therefore, Alluxio was not chosen for production.&lt;/p&gt;
&lt;p&gt;We finally chose JuiceFS mainly because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;It uses object storage as the database.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It provides a unified namespace and consistent file system view through an independent metadata service. This allows multiple clusters to access the same model data as a file system.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;This architecture is suitable for cross-cloud and cross-cluster model distribution and shared reading.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It’s also compatible with the user's existing object storage and model repository, reducing data migration and application integration costs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The decision to further adopt &lt;a href="https://juicefs.com/docs/cloud/"&gt;JuiceFS Enterprise Edition&lt;/a&gt; was mainly due to its &lt;strong&gt;distributed caching capabilities and managed metadata service&lt;/strong&gt;. In this scenario, the value of JuiceFS is not just providing a file system interface. It combines object storage, unified namespace, metadata management, and cache acceleration into a storage access layer that is better suited for cross-cloud elastic inference.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS 架构图（第四版）-第 2 页-winfsp (2)&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS Enterprise Edition architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h2&gt;Practical: Object storage acceleration based on JuiceFS&lt;/h2&gt;
&lt;p&gt;Based on JuiceFS, the platform encapsulates an object storage acceleration product. This product connects the user's existing object storage to the elastic inference cluster. It provides the storage as a high-performance file system for the application. The overall process is as follows.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a file system.&lt;/strong&gt; The user provides object storage access credentials, for example, AK/SK for S3-compatible storage. The credential permissions can be configured as read only or read-write based on application needs. The platform creates a corresponding JuiceFS file system based on that object storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Import metadata.&lt;/strong&gt; The platform uses the JuiceFS import feature to scan the metadata of files in object storage. Then, it imports that metadata into the JuiceFS metadata service. In this way, the model files originally stored by the user in object storage can be accessed as file system directories in JuiceFS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a cache group.&lt;/strong&gt; Within each cluster that may host workloads, the platform sets up a JuiceFS cache group. This forms a distributed cache group. Before running a task, the platform can warm-up model files. It caches hot data in the target cluster in advance. This reduces the time needed to pull data from remote object storage when the inference service starts.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Mount to application Pods.&lt;/strong&gt; When the user's application runs, the platform uses the FUSE client to mount the JuiceFS file system into the application Pod. For the application, model files appear as local file system paths. Therefore, the original model reading logic usually does not need modification.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enable node local cache.&lt;/strong&gt; Besides the cluster level cache group, the node where the FUSE client runs can also provide local cache. This improves repeated read and model loading performance. It further reduces direct access to remote object storage.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This object storage acceleration product essentially productizes the JuiceFS metadata import, distributed cache, data warm-up, and FUSE mounting process. It allows the user's existing object storage to serve cross-cloud inference tasks in a way that feels closer to a local file system.&lt;/p&gt;
&lt;p&gt;In addition, the JuiceFS cache group is independent from the file system access point. This characteristic, on one hand, adds management complexity on the platform side, because the platform needs to manage the relationships among the file system, cache groups, mount points, and task scheduling. On the other hand, it provides a foundation for cache isolation, independent scheduling, and fine-grained management based on clusters, users, or application scenarios in the future.&lt;/p&gt;
&lt;h2&gt;Production case study: A leading text-to-image model community&lt;/h2&gt;
&lt;h3&gt;Scenario, challenges, and acceptance criteria&lt;/h3&gt;
&lt;p&gt;One of the most representative cases in this object storage acceleration solution involves a leading Chinese text-to-image model community hosting tens of terabytes of model data, including large checkpoint base models and a larger number of smaller LoRA models. In practice, inference jobs typically load a checkpoint first, then load one or more LoRA models to perform combined inference.&lt;/p&gt;
&lt;p&gt;The company already operated compute infrastructure at scale — several thousand GPUs — but its workload, serving creative design and production use cases, exhibited significant variability. &lt;strong&gt;Overall average utilization was below 50%, yet during morning and afternoon peak hours on weekdays, load could reach 140% of normal capacity, degrading the user experience&lt;/strong&gt;. The customer therefore needed a highly elastic compute supply.&lt;/p&gt;
&lt;p&gt;We provided a high-elasticity resource model: compute support at the scale of hundreds of GPUs was available only during weekday peak hours — 10:00–12:00 AM and 2:00–6:00 PM — with resources scaling to zero at all other times.&lt;/p&gt;
&lt;p&gt;This meant the platform needed to provision hundreds of GPUs within a window of minutes, while consuming zero resources outside peak hours. For the customer, this model delivers large-scale compute during peak periods while avoiding payment for idle capacity. For the platform, it enables more efficient utilization and monetization of idle compute resources.&lt;/p&gt;
&lt;p&gt;The technical challenges were significant:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A model repository of this scale cannot simply be replicated to every elastic cluster.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inference services do not load all models once at startup. Model reads and switches happen continuously as user requests arrive, resulting in high access frequency. Therefore, the object storage acceleration solution needed to support not just large-scale model repository access, but stable read performance under continuous dynamic loading.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The customer's performance requirements were also strict. During acceptance testing, a portion of production traffic was routed to the elastic cluster. The requirement was that both the median and mean inference latency of the elastic cluster must stay within 2 seconds of the customer's own cluster. Given that individual inference jobs take on the order of tens of seconds, this requirement left virtually no room for additional latency introduced by the storage layer. In the first few rounds of testing, both median and mean inference latency on the elastic cluster exceeded the customer's own cluster by approximately 10 seconds — failing the acceptance criteria.&lt;/p&gt;
&lt;h3&gt;Performance optimization: Reducing additional latency on the elastic cluster&lt;/h3&gt;
&lt;p&gt;Optimization began with the median. &lt;strong&gt;A high median indicates that a significant proportion of requests are experiencing performance degradation, not just a small number of outliers inflating the tail.&lt;/strong&gt; JuiceFS monitoring revealed that the cluster's cache hit rate was not reaching the expected level. In the current architecture, a cache miss requires a round trip over the public internet to the customer's object storage on Alibaba Cloud. This significantly increases model loading time and then affects inference request latency.&lt;/p&gt;
&lt;p&gt;To solve this, the platform used the isolation capability of the JuiceFS cache group. It assigned dedicated cache nodes to this customer, reserved enough cache space, and warmed up the core model data. After warming up, the access path for core models achieved nearly 100% cache hit rate. This effectively avoided the performance loss from cross public network backfilling.&lt;/p&gt;
&lt;p&gt;The second factor affecting the median was metadata access latency. Because the platform uses a unified cross-cluster architecture, the metadata service is accessed over the public internet, for example, via JuiceFS Cloud Service or a deployment on a remote host, and this latency affects overall model read performance.&lt;/p&gt;
&lt;p&gt;The platform took two measures to address this issue:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enabling JuiceFS' open cache to keep metadata in local memory as much as possible.&lt;/strong&gt; Since this workload is predominantly read-only, caching is an effective way to reduce metadata access overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tuning the cluster's network rate-limiting policy&lt;/strong&gt;. While the platform cannot directly control network equipment in edge data centers, it can apply node-level rate limiting to prevent any single node from saturating the available bandwidth, improving overall network stability. After these optimizations, cluster-wide performance improved meaningfully and the median metric gradually reached the customer's requirement.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Once the median met the target, the mean still showed a gap. This indicated that long-tail requests remained, with a small number of requests taking significantly longer than normal and pulling up the overall average.&lt;/strong&gt; Further analysis traced this to node-level local cache — specifically, the FUSE cache quota. With limited cache capacity, the elastic cluster experienced more frequent cache evictions than the customer's own cluster, causing some requests to reload model data from scratch and increasing mean inference latency. The platform addressed this by increasing the FUSE local cache quota in the production environment, reducing eviction frequency, improving tail latency, and ultimately bringing the mean metric within acceptance. The system passed validation and has been running stably since.&lt;/p&gt;
&lt;h3&gt;Multi-tenant cache management&lt;/h3&gt;
&lt;p&gt;After the single-tenant case was validated, the solution entered multi-tenant operation. As different tenants began time-sharing the same elastic nodes, a new issue emerged: cache contention between tenants.&lt;/p&gt;
&lt;p&gt;In the elastic resource model, FUSE clients do not actively clear node cache on exit. This is a reasonable design in single-tenant scenarios, where cached data from previous jobs can be reused by subsequent jobs to improve hit rates. &lt;strong&gt;However, in multi-tenant scenarios, one tenant's data can occupy node cache for extended periods. This leaves insufficient cache capacity for the next tenant, who is then forced to fall back to object storage, causing a noticeable performance drop.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To address this, we deployed an independent daemon process on each node that performs a global cache garbage collection (GC) pass before the application FUSE client starts. The eviction strategy references the JuiceFS FUSE client implementation, using a 2-random policy to balance collection efficiency and performance overhead. Coordination across nodes is handled via Kubernetes distributed locks: only the client that acquires the lock executes GC, preventing multiple clients from running cache collection simultaneously and creating excessive network and I/O pressure.&lt;/p&gt;
&lt;p&gt;This mechanism effectively mitigates the problem of historical jobs occupying cache resources in multi-tenant scenarios, allowing different tenants sharing elastic resources to maintain consistent cache performance.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;For elastic compute to reliably serve production traffic, compute scheduling alone is not enough. Model data and hot data must remain stably accessible across clouds and clusters.&lt;/p&gt;
&lt;p&gt;Built on JuiceFS, we’ve combined object storage, unified namespace, metadata management, distributed caching, and FUSE mounting into an object storage acceleration solution purpose-built for elastic inference. This is not simply about mounting object storage as a file system. It’s about building a data access layer around the access patterns of model inference: one that supports warm-up, caching, isolation, and management.&lt;/p&gt;
&lt;p&gt;This represents Gongjiyun's current progress in elastic compute and cross-cloud storage acceleration. As AI inference scenarios continue to evolve, model distribution, cache management, and multi-cluster data access will continue to surface new engineering challenges. We look forward to exchanging ideas with developers, AI application teams, and infrastructure practitioners, and to exploring more stable and efficient data access solutions for elastic compute environments.&lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 10 Jun 2026 09:48:58 +0000</pubDate><guid>https://www.juicefs.com/en/blog/user-stories/ai-storage-model-distribution-cross-cloud-inference</guid></item><item><title>Reducing Data Storage Costs: A Deep Dive into JuiceFS 1.4 Tiered Storage</title><link>https://www.juicefs.com/en/blog/engineering/juicefs-tiered-storage</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;&lt;a href="https://github.com/juicedata/juicefs/releases/tag/v1.4.0-beta1"&gt;JuiceFS Community Edition 1.4&lt;/a&gt; introduces enhanced tiered storage capabilities, allowing users to set object storage classes at the file or directory level. This makes it possible to manage different storage tiers for data under a unified file system interface. In this article, we’ll discuss this feature’s application background, evolution, usage model, implementation, and future plans.&lt;/p&gt;
&lt;h2&gt;Application background&lt;/h2&gt;
&lt;p&gt;In real‑world scenarios, different files have different access patterns and performance requirements. Some data is read or written frequently and demands low latency and high throughput. Other data is rarely accessed after being written, and the main concern is long‑term storage cost. Tiered storage addresses this by placing data in the appropriate storage layer based on access patterns, balancing performance and cost.&lt;br&gt;
Typically, data can be classified into three categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hot data:&lt;/strong&gt; Frequently accessed, requires low latency and high throughput.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Warm (infrequent access) data:&lt;/strong&gt; Accessed occasionally, but still requires fast retrieval when needed.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cold (archival) data:&lt;/strong&gt; Primarily for long‑term retention, very low access frequency, can tolerate some restoration delay in exchange for lower cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Object_storage"&gt;Object storage&lt;/a&gt; already offers tiering capabilities. For example, Amazon S3 provides S3 Standard for frequently accessed data, S3 Standard‑IA for infrequent but still millisecond‑accessible data, and Glacier / Deep Archive for long‑term archiving. These storage classes differ in access latency, minimum storage duration, and pricing.&lt;br&gt;
The table below compares main S3 storage classes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Storage class&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Use case&lt;/th&gt;
&lt;th style="text-align: left;"&gt;First byte latency&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Minimum storage duration fee&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;S3 Standard&lt;/td&gt;
&lt;td style="text-align: left;"&gt;General-purpose storage for frequently accessed data&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Milliseconds&lt;/td&gt;
&lt;td style="text-align: left;"&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;S3 Standard-IA&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Infrequently accessed data requiring millisecond access&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Milliseconds&lt;/td&gt;
&lt;td style="text-align: left;"&gt;30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;S3 Glacier Deep Archive&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Archiving very rarely accessed data with very low cost&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Hours&lt;/td&gt;
&lt;td style="text-align: left;"&gt;180 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For JuiceFS, which is built on top of object storage, the key is to translate these capabilities into file‑system‑level tiering: users set storage tiers for files, directories, or datasets, and JuiceFS maps them to the underlying object storage while handling writes, migrations, and restore operations.&lt;/p&gt;
&lt;h2&gt;Evolution of JuiceFS tiering capabilities&lt;/h2&gt;
&lt;p&gt;The evolution of JuiceFS tiering has moved from being “passively unaware of object storage classes” to “actively managing storage tiers at file and directory granularity.”  &lt;/p&gt;
&lt;p&gt;Before v1.1, JuiceFS did not provide a way to configure storage classes. While users could manually change the storage class of objects at the object storage side, these changes were not recognized or managed by JuiceFS at the file system level. For standard and infrequent‑access classes that support direct access, normal read/write operations usually continued to work. However, if objects were moved to archival storage, access would fail because those objects cannot be read directly.  &lt;/p&gt;
&lt;p&gt;Starting with v1.1, &lt;a href="https://juicefs.com/docs/community/reference/how_to_set_up_object_storage/#storage-class"&gt;JuiceFS supports setting the object storage class via &lt;code&gt;--storage-class&lt;/code&gt;&lt;/a&gt;. For example, you can specify the default storage class for the file system at format time or override the storage class used for data written to a specific mount point during mount. This gave JuiceFS a basic ability to leverage object storage tiering. However, the configuration granularity remained coarse – primarily at the file system default or mount‑point level – and did not allow fine‑grained management per directory, per file, or per dataset.  &lt;/p&gt;
&lt;p&gt;Version 1.4 further advances tiering capabilities to the file and directory level. You can assign a storage tier to individual files or directories based on data temperature. When a directory is assigned a tier, newly created files and subdirectories under it automatically inherit that configuration. Compared to the previous default or mount‑point level settings, v1.4 is better suited for tiered management by project, directory, dataset, or data temperature.&lt;/p&gt;
&lt;h2&gt;How to configure tiered storage&lt;/h2&gt;
&lt;p&gt;The key to tiered storage in JuiceFS 1.4 is translating object storage classes into file‑system‑manageable tiers. The usage model consists of two steps:  &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Map tier IDs to object storage classes.  &lt;/li&gt;
&lt;li&gt;Assign files or directories to those tier IDs.  &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This allows users to organise tiering policies by file, directory, or dataset without specifying the underlying storage class on every write.  &lt;/p&gt;
&lt;p&gt;The figure below shows mapping tier IDs to storage classes:&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;Mapping tier IDs to storage classes&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Mapping tier IDs to storage classes&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;For example, map tier IDs 1–3 to different storage classes:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;juicefs config redis://localhost --tier-id 1 --tier-sc STANDARD_IA -y  
juicefs config redis://localhost --tier-id 2 --tier-sc INTELLIGENT_TIERING -y  
juicefs config redis://localhost --tier-id 3 --tier-sc GLACIER_IR -y  
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After mapping, set the storage tier for a file or directory:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;juicefs tier set redis://localhost --id 1 /path/to/file  
juicefs tier set redis://localhost --id 2 /path/to/dir  
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Directory‑level settings have inheritance semantics. Once a directory is assigned a tier ID, newly created files and subdirectories will inherit that tier. To apply the tier to existing data under the directory, use &lt;code&gt;-r&lt;/code&gt; to recursively set the tier:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;juicefs tier set redis://localhost --id 2 /path/to/dir -r  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;existing data&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Directory inheritance and recursive setting for existing data&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;For archival storage classes such as Glacier, a restore request must be issued before reading:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;juicefs tier restore redis://localhost /path/to/dir -r  
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Implementation&lt;/h2&gt;
&lt;p&gt;From an implementation perspective, the key to tiered storage in v1.4 is storing tier information in metadata and using the tier ID to decide the object storage behavior during writes, migrations, and reads.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;Write, migration, and archival read flow&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Write, migration, and archival read flow&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h3&gt;Metadata design&lt;/h3&gt;
&lt;p&gt;JuiceFS uses &lt;code&gt;tier-id&lt;/code&gt; on files and directories to indicate the storage tier. A value of &lt;code&gt;0&lt;/code&gt; means the default storage tier; values &lt;code&gt;1&lt;/code&gt; to &lt;code&gt;3&lt;/code&gt; correspond to user‑configured object storage classes.  &lt;/p&gt;
&lt;p&gt;Thus, the storage tier is no longer just an external state at the object storage side, but becomes part of the file system metadata that JuiceFS can understand and manage. When writing new data, migrating existing data, or checking file status, JuiceFS can determine the intended storage class based on this metadata.&lt;/p&gt;
&lt;h3&gt;Migrating existing data&lt;/h3&gt;
&lt;p&gt;For existing data, changing the storage tier involves not only updating the metadata &lt;code&gt;tier-id&lt;/code&gt; but also changing the actual storage class of the underlying objects. When a directory is set recursively, JuiceFS processes all files and subdirectories under it and uses the object storage’s copy capability to migrate existing objects to the new storage class.  &lt;/p&gt;
&lt;p&gt;If only the mapping from a tier ID to a storage class is changed, the actual storage class of existing objects is not automatically updated. In that case, you must use &lt;code&gt;tier set --force&lt;/code&gt; to explicitly trigger the change. This will rewrite the objects with the new storage class.&lt;/p&gt;
&lt;h3&gt;Write path&lt;/h3&gt;
&lt;p&gt;When a new file is written, JuiceFS determines the target storage class based on the file’s own &lt;code&gt;tier-id&lt;/code&gt; or, if not set, the inherited &lt;code&gt;tier-id&lt;/code&gt; from its parent directory. For directories that already have a storage tier assigned, new data can be written directly to the corresponding storage tier. This avoids the overhead of first writing to the default tier and then migrating later.&lt;/p&gt;
&lt;h3&gt;Read path&lt;/h3&gt;
&lt;p&gt;For storage classes that support immediate access (for example, Standard and Standard‑IA), reads are transparent to the application, and JuiceFS simply reads the data from object storage as usual.  &lt;/p&gt;
&lt;p&gt;For archival classes such as Glacier and Deep Archive, objects cannot be read directly. You must first issue a restore request using &lt;code&gt;juicefs tier restore&lt;/code&gt;. This sends a request to the object storage service. Whether and when the objects become readable depends on the cloud provider’s restore mechanism. After restoration completes, applications can retry the read.  &lt;/p&gt;
&lt;p&gt;Therefore, archival storage is suitable for data that is accessed very infrequently and can tolerate restoration delay. It’s not appropriate for workloads that require online access at any time. When using archival tiers, you must consider storage cost, restoration time, and restoration costs.&lt;/p&gt;
&lt;h2&gt;Future plans&lt;/h2&gt;
&lt;h3&gt;Reducing operational costs of archival storage&lt;/h3&gt;
&lt;p&gt;Archival storage classes have low long‑term storage costs, but they often come with complex cost models for writes, restores, early deletion costs, and lifecycle transitions. Writing data directly to archival storage may incur extra costs in scenarios with frequent changes or bulk migrations.  &lt;/p&gt;
&lt;p&gt;In the future, JuiceFS could combine object storage lifecycle management. Data could first be written to standard storage with specific object tags. Users could then use cloud‑vendor lifecycle rules to automatically and cost‑effectively transition data to infrequent‑access or archival tiers based on those tags. This would preserve JuiceFS’ file‑system‑level tiering capabilities while leveraging native batch transition mechanisms to reduce overhead.&lt;/p&gt;
&lt;h3&gt;Extending tiering to multi‑bucket, multi‑cloud&lt;/h3&gt;
&lt;p&gt;Currently, tiered storage works on different storage classes within the same object storage backend. In the future, JuiceFS could extend “tier” to different buckets, different object storage services, or even different cloud providers. Tiering would no longer be limited to a single backend.  &lt;/p&gt;
&lt;p&gt;For example, hot data could be stored in a local high‑performance MinIO cluster backed by SSDs, while cold or archival data resides in low‑cost cloud archival buckets. Policies could then gradually move data from the hot tier to the cold tier. With such an architecture, JuiceFS could offer cross‑bucket, cross‑cloud, and cross‑media tiered data management under a unified file system namespace.  &lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 03 Jun 2026 08:45:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/juicefs-tiered-storage</guid></item><item><title>JuiceFS at Xiaomi: Unified Storage for AI, Big Data, and Cloud‑Native Workloads</title><link>https://www.juicefs.com/en/blog/user-stories/artificial-intelligence-big-data-cloud-native-storage</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Xiaomi"&gt;Xiaomi&lt;/a&gt; is one of the world's leading smartphone companies. Since 2021, its storage team has been building a file storage platform based on &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS&lt;/a&gt;, initially providing file storage capabilities for cloud‑native and some application scenarios. After Xiaomi announced its comprehensive AI strategy in 2024, issues with the previous heterogeneous storage system became more evident in areas such as technology selection, data flow, and development/operations. Leveraging multi‑protocol access, elastic scalability, multi‑cloud adaptability, and high performance, the team decided to build a unified file storage foundation centered on JuiceFS to support big data, cloud‑native, and AI workloads.&lt;/p&gt;
&lt;p&gt;To achieve this goal, the platform further developed core capabilities, including a capacity layer, a performance layer, and a cache layer. These reduce the complexity of multi‑system access and data movement while balancing large‑scale storage with high‑performance access. &lt;strong&gt;Over the past two years, with the rapid growth of generative AI and autonomous driving, the platform has supported typical scenarios such as large‑model training, autonomous driving training, inference acceleration, and big‑data cloud migration. Today, the platform can handle hundreds of billions of files and EB‑scale storage, covering the entire AI storage chain from raw data and training data to model file distribution.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Storage architecture challenges under the AI strategy&lt;/h2&gt;
&lt;p&gt;Before 2023, Xiaomi, like most companies, had built multiple storage systems for different application scenarios. In the &lt;a href="https://en.wikipedia.org/wiki/Big_data"&gt;big data&lt;/a&gt; area, the data platform was mainly based on HDFS; AI workloads, before the rise of large language models, relied primarily on high‑performance file storage services on the cloud, such as Parallel File System (PFS) and Network Attached Storage (NAS).&lt;/p&gt;
&lt;p&gt;During this period, we also began to introduce JuiceFS and built an internal self‑developed File Storage Service (FDS), using components like &lt;a href="https://juicefs.com/docs/csi/introduction/"&gt;JuiceFS CSI Driver&lt;/a&gt; to provide file storage for cloud‑native and some application scenarios. As application needs evolved, these storage systems grew independently. This led to a complex heterogeneous storage landscape.&lt;/p&gt;
&lt;p&gt;In 2024, after Xiaomi announced its comprehensive AI strategy, the shortcomings of the previous storage system became more pronounced in areas such as technology selection, access, data flow, and development/operations.&lt;/p&gt;
&lt;p&gt;These challenges included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High selection and access costs:&lt;/strong&gt; With many storage systems and inconsistent capabilities, application teams had to understand and adapt to each one, raising the barrier to entry.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Low data flow efficiency:&lt;/strong&gt; The lack of a unified access method across systems led to frequent cross‑system data copying. This hurt development efficiency.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scattered development and operations efforts:&lt;/strong&gt; Multiple systems were maintained and evolved independently, making it difficult to focus resources on the mission-critical infrastructure required for the AI strategy.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To address these issues, we conducted in‑depth internal discussions and architectural adjustments in 2024, and began redesigning a unified storage architecture for AI, big data, and &lt;a href="https://en.wikipedia.org/wiki/Cloud-native_computing"&gt;cloud‑native&lt;/a&gt; scenarios.&lt;/p&gt;
&lt;h2&gt;Building a unified file foundation with JuiceFS&lt;/h2&gt;
&lt;h3&gt;Selection rationale: Multi‑protocol support, elasticity, multi‑cloud, high Performance&lt;/h3&gt;
&lt;p&gt;JuiceFS is a distributed file system that natively supports multi‑protocol access, elastic scaling, and high‑performance reads/writes. This makes it a perfect fit for both native AI and big data storage needs.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS 架构图（第四版）-第 2 页-winfsp (2)&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS Community Edition architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;In the cloud-native field, we’ve been using JuiceFS since 2021, continuously conducting internal development and iterative optimization. At the same time, we maintain close collaboration with the JuiceFS open-source community to jointly drive technology evolution and real-world adoption.&lt;/p&gt;
&lt;p&gt;In AI scenarios, model training and inference rely heavily on &lt;a href="https://en.wikipedia.org/wiki/POSIX"&gt;POSIX&lt;/a&gt; semantics, which aligns naturally with JuiceFS capabilities. Meanwhile, in the big data area, we were already promoting HDFS replacement during cloud migration, a practice with many mature industry examples, so adapting the HDFS protocol was also feasible.&lt;/p&gt;
&lt;p&gt;Considering multi-protocol support, elastic scalability, &lt;a href="https://en.wikipedia.org/wiki/Multicloud"&gt;multi-cloud&lt;/a&gt; adaptability, and high-performance read/write, we ultimately chose JuiceFS as the core component of our unified file storage foundation. This solved the problems of complex data flow, high access costs, and scattered operations caused by using different file systems across multiple platforms and application units.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;xiaomi Evolution&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Evolution of the unified file storage foundation based on JuiceFS&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h3&gt;Storage layer capability construction&lt;/h3&gt;
&lt;p&gt;Our core goal is to build a unified file storage layer on top of JuiceFS, providing large capacity, high performance, and standardized access interfaces to uniformly support the three core application scenarios: big data, cloud-native, and AI.&lt;/p&gt;
&lt;p&gt;On the client side, we fully leverage JuiceFS’ multi-protocol capabilities, offering access methods including POSIX, Hadoop SDK, Python SDK, and &lt;a href="https://juicefs.com/docs/community/guide/gateway/"&gt;S3 Gateway&lt;/a&gt;. They’re all already in use internally.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS unified file storage&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Architecture of the JuiceFS unified file storage foundation&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;On the data plane, the architecture consists of three layers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Capacity layer:&lt;/strong&gt; Built on &lt;a href="https://aws.amazon.com/what-is/public-cloud/"&gt;public cloud&lt;/a&gt; object storage, designed for EB‑scale storage, supporting multi-cloud deployments across different strategic data centers and multiple cloud providers.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance layer:&lt;/strong&gt; Large‑scale tuning based on Ceph and all‑flash nodes, designed for AI training and other scenarios with high throughput and low latency requirements.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache layer:&lt;/strong&gt; Given the “write once, read many, seldom modify” characteristic of AI training datasets, we developed a high‑performance distributed cache system based on NVMe and RDMA to reduce repeated read costs and improve training data access efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On the control plane, we made custom enhancements to the &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;Community Edition&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For metadata, we built a distributed metadata service based on the Raft protocol to integrate with internal infrastructure systems and support multi-system access, improving reliability and scalability.  &lt;/li&gt;
&lt;li&gt;For backend management, we built a unified management service responsible for data lifecycle management, tiered storage, garbage collection, and warm-up of hot data from the capacity layer to the performance or cache layers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Through these efforts, JuiceFS has gradually become the unified file storage foundation at Xiaomi, supporting both large‑scale capacity storage and high‑performance access for AI training. The architecture is now running in production and provides the high throughput required for large model training.&lt;/p&gt;
&lt;h2&gt;Our practices&lt;/h2&gt;
&lt;p&gt;During the construction of the unified file storage foundation, JuiceFS has gradually covered Xiaomi’s mission-critical application scenarios, including big data, cloud-native, and AI:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;In terms of scale, the solution can support EB‑level storage and hundreds of billions of files.&lt;/strong&gt;  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In terms of capability, the coordinated design of the capacity, performance, and cache layers balances large‑scale storage with high performance.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Below we describe two typical scenarios: big data cloud migration and the &lt;a href="https://www.hpe.com/hk/en/what-is/ai-storage.html"&gt;AI storage&lt;/a&gt; pipeline.&lt;/p&gt;
&lt;h3&gt;Big data cloud migration and unified lakehouse storage&lt;/h3&gt;
&lt;p&gt;In its early days, our big data system was mainly built on the Hadoop ecosystem, where HDFS used a previous‑generation coupled architecture. Over time, this architecture showed problems such as performance fluctuations, complex operations, and high total cost. In contrast, cloud storage offers significant advantages in elastic scaling, resource utilization, and cost control. Therefore, starting in 2021, we systematically began migrating big data to the cloud.&lt;/p&gt;
&lt;h4&gt;From cold data to the lakehouse layer&lt;/h4&gt;
&lt;p&gt;Our big data cloud migration went through three stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Cold data migration:&lt;/strong&gt; We first migrated cold data from HDFS to cloud storage, a process lasting over two years.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakehouse layer migration:&lt;/strong&gt; We self‑developed a unified lakehouse file system, promoting the evolution from coupled to decoupled storage and compute.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified storage foundation based on JuiceFS:&lt;/strong&gt; After selecting JuiceFS, we migrated the entire lakehouse layer to JuiceFS.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Lakehouse construction can leverage Iceberg’s native support for &lt;a href="https://en.wikipedia.org/wiki/Object_storage"&gt;object storage&lt;/a&gt; access (like OSS or S3). However, our application spans multiple regions globally using several cloud vendors. Adapting to each vendor individually would incur high access and maintenance costs.&lt;/p&gt;
&lt;p&gt;Thus, we chose JuiceFS to uniformly access different cloud storage. Upper‑layer services simply switch the backend storage address via the SDK to adapt to access in different cloud environments, greatly reducing multi‑cloud complexity.&lt;/p&gt;
&lt;p&gt;For data migration, our self‑developed data‑factory platform supports transparently switching a table’s underlying storage to the new architecture and gradually migrates existing data to the cloud in the background, with little or no impact on application. Moreover, JuiceFS supports multi-cloud and on‑premises deployment. If future cost or strategic considerations require switching to self‑built storage, data can be smoothly migrated back via JuiceFS. This preserves architectural flexibility.&lt;/p&gt;
&lt;h4&gt;Hot table cache acceleration for compute efficiency&lt;/h4&gt;
&lt;p&gt;After data was in the cloud, we further analyzed access patterns of the lakehouse layer. For daily reporting and analysis tasks, computation is usually concentrated on day‑level or week‑level hot data, not requiring frequent full scans. Therefore, the performance focus for the lakehouse layer was not simply improving full‑scan throughput but rather increasing hot data access efficiency and task execution stability.&lt;/p&gt;
&lt;p&gt;Based on this, we built a hot table warm-up capability in cooperation with the lakehouse layer. The system identifies hot tables and their hot partitions based on daily access statistics, and preloads related data into the cache layer before task execution via a warm-up interface. For periodic reporting tasks that must be completed by 8 AM, hot data is warmed up before computation. This reduces remote reads and repeated access.&lt;/p&gt;
&lt;p&gt;Through offline and online testing, after hot table caching, compute efficiency improved by about 10-20%, with reductions in both computation time and resource consumption. The cache size has reached PB level, with average throughput around 200 GB/s. The cache layer also reduces cross‑cloud bandwidth pressure and cloud storage API call costs: by improving the hot data hit rate, repeated cross-cloud reads can be reduced, thereby lowering bandwidth consumption and access costs.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;Hot table warm-up architecture&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Hot table warm-up architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h4&gt;Benefits for big data&lt;/h4&gt;
&lt;p&gt;Benefits for our big data application include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Performance:&lt;/strong&gt; After switching to JuiceFS, sequential read/write performance improved significantly, more than doubling in some scenarios. Overall task duration decreased by about 10–30%.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost:&lt;/strong&gt; By Xiaomi's internal cost metrics, the unified storage architecture has greatly lowered storage costs – about 70% in China and 90% in overseas regions. The overseas legacy solution, which used HDFS with three replicas on cloud instances and EBS, had a high replication factor and thus higher costs.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stability and operations:&lt;/strong&gt; Under the previous mixed architecture, many compute tasks easily consumed node resources, raising node load and affecting storage performance. With the decoupled storage‑compute architecture, compute tasks run on dedicated nodes, task durations are more stable, and scaling and management are more flexible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;AI one‑stop storage&lt;/h3&gt;
&lt;p&gt;AI storage consists of three stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Raw data stage:&lt;/strong&gt; Storing large volumes of raw data, which undergoes processing (for example, ETL) to produce training datasets, then is fed into high‑performance training environments.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Training stage:&lt;/strong&gt; &lt;a href="https://www.ibm.com/think/topics/model-training"&gt;Training&lt;/a&gt; tasks require high throughput and low latency to reduce I/O wait time and increase GPU utilization. After training, model files are generated for subsequent inference.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inference stage:&lt;/strong&gt; Model files must be quickly distributed to specific nodes for rapid startup of inference tasks.&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;AI storage flow&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;AI storage flow&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Previously, data flowed among multiple systems, causing inconvenience for both application teams and internal operations. By adopting JuiceFS uniformly, we can meet diverse needs based on different storage tiers.&lt;/p&gt;
&lt;h4&gt;Requirements and solutions by stage&lt;/h4&gt;
&lt;p&gt;AI one-stop storage needs to cover three stages: raw data, training data, and model files. The requirements for capacity, performance, cost, and distribution efficiency differ at each stage. The table below compares the application needs for each stage with previous and current solutions.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Use case&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Application requirements&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Previous solution&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Current solution (JuiceFS)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Raw data&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Large capacity, low cost; support high‑concurrency data processing; scale to PB+&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Direct use of object storage; HDFS; other low‑cost storage&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Capacity‑oriented JuiceFS: multi‑cloud object storage underlying, shielding vendor differences; EB capacity, hundreds of billions of files; millions of concurrent tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Training data&lt;/td&gt;
&lt;td style="text-align: left;"&gt;High throughput, low latency; reduce I/O wait time; improve GPU utilization&lt;/td&gt;
&lt;td style="text-align: left;"&gt;PFS, NAS (good performance but high cost)&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Performance‑oriented/cache‑oriented JuiceFS: TB/s throughput, low latency; async checkpoint to reduce I/O wait; cache acceleration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Model files&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Fast distribution; efficient loading; quick inference startup&lt;/td&gt;
&lt;td style="text-align: left;"&gt;P2P distribution; workflow distribution; PFS&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Cache‑accelerated JuiceFS: cache improves model loading; up to 16 GB/s sequential load per node; several times faster than local disk or FDS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;High‑performance cache acceleration: improving efficiency and cutting costs&lt;/h4&gt;
&lt;p&gt;In AI training, training datasets typically have the characteristics of "write once, read many times, and modify very little." This is a typical read-heavy, write-light access pattern, making it suitable for improving data access efficiency through caching.&lt;/p&gt;
&lt;p&gt;Take our internal &lt;a href="https://en.wikipedia.org/wiki/Self-driving_car"&gt;autonomous driving&lt;/a&gt; training as an example. Once a dataset version matures, its data volume may continue to grow within the version cycle, but existing data is rarely modified. While the previous high‑performance file storage met training performance requirements, it had some performance redundancy and cost waste for such repetitive reads. Therefore, we began promoting a high‑performance cache acceleration solution based on JuiceFS.&lt;/p&gt;
&lt;p&gt;The cache solution offers several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Short I/O path:&lt;/strong&gt; Clients operate on files directly, greatly shortening the I/O path for fast responses.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance optimization:&lt;/strong&gt; Through RDMA and zero‑copy optimization, performance has significantly improved – throughput more than 20% higher than previous high‑performance storage, with ongoing optimization.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost reduction:&lt;/strong&gt; The previous PFS‑based storage used replication (though some used EC, replication was more common for stability). With the cache solution, single‑copy storage reduces costs by more than 60%.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resource consolidation:&lt;/strong&gt; For CPU training, GPU nodes typically have NVMe drives (about 10 TB each), which were previously used in scattered ways with low utilization. Now, we consolidate these NVMe resources into a unified cache pool to accelerate nearby GPU training and data processing tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Future plans&lt;/h2&gt;
&lt;p&gt;Looking ahead, we’ll focus on three directions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Continuously improve the stability, performance, and scalability of the unified file storage foundation.&lt;/strong&gt; As AI application grows rapidly, training, inference, and data processing tasks demand higher throughput, lower latency, and greater reliability. We’ll continue optimizing the underlying architecture and critical paths to enhance service capabilities under large‑scale concurrent access.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengthen lifecycle management for massive data.&lt;/strong&gt; Current data volumes continue to grow, but management across storage tiers, access frequencies, and retention periods can be further optimized. We’ll refine tiered storage, archiving, warm-up, and cleanup strategies based on data temperature, access patterns, and cost models, reducing unit storage cost and improving resource utilization.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhance data management and analysis capabilities.&lt;/strong&gt; On top of the unified file storage foundation, we’ll build data management capabilities for application users, helping them better understand data distribution, access behavior, and resource usage, supporting data management, cost optimization, and application decisions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We look forward to continuous exchanges with industry peers to explore more technical practices. If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 27 May 2026 08:03:25 +0000</pubDate><guid>https://www.juicefs.com/en/blog/user-stories/artificial-intelligence-big-data-cloud-native-storage</guid></item><item><title>From Metadata Backup to Mirror File System: Resilience Strategies for JuiceFS</title><link>https://www.juicefs.com/en/blog/engineering/metadata-backup-mirror-file-system-resilience-strategies</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;h2&gt;Murphy's law: Anything that can go wrong will go wrong&lt;/h2&gt;
&lt;p&gt;Let me paint you a picture.&lt;/p&gt;
&lt;p&gt;It's a Thursday morning. You're the head of infrastructure for a massive financial institution, let's say, a pension fund managing over $100 billion in assets for more than 600,000 members. Your team spent months migrating everything to a private Google Cloud deployment. You have redundancy in two geographic regions. You have backups. You followed every best practice in the book.&lt;/p&gt;
&lt;p&gt;Then you wake up to this: &lt;strong&gt;everything is gone&lt;/strong&gt;. Not just a server. Not just a database. The entire private cloud account is deleted, like it never existed.&lt;/p&gt;
&lt;p&gt;If you're unfamiliar with this story, allow me to introduce you to UniSuper, an Australian pension fund that experienced every sysadmin's worst nightmare. &lt;a href="https://youtu.be/3GOAUyipnM4?si=qb_EVhAguPoA2ks-"&gt;On May 1, 2024, UniSuper lost complete access to its private cloud infrastructure&lt;/a&gt;. No member portals. No transaction processing. No account balances. For two full weeks, customers couldn't access their retirement savings. Eventually, and thankfully, the incident was resolved, and the data was recovered. &lt;strong&gt;Shoutout to the engineers who didn't just back things up—they also made backups offsite and then backed those up too!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;But how did the incident happen exactly? The answer is almost too absurd to believe.&lt;/p&gt;
&lt;p&gt;When provisioning the infrastructure, a parameter was left blank. That blank field triggered an automatic default: the private cloud environment was set to a fixed one-year term, with automatic deletion at the end. Fast forward to that point in time, the system did exactly what it was told: delete everything. Both geographic regions, all the data, and yes, even the backups stored in those regions. There was no warning, no confirmation dialog asking, "Are you absolutely certain you wish to delete everything?"&lt;/p&gt;
&lt;p&gt;Human errors and automation failures are inevitable, so what do we do about it? We can't prevent every mistake, but we can build systems that survive them, just like what UniSuper engineers did. This is where your choices of core data infrastructure (i.e., databases, storage, and file systems) matter more than you think. JuiceFS offers multiple layers of defense specifically designed for scenarios like these. Let me walk you through all of them today.&lt;/p&gt;
&lt;h2&gt;Your first line of defense: protecting the brain&lt;/h2&gt;
&lt;p&gt;In JuiceFS' architecture, there's a clear separation of duties. The actual file content (the data) lives in object storage services of your choice, such as S3, GCS, and MinIO. In the meantime, the map (the metadata) that tells which data blocks belong to which file is stored in the metadata engine supported by JuiceFS, such as Redis, MySQL, or the JuiceFS Enterprise metadata store.&lt;/p&gt;
&lt;p&gt;Here's the "a little bit scary" part: The metadata is the brain. Lose your Redis backend, and you have petabytes of random data objects with no way to assemble them back into files. It's like having a million puzzle pieces without knowing what the original painting looks like.&lt;/p&gt;
&lt;p&gt;But flip that around. &lt;strong&gt;If the metadata is the brain, then backing it up alone means you've backed up the entire file system&lt;/strong&gt;. This assumes that the object storage is always intact, and we will question this assumption later.&lt;/p&gt;
&lt;h3&gt;Metadata backup&lt;/h3&gt;
&lt;p&gt;JuiceFS has a brilliant feature that doesn't rely on you to remember: &lt;a href="https://juicefs.com/docs/community/metadata_dump_load"&gt;automatic metadata backup&lt;/a&gt;. The JuiceFS client can be configured to automatically dump metadata to your object storage. By default, it does this operation every hour. Under the hood, the JuiceFS client uses the  &lt;code&gt;juicefs dump&lt;/code&gt;  command, serializing the entire directory tree into a snapshot, which can be either human-readable JSON or a compact binary format. And don't worry about performance: with optimizations in the JuiceFS Community Edition v1.3, &lt;a href="https://juicefs.com/en/blog/release-notes/juicefs-1-3-binary-backup"&gt;backing up 100 million files takes just minutes&lt;/a&gt; with controlled memory usage.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-shell"&gt;# Back up metadata every 8 hours.
juicefs mount -d --backup-meta 8h redis://127.0.0.1:6379/1 /mnt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The backup frequency is configurable with JuiceFS, along with some reasonable defaults taking the number of files and cleanup policies into account. It's worth mentioning that JuiceFS supports various databases as metadata engines. A database often comes with its own backup/snapshot mechanism as well, which can be utilized as an additional layer of backup. However, the backup file generated by  &lt;code&gt;juicefs dump&lt;/code&gt;  can be used across different engines as a uniform format specifically for JuiceFS metadata management.&lt;/p&gt;
&lt;p&gt;Once it's set, you don't think about it again. Until the day you need it, and when that day comes, the  &lt;code&gt;juicefs load&lt;/code&gt;  command restores everything from that snapshot. Of course, this all relies on one critical assumption: that your object storage is perfectly safe, which should often be the case. But for the sake of being extra skeptical, let's dig into that next.&lt;/p&gt;
&lt;h2&gt;Do I really need to back up object storage?&lt;/h2&gt;
&lt;p&gt;So now you have hourly metadata backups. That's a solid first step. Still, a common objection arises: "Amazon S3 provides 99.999999999% durability." Isn't that sufficient?&lt;/p&gt;
&lt;p&gt;Recall how this post began. Mainstream databases and object stores are extremely reliable. However, it didn't help with accidents, like when a higher-level control logic failed. Other scenarios also exist beyond vendor errors: lightning strikes (acts of God), accidental bucket deletions, misconfigured IAM policies, or cyber attacks compromising your access keys.&lt;/p&gt;
&lt;p&gt;Here is the essential truth: Your metadata backup is a map. If the actual data chunks in object storage are corrupted or deleted, that map serves no purpose. For critical data, backing up your object storage can be a sensible addition, because complex systems have many failure modes.&lt;/p&gt;
&lt;h2&gt;Beyond metadata backup: enterprise-grade resilience&lt;/h2&gt;
&lt;p&gt;Metadata backups are essential, and backing up object storage is pragmatic. But for organizations running critical workloads, especially across multiple regions or even cloud providers, sometimes a "cold" backup is not enough and can be wasteful as well. There are several additional JuiceFS features that go beyond traditional backup, which we will cover in this section.&lt;/p&gt;
&lt;h3&gt;Data synchronization&lt;/h3&gt;
&lt;p&gt;For straightforward data copying between storage systems, JuiceFS provides the &lt;a href="https://juicefs.com/docs/community/guide/sync"&gt;data synchronization&lt;/a&gt; feature via the  &lt;code&gt;juicefs sync&lt;/code&gt;  command, available in both the JuiceFS Community and Enterprise Editions. &lt;strong&gt;This tool can copy data between object storage, JuiceFS volumes, local file systems, and even remote servers via SSH or HDFS&lt;/strong&gt;. It supports incremental synchronization and pattern matching (similar to rsync), making it suitable for one-time data migrations or periodic copying of specific datasets.&lt;/p&gt;
&lt;p&gt;Here is a basic example of syncing from an S3 bucket to a JuiceFS volume, excluding any files with the  &lt;code&gt;.log&lt;/code&gt;  suffix:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-shell"&gt;juicefs sync s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/ --exclude '*.log'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The  &lt;code&gt;juicefs sync&lt;/code&gt;  tool is incredibly versatile, and it's not even only for JuiceFS usages. The flexibility is substantial: you can sync from MinIO to Azure Blob, from a local directory to JuiceFS, or between two entirely different cloud providers. The  &lt;code&gt;--include&lt;/code&gt;  and  &lt;code&gt;--exclude&lt;/code&gt;  patterns allow fine-grained control over exactly which files are copied.&lt;/p&gt;
&lt;h3&gt;Data replication&lt;/h3&gt;
&lt;p&gt;For continuous, asynchronous replication across clouds or regions, JuiceFS Enterprise Edition offers &lt;a href="https://juicefs.com/docs/cloud/guide/replication"&gt;data replication&lt;/a&gt;. Once enabled, &lt;strong&gt;every write to the primary object storage bucket is automatically copied to a target bucket in another region or cloud provider&lt;/strong&gt;. This feature serves multiple purposes: cross-region data sharing, seamless object storage migration, and disaster recovery. If the primary object store fails, clients can be manually switched to the target bucket to restore service with minimal downtime.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS data replication&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS data replication&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;As shown in the diagram above, taking the primary region (main) writes and the replicated region (target) reads as an example, the data copy logic is straightforward: writes go to the primary region's object storage first and then are asynchronously replicated to the target region. Reads prefer the local region's bucket and fall back to the remote bucket if data hasn't yet arrived.&lt;/p&gt;
&lt;p&gt;It's notable that replication runs continuously and asynchronously within the client process itself. The client writes to both buckets directly, rather than relying on a background job mechanism. Because both regions share the same JuiceFS Enterprise metadata engine, metadata remains fully consistent between them. However, the target region inherently operates with higher latency and may experience reduced performance.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS Web Console&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Enabling data replication in the JuiceFS Web Console&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;From a cost perspective, replication is free to enable, as JuiceFS does not charge for this feature, nor does it generate additional metadata that would affect billing. However, you remain responsible for your object storage provider's standard usage fees, as now you have more than one copy of data.&lt;/p&gt;
&lt;p&gt;In short, data replication is designed to keep the underlying object storage in sync across regions using a shared metadata engine, with the additional benefits described above.&lt;/p&gt;
&lt;h3&gt;Mirror file system&lt;/h3&gt;
&lt;p&gt;The most sophisticated option is the &lt;a href="https://juicefs.com/docs/cloud/guide/mirror"&gt;mirror file system&lt;/a&gt;—&lt;strong&gt;a complete, writable replica of your entire JuiceFS deployment&lt;/strong&gt;. Here's how it works and why it matters.&lt;/p&gt;
&lt;p&gt;A mirror file system creates one or more full copies (both the metadata engine and the object storage) of an existing file system in different regions. Metadata is automatically synchronized from the source, allowing clients in the mirrored region to access the file system locally with dramatically lower latency. It is noteworthy that you can still choose not to replicate the object storage and rely on the caching layers to accelerate data access and reduce storage costs. But for simplicity, we will skip that setup here, as the mirror file system itself deserves a full long blog post.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS mirror file system architecture&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS mirror file system architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Prior to JuiceFS Enterprise version 5.1, mirrors were read-only, and write operations had to go back to the source region. Starting with version 5.1, &lt;a href="https://juicefs.com/en/blog/release-notes/uicefs-enterprise-5-1-artificial-intelligence"&gt;mirrors support writes as well&lt;/a&gt;. In a write-enabled mirror, clients connect to both the source and mirror metadata services simultaneously. Read requests go to the local mirror for low latency. Write requests go directly to the source. After the source responds, the client waits briefly for the change to sync back to the mirror before returning to the application.&lt;/p&gt;
&lt;p&gt;Consider an AI training scenario: your GPUs are in Region B, but your primary storage is in Region A. Without a mirror, every metadata operation crosses the network, potentially hundreds of milliseconds per request. With a mirror, reads are local, and the performance difference can be dramatic.&lt;/p&gt;
&lt;p&gt;With that said, a mirror file system is not merely a backup. It is also a cross-cloud, low-latency disaster recovery solution. It is not likely that a JuiceFS Enterprise setup fails, as everything has high availability and resilience built-in: the JuiceFS Enterprise metadata engine is at least a 3-node topology using Raft as the consensus algorithm, and the object store service is generally robust. On top of that, the mirror file system feature enables consistent replication across regions and clouds. In the unlikely event of a primary region failure (again, maybe the entire data center is gone), you still have one or more copies of the entire file system with all committed writes available, each with its metadata engine and object storage highly available.&lt;/p&gt;
&lt;h2&gt;Final thoughts&lt;/h2&gt;
&lt;p&gt;Stop trusting anything as a magic shield. The UniSuper incident proves that even some of the most trusted systems can lose everything due to human errors. Assume your entire cloud console could be deleted at any second. Thus, test your backups! It is just a file until you load it, and that's exactly why you should consider running &lt;code&gt;juicefs load&lt;/code&gt; regularly in a test environment to rehearse the recovery process.&lt;/p&gt;
&lt;p&gt;Follow the &lt;a href="https://en.wikipedia.org/wiki/Backup#3-2-1_Backup_Rule"&gt;3-2-1 rule&lt;/a&gt;: Keep at least three copies of your data, stored on two different media types, with one copy kept offsite. Your downtime matters, so prepare accordingly. And here is how JuiceFS features covered in this blog post can help make your infrastructure resilient:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Feature&lt;/th&gt;
&lt;th style="text-align: left;"&gt;JuiceFS edition&lt;/th&gt;
&lt;th style="text-align: left;"&gt;What it does&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Metadata Backup&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Community &amp;amp; Enterprise&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Metadata backups saved to object storage with configurable frequency&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Protecting against metadata engine loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Data Synchronization&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Community &amp;amp; Enterprise&lt;/td&gt;
&lt;td style="text-align: left;"&gt;One-time or periodic syncing between storage systems&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Migrations and scheduled data syncing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Data Replication&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Enterprise&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Continuous async copy of object storage to another region with shared metadata&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Active-passive disaster recovery and data sharing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Mirror File System&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Enterprise&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Writable mirror with local metadata service and local object storage&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Cross-cloud, cross-region active-active deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;No single feature fits every use case. Assess your recovery needs, complexity tolerance, and budget. Let's recall Murphy's Law: anything that can go wrong will go wrong. The question now is just whether you will be ready for it.  &lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 14 May 2026 09:19:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/metadata-backup-mirror-file-system-resilience-strategies</guid></item><item><title>Quota Design in Distributed Architectures: Implementation and Use Cases in JuiceFS</title><link>https://www.juicefs.com/en/blog/engineering/quota-design-in-distributed-architecture</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;In distributed storage environments, storage resources are typically shared across multiple users, projects, and applications. Without effective constraint mechanisms, abnormal writes or erroneous operations from a single tenant can quickly consume large amounts of space or inodes, impacting system stability and cost control. Quota management provides a way to establish predictable resource boundaries in shared environments.  &lt;/p&gt;
&lt;p&gt;In distributed systems, quota management is far more than just "setting a limit." The system must balance concurrent writes from multiple clients, asynchronous metadata updates, and overall throughput. At the same time, quota rules must be enforced at different levels of control. To address this, &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS&lt;/a&gt; provides multi-level quota capabilities covering the entire file system, directories, and users, supporting scenarios ranging from overall capacity control to individual and team-level constraints.  &lt;/p&gt;
&lt;p&gt;In this article, we’ll introduce the design and implementation of JuiceFS' quota mechanism, including its core data structures, synchronization model, and the validation and accounting logic in write and delete processes. We’ll also include typical use cases that highlight common issues around quota changes, space reclamation, and over-limit writes.&lt;/p&gt;
&lt;h2&gt;Quota types and resource dimensions supported by JuiceFS&lt;/h2&gt;
&lt;p&gt;JuiceFS quotas support two resource dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Space: Used storage capacity. Statistics are based on the file system's usage perspective and are aligned to block granularity. The write path section later will explain how incremental usage is estimated under 4 KiB alignment.  &lt;/li&gt;
&lt;li&gt;Inodes: Number of used inodes. For workloads with a large number of small files, inodes often become the constraint bottleneck earlier than space. Therefore, inode quotas must also be part of the management strategy.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Based on these two resource dimensions, JuiceFS currently supports four types of quotas.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Quota type&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Scope&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Design goal&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Typical use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Total file system quota&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Entire file system&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Prevents overall resource runaway&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Cost budget control, capacity limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Subdirectory quota&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Directory subtree&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Blocks abnormal write behavior&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Prevents misoperations, small‑file storms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;User quota&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Per user&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Isolates impact between different applications&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Multi‑tenant data management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;User group quota&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Project or department&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Cost allocation and team limits&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Shared environment for AI projects&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;User quotas and user group quotas are expected to be released in &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS Community Edition&lt;/a&gt; 1.4.&lt;br&gt;
In practice, a common and effective strategy combines the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Total file system quota as a safety net.  &lt;/li&gt;
&lt;li&gt;Directory quotas to address individual abuse and small‑file storms.  &lt;/li&gt;
&lt;li&gt;User/group quotas for &lt;a href="https://en.wikipedia.org/wiki/Multitenancy"&gt;multi‑tenant&lt;/a&gt; management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layered approach controls overall resource limits while preventing abnormal growth of a single entity from affecting other workloads.&lt;/p&gt;
&lt;h2&gt;Quota implementation mechanism&lt;/h2&gt;
&lt;h3&gt;Synchronization model and data structures&lt;/h3&gt;
&lt;p&gt;The main challenge of implementing quotas is how to perform checking, accounting, and convergence at an acceptable cost under concurrent writes from multiple clients. JuiceFS clients run on various nodes and continuously issue resource‑changing operations such as creation, writing, truncation, and deletion. If every operation required a strongly consistent server‑side check and update, the write path would incur unacceptable overhead.&lt;br&gt;
Therefore, the quota mechanism must satisfy two goals:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Performance: Avoid a strongly consistent server‑side update on every write.  &lt;/li&gt;
&lt;li&gt;Consistency: Ensure that system usage eventually converges under concurrent writes from multiple clients and prevent over‑limit operations before they happen, as much as possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Based on this trade‑off, JuiceFS adopts a synchronization model that works as "local accumulation, periodic flush, and periodic refresh."&lt;/strong&gt; Clients first accumulate resource deltas in local memory, with background tasks periodically persisting them to the metadata engine in batches. At the same time, each client periodically pulls the latest quota configuration and baseline usage from the server, gradually aligning its own global view. Clients do not communicate directly with each other; instead, the metadata engine serves as the central coordination point.  &lt;/p&gt;
&lt;p&gt;In other words, JuiceFS quotas do not pursue &lt;a href="https://en.wikipedia.org/wiki/Strong_consistency"&gt;strong consistency&lt;/a&gt; on each operation but achieve eventually consistent resource control through periodic synchronization.  &lt;/p&gt;
&lt;p&gt;In the current implementation, quota deltas are flushed every &lt;strong&gt;3 seconds&lt;/strong&gt; (&lt;code&gt;flushQuotas&lt;/code&gt;). Clients reload the latest quota configuration and baseline usage from the backend approximately every 12 seconds (via a refresh call triggered by the mount heartbeat). This means that under extreme conditions, the global views seen by different clients may diverge by up to about 12 seconds, but they will gradually converge in subsequent sync cycles.  &lt;/p&gt;
&lt;p&gt;Quota information is managed uniformly by the quota structure. It represents a single quota entity and can adapt to different types of managed objects such as directories, users, and user groups. Its core design decouples baseline usage from incremental usage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;UsedSpace&lt;/code&gt;/&lt;code&gt;UsedInodes&lt;/code&gt;: Represents the baseline usage already persisted in the backend.  &lt;/li&gt;
&lt;li&gt;&lt;code&gt;newSpace&lt;/code&gt;/&lt;code&gt;newInodes&lt;/code&gt;: Represents the locally accumulated deltas on this client that have not yet been flushed to the backend.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;type Quota struct {  
    MaxSpace, MaxInodes   int64  // Maximum space and inode limits  
    UsedSpace, UsedInodes int64  // Used space and inodes  
    newSpace, newInodes   int64  // Pending usage deltas to be synced  
}  
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For inode accounting, hard links require special attention. Different quota types have different counting semantics for hard links. For directory quotas, counting is based on directory entries: when a hard link is created under a directory, both space and inode usage of that directory increase by 1, and they decrease accordingly when the hard link is removed. For user quotas and user group quotas, counting is deduplicated by the file object (inode). Even if a file has multiple hard links, it’s counted only once per &lt;a href="https://en.wikipedia.org/wiki/User_identifier"&gt;UID&lt;/a&gt;/&lt;a href="https://en.wikipedia.org/wiki/Group_identifier"&gt;GID&lt;/a&gt; dimension. Therefore, creating or deleting hard links does not change the usage for the associated user or user group.&lt;/p&gt;
&lt;h3&gt;Quota storage&lt;/h3&gt;
&lt;p&gt;Regarding the quota storage mechanism, the total file system quota (the global "red line") has its capacity and inode limits directly persisted in the metadata engine. Clients load this configuration during mount and enforce hard limits, ensuring the underlying resources are not exceeded.  &lt;/p&gt;
&lt;p&gt;In contrast, checks and delta accumulation for directory, user, and user group quotas rely more on the client side. Clients maintain in‑memory indexing structures keyed by inode, UID, and GID, and periodically synchronize the corresponding quota information from the backend. This keeps lookup overhead low in high‑frequency I/O scenarios. It’s important to emphasise that the client in‑memory state is only a runtime cache and incremental view; the authoritative source for quota configuration and baseline usage remains the metadata backend.&lt;/p&gt;
&lt;h3&gt;Quota checks&lt;/h3&gt;
&lt;p&gt;A synchronization model and data structures alone are not sufficient, and quota logic must also be embedded into the specific resource‑changing paths. A single write operation may not be a simple data append; it can simultaneously involve inode creation, block allocation, directory entry changes, and parent‑directory statistics updates. Under multi‑client concurrency, these changes collectively affect the same set of quota constraints. Therefore, only by placing checks and statistics updates directly into the operation paths (write, create, truncate, and delete) can we avoid out‑of‑limit writes and statistical inaccuracies.&lt;/p&gt;
&lt;h4&gt;Pre‑write: incremental estimation and multi‑dimensional quota check&lt;/h4&gt;
&lt;p&gt;When a user initiates an operation that may change resource usage (such as write, create, and truncate), the client first estimates the expected resource delta, including both space and inode changes.  &lt;/p&gt;
&lt;p&gt;Space delta is estimated based on the actual allocation granularity of underlying data blocks (for example, 4 KiB alignment), therefore block‑aligned calculation is required. Inode deltas primarily occur in creation operations (such as creating a new file or directory).  &lt;/p&gt;
&lt;p&gt;After obtaining the resource delta for the operation, the client performs a quota check before actually writing. The check covers multiple dimensions: user and user group quotas, total file system quota, and directory quotas for the target directory tree. If any dimension would exceed its limit after this operation, the request is rejected with an error such as quota exceeded or out of space.  &lt;/p&gt;
&lt;p&gt;By placing the check in the write path before the resource change, the system can block risky operations before they happen, avoiding complex cleanup or rollback afterwards.&lt;/p&gt;
&lt;h4&gt;Post‑write: local delta accumulation and background batched sync&lt;/h4&gt;
&lt;p&gt;After a successful write, the resource delta generated by the operation is incorporated into the corresponding usage statistics and gradually aligns with the global state according to the defined convergence mechanism. Specifically, three categories of statistics are affected:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Global level: The overall file system usage increases (or decreases).  &lt;/li&gt;
&lt;li&gt;Directory level: The usage of the relevant directory subtree changes accordingly.  &lt;/li&gt;
&lt;li&gt;User / user group level: The usage of the corresponding subject also accumulates.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These updates are first reflected in the client’s local accumulated deltas and are not immediately written back to the backend in a strongly consistent way. Later, background tasks flush them in batches, and periodic refresh operations gradually align them with other clients, achieving global convergence.&lt;/p&gt;
&lt;h2&gt;Usage statistics (&lt;code&gt;stats&lt;/code&gt;): foundation for the quota system&lt;/h2&gt;
&lt;p&gt;For quotas to work effectively, the system must be able to track current resource usage with low overhead. Whether for large directory trees or many users and user groups, if every check requires a real‑time full scan, the performance cost will be unacceptable. Therefore, an efficient and reliable usage statistics mechanism is a prerequisite for implementing quotas.&lt;/p&gt;
&lt;h3&gt;Directory statistics&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://juicefs.com/en/blog/engineering/design-juicefs-directory-quotas"&gt;Directory quotas&lt;/a&gt; constrain the total space and inode usage of an entire directory subtree, not the size of individual files. Consequently, they rely on directory‑level usage statistics.  &lt;/p&gt;
&lt;p&gt;It’s important to note that directory statistics (&lt;code&gt;DirStats&lt;/code&gt;) and quota statistics have different scopes: &lt;code&gt;DirStats&lt;/code&gt; only sums up the usage of immediate children (files and subdirectories) under a given directory – a single‑level statistic. In contrast, directory quotas recursively sum up the entire subtree. This design allows &lt;code&gt;DirStats&lt;/code&gt; to be maintained with lower overhead, while directory quotas provide a full subtree view.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The key to implementing such statistics is maintaining low overhead and high availability for large directory trees.&lt;/strong&gt; JuiceFS follows the same approach as the quota mechanism: high‑frequency local updates and batched background persistence. Clients maintain directory usage deltas in memory; when operations such as writes or deletions occur, the changes are first recorded locally and then periodically synced in batches to the metadata engine by background tasks.  &lt;/p&gt;
&lt;p&gt;In addition, the system does not load all directory statistics at mount time. For large directory trees, a full load would cause significant latency and memory overhead. Therefore, directory statistics adopt an on‑demand fetch strategy: only when precise usage is required (such as quota checks, usage summarisation, and administrative queries) does the system load the statistics of the corresponding directory from the backend.  &lt;/p&gt;
&lt;p&gt;When users query usage information via &lt;code&gt;df&lt;/code&gt; or an application calls &lt;code&gt;statfs&lt;/code&gt;, JuiceFS makes a trade‑off between performance and accuracy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It first uses locally cached used space and inodes for fast calculation.  &lt;/li&gt;
&lt;li&gt;If the local baseline is incomplete (for example, just after startup) or higher real‑time accuracy is needed, it fetches the latest global counters from the backend for calibration.  &lt;/li&gt;
&lt;li&gt;Finally, it adds locally accumulated (not yet synced) deltas to make the result more accurate for the current node’s write state.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After obtaining the used amounts, the client calculates &lt;code&gt;total&lt;/code&gt; and &lt;code&gt;avail&lt;/code&gt; based on whether a total capacity limit is configured:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If a limit is configured, total capacity equals that limit, and available capacity is "limit minus used."  &lt;/li&gt;
&lt;li&gt;If no limit is configured, it returns a dynamically estimated total capacity so that tools like &lt;code&gt;df&lt;/code&gt; can display normally.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Moreover, when querying quotas from the root directory, the system displays the maximum space and inode limits, allowing administrators to see the global resource boundaries.  &lt;/p&gt;
&lt;p&gt;In addition, JuiceFS will support real‑time updates of directory statistics for the trash in version 1.4. When files are deleted (moved to the trash), restored from the trash, or permanently cleaned up, the system updates the trash directory’s statistics immediately. This enables administrators to accurately track space usage of the trash.&lt;/p&gt;
&lt;h3&gt;User and user group statistics&lt;/h3&gt;
&lt;p&gt;User and user group statistics are collected only after the corresponding quota feature is enabled. Before enabling, the &lt;code&gt;updateUserGroupStat&lt;/code&gt; call in the kernel path returns directly without generating any statistics. After enabling, clients maintain usage data in an in‑memory map keyed by &lt;code&gt;uid&lt;/code&gt; and &lt;code&gt;gid&lt;/code&gt; and update the relevant statistics on all paths that may cause usage changes.  &lt;/p&gt;
&lt;p&gt;A special note: when setting a quota for a user or user group for the first time via &lt;code&gt;juicefs quota set --uid&lt;/code&gt; or &lt;code&gt;juicefs quota set --gid&lt;/code&gt;, the system immediately performs a full scan of existing files to initialise the baseline usage. After this initialisation, subsequent writes and deletions become incremental updates, and no further full scan is required.&lt;/p&gt;
&lt;h2&gt;Common scenarios&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;1. A file has been deleted, why hasn’t the total file system quota decreased? Why hasn’t the object storage billing changed?&lt;/strong&gt;  &lt;/p&gt;
&lt;p&gt;This is usually not a statistics error, but a result of file system semantics combined with the statistical model.  &lt;/p&gt;
&lt;p&gt;For example, after enabling the trash in JuiceFS, a deletion operation does not immediately free space. The file is first moved to the trash for possible recovery. Therefore, files in the trash are still counted in the total file system quota and user / user group quotas, but are no longer counted in the original directory quota.  &lt;/p&gt;
&lt;p&gt;Another common reason is the time lag between file system statistics and &lt;a href="https://en.wikipedia.org/wiki/Object_storage"&gt;object storage&lt;/a&gt; billing. JuiceFS quota statistics use a local accumulation + periodic background sync model, so it’s possible that different clients or different statistical interfaces have not yet converged in a short time. At the same time, object storage may not have completed garbage collection or lifecycle cleanup. Therefore, temporarily seeing inconsistency between file system usage, quota statistics, and object storage billing is generally expected. This is not considered a system anomaly as long as they gradually converge over time.  &lt;/p&gt;
&lt;p&gt;In addition, note that quota and &lt;code&gt;statfs&lt;/code&gt; show the file system perspective of space usage and availability, while object storage billing is based on the underlying object storage model – affected by factors such as chunking, merging, delayed reclamation, and lifecycle rules. &lt;/p&gt;
&lt;p&gt;The two are not required to be the same.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Quota is full, but appending to an existing file did not report an error immediately.&lt;/strong&gt;&lt;br&gt;
This is often related to the asynchronous commit path in some JuiceFS writes. From the application’s perspective, the write system call may return success early, while the actual data commit and corresponding quota check happen later. Thus, appending may appear to "succeed," but the data may not be fully persisted; if the later commit stage determines that the quota would be exceeded, the write may still fail.  &lt;/p&gt;
&lt;p&gt;In other words, a successful write return does not guarantee that the write has been finally committed. In scenarios involving quota limits, a safer approach is to check the return status on close, the final file size, and handle possible errors accordingly.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Quota is not yet full, but file creation fails.&lt;/strong&gt;&lt;br&gt;
This phenomenon is usually related to temporary view divergence under the eventual‑consistent statistical model.  &lt;/p&gt;
&lt;p&gt;Example: a volume has a total quota of 2,000 inodes, and there are currently 1,999 files. One more file should be creatable. However, in extreme concurrency or unusual refresh timing, the client’s local cache may diverge briefly from the backend baseline count. This may cause the in‑memory used inode count to be temporarily too high, thus rejecting a legitimate creation request.  &lt;/p&gt;
&lt;p&gt;This type of problem inherently stems from the local accumulation + periodic sync convergence model. It avoids the high overhead of strong‑consistent backend updates on every operation, but in extreme cases the system may have short‑term false positives. &lt;/p&gt;
&lt;p&gt;Typically, such false positives disappear with the next sync cycle, and retries can mitigate the issue.  &lt;/p&gt;
&lt;p&gt;This also illustrates that, in a distributed environment, quotas are best understood as an efficient, near‑real‑time constraint mechanism, not a fully synchronous, strongly consistent judgement for every concurrent operation.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. After a write exceeds the quota, why does the "failed" file remain in the directory?&lt;/strong&gt;  &lt;/p&gt;
&lt;p&gt;This is not unique to JuiceFS; it’s not uncommon in file systems that follow &lt;a href="https://ja.wikipedia.org/wiki/POSIX"&gt;POSIX&lt;/a&gt; semantics.  &lt;/p&gt;
&lt;p&gt;For example, a user sets a 1 GiB quota on a directory and then tries to write a 2 GiB file using &lt;code&gt;dd&lt;/code&gt;. The file system first allows the first 1 GiB of valid writes; only when the subsequent write exceeds the quota does it return “Disk quota exceeded.” Consequently, a "partial file" of about 1 GiB is left behind. This does not indicate abnormal behaviour. It simply means the first part of the data was written successfully, while the remainder failed due to the quota.  &lt;/p&gt;
&lt;p&gt;The file system's responsibility is to report the error, not to decide whether to delete the successfully written data. Whether to clean up such an incomplete file is left to the application. This follows standard POSIX semantics: the file system returns the error, and the application handles subsequent cleanup and recovery.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;In a distributed file system, quotas are not a simple "counter" feature, but a system design that must balance performance, consistency, and management granularity. Through pre‑write checks, local accumulation, and periodic background synchronization, JuiceFS minimizes overhead on the write path while allowing various usage statistics to gradually converge under an eventual consistency model. Based on this mechanism, quota control covers not only total file system capacity, but also multiple levels such as directories, users, and user groups, thereby meeting the needs of typical scenarios including multi‑tenant isolation, individual constraints, and team‑level resource management.  &lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 30 Apr 2026 11:43:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/quota-design-in-distributed-architecture</guid></item><item><title>JuiceFS Performance Optimization for AI Scenarios</title><link>https://www.juicefs.com/en/blog/engineering/juicefs-ai-workload-performance-optimization</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;The scale of computing power for &lt;a href="https://en.wikipedia.org/wiki/Large_language_model"&gt;large language model&lt;/a&gt; (LLM) training continues to expand. While GPU performance keeps improving, data access bottlenecks are becoming increasingly prominent in overall system performance. Local storage offers excellent performance but has limited scalability. Object storage excels in cost and scalability but suffers from insufficient throughput in massive small‑file and high‑concurrency scenarios. Teams often struggle to choose between them.  &lt;/p&gt;
&lt;p&gt;Therefore, &lt;a href="https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems"&gt;distributed file systems&lt;/a&gt; have become a key solution to balance high performance and scalability. &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS&lt;/a&gt; has been widely deployed in AI scenarios across multiple industries. Its distributed architecture delivers high performance, strong scalability, and low cost simultaneously for large‑scale data access.  &lt;/p&gt;
&lt;p&gt;In this article, we’ll introduce JuiceFS’ architecture from a performance perspective and analyze core performance bottlenecks and optimization methods under different access patterns. We’ll also offer links of key points for references, helping you understand JuiceFS’ performance mechanisms and master common tuning strategies.&lt;/p&gt;
&lt;h3&gt;Performance foundations from the JuiceFS architecture&lt;/h3&gt;
&lt;p&gt;JuiceFS comes in &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;Community Edition&lt;/a&gt; and &lt;a href="https://juicefs.com/docs/cloud/"&gt;Enterprise Edition&lt;/a&gt;. Both share the same architecture: metadata and data are separated. The client adopts a rich‑client design, handling core logic including some metadata operations, and provides both metadata and data caching. These modules work together for efficient data location and access. The underlying data is stored in object storage, with local caches further improving access performance. For external interfaces, JuiceFS supports multiple access methods – FUSE is the most common, and it also provides various SDKs and an S3 gateway.  &lt;/p&gt;
&lt;p&gt;JuiceFS Community Edition is designed as a general‑purpose file system. Users can choose different metadata engines based on their needs. For small‑scale deployments, Redis delivers lightweight, low‑latency metadata management. For large‑scale file scenarios, &lt;a href="https://tikv.org/"&gt;TiKV&lt;/a&gt; provides good horizontal scalability.  &lt;/p&gt;
&lt;p&gt;JuiceFS Enterprise Edition targets complex, high‑performance scenarios. It differs from Community Edition in two ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It uses a self‑developed multi‑zone metadata engine built on Raft that runs as an in‑memory cluster, offering low latency and strong horizontal scalability. It supports up to 500 billion files. Operations that require multiple key-value requests in the Community Edition often need only one or two in the Enterprise Edition, and complex logic can be processed inside the metadata cluster.  &lt;/li&gt;
&lt;li&gt;The Enterprise Edition supports distributed cache sharing: clients in the same group can access each other’s local caches via consistent hashing. This improves cache hit rates and access efficiency. In multi‑node, high‑concurrency scenarios, the cache space scales horizontally, and most required data can be warmed up before job execution. This accelerates AI training and inference while boosting performance and stability. See &lt;a href="https://juicefs.com/en/blog/release-notes/juicefs-enterprise-5-3-rdma-support"&gt;JuiceFS Enterprise 5.3: 500B+ Files per File System &amp;amp; RDMA Support&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS 社区版和企业版架构&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS Community Edition and Enterprise Edition architectures&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h4&gt;Data chunking&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://juicefs.com/docs/community/internals/io_processing"&gt;JuiceFS splits data into chunks&lt;/a&gt; and stores them in object storage. This design is key to its performance, affecting data read efficiency, cache hit rate, and throughput under high concurrency.  &lt;/p&gt;
&lt;p&gt;JuiceFS breaks a file into multiple chunks. Inside each chunk, the system maintains a management structure called a slice to track writes and updates. When data is written, new data does not overwrite existing slices; instead, a new slice is appended on top of the chunk.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;chunk&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;How JuiceFS stores data&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Ideally, each chunk ends up containing only one slice. Each slice consists of several 4 MB blocks, which are the smallest unit stored in object storage. By default, the caching system also manages data at the block level.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;block&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;How JuiceFS stores data&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;As shown in the diagram on the upper right, file updates use an append‑only write pattern: existing slices are shown in red, and new data is appended as a new slice. During reads, the system combines the slices to form the current view. When fragmentation becomes excessive, a compaction process merges slices to optimize access performance. For more details on data chunking, refer to &lt;a href="https://juicefs.com/en/blog/engineering/design-metadata-data-storage"&gt;Code-Level Analysis: Design Principles of JuiceFS Metadata and Data Storage&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Caching&lt;/h4&gt;
&lt;p&gt;Compared to direct object storage access, JuiceFS performance improvements largely benefit from its caching mechanism. The JuiceFS client comes with a high‑performance local cache module. Key configuration options include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;cache-dir&lt;/code&gt;: specifies the cache directory.  &lt;/li&gt;
&lt;li&gt;&lt;code&gt;cache-size&lt;/code&gt;: sets the maximum cache space.  &lt;/li&gt;
&lt;li&gt;Prefetch: a parameter in the cache module that controls prefetching. When a request hits a block, a background thread fetches the entire block.  &lt;/li&gt;
&lt;li&gt;Write‑back related settings: improves write IOPS by writing data blocks that need to be uploaded to object storage into the local cache first, then asynchronously uploading them to object storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;JuiceFS Enterprise Edition also provides advanced configurations. For example, a &lt;a href="https://juicefs.com/docs/cloud/guide/cache/"&gt;cache group&lt;/a&gt; can be used to designate a set of clients whose local caches form a distributed cache group, enabling cache sharing. In addition, the no sharing option (related to cache groups) allows a client to read data only from a specified cache group without serving its own cache to others. This creates a two‑level cache:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The first level is the local cache.  &lt;/li&gt;
&lt;li&gt;The second level is the cache on other nodes in the group.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Another performance‑boosting mechanism is the memory buffer (read buffer), which provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I/O request merging: multiple consecutive I/O requests can be merged in memory. For example, three I/O requests issued by the system may be reduced to just one after being processed by the memory buffer.  &lt;/li&gt;
&lt;li&gt;Adaptive read‑ahead: in large‑file sequential read scenarios, adaptive read‑ahead increases request concurrency by prefetching data. This fully utilizes cache and object storage resources and improves overall I/O performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Enterprise Edition also offers advanced read‑ahead settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;max read ahead&lt;/code&gt;: sets the maximum read‑ahead range.  &lt;/li&gt;
&lt;li&gt;&lt;code&gt;initial read ahead&lt;/code&gt;: sets the initial read‑ahead window size (default unit is 4 MB blocks).  &lt;/li&gt;
&lt;li&gt;&lt;code&gt;read ahead ratio&lt;/code&gt;: a configuration added last year that controls the read‑ahead ratio for large‑file random reads, reducing bandwidth waste caused by read amplification. Overly aggressive read‑ahead can negatively impact random read performance; read ahead ratio helps mitigate this. In AI scenarios, when large‑file sequential or random reads cause bandwidth or IOPS bottlenecks, adjusting these parameters can optimize overall performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;JuiceFS benchmark I/O tests and bottleneck analysis&lt;/h2&gt;
&lt;p&gt;Before diving into performance tuning for common &lt;a href="https://en.wikipedia.org/wiki/Artificial_intelligence"&gt;AI&lt;/a&gt; scenarios, let’s first examine JuiceFS’ I/O behavior under ideal conditions through sequential and random read benchmarks. This helps us understand throughput and latency under different access patterns, providing a reference for the read/write patterns of subsequent AI/ML workloads.&lt;/p&gt;
&lt;h3&gt;Sequential read performance&lt;/h3&gt;
&lt;p&gt;In JuiceFS, sequential read performance is typically bandwidth‑bound. In cold read scenarios, performance is mainly limited by object storage bandwidth; in distributed cache scenarios, network bandwidth can become the bottleneck. For example, a node with a 40 Gbps NIC may achieve less than 5 Gbps usable bandwidth. In addition, the user‑kernel transition overhead in the FUSE layer limits single‑thread throughput. Tests showed single‑thread sequential read bandwidth around 3.5 Gbps. To break this limit, multi‑threaded or higher‑concurrency strategies are needed to fully utilize storage and network resources.  &lt;/p&gt;
&lt;p&gt;The table below shows test results of JuiceFS sequential read performance:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Threads&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Bandwidth (GB/s)&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Bandwidth per thread (GB/s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;1&lt;/td&gt;
&lt;td style="text-align: left;"&gt;3.5&lt;/td&gt;
&lt;td style="text-align: left;"&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;2&lt;/td&gt;
&lt;td style="text-align: left;"&gt;6.3&lt;/td&gt;
&lt;td style="text-align: left;"&gt;3.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;3&lt;/td&gt;
&lt;td style="text-align: left;"&gt;9.5&lt;/td&gt;
&lt;td style="text-align: left;"&gt;3.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;4&lt;/td&gt;
&lt;td style="text-align: left;"&gt;9.7&lt;/td&gt;
&lt;td style="text-align: left;"&gt;2.43&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;6&lt;/td&gt;
&lt;td style="text-align: left;"&gt;14.0&lt;/td&gt;
&lt;td style="text-align: left;"&gt;2.33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;8&lt;/td&gt;
&lt;td style="text-align: left;"&gt;17.0&lt;/td&gt;
&lt;td style="text-align: left;"&gt;2.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;10&lt;/td&gt;
&lt;td style="text-align: left;"&gt;18.6&lt;/td&gt;
&lt;td style="text-align: left;"&gt;1.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;15&lt;/td&gt;
&lt;td style="text-align: left;"&gt;21&lt;/td&gt;
&lt;td style="text-align: left;"&gt;1.4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In the performance test, single‑thread sequential read bandwidth was about 3.5 Gbps. As the number of threads increased, total throughput gradually approached the network bandwidth limit. To help users evaluate the performance ceiling of their own environment, JuiceFS provides the &lt;code&gt;bj bench&lt;/code&gt; subcommand for testing object storage bandwidth.  &lt;/p&gt;
&lt;p&gt;In real workloads, caching is more common than direct object storage access. In such cases, increasing the buffer size raises the number of background prefetch requests, thereby improving concurrency and overall throughput. For example, after increasing the buffer size to 400 MB (corresponding to 100 background prefetch requests of 4 MB each), concurrency improved significantly and overall throughput increased.&lt;/p&gt;
&lt;h3&gt;Random read performance&lt;/h3&gt;
&lt;h4&gt;Low‑concurrency random reads&lt;/h4&gt;
&lt;p&gt;In low‑concurrency, non‑asynchronous access scenarios, each request must wait for the previous one to complete before being issued. As a result, latency has a significant impact on overall performance. I/O latency can come from many sources, including metadata query latency, object storage access latency, and local or distributed cache read latency. When analyzing random read performance, we must closely examine these latency factors.  &lt;/p&gt;
&lt;p&gt;In a 4 KB cold random read scenario, if the IOPS is only 8 and object storage latency is about 125 ms, the concurrency level is roughly 1 (8 IOPS × 125 ms ≈ 1,000 ms).  &lt;/p&gt;
&lt;p&gt;This indicates a near‑single‑concurrent, serial‑blocked state. In such cases, the optimization focus should be on shortening the access path and reducing per‑request latency rather than increasing concurrency – for example, by warming up data into the local cache. After data warm-up, the random read path switches from object storage to local cache, and IOPS can increase to about 12,000, approaching the I/O level of a local disk.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;juicefs stats 命令查看性能&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Using the juicefs stats command to view performance&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;预热后性能&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Performance after data warm-up&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h4&gt;High‑concurrency random reads&lt;/h4&gt;
&lt;p&gt;High‑concurrency random reads typically occur in scenarios with high thread counts or asynchronous I/O. The main performance bottleneck is often IOPS limits – including metadata IOPS, object storage IOPS, and cache IOPS. JuiceFS allows you to observe these metrics and pinpoint the bottleneck. Client machine resources (CPU, memory) can also affect performance, but such bottlenecks are easy to monitor.  &lt;/p&gt;
&lt;p&gt;In a cold read scenario using &lt;a href="https://github.com/anlongfei/libaio"&gt;Libaio&lt;/a&gt; for random reads, the object‑side IOPS ceiling is around 7,000/s. When caching is enabled and data is warmed up, the access path shifts from object storage to the cache layer, and IOPS can further increase to over 20,000. This shows that the bottleneck for high‑concurrency random reads shifts as the access path changes.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;预热前&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Before data warm-up&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;预热后&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;After data warm-up&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;For a deeper dive into JuiceFS’ complete data access path, refer to &lt;a href="https://juicefs.com/en/blog/engineering/optimize-read-performance"&gt;Optimizing JuiceFS Read Performance: Readahead, Prefetch, and Cache&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;I/O characteristics and performance tuning for common AI scenarios&lt;/h2&gt;
&lt;h3&gt;Large‑file sequential reads&lt;/h3&gt;
&lt;p&gt;A typical large‑file sequential read scenario is model loading, such as loading PyTorch .pt files saved via pickle serialization. In this process, performance is limited by two factors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/library/pickle.html"&gt;Pickle&lt;/a&gt; deserialization efficiency determines data processing speed.  &lt;/li&gt;
&lt;li&gt;Data reading is usually single‑threaded and limited by FUSE bandwidth and CPU performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To increase throughput, you can raise concurrency through multi‑threaded or sharded loading, fully utilizing I/O capacity. For large‑file sequential reads, the best performance is achieved when the entire dataset can be cached locally. If only on‑demand reading is required, the implementation is simple.&lt;br&gt;
For more details on optimizing large‑file sequential reads, see &lt;a href="https://juicefs.com/en/blog/solutions/idle-resources-elastic-high-throughput-storage-cache-pool"&gt;How JuiceFS Transformed Idle Resources into a 70 GB/s Cache Pool&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Massive small files&lt;/h3&gt;
&lt;p&gt;In computer vision and multimodal tasks, training datasets often consist of many individual files, for example, single images, video frames, or text annotations. Such massive small‑file scenarios place heavy pressure on metadata services.  &lt;/p&gt;
&lt;p&gt;In massive small-file scenarios, metadata performance is critical. On one hand, each file carries only a small amount of data; on the other hand, directory metadata access efficiency is low when a directory holds a huge number of small files.&lt;br&gt;
For read‑only workloads, enabling client metadata caching and extending the cache lifetime can improve performance. &lt;/p&gt;
&lt;p&gt;Moreover, the data read layer experiences higher IOPS pressure because small files cannot take advantage of read‑ahead. This makes requests more fragmented. Common optimizations include increasing local cache capacity; for the Enterprise Edition, you can also scale out the distributed cache cluster horizontally. Because small files derive little benefit from read‑ahead, their latency tends to be higher.  &lt;/p&gt;
&lt;p&gt;For performance tuning in this scenario, see &lt;a href="https://juicefs.com/en/blog/user-stories/multi-cloud-store-massive-small-files"&gt;How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Large‑file random reads&lt;/h3&gt;
&lt;p&gt;This scenario is common in AI training, for example, when randomly accessing datasets in TFRecord, HDF5, or LMDB format by sample. Take model loading: if the dataset is accessed randomly and each read size equals the sample size (for example, 1 MB to 4 MB images or short videos), read‑ahead can waste bandwidth. Such scenarios can often break through IOPS bottlenecks by increasing concurrency.  &lt;/p&gt;
&lt;p&gt;Recommended measures include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increase the number of data‑loading &lt;code&gt;reader&lt;/code&gt; threads.  &lt;/li&gt;
&lt;li&gt;Use asynchronous I/O to raise concurrency and saturate IOPS.  &lt;/li&gt;
&lt;li&gt;Improve the caching system, for example, pre‑map data into cache to boost underlying IOPS.  &lt;/li&gt;
&lt;li&gt;Adjust the &lt;code&gt;read ahead ratio&lt;/code&gt; parameter (for example, set it to &lt;code&gt;0.5&lt;/code&gt;) to reduce bandwidth waste from read‑ahead. For instance, a 4 MB sequential read would previously prefetch 4 MB; after adjustment, only 2 MB is prefetched.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this article, we’ve analyzed JuiceFS’ architecture from a performance perspective, covered benchmark I/O tests, and discussed tuning methods for typical AI scenarios. This provides an introductory reference for system performance. JuiceFS has been deployed in many production environments, and its distributed architecture offers a feasible balance between performance and cost.  &lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 15 Apr 2026 07:22:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/juicefs-ai-workload-performance-optimization</guid></item><item><title>Optimizing JuiceFS on the Arm Architecture: MLPerf-Based Performance Tuning</title><link>https://www.juicefs.com/en/blog/engineering/arm-juicefs-performance-optimization-mlperf-tuning</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Recently, building high-performance storage infrastructure on Arm platforms has become a technical focal point. &lt;a href="https://www.linaro.org/"&gt;Linaro&lt;/a&gt; is an international technology organization focused on the Arm ecosystem and open-source software. We collaborate with upstream and downstream industry players to address common issues and assist enterprise customers in productizing their solutions on an open-source foundation. Our team conducted systematic stress testing on &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS Community Edition&lt;/a&gt; (using Redis for metadata storage) during MLPerf Storage benchmarks, covering a variety of typical machine learning training workloads.  &lt;/p&gt;
&lt;p&gt;Our test results show that system performance is largely influenced by memory bandwidth and metadata access efficiency. JuiceFS’ throughput directly determines GPU utilization and training efficiency. Through testing workloads such as 3D U-Net, ResNet-50, and CosmoFlow, the analysis revealed: in single-node scenarios, GPU utilization is primarily limited by memory copy latency; in two-node or multi-node scenarios, metadata access and inter-node synchronization become the main bottlenecks. In the article, we also provide tuning strategies and practical results to address these bottlenecks.  &lt;/p&gt;
&lt;p&gt;In summary, large-scale AI training performance tuning is a systematic engineering effort that requires coordinated optimization across storage systems, memory bandwidth, CPU scheduling, caching strategies, and more to achieve efficient deep learning data supply on Arm platforms.&lt;/p&gt;
&lt;h2&gt;Arm64 vs. x86_64 architecture differences and concurrency characteristics&lt;/h2&gt;
&lt;p&gt;Compared to x86, Arm’s application scope continues to expand, extending from mobile devices to IoT, wearables, PCs, automotive, and servers. Its high performance per watt is a key reason for its widespread adoption.&lt;br&gt;
From an architectural design perspective, Arm is a &lt;a href="https://en.wikipedia.org/wiki/Reduced_instruction_set_computer"&gt;reduced instruction set computer&lt;/a&gt; (RISC), while x86 is a &lt;a href="https://en.wikipedia.org/wiki/Complex_instruction_set_computer"&gt;complex instruction set computer&lt;/a&gt; (CISC). This design difference also affects how processors execute instructions. Arm64 instructions have a fixed length of 4 bytes, whereas x86 instructions have variable lengths ranging from 1 to 15 bytes. Consequently, x86 often requires more complex decoders. In contrast, Arm’s instructions are simpler and rely more heavily on effective instruction organization during compilation and code generation, thus requiring longer compilation times.  &lt;/p&gt;
&lt;p&gt;From an engineer’s perspective, there are other architectural differences that directly impact program behavior. &lt;strong&gt;Code that seems intuitive on x86 may not behave the same way on Arm. Several of the common pitfalls discussed later are fundamentally related to these underlying differences.&lt;/strong&gt;  &lt;/p&gt;
&lt;p&gt;One typical issue is the alignment requirement for atomic operations. Whether using Load-Link/Store-Conditional (LL/SC) or Large System Extensions (LSE), read-modify-write operations like atomic increments typically require aligned memory addresses. Newer LSE2 relaxes this restriction, supporting unaligned accesses within a 16-byte window. Data alignment is not mandatory for x86, but maintaining good alignment helps improve performance. See &lt;a href="https://developer.arm.com/documentation/ddi0487/maa/-Part-B-The-AArch64-Application-Level-Architecture/-Chapter-B2-The-AArch64-Application-Level-Memory-Model/-B2-8-Alignment-support/-B2-8-2-Alignment-of-data-accesses?lang=en#chdffegj"&gt;Arm Architecture Reference Manual for A-profile architecture&lt;/a&gt;.  &lt;/p&gt;
&lt;p&gt;Another key feature to note is that Arm employs a weakly ordered / relaxed memory model. The difference lies in the strength of constraints on memory access ordering. In multi-threaded scenarios, the same read/write operations are more likely to appear in program order on x86, whereas Arm permits more reordering. Thus, the order observed by other threads may differ from the source code order. When debugging issues on Arm, memory ordering effects must be carefully considered. For more details, see the Arm white paper: &lt;a href="https://developer.arm.com/documentation/107630/1-0/?lang=en"&gt;Synchronization Overview and Case Study on Arm Architecture&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Overview of JuiceFS and MLPerf&lt;/h2&gt;
&lt;p&gt;JuiceFS is an open-source, high-performance &lt;a href="https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems"&gt;distributed file system&lt;/a&gt; built on object storage. It leverages the cost advantages of object storage while delivering a user experience close to traditional file systems. It supports POSIX, HDFS SDK, Python SDK, and S3-compatible interfaces, adapting to various applications and data processing frameworks. It also supports cloud-native extensions, data security, and compression, making it widely applicable to AI training, inference, big data processing, and more.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;1JuiceFS 架构图.drawio&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS Community Edition architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;To evaluate JuiceFS’ data supply capability under high-load scenarios like AI training, we can use the MLPerf Storage benchmark. Developed by MLCommons, this benchmark focuses on measuring a storage system’s ability to consistently and efficiently supply data to compute nodes.  &lt;/p&gt;
&lt;p&gt;Version 2.0 divides tests into training workloads and checkpoint workloads. The training workloads include 3D U-Net, ResNet-50, and CosmoFlow. They differ significantly in sample size and access patterns. Minimum GPU utilization requirements are set: 90% for 3D U-Net and ResNet-50, and 70% for CosmoFlow.  &lt;/p&gt;
&lt;p&gt;The table below shows MLPerf Storage 2.0 training workloads:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Task&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Reference network&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Data loader&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Sample size&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Batch size&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Accelerator utilization&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Time per batch run (s)&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Evaluate storage capability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Image segmentation (medical)&lt;/td&gt;
&lt;td style="text-align: left;"&gt;3D U-Net&lt;/td&gt;
&lt;td style="text-align: left;"&gt;PyTorch&lt;/td&gt;
&lt;td style="text-align: left;"&gt;146 MiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;7 x 146 = 1,022 MiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;90%&lt;/td&gt;
&lt;td style="text-align: left;"&gt;0.323 / 0.9 = 0.359 Data load time: 0.359-0.323 = 0.036&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Bandwidth, concurrent large block sequential reads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Image classification&lt;/td&gt;
&lt;td style="text-align: left;"&gt;ResNet50&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Tensorflow&lt;/td&gt;
&lt;td style="text-align: left;"&gt;150 KiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;400 x 150 = 58.5 MiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;90%&lt;/td&gt;
&lt;td style="text-align: left;"&gt;0.224 / 0.9 = 0.249 Data load time: 0.249 - 0.224 = 0.025&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Bandwidth, IOPS, high concurrency medium block sequential reads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Scientific (cosmology)&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Parameter prediction&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Tensorflow&lt;/td&gt;
&lt;td style="text-align: left;"&gt;2.7 MiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;1 x 2.7 = 2.7 MiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;70%&lt;/td&gt;
&lt;td style="text-align: left;"&gt;0.0035 / 0.7 = 0.005 Data load time: 0.005 - 0.0035 = 0.0015&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Bandwidth, IOPS, metadata latency, high concurrency sequential reads of many small files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;LLM checkpointing (new)&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Llama3&lt;/td&gt;
&lt;td style="text-align: left;"&gt;PyTorch&lt;/td&gt;
&lt;td style="text-align: left;"&gt;105GiB to 18TiB&lt;/td&gt;
&lt;td style="text-align: left;"&gt;—&lt;/td&gt;
&lt;td style="text-align: left;"&gt;—&lt;/td&gt;
&lt;td style="text-align: left;"&gt;—&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Bandwidth, concurrent sequential writes of extremely large files&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In the test flow, data is first read from the storage system into host memory before entering the compute phase. Training time is simulated to replicate the data flow of real training scenarios, eliminating the need for actual GPU deployment, lowering experimental barriers, and improving operational convenience.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;MLPerf Storage 数据流&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;MLPerf Storage data flow&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h2&gt;MLPerf Storage v2.0 test principles and tuning&lt;/h2&gt;
&lt;p&gt;Before detailing specific model test results, it’s essential to understand the data access principles of distributed training. This helps readers grasp the causes of GPU utilization, storage throughput, and performance bottlenecks, enabling better comprehension of subsequent test results and tuning strategies.  &lt;/p&gt;
&lt;p&gt;Distributed &lt;a href="https://en.wikipedia.org/wiki/Machine_learning"&gt;machine learning&lt;/a&gt; typically uses data parallelism, where multiple parallel processes share the same dataset, and each process handles reading and processing its corresponding training batches.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;分布式训练数据访问原理示意图&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Distributed training data access principle&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;MLPerf Storage training tests follow this approach: each training process reads data from the storage system in batches and simulates computation to evaluate the storage system’s ability to sustain data supply.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;MLPerf Storage 训练数据流示意图&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;MLPerf Storage training data flow&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;To understand the source of performance during testing, it’s also necessary to understand the data processing path within the JuiceFS client.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS 客户端线程与数据流示意图&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS client threads and data flow&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;As illustrated, when testing with JuiceFS, the execution flow can be roughly divided into three parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Left side: Application-side I/O threads, such as fio or MLPerf Storage’s DataLoader threads, which initiate read/write requests and wait for completion.  &lt;/li&gt;
&lt;li&gt;Middle: The main goroutine in the FUSE daemon, which handles FUSE requests from kernel space, places file data into memory buffers and caches, and triggers backend metadata and object storage access.  &lt;/li&gt;
&lt;li&gt;Right side: Asynchronous goroutines for the Meta client and ObjectStore client, which interact with the backend MetaDB and ObjectStore clusters for data and metadata operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From a performance analysis perspective, we need to note two types of issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data copying, corresponding to steps like 2.1, 3, 4, 5, and 6 in the diagram. These steps introduce additional memory copy overhead and are often key areas for analyzing latency and CPU usage.  &lt;/li&gt;
&lt;li&gt;Synchronization and asynchronous boundaries. As shown, steps 1, 2, 3, 4, 5, and 6 are part of the synchronous path, where the request must wait for the current stage to complete before proceeding. Step 7 is part of the asynchronous path, handled by background goroutines interacting with backend storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Test 1: 3D U-Net&lt;/h3&gt;
&lt;p&gt;In this test, the sample size was 146 MiB per image file, and we focused on large-block read performance. The test results showed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In a single-node environment, the system could stably run up to 5 GPUs, with GPU utilization at about 50%.  &lt;/li&gt;
&lt;li&gt;In a two-node scenario, it could support 10 GPUs, also with GPU utilization around 50%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To improve data read efficiency, we optimized the training parameters: we increased the number of reader threads from 4 to 16 to accelerate data generation, and switched to direct I/O to reduce buffer and memory copy overhead.&lt;br&gt;
&lt;strong&gt;Operational metrics showed that when mounting 6 GPUs on a single node, GPU utilization dropped to 83%, corresponding to a bandwidth of about 15.1 GB/s. This fell short of the expected high utilization target.&lt;/strong&gt; Further testing with fio on the storage side revealed similar bandwidth of about 15.1 GB/s. &lt;strong&gt;This indicated that the bottleneck had shifted to the JuiceFS client bandwidth rather than the GPU compute side.&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;Optimization analysis 1: CPU pinning&lt;/h4&gt;
&lt;p&gt;To further investigate the cause of the client bandwidth limitation, we pinned the process to a specific CPU (running on NUMA nodes 2 and 3). Monitoring showed that all 48 CPU cores were nearly fully utilized. Further analysis of &lt;code&gt;top-down&lt;/code&gt;, &lt;code&gt;memory&lt;/code&gt;, and &lt;code&gt;miss&lt;/code&gt; metrics revealed a clear memory-bound condition, with most time spent on memory copying. This indicated that in the CPU-pinned scenario, the performance bottleneck of JuiceFS primarily came from CPU processing capacity and the additional latency caused by cross-NUMA node memory copying.&lt;/p&gt;
&lt;h4&gt;Optimization analysis 2: no CPU pinning&lt;/h4&gt;
&lt;p&gt;To understand the bandwidth limitations under more general conditions, we observed the scenario without CPU pinning. The results showed that while the CPU was not fully saturated, the &lt;code&gt;devkit tuner numafast&lt;/code&gt; metric indicated that remote memory access accounted for about 80% of total memory accesses. This meant a large number of memory accesses were crossing local NUMA nodes, potentially even across CPU sockets, introducing significant bandwidth loss and access latency.  &lt;/p&gt;
&lt;p&gt;From the perspective of hardware bandwidth, cross-socket memory access has inherent limitations. For example, on the Arm platform, the theoretical physical bandwidth across sockets was about 60 GB/s. Further measurements showed cross-socket copy bandwidth on Arm1 to be around 48 GB/s, while on two x86 platforms it was about 37 GB/s and 28 GB/s, respectively.  &lt;/p&gt;
&lt;p&gt;This suggested that in the scenario without CPU pinning, even though the compute cores were not fully exhausted, extensive cross-node, cross-socket remote memory access had become a major source of overhead. Therefore, we inferred that the inability to further increase JuiceFS bandwidth was likely not solely due to CPU compute power, but rather constrained by the bandwidth and latency of cross-socket memory access. &lt;strong&gt;In other words, the system bottleneck had shifted from “local CPU being too busy” to “remote memory access being too costly.”&lt;/strong&gt;  &lt;/p&gt;
&lt;p&gt;In summary, the reasons for the JuiceFS bandwidth limitation differed between the two scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With CPU pinning, the bottleneck was primarily CPU resource consumption and the overhead of extensive memory copying.  &lt;/li&gt;
&lt;li&gt;Without CPU pinning, the bottleneck was largely due to a high proportion of non-local memory accesses, especially the bandwidth and latency penalties from cross-socket accesses.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Test 2: ResNet-50&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://blog.roboflow.com/what-is-resnet-50/"&gt;ResNet-50&lt;/a&gt; uses small samples (about 150 KiB each), with each batch containing 400 samples totaling about 58.5 MiB. This I/O test focused on data loading efficiency and training throughput under high GPU concurrency. The system maintained high utilization at large GPU scales:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single node: 50 GPUs, 95% GPU utilization, about 9.2 GB/s bandwidth.  &lt;/li&gt;
&lt;li&gt;Two nodes: 96 GPUs, 90% GPU utilization, about 16.9 GB/s bandwidth.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;During testing, we adjusted the &lt;code&gt;reader.read_threads&lt;/code&gt; parameter from 8 to 1. For this model (medium-sized images), a single thread sufficed for data supply.&lt;/p&gt;
&lt;h4&gt;Optimization analysis 1: single-node bottleneck and memory bandwidth impact&lt;/h4&gt;
&lt;p&gt;With 55 GPUs on a single node, GPU utilization dropped to 86% while bandwidth remained at about 9.2 GB/s. This indicated the bottleneck had shifted to JuiceFS client bandwidth.  &lt;/p&gt;
&lt;p&gt;Further analysis revealed ResNet-50 tests used buffer I/O mode. Beyond reading data, memory copies during dataset processing consumed part of the memory bandwidth.  &lt;/p&gt;
&lt;p&gt;System memory copy bandwidth depends on memory channel count, memory frequency, and CPU frequency. Stream tests on nodes with different configurations showed that single-node sequential read bandwidth aligned with measured system memory bandwidth, indicating read throughput largely depends on system memory bandwidth. &lt;strong&gt;For training tasks requiring high throughput and GPU utilization, selecting nodes with higher memory bandwidth is recommended to significantly enhance data supply capacity and training efficiency.&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Single-CPU memory copy bandwidth data&lt;/th&gt;
&lt;th style="text-align: left;"&gt;JuiceFS single-node deployment read bandwidth&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Arm3&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Arm3: 171 GB/s&lt;/td&gt;
&lt;td style="text-align: left;"&gt;25.3 GiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Arm2&lt;/td&gt;
&lt;td style="text-align: left;"&gt;114 GB/s&lt;/td&gt;
&lt;td style="text-align: left;"&gt;21.6 GiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Arm1&lt;/td&gt;
&lt;td style="text-align: left;"&gt;106 GB/s&lt;/td&gt;
&lt;td style="text-align: left;"&gt;18.3 GiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;x862&lt;/td&gt;
&lt;td style="text-align: left;"&gt;90 GB/s&lt;/td&gt;
&lt;td style="text-align: left;"&gt;17.9 GiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;x861&lt;/td&gt;
&lt;td style="text-align: left;"&gt;82 GB/s&lt;/td&gt;
&lt;td style="text-align: left;"&gt;16.6 GiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Optimization analysis 2: two-node scaling bottlenecks and distributed limitations&lt;/h4&gt;
&lt;p&gt;In multi-node deployments, in addition to single-node performance limits, cross-node memory access, network transfer, and metadata latency become new bottlenecks. Therefore, two-node testing after single-node analysis helped identify these distributed constraints and guide system optimization.  &lt;/p&gt;
&lt;p&gt;In a two-node scenario, the system theoretically supported up to 100 GPUs, but in actual testing only 96 GPUs could be achieved. Analysis showed that per-operation read latency had increased. Although file data was already cached on local disks, metadata access latency became the primary limiting factor.  &lt;/p&gt;
&lt;p&gt;To address this issue, we made multiple optimizations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We grouped CPU cores to ensure training threads and I/O threads ran on the same NUMA node.  &lt;/li&gt;
&lt;li&gt;Pure data processing and metadata access were assigned to different CPU cores and storage paths.  &lt;/li&gt;
&lt;li&gt;We adjusted Redis cache and local cache policies to reduce latency under high-concurrency metadata access.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After these tunings, the two-node scenario stably supported 100 GPUs, with GPU utilization reaching the expected level.&lt;/p&gt;
&lt;h3&gt;Test 3: CosmoFlow&lt;/h3&gt;
&lt;p&gt;Compared with previous models, this model had a much smaller size per sample. This imposed higher demands on I/O and metadata access. In both single-node and two-node scenarios, the CosmoFlow test showed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single node: Stably supported up to 10 GPUs (occasionally up to 12 GPUs), GPU utilization around 75%, bandwidth about 5.6 GB/s.  &lt;/li&gt;
&lt;li&gt;Key parameter adjustment: &lt;code&gt;reader.read_threads&lt;/code&gt; was reduced from &lt;code&gt;4&lt;/code&gt; to &lt;code&gt;1&lt;/code&gt;, batch size was set to 2 MiB, and a single thread was sufficient to meet data supply requirements.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Optimization analysis 1: single-node bottleneck – memory copy limiting GPU utilization&lt;/h4&gt;
&lt;p&gt;When we tried to increase the number of GPUs beyond 10, GPU utilization dropped. Log and performance data analysis revealed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data read time increased, while metadata access latency did not change significantly.  &lt;/li&gt;
&lt;li&gt;File data was cached on local disks, disk queues were not full, and latency was low, so the bottleneck was not in the storage device.  &lt;/li&gt;
&lt;li&gt;Profiling showed that the key bottleneck was memory copy (&lt;code&gt;memcpy&lt;/code&gt;) – cumulative delays from multiple copy operations in the data read path increased total read time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thus, we inferred that when the system demanded more memory bandwidth, memory copy latency became the main factor limiting read performance and GPU utilization.&lt;/p&gt;
&lt;h4&gt;Optimization analysis 2: two-node bottleneck – distributed synchronization and metadata latency&lt;/h4&gt;
&lt;p&gt;In the two-node scenario with 20 GPUs, the first round of testing showed significantly lower GPU utilization. Further analysis found:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One node had started training while the other was still performing dataset preprocessing (reading file lists and sharding).  &lt;/li&gt;
&lt;li&gt;Because CosmoFlow has a large data volume, reading high-index files took a long time. This caused the two nodes to start training out of sync, leading to lower GPU utilization in the first round.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To resolve this, we added a synchronization mechanism to ensure that all nodes completed dataset preprocessing before starting training. After this adjustment, the two-node test stably supported 20 GPUs, and GPU utilization reached the expected level.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;The key findings and optimization insights from our tests are summarized as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlcommons.org/working-groups/benchmarks/storage/"&gt;MLPerf Storage&lt;/a&gt; evaluates various file system capabilities through different combinations of sample sizes, file sizes, and batch sizes, including large/medium/small sequential read performance, file concurrency, total read bandwidth, metadata access latency, file read latency, and file operation stability. In read-only scenarios, fully utilizing high-speed near-end caches (including data and metadata caches) significantly improved read performance. Note that the smaller the file, the higher the requirements for IOPS and latency.  &lt;/li&gt;
&lt;li&gt;System memory and bandwidth have a decisive impact on performance. In memory‑copy‑intensive workloads, memory copies consume both bandwidth and CPU cycles, creating the illusion of "CPU busy" while the CPU actually spends most of its time waiting for data. Higher memory bandwidth directly leads to better storage throughput – a key reference for server selection.  &lt;/li&gt;
&lt;li&gt;The Go runtime has limited NUMA awareness. For large‑core deployments, performance may degrade compared to using fewer cores. Cross‑NUMA (especially cross‑socket) memory accesses should be avoided because cross‑socket bandwidth is typically low (tens of GB/s), increasing latency. In practice, allocate only enough CPU cores, not all, to prevent extra memory access delays.  &lt;/li&gt;
&lt;li&gt;System‑level optimizations exist. For memory‑copy‑intensive operations, newer Arm systems provide specialized instructions. We collaborated with the Arm community to push configuration improvements, achieving up to tens of percentage points higher bandwidth in some scenarios.  &lt;/li&gt;
&lt;li&gt;For operations involving heavy kernel‑userspace interaction (for example, file I/O and metadata processing), reducing unnecessary system calls lowers latency. Concentrating file processing within the same production node and avoiding cross‑NUMA/socket access further improves performance and stability.  &lt;/li&gt;
&lt;li&gt;Cache policy tuning matters. Under high single‑node load, adjusting JuiceFS memory cache policies to reduce invalid memory bandwidth usage effectively increases GPU utilization and storage throughput. Overall, MLPerf Storage Benchmark is a system engineering problem requiring coordinated optimization of file system, memory bandwidth, CPU scheduling, and caching strategies.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="http://go.juicefs.com/discord"&gt;community on Discord&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 02 Apr 2026 12:45:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/arm-juicefs-performance-optimization-mlperf-tuning</guid></item><item><title>How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS</title><link>https://www.juicefs.com/en/blog/user-stories/multi-cloud-store-massive-small-files</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;&lt;a href="https://en.d-robotics.cc/"&gt;D-Robotics&lt;/a&gt;, founded in 2024 and spun off from &lt;a href="https://en.wikipedia.org/wiki/Horizon_Robotics"&gt;Horizon Robotics&lt;/a&gt;' robotics division, specializes in the research and development of foundational computing platforms for consumer-grade robots. In 2025, we released an &lt;a href="https://www.nvidia.com/en-us/glossary/embodied-ai/"&gt;embodied AI&lt;/a&gt; foundation model.  &lt;/p&gt;
&lt;p&gt;In robot data management, training, and inference, the sheer volume of data is immense. Using object storage presents challenges such as handling small files and managing multi-cloud data. After trying some solutions and replacing private MinIO with SSD storage, we still faced difficulties in addressing these challenges. Ultimately, we selected &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS&lt;/a&gt; as our core storage solution.  &lt;/p&gt;
&lt;p&gt;JuiceFS' inherent adaptability for cross-cloud operations efficiently supports data sharing needs in multi-cloud environments. In training scenarios, JuiceFS' cache mechanism, specifically designed for small file data, effectively replaces traditional caching solutions while achieving a cost-effective balance between cost and efficiency, fully meeting storage performance requirements. Currently, we manage tens of millions of files.  &lt;/p&gt;
&lt;p&gt;In this article, we’ll share our application characteristics, storage pain points, solution selection, implementation practices, and production tuning experiences. We hope our experience offers useful insights for those facing similar challenges in the industry.&lt;/p&gt;
&lt;h2&gt;Storage pain points in the robotics industry&lt;/h2&gt;
&lt;p&gt;The cloud platform serves as our core technical hub, undertaking key application functions such as simulation environment setup, data generation and &lt;a href="https://www.ibm.com/think/topics/model-training"&gt;model training&lt;/a&gt;, model lightweighting and deployment, and visual verification. The data types involved in the platform are diverse, mainly including sensor image data, LiDAR point cloud data, model weights and configuration data, motor operational data, and map construction data.  &lt;/p&gt;
&lt;p&gt;While &lt;a href="https://en.wikipedia.org/wiki/Object_storage"&gt;object storage&lt;/a&gt; meets basic storage needs for massive data, its performance limitations become particularly obvious when handling the massive small files frequently encountered in robotics applications. Our storage system faced four challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metadata performance bottleneck with massive small files:&lt;/strong&gt; Robot model training involves tens of millions to billions of sensor images, LiDAR data, and model files. Traditional object storage (like standard S3) exhibits significant metadata operation bottlenecks at this scale. The fixed API latency for routine operations like listing files or retrieving attributes is typically 10–30 ms. This directly constrains queries per second (QPS) performance during training and inference and impacts overall R&amp;amp;D efficiency.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inefficient &lt;a href="https://en.wikipedia.org/wiki/Multicloud"&gt;multi-cloud&lt;/a&gt; collaboration and data flow:&lt;/strong&gt; As robotics companies increasingly adopt multi-cloud architectures for their R&amp;amp;D and production applications, ensuring efficient data synchronization and sharing across different cloud platforms and geographical regions has become a common challenge for the industry. Traditional storage solutions typically suffer from low cross-cloud data transfer efficiency and are often deeply integrated with a single cloud provider. This leads to technical lock-in and makes it difficult to achieve flexible cross-cloud deployment and data collaboration.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The impossible trinity of performance, cost, and operations:&lt;/strong&gt; High-performance parallel file systems offer high throughput and low latency but typically rely on all-flash arrays or dedicated hardware. This leads to high hardware investment and ongoing operational costs, plus complex deployment. Low-cost object storage offers good elasticity but is difficult to support the high-throughput I/O demands of GPU clusters in AI training scenarios. A common industry workaround is using a high-speed file system as a cache synchronized with S3. However, the extra data synchronization steps significantly reduce usability and fail to achieve efficient storage-compute synergy.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Difficulty in dataset version management:&lt;/strong&gt; The rapid iteration cycle of robot models requires efficient and granular management of multiple dataset versions. Using physical copies for version control directly leads to exponentially higher underlying storage consumption, significantly increasing costs. Moreover, the difficulty of retrieving, reusing, and maintaining multi-version data also increases substantially.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Storage selection: JuiceFS vs. MinIO/S3 vs. PFS&lt;/h2&gt;
&lt;p&gt;To address these storage challenges, we established a clear evaluation framework for storage selection. A comprehensive comparative test was conducted on mainstream storage solutions across seven core dimensions: storage architecture, protocol compatibility, metadata performance, scalability, multi-cloud adaptability, cost efficiency, and operational complexity.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Comparison basis&lt;/th&gt;
&lt;th style="text-align: left;"&gt;JuiceFS&lt;/th&gt;
&lt;th style="text-align: left;"&gt;MinIO / Public Cloud S3&lt;/th&gt;
&lt;th style="text-align: left;"&gt;CephFS / Public Cloud FS (CPFS)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Storage architecture&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Separation of metadata and data&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Unified object storage&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Metadata and data typically coupled, often with kernel-space parallel design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Protocol support&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Full compatibility: POSIX, HDFS, S3 API, Kubernetes CSI&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Primarily S3 API, with weak POSIX compatibility&lt;/td&gt;
&lt;td style="text-align: left;"&gt;POSIX-oriented; HDFS or S3 compatibility often requires plugins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Metadata performance&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Very high: sub-millisecond latency, supports hundreds of billions of files per volume&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Lower: high metadata overhead for massive small files; API call overhead about 10–30 ms&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Medium to high: performance bottlenecks and complexity challenges at ultra-large scale (100M+ files)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Scalability&lt;/td&gt;
&lt;td style="text-align: left;"&gt;High: horizontal scaling, supports tens to hundreds of billions of files per volume&lt;/td&gt;
&lt;td style="text-align: left;"&gt;High: near-infinite storage capacity, but small-file management efficiency degrades with scale&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Moderate: scaling limited by metadata nodes; operational complexity grows exponentially with scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Multi-cloud adaptability&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Native support&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Relies on sync tools; cross-cloud data flow inefficient; global unified view difficult&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Limited: often tightly bound to specific hardware or cloud provider; cross-cloud deployment is complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Cost efficiency&lt;/td&gt;
&lt;td style="text-align: left;"&gt;High performance-to-cost ratio&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Low (storage only): cheap storage, but low GPU utilization in high-throughput scenarios like AI training&lt;/td&gt;
&lt;td style="text-align: left;"&gt;High: often requires all-flash architecture or dedicated hardware; high operational labor cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Based on the comparison results above, JuiceFS demonstrates significant advantages in core performance, scalability, multi-cloud adaptability, and cost efficiency. This makes it the preferred choice for our unified storage solution.&lt;br&gt;
Furthermore, JuiceFS has been widely adopted in the &lt;a href="https://juicefs.com/en/blog?tag=AI%20storage"&gt;autonomous driving&lt;/a&gt; industry. Leading companies such as Horizon Robotics have leveraged JuiceFS to manage data at the exabyte scale. This demonstrates its maturity and effectiveness in large-scale production environments.&lt;br&gt;
For our specific application scenarios, JuiceFS' core technical advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Decoupled architecture:&lt;/strong&gt; JuiceFS adopts a metadata-data separation architecture, persisting data in cost-effective object storage (like S3 or OSS) while storing metadata in databases like Redis or TiKV. This decoupled design enables elastic storage scaling and reduces dependence on any single cloud provider.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunking and caching mechanisms:&lt;/strong&gt; JuiceFS &lt;a href="https://juicefs.com/docs/community/architecture#how-juicefs-store-files"&gt;uses chunks, slices, and blocks&lt;/a&gt; to significantly improve small file read efficiency and enhance concurrent read/write performance. In addition, multi-level caching (memory, local SSD, distributed cache) reduces access latency for hot data. This meets the demands of high-throughput training workloads.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud-native adaptability:&lt;/strong&gt; By providing a &lt;a href="https://juicefs.com/docs/csi/introduction"&gt;CSI Driver&lt;/a&gt;, JuiceFS delivers persistent storage decoupled from compute nodes in Kubernetes environments, supporting stateless container deployment and cross-cloud migration. It enables data sharing, enhances application high availability and flexibility, and adapts to various Kubernetes deployment methods.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full-stack support for AI training:&lt;/strong&gt; JuiceFS fully supports POSIX, HDFS, and S3 API, and is compatible with mainstream AI frameworks such as PyTorch and TensorFlow. It can be integrated without code modifications, lowering the technical barrier for adoption.  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-cloud support:&lt;/strong&gt; Its cross-cloud capabilities and high-performance metadata engine ensure efficient data flow, perfectly aligning with our strategy of "computing power on demand."&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;From a cost perspective, JuiceFS does not offer a significant cost advantage in the early stages of small-scale deployment. However, when data volume reaches the petabyte level—especially at the 10 PB or 100 PB scale—and is compared against all-flash storage solutions, its cost-efficient architecture built on object storage becomes fully evident.&lt;/strong&gt; In addition, JuiceFS requires minimal operational overhead. Currently, we need only one engineer to manage the entire cloud platform and storage system, a fraction of the personnel required by traditional solutions.&lt;/p&gt;
&lt;h2&gt;From Community Edition to Enterprise Edition: addressing larger-scale scenarios&lt;/h2&gt;
&lt;p&gt;As our application continued to expand, we encountered limitations when using Redis as the &lt;a href="https://juicefs.com/docs/community/databases_for_metadata/"&gt;metadata engine&lt;/a&gt;—specifically, physical memory capacity constrained data scalability. When the number of files approached the hundred-million level, metadata query latency increased significantly. This impacted the concurrency efficiency of training tasks. After using the clone feature, the metadata volume grew substantially. In addition, in cross-cloud scenarios, we faced higher demands for metadata synchronization and mirror file system capabilities. We also required more granular capacity controls and permission management at the directory level.  &lt;/p&gt;
&lt;p&gt;Considering these requirements—along with our desire to leverage local SSDs on GPU nodes to build a distributed cache layer for improved performance—we decided to deploy &lt;a href="https://juicefs.com/docs/cloud/"&gt;JuiceFS Enterprise Edition&lt;/a&gt; in parallel, migrating core scenarios such as ultra-large-scale directory management and multi-node collaborative training to this version. Through this scenario-based approach, we’ve effectively enhanced the adaptability of our overall storage system and established a solid foundation for future application growth. Below are the key features of the Enterprise Edition that we’ve applied in real-world scenarios.&lt;/p&gt;
&lt;h3&gt;High-performance metadata engine: solving the bottleneck of large-scale directory retrieval&lt;/h3&gt;
&lt;p&gt;For high-frequency operations such as traversing directories with hundreds of millions of files and deep pagination queries, we previously encountered the "slower as you query" problem with traditional storage solutions. When the number of files in a single directory exceeded 10 million, and the pagination offset surpassed 100,000 entries, response latency would spike from hundreds of milliseconds to several seconds. This severely impacted data filtering efficiency.  &lt;/p&gt;
&lt;p&gt;After switching to JuiceFS Enterprise Edition, its native tree-structured metadata storage architecture played a key role. Unlike the flat key-value storage used—which stores file metadata in a disordered manner—this tree structure allows direct navigation to directory levels, reducing the scope of metadata scans. In our actual tests, deep pagination queries (with an offset of 500,000 entries) in a directory containing 120 million files saw latency drop from 3.8 seconds to just 210 milliseconds. This fully met the retrieval needs of large-scale datasets. In addition, this engine supports storing hundreds of billions of files per volume, and we’ve already used it to manage three petabyte-scale training datasets stably, aligning with our application growth expectations.&lt;/p&gt;
&lt;h3&gt;Enterprise-grade distributed cache: improving data sharing efficiency in multi-node, multi-GPU training&lt;/h3&gt;
&lt;p&gt;In multi-node, multi-GPU training scenarios, we previously faced challenges such as low cache hit rates and cross-node bandwidth congestion. The open-source version only supports local caching on each node. This means that when multiple nodes pull the same dataset simultaneously, each node must access object storage independently. This resulted in single-node bandwidth utilization exceeding 90%, with average training job startup delays of up to 20 minutes.  &lt;/p&gt;
&lt;p&gt;With JuiceFS Enterprise Edition's &lt;a href="https://juicefs.com/docs/cloud/guide/distributed-cache/"&gt;distributed caching&lt;/a&gt; feature, we set up a distributed cache across a 12-node training cluster using just three commands. The dataset only needs to be pulled from object storage once and is cached in a pool built from local SSDs across the nodes. As a result, &lt;strong&gt;the cache hit rate for multi-node collaborative training increased from 45% to 92%, cross-node bandwidth utilization dropped to below 15%, and training job startup time was reduced to under three minutes&lt;/strong&gt;. This significantly improved compute utilization.&lt;/p&gt;
&lt;h3&gt;Enhanced cross-cloud collaboration: building a low-operational-cost cross-cloud data foundation&lt;/h3&gt;
&lt;p&gt;Since our R&amp;amp;D environments are distributed across two cloud environments, we previously encountered challenges with &lt;strong&gt;slow cross-cloud data synchronization and high operational costs&lt;/strong&gt;. Using traditional synchronization tools to maintain data consistency between the two clouds required configuring eight scheduled tasks, with an average synchronization delay of four hours, and dedicated personnel needed to investigate sync failures weekly.  &lt;/p&gt;
&lt;p&gt;By using the JuiceFS sync tool combined with our internal AI operations tools, we achieved automated configuration of synchronization policies. The system automatically adjusts sync priorities based on data heat levels, keeping cross-cloud data latency within 10 minutes. In addition, tasks such as failure retries and log alerts for synchronization are fully automated, eliminating the need for dedicated monitoring. &lt;strong&gt;This has reduced operational overhead by 70%&lt;/strong&gt;, and we now stably support multiple training projects across two cloud platforms sharing the same dataset. Going forward, we plan to use the Enterprise Edition's mirror file system feature to further enhance cross-cloud data collaboration.&lt;/p&gt;
&lt;h2&gt;JuiceFS optimization&lt;/h2&gt;
&lt;h3&gt;Client cache and write performance tuning&lt;/h3&gt;
&lt;p&gt;We need to pay attention to compatibility issues between caching strategies and Kubernetes resource limits. For example, using memory as a local cache path with improper configuration may lead to abnormal memory growth in the Mount Pod, or insufficient resource quota reservations may cause checkpoint loss or file handle write exceptions during long-running training tasks.  &lt;/p&gt;
&lt;p&gt;Regarding write performance tuning, enabling writeback mode can improve small file write throughput to some extent. However, considering production environment requirements for data consistency, we still adopt write-through synchronous mode to reduce data risks in extreme crash scenarios. It’s recommended to cautiously enable writeback mode only in scenarios with lower data reliability requirements, such as temporary computing or offline data cleaning, based on actual needs.&lt;/p&gt;
&lt;h3&gt;Deployment and network topology optimization&lt;/h3&gt;
&lt;p&gt;For more stable performance, it’s strongly recommended to deploy the metadata engine and compute nodes within the same region during deployment. In actual operations, we observed that cross-region deployment could increase metadata operation latency by several to ten times. This significantly impacted I/O-intensive operations such as data decompression. Deploying metadata services and GPU computing resources within the same region helps maintain performance while controlling network transmission costs, improving overall resource utilization efficiency.&lt;/p&gt;
&lt;h3&gt;Data warm-up and cache optimization&lt;/h3&gt;
&lt;p&gt;In a 10-gigabit network environment, fully utilizing JuiceFS' data &lt;a href="https://juicefs.com/docs/cloud/reference/command_reference/#warmup"&gt;warm-up&lt;/a&gt; and reasonably adjusting data block sizes based on application scenarios can better leverage network bandwidth capabilities and improve read throughput. Combined with the distributed cache architecture, this can effectively enhance data sharing efficiency in multi-node concurrent scenarios and improve cache hit rates during high-concurrency reads. This thereby optimizes the overall performance of large-scale AI training tasks.&lt;/p&gt;
&lt;h3&gt;Resource quotas and high availability guarantee&lt;/h3&gt;
&lt;p&gt;In enterprise-level multi-role operations and storage responsibility separation scenarios, to avoid operational risks caused by inconsistent configurations, it’s recommended to finely control resource quotas for &lt;a href="https://juicefs.com/docs/csi/introduction/"&gt;JuiceFS CSI Driver&lt;/a&gt; in Kubernetes environments. By appropriately setting CPU and memory request/limit for Mount Pods, Pod restarts or node anomalies caused by resource preemption can be reduced. In practice, resource reservation ratios can be dynamically adjusted based on cluster load.  &lt;/p&gt;
&lt;p&gt;In addition, for scenarios with high application continuity requirements, the automatic mount point recovery feature for Mount Pods can be enabled to achieve automated fault recovery for storage services, further ensuring underlying storage stability.&lt;/p&gt;
&lt;h3&gt;Multi-tenancy&lt;/h3&gt;
&lt;p&gt;We provide independent &lt;a href="https://en.wikipedia.org/wiki/File_system"&gt;file systems&lt;/a&gt; and storage buckets for large enterprise customers, while achieving isolation for small and medium-sized enterprises and end users through subdirectory-level directory isolation and permission control.  &lt;/p&gt;
&lt;p&gt;Large enterprises can flexibly scale throughput and capacity, avoiding performance bottlenecks associated with shared storage buckets. For small and medium-sized enterprises and end users, we ensure data security and independence through subdirectory isolation and permission control, while enabling accurate metering and billing.  &lt;/p&gt;
&lt;p&gt;This architecture ensures tenant isolation while flexibly allocating resources, improving system management efficiency.&lt;/p&gt;
&lt;h3&gt;Version management&lt;/h3&gt;
&lt;p&gt;Using the &lt;code&gt;juicefs clone&lt;/code&gt; command, copies of original datasets can be quickly created and modified independently without affecting the source data. The clone operation only copies file metadata, while data only stores additional changes, saving underlying storage space. This feature supports managing multiple versions, facilitating rollback and recovery and ensuring data security and version control.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;JuiceFS' characteristics in metadata performance, scalability, cross-cloud adaptability, and comprehensive cost efficiency have made it our choice for building a unified storage layer. Currently, we adopt both JuiceFS Community Edition and Enterprise Edition to accommodate different storage requirements across various application scenarios. &lt;/p&gt;
&lt;p&gt;In the future, we plan to further implement JuiceFS in the embodied intelligence field, addressing specific storage needs in this scenario. These include high-throughput processing of time-series data, precise multi-modal data alignment, edge-cloud collaborative storage, and integrated management of simulation and real-world data.  &lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="https://go.juicefs.com/slack/"&gt;community on Slack&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 05 Mar 2026 09:22:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/user-stories/multi-cloud-store-massive-small-files</guid></item><item><title>The Design Journey of FUSE: From Kernel-Space to User-Space File Systems</title><link>https://www.juicefs.com/en/blog/engineering/design-fuse-kernel-user-space</link><description>&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace"&gt;Filesystem in Userspace&lt;/a&gt; (FUSE) born in 2001, enables users to create custom file systems in user space. By lowering the barrier to file system development, FUSE empowers developers to innovate without modifying kernel code. &lt;a href="https://juicefs.com/docs/community/introduction/"&gt;JuiceFS&lt;/a&gt;, a high-performance distributed file system, leverages FUSE’s flexibility and extensibility to deliver robust storage solutions.  &lt;/p&gt;
&lt;p&gt;In this article, we’ll explore FUSE’s architecture and advantages, tracing the evolution of kernel file systems and network file systems that laid the groundwork for FUSE. Finally, we’ll share JuiceFS’ practical insights into optimizing FUSE performance for &lt;a href="https://en.wikipedia.org/wiki/Artificial_intelligence"&gt;AI&lt;/a&gt; workloads. Since FUSE requires switching between user space and kernel space, it brings certain overhead and may lead to I/O latency. Therefore, many people have doubts about its performance. From our practical experience, FUSE can meet performance requirements in most AI scenarios. We’ll elaborate relevant details in this article. &lt;/p&gt;
&lt;h2&gt;Standalone file systems: Kernel space and VFS&lt;/h2&gt;
&lt;p&gt;The file system, as a core underlying component of the operating system, is responsible for frequent operations on storage devices. It was initially designed entirely in kernel space. The &lt;em&gt;kernel&lt;/em&gt; concept emerged as computer hardware and software became increasingly complex, and operating systems separated the code for managing underlying resources from user programs.&lt;/p&gt;
&lt;h3&gt;Kernel space and user space&lt;/h3&gt;
&lt;p&gt;Kernel space:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The kernel is code with super privileges. It manages a computer’s core resources, such as CPU, memory, storage, and network.  &lt;/li&gt;
&lt;li&gt;When kernel code runs, the program enters kernel space, enabling full access to and control over underlying hardware. Due to the kernel’s high privileges, its code must undergo stringent testing and verification, and ordinary users cannot modify it freely.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;User space:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It’s the code of various applications we commonly use, such as browsers and games.  &lt;/li&gt;
&lt;li&gt;In user space, the permissions of programs are strictly limited and cannot directly access important underlying resources.&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;Kernel space and user space&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Kernel space and user space&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;If an application needs to use a file system, it must access it through the interface designed by the operating system, such as the commonly used &lt;code&gt;OPEN&lt;/code&gt;, &lt;code&gt;READ&lt;/code&gt;, and &lt;code&gt;WRITE&lt;/code&gt;, which are system calls. The role of system calls is to build a bridge between user space and kernel space. Modern operating systems often define hundreds of system calls, each with its own clear name, number, and parameters.  &lt;/p&gt;
&lt;p&gt;When an application makes a system call, it enters a section of kernel space code and returns the results to user space after execution. It’s worth noting that this entire process from user space to kernel space and then back to user space belongs to the same process category.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;System call architecture&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;System call architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;h3&gt;Virtual file systems&lt;/h3&gt;
&lt;p&gt;After understanding the background knowledge above, we’ll briefly explain how user space and kernel space interact when a user calls a &lt;a href="https://en.wikipedia.org/wiki/File_system"&gt;file system&lt;/a&gt; interface.  &lt;/p&gt;
&lt;p&gt;The kernel encapsulates a set of universal virtual file system interfaces through &lt;a href="https://en.wikipedia.org/wiki/Virtual_file_system"&gt;virtual file system&lt;/a&gt; (VFS), exposes them to user space via system calls, and provides programming interfaces to the underlying file systems. The underlying file systems need to implement their own file system interfaces according to the VFS format.   &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The standard process for user space to access the underlying file system:&lt;/strong&gt; a system call -&amp;gt; the VFS -&amp;gt; the underlying file system -&amp;gt; the physical device  &lt;/p&gt;
&lt;p&gt;For example, when we call open in an application, it carries a path as its parameter. After this call reaches the VFS layer, the VFS searches level by level in its tree structure based on this path. Ultimately, it finds a corresponding target and its affiliated underlying file system. This underlying file system also has its own implementation of the open method. Then, it passes this open call to the underlying file system.  &lt;/p&gt;
&lt;p&gt;The Linux kernel supports dozens of different file systems. For different storage media such as memory or networks, different file systems are used for management. &lt;strong&gt;The most critical point is that the extensibility of VFS enables the Linux system to easily support a variety of file systems to meet various complex storage needs; at the same time, this extensibility also provides a foundation for FUSE to implement kernel-space functions in user space later.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Network file systems: Breaking kernel boundaries&lt;/h2&gt;
&lt;p&gt;With the growth of computing needs, the performance of a single computer gradually could not meet the increasing computing and storage requirements. People began to introduce multiple computers to share the load and improve overall efficiency.  &lt;/p&gt;
&lt;p&gt;In this scenario, an application often needs to access data distributed on multiple computers. &lt;strong&gt;To solve this problem, people proposed the concept of introducing a virtual storage layer in the network, virtually mounting the remote computer's file system (such as a certain directory) through a network interface to the local computer's node. The purpose of this is to enable the local computer to seamlessly access remote computer data as if the data was stored locally.&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;Network file system (NFS) architecture&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;Network file system (NFS) architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Specifically, if a local computer needs to access remote data, a subdirectory of the remote computer can be virtually mounted to a node of the local computer through a network interface. In this process, the application does not need to make any modifications and can still access these paths through standard file system interfaces as if they were local data.&lt;/p&gt;
&lt;p&gt;When the application performs operations on these network paths (such as hierarchical directory lookup), these operations are converted into network requests and sent to the remote computer in the form of remote procedure call (RPC) for execution. After receiving these requests, the remote computer performs corresponding operations (such as finding files and reading data) and returns the results to the local computer.&lt;/p&gt;
&lt;p&gt;The process above is a simple implementation of the &lt;a href="https://en.wikipedia.org/wiki/Network_File_System"&gt;NFS&lt;/a&gt; protocol. As a network file system protocol, NFS provides an efficient solution for resource sharing between multiple computers. It allows users to mount and access remote file systems as conveniently as operating local file systems.&lt;/p&gt;
&lt;p&gt;Traditional file systems typically run entirely in the kernel space of a single node, while NFS was the first to break this limitation. The server-side implementation combines kernel space with user space. The subsequent design of FUSE was inspired by this approach.&lt;/p&gt;
&lt;h2&gt;FUSE: File system innovation from kernel to user space&lt;/h2&gt;
&lt;p&gt;With the continuous development of computer technology, many emerging application scenarios require using custom file systems. &lt;strong&gt;Traditional kernel-space file systems have high implementation difficulties and version compatibility issues. The NFS architecture first broke through the limitations of the kernel.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Based on this, someone proposed an idea: Can we transplant the NFS network protocol to a single node, transfer the server-side functionality to a user-space process, while retaining the client running in the kernel, and use system calls instead of network communication to realize file system functions in user space? This idea eventually led to the birth of FUSE.&lt;/p&gt;
&lt;p&gt;In 2001, Hungarian computer scientist Miklos Szeredi introduced FUSE, a framework that allows developers to implement file systems in user space. &lt;strong&gt;The core of FUSE is divided into two parts: the kernel module and the user-space library (&lt;a href="https://github.com/libfuse/libfuse"&gt;libfuse&lt;/a&gt;).&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Its kernel module, as part of the operating system kernel, interacts with VFS, forwarding file system requests from VFS to user space, and returning the processing results of user space to VFS. This design allows FUSE to implement custom file system functions without modifying kernel code.&lt;/p&gt;
&lt;p&gt;The FUSE user-space library (libfuse) provides an API library that interacts with the FUSE kernel module and helps users implement a daemon running in user space. The daemon handles file system requests from the kernel and implements specific file system logic.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;FUSE workflow&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;FUSE workflow&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;In specific implementations, the user-space daemon and kernel module collaborate through the following steps to complete file operations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Request reception&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;1.1 The kernel module registers a &lt;code&gt;*&lt;/code&gt; character device (&lt;code&gt;/dev/fuse&lt;/code&gt;) as a communication channel. The daemon reads requests from this device by calling &lt;code&gt;read()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;1.2 If the FUSE request queue of the kernel is empty, &lt;code&gt;read()&lt;/code&gt; enters a blocking state. At this time, the daemon pauses execution and releases CPU until a new request appears in the queue (implemented through the kernel's wait queue mechanism).&lt;/p&gt;
&lt;p&gt;1.3 When an application initiates a file operation (such as &lt;code&gt;open&lt;/code&gt; and &lt;code&gt;read&lt;/code&gt;), the kernel module encapsulates the request into a specially formatted data packet and inserts it into the request queue, waking the blocked daemon.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Request processing&lt;br&gt;
After the daemon reads the request data packet from the character device, it calls the corresponding user-space processing function according to the operation type (such as reading, writing, and creating a file).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Result returning&lt;br&gt;
After processing is complete, the daemon serializes the result (such as the content of the read file or error code) according to the FUSE protocol format and writes the data packet back to the character device through &lt;code&gt;write()&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After the kernel module receives the response:&lt;/p&gt;
&lt;p&gt;3.1 It parses the data packet content and passes the results to the waiting application.&lt;/p&gt;
&lt;p&gt;3.2 It wakes up the system call blocked in the application to continue executing subsequent logic.&lt;/p&gt;
&lt;p&gt;The emergence of FUSE brought revolutionary changes to file system development. By migrating the implementation of file systems from kernel space to user space, FUSE significantly reduced development difficulty, improved system flexibility and extensibility, and was widely applied in various scenarios such as network file systems, encrypted file systems, and virtual file systems.&lt;/p&gt;
&lt;h2&gt;JuiceFS: A FUSE user-space distributed file system&lt;/h2&gt;
&lt;p&gt;In 2017, with the full entry of IT infrastructure into the cloud era, the architecture faced unprecedented challenges. In this background, JuiceFS was born. As a &lt;a href="https://www.geeksforgeeks.org/distributed-systems/what-is-dfsdistributed-file-system/"&gt;distributed file system&lt;/a&gt; based on object storage, it uses FUSE technology to build its file system architecture, using FUSE’s flexible extensibility to meet the diverse needs of cloud computing environments.&lt;/p&gt;
&lt;p&gt;Through FUSE, the JuiceFS file system can be mounted to servers in a &lt;a href="https://en.wikipedia.org/wiki/POSIX#:~:text=The%20Portable%20Operating%20System%20Interface,maintaining%20compatibility%20between%20operating%20systems."&gt;POSIX&lt;/a&gt;-compatible manner. It treats massive cloud storage as local storage. Common file system commands, such as &lt;code&gt;ls&lt;/code&gt;, &lt;code&gt;cp&lt;/code&gt;, and &lt;code&gt;mkdir&lt;/code&gt;, can be used to manage files and directories in JuiceFS.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;JuiceFS 架构图（第四版）-第 2 页-winfsp (2)&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS Community Edition architecture&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Let’s take a user mounting JuiceFS and then opening one of its files as an example. The request first goes through the kernel VFS, then is passed to the kernel's FUSE module, and communicates with the JuiceFS client process through &lt;code&gt;/dev/fuse&lt;/code&gt; device. The relationship between VFS and FUSE can be simply regarded as a client-server protocol, with VFS acting as the client requesting service, and the user-space JuiceFS acting as the server role, handling these requests.&lt;/p&gt;
&lt;p&gt;The workflow is as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;After JuiceFS is mounted, the &lt;code&gt;go-fuse&lt;/code&gt; module inside JuiceFS opens &lt;code&gt;/dev/fuse&lt;/code&gt; to obtain &lt;code&gt;mount fd&lt;/code&gt; and start several threads to read the FUSE requests of the kernel.  &lt;/li&gt;
&lt;li&gt;The user calls the &lt;code&gt;open&lt;/code&gt; function, enters the VFS layer through the C library and system call, and the VFS layer transfers the request to the kernel's FUSE module.  &lt;/li&gt;
&lt;li&gt;The kernel FUSE module puts the &lt;code&gt;open&lt;/code&gt; request into the queue corresponding to the &lt;code&gt;fd&lt;/code&gt; of &lt;code&gt;juicefs mount&lt;/code&gt; according to the protocol and wakes up the read request thread of &lt;code&gt;go-fuse&lt;/code&gt; to wait for the processing result.  &lt;/li&gt;
&lt;li&gt;The user-space &lt;code&gt;go-fuse&lt;/code&gt; module reads the FUSE request and calls the corresponding implementation of JuiceFS after parsing the request.  &lt;/li&gt;
&lt;li&gt;&lt;code&gt;go-fuse&lt;/code&gt; writes the processing result of this request into &lt;code&gt;mount fd&lt;/code&gt;, that is, into the FUSE result queue, and then wakes up the application waiting thread.  &lt;/li&gt;
&lt;li&gt;The application thread is awakened, gets the processing result of this request, and then returns to the upper layer.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Due to the frequent switching between user space and kernel space required by FUSE, many people have doubts about its performance. In fact, this is not entirely the case. We conducted a set of tests using JuiceFS.&lt;/p&gt;
&lt;p&gt;The testing environment: 1.5 TB RAM, an Intel Xeon 176-core machine, a 512 GB sparse file on JuiceFS&lt;/p&gt;
&lt;p&gt;We used fio to perform sequential read tests on it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mount parameters: &lt;code&gt;./cmd/mount/mount mount --no-update --conf-dir=/root/jfs/deploy/docker --cache-dir /tmp/jfsCache0 --enable-xattr --enable-acl -o allow_other test-volume /tmp/jfs&lt;/code&gt;  &lt;/li&gt;
&lt;li&gt;The fio command: &lt;code&gt;fio --name=seq_read --filename=/tmp/jfs/holefile  --rw=read --bs=4M  --numjobs=1  --runtime=60 --time_based --group_reporting&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This excluded the constraints of hardware disks and tested the limit bandwidth of the FUSE file system.&lt;/p&gt;&lt;/div&gt;
&lt;div class="w-block-ImageWithCaption block-ImageWithCaption"&gt;&lt;dl&gt;
    &lt;dt&gt;image&lt;/dt&gt;
    &lt;dd&gt;all-memory&lt;/dd&gt;
    &lt;dt&gt;caption&lt;/dt&gt;
    &lt;dd&gt;JuiceFS sequential read throughput in all-memory&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="w-block-markdown block-markdown"&gt;&lt;p&gt;Testing results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The single-thread bandwidth under a single mount point reached 2.4 GiB/s.  &lt;/li&gt;
&lt;li&gt;As the number of threads increased, the bandwidth could grow linearly. At 20 threads, it reached 25.1 GiB/s. This throughput already meets most actual application scenarios.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In terms of the use of FUSE, JuiceFS has implemented the smooth upgrade feature. By ensuring the consistency of &lt;code&gt;mount fd&lt;/code&gt;, users can upgrade the JuiceFS version or modify the mount parameters without re-mounting the file system and interrupting the application. For details, see &lt;a href="https://juicefs.com/en/blog/engineering/smooth-upgrade"&gt;Smooth Upgrade: Implementation and Usage&lt;/a&gt;.  &lt;/p&gt;
&lt;p&gt;FUSE also has some limitations. For example, processes accessing the FUSE device require high permissions, especially in container environments, usually requiring privileged mode to be enabled. In addition, containers are usually transient and stateless. If a container exits unexpectedly and data is not written to disk in time, there is a risk of data loss.  &lt;/p&gt;
&lt;p&gt;Therefore, for Kubernetes scenarios, the &lt;a href="https://juicefs.com/docs/csi/introduction/"&gt;JuiceFS CSI Driver&lt;/a&gt; allows applications to access the JuiceFS file system with non-privileged containers. The CSI driver manages the lifecycle of the FUSE process to ensure that data can be written to disk in time and will not be lost. For details, see &lt;a href="https://juicefs.com/en/blog/usage-tips/kubernetes-data-persistence-juicefs"&gt;K8s Data Persistence: Getting Started with JuiceFS CSI Driver&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;FUSE decouples user space from kernel space, providing developers with great flexibility and convenience in implementing file systems in user space. Especially in modern computing environments such as cloud computing and distributed storage, FUSE makes building and maintaining complex storage systems more efficient, customizable, and easy to expand. &lt;/p&gt;
&lt;p&gt;JuiceFS is based on FUSE and implements a high-performance distributed file system in user space. In the future, we’ll continue exploring optimization methods for FUSE and continuously improving the performance and reliability of file systems to meet increasingly complex storage needs and provide users with stronger data management capabilities.&lt;/p&gt;
&lt;p&gt;If you have any questions for this article, feel free to join &lt;a href="https://github.com/juicedata/juicefs/discussions/"&gt;JuiceFS discussions on GitHub&lt;/a&gt; and &lt;a href="https://go.juicefs.com/slack/"&gt;community on Slack&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;</description><pubDate>Sat, 14 Feb 2026 07:27:00 +0000</pubDate><guid>https://www.juicefs.com/en/blog/engineering/design-fuse-kernel-user-space</guid></item></channel></rss>