JuiceFS Blog

JuiceFS Sync for PB-Scale Data Transfers: Resumable Sync, Encryption, and Bandwidth Control

Wed, 08 Jul 2026 09:42:00 +0000

In scenarios such as data migration, cross-cloud synchronization, and object storage backup, juicefs sync is commonly used to transfer large volumes of data. When datasets grow to the TB- or PB-scale, with millions or even billions of objects, a single synchronization task may run for hours or even days.

As these long-running jobs progress, several common challenges tend to emerge:

After network interruptions, process crashes, or node restarts, tasks often struggle to resume from a consistent state and may need to rescan or reprocess data.
Backup workflows may expose plaintext data and face compliance or security requirements.
When multiple sync jobs run concurrently, bandwidth contention becomes significant, while the overall transfer process lacks effective global control.

To address these challenges, JuiceFS 1.4 introduces three major enhancements to sync: resumable sync, data encryption/decryption, and global traffic control.

In this article, we’ll explain the use cases, implementation details, and configuration methods for each feature.

Resumable sync

In earlier versions, if a synchronization task failed or was interrupted, rerunning juicefs sync required rescanning both the source and destination before determining which objects had already been synchronized and which still needed to be copied.

For workloads involving hundreds of millions of objects or large files, the scan itself could incur substantial time and object-storage request costs.

To address this issue, JuiceFS 1.4 introduces a resumable sync mechanism for sync. When enabled, synchronization progress is periodically saved to the destination. If the task is interrupted, rerunning the same command automatically locates and loads the matching checkpoint and resumes from the last unfinished position, avoiding a full restart.

How it works

When resumable sync is enabled, sync stores a JSON state file on the destination side:

.juicefs-sync-checkpoint.<hash>.json

The <hash> value is derived from the source, destination, and key synchronization parameters. This ensures that a task only loads checkpoints created for itself, preventing accidental reuse across different jobs.
The workflow is shown below:

image: Checkpoint 的流程
caption: Resumable sync workflow

Checkpoint save, restore, and cleanup workflow in juicefs sync:

When sync starts, it first looks for a checkpoint matching the current task.
If a matching checkpoint is found, execution resumes from the saved state. Otherwise, synchronization starts normally with a fresh scan. sync traverses multiple prefixes concurrently, maintaining independent state for each prefix, including:
- Whether traversal is complete
- The last scanned position
- Pending objects to synchronize
- Failed objects
When restoring from a checkpoint:
- Pending and failed objects recorded in the checkpoint are re-added to the task queue.
- Prefixes that were not fully traversed resume scanning from their saved positions.
- Fully traversed prefixes only continue processing unfinished objects recorded in the checkpoint.
During execution, progress is saved asynchronously at a configurable interval, which defaults to every 10 seconds.
After successful completion, the checkpoint file is automatically removed. If the task fails or is interrupted, the checkpoint is retained for resumption on the next execution of the same command.

image: 集群模式
caption: Checkpoint architecture in cluster mode

In cluster mode, only a single checkpoint exists and is maintained centrally by Manager.
Workers do not directly read or write checkpoint files on the destination. Instead, they:

Pull tasks from Manager
Execute synchronization
Report results back to Manager

Manager aggregates completed objects, failed objects, statistics, and multipart-upload state into the global checkpoint.

Usage

# Enable resumable sync.
juicefs sync --enable-checkpoint SRC DST

# Customize checkpoint save interval (default: 10s).
juicefs sync --enable-checkpoint --checkpoint-interval 30s SRC DST

# Ignore existing checkpoints and restart from scratch.
juicefs sync --enable-checkpoint --checkpoint-force-reset SRC DST

Data encryption and decryption

For cross-cloud backup and archival workflows, client-side encryption is often required to satisfy compliance requirements such as data sovereignty, encryption at rest, and secure migration of sensitive data.

Previously, juicefs sync did not provide built-in encryption capabilities. Users who wanted to write encrypted data to the destination typically had to use external tools for additional processing.

In JuiceFS 1.4, streaming encryption and decryption are integrated directly into the synchronization pipeline, enabling three common workflows:

Encrypt-on-write: Encrypt plaintext data before writing it to the destination, suitable for encrypted backup and archiving.
Decrypt-on-read: Read encrypted data from the source and write decrypted data to the destination, suitable for data recovery or plaintext migration.
Re-encryption: Decrypts source data with an old key and re-encrypts it with a new key before writing to the destination, suitable for key rotation or cryptographic algorithm migration.

Chunk-based streaming encryption

To support object storage Range GET operations while avoiding excessive memory usage for large files all at once, sync uses a fixed-size 1 MiB chunk-based streaming encryption scheme.

A file is first divided into plaintext chunks:

[chunk 1: 1 MiB][chunk 2: 1 MiB] ... [chunk N: ≤1 MiB]

Each plaintext chunk is encrypted independently.

Each encrypted chunk consists of a 4-byte header and the ciphertext data, where the 4-byte header stores the actual ciphertext length (ct_len):

Each encrypted block: [4B ct_len][ciphertext + padding]

Encrypted file: [encrypted chunk 1][encrypted chunk 2] ... [encrypted chunk N]

The encrypted block size is determined by the plaintext chunk size plus encryption overhead: plainChunkSize + overhead. The plainChunkSize is fixed at 1 MiB, and the overhead depends on the encryption algorithm and key type used.

image: 分块流式加密
caption: Chunk-based streaming encryption architecture in juicefs sync

This design allows random reads to retrieve only the required encrypted chunk rather than downloading the entire file. Because encrypted objects contain additional headers, padding, and encryption metadata, the destination object is typically larger than the original plaintext file.

Supported algorithms

The table below shows the supported algorithms:

Option	Symmetric cipher	Key encapsulation	Typical use case
aes256gcm-rsa (default)	AES-256-GCM	RSA	General-purpose workloads
chacha20-rsa	ChaCha20-Poly1305	RSA	Environments without efficient AES hardware acceleration
sm4gcm	SM4-GCM	SM2	Scenarios requiring Chinese commercial cryptography standards

Usage

The following examples use RSA keys.

Generate a key pair:

# Generate an RSA private key (the public key is derived automatically).
openssl genrsa -out private.pem 2048

# Generate a password-protected private key.
openssl genrsa -aes256 -out private.pem 2048

Scenario 1: Encrypt and write to destination

juicefs sync /local/data s3://mybucket/backup 
    --encrypt-rsa-key /path/to/private.pem

Scenario 2: Decrypt and read from source for data recovery or plaintext migration

juicefs sync s3://mybucket/backup /local/data 
    --decrypt-rsa-key /path/to/private.pem

Scenario 3: Re-encrypt for key rotation or algorithm migration

# Decrypt data encrypted with the old key and re-encrypt with the new key to new storage.
juicefs sync s3://old-bucket/encrypted s3://new-bucket/re-encrypted 
    --decrypt-rsa-key /path/to/old-private.pem 
    --encrypt-rsa-key /path/to/new-private.pem

If the private key is password-protected, the password can be provided via environment variables:

# For encryption scenarios, use JFS_ENCRYPT_RSA_PASSPHRASE.
export JFS_ENCRYPT_RSA_PASSPHRASE="your-passphrase"
juicefs sync /local/data s3://mybucket/backup --encrypt-rsa-key private.pem

# For decryption scenarios, use JFS_DECRYPT_RSA_PASSPHRASE.
export JFS_DECRYPT_RSA_PASSPHRASE="your-passphrase"
juicefs sync s3://mybucket/backup /local/data --decrypt-rsa-key private.pem

Notes:

Encrypted data is stored using a JuiceFS-specific format and can only be decrypted through juicefs sync with the corresponding key.
Back up encryption keys carefully. Once a private key is lost, encrypted data cannot be recovered.

Global traffic control

In earlier versions, juicefs sync already supported per-process rate limiting via --bwlimit. However, when multiple sync processes run concurrently—such as multiple Workers in a distributed sync, or multiple independent sync tasks sharing the same egress link—per-process limiting cannot constrain total bandwidth usage. The egress link may still be saturated, affecting other application traffic.

JuiceFS 1.4 introduces the --traffic-control-url parameter. Multiple sync processes can connect to the same external traffic control service, which allocates bandwidth quotas uniformly, enabling cross-process, cross-task global rate limiting.

How it works

Global traffic control uses a token bucket model. Before transmitting data, each sync process requests byte credits from the same traffic-control service.

image: 全局流量
caption: Global traffic-control workflow in juicefs sync

Each process periodically requests a certain number of bytes (credit) before data transfer.

The traffic-control service determines:

How many bytes to grant
How long the granted quota remains valid

When credits are exhausted, the process requests additional credits.

If a quota is about to expire before being fully consumed, the unused portion is returned to the service in advance.

The service exposes a simple HTTP API for granting and reclaiming quotas. This must be implemented by the user or integrated with an existing service:

POST /traffic-control
Content-Type: application/json

Request:
{"bytes": 1048576}
  bytes > 0: Request byte credits.
  bytes < 0: Return unused credits.


Response:
{"granted": 524288, "expired": 1000}
  granted: Number of bytes granted this time.
  expired: Credit validity period (milliseconds).

During synchronization, sync requests quotas from the traffic control service before transmitting data. If no credits are available, transmission blocks until new credits are obtained. In this way, multiple sync tasks can share a single global bandwidth limit, preventing the total traffic from becoming uncontrolled even when individual tasks have their own limits.

Usage

# Deploy a traffic-control service first.
# (Example: listen on port 8080 and cap total bandwidth at 100 Mbps)
# (Service implementation is user-defined; JuiceFS only calls the API)

# Multiple sync processes share the same control service.
juicefs sync SRC1 DST1 --traffic-control-url http://127.0.0.1:8080/traffic-control &
juicefs sync SRC2 DST2 --traffic-control-url http://127.0.0.1:8080/traffic-control &

--traffic-control-url can be combined with --bwlimit.

The two mechanisms are independent:

--bwlimit limits the bandwidth of a single sync process.
--traffic-control-url limits aggregate bandwidth across multiple processes.

# Per-process limit: 50 Mbps. All processes combined respect the service-side cap.
juicefs sync SRC DST 
    --bwlimit 50 
    --traffic-control-url http://controller:8080/traffic-control

Summary

JuiceFS 1.4 enhancements to sync include:

Resumable sync reduces recovery costs after task interruptions.
Encryption and decryption improve the security of backups and archival data.
Global traffic control enables multiple synchronization tasks to share bandwidth in a coordinated manner.

For scenarios such as data migration, cross-cloud sync, object storage backup, and encrypted archiving, users can combine these capabilities flexibly based on task scale, network environment, and security requirements.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

42x Faster Writes & 85% Throughput Gain: JuiceFS for Multi-Cloud AI Agents

Wed, 01 Jul 2026 09:59:51 +0000

Xingchen Zhengtu Technology Co., Ltd. is a startup focused on AI search and multi‑modal artificial intelligence generated content (AIGC) applications for e‑commerce. Our products include:

Gensmo, which focuses on fashion and provides virtual try‑on, style recommendations, and product search.
ZooClaw, which offers AI agent services for a wider range of daily and work scenarios.

In this article, we’ll share our thinking and experience in unified storage selection, architecture design, and performance tuning. We’ve been running JuiceFS in production for more than a year. It manages over 100 million files. Our application runs across Oracle Cloud, DigitalOcean, and Google Cloud Platform (GCP). JuiceFS has become the unified storage layer supporting model training, inference, data processing, and online agents.

Unified storage requirements and design approach

Four scenarios, four I/O profiles

So far, we have four main storage scenarios that support Gensmo and ZooClaw.

image: Application-types-Xingchen-Zhengtu
caption: Application types at Xingchen Zhengtu Technology

Scenario 1: Model training
We build our own models, including Gensmo's Try‑on model and video generation models. They show try‑on results, 360‑degree model actions, or special effects to B2B and B2C customers. Model training involves large‑file sequential writes and checkpoint saving. Storage requirements: high capacity and high‑performance sequential I/O.

Scenario 2: Model inference
Inference services primarily need high‑concurrency sequential reads. Data is loaded into local cache to increase hit rates.

Scenario 3: Data processing
We crawl product, clothing, and review data from overseas independent e‑commerce sites. This data is used for model training and application operations analysis. This scenario involves many small files (a few hundred kilobytes per image). High IOPS and concurrency are challenges.
To optimize data processing, we use Ray Data to process datasets in parallel and consolidate large volumes of small files into reusable Parquet datasets ranging from tens to hundreds of gigabytes. These datasets serve as a shared data foundation that can be reused across downstream workloads such as embedding generation, retrieval, recommendation, training, and inference. This greatly reduces pressure on the file system and serves both training and inference needs.

Scenario 4: Online agents
Online agents are different from the offline scenarios above. There are still many small files, but these files are generated by online services. Each agent only reads and writes its own data. No cross‑agent distributed processing is required. The storage system must support high concurrency and fast response. It does not need cross‑agent data coordination.

In summary, these four scenarios have two types of storage requirements:

Offline training, inference, and data processing need high throughput, high concurrency, and caching.
Online agents care more about low latency, data isolation, and stability.

After clarifying these application needs, a natural question arises: do we need a multi‑cloud architecture? From the very beginning of platform construction, our answer was yes.

Cloud neutrality is not a concept; it’s bargaining power

The goal of cloud neutrality is not technical purity. It’s to meet the core needs of the infrastructure team: keeping compute resources and workloads portable, and maintaining bargaining power with different cloud providers.

For overseas business, if compute and storage are locked into a single cloud provider for a long time, the ability to flexibly adjust compute resources becomes limited as the application grows or prices change. This is especially true in AI, where GPU resource prices and availability fluctuate greatly. A resource that is cheap today may become expensive or scarce later. The compute scale needed after application growth may also exceed what the original cloud provider can offer.

Therefore, we want to decouple the storage layer from any specific cloud vendor. Data should stay cloud‑neutral. This way, training, inference, or online agent workloads can move to a cloud that better meets cost and performance requirements. There is no need to copy or reconfigure data repeatedly.

POSIX: The foundation of a unified storage experience

Another critical question in platform construction was: how do we give our engineering team a consistent operational experience across multiple clouds and multiple object storage backends?

For a single application scenario, using object storage directly is enough. But when training, inference, data processing, and online agents share the same data system, the development and operations cost of different object storage interfaces grows quickly. Therefore, we wanted a unified abstraction on top of the underlying storage. POSIX file system semantics are the best fit for this.

With JuiceFS, we map underlying object storage (whether it’s GCS, S3, or R2) into a POSIX file system and mount it as a local path. From local development to production, the engineering team always sees the same file system interface and access path. They don't need to care about which cloud or which object storage holds the data.

In short, an ideal cloud storage experience lets engineers not feel the multi‑cloud environment at all. They always see a local path. This was a major reason we later chose JuiceFS.

Selection: From GCS Fuse and S3 Fuse to JuiceFS

Because offline and online scenarios have very different needs, our storage selection took two different paths.

Offline: Choosing JuiceFS from the start

For offline scenarios, we faced a multi‑cloud environment and high-throughput requirements. Therefore, before building the system, our team evaluated mainstream solutions and compared them against our critical needs:

Self‑built parallel file systems: Highest performance, but high cost, hardware‑locked, limited cross‑cloud capability.
Cloud‑managed parallel file systems: Convenient, but locks you into a single cloud, still expensive.
Plain FUSE: Low cost, but insufficient POSIX semantics and performance.
Cache orchestration layers: Needs underlying storage and adds operational complexity.

Solution	Cloud-neutral	POSIX semantics	High throughput	Distributed cache	Cost/Operations
Self-built parallel file system (for example, Lustre)	❌ Hardware-dependent	✅	✅✅	Partial	High cost, operationally intensive
Cloud‑managed parallel file system (for example, Filestore)	❌ Locked to a single cloud	✅	✅	✅	High cost, lower operational overhead
Object storage + FUSE (S3FS / GCS Fuse)	⚠️ Cloud-dependent	❌	❌	❌	Low cost, lightweight operations
Cache orchestration layer (Alluxio/Fluid)	✅	✅	✅	✅	Requires an underlying storage layer, operationally complex
JuiceFS	✅ Backend of your choice	✅ Full	✅	✅ Built-in	Object storage cost with CSI integration

Compared with the alternatives, JuiceFS was the only solution that simultaneously met our core requirements: cloud neutrality, full POSIX compatibility, built-in distributed caching, and object storage as the backend. Every other option fell short in at least one of these areas. As a result, for our offline workloads, choosing JuiceFS was a straightforward decision from the very beginning.

Agent: Migrating from GCS Fuse to JuiceFS

Our early infrastructure was primarily deployed on Google Cloud, where Google Cloud Storage (GCS) was mounted into GKE Pods using GCS Fuse. In practice, however, we found that this architecture could not meet the reliability, performance, and cloud-neutrality requirements of our online agent workloads.

The biggest problem was data loss under SIGKILL. GCS Fuse uses an asynchronous writeback mechanism. After an application's write call returns success, data may still sit in a local buffer and not be truly written to GCS. If a Pod is terminated by an OOM kill or a SIGKILL signal, data that appears to have been written successfully may be permanently lost. In an agent workload, this directly causes session data loss.

The second issue was inadequate small-file performance and incomplete POSIX semantics. An agent's working directory typically contains numerous small files with frequent append operations. GCS Fuse introduces high latency for operations such as open and stat, while providing incomplete support for POSIX features such as rename, flock, and symlink. As a result, it falls short of the stability requirements for online services.

The third issue was cloud lock-in and limited stability under high concurrency. GCS Fuse is tightly coupled to the GCP ecosystem, making it incompatible with our cloud-neutral infrastructure strategy. In addition, its stability under highly concurrent agent workloads proved insufficient.

These limitations ultimately led us to migrate our online agent workloads to JuiceFS.

JuiceFS addresses the data loss problem through its write path and independent metadata engine. It separates data from metadata: data chunks are first uploaded to object storage, and only after the corresponding metadata is atomically committed to the independent metadata engine is the write considered successful. In other words, a successful write truly means the data has been durably persisted, so a SIGKILL cannot cause the loss of already acknowledged data.

More fundamentally, GCS Fuse exposes object storage through a file system interface, whereas JuiceFS builds a true file system on top of object storage. Its independent metadata engine, together with full POSIX compatibility, cloud neutrality, built-in distributed caching, and a mature ecosystem of tools, makes JuiceFS a much better fit for online agent workloads that demand reliability, consistency, and high-concurrency access.

Today, our online agent platform runs stably in production. JuiceFS has become the unified storage foundation for multiple workloads across the company.

A new architecture: Deploying JuiceFS across multiple clouds

Offline workloads: Compute mobility across clouds with unified metadata + R2

To meet the requirements of cloud neutrality, compute mobility, and high throughput for offline workloads, we designed the following architecture:

We chose Cloudflare R2 as the underlying object storage backend. Because R2 is independent of any cloud provider and does not charge for egress traffic, it’s well suited for high-throughput training workloads spanning multiple clouds. By comparison, while services such as GCS and Amazon S3 offer low storage costs, their egress fees can be substantial, significantly increasing the cost of offline training. For example, storing 1 TB of data in GCS costs roughly $20 per month, while egress charges can range from $20 to $140.

image: Offline-storage-architecture
caption: Offline storage architecture

On top of R2, we deployed JuiceFS Enterprise Edition to provide a unified file system across multiple clouds. Whether compute resources are running on Oracle Cloud or DigitalOcean, training, inference, and data processing jobs all access the same file system through a consistent namespace. This allows engineers to work without worrying about the underlying cloud infrastructure.

Our compute layer consists of H100 GPUs on Oracle Cloud and H200 GPUs on DigitalOcean, running a unified training and inference platform based on Slurm and KubeRay. Each GPU node contributes its local NVMe SSD to JuiceFS' distributed cache, forming a shared cache pool across the cluster. Datasets are fetched from R2 on first access, while subsequent reads are served almost entirely from the distributed cache, effectively masking the latency of cross-cloud data access.

The entire infrastructure is managed as code using Terraform. Networks, storage, training jobs, Ray clusters, and inference services can all be provisioned automatically. As long as a cloud provider supports Kubernetes, compute resources and workloads can be deployed seamlessly, enabling rapid cross-cloud scaling and flexible resource allocation.

Online workloads: Prioritizing low latency with in-cloud metadata

For online agent workloads, using ZooClaw as an example, our primary objective is to provide a unified storage foundation for a large number of agents while supporting centralized management, directory isolation, and usage-based billing. Unlike offline workloads, the online environment prioritizes low latency, efficient small-file writes, and high-concurrency access. Cross-cloud storage paths inevitably introduce additional I/O latency, making them unsuitable for latency-sensitive online services. Therefore, we keep the object storage, metadata service, and application Pods within the same cloud whenever possible.

Our current online deployment runs entirely on GCP. GCS serves as the object storage backend, while the metadata layer is provided by an independent three-node Raft cluster deployed within a private GCP VPC. This keeps object storage, metadata service, and application Pods within the same cloud. It reduces access latency and improves IOPS for small‑file write‑intensive scenarios.

On Kubernetes, we mount a single RWX PersistentVolumeClaim (PVC) through JuiceFS CSI. Each bot Pod accesses its own isolated directory through a dedicated subPath, while access is restricted by environment-specific tokens, providing file system-level isolation. From each agent's perspective, it sees only its own local working directory. From the platform's perspective, however, all data resides in a single unified storage system, simplifying centralized management and billing.

image: Online-storage-architecture
caption: Online storage architecture

If GCP resources or costs become unsuitable in the future, this architecture can still be migrated. We use Terraform and Kubernetes orchestration. We can spin up the same compute and storage structure on another cloud and then synchronize the metadata and data over. Online agent workloads can naturally be switched bot by bot, user by user, or tenant by tenant. There is no need for a one‑time full migration.

Looking back at both deployment models, their objectives are fundamentally different. Offline workloads emphasize cross-cloud data sharing, compute mobility, and high throughput, while online agent workloads prioritize low latency and high concurrency, without sacrificing the flexibility to migrate when needed. Rather than adopting a one-size-fits-all backend, we tailored the JuiceFS deployment to the requirements of each workload. This approach preserves a unified data management model and developer experience while allowing each scenario to use the most appropriate metadata and object storage architecture.

Optimization practice: Distributed cache, writeback, and S3 Gateway

After establishing a unified storage architecture, we continued to optimize performance and fine-tune access strategies for different workloads.

One cache, two optimization strategies

Distributed caching is one of the most important capabilities of JuiceFS, directly affecting IOPS, throughput, and access latency. Although both our offline and online deployments rely on distributed caching, their goals and optimization strategies differ significantly.

For offline workloads, the primary objective is to support high-throughput training and data processing while enabling cross-cloud data sharing and compute mobility. To achieve this, we cache as much data from Cloudflare R2 as possible on local storage. Training, inference, and data processing jobs run on H100 and H200 GPU nodes equipped with NVMe SSDs. Each node contributes roughly 50 TB of local cache, allowing a cluster of more than a dozen nodes to provide several hundred terabytes of distributed cache capacity.

The first access to a dataset requires fetching data from R2, which is slow. Once the initial read completes, however, subsequent training, data processing, and inference workloads are served almost entirely from the distributed cache, delivering I/O performance close to that of local NVMe storage. Since offline workloads write large checkpoint and model weight files—often hundreds of gigabytes or even several terabytes in size—data integrity is critical. Therefore, we typically leave writeback disabled to ensure that writes are durably persisted before being acknowledged.

For online agent workloads, the priorities are low-latency, high-concurrency access to small files while maintaining data isolation for each agent. Here, the cache is primarily used to accelerate small-file writes and access. Each agent Pod mounts the same RWX PVC and accesses its own isolated directory using a dedicated subPath. The cache expiration time is set to 3,600 seconds to maximize cache hits for frequently accessed files.
Because each agent typically accesses only its own directory, the cache strategy does not require strong cross‑agent consistency. Data is only eventually consistent with object storage during necessary offline analysis or operations debugging.

In online scenarios, to further improve small file write and high concurrency performance, we can use writeback together with caching. Writeback trades a controlled amount of data safety risk for higher write throughput. This means that for multiple agents running on the same node, if one agent fails during writing, only that single agent's output, such as a presentation, an image, or a temporary document, is affected. That data can be regenerated. With writeback, online agents get a significant performance boost for high‑concurrency small file writes, while the overall system remains stable and data isolation is preserved.

One copy of data, multiple interfaces

In our architecture, JuiceFS S3 Gateway serves as the data distribution layer, exposing data stored in JuiceFS through a standard S3-compatible interface.

In agent workloads, whether the data consists of configuration files or generated artifacts such as presentations, images, or videos, everything is ultimately stored in the same JuiceFS file system. However, these files often need to be shared with external users through URLs, making POSIX mounts unsuitable for external access.

To address this, we expose the same data directly through JuiceFS S3 Gateway. Internal services continue to access files through the POSIX interface, while external applications retrieve exactly the same data through standard S3 or HTTP APIs, eliminating the need for duplicate copies.

To further improve security and performance, we place Cloudflare Workers and a CDN in front of the S3 Gateway. Incoming requests first pass through a worker for path validation and access control before being forwarded to the gateway. CDN edge caching and ETag validation further reduce origin fetches.

This design provides two major benefits:

Multiple layers of isolation ensure data security, including JuiceFS directory isolation, S3 Gateway access controls, and code-level validation implemented in Cloudflare workers.
CDN caching reduces cross-region access latency and significantly improves download performance for large files such as videos and images. For users around the world, this means that even when the data resides in a GCP region in the eastern United States, content can still be served efficiently from the nearest CDN edge location.

From an architectural perspective, internal training, inference, and agent services access data through the POSIX file system, while external distribution is handled through the S3 Gateway. A single copy of the data can support multiple access methods without any additional replication.

Performance results

Offline workloads: ~4× higher sequential write throughput, 7–8 GB/s cached read performance

For our offline workloads, we conducted benchmark tests on sequential read and write performance. The results before and after optimization are summarized below.

For sequential writes, a single process achieved approximately 700 MB/s when writing model outputs or checkpoints. By leveraging parallel writes across multiple processes and multiple nodes, throughput exceeded 1 GB/s, providing sufficient performance for large-scale training workloads.
For sequential reads, after aggregating numerous small files into larger files during data processing and loading them into the distributed cache, cached sequential read throughput reached 6.7–7.8 GB/s, approaching the performance of local NVMe storage. Inference workloads can also load checkpoints directly from the local cache, eliminating the need to copy data across nodes.

Benchmark (JuiceFS on R2, offline)	Baseline	Optimized (distributed cache + tuning)
Sequential writes (large blocks)	~231 MB/s	~714 MB/s
Sequential writes (20–50 GB datasets)	~256–265 MB/s	840 MB/s–1.1 GB/s
Sequential reads (cache hit)	—	6.7–7.8 GB/s
Sequential reads (cold read from R2)	—	~427 MB/s

Beyond performance improvements, distributed caching also streamlined our engineering workflows. Training, inference, and data processing all share the same file path, eliminating the need to copy checkpoints between nodes or services. Newly generated model weights can be loaded directly by inference services, reducing data movement while shortening the path from model training to deployment.

Online workloads: ~42× faster small-file writes and ~85% higher large-file throughput

Our initial deployment placed the metadata service on OCI while using Cloudflare R2 as the object storage backend. Since the online services ran on GCP, every metadata request had to traverse the public Internet, resulting in a metadata round-trip time (RTT) of approximately 12.7 ms. Small-file throughput was limited to around 24 files/s, and occasional 30-second PUT timeouts from R2 could even affect bot stability.

We introduced two major optimizations.

We enabled writeback and tuned the cache TTL, improving large-file write throughput by approximately 85%.
We migrated both the metadata service and object storage into GCP. The metadata layer now runs as a three-node Raft cluster within a private VPC, while the object storage backend was migrated to GCS and paired with local NVMe caching.

After these optimizations, metadata RTT dropped to approximately 5.8 ms, while small-file throughput increased to around 1,000 files/s, representing an overall performance improvement of roughly 42×.

Conclusion

After more than a year of production use, JuiceFS has become the core storage layer of Xingchen Zhengtu Technology's infrastructure. Today, it reliably supports more than 100 million files across three cloud providers and multiple application scenarios. More importantly, it has unified the storage architecture for training, inference, data processing, and online agent workloads.

For a startup serving a global market, infrastructure must be both flexible and easy to operate so that engineering efforts can remain focused on product innovation. Our unified storage platform provides a consistent interface for applications and developers, while allowing the underlying infrastructure to be optimized for different workloads. Offline workloads prioritize compute mobility and cost optimization, whereas online workloads focus on low latency and high concurrency while retaining the flexibility to migrate when necessary.

This design keeps a consistent upper‑layer experience, while making compute cost negotiable and resources portable. It sets a solid foundation for future expansion to more clouds and regions.
If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

Monitoring JuiceFS with Better Stack

Thu, 25 Jun 2026 06:57:00 +0000

After deployment, JuiceFS feels like a local drive, but underneath it's a sophisticated distributed system. This perfectly reflects one of its core design principles: distributed systems are complex, but from a user's perspective, they should be simple to use.

Even so, that simplicity on the surface doesn't negate the need for deep visibility. For any critical storage system, gaining real-time visibility into its operations is crucial to prevent subtle performance degradations from escalating into significant incidents.

Fortunately, JuiceFS exposes a suite of monitoring metrics, including throughput, IOPS, latency, data size, and many more, in the widely adopted Prometheus format, making it ready for modern monitoring stacks. Traditionally, you would probably pair Prometheus with Grafana to collect these metrics and visualize them. This is indeed a powerful combination. However, deploying, managing, and maintaining these systems yourself adds operational overhead again. Ironically, you may want to monitor them too, and trust me, you would rather not create yet another monitoring stack just to monitor your Prometheus and Grafana combo.

That's where Better Stack comes in. It is a fully managed SaaS observability platform that combines user-friendly dashboards, tracing, logging, error tracking, incident management, automatic alerting, and even AI-powered SRE, all for a predictable, cost-effective price. With Better Stack, you get the power of the best-in-class tools out of the box without the operational overhead.

image: Better Stack
caption: Better Stack

In this post, we'll guide you through setting up a comprehensive monitoring system for JuiceFS using Better Stack, from metric ingestion to intelligent alerting, so you can ensure your file system remains healthy and performant.

Preparing the JuiceFS file system

Before diving into setting up Better Stack for monitoring, you'll need an existing JuiceFS file system that is actively publishing metrics. JuiceFS Community Edition and JuiceFS Enterprise Edition (our cloud service is based on JuiceFS Enterprise Edition) both expose real-time status metrics in Prometheus format, but they do it in slightly different ways.

For the JuiceFS Community Edition, after mounting the file system, JuiceFS automatically exposes metrics via http://localhost:9567/metrics by default on the mounting host where the JuiceFS client is running. You can customize this port using the --metrics option if needed.

On the other hand, for JuiceFS Enterprise Edition & Cloud Service, metrics are exposed through the console via dedicated API endpoints. You'll need to replace VOLUME_NAME with your file system name and API_TOKEN with your API token. In this case, both Prometheus and JSON formats are available for metrics:

Prometheus: https://juicefs.com/api/vol/VOLUME_NAME/metrics?token=API_TOKEN
JSON: https://juicefs.com/api/volume/VOLUME_NAME/status?token=YOUR_TOKEN

A quick but important note: metrics are only generated when the file system is mounted. So before proceeding, ensure your JuiceFS file system is properly mounted and accessible. In this guide, we will use the JuiceFS Cloud Service, as it's the simplest to get started. If you haven't set up JuiceFS yet, please refer to the documentation for detailed instructions. Once you have created the first file system, URLs for the metrics mentioned above would be available under its Monitor tab.

image: JuiceFS file system metrics APIs
caption: JuiceFS file system metrics APIs in the Monitor tab

Setting up a metrics source in Better Stack

With your JuiceFS file system up and running (don't forget to mount the file system to a host machine) and publishing metrics, the next step is to configure Better Stack to start ingesting that data.

First, if you haven't already, register for a Better Stack account. The process is seamless. Using a work email is recommended, and the platform provides clear guidance to help you set up your account and organization.

image: Logging into Better Stack
caption: Logging into Better Stack

Once you're logged in, follow these steps:

In the left-hand navigation panel, head to Telemetry.
Under the Sources section, click Connect source.
Give your telemetry data source a descriptive name, such as "jfs-better-stack" or "juicefs-production", to easily identify it later.

image: Configuring telemetry data source Configuring telemetry data source
caption: Configuring telemetry data source

Now, you'll configure how Better Stack should collect your metrics. In the collector settings:

Under Metrics, choose the Prometheus scrape option and click Connect source.
In the URLs to scrape section, input the JuiceFS metrics endpoint as described above.

image: Configuring URLs to scrape
caption: Configuring URLs to scrape

Note that if you are not using the JuiceFS Cloud Service and your JuiceFS endpoint is behind a firewall, you'll need to allow traffic from Better Stack's scrape servers. The list of IP addresses to add to the allowlist is available in their documentation and from here.

After saving the configuration, Better Stack will begin scraping the endpoint. Your JuiceFS metrics should be received within a few seconds.

Creating a dashboard with AI SRE

With your JuiceFS metrics flowing into Better Stack, it's time to visualize them. You could build a dashboard manually, but Better Stack provides a smarter and more efficient way to do it by using AI SRE.

What is AI SRE?

AI SRE (Site Reliability Engineering) is Better Stack's chat-based site reliability assistant. It's an autonomous AI agent that can read your telemetry data, analyze incidents, build dashboards, and even write code to fix errors. Instead of waiting for humans to manually set up charts and queries, AI SRE can generate comprehensive dashboards for you based on a prompt.

It's notable that AI SRE is a paid feature. If you're on the free plan, you can still create dashboards manually using the drag-and-drop chart builder.

Creating a JuiceFS monitoring dashboard with a single prompt

Once your metrics source is ready, follow these steps:

From the left panel, head to Telemetry and then Metrics.
Click Create dashboard and select the Create with AI option.
In the prompt field, give AI SRE a clear description of what you need. For example: "Create me a dashboard to track ALL JuiceFS metrics, such as latency, data size, etc."
Also make sure to select the metrics Source you created earlier (for example, "jfs-better-stack") so that AI SRE has the proper context and data to work with.

image: Creating a dashboard with AI SRE
caption: Creating a dashboard with AI SRE

Give the platform a few minutes for the dashboard to be created. AI SRE will analyze your JuiceFS metrics and automatically generate a complete set of charts and panels for the important performance indicators such as throughput, IOPS, latency, and storage utilization. For my first time trying this, it just worked like a charm as shown below.

image: AI SRE
caption: Charts and panels generated by AI SRE

AI SRE is a powerful feature that does so much more than create dashboards. It can analyze incidents, perform root cause analysis, suggest fixes, and even open pull requests. We've only scratched the surface in this post. This is your first step toward a smarter, AI-assisted observability workflow. After building your dashboard, you can further customize it by adding panels, editing queries, or setting alerts directly from the graphs.

Conclusion

In this post, we have walked through how to build a complete observability system for JuiceFS with Better Stack. We started by setting up the JuiceFS file system and getting its Prometheus-formatted metrics, then created a metrics source in Better Stack to ingest the data. We examined rapid creation of a full dashboard with AI SRE.

We hope this guide helps you gain better visibility into your JuiceFS deployment. If you have any questions or run into issues, we'd love to hear from you. Join the JuiceFS community on GitHub or Discord. And don't forget to check out Better Stack's documentation and their amazing YouTube channel for practical insights about distributed file storage, observability, AI, and more.

JuiceFS 1.4: Faster Metadata Operations with Batch Unlink, Batch Clone, and Redis Client-Side Caching

Thu, 18 Jun 2026 06:10:00 +0000

In large-scale file access scenarios such as AI training and dataset management, metadata often becomes the first performance bottleneck as file counts and concurrency grow. Whether you're deleting millions of small files, cloning large datasets, or traversing directories under heavy concurrency, metadata performance directly impacts application efficiency.

JuiceFS Community Edition 1.4 introduces three major metadata optimizations:

Batch unlink for large-scale file deletion
Batch clone for metadata cloning
Redis client-side caching for hot metadata reads

These improvements reduce transaction commits, network round trips, and redundant metadata lookups. In tests on a flat directory containing 100,000 files, batch unlink improved performance by up to 93×, while batch clone achieved up to 24× speedup.

In this article, we’ll explain the motivation, design, and performance benefits behind these optimizations.

Deletion: From one‑by‑one to batched transactions

Under JuiceFS' metadata-data separation architecture, deleting a file involves much more than removing a directory entry. The system must also:

Update inode reference counts
Reclaim inode and space resources
Process trash entries
Update quota statistics

These operations must typically be completed within the same transaction.

When a directory contains hundreds of thousands or even millions of files, the traditional file-by-file deletion approach used by rm -rf quickly becomes a bottleneck. Each unlink request goes through the FUSE protocol, switches between kernel and user space, and triggers a separate metadata transaction.

As the number of files grows, the overhead from system calls, context switches, network round trips, and transaction commits accumulates rapidly.

To mitigate this issue, JuiceFS previously introduced the juicefs rmr command. Unlike rm -rf, rmr bypasses the FUSE layer and sends deletion requests directly to the client. It also supports multi-threaded deletion (50 threads by default), significantly improving throughput.

However, each file deletion still requires its own metadata transaction. Deleting 100,000 files still means executing 100,000 transactions.

Batch unlink takes optimization one step further by merging many independent deletion operations within the same directory into a single batch transaction, further removing network overhead.

Core design

The key is to turn many small transactions into fewer large ones. JuiceFS adds a batch unlink interface at the metadata engine layer. It allows the client to delete multiple non‑directory files under the same directory in one call.

When recursively clearing a directory, JuiceFS reduces deletion overhead in two ways:

Different subdirectories are handled concurrently with multi‑threaded deletion.
Inside each directory, normal files and symlinks are grouped into batches and sent to BatchUnlink.

This merges many unlink operations into fewer batch transactions at the metadata level.
It's important to note that BatchUnlink does not directly delete directories. Directory removal still follows the standard recursive workflow: empty the subdirectory first, and then delete the subdirectory itself. Therefore, BatchUnlink only applies to regular files and symbolic links within the same directory.

This restriction preserves correct recursive deletion semantics while avoiding consistency risks to the directory tree structure.

image: JuiceFS delete operations
caption: Optimization path of JuiceFS delete operations

Implementation across metadata engines

JuiceFS uses different batching strategies depending on the metadata backend to minimize transaction commits and network round trips.

SQL backends (MySQL, PostgreSQL, etc.): Previously, each file deletion required its own sequence of INSERT, DELETE, and UPDATE statements. With BatchUnlink, the system:

Fetches all edge records for the target entries in a single batch query.
Retrieves the relevant inode attributes in a single locked batch query.
Executes edge deletions, inode state updates (decrementing nlink or marking for cleanup), and delfile entry insertions — all within one transaction.

Instead of executing one transaction per file, the entire batch can now be completed in a single transaction.

Redis backend: The optimization uses Redis pipelines and transactions. Where individual deletions previously required separate command round trips, BatchUnlink collects all HDEL (dentry removal), ZADD (enqueue for cleanup), SET (inode attribute update), and INCRBY (counter update) commands for multiple files into a single pipeline, executed atomically within one MULTI/EXEC transaction. To avoid blocking Redis' single-threaded event loop for too long, batch size is capped at 250 entries.

TiKV backend: BatchUnlink consolidates multiple deletions into a single transaction, using TiKV's batch write capability to reduce network round trips and transaction overhead. For distributed key-value backends, this kind of batching allows the backend's concurrent write capacity to be more fully utilized.

The figure below shows benchmark results on a flat directory of 100,000 files using juicefs rmr --threads 16. BatchUnlink delivers meaningful improvements across all metadata backends, with TiKV and Redis showing the largest gains.

image: deletion
caption: Throughput improvement for batch deletion across metadata backends

Clone: From one‑by‑one copy to batched references

juicefs clone creates fast copies of files or directories for training dataset version management, experiment snapshots, and large-scale directory duplication. Its efficiency comes from the fact that cloning doesn't immediately copy the underlying data blocks. Instead, it creates new file records at the metadata layer and reuses the source file's existing block references. New data blocks are only allocated when the clone is actually written to. This avoids the time and storage overhead of a full copy.

For large directory clones, the same problem as deletion arises: processing files one by one generates a large number of short transactions and network round trips. The core idea behind batch clone is to merge the clone operations for multiple files in the same directory into a single batch transaction. When recursively cloning a directory, the system reads directory entries in batches as a stream. For each batch, all non-directory entries are collected and cloned together in one operation.

One key implementation detail is inode pre-allocation: before entering the transaction, the system uses nextInode to pre-allocate target inodes for all entries to be cloned. This avoids lock contention from repeatedly requesting inodes inside the transaction. Once inside the transaction, the system batch-queries all source file attributes (with row locks), builds all the insertion data for target nodes, edges, chunks, symlinks, and xattrs, and then inserts everything in a single batch.

Batch clone uses each backend's native batch write capabilities in a similar way to batch unlink. The per-backend implementation details won't be repeated here.

The performance gains vary across backends depending on:

Transaction models
Network communication overhead
Batch insertion efficiency for metadata records such as nodes, edges, and chunk references

Results on a flat directory of 100,000 files are shown below. MySQL sees the largest improvement at approximately 24x; Redis at approximately 5x; TiKV at approximately 2x.

image: clone
caption: Throughput improvement for batch clone across metadata backends

Redis client-side caching: Keeping hot metadata local

In high-concurrency metadata workloads such as AI training dataset access and large-scale container startup, network round trips between JuiceFS clients and Redis often become a major performance bottleneck.

Consider the following operation:

open("/mnt/jfs/dataset/images/cat.jpg")

Before the file can be opened, the Linux Virtual File System (VFS) must resolve every component in the path:

Look up dataset.
Look up images.
Look up cat.jpg.

image: VFS
caption: VFS path resolution

If the images directory contains hundreds of thousands of files and training jobs perform random access across the dataset, each lookup requires a GET request to Redis.
Under heavy concurrency, this results in large numbers of network round trips and increased Redis CPU utilization. Even though a single Redis query takes only a few dozen microseconds, network latency pushes each lookup to hundreds of microseconds or even milliseconds. When thousands of training processes are accessing files simultaneously, this overhead becomes significant.

How it works: Redis 6.0 client-side caching

Redis 6.0 introduced client-side caching, which allows clients to cache hot keys locally and receive invalidation notifications whenever those keys are modified.

Based on this capability, JuiceFS caches two categories of metadata in client memory:

Inode attribute cache. Keyed by inode number, this stores the complete attribute data for a file, such as type, size, permissions, and timestamps. The caching is implemented transparently through hook mechanisms in the Redis driver layer. On query, it first checks the local cache; on hit, it returns immediately without any network request. On modification, it automatically invalidates the corresponding cache. Application logic requires no awareness of the cache.
Directory entry cache. Keyed by "parent inode + path separator + filename," this caches the results of directory lookups. Unlike the inode attribute cache, the lookup logic for entry cache is embedded directly in the directory lookup path rather than being intercepted transparently at the driver layer. When entries for a directory are invalidated, all related cache entries under that directory are cleared using prefix matching. This allows path resolution and repeated access to hot entries in the same directory to be served from local memory.

Introducing client-side caching creates a consistency challenge in multi-mount scenarios. When multiple clients share the same JuiceFS file system, an operation on one client — creating, deleting, renaming, or updating attributes of a file or directory — can invalidate cached inode attributes or directory entries on other clients. Without an effective invalidation mechanism, subsequent reads could hit stale metadata, causing the directory entries or file attributes seen by one client to diverge from the actual state in the backend.

To address this, JuiceFS introduces a Tracking and Broadcast Invalidation (BCAST) model on top of Redis' client-side caching mechanism. After connecting to Redis, each client declares the metadata key prefixes it wants to track. When those keys are modified, Redis sends invalidation notifications to the relevant clients. On receiving a notification, the client clears the corresponding inode attribute cache or entry cache entries, so that subsequent accesses fetch fresh data from the metadata engine.

In addition, at client initialization, JuiceFS warms up metadata for the root directory of the mount point. Since these files are typically the most frequently accessed, benchmarks show this warm-up significantly improves overall access performance.

Through this mechanism, hot metadata can be reused locally. When the metadata changes, the related caches are evicted in time, reducing the risk of stale metadata.

When to use it

Redis client‑side caching works best in read‑heavy, write‑light scenarios with repeated access to hot metadata. AI training dataset loading is a good example: the dataset is usually read‑only during training, and tasks repeatedly access the same directories and files, so inode attribute cache and entry cache hit often, reducing redundant lookups and remote metadata queries.

The benefit is even more obvious when there is higher network latency between the client and the Redis metadata engine, such as in cross-availability-zone deployments.

Redis 6.0 or later is required to use this feature. The default cache expiration time is 1 minute, which provides a safety net in case of network interruptions or connection anomalies where invalidation notifications may not arrive, preventing stale entries from persisting indefinitely. For workloads with stricter consistency requirements, the expiration time can be shortened or client-side caching can be disabled entirely to reduce the risk of reading stale metadata.

Summary

These three optimizations each target a different path through the metadata layer:

Batch unlink merges multiple independent unlink operations within the same directory into a single batch transaction.
Batch clone merges multiple independent clone operations within the same directory into a single batch transaction.
Redis client-side caching keeps hot metadata in client memory, bringing read latency from network-level down to memory-level, with broadcast invalidation to maintain consistency across multiple clients.

BatchUnlink and BatchClone are internal interfaces. Users do not call them directly. Just use the right commands: juicefs rmr for deleting large directories, juicefs clone for copying directories. The optimization will be applied automatically.

One thing worth noting: both batch operations work by merging regular files within the same directory into a single batch transaction. Subdirectories are handled recursively by concurrent goroutines. The larger the directory, the greater the benefit.

Batch operations mainly merge ordinary files under the same directory into one batch transaction. Subdirectories are handled recursively by concurrent goroutines. The larger the directory, the bigger the benefit.

All optimizations above are available in JuiceFS Community Edition 1.4. Upgrade the client to get the performance gains.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

How Gongjiyun Keeps Model Distribution Fast Enough for Cross-Cloud Elastic Inference

Wed, 10 Jun 2026 09:48:58 +0000

Founded in 2023 at Tsinghua University, Gongjiyun provides compute platforms and Model as a Service (MaaS) for artificial intelligence generated content (AIGC) enterprises and research institutions. We aim to alleviate the mismatch between elastic compute demand and supply. By aggregating idle IDC resources and edge resources, the platform offers containerized services, delivering rapidly schedulable compute for volatile workloads such as AI inference, video rendering, data processing, and data synthesis.

In cross-cloud elastic inference scenarios, compute tasks can be scheduled to different regions, cloud environments, and clusters, but model files and application data are large and cannot be migrated as quickly as compute resources. Especially in online inference, the model repository is read‑heavy and frequently accessed – storage access performance directly affects service startup, elastic scaling, and request latency.

To address this, we built an object storage acceleration solution on top of JuiceFS, integrating users’ existing object storage into elastic inference clusters. Through a unified namespace, metadata import, FUSE mount, distributed cache, and data warm-up, it improves access efficiency for model repositories across clouds and clusters. In a case study with a leading text‑to‑image model community, the solution supports a tens‑of‑TB model repository, dynamic loading of checkpoints and low-rank adaptations (LoRAs), and elastic scaling of hundreds of GPUs at peak, while keeping additional latency within the customer’s acceptance range.

In this post, we'll walk through why storage — not compute — is the real bottleneck in cross-cloud elastic inference, how we evaluated and chose JuiceFS, and the step-by-step optimizations that brought latency from +10s down to under 2s in production.

Elastic demand is widespread, but supply is hard to match

As AI applications grow rapidly, compute demand continues to increase, but resource usage patterns differ across scenarios. Compared to training, which has stable resource needs, AI inference, data processing, and data synthesis are often more volatile: office applications may see higher traffic during the day, entertainment apps during evenings or weekends, and project‑based data processing may consume large amounts of compute in short bursts then idle. For small teams or exploratory applications, elastic compute also helps them better evaluate the relationship between per‑request cost and application value.

On the supply side, compute infrastructure is capital‑intensive. Resource providers are not incapable of offering elastic services, but they prefer long‑term dedicated leases to recover costs and reduce risk. As a result, low price, stability, and elasticity are difficult to achieve together:

Dedicated leases are low‑cost and stable but lack elasticity.
Spot resources are cheap and elastic but uncertain.
On‑demand resources are elastic and stable but expensive.

In China, this contradiction is further reflected by a market dominated by dedicated leases, with elastic supply accounting for a small share.

We aim to resolve this mismatch between elastic demand and supply. By aggregating idle IDC and edge resources, the platform offers containerized services, providing rapidly schedulable compute for AI inference, video rendering, data processing, and data synthesis. At lower resource costs, we help users quickly spin up tasks during peaks, schedule them across clusters, and handle elastic demand, while enabling resource providers to improve utilization and monetize idle capacity beyond dedicated leases.

Compute can be scheduled: How does storage keep up?

As elastic compute platforms evolve, compute resource scheduling is easy. Container images can be synchronized across clusters via registries and distribution networks, tasks can be launched in different resource pools by schedulers, and traffic can be distributed via unified ingress and traffic management.

But model and data files are typically large, making cross‑cloud, cross‑cluster migration costly and slow, unable to match the sub‑second startup and release of compute. Therefore, in cross‑cloud elastic inference architectures, the real limitation on system elasticity is often not compute scheduling, but the efficiency of data and model distribution.

Different application scenarios have different storage requirements:

Model training, development, and debugging: These involve complex read‑write needs, including code repositories, model files, experiment results, and intermediate state. They also require high environment stability; users cannot tolerate state loss from frequent host switching. Thus, the platform typically provides long‑term stable compute resources and runtime environments, and storage needs can be met by existing stable storage systems.
Data processing: This can be split further. If a single processing job has high application value and can cover cross‑cloud network transfer costs, you can build a pipeline that continuously pulls data from S3 or other object storage, processes it in the compute cluster, and writes back streaming. The system does not need large local storage. If the data scale is larger or per‑job value is low, local storage acts as a one‑time cache. Data flows through and does not need to be persisted.

What is truly more challenging is the online inference scenario. Online inference services cannot tolerate downtime. However, the resources used by an elastic computing platform may come from idle resource pools. These resources could be preempted. Once resources in a certain data center or cluster become unavailable, the platform must be able to migrate tasks to other providers or other clusters in time. This means not only computing tasks must be migrated. Model files and related storage access capabilities must also be migrated at the same time

Online inference has higher requirements for service continuity and cross-cluster migration capabilities, but its storage access pattern is also more clear. Compared to training, development, and debugging scenarios, inference workloads are typically read heavy. The core needs focus on efficient model loading, reading model weights, and accessing the model repository. For large models and online applications, model loading speed directly affects service startup time, elastic scaling efficiency, and request response stability. Therefore, inference scenarios are not suitable for simply adopting traditional read-write hybrid storage architectures. Instead, they are better suited for specialized optimizations around model distribution, read only access, and cache acceleration.

In addition, an elastic computing platform usually does not host a user's complete application system. The user's primary cloud account, application database, model management system, and even some fixed computing resources often already exist in other clouds or on premises. For the platform to integrate with the user's application, it must be compatible with the user's existing model repository and model management processes. It cannot require the user to fully migrate the entire system.

Therefore, to support cross-cloud elastic inference, we need more than just compute scheduling capabilities. We need a cross-cloud high-performance storage and model distribution solution tailored for model inference scenarios. This solution must support hosting a large model repository and high-performance reading, it must adapt to the user's existing model management system. And it must provide stable data access capabilities when resources are migrated across clouds and clusters.

Why JuiceFS: Unified cross-cloud access, strongly consistent metadata, and high-performance cache

Facing cross-cloud elastic inference scenarios, the storage system needs to meet several conditions at the same time:

It must provide a unified access point across different clouds and clusters. It must support shared read-write access and unified metadata management.
It must be compatible with the user's existing object storage and model repository to avoid data migration.
It needs low operational complexity and good read performance.

When evaluating storage options, we considered Ceph:

Ceph is mature. It’s suitable for building unified storage within a single data center or a stable resource domain.
However, in cross cloud elastic inference scenarios, Ceph requires high network stability and operational skills. The overall integration cost is higher. So we did not choose it.

We also evaluated Alluxio. However, in a multi-cloud environment, multiple clusters need to access the same underlying object storage data concurrently. The workload is not purely read only; there are also occasional writes. This scenario requires strong data consistency. Therefore, Alluxio was not chosen for production.

We finally chose JuiceFS mainly because:

It uses object storage as the database.
It provides a unified namespace and consistent file system view through an independent metadata service. This allows multiple clusters to access the same model data as a file system.
This architecture is suitable for cross-cloud and cross-cluster model distribution and shared reading.
It’s also compatible with the user's existing object storage and model repository, reducing data migration and application integration costs.

The decision to further adopt JuiceFS Enterprise Edition was mainly due to its distributed caching capabilities and managed metadata service. In this scenario, the value of JuiceFS is not just providing a file system interface. It combines object storage, unified namespace, metadata management, and cache acceleration into a storage access layer that is better suited for cross-cloud elastic inference.

image: JuiceFS 架构图（第四版）-第 2 页-winfsp (2)
caption: JuiceFS Enterprise Edition architecture

Practical: Object storage acceleration based on JuiceFS

Based on JuiceFS, the platform encapsulates an object storage acceleration product. This product connects the user's existing object storage to the elastic inference cluster. It provides the storage as a high-performance file system for the application. The overall process is as follows.

Create a file system. The user provides object storage access credentials, for example, AK/SK for S3-compatible storage. The credential permissions can be configured as read only or read-write based on application needs. The platform creates a corresponding JuiceFS file system based on that object storage.
Import metadata. The platform uses the JuiceFS import feature to scan the metadata of files in object storage. Then, it imports that metadata into the JuiceFS metadata service. In this way, the model files originally stored by the user in object storage can be accessed as file system directories in JuiceFS.
Create a cache group. Within each cluster that may host workloads, the platform sets up a JuiceFS cache group. This forms a distributed cache group. Before running a task, the platform can warm-up model files. It caches hot data in the target cluster in advance. This reduces the time needed to pull data from remote object storage when the inference service starts.
Mount to application Pods. When the user's application runs, the platform uses the FUSE client to mount the JuiceFS file system into the application Pod. For the application, model files appear as local file system paths. Therefore, the original model reading logic usually does not need modification.
Enable node local cache. Besides the cluster level cache group, the node where the FUSE client runs can also provide local cache. This improves repeated read and model loading performance. It further reduces direct access to remote object storage.

This object storage acceleration product essentially productizes the JuiceFS metadata import, distributed cache, data warm-up, and FUSE mounting process. It allows the user's existing object storage to serve cross-cloud inference tasks in a way that feels closer to a local file system.

In addition, the JuiceFS cache group is independent from the file system access point. This characteristic, on one hand, adds management complexity on the platform side, because the platform needs to manage the relationships among the file system, cache groups, mount points, and task scheduling. On the other hand, it provides a foundation for cache isolation, independent scheduling, and fine-grained management based on clusters, users, or application scenarios in the future.

Production case study: A leading text-to-image model community

Scenario, challenges, and acceptance criteria

One of the most representative cases in this object storage acceleration solution involves a leading Chinese text-to-image model community hosting tens of terabytes of model data, including large checkpoint base models and a larger number of smaller LoRA models. In practice, inference jobs typically load a checkpoint first, then load one or more LoRA models to perform combined inference.

The company already operated compute infrastructure at scale — several thousand GPUs — but its workload, serving creative design and production use cases, exhibited significant variability. Overall average utilization was below 50%, yet during morning and afternoon peak hours on weekdays, load could reach 140% of normal capacity, degrading the user experience. The customer therefore needed a highly elastic compute supply.

We provided a high-elasticity resource model: compute support at the scale of hundreds of GPUs was available only during weekday peak hours — 10:00–12:00 AM and 2:00–6:00 PM — with resources scaling to zero at all other times.

This meant the platform needed to provision hundreds of GPUs within a window of minutes, while consuming zero resources outside peak hours. For the customer, this model delivers large-scale compute during peak periods while avoiding payment for idle capacity. For the platform, it enables more efficient utilization and monetization of idle compute resources.

The technical challenges were significant:

A model repository of this scale cannot simply be replicated to every elastic cluster.
Inference services do not load all models once at startup. Model reads and switches happen continuously as user requests arrive, resulting in high access frequency. Therefore, the object storage acceleration solution needed to support not just large-scale model repository access, but stable read performance under continuous dynamic loading.

The customer's performance requirements were also strict. During acceptance testing, a portion of production traffic was routed to the elastic cluster. The requirement was that both the median and mean inference latency of the elastic cluster must stay within 2 seconds of the customer's own cluster. Given that individual inference jobs take on the order of tens of seconds, this requirement left virtually no room for additional latency introduced by the storage layer. In the first few rounds of testing, both median and mean inference latency on the elastic cluster exceeded the customer's own cluster by approximately 10 seconds — failing the acceptance criteria.

Performance optimization: Reducing additional latency on the elastic cluster

Optimization began with the median. A high median indicates that a significant proportion of requests are experiencing performance degradation, not just a small number of outliers inflating the tail. JuiceFS monitoring revealed that the cluster's cache hit rate was not reaching the expected level. In the current architecture, a cache miss requires a round trip over the public internet to the customer's object storage on Alibaba Cloud. This significantly increases model loading time and then affects inference request latency.

To solve this, the platform used the isolation capability of the JuiceFS cache group. It assigned dedicated cache nodes to this customer, reserved enough cache space, and warmed up the core model data. After warming up, the access path for core models achieved nearly 100% cache hit rate. This effectively avoided the performance loss from cross public network backfilling.

The second factor affecting the median was metadata access latency. Because the platform uses a unified cross-cluster architecture, the metadata service is accessed over the public internet, for example, via JuiceFS Cloud Service or a deployment on a remote host, and this latency affects overall model read performance.

The platform took two measures to address this issue:

Enabling JuiceFS' open cache to keep metadata in local memory as much as possible. Since this workload is predominantly read-only, caching is an effective way to reduce metadata access overhead.
Tuning the cluster's network rate-limiting policy. While the platform cannot directly control network equipment in edge data centers, it can apply node-level rate limiting to prevent any single node from saturating the available bandwidth, improving overall network stability. After these optimizations, cluster-wide performance improved meaningfully and the median metric gradually reached the customer's requirement.

Once the median met the target, the mean still showed a gap. This indicated that long-tail requests remained, with a small number of requests taking significantly longer than normal and pulling up the overall average. Further analysis traced this to node-level local cache — specifically, the FUSE cache quota. With limited cache capacity, the elastic cluster experienced more frequent cache evictions than the customer's own cluster, causing some requests to reload model data from scratch and increasing mean inference latency. The platform addressed this by increasing the FUSE local cache quota in the production environment, reducing eviction frequency, improving tail latency, and ultimately bringing the mean metric within acceptance. The system passed validation and has been running stably since.

Multi-tenant cache management

After the single-tenant case was validated, the solution entered multi-tenant operation. As different tenants began time-sharing the same elastic nodes, a new issue emerged: cache contention between tenants.

In the elastic resource model, FUSE clients do not actively clear node cache on exit. This is a reasonable design in single-tenant scenarios, where cached data from previous jobs can be reused by subsequent jobs to improve hit rates. However, in multi-tenant scenarios, one tenant's data can occupy node cache for extended periods. This leaves insufficient cache capacity for the next tenant, who is then forced to fall back to object storage, causing a noticeable performance drop.

To address this, we deployed an independent daemon process on each node that performs a global cache garbage collection (GC) pass before the application FUSE client starts. The eviction strategy references the JuiceFS FUSE client implementation, using a 2-random policy to balance collection efficiency and performance overhead. Coordination across nodes is handled via Kubernetes distributed locks: only the client that acquires the lock executes GC, preventing multiple clients from running cache collection simultaneously and creating excessive network and I/O pressure.

This mechanism effectively mitigates the problem of historical jobs occupying cache resources in multi-tenant scenarios, allowing different tenants sharing elastic resources to maintain consistent cache performance.

Conclusion

For elastic compute to reliably serve production traffic, compute scheduling alone is not enough. Model data and hot data must remain stably accessible across clouds and clusters.

Built on JuiceFS, we’ve combined object storage, unified namespace, metadata management, distributed caching, and FUSE mounting into an object storage acceleration solution purpose-built for elastic inference. This is not simply about mounting object storage as a file system. It’s about building a data access layer around the access patterns of model inference: one that supports warm-up, caching, isolation, and management.

This represents Gongjiyun's current progress in elastic compute and cross-cloud storage acceleration. As AI inference scenarios continue to evolve, model distribution, cache management, and multi-cluster data access will continue to surface new engineering challenges. We look forward to exchanging ideas with developers, AI application teams, and infrastructure practitioners, and to exploring more stable and efficient data access solutions for elastic compute environments.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

Reducing Data Storage Costs: A Deep Dive into JuiceFS 1.4 Tiered Storage

Wed, 03 Jun 2026 08:45:00 +0000

JuiceFS Community Edition 1.4 introduces enhanced tiered storage capabilities, allowing users to set object storage classes at the file or directory level. This makes it possible to manage different storage tiers for data under a unified file system interface. In this article, we’ll discuss this feature’s application background, evolution, usage model, implementation, and future plans.

Application background

In real‑world scenarios, different files have different access patterns and performance requirements. Some data is read or written frequently and demands low latency and high throughput. Other data is rarely accessed after being written, and the main concern is long‑term storage cost. Tiered storage addresses this by placing data in the appropriate storage layer based on access patterns, balancing performance and cost.
Typically, data can be classified into three categories:

Hot data: Frequently accessed, requires low latency and high throughput.
Warm (infrequent access) data: Accessed occasionally, but still requires fast retrieval when needed.
Cold (archival) data: Primarily for long‑term retention, very low access frequency, can tolerate some restoration delay in exchange for lower cost.

Object storage already offers tiering capabilities. For example, Amazon S3 provides S3 Standard for frequently accessed data, S3 Standard‑IA for infrequent but still millisecond‑accessible data, and Glacier / Deep Archive for long‑term archiving. These storage classes differ in access latency, minimum storage duration, and pricing.
The table below compares main S3 storage classes:

Storage class	Use case	First byte latency	Minimum storage duration fee
S3 Standard	General-purpose storage for frequently accessed data	Milliseconds	N/A
S3 Standard-IA	Infrequently accessed data requiring millisecond access	Milliseconds	30 days
S3 Glacier Deep Archive	Archiving very rarely accessed data with very low cost	Hours	180 days

For JuiceFS, which is built on top of object storage, the key is to translate these capabilities into file‑system‑level tiering: users set storage tiers for files, directories, or datasets, and JuiceFS maps them to the underlying object storage while handling writes, migrations, and restore operations.

Evolution of JuiceFS tiering capabilities

The evolution of JuiceFS tiering has moved from being “passively unaware of object storage classes” to “actively managing storage tiers at file and directory granularity.”

Before v1.1, JuiceFS did not provide a way to configure storage classes. While users could manually change the storage class of objects at the object storage side, these changes were not recognized or managed by JuiceFS at the file system level. For standard and infrequent‑access classes that support direct access, normal read/write operations usually continued to work. However, if objects were moved to archival storage, access would fail because those objects cannot be read directly.

Starting with v1.1, JuiceFS supports setting the object storage class via --storage-class. For example, you can specify the default storage class for the file system at format time or override the storage class used for data written to a specific mount point during mount. This gave JuiceFS a basic ability to leverage object storage tiering. However, the configuration granularity remained coarse – primarily at the file system default or mount‑point level – and did not allow fine‑grained management per directory, per file, or per dataset.

Version 1.4 further advances tiering capabilities to the file and directory level. You can assign a storage tier to individual files or directories based on data temperature. When a directory is assigned a tier, newly created files and subdirectories under it automatically inherit that configuration. Compared to the previous default or mount‑point level settings, v1.4 is better suited for tiered management by project, directory, dataset, or data temperature.

How to configure tiered storage

The key to tiered storage in JuiceFS 1.4 is translating object storage classes into file‑system‑manageable tiers. The usage model consists of two steps:

Map tier IDs to object storage classes.
Assign files or directories to those tier IDs.

This allows users to organise tiering policies by file, directory, or dataset without specifying the underlying storage class on every write.

The figure below shows mapping tier IDs to storage classes:

image: Mapping tier IDs to storage classes
caption: Mapping tier IDs to storage classes

For example, map tier IDs 1–3 to different storage classes:

juicefs config redis://localhost --tier-id 1 --tier-sc STANDARD_IA -y  
juicefs config redis://localhost --tier-id 2 --tier-sc INTELLIGENT_TIERING -y  
juicefs config redis://localhost --tier-id 3 --tier-sc GLACIER_IR -y

After mapping, set the storage tier for a file or directory:

juicefs tier set redis://localhost --id 1 /path/to/file  
juicefs tier set redis://localhost --id 2 /path/to/dir

Directory‑level settings have inheritance semantics. Once a directory is assigned a tier ID, newly created files and subdirectories will inherit that tier. To apply the tier to existing data under the directory, use -r to recursively set the tier:

juicefs tier set redis://localhost --id 2 /path/to/dir -r

image: existing data
caption: Directory inheritance and recursive setting for existing data

For archival storage classes such as Glacier, a restore request must be issued before reading:

juicefs tier restore redis://localhost /path/to/dir -r

Implementation

From an implementation perspective, the key to tiered storage in v1.4 is storing tier information in metadata and using the tier ID to decide the object storage behavior during writes, migrations, and reads.

image: Write, migration, and archival read flow
caption: Write, migration, and archival read flow

Metadata design

JuiceFS uses tier-id on files and directories to indicate the storage tier. A value of 0 means the default storage tier; values 1 to 3 correspond to user‑configured object storage classes.

Thus, the storage tier is no longer just an external state at the object storage side, but becomes part of the file system metadata that JuiceFS can understand and manage. When writing new data, migrating existing data, or checking file status, JuiceFS can determine the intended storage class based on this metadata.

Migrating existing data

For existing data, changing the storage tier involves not only updating the metadata tier-id but also changing the actual storage class of the underlying objects. When a directory is set recursively, JuiceFS processes all files and subdirectories under it and uses the object storage’s copy capability to migrate existing objects to the new storage class.

If only the mapping from a tier ID to a storage class is changed, the actual storage class of existing objects is not automatically updated. In that case, you must use tier set --force to explicitly trigger the change. This will rewrite the objects with the new storage class.

Write path

When a new file is written, JuiceFS determines the target storage class based on the file’s own tier-id or, if not set, the inherited tier-id from its parent directory. For directories that already have a storage tier assigned, new data can be written directly to the corresponding storage tier. This avoids the overhead of first writing to the default tier and then migrating later.

Read path

For storage classes that support immediate access (for example, Standard and Standard‑IA), reads are transparent to the application, and JuiceFS simply reads the data from object storage as usual.

For archival classes such as Glacier and Deep Archive, objects cannot be read directly. You must first issue a restore request using juicefs tier restore. This sends a request to the object storage service. Whether and when the objects become readable depends on the cloud provider’s restore mechanism. After restoration completes, applications can retry the read.

Therefore, archival storage is suitable for data that is accessed very infrequently and can tolerate restoration delay. It’s not appropriate for workloads that require online access at any time. When using archival tiers, you must consider storage cost, restoration time, and restoration costs.

Future plans

Reducing operational costs of archival storage

Archival storage classes have low long‑term storage costs, but they often come with complex cost models for writes, restores, early deletion costs, and lifecycle transitions. Writing data directly to archival storage may incur extra costs in scenarios with frequent changes or bulk migrations.

In the future, JuiceFS could combine object storage lifecycle management. Data could first be written to standard storage with specific object tags. Users could then use cloud‑vendor lifecycle rules to automatically and cost‑effectively transition data to infrequent‑access or archival tiers based on those tags. This would preserve JuiceFS’ file‑system‑level tiering capabilities while leveraging native batch transition mechanisms to reduce overhead.

Extending tiering to multi‑bucket, multi‑cloud

Currently, tiered storage works on different storage classes within the same object storage backend. In the future, JuiceFS could extend “tier” to different buckets, different object storage services, or even different cloud providers. Tiering would no longer be limited to a single backend.

For example, hot data could be stored in a local high‑performance MinIO cluster backed by SSDs, while cold or archival data resides in low‑cost cloud archival buckets. Policies could then gradually move data from the hot tier to the cold tier. With such an architecture, JuiceFS could offer cross‑bucket, cross‑cloud, and cross‑media tiered data management under a unified file system namespace.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

JuiceFS at Xiaomi: Unified Storage for AI, Big Data, and Cloud‑Native Workloads

Wed, 27 May 2026 08:03:25 +0000

Xiaomi is one of the world's leading smartphone companies. Since 2021, its storage team has been building a file storage platform based on JuiceFS, initially providing file storage capabilities for cloud‑native and some application scenarios. After Xiaomi announced its comprehensive AI strategy in 2024, issues with the previous heterogeneous storage system became more evident in areas such as technology selection, data flow, and development/operations. Leveraging multi‑protocol access, elastic scalability, multi‑cloud adaptability, and high performance, the team decided to build a unified file storage foundation centered on JuiceFS to support big data, cloud‑native, and AI workloads.

To achieve this goal, the platform further developed core capabilities, including a capacity layer, a performance layer, and a cache layer. These reduce the complexity of multi‑system access and data movement while balancing large‑scale storage with high‑performance access. Over the past two years, with the rapid growth of generative AI and autonomous driving, the platform has supported typical scenarios such as large‑model training, autonomous driving training, inference acceleration, and big‑data cloud migration. Today, the platform can handle hundreds of billions of files and EB‑scale storage, covering the entire AI storage chain from raw data and training data to model file distribution.

Storage architecture challenges under the AI strategy

Before 2023, Xiaomi, like most companies, had built multiple storage systems for different application scenarios. In the big data area, the data platform was mainly based on HDFS; AI workloads, before the rise of large language models, relied primarily on high‑performance file storage services on the cloud, such as Parallel File System (PFS) and Network Attached Storage (NAS).

During this period, we also began to introduce JuiceFS and built an internal self‑developed File Storage Service (FDS), using components like JuiceFS CSI Driver to provide file storage for cloud‑native and some application scenarios. As application needs evolved, these storage systems grew independently. This led to a complex heterogeneous storage landscape.

In 2024, after Xiaomi announced its comprehensive AI strategy, the shortcomings of the previous storage system became more pronounced in areas such as technology selection, access, data flow, and development/operations.

These challenges included:

High selection and access costs: With many storage systems and inconsistent capabilities, application teams had to understand and adapt to each one, raising the barrier to entry.
Low data flow efficiency: The lack of a unified access method across systems led to frequent cross‑system data copying. This hurt development efficiency.
Scattered development and operations efforts: Multiple systems were maintained and evolved independently, making it difficult to focus resources on the mission-critical infrastructure required for the AI strategy.

To address these issues, we conducted in‑depth internal discussions and architectural adjustments in 2024, and began redesigning a unified storage architecture for AI, big data, and cloud‑native scenarios.

Building a unified file foundation with JuiceFS

Selection rationale: Multi‑protocol support, elasticity, multi‑cloud, high Performance

JuiceFS is a distributed file system that natively supports multi‑protocol access, elastic scaling, and high‑performance reads/writes. This makes it a perfect fit for both native AI and big data storage needs.

image: JuiceFS 架构图（第四版）-第 2 页-winfsp (2)
caption: JuiceFS Community Edition architecture

In the cloud-native field, we’ve been using JuiceFS since 2021, continuously conducting internal development and iterative optimization. At the same time, we maintain close collaboration with the JuiceFS open-source community to jointly drive technology evolution and real-world adoption.

In AI scenarios, model training and inference rely heavily on POSIX semantics, which aligns naturally with JuiceFS capabilities. Meanwhile, in the big data area, we were already promoting HDFS replacement during cloud migration, a practice with many mature industry examples, so adapting the HDFS protocol was also feasible.

Considering multi-protocol support, elastic scalability, multi-cloud adaptability, and high-performance read/write, we ultimately chose JuiceFS as the core component of our unified file storage foundation. This solved the problems of complex data flow, high access costs, and scattered operations caused by using different file systems across multiple platforms and application units.

image: xiaomi Evolution
caption: Evolution of the unified file storage foundation based on JuiceFS

Storage layer capability construction

Our core goal is to build a unified file storage layer on top of JuiceFS, providing large capacity, high performance, and standardized access interfaces to uniformly support the three core application scenarios: big data, cloud-native, and AI.

On the client side, we fully leverage JuiceFS’ multi-protocol capabilities, offering access methods including POSIX, Hadoop SDK, Python SDK, and S3 Gateway. They’re all already in use internally.

image: JuiceFS unified file storage
caption: Architecture of the JuiceFS unified file storage foundation

On the data plane, the architecture consists of three layers:

Capacity layer: Built on public cloud object storage, designed for EB‑scale storage, supporting multi-cloud deployments across different strategic data centers and multiple cloud providers.
Performance layer: Large‑scale tuning based on Ceph and all‑flash nodes, designed for AI training and other scenarios with high throughput and low latency requirements.
Cache layer: Given the “write once, read many, seldom modify” characteristic of AI training datasets, we developed a high‑performance distributed cache system based on NVMe and RDMA to reduce repeated read costs and improve training data access efficiency.

On the control plane, we made custom enhancements to the Community Edition:

For metadata, we built a distributed metadata service based on the Raft protocol to integrate with internal infrastructure systems and support multi-system access, improving reliability and scalability.
For backend management, we built a unified management service responsible for data lifecycle management, tiered storage, garbage collection, and warm-up of hot data from the capacity layer to the performance or cache layers.

Through these efforts, JuiceFS has gradually become the unified file storage foundation at Xiaomi, supporting both large‑scale capacity storage and high‑performance access for AI training. The architecture is now running in production and provides the high throughput required for large model training.

Our practices

During the construction of the unified file storage foundation, JuiceFS has gradually covered Xiaomi’s mission-critical application scenarios, including big data, cloud-native, and AI:

In terms of scale, the solution can support EB‑level storage and hundreds of billions of files.
In terms of capability, the coordinated design of the capacity, performance, and cache layers balances large‑scale storage with high performance.

Below we describe two typical scenarios: big data cloud migration and the AI storage pipeline.

Big data cloud migration and unified lakehouse storage

In its early days, our big data system was mainly built on the Hadoop ecosystem, where HDFS used a previous‑generation coupled architecture. Over time, this architecture showed problems such as performance fluctuations, complex operations, and high total cost. In contrast, cloud storage offers significant advantages in elastic scaling, resource utilization, and cost control. Therefore, starting in 2021, we systematically began migrating big data to the cloud.

From cold data to the lakehouse layer

Our big data cloud migration went through three stages:

Cold data migration: We first migrated cold data from HDFS to cloud storage, a process lasting over two years.
Lakehouse layer migration: We self‑developed a unified lakehouse file system, promoting the evolution from coupled to decoupled storage and compute.
Unified storage foundation based on JuiceFS: After selecting JuiceFS, we migrated the entire lakehouse layer to JuiceFS.

Lakehouse construction can leverage Iceberg’s native support for object storage access (like OSS or S3). However, our application spans multiple regions globally using several cloud vendors. Adapting to each vendor individually would incur high access and maintenance costs.

Thus, we chose JuiceFS to uniformly access different cloud storage. Upper‑layer services simply switch the backend storage address via the SDK to adapt to access in different cloud environments, greatly reducing multi‑cloud complexity.

For data migration, our self‑developed data‑factory platform supports transparently switching a table’s underlying storage to the new architecture and gradually migrates existing data to the cloud in the background, with little or no impact on application. Moreover, JuiceFS supports multi-cloud and on‑premises deployment. If future cost or strategic considerations require switching to self‑built storage, data can be smoothly migrated back via JuiceFS. This preserves architectural flexibility.

Hot table cache acceleration for compute efficiency

After data was in the cloud, we further analyzed access patterns of the lakehouse layer. For daily reporting and analysis tasks, computation is usually concentrated on day‑level or week‑level hot data, not requiring frequent full scans. Therefore, the performance focus for the lakehouse layer was not simply improving full‑scan throughput but rather increasing hot data access efficiency and task execution stability.

Based on this, we built a hot table warm-up capability in cooperation with the lakehouse layer. The system identifies hot tables and their hot partitions based on daily access statistics, and preloads related data into the cache layer before task execution via a warm-up interface. For periodic reporting tasks that must be completed by 8 AM, hot data is warmed up before computation. This reduces remote reads and repeated access.

Through offline and online testing, after hot table caching, compute efficiency improved by about 10-20%, with reductions in both computation time and resource consumption. The cache size has reached PB level, with average throughput around 200 GB/s. The cache layer also reduces cross‑cloud bandwidth pressure and cloud storage API call costs: by improving the hot data hit rate, repeated cross-cloud reads can be reduced, thereby lowering bandwidth consumption and access costs.

image: Hot table warm-up architecture
caption: Hot table warm-up architecture

Benefits for big data

Benefits for our big data application include:

Performance: After switching to JuiceFS, sequential read/write performance improved significantly, more than doubling in some scenarios. Overall task duration decreased by about 10–30%.
Cost: By Xiaomi's internal cost metrics, the unified storage architecture has greatly lowered storage costs – about 70% in China and 90% in overseas regions. The overseas legacy solution, which used HDFS with three replicas on cloud instances and EBS, had a high replication factor and thus higher costs.
Stability and operations: Under the previous mixed architecture, many compute tasks easily consumed node resources, raising node load and affecting storage performance. With the decoupled storage‑compute architecture, compute tasks run on dedicated nodes, task durations are more stable, and scaling and management are more flexible.

AI one‑stop storage

AI storage consists of three stages:

Raw data stage: Storing large volumes of raw data, which undergoes processing (for example, ETL) to produce training datasets, then is fed into high‑performance training environments.
Training stage: Training tasks require high throughput and low latency to reduce I/O wait time and increase GPU utilization. After training, model files are generated for subsequent inference.
Inference stage: Model files must be quickly distributed to specific nodes for rapid startup of inference tasks.

image: AI storage flow
caption: AI storage flow

Previously, data flowed among multiple systems, causing inconvenience for both application teams and internal operations. By adopting JuiceFS uniformly, we can meet diverse needs based on different storage tiers.

Requirements and solutions by stage

AI one-stop storage needs to cover three stages: raw data, training data, and model files. The requirements for capacity, performance, cost, and distribution efficiency differ at each stage. The table below compares the application needs for each stage with previous and current solutions.

Use case	Application requirements	Previous solution	Current solution (JuiceFS)
Raw data	Large capacity, low cost; support high‑concurrency data processing; scale to PB+	Direct use of object storage; HDFS; other low‑cost storage	Capacity‑oriented JuiceFS: multi‑cloud object storage underlying, shielding vendor differences; EB capacity, hundreds of billions of files; millions of concurrent tasks
Training data	High throughput, low latency; reduce I/O wait time; improve GPU utilization	PFS, NAS (good performance but high cost)	Performance‑oriented/cache‑oriented JuiceFS: TB/s throughput, low latency; async checkpoint to reduce I/O wait; cache acceleration
Model files	Fast distribution; efficient loading; quick inference startup	P2P distribution; workflow distribution; PFS	Cache‑accelerated JuiceFS: cache improves model loading; up to 16 GB/s sequential load per node; several times faster than local disk or FDS

High‑performance cache acceleration: improving efficiency and cutting costs

In AI training, training datasets typically have the characteristics of "write once, read many times, and modify very little." This is a typical read-heavy, write-light access pattern, making it suitable for improving data access efficiency through caching.

Take our internal autonomous driving training as an example. Once a dataset version matures, its data volume may continue to grow within the version cycle, but existing data is rarely modified. While the previous high‑performance file storage met training performance requirements, it had some performance redundancy and cost waste for such repetitive reads. Therefore, we began promoting a high‑performance cache acceleration solution based on JuiceFS.

The cache solution offers several advantages:

Short I/O path: Clients operate on files directly, greatly shortening the I/O path for fast responses.
Performance optimization: Through RDMA and zero‑copy optimization, performance has significantly improved – throughput more than 20% higher than previous high‑performance storage, with ongoing optimization.
Cost reduction: The previous PFS‑based storage used replication (though some used EC, replication was more common for stability). With the cache solution, single‑copy storage reduces costs by more than 60%.
Resource consolidation: For CPU training, GPU nodes typically have NVMe drives (about 10 TB each), which were previously used in scattered ways with low utilization. Now, we consolidate these NVMe resources into a unified cache pool to accelerate nearby GPU training and data processing tasks.

Future plans

Looking ahead, we’ll focus on three directions:

Continuously improve the stability, performance, and scalability of the unified file storage foundation. As AI application grows rapidly, training, inference, and data processing tasks demand higher throughput, lower latency, and greater reliability. We’ll continue optimizing the underlying architecture and critical paths to enhance service capabilities under large‑scale concurrent access.
Strengthen lifecycle management for massive data. Current data volumes continue to grow, but management across storage tiers, access frequencies, and retention periods can be further optimized. We’ll refine tiered storage, archiving, warm-up, and cleanup strategies based on data temperature, access patterns, and cost models, reducing unit storage cost and improving resource utilization.
Enhance data management and analysis capabilities. On top of the unified file storage foundation, we’ll build data management capabilities for application users, helping them better understand data distribution, access behavior, and resource usage, supporting data management, cost optimization, and application decisions.

We look forward to continuous exchanges with industry peers to explore more technical practices. If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

From Metadata Backup to Mirror File System: Resilience Strategies for JuiceFS

Thu, 14 May 2026 09:19:00 +0000

Murphy's law: Anything that can go wrong will go wrong

Let me paint you a picture.

It's a Thursday morning. You're the head of infrastructure for a massive financial institution, let's say, a pension fund managing over $100 billion in assets for more than 600,000 members. Your team spent months migrating everything to a private Google Cloud deployment. You have redundancy in two geographic regions. You have backups. You followed every best practice in the book.

Then you wake up to this: everything is gone. Not just a server. Not just a database. The entire private cloud account is deleted, like it never existed.

If you're unfamiliar with this story, allow me to introduce you to UniSuper, an Australian pension fund that experienced every sysadmin's worst nightmare. On May 1, 2024, UniSuper lost complete access to its private cloud infrastructure. No member portals. No transaction processing. No account balances. For two full weeks, customers couldn't access their retirement savings. Eventually, and thankfully, the incident was resolved, and the data was recovered. Shoutout to the engineers who didn't just back things up—they also made backups offsite and then backed those up too!

But how did the incident happen exactly? The answer is almost too absurd to believe.

When provisioning the infrastructure, a parameter was left blank. That blank field triggered an automatic default: the private cloud environment was set to a fixed one-year term, with automatic deletion at the end. Fast forward to that point in time, the system did exactly what it was told: delete everything. Both geographic regions, all the data, and yes, even the backups stored in those regions. There was no warning, no confirmation dialog asking, "Are you absolutely certain you wish to delete everything?"

Human errors and automation failures are inevitable, so what do we do about it? We can't prevent every mistake, but we can build systems that survive them, just like what UniSuper engineers did. This is where your choices of core data infrastructure (i.e., databases, storage, and file systems) matter more than you think. JuiceFS offers multiple layers of defense specifically designed for scenarios like these. Let me walk you through all of them today.

Your first line of defense: protecting the brain

In JuiceFS' architecture, there's a clear separation of duties. The actual file content (the data) lives in object storage services of your choice, such as S3, GCS, and MinIO. In the meantime, the map (the metadata) that tells which data blocks belong to which file is stored in the metadata engine supported by JuiceFS, such as Redis, MySQL, or the JuiceFS Enterprise metadata store.

Here's the "a little bit scary" part: The metadata is the brain. Lose your Redis backend, and you have petabytes of random data objects with no way to assemble them back into files. It's like having a million puzzle pieces without knowing what the original painting looks like.

But flip that around. If the metadata is the brain, then backing it up alone means you've backed up the entire file system. This assumes that the object storage is always intact, and we will question this assumption later.

Metadata backup

JuiceFS has a brilliant feature that doesn't rely on you to remember: automatic metadata backup. The JuiceFS client can be configured to automatically dump metadata to your object storage. By default, it does this operation every hour. Under the hood, the JuiceFS client uses the juicefs dump command, serializing the entire directory tree into a snapshot, which can be either human-readable JSON or a compact binary format. And don't worry about performance: with optimizations in the JuiceFS Community Edition v1.3, backing up 100 million files takes just minutes with controlled memory usage.

# Back up metadata every 8 hours.
juicefs mount -d --backup-meta 8h redis://127.0.0.1:6379/1 /mnt

The backup frequency is configurable with JuiceFS, along with some reasonable defaults taking the number of files and cleanup policies into account. It's worth mentioning that JuiceFS supports various databases as metadata engines. A database often comes with its own backup/snapshot mechanism as well, which can be utilized as an additional layer of backup. However, the backup file generated by juicefs dump can be used across different engines as a uniform format specifically for JuiceFS metadata management.

Once it's set, you don't think about it again. Until the day you need it, and when that day comes, the juicefs load command restores everything from that snapshot. Of course, this all relies on one critical assumption: that your object storage is perfectly safe, which should often be the case. But for the sake of being extra skeptical, let's dig into that next.

Do I really need to back up object storage?

So now you have hourly metadata backups. That's a solid first step. Still, a common objection arises: "Amazon S3 provides 99.999999999% durability." Isn't that sufficient?

Recall how this post began. Mainstream databases and object stores are extremely reliable. However, it didn't help with accidents, like when a higher-level control logic failed. Other scenarios also exist beyond vendor errors: lightning strikes (acts of God), accidental bucket deletions, misconfigured IAM policies, or cyber attacks compromising your access keys.

Here is the essential truth: Your metadata backup is a map. If the actual data chunks in object storage are corrupted or deleted, that map serves no purpose. For critical data, backing up your object storage can be a sensible addition, because complex systems have many failure modes.

Beyond metadata backup: enterprise-grade resilience

Metadata backups are essential, and backing up object storage is pragmatic. But for organizations running critical workloads, especially across multiple regions or even cloud providers, sometimes a "cold" backup is not enough and can be wasteful as well. There are several additional JuiceFS features that go beyond traditional backup, which we will cover in this section.

Data synchronization

For straightforward data copying between storage systems, JuiceFS provides the data synchronization feature via the juicefs sync command, available in both the JuiceFS Community and Enterprise Editions. This tool can copy data between object storage, JuiceFS volumes, local file systems, and even remote servers via SSH or HDFS. It supports incremental synchronization and pattern matching (similar to rsync), making it suitable for one-time data migrations or periodic copying of specific datasets.

Here is a basic example of syncing from an S3 bucket to a JuiceFS volume, excluding any files with the .log suffix:

juicefs sync s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/ --exclude '*.log'

The juicefs sync tool is incredibly versatile, and it's not even only for JuiceFS usages. The flexibility is substantial: you can sync from MinIO to Azure Blob, from a local directory to JuiceFS, or between two entirely different cloud providers. The --include and --exclude patterns allow fine-grained control over exactly which files are copied.

Data replication

For continuous, asynchronous replication across clouds or regions, JuiceFS Enterprise Edition offers data replication. Once enabled, every write to the primary object storage bucket is automatically copied to a target bucket in another region or cloud provider. This feature serves multiple purposes: cross-region data sharing, seamless object storage migration, and disaster recovery. If the primary object store fails, clients can be manually switched to the target bucket to restore service with minimal downtime.

image: JuiceFS data replication
caption: JuiceFS data replication

As shown in the diagram above, taking the primary region (main) writes and the replicated region (target) reads as an example, the data copy logic is straightforward: writes go to the primary region's object storage first and then are asynchronously replicated to the target region. Reads prefer the local region's bucket and fall back to the remote bucket if data hasn't yet arrived.

It's notable that replication runs continuously and asynchronously within the client process itself. The client writes to both buckets directly, rather than relying on a background job mechanism. Because both regions share the same JuiceFS Enterprise metadata engine, metadata remains fully consistent between them. However, the target region inherently operates with higher latency and may experience reduced performance.

image: JuiceFS Web Console
caption: Enabling data replication in the JuiceFS Web Console

From a cost perspective, replication is free to enable, as JuiceFS does not charge for this feature, nor does it generate additional metadata that would affect billing. However, you remain responsible for your object storage provider's standard usage fees, as now you have more than one copy of data.

In short, data replication is designed to keep the underlying object storage in sync across regions using a shared metadata engine, with the additional benefits described above.

Mirror file system

The most sophisticated option is the mirror file system—a complete, writable replica of your entire JuiceFS deployment. Here's how it works and why it matters.

A mirror file system creates one or more full copies (both the metadata engine and the object storage) of an existing file system in different regions. Metadata is automatically synchronized from the source, allowing clients in the mirrored region to access the file system locally with dramatically lower latency. It is noteworthy that you can still choose not to replicate the object storage and rely on the caching layers to accelerate data access and reduce storage costs. But for simplicity, we will skip that setup here, as the mirror file system itself deserves a full long blog post.

image: JuiceFS mirror file system architecture
caption: JuiceFS mirror file system architecture

Prior to JuiceFS Enterprise version 5.1, mirrors were read-only, and write operations had to go back to the source region. Starting with version 5.1, mirrors support writes as well. In a write-enabled mirror, clients connect to both the source and mirror metadata services simultaneously. Read requests go to the local mirror for low latency. Write requests go directly to the source. After the source responds, the client waits briefly for the change to sync back to the mirror before returning to the application.

Consider an AI training scenario: your GPUs are in Region B, but your primary storage is in Region A. Without a mirror, every metadata operation crosses the network, potentially hundreds of milliseconds per request. With a mirror, reads are local, and the performance difference can be dramatic.

With that said, a mirror file system is not merely a backup. It is also a cross-cloud, low-latency disaster recovery solution. It is not likely that a JuiceFS Enterprise setup fails, as everything has high availability and resilience built-in: the JuiceFS Enterprise metadata engine is at least a 3-node topology using Raft as the consensus algorithm, and the object store service is generally robust. On top of that, the mirror file system feature enables consistent replication across regions and clouds. In the unlikely event of a primary region failure (again, maybe the entire data center is gone), you still have one or more copies of the entire file system with all committed writes available, each with its metadata engine and object storage highly available.

Final thoughts

Stop trusting anything as a magic shield. The UniSuper incident proves that even some of the most trusted systems can lose everything due to human errors. Assume your entire cloud console could be deleted at any second. Thus, test your backups! It is just a file until you load it, and that's exactly why you should consider running juicefs load regularly in a test environment to rehearse the recovery process.

Follow the 3-2-1 rule: Keep at least three copies of your data, stored on two different media types, with one copy kept offsite. Your downtime matters, so prepare accordingly. And here is how JuiceFS features covered in this blog post can help make your infrastructure resilient:

Feature	JuiceFS edition	What it does	Best for
Metadata Backup	Community & Enterprise	Metadata backups saved to object storage with configurable frequency	Protecting against metadata engine loss
Data Synchronization	Community & Enterprise	One-time or periodic syncing between storage systems	Migrations and scheduled data syncing
Data Replication	Enterprise	Continuous async copy of object storage to another region with shared metadata	Active-passive disaster recovery and data sharing
Mirror File System	Enterprise	Writable mirror with local metadata service and local object storage	Cross-cloud, cross-region active-active deployments

No single feature fits every use case. Assess your recovery needs, complexity tolerance, and budget. Let's recall Murphy's Law: anything that can go wrong will go wrong. The question now is just whether you will be ready for it.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

Quota Design in Distributed Architectures: Implementation and Use Cases in JuiceFS

Thu, 30 Apr 2026 11:43:00 +0000

In distributed storage environments, storage resources are typically shared across multiple users, projects, and applications. Without effective constraint mechanisms, abnormal writes or erroneous operations from a single tenant can quickly consume large amounts of space or inodes, impacting system stability and cost control. Quota management provides a way to establish predictable resource boundaries in shared environments.

In distributed systems, quota management is far more than just "setting a limit." The system must balance concurrent writes from multiple clients, asynchronous metadata updates, and overall throughput. At the same time, quota rules must be enforced at different levels of control. To address this, JuiceFS provides multi-level quota capabilities covering the entire file system, directories, and users, supporting scenarios ranging from overall capacity control to individual and team-level constraints.

In this article, we’ll introduce the design and implementation of JuiceFS' quota mechanism, including its core data structures, synchronization model, and the validation and accounting logic in write and delete processes. We’ll also include typical use cases that highlight common issues around quota changes, space reclamation, and over-limit writes.

Quota types and resource dimensions supported by JuiceFS

JuiceFS quotas support two resource dimensions:

Space: Used storage capacity. Statistics are based on the file system's usage perspective and are aligned to block granularity. The write path section later will explain how incremental usage is estimated under 4 KiB alignment.
Inodes: Number of used inodes. For workloads with a large number of small files, inodes often become the constraint bottleneck earlier than space. Therefore, inode quotas must also be part of the management strategy.

Based on these two resource dimensions, JuiceFS currently supports four types of quotas.

Quota type	Scope	Design goal	Typical use case
Total file system quota	Entire file system	Prevents overall resource runaway	Cost budget control, capacity limit
Subdirectory quota	Directory subtree	Blocks abnormal write behavior	Prevents misoperations, small‑file storms
User quota	Per user	Isolates impact between different applications	Multi‑tenant data management
User group quota	Project or department	Cost allocation and team limits	Shared environment for AI projects

User quotas and user group quotas are expected to be released in JuiceFS Community Edition 1.4.
In practice, a common and effective strategy combines the following:

Total file system quota as a safety net.
Directory quotas to address individual abuse and small‑file storms.
User/group quotas for multi‑tenant management.

This layered approach controls overall resource limits while preventing abnormal growth of a single entity from affecting other workloads.

Quota implementation mechanism

Synchronization model and data structures

The main challenge of implementing quotas is how to perform checking, accounting, and convergence at an acceptable cost under concurrent writes from multiple clients. JuiceFS clients run on various nodes and continuously issue resource‑changing operations such as creation, writing, truncation, and deletion. If every operation required a strongly consistent server‑side check and update, the write path would incur unacceptable overhead.
Therefore, the quota mechanism must satisfy two goals:

Performance: Avoid a strongly consistent server‑side update on every write.
Consistency: Ensure that system usage eventually converges under concurrent writes from multiple clients and prevent over‑limit operations before they happen, as much as possible.

Based on this trade‑off, JuiceFS adopts a synchronization model that works as "local accumulation, periodic flush, and periodic refresh." Clients first accumulate resource deltas in local memory, with background tasks periodically persisting them to the metadata engine in batches. At the same time, each client periodically pulls the latest quota configuration and baseline usage from the server, gradually aligning its own global view. Clients do not communicate directly with each other; instead, the metadata engine serves as the central coordination point.

In other words, JuiceFS quotas do not pursue strong consistency on each operation but achieve eventually consistent resource control through periodic synchronization.

In the current implementation, quota deltas are flushed every 3 seconds (flushQuotas). Clients reload the latest quota configuration and baseline usage from the backend approximately every 12 seconds (via a refresh call triggered by the mount heartbeat). This means that under extreme conditions, the global views seen by different clients may diverge by up to about 12 seconds, but they will gradually converge in subsequent sync cycles.

Quota information is managed uniformly by the quota structure. It represents a single quota entity and can adapt to different types of managed objects such as directories, users, and user groups. Its core design decouples baseline usage from incremental usage:

UsedSpace/UsedInodes: Represents the baseline usage already persisted in the backend.
newSpace/newInodes: Represents the locally accumulated deltas on this client that have not yet been flushed to the backend.

type Quota struct {  
    MaxSpace, MaxInodes   int64  // Maximum space and inode limits  
    UsedSpace, UsedInodes int64  // Used space and inodes  
    newSpace, newInodes   int64  // Pending usage deltas to be synced  
}

For inode accounting, hard links require special attention. Different quota types have different counting semantics for hard links. For directory quotas, counting is based on directory entries: when a hard link is created under a directory, both space and inode usage of that directory increase by 1, and they decrease accordingly when the hard link is removed. For user quotas and user group quotas, counting is deduplicated by the file object (inode). Even if a file has multiple hard links, it’s counted only once per UID/GID dimension. Therefore, creating or deleting hard links does not change the usage for the associated user or user group.

Quota storage

Regarding the quota storage mechanism, the total file system quota (the global "red line") has its capacity and inode limits directly persisted in the metadata engine. Clients load this configuration during mount and enforce hard limits, ensuring the underlying resources are not exceeded.

In contrast, checks and delta accumulation for directory, user, and user group quotas rely more on the client side. Clients maintain in‑memory indexing structures keyed by inode, UID, and GID, and periodically synchronize the corresponding quota information from the backend. This keeps lookup overhead low in high‑frequency I/O scenarios. It’s important to emphasise that the client in‑memory state is only a runtime cache and incremental view; the authoritative source for quota configuration and baseline usage remains the metadata backend.

Quota checks

A synchronization model and data structures alone are not sufficient, and quota logic must also be embedded into the specific resource‑changing paths. A single write operation may not be a simple data append; it can simultaneously involve inode creation, block allocation, directory entry changes, and parent‑directory statistics updates. Under multi‑client concurrency, these changes collectively affect the same set of quota constraints. Therefore, only by placing checks and statistics updates directly into the operation paths (write, create, truncate, and delete) can we avoid out‑of‑limit writes and statistical inaccuracies.

Pre‑write: incremental estimation and multi‑dimensional quota check

When a user initiates an operation that may change resource usage (such as write, create, and truncate), the client first estimates the expected resource delta, including both space and inode changes.

Space delta is estimated based on the actual allocation granularity of underlying data blocks (for example, 4 KiB alignment), therefore block‑aligned calculation is required. Inode deltas primarily occur in creation operations (such as creating a new file or directory).

After obtaining the resource delta for the operation, the client performs a quota check before actually writing. The check covers multiple dimensions: user and user group quotas, total file system quota, and directory quotas for the target directory tree. If any dimension would exceed its limit after this operation, the request is rejected with an error such as quota exceeded or out of space.

By placing the check in the write path before the resource change, the system can block risky operations before they happen, avoiding complex cleanup or rollback afterwards.

Post‑write: local delta accumulation and background batched sync

After a successful write, the resource delta generated by the operation is incorporated into the corresponding usage statistics and gradually aligns with the global state according to the defined convergence mechanism. Specifically, three categories of statistics are affected:

Global level: The overall file system usage increases (or decreases).
Directory level: The usage of the relevant directory subtree changes accordingly.
User / user group level: The usage of the corresponding subject also accumulates.

These updates are first reflected in the client’s local accumulated deltas and are not immediately written back to the backend in a strongly consistent way. Later, background tasks flush them in batches, and periodic refresh operations gradually align them with other clients, achieving global convergence.

Usage statistics (`stats`): foundation for the quota system

For quotas to work effectively, the system must be able to track current resource usage with low overhead. Whether for large directory trees or many users and user groups, if every check requires a real‑time full scan, the performance cost will be unacceptable. Therefore, an efficient and reliable usage statistics mechanism is a prerequisite for implementing quotas.

Directory statistics

Directory quotas constrain the total space and inode usage of an entire directory subtree, not the size of individual files. Consequently, they rely on directory‑level usage statistics.

It’s important to note that directory statistics (DirStats) and quota statistics have different scopes: DirStats only sums up the usage of immediate children (files and subdirectories) under a given directory – a single‑level statistic. In contrast, directory quotas recursively sum up the entire subtree. This design allows DirStats to be maintained with lower overhead, while directory quotas provide a full subtree view.

The key to implementing such statistics is maintaining low overhead and high availability for large directory trees. JuiceFS follows the same approach as the quota mechanism: high‑frequency local updates and batched background persistence. Clients maintain directory usage deltas in memory; when operations such as writes or deletions occur, the changes are first recorded locally and then periodically synced in batches to the metadata engine by background tasks.

In addition, the system does not load all directory statistics at mount time. For large directory trees, a full load would cause significant latency and memory overhead. Therefore, directory statistics adopt an on‑demand fetch strategy: only when precise usage is required (such as quota checks, usage summarisation, and administrative queries) does the system load the statistics of the corresponding directory from the backend.

When users query usage information via df or an application calls statfs, JuiceFS makes a trade‑off between performance and accuracy:

It first uses locally cached used space and inodes for fast calculation.
If the local baseline is incomplete (for example, just after startup) or higher real‑time accuracy is needed, it fetches the latest global counters from the backend for calibration.
Finally, it adds locally accumulated (not yet synced) deltas to make the result more accurate for the current node’s write state.

After obtaining the used amounts, the client calculates total and avail based on whether a total capacity limit is configured:

If a limit is configured, total capacity equals that limit, and available capacity is "limit minus used."
If no limit is configured, it returns a dynamically estimated total capacity so that tools like df can display normally.

Moreover, when querying quotas from the root directory, the system displays the maximum space and inode limits, allowing administrators to see the global resource boundaries.

In addition, JuiceFS will support real‑time updates of directory statistics for the trash in version 1.4. When files are deleted (moved to the trash), restored from the trash, or permanently cleaned up, the system updates the trash directory’s statistics immediately. This enables administrators to accurately track space usage of the trash.

User and user group statistics

User and user group statistics are collected only after the corresponding quota feature is enabled. Before enabling, the updateUserGroupStat call in the kernel path returns directly without generating any statistics. After enabling, clients maintain usage data in an in‑memory map keyed by uid and gid and update the relevant statistics on all paths that may cause usage changes.

A special note: when setting a quota for a user or user group for the first time via juicefs quota set --uid or juicefs quota set --gid, the system immediately performs a full scan of existing files to initialise the baseline usage. After this initialisation, subsequent writes and deletions become incremental updates, and no further full scan is required.

Common scenarios

1. A file has been deleted, why hasn’t the total file system quota decreased? Why hasn’t the object storage billing changed?

This is usually not a statistics error, but a result of file system semantics combined with the statistical model.

For example, after enabling the trash in JuiceFS, a deletion operation does not immediately free space. The file is first moved to the trash for possible recovery. Therefore, files in the trash are still counted in the total file system quota and user / user group quotas, but are no longer counted in the original directory quota.

Another common reason is the time lag between file system statistics and object storage billing. JuiceFS quota statistics use a local accumulation + periodic background sync model, so it’s possible that different clients or different statistical interfaces have not yet converged in a short time. At the same time, object storage may not have completed garbage collection or lifecycle cleanup. Therefore, temporarily seeing inconsistency between file system usage, quota statistics, and object storage billing is generally expected. This is not considered a system anomaly as long as they gradually converge over time.

In addition, note that quota and statfs show the file system perspective of space usage and availability, while object storage billing is based on the underlying object storage model – affected by factors such as chunking, merging, delayed reclamation, and lifecycle rules.

The two are not required to be the same.

2. Quota is full, but appending to an existing file did not report an error immediately.
This is often related to the asynchronous commit path in some JuiceFS writes. From the application’s perspective, the write system call may return success early, while the actual data commit and corresponding quota check happen later. Thus, appending may appear to "succeed," but the data may not be fully persisted; if the later commit stage determines that the quota would be exceeded, the write may still fail.

In other words, a successful write return does not guarantee that the write has been finally committed. In scenarios involving quota limits, a safer approach is to check the return status on close, the final file size, and handle possible errors accordingly.

3. Quota is not yet full, but file creation fails.
This phenomenon is usually related to temporary view divergence under the eventual‑consistent statistical model.

Example: a volume has a total quota of 2,000 inodes, and there are currently 1,999 files. One more file should be creatable. However, in extreme concurrency or unusual refresh timing, the client’s local cache may diverge briefly from the backend baseline count. This may cause the in‑memory used inode count to be temporarily too high, thus rejecting a legitimate creation request.

This type of problem inherently stems from the local accumulation + periodic sync convergence model. It avoids the high overhead of strong‑consistent backend updates on every operation, but in extreme cases the system may have short‑term false positives.

Typically, such false positives disappear with the next sync cycle, and retries can mitigate the issue.

This also illustrates that, in a distributed environment, quotas are best understood as an efficient, near‑real‑time constraint mechanism, not a fully synchronous, strongly consistent judgement for every concurrent operation.

4. After a write exceeds the quota, why does the "failed" file remain in the directory?

This is not unique to JuiceFS; it’s not uncommon in file systems that follow POSIX semantics.

For example, a user sets a 1 GiB quota on a directory and then tries to write a 2 GiB file using dd. The file system first allows the first 1 GiB of valid writes; only when the subsequent write exceeds the quota does it return “Disk quota exceeded.” Consequently, a "partial file" of about 1 GiB is left behind. This does not indicate abnormal behaviour. It simply means the first part of the data was written successfully, while the remainder failed due to the quota.

The file system's responsibility is to report the error, not to decide whether to delete the successfully written data. Whether to clean up such an incomplete file is left to the application. This follows standard POSIX semantics: the file system returns the error, and the application handles subsequent cleanup and recovery.

Summary

In a distributed file system, quotas are not a simple "counter" feature, but a system design that must balance performance, consistency, and management granularity. Through pre‑write checks, local accumulation, and periodic background synchronization, JuiceFS minimizes overhead on the write path while allowing various usage statistics to gradually converge under an eventual consistency model. Based on this mechanism, quota control covers not only total file system capacity, but also multiple levels such as directories, users, and user groups, thereby meeting the needs of typical scenarios including multi‑tenant isolation, individual constraints, and team‑level resource management.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

JuiceFS Performance Optimization for AI Scenarios

Wed, 15 Apr 2026 07:22:00 +0000

The scale of computing power for large language model (LLM) training continues to expand. While GPU performance keeps improving, data access bottlenecks are becoming increasingly prominent in overall system performance. Local storage offers excellent performance but has limited scalability. Object storage excels in cost and scalability but suffers from insufficient throughput in massive small‑file and high‑concurrency scenarios. Teams often struggle to choose between them.

Therefore, distributed file systems have become a key solution to balance high performance and scalability. JuiceFS has been widely deployed in AI scenarios across multiple industries. Its distributed architecture delivers high performance, strong scalability, and low cost simultaneously for large‑scale data access.

In this article, we’ll introduce JuiceFS’ architecture from a performance perspective and analyze core performance bottlenecks and optimization methods under different access patterns. We’ll also offer links of key points for references, helping you understand JuiceFS’ performance mechanisms and master common tuning strategies.

Performance foundations from the JuiceFS architecture

JuiceFS comes in Community Edition and Enterprise Edition. Both share the same architecture: metadata and data are separated. The client adopts a rich‑client design, handling core logic including some metadata operations, and provides both metadata and data caching. These modules work together for efficient data location and access. The underlying data is stored in object storage, with local caches further improving access performance. For external interfaces, JuiceFS supports multiple access methods – FUSE is the most common, and it also provides various SDKs and an S3 gateway.

JuiceFS Community Edition is designed as a general‑purpose file system. Users can choose different metadata engines based on their needs. For small‑scale deployments, Redis delivers lightweight, low‑latency metadata management. For large‑scale file scenarios, TiKV provides good horizontal scalability.

JuiceFS Enterprise Edition targets complex, high‑performance scenarios. It differs from Community Edition in two ways:

It uses a self‑developed multi‑zone metadata engine built on Raft that runs as an in‑memory cluster, offering low latency and strong horizontal scalability. It supports up to 500 billion files. Operations that require multiple key-value requests in the Community Edition often need only one or two in the Enterprise Edition, and complex logic can be processed inside the metadata cluster.
The Enterprise Edition supports distributed cache sharing: clients in the same group can access each other’s local caches via consistent hashing. This improves cache hit rates and access efficiency. In multi‑node, high‑concurrency scenarios, the cache space scales horizontally, and most required data can be warmed up before job execution. This accelerates AI training and inference while boosting performance and stability. See JuiceFS Enterprise 5.3: 500B+ Files per File System & RDMA Support.

image: JuiceFS 社区版和企业版架构
caption: JuiceFS Community Edition and Enterprise Edition architectures

Data chunking

JuiceFS splits data into chunks and stores them in object storage. This design is key to its performance, affecting data read efficiency, cache hit rate, and throughput under high concurrency.

JuiceFS breaks a file into multiple chunks. Inside each chunk, the system maintains a management structure called a slice to track writes and updates. When data is written, new data does not overwrite existing slices; instead, a new slice is appended on top of the chunk.

image: chunk
caption: How JuiceFS stores data

Ideally, each chunk ends up containing only one slice. Each slice consists of several 4 MB blocks, which are the smallest unit stored in object storage. By default, the caching system also manages data at the block level.

image: block
caption: How JuiceFS stores data

As shown in the diagram on the upper right, file updates use an append‑only write pattern: existing slices are shown in red, and new data is appended as a new slice. During reads, the system combines the slices to form the current view. When fragmentation becomes excessive, a compaction process merges slices to optimize access performance. For more details on data chunking, refer to Code-Level Analysis: Design Principles of JuiceFS Metadata and Data Storage.

Caching

Compared to direct object storage access, JuiceFS performance improvements largely benefit from its caching mechanism. The JuiceFS client comes with a high‑performance local cache module. Key configuration options include:

cache-dir: specifies the cache directory.
cache-size: sets the maximum cache space.
Prefetch: a parameter in the cache module that controls prefetching. When a request hits a block, a background thread fetches the entire block.
Write‑back related settings: improves write IOPS by writing data blocks that need to be uploaded to object storage into the local cache first, then asynchronously uploading them to object storage.

JuiceFS Enterprise Edition also provides advanced configurations. For example, a cache group can be used to designate a set of clients whose local caches form a distributed cache group, enabling cache sharing. In addition, the no sharing option (related to cache groups) allows a client to read data only from a specified cache group without serving its own cache to others. This creates a two‑level cache:

The first level is the local cache.
The second level is the cache on other nodes in the group.

Another performance‑boosting mechanism is the memory buffer (read buffer), which provides:

I/O request merging: multiple consecutive I/O requests can be merged in memory. For example, three I/O requests issued by the system may be reduced to just one after being processed by the memory buffer.
Adaptive read‑ahead: in large‑file sequential read scenarios, adaptive read‑ahead increases request concurrency by prefetching data. This fully utilizes cache and object storage resources and improves overall I/O performance.

The Enterprise Edition also offers advanced read‑ahead settings:

max read ahead: sets the maximum read‑ahead range.
initial read ahead: sets the initial read‑ahead window size (default unit is 4 MB blocks).
read ahead ratio: a configuration added last year that controls the read‑ahead ratio for large‑file random reads, reducing bandwidth waste caused by read amplification. Overly aggressive read‑ahead can negatively impact random read performance; read ahead ratio helps mitigate this. In AI scenarios, when large‑file sequential or random reads cause bandwidth or IOPS bottlenecks, adjusting these parameters can optimize overall performance.

JuiceFS benchmark I/O tests and bottleneck analysis

Before diving into performance tuning for common AI scenarios, let’s first examine JuiceFS’ I/O behavior under ideal conditions through sequential and random read benchmarks. This helps us understand throughput and latency under different access patterns, providing a reference for the read/write patterns of subsequent AI/ML workloads.

Sequential read performance

In JuiceFS, sequential read performance is typically bandwidth‑bound. In cold read scenarios, performance is mainly limited by object storage bandwidth; in distributed cache scenarios, network bandwidth can become the bottleneck. For example, a node with a 40 Gbps NIC may achieve less than 5 Gbps usable bandwidth. In addition, the user‑kernel transition overhead in the FUSE layer limits single‑thread throughput. Tests showed single‑thread sequential read bandwidth around 3.5 Gbps. To break this limit, multi‑threaded or higher‑concurrency strategies are needed to fully utilize storage and network resources.

The table below shows test results of JuiceFS sequential read performance:

Threads	Bandwidth (GB/s)	Bandwidth per thread (GB/s)
1	3.5	3.5
2	6.3	3.15
3	9.5	3.16
4	9.7	2.43
6	14.0	2.33
8	17.0	2.13
10	18.6	1.9
15	21	1.4

In the performance test, single‑thread sequential read bandwidth was about 3.5 Gbps. As the number of threads increased, total throughput gradually approached the network bandwidth limit. To help users evaluate the performance ceiling of their own environment, JuiceFS provides the bj bench subcommand for testing object storage bandwidth.

In real workloads, caching is more common than direct object storage access. In such cases, increasing the buffer size raises the number of background prefetch requests, thereby improving concurrency and overall throughput. For example, after increasing the buffer size to 400 MB (corresponding to 100 background prefetch requests of 4 MB each), concurrency improved significantly and overall throughput increased.

Random read performance

Low‑concurrency random reads

In low‑concurrency, non‑asynchronous access scenarios, each request must wait for the previous one to complete before being issued. As a result, latency has a significant impact on overall performance. I/O latency can come from many sources, including metadata query latency, object storage access latency, and local or distributed cache read latency. When analyzing random read performance, we must closely examine these latency factors.

In a 4 KB cold random read scenario, if the IOPS is only 8 and object storage latency is about 125 ms, the concurrency level is roughly 1 (8 IOPS × 125 ms ≈ 1,000 ms).

This indicates a near‑single‑concurrent, serial‑blocked state. In such cases, the optimization focus should be on shortening the access path and reducing per‑request latency rather than increasing concurrency – for example, by warming up data into the local cache. After data warm-up, the random read path switches from object storage to local cache, and IOPS can increase to about 12,000, approaching the I/O level of a local disk.

image: juicefs stats 命令查看性能
caption: Using the juicefs stats command to view performance

image: 预热后性能
caption: Performance after data warm-up

High‑concurrency random reads

High‑concurrency random reads typically occur in scenarios with high thread counts or asynchronous I/O. The main performance bottleneck is often IOPS limits – including metadata IOPS, object storage IOPS, and cache IOPS. JuiceFS allows you to observe these metrics and pinpoint the bottleneck. Client machine resources (CPU, memory) can also affect performance, but such bottlenecks are easy to monitor.

In a cold read scenario using Libaio for random reads, the object‑side IOPS ceiling is around 7,000/s. When caching is enabled and data is warmed up, the access path shifts from object storage to the cache layer, and IOPS can further increase to over 20,000. This shows that the bottleneck for high‑concurrency random reads shifts as the access path changes.

image: 预热前
caption: Before data warm-up

image: 预热后
caption: After data warm-up

For a deeper dive into JuiceFS’ complete data access path, refer to Optimizing JuiceFS Read Performance: Readahead, Prefetch, and Cache.

I/O characteristics and performance tuning for common AI scenarios

Large‑file sequential reads

A typical large‑file sequential read scenario is model loading, such as loading PyTorch .pt files saved via pickle serialization. In this process, performance is limited by two factors:

Pickle deserialization efficiency determines data processing speed.
Data reading is usually single‑threaded and limited by FUSE bandwidth and CPU performance.

To increase throughput, you can raise concurrency through multi‑threaded or sharded loading, fully utilizing I/O capacity. For large‑file sequential reads, the best performance is achieved when the entire dataset can be cached locally. If only on‑demand reading is required, the implementation is simple.
For more details on optimizing large‑file sequential reads, see How JuiceFS Transformed Idle Resources into a 70 GB/s Cache Pool.

Massive small files

In computer vision and multimodal tasks, training datasets often consist of many individual files, for example, single images, video frames, or text annotations. Such massive small‑file scenarios place heavy pressure on metadata services.

In massive small-file scenarios, metadata performance is critical. On one hand, each file carries only a small amount of data; on the other hand, directory metadata access efficiency is low when a directory holds a huge number of small files.
For read‑only workloads, enabling client metadata caching and extending the cache lifetime can improve performance.

Moreover, the data read layer experiences higher IOPS pressure because small files cannot take advantage of read‑ahead. This makes requests more fragmented. Common optimizations include increasing local cache capacity; for the Enterprise Edition, you can also scale out the distributed cache cluster horizontally. Because small files derive little benefit from read‑ahead, their latency tends to be higher.

For performance tuning in this scenario, see How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS.

Large‑file random reads

This scenario is common in AI training, for example, when randomly accessing datasets in TFRecord, HDF5, or LMDB format by sample. Take model loading: if the dataset is accessed randomly and each read size equals the sample size (for example, 1 MB to 4 MB images or short videos), read‑ahead can waste bandwidth. Such scenarios can often break through IOPS bottlenecks by increasing concurrency.

Recommended measures include:

Increase the number of data‑loading reader threads.
Use asynchronous I/O to raise concurrency and saturate IOPS.
Improve the caching system, for example, pre‑map data into cache to boost underlying IOPS.
Adjust the read ahead ratio parameter (for example, set it to 0.5) to reduce bandwidth waste from read‑ahead. For instance, a 4 MB sequential read would previously prefetch 4 MB; after adjustment, only 2 MB is prefetched.

In this article, we’ve analyzed JuiceFS’ architecture from a performance perspective, covered benchmark I/O tests, and discussed tuning methods for typical AI scenarios. This provides an introductory reference for system performance. JuiceFS has been deployed in many production environments, and its distributed architecture offers a feasible balance between performance and cost.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.