Data migration
juicefs sync
is a powerful data migration tool, which can copy data across all supported storages including object storage, JuiceFS itself, and local file systems, you can freely copy data between any of these systems. In addition, it supports remote directories through SSH, HDFS, WebDAV, etc. while providing advanced features such as incremental synchronization, and pattern matching (like rsync), and distributed syncing.
juicefs sync
shares code between community edition and enterprise edition, that's why even if you use the other version, sync
will continue to work normally, with one exception with jfs://
protocols. CE and EE employ different metadata implementations, so if jfs://
is involved, you can't mixed two editions.
juicefs sync
works like this:
juicefs sync [command options] SRC DST
# Sync object from OSS to S3
juicefs sync oss://mybucket.oss-cn-shanghai.aliyuncs.com s3://mybucket.s3.us-east-2.amazonaws.com
# Sync objects from S3 to JuiceFS
juicefs sync s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
# SRC: a1/b1,a2/b2,aaa/b1 DST: empty sync result: aaa/b1
juicefs sync --exclude='a?/b*' s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
# SRC: a1/b1,a2/b2,aaa/b1 DST: empty sync result: a1/b1,aaa/b1
juicefs sync --include='a1/b1' --exclude='a[1-9]/b*' s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
# SRC: a1/b1,a2/b2,aaa/b1,b1,b2 DST: empty sync result: a1/b1,b2
juicefs sync --include='a1/b1' --exclude='a*' --include='b2' --exclude='b?' s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
Pattern matching
When composing a sync command, keep in mind that SRC
and DST
ending with a trailing /
are treated as directories (e.g. movie/
.) Those don't without a trailing /
are treated as prefixes, and will be used for pattern matching. Assuming we have test
and text
directories in the current directory, the following command can synchronize them into the destination ~/mnt/
:
juicefs sync ./te ~/mnt/
In this way, the subcommand sync
takes te
as a prefix to find all the matching directories, i.e. test
and text
. The above command will copy them to ~/mnt/
.
You can also change the destination to add a prefix, effectively renaming while copying:
# The `test` directory synchronized to the destination directory will be renamed as `abst`, and `text` will be `abxt`
juicefs sync ./te ~/mnt/ab
Similar to rsync, you can use --exclude
, --include
patterns to build filters:
- Patterns ending with
/
only matches directories; otherwise, they match files, links or devices. - Patterns containing
*
,?
or[
match as wildcards, otherwise, they match as regular strings; *
matches any non-empty path components (it stops at/
).?
matches any single character except/
;[
matches a set of characters, for example[a-z]
or[[:alpha:]]
;- Backslashes can be used to escape characters in wildcard patterns, while they match literally when no wildcards are present.
- It is always matched recursively using patterns as prefixes.
- The earlier options have higher priorities than the latter ones. Thus, the
--include
options should come before--exclude
. Otherwise, all the--include
options such as--include 'pic/' --include '4.png'
which appear later than--exclude '*'
will be ignored.
To sync everything, but excludes hidden directories and files (name starting with .
is regarded as hidden):
# Excluding hidden directories and files
juicefs sync --exclude '.*' /tmp/dir/ s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com/
You can use this option several times with different parameters to exclude multiple patterns. For example, using the following command can exclude all hidden files, pic/
directory and 4.png
:
juicefs sync --exclude '.*' --exclude 'pic/' --exclude '4.png' /tmp/dir/ s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com
Option --include
can be used to include patterns you don't want to exclude. For example, only pic/
and 4.png
are synchronized and all the others are excluded:
juicefs sync --include 'pic/' --include '4.png' --exclude '*' /tmp/dir/ s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com
Pattern matching can also be used to delete specific files from the storage system. The trick is to create an empty directory locally as SRC
.
The demonstrations are as follows. To be cautious, all demonstrations add the --dry --debug
options, so the command will not actually delete any files, but print the execution plan. After the verification is successful, you can actually execute it by removing these two options:
mkdir empty-dir
# Delete all objects in mybucket, but keep files with the suffix .gz
juicefs sync ./empty-dir/ s3://xxx:xxx@mybucket.s3.us-east-2.amazonaws.com/ --delete-dst --exclude=*.gz --dry --debug
# Delete all files with .gz extension in mybucket
juicefs sync ./empty-dir/ s3://xxx:xxx@mybucket.s3.us-east-2.amazonaws.com/ --delete-dst --include=*.gz --dry --debug
Sync behavior
Sync without a mount point
For data migrations that involve JuiceFS, it's recommended use the jfs://
protocol, rather than mount JuiceFS and access its local directory, which bypasses the FUSE mount point and access JuiceFS directly. This process still requires the client configuration file, you should prepare in advance using juicefs auth
. Under large scale scenarios, bypassing FUSE can save precious resources and increase performance. When using jfs://
protocol, you can pass in parameters of juicefs mount
command to imporve the performance, such as --max-downloads
, --max-uploads
, --buffer-size
, etc.
Incremental and full synchronization
By default, juicefs sync
performs incremental synchronization, existing files are overwritten when their sizes don't match. On top of this, you can also use --update
to copy files when source files' mtime
are newer. For scenarios with higher demand for data integrity, use --check-new
or --check-all
to perform byte stream comparison between SRC and DST.
For full synchronization (copy everything no matter already exists or not), use --force-update
.
Directory structure and file permissions
juicefs sync
skips empty directories by default. To synchronize empty directories, you can use --dirs
option.
In addition, when synchronizing between file systems such as local, SFTP and HDFS, option --perms
can be used to synchronize file permissions.
Copy symbolic links
You can use --links
option to disable symbolic link resolving when synchronizing local directories. That is, synchronizing only the symbolic links themselves rather than the directories or files they are pointing to. The new symbolic links created by the synchronization refer to the same paths as the original symbolic links without any conversions, no matter whether their references are reachable before or after the synchronization. Also note that:
- The
mtime
of a symbolic link will not be synchronized; - The
--check-new
and--perms
options will be ignored when synchronizing symbolic links.
Data sync and compaction
For sequential write scenarios, always try to guarantee that every write file handler has at least 4M (the default block size) buffer available. If the write concurrency is too high or the buffer size is too small, the client will not be able to maintain the desired "write by large chunk" pattern. Instead, it could only write by small slices, which combined with compaction, could really deteriorate performance due to write amplification.
Compaction traffic can be monitored using juicefs_compact_bytes
, which is already graphed in our Web Console and on-prem Grafana, within the object storage section. If compaction traffic is substantial during a sync
operation, the following performance tuning measures are recommended:
- If object storage bandwidth is limited, avoid setting high concurrency (
--threads
). Instead, start with low concurrency and gradually increase it until you get the desired speed. - When the destination is a JuiceFS file system, use the
jfs://
protocol, because it bypasses the FUSE mount point (reducing overhead) and is already optimized for file fragmentation problems. See the next point for details. - If you must use a FUSE mount point for data migration, for large file transfers, it's recommended to add the
--flush-wait=60s
option. This effectively helps reduce the number of slices and mitigates write amplification. For more information, see troubleshooting write amplification. - When the destination is a JuiceFS file system, ensure the destination has sufficient available buffer capacity. Each write file handler must have at least 4MB of reserved memory. This means the
--buffer-size
should be at least 4 times the--threads
value. If higher write concurrency is needed, consider setting it to 8 or 12 times the value. Depending on the destination file system's deployment model, you'll use different methods to configure buffer size:- When the destination starts with the
jfs://
protocol, the JuiceFS client is part of thejuicefs sync
command itself. In this case,--buffer-size
needs to be appended to thejuicefs sync
command. - When the destination is a FUSE mount point, the JuiceFS client runs as the
juicefs mount
process on the host machine. In this case,--buffer-size
needs to be added directly to the mount command.
- When the destination starts with the
- If you need to limit bandwidth via
--bwlimit
, you must also lower the--threads
value to avoid write fragmentation caused by concurrency congestion. Since storage systems come with different performance levels, exact calculations cannot be provided here. Therefore, it's recommended to start with low concurrency and adjust as needed.
Accelerate synchronization
juicefs sync
by default starts 10 threads to run syncing jobs, you can set the --threads
option to increase or decrease the number of threads as needed. But also note that due to resource limits on a single node, blindly increasing --threads
may not always work but cause OOM instead. You should also consider:
SRC
andDST
storage systems may have already reached their bandwidth limits, if this is indeed the bottleneck, further increasing concurrency will not improve the situation;- Performing
juicefs sync
on a single host may be limited by host resources, e.g. CPU or network throttle, if this is the case, consider the following:- If there's a higher spec machine available in you environment, preferably with better CPU or bandwidth, use this node instead to run
juicefs sync
, and access source data via SSH, for example,juicefs sync root@src:/data /jfs/data
; - use distributed synchronization (introduced below).
- If there's a higher spec machine available in you environment, preferably with better CPU or bandwidth, use this node instead to run
- If the synchronized data is mainly small files, and the
list
API ofSRC
storage system has excellent performance, then the default single-threadedlist
ofjuicefs sync
may become a bottleneck. You can consider enabling concurrentlist
(introduced below).
Concurrent list
From the output of juicefs sync
, pay attention to the Pending objects
count, if this value stays zero, consumption is faster than production and you should increase --list-threads
to enable concurrent list
, and then use --list-depth
to control list
depth.
For example, if you're dealing with a object storage bucket used by JuiceFS, directory structure will be /<vol-name>/chunks/xxx/xxx/...
, using --list-depth=2
will perform concurrent listing on /<vol-name>/chunks
which usually renders the best performance.
Distributed synchronization
Synchronizing between two object storages is essentially pulling data from one and pushing it to the other. The efficiency of the synchronization will depend on the bandwidth between the client and the cloud.
When copying large scale data, node bandwidth can easily bottleneck the synchronization process. For this scenario, juicefs sync
provides a multi-machine concurrent solution, as shown in the figure below.
Manager node executes sync
command as the master, and defines multiple worker nodes by setting option --worker
(manager node itself also serve as a worker node). JuiceFS will split the workload distribute to Workers for distributed synchronization. This increases the amount of data that can be processed per unit time, and the total bandwidth is also multiplied.
When using distributed syncing, you should configure SSH logins so that the manager can access all worker nodes without password, if SSH port isn't the default 22, you'll also have to include that in the manager's ~/.ssh/config
. Manager will distribute the JuiceFS Client to all worker nodes, so they should all use the same architecture to avoid running into compatibility problems.
For example, to synchronize data between two object storage services:
juicefs sync --worker bob@192.168.1.20,tom@192.168.8.10 s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com oss://ABCDEFG:HIJKLMN@bbb.oss-cn-hangzhou.aliyuncs.com
The synchronization workload between the two object storages is shared by the manager machine and the two Workers bob@192.168.1.20
and tom@192.168.8.10
.
Observation
Simply put, when using sync
to transfer big files, progress bar might move slowly or get stuck. If this happens, you can observe the progress using other methods.
sync
assumes it's mainly used to copy a large amount of files, its progress bar is designed for this scenario: progress only updates when a file has been transferred. In a large file scenario, every file is transferred slowly, hence the slow or even static progress bar. This is worse for destinations without multipart upload support (e.g. file
, sftp
, jfs
, schemes), where every file is transferred single-threaded.
If progress bar is not moving, use below methods to observe and troubleshoot:
-
Add
--verbose
or--debug
option to thejuicefs sync
command to print debug logs. -
If either end is a JuiceFS mount point:
- Use
juicefs stats
to quickly check current I/O status - Read client log (defaults to
/var/log/juicefs.log
to see if there's slow requests or timeout errors
- Use
-
If destination is a local disk, look for temporary files that end with
.tmp.xxx
, these are the temp files created bysync
, they will be renamed upon transfer complete. Look for size changes in temp files to verify the current I/O status. -
If both end are object storage services, use tools like
nethogs
to check network IO. -
When non of the above methods retrieve useful debug information, please collect its goroutine and send to Juicedata engineers:
# Replace <PID> with actual PID of the stuck sync process
# This command will print its pprof listen port
lsof -p <PID> | grep TCP | grep LISTEN
# pprof port is typically 6061, but in the face of port conflict,
# port number will be automatically increased
curl -s localhost:6061/debug/pprof/goroutine?debug=1
Sync across regions using S3 Gateway
When transferring a large amount of small files across different regions via FUSE mount points, clients will inevitably talk to the metadata service in the opposite region via public internet (or dedicated network connection with limited bandwidth). In such cases, metadata latency can become the bottleneck of the data transfer:
S3 Gateway comes to rescue in these circumstances: deploy a gateway in the source region, and since this gateway accesses metadata via private network, metadata latency is eliminated to a minimum, bringing the best performance for small file intensive scenarios.
Read S3 Gateway to learn its deployment and use.