How to Build a Ceph Cluster and Integrate with the JuiceFS File System

2023-12-07
timfeirg

Ceph provides object storage and can be used as a storage engine in JuiceFS. This combination is well-suited for data-intensive applications such as cloud computing, big data analytics, and machine learning.

In daily deployments, you can directly use Ceph RADOS in conjunction with JuiceFS without the need to deploy RGW. Based on practical experience with this solution and common issues in operation and maintenance, this blog post is compiled in collaboration with the official Ceph documentation. The goal is to assist engineers who may not be familiar with Ceph in smoothly deploying and maintaining Ceph in integration with JuiceFS.

Background knowledge

Before diving into the tutorial, let's briefly cover some basic Ceph concepts involved in the upcoming operations:

  • cephadm: Ceph cluster's dedicated operational management tool used for initializing the Ceph cluster.

  • Object Storage Daemon (OSD): A storage service instance in the Ceph cluster responsible for managing each disk allocated for Ceph's use.

  • Placement Group (PG): You can treat it as a "virtual directory" in Ceph, grouping a large number of objects stored in Ceph. Operations within the same directory are executed with locks, while concurrent operations can occur between different directories. Therefore, if the number of PGs is too small, it may result in performance issues due to conflicts. On the other hand, if the number of PGs is too large, it will lead to increased resource consumption. In the tutorial below, we’ll follow the recommended best practices from the official guidelines to calculate and set the values accordingly.

Get prepared

Set up the environment

The tutorial is demonstrated using the following environment:

  • Operating system: Ubuntu 22.04 LTS
  • Kernel version: 5.15.0-72-generic
  • Data disks: 8 local NVME SSDs per node
  • Memory: 256 GB
  • Ceph version: quincy v17.2.6

Total of 16 nodes, each with 8 SSD disks (unformatted)

In the preparation phase, perform the following steps: 1.Configure password-free login for all nodes on node 1 and record node information in /etc/hosts. For example:

10.38.0.1 ceph-node01
10.38.0.2 ceph-node02
10.32.0.3 ceph-node03
...
10.32.0.16 ceph-node16

Once this operation is done, ceph-node01 will be used as the operational (and Ceph control) node. All subsequent Ceph-related installations and operational tasks will be performed on this machine.

2.It is recommended to install pssh in advance for convenient batch operations:

apt install pssh

pssh, short for Parallel SSH, facilitates concurrent SSH connections to multiple nodes for batch command execution. This tool requires reading a list of hosts.

3.Record the hostnames of all nodes (excluding the control node) in a text file:

ceph-node02
ceph-node03
...

Configure node infrastructure

1.If you haven’t established clock synchronization, manual installation of chrony is required:

apt install chrony

# Check servers that can be connected to.
# If the external network is unavailable, you need to change the configuration to connect to the internal server. If you are not sure how to configure it, contact the cloud service provider's technical support for chronic sources.

# Clock synchronization needs to be set on all nodes. The following command demonstrates batch installation of chrony.
# For more configurations, manual adjustments are necessary:
parallel-ssh -h hosts.txt -i apt install chrony

2.If nodes have swap enabled (check for "swap" in /etc/fstab), disable swap on all nodes:

swapoff -a
# Comment out swap-related entries in fstab:
sed -i.bak "s/\/swap.img/# \/swap.img/" /etc/fstab

# After executing the above command on the control node, perform the following commands on all other nodes:
parallel-ssh -h hosts.txt -i swapoff -a
parallel-ssh -h hosts.txt -i 'sed -i.bak "s/\/swap.img/# \/swap.img/" /etc/fstab'

Install Ceph

1. Transfer Docker and images

Install Docker on all nodes:

apt install docker.io
parallel-ssh -h hosts.txt -i apt install docker.io

During subsequent installations, both the ceph and node-exporter images are required on all nodes. It is recommended to pull and manually distribute them to all nodes in advance. If this step is skipped, cephadm will pull them from the official source, which might be slow depending on the local network environment.

docker image save quay.io/ceph/ceph quay.io/prometheus/node-exporter | gzip > images.tar.gz
parallel-scp -h hosts.txt ./images.tar.gz /tmp/
parallel-ssh -h hosts.txt -i 'docker load -i /tmp/images.tar.gz'

# Once you confirm that all nodes have successfully loaded the images, clean up the images.
parallel-ssh -h hosts.txt -i rm /tmp/images.tar.gz

2. Deploy the Ceph control node

On Ubuntu 22, cephadm can be easily installed from the system source. However, for other Ubuntu versions or systems, manual installation might be required. For details, see the download page of Ceph.

# Install the deployment tool on the control node, ensuring it's version quincy (17.2.6), the default version for Ubuntu 22.
apt install cephadm

# If the version from the system apt source doesn't meet requirements, manually download from the Ceph official website.
# For example, for Ubuntu 20 (focal), use the following command for manual installation.
# wget https://download.ceph.com/debian-17.2.6/pool/main/c/ceph/cephadm_17.2.6-1focal_amd64.deb
# dpkg -i ./cephadm_17.2.6-1focal_amd64.deb

# After manually installing cephadm, add Ceph-related software sources to the package manager's repo list.
# This command is needed to install the required version of ceph-common in the next step.
# cephadm add-repo --release quincy

# Install various Ceph tools.
apt install ceph-common

Initialize the cluster, where mon-ip is the internal IP of the control machine. For clusters with multiple network interfaces, an internal network can be specified, for example, --cluster-network 10.90.90.0/24.

# After the command runs, it will print the account and password for the Ceph Dashboard. Be sure to save them.
cephadm --image quay.io/ceph/ceph:v17.2.6-20230920 bootstrap --mon-ip 10.38.0.1 --log-to-file

After the Ceph control process is online, it generates its own key. You need to install it on all nodes:

# Print the public key and copy its content.
cat /etc/ceph/ceph.pub

# Write the command and execute it in batches.
parallel-ssh -h hosts.txt -i 'echo ssh-rsa xxx ceph-xxx >> /root/.ssh/authorized_keys'

After the public key installation, the control node can log in to all worker nodes for operational tasks. Enter the Ceph management container and subsequent commands will be executed within this management container. However, considering that the control node has already installed all the required tools at the host level, entering cephadm shell is optional.

# Running this command will enter the container, but the prompt won't change.
cephadm shell

# You can check the processes to identify whether you are in the host or the container.
ps -ef

Confirm the initial status of the cluster and perform preliminary optimizations:

# View cluster status.
ceph -s

# Disable unused nfs modules.
ceph mgr module disable nfs

# Default memory limit per OSD is 4 GB.
# In a Ceph cluster, each disk (block device) is deployed as an OSD.
# With 8 disks per node and 256 GB memory, allocating a larger memory limit of 16 GB.
# After adjustment, the total memory limit for the OSDs is 128 GB, providing sufficient redundancy.
ceph config set osd osd_memory_target 17179869184

3. Set up the Ceph cluster

Add all nodes to the Ceph cluster. Add the _admin label to the first five nodes (including the control node). The subsequent nodes do not need this label. Considering the large number of nodes, write a script for execution:

The following commands illustrate the process:

# The control node is already a cluster member. You don’t need to add the _admin label to it.
# Therefore, add the _admin label to ceph-node[02:04]. Other nodes don’t need it.
ceph orch host add ceph-node02 10.32.0.2 --labels _admin
...
ceph orch host add ceph-node16 10.32.0.16

Once all nodes are members of the Ceph cluster, the system automatically detects available block devices on each node. Specifically, unformatted and unused disks will be detected. Before proceeding to add OSD, this detection process does not modify any disks.

# View available disks on all nodes to ensure they match the actual configuration, that is, all empty disks have been detected and there are no bad disks.
ceph orch device ls

# For non-new disks, formatting is required.
# ceph orch device zap host0 /dev/nvme0n1 --force

# Add all available SSDs on all nodes as Ceph OSDs.
# After running this command, the disk will be managed by Ceph.
# Add them one by one to ensure that the IDs are in order. This process takes a long time. It is recommended to run it as a script.
ceph orch daemon add osd ceph-node01:/dev/nvme0n1
ceph orch daemon add osd ceph-node01:/dev/nvme1n1
...
ceph orch daemon add osd ceph-node16:/dev/nvme7n1

# Monitor the progress and status during OSD deployment.
ceph orch ps --daemon_type osd

4. Create pools

After you deploy all disks as OSDs, you can create pools. It's essential to understand two storage modes supported by Ceph before creating pools:

  • Replication mode. By default, each object is stored in 3 copies (size=3). Alongside this, there is another concept known as the "minimum available replicas," which defaults to 2 (min_size=2). This means that a minimum of 2 properly functioning replicas is required for I/O operations. In the default configuration with size=3, min_size=2, it allows for a maximum of 1 replica to be damaged. This is an optimal setting for production environments. If changed to size=2, min_size=1, all objects will have only 2 replicas, increasing the risk of data loss if any one replica fails. The fault domain in replication mode is also dependent on the cluster configuration. In cases where the number of nodes is greater than or equal to 3, Ceph automatically distributes replicas across different nodes, allowing for the failure of any single node without affecting the service.
  • Erasure code (EC) mode. This mode offers better storage efficiency compared to replication mode but introduces some computational resource overhead due to the nature of erasure codes. If the priority is to conserve disk space, this mode can be selected.
# Two numbers are the number of placement groups, and they need to be the same.
# The formula for choosing values is num_osd * 100 / 3 rounded up to the nearest power of 2; num_osd is 128, so here we take 4096.
ceph osd pool create jfsvol 4096 4096 replicated
ceph osd pool application enable jfsvol juicefs

The example of creating a command for erasure code mode:

# EC 4+2 pool (default stripe unit is 4K)
# ceph osd erasure-code-profile set jfs-ec k=4 m=2 stripe_unit=8192
# ceph osd pool create ec-pool 32 32 erasure jfs-ec

Now, Ceph is deployed and ready for use. It is recommended to skip the monitoring and alerting section for now and proceed to integrate the JuiceFS file system. You can set up monitoring after completing the integration.

Integrate with the JuiceFS file system

Create a JuiceFS file system with a bucket name aligned with the storage pool name created earlier (jfsvol). Then follow the documentation to mount.

Mount the JuiceFS file system connected to Ceph in the CSI Driver. To make the Ceph configuration files visible to the JuiceFS client, upload them to Kubernetes Secret:

# Before uploading, make sure there is a newline character at the end of the configuration file, otherwise subsequent mounts may cause an exception.
echo >> ceph.conf
echo >> ceph.client.admin.keyring

kubectl create secret generic ceph-secret --from-file=ceph.conf=ceph.conf --from-file=ceph.client.admin.keyring=ceph.client.admin.keyring

After creating the Secret, add the configs field to the file system authentication information:

apiVersion: v1
kind: Secret
metadata:
  name: jfsvol-secret
type: Opaque
stringData:
  name: jfsvol
  token: xxx
  access-key: ceph
  secret-key: client.admin
  # Mount the ceph-secret created above to /etc/ceph.
  configs: '{"ceph-secret": "/etc/ceph/"}'

Operations and maintenance

Ceph is a mature storage solution that, when planned and maintained correctly, operates stably. Read the following section to understand common operational points.

Deploy monitoring

When deploying with cephadm, it installs its own Grafana, integrated with its dashboard by default. You can use this dashboard to monitor alerts, or you can connect to the existing external Grafana.

To reuse the existing external Grafana in the environment:

  1. Open the Grafana address in the browser.
  2. Click Administration on the left sidebar.
  3. Add a data source, and fill in the Prometheus address of the Ceph cluster. The default is http://<ceph- master-ip>:9095. If you are not sure about the specific port number, go to the control node ss -lntp | grep prometheus to confirm the on-site situation. Test and save the data source.

After Prometheus is successfully added, import the community's Ceph Grafana Dashboard, and the monitoring is completed. Alert rules have been built into this dashboard. After you configure the alert sending channel, it can start running directly.

Monitor capacity and recover from failures

You can easily use ceph -s to check the current capacity usage. The dashboard imported in the previous section also contains capacity monitoring rules for various aspects of Ceph's capacity usage:

  • CephNodeRootFilesystemFull: Node root partition usage monitoring
  • CephOSDNearFull|CephOSDFull|CephOSDBackfillFull: OSD-related capacity monitoring
  • CephPoolNearFull|CephPoolFull|CephPoolBackfillFull: Pool-related capacity monitoring

If the Ceph system is full, it triggers CephOSDFull|CephPoolFull alerts. At the same time, Ceph will no longer allow write operations. This is a problematic situation because, for Ceph, "deletion" is essentially a write request. In the case of CephPoolFull, even if the client initiates a deletion request, Ceph will intercept it, making the request unable to complete until it times out. In the event of such a failure, if it is not possible to immediately add storage for scaling, you must take the following steps:

1.Identify the problematic file system, which will be cleaned to free up capacity. Therefore, you must:

  • Set the trash to 0 in advance. Otherwise, even if files are deleted, they will be moved to the trash for preservation, unable to release object storage capacity.
  • It is recommended to take the application-side action of stopping writes (considering that Ceph is already over capacity at this point, all write requests will be stuck). This is not necessary but increases the fault tolerance of subsequent operations.

2.Choose an available client for the file system, enter the mount point, clean up files, and free up capacity.

3.Delete a sufficient number of files at the JuiceFS mount point. If capacity is not immediately released on the Ceph side, it indicates encountering the deadlock issue mentioned above, where deletion is not possible when capacity is exceeded. In this case, log in to the Ceph control node and temporarily raise the threshold to allow deletion requests. Make sure the application has stopped writing before performing this operation; otherwise, once the threshold is lifted, it will quickly fill up again, causing another failure.

# The mon_osd_full_ratio defaults to 0.95; temporarily raise it to allow writes.
ceph osd set-full-ratio 0.96

# After turning up the threshold, closely observe whether capacity is released.
ceph -s

4.Confirm capacity release and prepare to roll back the full ratio setting. This step also needs to be performed carefully: before rolling back the configuration of the full ratio, make sure the operation will not put the cluster in an over-capacity state again and interrupt application writing. Therefore, run ceph df to verify the used space percentage (%USED). If %USED is below 90%, you can run the following command to roll back the configuration:

ceph osd set-full-ratio 0.95

To improve the handling process of capacity overrun, the JuiceFS client supports deletion operations in the case of Ceph cluster fullness (see related code changes in JuiceFS Community Edition). Therefore, for newer client versions, there is no need to use set-full-ratio for temporary adjustments.

Scale up

To expand the cluster's capacity by adding new disks, perform the following steps. After the disk insertion has been completed on all nodes, confirm the new disk using the command:

# In the output table, AVAILABLE=Yes indicates the detected disks.
# Confirm that the new disk parameters are as expected.
ceph orch device ls

It is recommended to manually disable data balancing before creating a new OSD. This is because the number of disks may be large, OSD creation requires formatting the disk, and the entire operation will take a long time. If you add disks with data migration enabled by default, each new disk will get data migration. To avoid inefficient repeated data migration, it is recommended to enable all new disks after they are included in Ceph.

# Disable data balancing.
ceph osd set nobackfill

# Confirm that the operation has taken effect.
ceph -s | grep backfill

Next, run the command to add a new disk. The operation is exactly the same as the steps in the creating a Ceph cluster section. Search for the ceph orch daemon add osd command in this section and add all new disks in order. If there are a large number of disks to be added, it is recommended to write a script and run it in advance.

After all disks have been added, run the following command to finish:

# Confirm that all new disks have been added to the cluster (AVAILABLE=No).
ceph orch device ls

# Restart data balancing.
ceph osd unset nobackfill

If you have any questions or would like to learn more details, feel free to join discussions about JuiceFS on GitHub and the JuiceFS community on Slack.

Author

timfeirg

Related Posts

Xiaomi: Building a Cloud-Native File Storage Platform to Host 5B+ Files in AI Training & More

2023-10-12
Xiaomi built a cloud-native file storage platform with JuiceFS to support apps in AI training, larg…

China Telecom Manages PB-Level Data by Separating Compute & Storage with JuiceFS

2023-03-17 Lei Yang
China Telecom's big data team handles massive data at the PB level on a daily basis, and it resolve…