Planning Your Deployment

This topic provides a planning checklist for deploying Ceph distributed storage on Alauda Container Platform (ACP). It summarizes architecture choices, security options, infrastructure sizing, network constraints, and disaster recovery considerations so that you can decide on a deployment model before performing the actual installation.

For product background, see Introduction and Architecture. For deployment procedures, see the documents under Install and How To.

Deployment Architecture

ACP distributed storage is based on Ceph and Rook. At a high level, the platform combines the following layers:

  • Ceph daemons such as MON, MGR, OSD, MDS, and RGW to provide block, file, and object storage capabilities
  • Rook and CSI components to automate deployment, provisioning, expansion, and lifecycle management
  • ACP platform integration to expose storage pools, observability, and operational entry points

Before deployment, decide whether your environment should use storage services from the local cluster or consume storage from an external Ceph environment.

Internal and External Deployment Models

You can plan ACP distributed storage in one of the following ways:

Deployment patternWhere storage services runWho manages the storage clusterBest fitKey tradeoff
Internal, co-residentCeph components run on the same ACP worker nodes that also run business workloadsThe ACP platform team or cluster adminEarly-stage environments, bare metal clusters, or situations where storage requirements are not fully clear yetSimpler rollout, but resource contention between apps and storage is more likely
Internal, dedicated nodesCeph components run on dedicated storage or infrastructure nodes inside the same ACP clusterThe ACP platform team or cluster adminProduction environments with predictable storage demand and stricter isolation requirementsBetter operational isolation and sizing control, but requires more reserved nodes and capacity planning
ExternalACP consumes storage classes from an external Ceph environmentA separate storage team, SRE team, or an existing external storage ownerLarge-scale environments, multiple consumer clusters, or organizations that already operate a separate Ceph clusterClear ownership boundary, but more cross-cluster networking, authentication, and dependency management

Internal deployment is easier to roll out and manage because storage services and the consuming workloads are planned within the same ACP environment. Within internal deployment, the first design choice is whether storage should share nodes with business workloads or use dedicated nodes. External deployment is better when you need stronger separation between storage and application clusters or when multiple business clusters need to share the same storage backend.

The main planning decision points are:

  • Choose co-resident deployment when you want faster rollout and can tolerate storage and application workloads sharing the same worker pool.
  • Choose dedicated-node deployment when storage demand is known and you want clearer capacity control, fault isolation, and maintenance boundaries.
  • Choose external deployment when storage is already managed elsewhere or when a single external cluster must serve multiple ACP clusters.

Node Roles

When planning node placement, separate the responsibilities of control plane nodes, infrastructure nodes, and worker nodes:

  • Control plane nodes maintain cluster management functions and should not be treated as general-purpose storage nodes unless the deployment model explicitly supports it.
  • Infrastructure nodes are suitable when you want to isolate storage platform components from business workloads.
  • Worker nodes can host storage services in co-resident deployments, but this increases resource contention between applications and storage daemons.

For production use, plan at least three failure domains for highly available storage services. Spread storage nodes across racks, zones, or host groups wherever possible.

Security Considerations

Before deployment, confirm whether encryption in transit is required for the storage design and validate the operational impact before enabling it.

Encryption in Transit

ACP currently supports encryption in transit for Ceph distributed storage. This feature protects traffic between Ceph components and clients and is typically planned around Ceph msgr2 and the cluster networking model.

Before enabling in-transit encryption, verify:

  • Kernel and operating system support on storage and client nodes
  • Expected CPU overhead on busy storage nodes
  • Throughput and latency impact on the target hardware

For implementation details, see Configure in-transit encryption.

Infrastructure Requirements

Plan node count, storage devices, and available resources before creating the cluster.

ItemMinimum configurationRecommended configuration
Storage nodes3 nodes3 or more nodes distributed across failure domains
Storage devices1 available storage device per nodeMultiple dedicated devices per node, with consistent type and size
Node distribution3 nodes available to host Ceph services3 failure domains such as racks or zones
Device usageSeparate system disk and storage diskDedicated raw disks for Ceph data and future expansion headroom

At minimum, the cluster should have three nodes and one usable storage device on each node. For production use, deploy the cluster across at least three failure domains and reserve enough free resources to absorb rebalance, repair, and future growth.

Resource Sizing

Ceph storage services consume CPU, memory, and device capacity continuously. Plan resources for storage daemons first, then reserve additional headroom for recovery, rebalance, upgrades, and background tasks.

As a baseline:

  • Start with at least three storage nodes for a highly available cluster
  • Reserve enough CPU and memory for MON, MGR, OSD, and any enabled MDS or RGW services
  • Keep growth headroom for new pools, additional devices, and cluster recovery events
  • Avoid planning a cluster that is already near saturation at day one

If your design uses dedicated storage nodes, resource planning is more predictable. If storage runs together with business workloads, reserve extra headroom to absorb contention during peak load and node failures.

Aggregate Cluster Planning Budget

For early sizing, start from an aggregate cluster budget rather than from per-component values alone. The following table is intended as a planning reference for a three-node highly available cluster before workload-specific tuning:

Deployment patternAggregate CPU to reserve for storageAggregate memory to reserve for storageNotes
Internal, minimum baseline24 logical CPUs72 GiBEntry-level three-node planning baseline when only the minimum deployment target is being met
Internal, standard baseline30 logical CPUs72 GiBBetter starting point for general production planning and future expansion
Internal, performance-oriented baseline45 logical CPUs96 GiBSuitable when higher throughput or lower latency is required from the beginning
External consumer clusterSize for connectivity and client access onlySize for connectivity and client access onlyStorage daemons run outside the ACP cluster, so the ACP cluster mainly needs network reachability, credentials, and client-side capacity

These values should be treated as cluster-level planning targets, not exact scheduler reservations. To estimate per-node budget for a three-node cluster, divide the aggregate numbers evenly across the participating storage nodes.

The following recommendations are suitable for early planning:

ComponentRecommended CPURecommended memory
MON2 cores3 GiB
MGR3 cores4 GiB
MDS3 cores8 GiB
RGW2 cores4 GiB
OSD4 cores8 GiB

These values are planning references rather than hard scheduling guarantees. Actual requirements depend on the number of devices, enabled services, and workload intensity.

How to Estimate Cluster Size

Use the following order when sizing a cluster:

  1. Choose the deployment pattern: co-resident, dedicated-node, or external.
  2. Determine the minimum node count and failure-domain layout.
  3. Decide whether block, file, object, or mixed storage services are required.
  4. Start from the aggregate cluster planning budget.
  5. Add headroom for additional device sets, recovery, monitoring, and expected growth.

If file and object services are both required, or if the cluster will host heavy business workloads at the same time, size above the minimum baseline rather than directly at it.

Pod Placement

Pod placement rules directly affect resilience. Plan the cluster so that:

  • Highly available components can be spread across different failure domains
  • Every failure domain has accessible storage devices and enough allocatable resources
  • New device sets or future expansion can still follow the same placement pattern

In practice, this means that simply having three nodes is not enough. The nodes also need to be distributed in a way that avoids a single rack, host group, or zone becoming a single point of failure.

Storage Device Planning

When selecting storage devices, standardize device size and class as much as possible. Mixed devices complicate performance tuning and capacity planning.

Use the following principles:

  • Reserve one system disk for the operating system and separate storage devices for Ceph data
  • Prefer raw disks or dedicated devices instead of partitioning shared disks
  • Keep device counts per node at a manageable level so that recovery and maintenance remain practical
  • Track usable capacity rather than raw capacity because replication reduces effective storage space

Capacity planning should also include alert thresholds and expansion policy. Plan expansion before the cluster reaches a near-full state. Running close to full capacity increases rebalance pressure and makes recovery harder.

For related operational guidance, see Managing Storage Pools and Adding Devices/Device Classes.

Capacity Planning

When planning cluster capacity, calculate usable capacity rather than raw disk capacity. In a replicated Ceph deployment, a portion of raw storage is always consumed by data protection.

Use the following planning principles:

  • Keep available capacity ahead of expected business growth instead of expanding only after the cluster is almost full
  • Reserve additional headroom for recovery, rebalance, snapshots, and temporary bursts in data usage
  • Expand storage in a balanced way across nodes and failure domains so that new capacity does not create skewed utilization
  • Review both current utilization and projected growth before adding new workloads to the cluster

The following examples can be used as early planning references for a three-node cluster with one device per node and a 3-replica data protection policy:

Device size per nodeRaw cluster capacityApproximate usable capacity with 3 replicas
0.5 TiB1.5 TiB0.5 TiB
2 TiB6 TiB2 TiB
4 TiB12 TiB4 TiB

These values are examples only. Usable capacity varies with the actual data protection policy and should not be treated as a general rule for every cluster design.

In day-two operations, capacity should be reviewed before the cluster reaches warning levels. If growth is predictable, expand early rather than waiting for a near-full or full condition.

Network Requirements

Ceph is sensitive to network quality. Before deployment, validate the following:

  • The cluster network can provide stable throughput for replication and recovery traffic
  • Latency between failure domains is within the supported range for the selected deployment model
  • Required ports are open between storage nodes and consuming clusters
  • Any dedicated network design, such as Multus-based separation, is decided in advance

If you plan to isolate storage traffic from general application traffic, confirm the network interfaces, routing policy, and operational ownership before deployment. Network isolation improves security and performance, but it also increases design complexity.

IPv6 Support

ACP distributed storage planning must follow the cluster network stack selected for the platform.

  • IPv6 is supported in single-stack IPv6 environments.
  • Dual-stack planning must be validated against the ACP cluster network design before storage deployment.
  • Storage nodes and client nodes should use the same address family strategy to avoid connectivity and service discovery issues.

If your environment uses IPv6, confirm the following before installation:

  • The ACP cluster network is already configured for IPv6 operation
  • All storage nodes can communicate over the required IPv6 routes
  • Monitoring, alerting, and external integrations that access storage endpoints also support IPv6

IPv6 should be treated as an installation-time architecture decision. Do not assume that an existing IPv4-oriented storage design can be converted later without revalidation.

Disaster Recovery Planning

ACP distributed storage can be planned with different recovery objectives. Choose a model based on your recovery point objective (RPO), recovery time objective (RTO), and site topology.

Regional-DR

ACP supports Regional-DR for cross-region or cross-site disaster recovery scenarios where asynchronous replication and a small amount of potential data loss are acceptable.

When planning Regional-DR, confirm the following items in advance:

  • The source and destination clusters have compatible storage and network designs
  • Replication latency and failover expectations match the business recovery objectives
  • The protected workload type is clear, such as block, file system, or object data

For implementation details, see Disaster Recovery.

Stretch Cluster

A stretch cluster is appropriate only when the latency between sites is tightly controlled and the topology is designed specifically for this pattern. In general, plan for:

  • Two data sites and one quorum or arbiter site
  • A minimum of five nodes across three zones
  • Manual and explicit failure-domain labels before cluster creation
  • Sufficient nodes in each data site to preserve storage service availability
  • Inter-zone latency that remains within a low-latency design envelope, typically no more than 10 ms RTT between the data sites
WARNING

Do not treat a stretch cluster as a general solution for long-distance, high-latency, multi-datacenter deployment. If inter-site latency is not tightly controlled, use a dedicated disaster recovery architecture instead.

For ACP-specific stretch cluster deployment guidance, see Create Stretch Type Cluster.

Performance Planning

Performance should be planned from workload characteristics rather than from raw device counts alone. Before deployment, identify:

  • Whether the primary workloads are block, file, or object oriented
  • Whether the workload is latency sensitive, throughput sensitive, or capacity heavy
  • Whether hot data, backup traffic, or analytics jobs will dominate the cluster

Also confirm whether special tuning or feature-specific design is required. For example, object workloads may need separate planning for gateway capacity, and some environments may require cache-oriented or dedicated-cluster designs.

Next Steps

After you complete planning, proceed to the deployment guide that matches your selected deployment model:

Internal deployment

External deployment