Skip to main content
clickhouse-ansible_agwwy4.webp

Building Enterprise-Grade ClickHouse Clusters with Ansible

Rakesh Therani

Rakesh Therani


Introduction

ClickHouse has established itself as one of the leading columnar database management systems for analytical workloads, processing billions of rows in milliseconds. However, deploying a production-ready ClickHouse cluster comes with significant complexity — from configuring proper sharding and replication to optimizing system parameters for peak performance. In this technical deep-dive, I’ll walk through an enterprise-grade ClickHouse automation solution that eliminates deployment complexity while enabling hardware-aware configuration for optimal performance.

The automation solution we’ll explore allows engineers to deploy highly available, secure, and optimized ClickHouse clusters using a single configuration file and two commands. By the end of this article, you’ll understand how this Ansible project is structured, the key configurations it manages, and how it automatically tunes ClickHouse based on your hardware specifications.

Real-World Challenges in ClickHouse Deployment

Before diving into the solution, let’s understand the common challenges organizations face when deploying ClickHouse:

  1. Configuration Complexity: ClickHouse has hundreds of configuration parameters, many of which require careful tuning based on workload and hardware.
  2. Resource Optimization: Setting memory limits, cache sizes, and thread pools incorrectly can lead to performance issues or out-of-memory errors.
  3. High Availability Design: Implementing proper sharding and replication requires careful planning and configuration.
  4. Operational Readiness: Production deployments need comprehensive monitoring, backup solutions, and health checks.
  5. Security Implementation: Enterprise deployments require proper encryption, authentication, and authorization.

Key Features of the Ansible Automation

Let’s highlight what this automation solution provides:

  • Hardware-aware configuration: Automatically optimizes ClickHouse settings based on available CPU cores and RAM
  • Flexible cluster topologies: Supports arbitrary shard and replica combinations
  • High availability: Configures ClickHouse Keeper (replacement for ZooKeeper) for distributed coordination
  • Security hardening: Implements SSL/TLS, user authentication, and network restrictions
  • Monitoring integration: Sets up Prometheus metrics endpoints and health checks
  • Backup automation: Configures automated backup with optional S3 storage
  • Schema management: Provides a framework for database and table creation

Getting Started: Initial Project Setup

The first step in using this automation framework is to run the setup script that generates the entire project structure. The script takes parameters for CPU cores and RAM to properly configure ClickHouse’s hardware-aware settings:

sudo ./setup-clickhouse-ansible.sh --cpu 32 --ram 256 --version 25.4.1.1
bash

This command generates output similar to:

ClickHouse Ansible Project Setup Configuration: CPU Cores: 32 RAM (GB): 256 ClickHouse Version: 25.4.1.1
bash
Creating directory structure... Creating initial configuration files... Creating README.md... Improved ClickHouse Ansible Project Structure Created Successfully!Usage: ./setup-clickhouse-ansible.sh --cpu 16 --ram 64 - Create project with 16 CPU cores and 64GB RAM ./setup-clickhouse-ansible.sh --cpu 32 --ram 128 --version 25.4.1.1 - Specify CPU, RAM and ClickHouse versionNext steps: 1. Edit config.yml to configure your cluster settings 2. Run ansible-playbook -i localhost, setup_inventory.yml -c local to generate inventory 3. Run ansible-playbook -i inventory.yml deploy_clickhouse.yml to deploy your cluster
bash

The setup script creates a complete Ansible project structure with all necessary roles, templates, and configuration files. It uses the CPU and RAM parameters to set default values in the configuration, but these can be further customized later.

Architecture of the Automation Framework

The framework implements a comprehensive infrastructure-as-code approach with a clear separation of concerns:

Figure 1: High-level architecture of the ClickHouse Ansible automation framework

Understanding the Directory Structure

The Ansible project follows a well-organized structure that separates concerns and promotes maintainability:

clickhouse-ansible/ ├── config.yml # Central configuration file ├── inventory.yml # Generated inventory file ├── deploy_clickhouse.yml # Main deployment playbook ├── setup_inventory.yml # Inventory generator playbook ├── group_vars/ # Group variables │ └── all.yml # Common variables (generated) ├── roles/ │ ├── common/ # Common setup tasks │ │ └── tasks/ │ │ ├── main.yml │ │ ├── install_pre_req.yml │ │ ├── system_optimizations.yml │ │ ├── monitoring.yml │ │ ├── health_checks.yml │ │ ├── verify_cluster.yml │ │ ├── clickhouse_keeper/ │ │ └── clickhouse_server/ │ ├── clickhouse_server/ # Server role │ │ ├── handlers/ │ │ ├── tasks/ │ │ └── templates/ │ └── clickhouse_keeper/ # Keeper role │ ├── handlers/ │ ├── tasks/ │ └── templates/ └── templates/ # Templates for generators ├── inventory.j2 └── all.yml.j2
bash

This structure adheres to Ansible best practices with role-based organization:

  • Roles: Define the server and keeper configurations independently
  • Tasks: Modular, reusable configuration steps
  • Templates: Jinja2-powered configuration generation
  • Handlers: Service restart notifications

The Configuration Hub: config.yml

The config.yml file serves as the single source of truth for all deployment parameters. Let's examine some key sections:

# Hardware configuration cpu_cores: 16 ram_gb: 64
bash
# ClickHouse version clickhouse_version: "25.3.2.39"# Cluster configuration cluster_name: "clickhouse_cluster" cluster_secret: "mysecretphrase" shard_count: 1 replica_count: 3# Network ports keeper_port: 9181 keeper_raft_port: 9234 clickhouse_port: 9000 clickhouse_http_port: 8123# Keeper and server nodes keeper_ips: - "13.91.32.134" - "13.91.224.109" - "13.91.246.177"server_ips: - "13.64.100.15" - "40.112.129.86" - "40.112.134.238"# Performance tuning hardware_profile: "auto" # Options: auto, small, medium, large, custom# Auto-tuning parameters memory_ratio: server_usage_to_ram_ratio: 0.8 mark_cache_percent: 0.2 uncompressed_cache_percent: 0.2
yaml

This configuration allows users to:

  • Specify hardware resources (CPU cores, RAM)
  • Define the cluster topology (shards, replicas)
  • List IP addresses for server and keeper nodes
  • Select a hardware profile for performance tuning

The hardware profile selector is particularly powerful, automatically mapping to optimized resource allocation settings based on available hardware.

Hardware-Aware Configuration: The Secret Sauce

One of the most powerful features of this automation is its ability to automatically calculate optimal ClickHouse settings based on available hardware. This is implemented through hardware profiles:

hw_profile_params: small: max_server_memory_usage_to_ram_ratio: 0.7 max_server_memory_usage: '{{ (ram_bytes | int * 0.7) | int }}' background_pool_size: '{{ [4, cpu_cores | int * 0.5] | min | int }}' mark_cache_size: '{{ (ram_bytes | int * 0.1) | int }}' uncompressed_cache_size: '{{ (ram_bytes | int * 0.1) | int }}' medium: max_server_memory_usage_to_ram_ratio: 0.75 max_server_memory_usage: '{{ (ram_bytes | int * 0.75) | int }}' background_pool_size: '{{ [8, cpu_cores | int * 0.75] | min | int }}' mark_cache_size: '{{ (ram_bytes | int * 0.15) | int }}' uncompressed_cache_size: '{{ (ram_bytes | int * 0.15) | int }}' large: max_server_memory_usage_to_ram_ratio: 0.8 max_server_memory_usage: '{{ (ram_bytes | int * 0.8) | int }}' background_pool_size: '{{ cpu_cores | int }}' mark_cache_size: '{{ (ram_bytes | int * 0.2) | int }}' uncompressed_cache_size: '{{ (ram_bytes | int * 0.2) | int }}'
yaml

For the “auto” profile, the automation calculates settings dynamically:

auto_profile: max_server_memory_usage_to_ram_ratio: '{{ memory_ratio.server_usage_to_ram_ratio }}' max_server_memory_usage: '{{ (ram_bytes | int * memory_ratio.server_usage_to_ram_ratio) | int }}' background_pool_size: '{{ cpu_cores | int }}' mark_cache_size: '{{ (ram_bytes | int * memory_ratio.server_usage_to_ram_ratio * memory_ratio.mark_cache_percent) | int }}' uncompressed_cache_size: '{{ (ram_bytes | int * memory_ratio.server_usage_to_ram_ratio * memory_ratio.uncompressed_cache_percent) | int }}'
yaml

These calculations ensure that critical parameters like memory limits, cache sizes, and thread pools are proportioned appropriately for the available hardware, eliminating guesswork and manual tuning.

Dynamic Inventory Generation

Rather than manually maintaining an inventory file, the solution dynamically generates it based on the configuration. The setup_inventory.yml playbook processes the configuration and creates:

  1. An inventory.yml file with the server layout
  2. A group_vars/all.yml file with derived variables

The inventory generation template implements a sophisticated algorithm that:

{% for i in range(total_nodes|int) %} {% set current_shard = (i // replica_count|int) + 1 %} {% set current_replica = (i % replica_count|int) + 1 %} {% if i < server_ips|length %} clickhouse-s{{ '%02d' % current_shard }}-r{{ '%02d' % current_replica }}: ansible_host: {{ server_ips[i] }} ansible_ssh_private_key_file: "{{ server_ssh_key_path }}" shard: "{{ '%02d' % current_shard }}" replica: "{{ '%02d' % current_replica }}" {% endif %} {% endfor %}
yaml

This template:

  1. Calculates the appropriate shard and replica ID for each server
  2. Assigns sequential IDs to Keeper nodes
  3. Creates consistent hostname patterns
  4. Sets appropriate SSH connection parameters

System Optimization Deep Dive

ClickHouse’s performance depends significantly on system-level optimizations. The automation applies key optimizations:

- name: Configure sysctl parameters for ClickHouse sysctl: name: '{{ item.name }}' value: '{{ item.value }}' state: present reload: yes with_items: - { name: 'vm.swappiness', value: '0' } # Minimize swapping - { name: 'vm.max_map_count', value: '1048576' } # Increase memory map areas - { name: 'net.core.somaxconn', value: '4096' } # TCP connection queue - { name: 'net.ipv4.tcp_max_syn_backlog', value: '4096' } # SYN backlog - { name: 'net.core.netdev_max_backlog', value: '10000' } # Network packet backlog - { name: 'net.ipv4.tcp_slow_start_after_idle', value: '0' } # Disable TCP slow start - { name: 'net.ipv4.tcp_fin_timeout', value: '10' } # Faster TCP connection cleanup - { name: 'net.ipv4.tcp_keepalive_time', value: '60' } # Faster dead connection detection - { name: 'net.ipv4.tcp_keepalive_intvl', value: '10' } - { name: 'net.ipv4.tcp_keepalive_probes', value: '6' } - { name: 'fs.file-max', value: '9223372036854775807' } # Maximum file handles - { name: 'fs.aio-max-nr', value: '1048576' } # Async IO operations limit
yaml

These optimizations focus on:

  1. Memory Management: Minimize swapping and increase memory mapping limits
  2. Network Performance: Optimize TCP connection handling and backlog queues
  3. File Handling: Increase file descriptor limits for high connection counts
  4. Disk I/O: Configure asynchronous I/O parameters

Additionally, the playbook disables transparent huge pages (THP), which can cause performance issues for databases:

- name: Disable transparent huge pages shell: | echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag
bash

And creates a systemd service to ensure this setting persists across reboots.

ClickHouse Keeper Configuration

ClickHouse Keeper (introduced in version 22.x as a ZooKeeper replacement) is configured as a distributed coordination service:

<clickhouse> <keeper_server> <tcp_port>{{ clickhouse_keeper_port }}</tcp_port> <server_id>{{ server_id }}</server_id> <log_storage_path>{{ clickhouse_keeper_coordination_dir }}/logs</log_storage_path> <snapshot_storage_path>{{ clickhouse_keeper_coordination_dir }}/snapshots</snapshot_storage_path> <coordination_settings> <operation_timeout_ms>10000</operation_timeout_ms> <min_session_timeout_ms>10000</min_session_timeout_ms> <session_timeout_ms>100000</session_timeout_ms> <raft_logs_level>{{ clickhouse_keeper_log_level }}</raft_logs_level> </coordination_settings> <raft_configuration> {% for host in groups['clickhouse_keepers'] %} <server> <id>{{ hostvars[host].server_id }}</id> <hostname>{{ hostvars[host].ansible_host }}</hostname> <port>{{ clickhouse_keeper_raft_port }}</port> </server> {% endfor %} </raft_configuration> </keeper_server> </clickhouse>
xml

This configuration:

  1. Sets up proper logging with rotation
  2. Configures connection handling
  3. Establishes unique server IDs
  4. Defines RAFT consensus protocol parameters
  5. Creates a complete list of all Keeper nodes for coordination

ClickHouse Server Cluster Configuration

The server configuration manages the distributed cluster setup through several key files:

1. Macros for Sharding/Replication

<clickhouse> <macros> <shard>{{ shard }}</shard> <replica>{{ replica }}</replica> <cluster>{{ clickhouse_cluster_name }}</cluster> </macros> </clickhouse>
xml

These macros identify each server’s role in the cluster and are used in table definitions.

2. Remote Servers (Cluster Definition)

<clickhouse> <remote_servers replace="true"> <{{ clickhouse_cluster_name }}> <secret>{{ clickhouse_secret }}</secret> {% set ns = namespace(shard_hosts={}) %} {# Group servers by shard #} {% for host in groups['clickhouse_servers'] %} {% set shard_num = hostvars[host].shard | int %} {% if shard_num not in ns.shard_hosts %} {% set ns.shard_hosts = ns.shard_hosts | combine({shard_num: []}) %} {% endif %} {% set _ = ns.shard_hosts[shard_num].append(host) %} {% endfor %} {# Create shards with proper replica configuration #} {% for shard_num, hosts in ns.shard_hosts.items() | sort %} <shard> <internal_replication>true</internal_replication> {% for host in hosts %} <replica> <host>{{ hostvars[host].ansible_host }}</host> <port>{{ clickhouse_server_port }}</port> </replica> {% endfor %} </shard> {% endfor %} </{{ clickhouse_cluster_name }}> </remote_servers> </clickhouse>
yaml

This sophisticated template:

  1. Uses Jinja2 namespacing to create temporary data structures
  2. Groups servers by shard number
  3. Creates nested shard and replica definitions
  4. Enables internal replication for Keeper coordination

Security Implementation

The security configuration is comprehensive, including:

SSL/TLS Setup

- name: Generate self-signed SSL certificate if not exists shell: | openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 \ -subj "/C=US/ST=CA/L=SF/O=ClickHouse/CN={{ inventory_hostname }}" \ -keyout /etc/clickhouse-server/ssl/server.key \ -out /etc/clickhouse-server/ssl/server.crt args: creates: /etc/clickhouse-server/ssl/server.crt when: ssl_enabled | bool
yaml

This creates:

  • Self-signed certificates per server
  • Strong Diffie-Hellman parameters for secure key exchange
  • Proper file permissions for security

User Authentication and Authorization

<clickhouse> <users> <default> <!-- Disable default user or set a strong password --> <password></password> <profile>default</profile> <quota>default</quota> <access_management>0</access_management> </default> <admin> <password_sha256_hex>{{ admin_password_hash | default('8d969eef6ecad3c29a3a629280e686cf0c3f5d5a86aff3ca12020c923adc6c92') }}</password_sha256_hex> <profile>admin</profile> <quota>default</quota> <access_management>1</access_management> <networks> <ip>::1/128</ip> <ip>127.0.0.1/32</ip> <ip>{{ network_access }}</ip> </networks> </admin> </users> </clickhouse>
xml

This configuration implements:

  • Hashed password storage
  • IP-based access restrictions
  • Resource profiles to limit memory and CPU usage
  • Query execution quotas

Monitoring and Backup Solutions

Prometheus Integration

<clickhouse> <prometheus> <endpoint>/metrics</endpoint> <port>{{ prometheus_port }}</port> <metrics>true</metrics> <events>true</events> <asynchronous_metrics>true</asynchronous_metrics> </prometheus> </clickhouse>
yaml

This exposes ClickHouse metrics in Prometheus format on a dedicated port.

Health Check Scripts

The automation deploys custom health check scripts for proactive monitoring:

#!/bin/bash # ClickHouse Server Health Check Script # Check if ClickHouse server is running if ! pgrep -x "clickhouse-server" > /dev/null; then echo "ERROR: ClickHouse server is not running!" exit 1 fi # Check if we can connect to the server if ! clickhouse-client --query "SELECT 1" &>/dev/null; then echo "ERROR: Cannot connect to ClickHouse server!" exit 1 fi # Check server uptime UPTIME=$(clickhouse-client --query "SELECT uptime()") echo "ClickHouse server uptime: ${UPTIME} seconds" # Check system.errors count ERRORS=$(clickhouse-client --query "SELECT count() FROM system.errors") if [ "$ERRORS" -gt 0 ]; then echo "WARNING: Found ${ERRORS} errors in system.errors table!" else echo "No errors found in system.errors table" fi # Additional checks for memory, disk usage, replication status...
bash

Backup Solution

The automation implements the clickhouse-backup tool for consistent backups:

- name: Create backup cron job cron: name: 'ClickHouse backup' user: clickhouse hour: '1' minute: '0' job: '/usr/local/bin/clickhouse-backup create && {% if remote_backup_enabled | bool %}/usr/local/bin/clickhouse-backup upload{% endif %}' state: present when: backup_enabled | bool
yaml

The backup configuration supports both local backups and remote S3-compatible storage.

Schema Management

The automation includes initial schema setup with support for replicated tables:

CREATE TABLE IF NOT EXISTS analytics.events ( event_date Date, event_time DateTime, event_type String, user_id String, session_id String, properties String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/analytics.events', '{replica}') PARTITION BY toYYYYMM(event_date) ORDER BY (event_date, event_type, user_id);
sql

And distributed tables for cluster-wide queries:

CREATE TABLE IF NOT EXISTS analytics.events_distributed ( event_date Date, event_time DateTime, event_type String, user_id String, session_id String, properties String ) ENGINE = Distributed('{{ clickhouse_cluster_name }}', 'analytics', 'events', rand());
sql

Deployment Workflow

The complete deployment workflow is a two-step process:

  1. Generate the Inventory:
ansible-playbook -i localhost, setup_inventory.yml -c local
bash
  1. Deploy the Cluster:
ansible-playbook -i inventory.yml deploy_clickhouse.yml
bash

The deployment playbook executes in two phases:

  • First, deploy and configure ClickHouse Keeper instances
  • Then, deploy and configure ClickHouse Server instances

Each phase includes:

  • OS preparation with package installation and optimization
  • Core service installation
  • Configuration generation
  • Security hardening
  • Monitoring setup
  • Health check configuration

Advanced Usage Scenarios

Multi-Shard Configuration

For horizontal scaling, you can configure multiple shards:

shard_count: 3 replica_count: 2 server_ips: - "10.0.1.10" # Shard 1, Replica 1 - "10.0.1.11" # Shard 1, Replica 2 - "10.0.1.12" # Shard 2, Replica 1 - "10.0.1.13" # Shard 2, Replica 2 - "10.0.1.14" # Shard 3, Replica 1 - "10.0.1.15" # Shard 3, Replica 2
bash

Geographic Distribution

For multi-datacenter setups, you can distribute replicas:

shard_count: 2 replica_count: 3 server_ips: - "10.1.1.10" # DC1, Shard 1, Replica 1 - "10.2.1.10" # DC2, Shard 1, Replica 2 - "10.3.1.10" # DC3, Shard 1, Replica 3 - "10.1.1.11" # DC1, Shard 2, Replica 1 - "10.2.1.11" # DC2, Shard 2, Replica 2 - "10.3.1.11" # DC3, Shard 2, Replica 3
bash

Custom Hardware Profiles

For specialized workloads, you can create custom hardware profiles:

hardware_profile: "custom" custom_profile: max_server_memory_usage_to_ram_ratio: 0.75 max_server_memory_usage: 0 # Will be calculated based on RAM if set to 0 background_pool_size: 0 # Will be set to CPU count if 0 mark_cache_size: 0 # Will be calculated if 0 uncompressed_cache_size: 0 # Will be calculated if 0
bash

Conclusion

This Ansible automation provides a robust, enterprise-grade solution for deploying ClickHouse clusters with optimal configuration. The key advantages are:

  1. Hardware-aware configuration: Automatically tunes ClickHouse based on available resources
  2. Simplicity: Manages complex deployments with minimal configuration
  3. Security: Implements comprehensive security best practices
  4. Flexibility: Supports various cluster topologies and use cases
  5. Maintainability: Follows structured, role-based Ansible best practices

With this automation, organizations can rapidly deploy production-ready ClickHouse clusters without worrying about the intricacies of configuration optimization, security hardening, or operational readiness. The script provided sets up the entire Ansible project structure, creating all the necessary files and directories to implement this comprehensive ClickHouse deployment solution.

By leveraging this automation, teams can focus on using ClickHouse’s analytical capabilities rather than managing infrastructure complexity, ultimately accelerating their data analytics initiatives while maintaining operational excellence.

Looking for Enterprise-grade ClickHouse Support and Consulting

Discover how our team of experts can help you optimize your ClickHouse deployment, improve performance, and ensure reliability.

References

  1. ClickHouse Documentation
  2. ClickHouse Keeper Documentation
  3. Ansible Best Practices
  4. clickhouse-backup GitHub Repository
  5. ClickHouse System Requirements

Enjoying this post?

Get our posts directly in your inbox.