Introduction
ClickHouse has established itself as one of the leading columnar database management systems for analytical workloads, processing billions of rows in milliseconds. However, deploying a production-ready ClickHouse cluster comes with significant complexity — from configuring proper sharding and replication to optimizing system parameters for peak performance. In this technical deep-dive, I’ll walk through an enterprise-grade ClickHouse automation solution that eliminates deployment complexity while enabling hardware-aware configuration for optimal performance.
The automation solution we’ll explore allows engineers to deploy highly available, secure, and optimized ClickHouse clusters using a single configuration file and two commands. By the end of this article, you’ll understand how this Ansible project is structured, the key configurations it manages, and how it automatically tunes ClickHouse based on your hardware specifications.
Real-World Challenges in ClickHouse Deployment
Before diving into the solution, let’s understand the common challenges organizations face when deploying ClickHouse:
- Configuration Complexity: ClickHouse has hundreds of configuration parameters, many of which require careful tuning based on workload and hardware.
- Resource Optimization: Setting memory limits, cache sizes, and thread pools incorrectly can lead to performance issues or out-of-memory errors.
- High Availability Design: Implementing proper sharding and replication requires careful planning and configuration.
- Operational Readiness: Production deployments need comprehensive monitoring, backup solutions, and health checks.
- Security Implementation: Enterprise deployments require proper encryption, authentication, and authorization.
Key Features of the Ansible Automation
Let’s highlight what this automation solution provides:
- Hardware-aware configuration: Automatically optimizes ClickHouse settings based on available CPU cores and RAM
- Flexible cluster topologies: Supports arbitrary shard and replica combinations
- High availability: Configures ClickHouse Keeper (replacement for ZooKeeper) for distributed coordination
- Security hardening: Implements SSL/TLS, user authentication, and network restrictions
- Monitoring integration: Sets up Prometheus metrics endpoints and health checks
- Backup automation: Configures automated backup with optional S3 storage
- Schema management: Provides a framework for database and table creation
Getting Started: Initial Project Setup
The first step in using this automation framework is to run the setup script that generates the entire project structure. The script takes parameters for CPU cores and RAM to properly configure ClickHouse’s hardware-aware settings:
sudo ./setup-clickhouse-ansible.sh --cpu 32 --ram 256 --version 25.4.1.1
bash
This command generates output similar to:
ClickHouse Ansible Project Setup
Configuration:
CPU Cores: 32
RAM (GB): 256
ClickHouse Version: 25.4.1.1
bash
Creating directory structure...
Creating initial configuration files...
Creating README.md...
Improved ClickHouse Ansible Project Structure Created Successfully!Usage:
./setup-clickhouse-ansible.sh --cpu 16 --ram 64 - Create project with 16 CPU cores and 64GB RAM
./setup-clickhouse-ansible.sh --cpu 32 --ram 128 --version 25.4.1.1 - Specify CPU, RAM and ClickHouse versionNext steps:
1. Edit config.yml to configure your cluster settings
2. Run ansible-playbook -i localhost, setup_inventory.yml -c local to generate inventory
3. Run ansible-playbook -i inventory.yml deploy_clickhouse.yml to deploy your cluster
bash
The setup script creates a complete Ansible project structure with all necessary roles, templates, and configuration files. It uses the CPU and RAM parameters to set default values in the configuration, but these can be further customized later.
Architecture of the Automation Framework
The framework implements a comprehensive infrastructure-as-code approach with a clear separation of concerns:
Figure 1: High-level architecture of the ClickHouse Ansible automation framework
Understanding the Directory Structure
The Ansible project follows a well-organized structure that separates concerns and promotes maintainability:
clickhouse-ansible/
├── config.yml # Central configuration file
├── inventory.yml # Generated inventory file
├── deploy_clickhouse.yml # Main deployment playbook
├── setup_inventory.yml # Inventory generator playbook
├── group_vars/ # Group variables
│ └── all.yml # Common variables (generated)
├── roles/
│ ├── common/ # Common setup tasks
│ │ └── tasks/
│ │ ├── main.yml
│ │ ├── install_pre_req.yml
│ │ ├── system_optimizations.yml
│ │ ├── monitoring.yml
│ │ ├── health_checks.yml
│ │ ├── verify_cluster.yml
│ │ ├── clickhouse_keeper/
│ │ └── clickhouse_server/
│ ├── clickhouse_server/ # Server role
│ │ ├── handlers/
│ │ ├── tasks/
│ │ └── templates/
│ └── clickhouse_keeper/ # Keeper role
│ ├── handlers/
│ ├── tasks/
│ └── templates/
└── templates/ # Templates for generators
├── inventory.j2
└── all.yml.j2
bash
This structure adheres to Ansible best practices with role-based organization:
- Roles: Define the server and keeper configurations independently
- Tasks: Modular, reusable configuration steps
- Templates: Jinja2-powered configuration generation
- Handlers: Service restart notifications
The Configuration Hub: config.yml
The config.yml
file serves as the single source of truth for all deployment parameters. Let's examine some key sections:
# Hardware configuration
cpu_cores: 16
ram_gb: 64
bash
# ClickHouse version
clickhouse_version: "25.3.2.39"# Cluster configuration
cluster_name: "clickhouse_cluster"
cluster_secret: "mysecretphrase"
shard_count: 1
replica_count: 3# Network ports
keeper_port: 9181
keeper_raft_port: 9234
clickhouse_port: 9000
clickhouse_http_port: 8123# Keeper and server nodes
keeper_ips:
- "13.91.32.134"
- "13.91.224.109"
- "13.91.246.177"server_ips:
- "13.64.100.15"
- "40.112.129.86"
- "40.112.134.238"# Performance tuning
hardware_profile: "auto" # Options: auto, small, medium, large, custom# Auto-tuning parameters
memory_ratio:
server_usage_to_ram_ratio: 0.8
mark_cache_percent: 0.2
uncompressed_cache_percent: 0.2
yaml
This configuration allows users to:
- Specify hardware resources (CPU cores, RAM)
- Define the cluster topology (shards, replicas)
- List IP addresses for server and keeper nodes
- Select a hardware profile for performance tuning
The hardware profile selector is particularly powerful, automatically mapping to optimized resource allocation settings based on available hardware.
Hardware-Aware Configuration: The Secret Sauce
One of the most powerful features of this automation is its ability to automatically calculate optimal ClickHouse settings based on available hardware. This is implemented through hardware profiles:
hw_profile_params:
small:
max_server_memory_usage_to_ram_ratio: 0.7
max_server_memory_usage: '{{ (ram_bytes | int * 0.7) | int }}'
background_pool_size: '{{ [4, cpu_cores | int * 0.5] | min | int }}'
mark_cache_size: '{{ (ram_bytes | int * 0.1) | int }}'
uncompressed_cache_size: '{{ (ram_bytes | int * 0.1) | int }}'
medium:
max_server_memory_usage_to_ram_ratio: 0.75
max_server_memory_usage: '{{ (ram_bytes | int * 0.75) | int }}'
background_pool_size: '{{ [8, cpu_cores | int * 0.75] | min | int }}'
mark_cache_size: '{{ (ram_bytes | int * 0.15) | int }}'
uncompressed_cache_size: '{{ (ram_bytes | int * 0.15) | int }}'
large:
max_server_memory_usage_to_ram_ratio: 0.8
max_server_memory_usage: '{{ (ram_bytes | int * 0.8) | int }}'
background_pool_size: '{{ cpu_cores | int }}'
mark_cache_size: '{{ (ram_bytes | int * 0.2) | int }}'
uncompressed_cache_size: '{{ (ram_bytes | int * 0.2) | int }}'
yaml
For the “auto” profile, the automation calculates settings dynamically:
auto_profile:
max_server_memory_usage_to_ram_ratio: '{{ memory_ratio.server_usage_to_ram_ratio }}'
max_server_memory_usage: '{{ (ram_bytes | int * memory_ratio.server_usage_to_ram_ratio) | int }}'
background_pool_size: '{{ cpu_cores | int }}'
mark_cache_size: '{{ (ram_bytes | int * memory_ratio.server_usage_to_ram_ratio * memory_ratio.mark_cache_percent) | int }}'
uncompressed_cache_size: '{{ (ram_bytes | int * memory_ratio.server_usage_to_ram_ratio * memory_ratio.uncompressed_cache_percent) | int }}'
yaml
These calculations ensure that critical parameters like memory limits, cache sizes, and thread pools are proportioned appropriately for the available hardware, eliminating guesswork and manual tuning.
Dynamic Inventory Generation
Rather than manually maintaining an inventory file, the solution dynamically generates it based on the configuration. The setup_inventory.yml
playbook processes the configuration and creates:
- An
inventory.yml
file with the server layout - A
group_vars/all.yml
file with derived variables
The inventory generation template implements a sophisticated algorithm that:
{% for i in range(total_nodes|int) %}
{% set current_shard = (i // replica_count|int) + 1 %}
{% set current_replica = (i % replica_count|int) + 1 %}
{% if i < server_ips|length %}
clickhouse-s{{ '%02d' % current_shard }}-r{{ '%02d' % current_replica }}:
ansible_host: {{ server_ips[i] }}
ansible_ssh_private_key_file: "{{ server_ssh_key_path }}"
shard: "{{ '%02d' % current_shard }}"
replica: "{{ '%02d' % current_replica }}"
{% endif %}
{% endfor %}
yaml
This template:
- Calculates the appropriate shard and replica ID for each server
- Assigns sequential IDs to Keeper nodes
- Creates consistent hostname patterns
- Sets appropriate SSH connection parameters
System Optimization Deep Dive
ClickHouse’s performance depends significantly on system-level optimizations. The automation applies key optimizations:
- name: Configure sysctl parameters for ClickHouse
sysctl:
name: '{{ item.name }}'
value: '{{ item.value }}'
state: present
reload: yes
with_items:
- { name: 'vm.swappiness', value: '0' } # Minimize swapping
- { name: 'vm.max_map_count', value: '1048576' } # Increase memory map areas
- { name: 'net.core.somaxconn', value: '4096' } # TCP connection queue
- { name: 'net.ipv4.tcp_max_syn_backlog', value: '4096' } # SYN backlog
- { name: 'net.core.netdev_max_backlog', value: '10000' } # Network packet backlog
- { name: 'net.ipv4.tcp_slow_start_after_idle', value: '0' } # Disable TCP slow start
- { name: 'net.ipv4.tcp_fin_timeout', value: '10' } # Faster TCP connection cleanup
- { name: 'net.ipv4.tcp_keepalive_time', value: '60' } # Faster dead connection detection
- { name: 'net.ipv4.tcp_keepalive_intvl', value: '10' }
- { name: 'net.ipv4.tcp_keepalive_probes', value: '6' }
- { name: 'fs.file-max', value: '9223372036854775807' } # Maximum file handles
- { name: 'fs.aio-max-nr', value: '1048576' } # Async IO operations limit
yaml
These optimizations focus on:
- Memory Management: Minimize swapping and increase memory mapping limits
- Network Performance: Optimize TCP connection handling and backlog queues
- File Handling: Increase file descriptor limits for high connection counts
- Disk I/O: Configure asynchronous I/O parameters
Additionally, the playbook disables transparent huge pages (THP), which can cause performance issues for databases:
- name: Disable transparent huge pages
shell: |
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
bash
And creates a systemd service to ensure this setting persists across reboots.
ClickHouse Keeper Configuration
ClickHouse Keeper (introduced in version 22.x as a ZooKeeper replacement) is configured as a distributed coordination service:
<clickhouse>
<keeper_server>
<tcp_port>{{ clickhouse_keeper_port }}</tcp_port>
<server_id>{{ server_id }}</server_id>
<log_storage_path>{{ clickhouse_keeper_coordination_dir }}/logs</log_storage_path>
<snapshot_storage_path>{{ clickhouse_keeper_coordination_dir }}/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<min_session_timeout_ms>10000</min_session_timeout_ms>
<session_timeout_ms>100000</session_timeout_ms>
<raft_logs_level>{{ clickhouse_keeper_log_level }}</raft_logs_level>
</coordination_settings>
<raft_configuration>
{% for host in groups['clickhouse_keepers'] %}
<server>
<id>{{ hostvars[host].server_id }}</id>
<hostname>{{ hostvars[host].ansible_host }}</hostname>
<port>{{ clickhouse_keeper_raft_port }}</port>
</server>
{% endfor %}
</raft_configuration>
</keeper_server>
</clickhouse>
xml
This configuration:
- Sets up proper logging with rotation
- Configures connection handling
- Establishes unique server IDs
- Defines RAFT consensus protocol parameters
- Creates a complete list of all Keeper nodes for coordination
ClickHouse Server Cluster Configuration
The server configuration manages the distributed cluster setup through several key files:
1. Macros for Sharding/Replication
<clickhouse>
<macros>
<shard>{{ shard }}</shard>
<replica>{{ replica }}</replica>
<cluster>{{ clickhouse_cluster_name }}</cluster>
</macros>
</clickhouse>
xml
These macros identify each server’s role in the cluster and are used in table definitions.
2. Remote Servers (Cluster Definition)
<clickhouse>
<remote_servers replace="true">
<{{ clickhouse_cluster_name }}>
<secret>{{ clickhouse_secret }}</secret>
{% set ns = namespace(shard_hosts={}) %}
{# Group servers by shard #}
{% for host in groups['clickhouse_servers'] %}
{% set shard_num = hostvars[host].shard | int %}
{% if shard_num not in ns.shard_hosts %}
{% set ns.shard_hosts = ns.shard_hosts | combine({shard_num: []}) %}
{% endif %}
{% set _ = ns.shard_hosts[shard_num].append(host) %}
{% endfor %}
{# Create shards with proper replica configuration #}
{% for shard_num, hosts in ns.shard_hosts.items() | sort %}
<shard>
<internal_replication>true</internal_replication>
{% for host in hosts %}
<replica>
<host>{{ hostvars[host].ansible_host }}</host>
<port>{{ clickhouse_server_port }}</port>
</replica>
{% endfor %}
</shard>
{% endfor %}
</{{ clickhouse_cluster_name }}>
</remote_servers>
</clickhouse>
yaml
This sophisticated template:
- Uses Jinja2 namespacing to create temporary data structures
- Groups servers by shard number
- Creates nested shard and replica definitions
- Enables internal replication for Keeper coordination
Security Implementation
The security configuration is comprehensive, including:
SSL/TLS Setup
- name: Generate self-signed SSL certificate if not exists
shell: |
openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 \
-subj "/C=US/ST=CA/L=SF/O=ClickHouse/CN={{ inventory_hostname }}" \
-keyout /etc/clickhouse-server/ssl/server.key \
-out /etc/clickhouse-server/ssl/server.crt
args:
creates: /etc/clickhouse-server/ssl/server.crt
when: ssl_enabled | bool
yaml
This creates:
- Self-signed certificates per server
- Strong Diffie-Hellman parameters for secure key exchange
- Proper file permissions for security
User Authentication and Authorization
<clickhouse>
<users>
<default>
<!-- Disable default user or set a strong password -->
<password></password>
<profile>default</profile>
<quota>default</quota>
<access_management>0</access_management>
</default>
<admin>
<password_sha256_hex>{{ admin_password_hash | default('8d969eef6ecad3c29a3a629280e686cf0c3f5d5a86aff3ca12020c923adc6c92') }}</password_sha256_hex>
<profile>admin</profile>
<quota>default</quota>
<access_management>1</access_management>
<networks>
<ip>::1/128</ip>
<ip>127.0.0.1/32</ip>
<ip>{{ network_access }}</ip>
</networks>
</admin>
</users>
</clickhouse>
xml
This configuration implements:
- Hashed password storage
- IP-based access restrictions
- Resource profiles to limit memory and CPU usage
- Query execution quotas
Monitoring and Backup Solutions
Prometheus Integration
<clickhouse>
<prometheus>
<endpoint>/metrics</endpoint>
<port>{{ prometheus_port }}</port>
<metrics>true</metrics>
<events>true</events>
<asynchronous_metrics>true</asynchronous_metrics>
</prometheus>
</clickhouse>
yaml
This exposes ClickHouse metrics in Prometheus format on a dedicated port.
Health Check Scripts
The automation deploys custom health check scripts for proactive monitoring:
#!/bin/bash
# ClickHouse Server Health Check Script
# Check if ClickHouse server is running
if ! pgrep -x "clickhouse-server" > /dev/null; then
echo "ERROR: ClickHouse server is not running!"
exit 1
fi
# Check if we can connect to the server
if ! clickhouse-client --query "SELECT 1" &>/dev/null; then
echo "ERROR: Cannot connect to ClickHouse server!"
exit 1
fi
# Check server uptime
UPTIME=$(clickhouse-client --query "SELECT uptime()")
echo "ClickHouse server uptime: ${UPTIME} seconds"
# Check system.errors count
ERRORS=$(clickhouse-client --query "SELECT count() FROM system.errors")
if [ "$ERRORS" -gt 0 ]; then
echo "WARNING: Found ${ERRORS} errors in system.errors table!"
else
echo "No errors found in system.errors table"
fi
# Additional checks for memory, disk usage, replication status...
bash
Backup Solution
The automation implements the clickhouse-backup tool for consistent backups:
- name: Create backup cron job
cron:
name: 'ClickHouse backup'
user: clickhouse
hour: '1'
minute: '0'
job: '/usr/local/bin/clickhouse-backup create && {% if remote_backup_enabled | bool %}/usr/local/bin/clickhouse-backup upload{% endif %}'
state: present
when: backup_enabled | bool
yaml
The backup configuration supports both local backups and remote S3-compatible storage.
Schema Management
The automation includes initial schema setup with support for replicated tables:
CREATE TABLE IF NOT EXISTS analytics.events
(
event_date Date,
event_time DateTime,
event_type String,
user_id String,
session_id String,
properties String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/analytics.events', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, event_type, user_id);
sql
And distributed tables for cluster-wide queries:
CREATE TABLE IF NOT EXISTS analytics.events_distributed
(
event_date Date,
event_time DateTime,
event_type String,
user_id String,
session_id String,
properties String
)
ENGINE = Distributed('{{ clickhouse_cluster_name }}', 'analytics', 'events', rand());
sql
Deployment Workflow
The complete deployment workflow is a two-step process:
- Generate the Inventory:
ansible-playbook -i localhost, setup_inventory.yml -c local
bash
- Deploy the Cluster:
ansible-playbook -i inventory.yml deploy_clickhouse.yml
bash
The deployment playbook executes in two phases:
- First, deploy and configure ClickHouse Keeper instances
- Then, deploy and configure ClickHouse Server instances
Each phase includes:
- OS preparation with package installation and optimization
- Core service installation
- Configuration generation
- Security hardening
- Monitoring setup
- Health check configuration
Advanced Usage Scenarios
Multi-Shard Configuration
For horizontal scaling, you can configure multiple shards:
shard_count: 3
replica_count: 2
server_ips:
- "10.0.1.10" # Shard 1, Replica 1
- "10.0.1.11" # Shard 1, Replica 2
- "10.0.1.12" # Shard 2, Replica 1
- "10.0.1.13" # Shard 2, Replica 2
- "10.0.1.14" # Shard 3, Replica 1
- "10.0.1.15" # Shard 3, Replica 2
bash
Geographic Distribution
For multi-datacenter setups, you can distribute replicas:
shard_count: 2
replica_count: 3
server_ips:
- "10.1.1.10" # DC1, Shard 1, Replica 1
- "10.2.1.10" # DC2, Shard 1, Replica 2
- "10.3.1.10" # DC3, Shard 1, Replica 3
- "10.1.1.11" # DC1, Shard 2, Replica 1
- "10.2.1.11" # DC2, Shard 2, Replica 2
- "10.3.1.11" # DC3, Shard 2, Replica 3
bash
Custom Hardware Profiles
For specialized workloads, you can create custom hardware profiles:
hardware_profile: "custom"
custom_profile:
max_server_memory_usage_to_ram_ratio: 0.75
max_server_memory_usage: 0 # Will be calculated based on RAM if set to 0
background_pool_size: 0 # Will be set to CPU count if 0
mark_cache_size: 0 # Will be calculated if 0
uncompressed_cache_size: 0 # Will be calculated if 0
bash
Conclusion
This Ansible automation provides a robust, enterprise-grade solution for deploying ClickHouse clusters with optimal configuration. The key advantages are:
- Hardware-aware configuration: Automatically tunes ClickHouse based on available resources
- Simplicity: Manages complex deployments with minimal configuration
- Security: Implements comprehensive security best practices
- Flexibility: Supports various cluster topologies and use cases
- Maintainability: Follows structured, role-based Ansible best practices
With this automation, organizations can rapidly deploy production-ready ClickHouse clusters without worrying about the intricacies of configuration optimization, security hardening, or operational readiness. The script provided sets up the entire Ansible project structure, creating all the necessary files and directories to implement this comprehensive ClickHouse deployment solution.
By leveraging this automation, teams can focus on using ClickHouse’s analytical capabilities rather than managing infrastructure complexity, ultimately accelerating their data analytics initiatives while maintaining operational excellence.
Looking for Enterprise-grade ClickHouse Support and Consulting
Discover how our team of experts can help you optimize your ClickHouse deployment, improve performance, and ensure reliability.