Skip to main content

My service is stuck in an awakening state (15+ Minutes)

Troubleshoot a ClickHouse Cloud services that take 15+ minutes to wake from idle state, including dictionary loading issues, DDL queue synchronization, and resource constraints.

Service Stuck in Awakening State (15+ Minutes)

When you start, scale, or wake a ClickHouse Cloud service, you may sometimes experience longer-than-expected wait times. This guide explains normal startup timelines, common causes of delays, and self-service troubleshooting steps.

Startup duration depends on several factors:

ScenarioTypical durationInvestigation threshold
Starting idle service (minimal metadata)2-3 minutes>10 minutes
Starting service with dictionaries5-15 minutes>30 minutes
Scaling operations (adding replicas)10-20 minutes>45 minutes
Release channel changes with upgrades15-30 minutes>60 minutes
Services with large metadata volumes20-40 minutes>60 minutes
Normal behavior

Services typically show a "Starting" or "Awakening" status while:

  • Kubernetes pods are being created and scheduled
  • Dictionaries are being loaded from ClickHouse Keeper
  • Metadata is being synchronized across replicas
  • Readiness health checks are passing

If your ClickHouse Cloud service takes longer than the typical durations described above to wake from idle state, it's typically caused by one of several specific issues. This guide will help you diagnose and resolve long startup times.

Common causes

1. Dictionary loading blocks startup (most common)

In this case, your service waits 15+ minutes trying to load dictionaries that connect to external sources.

Example from logs:

2025-09-19 06:19:01 | Context: Waiting for dictionaries to be loaded
2025-09-19 06:24:01 | WARN: Connection failed at try #1
2025-09-19 06:29:01 | WARN: Connection failed at try #2
2025-09-19 06:34:01 | ERROR: Could not load external dictionary

The issue is caused by a dictionary with an external source like:

CREATE DICTIONARY testing_env.failure_status_dict
SOURCE(CLICKHOUSE(HOST 'other-service.clickhouse.cloud' ...))
LIFETIME(MIN 0 MAX 0)
LAYOUT(...)

If the target service is also idle or starting, it creates a circular wait. By default, ClickHouse waits for all dictionaries to load before accepting queries.

To solve the issue:

  • Contact support to enable lazy dictionary loading by setting:
    • dictionaries_lazy_load=true
    • wait_dictionaries_load_at_startup=false
  • Use LIFETIME(MIN 0 MAX 0) only for small, local dictionaries
  • Avoid cross-service dictionary dependencies
  • Consider using materialized views or regular tables instead of dictionaries for external data

2. DDL queue synchronization

Long waits for SYNC DATABASE REPLICA operations are symptomatic of this case.

Services with heavy DDL activity (frequent CREATE/DROP/ALTER) must replay all DDL changes when waking from idle.

Example from the logs:

Code: 159. TIMEOUT_EXCEEDED: SYNC DATABASE REPLICA prod_events:
database is readonly or command timed out after 180 seconds

This is worse particularly after idle as:

  • New services start faster because they only copy current metadata
  • Idled services must replay hours or days of DDL changes sequentially
  • Each DDL operation is replayed one at a time

To solve this issue, you can:

  • Increase idle timeout to match your DDL frequency (e.g., if you run DDL hourly, use at least 2-hour idle timeout)
  • Consider keeping critical services always-on
  • Reduce DDL operations:
    • Use materialized views instead of frequent schema changes
    • Batch DDL operations together
    • Avoid frequent CREATE/DROP operations

3. Idle timeout is too short for startup time

One such example of this case is when a service with a 5-minute idle timeout takes 15+ minutes to start.

This creates a never-ending cycle:

09:15 - Service goes idle
09:16 - Port scan or query wakes it up
09:30 - Still starting...
09:31 - Timeout triggers, pods restart
09:32 - Start over...

In this case, you should check your metrics:

{
  "serverInitDurationSec": "915",  // 15+ minutes!
  "idleTimeout": "1h0m0s"
}

The following idle timeouts are recommended based on startup time:

  • 15-30 min startup → 30 min idle timeout minimum
  • 5-15 min startup → 15 min idle timeout minimum
  • < 5 min startup → 5 min idle timeout acceptable
Tip

Always ensure your idle timeout is at least 2x your typical startup time to prevent restart loops.

To solve this issue:

  • Contact support to increase your idle timeout
  • Consider keeping the service always-on if startup times are consistently high

4. Insufficient resources

Services with minimal resources (e.g., 2 CPU / 8GB RAM) can take 30+ minutes to:

  • Load table metadata
  • Initialize merge operations
  • Synchronize database replicas

The solution here is to:

  • Scale up your service manually via the Cloud console or API
  • Employ automatic scaling

How to diagnose your service

Check Dictionary issues

Connect to your service (once it's awake) and run:

SELECT
    name,
    status,
    last_exception
FROM system.dictionaries
WHERE last_exception != '';

If you see timeout errors or connection failures, dictionaries are likely blocking startup.

Check for frequent DDL operations

Review your query logs:

SELECT
    event_date,
    count() as ddl_count
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_kind = 'DDL'
  AND event_date >= today() - 7
GROUP BY event_date
ORDER BY event_date DESC;

If you see hundreds of DDL operations per day, DDL replay may be causing slow startups.

Review your configuration

Check these settings in the Cloud console:

  1. Idle timeout: Should be at least 2x your startup time
  2. IP allowlist: If set to 0.0.0.0/0, port scans can wake your service unnecessarily
  3. Resource allocation: Ensure adequate CPU and memory

Additional resources

If startup times remain high after trying these solutions, contact ClickHouse Cloud support with your service details for further investigation.

· 5 min read