My service is stuck in an awakening state (15+ Minutes)
Service Stuck in Awakening State (15+ Minutes)
When you start, scale, or wake a ClickHouse Cloud service, you may sometimes experience longer-than-expected wait times. This guide explains normal startup timelines, common causes of delays, and self-service troubleshooting steps.
Startup duration depends on several factors:
| Scenario | Typical duration | Investigation threshold |
|---|---|---|
| Starting idle service (minimal metadata) | 2-3 minutes | >10 minutes |
| Starting service with dictionaries | 5-15 minutes | >30 minutes |
| Scaling operations (adding replicas) | 10-20 minutes | >45 minutes |
| Release channel changes with upgrades | 15-30 minutes | >60 minutes |
| Services with large metadata volumes | 20-40 minutes | >60 minutes |
Services typically show a "Starting" or "Awakening" status while:
- Kubernetes pods are being created and scheduled
- Dictionaries are being loaded from ClickHouse Keeper
- Metadata is being synchronized across replicas
- Readiness health checks are passing
If your ClickHouse Cloud service takes longer than the typical durations described above to wake from idle state, it's typically caused by one of several specific issues. This guide will help you diagnose and resolve long startup times.
Common causes
1. Dictionary loading blocks startup (most common)
In this case, your service waits 15+ minutes trying to load dictionaries that connect to external sources.
Example from logs:
The issue is caused by a dictionary with an external source like:
If the target service is also idle or starting, it creates a circular wait. By default, ClickHouse waits for all dictionaries to load before accepting queries.
To solve the issue:
- Contact support to enable lazy dictionary loading by setting:
dictionaries_lazy_load=truewait_dictionaries_load_at_startup=false
- Use
LIFETIME(MIN 0 MAX 0)only for small, local dictionaries - Avoid cross-service dictionary dependencies
- Consider using materialized views or regular tables instead of dictionaries for external data
2. DDL queue synchronization
Long waits for SYNC DATABASE REPLICA operations are symptomatic of this case.
Services with heavy DDL activity (frequent CREATE/DROP/ALTER) must replay all DDL changes when waking from idle.
Example from the logs:
This is worse particularly after idle as:
- New services start faster because they only copy current metadata
- Idled services must replay hours or days of DDL changes sequentially
- Each DDL operation is replayed one at a time
To solve this issue, you can:
- Increase idle timeout to match your DDL frequency (e.g., if you run DDL hourly, use at least 2-hour idle timeout)
- Consider keeping critical services always-on
- Reduce DDL operations:
- Use materialized views instead of frequent schema changes
- Batch DDL operations together
- Avoid frequent CREATE/DROP operations
3. Idle timeout is too short for startup time
One such example of this case is when a service with a 5-minute idle timeout takes 15+ minutes to start.
This creates a never-ending cycle:
In this case, you should check your metrics:
The following idle timeouts are recommended based on startup time:
- 15-30 min startup → 30 min idle timeout minimum
- 5-15 min startup → 15 min idle timeout minimum
- < 5 min startup → 5 min idle timeout acceptable
Always ensure your idle timeout is at least 2x your typical startup time to prevent restart loops.
To solve this issue:
- Contact support to increase your idle timeout
- Consider keeping the service always-on if startup times are consistently high
4. Insufficient resources
Services with minimal resources (e.g., 2 CPU / 8GB RAM) can take 30+ minutes to:
- Load table metadata
- Initialize merge operations
- Synchronize database replicas
The solution here is to:
- Scale up your service manually via the Cloud console or API
- Employ automatic scaling
How to diagnose your service
Check Dictionary issues
Connect to your service (once it's awake) and run:
If you see timeout errors or connection failures, dictionaries are likely blocking startup.
Check for frequent DDL operations
Review your query logs:
If you see hundreds of DDL operations per day, DDL replay may be causing slow startups.
Review your configuration
Check these settings in the Cloud console:
- Idle timeout: Should be at least 2x your startup time
- IP allowlist: If set to
0.0.0.0/0, port scans can wake your service unnecessarily - Resource allocation: Ensure adequate CPU and memory
Additional resources
If startup times remain high after trying these solutions, contact ClickHouse Cloud support with your service details for further investigation.