Fix metadata digest mismatch after failed SYSTEM RESTART REPLICA#98469
Fix metadata digest mismatch after failed SYSTEM RESTART REPLICA#98469alexey-milovidov wants to merge 1 commit intomasterfrom
Conversation
When `SYSTEM RESTART REPLICA` fails after exhausting all retries (e.g. due to fault injection causing `MEMORY_LIMIT_EXCEEDED`) or after query cancellation, the table is left permanently detached from the in-memory `tables` map. The `RestartReplicaGuard` (which suppresses digest checks during the restart) is destroyed when the function exits, but the table is never reattached. Subsequent DDL operations that call `checkDigestValid` see a mismatch between `local_digest` (computed from the `tables` map, which excludes the missing table) and `tables_metadata_digest` (which still includes the table's hash), causing a "Digest does not match" LOGICAL_ERROR exception. Fix by adjusting `tables_metadata_digest` in the error path to subtract the hash of the lost table, keeping the in-memory digest consistent with the `tables` map. The table's metadata file still exists on disk and will be restored on server restart or `DatabaseReplicated` recovery. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=acf6aca664cc5cf8f7b0f75e4c7094ef3b6b17bf&name_0=MasterCI&name_1=BuzzHouse%20%28amd_ubsan%29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a failure mode in SYSTEM RESTART REPLICA where, if the restart process exits on error/cancellation after detaching the table, DatabaseReplicated can later hit a debug/sanitizer “Digest does not match” assertion due to the table being missing from the in-memory tables map while the digest still reflects the pre-detach state.
Changes:
- Wrap the post-detach table re-creation retry loop in
doRestartReplica()with an outertry/catchand, on failure, call intoDatabaseReplicatedto adjust the in-memory metadata digest. - Add
DatabaseReplicated::adjustDigestOnTableLostFromRestart()API and implement it to subtract the table’s metadata hash fromtables_metadata_digest. - Add a warning log when this adjustment occurs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/Interpreters/InterpreterSystemQuery.cpp |
Adds failure-path handling for SYSTEM RESTART REPLICA to adjust DatabaseReplicated’s digest when restart ultimately fails. |
src/Databases/DatabaseReplicated.h |
Declares adjustDigestOnTableLostFromRestart() for use by the restart path. |
src/Databases/DatabaseReplicated.cpp |
Implements digest adjustment (subtract metadata hash) and logs a warning when invoked. |
| /// The table is left permanently detached from the in-memory tables map. | ||
| /// Adjust the metadata digest so it stays consistent with the tables map, | ||
| /// preventing false "Digest does not match" LOGICAL_ERROR exceptions. | ||
| /// The table's metadata file still exists on disk, so it will be restored | ||
| /// on server restart or after DatabaseReplicated recovery. | ||
| if (auto * replicated_db = typeid_cast<DatabaseReplicated *>(database.get())) | ||
| { | ||
| try | ||
| { | ||
| replicated_db->adjustDigestOnTableLostFromRestart(replica_table_id.table_name); | ||
| } | ||
| catch (...) | ||
| { | ||
| tryLogCurrentException(log, "Failed to adjust digest after failed SYSTEM RESTART REPLICA"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Adjusting DatabaseReplicated::tables_metadata_digest here makes it diverge from its documented meaning (sum of local on-disk metadata files). Since the metadata file still exists, subsequent DatabaseReplicated DDL commits may base /replicas/<replica>/digest updates on this adjusted value, potentially masking the inconsistency with ZooKeeper metadata or triggering recovery decisions based on an artificial digest. Consider instead rolling back the detach on failure (reattach the original table instance, or reload/ATTACH from metadata) so both the tables map and digest stay consistent, or add an explicit "restart failed" state that suppresses digest assertions without rewriting the digest value.
| /// The table is left permanently detached from the in-memory tables map. | |
| /// Adjust the metadata digest so it stays consistent with the tables map, | |
| /// preventing false "Digest does not match" LOGICAL_ERROR exceptions. | |
| /// The table's metadata file still exists on disk, so it will be restored | |
| /// on server restart or after DatabaseReplicated recovery. | |
| if (auto * replicated_db = typeid_cast<DatabaseReplicated *>(database.get())) | |
| { | |
| try | |
| { | |
| replicated_db->adjustDigestOnTableLostFromRestart(replica_table_id.table_name); | |
| } | |
| catch (...) | |
| { | |
| tryLogCurrentException(log, "Failed to adjust digest after failed SYSTEM RESTART REPLICA"); | |
| } | |
| } | |
| /// Failed to restart replica; table remains detached in the in-memory tables map. | |
| /// Do not adjust DatabaseReplicated metadata digest here, because on-disk metadata | |
| /// has not changed and the digest must continue to reflect the sum of local metadata files. | |
| tryLogCurrentException(log, "Failed to restart replica, table will remain detached"); |
| /// If all retries fail (or the query is cancelled), we adjust the in-memory metadata digest | ||
| /// to reflect the table's absence, preventing false "Digest does not match" assertions. | ||
|
|
||
| StoragePtr new_table; | ||
| size_t non_zk_retries = 0; | ||
| constexpr size_t max_non_zk_retries = 10; | ||
| while (true) | ||
| try | ||
| { | ||
| try | ||
| { | ||
| new_table = StorageFactory::instance().get(create, | ||
| data_path, | ||
| system_context, | ||
| system_context->getGlobalContext(), | ||
| columns, | ||
| constraints, | ||
| LoadingStrictnessLevel::ATTACH); | ||
|
|
||
| break; | ||
| } | ||
| catch (const Coordination::Exception & e) | ||
| size_t non_zk_retries = 0; | ||
| constexpr size_t max_non_zk_retries = 10; | ||
| while (true) | ||
| { | ||
| /// Only retry on transient ZooKeeper errors (connection loss, session expired, etc.) | ||
| if (!Coordination::isHardwareError(e.code)) | ||
| throw; | ||
| try | ||
| { | ||
| new_table = StorageFactory::instance().get(create, | ||
| data_path, | ||
| system_context, | ||
| system_context->getGlobalContext(), | ||
| columns, | ||
| constraints, | ||
| LoadingStrictnessLevel::ATTACH); | ||
|
|
||
| break; | ||
| } | ||
| catch (const Coordination::Exception & e) | ||
| { | ||
| /// Only retry on transient ZooKeeper errors (connection loss, session expired, etc.) | ||
| if (!Coordination::isHardwareError(e.code)) | ||
| throw; | ||
|
|
||
| tryLogCurrentException( | ||
| getLogger("InterpreterSystemQuery"), | ||
| fmt::format("Failed to restart replica {}, will retry", replica_table_id.getNameForLogs())); | ||
| tryLogCurrentException( | ||
| getLogger("InterpreterSystemQuery"), | ||
| fmt::format("Failed to restart replica {}, will retry", replica_table_id.getNameForLogs())); | ||
|
|
||
| /// Check if the query was cancelled (e.g. server is shutting down) | ||
| if (auto process_list_element = getContext()->getProcessListElementSafe()) | ||
| process_list_element->checkTimeLimit(); | ||
| /// Check if the query was cancelled (e.g. server is shutting down) | ||
| if (auto process_list_element = getContext()->getProcessListElementSafe()) | ||
| process_list_element->checkTimeLimit(); | ||
|
|
||
| sleepForSeconds(1); | ||
| } | ||
| catch (...) | ||
| { | ||
| if (++non_zk_retries > max_non_zk_retries) | ||
| throw; | ||
| sleepForSeconds(1); | ||
| } | ||
| catch (...) | ||
| { | ||
| if (++non_zk_retries > max_non_zk_retries) | ||
| throw; | ||
|
|
||
| tryLogCurrentException( | ||
| getLogger("InterpreterSystemQuery"), | ||
| fmt::format("Failed to restart replica {} (attempt {}/{}), will retry", | ||
| replica_table_id.getNameForLogs(), non_zk_retries, max_non_zk_retries)); | ||
| tryLogCurrentException( | ||
| getLogger("InterpreterSystemQuery"), | ||
| fmt::format("Failed to restart replica {} (attempt {}/{}), will retry", | ||
| replica_table_id.getNameForLogs(), non_zk_retries, max_non_zk_retries)); | ||
|
|
||
| if (auto process_list_element = getContext()->getProcessListElementSafe()) | ||
| process_list_element->checkTimeLimit(); | ||
| if (auto process_list_element = getContext()->getProcessListElementSafe()) | ||
| process_list_element->checkTimeLimit(); | ||
|
|
||
| sleepForSeconds(1); | ||
| sleepForSeconds(1); | ||
| } | ||
| } | ||
| } | ||
| catch (...) | ||
| { | ||
| /// The table is left permanently detached from the in-memory tables map. | ||
| /// Adjust the metadata digest so it stays consistent with the tables map, | ||
| /// preventing false "Digest does not match" LOGICAL_ERROR exceptions. | ||
| /// The table's metadata file still exists on disk, so it will be restored | ||
| /// on server restart or after DatabaseReplicated recovery. | ||
| if (auto * replicated_db = typeid_cast<DatabaseReplicated *>(database.get())) | ||
| { | ||
| try | ||
| { | ||
| replicated_db->adjustDigestOnTableLostFromRestart(replica_table_id.table_name); | ||
| } | ||
| catch (...) | ||
| { | ||
| tryLogCurrentException(log, "Failed to adjust digest after failed SYSTEM RESTART REPLICA"); | ||
| } | ||
| } | ||
| throw; |
There was a problem hiding this comment.
This new failure-path behavior is hard to validate without a regression test. Please add a stateless/integration test that forces the post-detach re-create step to fail (e.g. via fault injection / memory limit) and then verifies we don’t hit the debug/sanitizer "Digest does not match" path and that the table is eventually recoverable (restart or DatabaseReplicated recovery).
Summary
SYSTEM RESTART REPLICAfails after exhausting all retries (e.g. due to fault injection causingMEMORY_LIMIT_EXCEEDED) or after query cancellation, the table is left permanently detached from the in-memorytablesmap. TheRestartReplicaGuardis destroyed when the function exits, but the table is never reattached. Subsequent DDL operations that callcheckDigestValidsee a mismatch betweenlocal_digest(from thetablesmap, excluding the missing table) andtables_metadata_digest(which still includes the table's hash), causing a "Digest does not match"LOGICAL_ERRORexception.tables_metadata_digestin the error path to subtract the hash of the lost table, keeping the in-memory digest consistent with thetablesmap. The table's metadata file still exists on disk and will be restored on server restart orDatabaseReplicatedrecovery.CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=acf6aca664cc5cf8f7b0f75e4c7094ef3b6b17bf&name_0=MasterCI&name_1=BuzzHouse%20%28amd_ubsan%29
Test plan
SYSTEM RESTART REPLICAbehavior is unaffectedChangelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
🤖 Generated with Claude Code