Skip to content

Fix metadata digest mismatch after failed SYSTEM RESTART REPLICA#98469

Open
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-digest-mismatch-on-restart-replica-failure
Open

Fix metadata digest mismatch after failed SYSTEM RESTART REPLICA#98469
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-digest-mismatch-on-restart-replica-failure

Conversation

@alexey-milovidov
Copy link
Member

Summary

  • When SYSTEM RESTART REPLICA fails after exhausting all retries (e.g. due to fault injection causing MEMORY_LIMIT_EXCEEDED) or after query cancellation, the table is left permanently detached from the in-memory tables map. The RestartReplicaGuard is destroyed when the function exits, but the table is never reattached. Subsequent DDL operations that call checkDigestValid see a mismatch between local_digest (from the tables map, excluding the missing table) and tables_metadata_digest (which still includes the table's hash), causing a "Digest does not match" LOGICAL_ERROR exception.
  • Fix by adjusting tables_metadata_digest in the error path to subtract the hash of the lost table, keeping the in-memory digest consistent with the tables map. The table's metadata file still exists on disk and will be restored on server restart or DatabaseReplicated recovery.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=acf6aca664cc5cf8f7b0f75e4c7094ef3b6b17bf&name_0=MasterCI&name_1=BuzzHouse%20%28amd_ubsan%29

Test plan

  • Verify the BuzzHouse (amd_ubsan) test no longer fails with "Digest does not match"
  • Verify the fix handles both retry exhaustion and query cancellation paths
  • Verify normal SYSTEM RESTART REPLICA behavior is unaffected

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

🤖 Generated with Claude Code

When `SYSTEM RESTART REPLICA` fails after exhausting all retries (e.g.
due to fault injection causing `MEMORY_LIMIT_EXCEEDED`) or after query
cancellation, the table is left permanently detached from the in-memory
`tables` map. The `RestartReplicaGuard` (which suppresses digest checks
during the restart) is destroyed when the function exits, but the table
is never reattached. Subsequent DDL operations that call `checkDigestValid`
see a mismatch between `local_digest` (computed from the `tables` map,
which excludes the missing table) and `tables_metadata_digest` (which
still includes the table's hash), causing a "Digest does not match"
LOGICAL_ERROR exception.

Fix by adjusting `tables_metadata_digest` in the error path to subtract
the hash of the lost table, keeping the in-memory digest consistent with
the `tables` map. The table's metadata file still exists on disk and will
be restored on server restart or `DatabaseReplicated` recovery.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=acf6aca664cc5cf8f7b0f75e4c7094ef3b6b17bf&name_0=MasterCI&name_1=BuzzHouse%20%28amd_ubsan%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Mar 2, 2026

Workflow [PR], commit [ff73aa4]

Summary:

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a failure mode in SYSTEM RESTART REPLICA where, if the restart process exits on error/cancellation after detaching the table, DatabaseReplicated can later hit a debug/sanitizer “Digest does not match” assertion due to the table being missing from the in-memory tables map while the digest still reflects the pre-detach state.

Changes:

  • Wrap the post-detach table re-creation retry loop in doRestartReplica() with an outer try/catch and, on failure, call into DatabaseReplicated to adjust the in-memory metadata digest.
  • Add DatabaseReplicated::adjustDigestOnTableLostFromRestart() API and implement it to subtract the table’s metadata hash from tables_metadata_digest.
  • Add a warning log when this adjustment occurs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/Interpreters/InterpreterSystemQuery.cpp Adds failure-path handling for SYSTEM RESTART REPLICA to adjust DatabaseReplicated’s digest when restart ultimately fails.
src/Databases/DatabaseReplicated.h Declares adjustDigestOnTableLostFromRestart() for use by the restart path.
src/Databases/DatabaseReplicated.cpp Implements digest adjustment (subtract metadata hash) and logs a warning when invoked.

Comment on lines +1292 to 1307
/// The table is left permanently detached from the in-memory tables map.
/// Adjust the metadata digest so it stays consistent with the tables map,
/// preventing false "Digest does not match" LOGICAL_ERROR exceptions.
/// The table's metadata file still exists on disk, so it will be restored
/// on server restart or after DatabaseReplicated recovery.
if (auto * replicated_db = typeid_cast<DatabaseReplicated *>(database.get()))
{
try
{
replicated_db->adjustDigestOnTableLostFromRestart(replica_table_id.table_name);
}
catch (...)
{
tryLogCurrentException(log, "Failed to adjust digest after failed SYSTEM RESTART REPLICA");
}
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting DatabaseReplicated::tables_metadata_digest here makes it diverge from its documented meaning (sum of local on-disk metadata files). Since the metadata file still exists, subsequent DatabaseReplicated DDL commits may base /replicas/<replica>/digest updates on this adjusted value, potentially masking the inconsistency with ZooKeeper metadata or triggering recovery decisions based on an artificial digest. Consider instead rolling back the detach on failure (reattach the original table instance, or reload/ATTACH from metadata) so both the tables map and digest stay consistent, or add an explicit "restart failed" state that suppresses digest assertions without rewriting the digest value.

Suggested change
/// The table is left permanently detached from the in-memory tables map.
/// Adjust the metadata digest so it stays consistent with the tables map,
/// preventing false "Digest does not match" LOGICAL_ERROR exceptions.
/// The table's metadata file still exists on disk, so it will be restored
/// on server restart or after DatabaseReplicated recovery.
if (auto * replicated_db = typeid_cast<DatabaseReplicated *>(database.get()))
{
try
{
replicated_db->adjustDigestOnTableLostFromRestart(replica_table_id.table_name);
}
catch (...)
{
tryLogCurrentException(log, "Failed to adjust digest after failed SYSTEM RESTART REPLICA");
}
}
/// Failed to restart replica; table remains detached in the in-memory tables map.
/// Do not adjust DatabaseReplicated metadata digest here, because on-disk metadata
/// has not changed and the digest must continue to reflect the sum of local metadata files.
tryLogCurrentException(log, "Failed to restart replica, table will remain detached");

Copilot uses AI. Check for mistakes.
Comment on lines +1235 to +1308
/// If all retries fail (or the query is cancelled), we adjust the in-memory metadata digest
/// to reflect the table's absence, preventing false "Digest does not match" assertions.

StoragePtr new_table;
size_t non_zk_retries = 0;
constexpr size_t max_non_zk_retries = 10;
while (true)
try
{
try
{
new_table = StorageFactory::instance().get(create,
data_path,
system_context,
system_context->getGlobalContext(),
columns,
constraints,
LoadingStrictnessLevel::ATTACH);

break;
}
catch (const Coordination::Exception & e)
size_t non_zk_retries = 0;
constexpr size_t max_non_zk_retries = 10;
while (true)
{
/// Only retry on transient ZooKeeper errors (connection loss, session expired, etc.)
if (!Coordination::isHardwareError(e.code))
throw;
try
{
new_table = StorageFactory::instance().get(create,
data_path,
system_context,
system_context->getGlobalContext(),
columns,
constraints,
LoadingStrictnessLevel::ATTACH);

break;
}
catch (const Coordination::Exception & e)
{
/// Only retry on transient ZooKeeper errors (connection loss, session expired, etc.)
if (!Coordination::isHardwareError(e.code))
throw;

tryLogCurrentException(
getLogger("InterpreterSystemQuery"),
fmt::format("Failed to restart replica {}, will retry", replica_table_id.getNameForLogs()));
tryLogCurrentException(
getLogger("InterpreterSystemQuery"),
fmt::format("Failed to restart replica {}, will retry", replica_table_id.getNameForLogs()));

/// Check if the query was cancelled (e.g. server is shutting down)
if (auto process_list_element = getContext()->getProcessListElementSafe())
process_list_element->checkTimeLimit();
/// Check if the query was cancelled (e.g. server is shutting down)
if (auto process_list_element = getContext()->getProcessListElementSafe())
process_list_element->checkTimeLimit();

sleepForSeconds(1);
}
catch (...)
{
if (++non_zk_retries > max_non_zk_retries)
throw;
sleepForSeconds(1);
}
catch (...)
{
if (++non_zk_retries > max_non_zk_retries)
throw;

tryLogCurrentException(
getLogger("InterpreterSystemQuery"),
fmt::format("Failed to restart replica {} (attempt {}/{}), will retry",
replica_table_id.getNameForLogs(), non_zk_retries, max_non_zk_retries));
tryLogCurrentException(
getLogger("InterpreterSystemQuery"),
fmt::format("Failed to restart replica {} (attempt {}/{}), will retry",
replica_table_id.getNameForLogs(), non_zk_retries, max_non_zk_retries));

if (auto process_list_element = getContext()->getProcessListElementSafe())
process_list_element->checkTimeLimit();
if (auto process_list_element = getContext()->getProcessListElementSafe())
process_list_element->checkTimeLimit();

sleepForSeconds(1);
sleepForSeconds(1);
}
}
}
catch (...)
{
/// The table is left permanently detached from the in-memory tables map.
/// Adjust the metadata digest so it stays consistent with the tables map,
/// preventing false "Digest does not match" LOGICAL_ERROR exceptions.
/// The table's metadata file still exists on disk, so it will be restored
/// on server restart or after DatabaseReplicated recovery.
if (auto * replicated_db = typeid_cast<DatabaseReplicated *>(database.get()))
{
try
{
replicated_db->adjustDigestOnTableLostFromRestart(replica_table_id.table_name);
}
catch (...)
{
tryLogCurrentException(log, "Failed to adjust digest after failed SYSTEM RESTART REPLICA");
}
}
throw;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new failure-path behavior is hard to validate without a regression test. Please add a stateless/integration test that forces the post-detach re-create step to fail (e.g. via fault injection / memory limit) and then verifies we don’t hit the debug/sanitizer "Digest does not match" path and that the table is eventually recoverable (restart or DatabaseReplicated recovery).

Copilot generated this review using guidance from repository custom instructions.
@tiandiwonder tiandiwonder self-assigned this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants