Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and handle corrupt or missing blocks or indexes #537

Open
2 tasks
jmjatlanta opened this issue Apr 4, 2022 · 4 comments
Open
2 tasks

Detect and handle corrupt or missing blocks or indexes #537

jmjatlanta opened this issue Apr 4, 2022 · 4 comments

Comments

@jmjatlanta
Copy link

Situation:

When a node encounters a problem, the block files or index files can become corrupted or have incomplete data.

Weapons:

Corruption and missing data can be detected. Other nodes can provide the information the corrupted node lacks.

Objective:

Detect the problem and repair it before allowing the node to report that it is fully synchronized.

Tactics:

  1. For the case of corrupted block files, the chain must be downloaded from the network starting from the point the corruption was detected.
  2. For missing block files, the chain must be downloaded and the node re-indexed.
  3. For corrupted or missing indexes, the problem must be accurately detected and the user must be prompted to restart the node with the -reindex option.

Where we are now:

Some situations of block or index corruption are detected, others are not.

As an example, having the node crash after writing the block to disk but before the index file is written will lead to missing transactions when the node restarts. The node does restart, but the data is inaccurate. Looking for the block returns nullptr and looking for a transaction that was within that block returns that the transaction does not exist.

Additional Information:

The function LoadIndexDB() does some checks to verify the integrity of the block files. This may be a good place to add additional checks to verify that the blocks and the index are synchronized.

Note: The attempt is to assist node operators when hardware/software issues make a mess within the persisted data on disk. Detecting malicious modification to distort data is not considered here.

Note: Having a block who's previous block does not exist may not be an indication of corruption. It is a valid (temporary) situation that must be planned for.

In my testing:

  1. Having an incomplete index file (an entire entry about a block does not exist) is not detected, and the node starts with incomplete data.
  2. Having a corrupted index file (data truncated off the end) is not detected, and the node starts with a shortened chain. I have yet to test to see if it re-syncs correctly.
  3. Having an incomplete block file (an entire block does not exist but does exist in the index) is not detected, but would probably be a very rare occurrence. We could test for it, but we may not want to concentrate heavily on detecting/fixing it.
  4. Having a corrupted block file (block cannot be de-serialized) is detected. I have yet to test what options are available for a node operator beyond a full re-sync.

ToDo:

  • Verify the findings above are accurate for different combinations of corruption / missing data.
  • Run tests on a multi-node chain to determine current abilities for recovering from corruption.

See also:

Bitcoin issue 19274

@dimxy
Copy link
Collaborator

dimxy commented Apr 5, 2022

What I think about this issue:
basically blocks and indexes are two databases that need to be updated atomically.
In other systems this is done via 2-phase resource coordinator which ensure either or both dbs are updated or not.
In our code we do not have such a coordinator so it is possible that only one of the both block and index dbs may be updated if a crash occurs.
And if this happens it is not necessary that both dbs are corrupted, maybe they could be in a good state but unsynchronised.
So it is good to enhance detection of failures in chain dbs but maybe we should detect abnormal ends and suggest the user on startup that he needs to reindex or restart from the bootstrap, maybe ask for y/n confirmation to continue.
(I know it could be a problem for auto-maintained nodes which restart automatically if crashed but anyway this is better than nothing)

@TheComputerGenie
Copy link

If it's known where the block is, wouldn't it be a better option to invalidate it and then reconsider it after connecting to peers, rather than having the user down for 12-20 hours with a reindex?

@jmjatlanta
Copy link
Author

Based on my tests, 1 common problem is when they are out of sync at the end. i.e. block written to disk but the daemon dies before writing the index. That I believe could be solved without re-indexing everything if we choose to do so.

Another problem is when the files get damaged in the middle. At that point, it becomes very difficult to trust anything after that point. A download across the wire and a re-index is probably inevitable at that point.

As for 2 phase commit, that is a possible solution. There are actual journaling filesystem libraries that can help with that. I am unsure if any open-source libraries exist for that, but I imagine so. Or we could roll our own. The "costs" of such a solution must be weighed (i.e. maintenance).

@dimxy
Copy link
Collaborator

dimxy commented Apr 5, 2022

I think we should try and test the revalidation proposal

who-biz pushed a commit to who-biz/komodo that referenced this issue Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants