Fault-tolerant & Recovery

Hello, can someone help to clarify what kind of mechanism does nebula use to recover the failure nodes (e.g., when one or a few of the nodes such as graph, meta, or storage failed, ) from a single cluster?

You may want to take a look at Nebula Graph architecture: Architecture overview - Nebula Graph Database Manual
Meta and storage are alike. It’s suggested to have at least 3 replicas in production. The replicas are guaranteed to be consistent by raft protocol. When one meta or storage is failed, the cluster is still functioning, so simply bring the failed server up and it will catch up.
Graph is stateless so it’s much easier to explain. Simply bringing the failed one back will do.
Hope this helps.

Thank you for the kind reply and sharing the details!

Regarding “When one meta or storage is failed, the cluster is still functioning, so simply bring the failed server up and it will catch up”: in this case does it require the client user to have a routine to regularly create snapshots and apply the right snapshot when bring up the failed node (meta/storage)? or the cluster will automatically catch up the data gaps from the health nodes? And what the typical time duration for a restarted node to catch up?

Regarding snapshot & checkpoint: normally how long does it take to recover from a snapshot/checkpoint since writes will be blocked during this time?

Snapshots are not required here in the case I described above. Raft protocol will do the trick. There’re still 2 working replicas out there. When the failed node is up and healthy again, it will pull the missing logs from the other replicas. Worse case, if the node was failed and lost all the data (or it’s been reinstalled or even replaced), it will then pull all the logs from other replicas and that will take much longer.

Got it, that makes sense!

So snapshots are used to change the current cluster to the specified (snapshot) state, not for failover recovery, right?

Correct, particularly before you update the graph schema or make some other important changes and would like to save a checkpoint. You can also use it for backup/restore for now. In the future, there’ll be an official backup/restore utility too.

Got it, thank you!