There are usually users asking us this question, especially when they are just getting started with Nebula Graph. They have an overall estimate of their dataset size and wonder what hardware resources they should be using for storage and query in Nebula Graph.
In this topic, we are going to break the question down to three parts: hard disk, CPU, and memory so that you have a basic understanding of the data processing logic in Nebula Graph.
Hard disk capacity requirement.
The following factors would affect the hard disk capacity you need:
- Vertices. Suppose there are 1 billion vertices and each is with 1k properties attached to it, you need 1000 G to just store them. Note that properties can be compressed in Nebula Graph but the compression is not at a fixed ratio but varies by data types. If there are no properties attached to vertices, then compression will not happen.
- Edges. Suppose there are 10 billion edges and each is with 1k properties attached to it, you need 10 T to just store them. Since Nebula Graph automatically create the reverse edge for each edge for quicker reverse queries, which doubles the hard disk storage to 20 T. Compression rules for properties on edges are the same as on vertices.
- Replicas. Suppose that you have deployed a three-replica cluster. For each replica, you need at lease 21 T to store 1 billion vertices and 10 billion edges because WAL may not be cleared in time when data is inserted to Nebula Graph. Currently WAL will be deleted every four hours. The parameter is configurable, though.
So to sum up, for a three-replica cluster with a dataset of 1 billion vertices (each with 1k properties attached) and 10 billion edges (each with 1k properties attached), and 100 partitions for each graph space, the hard disk capacity required is 21 T * 3 = 63 T. However, Nebula Graph automatically compresses properties on vertices/edges by 5~20 times. So theoretically the minimum hard disk capacity required is 63 T / 5 = 12.6 T.
Both query engine and storage engine would affect the CPU capacity you need.
Currently Nebula Graph query engine would occupy all threads of all machines for query if necessary. The storage engine occupies 16 CPUs at most. See the configuration of the two parameters num_io_threads and num_worker_threads.
Memory capacity requirement.
Both storage and query would affect the memory capacity you need.
- rocksdb, the rocksdb_block_cache of one rocksdb instance is 1024 M, i.e. 1G. And the write_buffer_size is 256M. So a rockdb instance needs 1.25G in memory.
- Vertex cache. For higher read performance, Nebula Graph can cache 16 * 1000 * 1000 vertices at most by default (configure the parameter vertex_cache_num per your own business needs), which requires approximately 16G memory for cache purpose if there are 10k properties on each vertex.
wal_buffer_size: 8 * 1024 * 1024 *2 *100 = 1.6G
That said, at lease 18.6G is required for data storage and query in case of 1 billion vertices and 10 billion edges. In addition to the memory needed for application running, you need to reserve at least 18.6G memory based on your own business scenarios.
Generally speaking, you need larger memory if you need to query larger amount of data in Nebula Graph.
So to sum up, for a three-replica cluster with a dataset of 1 billion vertices (each with 1k properties attached) and 10 billion edges (each with 1k properties attached), and 100 partitions for each graph space, the memory capacity requirement is 18.6G at minimum.
You business data set may vary in size, but following the logic in this topic, you can have an overall idea of what hardware resources are required to use Nebula Graph.