Star

Practice Nebula Graph on Boss Zhipin

Business background

Chinese recruitment platform, Boss Zhipin, uses a large scale graph storage and mining computing in its security and risk control. Boss Zhipin introduced a self-built HA Neo4j cluster to handle its needs. However, when it comes to real time analysis, Neo4j works not well because it doesn’t support a daily data increase of 1 billion relationships.

We first adopted Dgraph to meet our needs. After too many tricky usages and meetings with Dgraph for half a year, we finally make up our mind to migrate to Nebula Graph, a database that fits our scenarios more. This post won’t cover the Benchmark because there are plenty of them on the forum. We will share some technical qualifications and selections, plus the comparisons between the two, which I think you are more interested in.

Technical qualifications

Hardware

Hardware configuration:

  • CPU: Intel® Xeon® Gold 6230 CPU @ 2.10GHz 80(cores)
  • Memory: DDR4,128G
  • Storage: 1.8T SSD
  • Network: 10 gigabyte

We deployed five Nebula Graph nodes, as suggested, 3 metad, 5 graphd, and 5 storaged.

Software

  • Nebula Graph version: v1.1.0
  • Operating system: CentOS Linux release 7.3.1611 (Core)

Configuration

Modify the storage related configurations
# According to the document, set the following value to one third of the configured memory
--rocksdb_block_cache=40960
# Modify these configurations to decrease memory cost
--enable_partitioned_index_filter=true
--max_edge_returned_per_vertex=100000

Qualifications

Currently, the security behavior graph saves the behaviors in the latest 3 months, that is nearly 50 billion edges. The aggregated write is updated every 10 minutes. The mean daily vertex write is 30 million. The mean daily edge write is 550 million. The insertion latency is less than 20 ms.

Reading latency is less than 100 ms. Reading latency for the business API is less 200 ms. Some of the time-consuming requirement costs up to 1 second.

The current disk uses about 600G * 5.

CPU consumption is about 500%. That value falls to about 60 G when the situation is stable.

Comparison with Dgraph

At this time, there are two main native distributed graph databases in China, Dgraph and Nebula Graph. We have used Dgraph for half a year, the comparison between the two are listed below.

At this time, there are two main native distributed graph databases in China, Dgraph and Nebula Graph. We have used Dgraph for half a year, the comparison between the two are listed below.

Items Nebula Graph Dgraph
Storage architecture Common solution. Nebula Graph shards the vertices then separate them, using the redundant storage strategy for the in and out edges. This works well for the common one hop queries. Dgraph shards the edges. Each edge is stored on a vertex, so the network transmission is better for the multi-hop queries. But data tends to skew and leads to OOM. You need to take special care to ensure the balance. Super large edge needs sharding further.
Sharding method Nebula Graph shards and balances manually and controls the low peak operations. You need to trigger the automatic balance periodically. So it’s better to provide both methods. The auto balance is triggered by default. This makes the operation and maintenance simpler but the online use is influenced more easily. You can either set a super long interval to avoid such auto trigger or turn off this function by changing the code.
Import speed Nebula Graph imports data faster because it depends on the sst. Data import in Dgraph is a distributed request so large scale import is slow and causes OOM. Refer this PR to improve.
Write amplification (WA) WA in Nebula Graph is small because it depends on RocksDB. Theoretically, WA will be smaller after the badger optimization. We suggest writing in batch. If strong consistency is not required, use the ludicrous mode. WA increases a lot when writing frequently. And writing is tend to be blocked when a snapshot is triggered. You can improve this by decreasing the snapshot frequency but there is a risk of data loss.
Read amplification (RA) RA in Nebula Graph is small because it depends on RocksDB. High-frequency write leads to large RA, especially when the snapshot is triggered, read and write are blocked.
Disk space amplification Disk space amplification in Nebula Graph is small because it depends on RocksDB. Related with the storage structure. This is not optimized well, especially when writing frequently, this value becomes rather high.
Stability Nebula Graph enjoys great stability owing for using RocksDB. We have run Nebula Graph for more than half a year. Dgraph is stable.Badger andRisstretto areprogrammed byDgraphcompany from scratch,butitlacks use case for large scale datacompared to RocksDB. For now, Dgraph crashes when writing with high QPS.
Query language nGQL is easy to use and similar to SQL, which is more close to the using habits. It is currently under development. Compared with Dgraph, nGQL query is longer. For example, querying multiple edges at the same time requires combining the return results by yourself. GraphQL is easier than nGQL but is troublesome for complex queries.
Community There are always on call staff for you on the forum. Dgraph does not provide Chinese community. Feel free to connect me. I have a Chinese Wechat group.
Interface deployment I developed a program myself to connect with nGQL. Dgraph provides a native GraphQL interface.
Future plan Nebula Graph plans to support open Cypher and graph computing frameworks. GraphQL is a native database tend to develop a middle platform.

Based on our experience, Dgraph has a good design but it doesn’t fit our needs. Even the native GraphQL is very attractive, the OOM is inevitable due to its storage structure. Dgraph plans to improve this by optimizing the edge storage shards. Plus, the self-developed badger and ristretto are not verified by the official use cases. We are suffering a lot of crashes and OOM when dealing with a large scale data and high QPS. If adopting the SSD to store massive data, we will have to optimise the disk amplification and memory cost for Dgraph.

Putting the high QPS write aside, Dgraph is worth trying. For many scenarios, as a native GraphQL graph database, Dgraph is very suitable for data middle platform. This is a trend for the moment and Dgraph has released its own cloud service. TigerGraph is also exploring in this field, too. Besides, transaction is another strength for Dgraph. But we don’t need this feature so we didn’t try it. Actually, Dgraph runs stably when the online write concurrency is not high or importing data offline. If you think the high availability and transaction meet your need, Dgraph is worth trying.

To conclude, Nebula Graph works well especially in engineering. You can draw such a conclusion from many details in its design. Nebula Graph enjoys an excellent balance between design and practice. For example:

  1. Nebula Graph supports manual balance. Good as it is, auto balance is prone to many problems.
  2. Nebula Graph controls the memory usage smartly by setting theenable_partitioned_index_filterparameter and limiting the maximum returned edge number. Nebula Graph balances well between the data size and performance.
  3. Different graph spaces are physically separated.
  4. nGQL is very close to SQL and compatible with open Cypher gradually. GraphQL is powerful in some aspects, but considering the complexity in long queries, it’s a more data middle platform language.
  5. The recent released Spark GraphX is very powerful. Originally, our graph computing is based on GraphX and extracted from Neo4j. We plan to extract directly from Nebula Graph in the future.

The preceding conclusion is based on our experience at this time. Since both databases are developing persistently, you’d better refer to the latest release note to know their updates.

Suggestions

Nebula Graph performs well. But combined with our scenarios, we propose the following suggestions:

  1. Provide more offline algorithms, including the support for the existing graph neural network. Online graph query is mostly used for analysis. In fact, data for online application is computed offline and then imported to the database.
  2. **Support Plato platform.**Spark GraphX is relatively inefficient in computing. It’s better to integrate with Plato of Tencent.
  3. Inspired by TigerGraph and Dgraph, I hope Nebula Graph can support solidification of nGQL queries to directly generate the REST endpoints. And can support queries with the passing in HTTP parameters. In this way, data query interfaces can be generated very quickly, so there is no need to connect to the database to provide SQL service.

Currently Boss Zhipin applies Nebula Graph in its security business. Nebula Graph has been running stably for more than half a year. I shared some experience in this post and hope to trigger more thoughts on Nebula Graph.

Some problems we encountered when using Dgraph:

  • The issues we raised.
  • The PRs we contributed to Dgraph.

Reference

The writer of this post is Wen Zhou at Boss Zhipin

3 Likes