Data Migration from JanusGraph to Nebula Graph - Practice at 360 Finance

Speaking of graph data processing, we have had experience in using various graph databases. In the beginning, we used the stand-alone edition ofAgensGraph. Later, due to its performance limitations, we switched to JanusGraph, a distributed graph database. I introduced details on how to migrate data in my article “Migrate tens of billions of graph data into JanusGraph (only in Chinese)”. As the data size and the number of business calls grew, a new problem appeared: Each query consumed too much time. In some business scenarios, a single query took up to 10 seconds, and with increase of the data size, a more complicated single query needed two or three seconds. These problems had seriously affected the performance of the entire business process and the development of related businesses.

The architecture design of JanusGraph determines that a single query is time-consuming. The core reason is that its storage depends on the external storage, and JanusGraph cannot control the external storage well. In our production environment, an HBase cluster is used, which makes it impossible for all queries to be pushed down to the storage layer for processing. Instead, data can only be queried from HBase to the JanusGraph Server memory and then filtered accordingly.

For example, in a dataset, we want to find out users older than 50 having the one-hop relationship with a specified user: 1,000 users have such a relationship with the user, but only two of them are older than 50. In JanusGraph, when the query requests are sent to HBase, the vertices of one-hop relationship cannot be filtered by their properties. Therefore, we have to use concurrent requests to query HBase to obtain the properties of these 1,000 people, filter them in the memory of JanusGraph Server, and then the two users who meet the conditions are returned to the client.

Such operations may cause a lot of waste of disk I/O and network I/O, and most of the data returned for the query is not used in subsequent queries. The HBase used in our production environment uses 19 high-profile SSD servers. The specific network I/O and disk I/O are as follows.

HBase network I/O

HBase disk I/O

In the same business scenario, if we use Nebula Graph to process graph data, only 6 SSD servers with the same configuration are needed. The disk I/O and network I/O are as follows.

Nebula Graph network I/O

Nebula Graph disk I/O

From the previous comparison, it can be seen that the performance of Nebula Graph is much better. It is especially important to note that this performance is achieved when the machine resources are only 30% of the HBase cluster. Let’s take a look at the time consumption in the business scenarios: In a business scenario where JanusGraph needs two or three seconds to process a query, Nebula Graph only takes about 100 ms; in a business scenario where JanusGraph requires 10 to 20 seconds, Nebula Graph only takes about two seconds. Moreover, in Nebula Graph, the average time consumption is about 500 ms, and the performance is improved to at least 20 times. :slight_smile:

cat time consumed

If you are still using JanusGraph, after reading this performance comparison, I guess you will forward this article to your team immediately, requesting a project to migrate graph data to Nebula Graph.

Migration of Historical Data

Now let’s talk about how to migrate data. Our data size is relatively large, about 2 billion vertices and 20 billion edges. It is difficult to migrate data of such a large size. Lucky to us, Nebula Graph provides a Spark-based import tool, Spark Writer. This tool facilitates our process of data migration so much. We have the experience to share with you: The asynchronous data import may not be what you need, because it may introduce a lot of errors. We recommend that you change the import method to synchronous writing. Here is another experience about using Spark: If the amount of imported data is relatively large, the partitions parameter should be set to a great value. In our case, we tried setting this value to 80,000. If the value you set is less than what is supposed to be, the data size of a single partition will be relatively large, which may easily cause OOM Fail to your Spark tasks.

Query Tuning

We are now using Nebula Graph 1.0 in the production environment. In this environment, we use the HASH() function instead of the UUID() function to generate IDs, because the UUID() function consumes more time during the data import process. Besides, it is said that Nebula Graph will no longer support UUID() in the future.

In our production environment, the major tuning configurations are as follows. Most of them are for the tuning of nebula-storage.

# The default reserved bytes for one batch operation


# The default block cache size used in BlockBasedTable

# The unit is MB. The memory capacity of our production server is 128 GB


############## rocksdb Options ##############

# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma


# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma


# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma



# Newly added parameters


# Parameters to reduce memory usage


Regarding the tuning of the Linux server: The main configuration is disabling the swap of the service, because if the swap is enabled, the disk I/O may reduce the query performance. In addition, about the minor compact and major compact tuning: In our production environment, we enabled minor compact but disabled major compact. If major compact is enabled, it may take up a lot of disk I/O, and it is difficult to control it by setting the number of threads (--rocksdb_db_options={"max_subcompactions":"3","max_background_jobs":"3"}). It is said that this function will be optimized in the future versions of Nebula Graph.

Finally, let’s praise the max_edge_returned_per_vertex parameter of Nebula Graph. In my opinion, with this one parameter alone, Nebula Graph can be recommended as the veteran in the graph database industry. Our previous graph queries have always been troubled by the super vertices that have millions of edges with other vertices. In the production environment, if you want to query these vertices in JanusGraph together with HBase, a crash may be part of your routine work. In our production environment it happened several times. When using JanusGraph, we cannot solve this problem well by adding various LIMIT clauses with Gremlin statements. However, with Nebula Graph, the max_edge_returned_per_vertex parameter appears as a savior. With this parameter, Nebula Graph filters the data directly in the underlying storage layer, which saves us from the super vertex trouble in the production environment. Based on this parameter alone, we give NebulaGraph a FIVE STAR!