Star

Everything you need to know about importing data from csv files to Nebula Graph

Recently we have received a bunch of questions from Nebula users related to csv data import. These questions can help new users import data smoothly to Nebula Graph.

Table of Contents

  • ErrMsg: SyntaxError: syntax error near xx, Vertex ID type not supported

  • How to import data without header

  • How long does it take to hash string to int

  • Is data import concurrent?

  • Data configuration in graph client

  • How to check the count of vertices and edges after data import has finished?

  • Parameters meaning: Latency AVG, Batches Req AVG, Rows AVG

  • Data import rate vs host configuration

  • How to validate the modifications in the storaged config file?

  • Will the import stop when the shell window is closed?

  • How to check data impoort progress?

  • Is the data import sequential or random?

  • Does the tool support continuous transmission from breakpoint?

ErrMsg: SyntaxError: syntax error near xx, Vertex ID type not supported

Question: ErrMsg: SyntaxError: syntax error near xx. Does this mean this vertex ID type is not supported? Only int is supported?

Answer: Yes, currently we only support int64 as vertex ID type. If vid is string, you can specify function: hash or uuid in vid configuration. Then the function will be automatically called during data import to convert the string to int64 as vid.

How to import data without header?

Question: How to ignore a specific field when there’s no header in my data to be imported? I’ve found only the guide to import data with header.

Answer: Assign the index of a column as props of a tag and the rest will be ignored.

How long does it take to hash string to int?

Question: VID data type is string. Does the import tool count the time spent hashing string into int64 as a part of data import?

Answer: Correct. It’s like turning INSERT VERTEX v() VALUES 234: () into INSERT VERTEX v() VALUES hash("234"):()

Is data import concurrent?

Question: Is data import concurrent? Who gets to decide the number of concurrences? Graph clients?

image

Answer: Correct. Data import is concurrent. Best practice is to specify the concurrency parameter as the number of your cpu cores. This will inform the import tool how many graph clients it needs to start to connect to Nebula server.

Data configuration in graph client

Question: The more graph clients, the faster data can be imported, right?

Answer:Theoretically right. But don’t set the number of graph clients too high. Best practice is to set it as the same number of your cpu cores.

How to check the count of vertices and edges after data import has finished?

Question: Is there an API that can acquire the number of vertices and edges after data is successfully imported?

Answer: There’s no command for that. But you can use this tool: https://github.com/vesoft-inc/nebula/blob/master/docs/manual-CN/3.build-develop-and-administration/3.deploy-and-administrations/server-administration/storage-service-administration/data-export/dump-tool.md


--mode=stat #For statistics purpose only

Parameter meaning: Latency AVG, Batches Req AVG, and Rows AVG

Question: What do these parameters mean: Latency AVG, Batches Req AVG, Rows AVG?

Answer: Latency AVG is the average latency of all insert operations. Similar to the latency returned in console. Batches Req AVG means the average time spent inserting data in a batch request. Rows AVG means the average number of records inserted per second. Finished means the total number of records imported to Nebula.

Data import rate vs host configuration

Question: It took me 15 hours to import 230 million vertices and 5 billion edges to a single host. Should I expect that?

Answer: Unfortunately no. Please check your host config, Nebula config file, go-importer config, and if you are using indexes. Also refer to the following storage.conf:


########## storage ##########

# Root data path, multiple paths should be splitted by comma.

# One path per instance, if --engine_type is `rocksdb'

--data_path=/mnt/ssd1/storage,/mnt/ssd2/storage,/mnt/ssd3/storage

# The default block cache size used in BlockBasedTable.

# The unit is MB.

--rocksdb_block_cache=65536 # About 1/3 of your total memory

# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma

--rocksdb_column_family_options={"write_buffer_size":"67108864","max_write_buffer_number":"6","max_bytes_for_level_base":"268435456"}

Update three items in nebula-storage.conf and leave the .yaml file as it is.


--data_path=/mnt/ssd1/storage,/mnt/ssd2/storage,/mnt/ssd3/storage # 3 disks here

Change the number of disks per your own scenario.

How to validate the modifications in the storaged config file?

Question: Should I restart all services or the storage service only after updating the storage config file?

Answer: Stop storaged and wait for a while (30s). Then restart the service because it takes time to refresh the new config to disk.

Will the import stop when the shell window is closed?

Question: Will the data import stop if I’ve accidentally closed the shell window where the import is happening?

Answer: Correct.

How to check data import progress?

Question: How do I know the progress of the data import?

Answer:Check the Finished field to see how much data has already been written to Nebula.

Is the data import sequential or random?

Question: Does the tool import vertices and edges by sequence or randomly?

Answer: The import tool reads vertices and edges by the sequence of your csv files. Then it assigns the data to different graph clients to insert concurrently. The insertion is not necessarily by sequence. For each file, the tool will start a thread to read data from it and then assign the data to multiple threads to insert to Nebula. Therefore if you have multiple files, chances are data from these files are inserted to Nebula concurrently by different threads.

Does the tool support continuous transmission from breakpoint?

Question: Does the import tool support continuous transmission from breakpoint? For example the import stops for a reason and I need to reimport it.

Answer: Not supported. It doesn’t make that much sense for importing from local csv. You may try it yourself since it’s not difficult technically.

Finally, big thanks to Xiaohui and other users for their contribution to the tips! :blush:

2 Likes