Practicing data importing and nGQL against LDBC SNB dataset

Lately I have done some practice on the LDBC SNB Dataset, by importing it to a local testing Nebula environment, and then trying to solve those basic “Short Reads” problems by nGQL queries.

Importing the dataset

First and foremost, let me import the dataset to Nebula. Luckily there’s already a project nebula bench which is developed for Nebula load testing, but also includes the capability to generate and import the LDBC SNB dataset. I simply followed the guide and all went smoothly, except for two little issues:

  • By default, running command “python3 run.py importer” generates a yaml file in which the space replica is set to 3. It failed to run in my environment with only 1 storaged up and running. I had to try draft run, grab the yaml file and run nebula importer by myself.
  • Some total different tags might share same vertex ids, e.g. person and organization. It’s mentioned as a limitation in the guide and I guess there’s no plan to get it fixed since it doesn’t really matter to the load testing. Fine, just don’t get confused.

Regardless, the dataset was imported quickly and successfully. Have a look.

(root@nebula) [ldbc1]> show stats;
+---------+------------------+----------+
| Type    | Name             | Count    |
+---------+------------------+----------+
| "Tag"   | "Comment"        | 2052169  |
+---------+------------------+----------+
| "Tag"   | "Forum"          | 90492    |
+---------+------------------+----------+
| "Tag"   | "Organisation"   | 7955     |
+---------+------------------+----------+
| "Tag"   | "Person"         | 9892     |
+---------+------------------+----------+
| "Tag"   | "Place"          | 1460     |
+---------+------------------+----------+
| "Tag"   | "Post"           | 1003605  |
+---------+------------------+----------+
| "Tag"   | "Tag"            | 16080    |
+---------+------------------+----------+
| "Tag"   | "Tagclass"       | 71       |
+---------+------------------+----------+
| "Edge"  | "CONTAINER_OF"   | 1003605  |
+---------+------------------+----------+
| "Edge"  | "HAS_CREATOR"    | 3055774  |
+---------+------------------+----------+
| "Edge"  | "HAS_INTEREST"   | 229166   |
+---------+------------------+----------+
| "Edge"  | "HAS_MEMBER"     | 1611869  |
+---------+------------------+----------+
| "Edge"  | "HAS_MODERATOR"  | 90492    |
+---------+------------------+----------+
| "Edge"  | "HAS_TAG"        | 3721409  |
+---------+------------------+----------+
| "Edge"  | "HAS_TYPE"       | 16080    |
+---------+------------------+----------+
| "Edge"  | "IS_LOCATED_IN"  | 3073620  |
+---------+------------------+----------+
| "Edge"  | "IS_PART_OF"     | 1454     |
+---------+------------------+----------+
| "Edge"  | "IS_SUBCLASS_OF" | 70       |
+---------+------------------+----------+
| "Edge"  | "KNOWS"          | 180623   |
+---------+------------------+----------+
| "Edge"  | "LIKES"          | 2190095  |
+---------+------------------+----------+
| "Edge"  | "REPLY_OF"       | 2052169  |
+---------+------------------+----------+
| "Edge"  | "STUDY_AT"       | 7949     |
+---------+------------------+----------+
| "Edge"  | "WORK_AT"        | 21654    |
+---------+------------------+----------+
| "Space" | "vertices"       | 3165488  |
+---------+------------------+----------+
| "Space" | "edges"          | 17256029 |
+---------+------------------+----------+
Got 25 rows (time spent 1344/16017 us)

“Short Reads”

Now the more interesting part, I tried to solve the 7 “Short Reads” problems in the LDBC SNB Interactive workload. Refer to the spec for details.

Short Reads #1 - Profile of a person

Problem: Given a start Person, retrieve their first name, last name, birthday, IP address, browser, and city
of residence.

match (v1:Person)-[:IS_LOCATED_IN]->(v2:Place) where id(v1)==$person_id
return v1.firstName, v1.lastName, v1.birthday, v1.locationIP, v1.browserUsed, id(v2), v1.gender, v1.creationDate

Short Reads #2 - Recent messages of a person

Problem: Given a start Person, retrieve the last 10 Messages created by that user. For each Message, return that
Message, the original Post in its conversation (post), and the author of that Post (originalPoster).
If any of the Messages is a Post, then the original Post (post) will be the same Message, i.e. that
Message will appear twice in that result.

Well, looking for the orginal Post of a Message could involve “unlimited” hops, which unfortunately is not yet supported by Nebula. As a workaround, a maximum hops needs to be specified in the query. I gave it 5 in my testing.

match(p1:Person)<-[:HAS_CREATOR]-(m:`Comment`)-[:REPLY_OF*..5]->(p:Post)-[:HAS_CREATOR]->(p2:Person) 
where id(p1)==$person_id return id(m) as messageId, 
(case m.content is null when false then m.content when true then m.imageFile end) as content,
id(p),id(p2),p2.firstName,p2.lastName,
m.creationDate as creationDate order by creationDate desc, messageId desc limit 10;

Short Reads #3 - Friends of a person

Problem: Given a start Person, retrieve all of their friends, and the date at which they became friends.

match (p1:Person)-[k:KNOWS]-(p2:Person) where id(p1)==$person_id 
return id(p2) as friendId,p2.firstName,p2.lastName,k.creationDate as creationDate 
order by creationDate desc, friendId;

Short Reads #4 - Content of a message

Problem: Given a Message, retrieve its content and creation date.

A simple “Fetch” will do.

fetch prop on Post $message_id 
yield Post.creationDate, Post.content, Post.imageFile

Short Reads #5 - Creator of a message

Problem: Given a Message, retrieve its author.

A simple “Go” will do :slight_smile:

go from $message_id over HAS_CREATOR yield HAS_CREATOR._dst as personId, $$.Person.firstName, $$.Person.lastName;

Short Reads #6 - Forum of a message

Problem: Given a Message, retrieve the Forum that contains it and the Person that moderates that Forum.
Since Comments are not directly contained in Forums, for Comments, return the Forum containing
the original Post in the thread which the Comment is replying to.

Again, “unlimited” hops is necessary but not supported by Nebula yet.

go 0 to 5 steps from $message_id over REPLY_OF yield REPLY_OF._dst as postId 
| go from $-.postId over CONTAINER_OF REVERSELY yield CONTAINER_OF._dst as forumId, $$.Forum.title as title
| go from $-.forumId over HAS_MODERATOR yield $-.forumId, $-.title, HAS_MODERATOR._dst as moderatorId, $$.Person.firstName, $$.Person.lastName

Short Reads #7 - Replies of a message

Problem: Given a Message, retrieve the (1-hop) Comments that reply to it.
In addition, return a boolean flag knows indicating if the author of the reply (replyAuthor) knows
the author of the original message (messageAuthor). If author is same as original author, return
False for knows flag.

I couldn’t figure this one out using nGQL. Seems Optional Match is a must and is still missing in the latest release of Nebula. Look forward to this been added to a future release of Nebula.

The end :slight_smile:

1 Like