Star

Lookup executor was partially performed

I’ve inserted Vertices and Edges, but when I query with LOOKUP got partial number of rows in response
with warning about partial response.

[WARNING]: Lookup executor was partially performed.

Cluster is normally operational and all nodes are in online state.
What is the issue with that ?

Cluster info: 4 machines, bare metal install
Nebula graph version 1.2.0
Centos 7.8 kernel 4.4
64 cores, 256GB RAM, 4 x NVMe disks 3.5TB

Spaces:

======================================================================
| ID | Name | Partition number | Replica Factor | Charset | Collate  |
======================================================================
| 1  | rl   | 128              | 2              | utf8    | utf8_bin |
----------------------------------------------------------------------

storaged log (on one node, latest few lines):

E0218 12:45:33.935384 6882 RaftPart.cpp:1075] [Port: 44501, Space: 1, Part: 85] Receive response about askForVote from [10.200.133.39:44501], error code is -6
E0218 12:45:34.170789 6881 RaftPart.cpp:1075] [Port: 44501, Space: 1, Part: 5] Receive response about askForVote from [10.200.133.39:44501], error code is -6
E0218 12:45:34.541381 6879 RaftPart.cpp:1075] [Port: 44501, Space: 1, Part: 61] Receive response about askForVote from [10.200.133.39:44501], error code is -6
E0218 12:45:34.863966 6881 RaftPart.cpp:1075] [Port: 44501, Space: 1, Part: 5] Receive response about askForVote from [10.200.133.39:44501], error code is -6
E0218 12:45:35.267017 6880 RaftPart.cpp:1075] [Port: 44501, Space: 1, Part: 85] Receive response about askForVote from [10.200.133.39:44501], error code is -6
E0218 12:45:35.356729 6882 RaftPart.cpp:1075] [Port: 44501, Space: 1, Part: 125] Receive response about askForVote from [10.200.133.39:44501], error code is -6
E0218 21:37:31.496176 6537 RaftPart.cpp:909] [Port: 44501, Space: 1, Part: 9] processAppendLogResponses failed!
E0218 22:01:51.450465 6526 LookUpIndexProcessor.cpp:39] Execute Execution Plan! ret = -5, spaceId = 1, partId = 1
E0218 22:02:30.601763 6521 LookUpIndexProcessor.cpp:39] Execute Execution Plan! ret = -5, spaceId = 1, partId = 9
E0218 22:22:02.391155 6527 LookUpIndexProcessor.cpp:39] Execute Execution Plan! ret = -5, spaceId = 1, partId = 13
E0219 08:45:54.192809 6542 LookUpIndexProcessor.cpp:39] Execute Execution Plan! ret = -5, spaceId = 1, partId = 1

Hi @goranc , thanks for asking the question here! For the on-call dev to quickly debug the issue, could you please provide the following info?

Nebula Graph version you are using:
Deployment method (Distributed/Single Host/Docker):
Hardware type (SSD/HDD):
Memory size:
How did you create your graph space?

Plz provide the info as @jamieliu1023 said, and show the log of storage.

Please set the “Replica Factor” to 3.

@goranc The issue may be that you have set the number of replicas as an even number. So the leader election process won’t work properly because Nebula adopts the quorum protocol in this regard.

For your reference: CREATE SPACE - Nebula Graph Database Manual

Hope it helps!

Thanks,

WIll try to set replicas to 3.

Just to show that there is strange behavior on cluster, will put here what I found just using ordinary commands in cli.

SHOW PARTS doesn’t show leaders for some parts
after issue BALANCE LEADER, the leaders for parts are populated, but lost after some queries executed.

...
| 14           | 10.200.133.40:44500 | 10.200.133.39:44500, 10.200.133.40:44500 |       |
-----------------------------------------------------------------------------------------
| 15           | 10.200.133.40:44500 | 10.200.133.40:44500, 10.200.133.37:44500 |       |
-----------------------------------------------------------------------------------------
| 16           | 10.200.133.37:44500 | 10.200.133.37:44500, 10.200.133.38:44500 |       |
-----------------------------------------------------------------------------------------
| 17           |                     | 10.200.133.38:44500, 10.200.133.39:44500 |       |
-----------------------------------------------------------------------------------------
| 18           | 10.200.133.40:44500 | 10.200.133.39:44500, 10.200.133.40:44500 |       |
-----------------------------------------------------------------------------------------
| 19           | 10.200.133.40:44500 | 10.200.133.40:44500, 10.200.133.37:44500 |       |
-----------------------------------------------------------------------------------------
| 20           | 10.200.133.37:44500 | 10.200.133.37:44500, 10.200.133.38:44500 |       |
-----------------------------------------------------------------------------------------
| 21           |                     | 10.200.133.38:44500, 10.200.133.39:44500 |       |
-----------------------------------------------------------------------------------------
| 22           | 10.200.133.40:44500 | 10.200.133.39:44500, 10.200.133.40:44500 |       |
-----------------------------------------------------------------------------------------
| 23           | 10.200.133.37:44500 | 10.200.133.40:44500, 10.200.133.37:44500 |       |
-----------------------------------------------------------------------------------------
| 24           | 10.200.133.37:44500 | 10.200.133.37:44500, 10.200.133.38:44500 |       |
...

SHOW EDGE INDEXES list indexes for edges
SHOW EDGE INDEX STATUS is empty list
for TAGS both commands shows list of indexes and their status.

Yes, that is the issue as I can see.

If the 2/3 quorum is mandatory than there should be error reported when you try to create space with 2 replicas.
If we can live with 2 replicas and quorum will be floor( (N+1)/2 ) then algorithm for quorum should be able to acquire quorum if all nodes are up and running, or even one node is down.

I’m not sure it is acceptable that if we have healthy cluster that data shouldn’t be available for queries.

@bright-starry-sky @critical27 Please help look at this problem and see if it is the same issue as in this topic Is it safe to change rocksdb parameters after data are loaded
Thanks!

The same problem as Is it safe to change rocksdb parameters after data are loaded, the config and environment need to be checked.

1 Like

I have setup with replica factor 3 on same cluster setup (new cluster nodes)
Repeated with same data model and loaded subset of data (3 billion vertices and 10 billion edges)

Tried to stop and start services on all nodes using nebula.service stop all
After starting services back again nebula.service start all
I have verified cluster state and notice some issues:
show hosts - all hosts are online but leaders are not balanced across nodes
show parts - not all parts have leader

Is there correct way to restart cluster nodes ?

Eg. stop graphd service on first node and verify status of the service, then continue to next node
after finished with graphd service, stop storaged service and verify on each node that service is stopped (it takes some time depends of the data size) before proceed to next node,
last action is to stop metad services on each node (usually 3 nodes in production system)

If all hosts in the cluster running fine, but not all partitions have leaders (no matter the replica is 2 or 3), that sounds like a bug to me.

In order to debug, would it be possible for you to provide the log files of storage daemons?