https://nebula-graph.io logo
k

Kasper

08/30/2022, 7:41 AM
I'm testing on SNB SF100 generated with
nebula-bench
. Graph is deployed via docker-compose running on a 64 vCPU machine with 256GB RAM. I add a tag index to Post and rebuild:
Copy code
CREATE TAG INDEX IF NOT EXISTS post_index on Post()
REBUILD TAG INDEX post_index;
After that i use
lookup
. Querying for Posts (~60M instances), this query returns in around 5 minutes:
LOOKUP ON Post yield id(vertex) | yield count(*)
whereas
LOOKUP ON Post yield id(vertex)
stalls. I tested both using console as well as python client. Am I missing something? How should I retrieve large number of results then? Use vertex scan? Some kind of pagination?
g

Goran Cvijanovic

08/30/2022, 7:51 AM
Must be aware that rebuilding index take some time. You can create index in advance, that will slow down inserting data a little bit, but you will have index up to date. Use SHOW INDEX STATUS and SHOW JOBS and SHOW JOB <id> to verify status of the rebuild process.
j

Jingchun

08/30/2022, 8:10 AM
Try
SUBMIT JOB STATS
then
SHOW STATS
if you only need to get the numbers of vertices and edges. It's better to use LIMIT and pagination when retrieving large amount of data.
k

Kasper

08/30/2022, 8:23 AM
Did that, index job was successfully completed.
Regarding pagination, you mean using
SKIP
and
LIMIT
, correct? So say if I do
Match ... SKIP 1000000 LIMIT 1000000
, doesn't it still retrieve 2M results and then just discards the first 1M? Then each subsequent page would take more time.
Is
lookup
then not the right way to retrieve large numbers of results/vertex ids? From what I am aware you cannot attach a
LIMIT
to a
LOOKUP
query.
j

Jingchun

08/30/2022, 8:26 AM
You can do that in LOOKUP, such as
LOOKUP ON Post yield id(vertex) | LIMIT 10, 10
you're right about the 2M results in your example.
why do you need to retrieve that much data in a query? could you shed some lights on the scenario? thx.
k

Kasper

08/30/2022, 8:30 AM
I'm fetching a node-induced subgraph. These vertex ids are my starting points.
j

Jingchun

08/30/2022, 8:38 AM
i see, so you will need all the 58M Post vertices? In this case will you consider using client sdk instead of queries?
for example, in java client sdk, there's an API to scan vertex.
❤️ 1
k

Kasper

08/30/2022, 9:25 AM
there's one for python as well, no?
so scanning vertices is the preferred way, correct?
Scan will be much much faster
k

Kasper

08/30/2022, 9:28 AM
I've been using the python client already. Will switch to the scan approach then. Thanks!
j

Jingchun

08/30/2022, 9:28 AM
welcome 🙂
g

Goran Cvijanovic

08/30/2022, 10:53 AM
I had a use case to get vertices entered and do with them something, and I set a timestamp property with index and used it to fetch vertices in specific time range, so i pick eg. one minute range and get thousands or even hundreds of thousands vertices entered in that range, then process them with some graph queries. Then get another batch in continues time frame without overlap and be able to process all vertices continuously. Maybe idea can fit your usecase.
❤️ 1
w

wey

08/31/2022, 12:19 AM
@Kasper for python storage client, please use the master version for now, as there is a bug being fixed on scanning data, a release to pypi will be done later 🙂. Another thing to be noted is, the storage client requires to access metad and storaged directly.
👍 1
5 Views