I'm testing on SNB SF100 generated with `nebula-be...
# nebula-users
k
I'm testing on SNB SF100 generated with
nebula-bench
. Graph is deployed via docker-compose running on a 64 vCPU machine with 256GB RAM. I add a tag index to Post and rebuild:
Copy code
CREATE TAG INDEX IF NOT EXISTS post_index on Post()
REBUILD TAG INDEX post_index;
After that i use
lookup
. Querying for Posts (~60M instances), this query returns in around 5 minutes:
LOOKUP ON Post yield id(vertex) | yield count(*)
whereas
LOOKUP ON Post yield id(vertex)
stalls. I tested both using console as well as python client. Am I missing something? How should I retrieve large number of results then? Use vertex scan? Some kind of pagination?
g
Must be aware that rebuilding index take some time. You can create index in advance, that will slow down inserting data a little bit, but you will have index up to date. Use SHOW INDEX STATUS and SHOW JOBS and SHOW JOB <id> to verify status of the rebuild process.
j
Try
SUBMIT JOB STATS
then
SHOW STATS
if you only need to get the numbers of vertices and edges. It's better to use LIMIT and pagination when retrieving large amount of data.
k
Did that, index job was successfully completed.
Regarding pagination, you mean using
SKIP
and
LIMIT
, correct? So say if I do
Match ... SKIP 1000000 LIMIT 1000000
, doesn't it still retrieve 2M results and then just discards the first 1M? Then each subsequent page would take more time.
Is
lookup
then not the right way to retrieve large numbers of results/vertex ids? From what I am aware you cannot attach a
LIMIT
to a
LOOKUP
query.
j
You can do that in LOOKUP, such as
LOOKUP ON Post yield id(vertex) | LIMIT 10, 10
you're right about the 2M results in your example.
why do you need to retrieve that much data in a query? could you shed some lights on the scenario? thx.
k
I'm fetching a node-induced subgraph. These vertex ids are my starting points.
j
i see, so you will need all the 58M Post vertices? In this case will you consider using client sdk instead of queries?
for example, in java client sdk, there's an API to scan vertex.
❤️ 1
k
there's one for python as well, no?
so scanning vertices is the preferred way, correct?
Scan will be much much faster
k
I've been using the python client already. Will switch to the scan approach then. Thanks!
j
welcome 🙂
g
I had a use case to get vertices entered and do with them something, and I set a timestamp property with index and used it to fetch vertices in specific time range, so i pick eg. one minute range and get thousands or even hundreds of thousands vertices entered in that range, then process them with some graph queries. Then get another batch in continues time frame without overlap and be able to process all vertices continuously. Maybe idea can fit your usecase.
❤️ 1
w
@Kasper for python storage client, please use the master version for now, as there is a bug being fixed on scanning data, a release to pypi will be done later 🙂. Another thing to be noted is, the storage client requires to access metad and storaged directly.
👍 1