Graph Data Science with Neo4j

In GDS we refer to this process as projecting a graph and refer to the in-memory graph as a graph projection. GDS can hold multiple graph projections at once and they are managed by a component called the Graph Catalog.

This includes classic graph algorithms such as centrality, community detection, path finding, etc. It also includes embeddings, a form of robust graph feature engineering, as well as machine learning pipelines. There are a few things you may want to do with the output/result of graph algorithms. GDS enables you to write results back to the database, export to disk in csv format, or stream results into another application or downstream workflow.

In summary, read and load, execute and store

Of the above, two main types of memory can be allocated in configuration:

  • Heap Space: Used for storing in-memory graphs, executing GDS algorithms, query execution, and transaction state

  • Page Cache: Used for indexes and to cache the Neo4j data stored on disk. Improves performance for querying the database and projecting graphs.

Allocation of heap memory can be done with the dbms.memory.heap.initial_size and dbms.memory.heap.max_size in the Neo4j configuration

setting a minimum page cache size is still important when projecting graphs. This minimum can be estimated at approximately 8KB * 100 * readConcurrency for standard, native, projections. Page cache size can be set via dbms.memory.pagecache.size in the Neo4j configuration.

What is Graph Catalog?

  • creating (a.k.a projecting) graphs

  • viewing details about graphs

  • dropping graph projections

  • exporting graph projections

  • writing graph projection properties back to the database

How to work with GDS?

We create graph projection to create sparsh, so that we can run algorithms on them.

Then We can store the data from the algos, into node properties.

YIELD nodeId, propertyValue
RETURN gds.util.asNode(nodeId).name AS actorName, propertyValue AS numberOfMoviesActedIn
ORDER BY numberOfMoviesActedIn DESCENDING, actorName LIMIT 10

For, example we can count the num of movie actor has acted in with data centrality algorithm.

We can also export data from the graphs,

two methods for that:

  1. gds.graph.export

  2. gds.beta.graph.export.csv

Dropping and deleting

CALL gds.graph.drop('my-graph-projection')

We list graphs

CALL gds.graph.list()

Native Projections

Native Projections provide the best performance and we can call gds.graph.project()


‘Name’, ‘Node project’, ‘relationshipProjection’

CALL gds.graph.project('native-proj',['User', 'Movie'], ['RATED']);

We set and use nodeProperites to Node and Relationship properties, as they can be used as weights in graph algo and features in ML.

CALL gds.graph.project(
    ['User', 'Movie'],
    {RATED: {orientation: 'UNDIRECTED'}},
            revenue: {defaultValue: 0}, // (1)
            budget: {defaultValue: 0},
            runtime: {defaultValue: 0}
        relationshipProperties: ['rating'] // (2)

Why native projection are good ?

CALL gds.graph.project(
    SENT_MONEY_TO: { aggregation: 'SINGLE' }

Sum Aggregator

CALL gds.graph.project(
      properties: {
        totalAmount: {
          property: 'amount',
          aggregation: 'SUM'

Cypher Projections

Cypher projections are intended to be used in exploratory analysis and developmental phases where additional flexibility and/or customization is needed.


A Cypher projection takes three mandatory arguments: graphName, nodeQuery, and relationshipQuery. In addition, the optional configuration parameter allows us to further configure graph creation.

Since we cannot do complex filtering in native project that is one of the reasons to use Cypher

CALL gds.graph.project.cypher(
  'MATCH (a:Actor) RETURN id(a) AS id, labels(a) AS labels',
  'MATCH (a1:Actor)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2)
   WHERE m.year >= 1990 AND m.revenue >= 1000000
   RETURN id(a1) AS source , id(a2) AS target, count(*) AS actedWithCount, "ACTED_WITH" AS type'


CALL'proj-cypher',{relationshipWeightProperty: 'actedWithCount'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score

Migrating from Cypher to native projection

//set a node label based on recent release and revenue conditions
MATCH (m:Movie)
WHERE m.year >= 1990 AND m.revenue >= 1000000
SET m:RecentBigMovie;

//native projection with reverse relationships
CALL gds.graph.project('proj-native',
    HAS_ACTOR:{type:'ACTED_IN', orientation: 'REVERSE'}

//collapse path utility for relationship aggregation - no weight property
CALL gds.beta.collapsePath.mutate('proj-native',{
    pathTemplates: [['ACTED_IN', 'HAS_ACTOR']],
    allowSelfLoops: false,
    mutateRelationshipType: 'ACTED_WITH'
CALL gds.graph.project.cypher(
  // Projection name
  // Cypher statement to find all nodes in projection
  'MATCH (m:Movie) RETURN id(m) AS id, labels(m) AS labels',
  // Cypher statement to find every User that rated a Movie
  // where the rating property is greater than or equal to 4
  // and the movie was released after 2014 (year > 2014)
  'MATCH (u:User)-[r:RATED]->(m:Movie)
   WHERE m.releaseYear > 2014 AND r.stars >= 4
   RETURN id(u) AS source , id(m) AS target, count(*) AS actedWithCount, "RATED" AS type'
) YIELD relationshipCount;


CALL gds.graph.project.cypher(
  // Projection name
  // Cypher statement to find all nodes in projection
    MATCH (u:User) RETURN id(u) AS id, labels(u) AS labels
    UNION MATCH (m:Movie) WHERE m.year > 2014 RETURN id(m) AS id, labels(m) AS labels
  // Cypher statement to find every User that rated a Movie
  // where the rating property is greater than or equal to 4
  // and the movie was released after 2014 (year > 2014)
    MATCH (u:User)-[r:RATED]->(m:Movie)
    WHERE r.rating >= 4 AND m.year > 2014
    RETURN id(u) AS source,
        id(m) AS target,
        r.rating AS rating,
        "RATED" AS type

Centrality and Importance

Centrality algorithms are used to determine the importance of distinct nodes in a graph.

Degree Centrality Example

IN GDS, we calcualte the no of out relationshhips from a node

Count the number of movies Actor has acted in

CALL gds.graph.project('proj', ['Actor','Movie'], 'ACTED_IN');
//get top 5 most prolific actors (those in the most movies)
//using degree centrality which counts number of `ACTED_IN` relationships
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS actorName, score AS numberOfMoviesActedIn
ORDER BY numberOfMoviesActedIn DESCENDING, actorName LIMIT 5


PageRank estimates the importance of a node by counting the number of incoming relationships from neighboring nodes weighted by the importance and out-degree centrality of those neighbors. The underlying assumption is that more important nodes are likely to have proportionately more incoming relationships from other important nodes

//drop last graph projection
CALL gds.graph.drop('proj', false);

//create Cypher projection for network of people directing actors
//filter to recent high grossing movies
CALL gds.graph.project.cypher(
  'MATCH (a:Person) RETURN id(a) AS id, labels(a) AS labels',
  'MATCH (a1:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a2)
   WHERE m.year >= 1990 AND m.revenue >= 10000000
   RETURN id(a1) AS source , id(a2) AS target, count(*) AS actedWithCount,

Next stream PageRank to find the top 5 most influential people in director-actor network.

YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS personName, score AS influence
ORDER BY influence DESCENDING, personName LIMIT 5
  • Betweenness Centrality: Measures the extent to which a node stands between the other nodes in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another.

  • Eigenvector Centrality: Measures the transitive influence of nodes. Similar to PageRank, but works only on the largest eigenvector of the adjacency matrix so does not converge in the same way and tends to more strongly favor high degree nodes. It can be more appropriate in certain use cases, particularly those with undirected relationships.

  • Article Rank: A variant of PageRank which assumes that relationships originating from low-degree nodes have a higher influence than relationships from high-degree nodes.

Path Finding

shortest path between two or more nodes and evaluate the availability and quality of path

Use Case, Supply chain analytics and Customer Journey

  • A Shortest Path:* An extension of Dijkstra that uses a heuristic function to speed up computation.

  • Yen’s Algorithm Shortest Path: An extension of Dijkstra that allows you to find multiple, the top k, shortest paths. [For multiple paths as well]

  • Dijkstra Single-Source Shortest Path: Dijkstra implementation for shortest path between one source and multiple targets.

  • Delta-Stepping Single-Source Shortest Path: Parallelized shortest path computation. Computes faster than Dijkstra single-source shortest Path but uses more memory.

CALL gds.graph.project('proj',
MATCH (a:Actor)
WHERE IN ['Kevin Bacon', 'Denzel Washington']
WITH collect(id(a)) AS nodeIds
CALL'proj', {sourceNode:nodeIds[0], TargetNode:nodeIds[1]})
YIELD sourceNode, targetNode, path
RETURN gds.util.asNode(sourceNode).name AS sourceNodeName,
    gds.util.asNode(targetNode).name AS targetNodeName,
    nodes(path) as path;

Community Detection

group of nodes for clustering or Partitioned in the graph.

Louvain Community Detection

Louvain maximizes a modularity score for each community, where the modularity quantifies the quality of an assignment of nodes to communities. This means evaluating how much more densely connected the nodes within a community are, compared to how connected they would be in a random network.

  • hierarchical clustering

  • recursively merges communities together

  • stochastic algorithm

  • Label Propagation: Similar intent as Louvain. Fast algorithm that parallelizes well. Great for large graphs.

  • Weakly Connected Components (WCC): Partitions the graph into sets of connected nodes such that

    1. Every node is reachable from any other node in the same set 2. No path exists between nodes from different sets
  • Triangle Count: Counts the number of triangles for each node. Can be used to detect the cohesiveness of communities and stability of the graph.

  • Local Clustering Coefficient: Computes the local clustering coefficient for each node in the graph which is an indicator for how the node clusters with its neighbors


Node Embedding

node embedding is to compute low-dimensional vector representations of nodes such that similarity between vectors (eg. dot product) approximates similarity between nodes in the original graph.

Use Cases:

  • features for ML

  • Similarity Measurements

  • Exploratory Data Analysis(EDA)


GDS offers a custom implementation of a node embedding technique called Fast Random Projection, or FastRP for short.

  • embeddingDimension: Applies to all node embedding algorithms in GDS. Controls the length of the embedding vectors. Setting this parameter is a trade-off between dimensionality reduction and accuracy.

  • IterationWeights: his controls two aspects: the number of iterations for intermediate embeddings, and their relative impact on the final node embedding. The parameter is a list of numbers, indicating one iteration per number where the number is the weight applied to that iteration

Example of FastRP

CALL gds.graph.project('proj', ['Movie', 'Person'], {

Then Run FastRP

CALL'proj',  {embeddingDimension:64, randomSeed:7474})
YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) AS n, embedding
WHERE n:Person
RETURN id(n),, embedding LIMIT 10

Other Algo:

  • Node2Vec

  • GraphSage


Similarity algorithms, as the name implies, are used to infer similarity between pairs of nodes

USe Cases:

  • Fraud Detection

  • Recommendation System

  • Entity Resolution


  • Node Similarity: Determines similarity between nodes based on the relative proportion of shared neighboring nodes in the graph.

  • k-nearest Neighbor(KNN):Determines similarity based off node properties. The GDS KNN implementation can scale well for global inference over large graphs when tuned appropriately.

Both Node Similarity and KNN provide choices between different similarity metrics. Node Similarity has choices between Jaccard and Overlap similarity. KNN choice of metric is driven by the node property types. List of integers are subject to Jaccard and Overlap, list of floating point numbers to Cosine Similarity, Pearson, and Euclidean.

Graph Machine Learning

  • Node Classificatioin

  • Link Prediction

  • Node Classification Pipelines: Supervised binary and multi-class classification for nodes

  • Link Prediction Pipelines: Supervised prediction for whether a relationship or "link" should exist between pairs of nodes

These pipelines have a train procedure that, once run, produces a trained model object. These trained model objects, in turn, have a predict procedure that can be used to produce predictions on the data.

Node Classification Patterns in GDS

Shorted Path with Neo4j

We can either non-weighted or weighted shortestpath with neo4j. non-weighted in built in neo4j where for weighted we have to create a projections in noe4j.

Challenge: Dijkstra’s Source-Target Shortest Path

CALL gds.graph.project(

Find the Source-Target Shortest Path

MATCH (source:Airport {iata: "BNA"}),
      (target:Airport {iata:"HKT"})

) YIELD totalCost
RETURN totalCost;

Get K Multiple shorted Paths with yen’s algorithm

MATCH (source:Airport {iata: "CDG"}),
      (target:Airport {iata:"LIS"})

    k:3 // (1)
YIELD index, nodeIds, totalCost // (2)
RETURN index, [nodeId in nodeIds | gds.util.asNode(nodeId).descr] AS airports, totalCost

Did you find this article valuable?

Support Pratik Sharma by becoming a sponsor. Any amount is appreciated!