From 24ec84b3261e8b351fde8b0b70039310f9a22f8b Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Wed, 7 Oct 2020 16:22:36 -0700
Subject: [PATCH 01/12] Create hive_gcn.md

in progress
---
 docs/hive/hive_gcn.md | 139 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 139 insertions(+)
 create mode 100644 docs/hive/hive_gcn.md
diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
new file mode 100644
index 0000000000..bab3609afb
--- /dev/null
+++ b/docs/hive/hive_gcn.md
@@ -0,0 +1,139 @@
+# Graph Convolutional Network
+
+The most promising use case is in Semi-supervised Learning where we are given a set of nodes, each with some observed numeric attributes x<sub>i</sub>.
+
+Now, we `predict an output/label for each node` based on partial observations i.e. labels for some, but not all, of the nodes.
+
+We might also be given a set of weighted edges, summarised by an adjacency matrix A. The main assumption is that when predicting the output yi for node i, the attributes and connectivity of nearby nodes provide useful side information or additional context.
+
+
+## Summary of Results
+
+-- Talk to JDO about this. Write it last, probably.
+
+## Summary of Gunrock Implementation
+
+The implementation has been hugely guided by
+
+- http://proceedings.mlr.press/v97/wu19e.html
+- https://arxiv.org/abs/1609.02907
+
+
+The GCN algorithm can be mapped into the following steps:
+
+1. 
+2. 
+3.
+4.
+5.
+6.
+
+The description of the lower level operators used to implement some of the steps described above:
+
+1. 
+
+2. 
+
+3.
+
+What was implemented with respect to the entire workflow?
+
+
+## How To Run This Application on DARPA's DGX-1
+
+
+### Prereqs/input
+
+1. Build Gunrock -- https://github.com/gunrock/gunrock/pull/805
+2. Make sure the following datafiles are available:
+  - feature file (edge weights / node vectors)
+  - split file (for each vertex, a value 0/1/2 to specify train/test/validation node
+  - graph file (adjacency list format)
+  
+### Running the application
+
+<code>
+// building gunrock 
+// cd to build/bin folder
+
+./gcn --feature_file <featurefile> --graph_file <graphfile> --split_file <splitfile>
+
+// can decide a fixed number of training iterations 
+
+</code>
+
+Note: This run / these runs are faster on DARPA's DGX-1.
+
+### Output
+
+1. Relevant data is printed per epoch:
+
+- Training loss/acc
+- Validation loss/acc
+- Time taken
+
+2. Output after training:
+
+- test loss/acc
+- time taken by various operators
+
+#### To extract weights:
+
+How do you make sure your output is correct/meaningful? (What are you comparing against?)
+
+The operators have tests that are verified internally and the backpropagation has been verified using python scripts with the theoretical implementation
+
+## Performance and Analysis
+
+### runtime
+### metrics
+
+### Implementation limitations
+
+e.g.:
+
+- Size of dataset that fits into GPU memory (what is the specific limitation?)
+- Restrictions on the type/nature of the dataset
+
+### Comparison against existing implementations
+
+- Reference implementation (python? Matlab?)
+- OpenMP reference
+
+Comparison is both performance and accuracy/quality.
+
+
+
+### Performance limitations
+
+e.g., random memory access?
+
+## Next Steps
+
+### Alternate approaches
+
+If you had an infinite amount of time, is there another way (algorithm/approach) we should consider to implement this?
+
+### Gunrock implications
+
+What did we learn about Gunrock? What is hard to use, or slow? What potential Gunrock features would have been helpful in implementing this workflow?
+
+### Notes on multi-GPU parallelization
+
+What will be the challenges in parallelizing this to multiple GPUs on the same node?
+
+Can the dataset be effectively divided across multiple GPUs, or must it be replicated?
+
+### Notes on dynamic graphs
+
+(Only if appropriate)
+
+Does this workload have a dynamic-graph component? If so, what are the implications of that? How would your implementation change? What support would Gunrock need to add?
+
+### Notes on larger datasets
+
+What if the dataset was larger than can fit into GPU memory or the aggregate GPU memory of multiple GPUs on a node? What implications would that have on performance? What support would Gunrock need to add?
+
+### Notes on other pieces of this workload
+
+Briefly: What are the important other (non-graph) pieces of this workload? Any thoughts on how we might implement them / what existing approaches/libraries might implement them?

From 23cb27e521c8eabac0abe15894d62263693830f4 Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 10:06:15 -0700
Subject: [PATCH 02/12] Update hive_gcn.md

committing  to prevent accidental loss of text
---
 docs/hive/hive_gcn.md | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index bab3609afb..88a7078a11 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -21,12 +21,26 @@ The implementation has been hugely guided by
 
 The GCN algorithm can be mapped into the following steps:
 
-1. 
-2. 
-3.
-4.
-5.
-6.
+1. Initialization
+-   Data Reading/Parsing
+-   Parameter Initialization
+-   [Random weight initialisation](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/gcn_problem.cuh#L225) W<sub>1</sub> and W<sub>2</sub>
+
+<sup><sub>__Forward Propagation__</sub></sup>
+
+2. Edge Dropout
+  - [With probability `p`, mask (disable) an edge value](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/dropout/dropout.cuh#L53)
+3. Edge Weight Sparse Multiplication
+  - Multiplication of edge values with trainable weights
+4. Graph Sum
+5. ReLU
+6. Dropout
+7. Matrix Multiplication
+8. Graph Sum
+9. Cross Entropy Loss
+
+<sup><sub>__Backward Propagation__</sub></sup>
+
 
 The description of the lower level operators used to implement some of the steps described above:
 

From 279e78092c23e6801e7d35cf0d71f50a7df3201c Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 11:33:31 -0700
Subject: [PATCH 03/12] Update hive_gcn.md

temporary checkpoint
---
 docs/hive/hive_gcn.md | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 88a7078a11..c8677e213c 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -24,23 +24,44 @@ The GCN algorithm can be mapped into the following steps:
 1. Initialization
 -   Data Reading/Parsing
 -   Parameter Initialization
--   [Random weight initialisation](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/gcn_problem.cuh#L225) W<sub>1</sub> and W<sub>2</sub>
+-   [Random initialization of weight matrices](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/gcn_problem.cuh#L225) W<sub>0</sub> and W<sub>1</sub>
 
 <sup><sub>__Forward Propagation__</sub></sup>
 
 2. Edge Dropout
   - [With probability `p`, mask (disable) an edge value](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/dropout/dropout.cuh#L53)
+  - Results in new edge values
 3. Edge Weight Sparse Multiplication
-  - Multiplication of edge values with trainable weights
-4. Graph Sum
+  - [Multiplication of edge values with trainable weights](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/sparseMatMul/sparseMatMul_enactor.cuh#L89)
+  - Summing the result from the multiplication to result in the XW<sub>0</sub> matrix
+4. Graph Neighbor Sum 
+  - [Summing the neighbour vectors for each vertex](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/graphsum/graphsum_enactor.cuh#L99)
+  - Results in the AXW<sub>0</sub> matrix
 5. ReLU
+  - on the AXW<sub>0</sub> matrix
 6. Dropout
-7. Matrix Multiplication
-8. Graph Sum
+  - on the rectified AXW<sub>0</sub> matrix
+7. Multiplication of W<sub>1</sub> weight matrix
+  - results in AXW<sub>0</sub>W<sub>1</sub>
+8. Repeat Graph Neighbor Sum 
+  - [Summing the neighbour vectors for each vertex](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/graphsum/graphsum_enactor.cuh#L99)
+  - Results in the AAXW<sub>0</sub>W<sub>1</sub> matrix
 9. Cross Entropy Loss
-
+  - Compute training loss
+  - Backprop with the loss value obtained
+  
 <sup><sub>__Backward Propagation__</sub></sup>
 
+8. `backprop` 
+7. `backprop` 
+6. `backprop` 
+5. `backprop` 
+4. `backprop` 
+3. `backprop` 
+2. `backprop` 
+1. `backprop` 
+* 7B.
+* 6B.
 
 The description of the lower level operators used to implement some of the steps described above:
 

From b6cbfba7870972aa4f5e58174df2c23d8cf6413c Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 12:19:18 -0700
Subject: [PATCH 04/12] Update hive_gcn.md

checkpoint
---
 docs/hive/hive_gcn.md | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index c8677e213c..24af9d68b2 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -47,21 +47,25 @@ The GCN algorithm can be mapped into the following steps:
   - [Summing the neighbour vectors for each vertex](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/graphsum/graphsum_enactor.cuh#L99)
   - Results in the AAXW<sub>0</sub>W<sub>1</sub> matrix
 9. Cross Entropy Loss
-  - Compute training loss
-  - Backprop with the loss value obtained
+  - Compute training loss and likewise gradients of AAXW<sub>0</sub>W<sub>1</sub> matrix
+  - Start Backprop
   
 <sup><sub>__Backward Propagation__</sub></sup>
 
-8. `backprop` 
-7. `backprop` 
-6. `backprop` 
-5. `backprop` 
-4. `backprop` 
-3. `backprop` 
-2. `backprop` 
-1. `backprop` 
-* 7B.
-* 6B.
+10. `backprop 8.` 
+  - Results in the gradients of AXW<sub>0</sub>W<sub>1</sub> matrix
+2. `backprop 7.` 
+  - Compute the gradients of W<sub>1</sub> matrix and stores it to update the W<sub>1</sub> weight matrix later
+  - Results in the gradients of AXW<sub>0</sub> matrix
+3. `backprop 6.` 
+  - Results in the updated gradients of AXW<sub>0</sub> matrix
+4. `backprop 5.` 
+  - Results in the updated gradients of AXW<sub>0</sub> matrix
+5. `backprop 4.` 
+6. `backprop 3.` 
+7. `backprop 2.` 
+8. `backprop 1.` 
+
 
 The description of the lower level operators used to implement some of the steps described above:
 

From 6aab1e4933bcce0e4f2b909351c0d20698308f7b Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 13:38:34 -0700
Subject: [PATCH 05/12] Update hive_gcn.md

checkpoint
---
 docs/hive/hive_gcn.md | 85 +++++++++++++++++++++++++++++++------------
 1 file changed, 62 insertions(+), 23 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 24af9d68b2..a1469e585e 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -19,13 +19,15 @@ The implementation has been hugely guided by
 - https://arxiv.org/abs/1609.02907
 
 
-The GCN algorithm can be mapped into the following steps:
+### The GCN algorithm can be mapped into the following steps:
 
 1. Initialization
 -   Data Reading/Parsing
 -   Parameter Initialization
 -   [Random initialization of weight matrices](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/gcn_problem.cuh#L225) W<sub>0</sub> and W<sub>1</sub>
 
+[comment]: <> (Forward propagation and Backpropagation are explained for a single epoch, the same process is iterated for --num_iterations)
+
 <sup><sub>__Forward Propagation__</sub></sup>
 
 2. Edge Dropout
@@ -54,26 +56,59 @@ The GCN algorithm can be mapped into the following steps:
 
 10. `backprop 8.` 
   - Results in the gradients of AXW<sub>0</sub>W<sub>1</sub> matrix
-2. `backprop 7.` 
+11. `backprop 7.` 
   - Compute the gradients of W<sub>1</sub> matrix and stores it to update the W<sub>1</sub> weight matrix later
   - Results in the gradients of AXW<sub>0</sub> matrix
-3. `backprop 6.` 
+12. `backprop 6.` 
   - Results in the updated gradients of AXW<sub>0</sub> matrix
-4. `backprop 5.` 
+13. `backprop 5.` 
   - Results in the updated gradients of AXW<sub>0</sub> matrix
-5. `backprop 4.` 
-6. `backprop 3.` 
-7. `backprop 2.` 
-8. `backprop 1.` 
-
-
-The description of the lower level operators used to implement some of the steps described above:
-
-1. 
-
-2. 
-
-3.
+14. `backprop 4.` 
+  - Results in the updated gradients of XW<sub>0</sub> matrix
+15. `backprop 3.`
+  - Compute the gradients of W<sub>0</sub> matrix and stores it to update the W<sub>0</sub> weight matrix later
+[16.]: <> (backprop 2 does not exist as that computation does not involve any trainable weight)
+16. Update weight matrices W<sub>0</sub> and W<sub>1</sub>
+  - Use the new weight matrices in the next epoch (iteration)
+
+<sup><sub>__End of training__</sub></sup>
+
+17. Export the trained weight matrices along with the loss/accuracy/runtime metrics
+
+### The description of the lower level operators used to implement some of the steps described above:
+
+The 17 steps above share computation patterns
+
+1. To update an Array1D, the `ForEach` op has been used
+
+```CUDA
+ GUARD_CU (arr.ForEach (
+          [params]__host__ __device__(ValueT &x) {
+            x = update(x, params);
+          }
+      ))
+```
+
+2. To multiply dense matrix (X: nodes * features) with dense matrix (W<sub>0</sub>: features * dimension_0):
+
+```CUDA
+auto denseMM =
+        [X, output, dimension_0, W0] __host__ __device__(
+            const VertexT &src, VertexT &dest, const SizeT &edge_id,
+            const VertexT &input_item, const SizeT &input_pos,
+            SizeT &output_pos) -> bool {
+      for (int i = 0; i < dimension_0; i++) {
+        atomicAdd(output + src * dimension_0 + i, W0[edge_id] * X[dest * dimension_0 + i]);
+      }
+      return true;
+    };
+   
+GUARD_CU(oprtr::Advance<oprtr::OprtrType_V2V> (
+            graph.csr (), &local_vertices, null_ptr, oprtr_parameters,
+            denseMM));
+```
+
+3. 
 
 What was implemented with respect to the entire workflow?
 
@@ -91,15 +126,19 @@ What was implemented with respect to the entire workflow?
   
 ### Running the application
 
-<code>
-// building gunrock 
-// cd to build/bin folder
+<!-- <code> -->
+```bash
+# build gunrock 
 
-./gcn --feature_file <featurefile> --graph_file <graphfile> --split_file <splitfile>
+# cd to bin folder
+cd ./build/bin
 
-// can decide a fixed number of training iterations 
+# run gcn binary
+./gcn --feature_file <featurefile> --graph_file <graphfile> --split_file <splitfile>
 
-</code>
+# vary parameters like: Number of training iterations, silent run, etc. 
+```
+<!-- </code> -->
 
 Note: This run / these runs are faster on DARPA's DGX-1.
 

From bd34d93d0462cab0836fc3f2e77c31afa4558596 Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 16:34:06 -0700
Subject: [PATCH 06/12] Update hive_gcn.md

checkpoint
---
 docs/hive/hive_gcn.md | 81 +++++++++++++++++++++++++++++++------------
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index a1469e585e..429bc78069 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -33,10 +33,10 @@ The implementation has been hugely guided by
 2. Edge Dropout
   - [With probability `p`, mask (disable) an edge value](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/dropout/dropout.cuh#L53)
   - Results in new edge values
-3. Edge Weight Sparse Multiplication
+3. Edge Weight Sparse Multiplication (Neighborhood Gather)
   - [Multiplication of edge values with trainable weights](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/sparseMatMul/sparseMatMul_enactor.cuh#L89)
-  - Summing the result from the multiplication to result in the XW<sub>0</sub> matrix
-4. Graph Neighbor Sum 
+  - Result in the XW<sub>0</sub> matrix
+4. Graph Neighbor Sum (Aggregation)
   - [Summing the neighbour vectors for each vertex](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/graphsum/graphsum_enactor.cuh#L99)
   - Results in the AXW<sub>0</sub> matrix
 5. ReLU
@@ -54,18 +54,18 @@ The implementation has been hugely guided by
   
 <sup><sub>__Backward Propagation__</sub></sup>
 
-10. `backprop 8.` 
+10. backprop for 8.
   - Results in the gradients of AXW<sub>0</sub>W<sub>1</sub> matrix
-11. `backprop 7.` 
+11. backprop for 7. 
   - Compute the gradients of W<sub>1</sub> matrix and stores it to update the W<sub>1</sub> weight matrix later
   - Results in the gradients of AXW<sub>0</sub> matrix
-12. `backprop 6.` 
+12. backprop for 6. 
   - Results in the updated gradients of AXW<sub>0</sub> matrix
-13. `backprop 5.` 
+13. backprop for 5. 
   - Results in the updated gradients of AXW<sub>0</sub> matrix
-14. `backprop 4.` 
+14. backprop for 4. 
   - Results in the updated gradients of XW<sub>0</sub> matrix
-15. `backprop 3.`
+15. backprop for 3.
   - Compute the gradients of W<sub>0</sub> matrix and stores it to update the W<sub>0</sub> weight matrix later
 [16.]: <> (backprop 2 does not exist as that computation does not involve any trainable weight)
 16. Update weight matrices W<sub>0</sub> and W<sub>1</sub>
@@ -89,28 +89,57 @@ The 17 steps above share computation patterns
       ))
 ```
 
-2. To multiply dense matrix (X: nodes * features) with dense matrix (W<sub>0</sub>: features * dimension_0):
+2. Neighborhood Gather / Scatter
+Gather neigbhorhood features (X) for all vertices after multiplication with edge weight (weights)
 
 ```CUDA
-auto denseMM =
-        [X, output, dimension_0, W0] __host__ __device__(
+    auto denseMM =
+        [X, output, C, W] __host__ __device__(
             const VertexT &src, VertexT &dest, const SizeT &edge_id,
             const VertexT &input_item, const SizeT &input_pos,
             SizeT &output_pos) -> bool {
-      for (int i = 0; i < dimension_0; i++) {
-        atomicAdd(output + src * dimension_0 + i, W0[edge_id] * X[dest * dimension_0 + i]);
+      for (int i = 0; i < C; i++) {
+        atomicAdd(output + src * C + i, W[edge_id] * X[dest * C + i]);
       }
       return true;
     };
    
-GUARD_CU(oprtr::Advance<oprtr::OprtrType_V2V> (
+    GUARD_CU(oprtr::Advance<oprtr::OprtrType_V2V> (
             graph.csr (), &local_vertices, null_ptr, oprtr_parameters,
             denseMM));
 ```
 
-3. 
+3. To aggregate feature matrix M (shape: YxZ) along the adjacency list
 
-What was implemented with respect to the entire workflow?
+```CUDA
+    auto sumNeighbors =
+        [M, output, Z] __host__ __device__(
+            const VertexT &src, VertexT &dest, const SizeT &edge_id,
+            const VertexT &input_item, const SizeT &input_pos,
+            SizeT &output_pos) -> bool {
+     
+      for (int i = 0; i < Z; i++)
+        atomicAdd(output + src * Z + i, *(M + dest * Z + i));
+      return true;
+    };
+    
+    GUARD_CU(oprtr::Advance<oprtr::OprtrType_V2V> (
+            graph.csr (), &local_vertices, null_ptr, oprtr_parameters,
+            sumNeighbors));
+    
+```
+
+
+### What was implemented with respect to the entire workflow?
+
+The various modules viz.:
+
+- Activation Function (ReLU)
+- Dropout
+- Graph Sum
+- Loss Function (Cross Entropy)
+- SpMM Multiplication
+- Manual differentiation for all the operators
 
 
 ## How To Run This Application on DARPA's DGX-1
@@ -157,9 +186,13 @@ Note: This run / these runs are faster on DARPA's DGX-1.
 
 #### To extract weights:
 
-How do you make sure your output is correct/meaningful? (What are you comparing against?)
+Uncomment [call to `Extract()` function](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/gcn_app.cu#L119) which provides both W<sub>0</sub> and W<sub>1</sub> trained matrices
 
-The operators have tests that are verified internally and the backpropagation has been verified using python scripts with the theoretical implementation
+#### How do you make sure your output is correct/meaningful? (What are you comparing against?)
+
+- Some of the operators have unittests
+- Backpropagation has been verified using Autodiff in Python 
+- Manual verification of the overall algorithm with the theoretical reference
 
 ## Performance and Analysis
 
@@ -168,6 +201,10 @@ The operators have tests that are verified internally and the backpropagation ha
 
 ### Implementation limitations
 
+- No provision for using local weight matrices (to checkpoint training on disk)
+- No provision for Learning Rate modulation
+- No provsion for hyperparamter grid search
+
 e.g.:
 
 - Size of dataset that fits into GPU memory (what is the specific limitation?)
@@ -175,12 +212,12 @@ e.g.:
 
 ### Comparison against existing implementations
 
-- Reference implementation (python? Matlab?)
-- OpenMP reference
+- Reference implementation (python: https://github.com/Tiiiger/SGC + https://github.com/zhouchunpong/Simplifying-Graph-Convolutional-Networks)
 
-Comparison is both performance and accuracy/quality.
 
+Comparison is both performance and accuracy/quality.
 
+- 
 
 ### Performance limitations
 

From 1e0718febb4965de5122d0ada3471eb6edc1c8bb Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 19:29:08 -0700
Subject: [PATCH 07/12] Update hive_gcn.md

checkpoint
---
 docs/hive/hive_gcn.md | 53 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 429bc78069..0c9bd98989 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -196,6 +196,8 @@ Uncomment [call to `Extract()` function](https://github.com/achalagarwal/gunrock
 
 ## Performance and Analysis
 
+Latest runs will be carried out (on V100) and results will be updated (TODO)
+
 ### runtime
 ### metrics
 
@@ -214,8 +216,8 @@ e.g.:
 
 - Reference implementation (python: https://github.com/Tiiiger/SGC + https://github.com/zhouchunpong/Simplifying-Graph-Convolutional-Networks)
 
+The accuracy shouldn't be affected, the performance benchmark will be added (TODO)
 
-Comparison is both performance and accuracy/quality.
 
 - 
 
@@ -225,25 +227,62 @@ e.g., random memory access?
 
 ## Next Steps
 
-### Alternate approaches
+### Alternate/Next approaches
+
+1. Auto backpropagation
+
+Integrate autodiff in Gunrock so that a developer does not need to generate backpropagation code manually. https://github.com/mitsuba-renderer/enoki, https://github.com/mitsuba-renderer/enoki
+
+2. Use optimised Sparse Matrix multiplication and aggregations
+
+Currently we use pure atomics for all such operations and shifting to better/optimised algorithms will make our training faster
+
+3. Providing a queue of graphs to read from disk (depends on application)
+
+To better leverage the speed that gunrock provides for training GNNs, we should batch our graphs from CPU to GPU so that the I/O time is minimized
+
+4. Move to better GNN architectures
+
+GNN research has led to various different GNN architectures that perform better on certain datasets/tasks. Providing support for a canonical set of operators to support multiple GNN architectures.
 
-If you had an infinite amount of time, is there another way (algorithm/approach) we should consider to implement this?
 
 ### Gunrock implications
 
-What did we learn about Gunrock? What is hard to use, or slow? What potential Gunrock features would have been helpful in implementing this workflow?
+> What did we learn about Gunrock? What is hard to use, or slow? What potential Gunrock features would have been helpful in implementing this workflow?
+
+1. Gunrock as a framework for GNN would not be for the masses, it is better suited as a library to a Python Interface so that users can quickly iterate over their code etc.
+
+2. Gunrock has optimised traversal and frontier operators that make certain operations faster but providing support for optimised implementations of matrix multiplication / sparse matrix multiplication / support for 2D arrays / quick integration between apps amongst themselves etc. will make it faster for users to develop their architectures
+
 
 ### Notes on multi-GPU parallelization
 
-What will be the challenges in parallelizing this to multiple GPUs on the same node?
+> What will be the challenges in parallelizing this to multiple GPUs on the same node?
+
+1. Effectively dividing across multiple GPUs
+
+- Model Parallelization
+
+Easy as we can have a glue module that receives data from all the split components and completes the pipeline
 
-Can the dataset be effectively divided across multiple GPUs, or must it be replicated?
+- Large Graph (Monolithic model)
+
+There is some work done on graph partitioning for GNN training specifically, that can be leveraged.
+Secondly, all the provided operators need to support multi gpu mode and that will be plenty work.
 
 ### Notes on dynamic graphs
 
 (Only if appropriate)
 
-Does this workload have a dynamic-graph component? If so, what are the implications of that? How would your implementation change? What support would Gunrock need to add?
+> Does this workload have a dynamic-graph component?
+Not currently but it could benefit from support for dynamic graphs (Pooling operators)
+
+> If so, what are the implications of that? 
+Pooling operators have been shown to help increase the quality as well as the performance of training
+
+> How would your implementation change? What support would Gunrock need to add?
+Gunrock needs to provide support for Union-Find on graphs, edge contraction etc. 
+
 
 ### Notes on larger datasets
 

From 9b9a4ce742f076a160ef3c5c54cfd8e4fbb1ba04 Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 19:45:00 -0700
Subject: [PATCH 08/12] Update hive_gcn.md

---
 docs/hive/hive_gcn.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 0c9bd98989..047919a93e 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -35,7 +35,7 @@ The implementation has been hugely guided by
   - Results in new edge values
 3. Edge Weight Sparse Multiplication (Neighborhood Gather)
   - [Multiplication of edge values with trainable weights](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/sparseMatMul/sparseMatMul_enactor.cuh#L89)
-  - Result in the XW<sub>0</sub> matrix
+  - Results in the XW<sub>0</sub> matrix
 4. Graph Neighbor Sum (Aggregation)
   - [Summing the neighbour vectors for each vertex](https://github.com/achalagarwal/gunrock/blob/d0202e3bbb88560bc97666675c0a94aa9e491c9c/gunrock/app/GuNNrock/graphsum/graphsum_enactor.cuh#L99)
   - Results in the AXW<sub>0</sub> matrix
@@ -140,7 +140,7 @@ The various modules viz.:
 - Loss Function (Cross Entropy)
 - SpMM Multiplication
 - Manual differentiation for all the operators
-
+- ...
 
 ## How To Run This Application on DARPA's DGX-1
 
@@ -274,13 +274,13 @@ Secondly, all the provided operators need to support multi gpu mode and that wil
 
 (Only if appropriate)
 
-> Does this workload have a dynamic-graph component?
+> Does this workload have a dynamic-graph component? </br>
 Not currently but it could benefit from support for dynamic graphs (Pooling operators)
 
-> If so, what are the implications of that? 
+> If so, what are the implications of that? </br>
 Pooling operators have been shown to help increase the quality as well as the performance of training
 
-> How would your implementation change? What support would Gunrock need to add?
+> How would your implementation change? What support would Gunrock need to add? </br>
 Gunrock needs to provide support for Union-Find on graphs, edge contraction etc. 
 
 

From ee9df64e699e328d0984c55500b5134764b07ece Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 21:33:11 -0700
Subject: [PATCH 09/12] Update hive_gcn.md

---
 docs/hive/hive_gcn.md | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 047919a93e..5072c939a6 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -196,10 +196,26 @@ Uncomment [call to `Extract()` function](https://github.com/achalagarwal/gunrock
 
 ## Performance and Analysis
 
-Latest runs will be carried out (on V100) and results will be updated (TODO)
+```bash
+
+# citeseer dataset (default)
+time ./gcn --max_iter=1000
+
+real	0m6.419s
+user	0m5.263s
+sys	0m2.112s
+```
+
+*Average training time:* 5.73ms
+*Lowest training time:* 3.3ms
+*Highest training time:* 11.8ms
+
+**Modules (% time taken)**
+1. Graph Sum (Total: 45%, Forward: 30%, Backprop: 15%)
+2. Sparse Matrix Multiplication (Forward: 15.2%, Backprop: 8.2%)
+3. Cross Entropy Loss (15.6%)
+4. Matrix Multiplication (Forward: 10.2%, Backprop: 12.3%)
 
-### runtime
-### metrics
 
 ### Implementation limitations
 
@@ -284,6 +300,7 @@ Pooling operators have been shown to help increase the quality as well as the pe
 Gunrock needs to provide support for Union-Find on graphs, edge contraction etc. 
 
 
+
 ### Notes on larger datasets
 
 What if the dataset was larger than can fit into GPU memory or the aggregate GPU memory of multiple GPUs on a node? What implications would that have on performance? What support would Gunrock need to add?

From f4028c427e2f72f512f67acf8c9b14fb9fc60012 Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 21:35:43 -0700
Subject: [PATCH 10/12] Update hive_gcn.md

---
 docs/hive/hive_gcn.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 5072c939a6..1a69a68192 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -206,9 +206,9 @@ user	0m5.263s
 sys	0m2.112s
 ```
 
-*Average training time:* 5.73ms
-*Lowest training time:* 3.3ms
-*Highest training time:* 11.8ms
+Average training time: **5.73ms** </br>
+Lowest training time: **3.3ms** </br>
+Highest training time: **11.8ms** </br>
 
 **Modules (% time taken)**
 1. Graph Sum (Total: 45%, Forward: 30%, Backprop: 15%)

From 8dae2357f2778df9d737160d445b28bb97210272 Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 21:36:45 -0700
Subject: [PATCH 11/12] Update hive_gcn.md

---
 docs/hive/hive_gcn.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 1a69a68192..85978912b1 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -212,9 +212,9 @@ Highest training time: **11.8ms** </br>
 
 **Modules (% time taken)**
 1. Graph Sum (Total: 45%, Forward: 30%, Backprop: 15%)
-2. Sparse Matrix Multiplication (Forward: 15.2%, Backprop: 8.2%)
+2. Sparse Matrix Multiplication (Total: 23.4%, Forward: 15.2%, Backprop: 8.2%)
 3. Cross Entropy Loss (15.6%)
-4. Matrix Multiplication (Forward: 10.2%, Backprop: 12.3%)
+4. Matrix Multiplication (Total: 22.5%, Forward: 10.2%, Backprop: 12.3%)
 
 
 ### Implementation limitations

From df23e0c93f526db906c7ae42ea00a593ca747883 Mon Sep 17 00:00:00 2001
From: Achal Agarwal <achalagarwal.01@gmail.com>
Date: Thu, 8 Oct 2020 21:46:18 -0700
Subject: [PATCH 12/12] Update hive_gcn.md

---
 docs/hive/hive_gcn.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/hive/hive_gcn.md b/docs/hive/hive_gcn.md
index 85978912b1..124f9d3b3e 100644
--- a/docs/hive/hive_gcn.md
+++ b/docs/hive/hive_gcn.md
@@ -196,6 +196,9 @@ Uncomment [call to `Extract()` function](https://github.com/achalagarwal/gunrock
 
 ## Performance and Analysis
 
+@JDO
+Will be adding results for more datasets, are there any specific metrics that you think I should focus on? I could share the highlevel metrics from nvsight?
+
 ```bash
 
 # citeseer dataset (default)