# Collaborative Filtering (aka. CF)

Collaborative Filtering (CF) is a technique widely used by **Recommendation Systems**. These systems model the users taste patterns based on the items they have rated, viewed or consumed and suggest new items in a way that matches the users preferences (Personalization).

Collaborative Filtering is about relating two entities; items and users by using the bipartite graph that is formed by their interactions (e.g. users rating items). Proadly speaking there are two types of approaches to CF ** 1) the Neighborhood Approach** and **2) Latent Factor Models**. The former directly computes the similarities between items or users and uses these similarities to make predictions for new items; The latter approach - the so called model-based - characterizes users and items with latent factors and embeds both entities to the same latent factor space. These latent factors of th

Okapi provides a number of latent factor algorithms covering CF; these are based on Alternated Least Squares (ALS), Stochastic Gradient Descend (SGD) and Singular Value Decomposition (SVD++). Okapi currently includes state-of-the-art "learning to rank" technology (CLiMF) and will also incorporate CF methods that are a product of our research in the domain.

Try them out and leave us a comment. Or two.

### Alternating Least Squares (ALS)

ALS is a matrix factorization algorithm for CF. Users and items can be characterized in terms of latent factor vectors, the algorithm fits these values by solving a Least Squares Erroer model over the users assuming the item factors are fixed, and over the items assuming the user factors are fixed.

ALS receives uses a set of ratings from users on items and tries to minimize the error between the predicted and known ratings.

ALS tries to minimize the error between the predicted and known ratings by alternating between two steps:

The two steps are repeated, till convergence of the training error.

#### Input format

The input to ALS algorithm is a set of user-item ratings stored in text files, where each line has the following format:

`userID itemID rating `

#### Output Format

The output of ALS is a set of latent vectors for each user and item. Each line in the output has the following format

`ID type vector`

, where:

#### Configuration parameters

#### Running ALS

The basic ALS execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.cf.als.Als\$InitUsersComputation\-mcml.grafos.okapi.cf.als.Als\$MasterCompute\-eifml.grafos.okapi.cf.CfLongIdFloatTextInputFormat\-eip$INPUT_RATINGS\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

The above command will run ALS with the default values for all the configuration parameters. You can change one or more parameters by appending Giraph *custom arguments* to the command. For example, to change the number of iterations, append the following to the command:

**-ca** als.iterations**=**20

#### WWW

Wikipedia - Least Squares Method#### Publications

**Large-Scale Parallel Collaborative Filtering for the Netflix Prize**, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan, International Conference on Algorithmic Aspects in Information and Management, Shanghai, China 2008.

#### Contributors

Maria | Dionysis |

### Stochastic Gradient Descent (SGD)

The SGD algorithm is an optimization algorithm used for matrix factorization. SGD is typically used for the training of neural networks as well as for CF.

In the context of recommendations, the algorithm takes as input a training set with user-item pairs and their corresponding known ratings. It outputs latent vectors for both users and items that can then be used to predict ratings between any pair of user and item.

SGD works by iteratively refining the user and item latent vectors to minimize the regularized squared error loss. The algorithm starts with user nodes randomly initializing their latent vectors and every user node sending its latent vector to the item nodes that have been rated by the user. Subsequently the item nodes update their own latent vectors and each item node sends its updated latent vector to the users that have rated the item. This process continues until a maximum number of iterations or until a target RMSE is reached.
We provide two versions of SGD, *basic SGD* and the *incremental SGD*. These apply different policies with respect to when a user or item node sends its updated latent vector.

**Basic SGD**

In the basic SGD, during an iteration user nodes always send updates to item nodes and vice versa.

**Incremental SGD**

In the incremental version of SGD, if a vertex value (i.e. a user or item latent vector) does not change by more than a threshold, then the vertex does not propagate any update to other vertices.

Our study has shown that this causes the algorithm to converge faster and it is also cheaper in terms of total messages sent. This method, however, requires you to set the threshold manually. This requires some experimentation with different thresholds for your specific data set.

By default, the incremental mode is disabled. To enable it, just set a threshold using the

#### Input format

The input to the SGD algorithm is a set of user-item ratings stored in text files, where each line has the following format:

`userID itemID rating `

#### Output Format

The output of SGD is a set of latent vectors for each user and item. Each line in the output has the following format:

`ID type vector`

where:

Note that the

#### Configuration parameters

#### Running

This the basic SGD execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.cf.sgd.Sgd\$InitUsersComputation\-mcml.grafos.okapi.cf.sgd.Sgd\$MasterCompute\-eifml.grafos.okapi.cf.CfLongIdFloatTextInputFormat\-eip$INPUT_RATINGS\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

The above command will run SGD with the default values for all the configuration parameters. You can change one or more parameters by appending Giraph *custom arguments* to the command. For example, to change the number of iterations, append the following to the command:

**-ca** sgd.iterations**=**20

#### WWW

Wikipedia - Stochastic Gradient Descent (SGD)

#### Publications

**Matrix Factorization Techniques for Recommender Systems**, Yehuda Koren, Robert Bell, Chris Volinsky, In IEEE Computer, Vol. 42, No. 8. August 2009, pp. 30-37.

#### Contributors

Maria | Dionysis |

### Singular Value Decomposition (SVD++)

Singular Value Decomposition is a matrix factorization technique for CF.

We have implemented the SVD++ variant of Singular Value Decomposition. Similarly to other techniques, SVD++ iteratively updates the latent vectors of users and items.

In SVD++, every node in the graph representation (user or item) maintains the typical factor vector but also a vector of weights of the same size. The impact on the Giraph implementation is that this algorithm requires more memory.

#### Input format

The input to the SVD++ algorithm is a set of user-item ratings stored in text files, where each line has the following format:

`userID itemID rating `

#### Output Format

The output of SVD++ is a set of latent vectors for each user and item. Each line in the output has the following format:

`ID type vector`

where:

Note that the

#### Configuration parameters

#### Running

This the basic SVD++ execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.cf.svd.Svdpp\$InitUsersComputation\-mcml.grafos.okapi.cf.svd.Svdpp\$MasterCompute\-eifml.grafos.okapi.cf.CfLongIdFloatTextInputFormat\-eip$INPUT_RATINGS\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

The above command will run SVD++ with the default values for all the configuration parameters. You can change one or more parameters by appending Giraph *custom arguments* to the command. For example, to change the number of iterations, append the following to the command:

**-ca** svd.iterations**=**20

#### WWW

ACM RecSys Wiki - SVD++#### Publications

**Factorization meets the neighborhood: a multifaceted collaborative filtering model**, Yehuda Koren, In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.

#### Contributors

Maria | Dionysis |

### Bayesian Personalized Ranking (BPR)

Bayesian Personalized Ranking from implicit feedback (BPR), the optimizes Area Under the Curve (AUC) in a Collaborative Filtering setting. The implementation in Okapi is a port of the myMediaLite implementation to the giraph framework.

The main difference between the original BPR and this implementation is that here the sampling is done for each user, instead of random sample over the users and item triplets.

This leads to 2 main differences. 1) each user is updated at each iteration 2) for each iteration an item can be updated more than once. In other worlds, item factors are updated in batches. For example, imagine that we sample triplets in original BPR: (u1, i1, i2), (u2, i1, i3). After the updates using first the sample, values of i1 will change. Therefore, when computing updates for 2 sample, the value of i1 is already new. In the Okapi implementation we would update these concurrently, therefore, for both we would use the same i1 factors and after iteration i1 would be updated 2 times with corresponding deltas.

### Algorithm implementation:

The idea is to represent users and items as bipartite graph. Each node has its latent factors. Each edge has score (>0) or 0 if it is sampled as irrelevant.- User samples the relevant items and same amount of irrelevant items. To sample irrelevant items we basically create edge with 0. Then he asks all these items to send him their factors.
- Item nodes don't do anything but send back their factors to requested user.
- User nodes compute prediction for all relevant and irrelevant items and compute update vectors based on these predictions.
- User node updates itself with the computed gradient and sends computed updates to the items. After sending the updates to items it erases all edges to irrelevant items (marked as 0).
- Items update themselves.
- Start from 1 for #iterations.

We make an additional trick. We want the final model to be U*V and to remove all the item and user biases. So if we have d=10, we make U and V vectors d=11. the U[0]=1 and V[0]=item_0_bias. In this case we can use the same evaluation framework for all the methods.

Similarly to the previous Algorithms, BPR is ready to go right after we set the input and output of the algorithm.

#### Input format

The input to the algorithm is a set of user-item ratings stored in a text file. Each line is shown in the following format: `userID itemID rating `

,where:

#### Output Format

The output of BPR is a set of users and item values in the following format: `userID userValue`

, where:

#### Configuration parameters

#### Running

The BPR execution command line is similar to the other methods; the parameters required to be specified are:

The above parameters form the following execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.cf.ranking.BPRRankingComputation\-eifml.grafos.okapi.cf.CfLongIdFloatTextInputFormat\-eip$INPUT_RATINGS\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

In the above command line, we do not set any configuration parameters, thus the default values are used. To change one or more configuration parameters you only have to add one line per parameter at the end of the command line. The new line should look like this:

**-ca** parameter_name**=**new_value

#### Publications

S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. Bpr: Bayesian personalized ranking from implicit feedback. UAI ’09, pages 452–461, 2009.

#### Contributors

### CLiMF

Collaborative less-is-more filtering (CLiMF), tackles the problem of recommendation in the scenarios with binary relevance data, when only a few (k) items are recommended to individual users. In CLiMF the latent factors are learned by directly maximizing the Mean Reciprocal Rank (MRR), which is a well-known Information Retrieval metric for measuring the performance of top-k recommendations. We achieve linear computational complexity by introducing a lower bound of the smoothed reciprocal rank metric.

#### Implementation of the algorithm:

The idea is to represent users and items as bipartite graph. Each node has its latent factors. Each edge has score (>0) or 0 if it is sampled as irrelevant.- Item nodes don't do anything but send back their factors to requested user.
- User nodes compute prediction for all relevant and irrelevant items and compute update vectors based on these predictions.
- User node updates itself with the computed gradient and send computed updates to the items. After sending the updates to items it erases all edges to irrelevant items (marked as 0).
- Items update themselves.
- Start from 1 for #iterations.

We make additional an trick. We want the final model to be U*V and to remove all the item and user biases. So if we have d=10, we increase the size of U and V vectors to d=11. the U[0]=1 and V[0]=item_0_bias. In this case we can use the same evaluation framework for all the methods.

Similarly to the previous Algorithms, CLiMF is ready to go right after we set the input and output of the algorithm.

#### Input format

The input to the algorithm is a set of user-item ratings stored in a text file. Each line is shown in the following format: `userID itemID rating `

,where:

#### Output Format

The output of CLiMF is a set of users and item values in the following format: `userID userValue`

, where:

#### Configuration parameters

#### Running

The CLiMF execution command line is similar to the other methods; the parameters required to be specified are:

The above parameters form the following execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.cf.ranking.CLiMFRankingComputation\-eifml.grafos.okapi.cf.CfLongIdFloatTextInputFormat\-eip$INPUT_RATINGS\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

In the above command line, we do not set any configuration parameters, thus the default values are used. To change one or more configuration parameters you only have to add one line per parameter at the end of the command line. The new line should look like this:

**-ca** parameter_name**=**new_value

#### Publications

**CLiMF: learning to maximize reciprocal rank with collaborative
less-is-more filtering**, Y. Shi, A. Karatzoglou, L. Baltrunas,M. Larson, N. Oliver, A. Hanjalic, ACM RecSys 2012

#### Contributors

### TFMAP

It is about collaborative ;) cool huh?

#### Publications

Shi, Yue, et al. "TFMAP: Optimizing MAP for top-n context-aware recommendation." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012.

#### #3 Authors

Linas Alexandros### Popularity Ranking

Computes the popularity ranking based on the number of of times a user rated/seen the item. The idea is to represent users and items as bipartite graph. Then compute how many messages each item received which is equal to the number of users rated the item.

#### Publications

S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. Bpr: Bayesian personalized ranking from implicit feedback. UAI ’09, pages 452–461, 2009.

#### Contributors

### Random

Adding more documentation soon...

##### Details

Adding more documentation soon...

#### How to Run

Adding more documentation soon...

#### WWW

Adding more documentation soon...

#### #3 Authors

Linas Alexandros# Graph Analytics

Mining large-scale graphs has become important in several contexts, suchs as Online Social Network analysis, biological network analysis and Call Detail Record mining. Such graphs may consist of billions of vertices and edges, which makes the computation even simple metrics challenging. Okapi provides a set of useful algorithms that include, clustering, graph partitioning, fake-account detection in OSNs and others. We are constantly enriching this library with new algorithms.

Try them and drop us a line if you think there are more algorithms that would be useful to have. Feel free to contribute yourself too! Our goal is to grow this list to make a rich graph analytics toolkit.

### Adamic-Adar similarity

The Adamic-Adar similarity represents how similar two nodes of a graph are, based on their common neighbors. However, it extends the simple counting of common neighbors, by weighting the similarity value according to the nodes' degrees. In particular, it is computed as the sum of the log of the inverse degree of each common neighbor of the two nodes.

The negative value of the metric computed gives a distance measure, with values in the range [0, INF].

This implementation computes the Adamic-Adar index only for existing edges of the input graph and not for every pair of nodes.

#### Input format

The input format is an edge list, containing one edge per line, in the format#### Output Format

The output of this program is the edges of the graph with their Jaccard index, one edge per line, in the form:#### Configuration parameters

#### Running

To run the exact Adamic-Adar computation:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.AdamicAdar\$ComputeLogOfInverseDegree\-mcml.grafos.okapi.graphs.AdamicAdar\$MasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat\-eip$INPUT_EDGES\-eoforg.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

To run the approximate Adamic-Adar computation:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.AdamicAdar\$ComputeLogOfInverseDegree\-mcml.grafos.okapi.graphs.AdamicAdar\$MasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat\-eip$INPUT_EDGES\-eoforg.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges\-caadamicadar.approximation.enabled=true

#### Publications

**Friends and neighbors on the web**, Lada A. Adamic and Eytan Adar. Social Networks, 25(3):211–230, July 2003.

#### Contributors

Vasia |

### B-matching

Given a weighted undirected graph with integer capacities b(v) assigned to each vertex v, the maximum b-matching problem is to select a subgraph of maximum weight such that the number of edges incident to each vertex in the subgraph does not exceed its capacity.

This algorithm is a greedy approach that provides a 1/2-approximation guarantee.

This implementation is a parallel version of the greedy algorithm, which provides a 1/2-approximation.

The greedy algorithm sorts all edges in the graph by decreasing weight and then processes them one by one. An edge is included in the solution if the capacities of both its endpoint vertexes are larger than zero. Once an edge is included, the capacities of both its endpoint vertexes are decreased by one.

The parallel algorithm works as follows. At each superstep, each vertex v proposes its b(v) edges with maximum weight to its neighbors. At the same time, each vertex computes the intersection between its own proposals and the proposals of its neighbors. The set of edges in the intersection is included in the final solution. Then, each vertex updates its capacity b(v). If it becomes 0, the vertex is removed from the graph.

This is the pseudocode describing the algorithm:

```
```1: while E is non empty do
2: for all v ∈ V in parallel do
3: Let L_v be the set of b(v) edges incident to v with maximum weight;
4: Let F be L_v ∩ L_U where U={ u ∈ V : ∃ e(v,u) ∈ E } is the set f vertexes sharing an edge with v; -- the intersection
5: Update M ← M ∪ F;
6: Update E ← E\F;
7: Update b(v) ← b(v) − b_F(v);
8: If b(v) = 0 remove v from V and remove all edges incident to v from E;
9: end for
10: end while
11: return M;

#### Input format

The input is an undirected graph in adjacency list format, one vertex per line. The algorithm assumes that each edge is present twice in the input, once for each end-point. The line is composed by a vertex, and a list of edges. The vertex is composed by a long id, and an integer capacity. Each edge is composed by a long id (the destination vertex id), and a double weight. The elements of the list are tab or space separated.

For instance, the line

#### Output Format

The output is the matched subgraph.#### Configuration parameters

None.#### Running

This the basic B-matching execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.maxbmatching.MaxBMatching\-vifml.grafos.okapi.graphs.maxbmatching.MBMTextInputFormat\-vip$INPUT_VERTICES\-vofml.grafos.okapi.graphs.maxbmatching.MBMTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

#### Publications

**Social Content Matching in MapReduce**, G. De Francisci Morales, A. Gionis, M. Sozio, Proceedings of the VLDB Endowment, 4(7):460-469, 2011.

#### Contributors

Gianmarco |

### Clustering Coefficient

*local clustering coefficient*is defined for each vertex and measures how close a vertex and its neighbors are to a clique. Given a vertex

*v*, the clustering coefficient

_{max}

_{max}

The *global clustering coefficient* is used to measure the overall
clustering of the vertices in the graph and is defined as the average of the
local clustering coefficients across all vertices in a graph.

This implementation computes the local clustering coefficient for every vertex in the graph as well as the global clustering coefficient. The local clustering coefficient are written as the output of the job, while the global clustering coefficient is set as a Hadoop counter. You can check its value in the standard output of the terminal or in the Hadoop web interface.

This computation works for both directed and undirected graphs.

Note that because every vertex sends its friend list to each of each neighbors this algorithm can incur a high memory footprint and network traffic.

#### Input format

The input to the Clustering Coefficient algorithm is a set of edges describing the graph, with the following format:

`srcID dstID`

#### Output Format

For each vertex, the algorithm outputs its local clustering coefficient in the format:

`ID coefficient`

The global clustering coefficient is set as a Hadoop counter that you can see either at your terminal output or the Hadoop web interface when the execution of the algorithm finishes.

#### Configuration parameters

None.#### Running

This the basic Clustering Coefficient execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.graphs.ClusteringCoefficient\$SendFriendsList\-mcml.grafos.okapi.graphs.ClusteringCoefficient\$MasterCompute\-eifml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat\-eip$INPUT_EDGES\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true

Note that because this algorithm can incur a large memory footprint and network traffic, we suggest you enable the

### Graph partitioning

We implement the *Spinner* graph partitioning algorithm. Spinner is an algorithm that computes an edge-based balanced k-way partitioning of a graph that and is designed for massive graphs (e.g. up to 1 billion vertices and beyond). It supports a number of interesting features such as incremental computation and ability to adapt a partitioning to changes to the graph (e.g. vertices and edges are added or removed) and to the partitions (e.g. partitions are added or removed). These features allow Spinner to partition massive graphs and keep the partitioning up-to-date with minimal re-computations.

The partitioning produced by Spinner can be used as an internal graph partitioning inside of Giraph, to minimise network communication between workers, and maximise load balance.

Spinner splits the vertices across k partitions, trying to maximise *locality* and *balancing*. The former means that vertices that are connected through an edge tend to be assigned to the same partition. The latter means that a similar number of edges is assigned to each partition.

Spinner is based on iterative vertex migrations, with decisions taken independently based on (per-vertex) local information. When a graph is partitioned for the first time, vertices are assigned randomly to partitions. After initialisation, vertices communicate their partition to their neighbours through messages, and they try to migrate to the partition where most of their neighbours occur. This pattern is also known as label-propagation.

Spinner is also able to adapt a previously computed partitioning to changes to the graph or to the number of partitions, hence practically avoiding re-partitioning the graph from scratch upon changes.

#### Input format

Spinner assumes long integers as vertex IDs. Hence, you can load a graph through SpinnerEdgeInputFormat (but in principle you can easily write a VertexInputFormat with the same I V M signature). The format of the input file is simple, and it requires a pair of vertex IDs per line, divided by either a space or a tab.

To load a previous partitioning, you also need to use SpinnerVertexValueInputFormat, which only reads vertex-partition assignments computed during a previous partitioning. The format of this file is also simple, and it requires a pair composed of a vertex ID and the partition id, divided by either a space or a tab.

#### Output Format

Spinner will spit vertex-value assignments as a result of the computation. Such output can be used as-is for future partitionings, or it can be transformed to a Partitioned ID to run a Giraph computation taking advantage of Spinner’s computed partitioning. The class to write the partitioning to HDFS is SpinnerVertexValueOutputFormat.

#### Configuration parameters

Spinner requires to specify only the number of partitions k in which the graph needs to be split. However, a number of parameters are supported by the class. See the paper for more details about the effect of each parameter.

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.spinner.Spinner\$ConverterPropagate\-mcml.grafos.okapi.spinner.Spinner\$PartitionerMasterCompute\-eifml.grafos.okapi.spinner.Spinner\$SpinnerEdgeInputFormat\-eip$INPUT_EDGES\-vifml.grafos.okapi.spinner.Spinner\$SpinnerVertexValueInputFormat\-vip$INPUT_PARTITIONING\-vofml.grafos.okapi.spinner.Spinner\$SpinnerVertexValueOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.outEdgesClass=ml.grafos.okapi.spinner.OpenHashMapEdges

Note that the

#### Publications

**Spinner: Large-scale Graph Partitioning for the Cloud**, Claudio Martella, Dionysios Logothetis, Georgos Siganos, arXiv 1404.3861, April 2014.

#### Contributors

Claudio |

### Jaccard similarity

The Jaccard similarity represents how similar two nodes of a graph are, based on their common neighbors. In particular, it is computed as the size of the intersection of the nodes’ neighbor sets over the size of the union of the nodes’ neighbor sets. The similarity value between two nodes is a number in the range [0, 1]. The closer the value is to 1, the more similar the nodes.

The same metric can be easily converted to a distance with values in the range [0, INF].

This implementation only computes similarity or distance for existing edges of the input graph and not for every pair of nodes.

The computation consists of 2 supersteps. During the first superstep, each vertex sends a list of its friends to all its neighbors in the graph. In the second superstep, each vertex computes the Jaccard similarity value for each of its edges. For every message received, it compares the friend list with its own neighborhood and computes common and total number of friends. It then sets the edge value to the Jaccard similarity index. If the configuration option for distance conversion is enabled, then an additional superstep is executed, where the Jaccard index is converted to a distance index, using the function f(x) = (1/x) - 1.

If the input graph is large, the computation of the Jaccard index for all the edges might be very expensive. For this reason, an approximate implementation is also provided. In this implementation, each vertex uses a BloomFilter to store its neighborhood and sends this filter as a message to its neighbors, instead of the exact friend list. The hash functions to use and the size of the BloomFilter are configurable.

#### Input format

The input format is an edge list, containing one edge per line, in the format#### Output Format

The output of this program is the edges of the graph with their Jaccard index, one edge per line, in the form:#### Configuration parameters

#### Running

To run the exact Jaccard computation:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.Jaccard\$SendFriendsList\-mcml.grafos.okapi.graphs.Jaccard\$MasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat\-eip$INPUT_EDGES\-eoforg.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

To run the approximate Jaccard computation:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.Jaccard\$SendFriendsBloomFilter\-mcml.grafos.okapi.graphs.Jaccard\$MasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat\-eip$INPUT_EDGES\-eoforg.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges\-cajaccard.approximation.enabled=true

### K-core

The *k-core* of a graph is the subgraph in which all vertices have degree of at least *k*.

We find the k-core of a graph by iteratively removing vertices with degree less than *k*.

The algorithm stops when there are no more vertex removals. At the end of the execution, the remaining graph represents the k-core. It is possible that the result in an empty graph.

#### Input format

The input is a set of edges where vertex IDs are of type long.

#### Output Format

If a k-core exists, the output contains the k-core in an adjacency list format.

#### Configuration parameters

#### Running

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.graphs.KCore\$KCoreComputation\-eifml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat\-eip$INPUT_EDGES\-vofml.grafos.okapi.io.formats.LongNullNullAdjacencyListTextVertexOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.vertexResolverClass=ml.grafos.okapi.graphs.KCore\$KCoreVertexResolver\-cacore.k=3

#### WWW

#### Publications

**Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis**, Louise Quick, Paul Wilkinson, David Hardcastle, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012.

#### Contributors

Dionysis |

### Multi-source shortest paths

This algorithm takes as input a weighted graph and computes the shortest distances from a selected set of source vertices to all other vertices in the graph.

This can be used, for instance, to compute distances from a set of "landmark" vertices in a graph. The distances from the landmarks may then be used to compute other metrics such as approximate all-pair shortest distances and diameter estimation.

Although this can be done by running the single-source shortest paths algorithm multiple times, this algorithm is more efficient.

Every vertex maintains a map with current shortest distances from each source. Initially every source starts propagating their distances to their neighbors. After that, if the distance from one of the sources changes, a vertex propagates only the change with respect to that source.

The execution finishes when no changes occur. At the end every vertex's value holds the distance from each source. If a vertex does not contain a specific source ID, then its distance to the specific source is considered to be infinite.

Note that the larger the number of sources selected, the largest the computational overhead, memory footprint and network traffic incurred by this algorithm, so be cautious about how many sources you select.

The user can either define a fraction of vertices to be selected randomly as sources or can explicitly define list of source IDs separated by a ':'. If non are specified, vertex with ID=1 will be selected as the single source.

#### Input format

The input to the algorithm is a weighted graph specified as a set of edges, with the following format:

`srcID dstID weight`

#### Output Format

For each vertex, the algorithm outputs its distance from each source in the format:

`ID (src1, distance1)(src2, distance2) ...`

#### Configuration parameters

If no fraction or explicit list are defined, then the vertex with ID 1 will be selected as the single source.

#### Running

This the basic Multiple-source shortest paths execution command line:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.graphs.MultiSourceShortestPaths\$InitSources\-mcml.grafos.okapi.graphs.MultiSourceShortestPaths\$MasterCompute\-eifml.grafos.okapi.io.formats.LongFloatTextEdgeInputFormat\-eip$INPUT_EDGES\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-casources.fraction=0.1

#### Contributors

Dionysis |

### Semi-Clustering

In social graphs, vertices represent users and edges represent connections between them. Edges may have weights that represent the strength of the connection between two users.

A semi-cluster is formed by users who have stronger connections with each other compared to other users. The main difference of a semi-cluster from a typical cluster, is the possibility of a node to exist in more than one clusters.

This algorithm receives a graph as an input and computes semi-clusters based on their edge weights.

The main objective of semi-clustering is to create semi-clusters with vertices who have the highest interaction among them. The logic behind this lies on the calculation of a score for each semi-cluster and the dismissal of the semi-clusters with the lowest score. A semi-cluster score is calculated by the equation: Sc = (Ic - fB * Bc)/Vc(Vc - 1) / 2, where:

Each vertex maintains a list of semi-clusters in which it belongs to. The list contains a maximum number of semi-clusters, Cmax, defined by the user and sorted by their score. In each superstep, every vertex:

For the first superstep, each vertex creates an empty cluster and adds itself in it. Thus the first set of messages sent in the network contain the vertices themselves.

#### Input format

The input to the Semi-Clustering algorithm is a weighted, undirected graph. Note that a user cannot be connected to itself; thus lines containing the same userID twice should be avoided.#### Output Format

The output of the graph consists of a line for each vertex in the graph and contains the IDs of the clusters a vertex belongs too.#### Configuration parameters

`iterations`

: the maximum number of iterations to perform. The default value is 10.`max.clusters`

: the maximum number of clusters to shape. The default value is 2.`cluster.capacity`

: the maximum number of vertices a cluster can have. The default value is 4.`score.factor`

: the boundary edge score factor. The default value is 0.5.#### Run Semi-Clustering

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.graphs.SemiClustering\-eifml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat\-eip$INPUT_EDGES\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

#### Publications

**Pregel: a system for large-scale graph processing**, Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010.

#### Contributors

Maria | Dionysis |

### SybilRank

The SybilRank algorithm is used in Online Social Network analysis to detect fake accounts. When seeded with a set of known fake accounts, it ranks the nodes in the graph according to the probability to be fake.

The algorithm consists of three types of computation:

**TrustedNodeAggregation**: In the first step if a vertex is trusted it updates the

**Initializer**: It initializes the rank according to whether a node is trusted or not and then starts the power iterations by distributing the rank of each vertex.

**SybilRankComputation**: This is the implementation of the power iteration. It processes messages, updates state and sends messages.

#### Input format

The input to the algorithm consists of two sets:

`srcID dstID weight`

#### Configuration parameters

#### Running

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\-libjars$OKAPI_JAR\ml.grafos.okapi.graphs.SybilRank\$TrustAggregation\-mcml.grafos.okapi.graphs.SybilRank\$SybilRankMasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat\-eip$INPUT_EDGES\-vifml.grafos.okapi.graphs.SybilRank\$SybilRankVertexValueInputFormat\-vip$TRUSTED_VERTICES\-voforg.apache.giraph.io.formats.IdWithValueTextOutputFormat\-op$OUTPUT\-w$WORKERS

#### Publications

**Aiding the detection of fake accounts in large scale social online services**, C. Qiang, M. Sirivianos, X. Yang, T. Pregueiro, In NSDI 2012.

**Integro: Leveraging Victim Prediction for Robust Fake Account Detection in OSNs**,Y. Boshmaf, D. Logothetis, G. Siganos, J. Leria, J. Lorenzo, M. Ripeanu, K. Beznosov, In Network and Distributed System Security Symposium (NDSS’15).

#### Contributors

Dionysis | Yazan |

### Triangles

We have implemented a set of algorithms that are related to the existance of triangles in graphs. More specifically, we provide (i) an implementation that counts the unique triangles in a graph and (ii) an implementation that enumerates the actual triangles in a graph.

#### Input format

We assume that the input graph is unweighted and undirected.

The input is a set of edges, with the following format:

`srcID dstID`

#### Output Format

When counting the number of unique triangles this algorithm outputs a line for each vertex that participates in a triangle and is the vertex with the highest ID in the triangle. Such a line has the format:

This is a vertex ID and the number of times it participates in a triangle as the vertex with the highest ID. Summing up the counts on the the second column across the entire output will give you the total number of unique triangles in the graph.

When enumerating the triangles this algorithm outputs a line for each vertex that participates in a triangle and is the vertex with the highest ID in the triangle. Such a line has the format:
`id [[id11 id12]] [id21 id22] ...]`

This is the vertx ID and a list of pairs that make a triangle with this vertex. For instance, the line:

`7 [[2 3] [1 4]]`

means that vertex 7 forms a triangle with vertices 2 and 3 and another triangle with vertices 1 and 4.

#### Configuration parameters

None.#### Running

You can compute the different triangle-related metrics by specifying the right implementation of the MasterCompute class.

To count the number of unique trianges in the graph, run the following command:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.Triangles\$Initialize\-mcml.grafos.okapi.graphs.Triangles\$TriangleCount\-eifml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat\-eip$INPUT_EDGES\-vofml.grafos.okapi.graphs.Triangles\$TriangleOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

To enumerate all the unique trianges in the graph, run the following command:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.Triangles\$Initialize\-mcml.grafos.okapi.graphs.Triangles\$TriangleFind\-eifml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat\-eip$INPUT_EDGES\-vofml.grafos.okapi.graphs.Triangles\$TriangleOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

#### Publications

**Investigating Graph Algorithms in the BSP Model on the Cray XMT**, D. Ediger, D. Bader, 7th Workshop on Multithreaded Architectures and Applications (MTAAP), Boston, MA, May 24, 2013.

**Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis**, Louise Quick, Paul Wilkinson, David Hardcastle, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012.

#### Contributors

Dionysis |

### Triangle semi-metricity

In a weighted graph, we say that a direct edge between two nodes is semi-metric, if there exists an indirect path between these two nodes, with a shorter distance. This program takes as input an undirected, weighted graph and removes all 1st-order semi-metric edges from it; i.e. all semi-metric edges, for which the shorter indirect path has length two.

To find all 1st-order semimetric edges, we first find all the triangles in the graph and then compare their weights. If vertices A, B, C form a triangle, then edge AB is semi-metric if D(A,B) > D(A,C)+D(C,B), where D(u, v) is the weight of the edge (u, v).

The output of this algorithm is the graph without the 1st-order semimetric edges.

We provide two implementations, a basic one and an optimized that reduces the amount of messages that have to be sent at the same time, making better use of memory.

#### Basic implementation

The algorithm consists of four supersteps and discovers each triangle exactly once. We assume mnumeric vertex IDs and a total ordering on them. In the first superstep, every node propagates its ID to all its neighbors with higher ID. In the second superstep, each node retrieves the ID from each received message and the weight of the edge that is formed between this node and the message sender. Then, the node constructs a new message with the sender ID, the edge weight and its own ID and sends this message to all the neighbors with higher ID. In the third superstep, for every message received, a node checks if a triangle is formed and, in case it is, whether it contains a semi-metric edge. If the node detects a semi-metric edge, it marks this and the opposite direction edge for removal. In the final superstep, all marked edges are removed.

#### Optimized implementation

This implementation divides the algorithm into several megasteps which contain the three supersteps of the main computation. In each megastep, some of the vertices of the graph execute the algorithm, while the rest are idle (but still active). The algorithm finishes, when all vertices have executed the computation. In the case of semi-metric edge removal, this model is possible, because messages are neither aggregated nor combined. This implementation assumes numeric vertex IDs.

#### Input format

The input format is an edge list, containing one edge with its value per line, in the format#### Output Format

The output format is an edge list, containing one edge with its value per line, in the format#### Configuration parameters

#### Running

To run the basic implementation execute:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.SemimetricTriangles\$PropagateId\-mcml.grafos.okapi.graphs.SemimetricTriangles\$SemimetricMasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat\-eip$INPUT_EDGES\-eoforg.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

To run the optimized implementation execute:

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.graphs.ScalableSemimetric\$PropagateId\-mcml.grafos.okapi.graphs.ScalableSemimetric\$SemimetricMasterCompute\-eifml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat\-eip$INPUT_EDGES\-eoforg.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat\-op$OUTPUT\-w$WORKERS\-cagiraph.oneToAllMsgSending=true\-cagiraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges

#### Publications

**Investigating Graph Algorithms in the BSP Model on the Cray XMT**, D. Ediger and D.A. Bader, , 7th Workshop on Multithreaded Architectures and Applications (MTAAP), Boston, MA, May 24, 2013.

#### Contributors

Dionysis |

# Clustering

### Kmeans

K-means clustering is an iterative algorithm that partitions *N* data points (observations) into *k* clusters. During the initialization phase, the algorithm chooses *k* clusters from the set of input points at random. These are assigned to be the initial cluster centers.

At each iteration: (i) each data point is assigned to the cluster center which is closest to it, by means of euclidean distance, (ii) new cluster centers are recomputed, by calculating the arithmetic mean of the assigned points.

The algorithm converges when the positions of the cluster centers do not change more than a specified threshold.

Each data point corresponds to a vertex. During a superstep, each point reads the current cluster centers and finds the one which is closest to it. It then aggregates its own coordinates to the corresponding cluster center aggregator and updates its vertex value with the new cluster ID.

The random initialization of the cluster centers is implemented using a custom aggregator. The aggregator builds a list of k (number of cluster centers) elements and their coordinates. If the current list size is less than k, the aggregate method appends the new element at the end of the list. Otherwise, it replaces an element at a random position in the list, with probability k/N, where N is the total number of input points.

The positions of the k centers are stored in aggregators and they are updated by the Master. Each of the aggregators aggregates the vector coordinates of the points assigned to the corresponding cluster center. At the beginning of a superstep, the master computes the coordinates of the new clusters, as the means of the coordinates of the assigned points.

#### Input format

The input to the k-means clustering algorithm is a set of points, described by an ID and their coordinates, with the following format:

`pointID, coord_1\tcoord_2\t…\tcoord_n`

*n*is the dimension of the point space,

#### Output format

For each point, the algorithm outputs the clusterID where it belongs, in the form:`pointID, clusterID`

#### Configuration parameters

K-means clustering requires setting the desired number of cluster centers, the total number of input data points and the dimensionality of the data points. The maximum number of iterations can also be optionally set.

#### Running

hadoop jar$OKAPI_JAR org.apache.giraph.GiraphRunner\ml.grafos.okapi.clustering.kmeans.KMeansClustering\$RandomCentersInitialization\-mcml.grafos.okapi.clustering.kmeans.KMeansClustering\$KMeansMasterCompute\-vifml.grafos.okapi.clustering.kmeans.KMeansTextInputFormat\-vip$INPUT_POINTS\-vofml.grafos.okapi.clustering.kmeans.KMeansTextOutputFormat\-cagiraph.outEdgesClass=ml.grafos.okapi.common.graph.NullOutEdges\-cakmeans.cluster.centers.count=$CLUSTER_COUNT\-cakmeans.points.count=$POINT_COUNT\-cakmeans.points.dimensions=$POINT_DIMENSIONS\-op$OUTPUT\-w$WORKERS