Collaborative Filtering (CF) is a technique widely used by Recommendation Systems. These systems model the users taste patterns based on the items they have rated, viewed or consumed and suggest new items in a way that matches the users preferences (Personalization).

Collaborative Filtering is about relating two entities; items and users by using the bipartite graph that is formed by their interactions (e.g. users rating items). Proadly speaking there are two types of approaches to CF 1) the Neighborhood Approach and 2) Latent Factor Models. The former directly computes the similarities between items or users and uses these similarities to make predictions for new items; The latter approach - the so called model-based - characterizes users and items with latent factors and embeds both entities to the same latent factor space. These latent factors of th

Okapi provides a number of latent factor algorithms covering CF; these are based on Alternated Least Squares (ALS), Stochastic Gradient Descend (SGD) and Singular Value Decomposition (SVD++). Okapi currently includes state-of-the-art "learning to rank" technology (CLiMF) and will also incorporate CF methods that are a product of our research in the domain.

Try them out and leave us a comment. Or two.


 

Alternating Least Squares (ALS)

ALS is a matrix factorization algorithm for CF. Users and items can be characterized in terms of latent factor vectors, the algorithm fits these values by solving a Least Squares Erroer model over the users assuming the item factors are fixed, and over the items assuming the user factors are fixed.

ALS receives uses a set of ratings from users on items and tries to minimize the error between the predicted and known ratings.

ALS tries to minimize the error between the predicted and known ratings by alternating between two steps:

  • Fix pu (user factors) and compute qi (item factors) by solving a least-squares problem.
  • Fix qi and compute pu by solving a least-squares problem.
  • The two steps are repeated, till convergence of the training error.

    Input format

    The input to ALS algorithm is a set of user-item ratings stored in text files, where each line has the following format:

    userID itemID rating

    where:

  • userID and itemID are integer identities representing users and items.
  • rating is a float value representing the rating of a userID for itemID.
  • Output Format

    The output of ALS is a set of latent vectors for each user and item. Each line in the output has the following format

    ID type vector, where:

  • ID is the integer ID of the user or item.
  • type indicates whether this is a user (type=0) or item (type=1).
  • vector is the computed latent vector for this user or item in the format [val1; val2; ...; valN].

    Note that the type field is there so that you can distinguish between users and items that may potentially have the same integer ID.

    Configuration parameters

    iterations: the maximum number of iterations to perform. The default value is 10.

    dim: the number of factors in the latent vectors. The default value is 50.

    lambda: the ALS regularization parameter. The default value is 0.01.

    rmse: the target train RMSE to reach for termination. This is optional.

    Running ALS

    The basic ALS execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
           -libjars $OKAPI_JAR \
           ml.grafos.okapi.cf.als.Als\$InitUsersComputation \
           -mc ml.grafos.okapi.cf.als.Als\$MasterCompute \
           -eif ml.grafos.okapi.cf.CfLongIdFloatTextInputFormat \
           -eip $INPUT_RATINGS \
           -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
           -op $OUTPUT \
           -w $WORKERS

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_RATINGS: Location of input ratings on HDFS.
  • $OUTPUT: Directory to output the computed latent vectors.
  • $WORKERS: Number of Giraph workers.
  • The above command will run ALS with the default values for all the configuration parameters. You can change one or more parameters by appending Giraph custom arguments to the command. For example, to change the number of iterations, append the following to the command:

    -ca als.iterations=20

    WWW

    Wikipedia - Least Squares Method

    Publications

    Large-Scale Parallel Collaborative Filtering for the Netflix Prize, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan, International Conference on Algorithmic Aspects in Information and Management, Shanghai, China 2008.

    Contributors

    ALS-contr ALS-contr
    Maria Dionysis

     

    Stochastic Gradient Descent (SGD)

    The SGD algorithm is an optimization algorithm used for matrix factorization. SGD is typically used for the training of neural networks as well as for CF.

    In the context of recommendations, the algorithm takes as input a training set with user-item pairs and their corresponding known ratings. It outputs latent vectors for both users and items that can then be used to predict ratings between any pair of user and item.

    SGD works by iteratively refining the user and item latent vectors to minimize the regularized squared error loss. The algorithm starts with user nodes randomly initializing their latent vectors and every user node sending its latent vector to the item nodes that have been rated by the user. Subsequently the item nodes update their own latent vectors and each item node sends its updated latent vector to the users that have rated the item. This process continues until a maximum number of iterations or until a target RMSE is reached. We provide two versions of SGD, basic SGD and the incremental SGD. These apply different policies with respect to when a user or item node sends its updated latent vector.

    Basic SGD

    In the basic SGD, during an iteration user nodes always send updates to item nodes and vice versa.

    Incremental SGD

    In the incremental version of SGD, if a vertex value (i.e. a user or item latent vector) does not change by more than a threshold, then the vertex does not propagate any update to other vertices.

    Our study has shown that this causes the algorithm to converge faster and it is also cheaper in terms of total messages sent. This method, however, requires you to set the threshold manually. This requires some experimentation with different thresholds for your specific data set.

    By default, the incremental mode is disabled. To enable it, just set a threshold using the threshold configuration parameter (more details in the How to Run section).

    Input format

    The input to the SGD algorithm is a set of user-item ratings stored in text files, where each line has the following format:

    userID itemID rating

    where:
  • userID and itemID are integer identities representing users and items.
  • rating is a float value representing the rating of userID for itemID.
  • Output Format

    The output of SGD is a set of latent vectors for each user and item. Each line in the output has the following format:

    ID type vector

    where:

  • ID is the integer ID of the user or item.
  • type indicates whether this is a user (type=0) or item (type=1).
  • vector is the computed latent vector for this user or item in the format [val1; val2; ...; valN].

    Note that the type field is there so that you can distinguish between users and items that may potentially have the same integer ID.

    Configuration parameters

  • iterations: the maximum number of iterations to perform. The default value is 10.
  • dim: the number of factors in the latent vectors. The default value is 50.
  • min.rating: the minimum rating possible. The default value is 0.
  • max.rating: the maximum rating possible. The default value is 5.
  • lambda: the SGD regularization parameter. The default value is 0.01.
  • gamma: the SGD learning rate. The default value is 0.005.
  • rmse: the target train RMSE to reach. This is optional.
  • tolerance: used in the incremental version of the algorithm. If the difference (L2-norm) between two successive values of a node (user or item) is less than this number then this node does not send its updated latent vector. This is optional.
  • Running

    This the basic SGD execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
           -libjars $OKAPI_JAR \
           ml.grafos.okapi.cf.sgd.Sgd\$InitUsersComputation \
           -mc ml.grafos.okapi.cf.sgd.Sgd\$MasterCompute \
           -eif ml.grafos.okapi.cf.CfLongIdFloatTextInputFormat \
           -eip $INPUT_RATINGS \
           -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
           -op $OUTPUT \
           -w $WORKERS

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_RATINGS: Location of input ratings on HDFS.
  • $OUTPUT: Directory to output the computed latent vectors.
  • $WORKERS: Number of Giraph workers.
  • The above command will run SGD with the default values for all the configuration parameters. You can change one or more parameters by appending Giraph custom arguments to the command. For example, to change the number of iterations, append the following to the command:

    -ca sgd.iterations=20

    WWW

    Wikipedia - Stochastic Gradient Descent (SGD)


    Publications

    Matrix Factorization Techniques for Recommender Systems, Yehuda Koren, Robert Bell, Chris Volinsky, In IEEE Computer, Vol. 42, No. 8. August 2009, pp. 30-37.


    Contributors

    ALS-contr ALS-contr
    Maria Dionysis

     

    Singular Value Decomposition (SVD++)

    Singular Value Decomposition is a matrix factorization technique for CF.

    We have implemented the SVD++ variant of Singular Value Decomposition. Similarly to other techniques, SVD++ iteratively updates the latent vectors of users and items.

    In SVD++, every node in the graph representation (user or item) maintains the typical factor vector but also a vector of weights of the same size. The impact on the Giraph implementation is that this algorithm requires more memory.

    Input format

    The input to the SVD++ algorithm is a set of user-item ratings stored in text files, where each line has the following format:

    userID itemID rating

    where:
  • userID and itemID are integer identities representing users and items.
  • rating is a float value representing the rating of userID for itemID.
  • Output Format

    The output of SVD++ is a set of latent vectors for each user and item. Each line in the output has the following format:

    ID type vector

    where:

  • ID is the integer ID of the user or item.
  • type indicates whether this is a user (type=0) or item (type=1).
  • vector is the computed latent vector for this user or item in the format [val1; val2; ...; valN].

    Note that the type field is there so that you can distinguish between users and items that may potentially have the same integer ID.

    Configuration parameters

  • iterations: the maximum number of iterations to perform. The default value is 10.
  • dim: the number of factors in the latent vectors. The default value is 50.
  • min.rating: the minimum rating possible. The default value is 0.
  • max.rating: the maximum rating possible. The default value is 5.
  • lambda.factor: latent vector regularization parameter. The default value is 0.01.
  • lambda.bias: bias vector regularization parameter. The default value is 0.01.
  • gamma.factor: factor vector learning rate. The default value is 0.005.
  • gamma.bias: bias vector learning rate. The default value is 0.005.
  • rmse: the target RMSE to reach. This is optional.
  • Running

    This the basic SVD++ execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
           -libjars $OKAPI_JAR \
           ml.grafos.okapi.cf.svd.Svdpp\$InitUsersComputation \
           -mc ml.grafos.okapi.cf.svd.Svdpp\$MasterCompute \
           -eif ml.grafos.okapi.cf.CfLongIdFloatTextInputFormat \
           -eip $INPUT_RATINGS \
           -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
           -op $OUTPUT \
           -w $WORKERS

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_RATINGS: Location of input ratings on HDFS.
  • $OUTPUT: Directory to output the computed latent vectors.
  • $WORKERS: Number of Giraph workers.
  • The above command will run SVD++ with the default values for all the configuration parameters. You can change one or more parameters by appending Giraph custom arguments to the command. For example, to change the number of iterations, append the following to the command:

    -ca svd.iterations=20

    WWW

    ACM RecSys Wiki - SVD++

    Publications

    Factorization meets the neighborhood: a multifaceted collaborative filtering model, Yehuda Koren, In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.


    Contributors

    ALS-contr ALS-contr
    Maria Dionysis

    Bayesian Personalized Ranking (BPR)

    Bayesian Personalized Ranking from implicit feedback (BPR), the optimizes Area Under the Curve (AUC) in a Collaborative Filtering setting. The implementation in Okapi is a port of the myMediaLite implementation to the giraph framework.

    The algorithm is in beta, and is not guaranteed to produce the correct results.

    The main difference between the original BPR and this implementation is that here the sampling is done for each user, instead of random sample over the users and item triplets.

    This leads to 2 main differences. 1) each user is updated at each iteration 2) for each iteration an item can be updated more than once. In other worlds, item factors are updated in batches. For example, imagine that we sample triplets in original BPR: (u1, i1, i2), (u2, i1, i3). After the updates using first the sample, values of i1 will change. Therefore, when computing updates for 2 sample, the value of i1 is already new. In the Okapi implementation we would update these concurrently, therefore, for both we would use the same i1 factors and after iteration i1 would be updated 2 times with corresponding deltas.

    Algorithm implementation:

    The idea is to represent users and items as bipartite graph. Each node has its latent factors. Each edge has score (>0) or 0 if it is sampled as irrelevant.
    1. User samples the relevant items and same amount of irrelevant items. To sample irrelevant items we basically create edge with 0.
    2. Then he asks all these items to send him their factors.
    3. Item nodes don't do anything but send back their factors to requested user.
    4. User nodes compute prediction for all relevant and irrelevant items and compute update vectors based on these predictions.
    5. User node updates itself with the computed gradient and sends computed updates to the items. After sending the updates to items it erases all edges to irrelevant items (marked as 0).
    6. Items update themselves.
    7. Start from 1 for #iterations.

    We make an additional trick. We want the final model to be U*V and to remove all the item and user biases. So if we have d=10, we make U and V vectors d=11. the U[0]=1 and V[0]=item_0_bias. In this case we can use the same evaluation framework for all the methods.

    Similarly to the previous Algorithms, BPR is ready to go right after we set the input and output of the algorithm.

    Input format

    The input to the algorithm is a set of user-item ratings stored in a text file. Each line is shown in the following format: userID itemID rating ,where:

  • userID and itemID are unique identity values for users and items respectively.
  • rating is an integer value of 1 (one) signaling that a user has interacted with an item (implicit feedback) userID for itemID.
  • Output Format

    The output of BPR is a set of users and item values in the following format: userID userValue, where:

    userValue is the user latent vector

    .

    Configuration parameters

  • iterations: the maximum number of iterations to perform. The default value is 10.
  • dim: the number of factors in the latent vectors. The default value is 50.
  • lambda: the BPR regularization parameter. The default value is 0.01.
  • gamma: the BPR learning rate for the stochastic gradient descent steps. The default value is 0.005.
  • Running

    The BPR execution command line is similar to the other methods; the parameters required to be specified are:

  • Directory of the BPR algorithm.
  • Directory of the Master Compute for the BPR algorithm (optional).
  • Directory of the Input Format.
  • $INPUT_RATINGS: Directory of the input edges file in HDFS.
  • Directory of the Output Format.
  • $OUTPUTDirectory for the output file to be stored in hdfs.
  • $WORKERSNumber of workers (integer).
  • The above parameters form the following execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
           -libjars $OKAPI_JAR \
           ml.grafos.okapi.cf.ranking.BPRRankingComputation \
           -eif ml.grafos.okapi.cf.CfLongIdFloatTextInputFormat \
           -eip $INPUT_RATINGS \
           -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
           -op $OUTPUT \
           -w $WORKERS

    In the above command line, we do not set any configuration parameters, thus the default values are used. To change one or more configuration parameters you only have to add one line per parameter at the end of the command line. The new line should look like this:

    -ca parameter_name=new_value

    Publications

    S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. Bpr: Bayesian personalized ranking from implicit feedback. UAI ’09, pages 452–461, 2009.

    Contributors

    ALS-contrLinas

     

    CLiMF

    Collaborative less-is-more filtering (CLiMF), tackles the problem of recommendation in the scenarios with binary relevance data, when only a few (k) items are recommended to individual users. In CLiMF the latent factors are learned by directly maximizing the Mean Reciprocal Rank (MRR), which is a well-known Information Retrieval metric for measuring the performance of top-k recommendations. We achieve linear computational complexity by introducing a lower bound of the smoothed reciprocal rank metric.

    The algorithm is in beta, and is not guaranteed to produce the correct results.
    The CLiMF ranking algorithm for Collaborative Filtering in Okapi. The main difference between the original CLiMF and this implementation is that learning is happening in semi-batch mode i.e. all users are updated and then all items are updated etc. following the Giraph/Pregel computing model. This leads to 2 main differences to the original CLiMF implementation. 1) each user is updated at each iteration 2) for each iteration an item can be updated more than once. In other worlds, item factors are updated in batches.

    Implementation of the algorithm:

    The idea is to represent users and items as bipartite graph. Each node has its latent factors. Each edge has score (>0) or 0 if it is sampled as irrelevant.
  • User samples the relevant items and same amount of irrelevant items. To sample irrelevant items we basically create edge with 0.
  • Then he asks all these items to send him their factors.
    1. Item nodes don't do anything but send back their factors to requested user.
    2. User nodes compute prediction for all relevant and irrelevant items and compute update vectors based on these predictions.
    3. User node updates itself with the computed gradient and send computed updates to the items. After sending the updates to items it erases all edges to irrelevant items (marked as 0).
    4. Items update themselves.
    5. Start from 1 for #iterations.

    We make additional an trick. We want the final model to be U*V and to remove all the item and user biases. So if we have d=10, we increase the size of U and V vectors to d=11. the U[0]=1 and V[0]=item_0_bias. In this case we can use the same evaluation framework for all the methods.

    Similarly to the previous Algorithms, CLiMF is ready to go right after we set the input and output of the algorithm.

    Input format

    The input to the algorithm is a set of user-item ratings stored in a text file. Each line is shown in the following format: userID itemID rating ,where:

  • userID and itemID are unique identity values for users and items respectively.
  • rating is an integer value between 1 and 5 and represents the rating from userID for itemID.
  • Output Format

    The output of CLiMF is a set of users and item values in the following format: userID userValue, where:

    userValue is the user latent vector

    .

    Configuration parameters

  • iterations: the maximum number of iterations to perform. The default value is 10.
  • dim: the number of factors in the latent vectors. The default value is 50.
  • lambda: the CLiMF regularization parameter. The default value is 0.01.
  • gamma: the CLiMF learning rate for the stochastic gradient descent steps. The default value is 0.005.
  • Running

    The CLiMF execution command line is similar to the other methods; the parameters required to be specified are:

  • Directory of the CLiMF algorithm.
  • Directory of the Master Compute for the CLiMF algorithm (optional).
  • Directory of the Input Format.
  • $INPUT_RATINGS: Directory of the input edges file in HDFS.
  • Directory of the Output Format.
  • $OUTPUTDirectory for the output file to be stored in hdfs.
  • $WORKERSNumber of workers (integer).
  • The above parameters form the following execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
           -libjars $OKAPI_JAR \
           ml.grafos.okapi.cf.ranking.CLiMFRankingComputation \
           -eif ml.grafos.okapi.cf.CfLongIdFloatTextInputFormat \
           -eip $INPUT_RATINGS \
           -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
           -op $OUTPUT \
           -w $WORKERS

    In the above command line, we do not set any configuration parameters, thus the default values are used. To change one or more configuration parameters you only have to add one line per parameter at the end of the command line. The new line should look like this:

    -ca parameter_name=new_value

    Publications

    CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering, Y. Shi, A. Karatzoglou, L. Baltrunas,M. Larson, N. Oliver, A. Hanjalic, ACM RecSys 2012

    Contributors

    ALS-contrAlexandros

     

    TFMAP

    It is about collaborative ;) cool huh?

    The algorithm is in beta, and is not guaranteed to produce the correct results.

    Publications

    Shi, Yue, et al. "TFMAP: Optimizing MAP for top-n context-aware recommendation." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012.

    #3 Authors

    ALS-contrLinas ALS-contrAlexandros

    Popularity Ranking

    Computes the popularity ranking based on the number of of times a user rated/seen the item. The idea is to represent users and items as bipartite graph. Then compute how many messages each item received which is equal to the number of users rated the item.

    Publications

    S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. Bpr: Bayesian personalized ranking from implicit feedback. UAI ’09, pages 452–461, 2009.

    Contributors

    ALS-contrLinas

     

    Random

    Adding more documentation soon...

    Details

    Adding more documentation soon...

    How to Run

    Adding more documentation soon...

    WWW

    Adding more documentation soon...

    #3 Authors

    ALS-contrLinas ALS-contrAlexandros

    Mining large-scale graphs has become important in several contexts, suchs as Online Social Network analysis, biological network analysis and Call Detail Record mining. Such graphs may consist of billions of vertices and edges, which makes the computation even simple metrics challenging. Okapi provides a set of useful algorithms that include, clustering, graph partitioning, fake-account detection in OSNs and others. We are constantly enriching this library with new algorithms.

    Try them and drop us a line if you think there are more algorithms that would be useful to have. Feel free to contribute yourself too! Our goal is to grow this list to make a rich graph analytics toolkit.


     

    Adamic-Adar similarity

    The Adamic-Adar similarity represents how similar two nodes of a graph are, based on their common neighbors. However, it extends the simple counting of common neighbors, by weighting the similarity value according to the nodes' degrees. In particular, it is computed as the sum of the log of the inverse degree of each common neighbor of the two nodes.

    The negative value of the metric computed gives a distance measure, with values in the range [0, INF].

    This implementation computes the Adamic-Adar index only for existing edges of the input graph and not for every pair of nodes.

    The computation consists of 3 supersteps. During the first superstep, each vertex computes the logarithm of its inverse degree and sets this value as its vertex value. In the second superstep, each vertex sends a list of its friends and its vertex value to all its neighbors in the graph. In the third superstep, each vertex computes the Adamic-Adar index value for each of its edges. If the configuration option for distance conversion is enabled, then an additional superstep is executed, where the Adamic-Adar index is converted to a distance index, using the function, by negating the value. If the input graph is large, the computation of the Adamic-Adar index for all the edges might be very expensive. For this reason, an approximate implementation is also provided. In this implementation, each vertex uses a BloomFilter to store its neighborhood and sends this filter as a message to its neighbors, instead of the exact friend list. The hash functions to use and the size of the BloomFilter are configurable.

    Input format

    The input format is an edge list, containing one edge per line, in the format srcID<tab>targetID where the IDs are of type Long.

    Output Format

    The output of this program is the edges of the graph with their Jaccard index, one edge per line, in the form: srcID<tab>targetID<tab>similarity where the IDs are of type Long and similarity is Double.

    Configuration parameters

  • adamicadar.approximation.enabled: When set to true, the bloom filter implementation will be used and the program will compute the approximate Adamic-Adar index. The deafult value is false.
  • adamicadar.bloom.filter.bits: Size of the bloom filter, in bits. The default value if 16.
  • adamicadar.bloom.filter.functions: The number of hash functions to use for the bloom filter. The default value is 1.
  • adamicadar.bloom.filter.hash.type: The type of the hash functions to use for the bloom filter. The default value is Hash.MURMUR_HASH.
  • distance.conversion.enabled: When set to true, the prograpm will compute the Adamic-Adar distance. The deafult value if false.
  • Running

    To run the exact Adamic-Adar computation:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.AdamicAdar\$ComputeLogOfInverseDegree  \
      -mc  ml.grafos.okapi.graphs.AdamicAdar\$MasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -eof org.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges 
    

    To run the approximate Adamic-Adar computation:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.AdamicAdar\$ComputeLogOfInverseDegree  \
      -mc  ml.grafos.okapi.graphs.AdamicAdar\$MasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -eof org.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges  \
      -ca adamicadar.approximation.enabled=true
    

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_EDGES: Location of input edges on HDFS.
  • $OUTPUT: Directory to output the computed clustering coefficients.
  • $WORKERS: Number of Giraph workers.
  • Publications

    Friends and neighbors on the web, Lada A. Adamic and Eytan Adar. Social Networks, 25(3):211–230, July 2003.


    Contributors

    ALS-contr
    Vasia

     

    B-matching

    Given a weighted undirected graph with integer capacities b(v) assigned to each vertex v, the maximum b-matching problem is to select a subgraph of maximum weight such that the number of edges incident to each vertex in the subgraph does not exceed its capacity.

    This algorithm is a greedy approach that provides a 1/2-approximation guarantee.

    This implementation is a parallel version of the greedy algorithm, which provides a 1/2-approximation.

    The greedy algorithm sorts all edges in the graph by decreasing weight and then processes them one by one. An edge is included in the solution if the capacities of both its endpoint vertexes are larger than zero. Once an edge is included, the capacities of both its endpoint vertexes are decreased by one.

    The parallel algorithm works as follows. At each superstep, each vertex v proposes its b(v) edges with maximum weight to its neighbors. At the same time, each vertex computes the intersection between its own proposals and the proposals of its neighbors. The set of edges in the intersection is included in the final solution. Then, each vertex updates its capacity b(v). If it becomes 0, the vertex is removed from the graph.

    This is the pseudocode describing the algorithm:

    1: while E is non empty do
    2:   for all v ∈ V in parallel do
    3:     Let L_v be the set of b(v) edges incident to v with maximum weight;
    4:     Let F be L_v ∩ L_U where U={ u ∈ V : ∃ e(v,u) ∈ E } is the set f vertexes sharing an edge with v; -- the intersection
    5:     Update M ← M ∪ F;
    6:     Update E ← E\F;
    7:     Update b(v) ← b(v) − b_F(v);
    8:     If b(v) = 0 remove v from V and remove all edges incident to v from E;
    9:   end for
    10: end while
    11: return M;
    

    Input format

    The input is an undirected graph in adjacency list format, one vertex per line. The algorithm assumes that each edge is present twice in the input, once for each end-point. The line is composed by a vertex, and a list of edges. The vertex is composed by a long id, and an integer capacity. Each edge is composed by a long id (the destination vertex id), and a double weight. The elements of the list are tab or space separated.

    For instance, the line 1 5 2 0.5 3 0.1 creates a vertex with id=1, capacity=5, and two edges to vertices 2 and 3, with weights 0.5 and 0.1, respectively.

    Output Format

    The output is the matched subgraph.

    Configuration parameters

    None.

    Running

    This the basic B-matching execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.maxbmatching.MaxBMatching  \
      -vif ml.grafos.okapi.graphs.maxbmatching.MBMTextInputFormat \
      -vip $INPUT_VERTICES \
      -vof ml.grafos.okapi.graphs.maxbmatching.MBMTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges
    

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_VERTICES: Location of input vertices on HDFS.
  • $OUTPUT: Directory to output the computed clustering coefficients.
  • $WORKERS: Number of Giraph workers.
  • Publications

    Social Content Matching in MapReduce, G. De Francisci Morales, A. Gionis, M. Sozio, Proceedings of the VLDB Endowment, 4(7):460-469, 2011.


    Contributors

    ALS-contr
    Gianmarco

     

    Clustering Coefficient

    The clustering coefficient is used to measure how well vertices are connected to each other. There are two types of the clustering coefficient metric: The local clustering coefficient is defined for each vertex and measures how close a vertex and its neighbors are to a clique. Given a vertex v, the clustering coefficient C(v) for an undirected graph is defined as:

    C(v) = N/Nmax

    where N is the number of edges between vertices that are neighbors of v and Nmax is the maximum possible number of edges between vertices that are neighbors of v. If k is the number of neighbors of v then in directed graphs:

    C(v) = N/(k(k-1))

    In undirected graphs:

    C(v) = N/(k(k-1)/2)

    The global clustering coefficient is used to measure the overall clustering of the vertices in the graph and is defined as the average of the local clustering coefficients across all vertices in a graph.

    This implementation computes the local clustering coefficient for every vertex in the graph as well as the global clustering coefficient. The local clustering coefficient are written as the output of the job, while the global clustering coefficient is set as a Hadoop counter. You can check its value in the standard output of the terminal or in the Hadoop web interface.

    This computation works for both directed and undirected graphs.

    Note that because every vertex sends its friend list to each of each neighbors this algorithm can incur a high memory footprint and network traffic.

    Input format

    The input to the Clustering Coefficient algorithm is a set of edges describing the graph, with the following format:

    srcID dstID

    where:
  • srcID and dstID are identities of type long representing nodes.
  • Output Format

    For each vertex, the algorithm outputs its local clustering coefficient in the format:

    ID coefficient

    The global clustering coefficient is set as a Hadoop counter that you can see either at your terminal output or the Hadoop web interface when the execution of the algorithm finishes.

    Configuration parameters

    None.

    Running

    This the basic Clustering Coefficient execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      -libjars $OKAPI_JAR \
      ml.grafos.okapi.graphs.ClusteringCoefficient\$SendFriendsList  \
      -mc  ml.grafos.okapi.graphs.ClusteringCoefficient\$MasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true
    

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_EDGES: Location of input edges on HDFS.
  • $OUTPUT: Directory to output the computed clustering coefficients.
  • $WORKERS: Number of Giraph workers.
  • Note that because this algorithm can incur a large memory footprint and network traffic, we suggest you enable the giraph.oneToAllMsgSending Giraph option. This option optimizes algorithms where vertices send messages to all their neighbors.

    WWW

    Wikipedia - Clustering Coefficient


    Contributors

    ALS-contr
    Dionysis

     

    Graph partitioning

    We implement the Spinner graph partitioning algorithm. Spinner is an algorithm that computes an edge-based balanced k-way partitioning of a graph that and is designed for massive graphs (e.g. up to 1 billion vertices and beyond). It supports a number of interesting features such as incremental computation and ability to adapt a partitioning to changes to the graph (e.g. vertices and edges are added or removed) and to the partitions (e.g. partitions are added or removed). These features allow Spinner to partition massive graphs and keep the partitioning up-to-date with minimal re-computations.

    The partitioning produced by Spinner can be used as an internal graph partitioning inside of Giraph, to minimise network communication between workers, and maximise load balance.

    Spinner splits the vertices across k partitions, trying to maximise locality and balancing. The former means that vertices that are connected through an edge tend to be assigned to the same partition. The latter means that a similar number of edges is assigned to each partition.

    Spinner is based on iterative vertex migrations, with decisions taken independently based on (per-vertex) local information. When a graph is partitioned for the first time, vertices are assigned randomly to partitions. After initialisation, vertices communicate their partition to their neighbours through messages, and they try to migrate to the partition where most of their neighbours occur. This pattern is also known as label-propagation.

    Spinner is also able to adapt a previously computed partitioning to changes to the graph or to the number of partitions, hence practically avoiding re-partitioning the graph from scratch upon changes.

    Input format

    Spinner assumes long integers as vertex IDs. Hence, you can load a graph through SpinnerEdgeInputFormat (but in principle you can easily write a VertexInputFormat with the same I V M signature). The format of the input file is simple, and it requires a pair of vertex IDs per line, divided by either a space or a tab.

    To load a previous partitioning, you also need to use SpinnerVertexValueInputFormat, which only reads vertex-partition assignments computed during a previous partitioning. The format of this file is also simple, and it requires a pair composed of a vertex ID and the partition id, divided by either a space or a tab.

    Output Format

    Spinner will spit vertex-value assignments as a result of the computation. Such output can be used as-is for future partitionings, or it can be transformed to a Partitioned ID to run a Giraph computation taking advantage of Spinner’s computed partitioning. The class to write the partitioning to HDFS is SpinnerVertexValueOutputFormat.

    Configuration parameters

    Spinner requires to specify only the number of partitions k in which the graph needs to be split. However, a number of parameters are supported by the class. See the paper for more details about the effect of each parameter.

  • spinner.numberOfPartitions: the number of partitions to split the graph to. The default value is 32.
  • spinner.maxIterations: the maximum number of iterations to run if convergence is not reached. The default value is 290.
  • spinner.repartition: the number of partitions to add or remove compared to a previously computed partitioning, either +n or -n (default 0). This parameter, if set, will trigger a repartitioning due to a change to the number of partitions. In this case, spinner.numberOfPartitions should contain the current number of partitions. Keep in mind that you can re-balance a partitioning to a different number of partitions only if the graph has been already partitioned (i.e. you cannot add edges and vertices AND change the number of partitions).
  • hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      -libjars $OKAPI_JAR \
      ml.grafos.okapi.spinner.Spinner\$ConverterPropagate \
      -mc ml.grafos.okapi.spinner.Spinner\$PartitionerMasterCompute \
      -eif ml.grafos.okapi.spinner.Spinner\$SpinnerEdgeInputFormat \
      -eip $INPUT_EDGES \
      -vif ml.grafos.okapi.spinner.Spinner\$SpinnerVertexValueInputFormat \
      -vip $INPUT_PARTITIONING \
      -vof ml.grafos.okapi.spinner.Spinner\$SpinnerVertexValueOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.outEdgesClass=ml.grafos.okapi.spinner.OpenHashMapEdges
    
  • $OKAPI_JAR: Location of the Okapi jar.
  • $INPUT_EDGES: Directory of the input edges on HDFS.
  • $INPUT_PARTITIONING: Directory where the previous partitioning was stored (this is necessary only if we’re adapting a previous partitioning).
  • $OUTPUTDirectory where the computed partitioning will be stored on HDFS.
  • $WORKERSNumber of workers (integer).
  • Note that the -vif and -vip options are necessary only when a previous partitioning is adapted to new changes.

    Publications

    Spinner: Large-scale Graph Partitioning for the Cloud, Claudio Martella, Dionysios Logothetis, Georgos Siganos, arXiv 1404.3861, April 2014.

    Contributors

    ALS-contr
    Claudio

    Jaccard similarity

    The Jaccard similarity represents how similar two nodes of a graph are, based on their common neighbors. In particular, it is computed as the size of the intersection of the nodes’ neighbor sets over the size of the union of the nodes’ neighbor sets. The similarity value between two nodes is a number in the range [0, 1]. The closer the value is to 1, the more similar the nodes.

    The same metric can be easily converted to a distance with values in the range [0, INF].

    This implementation only computes similarity or distance for existing edges of the input graph and not for every pair of nodes.

    The computation consists of 2 supersteps. During the first superstep, each vertex sends a list of its friends to all its neighbors in the graph. In the second superstep, each vertex computes the Jaccard similarity value for each of its edges. For every message received, it compares the friend list with its own neighborhood and computes common and total number of friends. It then sets the edge value to the Jaccard similarity index. If the configuration option for distance conversion is enabled, then an additional superstep is executed, where the Jaccard index is converted to a distance index, using the function f(x) = (1/x) - 1.

    If the input graph is large, the computation of the Jaccard index for all the edges might be very expensive. For this reason, an approximate implementation is also provided. In this implementation, each vertex uses a BloomFilter to store its neighborhood and sends this filter as a message to its neighbors, instead of the exact friend list. The hash functions to use and the size of the BloomFilter are configurable.

    Input format

    The input format is an edge list, containing one edge per line, in the format srcID<tab>targetID where the IDs are of type Long.

    Output Format

    The output of this program is the edges of the graph with their Jaccard index, one edge per line, in the form: srcID<tab>targetID<tab>similarity where the IDs are of type Long and similarity is Double.

    Configuration parameters

  • jaccard.approximation.enabled: When set to true, the bloom filter implementation will be used and the program will compute the approximate Jaccard index. The deafult value is false.
  • jaccard.bloom.filter.bits: Size of the bloom filter, in bits. The default value if 16.
  • jaccard.bloom.filter.functions: The number of hash functions to use for the bloom filter. The default value is 1.
  • jaccard.bloom.filter.hash.type: The type of the hash functions to use for the bloom filter. The default value is Hash.MURMUR_HASH.
  • distance.conversion.enabled: When set to true, the prograpm will compute the Jaccard distance. The deafult value if false.
  • Running

    To run the exact Jaccard computation:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.Jaccard\$SendFriendsList  \
      -mc  ml.grafos.okapi.graphs.Jaccard\$MasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -eof org.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges 
    

    To run the approximate Jaccard computation:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.Jaccard\$SendFriendsBloomFilter  \
      -mc  ml.grafos.okapi.graphs.Jaccard\$MasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongDoubleZerosTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -eof org.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges  \
      -ca jaccard.approximation.enabled=true
    

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_EDGES: Location of input edges on HDFS.
  • $OUTPUT: Directory to output the computed clustering coefficients.
  • $WORKERS: Number of Giraph workers.
  • WWW

    Wikipedia - Jaccard index


    Contributors

    Vasia Vasia
    Vasia Dionysios

     

    K-core

    The k-core of a graph is the subgraph in which all vertices have degree of at least k.

    We find the k-core of a graph by iteratively removing vertices with degree less than k.

    The algorithm stops when there are no more vertex removals. At the end of the execution, the remaining graph represents the k-core. It is possible that the result in an empty graph.

    Input format

    The input is a set of edges where vertex IDs are of type long.

    Output Format

    If a k-core exists, the output contains the k-core in an adjacency list format.

    Configuration parameters

  • core.k: The size of the graph core to find.
  • Running

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      -libjars $OKAPI_JAR \
      ml.grafos.okapi.graphs.KCore\$KCoreComputation \
      -eif ml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat \
      -eip $INPUT_EDGES \
      -vof ml.grafos.okapi.io.formats.LongNullNullAdjacencyListTextVertexOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.vertexResolverClass=ml.grafos.okapi.graphs.KCore\$KCoreVertexResolver \
      -ca core.k=3
    
  • $OKAPI_JAR: Location of the Okapi jar.
  • $INPUT_EDGES: Directory of the input edges on HDFS.
  • $OUTPUTDirectory where the computed core will be stored on HDFS.
  • $WORKERSNumber of workers.
  • WWW

    Wikipedia - Graph degeneracy

    Publications

    Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis, Louise Quick, Paul Wilkinson, David Hardcastle, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012.

    Contributors

    ALS-contr
    Dionysis

    Multi-source shortest paths

    This algorithm takes as input a weighted graph and computes the shortest distances from a selected set of source vertices to all other vertices in the graph.

    This can be used, for instance, to compute distances from a set of "landmark" vertices in a graph. The distances from the landmarks may then be used to compute other metrics such as approximate all-pair shortest distances and diameter estimation.

    Although this can be done by running the single-source shortest paths algorithm multiple times, this algorithm is more efficient.

    Every vertex maintains a map with current shortest distances from each source. Initially every source starts propagating their distances to their neighbors. After that, if the distance from one of the sources changes, a vertex propagates only the change with respect to that source.

    The execution finishes when no changes occur. At the end every vertex's value holds the distance from each source. If a vertex does not contain a specific source ID, then its distance to the specific source is considered to be infinite.

    Note that the larger the number of sources selected, the largest the computational overhead, memory footprint and network traffic incurred by this algorithm, so be cautious about how many sources you select.

    The user can either define a fraction of vertices to be selected randomly as sources or can explicitly define list of source IDs separated by a ':'. If non are specified, vertex with ID=1 will be selected as the single source.

    Input format

    The input to the algorithm is a weighted graph specified as a set of edges, with the following format:

    srcID dstID weight

    where:
  • srcID and dstID are identities of type long representing nodes.
  • weight is the edge weight of type float.
  • Output Format

    For each vertex, the algorithm outputs its distance from each source in the format:

    ID (src1, distance1)(src2, distance2) ...

    Configuration parameters

  • sources.fraction: fraction of vertices to select randomly as sources.
  • sources.list: list of vertex ids separated by ':' to select as sources
  • If no fraction or explicit list are defined, then the vertex with ID 1 will be selected as the single source.

    Running

    This the basic Multiple-source shortest paths execution command line:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      -libjars $OKAPI_JAR \
      ml.grafos.okapi.graphs.MultiSourceShortestPaths\$InitSources  \
      -mc  ml.grafos.okapi.graphs.MultiSourceShortestPaths\$MasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongFloatTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca sources.fraction=0.1
    

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_EDGES: Location of input edges on HDFS.
  • $OUTPUT: Directory to output the computed distances.
  • $WORKERS: Number of Giraph workers.
  • Contributors

    ALS-contr
    Dionysis

     

    Semi-Clustering

    In social graphs, vertices represent users and edges represent connections between them. Edges may have weights that represent the strength of the connection between two users.

    A semi-cluster is formed by users who have stronger connections with each other compared to other users. The main difference of a semi-cluster from a typical cluster, is the possibility of a node to exist in more than one clusters.

    This algorithm receives a graph as an input and computes semi-clusters based on their edge weights.

    The main objective of semi-clustering is to create semi-clusters with vertices who have the highest interaction among them. The logic behind this lies on the calculation of a score for each semi-cluster and the dismissal of the semi-clusters with the lowest score. A semi-cluster score is calculated by the equation: Sc = (Ic - fB * Bc)/Vc(Vc - 1) / 2, where:

    Ic: the sum of weights of all internal edges.

    Bc: the sum of weights of all outgoing edges.

    Vc: the number of vertices in the semi-cluster.

    fB: the boundary edge score factor, in the range [0..1]. This is specified by the parameter score.factor.

    Each vertex maintains a list of semi-clusters in which it belongs to. The list contains a maximum number of semi-clusters, Cmax, defined by the user and sorted by their score. In each superstep, every vertex:

  • Receives messages which are clusters created in previous supersteps.
  • For each message, it checks whether it is included in the message/cluster. If not, it adds itself in the cluster and calculates the score for the newly created semi-cluster. In any moment, there are at most Cmax number of clusters. Hence, if a new semi-cluster has a higher score than the last semi-cluster in the vertex' list, then the latter gets discarded. Note that a vertex cannot add itself in the cluster if the cluster already contains a maximum number of servers, defined by the user.
  • It sends a message with its updated list of clusters.
  • For the first superstep, each vertex creates an empty cluster and adds itself in it. Thus the first set of messages sent in the network contain the vertices themselves.

    Input format

    The input to the Semi-Clustering algorithm is a weighted, undirected graph. Note that a user cannot be connected to itself; thus lines containing the same userID twice should be avoided.

    Output Format

    The output of the graph consists of a line for each vertex in the graph and contains the IDs of the clusters a vertex belongs too.

    Configuration parameters

  • iterations: the maximum number of iterations to perform. The default value is 10.
  • max.clusters: the maximum number of clusters to shape. The default value is 2.
  • cluster.capacity: the maximum number of vertices a cluster can have. The default value is 4.
  • score.factor: the boundary edge score factor. The default value is 0.5.
  • Run Semi-Clustering

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      -libjars $OKAPI_JAR \
      ml.grafos.okapi.graphs.SemiClustering \
      -eif ml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat \
      -eip $INPUT_EDGES \
      -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS

    Publications

    Pregel: a system for large-scale graph processing, Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010.

    Contributors

    ALS-contr ALS-contr
    Maria Dionysis

    SybilRank

    The SybilRank algorithm is used in Online Social Network analysis to detect fake accounts. When seeded with a set of known fake accounts, it ranks the nodes in the graph according to the probability to be fake.

    The algorithm consists of three types of computation:

  • TrustedNodeAggregation: In the first step if a vertex is trusted it updates the NUM_TRUSTED aggregator.
  • Initializer: It initializes the rank according to whether a node is trusted or not and then starts the power iterations by distributing the rank of each vertex.
  • SybilRankComputation: This is the implementation of the power iteration. It processes messages, updates state and sends messages.
  • Input format

    The input to the algorithm consists of two sets:

  • The graph structure specified as a set of weighted edges of the form:

    srcID dstID weight

    The algorithm assumes that the graph is undirected. We add an initial computation superstep that ensures that the input graph is converted to an undirected one. For instance, if the input contains only edge A->B with weight W, then the first computation adds the edge B->A as well with the same weight W.
  • The set of the vertices that are considered trusted that is used to seed the algorithm. This is just a set of ids that could be located in a single file, or distributed across multiple files in an HDFS directory.
  • Configuration parameters

  • sybilrank.beta: The number of power iterations that the algorithm is going to perform. If no value is specified, the algorithm will set this value to logN, where N is the number of vertices in the graph.
  • Running

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      -libjars $OKAPI_JAR \
      ml.grafos.okapi.graphs.SybilRank\$TrustAggregation \
      -mc ml.grafos.okapi.graphs.SybilRank\$SybilRankMasterCompute \
      -eif ml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat \
      -eip $INPUT_EDGES \
      -vif ml.grafos.okapi.graphs.SybilRank\$SybilRankVertexValueInputFormat \
      -vip $TRUSTED_VERTICES \
      -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS
    

    Publications

    Aiding the detection of fake accounts in large scale social online services, C. Qiang, M. Sirivianos, X. Yang, T. Pregueiro, In NSDI 2012.

    Integro: Leveraging Victim Prediction for Robust Fake Account Detection in OSNs,Y. Boshmaf, D. Logothetis, G. Siganos, J. Leria, J. Lorenzo, M. Ripeanu, K. Beznosov, In Network and Distributed System Security Symposium (NDSS’15).

    Contributors

    ALS-contr ALS-contr
    Dionysis Yazan

    Triangles

    We have implemented a set of algorithms that are related to the existance of triangles in graphs. More specifically, we provide (i) an implementation that counts the unique triangles in a graph and (ii) an implementation that enumerates the actual triangles in a graph.

    The implementation of the triangle-related metrics are based on ideas from this paper.

    Input format

    We assume that the input graph is unweighted and undirected.

    The input is a set of edges, with the following format:

    srcID dstID

    where:
  • srcID and dstID are identities of type long representing nodes.
  • Output Format

    When counting the number of unique triangles this algorithm outputs a line for each vertex that participates in a triangle and is the vertex with the highest ID in the triangle. Such a line has the format: id count

    This is a vertex ID and the number of times it participates in a triangle as the vertex with the highest ID. Summing up the counts on the the second column across the entire output will give you the total number of unique triangles in the graph.

    When enumerating the triangles this algorithm outputs a line for each vertex that participates in a triangle and is the vertex with the highest ID in the triangle. Such a line has the format: id [[id11 id12]] [id21 id22] ...]

    This is the vertx ID and a list of pairs that make a triangle with this vertex. For instance, the line:

    7 [[2 3] [1 4]]

    means that vertex 7 forms a triangle with vertices 2 and 3 and another triangle with vertices 1 and 4.

    Configuration parameters

    None.

    Running

    You can compute the different triangle-related metrics by specifying the right implementation of the MasterCompute class.

    To count the number of unique trianges in the graph, run the following command:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
    	ml.grafos.okapi.graphs.Triangles\$Initialize  \
           -mc  ml.grafos.okapi.graphs.Triangles\$TriangleCount  \
           -eif ml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat  \
           -eip $INPUT_EDGES \
           -vof ml.grafos.okapi.graphs.Triangles\$TriangleOutputFormat \
           -op $OUTPUT \
           -w $WORKERS \
           -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges
           

    To enumerate all the unique trianges in the graph, run the following command:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
           ml.grafos.okapi.graphs.Triangles\$Initialize  \
           -mc  ml.grafos.okapi.graphs.Triangles\$TriangleFind  \
           -eif ml.grafos.okapi.io.formats.LongNullTextEdgeInputFormat  \
           -eip $INPUT_EDGES \
           -vof ml.grafos.okapi.graphs.Triangles\$TriangleOutputFormat \
           -op $OUTPUT \
           -w $WORKERS \
           -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges
           

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_EDGES: Location of input edges on HDFS.
  • $OUTPUT: Directory to output the computed clustering coefficients.
  • $WORKERS: Number of Giraph workers.
  • Publications

    Investigating Graph Algorithms in the BSP Model on the Cray XMT, D. Ediger, D. Bader, 7th Workshop on Multithreaded Architectures and Applications (MTAAP), Boston, MA, May 24, 2013.

    Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis, Louise Quick, Paul Wilkinson, David Hardcastle, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012.

    Contributors

    ALS-contr
    Dionysis

     

    Triangle semi-metricity

    In a weighted graph, we say that a direct edge between two nodes is semi-metric, if there exists an indirect path between these two nodes, with a shorter distance. This program takes as input an undirected, weighted graph and removes all 1st-order semi-metric edges from it; i.e. all semi-metric edges, for which the shorter indirect path has length two.

    To find all 1st-order semimetric edges, we first find all the triangles in the graph and then compare their weights. If vertices A, B, C form a triangle, then edge AB is semi-metric if D(A,B) > D(A,C)+D(C,B), where D(u, v) is the weight of the edge (u, v).

    The output of this algorithm is the graph without the 1st-order semimetric edges.

    We provide two implementations, a basic one and an optimized that reduces the amount of messages that have to be sent at the same time, making better use of memory.

    Basic implementation

    The algorithm consists of four supersteps and discovers each triangle exactly once. We assume mnumeric vertex IDs and a total ordering on them. In the first superstep, every node propagates its ID to all its neighbors with higher ID. In the second superstep, each node retrieves the ID from each received message and the weight of the edge that is formed between this node and the message sender. Then, the node constructs a new message with the sender ID, the edge weight and its own ID and sends this message to all the neighbors with higher ID. In the third superstep, for every message received, a node checks if a triangle is formed and, in case it is, whether it contains a semi-metric edge. If the node detects a semi-metric edge, it marks this and the opposite direction edge for removal. In the final superstep, all marked edges are removed.

    Optimized implementation

    This implementation divides the algorithm into several megasteps which contain the three supersteps of the main computation. In each megastep, some of the vertices of the graph execute the algorithm, while the rest are idle (but still active). The algorithm finishes, when all vertices have executed the computation. In the case of semi-metric edge removal, this model is possible, because messages are neither aggregated nor combined. This implementation assumes numeric vertex IDs.

    Input format

    The input format is an edge list, containing one edge with its value per line, in the format srcID<tab>targetID<tab>weight where the IDs are of type Long and the weight is Double.

    Output Format

    The output format is an edge list, containing one edge with its value per line, in the format srcID<tab>targetID<tab>weight where the IDs are of type Long and the weight is Double.

    Configuration parameters

  • semimetric.megasteps: Number of megasteps to divide the computation. This is used only in the optimized implementation. The default value is 2.
  • Running

    To run the basic implementation execute:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.SemimetricTriangles\$PropagateId  \
      -mc  ml.grafos.okapi.graphs.SemimetricTriangles\$SemimetricMasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -eof org.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true  \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges
    

    To run the optimized implementation execute:

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.graphs.ScalableSemimetric\$PropagateId  \
      -mc  ml.grafos.okapi.graphs.ScalableSemimetric\$SemimetricMasterCompute  \
      -eif ml.grafos.okapi.io.formats.LongDoubleTextEdgeInputFormat  \
      -eip $INPUT_EDGES \
      -eof org.apache.giraph.io.formats.SrcIdDstIdEdgeValueTextOutputFormat \
      -op $OUTPUT \
      -w $WORKERS \
      -ca giraph.oneToAllMsgSending=true  \
      -ca giraph.outEdgesClass=org.apache.giraph.edge.HashMapEdges
    

  • $OKAPI_JAR: Location of the Okapi jar file.
  • $INPUT_EDGES: Location of input edges on HDFS.
  • $OUTPUT: Directory to output the computed clustering coefficients.
  • $WORKERS: Number of Giraph workers.
  • Publications

    Investigating Graph Algorithms in the BSP Model on the Cray XMT, D. Ediger and D.A. Bader, , 7th Workshop on Multithreaded Architectures and Applications (MTAAP), Boston, MA, May 24, 2013.

    Contributors

    Vasia
    Dionysis

     

    Kmeans

    K-means clustering is an iterative algorithm that partitions N data points (observations) into k clusters. During the initialization phase, the algorithm chooses k clusters from the set of input points at random. These are assigned to be the initial cluster centers.

    At each iteration: (i) each data point is assigned to the cluster center which is closest to it, by means of euclidean distance, (ii) new cluster centers are recomputed, by calculating the arithmetic mean of the assigned points.

    The algorithm converges when the positions of the cluster centers do not change more than a specified threshold.

    Each data point corresponds to a vertex. During a superstep, each point reads the current cluster centers and finds the one which is closest to it. It then aggregates its own coordinates to the corresponding cluster center aggregator and updates its vertex value with the new cluster ID.

    The random initialization of the cluster centers is implemented using a custom aggregator. The aggregator builds a list of k (number of cluster centers) elements and their coordinates. If the current list size is less than k, the aggregate method appends the new element at the end of the list. Otherwise, it replaces an element at a random position in the list, with probability k/N, where N is the total number of input points.

    The positions of the k centers are stored in aggregators and they are updated by the Master. Each of the aggregators aggregates the vector coordinates of the points assigned to the corresponding cluster center. At the beginning of a superstep, the master computes the coordinates of the new clusters, as the means of the coordinates of the assigned points.

    Input format

    The input to the k-means clustering algorithm is a set of points, described by an ID and their coordinates, with the following format:

    pointID, coord_1\tcoord_2\t…\tcoord_n

    where n is the dimension of the point space, pointID is of type Long and coord_1 … coord_N are of type Double.

    Output format

    For each point, the algorithm outputs the clusterID where it belongs, in the form:

    pointID, clusterID

    The coordinates of the final clusters can be printed by setting the configuration parameter kmeans.print.final.centers

    Configuration parameters

    K-means clustering requires setting the desired number of cluster centers, the total number of input data points and the dimensionality of the data points. The maximum number of iterations can also be optionally set.

  • kmeans.cluster.centers.count: the number of desired cluster centers. The default value is 3.
  • kmeans.points.count: the number of input data points.
  • kmeans.points.dimensions: the dimensionality of the input data points.
  • kmeans.iterations: the maximum number of iterations. The default value is 100.
  • Running

    hadoop jar $OKAPI_JAR org.apache.giraph.GiraphRunner \
      ml.grafos.okapi.clustering.kmeans.KMeansClustering\$RandomCentersInitialization \
      -mc ml.grafos.okapi.clustering.kmeans.KMeansClustering\$KMeansMasterCompute \
      -vif ml.grafos.okapi.clustering.kmeans.KMeansTextInputFormat \
      -vip $INPUT_POINTS \
      -vof ml.grafos.okapi.clustering.kmeans.KMeansTextOutputFormat \
      -ca giraph.outEdgesClass=ml.grafos.okapi.common.graph.NullOutEdges \
      -ca kmeans.cluster.centers.count=$CLUSTER_COUNT \
      -ca kmeans.points.count=$POINT_COUNT \
      -ca kmeans.points.dimensions=$POINT_DIMENSIONS \
      -op $OUTPUT \
      -w $WORKERS
    

    WWW

    Wikipedia - Kmeans clustering

    Contributors

    Kmeans contributor
    Vasia