Algorithms¶
Buffalo provides the following algorithm implementations:
- Alternating Least Squares
- Bayesian Personalized Ranking Matrix Factorization
- Weighted Approximate-Rank Pairwise
- Word2Vec
- CoFactors
All algorithms inherit common parent classes such as Algo, Serializable, TensorboardExtension, Optimizable, Evaluable.
Algo¶
-
class
buffalo.algo.base.
Algo
(*args, **kwargs)¶ Bases:
abc.ABC
-
build_itemid_map
()¶
-
build_userid_map
()¶
-
get_feature
(name, group='item')¶
-
get_index
(keys, group='item')¶ Get index list of given item keys. If there is no index for such key, return None.
Parameters: - keys (str or list) – Query key(s)
- group (str) – Data group where to find (default: item)
Return type: int or list
-
get_index_pool
(pool, group='item')¶ Simple wrapper of get_index. For np.ndarray pool, it returns asis with nothing. But list, it perform get_index with keys in pool.
Parameters: - pool – The list of keys.
- group (str) – Data group where to find (default: item)
Return type: np.ndarray
-
get_weighted_feature
(weights, group='item', min_length=1)¶
-
initialize
()¶
-
most_similar
(key, topk=10, group='item', pool=None)¶ Return top-k most similar items
Parameters: - key (str) – Query key
- topk (int) – The number of results (default: 10)
- group (str) – Data group where to find (default: item)
- pool (list or numpy.ndarray) – The list of item keys to find for. If it is a numpy.ndarray instance then it treat as index of items and it would be helpful for calculation speed. (default: None)
Returns: Top-k most similar items for given query.
Return type: list
-
normalize
(group='item')¶
-
topk_recommendation
(keys, topk=10, pool=None)¶ Return TopK recommendation for each users(keys)
Parameters: - keys (list or str) – Query key(s)
- topk (int) – Number of recommendation
- pool – See the pool parameter of most_similar
Return type: dict or list
-
Serializable¶
TensorboardExtension¶
Optimizable¶
Alternating Least Squares¶
-
class
buffalo.algo.als.
ALS
(opt_path=None, *args, **kwargs)¶ Bases:
buffalo.algo.base.Algo
,buffalo.algo.options.ALSOption
,buffalo.evaluate.base.Evaluable
,buffalo.algo.base.Serializable
,buffalo.algo.optimize.Optimizable
,buffalo.algo.base.TensorboardExtension
Python implementation for C-ALS.
Implementation of Collaborative Filtering for Implicit Feedback datasets.
Reference: http://yifanhu.net/PUB/cf.pdf
-
get_scores
(row_col_pairs)¶
-
initialize
()¶
-
static
new
(path, data_fields=[])¶
-
normalize
(group='item')¶
-
train
()¶
-
-
class
buffalo.algo.options.
ALSOption
(*args, **kwargs)¶ Bases:
buffalo.algo.options.AlgoOption
-
get_default_optimize_option
()¶ Optimization Options for ALS.
Variables: - loss (str) – Target loss to optimize.
- max_trials (int) – The maximum experiments for optimization. If not given, run forever.
- min_trials (int) – The minimum experiments before deploying model. (Since the best parameter may not be found after min_trials, the first best parameter is always deployed)
- deployment (bool) – Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
- start_with_default_parameters (bool) – If set to True, the loss value of the default parameter is used as the starting loss to beat.
- space (dict) – The parameter space definition. For more information, please check reference hyperopt’s express. Note) Due to hyperopt’s randint does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with randint.
-
get_default_option
()¶ Options for Alternating Least Squares.
Variables: - adaptive_reg (bool) – Set True, for adaptive regularization. (default: False)
- save_factors (bool) – Set True, to save models. (default: False)
- accelerator (bool) – Set True, to accelerate training using GPU. (default: False)
- d (int) – The number of latent feature dimension. (default: 20)
- num_iters (int) – The number of iterations for training. (default: 10)
- num_workers (int) – The number of threads. (default: 1)
- hyper_threads (int) – The number of hyper threads when using cuda cores. (default: 256)
- reg_u (float) – The L2 regularization coefficient for user embedding matrix. (default: 0.1)
- reg_i (float) – The L2 regularization coefficient for item embedding matrix. (default: 0.1)
- alpha (float) – The coefficient of giving more weights to losses on positive samples. (default: 8)
- eps (float) – epsilon for numerical stability (default: 1e-10)
- cg_tolerance (float) – tolerance of conjugate gradient for early stopping iterations (default: 1e-10)
- optimizer (str) – The name of optimizer, should be in [llt, ldlt, manual_cg, eigen_cg, eigen_bicg, eigen_gmres, eigen_dgmres, eigen_minres]. (default: manual_cg)
- num_cg_max_iters (int) – The number of maximum iterations for conjugate gradient optimizer. (default: 3)
- model_path (str) – Where to save model.
- data_opt (dict) – This option will be used to load data if given.
-
Bayesian Personalized Ranking Matrix Factorization¶
-
class
buffalo.algo.bpr.
BPRMF
(opt_path=None, *args, **kwargs)¶ Bases:
buffalo.algo.base.Algo
,buffalo.algo.options.BPRMFOption
,buffalo.evaluate.base.Evaluable
,buffalo.algo.base.Serializable
,buffalo.algo.optimize.Optimizable
,buffalo.algo.base.TensorboardExtension
Python implementation for C-BPRMF.
-
get_scores
(row_col_pairs)¶
-
initialize
()¶
-
static
new
(path, data_fields=[])¶
-
normalize
(group='item')¶
-
train
()¶
-
-
class
buffalo.algo.options.
BPRMFOption
(*args, **kwargs)¶ Bases:
buffalo.algo.options.AlgoOption
-
get_default_optimize_option
()¶ Optimization options for BPRMF.
-
get_default_option
()¶ Options for Bayesian Personalized Ranking Matrix Factorization.
Variables: - accelerator (bool) – Set True, to accelerate training using GPU. (default: False)
- use_bias (bool) – Set True, to use bias term for the model.
- evaluation_period (int) – (default: 100)
- num_workers (int) – The number of threads. (default: 1)
- hyper_threads (int) – The number of hyper threads when using cuda cores. (default: 256)
- num_iters (int) – The number of iterations for training. (default: 100)
- d (int) – The number of latent feature dimension. (default: 20)
- update_i (bool) – Set True, to update positive item feature. (default: True)
- update_j (bool) – Set True, to update negative item feature. (default: True)
- reg_u (float) – The L2 regularization coefficient for user embedding matrix. (default: 0.025)
- reg_i (float) – The L2 regularization coefficient for positive item embedding matrix. (default: 0.025)
- reg_j (float) – The L2 regularization coefficient for negative item embedding matrix. (default: 0.025)
- reg_b (float) – The L2 regularization coefficient for bias term. (default: 0.025)
- optimizer (str) – The name of optimizer, should be one of [sgd, adagrad, adam]. (default: sgd)
- lr (float) – The learning rate.
- min_lr (float) – The minimum of learning rate, to prevent going to zero by learning rate decaying. (default: 0.0001)
- beta1 (float) – The parameter of Adam optimizer. (default: 0.9)
- beta2 (float) – The parameter of Adam optimizer. (default: 0.999)
- per_coordinate_normalize (bool) – This is a bit tricky option for Adam optimizer. Before update factors with gradients, do normalize gradients per class by its number of contributed samples. (default: False)
- sampling_power (float) – This parameter control sampling distribution. When it set to 0, it draw negative items from uniform distribution, while to set 1, it draw from the given data popularation. (default: 0.0)
- random_positive (bool) – Set True, to draw positive sample uniformly instead of using straight forward positive sample, only implemented in cuda mode, according to the original paper, set True, but we found out False usually produces better results) (default: False)
- verify_neg (bool) – Set True, to ensure negative sample does not belong to positive samples. (default True)
- model_path (str) – Where to save model.
- data_opt (dict) – This option will be used to load data if given.
-
Weighted Approximate-Rank Pairwise¶
-
class
buffalo.algo.warp.
WARP
(opt_path=None, *args, **kwargs)¶ Bases:
buffalo.algo.base.Algo
,buffalo.algo.options.WARPOption
,buffalo.evaluate.base.Evaluable
,buffalo.algo.base.Serializable
,buffalo.algo.optimize.Optimizable
,buffalo.algo.base.TensorboardExtension
Python implementation for C-WARP.
-
get_scores
(row_col_pairs)¶
-
initialize
()¶
-
static
new
(path, data_fields=[])¶
-
normalize
(group='item')¶
-
train
()¶
-
-
class
buffalo.algo.options.
WARPOption
(*args, **kwargs)¶ Bases:
buffalo.algo.options.AlgoOption
-
get_default_optimize_option
()¶ Optimization options for WARP.
-
get_default_option
()¶ Options for WARP Matrix Factorization.
Variables: - accelerator (bool) – Set True, to accelerate training using GPU. (default: False)
- use_bias (bool) – Set True, to use bias term for the model.
- evaluation_period (int) – (default: 15)
- num_workers (int) – The number of threads. (default: 1)
- hyper_threads (int) – The number of hyper threads when using cuda cores. (default: 256)
- num_iters (int) – The number of iterations for training. (default: 15)
- d (int) – The number of latent feature dimension. (default: 30)
- max_trials (int) – The maximum number of attempts to find a violating negative sample during training.
- update_i (bool) – Set True, to update positive item feature. (default: True)
- update_j (bool) – Set True, to update negative item feature. (default: True)
- reg_u (float) – The L2 regularization coefficient for user embedding matrix. (default: 0.0)
- reg_i (float) – The L2 regularization coefficient for positive item embedding matrix. (default: 0.0)
- reg_j (float) – The L2 regularization coefficient for negative item embedding matrix. (default: 0.0)
- reg_b (float) – The L2 regularization coefficient for bias term. (default: 0.0)
- optimizer (str) – The name of optimizer, should be one of [adagrad, adam]. (default: adagrad)
- lr (float) – The learning rate. (default: 0.1)
- min_lr (float) – The minimum of learning rate, to prevent going to zero by learning rate decaying. (default: 0.0001)
- beta1 (float) – The parameter of Adam optimizer. (default: 0.9)
- beta2 (float) – The parameter of Adam optimizer. (default: 0.999)
- per_coordinate_normalize (bool) – This is a bit tricky option for Adam optimizer. Before update factors with gradients, do normalize gradients per class by its number of contributed samples. (default: False)
- random_positive (bool) – Set True, to draw positive sample uniformly instead of using straight forward positive sample, only implemented in cuda mode, according to the original paper, set True, but we found out False usually produces better results) (default: False)
- model_path (str) – Where to save model.
- data_opt (dict) – This option will be used to load data if given.
-
CoFactors¶
-
class
buffalo.algo.cfr.
CFR
(opt_path=None, *args, **kwargs)¶ Bases:
buffalo.algo.base.Algo
,buffalo.algo.options.CFROption
,buffalo.evaluate.base.Evaluable
,buffalo.algo.base.Serializable
,buffalo.algo.optimize.Optimizable
,buffalo.algo.base.TensorboardExtension
Python implementation for CoFactor.
- Reference: Factorization Meets the Item Embedding:
- Regularizing Matrix Factorization with Item Co-occurrence
Paper link: http://dawenl.github.io/publications/LiangACB16-cofactor.pdf
-
get_scores
(row_col_pairs)¶
-
initialize
()¶
-
static
new
(path, data_fields=[])¶
-
normalize
(group='item')¶
-
train
()¶
-
class
buffalo.algo.options.
CFROption
(*args, **kwargs)¶ Bases:
buffalo.algo.options.AlgoOption
-
get_default_optimize_option
()¶ Optimization options for CoFactor.
Variables: - loss (str) – Target loss to optimize.
- max_trials (int) – Maximum experiments for optimization. If not given, run forever.
- min_trials (int) – Minimum experiments before deploying model. (Since the best parameter may not be found after min_trials, the first best parameter is always deployed)
- deployment( (bool) – Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
- start_with_default_parameters (bool) – If set to True, the loss value of the default parameter is used as the starting loss to beat.
- space (dict) – Parameter space definition. For more information, please check reference hyperopt’s express. Note) Due to hyperopt’s randint does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with randint.
-
get_default_option
()¶ Basic Options for CoFactor.
Variables: - d (int) – The number of latent feature dimension. (default: 20)
- num_iters (int) – The number of iterations for training. (default: 10)
- num_workers (int) – The number of threads. (default: 1)
- reg_u (float) – The L2 regularization coefficient for user embedding matrix. (default: 0.1)
- reg_i (float) – The L2 regularization coefficient for item embedding matrix. (default: 0.1)
- reg_c (float) – The L2 regularization coefficient for context embedding matrix. (default: 0.1)
- eps (float) – epsilon for numerical stability (default: 1e-10)
- cg_tolerance (float) – The tolerance for early stopping conjugate gradient optimizer. (default: 1e-10)
- alpha (float) – The coefficient of giving more weights to losses on positive samples. (default: 8.0)
- l (float) – The relative weight of loss on user-item relation over item-context relation. (default: 1.0)
- optimizer (str) – The name of optimizer, should be in [llt, ldlt, manual_cg, eigen_cg, eigen_bicg, eigen_gmres, eigen_dgmres, eigen_minres]. (default: manual_cg)
- num_cg_max_iters (int) – The number of maximum iterations for conjugate gradient optimizer. (default: 3)
- model_path (str) – Where to save model. (default: ‘’)
- data_opt (dict) – This option will be used to load data if given. (default: {})
-
is_valid_option
(opt)¶
-
Word2Vec¶
-
class
buffalo.algo.w2v.
W2V
(opt_path=None, *args, **kwargs)¶ Bases:
buffalo.algo.base.Algo
,buffalo.algo.options.W2VOption
,buffalo.evaluate.base.Evaluable
,buffalo.algo.base.Serializable
,buffalo.algo.optimize.Optimizable
,buffalo.algo.base.TensorboardExtension
Python implementation for C-W2V
-
get_index
(key, group='item')¶ Get index list of given item keys. If there is no index for such key, return None.
Parameters: - keys (str or list) – Query key(s)
- group (str) – Data group where to find (default: item)
Return type: int or list
-
get_scores
(row_col_pairs)¶
-
initialize
()¶
-
static
new
(path, data_fields=[])¶
-
normalize
(group='item')¶
-
train
()¶
-
-
class
buffalo.algo.options.
W2VOption
(*args, **kwargs)¶ Bases:
buffalo.algo.options.AlgoOption
-
get_default_optimize_option
()¶ Optimization options for W2V
-
get_default_option
()¶ Options for Word2Vec.
Variables: - evaluation_on_learning (bool) – Set True to do run evaluation on training phrase. (default: False)
- num_workers (int) – The number of threads. (default: 1)
- num_iters (int) – The number of iterations for training. (default: 100)
- d (int) – The number of latent feature dimension. (default: 20)
- window (int) – The window size. (default: 5)
- min_count (int) – The minimum required frequency of the words to use training vocabulary. (default: 5)
- sample (float) – The sampling ratio to downsample the frequent words. (default: 0.001)
- num_negative_samples (int) – The number of negative noise words. (default: 5)
- lr (float) – The learning rate.
- model_path (str) – Where to save model.
- data_opt (dict) – This option will be used to load data if given.
-
pLSI¶
-
class
buffalo.algo.plsi.
PLSI
(opt_path=None, *args, **kwargs)¶ Bases:
buffalo.algo.base.Algo
,buffalo.algo.options.PLSIOption
,buffalo.evaluate.base.Evaluable
,buffalo.algo.base.Serializable
,buffalo.algo.optimize.Optimizable
,buffalo.algo.base.TensorboardExtension
Python implementation for pLSI.
-
get_scores
(row_col_pairs)¶
-
inherit
()¶
-
initialize
()¶
-
static
new
(path, data_fields=[])¶
-
normalize
(group='item')¶
-
train
()¶
-
-
class
buffalo.algo.options.
PLSIOption
(*args, **kwargs)¶ Bases:
buffalo.algo.options.AlgoOption
-
get_default_optimize_option
()¶ Optimization options for pLSI.
Variables: - loss (str) – Target loss to optimize.
- max_trials (int) – Maximum experiments for optimization. If not given, run forever.
- min_trials (int) – Minimum experiments before deploying model. (Since the best parameter may not be found after min_trials, the first best parameter is always deployed)
- deployment (bool) – Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
- start_with_default_parameters (bool) – If set to True, the loss value of the default parameter is used as the starting loss to beat.
- space (dict) – Parameter space definition. For more information, please check reference hyperopt’s express. Note) Due to hyperopt’s randint does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with randint.
-
get_default_option
()¶ Basic Options for pLSI.
Variables: - d (int) – The number of latent feature dimension. (default: 20)
- num_iters (int) – The number of iterations for training. (default: 10)
- num_workers (int) – The number of threads. (default: 1)
- alpha1 (float) – The coefficient of regularization term for clustering assignment. (default: 1.0)
- alpha2 (float) – The coefficient of regularization term for item preference in each cluster. (default: 1.0)
- eps (float) – epsilon for numerical stability (default: 1e-10)
- model_path (str) – Where to save model. (default: ‘’)
- save_factors (bool) – Set True, to save models. (default: False)
- data_opt (dict) – This option will be used to load data if given. (default: {})
-