striatum.bandit package¶
Submodules¶
striatum.bandit.exp3 module¶
Exp3: Exponential-weight algorithm for Exploration and Exploitation This module contains a class that implements EXP3, a bandit algorithm that randomly choose an action according to a learned probability distribution.
-
class
striatum.bandit.exp3.
Exp3
(history_storage, model_storage, action_storage, recommendation_cls=None, gamma=0.3, random_state=None)¶ Bases:
striatum.bandit.bandit.BaseBandit
Exp3 algorithm.
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
gamma: float, 0 < gamma <= 1
The parameter used to control the minimum chosen probability for each action.
random_state: {int, np.random.RandomState} (default: None)
If int, np.random.RandomState will used it as seed. If None, a random seed will be used.
References
[R1] Peter Auer, Nicolo Cesa-Bianchi, et al. “The non-stochastic multi-armed bandit problem .” SIAM Journal of Computing. 2002. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
([context, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action objects for recommendation
-
get_action
(context=None, n_actions=None)¶ Return the action to perform
Parameters: context : {array-like, None}
The context of current state, None if no context available.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
striatum.bandit.exp4p module¶
EXP4.P: An extention to exponential-weight algorithm for exploration and exploitation. This module contains a class that implements EXP4.P, a contextual bandit algorithm with expert advice.
-
class
striatum.bandit.exp4p.
Exp4P
(actions, historystorage, modelstorage, delta=0.1, p_min=None, max_rounds=10000)¶ Bases:
striatum.bandit.bandit.BaseBandit
Exp4.P with pre-trained supervised learning algorithm.
Parameters: actions : list of Action objects
List of actions to be chosen from.
historystorage: a HistoryStorage object
The place where we store the histories of contexts and rewards.
modelstorage: a ModelStorage object
The place where we store the model parameters.
delta: float, 0 < delta <= 1
With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.
p_min: float, 0 < p_min < 1/k
The minimum probability to choose each action.
References
[R2] Beygelzimer, Alina, et al. “Contextual bandit algorithms with supervised learning guarantees.” International Conference on Artificial Intelligence and Statistics (AISTATS). 2011u. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
([context, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
get_action
(context=None, n_actions=1)¶ Return the action to perform
Parameters: context : dictionary
Contexts {expert_id: {action_id: expert_prediction}} of different actions.
n_actions: int
Number of actions wanted to recommend users.
Returns: history_id : int
The history id of the action.
action_recommendation : list of dictionaries
In each dictionary, it will contains {Action object, estimated_reward, uncertainty}.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
striatum.bandit.linthompsamp module¶
Thompson Sampling with Linear Payoff In This module contains a class that implements Thompson Sampling with Linear Payoff. Thompson Sampling with linear payoff is a contexutal multi-armed bandit algorithm which assume the underlying relationship between rewards and contexts is linear. The sampling method is used to balance the exploration and exploitation. Please check the reference for more details.
-
class
striatum.bandit.linthompsamp.
LinThompSamp
(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, delta=0.5, R=0.01, epsilon=0.5, random_state=None)¶ Bases:
striatum.bandit.bandit.BaseBandit
Thompson sampling with linear payoff.
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
delta: float, 0 < delta < 1
With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.
R: float, R >= 0
Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, R^2 represents the variance for residuals of the linear model \(bi(t)^T\).
epsilon: float, 0 < epsilon < 1
A parameter used by the Thompson Sampling algorithm. If the total trials T is known, we can choose epsilon = 1/ln(T).
random_state: {int, np.random.RandomState} (default: None)
If int, np.random.RandomState will used it as seed. If None, a random seed will be used.
References
[R3] Shipra Agrawal, and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” Advances in Neural Information Processing Systems 24. 2011. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
(context[, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action oBjects for recommendation
-
get_action
(context, n_actions=None)¶ Return the action to perform
Parameters: context : dictionary
Contexts {action_id: context} of different actions.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
striatum.bandit.linucb module¶
LinUCB with Disjoint Linear Models
This module contains a class that implements LinUCB with disjoint linear model, a contextual bandit algorithm assuming the reward function is a linear function of the context.
-
class
striatum.bandit.linucb.
LinUCB
(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, alpha=0.5)¶ Bases:
striatum.bandit.bandit.BaseBandit
LinUCB with Disjoint Linear Models
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
context_dimension: int
The dimension of the context.
alpha: float
The constant determines the width of the upper confidence bound.
References
[R4] Lihong Li, et al. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
(context[, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action objects for recommendation
-
get_action
(context, n_actions=None)¶ Return the action to perform
Parameters: context : dict
Contexts {action_id: context} of different actions.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
striatum.bandit.ucb1 module¶
Upper Confidence Bound 1 This module contains a class that implements UCB1 algorithm, a famous multi-armed bandit algorithm without context.
-
class
striatum.bandit.ucb1.
UCB1
(history_storage, model_storage, action_storage, recommendation_cls=None)¶ Bases:
striatum.bandit.bandit.BaseBandit
Upper Confidence Bound 1
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
References
[R5] Peter Auer, et al. “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47. 2002. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
([context, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action objects for recommendation
-
get_action
(context=None, n_actions=None)¶ Return the action to perform
Parameters: context : {array-like, None}
The context of current state, None if no context available.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
Module contents¶
Bandit algorithm classes
-
class
striatum.bandit.
Exp3
(history_storage, model_storage, action_storage, recommendation_cls=None, gamma=0.3, random_state=None)¶ Bases:
striatum.bandit.bandit.BaseBandit
Exp3 algorithm.
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
gamma: float, 0 < gamma <= 1
The parameter used to control the minimum chosen probability for each action.
random_state: {int, np.random.RandomState} (default: None)
If int, np.random.RandomState will used it as seed. If None, a random seed will be used.
References
[R6] Peter Auer, Nicolo Cesa-Bianchi, et al. “The non-stochastic multi-armed bandit problem .” SIAM Journal of Computing. 2002. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
([context, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action objects for recommendation
-
get_action
(context=None, n_actions=None)¶ Return the action to perform
Parameters: context : {array-like, None}
The context of current state, None if no context available.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
-
class
striatum.bandit.
Exp4P
(actions, historystorage, modelstorage, delta=0.1, p_min=None, max_rounds=10000)¶ Bases:
striatum.bandit.bandit.BaseBandit
Exp4.P with pre-trained supervised learning algorithm.
Parameters: actions : list of Action objects
List of actions to be chosen from.
historystorage: a HistoryStorage object
The place where we store the histories of contexts and rewards.
modelstorage: a ModelStorage object
The place where we store the model parameters.
delta: float, 0 < delta <= 1
With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.
p_min: float, 0 < p_min < 1/k
The minimum probability to choose each action.
References
[R7] Beygelzimer, Alina, et al. “Contextual bandit algorithms with supervised learning guarantees.” International Conference on Artificial Intelligence and Statistics (AISTATS). 2011u. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
([context, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
get_action
(context=None, n_actions=1)¶ Return the action to perform
Parameters: context : dictionary
Contexts {expert_id: {action_id: expert_prediction}} of different actions.
n_actions: int
Number of actions wanted to recommend users.
Returns: history_id : int
The history id of the action.
action_recommendation : list of dictionaries
In each dictionary, it will contains {Action object, estimated_reward, uncertainty}.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
-
class
striatum.bandit.
LinThompSamp
(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, delta=0.5, R=0.01, epsilon=0.5, random_state=None)¶ Bases:
striatum.bandit.bandit.BaseBandit
Thompson sampling with linear payoff.
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
delta: float, 0 < delta < 1
With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.
R: float, R >= 0
Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, R^2 represents the variance for residuals of the linear model \(bi(t)^T\).
epsilon: float, 0 < epsilon < 1
A parameter used by the Thompson Sampling algorithm. If the total trials T is known, we can choose epsilon = 1/ln(T).
random_state: {int, np.random.RandomState} (default: None)
If int, np.random.RandomState will used it as seed. If None, a random seed will be used.
References
[R8] Shipra Agrawal, and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” Advances in Neural Information Processing Systems 24. 2011. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
(context[, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action oBjects for recommendation
-
get_action
(context, n_actions=None)¶ Return the action to perform
Parameters: context : dictionary
Contexts {action_id: context} of different actions.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
-
class
striatum.bandit.
LinUCB
(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, alpha=0.5)¶ Bases:
striatum.bandit.bandit.BaseBandit
LinUCB with Disjoint Linear Models
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
context_dimension: int
The dimension of the context.
alpha: float
The constant determines the width of the upper confidence bound.
References
[R9] Lihong Li, et al. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
(context[, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action objects for recommendation
-
get_action
(context, n_actions=None)¶ Return the action to perform
Parameters: context : dict
Contexts {action_id: context} of different actions.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-
-
class
striatum.bandit.
UCB1
(history_storage, model_storage, action_storage, recommendation_cls=None)¶ Bases:
striatum.bandit.bandit.BaseBandit
Upper Confidence Bound 1
Parameters: history_storage : HistoryStorage object
The HistoryStorage object to store history context, actions and rewards.
model_storage : ModelStorage object
The ModelStorage object to store model parameters.
action_storage : ActionStorage object
The ActionStorage object to store actions.
recommendation_cls : class (default: None)
The class used to initiate the recommendations. If None, then use default Recommendation class.
References
[R10] Peter Auer, et al. “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47. 2002. Attributes
history_storage
Methods
add_action
(actions)Add new actions (if needed). calculate_avg_reward
()Calculate average reward with respect to time. calculate_cum_reward
()Calculate cumulative reward with respect to time. get_action
([context, n_actions])Return the action to perform plot_avg_regret
()Plot average regret with respect to time. plot_avg_reward
()Plot average reward with respect to time. remove_action
(action_id)Remove action by id. reward
(history_id, rewards)Reward the previous action with reward. update_action
(action)Update action. -
add_action
(actions)¶ Add new actions (if needed).
Parameters: actions : iterable
A list of Action objects for recommendation
-
get_action
(context=None, n_actions=None)¶ Return the action to perform
Parameters: context : {array-like, None}
The context of current state, None if no context available.
n_actions: int (default: None)
Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.
Returns: history_id : int
The history id of the action.
recommendations : list of dict
Each dict contains {Action object, estimated_reward, uncertainty}.
-
remove_action
(action_id)¶ Remove action by id.
Parameters: action_id : int
The id of the action to remove.
-
reward
(history_id, rewards)¶ Reward the previous action with reward.
Parameters: history_id : int
The history id of the action to reward.
rewards : dictionary
The dictionary {action_id, reward}, where reward is a float.
-