striatum.bandit package

Submodules

striatum.bandit.exp3 module

Exp3: Exponential-weight algorithm for Exploration and Exploitation This module contains a class that implements EXP3, a bandit algorithm that randomly choose an action according to a learned probability distribution.

class striatum.bandit.exp3.Exp3(history_storage, model_storage, action_storage, recommendation_cls=None, gamma=0.3, random_state=None)

Bases: striatum.bandit.bandit.BaseBandit

Exp3 algorithm.

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

gamma: float, 0 < gamma <= 1

The parameter used to control the minimum chosen probability for each action.

random_state: {int, np.random.RandomState} (default: None)

If int, np.random.RandomState will used it as seed. If None, a random seed will be used.

References

[R1]Peter Auer, Nicolo Cesa-Bianchi, et al. “The non-stochastic multi-armed bandit problem .” SIAM Journal of Computing. 2002.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context=None, n_actions=None)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.exp4p module

EXP4.P: An extention to exponential-weight algorithm for exploration and exploitation. This module contains a class that implements EXP4.P, a contextual bandit algorithm with expert advice.

class striatum.bandit.exp4p.Exp4P(actions, historystorage, modelstorage, delta=0.1, p_min=None, max_rounds=10000)

Bases: striatum.bandit.bandit.BaseBandit

Exp4.P with pre-trained supervised learning algorithm.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

delta: float, 0 < delta <= 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

p_min: float, 0 < p_min < 1/k

The minimum probability to choose each action.

References

[R2]Beygelzimer, Alina, et al. “Contextual bandit algorithms with supervised learning guarantees.” International Conference on Artificial Intelligence and Statistics (AISTATS). 2011u.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
get_action(context=None, n_actions=1)

Return the action to perform

Parameters:

context : dictionary

Contexts {expert_id: {action_id: expert_prediction}} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.linthompsamp module

Thompson Sampling with Linear Payoff In This module contains a class that implements Thompson Sampling with Linear Payoff. Thompson Sampling with linear payoff is a contexutal multi-armed bandit algorithm which assume the underlying relationship between rewards and contexts is linear. The sampling method is used to balance the exploration and exploitation. Please check the reference for more details.

class striatum.bandit.linthompsamp.LinThompSamp(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, delta=0.5, R=0.01, epsilon=0.5, random_state=None)

Bases: striatum.bandit.bandit.BaseBandit

Thompson sampling with linear payoff.

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

delta: float, 0 < delta < 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

R: float, R >= 0

Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, R^2 represents the variance for residuals of the linear model \(bi(t)^T\).

epsilon: float, 0 < epsilon < 1

A parameter used by the Thompson Sampling algorithm. If the total trials T is known, we can choose epsilon = 1/ln(T).

random_state: {int, np.random.RandomState} (default: None)

If int, np.random.RandomState will used it as seed. If None, a random seed will be used.

References

[R3]Shipra Agrawal, and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” Advances in Neural Information Processing Systems 24. 2011.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action oBjects for recommendation

get_action(context, n_actions=None)

Return the action to perform

Parameters:

context : dictionary

Contexts {action_id: context} of different actions.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.linucb module

LinUCB with Disjoint Linear Models

This module contains a class that implements LinUCB with disjoint linear model, a contextual bandit algorithm assuming the reward function is a linear function of the context.

class striatum.bandit.linucb.LinUCB(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, alpha=0.5)

Bases: striatum.bandit.bandit.BaseBandit

LinUCB with Disjoint Linear Models

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

context_dimension: int

The dimension of the context.

alpha: float

The constant determines the width of the upper confidence bound.

References

[R4]Lihong Li, et al. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context, n_actions=None)

Return the action to perform

Parameters:

context : dict

Contexts {action_id: context} of different actions.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.ucb1 module

Upper Confidence Bound 1 This module contains a class that implements UCB1 algorithm, a famous multi-armed bandit algorithm without context.

class striatum.bandit.ucb1.UCB1(history_storage, model_storage, action_storage, recommendation_cls=None)

Bases: striatum.bandit.bandit.BaseBandit

Upper Confidence Bound 1

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

References

[R5]Peter Auer, et al. “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47. 2002.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context=None, n_actions=None)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

Module contents

Bandit algorithm classes

class striatum.bandit.Exp3(history_storage, model_storage, action_storage, recommendation_cls=None, gamma=0.3, random_state=None)

Bases: striatum.bandit.bandit.BaseBandit

Exp3 algorithm.

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

gamma: float, 0 < gamma <= 1

The parameter used to control the minimum chosen probability for each action.

random_state: {int, np.random.RandomState} (default: None)

If int, np.random.RandomState will used it as seed. If None, a random seed will be used.

References

[R6]Peter Auer, Nicolo Cesa-Bianchi, et al. “The non-stochastic multi-armed bandit problem .” SIAM Journal of Computing. 2002.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context=None, n_actions=None)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.Exp4P(actions, historystorage, modelstorage, delta=0.1, p_min=None, max_rounds=10000)

Bases: striatum.bandit.bandit.BaseBandit

Exp4.P with pre-trained supervised learning algorithm.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

delta: float, 0 < delta <= 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

p_min: float, 0 < p_min < 1/k

The minimum probability to choose each action.

References

[R7]Beygelzimer, Alina, et al. “Contextual bandit algorithms with supervised learning guarantees.” International Conference on Artificial Intelligence and Statistics (AISTATS). 2011u.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
get_action(context=None, n_actions=1)

Return the action to perform

Parameters:

context : dictionary

Contexts {expert_id: {action_id: expert_prediction}} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.LinThompSamp(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, delta=0.5, R=0.01, epsilon=0.5, random_state=None)

Bases: striatum.bandit.bandit.BaseBandit

Thompson sampling with linear payoff.

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

delta: float, 0 < delta < 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

R: float, R >= 0

Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, R^2 represents the variance for residuals of the linear model \(bi(t)^T\).

epsilon: float, 0 < epsilon < 1

A parameter used by the Thompson Sampling algorithm. If the total trials T is known, we can choose epsilon = 1/ln(T).

random_state: {int, np.random.RandomState} (default: None)

If int, np.random.RandomState will used it as seed. If None, a random seed will be used.

References

[R8]Shipra Agrawal, and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” Advances in Neural Information Processing Systems 24. 2011.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action oBjects for recommendation

get_action(context, n_actions=None)

Return the action to perform

Parameters:

context : dictionary

Contexts {action_id: context} of different actions.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.LinUCB(history_storage, model_storage, action_storage, recommendation_cls=None, context_dimension=128, alpha=0.5)

Bases: striatum.bandit.bandit.BaseBandit

LinUCB with Disjoint Linear Models

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

context_dimension: int

The dimension of the context.

alpha: float

The constant determines the width of the upper confidence bound.

References

[R9]Lihong Li, et al. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context, n_actions=None)

Return the action to perform

Parameters:

context : dict

Contexts {action_id: context} of different actions.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.UCB1(history_storage, model_storage, action_storage, recommendation_cls=None)

Bases: striatum.bandit.bandit.BaseBandit

Upper Confidence Bound 1

Parameters:

history_storage : HistoryStorage object

The HistoryStorage object to store history context, actions and rewards.

model_storage : ModelStorage object

The ModelStorage object to store model parameters.

action_storage : ActionStorage object

The ActionStorage object to store actions.

recommendation_cls : class (default: None)

The class used to initiate the recommendations. If None, then use default Recommendation class.

References

[R10]Peter Auer, et al. “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47. 2002.

Attributes

history_storage

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
remove_action(action_id) Remove action by id.
reward(history_id, rewards) Reward the previous action with reward.
update_action(action) Update action.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context=None, n_actions=None)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int (default: None)

Number of actions wanted to recommend users. If None, only return one action. If -1, get all actions.

Returns:

history_id : int

The history id of the action.

recommendations : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}.

remove_action(action_id)

Remove action by id.

Parameters:

action_id : int

The id of the action to remove.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.