striatum.bandit package

Submodules

striatum.bandit.exp3 module

Exp3: Exponential-weight algorithm for Exploration and Exploitation This module contains a class that implements EXP3, a bandit algorithm that randomly choose an action according to a learned probability distribution.

class striatum.bandit.exp3.Exp3(actions, historystorage, modelstorage, gamma)

Bases: striatum.bandit.bandit.BaseBandit

Exp3 algorithm.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

gamma: float, 0 < gamma <= 1

The parameter used to control the minimum chosen probability for each action.

References

[R1]Peter Auer, Nicolo Cesa-Bianchi, et al. “The non-stochastic multi-armed bandit problem .” SIAM Journal of Computing. 2002.

Attributes

exp3_ (‘exp3’ object instance) The contextual bandit algorithm instances

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
exp3() The generator which implements the main part of Exp3.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

exp3()

The generator which implements the main part of Exp3.

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.exp4p module

EXP4.P: An extention to exponential-weight algorithm for exploration and exploitation. This module contains a class that implements EXP4.P, a contextual bandit algorithm with expert advice.

class striatum.bandit.exp4p.Exp4P(actions, historystorage, modelstorage, delta=0.1, p_min=None, max_rounds=10000)

Bases: striatum.bandit.bandit.BaseBandit

Exp4.P with pre-trained supervised learning algorithm.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

delta: float, 0 < delta <= 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

p_min: float, 0 < p_min < 1/k

The minimum probability to choose each action.

References

[R2]Beygelzimer, Alina, et al. “Contextual bandit algorithms with supervised learning guarantees.” International Conference on Artificial Intelligence and Statistics (AISTATS). 2011u.

Attributes

exp4p_ (‘exp4p’ object instance) The contextual bandit algorithm instances

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context=None, n_actions=1)

Return the action to perform

Parameters:

context : dictionary

Contexts {expert_id: {action_id: expert_prediction}} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.linthompsamp module

Thompson Sampling with Linear Payoff In This module contains a class that implements Thompson Sampling with Linear Payoff. Thompson Sampling with linear payoff is a contexutal multi-armed bandit algorithm which assume the underlying relationship between rewards and contexts is linear. The sampling method is used to balance the exploration and exploitation. Please check the reference for more details.

class striatum.bandit.linthompsamp.LinThompSamp(actions, historystorage, modelstorage, context_dimension, delta=0.5, R=0.5, epsilon=0.1, random_state=None)

Bases: striatum.bandit.bandit.BaseBandit

Thompson sampling with linear payoff.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

delta: float, 0 < delta < 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

R: float, R >= 0

Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, R^2 represents the variance for residuals of the linear model \(bi(t)^T\).

epsilon: float, 0 < epsilon < 1

A parameter used by the Thompson Sampling algorithm. If the total trials T is known, we can choose epsilon = 1/ln(T).

References

[R3]Shipra Agrawal, and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” Advances in Neural Information Processing Systems 24. 2011.

Attributes

linthomp_ (‘linthomp’ object instance) The contextual bandit algorithm instances

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action oBjects for recommendation

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : dictionary

Contexts {action_id: context} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it contains {Action object, estimated_reward, uncertainty}.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.linucb module

LinUCB with Disjoint Linear Models

This module contains a class that implements LinUCB with disjoint linear model, a contextual bandit algorithm assuming the reward function is a linear function of the context.

class striatum.bandit.linucb.LinUCB(actions, historystorage, modelstorage, alpha, context_dimension=1)

Bases: striatum.bandit.bandit.BaseBandit

LinUCB with Disjoint Linear Models

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

alpha: float

The constant determines the width of the upper confidence bound.

context_dimension: int

The dimension of the context.

References

[R4]Lihong Li, et al. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.

Attributes

linucb_ (‘linucb’ object instance) The contextual bandit algorithm instances.

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : dict

Contexts {action_id: context} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

striatum.bandit.ucb1 module

Upper Confidence Bound 1 This module contains a class that implements UCB1 algorithm, a famous multi-armed bandit algorithm without context.

class striatum.bandit.ucb1.UCB1(actions, historystorage, modelstorage)

Bases: striatum.bandit.bandit.BaseBandit

Upper Confidence Bound 1

Parameters:

actions : {array-like, None}

Actions (arms) for recommendation

historystorage: a :py:mod:’striatum.storage.HistoryStorage’ object

The object where we store the histories of contexts and rewards.

modelstorage: a :py:mod:’straitum.storage.ModelStorage’ object

The object where we store the model parameters.

References

[R5]Peter Auer, et al. “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47. 2002.

Attributes

ucb1_ (‘ucb1’ object instance) The multi-armed bandit algorithm instances.

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
ucb1()
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

ucb1()

Module contents

Bandit algorithm classes

class striatum.bandit.Exp3(actions, historystorage, modelstorage, gamma)

Bases: striatum.bandit.bandit.BaseBandit

Exp3 algorithm.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

gamma: float, 0 < gamma <= 1

The parameter used to control the minimum chosen probability for each action.

References

[R6]Peter Auer, Nicolo Cesa-Bianchi, et al. “The non-stochastic multi-armed bandit problem .” SIAM Journal of Computing. 2002.

Attributes

exp3_ (‘exp3’ object instance) The contextual bandit algorithm instances

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
exp3() The generator which implements the main part of Exp3.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

exp3()

The generator which implements the main part of Exp3.

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.Exp4P(actions, historystorage, modelstorage, delta=0.1, p_min=None, max_rounds=10000)

Bases: striatum.bandit.bandit.BaseBandit

Exp4.P with pre-trained supervised learning algorithm.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

delta: float, 0 < delta <= 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

p_min: float, 0 < p_min < 1/k

The minimum probability to choose each action.

References

[R7]Beygelzimer, Alina, et al. “Contextual bandit algorithms with supervised learning guarantees.” International Conference on Artificial Intelligence and Statistics (AISTATS). 2011u.

Attributes

exp4p_ (‘exp4p’ object instance) The contextual bandit algorithm instances

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action([context, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context=None, n_actions=1)

Return the action to perform

Parameters:

context : dictionary

Contexts {expert_id: {action_id: expert_prediction}} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.LinThompSamp(actions, historystorage, modelstorage, context_dimension, delta=0.5, R=0.5, epsilon=0.1, random_state=None)

Bases: striatum.bandit.bandit.BaseBandit

Thompson sampling with linear payoff.

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

delta: float, 0 < delta < 1

With probability 1 - delta, LinThompSamp satisfies the theoretical regret bound.

R: float, R >= 0

Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, R^2 represents the variance for residuals of the linear model \(bi(t)^T\).

epsilon: float, 0 < epsilon < 1

A parameter used by the Thompson Sampling algorithm. If the total trials T is known, we can choose epsilon = 1/ln(T).

References

[R8]Shipra Agrawal, and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” Advances in Neural Information Processing Systems 24. 2011.

Attributes

linthomp_ (‘linthomp’ object instance) The contextual bandit algorithm instances

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action oBjects for recommendation

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : dictionary

Contexts {action_id: context} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dictionaries

In each dictionary, it contains {Action object, estimated_reward, uncertainty}.

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.LinUCB(actions, historystorage, modelstorage, alpha, context_dimension=1)

Bases: striatum.bandit.bandit.BaseBandit

LinUCB with Disjoint Linear Models

Parameters:

actions : list of Action objects

List of actions to be chosen from.

historystorage: a HistoryStorage object

The place where we store the histories of contexts and rewards.

modelstorage: a ModelStorage object

The place where we store the model parameters.

alpha: float

The constant determines the width of the upper confidence bound.

context_dimension: int

The dimension of the context.

References

[R9]Lihong Li, et al. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.

Attributes

linucb_ (‘linucb’ object instance) The contextual bandit algorithm instances.

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : dict

Contexts {action_id: context} of different actions.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action_recommendation : list of dict

Each dict contains {Action object, estimated_reward, uncertainty}

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

class striatum.bandit.UCB1(actions, historystorage, modelstorage)

Bases: striatum.bandit.bandit.BaseBandit

Upper Confidence Bound 1

Parameters:

actions : {array-like, None}

Actions (arms) for recommendation

historystorage: a :py:mod:’striatum.storage.HistoryStorage’ object

The object where we store the histories of contexts and rewards.

modelstorage: a :py:mod:’straitum.storage.ModelStorage’ object

The object where we store the model parameters.

References

[R10]Peter Auer, et al. “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47. 2002.

Attributes

ucb1_ (‘ucb1’ object instance) The multi-armed bandit algorithm instances.

Methods

add_action(actions) Add new actions (if needed).
calculate_avg_reward() Calculate average reward with respect to time.
calculate_cum_reward() Calculate cumulative reward with respect to time.
get_action(context[, n_actions]) Return the action to perform
get_action_with_id(action_id)
plot_avg_regret() Plot average regret with respect to time.
plot_avg_reward() Plot average reward with respect to time.
reward(history_id, rewards) Reward the previous action with reward.
ucb1()
add_action(actions)

Add new actions (if needed).

Parameters:

actions : iterable

A list of Action objects for recommendation

get_action(context, n_actions=1)

Return the action to perform

Parameters:

context : {array-like, None}

The context of current state, None if no context available.

n_actions: int

Number of actions wanted to recommend users.

Returns:

history_id : int

The history id of the action.

action : list of dictionaries

In each dictionary, it will contains {Action object, estimated_reward, uncertainty}

reward(history_id, rewards)

Reward the previous action with reward.

Parameters:

history_id : int

The history id of the action to reward.

rewards : dictionary

The dictionary {action_id, reward}, where reward is a float.

ucb1()