Developer Interface¶
This part of the documentation covers the public interface of itembed.
Preprocessing Tools¶
A few helpers are provided to clean the data and convert to the expected format.
- itembed.index_batch_stream(num_index, batch_size)¶
Indices generator.
- itembed.pack_itemsets(itemsets, *, min_count=1, min_length=1)¶
Convert itemset collection to packed indices.
- Parameters
itemsets (list of list of object) – List of sets of hashable objects.
min_count (int, optional) – Minimal frequency count to be kept.
min_length (int, optional) – Minimal itemset length.
- Returns
labels (list of object) – Mapping from indices to labels.
indices (int32, num_item) – Packed index array.
offsets (int32, num_itemset + 1) – Itemsets offsets in packed array.
Example
>>> itemsets = [ ... ["apple"], ... ["apple", "sugar", "flour"], ... ["pear", "sugar", "flour", "butter"], ... ["apple", "pear", "sugar", "butter", "cinnamon"], ... ["salt", "flour", "oil"], ... ] >>> pack_itemsets(itemsets, min_length=2) (['apple', 'sugar', 'flour', 'pear', 'butter', 'cinnamon', 'salt', 'oil'], array([0, 1, 2, 3, 1, 2, 4, 0, 3, 1, 4, 5, 6, 2, 7]), array([ 0, 3, 7, 12, 15]))
- itembed.prune_itemsets(indices, offsets, *, mask=None, min_length=None)¶
Filter packed indices.
Either an explicit mask or a length threshold must be defined.
- Parameters
indices (int32, num_item) – Packed index array.
offsets (int32, num_itemset + 1) – Itemsets offsets in packed array.
mask (bool, num_itemset) – Boolean mask.
min_length (int) – Minimum length, inclusive.
- Returns
indices (int32, num_item) – Packed index array.
offsets (int32, num_itemset + 1) – Itemsets offsets in packed array.
Example
>>> indices = np.array([0, 0, 1, 0, 1, 2, 0, 1, 2, 3]) >>> offsets = np.array([0, 1, 3, 6, 10]) >>> mask = np.array([True, True, False, True]) >>> prune_itemsets(indices, offsets, mask=mask, min_length=2) (array([0, 1, 0, 1, 2, 3]), array([0, 2, 6]))
Tasks¶
Tasks are high-level building blocks used to define an optimization problem.
- class itembed.Task(learning_rate_scale)¶
Abstract training task.
- do_batch(learning_rate)¶
Apply training step.
- class itembed.UnsupervisedTask(items, offsets, syn0, syn1, *, weights=None, num_negative=5, learning_rate_scale=1.0, batch_size=64)¶
Unsupervised training task.
See also
- Parameters
items (int32, num_item) – Itemsets, concatenated.
offsets (int32, num_itemset + 1) – Boundaries in packed items.
indices (int32, num_step) – Subset of offsets to consider.
syn0 (float32, num_label x num_dimension) – First set of embeddings.
syn1 (float32, num_label x num_dimension) – Second set of embeddings.
weights (float32, num_item, optional) – Item weights, concatenated.
num_negative (int32, optional) – Number of negative samples.
learning_rate_scale (float32, optional) – Learning rate multiplier.
batch_size (int32, optional) – Batch size.
- do_batch(learning_rate)¶
Apply training step.
- class itembed.SupervisedTask(left_items, left_offsets, right_items, right_offsets, left_syn, right_syn, *, left_weights=None, right_weights=None, num_negative=5, learning_rate_scale=1.0, batch_size=64)¶
Supervised training task.
See also
- Parameters
left_items (int32, num_left_item) – Itemsets, concatenated.
left_offsets (int32, num_itemset + 1) – Boundaries in packed items.
right_items (int32, num_right_item) – Itemsets, concatenated.
right_offsets (int32, num_itemset + 1) – Boundaries in packed items.
left_syn (float32, num_left_label x num_dimension) – Feature embeddings.
right_syn (float32, num_right_label x num_dimension) – Label embeddings.
left_weights (float32, num_left_item, optional) – Item weights, concatenated.
right_weights (float32, num_right_item, optional) – Item weights, concatenated.
num_negative (int32, optional) – Number of negative samples.
learning_rate_scale (float32, optional) – Learning rate multiplier.
batch_size (int32, optional) – Batch size.
- do_batch(learning_rate)¶
Apply training step.
Training Tools¶
Embeddings initialization and training loop helpers:
- itembed.initialize_syn(num_label, num_dimension, method='uniform')¶
Allocate and initialize embedding set.
- Parameters
num_label (int32) – Number of labels.
num_dimension (int32) – Size of embeddings.
method ({"uniform", "zero"}, optional) – Initialization method.
- Returns
syn – Embedding set.
- Return type
float32, num_label x num_dimension
- itembed.train(task, *, num_epoch=10, initial_learning_rate=0.025, final_learning_rate=0.0)¶
Train loop.
Learning rate decreases linearly, down to zero.
Keyboard interruptions are silently captured, which interrupt the training process.
A progress bar is shown, using
tqdm
.- Parameters
task (Task) – Top-level task to train.
num_epoch (int) – Number of passes across the whole task.
initial_learning_rate (float) – Maximum learning rate (inclusive).
final_learning_rate (float) – Minimum learning rate (exclusive).
Postprocessing Tools¶
Once embeddings are trained, some methods are provided to normalize and use them.
- itembed.softmax(x)¶
Compute softmax.
- itembed.norm(x)¶
L2 norm.
- itembed.normalize(x)¶
L2 normalization.
Low-Level Optimization Methods¶
At its core, itembed is a set of optimized methods.
- itembed.expit(x)¶
Compute logistic activation.
- itembed.do_step(left, right, syn_left, syn_right, tmp_syn, num_negative, learning_rate)¶
Apply a single training step.
- Parameters
left (int32) – Left-hand item.
right (int32) – Right-hand item.
syn_left (float32, num_left x num_dimension) – Left-hand embeddings.
syn_right (float32, num_right x num_dimension) – Right-hand embeddings.
tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).
num_negative (int32) – Number of negative samples.
learning_rate (float32) – Learning rate.
- itembed.do_supervised_steps(left_itemset, right_itemset, left_weights, right_weights, left_syn, right_syn, tmp_syn, num_negative, learning_rate)¶
Apply steps from two itemsets.
This is used in a supervised setting, where left-hand items are features and right-hand items are labels.
- Parameters
left_itemset (int32, left_length) – Feature items.
right_itemset (int32, right_length) – Label items.
left_weights (float32, left_length) – Feature item weights.
right_weights (float32, right_length) – Label item weights.
left_syn (float32, num_left_label x num_dimension) – Feature embeddings.
right_syn (float32, num_right_label x num_dimension) – Label embeddings.
tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).
num_negative (int32) – Number of negative samples.
learning_rate (float32) – Learning rate.
- itembed.do_unsupervised_steps(itemset, weights, syn0, syn1, tmp_syn, num_negative, learning_rate)¶
Apply steps from a single itemset.
This is used in an unsupervised setting, where co-occurrence is used as a knowledge source. It follows the skip-gram method, as introduced by Mikolov et al.
For each item, a single random neighbor is sampled to define a pair. This means that only a subset of possible pairs is considered. The reason is twofold: training stays in linear complexity w.r.t. itemset lengths and large itemsets do not dominate smaller ones.
Itemset must have at least 2 items. Length is not checked, for efficiency.
- Parameters
itemset (int32, length) – Items.
weights (float32, length) – Item weights.
syn0 (float32, num_label x num_dimension) – First set of embeddings.
syn1 (float32, num_label x num_dimension) – Second set of embeddings.
tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).
num_negative (int32) – Number of negative samples.
learning_rate (float32) – Learning rate.
- itembed.do_supervised_batch(left_items, left_weights, left_offsets, left_indices, right_items, right_weights, right_offsets, right_indices, left_syn, right_syn, tmp_syn, num_negative, learning_rate)¶
Apply supervised steps from multiple itemsets.
See also
- Parameters
left_items (int32, num_left_item) – Itemsets, concatenated.
left_weights (float32, num_left_item) – Item weights, concatenated.
left_offsets (int32, num_itemset + 1) – Boundaries in packed items.
left_indices (int32, num_step) – Subset of offsets to consider.
right_items (int32, num_right_item) – Itemsets, concatenated.
right_weights (float32, num_right_item) – Item weights, concatenated.
right_offsets (int32, num_itemset + 1) – Boundaries in packed items.
right_indices (int32, num_step) – Subset of offsets to consider.
left_syn (float32, num_left_label x num_dimension) – Feature embeddings.
right_syn (float32, num_right_label x num_dimension) – Label embeddings.
tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).
num_negative (int32) – Number of negative samples.
learning_rate (float32) – Learning rate.
- itembed.do_unsupervised_batch(items, weights, offsets, indices, syn0, syn1, tmp_syn, num_negative, learning_rate)¶
Apply unsupervised steps from multiple itemsets.
See also
- Parameters
items (int32, num_item) – Itemsets, concatenated.
weights (float32, num_item) – Item weights, concatenated.
offsets (int32, num_itemset + 1) – Boundaries in packed items.
indices (int32, num_step) – Subset of offsets to consider.
syn0 (float32, num_label x num_dimension) – First set of embeddings.
syn1 (float32, num_label x num_dimension) – Second set of embeddings.
tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).
num_negative (int32) – Number of negative samples.
learning_rate (float32) – Learning rate.