Developer Interface

This part of the documentation covers the public interface of itembed.

Preprocessing Tools

A few helpers are provided to clean the data and convert to the expected format.

itembed.index_batch_stream(num_index, batch_size)

Indices generator.

itembed.pack_itemsets(itemsets, *, min_count=1, min_length=1)

Convert itemset collection to packed indices.

Parameters
  • itemsets (list of list of object) – List of sets of hashable objects.

  • min_count (int, optional) – Minimal frequency count to be kept.

  • min_length (int, optional) – Minimal itemset length.

Returns

  • labels (list of object) – Mapping from indices to labels.

  • indices (int32, num_item) – Packed index array.

  • offsets (int32, num_itemset + 1) – Itemsets offsets in packed array.

Example

>>> itemsets = [
...     ["apple"],
...     ["apple", "sugar", "flour"],
...     ["pear", "sugar", "flour", "butter"],
...     ["apple", "pear", "sugar", "butter", "cinnamon"],
...     ["salt", "flour", "oil"],
... ]
>>> pack_itemsets(itemsets, min_length=2)
(['apple', 'sugar', 'flour', 'pear', 'butter', 'cinnamon', 'salt', 'oil'],
 array([0, 1, 2, 3, 1, 2, 4, 0, 3, 1, 4, 5, 6, 2, 7]),
 array([ 0,  3,  7, 12, 15]))
itembed.prune_itemsets(indices, offsets, *, mask=None, min_length=None)

Filter packed indices.

Either an explicit mask or a length threshold must be defined.

Parameters
  • indices (int32, num_item) – Packed index array.

  • offsets (int32, num_itemset + 1) – Itemsets offsets in packed array.

  • mask (bool, num_itemset) – Boolean mask.

  • min_length (int) – Minimum length, inclusive.

Returns

  • indices (int32, num_item) – Packed index array.

  • offsets (int32, num_itemset + 1) – Itemsets offsets in packed array.

Example

>>> indices = np.array([0, 0, 1, 0, 1, 2, 0, 1, 2, 3])
>>> offsets = np.array([0, 1, 3, 6, 10])
>>> mask = np.array([True, True, False, True])
>>> prune_itemsets(indices, offsets, mask=mask, min_length=2)
(array([0, 1, 0, 1, 2, 3]), array([0, 2, 6]))

Tasks

Tasks are high-level building blocks used to define an optimization problem.

class itembed.Task(learning_rate_scale)

Abstract training task.

do_batch(learning_rate)

Apply training step.

class itembed.UnsupervisedTask(items, offsets, syn0, syn1, *, weights=None, num_negative=5, learning_rate_scale=1.0, batch_size=64)

Unsupervised training task.

Parameters
  • items (int32, num_item) – Itemsets, concatenated.

  • offsets (int32, num_itemset + 1) – Boundaries in packed items.

  • indices (int32, num_step) – Subset of offsets to consider.

  • syn0 (float32, num_label x num_dimension) – First set of embeddings.

  • syn1 (float32, num_label x num_dimension) – Second set of embeddings.

  • weights (float32, num_item, optional) – Item weights, concatenated.

  • num_negative (int32, optional) – Number of negative samples.

  • learning_rate_scale (float32, optional) – Learning rate multiplier.

  • batch_size (int32, optional) – Batch size.

do_batch(learning_rate)

Apply training step.

class itembed.SupervisedTask(left_items, left_offsets, right_items, right_offsets, left_syn, right_syn, *, left_weights=None, right_weights=None, num_negative=5, learning_rate_scale=1.0, batch_size=64)

Supervised training task.

Parameters
  • left_items (int32, num_left_item) – Itemsets, concatenated.

  • left_offsets (int32, num_itemset + 1) – Boundaries in packed items.

  • right_items (int32, num_right_item) – Itemsets, concatenated.

  • right_offsets (int32, num_itemset + 1) – Boundaries in packed items.

  • left_syn (float32, num_left_label x num_dimension) – Feature embeddings.

  • right_syn (float32, num_right_label x num_dimension) – Label embeddings.

  • left_weights (float32, num_left_item, optional) – Item weights, concatenated.

  • right_weights (float32, num_right_item, optional) – Item weights, concatenated.

  • num_negative (int32, optional) – Number of negative samples.

  • learning_rate_scale (float32, optional) – Learning rate multiplier.

  • batch_size (int32, optional) – Batch size.

do_batch(learning_rate)

Apply training step.

class itembed.CompoundTask(*tasks, learning_rate_scale=1.0)

Group multiple sub-tasks together.

Parameters
  • *tasks (list of Task) – Collection of tasks to train jointly.

  • learning_rate_scale (float32, optional) – Learning rate multiplier.

do_batch(learning_rate)

Apply training step.

Training Tools

Embeddings initialization and training loop helpers:

itembed.initialize_syn(num_label, num_dimension, method='uniform')

Allocate and initialize embedding set.

Parameters
  • num_label (int32) – Number of labels.

  • num_dimension (int32) – Size of embeddings.

  • method ({"uniform", "zero"}, optional) – Initialization method.

Returns

syn – Embedding set.

Return type

float32, num_label x num_dimension

itembed.train(task, *, num_epoch=10, initial_learning_rate=0.025, final_learning_rate=0.0)

Train loop.

Learning rate decreases linearly, down to zero.

Keyboard interruptions are silently captured, which interrupt the training process.

A progress bar is shown, using tqdm.

Parameters
  • task (Task) – Top-level task to train.

  • num_epoch (int) – Number of passes across the whole task.

  • initial_learning_rate (float) – Maximum learning rate (inclusive).

  • final_learning_rate (float) – Minimum learning rate (exclusive).

Postprocessing Tools

Once embeddings are trained, some methods are provided to normalize and use them.

itembed.softmax(x)

Compute softmax.

itembed.norm(x)

L2 norm.

itembed.normalize(x)

L2 normalization.

Low-Level Optimization Methods

At its core, itembed is a set of optimized methods.

itembed.expit(x)

Compute logistic activation.

itembed.do_step(left, right, syn_left, syn_right, tmp_syn, num_negative, learning_rate)

Apply a single training step.

Parameters
  • left (int32) – Left-hand item.

  • right (int32) – Right-hand item.

  • syn_left (float32, num_left x num_dimension) – Left-hand embeddings.

  • syn_right (float32, num_right x num_dimension) – Right-hand embeddings.

  • tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).

  • num_negative (int32) – Number of negative samples.

  • learning_rate (float32) – Learning rate.

itembed.do_supervised_steps(left_itemset, right_itemset, left_weights, right_weights, left_syn, right_syn, tmp_syn, num_negative, learning_rate)

Apply steps from two itemsets.

This is used in a supervised setting, where left-hand items are features and right-hand items are labels.

Parameters
  • left_itemset (int32, left_length) – Feature items.

  • right_itemset (int32, right_length) – Label items.

  • left_weights (float32, left_length) – Feature item weights.

  • right_weights (float32, right_length) – Label item weights.

  • left_syn (float32, num_left_label x num_dimension) – Feature embeddings.

  • right_syn (float32, num_right_label x num_dimension) – Label embeddings.

  • tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).

  • num_negative (int32) – Number of negative samples.

  • learning_rate (float32) – Learning rate.

itembed.do_unsupervised_steps(itemset, weights, syn0, syn1, tmp_syn, num_negative, learning_rate)

Apply steps from a single itemset.

This is used in an unsupervised setting, where co-occurrence is used as a knowledge source. It follows the skip-gram method, as introduced by Mikolov et al.

For each item, a single random neighbor is sampled to define a pair. This means that only a subset of possible pairs is considered. The reason is twofold: training stays in linear complexity w.r.t. itemset lengths and large itemsets do not dominate smaller ones.

Itemset must have at least 2 items. Length is not checked, for efficiency.

Parameters
  • itemset (int32, length) – Items.

  • weights (float32, length) – Item weights.

  • syn0 (float32, num_label x num_dimension) – First set of embeddings.

  • syn1 (float32, num_label x num_dimension) – Second set of embeddings.

  • tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).

  • num_negative (int32) – Number of negative samples.

  • learning_rate (float32) – Learning rate.

itembed.do_supervised_batch(left_items, left_weights, left_offsets, left_indices, right_items, right_weights, right_offsets, right_indices, left_syn, right_syn, tmp_syn, num_negative, learning_rate)

Apply supervised steps from multiple itemsets.

Parameters
  • left_items (int32, num_left_item) – Itemsets, concatenated.

  • left_weights (float32, num_left_item) – Item weights, concatenated.

  • left_offsets (int32, num_itemset + 1) – Boundaries in packed items.

  • left_indices (int32, num_step) – Subset of offsets to consider.

  • right_items (int32, num_right_item) – Itemsets, concatenated.

  • right_weights (float32, num_right_item) – Item weights, concatenated.

  • right_offsets (int32, num_itemset + 1) – Boundaries in packed items.

  • right_indices (int32, num_step) – Subset of offsets to consider.

  • left_syn (float32, num_left_label x num_dimension) – Feature embeddings.

  • right_syn (float32, num_right_label x num_dimension) – Label embeddings.

  • tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).

  • num_negative (int32) – Number of negative samples.

  • learning_rate (float32) – Learning rate.

itembed.do_unsupervised_batch(items, weights, offsets, indices, syn0, syn1, tmp_syn, num_negative, learning_rate)

Apply unsupervised steps from multiple itemsets.

Parameters
  • items (int32, num_item) – Itemsets, concatenated.

  • weights (float32, num_item) – Item weights, concatenated.

  • offsets (int32, num_itemset + 1) – Boundaries in packed items.

  • indices (int32, num_step) – Subset of offsets to consider.

  • syn0 (float32, num_label x num_dimension) – First set of embeddings.

  • syn1 (float32, num_label x num_dimension) – Second set of embeddings.

  • tmp_syn (float32, num_dimension) – Internal buffer (allocated only once, for performance).

  • num_negative (int32) – Number of negative samples.

  • learning_rate (float32) – Learning rate.