thirdai.bolt

UDT

UniversalDeepTransformer.__init__(*args, **kwargs)

Overloaded function.

  1. __init__(self: thirdai._thirdai.bolt.UniversalDeepTransformer, data_types: Dict[str, thirdai::automl::DataType], temporal_tracking_relationships: Dict[str, List[Union[str, thirdai::automl::TemporalConfig]]] = {}, target: str, n_target_classes: Optional[int] = None, integer_target: bool = False, time_granularity: str = ‘daily’, lookahead: int = 0, delimiter: str = ‘,’, model_config: Optional[str] = None, options: dict = {}) -> None

UniversalDeepTransformer (UDT) Constructor.

Parameters:
  • data_types (Dict[str, bolt.types.ColumnType]) –

    A mapping from column name to column type. This map specifies the columns that we want to pass into the model; it does not need to include all columns in the dataset.

    Column type is one of: - bolt.types.categorical - bolt.types.numerical - bolt.types.text - bolt.types.date See bolt.types for details.

    If temporal_tracking_relationships is non-empty, there must one and only one bolt.types.date() column. This column contains date strings in YYYY-MM-DD format.

  • temporal_tracking_relationships (Dict[str, List[Union[str, bolt.temporal.TemporalConfig]]]) –

    Optional. A mapping from column name to a list of either other column names or bolt.temporal objects. This mapping tells UDT what columns can be tracked over time for each key. For example, we may want to tell UDT that we want to track a user’s watch history by passing in a map like {“user_id”: [“movie_id”]}

    If we provide a mapping from a string to a list of strings like the above, the temporal tracking configuration will be autotuned. You can achieve finer grained control by passing in bolt.temporal objects intead of strings.

    A bolt.temporal object is one of: - bolt.temporal.categorical - bolt.temporal.numerical See bolt.temporal for details.

  • target (str) – Name of the column that contains the value to be predicted by UDT. The target column has to be a categorical column.

  • n_target_classes (int) – Number of target classes.

  • integer_target (bool) – Whether the target classes are integers in the range 0 to n_target_classes - 1.

  • time_granularity (str) – Optional. Either “daily”/”d”, “weekly”/”w”, “biweekly”/”b”, or “monthly”/”m”. Interval of time that UDT should use for temporal features. Temporal numerical features are clubbed according to this time granularity. E.g. if time_granularity=”w” and the numerical values on days 1 and 2 are 345.25 and 201.1 respectively, then UDT captures a single numerical value of 546.26 for the week instead of individual values for the two days. Defaults to “daily”.

  • lookahead (str) – Optional. How far into the future the model should predict. This length of time is in terms of time_granularity. E.g. ‘time_granularity=”daily”` and lookahead=5 means that the model should learn to predict 5 days into the future. Defaults to 0 (predict the current value of the target).

  • delimiter (str) – Optional. Defaults to ‘,’. A single character (length-1 string) that separates the columns of the CSV training / validation dataset.

  • model_config (Optional[str]) – This overwrites the autotuned model with a custom model defined by the given config file.

Examples

>>> # Suppose each row of our data has the following columns: "product_id", "timestamp", "ad_spend", "sales_quantity", "sales_performance"
>>> # We want to predict next week's sales performance for each product using temporal context.
>>> # For each product ID, we would like to track both their ad spend and sales quantity over time.
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "product_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "ad_spend": bolt.types.numerical(range=(0, 10000)),
            "sales_quantity": bolt.types.numerical(range=(0, 20)),
            "sales_performance": bolt.types.categorical(),
        },
        temporal_tracking_relationships={
            "product_id": [
                # We can use multiple bolt.temporal objects with the same column name but
                # different history lengths to track different intervals of the same variable
                # Track last 5 weeks of ad spend
                bolt.temporal.numerical(column_name="ad_spend", history_length=5),
                # Track last 10 weeks of ad spend
                bolt.temporal.numerical(column_name="ad_spend", history_length=10),
                # Track last 5 weeks of sales performance
                bolt.temporal.categorical(column_name="sales_performance", history_length=5),
            ]
        },
        target="sales_performance",
        n_target_classes=5,
        time_granularity="weekly",
        lookahead=2 # predict 2 weeks ahead
    )
>>> # Alternatively suppose our data has the following columns: "user_id", "movie_id", "hours_watched", "timestamp"
>>> # We want to build a movie recommendation system.
>>> # Then we may configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "movie_id": bolt.types.categorical(),
            "hours_watched": bolt.types.numerical(range=(0, 25)),
        },
        temporal_tracking_relationships={
            "user_id": [
                "movie_id", # autotuned movie temporal tracking
                bolt.temporal.numerical(column_name="hours_watched", history_length="5") # track last 5 days of hours watched.
            ]
        },
        target="movie_id",
        n_target_classes=3000
    )

Notes

  • Refer to the documentation bolt.types.ColumnType and bolt.temporal.TemporalConfig to better understand column types and temporal tracking configurations.

  1. __init__(self: thirdai._thirdai.bolt.UniversalDeepTransformer, source_column: str, target_column: str, dataset_size: str, use_spell_checker: bool = False, delimiter: str = ‘,’, model_config: Optional[str] = None, options: dict = {}) -> None

UniversalDeepTransformer (UDT) Constructor.

Parameters:
  • source_column (str) – Optional. Column name specifying the source queries in the input dataset. If provided then the model can use these queries to augment its training. If not provided then the model be trained from the target queries directly. If the source column is specified the the model can be trained with in both a supervised setting where (incorrect query, correct query) pairs are provided and in an unsupervised setting where only correct queries are provided. If source is not specified then it can only be trained in an unsupervised setting.

  • target_column (str) – Column name specifying the target queries in the input dataset. Queries in this column are the target that the UDT model learns to predict.

  • dataset_size (str) –

    The size of the input dataset. This size factor informs what UDT model to create.

    The dataset size can be one of the following: - small - medium - large

Example

>>> # Suppose we have an input CSV dataset consisting of grammatically or syntactically
>>> # incorrect queries that we want to reformulate. We will assume that the dataset also
>>> # has a target correct query for each incorrect query. We can initialize a UDT model
>>> # for query reformulation as follows:
>>> model = bolt.UniversalDeepTransformer(
        target_column="queries_for_prediction",
        source_column="incorrect_queries",
        dataset_size="medium"
    )
  1. __init__(self: thirdai._thirdai.bolt.UniversalDeepTransformer, target_column: str, dataset_size: str, use_spell_checker: bool = False, delimiter: str = ‘,’, model_config: Optional[str] = None, options: dict = {}) -> None

UniversalDeepTransformer (UDT) Constructor.

Parameters:
  • source_column (str) – Optional. Column name specifying the source queries in the input dataset. If provided then the model can use these queries to augment its training. If not provided then the model be trained from the target queries directly. If the source column is specified the the model can be trained with in both a supervised setting where (incorrect query, correct query) pairs are provided and in an unsupervised setting where only correct queries are provided. If source is not specified then it can only be trained in an unsupervised setting.

  • target_column (str) – Column name specifying the target queries in the input dataset. Queries in this column are the target that the UDT model learns to predict.

  • dataset_size (str) –

    The size of the input dataset. This size factor informs what UDT model to create.

    The dataset size can be one of the following: - small - medium - large

Example

>>> # Suppose we have an input CSV dataset consisting of grammatically or syntactically
>>> # incorrect queries that we want to reformulate. We will assume that the dataset also
>>> # has a target correct query for each incorrect query. We can initialize a UDT model
>>> # for query reformulation as follows:
>>> model = bolt.UniversalDeepTransformer(
        target_column="queries_for_prediction",
        source_column="incorrect_queries",
        dataset_size="medium"
    )
  1. __init__(self: thirdai._thirdai.bolt.UniversalDeepTransformer, file_format: str, n_target_classes: int, input_dim: int, model_config: Optional[str] = None, options: dict = {}) -> None

UniversalDeepTransformer.train(filename: str, learning_rate: float = 0.001, epochs: int = 5, validation: Validation | None = None, batch_size: int | None = None, max_in_memory_batches: int | None = None, verbose: bool = True, callbacks: List[Callback] = [], metrics: List[str] = [], logging_interval: int | None = None, shuffle_reservoir_size: int = 64000, comm=None)

Trains a UniversalDeepTransformer (UDT) on a given dataset using a file on disk or in a cloud storage bucket, such as s3 or google cloud storage (GCS). If the file is on S3, it should be in the normal s3 form, i.e. s3://bucket/path/to/key. For files in GCS, the path should have the form gcs://bucket/path/to/filename. We currently support csv and parquet format files. If the file is parquet, it should end in .parquet or .pqt. Otherwise, we will assume it is a csv file.

Parameters:
  • filename (str) – Path to the dataset file. It Can be a path to a file on disk or an S3 or GCS resource identifier. If the file is on s3 or GCS, regular credentials files will be required for authentication.

  • learning_rate (float) – Optional, uses default if not provided.

  • epochs (int) – Optional, uses default if not provided.

  • validation (Optional[bolt.Validation]) – This is an optional parameter that specifies a validation dataset, metrics, and interval to use during training.

  • batch_size (Option[int]) – This is an optional parameter indicating which batch size to use for training. If not specified, the batch size will be autotuned.

  • max_in_memory_batches (Option[int]) – The maximum number of batches to load in memory at a given time. If this is specified then the dataset will be processed in a streaming fashion.

  • verbose (bool) – Optional, defaults to True. Controls if additional information is printed during training.

  • callbacks (List[bolt.train.callbacks.Callback]) – List of callbacks to use during training.

  • metrics (List[str]) – List of metrics to compute during training. These are logged if logging is enabled, and are accessible by any callbacks.

  • logging_interval (Optional[int]) – How frequently to log training metrics, represents the number of batches between logging metrics. If not specified logging is done at the end of each epoch.

Returns:

The train method returns a dictionary providing the values of any metrics computed during training. The format is: {“name of metric”: [list of values]}.

Return type:

(Dict[str, List[float]])

Examples

>>> model.train(
        filename="./train_file", epochs=5, learning_rate=0.01, max_in_memory_batches=12
    )
>>> model.train(
        filename="s3://bucket/path/to/key"
    )

Notes

  • If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. model.train() automatically updates UDT’s temporal context.

  • If the prediction task is binary classification then the model will attempt to find an optimal threshold for predictions that will be used if return_predicted_class=True is passed to calls to evaluate, predict, and predict_batch. The optimal threshold will be selected based on what threshold maximizes the first validation metric on the validation data. If no validation data or metrics are passed in then it will use the first 100 batches of the training data and the first training metric. If there is also no training metrics then it will not choose a prediction threshold.

UniversalDeepTransformer.cold_start(filename: str, strong_column_names: ~typing.List[str], weak_column_names: ~typing.List[str], variable_length: ~data.transformations.VariableLengthConfig | None = <data.transformations.VariableLengthConfig object>, learning_rate: float = 0.001, epochs: int = 5, batch_size: int | None = None, metrics: ~typing.List[str] = [], validation: ~bolt.Validation | None = None, callbacks: ~typing.List[~bolt.train.callbacks.Callback] = [], max_in_memory_batches: int | None = None, verbose: bool = True, logging_interval: int | None = None, comm=None, shuffle_reservoir_size: int = 64000)

This method will perform cold start pretraining for UDT. This is a type of pretraining for text classification models that is especially useful for query to product recommendation models. It requires that the model takes in a single text input and has a categorical/multi-categorical output.

The cold start pretraining typically takes in an unsupervised dataset of objects where each object corresponds to one or more columns of textual metadata. This could be something like a product catalog (with product ids as objects, and titles, descriptions, and tags as metadata). The goal with cold start is to pre-train UDT on unsupervised data so in the future it may be able to answer text search queries and return the relevant objects. The dataset it takes in should be a csv file that gives a class id column and some number of text columns, where for a given row the text is related to the class also specified by that row.

You may cold_start the model and train with supervised data afterwards, typically leading to faster convergence on the supervised data.

Parameters:
  • filename (str) – Path to the dataset used for pretraining.

  • strong_column_names (List[str]) – The strong column names indicate which text columns are most closely related to the output class. In this case closely related means that all of the words in the text are useful in identifying the output class in that row. For example in the case of a product catalog then a strong column could be the full title of the product.

  • weak_column_names (List[str]) – The weak column names indicate which text columns are either more loosely related to the output class. In this case loosely related means that parts of the text are useful in identifying the output class, but there may also be parts of the text that contain more generic words or phrases that don’t have as high of a correlation. For example in a product catalog the description of the product could be a weak column because while there is a correlation, parts of the description may be fairly similar between products or be too general to completly identify which products the correspond to.

  • learning_rate (float) – Optional, uses default if not provided.

  • epochs (int) – Optional, uses default if not provided.

  • batch_size (Option[int]) – This is an optional parameter indicating which batch size to use for training. If not specified, the batch size will be autotuned.

  • metrics (List[str]) – List of metrics to compute during training. These are logged if logging is enabled, and are accessible by any callbacks.

  • validation (Optional[bolt.Validation]) – This is an optional parameter that specifies a validation dataset, metrics, and interval to use during training.

  • callbacks (List[bolt.train.callbacks.Callback]) – List of callbacks to use during training.

  • max_in_memory_batches (Option[int]) – The maximum number of batches to load in memory at a given time. If this is specified then the dataset will be processed in a streaming fashion.

  • verbose (bool) – Optional, defaults to True. Controls if additional information is printed during training.

  • logging_interval (Optional[int]) – How frequently to log training metrics, represents the number of batches between logging metrics. If not specified logging is done at the end of each epoch.

Returns:

The train method returns a dictionary providing the values of any metrics computed during training. The format is: {“name of metric”: [list of values]}.

Return type:

(Dict[str, List[float]])

Examples

>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "query": bolt.types.text(),
            "product": bolt.types.categorical(),
        }
        target="product",
        n_target_classes=1000,
        integer_target=True,
    )
>>> model.cold_start(
        filename="product_catalog.csv",
        strong_column_names=["title"],
        weak_column_names=["description", "bullet_points"],
        learning_rate=0.001,
        epochs=5,
        metrics=["f_measure(0.95)"]
    )
>>> model.train(
        train_filename=supervised_query_product_data,
    )
>>> result = model.predict({"QUERY": query})
UniversalDeepTransformer.train_batch(batch: List[Dict[str, str]], learning_rate: float = 0.001) object

Trains the model on the given training batch.

Parameters:
  • batch (List[Dict[str, str]]) – The raw data comprising the training batch. This should be in the form {“column_name”: “column_value”} for each column the model expects.

  • learning_rate (float) – Optional, uses default if not provided.

Returns:

None

UniversalDeepTransformer.evaluate(filename: str, metrics: List[str] = [], use_sparse_inference: bool = False, verbose: bool = True, top_k: int | None = None)

Evaluates the UniversalDeepTransformer (UDT) on the given dataset and returns a numpy array of the activations. We currently support csv and parquet format files. If the file is parquet, it should end in .parquet or .pqt. Otherwise, we will assume it is a csv file.

Parameters:
  • filename (str) – Path to the dataset file. Like train, this can be a path to a local file or a path to a file that lives in an s3 or google cloud storage (GCS) bucket.

  • metrics (List[str]) – List of metrics to compute during evaluation.

  • use_sparse_inference (bool) – Optional, defaults to False, determines if sparse inference is used during evaluation.

  • verbose (bool) – Optional, defaults to True. Controls if additional information is printed during training.

  • top_k (Optional[int]) – Optional, defaults to None. This parameter is only used for query reformulation model to deterimine how many candidates to select before computing evaluation metrics.

Returns:

Returns a list of values for the specified metrics, keyed by the metric names.

Return type:

(Dict[str, float])

Examples

>>> metrics = model.evaluate(filename="./test_file", metrics=["categorical_accuracy"])

Notes

  • If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. model.evaluate() automatically updates UDT’s temporal context.

UniversalDeepTransformer.predict(sample: Dict[str, str], sparse_inference: bool = False, return_predicted_class: bool = False, top_k: int | None = None) object

Performs inference on a single sample.

Parameters:
  • input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names and the values are the respective column values.

  • use_sparse_inference (bool) – Whether or not to use sparse inference.

  • return_predicted_class (bool) – If true then the model will return the id of the predicted class instead of the activations of the output layer. This argument is only applicable to classification models.

  • top_k (Optional[int]) – If specified then the model will return the ids of the top k predicted classes instead of the activations of the output layer. This argument is only applicable to classification models.

Returns:

Returns a numpy array of the activations if the output is dense, or a tuple of the active neurons and activations if the output is sparse. The shape of each array will be (num_nonzeros_in_output, ). If return predicted class is specified then the class id (an integer) will be returned. If top_k is specified then a list of integer class ids will be returned. You can map neuron ids back to target class names by calling the class_name() method. If the target column is a sequence, UDT will perform inference recursively and return a sequence in the same format as the target column.

Return type:

(np.ndarray, Tuple[np.ndarray, np.ndarray], List[int], or int)

Examples

>>> # Suppose we configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical()
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
        n_target_classes=500
    )
>>> # Make a single prediction
>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )

Notes

  • The values of columns that are tracked temporally may be unknown during inference (the column_known_during_inference attribute of the bolt.temporal objects are False by default). These columns do not need to be passed into model.predict(). For example, we did not pass the “movie_title” column to model.predict(). All other columns must be passed in.

  • If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.predict() does not update UDT’s temporal context. To do this without retraining the model, we need to use model.index() or model.index_batch(). Read about model.index() and model.index_batch() for details.

UniversalDeepTransformer.predict_batch(samples: List[Dict[str, str]], sparse_inference: bool = False, return_predicted_class: bool = False, top_k: int | None = None) object

Performs inference on a batch of samples in parallel.

Parameters:
  • input_samples (List[Dict[str, str]]) – A list of input sample as dictionaries where the keys are column names as specified in data_types and the values are the respective column values.

  • use_sparse_inference (bool, default=False) – Whether or not to use sparse inference.

  • return_predicted_class (bool) – If true then the model will return the id of the predicted class instead of the activations of the output layer. This argument is only applicable to classification models.

  • top_k (Optional[int]) – If specified then the model will return the ids of the top k predicted classes instead of the activations of the output layer. This argument is only applicable to classification models.

Returns:

Returns a numpy array of the activations if the output is dense, or a tuple of the active neurons and activations if the output is sparse. The shape of each array will be (batch_size, num_nonzeros_in_output). If return predicted class is specified then the class id (an integer) will be returned. If top_k is specified then a list of integer class ids will be returned. You can map neuron ids back to target class names by calling the class_name() method. If the target column is a sequence, UDT will perform inference recursively and return a sequence in the same format as the target column.

Return type:

(np.ndarray, Tuple[np.ndarray, np.ndarray], List[List[int]], or List[int])

Examples

>>> activations = model.predict_batch([
        {"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"},
        {"user_id": "A25978", "timestamp": "2022-12-25", "special_event": "christmas"},
        {"user_id": "A25978", "timestamp": "2022-12-26", "special_event": "christmas"}"
    ])

Notes

  • The values of columns that are tracked temporally may be unknown during inference (the column_known_during_inference attribute of the bolt.temporal objects are False by default). These columns do not need to be passed into model.predict_batch(). For example, we did not pass the “movie_title” column to model.predict_batch(). All other columns must be passed in.

  • If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.predict_batch() does not update UDT’s temporal context. To do this without retraining the model, we need to use model.index() or model.index_batch(). Read about model.index() and model.index_batch() for details.

UniversalDeepTransformer.explain(input_sample: Dict[str, str], target_class: int | str | None = None) List[Tuple[str, float]]

Identifies the columns that are most responsible for a predicted outcome and provides a brief description of the column’s value.

If a target is provided, the model will identify the columns that need to change for the model to predict the target class.

Parameters:
  • input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names as specified in data_types and the ” values are the respective column values.

  • target_class (str) – Optional. The desired target class. If provided, the model will identify the columns that need to change for the model to predict the target class.

Returns:

A list of explanations from the input features along with weights representing the significance of that feature.

Return type:

List[Tuple[str, float]]

Example

>>> # Suppose we configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical()
        },
        temporal_tracking_relationships={
            "user_id": "movie_title"
        },
        target="movie_title",
        n_target_classes=500,
    )
>>> # Make a single prediction
>>> explanations = model.explain(
        input_sample={"user_id": "A33225", "timestamp": "2022-02-02", "special_event": "christmas"}, target_class=35
    )
UniversalDeepTransformer.save(filename: str) None

Serializes an instance of UniversalDeepTransformer (UDT) into a file on disk. The serialized UDT includes its current temporal context. The save method just saves the model parameters, the checkpoint method saves additional information such as the optimizer state to use if training is resumed.

Parameters:

filename (str) – The file on disk to serialize this instance of UDT into.

Example

>>> model.save("udt_savefile.bolt")
>>> model.checkpoint("udt_savefile.bolt")
UniversalDeepTransformer.checkpoint(filename: str) None

Serializes an instance of UniversalDeepTransformer (UDT) into a file on disk. The serialized UDT includes its current temporal context. The save method just saves the model parameters, the checkpoint method saves additional information such as the optimizer state to use if training is resumed.

Parameters:

filename (str) – The file on disk to serialize this instance of UDT into.

Example

>>> model.save("udt_savefile.bolt")
>>> model.checkpoint("udt_savefile.bolt")
static UniversalDeepTransformer.load(filename: str) bolt.UniversalDeepTransformer

Loads a serialized instance of a UniversalDeepTransformer (UDT) model from a file on disk.

Parameters:

filename (str) – The file on disk from where to load an instance of UDT.

Returns:

The loaded instance of UDT

Return type:

UniversalDeepTransformer

Example

>>> model = bolt.UniversalDeepTransformer(...)
>>> model = bolt.UniversalDeepTransformer.load("udt_savefile.bolt")
UniversalDeepTransformer.embedding_representation(input_sample: List[Dict[str, str]]) object

Performs inference on a single sample and returns the penultimate layer of UniversalDeepTransformer (UDT) so that it can be used as an embedding representation for downstream applications.

Parameters:

input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names as specified in data_types and the values are the respective column values.

Returns:

Returns a numpy array of the penultimate layer’s activations.

Return type:

np.ndarray

Examples

>>> # Suppose we configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical()
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
        n_target_classes=500,
    )
>>> # Get an embedding representation
>>> embedding = model.embedding_representation(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )

Notes

  • The values of columns that are tracked temporally may be unknown during inference (the column_known_during_inference attribute of the bolt.temporal objects are False by default). These columns do not need to be passed into model.embedding_representation(). For example, we did not pass the “movie_title” column to model.embedding_representation(). All other columns must be passed in.

  • If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.predict() does not update UDT’s temporal context. To do this without retraining the model, we need to use model.index() or model.index_batch(). Read about model.index() and model.index_batch() for details.

UniversalDeepTransformer.get_entity_embedding(label_id: int | str) object

Returns an embedding representation for a given output entity, an entity being the name of a class predicted as output.

Parameters:
  • label_id (Union[int, str]) – The the name of the entity to get an embedding for.

  • integer_target=True (If) –

  • to (this function should take in an integer from 0) –

  • string. (n_target_classes - 1 instead of a) –

Returns:

A 1D numpy array of floats representing a dense embedding of that entity.

UniversalDeepTransformer.class_name(arg0: int) str

Returns the target class name associated with an output neuron ID.

Parameters:

neuron_id (int) – The index of the neuron in UDT’s output layer. This is useful for mapping the activations returned by evaluate() and predict() back to class names.

Returns:

The class names that corresponds to the given neuron_id.

Return type:

str

Example

>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )
>>> top_recommendation = np.argmax(activations)
>>> model.class_name(top_recommendation)
"Die Hard"
UniversalDeepTransformer.index(input_sample: Dict[str, str]) None

Indexes a single true sample to keep UniversalDeepTransformer’s (UDT) temporal context up to date.

If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.index() does exactly this.

Parameters:

input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names as specified in data_types and the ” values are the respective column values.

Example

>>> # Suppose we configure UDT to do movie recommendation as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical()
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
        n_target_classes=500,
    )
>>> # We then deploy the model for inference. Inference is performed by calling model.predict()
>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )
>>> # Suppose we later learn that user "A33225" ends up watching "Die Hard 3".
>>> # We can call model.index() to keep UDT's temporal context up to date.
>>> model.index(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas", "movie_title": "Die Hard 3"}
    )
UniversalDeepTransformer.index_batch(input_samples: List[Dict[str, str]]) None

Indexes a batch of true samples to keep UniversalDeepTransformer’s (UDT) temporal context up to date.

If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.index_batch() does exactly this with a batch of samples.

Parameters:

input_samples (List[Dict[str, str]]) – The input sample as a dictionary where the keys are column names as specified in data_types and the ” values are the respective column values.

Example

>>> # Suppose we configure UDT to do movie recommendation as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical()
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
        n_target_classes=500,
    )
>>> # We then deploy the model for inference. Inference is performed by calling model.predict()
>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )
>>> # Suppose we later learn what users actually watched.
>>> # We can call model.index_batch() to keep UDT's temporal context up to date.
>>> model.index_batch(
        input_samples=[
            {"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas", "movie_title": "Die Hard 3"},
            {"user_id": "A39574", "timestamp": "2022-12-25", "special_event": "christmas", "movie_title": "Home Alone"},
            {"user_id": "A39574", "timestamp": "2022-12-26", "special_event": "christmas", "movie_title": "Home Alone 2"},
        ]
    )
UniversalDeepTransformer.reset_temporal_trackers() None

Resets UniversalDeepTransformer’s (UDT) temporal context. When temporal relationships are supplied, UDT assumes that we feed it data in chronological order. Thus, if we break this assumption, we need to first reset the temporal trackers. An example of when you would use this is when you want to repeat the UDT training routine on the same dataset. Since you would be training on data from the same time period as before, we need to first reset the temporal trackers so that we don’t double count events.

Parameters:

None

Returns:

None

Example

>>> model.reset_temporal_trackers()
UniversalDeepTransformer.index_nodes(data_source: dataset.DataSource) None

Updates the graph that the UDT model is performing graph node classification on. The file should have the same node id, neighbors, and features columns as the model is configured to accept.

Parameters:

filename (str) – The filename to load the graph from.

Returns:

None

UniversalDeepTransformer.clear_graph() None

Clears all graph info that is being tracked by the model.

Returns:

None

class thirdai.bolt.Validation

Bases: pybind11_object

__init__(filename: str, metrics: List[str], interval: int | None = None, use_sparse_inference: bool = False) None

Creates a validation object that stores the necessary information for the model to perform validation during training.

Parameters:
  • filename (str) – The name of the validation file.

  • metrics (List[str]) – The metrics to compute for validation.

  • interval (Optional[int]) – The interval, in number of batches, between computing validation. For instance, interval=10 means that validation metrics will be computed every 10 batches. If it is not specified then validation will be done after each epoch.

  • use_sparse_inference (bool) – Optional, defaults to False. When True, sparse inference will be used during validation.

Examples

>>> validation = bolt.Validation(
        filename="validation.csv", metrics=["categorical_accuracy"], interval=10
    )
>>> model.train("train.csv", epochs=5, validation=validation)
property filename
property metrics
property sparse_validation
property steps_per_validation

UDT Input Column Types

class thirdai.bolt.types.ColumnType

Bases: pybind11_object

Base class for bolt types.

class thirdai.bolt.types.categorical

Bases: ColumnType

__init__(delimiter: str | None = None, metadata: bolt.types.metadata = None) None

Temporal categorical config. Use this object to configure how a categorical column is tracked over time.

Parameters:
  • column_name (str) – The name of the tracked column.

  • track_last_n (int) – Number of last categorical values to track per tracking id.

  • column_known_during_inference (bool) – Optional. Whether the value of the tracked column is known during inference. Defaults to False.

  • use_metadata (bool) – Optional. Whether to use the metadata of the N tracked items, if metadata is provided in the corresponding categorical column type object. Ignored if no metadata is provided. Defaults to False.

Example

>>> # Suppose each row of our data has the following columns: "product_id", "timestamp", "ad_spend_level", "sales_performance"
>>> # We want to predict the current week's sales performance for each product using temporal context.
>>> # For each product ID, we would like to track both their ad spend level and sales performance over time.
>>> # Ad spend level is known at the time of inference but sales performance is not. Then we can configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "product_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "ad_spend_level": bolt.types.categorical(),
            "sales_performance": bolt.types.categorical(),
        },
        temporal_tracking_relationships={
            "product_id": [
                bolt.temporal.categorical(column_name="ad_spend_level", track_last_n=5, column_known_during_inference=True),
                bolt.temporal.categorical(column_name="ad_spend_level", track_last_n=25, column_known_during_inference=True),
                bolt.temporal.categorical(column_name="sales_performance", track_last_n=5), # column_known_during_inference defaults to False
            ]
        },
        ...
    )

Notes

  • Temporal categorical features are tracked as a set; if we track the last 5 ad spend levels,

    we capture what the last 5 ad spend levels are, but we do not capture their order.

  • The same column can be tracked more than once, allowing us to capture both short and

    long term trends.

property delimiter
class thirdai.bolt.types.date

Bases: ColumnType

__init__() None

Date column type. Use this object if a column contains date strings. Date strings must be in YYYY-MM-DD format.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "timestamp": bolt.types.date()
        }
        ...
    )
class thirdai.bolt.types.metadata

Bases: pybind11_object

__init__(filename: str, key_column_name: str, data_types: Dict[str, bolt.types.ColumnType], delimiter: str = ',') None

A configuration object for processing a metadata file to enrich categorical features from the main dataset. To illustrate when this is useful, suppose we are building a movie recommendation system. The contents of the training dataset may look something like the following: user_id,movie_id,timestamp A526,B894,2022-01-01 A339,B801,2022-01-01 A293,B801,2022-01-01 … If you have additional information about users or movies, such as users’ age groups or movie genres, you can use that information to enrich your model. Adding these features into the main dataset as new columns is wasteful because the same users and movies ids will be repeated many times throughout the dataset. Instead, we can put them all in a metadata file and UDT will inject these features where appropriate. :param filename: Path to metadata file. The file should be in CSV format. :type filename: str :param key_column_name: The name of the column whose values are used as

keys to map metadata features back to values in the main dataset. This column does not need to be passed into the data_types argument.

Parameters:
  • data_types (Dict[str, bolt.types.ColumnType]) – A mapping from column name to column type. Column type is one of: - bolt.types.categorical - bolt.types.numerical - bolt.types.text - bolt.types.date

  • delimiter (str) – Optional. Defaults to ‘,’. A single character (length-1 string) that separates the columns of the metadata file.

Example

>>> for line in open("user_meta.csv"):
>>>     print(line)
user_id,age
A526,52
A531,22
A339,29
...
>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_id": bolt.types.categorical(
                delimiter=' ',
                metadata=bolt.types.metadata(
                    filename="user_meta.csv",
                    data_types={"age": bolt.types.numerical()},
                    key_column_name="user_id"
                )
            )
        }
        ...
    )
class thirdai.bolt.types.neighbors

Bases: ColumnType

__init__() None
class thirdai.bolt.types.node_id

Bases: ColumnType

__init__() None
class thirdai.bolt.types.numerical

Bases: ColumnType

__init__(range: Tuple[float, float], granularity: str = 'm') None

Numerical column type. Use this object if a column contains numerical data (the value is treated as a quantity). Examples include hours of a movie watched, sale quantity, or population size.

Parameters:
  • range (tuple(float, float)) – The expected range (min to max) of the numeric quantity. The more accurate this range to the test data, the better the model performance.

  • granularity (str) – Optional. One of “extrasmall”/”xs”, “small”/”s”, “medium”/”m”, “large”/”l” or “extralarge”/”xl” . Defaults to “m”.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "hours_watched": bolt.types.numerical(range=(0, 25), granularity="xs")
        }
        ...
    )
class thirdai.bolt.types.sequence

Bases: ColumnType

__init__(delimiter: str = ' ', max_length: int | None = None) None

Sequence column type. Use this object if a column contains an ordered sequence of strings delimited by a character. The delimiter must be different than the delimiter between columns.

When the target column is a sequence type, then UDT will perform inferences recursively.

Parameters:
  • delimiter (str) – Optional. The sequence delimiter. Defaults to “ “.

  • max_length (int) – Required if the column is the target. The maximum length of the sequence. If UDT sees longer sequences, elements beyond the provided upper bound will be ignored.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "input_sequence": bolt.types.sequence(delimiter='\t')
            "output_sequence": bolt.types.sequence(max_length=30) # max_length must be provided for target sequence.
        },
        target="output_sequence",
        n_target_classes=26
        ...
    )
class thirdai.bolt.types.text

Bases: ColumnType

__init__(*args, **kwargs)

Overloaded function.

  1. __init__(self: thirdai._thirdai.bolt.types.text, tokenizer: str = ‘words’, contextual_encoding: str = ‘none’, lowercase: bool = True) -> None

Text column type. Use this object if a column contains text data (the meaning of the text matters). Examples include descriptions, search queries, and user bios.

Parameters:
  • tokenizer (str) – Optional. Either “words”, “words-punct” or “char-k” (k is a number, e.g. “char-5”). Defaults to “words”.

  • contextual_encoding (str) – Optional. Either “local”, “global”, “ngram-N”, or “none”, defaults to “none”.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_motto": bolt.types.text(),
            "user_bio": bolt.types.text(contextual_encoding="local")
        }
        ...
    )
  1. __init__(self: thirdai._thirdai.bolt.types.text, tokenizer: thirdai._thirdai.dataset.WordpieceTokenizer, contextual_encoding: str = ‘none’) -> None

Text column type. Use this object if a column contains text data (the meaning of the text matters). Examples include descriptions, search queries, and user bios.

Parameters:
  • tokenizer (str) – Optional. Either “words”, “words-punct” or “char-k” (k is a number, e.g. “char-5”). Defaults to “words”.

  • contextual_encoding (str) – Optional. Either “local”, “global”, “ngram-N”, or “none”, defaults to “none”.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_motto": bolt.types.text(),
            "user_bio": bolt.types.text(contextual_encoding="local")
        }
        ...
    )

UDT Temporal Options

class thirdai.bolt.temporal.TemporalConfig

Bases: pybind11_object

Base class for temporal feature configs.

thirdai.bolt.temporal.categorical(column_name: str, track_last_n: int, column_known_during_inference: bool = False, use_metadata: bool = False) bolt.temporal.TemporalConfig

Temporal categorical config. Use this object to configure how a categorical column is tracked over time.

Parameters:
  • column_name (str) – The name of the tracked column.

  • track_last_n (int) – Number of last categorical values to track per tracking id.

  • column_known_during_inference (bool) – Optional. Whether the value of the tracked column is known during inference. Defaults to False.

  • use_metadata (bool) – Optional. Whether to use the metadata of the N tracked items, if metadata is provided in the corresponding categorical column type object. Ignored if no metadata is provided. Defaults to False.

Example

>>> # Suppose each row of our data has the following columns: "product_id", "timestamp", "ad_spend_level", "sales_performance"
>>> # We want to predict the current week's sales performance for each product using temporal context.
>>> # For each product ID, we would like to track both their ad spend level and sales performance over time.
>>> # Ad spend level is known at the time of inference but sales performance is not. Then we can configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "product_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "ad_spend_level": bolt.types.categorical(),
            "sales_performance": bolt.types.categorical(),
        },
        temporal_tracking_relationships={
            "product_id": [
                bolt.temporal.categorical(column_name="ad_spend_level", track_last_n=5, column_known_during_inference=True),
                bolt.temporal.categorical(column_name="ad_spend_level", track_last_n=25, column_known_during_inference=True),
                bolt.temporal.categorical(column_name="sales_performance", track_last_n=5), # column_known_during_inference defaults to False
            ]
        },
        ...
    )

Notes

  • Temporal categorical features are tracked as a set; if we track the last 5 ad spend levels,

    we capture what the last 5 ad spend levels are, but we do not capture their order.

  • The same column can be tracked more than once, allowing us to capture both short and

    long term trends.

thirdai.bolt.temporal.numerical(column_name: str, history_length: int, column_known_during_inference: bool = False) bolt.temporal.TemporalConfig

Temporal numerical config. Use this object to configure how a numerical column is tracked over time.

Parameters:
  • column_name (str) – The name of the tracked column.

  • history_length (int) – Amount of time to look back. Time is in terms of the time granularity passed to the UDT constructor.

  • column_known_during_inference (bool) – Optional. Whether the value of the tracked column is known during inference. Defaults to False.

Example

>>> # Suppose each row of our data has the following columns: "product_id", "timestamp", "ad_spend", "sales_performance"
>>> # We want to predict the current week's sales performance for each product using temporal context.
>>> # For each product ID, we would like to track both their ad spend and sales performance over time.
>>> # Ad spend is known at the time of inference but sales performance is not. Then we can configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "product_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "ad_spend": bolt.types.numerical(range=(0, 10000)),
            "sales_performance": bolt.types.categorical(),
        },
        target="sales_performance"
        time_granularity="weekly",
        temporal_tracking_relationships={
            "product_id": [
                # Track last 5 weeks of ad spend
                bolt.temporal.numerical(column_name="ad_spend", history_length=5, column_known_during_inference=True),
                # Track last 10 weeks of ad spend
                bolt.temporal.numerical(column_name="ad_spend", history_length=10, column_known_during_inference=True),
                # Track last 5 weeks of sales quantity
                bolt.temporal.numerical(column_name="sales_quantity", history_length=5), # column_known_during_inference defaults to False
            ]
        },

)

Notes

  • The same column can be tracked more than once, allowing us to capture both short and

    long term trends.