thirdai.bolt

UDT

UniversalDeepTransformer.__init__(target: str | None = None, data_types=None, dataset_path: str | None = None, **kwargs)

UniversalDeepTransformer.train(filename: str, learning_rate: float = 0.001, epochs: int = 5, validation: Validation | None = None, batch_size: int | None = None, max_in_memory_batches: int | None = None, verbose: bool = True, callbacks: List[Callback] = [], metrics: List[str] = [], logging_interval: int | None = None, shuffle_reservoir_size: int = 64000, comm=None, **kwargs)

Trains a UniversalDeepTransformer (UDT) on a given dataset using a file on disk or in a cloud storage bucket, such as s3 or google cloud storage (GCS). If the file is on S3, it should be in the normal s3 form, i.e. s3://bucket/path/to/key. For files in GCS, the path should have the form gcs://bucket/path/to/filename. We currently support csv and parquet format files. If the file is parquet, it should end in .parquet or .pqt. Otherwise, we will assume it is a csv file.

Parameters:

filename (str) – Path to the dataset file. It Can be a path to a file on disk or an S3 or GCS resource identifier. If the file is on s3 or GCS, regular credentials files will be required for authentication.
learning_rate (float) – Optional, uses default if not provided.
epochs (int) – Optional, uses default if not provided.
validation (Optional[bolt.Validation]) – This is an optional parameter that specifies a validation dataset, metrics, and interval to use during training.
batch_size (Option[int]) – This is an optional parameter indicating which batch size to use for training. If not specified, the batch size will be autotuned.
max_in_memory_batches (Option[int]) – The maximum number of batches to load in memory at a given time. If this is specified then the dataset will be processed in a streaming fashion.
verbose (bool) – Optional, defaults to True. Controls if additional information is printed during training.
callbacks (List[bolt.train.callbacks.Callback]) – List of callbacks to use during training.
metrics (List[str]) – List of metrics to compute during training. These are logged if logging is enabled, and are accessible by any callbacks.
logging_interval (Optional[int]) – How frequently to log training metrics, represents the number of batches between logging metrics. If not specified logging is done at the end of each epoch.

Returns:

The train method returns a dictionary providing the values of any metrics computed during training. The format is: {“name of metric”: [list of values]}.

Return type:

(Dict[str, List[float]])

Examples

>>> model.train(
        filename="./train_file", epochs=5, learning_rate=0.01, max_in_memory_batches=12
    )
>>> model.train(
        filename="s3://bucket/path/to/key"
    )

Notes

If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. model.train() automatically updates UDT’s temporal context.
If the prediction task is binary classification then the model will attempt to find an optimal threshold for predictions that will be used if return_predicted_class=True is passed to calls to evaluate, predict, and predict_batch. The optimal threshold will be selected based on what threshold maximizes the first validation metric on the validation data. If no validation data or metrics are passed in then it will use the first 100 batches of the training data and the first training metric. If there is also no training metrics then it will not choose a prediction threshold.

UniversalDeepTransformer.cold_start(filename: str, strong_column_names: ~typing.List[str], weak_column_names: ~typing.List[str], variable_length: ~data.transformations.VariableLengthConfig | None = <data.transformations.VariableLengthConfig object>, learning_rate: float = 0.001, epochs: int = 5, batch_size: int | None = None, metrics: ~typing.List[str] = [], validation: ~bolt.Validation | None = None, callbacks: ~typing.List[~bolt.train.callbacks.Callback] = [], max_in_memory_batches: int | None = None, verbose: bool = True, logging_interval: int | None = None, comm=None, shuffle_reservoir_size: int = 64000, **kwargs)

This method will perform cold start pretraining for UDT. This is a type of pretraining for text classification models that is especially useful for query to product recommendation models. It requires that the model takes in a single text input and has a categorical/multi-categorical output.

The cold start pretraining typically takes in an unsupervised dataset of objects where each object corresponds to one or more columns of textual metadata. This could be something like a product catalog (with product ids as objects, and titles, descriptions, and tags as metadata). The goal with cold start is to pre-train UDT on unsupervised data so in the future it may be able to answer text search queries and return the relevant objects. The dataset it takes in should be a csv file that gives a class id column and some number of text columns, where for a given row the text is related to the class also specified by that row.

You may cold_start the model and train with supervised data afterwards, typically leading to faster convergence on the supervised data.

Parameters:

filename (str) – Path to the dataset used for pretraining.
strong_column_names (List[str]) – The strong column names indicate which text columns are most closely related to the output class. In this case closely related means that all of the words in the text are useful in identifying the output class in that row. For example in the case of a product catalog then a strong column could be the full title of the product.
weak_column_names (List[str]) – The weak column names indicate which text columns are either more loosely related to the output class. In this case loosely related means that parts of the text are useful in identifying the output class, but there may also be parts of the text that contain more generic words or phrases that don’t have as high of a correlation. For example in a product catalog the description of the product could be a weak column because while there is a correlation, parts of the description may be fairly similar between products or be too general to completly identify which products the correspond to.
learning_rate (float) – Optional, uses default if not provided.
epochs (int) – Optional, uses default if not provided.
batch_size (Option[int]) – This is an optional parameter indicating which batch size to use for training. If not specified, the batch size will be autotuned.
metrics (List[str]) – List of metrics to compute during training. These are logged if logging is enabled, and are accessible by any callbacks.
validation (Optional[bolt.Validation]) – This is an optional parameter that specifies a validation dataset, metrics, and interval to use during training.
callbacks (List[bolt.train.callbacks.Callback]) – List of callbacks to use during training.
max_in_memory_batches (Option[int]) – The maximum number of batches to load in memory at a given time. If this is specified then the dataset will be processed in a streaming fashion.
verbose (bool) – Optional, defaults to True. Controls if additional information is printed during training.
logging_interval (Optional[int]) – How frequently to log training metrics, represents the number of batches between logging metrics. If not specified logging is done at the end of each epoch.

Returns:

The train method returns a dictionary providing the values of any metrics computed during training. The format is: {“name of metric”: [list of values]}.

Return type:

(Dict[str, List[float]])

Examples

>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "query": bolt.types.text(),
            "product": bolt.types.categorical(n_classes=1000),
        }
        target="product",
    )
>>> model.cold_start(
        filename="product_catalog.csv",
        strong_column_names=["title"],
        weak_column_names=["description", "bullet_points"],
        learning_rate=0.001,
        epochs=5,
        metrics=["f_measure(0.95)"]
    )
>>> model.train(
        train_filename=supervised_query_product_data,
    )
>>> result = model.predict({"QUERY": query})

UniversalDeepTransformer.train_batch(batch: List[Dict[str, str]], learning_rate: float = 0.001) → object

Trains the model on the given training batch.

Parameters:

batch (List[Dict[str, str]]) – The raw data comprising the training batch. This should be in the form {“column_name”: “column_value”} for each column the model expects.
learning_rate (float) – Optional, uses default if not provided.

Returns:

None

UniversalDeepTransformer.evaluate(filename: str, metrics: List[str] = [], use_sparse_inference: bool = False, verbose: bool = True, **kwargs)

Evaluates the UniversalDeepTransformer (UDT) on the given dataset and returns a numpy array of the activations. We currently support csv and parquet format files. If the file is parquet, it should end in .parquet or .pqt. Otherwise, we will assume it is a csv file.

Parameters:

filename (str) – Path to the dataset file. Like train, this can be a path to a local file or a path to a file that lives in an s3 or google cloud storage (GCS) bucket.
metrics (List[str]) – List of metrics to compute during evaluation.
use_sparse_inference (bool) – Optional, defaults to False, determines if sparse inference is used during evaluation.
verbose (bool) – Optional, defaults to True. Controls if additional information is printed during training.
top_k (Optional[int]) – Optional, defaults to None. This parameter is only used for query reformulation model to deterimine how many candidates to select before computing evaluation metrics.

Returns:

Returns a list of values for the specified metrics, keyed by the metric names.

Return type:

(Dict[str, float])

Examples

>>> metrics = model.evaluate(filename="./test_file", metrics=["categorical_accuracy"])

Notes

If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. model.evaluate() automatically updates UDT’s temporal context.

UniversalDeepTransformer.predict(sample: Dict[str, str], sparse_inference: bool = False, return_predicted_class: bool = False, top_k: int | None = None, **kwargs) → object

Performs inference on a single sample.

Parameters:

input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names and the values are the respective column values.
use_sparse_inference (bool) – Whether or not to use sparse inference.
return_predicted_class (bool) – If true then the model will return the id of the predicted class instead of the activations of the output layer. This argument is only applicable to classification models.
top_k (Optional[int]) – If specified then the model will return the ids of the top k predicted classes instead of the activations of the output layer. This argument is only applicable to classification models.

Returns:

Returns a numpy array of the activations if the output is dense, or a tuple of the active neurons and activations if the output is sparse. The shape of each array will be (num_nonzeros_in_output, ). If return predicted class is specified then the class id (an integer) will be returned. If top_k is specified then a list of integer class ids will be returned. You can map neuron ids back to target class names by calling the class_name() method. If the target column is a sequence, UDT will perform inference recursively and return a sequence in the same format as the target column.

Return type:

(np.ndarray, Tuple[np.ndarray, np.ndarray], List[int], or int)

Examples

>>> # Suppose we configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical(n_classes=500)
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
    )
>>> # Make a single prediction
>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )

Notes

The values of columns that are tracked temporally may be unknown during inference (the column_known_during_inference attribute of the bolt.temporal objects are False by default). These columns do not need to be passed into model.predict(). For example, we did not pass the “movie_title” column to model.predict(). All other columns must be passed in.
If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.predict() does not update UDT’s temporal context. To do this without retraining the model, we need to use model.index() or model.index_batch(). Read about model.index() and model.index_batch() for details.

UniversalDeepTransformer.predict_batch(samples: List[Dict[str, str]], sparse_inference: bool = False, return_predicted_class: bool = False, top_k: int | None = None, **kwargs) → object

Performs inference on a batch of samples in parallel.

Parameters:

input_samples (List[Dict[str, str]]) – A list of input sample as dictionaries where the keys are column names as specified in data_types and the values are the respective column values.
use_sparse_inference (bool, default=False) – Whether or not to use sparse inference.
return_predicted_class (bool) – If true then the model will return the id of the predicted class instead of the activations of the output layer. This argument is only applicable to classification models.
top_k (Optional[int]) – If specified then the model will return the ids of the top k predicted classes instead of the activations of the output layer. This argument is only applicable to classification models.

Returns:

Returns a numpy array of the activations if the output is dense, or a tuple of the active neurons and activations if the output is sparse. The shape of each array will be (batch_size, num_nonzeros_in_output). If return predicted class is specified then the class id (an integer) will be returned. If top_k is specified then a list of integer class ids will be returned. You can map neuron ids back to target class names by calling the class_name() method. If the target column is a sequence, UDT will perform inference recursively and return a sequence in the same format as the target column.

Return type:

(np.ndarray, Tuple[np.ndarray, np.ndarray], List[List[int]], or List[int])

Examples

>>> activations = model.predict_batch([
        {"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"},
        {"user_id": "A25978", "timestamp": "2022-12-25", "special_event": "christmas"},
        {"user_id": "A25978", "timestamp": "2022-12-26", "special_event": "christmas"}"
    ])

Notes

The values of columns that are tracked temporally may be unknown during inference (the column_known_during_inference attribute of the bolt.temporal objects are False by default). These columns do not need to be passed into model.predict_batch(). For example, we did not pass the “movie_title” column to model.predict_batch(). All other columns must be passed in.
If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.predict_batch() does not update UDT’s temporal context. To do this without retraining the model, we need to use model.index() or model.index_batch(). Read about model.index() and model.index_batch() for details.

UniversalDeepTransformer.explain(input_sample: Dict[str, str], target_class: int | str | None = None) → List[Tuple[str, float]]

Identifies the columns that are most responsible for a predicted outcome and provides a brief description of the column’s value.

If a target is provided, the model will identify the columns that need to change for the model to predict the target class.

Parameters:

input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names as specified in data_types and the ” values are the respective column values.
target_class (str) – Optional. The desired target class. If provided, the model will identify the columns that need to change for the model to predict the target class.

Returns:

A list of explanations from the input features along with weights representing the significance of that feature.

Return type:

List[Tuple[str, float]]

Example

>>> # Suppose we configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical(n_classes=500)
        },
        temporal_tracking_relationships={
            "user_id": "movie_title"
        },
        target="movie_title",
    )
>>> # Make a single prediction
>>> explanations = model.explain(
        input_sample={"user_id": "A33225", "timestamp": "2022-02-02", "special_event": "christmas"}, target_class=35
    )

UniversalDeepTransformer.save(filename: str) → None

Serializes an instance of UniversalDeepTransformer (UDT) into a file on disk. The serialized UDT includes its current temporal context. The save method just saves the model parameters, the checkpoint method saves additional information such as the optimizer state to use if training is resumed.

Parameters:: filename (str) – The file on disk to serialize this instance of UDT into.

Example

>>> model.save("udt_savefile.bolt")
>>> model.checkpoint("udt_savefile.bolt")

UniversalDeepTransformer.checkpoint(filename: str) → None

Serializes an instance of UniversalDeepTransformer (UDT) into a file on disk. The serialized UDT includes its current temporal context. The save method just saves the model parameters, the checkpoint method saves additional information such as the optimizer state to use if training is resumed.

Parameters:: filename (str) – The file on disk to serialize this instance of UDT into.

Example

>>> model.save("udt_savefile.bolt")
>>> model.checkpoint("udt_savefile.bolt")

static UniversalDeepTransformer.load(filename: str) → bolt.UniversalDeepTransformer

Loads a serialized instance of a UniversalDeepTransformer (UDT) model from a file on disk.

Parameters:: filename (str) – The file on disk from where to load an instance of UDT.
Returns:: The loaded instance of UDT
Return type:: UniversalDeepTransformer

Example

>>> model = bolt.UniversalDeepTransformer(...)
>>> model = bolt.UniversalDeepTransformer.load("udt_savefile.bolt")

UniversalDeepTransformer.embedding_representation(input_sample: List[Dict[str, str]]) → object

Performs inference on a single sample and returns the penultimate layer of UniversalDeepTransformer (UDT) so that it can be used as an embedding representation for downstream applications.

Parameters:: input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names as specified in data_types and the values are the respective column values.
Returns:: Returns a numpy array of the penultimate layer’s activations.
Return type:: np.ndarray

Examples

>>> # Suppose we configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical(n_classes=500)
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
    )
>>> # Get an embedding representation
>>> embedding = model.embedding_representation(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )

Notes

The values of columns that are tracked temporally may be unknown during inference (the column_known_during_inference attribute of the bolt.temporal objects are False by default). These columns do not need to be passed into model.embedding_representation(). For example, we did not pass the “movie_title” column to model.embedding_representation(). All other columns must be passed in.
If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.predict() does not update UDT’s temporal context. To do this without retraining the model, we need to use model.index() or model.index_batch(). Read about model.index() and model.index_batch() for details.

UniversalDeepTransformer.get_entity_embedding(label_id: int | str) → object

Returns an embedding representation for a given output entity, an entity being the name of a class predicted as output.

Parameters:

label_id (Union[int, str]) – The the name of the entity to get an embedding for.
target (If type='int' for the)
to (this function should take in an integer from 0)
string. (n_classes - 1 instead of a)

Returns:

A 1D numpy array of floats representing a dense embedding of that entity.

UniversalDeepTransformer.class_name(arg0: int) → str

Returns the target class name associated with an output neuron ID.

Parameters:: neuron_id (int) – The index of the neuron in UDT’s output layer. This is useful for mapping the activations returned by evaluate() and predict() back to class names.
Returns:: The class names that corresponds to the given neuron_id.
Return type:: str

Example

>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )
>>> top_recommendation = np.argmax(activations)
>>> model.class_name(top_recommendation)
"Die Hard"

UniversalDeepTransformer.index(input_sample: Dict[str, str]) → None

Indexes a single true sample to keep UniversalDeepTransformer’s (UDT) temporal context up to date.

If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.index() does exactly this.

Parameters:: input_sample (Dict[str, str]) – The input sample as a dictionary where the keys are column names as specified in data_types and the ” values are the respective column values.

Example

>>> # Suppose we configure UDT to do movie recommendation as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical(n_classes=500)
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
    )
>>> # We then deploy the model for inference. Inference is performed by calling model.predict()
>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )
>>> # Suppose we later learn that user "A33225" ends up watching "Die Hard 3".
>>> # We can call model.index() to keep UDT's temporal context up to date.
>>> model.index(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas", "movie_title": "Die Hard 3"}
    )

UniversalDeepTransformer.index_batch(input_samples: List[Dict[str, str]]) → None

Indexes a batch of true samples to keep UniversalDeepTransformer’s (UDT) temporal context up to date.

If temporal tracking relationships are provided, UDT can make better predictions by taking temporal context into account. For example, UDT may keep track of the last few movies that a user has watched to better recommend the next movie. Thus, UDT is at its best when its internal temporal context gets updated with new true samples. model.index_batch() does exactly this with a batch of samples.

Parameters:: input_samples (List[Dict[str, str]]) – The input sample as a dictionary where the keys are column names as specified in data_types and the ” values are the respective column values.

Example

>>> # Suppose we configure UDT to do movie recommendation as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "user_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "special_event": bolt.types.categorical(),
            "movie_title": bolt.types.categorical(n_classes=500)
        },
        temporal_tracking_relationships={
            "user_id": ["movie_title"]
        },
        target="movie_title",
    )
>>> # We then deploy the model for inference. Inference is performed by calling model.predict()
>>> activations = model.predict(
        input_sample={"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas"}
    )
>>> # Suppose we later learn what users actually watched.
>>> # We can call model.index_batch() to keep UDT's temporal context up to date.
>>> model.index_batch(
        input_samples=[
            {"user_id": "A33225", "timestamp": "2022-12-25", "special_event": "christmas", "movie_title": "Die Hard 3"},
            {"user_id": "A39574", "timestamp": "2022-12-25", "special_event": "christmas", "movie_title": "Home Alone"},
            {"user_id": "A39574", "timestamp": "2022-12-26", "special_event": "christmas", "movie_title": "Home Alone 2"},
        ]
    )

UniversalDeepTransformer.reset_temporal_trackers() → None

Resets UniversalDeepTransformer’s (UDT) temporal context. When temporal relationships are supplied, UDT assumes that we feed it data in chronological order. Thus, if we break this assumption, we need to first reset the temporal trackers. An example of when you would use this is when you want to repeat the UDT training routine on the same dataset. Since you would be training on data from the same time period as before, we need to first reset the temporal trackers so that we don’t double count events.

Parameters:: None
Returns:: None

Example

>>> model.reset_temporal_trackers()

UniversalDeepTransformer.index_nodes(data_source: dataset.DataSource) → None

Updates the graph that the UDT model is performing graph node classification on. The file should have the same node id, neighbors, and features columns as the model is configured to accept.

Parameters:: filename (str) – The filename to load the graph from.
Returns:: None

UniversalDeepTransformer.clear_graph() → None

Clears all graph info that is being tracked by the model.

Returns:: None

class thirdai.bolt.Validation

Bases: pybind11_object

__init__(filename: str, metrics: List[str], interval: int | None = None, use_sparse_inference: bool = False) → None

Creates a validation object that stores the necessary information for the model to perform validation during training.

Parameters:

filename (str) – The name of the validation file.
metrics (List[str]) – The metrics to compute for validation.
interval (Optional[int]) – The interval, in number of batches, between computing validation. For instance, interval=10 means that validation metrics will be computed every 10 batches. If it is not specified then validation will be done after each epoch.
use_sparse_inference (bool) – Optional, defaults to False. When True, sparse inference will be used during validation.

Examples

>>> validation = bolt.Validation(
        filename="validation.csv", metrics=["categorical_accuracy"], interval=10
    )
>>> model.train("train.csv", epochs=5, validation=validation)

property filename

property metrics

property sparse_validation

property steps_per_validation

UDT Input Column Types

class thirdai.bolt.types.ColumnType

Bases: pybind11_object

Base class for bolt types.

class thirdai.bolt.types.categorical

Bases: ColumnType

__init__(n_classes: int | None = None, type: str = 'str', delimiter: str | None = None, metadata: bolt.types.metadata = None) → None

Categorical column type. Use this object if a column contains categorical data (each unique value is treated as a class). Examples include user IDs, movie titles, or age groups.

Parameters:

n_classes (Optiona[int]) – Optional, defaults to None. If the data type is for the target column in a classification task then this must be specified to indicate how many classes are present in the dataset.
type (str) – Optional, defaults to “str”. How the categories are represented in the dataset, i.e. are they stored as strings or integers. This is primarily relevant for target columns since string categories will be mapped to integers to correspond to output neurons, however if the categories are already represented as integers then they can be used for labels directly.
delimiter (str) – Optional. Defaults to None. A single character (length-1 string) that separates multiple values in the same column. This can only be used for the target column. If not provided, UDT assumes that there is only one value in the column.
metadata (metadata) – Optional. A metadata object to be used when there is a separate metadata file corresponding to this categorical column.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_id": bolt.types.categorical(
                delimiter=' ',
                metadata=bolt.types.metadata(filename="user_meta.csv", data_types={"age": bolt.types.numerical()}, key_column_name="user_id")
            )
        }
        ...
    )

property delimiter

class thirdai.bolt.types.date

Bases: ColumnType

__init__() → None

Date column type. Use this object if a column contains date strings. Date strings must be in YYYY-MM-DD format.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "timestamp": bolt.types.date()
        }
        ...
    )

class thirdai.bolt.types.metadata

Bases: pybind11_object

__init__(filename: str, key_column_name: str, data_types: Dict[str, bolt.types.ColumnType], delimiter: str = ',') → None

A configuration object for processing a metadata file to enrich categorical features from the main dataset. To illustrate when this is useful, suppose we are building a movie recommendation system. The contents of the training dataset may look something like the following: user_id,movie_id,timestamp A526,B894,2022-01-01 A339,B801,2022-01-01 A293,B801,2022-01-01 … If you have additional information about users or movies, such as users’ age groups or movie genres, you can use that information to enrich your model. Adding these features into the main dataset as new columns is wasteful because the same users and movies ids will be repeated many times throughout the dataset. Instead, we can put them all in a metadata file and UDT will inject these features where appropriate. :param filename: Path to metadata file. The file should be in CSV format. :type filename: str :param key_column_name: The name of the column whose values are used as

keys to map metadata features back to values in the main dataset. This column does not need to be passed into the data_types argument.

Parameters:

data_types (Dict[str, bolt.types.ColumnType]) – A mapping from column name to column type. Column type is one of: - bolt.types.categorical - bolt.types.numerical - bolt.types.text - bolt.types.date
delimiter (str) – Optional. Defaults to ‘,’. A single character (length-1 string) that separates the columns of the metadata file.

Example

>>> for line in open("user_meta.csv"):
>>>     print(line)
user_id,age
A526,52
A531,22
A339,29
...
>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_id": bolt.types.categorical(
                delimiter=' ',
                metadata=bolt.types.metadata(
                    filename="user_meta.csv",
                    data_types={"age": bolt.types.numerical()},
                    key_column_name="user_id"
                )
            )
        }
        ...
    )

class thirdai.bolt.types.neighbors

Bases: ColumnType

__init__() → None

class thirdai.bolt.types.node_id

Bases: ColumnType

__init__() → None

class thirdai.bolt.types.numerical

Bases: ColumnType

__init__(range: Tuple[float, float], granularity: str = 'm', explicit_granularity: int | None = None) → None

Numerical column type. Use this object if a column contains numerical data (the value is treated as a quantity). Examples include hours of a movie watched, sale quantity, or population size.

Parameters:

range (tuple(float, float)) – The expected range (min to max) of the numeric quantity. The more accurate this range to the test data, the better the model performance.
granularity (str) – Optional. One of “extrasmall”/”xs”, “small”/”s”, “medium”/”m”, “large”/”l” or “extralarge”/”xl” . Defaults to “m”.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "hours_watched": bolt.types.numerical(range=(0, 25), granularity="xs")
        }
        ...
    )

class thirdai.bolt.types.sequence

Bases: ColumnType

__init__(n_classes: int | None = None, delimiter: str = ' ', max_length: int | None = None) → None

Sequence column type. Use this object if a column contains an ordered sequence of strings delimited by a character. The delimiter must be different than the delimiter between columns.

When the target column is a sequence type, then UDT will perform inferences recursively.

Parameters:

delimiter (str) – Optional. The sequence delimiter. Defaults to “ “.
max_length (int) – Required if the column is the target. The maximum length of the sequence. If UDT sees longer sequences, elements beyond the provided upper bound will be ignored.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "input_sequence": bolt.types.sequence(delimiter='\t')
            "output_sequence": bolt.types.sequence(n_classes=26, max_length=30) # max_length must be provided for target sequence.
        },
        target="output_sequence",
        ...
    )

class thirdai.bolt.types.text

Bases: ColumnType

__init__(*args, **kwargs)

Overloaded function.

__init__(self: thirdai._thirdai.bolt.types.text, tokenizer: str = ‘words’, contextual_encoding: str = ‘none’, lowercase: bool = True) -> None

Text column type. Use this object if a column contains text data (the meaning of the text matters). Examples include descriptions, search queries, and user bios.

Parameters:

tokenizer (str) – Optional. Either “words”, “words-punct” or “char-k” (k is a number, e.g. “char-5”). Defaults to “words”.
contextual_encoding (str) – Optional. Either “local”, “global”, “ngram-N”, or “none”, defaults to “none”.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_motto": bolt.types.text(),
            "user_bio": bolt.types.text(contextual_encoding="local")
        }
        ...
    )

__init__(self: thirdai._thirdai.bolt.types.text, tokenizer: thirdai._thirdai.dataset.WordpieceTokenizer, contextual_encoding: str = ‘none’) -> None

Text column type. Use this object if a column contains text data (the meaning of the text matters). Examples include descriptions, search queries, and user bios.

Parameters:

tokenizer (str) – Optional. Either “words”, “words-punct” or “char-k” (k is a number, e.g. “char-5”). Defaults to “words”.
contextual_encoding (str) – Optional. Either “local”, “global”, “ngram-N”, or “none”, defaults to “none”.

Example

>>> bolt.UniversalDeepTransformer(
        data_types: {
            "user_motto": bolt.types.text(),
            "user_bio": bolt.types.text(contextual_encoding="local")
        }
        ...
    )

class thirdai.bolt.types.token_tags

Bases: ColumnType

__init__(tags: List[str | data.transformations.NERLearnedTag], default_tag: str) → None

UDT Temporal Options

class thirdai.bolt.temporal.TemporalConfig

Bases: pybind11_object

Base class for temporal feature configs.

thirdai.bolt.temporal.categorical(column_name: str, track_last_n: int, column_known_during_inference: bool = False, use_metadata: bool = False) → bolt.temporal.TemporalConfig

Temporal categorical config. Use this object to configure how a categorical column is tracked over time.

Parameters:

column_name (str) – The name of the tracked column.
track_last_n (int) – Number of last categorical values to track per tracking id.
column_known_during_inference (bool) – Optional. Whether the value of the tracked column is known during inference. Defaults to False.
use_metadata (bool) – Optional. Whether to use the metadata of the N tracked items, if metadata is provided in the corresponding categorical column type object. Ignored if no metadata is provided. Defaults to False.

Example

>>> # Suppose each row of our data has the following columns: "product_id", "timestamp", "ad_spend_level", "sales_performance"
>>> # We want to predict the current week's sales performance for each product using temporal context.
>>> # For each product ID, we would like to track both their ad spend level and sales performance over time.
>>> # Ad spend level is known at the time of inference but sales performance is not. Then we can configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "product_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "ad_spend_level": bolt.types.categorical(),
            "sales_performance": bolt.types.categorical(),
        },
        temporal_tracking_relationships={
            "product_id": [
                bolt.temporal.categorical(column_name="ad_spend_level", track_last_n=5, column_known_during_inference=True),
                bolt.temporal.categorical(column_name="ad_spend_level", track_last_n=25, column_known_during_inference=True),
                bolt.temporal.categorical(column_name="sales_performance", track_last_n=5), # column_known_during_inference defaults to False
            ]
        },
        ...
    )

Notes

Temporal categorical features are tracked as a set; if we track the last 5 ad spend levels,
we capture what the last 5 ad spend levels are, but we do not capture their order.
The same column can be tracked more than once, allowing us to capture both short and
long term trends.

thirdai.bolt.temporal.numerical(column_name: str, history_length: int, column_known_during_inference: bool = False) → bolt.temporal.TemporalConfig

Temporal numerical config. Use this object to configure how a numerical column is tracked over time.

Parameters:

column_name (str) – The name of the tracked column.
history_length (int) – Amount of time to look back. Time is in terms of the time granularity passed to the UDT constructor.
column_known_during_inference (bool) – Optional. Whether the value of the tracked column is known during inference. Defaults to False.

Example

>>> # Suppose each row of our data has the following columns: "product_id", "timestamp", "ad_spend", "sales_performance"
>>> # We want to predict the current week's sales performance for each product using temporal context.
>>> # For each product ID, we would like to track both their ad spend and sales performance over time.
>>> # Ad spend is known at the time of inference but sales performance is not. Then we can configure UDT as follows:
>>> model = bolt.UniversalDeepTransformer(
        data_types={
            "product_id": bolt.types.categorical(),
            "timestamp": bolt.types.date(),
            "ad_spend": bolt.types.numerical(range=(0, 10000)),
            "sales_performance": bolt.types.categorical(),
        },
        target="sales_performance"
        time_granularity="weekly",
        temporal_tracking_relationships={
            "product_id": [
                # Track last 5 weeks of ad spend
                bolt.temporal.numerical(column_name="ad_spend", history_length=5, column_known_during_inference=True),
                # Track last 10 weeks of ad spend
                bolt.temporal.numerical(column_name="ad_spend", history_length=10, column_known_during_inference=True),
                # Track last 5 weeks of sales quantity
                bolt.temporal.numerical(column_name="sales_quantity", history_length=5), # column_known_during_inference defaults to False
            ]
        },

)

Notes

The same column can be tracked more than once, allowing us to capture both short and
long term trends.