thirdai.neural_db

class thirdai.neural_db.NeuralDB

Bases: object

NeuralDB is a search and retrieval system that can be used to search over knowledge bases and documents. It can also be used in RAG pipelines for the search retrieval phase.

Examples

>>> ndb = NeuralDB()
>>> ndb.insert([CSV(...), PDF(...), DOCX(...)])
>>> results = ndb.search("how to make chocolate chip cookies")
__init__(user_id: str = 'user', num_shards: int = 1, num_models_per_shard: int = 1, retriever='finetunable_retriever', low_memory=None, **kwargs) None

Constructs an empty NeuralDB.

Parameters:
  • user_id (str) – Optional, used to identify user/session in logging.

  • retriever (str) – One of ‘finetunable_retriever’, ‘mach’, or ‘hybrid’. Identifies which retriever to use as the backend. Defaults to ‘finetunable_retriever’.

Returns:

A NeuralDB.

associate(source: str, target: str, strength: Strength = Strength.Strong, **kwargs)

Teaches the underlying model in the NeuralDB that two different texts correspond to similar concepts or queries.

Parameters:
  • source (str) – The source is the new text you want to teach the model about.

  • target (str) – The target is the known text that is provided to the model as an example of the type of information or query the source resembles.

Examples

>>> ndb.associate("asap", "as soon as possible")
>>> ndb.associate("what is a 401k", "explain different types of retirement savings")
associate_batch(text_pairs: List[Tuple[str, str]], strength: Strength = Strength.Strong, **kwargs)

Same as associate, but the process is applied to a batch of (source, target) pairs at once.

clear_sources() None

Removes all documents stored in the NeuralDB.

delete(source_ids: List[str])

Deletes documents from the NeuralDB.

static from_checkpoint(checkpoint_path: str, user_id: str = 'user', on_progress: ~typing.Callable = <function no_op>, **kwargs)

Constructs a NeuralDB from a checkpoint. This can be used save and reload NeuralDBs, it is also used for loading pretrained NeuralDB models.

Parameters:
  • checkpoint_path (str) – The path to the checkpoint directory.

  • user_id (str) – Optional, used to identify user/session in logging.

  • on_progress (Callable) – Optional, callback that can be called as loading the checkpoint progresses.

Returns:

A NeuralDB.

static from_udt(udt: UniversalDeepTransformer, user_id: str = 'user', csv: str | None = None, csv_id_column: str | None = None, csv_strong_columns: List[str] | None = None, csv_weak_columns: List[str] | None = None, csv_reference_columns: List[str] | None = None)

Instantiate a NeuralDB, using the given UDT as the underlying model. Usually for porting a pretrained model into the NeuralDB format. Use the optional csv-related arguments to insert the pretraining dataset into the NeuralDB instance.

Parameters:
  • udt (bolt.UniversalDeepTransformer) – The udt model to use in the NeuralDB.

  • user_id (str) – Optional, used to identify user/session in logging.

  • csv (Optional[str]) – Optional, default None. The path to the CSV file used to train the udt model. If supplied, the CSV file will be inserted into NeuralDB.

  • csv_id_column (Optional[str]) – Optional, default None. The id column of the training dataset. Required only if the data is being inserted via the csv arg.

  • csv_strong_columns (Optional[str]) – Optional, default None. The strong signal columns from the training data. Required only if the data is being inserted via the csv arg.

  • csv_weak_columns (Optional[str]) – Optional, default None. The weak signal columns from the training data. Required only if the data is being inserted via the csv arg.

  • csv_reference_columns (Optional[str]) – Optional, default None. The columns whose data should be returned as search results to queries. Required only if the data is being inserted via the csv arg.

Returns:

A NeuralDB.

get_associate_samples()

Get past associate() and associate_batch() samples from NeuralDB logs.

get_rlhf_samples()

Get past associate(), associate_batch(), text_to_result(), and text_to_result_batch() samples from NeuralDB logs.

get_upvote_samples()

Get past text_to_result() and text_to_result_batch() samples from NeuralDB logs.

insert(sources: ~typing.List[~thirdai.neural_db.documents.Document], train: bool = True, fast_approximation: bool = True, num_buckets_to_sample: int | None = None, on_progress: ~typing.Callable = <function no_op>, on_success: ~typing.Callable = <function no_op>, on_error: ~typing.Callable | None = None, cancel_state: ~thirdai.neural_db.models.model_interface.CancelState | None = None, max_in_memory_batches: int | None = None, variable_length: ~data.transformations.VariableLengthConfig | None = <data.transformations.VariableLengthConfig object>, checkpoint_config: ~thirdai.neural_db.trainer.checkpoint_config.CheckpointConfig | None = None, callbacks: ~typing.List[~bolt.train.callbacks.Callback] | None = None, **kwargs) List[str]

Inserts documents/resources into the database.

Parameters:
  • sources (List[Doc]) – List of NeuralDB documents to be inserted.

  • train (bool) – Optional, defaults True. When True this means that the underlying model in the NeuralDB will undergo unsupervised pretraining on the inserted documents.

  • fast_approximation (bool) – Optional, default True. Much faster insertion with a slight drop in performance.

  • num_buckets_to_sample (Optional[int]) – Used to control load balacing when inserting entities into the NeuralDB.

  • on_progress (Callable) – Optional, a callback that is called at intervals as documents are inserted.

  • on_success (Callable) – Optional, a callback that is invoked when document insertion is finished successfully.

  • on_error (Callable) – Optional, a callback taht is invoked if an error occurs during insertion.

  • cancel_state (CancelState) – An object that can be used to stop an ongoing insertion. Primarily used for PocketLLM.

  • max_in_memory_batches (int) – Optional, default None. When supplied this limits the maximum amount of data that is loaded into memory at once during training. Useful for lower memory paradigms or with large datasets.

  • checkpoint_config (CheckpointConfig) – Optional, default None. Configuration for checkpointing during insertion. No checkpoints are created if checkpoint_config is unspecified.

Returns:

A list of the ids assigned to the inserted documents.

pretrain_distributed(documents, scaling_config, run_config, learning_rate: float = 0.001, epochs: int = 5, batch_size: int | None = None, metrics: List[str] = [], max_in_memory_batches: int | None = None, communication_backend='gloo', log_folder=None)

Pretrains a model in a distributed manner using the provided documents.

Parameters:
  • documents – List of documents for pretraining. All the documents must have the same id column.

  • scaling_config – Configuration related to the scaling aspects for Ray trainer. Read https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html

  • run_config – Configuration related to the runtime aspects for Ray trainer. Read https://docs.ray.io/en/latest/train/api/doc/ray.train.RunConfig.html ** Note: We need to specify storage_path in RunConfig which must be a networked ** ** file system or cloud storage path accessible by all workers. (Ray 2.7.0 onwards) **

  • learning_rate (float, optional) – Learning rate for the optimizer. Default is 0.001.

  • epochs (int, optional) – Number of epochs to train. Default is 5.

  • batch_size (int, optional) – Size of each batch for training. If not provided, will be determined automatically.

  • metrics (List[str], optional) – List of metrics to evaluate during training. Default is an empty list.

  • max_in_memory_batches (Optional[int], optional) – Number of batches to load in memory at once. Useful for streaming support when dataset is too large to fit in memory. If None, all batches will be loaded.

  • communication_backend (str, optional) – Bolt Distributed Training uses Torch Communication Backend. This refers to backend for inter-worker communication. Default is “gloo”.

Notes

  • Make sure to pass id_column to neural_db.CSV() making sure the ids are in ascending order starting from 0.

  • The scaling_config, run_config, and resume_from_checkpoint arguments are related to the Ray trainer configuration. Read

    https://docs.ray.io/en/latest/ray-air/trainers.html#trainer-basics

  • Ensure that the communication backend specified is compatible with the hardware and network setup for MPI/Gloo backend.

Returns True if documents have been inserted and the model is prepared to serve queries, False otherwise.

reference(element_id: int)

Returns a reference containing the text and other information for a given entity id.

retrain(text_pairs: List[Tuple[str, str]] = [], learning_rate: float = 0.0001, epochs: int = 3, strength: Strength = Strength.Strong)

Train NeuralDB on all inserted documents and logged RLHF samples.

save(save_to: str | ~pathlib.Path, on_progress: ~typing.Callable = <function no_op>) str
search(query: str, top_k: int, constraints=None, rerank=False, reranker=None, top_k_rerank=100, rerank_threshold=1.5, top_k_threshold=None, retriever=None, label_probing=False, mach_first=False) List[Reference]

Searches the contents of the NeuralDB for documents relevant to the given query.

Parameters:
  • query (str) – The query to search with.

  • top_k (int) – The number of results to return.

  • constraints (Dict[str, Any]) – A dictionary containing constraints to apply to the metadata field of each document in the NeuralDB. This allows for queries that will only return results with a certain property. The constrains are in the form {“metadata_key”: <constraint>} where <constraint> is either an explicit value for the key in the metadata, or a Filter object.

  • rerank (bool) – Optional, default False. When True an additional reranking step is applied to results.

  • top_k_rerank (int) – Optional, default 100. If rerank=True then this argument determines how many candidates are retrieved, before reranking and returning the top_k.

  • rerank_threshold (float) – Optional, default 1.5. In reranking all candidates with a score under a certain threshold are reranked. This threshold is computed as this argument (rerank_threshold) times the average score over the first top_k_threshold candidates. Candidates with scores lower than this threshold will be reranked. Thus, increasing this value causes more candidates to be reranked.

  • top_k_threshold (Optional[float]) – Optional, default None, which means the arg top_k will be used. If specified this argument controls how many of the top candidates’ scores are averaged to obtain the mean that is used to determine which candidates are reranked. For example passing rerank_threshold=2 and top_k_threshold=4 means that the scores of the top 4 elements are averaged, and all elements below 2x this average are reranked.

  • retriever (Optional[str]) – Optional, default None. This arg controls which retriever to use for search when a hybrid retrieval model is used. Passing None means that NeuralDB will automatically decide which retrievers (or combination of retrievers) to use.

Returns:

A list of Reference objects. Each reference object contains text data matching the query, along with information about which document contained that text.

Return type:

List[Reference]

Examples

>>> ndb.search("what is ...", top_k=5)
>>> ndb.search("what is ...", top_k=5, constraints={"file_type": "pdf", "file_created", GreaterThan(10)})
search_batch(queries: List[str], top_k: int, constraints=None, rerank=False, reranker=None, top_k_rerank=100, rerank_threshold=1.5, top_k_threshold=None, retriever=None, label_probing=False, mach_first=False)

Runs search on a batch of queries for much faster throughput.

Parameters:

queries (List[str]) – The queries to search.

Returns:

Combines each result of db.search into a list.

Return type:

List[List[Reference]]

sources() Dict[str, Document]

Returns a mapping from source IDs to their corresponding document objects. This is useful when you need to know the source ID of a document you inserted, e.g. for creating a Sup object for supervised_train().

supervised_train(data: List[Sup], learning_rate=0.0001, epochs=3, batch_size: int | None = None, max_in_memory_batches: int | None = None, metrics: List[str] = [], callbacks: List[Callback] = [], checkpoint_config: CheckpointConfig | None = None, **kwargs)

Train on supervised datasets that correspond to specific sources. Suppose you inserted a “sports” product catalog and a “furniture” product catalog. You also have supervised datasets - pairs of queries and correct products - for both categories. You can use this method to train NeuralDB on these supervised datasets.

Parameters:
  • data (List[Sup]) – Supervised training samples.

  • learning_rate (float) – Optional. The learning rate to use for training.

  • epochs (int) – Optional. The number of epochs to train for.

supervised_train_with_ref_ids(csv: str | None = None, query_column: str | None = None, id_column: str | None = None, id_delimiter: str | None = None, queries: Sequence[str] | None = None, labels: Sequence[Sequence[int]] | None = None, learning_rate=0.0001, epochs=3, batch_size: int | None = None, max_in_memory_batches: int | None = None, metrics: List[str] = [], callbacks: List[Callback] = [], checkpoint_config: CheckpointConfig | None = None, **kwargs)

Train on supervised datasets that correspond to specific sources. Suppose you inserted a “sports” product catalog and a “furniture” product catalog. You also have supervised datasets - pairs of queries and correct products - for both categories. You can use this method to train NeuralDB on these supervised datasets. This method must be invoked with either A) a csv file with the query and id columns within it, or B) an explicit list of queries and expected labels.

text_to_result(text: str, result_id: int, **kwargs) None

Trains NeuralDB to map the given text to the given entity ID. Also known as “upvoting”.

Example

>>> ndb.text_to_result("a new query", result_id=4)
text_to_result_batch(text_id_pairs: List[Tuple[str, int]], **kwargs) None

Trains NeuralDB to map the given texts to the given entity IDs. Also known as “batch upvoting”.

class thirdai.neural_db.Reference

Bases: object

__init__(document: Document, element_id: int, text: str, source: str, metadata: dict, upvote_ids: List[int] | None = None, retriever: str | None = None)
context(radius: int)
property document
property id
property id_in_document
property metadata
property retriever
property score
property source
property text
property upvote_ids
class thirdai.neural_db.Sup

Bases: object

An object that contains supervised samples. This object is to be passed into NeuralDB.supervised_train().

It can be initialized either with a CSV file, in which case it needs query and ID column names, or with sequences of queries and labels. It also needs to know which source object (i.e. which inserted CSV or PDF object) contains the relevant entities to the supervised samples.

If uses_db_id is True, then the labels are assumed to use database-assigned IDs and will not be converted.

__init__(csv: str | None = None, query_column: str | None = None, id_column: str | None = None, id_delimiter: str | None = None, queries: Sequence[str] | None = None, labels: Sequence[Sequence[int]] | None = None, source_id: str = '', uses_db_id: bool = False)
property size

Document Types

NeuralDB currently supports the following types of documents and resources.

class thirdai.neural_db.CSV

Bases: Document

A document containing the rows of a csv file.

Parameters:
  • path (str) – The path to the csv file.

  • id_column (Optional[str]) – ids in this column are used to identify the rows in NeuralDB. If not provided then ids are assigned.

  • strong_columns (Optional[List[str]]) – Optional, defaults to None. This argument can be used to provide NeuralDB with information about which columns are likely to contain the strongest signal in matching with a given query. For example this could be something like the name of a product.

  • weak_columns (Optional[List[str]]) – Optional, defaults to None. This argument can be used to provide NeuralDB with information about which columns are likely to contain weaker signals in matching with a given query. For example this could be something like the description of a product.

  • reference_columns (Optional[List[str]]) – Optional, defaults to None. If provided the specified columns are returned by NeuralDB as responses to queries. If not specifed all columns are returned.

  • metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.

__init__(path: str, id_column: str | None = None, strong_columns: List[str] | None = None, weak_columns: List[str] | None = None, reference_columns: List[str] | None = None, save_extra_info=True, metadata=None, has_offset=False, on_disk=False, use_dask=False, blocksize=None) None
all_entity_ids() List[int]
context(element_id: int, radius) str
filter_entity_ids(filters: Dict[str, Filter])
property hash: str
id_map() Dict[str, int] | None
load_meta(directory: Path)
property matched_constraints
property name: str
reference(element_id: int) Reference
remove_spaces()
remove_spaces_from_list()
row_iterator()
save_meta(directory: Path)
property size: int
property source: str
strong_text(element_id: int) str
strong_text_from_row(row) str
valid_id_column()
weak_text(element_id: int) str
weak_text_from_row(row) str
class thirdai.neural_db.PDF

Bases: Extracted

Parses a PDF document into chunks of text that can be indexed by NeuralDB.

Parameters:
  • path (str) – path to PDF file

  • version (str) – Either “v1” or “v2”. If “v1”, the parser splits the PDF into paragraphs. If “v2”, the parser creates overlapping chunks comprised of entire lines from the PDF. “v2” does more data cleaning and therefore supports more options, which are outlined below.

  • chunk_size (int) – Only relevant if version = “v2”. The number of words in each chunk of text. Defaults to 100

  • stride (int) – Only relevant if version = “v2”. The number of words between each chunk of text. When stride < chunk_size, the text chunks overlap. When stride = chunk_size, the text chunks do not overlap. Defaults to 40 so adjacent chunks have a 60% overlap.

  • emphasize_first_words (int) – Only relevant if version = “v2”. The number of words at the beginning of the document to be passed into NeuralDB as a strong signal. For example, if your document starts with a descriptive title that is 3 words long, then you can set emphasize_first_words to 3 so that NeuralDB captures this strong signal. Defaults to 0.

  • ignore_header_footer (bool) – Only relevant if version = “v2”. Whether the parser should remove headers and footers. Defaults to True; headers and footers are removed by default.

  • ignore_nonstandard_orientation (bool) – Only relevant if version = “v2”. Whether the parser should remove lines of text that have a nonstandard orientation, such as margins that are oriented vertically. Defaults to True; lines with nonstandard orientation are removed by default.

  • metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.

  • on_disk (bool) – If True, the processed chunks will be stored in a lightweight on-disk database. Otherwise, processed chunks will be stored in-memory. Defaults to False.

  • doc_keywords (str) – Only relevant if version = “v2”. If provided, the keywords will be prepended to every chunk in the document. It is helpful for use cases where a NeuralDB instance contains multiple documents. Defaults to an empty string.

  • emphasize_section_titles (bool) – Only relevant if version = “v2”. If True, infers section titles based on font properties and prepends the latest section title to each chunk. Defaults to False.

  • table_parsing (bool) – Only relevant if version = “v2”. If True, the contents of a table are considered to be contained in a single line, ensuring that any chunk that contains a table contains the entire table. Defaults to False.

  • save_extra_info (bool) – If True, the original PDF file will be saved in .ndb checkpoint. Defaults to True.

__init__(path: str, version: str = 'v1', chunk_size=100, stride=40, emphasize_first_words=0, ignore_header_footer=True, ignore_nonstandard_orientation=True, metadata=None, on_disk=False, doc_keywords='', emphasize_section_titles=False, table_parsing=False, save_extra_info=True)
static highlighted_doc(reference: Reference)
process_data(path: str) DataFrame
class thirdai.neural_db.DOCX

Bases: Extracted

__init__(path: str, metadata=None, on_disk=False)
process_data(path: str) DataFrame
class thirdai.neural_db.URL

Bases: Document

A URL document takes the data found at the provided URL (or in the provided reponse) and creates entities that can be inserted into NeuralDB.

Parameters:
  • url (str) – The URL where the data is located.

  • url_response (Reponse) – Optional, defaults to None. If provided then the data in the response is used to create the entities, otherwise a get request is sent to the url.

  • title_is_strong (bool) – Optional, defaults to False. If true then the title is used as a strong signal for NeuralDB.

  • metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.

__init__(url: str, url_response: Response | None = None, save_extra_info: bool = True, title_is_strong: bool = False, metadata=None, on_disk=False)
all_entity_ids() List[int]
context(element_id, radius) str
property hash: str
load_meta(directory: Path)
property matched_constraints: Dict[str, ConstraintValue]
property name: str
process_data(url, url_response=None) DataFrame
reference(element_id: int) Reference
save_meta(directory: Path)
property size: int
property source: str
strong_text(element_id: int) str
weak_text(element_id: int) str
class thirdai.neural_db.SQLDatabase

Bases: DocumentConnector

class for handling SQL database connections and data retrieval for training the neural_db model

This class encapsulates functionality for connecting to an SQL database, executing SQL queries, and retrieving data for use in training the model.

NOTE: It is being expected that the table will remain static in terms of both rows and columns.

__init__(engine: Connection, table_name: str, id_col: str, strong_columns: List[str] | None = None, weak_columns: List[str] | None = None, reference_columns: List[str] | None = None, chunk_size: int = 10000, save_extra_info: bool = False, metadata: dict = {}) None
all_entity_ids() List[int]
assert_valid_columns()
assert_valid_id()
chunk_iterator() DataFrame
get_engine()
get_strong_columns()
get_weak_columns()
property hash
property matched_constraints: Dict[str, ConstraintValue]

This method is called when the document is being added to a DocumentManager in order to build an index for constrained search.

property meta_table: DataFrame | None

It stores the mapping from id_in_document to meta_data of the document. It could be used to fetch the minimal document result if the connection is lost.

property name
reference(element_id: int) Reference
setup_connection(engine: Connection)

This is a helper function to re-establish the connection upon loading the saved ndb model containing this SQLDatabase document.

Parameters:

engine – SQLAlchemy Connection object NOTE: Provide the same connection object.

NOTE: Same table would be used to establish connection

property size: int
property source: str
strong_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
weak_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
class thirdai.neural_db.SalesForce

Bases: DocumentConnector

Class for handling the Salesforce object connections and data retrieval for training the neural_db model

This class encapsulates functionality for connecting to an object, executing Salesforce Object Query Language (SOQL) queries, and retrieving

NOTE: Allow the Bulk API access for the provided object. Also, it is being expected that the table will remain static in terms of both rows and columns.

__init__(instance: Salesforce, object_name: str, id_col: str, strong_columns: List[str] | None = None, weak_columns: List[str] | None = None, reference_columns: List[str] | None = None, save_extra_info: bool = True, metadata: dict = {}) None
all_entity_ids() List[int]
assert_field_inclusion(all_fields: List[OrderedDict])
assert_field_type(all_fields: List[OrderedDict], supported_text_types: Tuple[str])
assert_valid_fields(supported_text_types: Tuple[str] = ('string', 'textarea'))
assert_valid_id()
chunk_iterator() DataFrame
default_fields(all_fields: List[OrderedDict], supported_text_types: Tuple[str])
get_strong_columns()
get_weak_columns()
property hash: str
property matched_constraints: Dict[str, ConstraintValue]

This method is called when the document is being added to a DocumentManager in order to build an index for constrained search.

property meta_table: DataFrame | None

It stores the mapping from id_in_document to meta_data of the document. It could be used to fetch the minimal document result if the connection is lost.

property name: str
reference(element_id: int) Reference
row_iterator()
setup_connection(instance: Salesforce)

This is a helper function to re-establish the connection upon loading a saved ndb model containing this SalesForce document.

Parameters:

instance – Salesforce instance. NOTE: Provide the same connection object.

NOTE: Same object name would be used to establish connection

property size: int
property source: str
strong_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
weak_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
class thirdai.neural_db.SharePoint

Bases: DocumentConnector

Class for handling sharepoint connection, retrieving documents, processing and training the neural_db model

Parameters:
  • ctx (ClientContext) – A ClientContext object for SharePoint connection.

  • library_path (str) – The server-relative directory path where documents are stored. Default: ‘Shared Documents’

  • chunk_size (int) – The maximum amount of data (in bytes) that can be fetched at a time. (This limit may not apply if there are no files within this range.) Default: 10MB

  • metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.

__init__(ctx: ClientContext, library_path: str = 'Shared Documents', chunk_size: int = 10485760, save_extra_info: bool = False, metadata: dict = {}) None
all_entity_ids() List[int]
build_meta_table()
chunk_iterator() DataFrame
static dummy_query(ctx: ClientContext)
get_strong_columns()
get_weak_columns()
property hash: str
property matched_constraints: Dict[str, ConstraintValue]

Each constraint will get applied to each supported document on the sharepoint. This method is called when the document is being added to a DocumentManager in order to build an index for constrained search.

property meta_table: DataFrame | None

It stores the mapping from id_in_document to meta_data of the document. It could be used to fetch the minimal document result if the connection is lost.

property name: str
reference(element_id: int) Reference
static setup_clientContext(base_url: str, credentials: Dict[str, str]) ClientContext

Method to create a ClientContext object given base_url and credentials in the form (username, password) OR (client_id, client_secret)

setup_connection(ctx: ClientContext)

This is a helper function to re-establish the connection upon loading the saved ndb model containing this Sharepoint document.

Parameters:

engine – SQLAlchemy Connection object. NOTE: Provide the same connection object.

NOTE: Same library path would be used

property size: int
property source: str
class thirdai.neural_db.SentenceLevelPDF

Bases: SentenceLevelExtracted

Parses a document into sentences and creates a NeuralDB entry for each sentence. The strong column of the entry is the sentence itself while the weak column is the paragraph from which the sentence came. A NeuralDB reference produced by this object displays the paragraph instead of the sentence to increase recall.

Parameters:
  • path (str) – The path to the pdf file.

  • metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.

__init__(path: str, metadata=None, on_disk=False)
process_data(path: str) DataFrame
class thirdai.neural_db.SentenceLevelDOCX

Bases: SentenceLevelExtracted

Parses a document into sentences and creates a NeuralDB entry for each sentence. The strong column of the entry is the sentence itself while the weak column is the paragraph from which the sentence came. A NeuralDB reference produced by this object displays the paragraph instead of the sentence to increase recall.

Parameters:
  • path (str) – The path to the docx file.

  • metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.

__init__(path: str, metadata=None, on_disk=False)
process_data(path: str) DataFrame
class thirdai.neural_db.InMemoryText

Bases: Document

A wrapper around a batch of texts and their metadata to fit it in the NeuralDB Document framework.

Parameters:
  • name (str) – A name for the batch of texts.

  • texts (List[str]) – A batch of texts.

  • metadatas (List[Dict[str, Any]]) – Optional. Metadata for each text.

  • global_metadata (Dict[str, Any]) – Optional. Metadata for the whole batch

  • texts. (of)

__init__(name: str, texts: List[str], metadatas: List[dict] | None = None, global_metadata=None, on_disk=False)
all_entity_ids() List[int]
context(element_id, radius) str
filter_entity_ids(filters: Dict[str, Filter])
property hash: str
load_meta(directory: Path)
property matched_constraints: Dict[str, ConstraintValue]
property name: str
reference(element_id: int) Reference
save_meta(directory: Path)
property size: int
property source: str
strong_text(element_id: int) str
weak_text(element_id: int) str

NeuralDB Enterprise Python Client

A deployed instance of NeuralDB enterprise can be accessed via this python api.

class thirdai.neural_db.ModelBazaar

Bases: Bazaar

A class representing ModelBazaar, providing functionality for managing models and deployments.

_base_url

The base URL for the Model Bazaar.

Type:

str

_cache_dir

The directory for caching downloads.

Type:

Union[Path, str]

__init__(self, base_url

str, cache_dir: Union[Path, str] = “./bazaar_cache”) -> None: Initializes a new instance of the ModelBazaar class.

sign_up(self, email

str, password: str, username: str) -> None: Signs up a user and sets the username for the ModelBazaar instance.

log_in(self, email

str, password: str) -> None: Logs in a user and sets user-related attributes for the ModelBazaar instance.

push_model(self, model_name

str, local_path: str, access_level: str = “private”) -> None: Pushes a model to the Model Bazaar.

pull_model(self, model_identifier

str) -> NeuralDBClient: Pulls a model from the Model Bazaar and returns a NeuralDBClient instance.

list_models(self) List[dict]

Lists available models in the Model Bazaar.

train(self,

model_name: str, unsupervised_docs: Optional[List[str]] = None, supervised_docs: Optional[List[Tuple[str, str]]] = None, test_doc: Optional[str] = None, doc_type: str = “local”, sharded: bool = False, is_async: bool = False, base_model_identifier: str = None, train_extra_options: Optional[dict] = None, metadata: Optional[List[Dict[str, str]]] = None

) -> Model

Initiates training for a model and returns a Model instance.

await_train(self, model

Model) -> None: Waits for the training of a model to complete.

test(self,

model_identifier: str, test_doc: str, doc_type: str = “local”, test_extra_options: dict = {}, is_async: bool = False,

) -> str

Starts the Model testing on given test file.

await_test(self, model_identifier

str, test_id: str) -> None: Waits for the testing of a model on that test_id to complete.

deploy(self, model_identifier

str, deployment_name: str, is_async: bool = False) -> NeuralDBClient: Deploys a model and returns a NeuralDBClient instance.

await_deploy(self, ndb_client

NeuralDBClient) -> None: Waits for the deployment of a model to complete.

undeploy(self, ndb_client

NeuralDBClient) -> None: Undeploys a deployed model.

list_deployments(self) List[dict]

Lists the deployments in the Model Bazaar.

connect(self, deployment_identifier

str) -> NeuralDBClient: Connects to a deployed model and returns a NeuralDBClient instance.

__init__(base_url: str, cache_dir: Path | str = './bazaar_cache')

Initializes a new instance of the ModelBazaar class.

Parameters:
  • base_url (str) – The base URL for the Model Bazaar.

  • cache_dir (Union[Path, str]) – The directory for caching downloads.

await_deploy(ndb_client: NeuralDBClient)

Waits for the deployment of a model to complete.

Parameters:

ndb_client (NeuralDBClient) – The NeuralDBClient instance.

await_test(model_identifier: str, test_id: str)

Waits for the testing of the model to complete.

Parameters:
  • model_identifier – The identifier of the model.

  • test_id – Unique id for the test.

await_train(model: Model)

Waits for the training of a model to complete.

Parameters:

model (Model) – The Model instance.

connect(deployment_identifier: str)

Connects to a deployed model and returns a NeuralDBClient instance.

Parameters:

deployment_identifier (str) – The identifier of the deployment.

Returns:

A NeuralDBClient instance.

Return type:

NeuralDBClient

deploy(model_identifier: str, deployment_name: str, memory: int | None = None, is_async=False)

Deploys a model and returns a NeuralDBClient instance.

Parameters:
  • model_identifier (str) – The identifier of the model.

  • deployment_name (str) – The name for the deployment.

  • is_async (bool) – Whether deployment should be asynchronous (default is False).

Returns:

A NeuralDBClient instance.

Return type:

NeuralDBClient

list_deployments()

Lists the deployments in the Model Bazaar.

Returns:

A list of dictionaries containing information about deployments.

Return type:

List[dict]

list_models()

Lists available models in the Model Bazaar.

Returns:

A list of dictionaries containing information about available models.

Return type:

List[dict]

log_in(email, password)

Logs in a user and sets user-related attributes for the ModelBazaar instance.

Parameters:
  • email (str) – The email of the user.

  • password (str) – The password of the user.

pull_model(model_identifier: str)

Pulls a model from the Model Bazaar and returns a NeuralDBClient instance.

Parameters:

model_identifier (str) – The identifier of the model.

Returns:

A NeuralDBClient instance.

Return type:

NeuralDBClient

push_model(model_name: str, local_path: str, access_level: str = 'private')

Pushes a model to the Model Bazaar.

Parameters:
  • model_name (str) – The name of the model.

  • local_path (str) – The local path of the model.

  • access_level (str) – The access level for the model (default is “private”).

sign_up(email, password, username)

Signs up a user and sets the username for the ModelBazaar instance.

Parameters:
  • email (str) – The email of the user.

  • password (str) – The password of the user.

  • username (str) – The desired username.

test(model_identifier: str, test_doc: str, doc_type: str = 'local', test_extra_options: dict = {}, is_async: bool = False)

Initiates testing for a model and returns the test_id (unique identifier for this test)

Parameters:
  • model_identifier (str) – The identifier of the model.

  • test_doc (str) – A path to a test file for evaluating the trained NeuralDB.

  • doc_type (str) – Specifies document location type : “local”(default), “nfs” or “s3”.

  • test_extra_options – (Optional[dict])

  • is_async (bool) – Whether testing should be asynchronous (default is False).

Returns:

The test_id which is unique for given testing.

Return type:

str

test_status(test_id: str)

Checks for the status of the model testing

Parameters:
  • test_id (str) – The unique id with which we can recognize the test,

  • test. (the user will get this id in the response when they trigger the)

train(model_name: str, unsupervised_docs: List[str] | None = None, supervised_docs: List[Tuple[str, str]] | None = None, test_doc: str | None = None, doc_type: str = 'local', sharded: bool = False, is_async: bool = False, base_model_identifier: str | None = None, train_extra_options: dict | None = None, metadata: List[Dict[str, str]] | None = None)

Initiates training for a model and returns a Model instance.

Parameters:
  • model_name (str) – The name of the model.

  • unsupervised_docs (Optional[List[str]]) – A list of document paths for unsupervised training.

  • supervised_docs (Optional[List[Tuple[str, str]]]) – A list of document path and source id pairs.

  • test_doc (Optional[str]) – A path to a test file for evaluating the trained NeuralDB.

  • doc_type (str) – Specifies document location type : “local”(default), “nfs” or “s3”.

  • sharded (bool) – Whether NeuralDB training will be distributed over NeuralDB shards.

  • is_async (bool) – Whether training should be asynchronous (default is False).

  • train_extra_options – (Optional[dict])

  • base_model_identifier (Optional[str]) – The identifier of the base model.

  • metadata (Optional[List[Dict[str, str]]]) – A list metadata dicts. Each dict corresponds to an unsupervised file.

Returns:

A Model instance.

Return type:

Model

train_status(model: Model)

Checks for the status of the model training

Parameters:

model (Model) – The Model instance.

undeploy(ndb_client: NeuralDBClient)

Undeploys a deployed model.

Parameters:

ndb_client (NeuralDBClient) – The NeuralDBClient instance.

update_model(model_name: str, base_model_identifier: str)

Creates a new model with give name by updating the existing model with RLHF Logs.

Parameters:
  • model_name (str) – Name for the new model.

  • base_model_identifier (str) – The identifier of the base model.

Returns:

A Model instance.

Return type:

Model