thirdai.neural_db
- class thirdai.neural_db.NeuralDB
Bases:
object
NeuralDB is a search and retrieval system that can be used to search over knowledge bases and documents. It can also be used in RAG pipelines for the search retrieval phase.
Examples
>>> ndb = NeuralDB() >>> ndb.insert([CSV(...), PDF(...), DOCX(...)]) >>> results = ndb.search("how to make chocolate chip cookies")
- __init__(user_id: str = 'user', num_shards: int = 1, num_models_per_shard: int = 1, retriever='finetunable_retriever', low_memory=None, **kwargs) None
Constructs an empty NeuralDB.
- Parameters:
user_id (str) – Optional, used to identify user/session in logging.
retriever (str) – One of ‘finetunable_retriever’, ‘mach’, or ‘hybrid’. Identifies which retriever to use as the backend. Defaults to ‘finetunable_retriever’.
- Returns:
A NeuralDB.
- associate(source: str, target: str, strength: Strength = Strength.Strong, **kwargs)
Teaches the underlying model in the NeuralDB that two different texts correspond to similar concepts or queries.
- Parameters:
source (str) – The source is the new text you want to teach the model about.
target (str) – The target is the known text that is provided to the model as an example of the type of information or query the source resembles.
Examples
>>> ndb.associate("asap", "as soon as possible") >>> ndb.associate("what is a 401k", "explain different types of retirement savings")
- associate_batch(text_pairs: List[Tuple[str, str]], strength: Strength = Strength.Strong, **kwargs)
Same as associate, but the process is applied to a batch of (source, target) pairs at once.
- clear_sources() None
Removes all documents stored in the NeuralDB.
- delete(source_ids: List[str])
Deletes documents from the NeuralDB.
- static from_checkpoint(checkpoint_path: str, user_id: str = 'user', on_progress: ~typing.Callable = <function no_op>, **kwargs)
Constructs a NeuralDB from a checkpoint. This can be used save and reload NeuralDBs, it is also used for loading pretrained NeuralDB models.
- Parameters:
checkpoint_path (str) – The path to the checkpoint directory.
user_id (str) – Optional, used to identify user/session in logging.
on_progress (Callable) – Optional, callback that can be called as loading the checkpoint progresses.
- Returns:
A NeuralDB.
- static from_udt(udt: UniversalDeepTransformer, user_id: str = 'user', csv: str | None = None, csv_id_column: str | None = None, csv_strong_columns: List[str] | None = None, csv_weak_columns: List[str] | None = None, csv_reference_columns: List[str] | None = None)
Instantiate a NeuralDB, using the given UDT as the underlying model. Usually for porting a pretrained model into the NeuralDB format. Use the optional csv-related arguments to insert the pretraining dataset into the NeuralDB instance.
- Parameters:
udt (bolt.UniversalDeepTransformer) – The udt model to use in the NeuralDB.
user_id (str) – Optional, used to identify user/session in logging.
csv (Optional[str]) – Optional, default None. The path to the CSV file used to train the udt model. If supplied, the CSV file will be inserted into NeuralDB.
csv_id_column (Optional[str]) – Optional, default None. The id column of the training dataset. Required only if the data is being inserted via the csv arg.
csv_strong_columns (Optional[str]) – Optional, default None. The strong signal columns from the training data. Required only if the data is being inserted via the csv arg.
csv_weak_columns (Optional[str]) – Optional, default None. The weak signal columns from the training data. Required only if the data is being inserted via the csv arg.
csv_reference_columns (Optional[str]) – Optional, default None. The columns whose data should be returned as search results to queries. Required only if the data is being inserted via the csv arg.
- Returns:
A NeuralDB.
- get_associate_samples()
Get past associate() and associate_batch() samples from NeuralDB logs.
- get_rlhf_samples()
Get past associate(), associate_batch(), text_to_result(), and text_to_result_batch() samples from NeuralDB logs.
- get_upvote_samples()
Get past text_to_result() and text_to_result_batch() samples from NeuralDB logs.
- insert(sources: ~typing.List[~thirdai.neural_db.documents.Document], train: bool = True, fast_approximation: bool = True, num_buckets_to_sample: int | None = None, on_progress: ~typing.Callable = <function no_op>, on_success: ~typing.Callable = <function no_op>, on_error: ~typing.Callable | None = None, cancel_state: ~thirdai.neural_db.models.model_interface.CancelState | None = None, max_in_memory_batches: int | None = None, variable_length: ~data.transformations.VariableLengthConfig | None = <data.transformations.VariableLengthConfig object>, checkpoint_config: ~thirdai.neural_db.trainer.checkpoint_config.CheckpointConfig | None = None, callbacks: ~typing.List[~bolt.train.callbacks.Callback] | None = None, **kwargs) List[str]
Inserts documents/resources into the database.
- Parameters:
sources (List[Doc]) – List of NeuralDB documents to be inserted.
train (bool) – Optional, defaults True. When True this means that the underlying model in the NeuralDB will undergo unsupervised pretraining on the inserted documents.
fast_approximation (bool) – Optional, default True. Much faster insertion with a slight drop in performance.
num_buckets_to_sample (Optional[int]) – Used to control load balacing when inserting entities into the NeuralDB.
on_progress (Callable) – Optional, a callback that is called at intervals as documents are inserted.
on_success (Callable) – Optional, a callback that is invoked when document insertion is finished successfully.
on_error (Callable) – Optional, a callback taht is invoked if an error occurs during insertion.
cancel_state (CancelState) – An object that can be used to stop an ongoing insertion. Primarily used for PocketLLM.
max_in_memory_batches (int) – Optional, default None. When supplied this limits the maximum amount of data that is loaded into memory at once during training. Useful for lower memory paradigms or with large datasets.
checkpoint_config (CheckpointConfig) – Optional, default None. Configuration for checkpointing during insertion. No checkpoints are created if checkpoint_config is unspecified.
- Returns:
A list of the ids assigned to the inserted documents.
- pretrain_distributed(documents, scaling_config, run_config, learning_rate: float = 0.001, epochs: int = 5, batch_size: int | None = None, metrics: List[str] = [], max_in_memory_batches: int | None = None, communication_backend='gloo', log_folder=None)
Pretrains a model in a distributed manner using the provided documents.
- Parameters:
documents – List of documents for pretraining. All the documents must have the same id column.
scaling_config – Configuration related to the scaling aspects for Ray trainer. Read https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html
run_config – Configuration related to the runtime aspects for Ray trainer. Read https://docs.ray.io/en/latest/train/api/doc/ray.train.RunConfig.html ** Note: We need to specify storage_path in RunConfig which must be a networked ** ** file system or cloud storage path accessible by all workers. (Ray 2.7.0 onwards) **
learning_rate (float, optional) – Learning rate for the optimizer. Default is 0.001.
epochs (int, optional) – Number of epochs to train. Default is 5.
batch_size (int, optional) – Size of each batch for training. If not provided, will be determined automatically.
metrics (List[str], optional) – List of metrics to evaluate during training. Default is an empty list.
max_in_memory_batches (Optional[int], optional) – Number of batches to load in memory at once. Useful for streaming support when dataset is too large to fit in memory. If None, all batches will be loaded.
communication_backend (str, optional) – Bolt Distributed Training uses Torch Communication Backend. This refers to backend for inter-worker communication. Default is “gloo”.
Notes
Make sure to pass id_column to neural_db.CSV() making sure the ids are in ascending order starting from 0.
- The scaling_config, run_config, and resume_from_checkpoint arguments are related to the Ray trainer configuration. Read
https://docs.ray.io/en/latest/ray-air/trainers.html#trainer-basics
Ensure that the communication backend specified is compatible with the hardware and network setup for MPI/Gloo backend.
- ready_to_search() bool
Returns True if documents have been inserted and the model is prepared to serve queries, False otherwise.
- reference(element_id: int)
Returns a reference containing the text and other information for a given entity id.
- retrain(text_pairs: List[Tuple[str, str]] = [], learning_rate: float = 0.0001, epochs: int = 3, strength: Strength = Strength.Strong)
Train NeuralDB on all inserted documents and logged RLHF samples.
- save(save_to: str | ~pathlib.Path, on_progress: ~typing.Callable = <function no_op>) str
- search(query: str, top_k: int, constraints=None, rerank=False, reranker=None, top_k_rerank=100, rerank_threshold=1.5, top_k_threshold=None, retriever=None, label_probing=False, mach_first=False) List[Reference]
Searches the contents of the NeuralDB for documents relevant to the given query.
- Parameters:
query (str) – The query to search with.
top_k (int) – The number of results to return.
constraints (Dict[str, Any]) – A dictionary containing constraints to apply to the metadata field of each document in the NeuralDB. This allows for queries that will only return results with a certain property. The constrains are in the form {“metadata_key”: <constraint>} where <constraint> is either an explicit value for the key in the metadata, or a Filter object.
rerank (bool) – Optional, default False. When True an additional reranking step is applied to results.
top_k_rerank (int) – Optional, default 100. If rerank=True then this argument determines how many candidates are retrieved, before reranking and returning the top_k.
rerank_threshold (float) – Optional, default 1.5. In reranking all candidates with a score under a certain threshold are reranked. This threshold is computed as this argument (rerank_threshold) times the average score over the first top_k_threshold candidates. Candidates with scores lower than this threshold will be reranked. Thus, increasing this value causes more candidates to be reranked.
top_k_threshold (Optional[float]) – Optional, default None, which means the arg top_k will be used. If specified this argument controls how many of the top candidates’ scores are averaged to obtain the mean that is used to determine which candidates are reranked. For example passing rerank_threshold=2 and top_k_threshold=4 means that the scores of the top 4 elements are averaged, and all elements below 2x this average are reranked.
retriever (Optional[str]) – Optional, default None. This arg controls which retriever to use for search when a hybrid retrieval model is used. Passing None means that NeuralDB will automatically decide which retrievers (or combination of retrievers) to use.
- Returns:
A list of Reference objects. Each reference object contains text data matching the query, along with information about which document contained that text.
- Return type:
List[Reference]
Examples
>>> ndb.search("what is ...", top_k=5) >>> ndb.search("what is ...", top_k=5, constraints={"file_type": "pdf", "file_created", GreaterThan(10)})
- search_batch(queries: List[str], top_k: int, constraints=None, rerank=False, reranker=None, top_k_rerank=100, rerank_threshold=1.5, top_k_threshold=None, retriever=None, label_probing=False, mach_first=False)
Runs search on a batch of queries for much faster throughput.
- Parameters:
queries (List[str]) – The queries to search.
- Returns:
Combines each result of db.search into a list.
- Return type:
List[List[Reference]]
- sources() Dict[str, Document]
Returns a mapping from source IDs to their corresponding document objects. This is useful when you need to know the source ID of a document you inserted, e.g. for creating a Sup object for supervised_train().
- supervised_train(data: List[Sup], learning_rate=0.0001, epochs=3, batch_size: int | None = None, max_in_memory_batches: int | None = None, metrics: List[str] = [], callbacks: List[Callback] = [], checkpoint_config: CheckpointConfig | None = None, **kwargs)
Train on supervised datasets that correspond to specific sources. Suppose you inserted a “sports” product catalog and a “furniture” product catalog. You also have supervised datasets - pairs of queries and correct products - for both categories. You can use this method to train NeuralDB on these supervised datasets.
- Parameters:
data (List[Sup]) – Supervised training samples.
learning_rate (float) – Optional. The learning rate to use for training.
epochs (int) – Optional. The number of epochs to train for.
- supervised_train_with_ref_ids(csv: str | None = None, query_column: str | None = None, id_column: str | None = None, id_delimiter: str | None = None, queries: Sequence[str] | None = None, labels: Sequence[Sequence[int]] | None = None, learning_rate=0.0001, epochs=3, batch_size: int | None = None, max_in_memory_batches: int | None = None, metrics: List[str] = [], callbacks: List[Callback] = [], checkpoint_config: CheckpointConfig | None = None, **kwargs)
Train on supervised datasets that correspond to specific sources. Suppose you inserted a “sports” product catalog and a “furniture” product catalog. You also have supervised datasets - pairs of queries and correct products - for both categories. You can use this method to train NeuralDB on these supervised datasets. This method must be invoked with either A) a csv file with the query and id columns within it, or B) an explicit list of queries and expected labels.
- text_to_result(text: str, result_id: int, **kwargs) None
Trains NeuralDB to map the given text to the given entity ID. Also known as “upvoting”.
Example
>>> ndb.text_to_result("a new query", result_id=4)
- text_to_result_batch(text_id_pairs: List[Tuple[str, int]], **kwargs) None
Trains NeuralDB to map the given texts to the given entity IDs. Also known as “batch upvoting”.
- class thirdai.neural_db.Reference
Bases:
object
- __init__(document: Document, element_id: int, text: str, source: str, metadata: dict, upvote_ids: List[int] | None = None, retriever: str | None = None)
- context(radius: int)
- property document
- property id
- property id_in_document
- property metadata
- property retriever
- property score
- property source
- property text
- property upvote_ids
- class thirdai.neural_db.Sup
Bases:
object
An object that contains supervised samples. This object is to be passed into NeuralDB.supervised_train().
It can be initialized either with a CSV file, in which case it needs query and ID column names, or with sequences of queries and labels. It also needs to know which source object (i.e. which inserted CSV or PDF object) contains the relevant entities to the supervised samples.
If uses_db_id is True, then the labels are assumed to use database-assigned IDs and will not be converted.
- __init__(csv: str | None = None, query_column: str | None = None, id_column: str | None = None, id_delimiter: str | None = None, queries: Sequence[str] | None = None, labels: Sequence[Sequence[int]] | None = None, source_id: str = '', uses_db_id: bool = False)
- property size
Document Types
NeuralDB currently supports the following types of documents and resources.
- class thirdai.neural_db.CSV
Bases:
Document
A document containing the rows of a csv file.
- Parameters:
path (str) – The path to the csv file.
id_column (Optional[str]) – ids in this column are used to identify the rows in NeuralDB. If not provided then ids are assigned.
strong_columns (Optional[List[str]]) – Optional, defaults to None. This argument can be used to provide NeuralDB with information about which columns are likely to contain the strongest signal in matching with a given query. For example this could be something like the name of a product.
weak_columns (Optional[List[str]]) – Optional, defaults to None. This argument can be used to provide NeuralDB with information about which columns are likely to contain weaker signals in matching with a given query. For example this could be something like the description of a product.
reference_columns (Optional[List[str]]) – Optional, defaults to None. If provided the specified columns are returned by NeuralDB as responses to queries. If not specifed all columns are returned.
metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.
- __init__(path: str, id_column: str | None = None, strong_columns: List[str] | None = None, weak_columns: List[str] | None = None, reference_columns: List[str] | None = None, save_extra_info=True, metadata=None, has_offset=False, on_disk=False, use_dask=False, blocksize=None) None
- all_entity_ids() List[int]
- context(element_id: int, radius) str
- filter_entity_ids(filters: Dict[str, Filter])
- property hash: str
- id_map() Dict[str, int] | None
- load_meta(directory: Path)
- property matched_constraints
- property name: str
- remove_spaces()
- remove_spaces_from_list()
- row_iterator()
- save_meta(directory: Path)
- property size: int
- property source: str
- strong_text(element_id: int) str
- strong_text_from_row(row) str
- valid_id_column()
- weak_text(element_id: int) str
- weak_text_from_row(row) str
- class thirdai.neural_db.PDF
Bases:
Extracted
Parses a PDF document into chunks of text that can be indexed by NeuralDB.
- Parameters:
path (str) – path to PDF file
version (str) – Either “v1” or “v2”. If “v1”, the parser splits the PDF into paragraphs. If “v2”, the parser creates overlapping chunks comprised of entire lines from the PDF. “v2” does more data cleaning and therefore supports more options, which are outlined below.
chunk_size (int) – Only relevant if version = “v2”. The number of words in each chunk of text. Defaults to 100
stride (int) – Only relevant if version = “v2”. The number of words between each chunk of text. When stride < chunk_size, the text chunks overlap. When stride = chunk_size, the text chunks do not overlap. Defaults to 40 so adjacent chunks have a 60% overlap.
emphasize_first_words (int) – Only relevant if version = “v2”. The number of words at the beginning of the document to be passed into NeuralDB as a strong signal. For example, if your document starts with a descriptive title that is 3 words long, then you can set emphasize_first_words to 3 so that NeuralDB captures this strong signal. Defaults to 0.
ignore_header_footer (bool) – Only relevant if version = “v2”. Whether the parser should remove headers and footers. Defaults to True; headers and footers are removed by default.
ignore_nonstandard_orientation (bool) – Only relevant if version = “v2”. Whether the parser should remove lines of text that have a nonstandard orientation, such as margins that are oriented vertically. Defaults to True; lines with nonstandard orientation are removed by default.
metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.
on_disk (bool) – If True, the processed chunks will be stored in a lightweight on-disk database. Otherwise, processed chunks will be stored in-memory. Defaults to False.
doc_keywords (str) – Only relevant if version = “v2”. If provided, the keywords will be prepended to every chunk in the document. It is helpful for use cases where a NeuralDB instance contains multiple documents. Defaults to an empty string.
emphasize_section_titles (bool) – Only relevant if version = “v2”. If True, infers section titles based on font properties and prepends the latest section title to each chunk. Defaults to False.
table_parsing (bool) – Only relevant if version = “v2”. If True, the contents of a table are considered to be contained in a single line, ensuring that any chunk that contains a table contains the entire table. Defaults to False.
save_extra_info (bool) – If True, the original PDF file will be saved in .ndb checkpoint. Defaults to True.
- __init__(path: str, version: str = 'v1', chunk_size=100, stride=40, emphasize_first_words=0, ignore_header_footer=True, ignore_nonstandard_orientation=True, metadata=None, on_disk=False, doc_keywords='', emphasize_section_titles=False, table_parsing=False, save_extra_info=True)
- process_data(path: str) DataFrame
- class thirdai.neural_db.DOCX
Bases:
Extracted
- __init__(path: str, metadata=None, on_disk=False)
- process_data(path: str) DataFrame
- class thirdai.neural_db.URL
Bases:
Document
A URL document takes the data found at the provided URL (or in the provided reponse) and creates entities that can be inserted into NeuralDB.
- Parameters:
url (str) – The URL where the data is located.
url_response (Reponse) – Optional, defaults to None. If provided then the data in the response is used to create the entities, otherwise a get request is sent to the url.
title_is_strong (bool) – Optional, defaults to False. If true then the title is used as a strong signal for NeuralDB.
metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.
- __init__(url: str, url_response: Response | None = None, save_extra_info: bool = True, title_is_strong: bool = False, metadata=None, on_disk=False)
- all_entity_ids() List[int]
- context(element_id, radius) str
- property hash: str
- load_meta(directory: Path)
- property matched_constraints: Dict[str, ConstraintValue]
- property name: str
- process_data(url, url_response=None) DataFrame
- save_meta(directory: Path)
- property size: int
- property source: str
- strong_text(element_id: int) str
- weak_text(element_id: int) str
- class thirdai.neural_db.SQLDatabase
Bases:
DocumentConnector
class for handling SQL database connections and data retrieval for training the neural_db model
This class encapsulates functionality for connecting to an SQL database, executing SQL queries, and retrieving data for use in training the model.
NOTE: It is being expected that the table will remain static in terms of both rows and columns.
- __init__(engine: Connection, table_name: str, id_col: str, strong_columns: List[str] | None = None, weak_columns: List[str] | None = None, reference_columns: List[str] | None = None, chunk_size: int = 10000, save_extra_info: bool = False, metadata: dict = {}) None
- all_entity_ids() List[int]
- assert_valid_columns()
- assert_valid_id()
- chunk_iterator() DataFrame
- get_engine()
- get_strong_columns()
- get_weak_columns()
- property hash
- property matched_constraints: Dict[str, ConstraintValue]
This method is called when the document is being added to a DocumentManager in order to build an index for constrained search.
- property meta_table: DataFrame | None
It stores the mapping from id_in_document to meta_data of the document. It could be used to fetch the minimal document result if the connection is lost.
- property name
- setup_connection(engine: Connection)
This is a helper function to re-establish the connection upon loading the saved ndb model containing this SQLDatabase document.
- Parameters:
engine – SQLAlchemy Connection object NOTE: Provide the same connection object.
NOTE: Same table would be used to establish connection
- property size: int
- property source: str
- strong_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
- weak_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
- class thirdai.neural_db.SalesForce
Bases:
DocumentConnector
Class for handling the Salesforce object connections and data retrieval for training the neural_db model
This class encapsulates functionality for connecting to an object, executing Salesforce Object Query Language (SOQL) queries, and retrieving
NOTE: Allow the Bulk API access for the provided object. Also, it is being expected that the table will remain static in terms of both rows and columns.
- __init__(instance: Salesforce, object_name: str, id_col: str, strong_columns: List[str] | None = None, weak_columns: List[str] | None = None, reference_columns: List[str] | None = None, save_extra_info: bool = True, metadata: dict = {}) None
- all_entity_ids() List[int]
- assert_field_inclusion(all_fields: List[OrderedDict])
- assert_field_type(all_fields: List[OrderedDict], supported_text_types: Tuple[str])
- assert_valid_fields(supported_text_types: Tuple[str] = ('string', 'textarea'))
- assert_valid_id()
- chunk_iterator() DataFrame
- default_fields(all_fields: List[OrderedDict], supported_text_types: Tuple[str])
- get_strong_columns()
- get_weak_columns()
- property hash: str
- property matched_constraints: Dict[str, ConstraintValue]
This method is called when the document is being added to a DocumentManager in order to build an index for constrained search.
- property meta_table: DataFrame | None
It stores the mapping from id_in_document to meta_data of the document. It could be used to fetch the minimal document result if the connection is lost.
- property name: str
- row_iterator()
- setup_connection(instance: Salesforce)
This is a helper function to re-establish the connection upon loading a saved ndb model containing this SalesForce document.
- Parameters:
instance – Salesforce instance. NOTE: Provide the same connection object.
NOTE: Same object name would be used to establish connection
- property size: int
- property source: str
- strong_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
- weak_text_from_chunk(id_in_chunk: int, chunk: DataFrame) str
Bases:
DocumentConnector
Class for handling sharepoint connection, retrieving documents, processing and training the neural_db model
- Parameters:
ctx (ClientContext) – A ClientContext object for SharePoint connection.
library_path (str) – The server-relative directory path where documents are stored. Default: ‘Shared Documents’
chunk_size (int) – The maximum amount of data (in bytes) that can be fetched at a time. (This limit may not apply if there are no files within this range.) Default: 10MB
metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.
Each constraint will get applied to each supported document on the sharepoint. This method is called when the document is being added to a DocumentManager in order to build an index for constrained search.
It stores the mapping from id_in_document to meta_data of the document. It could be used to fetch the minimal document result if the connection is lost.
Method to create a ClientContext object given base_url and credentials in the form (username, password) OR (client_id, client_secret)
This is a helper function to re-establish the connection upon loading the saved ndb model containing this Sharepoint document.
- Parameters:
engine – SQLAlchemy Connection object. NOTE: Provide the same connection object.
NOTE: Same library path would be used
- class thirdai.neural_db.SentenceLevelPDF
Bases:
SentenceLevelExtracted
Parses a document into sentences and creates a NeuralDB entry for each sentence. The strong column of the entry is the sentence itself while the weak column is the paragraph from which the sentence came. A NeuralDB reference produced by this object displays the paragraph instead of the sentence to increase recall.
- Parameters:
path (str) – The path to the pdf file.
metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.
- __init__(path: str, metadata=None, on_disk=False)
- process_data(path: str) DataFrame
- class thirdai.neural_db.SentenceLevelDOCX
Bases:
SentenceLevelExtracted
Parses a document into sentences and creates a NeuralDB entry for each sentence. The strong column of the entry is the sentence itself while the weak column is the paragraph from which the sentence came. A NeuralDB reference produced by this object displays the paragraph instead of the sentence to increase recall.
- Parameters:
path (str) – The path to the docx file.
metadata (Dict[str, Any]) – Optional, defaults to {}. Specifies metadata to associate with entities from this file. Queries to NeuralDB can provide constrains to restrict results based on the metadata.
- __init__(path: str, metadata=None, on_disk=False)
- process_data(path: str) DataFrame
- class thirdai.neural_db.InMemoryText
Bases:
Document
A wrapper around a batch of texts and their metadata to fit it in the NeuralDB Document framework.
- Parameters:
name (str) – A name for the batch of texts.
texts (List[str]) – A batch of texts.
metadatas (List[Dict[str, Any]]) – Optional. Metadata for each text.
global_metadata (Dict[str, Any]) – Optional. Metadata for the whole batch
texts. (of)
- __init__(name: str, texts: List[str], metadatas: List[dict] | None = None, global_metadata=None, on_disk=False)
- all_entity_ids() List[int]
- context(element_id, radius) str
- filter_entity_ids(filters: Dict[str, Filter])
- property hash: str
- load_meta(directory: Path)
- property matched_constraints: Dict[str, ConstraintValue]
- property name: str
- save_meta(directory: Path)
- property size: int
- property source: str
- strong_text(element_id: int) str
- weak_text(element_id: int) str
Constrained Search
NeuralDB supports the following contraints that can be passed to the search method.
- class thirdai.neural_db.AnyOf
Bases:
Filter
[ItemT
]- __init__(values: Iterable[Any])
- filter(value_to_items: SortedDict) Set[ItemT]
- filter_df_column(df: DataFrame, column_name: str)
- sql_condition(column_name: str)
- class thirdai.neural_db.GreaterThan
Bases:
InRange
[ItemT
]- __init__(minimum: Any, include_equal=False)
- filter(value_to_items: SortedDict) Set[ItemT]
- filter_df_column(df: DataFrame, column_name: str)
- sql_condition(column_name: str)
NeuralDB Enterprise Python Client
A deployed instance of NeuralDB enterprise can be accessed via this python api.
- class thirdai.neural_db.ModelBazaar
Bases:
Bazaar
A class representing ModelBazaar, providing functionality for managing models and deployments.
- _base_url
The base URL for the Model Bazaar.
- Type:
str
- _cache_dir
The directory for caching downloads.
- Type:
Union[Path, str]
- __init__(self, base_url
str, cache_dir: Union[Path, str] = “./bazaar_cache”) -> None: Initializes a new instance of the ModelBazaar class.
- sign_up(self, email
str, password: str, username: str) -> None: Signs up a user and sets the username for the ModelBazaar instance.
- log_in(self, email
str, password: str) -> None: Logs in a user and sets user-related attributes for the ModelBazaar instance.
- push_model(self, model_name
str, local_path: str, access_level: str = “private”) -> None: Pushes a model to the Model Bazaar.
- pull_model(self, model_identifier
str) -> NeuralDBClient: Pulls a model from the Model Bazaar and returns a NeuralDBClient instance.
- list_models(self) List[dict]
Lists available models in the Model Bazaar.
- train(self,
model_name: str, unsupervised_docs: Optional[List[str]] = None, supervised_docs: Optional[List[Tuple[str, str]]] = None, test_doc: Optional[str] = None, doc_type: str = “local”, sharded: bool = False, is_async: bool = False, base_model_identifier: str = None, train_extra_options: Optional[dict] = None, metadata: Optional[List[Dict[str, str]]] = None
- ) -> Model
Initiates training for a model and returns a Model instance.
- await_train(self, model
Model) -> None: Waits for the training of a model to complete.
- test(self,
model_identifier: str, test_doc: str, doc_type: str = “local”, test_extra_options: dict = {}, is_async: bool = False,
- ) -> str
Starts the Model testing on given test file.
- await_test(self, model_identifier
str, test_id: str) -> None: Waits for the testing of a model on that test_id to complete.
- deploy(self, model_identifier
str, deployment_name: str, is_async: bool = False) -> NeuralDBClient: Deploys a model and returns a NeuralDBClient instance.
- await_deploy(self, ndb_client
NeuralDBClient) -> None: Waits for the deployment of a model to complete.
- undeploy(self, ndb_client
NeuralDBClient) -> None: Undeploys a deployed model.
- list_deployments(self) List[dict]
Lists the deployments in the Model Bazaar.
- connect(self, deployment_identifier
str) -> NeuralDBClient: Connects to a deployed model and returns a NeuralDBClient instance.
- __init__(base_url: str, cache_dir: Path | str = './bazaar_cache')
Initializes a new instance of the ModelBazaar class.
- Parameters:
base_url (str) – The base URL for the Model Bazaar.
cache_dir (Union[Path, str]) – The directory for caching downloads.
- await_deploy(ndb_client: NeuralDBClient)
Waits for the deployment of a model to complete.
- Parameters:
ndb_client (NeuralDBClient) – The NeuralDBClient instance.
- await_test(model_identifier: str, test_id: str)
Waits for the testing of the model to complete.
- Parameters:
model_identifier – The identifier of the model.
test_id – Unique id for the test.
- await_train(model: Model)
Waits for the training of a model to complete.
- Parameters:
model (Model) – The Model instance.
- connect(deployment_identifier: str)
Connects to a deployed model and returns a NeuralDBClient instance.
- Parameters:
deployment_identifier (str) – The identifier of the deployment.
- Returns:
A NeuralDBClient instance.
- Return type:
NeuralDBClient
- deploy(model_identifier: str, deployment_name: str, memory: int | None = None, is_async=False)
Deploys a model and returns a NeuralDBClient instance.
- Parameters:
model_identifier (str) – The identifier of the model.
deployment_name (str) – The name for the deployment.
is_async (bool) – Whether deployment should be asynchronous (default is False).
- Returns:
A NeuralDBClient instance.
- Return type:
NeuralDBClient
- list_deployments()
Lists the deployments in the Model Bazaar.
- Returns:
A list of dictionaries containing information about deployments.
- Return type:
List[dict]
- list_models()
Lists available models in the Model Bazaar.
- Returns:
A list of dictionaries containing information about available models.
- Return type:
List[dict]
- log_in(email, password)
Logs in a user and sets user-related attributes for the ModelBazaar instance.
- Parameters:
email (str) – The email of the user.
password (str) – The password of the user.
- pull_model(model_identifier: str)
Pulls a model from the Model Bazaar and returns a NeuralDBClient instance.
- Parameters:
model_identifier (str) – The identifier of the model.
- Returns:
A NeuralDBClient instance.
- Return type:
NeuralDBClient
- push_model(model_name: str, local_path: str, access_level: str = 'private')
Pushes a model to the Model Bazaar.
- Parameters:
model_name (str) – The name of the model.
local_path (str) – The local path of the model.
access_level (str) – The access level for the model (default is “private”).
- sign_up(email, password, username)
Signs up a user and sets the username for the ModelBazaar instance.
- Parameters:
email (str) – The email of the user.
password (str) – The password of the user.
username (str) – The desired username.
- test(model_identifier: str, test_doc: str, doc_type: str = 'local', test_extra_options: dict = {}, is_async: bool = False)
Initiates testing for a model and returns the test_id (unique identifier for this test)
- Parameters:
model_identifier (str) – The identifier of the model.
test_doc (str) – A path to a test file for evaluating the trained NeuralDB.
doc_type (str) – Specifies document location type : “local”(default), “nfs” or “s3”.
test_extra_options – (Optional[dict])
is_async (bool) – Whether testing should be asynchronous (default is False).
- Returns:
The test_id which is unique for given testing.
- Return type:
str
- test_status(test_id: str)
Checks for the status of the model testing
- Parameters:
test_id (str) – The unique id with which we can recognize the test,
test. (the user will get this id in the response when they trigger the)
- train(model_name: str, unsupervised_docs: List[str] | None = None, supervised_docs: List[Tuple[str, str]] | None = None, test_doc: str | None = None, doc_type: str = 'local', sharded: bool = False, is_async: bool = False, base_model_identifier: str | None = None, train_extra_options: dict | None = None, metadata: List[Dict[str, str]] | None = None)
Initiates training for a model and returns a Model instance.
- Parameters:
model_name (str) – The name of the model.
unsupervised_docs (Optional[List[str]]) – A list of document paths for unsupervised training.
supervised_docs (Optional[List[Tuple[str, str]]]) – A list of document path and source id pairs.
test_doc (Optional[str]) – A path to a test file for evaluating the trained NeuralDB.
doc_type (str) – Specifies document location type : “local”(default), “nfs” or “s3”.
sharded (bool) – Whether NeuralDB training will be distributed over NeuralDB shards.
is_async (bool) – Whether training should be asynchronous (default is False).
train_extra_options – (Optional[dict])
base_model_identifier (Optional[str]) – The identifier of the base model.
metadata (Optional[List[Dict[str, str]]]) – A list metadata dicts. Each dict corresponds to an unsupervised file.
- Returns:
A Model instance.
- Return type:
Model
- train_status(model: Model)
Checks for the status of the model training
- Parameters:
model (Model) – The Model instance.
- undeploy(ndb_client: NeuralDBClient)
Undeploys a deployed model.
- Parameters:
ndb_client (NeuralDBClient) – The NeuralDBClient instance.
- update_model(model_name: str, base_model_identifier: str)
Creates a new model with give name by updating the existing model with RLHF Logs.
- Parameters:
model_name (str) – Name for the new model.
base_model_identifier (str) – The identifier of the base model.
- Returns:
A Model instance.
- Return type:
Model