Database

When either BerryDB.connect or BerryDB.create_database is invoked, it returns a Database object. This object has the following methods:

Database.settings(settings)

Note

Refer Settings to learn how to create and save settings

Example:

# import Settings
from berrydb import Settings

settings = database.settings("settings-name")
settings = database.settings(settings)

Database.enable_fts(fields=None, override=False)

Creates a new full-text search (FTS) index on the database.

Args:

fields (list): List of fields (JSON paths) to build the index on. If empty, the index fields set on the schema is considered override (bool): If True, replaces any existing index

Returns:

FTS: An instance of FTS object

Database.database_name()

The database_name method retrieves the name of the currently connected database. This name is essential for identifying and working with the specific database in your system, especially when managing multiple databases.

Parameters:

  • None: This method does not require any parameters to be passed.

Returns:

  • str: A string representing the name of the connected database.

Example

# 'database' as an instance of the connected database (See connect/create_database methods)
# Retrieve the name of the connected database
db_name = database.database_name()

# Print the database name
print(f"Connected to database: {db_name}")

Database.get_all_documents(document_ids=None)

The get_all_documents method retrieves all documents in the currently connected database. This functionality allows users to access the entire dataset, which is useful for reviewing, processing, or analyzing the data in bulk.

Parameters:

  • None: A list containing all documents from the connected database. Each document is typically represented as a dictionary, with key-value pairs corresponding to the document’s fields and their respective values. If no documents are found, an empty list will be returned.

Returns:

  • List[Dict]: A list of documents from the connected database.

Example

# 'database' as an instance of the connected database (See connect/create_database methods)
documents = database.get_all_documents()

# Check if any documents were returned and print them
if documents:
    print("Documents retrieved from the database:")
    for doc in documents:
        print(doc['BerryDb'])

Database.get_all_documents_with_col_filter(col_filter=['*'])

The get_all_documents_with_col_filter method retrieves documents from the currently connected database while applying a filter to specify which columns should be included in the returned documents. This method is useful for narrowing down the data returned, allowing you to focus on specific fields of interest.

Parameters:

  • col_filter (list, optional): A list of column names to filter the documents. If no specific columns are provided, the method defaults to [“*”], which retrieves all columns for each document. You can specify column names as strings (e.g., [“name”, “age”]) to limit the returned fields.

Returns:

  • List[Dict]: list of documents from the connected database, with each document being represented as a dictionary. The returned documents will only include the columns specified in the col_filter. If no documents match the criteria or if the database is empty, an empty list will be returned.

Example

# 'database' as an instance of the connected database (See connect/create_database methods)
column_filter = ["name", "age"]  # Specify columns to retrieve
filtered_documents = database.get_all_documents_with_col_filter(column_filter)

Database.get_document_by_object_id(document_id, key_name=None, key_value=None)

The get_document_by_object_id method retrieves documents from the connected database using a specified object/document ID. This method can also apply additional filtering based on an optional key-value pair, allowing for more precise data retrieval. It is useful for locating specific documents or for filtering results based on certain criteria.

Parameters:

  • document_id (str): The unique key or ID of the document you wish to retrieve. This identifier should correspond to an existing document in the connected database.

  • key_name (str, optional): The name of an optional key to filter the documents further. If provided, the method will return documents that match both the document_id and the specified key.

  • key_value (str, optional): The value associated with the key_name that you want to filter by. This value should match the corresponding field in the document. If not specified, only the document_id will be used for retrieval.

Returns:

  • List[Dict]: A list of documents that match the provided document_id or the additional filters. If no documents are found, an empty list will be returned.

Example

document_id = "DOCUMENT_ID"
key_name = "status"
key_value = "active"

# 'database' as an instance of the connected database (See connect/create_database methods)
# Retrieve documents based on the document ID and optional filters
matching_documents = database.get_document_by_object_id(document_id, key_name, key_value)

if matching_documents:
    print("Documents retrieved:")
    for doc in matching_documents:
        # Do operations here
        print(doc)

Database.query(query)

The query method allows users to execute SQL-like queries on the currently connected database. This method provides powerful capabilities for retrieving specific documents based on various conditions, making it essential for data retrieval and analysis.

Parameters:

  • query (str): An SQL-like query string that defines the criteria for retrieving documents from the database. The query can include various clauses such as SELECT, WHERE, ORDER BY, and other SQL commands supported by the database.

Returns:

  • List[Dict]: A list of documents that match the criteria specified in the query. Each document is represented as a dictionary, containing key-value pairs corresponding to the document’s fields. If no documents match the query, an empty list will be returned.

Example

# 'database' as an instance of the connected database (See connect/create_database methods)
database_id = database.databaseId()
query_string = f'SELECT * FROM `BerryDB` WHERE databaseId = "{database_id}" age > 30 ORDER BY name'

# Run the query on the database
results = database.query(query_string)

if results:
    print("Documents retrieved from the database:")
    for doc in results:
        # Do operations here
        print(doc)

Database.upsert(documents)

The upsert method allows users to add new documents to the connected database or update existing documents if they already exist. This functionality is useful for maintaining up-to-date records without the need to manually check for existing documents.

Note

To update a document in the database, the document must contain a key called id that matches the ID of the document being edited. Additionally, the document must belong to the currently connected database, which should be part of the organization. If the key id is not present, a random string will be assigned as its identifier and a new document/record is created in the connected database. It is recommended that the “id” key not be included in the documents when creating new entries in the connected database. Allow BerryDB to handle ID creation to prevent clashes and avoid overwriting or inadvertently updating existing documents.

Parameters:

  • documents (List[Dict]): A list of document objects to add or update. Each document should have a key "id"; if not, a random string will be assigned.

Note

It is recommended that the “id” key not be included in the documents when creating new entries in the connected database. Allow BerryDB to handle ID creation to prevent clashes and avoid overwriting or inadvertently updating existing documents.

Returns:

  • str: A message indicating the outcome of the operation. This message will specify whether the operation was successful or if there was a failure, providing context for any issues encountered.

Example

documents_to_upsert = [
    {"id": "doc_1", "name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}  # This document will be assigned a random ID
]

# 'database' as an instance of the connected database (See connect/create_database methods)
# Add or update the documents in the database
result = database.upsert(documents_to_upsert)

print(result)

Database.ingest_pdf(file_list, extract_json_path=None)

The ingest_pdf method processes a list of PDF files, extracting their content and adding the documents into the connected database. The extracted data can optionally be saved to a specified JSON path depending on your need and schema design.

Parameters:

  • file_list (list[File]): A list of PDF files to be processed. Each file in the list should be a valid PDF object that can be ingested and parsed

  • extract_json_path (str, optional): The JSON path where the extracted data should be added in the JSON document. If not provided, the extracted data will be stored under the ‘content’ field by default

Returns:

  • List[Dict]: Returns a list of extracted documents, with each document represented as a dictionary containing the extracted content.

Example

file_list = ["file1.pdf", "file2.pdf"]  # List of PDF files to ingest
extract_json_path = "content"  # Optional path to save extracted data (Default: "content")
# If for example the extract_json_path = "content", the extracted text is in the key "content.text"
# "content" key will be of type 'dict'

# 'database' as an instance of the connected database (See connect/create_database methods)
extracted_documents = database.ingest_pdf(
    file_list=file_list,
    extract_json_path=extract_json_path
)

# Print the documents
for doc in extracted_documents:
    print(doc)

Database.embed(embedding_api_key)

The embed method allows users to generate embeddings for the documents in the database using a specified embedding function, often powered by an OpenAI language model. This is useful for tasks such as similarity search, clustering, or any application that requires semantic understanding of the data.

Parameters:

  • open_ai_api_key (str): The API key used to authenticate requests to the OpenAI API. This key must be valid and associated with your OpenAI account.

Returns:

  • str: A message indicating the success or failure of the embedding operation. This message will provide context for any errors encountered during the process.

Example

open_ai_api_key = "YOUR_OPENAI_API_KEY"

# 'database' as an instance of the connected database (See connect/create_database methods)
# Embed the documents in the database
result = database.embed(
    open_ai_api_key=open_ai_api_key
)

print(result)

Database.chat(llm_api_key, question, embedding_model_api_key=None)

The chat method is designed to query a database using a language model (LLM). This method takes a user-defined question and returns an answer generated by the LLM, allowing for interaction with the database in a conversational manner.

Parameters:

  • llm_api_key (str): The API key used to authenticate requests to the LLM provider.

  • question (str): The query or question that you want to ask regarding the database. This can be a natural language question or a specific request for information.

  • embedding_model_api_key (str, optional): The API key/token of your embedding model (Only used if the embedding and chat providers do not match)

Returns:

  • str: The generated answer to the query or an error message if the operation fails.

Example

llm_api_key = "OPENAI_API_KEY"
question = "What are the benefits of using machine learning?"

# 'database' as an instance of the connected database (See connect/create_database methods)
# Get answers from the database using the OpenAI API
response = database.chat(
    llm_api_key=llm_api_key,
    question=question
)

print(f"Response: {response}")

Database.chat_for_eval(llm_api_key, question, embedding_model_api_key=None)

The chat_for_eval method is designed to query a database using a language model (LLM) and also trace the LLM responses for . This method takes a user-defined question and returns an answer generated by the LLM, allowing for interaction with the database in a conversational manner.

Parameters:

  • llm_api_key (str): The API key used to authenticate requests to the LLM provider.

  • question (str): The query or question that you want to ask regarding the database. This can be a natural language question or a specific request for information.

  • embedding_model_api_key (str, optional): The API key/token of your embedding model (Only used if the embedding and chat providers do not match)

Returns:

  • dict: A dictionary containing the generated answer along with the documents used as context for the response, or an error message if the operation fails.

Example

llm_api_key = "OPENAI_API_KEY"
question = "What are the benefits of using machine learning?"

# 'database' as an instance of the connected database (See connect/create_database methods)
# Get answers from the database using the LLM API
response = database.chat_for_eval(
    llm_api_key=llm_api_key,
    question=question,
)

print(f"Response: {response['answer']}")

Performs a search of the database and returns a list of documents based on the query and the configured settings

Parameters:

  • llm_api_key (str): The API key used to authenticate requests to the LLM provider.

  • query (str): The query or question that you want to ask regarding the database. This can be a natural language question or a specific request for information.

Returns:

  • dict: A list of documents.

Example

llm_api_key = "OPENAI_API_KEY"
query = "What are the benefits of using machine learning?"

# 'database' as an instance of the connected database (See connect/create_database methods)
# Get a list of matching documents for the database
response = database.similarity_search(
    llm_api_key=llm_api_key,
    query=query,
)
results = response['results']
print("vectorDocuments", res['vector'])
print("ftsDocuments", res['fts'])
print("keywordSearchDocuments", res['keyword_search'])

See also

Refer Settings to learn how to create and save settings

Database.evaluator(llm_api_key, embedding_api_key=None, metrics_database_name='EvalMetricsDB')

The evaluator method initializes and returns an instance of BerryDBRAGEvaluator, configured to assess various metrics for a specific project in BerryDB. This evaluator can be used to measure and track key performance indicators within the database.

Parameters:

  • llm_api_key (str): The API key used to authenticate requests to the LLM API. This key must be valid and associated with your account

  • embedding_api_key (str, optional): This is required only when the chat and embedding models are different. Provide the API key for the embedding model as per settings.

  • metrics_database_name (str, optional): The name of the database where evaluation metrics will be stored. Defaults to “EvalMetricsDB

Note

This method requires that the database specified by metrics_database_name (or the default EvalMetricsDB) already exists. The method does not create a new database and will raise an error if the specified database is not found.

Returns:

  • BerryDBRAGEvaluator: An instance of the BerryDBRAGEvaluator class, initialized with the specified API keys and project/database names for evaluating and storing metrics.

Example:

# Initialize the evaluator with the required OpenAI API key
llm_api_key = "LLM_API_KEY"
embedding_api_key = "EMBEDDING_API_KEY"
evaluator = database.evaluator(
    llm_api_key=llm_api_key,
    embedding_api_key=embedding_api_key,
    metrics_database_name="MyMetricsDB"
)

See also

See eval for instructions on performing additional operations with the evaluator.


Database.ner(json_path, document_ids=[], annotate=False)

The ner (Named Entity Recognition) method processes the text in JSON path and extracts semantic data from specified documents. This method can also optionally annotate the extracted semantic data back into the documents. The return value varies based on annotate flag. Please refer to the example for a clearer understanding.

Parameters:

  • json_path (str): The JSON path to the key containing the text for which Named Entity Recognition (NER) should be performed

  • document_ids (list, optional): A list of document IDs representing the specific documents you want to extract semantic data from. If annotate is False and no document IDs are provided, a validation error will be raised

  • annotate (str, optional): A flag indicating whether to add the extracted semantic data back into the original documents as annotations. Defaults to False. If True, the method will return a hash representing the annotated document; otherwise, it returns the predictions for the specified document IDs

Returns:

  • If annotate is True:

  • A hash to track the status of the job.

  • If annotate is False:

  • A dictionary with keys corresponding to the individual document IDs specified in ‘document_ids,’ and the predictions as their values.

Example

# 'database' as an instance of the connected database (See connect/create_database methods)

# Scenario 1: annotate = True
# If the 'document_ids' parameter is not specified, Named Entity Recognition (NER) will be performed on all documents in the database. If specific document IDs are provided, NER will only be applied to those documents. The predictions will then be added to the individual items within the specified documents. This process is executed asynchronously, so it may take some time for the predictions to appear in the documents. The returned hash can be used to track the status of the job however, this feature is not yet functional and will be implemented in the near future.
document_ids = ["doc-1", "doc-2"] # (Optional)
hash = database.ner(
        json_path=json_path,
        document_ids=document_ids,
        annotate=True
    )

# Scenario 2: annotate = False
# A list of document IDs in the 'document_ids' parameter is required. If 'annotate' is set to False, the predictions are returned as a dict, with the keys corresponding to the individual document IDs specified in 'document_ids.' These predictions are not automatically added to the documents in the database, instead they should be added into the respective documents using the upsert or query APIs (an example using query is provided below).
document_ids = ["doc-3", "doc-4"]
predictions = database.ner(
        json_path=json_path,
        document_ids=document_ids,
        annotate=False
    )

def upsert_annotation_to_document(database_id, document_id, annotation):
query = f'UPDATE `BerryDb` SET annotations = CASE WHEN ARRAY_LENGTH(annotations) > 0 THEN ARRAY_APPEND(annotations, {annotation}) ELSE [{annotation}] END WHERE databaseId = "{database_id}" AND id = "{document_id}"'
print("Adding annotation to document with ID: ", document_id)
database.query(query)

for doc_id in predictions:
upsert_annotation_to_document(database.databaseId(), doc_id, predictions[doc_id])

Database.text_classification(json_path, labels, document_ids=[], annotate=False)

The text_classification method processes the text in JSON path and extracts semantic data from specified documents. This method can also optionally annotate the extracted semantic data back into the documents. The return value varies based on annotate flag. Please refer to the example for a clearer understanding.

Parameters:

  • json_path (str): The JSON path to the key containing the text for which classification should be performed

  • labels (list[str]): A list of strings representing the categories or classes for text classification, with a maximum of 10 labels allowed

  • document_ids (list, optional): A list of document IDs representing the specific documents you want to extract semantic data from. If annotate is False and no document IDs are provided, a validation error will be raised

  • annotate (str, optional): A flag indicating whether to add the extracted semantic data back into the original documents as annotations. Defaults to False. If True, the method will return a hash representing the annotated document; otherwise, it returns the predictions for the specified document IDs

Returns:

  • If annotate is True:

  • A hash to track the status of the job.

  • If annotate is False:

  • A dictionary with keys corresponding to the individual document IDs specified in ‘document_ids,’ and the predictions as their values.

Example

# 'database' as an instance of the connected database (See connect/create_database methods)

# Scenario 1: annotate = True
# If the 'document_ids' parameter is not specified, Text Classification will be performed on all documents in the database. If specific document IDs are provided, Text Classification will only be applied to those documents. The predictions will then be added to the individual items within the specified documents. This process is executed asynchronously, so it may take some time for the predictions to appear in the documents. The returned hash can be used to track the status of the job however, this feature is not yet functional and will be implemented in the near future.
document_ids = ["doc-1", "doc-2"] # (Optional)
hash = database.text_classification(
        json_path=json_path,
        labels=['POSITIVE', 'NEGATIVE', 'NEUTRAL']
        document_ids=document_ids,
        annotate=True
    )

# Scenario 2: annotate = False
# A list of document IDs in the 'document_ids' parameter is required. If 'annotate' is set to False, the predictions are returned as a dict, with the keys corresponding to the individual document IDs specified in 'document_ids.' These predictions are not automatically added to the documents in the database, instead they should be added into the respective documents using the upsert or query APIs (an example using query is provided below).
document_ids = ["doc-3", "doc-4"]
predictions = database.text_classification(
        json_path=json_path,
        labels=['POSITIVE', 'NEGATIVE', 'NEUTRAL']
        document_ids=document_ids,
        annotate=False
    )

def upsert_annotation_to_document(database_id, document_id, annotation):
query = f'UPDATE `BerryDb` SET annotations = CASE WHEN ARRAY_LENGTH(annotations) > 0 THEN ARRAY_APPEND(annotations, {annotation}) ELSE [{annotation}] END WHERE databaseId = "{database_id}" AND id = "{document_id}"'
print("Adding annotation to document with ID: ", document_id)
database.query(query)

for doc_id in predictions:
upsert_annotation_to_document(database.databaseId(), doc_id, predictions[doc_id])