Skip to content

feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation#186

Merged
mic-k3y merged 63 commits intogoogleapis:mainfrom
mic-k3y:main
Aug 27, 2025
Merged

feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation#186
mic-k3y merged 63 commits intogoogleapis:mainfrom
mic-k3y:main

Conversation

@mic-k3y
Copy link
Contributor

@mic-k3y mic-k3y commented Aug 22, 2025

This PR adds the async-only vector store class that handles the underlying data operations for the main vector store class.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #184 🦕

mic-k3y and others added 30 commits July 17, 2025 16:51
…context

Added the engine class that manages the client and execution context which handles the async and sync conversion via a background event loop.

Added a test file for the engine class.
…tion

Removes the option to pre-instantiate and pass a BigtableDataClientAsync
to the BigtableEngine. The engine now creates its own client instance,
making it more self-contained.
Adjusted test cases and assertions to align with the recent modifications
in the BigtableEngine class API and behavior.
This change introduces BigtableEngine.async_initialize for non-blocking
asynchronous initialization.
…n LangChain's key-value store.

This commit introduces the `AsyncBigtableByteStore` class. This class is designed to handle setting,
getting, deleting, and yielding keys asynchronously. It will be used by the main public-facing
`BigtableByteStore` for all the data operations on the table. It also includes a test suite.
The `shutdown_default_loop` method was not closing the asyncio event loop, only that class level variables were reassigned. This commit adds a line that closes the loop when the thread is successfully terminated.
# Conflicts:
#	src/langchain_google_bigtable/engine.py
#	tests/test_engine.py
The `shutdown_default_loop` method was not closing the asyncio event loop, only that class level variables were reassigned. This commit adds a line that closes the loop when the thread is successfully terminated.
…context (googleapis#163)

* feat: add BigtableEngine Class for managing the client and execution context

Added the engine class that manages the client and execution context which handles the async and sync conversion via a background event loop.

Added a test file for the engine class.

* fix(engine): update BigtableEngine setup by internalizing client creation

Removes the option to pre-instantiate and pass a BigtableDataClientAsync
to the BigtableEngine. The engine now creates its own client instance,
making it more self-contained.

* test(engine): update tests to match BigtableEngine class changes

Adjusted test cases and assertions to align with the recent modifications
in the BigtableEngine class API and behavior.

* style(engine): update test_engine.py and engine.py format

* test(engine): add tests for async_initialize workflow

* feat(engine): add async_initialize factory and shared loop management

This change introduces BigtableEngine.async_initialize for non-blocking
asynchronous initialization.

* style(engine): update test_engine.py
# Conflicts:
#	src/langchain_google_bigtable/async_key_value_store.py
…lementation.

This class will be used for interacting with LangChain's key-value store for Bigtable. It contains methods that add, delete, get, and yield key-value pairs from the store. It's an implementation of LangChain's BaseStore(ByteStore) abstract interface. This commit also contains comprehensive test suites. It also contains a change for the table creation fixture for the engine test, simplifying it from previous approach.
… raises error.

address error handling test compatibility for Python 3.9
@mic-k3y mic-k3y requested a review from a team August 22, 2025 22:50
@mic-k3y mic-k3y requested review from ad548 and georgecma August 22, 2025 22:50
@mic-k3y mic-k3y assigned mic-k3y and unassigned georgecma Aug 22, 2025
@mic-k3y mic-k3y requested review from ad548 and removed request for ad548 August 26, 2025 15:19
"cell_type": "code",
"source": [
"# Embedded document/text have been added to the store\n",
"# Embedded document/text have been added to the store. Hashed key for the raw text is used as the key.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this hashing done by the cached embedding class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The hashing is done by the cached embedding class.

embedding_service (Embeddings): The embedding service to use.
content_column (ColumnConfig): ColumnConfig for document content.
embedding_column (ColumnConfig): ColumnConfig for vector embeddings.
collection (Optional[str]): The name of the collection (optional). It is used as a prefix for row keys.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to this not be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collection is forced in the main store class and is not optional there, which would force the end user to pass the collection string. I put this here as optional for flexibility in case there are any changes. Should I make the argument required here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw the comment about the collection filter and made the change. The collection filter is now a required argument and removed the block of code below.

content_column (ColumnConfig): ColumnConfig for document content.
embedding_column (ColumnConfig): ColumnConfig for vector embeddings.
collection (Optional[str]): The name of the collection (optional). It is used as a prefix for row keys.
metadata_as_json_column (Optional[ColumnConfig]): ColumnConfig for metadata as JSON column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this field a bit more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This puts all the metadata in one serialized JSON object and stores it in the column defined by the user here. This is helpful for accessing all the metadata for a document at once if there is any need for it and can be used to store any other metadata not defined in metadata mappings. It is used to retrieve the metadatas at once in the aget_by_ids method. It is not used for filtering or in the btql query string however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg. Maybe expand the function comment a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

encoding (Encoding): The data encoding to use for the column's value.
"""

column_family: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the family could default to METADATA_COLUMN_FAMILY here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def __init__(
self,
metadata_key: str,
encoding: Encoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not take in the ColumnConfig as the parameter instead of encoding and column_qualifier separately

Copy link
Contributor Author

@mic-k3y mic-k3y Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to enforce all the metadatas to be stored under the METADATA_COLUMN_FAMILY("md"). If we allowed the user to put in the column config, we would also be giving them the option to have a different column family for the metadatas which might be different from the column family "md" internally being used in the class for filtering and retrieval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see. Does it need to inherit from ColumnConfig then? Not necessarily opposed to it but just wondering about the choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also expand the function comment to clarify that cf is always md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I mean it doesn't need to inherit ColumnConfig, but it might be useful to let the user know that each metadata mappings are individual columns and not just regular mappings.


async def asimilarity_search_with_score_by_vector(
self,
embedding: List[float],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this embedding parameter is really the query vector? Can you using that naming consistently throughout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is the query vector. The problem is the vector store abstract interface expects the naming to be 'embedding' for these methods. It also has the 'embedding' argument for the factory methods "from_texts, .." to be the embedding service which is misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all the "query_vector" arguments to "embedding" even though it might be confusing, but better than other overriding the abstract class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, just make sure to clarify it as the 'embedding query vector' in the comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Args:
query (str): The text to find relevant and diverse documents for.
k (int): Number of documents to return. Defaults to 4.
fetch_k (int): Number of documents to fetch to pass to MMR algorithm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe there should be a check for fetch_k should be greater than k

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added a check.

f"(STARTS_WITH(_key, @{param_key})) {new_line}"
)
param_count += 1
if filters and "collectionFilter" in filters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the user provide a row key prefix filter? something like "collection:phone#"?

Copy link
Contributor Author

@mic-k3y mic-k3y Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It would be something like = {"collectionFilter": "mydocs:phone#"} for filtering in collection "mydocs" and row key prefix "mydocs:phone#" since every row for that collection has "my docs:" as the prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user didn't pass the collection filter, by default the store will filter the table by row key prefix "{collection name}:"


def _process_value_filters(
self,
filter_dict: Dict[str, Any],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to provide some sort of structure/strong typing to this FilterDict. We could use a TypeAlias here. Maybe something like this

FilterType: TypeAlias = Literal[ "RowKeyPrefix",
"Qualifiers",
"ColumnQualifierPrefix",
"ColumnQualifierRegex",]

OperatorType: TypeAlias = Literal['==', '>', in', 'nin']

OperatorDict = Dict[str, [OperatorType, str]]

A forward reference (string literal "FilterDict") is used for recursion.
FilterDict: TypeAlias
FilterDict = Dict[
FilterType,
Union[
OperatorDict, # For a metadata key like "color": {"==": "blue"}
"FilterDict" # For nesting with "QualifierChainFilter" or "QualifierUnionFilter"
]
]

def _process_value_filters(
self,
filter_dict: FilterDict,
... other parameters
):
... implementation
pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also leaning towards ColumnValueChainFilter instead of QualifierChainFilter. I think it's useful to have the term "Value there"

Copy link
Contributor Author

@mic-k3y mic-k3y Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also changed the chain filter names to ColumnValueChainFilter and ColumnValueUnionFilter. I will change the names in the docs as well.

# User provided a 'collectionPrefix' filter, overriding the default.
# Update the parameter with the user-specified prefix.
params[param_key] = filters["collectionFilter"].encode("utf-8")
if not self.collection and filters and "collectionFilter" in filters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should simplify all this by enforcing a collection during instantiation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mic-k3y mic-k3y requested a review from ad548 August 26, 2025 22:32

# Represents the top-level filter dictionary that users can provide.
FilterDict: TypeAlias = Dict[
Literal["collectionFilter", "metadataFilter"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not provide both at the same time? could you update the tests to support both? Also since we want to support collection + remaining row key prefix, I'd just call this rowKeyFilter and clarify that collection is always present by default

Copy link
Contributor Author

@mic-k3y mic-k3y Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Both can be provided at the same time. It will be something like
filters={"rowKeyFilter": "...", Qualifier Filters, Value and Logic Filters}

instead of partitioning the filter into collectionFilter and metadataFilter at the top-level we had before like

filters={"collectionFilter": ...., "metadataFilter": {Qualifer Filters, Value and Logic Filters} }.

I have also updated the tests to support both at the same time and in the same dictionary.

@mic-k3y mic-k3y requested a review from ad548 August 27, 2025 12:30
Copy link
Member

@ad548 ad548 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last few comments.

# Represents the top-level filter dictionary. It can contain a `rowKeyFilter`
# to specify a row key prefix that follows the collection prefix by default.
# It also supports high-level qualifier filters and nested value-based filters.
FilterDict: TypeAlias = Dict[str, Union[str, List[str], OperatorDict, ValueFilterDict]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the keys here and in ValueFilterDict be constrained to some literal values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, but we expect the user to have different metadata keys. Constraining the keys here might constrain the metadata keys the user can filter on.

def __init__(
self,
metadata_key: str,
encoding: Encoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see. Does it need to inherit from ColumnConfig then? Not necessarily opposed to it but just wondering about the choice.

def __init__(
self,
metadata_key: str,
encoding: Encoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also expand the function comment to clarify that cf is always md

content_column (ColumnConfig): ColumnConfig for document content.
embedding_column (ColumnConfig): ColumnConfig for vector embeddings.
collection (Optional[str]): The name of the collection (optional). It is used as a prefix for row keys.
metadata_as_json_column (Optional[ColumnConfig]): ColumnConfig for metadata as JSON column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg. Maybe expand the function comment a bit more.

param_count += 1

# Process top-level qualifier filters from the flattened structure.
if "Qualifiers" in remaining_filters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe ColumnQualifiers to keep it consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

row_key_prefix = f"{self.collection}:"

# If a 'rowKeyFilter' is provided, append it to the collection prefix.
if "rowKeyFilter" in remaining_filters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "RowKeyFiler" - pascal case to keep it consistent with others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


async def asimilarity_search_with_score_by_vector(
self,
embedding: List[float],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, just make sure to clarify it as the 'embedding query vector' in the comments.

if final_query_params.distance_strategy == DistanceStrategy.COSINE:
relevance_score_fn = self._cosine_relevance_score_fn
else: # If distance to use not declared as COSINE, uses EUCLIDEAN by default
relevance_score_fn = self._euclidean_relevance_score_fn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is self._euclidean_relevance_score_fn defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class inherits those methods from the abstract VectorStore class.


def test_filter_equal(self, store: AsyncBigtableVectorStore) -> None:
"""Tests the '==' (equal) operator and verifies UTF8 encoding."""
params = QueryParameters(filters={"color": {"==": "red"}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these standalone column value filters should be in there own "ColumnValueFilter" dict? Maybe that's what's preventing us from having the dict keys be constrained to certain literals? Let me know if that makes it more complex or something.

Copy link
Contributor Author

@mic-k3y mic-k3y Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It's those standalone column value filters preventing us from having the dict keys be constrained to certain literals. I delegated the ColumnValueFilter dict translation task to the method we already have for value filters(_process_value_filters) and rewrote the tests for this change. Now, I have literals for the top-level filters dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works!

@mic-k3y mic-k3y requested a review from ad548 August 27, 2025 19:41
ad548
ad548 previously approved these changes Aug 27, 2025
assert len(results_cosine) == 1
doc, score = results_cosine[0]
assert isinstance(doc, Document)
assert doc.page_content == "a document about cats"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some test cases where text is not exactly the same as the text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't add some test cases where text is not exactly the same as the text unless I use a real embedding model. I am using the DeterministicFakeEmbedding class from LangChain used for testing that generates a random embedding and hashes it for that same text.

self, store: AsyncBigtableVectorStore
) -> None:
"""Tests amax_marginal_relevance_search returns the correct top result."""
texts = ["foo", "bar", "baz", "boo"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here but maybe test data with some redundant values and make sure top results are varied. like foo, foo2, foo3, bar, baz etc

@mic-k3y mic-k3y merged commit 18d4a33 into googleapis:main Aug 27, 2025
11 checks passed
yangyzs pushed a commit to yangyzs/langchain-google-bigtable-python that referenced this pull request Aug 31, 2025
…lementation (googleapis#176)

This class will be used for interacting with LangChain's key-value store for Bigtable. It contains methods that add, delete, get, and yield key-value pairs from the store. It's an implementation of LangChain's BaseStore(ByteStore) abstract interface. This commit also contains comprehensive test suites. It also contains a change for the table creation fixture for the engine test, simplifying it from previous approach.

chore(main): release 0.5.0 (googleapis#167)

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

feat: add playground notebook for key-value store (googleapis#179)

* feat: add playground notebook for key-value store

This commit adds a notebook for how to use key-value stores.

* fix: update embedding model used in key_value_store notebook

This commit also updates the embedding model used for playground notebook.

* fix: update readme.rst file

* fix: address PR comments for the notebook

chore(main): release 0.6.0 (googleapis#182)

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

initial get_table and execute_query implementation

feat: add `AsyncBigtableVectorStore` Class for the async-only vector value store implementation (googleapis#186)

* feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation

This commit adds the async-only vector store class that handles the underlying data operations for the main vector store class along with its test files.

* fix: include comments on the key-value store usage cells

Included comments explaining how the embedding cache key-value store works.

* fix: include comments on the key-value store usage cells

Included comments explaining how the embedding cache key-value store works.

* fix: header date for async_vector_store.py

fix: include comments on the key-value store usage cells

Included comments explaining how the embedding cache key-value store works.

* fix: typing for tests and async vector store class

fix: add os environment function for the tests

* fix: isort for imports for test files

* fix: revert removed test_async_key_value_store.py file.

* fix: address PR comments

* fix: address PR comments

* fix: address PR comments

* fix: address PR comments

* fix: address PR comments

* fix: address PR comments

* fix: address PR comments

* fix: add check to make sure each document has an ID.

feat: add `BigtableVectorStore` Class for LangChain Vector Store integration (googleapis#189)

* feat: add `BigtableVectorStore` class for Bigtable <> LangChain Vector Store integration.

* feat: add `BigtableVectorStore` class for Bigtable <> LangChain Vector Store integration.

* fix: isort imports for test file.

* fix: fixture return types.

* chore: add more test cases

* fix: remove await from sync store test suite

* fix: reformat test_vector_store.py

* fix: address PR comments.

chore(main): release 0.7.0 (googleapis#187)

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

feat: add playground notebook for the vector store (googleapis#191)

* feat: add BigtableVectorStore Playground notebook.

* feat: add BigtableVectorStore Playground notebook.

fix: update vector_store.ipynb

* fix: update vector_store.ipynb

* fix: update vector_store.ipynb

* fix: update vector_store.ipynb

* feat: change documentation for the library

feat: add BigtableVectorStore Playground notebook.

fix: make sure the typing_extensions is supported across all python versions supported by this library

feat: change documentation for the library

fix: update vector_store.ipynb

fix: update vector_store.ipynb

fix: update vector_store.ipynb

feat: add BigtableVectorStore Playground notebook.

fix: update vector_store.ipynb

* feat: update documentation for the library

feat: update documentation for the library

* feat: update documentation for the library

feat: update documentation for the library

feat: update documentation for the library

* feat: update documentation for the library

* fix: reformat vector_store.ipynb

* fix: address PR comments

* fix: address PR comments.

* fix: address PR comments.

chore(main): release 0.8.0 (googleapis#192)

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

resolve merge conflict with upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable Issues related to the googleapis/langchain-google-bigtable-python API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation

3 participants