feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation#186
feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation#186mic-k3y merged 63 commits intogoogleapis:mainfrom
AsyncBigtableVectorStore Class for the async-only vector value store implementation#186Conversation
…context Added the engine class that manages the client and execution context which handles the async and sync conversion via a background event loop. Added a test file for the engine class.
…tion Removes the option to pre-instantiate and pass a BigtableDataClientAsync to the BigtableEngine. The engine now creates its own client instance, making it more self-contained.
Adjusted test cases and assertions to align with the recent modifications in the BigtableEngine class API and behavior.
This change introduces BigtableEngine.async_initialize for non-blocking asynchronous initialization.
…n LangChain's key-value store. This commit introduces the `AsyncBigtableByteStore` class. This class is designed to handle setting, getting, deleting, and yielding keys asynchronously. It will be used by the main public-facing `BigtableByteStore` for all the data operations on the table. It also includes a test suite.
The `shutdown_default_loop` method was not closing the asyncio event loop, only that class level variables were reassigned. This commit adds a line that closes the loop when the thread is successfully terminated.
# Conflicts: # src/langchain_google_bigtable/engine.py # tests/test_engine.py
The `shutdown_default_loop` method was not closing the asyncio event loop, only that class level variables were reassigned. This commit adds a line that closes the loop when the thread is successfully terminated.
…context (googleapis#163) * feat: add BigtableEngine Class for managing the client and execution context Added the engine class that manages the client and execution context which handles the async and sync conversion via a background event loop. Added a test file for the engine class. * fix(engine): update BigtableEngine setup by internalizing client creation Removes the option to pre-instantiate and pass a BigtableDataClientAsync to the BigtableEngine. The engine now creates its own client instance, making it more self-contained. * test(engine): update tests to match BigtableEngine class changes Adjusted test cases and assertions to align with the recent modifications in the BigtableEngine class API and behavior. * style(engine): update test_engine.py and engine.py format * test(engine): add tests for async_initialize workflow * feat(engine): add async_initialize factory and shared loop management This change introduces BigtableEngine.async_initialize for non-blocking asynchronous initialization. * style(engine): update test_engine.py
# Conflicts: # src/langchain_google_bigtable/async_key_value_store.py
…lementation. This class will be used for interacting with LangChain's key-value store for Bigtable. It contains methods that add, delete, get, and yield key-value pairs from the store. It's an implementation of LangChain's BaseStore(ByteStore) abstract interface. This commit also contains comprehensive test suites. It also contains a change for the table creation fixture for the engine test, simplifying it from previous approach.
… raises error. address error handling test compatibility for Python 3.9
| "cell_type": "code", | ||
| "source": [ | ||
| "# Embedded document/text have been added to the store\n", | ||
| "# Embedded document/text have been added to the store. Hashed key for the raw text is used as the key.\n", |
There was a problem hiding this comment.
is this hashing done by the cached embedding class?
There was a problem hiding this comment.
Yes. The hashing is done by the cached embedding class.
| embedding_service (Embeddings): The embedding service to use. | ||
| content_column (ColumnConfig): ColumnConfig for document content. | ||
| embedding_column (ColumnConfig): ColumnConfig for vector embeddings. | ||
| collection (Optional[str]): The name of the collection (optional). It is used as a prefix for row keys. |
There was a problem hiding this comment.
I thought we wanted to this not be optional?
There was a problem hiding this comment.
The collection is forced in the main store class and is not optional there, which would force the end user to pass the collection string. I put this here as optional for flexibility in case there are any changes. Should I make the argument required here as well?
There was a problem hiding this comment.
I just saw the comment about the collection filter and made the change. The collection filter is now a required argument and removed the block of code below.
| content_column (ColumnConfig): ColumnConfig for document content. | ||
| embedding_column (ColumnConfig): ColumnConfig for vector embeddings. | ||
| collection (Optional[str]): The name of the collection (optional). It is used as a prefix for row keys. | ||
| metadata_as_json_column (Optional[ColumnConfig]): ColumnConfig for metadata as JSON column. |
There was a problem hiding this comment.
can you explain this field a bit more
There was a problem hiding this comment.
Yes. This puts all the metadata in one serialized JSON object and stores it in the column defined by the user here. This is helpful for accessing all the metadata for a document at once if there is any need for it and can be used to store any other metadata not defined in metadata mappings. It is used to retrieve the metadatas at once in the aget_by_ids method. It is not used for filtering or in the btql query string however.
There was a problem hiding this comment.
Sg. Maybe expand the function comment a bit more.
| encoding (Encoding): The data encoding to use for the column's value. | ||
| """ | ||
|
|
||
| column_family: str |
There was a problem hiding this comment.
the family could default to METADATA_COLUMN_FAMILY here
| def __init__( | ||
| self, | ||
| metadata_key: str, | ||
| encoding: Encoding, |
There was a problem hiding this comment.
why not take in the ColumnConfig as the parameter instead of encoding and column_qualifier separately
There was a problem hiding this comment.
This is used to enforce all the metadatas to be stored under the METADATA_COLUMN_FAMILY("md"). If we allowed the user to put in the column config, we would also be giving them the option to have a different column family for the metadatas which might be different from the column family "md" internally being used in the class for filtering and retrieval.
There was a problem hiding this comment.
Hmm I see. Does it need to inherit from ColumnConfig then? Not necessarily opposed to it but just wondering about the choice.
There was a problem hiding this comment.
can you also expand the function comment to clarify that cf is always md
There was a problem hiding this comment.
Done. I mean it doesn't need to inherit ColumnConfig, but it might be useful to let the user know that each metadata mappings are individual columns and not just regular mappings.
|
|
||
| async def asimilarity_search_with_score_by_vector( | ||
| self, | ||
| embedding: List[float], |
There was a problem hiding this comment.
this embedding parameter is really the query vector? Can you using that naming consistently throughout?
There was a problem hiding this comment.
Yes. It is the query vector. The problem is the vector store abstract interface expects the naming to be 'embedding' for these methods. It also has the 'embedding' argument for the factory methods "from_texts, .." to be the embedding service which is misleading.
There was a problem hiding this comment.
I changed all the "query_vector" arguments to "embedding" even though it might be confusing, but better than other overriding the abstract class.
There was a problem hiding this comment.
Okay, just make sure to clarify it as the 'embedding query vector' in the comments.
| Args: | ||
| query (str): The text to find relevant and diverse documents for. | ||
| k (int): Number of documents to return. Defaults to 4. | ||
| fetch_k (int): Number of documents to fetch to pass to MMR algorithm. |
There was a problem hiding this comment.
maybe there should be a check for fetch_k should be greater than k
There was a problem hiding this comment.
Good point. I added a check.
| f"(STARTS_WITH(_key, @{param_key})) {new_line}" | ||
| ) | ||
| param_count += 1 | ||
| if filters and "collectionFilter" in filters: |
There was a problem hiding this comment.
how does the user provide a row key prefix filter? something like "collection:phone#"?
There was a problem hiding this comment.
Yes. It would be something like = {"collectionFilter": "mydocs:phone#"} for filtering in collection "mydocs" and row key prefix "mydocs:phone#" since every row for that collection has "my docs:" as the prefix.
There was a problem hiding this comment.
If the user didn't pass the collection filter, by default the store will filter the table by row key prefix "{collection name}:"
|
|
||
| def _process_value_filters( | ||
| self, | ||
| filter_dict: Dict[str, Any], |
There was a problem hiding this comment.
It would be good to provide some sort of structure/strong typing to this FilterDict. We could use a TypeAlias here. Maybe something like this
FilterType: TypeAlias = Literal[ "RowKeyPrefix",
"Qualifiers",
"ColumnQualifierPrefix",
"ColumnQualifierRegex",]
OperatorType: TypeAlias = Literal['==', '>', in', 'nin']
OperatorDict = Dict[str, [OperatorType, str]]
A forward reference (string literal "FilterDict") is used for recursion.
FilterDict: TypeAlias
FilterDict = Dict[
FilterType,
Union[
OperatorDict, # For a metadata key like "color": {"==": "blue"}
"FilterDict" # For nesting with "QualifierChainFilter" or "QualifierUnionFilter"
]
]
def _process_value_filters(
self,
filter_dict: FilterDict,
... other parameters
):
... implementation
pass
There was a problem hiding this comment.
also leaning towards ColumnValueChainFilter instead of QualifierChainFilter. I think it's useful to have the term "Value there"
There was a problem hiding this comment.
Done. I also changed the chain filter names to ColumnValueChainFilter and ColumnValueUnionFilter. I will change the names in the docs as well.
| # User provided a 'collectionPrefix' filter, overriding the default. | ||
| # Update the parameter with the user-specified prefix. | ||
| params[param_key] = filters["collectionFilter"].encode("utf-8") | ||
| if not self.collection and filters and "collectionFilter" in filters: |
There was a problem hiding this comment.
I think we should simplify all this by enforcing a collection during instantiation
|
|
||
| # Represents the top-level filter dictionary that users can provide. | ||
| FilterDict: TypeAlias = Dict[ | ||
| Literal["collectionFilter", "metadataFilter"], |
There was a problem hiding this comment.
can we not provide both at the same time? could you update the tests to support both? Also since we want to support collection + remaining row key prefix, I'd just call this rowKeyFilter and clarify that collection is always present by default
There was a problem hiding this comment.
Done. Both can be provided at the same time. It will be something like
filters={"rowKeyFilter": "...", Qualifier Filters, Value and Logic Filters}
instead of partitioning the filter into collectionFilter and metadataFilter at the top-level we had before like
filters={"collectionFilter": ...., "metadataFilter": {Qualifer Filters, Value and Logic Filters} }.
I have also updated the tests to support both at the same time and in the same dictionary.
| # Represents the top-level filter dictionary. It can contain a `rowKeyFilter` | ||
| # to specify a row key prefix that follows the collection prefix by default. | ||
| # It also supports high-level qualifier filters and nested value-based filters. | ||
| FilterDict: TypeAlias = Dict[str, Union[str, List[str], OperatorDict, ValueFilterDict]] |
There was a problem hiding this comment.
could the keys here and in ValueFilterDict be constrained to some literal values?
There was a problem hiding this comment.
It is possible, but we expect the user to have different metadata keys. Constraining the keys here might constrain the metadata keys the user can filter on.
| def __init__( | ||
| self, | ||
| metadata_key: str, | ||
| encoding: Encoding, |
There was a problem hiding this comment.
Hmm I see. Does it need to inherit from ColumnConfig then? Not necessarily opposed to it but just wondering about the choice.
| def __init__( | ||
| self, | ||
| metadata_key: str, | ||
| encoding: Encoding, |
There was a problem hiding this comment.
can you also expand the function comment to clarify that cf is always md
| content_column (ColumnConfig): ColumnConfig for document content. | ||
| embedding_column (ColumnConfig): ColumnConfig for vector embeddings. | ||
| collection (Optional[str]): The name of the collection (optional). It is used as a prefix for row keys. | ||
| metadata_as_json_column (Optional[ColumnConfig]): ColumnConfig for metadata as JSON column. |
There was a problem hiding this comment.
Sg. Maybe expand the function comment a bit more.
| param_count += 1 | ||
|
|
||
| # Process top-level qualifier filters from the flattened structure. | ||
| if "Qualifiers" in remaining_filters: |
There was a problem hiding this comment.
maybe ColumnQualifiers to keep it consistent?
| row_key_prefix = f"{self.collection}:" | ||
|
|
||
| # If a 'rowKeyFilter' is provided, append it to the collection prefix. | ||
| if "rowKeyFilter" in remaining_filters: |
There was a problem hiding this comment.
nit: "RowKeyFiler" - pascal case to keep it consistent with others
|
|
||
| async def asimilarity_search_with_score_by_vector( | ||
| self, | ||
| embedding: List[float], |
There was a problem hiding this comment.
Okay, just make sure to clarify it as the 'embedding query vector' in the comments.
| if final_query_params.distance_strategy == DistanceStrategy.COSINE: | ||
| relevance_score_fn = self._cosine_relevance_score_fn | ||
| else: # If distance to use not declared as COSINE, uses EUCLIDEAN by default | ||
| relevance_score_fn = self._euclidean_relevance_score_fn |
There was a problem hiding this comment.
where is self._euclidean_relevance_score_fn defined?
There was a problem hiding this comment.
The class inherits those methods from the abstract VectorStore class.
tests/test_btql_query_builder.py
Outdated
|
|
||
| def test_filter_equal(self, store: AsyncBigtableVectorStore) -> None: | ||
| """Tests the '==' (equal) operator and verifies UTF8 encoding.""" | ||
| params = QueryParameters(filters={"color": {"==": "red"}}) |
There was a problem hiding this comment.
I wonder if these standalone column value filters should be in there own "ColumnValueFilter" dict? Maybe that's what's preventing us from having the dict keys be constrained to certain literals? Let me know if that makes it more complex or something.
There was a problem hiding this comment.
Yeah. It's those standalone column value filters preventing us from having the dict keys be constrained to certain literals. I delegated the ColumnValueFilter dict translation task to the method we already have for value filters(_process_value_filters) and rewrote the tests for this change. Now, I have literals for the top-level filters dict.
| assert len(results_cosine) == 1 | ||
| doc, score = results_cosine[0] | ||
| assert isinstance(doc, Document) | ||
| assert doc.page_content == "a document about cats" |
There was a problem hiding this comment.
maybe add some test cases where text is not exactly the same as the text
There was a problem hiding this comment.
I can't add some test cases where text is not exactly the same as the text unless I use a real embedding model. I am using the DeterministicFakeEmbedding class from LangChain used for testing that generates a random embedding and hashes it for that same text.
| self, store: AsyncBigtableVectorStore | ||
| ) -> None: | ||
| """Tests amax_marginal_relevance_search returns the correct top result.""" | ||
| texts = ["foo", "bar", "baz", "boo"] |
There was a problem hiding this comment.
same here but maybe test data with some redundant values and make sure top results are varied. like foo, foo2, foo3, bar, baz etc
…lementation (googleapis#176) This class will be used for interacting with LangChain's key-value store for Bigtable. It contains methods that add, delete, get, and yield key-value pairs from the store. It's an implementation of LangChain's BaseStore(ByteStore) abstract interface. This commit also contains comprehensive test suites. It also contains a change for the table creation fixture for the engine test, simplifying it from previous approach. chore(main): release 0.5.0 (googleapis#167) Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> feat: add playground notebook for key-value store (googleapis#179) * feat: add playground notebook for key-value store This commit adds a notebook for how to use key-value stores. * fix: update embedding model used in key_value_store notebook This commit also updates the embedding model used for playground notebook. * fix: update readme.rst file * fix: address PR comments for the notebook chore(main): release 0.6.0 (googleapis#182) Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> initial get_table and execute_query implementation feat: add `AsyncBigtableVectorStore` Class for the async-only vector value store implementation (googleapis#186) * feat: add AsyncBigtableVectorStore Class for the async-only vector value store implementation This commit adds the async-only vector store class that handles the underlying data operations for the main vector store class along with its test files. * fix: include comments on the key-value store usage cells Included comments explaining how the embedding cache key-value store works. * fix: include comments on the key-value store usage cells Included comments explaining how the embedding cache key-value store works. * fix: header date for async_vector_store.py fix: include comments on the key-value store usage cells Included comments explaining how the embedding cache key-value store works. * fix: typing for tests and async vector store class fix: add os environment function for the tests * fix: isort for imports for test files * fix: revert removed test_async_key_value_store.py file. * fix: address PR comments * fix: address PR comments * fix: address PR comments * fix: address PR comments * fix: address PR comments * fix: address PR comments * fix: address PR comments * fix: add check to make sure each document has an ID. feat: add `BigtableVectorStore` Class for LangChain Vector Store integration (googleapis#189) * feat: add `BigtableVectorStore` class for Bigtable <> LangChain Vector Store integration. * feat: add `BigtableVectorStore` class for Bigtable <> LangChain Vector Store integration. * fix: isort imports for test file. * fix: fixture return types. * chore: add more test cases * fix: remove await from sync store test suite * fix: reformat test_vector_store.py * fix: address PR comments. chore(main): release 0.7.0 (googleapis#187) Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> feat: add playground notebook for the vector store (googleapis#191) * feat: add BigtableVectorStore Playground notebook. * feat: add BigtableVectorStore Playground notebook. fix: update vector_store.ipynb * fix: update vector_store.ipynb * fix: update vector_store.ipynb * fix: update vector_store.ipynb * feat: change documentation for the library feat: add BigtableVectorStore Playground notebook. fix: make sure the typing_extensions is supported across all python versions supported by this library feat: change documentation for the library fix: update vector_store.ipynb fix: update vector_store.ipynb fix: update vector_store.ipynb feat: add BigtableVectorStore Playground notebook. fix: update vector_store.ipynb * feat: update documentation for the library feat: update documentation for the library * feat: update documentation for the library feat: update documentation for the library feat: update documentation for the library * feat: update documentation for the library * fix: reformat vector_store.ipynb * fix: address PR comments * fix: address PR comments. * fix: address PR comments. chore(main): release 0.8.0 (googleapis#192) Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> resolve merge conflict with upstream
This PR adds the async-only vector store class that handles the underlying data operations for the main vector store class.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #184 🦕