feat: add Pandas DataFrame support to TabularDataset#1185
feat: add Pandas DataFrame support to TabularDataset#1185sararob merged 12 commits intogoogleapis:mainfrom
Conversation
| df_source (pd.DataFrame): | ||
| Required. Pandas DataFrame containing the source data for | ||
| ingestion as a TabularDataset. | ||
| staging_path (str): |
There was a problem hiding this comment.
It seems the location requirements should also be documented or a reference to this documentation should be provided: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations
If possible they should be validated but not a hard requirement. Is it possible for the dataset create to fail because of the regional requirements?
There was a problem hiding this comment.
Good point, I just tested it and it can fail if the dataset location doesn't match the project location or the service doesn't have the right access to the dataset. I'll update the docstring to link to that page.
In terms of validating, the BQ client throws this error: google.api_core.exceptions.FailedPrecondition: 400 BigQuery Dataset location eu must be in the same location as the service location us-central1.
Do you think we should validate as well or let the BQ client handle validation? If we do validation, we'd need to use the BQ client to check the location of the provided BQ dataset string.
There was a problem hiding this comment.
Agree with relying on BQ client.
| parquet_options = bigquery.format_options.ParquetOptions() | ||
| parquet_options.enable_list_inference = True | ||
|
|
||
| job_config = bigquery.LoadJobConfig( |
There was a problem hiding this comment.
Will this config infer all the types? I see the enable_list_inference but I couldn't find a reference in the BQ docs for non list type inference.
There was a problem hiding this comment.
Yes, this will infer the data types from the DF. From the BQ docs:
The destination table to use for loading the data. If it is an
existing table, the schema of the :class:`~pandas.DataFrame`
must match the schema of the destination table. If the table
does not yet exist, the schema is inferred from the
:class:`~pandas.DataFrame`.
I added a bq_schema param to give the user the option to override the data type autodection, but I think I may need to add more client-side validation on that.
There was a problem hiding this comment.
I think we can rely on BQ client validation.
|
|
||
| _TEST_LABELS = {"my_key": "my_value"} | ||
|
|
||
| # create_from_dataframe |
There was a problem hiding this comment.
Preference for an integration test as well.
There was a problem hiding this comment.
Added 2 integration tests: one with the bq_schema param and one without.
| df_source (pd.DataFrame): | ||
| Required. Pandas DataFrame containing the source data for | ||
| ingestion as a TabularDataset. | ||
| staging_path (str): |
There was a problem hiding this comment.
Agree with relying on BQ client.
| from google.auth import credentials as auth_credentials | ||
|
|
||
| from google.cloud import bigquery | ||
| from google.cloud.bigquery import _pandas_helpers |
There was a problem hiding this comment.
We shouldn't rely on a private module from a dependent python package because it may break between minor versions.
| Required. The user-provided BigQuery schema. | ||
| """ | ||
| try: | ||
| _pandas_helpers.dataframe_to_arrow(dataframe, schema) |
There was a problem hiding this comment.
If the BQ client handles this validation we can rely on the client and avoid importing this private module.
There was a problem hiding this comment.
Removed this validation check and letting the BQ client handle it.
| parquet_options = bigquery.format_options.ParquetOptions() | ||
| parquet_options.enable_list_inference = True | ||
|
|
||
| job_config = bigquery.LoadJobConfig( |
There was a problem hiding this comment.
I think we can rely on BQ client validation.
| dataset.location = _TEST_LOCATION | ||
| shared_state["bigquery_dataset"] = bigquery_client.create_dataset(dataset) | ||
|
|
||
| yield |
There was a problem hiding this comment.
It seems like BQ dataset deletion should be handled after the yield.
|
|
||
| aiplatform.init(project=_TEST_PROJECT, location=_TEST_LOCATION) | ||
|
|
||
| tabular_dataset = aiplatform.TabularDataset.create_from_dataframe( |
There was a problem hiding this comment.
tabular_dataset should be appended to shared_state['resources'] so it's deleted after the test
see example: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/test_e2e_tabular.py#L89
There was a problem hiding this comment.
Updated both new tests to use shared_state['resources']
| ] | ||
|
|
||
|
|
||
| class TestDataset: |
There was a problem hiding this comment.
This should enherit from the base e2e class so it can take advantage of the resource cleanup functionality: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/e2e_base.py#L37
There was a problem hiding this comment.
Updated to inherit from e2e_base. I noticed there was some duplication of other fixtures in this file so I removed that and inherited the fixtures from the base class.
| bq_schema (Optional[Union[str, bigquery.SchemaField]]): | ||
| Optional. The schema to use when creating the staging table in BigQuery. For more details, | ||
| see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema | ||
| This is not needed if the BigQuery table provided in `staging_path` already exists. |
There was a problem hiding this comment.
Does this imply when the staging table does not exit, user need to pass in BQ schema? No existing staging table is the most common case, and this would be a bit hard to use - users need to learn how to create BQ schema, etc.
Given df_source already has a schema, can we auto generate the BQ schema for users?
There was a problem hiding this comment.
The best case is BQ client support inferencing schema natively, could you check on that? BQ UI supports inferencing schema for JSONL. Haven't tried Parquet.
There was a problem hiding this comment.
The BQ client will autodetect the schema using the DataFrame column types if the user doesn't provide this bq_schema parameter, this happens here. So I'm guessing most users won't use this schema param to override the autodetect.
I think my docstring was confusing. I updated it, let me know if this is clearer:
Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types.
If set, BigQuery will use the schema you provide when creating the staging table. For more details,
see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema
Adds support for uploading a local Pandas DataFrame as a Vertex TabularDataset via a
create_from_dataframemethod on theTabularDatasetclass.Also added relevant tests to
test_datasets.pyFixes b/189369695 🦕