Skip to content

BigQuery: max_results is ignored if bqstorage_client is used in to_dataframe or to_arrow #9174

@tswast

Description

@tswast

Steps to reproduce

  1. Call list_rows with max_results set.
  2. Call to_dataframe or to_arrow.
  3. Observe that more rows were returned than were requested.

Code example

from google.cloud import bigquery
from google.cloud import bigquery_storage
bqclient = bigquery.Client()
bqstorage_client = bigquery_storage.BigQueryStorageClient()

df_tabledata_list = bqclient.list_rows(
    "bigquery-public-data.utility_us.country_code_iso",
    selected_fields=[bigquery.SchemaField("country_name", "STRING")],
    max_results=100,
).to_dataframe()
print("tabledata.list: {} rows".format(len(df_tabledata_list.index)))

df_bqstorage = bqclient.list_rows(
    "bigquery-public-data.utility_us.country_code_iso",
    selected_fields=[bigquery.SchemaField("country_name", "STRING")],
    max_results=100,
).to_dataframe(bqstorage_client=bqstorage_client)
print("bqstorage: {} rows".format(len(df_bqstorage.index)))

Output

tabledata.list: 100 rows
bqstorage: 278 rows

Possible fixes

  1. (Harder) Keep track of how many rows you've downloaded in a BQ Storage session so far. Once you've downloaded enough rows, close all streams (is this even possible?).
  2. (Easier, but acceptable) If max_results is set, always download data with tabledata.list.

I think we should implement the fix with (2) because if max_results is set, it's unlikely that we are downloading all that many rows where using the BQ Storage API makes sense.

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the BigQuery API.api: bigquerystorageIssues related to the BigQuery Storage API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions