BigQuery: Add support to Dataset for project_ids with org prefix.#8877
Conversation
added support to Dataset for project_ids with org prefix
updated tests to check dataset chgs
|
@emar-kar, you should run black reformat on |
| if with_prefix is None: | ||
| parts = dataset_id.split(".") | ||
| else: | ||
| parts = with_prefix.group("ref").split(".") |
There was a problem hiding this comment.
I'm confused. What is this doing? I think the prefix needs to be part of the project ID, right?
There was a problem hiding this comment.
From the issue's Stack trace:
The error occurs due to the prefix google.com:. Previously the passed string was separated only by ., what led to ValueError raising because of the len(parts) > 2 at google.com:[project].ryan_dataset. As I see here prefix is not the part of the Project ID itself. I was trying to find out, how could I parse the string and fulfill both the previous and new format. That is why I decided to use regular expressions. Now, with template's help, it will separate prefix and solve several situations:
string-project.string_dataset- will pass the template successfully as it was before;prefix:string-project.string_dataset- will group the part without prefix and then will divide it;string-project:string_dataset- if thedefault_projectwas not defined raisesValueError;google.com:project:dataset_id- same as above.
ValueError: Too many parts in dataset_id. Expected a fully-qualified dataset ID in standard SQL format. e.g. "project.dataset_id", got google.com:[project].ryan_dataset
There was a problem hiding this comment.
Right, but it appears to me that we're discarding the prefix? Is that correct?
There was a problem hiding this comment.
Yeah, that is what I was thinking before this conversation. So, now I'm a bit confused. I thought the prefix is an extra part and should be just removed. But if it is actually the part of the Project ID, I'll need to reconfigure the pattern.
Applying requested chgs. // Removed description for 'single prefix'.
| from google.cloud.bigquery.table import TableReference | ||
|
|
||
|
|
||
| _W_PREFIX = re.compile( |
There was a problem hiding this comment.
Can we pick a better name for this? Maybe _PROJECT_PREFIX_PATTERN?
|
|
||
| _W_PREFIX = re.compile( | ||
| r""" | ||
| (\S*)\:(?P<ref>\S*) |
There was a problem hiding this comment.
Since at least one character is required, this should probably be \S+, right?
Also, ref isn't all that meaningful to me. How about remaining, since it's everything after the : character?
| if with_prefix is None: | ||
| parts = dataset_id.split(".") | ||
| else: | ||
| parts = with_prefix.group("ref").split(".") |
There was a problem hiding this comment.
Right, but it appears to me that we're discarding the prefix? Is that correct?
bigquery/tests/unit/test_dataset.py
Outdated
| def test_from_string_w_prefix(self): | ||
| cls = self._get_target_class() | ||
| got = cls.from_string("prefix:string-project.string_dataset") | ||
| self.assertEqual(got.project, "string-project") |
There was a problem hiding this comment.
Shouldn't this be prefix:string-project, since the prefix is actually part of the project ID?
Complete template change.
|
|
||
| _PROJECT_PREFIX_PATTERN = re.compile( | ||
| r""" | ||
| (?P<prefix>\S+\:\S+)\.+(?P<remaining>\S*) |
There was a problem hiding this comment.
Now that we're matching this way, prefix isn't the right term. Should be project_id. Likewise, remaining should be renamed to dataset_id.
Also, instead of \S, we should be matching for characters other than ., that is [^.]+.
We want to match the whole string, so we should probably end this pattern with $.
There was a problem hiding this comment.
I renamed parts of the pattern, but the second comment about [^.] seems inappropriate to me. As we know the string could be google.com:project.dataset, that means that dot could be a part of the prefix. I checked couple of variants and as I see \S fits more.
|
|
||
|
|
||
| _PROJECT_PREFIX_PATTERN = re.compile( | ||
| r""" |
There was a problem hiding this comment.
Why are we using a multi-line string here?
There was a problem hiding this comment.
Just for the readability. I think I’ll switch this to the single line, after correcting the pattern implementation.
minor corrections
| with self.assertRaises(ValueError): | ||
| cls.from_string("string-project:string_dataset") | ||
|
|
||
| def test_from_string_w_incorrect_prefix(self): |
There was a problem hiding this comment.
Let's add an additional test where the project ID / dataset ID contains an illegal . character. Another way to say that, is the string contains too many "parts". e.g. google.com:project-id.dataset_id.table_id. This should also fail with ValueError.
| from google.cloud.bigquery.table import TableReference | ||
|
|
||
|
|
||
| _PROJECT_PREFIX_PATTERN = re.compile(r"(?P<project_id>\S+\:\S+)\.+(?P<dataset_id>\S+)$") |
There was a problem hiding this comment.
I think this will match patterns with too many . characters. Let's try something like:
(?P<project_id>\S+\:[^.]+)\.(?P<dataset_id>[^.]+)$
There was a problem hiding this comment.
I see what you mean, sorry for misunderstanding. Appreciate your help.
pattern rewrote with the '[^.]' and .VERBOSE (due to blacken session) added test to check extra parts within the string with the prefix reconf prefix in an existed test
Closes: #8646