Speech Redaction Framework

This repository contains the Speech Redaction Framework, a collection of components and code from Google Cloud that you can use to redact sensitive information from audio files.

It can and:

Process uploaded audio files to Cloud Storage.
Write the findings to Google Cloud Storage.
Redact sensitive information from the audio file with Google Cloud Data Loss Prevention API.

Speech Redaction Framework Limitations:

The framework can only process .wav or .flac files. This is a limitation within the Framework code not Cloud Speech-to-Text API.

The process follows:

An audio file is uploaded to Cloud Storage.
The Cloud Function is triggered on object.create.
The Audio Process Cloud Function sends a long running job request to Cloud Speech-to-Text.
Speech-to-Text processes audio file.
The Cloud Function then sends the job ID from Cloud Speech-to-Text with additional metadata to Cloud Pub/Sub.
The Cloud Dataflow job identifies sensitive information and writes the findings to a JSON file on Cloud Storage.
A second Cloud Function is triggered on object.create that reads the findings JSON file, redacts sensitive information from the audio file and writes the redacted audio file to Cloud Storage.

Required IAM Permissions

The service account or user executing these commands requires the following IAM Roles at the GCP Project level:

Cloud Functions Developer
Compute Admin
Compute Network Admin
Dataflow Developer
DLP Administrator
Eventarc Admin
Logs Viewer
Project IAM Admin
Pub/Sub Admin
Service Account User
Service Usage Admin
Storage Admin

Grant the following IAM role to the user that will impersonate the Service Account:

Workload Identity User

Use the following code snippet to grant necessary IAM permissions using gcloud:

export PROJECT_ID="your-project-id"
export SA_EMAIL="your-service-account@$PROJECT_ID.iam.gserviceaccount.com"

ROLES=(
  "roles/cloudfunctions.developer"
  "roles/compute.admin"
  "roles/compute.networkAdmin"
  "roles/dataflow.developer"
  "roles/dlp.admin"
  "roles/eventarc.admin"
  "roles/logging.viewer"
  "roles/resourcemanager.projectIamAdmin"
  "roles/pubsub.admin"
  "roles/iam.serviceAccountUser"
  "roles/serviceusage.serviceUsageAdmin"
  "roles/storage.admin"
)

for ROLE in "${ROLES[@]}"; do
  gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SA_EMAIL" \
    --role="$ROLE" \
    --condition=None
done

APIs

Enable the required APIs:

export PROJECT_ID="<PROJECT_ID>"

gcloud services enable \
    cloudresourcemanager.googleapis.com \
    eventarc.googleapis.com \
--project=${PROJECT_ID}

Deployment

Clone the GitHub repository:

git clone https://github.com/GoogleCloudPlatform/dataflow-speech-redaction.git
cd dataflow-speech-redaction

Modify the following files to specify the correct values for each variable:
- backend.tf
- terraform.tfvars
- providers.tf (optional in case Service Account usage is needed)
Optional: Modify the "google_data_loss_prevention_inspect_template" "dlp_template" resource in main.tf to customize the DLP inspect template with the required infoTypes to be redacted.
Validate that the subnet specified in your terraform.tfvars has Private Google Access enabled. The target subnet must have Private Google Access enabled and must reside in the region specified by var.region.
Initialize, review, and apply the Terraform configuration:

terraform init
terraform plan
terraform apply

Usage

To trigger the pipeline, upload an audio file (.wav or .flac) to the input bucket defined in your configuration (variable: audio_files_bucket_name in terraform.tfvars).

Once processing is complete, the redacted audio file will appear in the output bucket (variable: redacted_audio_bucket_name in terraform.tfvars).

Scalability

This architecture uses Cloud Functions, which are powered by Cloud Run. They automatically scale up based on incoming traffic (uploaded audio files) and scale down to zero when idle.

To control costs and prevent downstream systems from being overwhelmed, the functions are currently capped at 10 concurrent instances.

How to Scale the Functions

To increase the processing throughput or allow more concurrent executions, you must modify the Terraform configuration in main.tf.

Locate the google_cloudfunctions2_function resources (specifically audio_process_func and redaction_func).
Inside the service_config block, update the max_instance_count value.

resource "google_cloudfunctions2_function" "audio_process_func" {
  # ...
  service_config {
    max_instance_count = 50   # <--- Update this value (Default: 10)
    # ...
  }
}

Save the file and apply the changes:

terraform plan
terraform apply

This is not an officially supported Google product

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dlp_templates		dlp_templates
images		images
srf-audio-process-func		srf-audio-process-func
srf-longrun-job-dataflow		srf-longrun-job-dataflow
srf-redaction-func		srf-redaction-func
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
backend.tf		backend.tf
deploy_dataflow.sh		deploy_dataflow.sh
main.tf		main.tf
outputs.tf		outputs.tf
providers.tf		providers.tf
terraform.tfvars		terraform.tfvars
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Redaction Framework

Required IAM Permissions

APIs

Enable the required APIs:

Deployment

Usage

Scalability

How to Scale the Functions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

GoogleCloudPlatform/dataflow-speech-redaction

Folders and files

Latest commit

History

Repository files navigation

Speech Redaction Framework

Required IAM Permissions

APIs

Enable the required APIs:

Deployment

Usage

Scalability

How to Scale the Functions

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages