Free Practice Questions•PDE - Professional Data Engineer•30 Questions with Answers•Free Practice Questions•PDE - Professional Data Engineer•30 Questions with Answers•
FREE QUESTIONS
PDE - Professional Data Engineer Practice Questions
30 free questions with correct answers and detailed explanations.
30Free Questions
2Free Exams
100%With Explanations
PDE Practice Set-01
15 questions
Q1
You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?
A
Re-write the application to load accumulated data every 2 minutes.
B
Convert the streaming insert code to batch load for individual messages.
C
Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.
Correct Answer
Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.
Explanation
BigQuery streaming inserts have eventual consistency; data may not be immediately available for queries. The recommended workaround is to account for latency by waiting a safe multiple of the typical propagation delay before querying. See: https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery#dataavailability
Q2
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?
Use Google Stackdriver Audit Logs to review data access.
B
Get the identity and access management IIAM) policy of each table
C
Use Stackdriver Monitoring to see the usage of BigQuery query slots.
D
Use the Google Cloud Billing API to see what account the warehouse is being billed to.
Correct Answer
Use Google Stackdriver Audit Logs to review data access.
Explanation
Cloud Audit Logs (Data Access logs) record who accessed which BigQuery resources and when, making them the correct first step to discover usage patterns before tightening permissions. See: https://cloud.google.com/bigquery/docs/reference/auditlogs
Q3
Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use?
A
Supervised learning to determine which transactions are most likely to be fraudulent.
Unsupervised learning to determine which transactions are most likely to be fraudulent.
Clustering to divide the transactions into N categories based on feature similarity.
Supervised learning to predict the location of a transaction.
E
Reinforcement learning to predict the location of a transaction.
F
Unsupervised learning to predict the location of a transaction.
Correct Answers
Unsupervised learning to determine which transactions are most likely to be fraudulent.
Clustering to divide the transactions into N categories based on feature similarity.
Supervised learning to predict the location of a transaction.
Explanation
Supervised learning can predict fraud or transaction location using labeled data; unsupervised clustering can group transactions by feature similarity without labels. Reinforcement learning is not applicable here. See: https://cloud.google.com/bigquery/docs/bqml-introduction
Q4
Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for- like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?
Put the data into Google Cloud Storage.
B
Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
C
Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
D
Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
Correct Answer
Put the data into Google Cloud Storage.
Explanation
Using Cloud Storage (GCS) instead of HDFS/Persistent Disk for Dataproc significantly reduces storage costs because GCS is cheaper than block storage and Dataproc can read data directly from GCS. See: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
Q5
You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?
A
The message body for the sensor event is too large.
B
Your custom endpoint has an out-of-date SSL certificate.
C
The Cloud Pub/Sub topic has too many messages published to it.
Your custom endpoint is not acknowledging messages within the acknowledgement deadline.
Correct Answer
Your custom endpoint is not acknowledging messages within the acknowledgement deadline.
Explanation
Pub/Sub push subscriptions resend messages when the HTTPS endpoint does not return a success (2xx) response within the acknowledgment deadline; if the endpoint is slow or returns errors, Pub/Sub retries causing duplicates. See: https://cloud.google.com/pubsub/docs/push
Q6
Your company's customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?
A
Add a node to the MySQL cluster and build an OLAP cube there.
Use an ETL tool to load the data from MySQL into Google BigQuery.
C
Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
D
Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.
Correct Answer
Use an ETL tool to load the data from MySQL into Google BigQuery.
Explanation
Using mysqldump backups loaded into Cloud SQL or BigQuery lets analytics run against a copy of the data, eliminating load on the production cluster. Exporting to BigQuery via a scheduled pipeline is the recommended pattern. See: https://cloud.google.com/solutions/mysql-to-bigquery
Q7
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?
A
Load the data every 30 minutes into a new partitioned table in BigQuery.
Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
C
Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
D
Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
Correct Answer
Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
Explanation
Loading small reference data updates to BigQuery every 30 minutes using batch loads or scheduled queries is the cheapest option; streaming inserts have per-row costs that add up, while scheduled queries or load jobs from GCS are free for the load itself. See: https://cloud.google.com/bigquery/docs/loading-data
Q8
You are preparing an organization-wide dataset. You need to preprocess customer data stored in a restricted bucket in Cloud Storage. The data will be used to create consumer analyses. You need to follow data privacy requirements, including protecting certain sensitive data elements, while also retaining all of the data for potential future use cases. What should you do?
A
Use the Cloud Data Loss Prevention API and Dataflow to detect and remove sensitive fields from the data in Cloud Storage. Write the filtered data in BigQuery.
B
Use customer-managed encryption keys (CMEK) to directly encrypt the data in Cloud Storage. Use federated queries from BigQuery. Share the encryption key by following the principle of least privilege.
Use Dataflow and the Cloud Data Loss Prevention API to mask sensitive data. Write the processed data in BigQuery.
D
Use Dataflow and Cloud KMS to encrypt sensitive fields and write the encrypted data in BigQuery. Share the encryption key by following the principle of least privilege.
Correct Answer
Use Dataflow and the Cloud Data Loss Prevention API to mask sensitive data. Write the processed data in BigQuery.
Explanation
Cloud Data Loss Prevention (DLP) API can de-identify or tokenize sensitive fields before loading data into BigQuery, allowing analytics while protecting PII and retaining all original data in Cloud Storage. See: https://cloud.google.com/dlp/docs/deidentify-sensitive-data
Q9
Your company uses Looker Studio connected to BigQuery for reporting. Users are experiencing slow dashboard load times due to complex queries on a large table. The queries involve aggregations and filtering on several columns. You need to optimize query performance to decrease the dashboard load times. What should you do?
A
Configure Looker Studio to use a shorter data refresh interval to ensure fresh data is always displayed.
Create a materialized view in BigQuery that pre-calculates the aggregations and filters used in the Looker Studio dashboards.
C
Implement row-level security in BigQuery to restrict data access and reduce the amount of data processed by the queries.
Use BigQuery BI Engine to accelerate query performance by caching frequently accessed data.
Correct Answers
Create a materialized view in BigQuery that pre-calculates the aggregations and filters used in the Looker Studio dashboards.
Use BigQuery BI Engine to accelerate query performance by caching frequently accessed data.
Explanation
BigQuery materialized views pre-compute and cache aggregation results, dramatically reducing query latency for dashboards that run the same aggregations repeatedly. Clustering and partitioning on filter columns also help. See: https://cloud.google.com/bigquery/docs/materialized-views-intro
Q10
You need to orchestrate a pipeline with several Google Cloud services: a batch Dataflow job, then a BigQuery query job followed by a Vertex AI batch prediction. The logic is sequential. You want a lightweight, serverless orchestration solution with minimal operational overhead. What service should you use?
A
Select Cloud Composer.
B
Select Compute Engine with cron.
C
Select Dataproc with Apache Oozie.
Select Cloud Workflows.
Correct Answer
Select Cloud Workflows.
Explanation
Cloud Workflows is a fully managed, serverless orchestration service designed to chain Google Cloud service calls sequentially with minimal overhead, making it ideal for lightweight pipeline orchestration. See: https://cloud.google.com/workflows/docs/overview
Q11
You have a data analyst team member who needs to analyze data by using BigQuery. The data analyst wants to create a data pipeline that would load 200 CSV files with an average size of 15MB from a Cloud Storage bucket into BigQuery daily. The data needs to be ingested and transformed before being accessed in BigQuery for analysis. You need to recommend a fully managed, no-code solution for the data analyst. What should you do?
A
Create a Cloud Run function and schedule it to run daily using Cloud Scheduler to load the data into BigQuery.
B
Use the BigQuery Data Transfer Service to load files from Cloud Storage to BigQuery, create a BigQuery job which transforms the data using BigQuery SQL and schedule it to run daily.
C
Build a custom Apache Beam pipeline and run it on Dataflow to load the file from Cloud Storage to BigQuery and schedule it to run daily using Cloud Composer.
Create a pipeline by using BigQuery pipelines and schedule it to load the data into BigQuery daily.
Correct Answer
Create a pipeline by using BigQuery pipelines and schedule it to load the data into BigQuery daily.
Explanation
Granting the BigQuery Data Editor role at the dataset level gives the analyst permission to create tables and query data within that dataset without over-provisioning access to other datasets. See: https://cloud.google.com/bigquery/docs/access-control
Q12
You are developing a fraud detection model using BigQuery ML. You have a raw transaction dataset and need to create new features such as the average_transaction_amount_last_24_hours and time_since_last_transaction. These features require aggregation and time-window calculations on the existing data. The goal is to ensure that these features are consistently applied during both model training and prediction without manual intervention. You need to prepare these features efficiently for your model. What should you do?
A
Implement a Cloud Run function that triggers on new transactions, calculates the features, and inserts them into a feature store before model serving.
B
Export the BigQuery data to Cloud Storage, perform feature engineering using a custom Python script in a Dataflow job, and then re-import the engineered features into BigQuery.
C
Use the TRANSFORM clause within the CREATE MODEL statement, leveraging SQL functions for aggregations and time-based calculations.
Create a separate BigQuery table containing pre-computed features using complex SQL queries and join this table with the raw data during model training and serving.
Correct Answer
Create a separate BigQuery table containing pre-computed features using complex SQL queries and join this table with the raw data during model training and serving.
Explanation
BigQuery ML TRANSFORM clause lets you define feature preprocessing that is automatically applied during training and prediction, enabling consistent feature engineering without separate pipelines. See: https://cloud.google.com/bigquery/docs/bigqueryml-transform
Q13
You are building a system to process financial transactions. The system must handle a high throughput of concurrent user operations and each operation requires low-latency reads and writes to individual records. You need to identify a storage solution that guarantees ACID compliance for the processed transactions. You want a Google Cloud managed service. What should you do?
A
Select Bigtable.
B
Select BigQuery.
C
Select Cloud Storage.
Select Spanner.
Correct Answer
Select Spanner.
Explanation
Cloud Spanner provides globally distributed ACID transactions with strong consistency and horizontal scalability, making it ideal for high-throughput financial transaction systems. See: https://cloud.google.com/spanner/docs/transactions
Q14
You are preparing data to serve a sales demand prediction model. The training data undergoes several pre-processing steps, including scaling numerical features and one-hot encoding categorical features. The model is deployed on Vertex AI Endpoints. You need to prevent training-serving skew and ensure accurate predictions in production. You want a solution that is easy to implement. What should you do?
A
Implement a custom handler within the Vertex AI Endpoint to automatically perform data transformations before the model makes a prediction.
Replicate the exact same pre-processing logic in the inference pipeline that was used during model training.
C
Store the raw, unprocessed data in a separate Cloud Storage bucket exclusively for serving.
D
Ensure the serving data is a smaller, random sample of the training data.
Correct Answer
Replicate the exact same pre-processing logic in the inference pipeline that was used during model training.
Explanation
Dataflow (Apache Beam) pipelines are portable, scalable, and reproducible; using Dataflow for preprocessing ensures the same transformations can be applied consistently during training and serving. See: https://cloud.google.com/dataflow/docs/overview
Q15
Your retail company is concerned about their BigQuery analytics spend. The company runs several queries that require the use of the same aggregation for the store ID and real-time sales volume. You need to implement the optimal solution that minimizes analytics spend and returns faster results. What should you do?
A
Create a new table from a CSV file with the repeated aggregation for the other queries to reference for faster processing.
Create a materialized view to minimize repetitive computations.
C
Use join acceleration with primary and foreign keys to increase query joining to live data.
D
Leverage partitioning to minimize the number of bytes read.
Correct Answer
Create a materialized view to minimize repetitive computations.
Explanation
Partitioning and clustering BigQuery tables reduces the amount of data scanned per query, which directly reduces on-demand query costs since BigQuery charges by bytes processed. See: https://cloud.google.com/bigquery/docs/partitioned-tables
PDE Practice Set-02
15 questions
Q1
You are defining the data governance strategy for a new BigQuery table with medical and financial data. You want a scalable solution that ensures the clinical researchers can access patient medical data without financial information, while allowing the accounting team to access only financial data with minimal patient identifiers. What should you do?
A
Implement column-level security policies in BigQuery tables with IAM permissions.
B
Create separate tables for personally identifiable information (PII), financial data, and anonymized medical data. Use IAM permissions to control access to each table.
C
Implement row-level security policies in BigQuery tables with IAM permissions.
Create separate datasets with authorized views exposing only approved data.
Correct Answer
Create separate datasets with authorized views exposing only approved data.
Explanation
BigQuery column-level security using policy tags in Data Catalog allows fine-grained access control on sensitive columns, scaling automatically without manual per-user management. See: https://cloud.google.com/bigquery/docs/column-level-security
Q2
You are designing a data lake on Google Cloud to store vast amounts of customer interaction data from various sources, such as websites, mobile apps, and social media. You need to ensure that this data, which arrives in different formats, is consistently cataloged and easy for data analysts to discover and use. You also want to perform basic data quality checks and transformations before the data is consumed by downstream applications. You need an automated and managed data governance solution. What should you do?
A
Use Cloud Storage as the central repository. Use Vertex AI to classify and process the data and perform data quality checks.
B
Stream all the data directly into BigQuery, where it is automatically cataloged and governed.
Use Cloud Storage and BigQuery as repositories. Use Dataplex Universal Catalog for metadata discovery, data quality checks, and transformations.
D
Use Cloud Storage as the central repository. Use a Cloud Run function to catalog, transform the data, and perform data quality checks.
Correct Answer
Use Cloud Storage and BigQuery as repositories. Use Dataplex Universal Catalog for metadata discovery, data quality checks, and transformations.
Explanation
Cloud Storage is the recommended foundation for a data lake on Google Cloud, offering unlimited scalability, low cost, and native integration with all GCP data services for batch and streaming workloads. See: https://cloud.google.com/storage/docs/introduction
Q3
You are using Dataflow to build a streaming data pipeline to analyze user website click activity from Pub/Sub. You need to calculate the number of clicks for each user site visit. A site visit is defined as a period of activity followed by 30 minutes of inactivity for a specific user. What should you do?
A
Use tumbling windows with a 30-minute window.
B
Use hopping windows with a 30-minute window, and a 1-minute period.
C
Use hopping windows with a 30-minute window, and a 30-minute period.
Use session windows with a 30-minute gap duration.
Correct Answer
Use session windows with a 30-minute gap duration.
Explanation
Apache Beam / Dataflow windowing with allowed lateness and triggers lets you handle late-arriving events in streaming pipelines, emitting results when windows close while still processing stragglers. See: https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub
Q4
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
A
Threading
B
Serialization
Dropout Methods
D
Dimensionality Reduction
Correct Answer
Dropout Methods
Explanation
Regularization techniques such as L1/L2 regularization and dropout reduce overfitting in deep neural networks by penalizing complexity and randomly disabling neurons during training. See: https://cloud.google.com/vertex-ai/docs/training/overview
Q5
You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?
A
Continuously retrain the model on just the new data.
Continuously retrain the model on a combination of existing data and the new data.
C
Train on the existing data while using the new data as your test set.
D
Train on the new data while using the existing data as your test set.
Correct Answer
Continuously retrain the model on a combination of existing data and the new data.
Explanation
A model trained with time-decayed or windowed data, combined with online learning or periodic retraining, captures evolving user preferences better than static models trained on all historical data. See: https://cloud.google.com/vertex-ai/docs/predictions/overview
Q6
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?
A
Add capacity (memory and disk space) to the database server by the order of 200.
B
Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
D
Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
Correct Answer
Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
Explanation
Cloud Spanner or Cloud SQL with proper schema design can scale from a small pilot to thousands of clinics; for structured relational patient data with ACID requirements, Cloud Spanner provides global scale. See: https://cloud.google.com/spanner/docs/schema-and-data-model
Q7
You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?
Disable caching by editing the report settings.
B
Disable caching in BigQuery by editing table details.
C
Refresh your browser tab showing the visualizations.
D
Clear your browser history for the past hour then reload the tab showing the virtualizations.
Correct Answer
Disable caching by editing the report settings.
Explanation
Enabling Data Studio's built-in caching reduces the number of BigQuery queries fired for repeated dashboard views, cutting both latency and costs significantly. See: https://cloud.google.com/looker/docs/looker-studio
Q8
An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
A
Use federated data sources, and check data in the SQL query.
B
Enable BigQuery monitoring in Google Stackdriver and create an alert.
C
Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
Correct Answer
Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
Explanation
Using Cloud Storage Transfer Service or gsutil to ingest the daily dump into GCS, then BigQuery scheduled load jobs to load the data, is a fully managed and reliable pattern. See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
Q9
Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?
A
Issue a command to restart the database servers.
Retry the query with exponential backoff, up to a cap of 15 minutes.
C
Retry the query every second until it comes back online to minimize staleness of data.
D
Reduce the query frequency to once every hour until the database comes back online.
Correct Answer
Retry the query with exponential backoff, up to a cap of 15 minutes.
Explanation
Memorystore for Redis or Cloud Bigtable serves as a low-latency cache for frequently read, rarely updated data like current temperature, reducing database load and query latency. See: https://cloud.google.com/memorystore/docs/redis/redis-overview
Q10
You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?
Linear regression
B
Logistic classification
C
Recurrent neural network
D
Feedforward neural network
Correct Answer
Linear regression
Explanation
Linear regression is the appropriate starting model for housing price prediction on resource-constrained hardware; it requires minimal compute and is interpretable. See: https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start
Q11
You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?
A
Include ORDER BY DESK on timestamp column and LIMIT to 1.
B
Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
C
Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
Correct Answer
Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
Explanation
BigQuery streaming inserts provide at-least-once delivery; to avoid duplicates you should use insertId for deduplication or design pipelines to be idempotent. See: https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery
Q12
Your company is using WILDCARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error: Which table name will make the SQL statement work correctly?
A
'bigquery-public-data.noaa_gsod.gsod'
B
bigquery-public-data.noaa_gsod.gsod*
C
'bigquery-public-data.noaa_gsod.gsod'*
'bigquery-public-data.noaa_gsod.gsod*`
Correct Answer
'bigquery-public-data.noaa_gsod.gsod*`
Explanation
Adding a _TABLE_SUFFIX filter in the WHERE clause when querying WILDCARD tables limits which underlying tables BigQuery scans, reducing bytes processed and query cost. See: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
Q13
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take?
A
Disable writes to certain tables.
Restrict access to tables by role.
C
Ensure that the data is encrypted at all times.
Restrict BigQuery API access to approved users.
Segregate data across multiple tables or databases.
F
Use Google Stackdriver Audit Logging to determine policy violations.
Correct Answers
Restrict access to tables by role.
Restrict BigQuery API access to approved users.
Segregate data across multiple tables or databases.
Explanation
BigQuery table-level and column-level IAM policies combined with authorized views allow restricting each user to only the rows and columns they are permitted to see. See: https://cloud.google.com/bigquery/docs/table-access-controls-intro
Q14
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules: ✑ No interaction by the user on the site for 1 hour Has added more than $30 worth of products to the basket ✑ Has not completed a transaction You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
A
Use a fixed-time window with a duration of 60 minutes.
B
Use a sliding time window with a duration of 60 minutes.
Use a session window with a gap time duration of 60 minutes.
D
Use a global window with a time based trigger with a delay of 60 minutes.
Correct Answer
Use a session window with a gap time duration of 60 minutes.
Explanation
A Pub/Sub + Dataflow streaming pipeline can process real-time user events and trigger basket abandonment messages with low latency when a user abandons their cart. See: https://cloud.google.com/architecture/ecommerce-recommendation-pubsub-dataflow-bigquery
Q15
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's data. You want to ensure appropriate access to the data. Which three steps should you take?
A
Load data into different partitions.
Load data into a different dataset for each client.
C
Put each client's BigQuery dataset into a different table.
Restrict a client's dataset to approved users.
E
Only allow a service account to access the datasets.
Use the appropriate identity and access management (IAM) roles for each client's users.
Correct Answers
Load data into a different dataset for each client.
Restrict a client's dataset to approved users.
Use the appropriate identity and access management (IAM) roles for each client's users.
Explanation
Using separate Google Cloud projects per client with VPC Service Controls provides isolation; BigQuery dataset-level sharing across projects lets each client use their own tools. See: https://cloud.google.com/architecture/data-lifecycle-cloud-platform
Want More Practice?
These are just the free questions. Unlock the full PDE - Professional Data Engineer exam library with hundreds of additional questions, timed practice mode, and progress tracking.