Updated Feb 01, 2026 Test Engine to Practice Test for Associate-Developer-Apache-Spark-3.5 Valid and Updated Dumps
Exam Questions for Associate-Developer-Apache-Spark-3.5 Updated Versions With Test Engine
NEW QUESTION # 47
Given a DataFrame df that has 10 partitions, after running the code:
result = df.coalesce(20)
How many partitions will the result DataFrame have?
- A. Same number as the cluster executors
- B. 0
- C. 1
- D. 2
Answer: D
Explanation:
The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It does not increase the number of partitions. If the specified number of partitions is greater than the current number, it will not have any effect.
From the official Spark documentation:
"coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the current partitions." However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of partitions remains unchanged.
Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.
NEW QUESTION # 48
A data engineer needs to write a Streaming DataFrame as Parquet files.
Given the code:
Which code fragment should be inserted to meet the requirement?
A)
B)
C)
D)
Which code fragment should be inserted to meet the requirement?
- A. .format("parquet")
.option("path", "path/to/destination/dir") - B. .option("format", "parquet")
.option("location", "path/to/destination/dir") - C. .format("parquet")
.option("location", "path/to/destination/dir") - D. CopyEdit
.option("format", "parquet")
.option("destination", "path/to/destination/dir")
Answer: A
Explanation:
To write a structured streaming DataFrame to Parquet files, the correct way to specify the format and output directory is:
.writeStream
.format("parquet")
.option("path", "path/to/destination/dir")
According to Spark documentation:
"When writing to file-based sinks (like Parquet), you must specify the path using the .option("path", ...) method. Unlike batch writes, .save() is not supported." Option A incorrectly uses .option("location", ...) (invalid for Parquet sink).
Option B incorrectly sets the format via .option("format", ...), which is not the correct method.
Option C repeats the same issue.
Option D is correct: .format("parquet") + .option("path", ...) is the required syntax.
Final answer: D
NEW QUESTION # 49
A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?
A)
Use the applylnPandas API
B)
C)

- A. Use a regular Spark UDF:
from pyspark.sql.functions import mean
df.groupBy("user_id").agg(mean("value")).show() - B. Use the mapInPandas API:
df.mapInPandas(mean_func, schema="user_id long, value double").show() - C. Use a Pandas UDF:
@pandas_udf("double")
def mean_func(value: pd.Series) -> float:
return value.mean()
df.groupby("user_id").agg(mean_func(df["value"])).show() - D. Use the applyInPandas API:
df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
Answer: D
Explanation:
The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.
As per the Databricks documentation:
"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution." Option A is correct and achieves this parallel execution.
Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.
Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.
Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.
Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.
NEW QUESTION # 50
A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
- A. Set the number of partitions equal to the total number of CPUs in the cluster
- B. Set the number of partitions equal to the number of nodes in the cluster
- C. Set the number of partitions to a fixed value, such as 200
- D. Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as
128 MB
Answer: D
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Spark's best practice is to estimate partition count based on data volume and a reasonable partition size - typically 128 MB to 256 MB per partition.
With 1 TB of data: 1 TB / 128 MB # ~8000 partitions
This ensures that tasks are distributed across available CPUs for parallelism and that each task processes an optimal volume of data.
Option A (equal to cores) may result in partitions that are too large.
Option B (fixed 200) is arbitrary and may underutilize the cluster.
Option C (nodes) gives too few partitions (10), limiting parallelism.
Reference: Databricks Spark Tuning Guide # Partitioning Strategy
NEW QUESTION # 51
You have:
DataFrame A: 128 GB of transactions
DataFrame B: 1 GB user lookup table
Which strategy is correct for broadcasting?
- A. DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself
- B. DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B
- C. DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself
- D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A
Answer: D
Explanation:
Comprehensive and Detailed Explanation:
Broadcast joins work by sending the smaller DataFrame to all executors, eliminating the shuffle of the larger DataFrame.
From Spark documentation:
"Broadcast joins are efficient when one DataFrame is small enough to fit in memory. Spark avoids shuffling the larger table." DataFrame B (1 GB) fits within the default threshold and should be broadcasted.
It eliminates the need to shuffle the large DataFrame A.
Final Answer: B
NEW QUESTION # 52
A data engineer is working on the DataFrame:
(Referring to the table image: it has columnsId,Name,count, andtimestamp.) Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?
- A. df.select("Name").distinct().orderBy(df["Name"].desc())
- B. df.select("Name").distinct()
- C. df.select("Name").orderBy(df["Name"].asc())
- D. df.select("Name").distinct().orderBy(df["Name"])
Answer: D
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
To extract unique values from a column and sort them alphabetically:
distinct()is required to remove duplicate values.
orderBy()is needed to sort the results alphabetically (ascending by default).
Correct code:
df.select("Name").distinct().orderBy(df["Name"])
This is directly aligned with standard DataFrame API usage in PySpark, as documented in the official Databricks Spark APIs. Option A is incorrect because it may not remove duplicates. Option C omits sorting.
Option D sorts in descending order, which doesn't meet the requirement for alphabetical (ascending) order.
NEW QUESTION # 53
A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.
Which change should be made to solve the issue?
- A. Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges
- B. Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy
- C. Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy
- D. Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges
Answer: B
Explanation:
The approx_percentile function in Spark is a performance-optimized alternative to percentile. It takes an optional accuracy parameter:
approx_percentile(column, percentage, accuracy)
Higher accuracy values → more precise results, but increased memory/computation.
Lower values → faster but less accurate.
From the documentation:
"Increasing the accuracy improves precision but increases memory usage." Final answer: D
NEW QUESTION # 54
A data engineer is working on the DataFrame:
(Referring to the table image: it has columns Id, Name, count, and timestamp.) Which code fragment should the engineer use to extract the unique values in the Name column into an alphabetically ordered list?
- A. df.select("Name").distinct().orderBy(df["Name"].desc())
- B. df.select("Name").distinct()
- C. df.select("Name").orderBy(df["Name"].asc())
- D. df.select("Name").distinct().orderBy(df["Name"])
Answer: D
Explanation:
To extract unique values from a column and sort them alphabetically:
distinct() is required to remove duplicate values.
orderBy() is needed to sort the results alphabetically (ascending by default).
Correct code:
df.select("Name").distinct().orderBy(df["Name"])
This is directly aligned with standard DataFrame API usage in PySpark, as documented in the official Databricks Spark APIs. Option A is incorrect because it may not remove duplicates. Option C omits sorting. Option D sorts in descending order, which doesn't meet the requirement for alphabetical (ascending) order.
NEW QUESTION # 55
A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.
The schema of the user profile table looks like this:
Which block of Spark code can be used to achieve this requirement?
Options:
- A. filtered_df = users_raw_df.na.drop(thresh=0)
- B. filtered_df = users_raw_df.na.drop(how='all', thresh=None)
- C. filtered_df = users_raw_df.na.drop(how='all')
- D. filtered_df = users_raw_df.na.drop(how='any')
Answer: D
Explanation:
na.drop(how='any')drops any row that has at least one null value.
This is exactly what's needed when the goal is to retain only fully complete records.
Usage:CopyEdit
filtered_df = users_raw_df.na.drop(how='any')
Explanation of incorrect options:
A: thresh=0 is invalid - thresh must be # 1.
B: how='all' drops only rows where all columns are null (too lenient).
D: spark.na.drop doesn't support mixing how and thresh in that way; it's incorrect syntax.
Reference:PySpark DataFrameNaFunctions.drop()
NEW QUESTION # 56
A developer wants to test Spark Connect with an existing Spark application.
What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)
- A. Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell
- B. Execute their pyspark shell with the option--remote "https://localhost"
- C. Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code
- D. Execute their pyspark shell with the option--remote "sc://localhost"
- E. Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code
Answer: A,D
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Spark Connect enables decoupling of the client and Spark driver processes, allowing remote access. Spark supports configuring the remote Spark Connect server in multiple ways:
From Databricks and Spark documentation:
Option B (--remote "sc://localhost") is a valid command-line argument for thepysparkshell to connect using Spark Connect.
Option C (settingSPARK_REMOTEenvironment variable) is also a supported method to configure the remote endpoint.
Option A is incorrect because Spark Connect uses thesc://protocol, nothttps://.
Option D requires modifying the code, which the question explicitly avoids.
Option E configures the port on the server side but doesn't start a client connection.
Final Answers: B and C
NEW QUESTION # 57
45 of 55.
Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?
- A. It is primarily used for data ingestion into Spark from external sources.
- B. It allows for remote execution of Spark jobs.
- C. It provides a way to run Spark applications remotely in any programming language.
- D. It can be used to interact with any remote cluster using the REST API.
Answer: B
Explanation:
Spark Connect enables remote execution of Spark jobs by decoupling the client from the driver using the Spark Connect protocol (gRPC).
It allows users to run Spark code from different environments (like notebooks, IDEs, or remote clients) while executing jobs on the cluster.
Key Features:
Enables remote interaction between client and Spark driver.
Supports interactive development and lightweight client sessions.
Improves developer productivity without needing driver resources locally.
Why the other options are incorrect:
A: Spark Connect is not limited to ingestion tasks.
B: It allows multi-language clients (Python, Scala, etc.) but runs via Spark Connect API, not arbitrary remote code.
C: Uses gRPC protocol, not REST.
Reference:
Databricks Exam Guide (June 2025): Section "Using Spark Connect to Deploy Applications" - describes Spark Connect architecture and remote execution model.
Spark 3.5 Documentation - Spark Connect overview and client-server protocol.
NEW QUESTION # 58
Given this code:
.withWatermark("event_time", "10 minutes")
.groupBy(window("event_time", "15 minutes"))
.count()
What happens to data that arrives after the watermark threshold?
Options:
- A. The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.
- B. Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.
- C. Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.
- D. Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.
Answer: B
Explanation:
According to Spark's watermarking rules:
"Records that are older than the watermark (event time < current watermark) are considered too late and are dropped." So, if a record's event_time is earlier than (max event_time seen so far - 10 minutes), it is discarded.
NEW QUESTION # 59
A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.
Which code fragment should be inserted in line 5 to meet the requirement?
Code context:
spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","host1:port1,host2:port2") \
.[LINE5] \
.load()
Options:
- A. .option("kafka.topic", "feed")
- B. .option("subscribe", "feed")
- C. .option("subscribe.topic", "feed")
- D. .option("topic", "feed")
Answer: B
Explanation:
Comprehensive and Detailed Explanation:
To read from a specific Kafka topic using Structured Streaming, the correct syntax is:
python
CopyEdit
option("subscribe","feed")
This is explicitly defined in the Spark documentation:
"subscribe - The Kafka topic to subscribe to. Only one topic can be specified for this option." (Source:Apache Spark Structured Streaming + Kafka Integration Guide)
B)."subscribe.topic" is invalid.
C)."kafka.topic" is not a recognized option.
D)."topic" is not valid for Kafka source in Spark.
NEW QUESTION # 60
6 of 55.
Which components of Apache Spark's Architecture are responsible for carrying out tasks when assigned to them?
- A. Executors
- B. Worker Nodes
- C. Driver Nodes
- D. CPU Cores
Answer: A
Explanation:
In Spark's distributed architecture:
The Driver Node coordinates the execution of a Spark application. It converts the logical plan into a physical plan of stages and tasks.
The Executors, running on Worker Nodes, are responsible for executing tasks assigned by the driver and storing data (in memory or disk) during execution.
Key point:
Executors are the active agents that perform the actual computations on data partitions. Each executor runs multiple tasks in parallel using available CPU cores.
Why the other options are incorrect:
A (Driver Nodes): The driver schedules tasks; it doesn't execute them.
C (CPU Cores): CPU cores execute within executors, but they are hardware, not Spark architectural components.
D (Worker Nodes): Worker nodes host executors but do not directly execute tasks; executors do.
Reference (Databricks Apache Spark 3.5 - Python / Study Guide):
Spark Architecture Components - Driver, Executors, Cluster Manager, Worker Nodes.
Databricks Exam Guide (June 2025): Section "Apache Spark Architecture and Components" - describes the roles of driver and executor nodes in distributed processing.
NEW QUESTION # 61
What is the difference between df.cache() and df.persist() in Spark DataFrame?
- A. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
- B. cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame
- C. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
- D. persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
Answer: B
Explanation:
df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)
df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.
By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.
NEW QUESTION # 62
15 of 55.
A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:
id
name
count
timestamp
1
Delhi
20
2024-09-19T10:11
1
Delhi
50
2024-09-19T10:12
2
London
50
2024-09-19T10:15
3
Paris
30
2024-09-19T10:18
3
Paris
20
2024-09-19T10:20
4
Washington
10
2024-09-19T10:22
Which operation is supported with streaming_df?
- A. streaming_df.select(countDistinct("name"))
- B. streaming_df.show()
- C. streaming_df.count()
- D. streaming_df.filter("count < 30")
Answer: D
Explanation:
In Structured Streaming, only transformation operations are allowed on streaming DataFrames. These include select(), filter(), where(), groupBy(), withColumn(), etc.
Example of supported transformation:
filtered_df = streaming_df.filter("count < 30")
However, actions such as count(), show(), and collect() are not supported directly on streaming DataFrames because streaming queries are unbounded and never finish until stopped.
To perform aggregations, the query must be executed through writeStream and an output sink.
Why the other options are incorrect:
A: count() is an action, not allowed directly on streaming DataFrames.
C: countDistinct() is a stateful aggregation, not supported outside of a proper streaming query.
D: show() is also an action, unsupported on streaming queries.
Reference:
PySpark Structured Streaming Programming Guide - supported transformations and actions.
Databricks Exam Guide (June 2025): Section "Structured Streaming" - performing operations on streaming DataFrames and understanding supported transformations.
NEW QUESTION # 63
Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?
- A. It allows for remote execution of Spark jobs
- B. It is primarily used for data ingestion into Spark from external sources
- C. It can be used to interact with any remote cluster using the REST API
- D. It provides a way to run Spark applications remotely in any programming language
Answer: A
Explanation:
Comprehensive and Detailed Explanation:
Spark Connect introduces a decoupled client-server architecture. Its key feature is enabling Spark job submission and execution from remote clients - in Python, Java, etc.
From Databricks documentation:
"Spark Connect allows remote clients to connect to a Spark cluster and execute Spark jobs without being co- located with the Spark driver." A is close, but "any language" is overstated (currently supports Python, Java, etc., not literally all).
B refers to REST, which is not Spark Connect's mechanism.
D is incorrect; Spark Connect isn't focused on ingestion.
Final Answer: C
NEW QUESTION # 64
16 of 55.
A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.
Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)
- A. Transformations are executed immediately to build the lineage graph.
- B. Transformations are evaluated lazily.
- C. The Spark engine optimizes the execution plan during the transformations, causing delays.
- D. Only actions trigger the execution of the transformation pipeline.
- E. The Spark engine requires manual intervention to start executing transformations.
Answer: B,D
Explanation:
Apache Spark follows a lazy evaluation model, meaning transformations (like filter(), select(), map()) are not executed immediately. Instead, they build a logical plan (lineage graph) that represents the sequence of operations to be applied.
Execution only begins when an action (e.g., count(), collect(), save(), show()) is called. At that point, Spark's engine:
Optimizes the logical plan into a physical plan.
Divides it into stages and tasks.
Executes them across the cluster.
This design helps Spark optimize execution paths and avoid unnecessary computations.
Why the other options are incorrect:
A: Transformations do not execute immediately; they are deferred.
B: Optimization happens during job execution (after an action), not during transformations.
D: Execution starts automatically once an action is triggered, no manual intervention needed.
Reference:
Databricks Exam Guide (June 2025): Section "Apache Spark Architecture and Components" - covers lazy evaluation, actions vs. transformations, and execution hierarchy.
Spark 3.5 Documentation - Lazy Evaluation model and DAG scheduling.
NEW QUESTION # 65
An application architect has been investigating Spark Connect as a way to modernize existing Spark applications running in their organization.
Which requirement blocks the adoption of Spark Connect in this organization?
- A. Complete Spark API support: the ability to migrate all existing code to Spark Connect without modification, including the RDD APIs
- B. Debuggability: the ability to perform interactive debugging directly from the application code
- C. Upgradability: the ability to upgrade the Spark applications independently from the Spark driver itself
- D. Stability: isolation of application code and dependencies from each other and the Spark driver
Answer: A
Explanation:
Spark Connect enables a decoupled client-server architecture, allowing remote clients to run Spark code via gRPC.
However, as of Spark 3.5, Spark Connect supports DataFrame and SQL APIs, but not RDD APIs.
Limitation:
Applications that rely heavily on RDD-based transformations or actions cannot be migrated directly to Spark Connect.
These APIs require tight driver integration, which Spark Connect intentionally decouples.
Thus, complete Spark API compatibility is not yet achieved - this is the key adoption blocker.
Why the other options are incorrect:
A: Debugging is possible through IDE integration and logs on the client side.
B: Spark Connect actually supports upgradable clients independent of the driver - this is an advantage, not a limitation.
D: Spark Connect provides strong isolation between the client and driver processes.
Reference:
Spark 3.5 Documentation - Spark Connect architecture and supported APIs.
Databricks Exam Guide (June 2025): Section "Using Spark Connect to Deploy Applications" - Spark Connect limitations (no RDD API support).
NEW QUESTION # 66
A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.
Which code snippet the data engineer could use to fulfil this requirement?
A)
B)
C)
D)
Options:
- A. Uses trigger(processingTime='5 seconds') - correct micro-batch trigger with interval.
- B. Uses trigger(continuous='5 seconds') - continuous processing mode.
- C. Uses trigger() - default micro-batch trigger without interval.
- D. Uses trigger(processingTime=5000) - invalid, as processingTime expects a string.
Answer: A
Explanation:
To define a micro-batch interval, the correct syntax is:
query = df.writeStream \
outputMode("append") \
trigger(processingTime='5 seconds') \
start()
This schedules the query to execute every 5 seconds.
Continuous mode (used in Option A) is experimental and has limited sink support.
Option D is incorrect because processingTime must be a string (not an integer).
Option B triggers as fast as possible without interval control.
Reference:Spark Structured Streaming - Triggers
NEW QUESTION # 67
How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?
Options:
- A. Increase the number of local threads based on the number of CPU cores.
- B. Use the spark.dynamicAllocation.enabled property to scale resources dynamically.
- C. Set the spark.executor.memory property to a large value.
- D. Configure the application to run in cluster mode instead of local mode.
Answer: A
Explanation:
When running in local mode (e.g., local[4]), the number inside the brackets defines how many threads Spark will use.
Using local[*] ensures Spark uses all available CPU cores for parallelism.
Example:
spark-submit --master local[*]
Dynamic allocation and executor memory apply to cluster-based deployments, not local mode.
NEW QUESTION # 68
A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:
Which operation is supported with streaming_df?
- A. streaming_df.groupby("Id").count()
- B. streaming_df.select(countDistinct("Name"))
- C. streaming_df.filter(col("count") < 30).show()
- D. streaming_df.orderBy("timestamp").limit(4)
Answer: A
Explanation:
Comprehensive and Detailed
Explanation:
In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data.
Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.
Review of Each Option:
A). select(countDistinct("Name"))
Not allowed - Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.
Reference: Databricks Structured Streaming Guide - Unsupported Operations.
B). groupby("Id").count()Supported - Streaming aggregations over a key (like groupBy("Id")) are supported.
Spark maintains intermediate state for each key.Reference: Databricks Docs # Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html)
C). orderBy("timestamp").limit(4)Not allowed - Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming - Unsupported Operations (ordering without watermark/window not allowed).
D). filter(col("count") < 30).show()Not allowed - show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide
- Output operations like show() are not supported.
Reference Extract from Official Guide:
"Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations."- Databricks Structured Streaming Programming Guide
NEW QUESTION # 69
A data scientist wants each record in the DataFrame to contain:
The first attempt at the code does read the text files but each record contains a single line. This code is shown below:
The entire contents of a file
The full file path
The issue: reading line-by-line rather than full text per file.
Code:
corpus = spark.read.text("/datasets/raw_txt/*") \
.select('*','_metadata.file_path')
Which change will ensure one record per file?
Options:
- A. Add the option lineSep='\n' to the text() function
- B. Add the option wholetext=True to the text() function
- C. Add the option lineSep=", " to the text() function
- D. Add the option wholetext=False to the text() function
Answer: B
Explanation:
To read each file as a single record, use:
spark.read.text(path, wholetext=True)
This ensures that Spark reads the entire file contents into one row.
Reference:Spark read.text() with wholetext
NEW QUESTION # 70
What is the benefit of Adaptive Query Execution (AQE)?
- A. It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.
- B. It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.
- C. It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.
- D. It allows Spark to optimize the query plan before execution but does not adapt during runtime.
Answer: B
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Adaptive Query Execution (AQE) is a powerful optimization framework introduced in Apache Spark 3.0 and enabled by default since Spark 3.2. It dynamically adjusts query execution plans based on runtime statistics, leading to significant performance improvements. The key benefits of AQE include:
Dynamic Join Strategy Selection: AQE can switch join strategies at runtime. For instance, it can convert a sort-merge join to a broadcast hash join if it detects that one side of the join is small enough to be broadcasted, thus optimizing the join operation .
Handling Skewed Data: AQE detects skewed partitions during join operations and splits them into smaller partitions. This approach balances the workload across tasks, preventing scenarios where certain tasks take significantly longer due to data skew .
Coalescing Post-Shuffle Partitions: AQE dynamically coalesces small shuffle partitions into larger ones based on the actual data size, reducing the overhead of managing numerous small tasks and improving overall query performance .
These runtime optimizations allow Spark to adapt to the actual data characteristics during query execution, leading to more efficient resource utilization and faster query processing times.
NEW QUESTION # 71
......
Associate-Developer-Apache-Spark-3.5 Exam Dumps - Free Demo & 365 Day Updates: https://dumpscertify.torrentexam.com/Associate-Developer-Apache-Spark-3.5-exam-latest-torrent.html

