Question 1

A Spark engineer must select an appropriate deployment mode for the Spark jobs. What is the benefit of using cluster mode in Apache Spark™?

Options :

A : In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs

B : In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.

C : In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.

D : In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

Answer: D

Question 2

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently. Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

Options :

A : Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

B : Use spark.read.json() with the inferSchema option set to true

C : Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

D : Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Answer: D

Question 3

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference inevent_timestamp. The engineer adds: dropDuplicatesWithinWatermark("event_timestamp", "30 minutes") What is the result?

Options :

A : It is not able to handle deduplication in this scenario

B : It removes duplicates that arrive within the 30-minute window specified by the watermark

C : It removes all duplicates regardless of when they arrive

D : It accepts watermarks in seconds and the code results in an error

Answer: B

Question 4

Which Spark configuration controls the number of tasks that can run in parallel on the executor? Options:

Options :

A : spark.executor.cores

B : spark.task.maxFailures

C : spark.driver.cores

D : spark.executor.memory

Answer: A

Question 5

A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?

Options :

A : Use the distinct() transformation to combine similar partitions

B : Use the coalesce() transformation with a lower number of partitions

C : Use the sortBy() transformation to reorganize the data

D : Use the repartition() transformation with a lower number of partitions

Answer: B