Congratulations! You’ve got a job interview doing Apache Spark work. After all of your training and experience, you can put it to the test and show a prospective employer that you’re the right one for your dream job.
Your interview is probably going to jump right to some pretty technical questions to test your mettle. We have a collection of the top potential questions you might face in an interview, so read on and see how you do.
Keep in mind that these questions are a good overall guideline to help in your preparation and studying. In other words, your mileage may vary.
What is Apache Spark Anyway?
Spark is an open-source distributed data processing engine. It’s a fast cluster computing setup that’s made for very fast computation. It has implicit data parallelism and fault tolerance. It was originally developed at UC Berkeley, but is now maintained by Apache.
Spark is the go-to open source option for iterative machine learning, interactive data analytics and processing, streaming, among other things. It’s used by a number of companies, such as Netflix, Pinterest, Conviva, Shopify, and Open Table.
What Are Some of the Key Features of Spark?
Although you probably don’t need to sell your potential employer on using Spark, it’s good for you to demonstrate that you’re well aware of the advantages of working with it. Some points to bring up:
- Support for a number of programming languages: Code for Spark can be written in a number of languages, including Java, Python, R, and Scala.
- Lazy evaluation: Spark uses something called lazy evaluation for its computations. This is a method of evaluation that delays its result until the final point when it is absolutely necessary. This is because the evaluations aren’t bound to variables, but when they are needed for other computations.
- Built-in machine learning: Spark uses Apache’s MLib machine learning component, meaning it doesn’t need a separate processor for this work.
- Multiple data source support: Spark has support for a number of data formats, including JSON, Hive, Cassandra, and Parquet, as well as Spark SQL.
- Real-time computation: You have a ton of processing power at hand with Spark, all massively scalable. Its computation power is in real time with low latency.
- Speed: Spark is designed for processing very large sets of data and it does so fast. It can be up to a hundred times faster than Hadoop MapReduce. It does this through controlled portioning that parallelize distributed processing in a very efficient manner.
- Hadoop integration: Because they’re both Apache products, Spark connects easily to Hadoop. It can serve as a replacement for the Hadoop MapReduce functions if need be and can run on top of an existing Hadoop cluster via YARN.
- Made of RDDs (resilient distributed datasets): This means that Spark can be cached across the computing nodes in a cluster.
- Supports multiple analytic tools: These include query analysis, real-time analysis, and graph processing
So, What Makes it Better Than Hadoop MapReduce?
Again, you probably don’t need to convince your employer on this, but they will want to see that you know you’re using the right tool for the job and that you’re thoughtful about it. Some points to remember are:
- Enhanced speed: While MapReduce is fast, it uses persistent storage for its data processing tasks, adding to its processing time. Spark uses in memory processing that rates at about ten to a hundred times faster.
- Multitasking: Spark has a number of built in libraries for running multiple tasks on the same processing core. This means you can have batch processing, interactive SQL queries, machine learning, and streaming all running at once. Hadoop only has batch processing through inbuilt libraries.
- Disk dependency: MapReduce depends heavily on its disk for memory access. Spark uses caching and in-memory data storage, making it faster and less volatile.
- Iterative computation: Spark is able to perform numerous computations over and over on the same data set with ease. MapReduce doesn’t allow for this.
How Would You Describe the Apache Spark Architecture?
The Spark application contains two primary programs, a driver and a worker. These are managed by a cluster manager to control interaction between them. That manager is a program called Spark Context. All Spark applications can run locally using a thread, but you can also use distributed environments with S3 or HDFS.
The worker node could be described as any node that can run code in a cluster. The driver program controls the worker, assigning it work, and taking in anything that it reports back. All of its actions are controlled by the master node scheduling based on available resources.
What Are the Various Components of Spark?
- Spark Core: This is the base engine for all the data processing. The heart of the Spark system.
- Spark Streaming: Takes care of any real-time streaming data.
- Spark SQL: This brings the functional programming API together with any relational processing.
- GraphX: Used for creating graphs as well as graph parallel computation.
- MLib: Used for all machine learning.
What is an RDD?
As stated earlier, RDD stands for Resilient Distribution Datasets. In other words, they are data in memory that’s distributed over many nodes but accessible. These are a fault tolerant collection of operational elements that run in parallel. That means if there is any partitioned data present in an RDD, it will be distributed and immutable.
There are two ways of making an RDD in Spark. The first is to parallelize a collection using the Driver program and the second is to load an external dataset.
What API is Used for Implementing Data Graphs in Spark?
This is where GraphX comes in. It’s the API used for implementing graphs and graph-parallel computing by extending a Spark RDD with a resilient distributed property graph. Each edge and vertex of this graph can have properties defined by the user. This whole process uses functions such as joinVertices, mapReduceTriplets, and subgraph.
What Is a Broadcast Variable and When Would It Be Used?
A broadcast variable is a read-only cached version of a variable that is available to every connected node. They are used any time you want to avoid having copies of a variable made for each running task. As you can imagine, this speeds up processing considerably and makes processes more efficient.
Speaking of Variables, How is Data Persistence Managed in Spark?
There are several persistence levels for storing RDDs in disk or memory. They are:
- DISK_ONLY: This stores the RDD partition on – you guessed it – the disk only.
- MEMORY_AND_DISK: This starts by storing the RDD as deserialized Java objects in the JVM. If that overflows, additional partitions are used on the disk.
- MEMORY_ONLY_SER: Here the RDDs are only stored as serialized Java objects, one byte array per partition.
- MEMORY_AND_DISK_SER: Just like MEMORY_ONLY_SER except it stores anything larger in the disk.
- MEMORY_ONLY: Stores all RDDs as deserialized Java objects in JVM. If anything doesn’t fit, it is recomputed at the moment. This is the default level.
- OFF_HEAP: Same as MEMORY_ONLY_SER but stores in an off-heap memory location.
What is the Shark Tool?
Shark is a tool for users coming from a non-development background. It’s a way of accessing Spark’s Scala MLib abilities with an SQL-style interface. This enables users to run Hive on Spark easier.
What Are Some Pitfalls in Using Spark?
Any tool is only as good as its user. Certainly being a skilled developer who understands the system will minimize any problems using Spark. Specifically, being diligent about memory use is important.
Spark consumes a huge amount of data, so if the processing of that data is not optimized, it can grind to a halt. It’s also important to not run everything on a local node, then you’re not taking advantage of Spark’s distributed power. Conversely, hitting the same web service over and over again by multiple nodes is redundant and will slow processes down.
Congratulations! You’ve gotten a job interview working with some of the most sophisticated and complex software currently in widespread use, Apache Spark. While we can’t prepare you for anything an interviewer might throw at you, this list should help you focus your preparation and studying so that you’ll be able to showcase your Spark knowledge and skills in the best possible light.
If after reading this you’re thinking you’d like to boost your Apache Spark knowledge for an interview -or it just sounds like platform you’d like to learn- check and see if your local coding bootcamp offers courses in Spark.