The Zen of Spark: What is it, exactly?
Apache Spark is a general-purpose, distributed, memory-based, parallel computing engine.
Man, that’s a mouthful.
Although technically accurate, that description comes across as just so much marketing-speech. But the fact is, we can learn a lot about Spark’s key characteristics by breaking it down.
- General Purpose: For quite some time now, the Map Reduce algorithm was the primary modus operandi for processing vast amounts of data in a parallel manner at scale. Unfortunately, some computing tasks do not fit so neatly into that manner of organizing a problem. Spark can do things differently, including sitting back and listening for streaming data, which it then packages up into “microbatches” that it sends on to your code. The code can do whatever it needs to with the microbatch–or with the batch from the “regular” batch interface. In addition, Spark can also manage state and hand it to your code with each element of a batch. In this way you can enable long-running, stateful operations.
- Distributed: Like most “internet-scale” software, Spark is designed to scale horizontally on what is essentially commodity hardware. It supports pluggable resource managers like YARN, Mesos, or even its own default scheduler to manage the allocation of resources across the cluster. This means that you can potentially run more than one application at a time across the cluster. I use the word “application” for the sake of familiarity, but they are also sometimes referred to as “jobs”. The term “job” can be a bit of a misnomer since it implies something that runs once and disappears. Spark applications do not necessarily behave that way.
- Memory-based: Spark can cache data, and in fact holds entire chunks (partitions) of data sets in memory as a matter of course. Best of all, thanks to its versatile caching abilities, you can reuse these chunks of data in subsequent operations–they do not need to be recomputed or re-read each time they are used. Even more importantly, if one partition fails, its data can be reconstituted on another without intervention. You don’t need to lift a finger. That is darn-near magical and is a big attraction to developers of mission critical applications.
- Parallel Computing Engine: Spark’s superb abilities with respect to parallel execution are quite likely another huge cause of its staggering growth in the marketplace–it simplifies the way we write programs needing to execute in parallel. It is only a small exaggeration to say that, in Spark, parallel execution is more a function of configuration than changes needed in code. Nevertheless, it is important to understand Spark’s parallelism to some degree in order to write efficient Spark applications: The unwary can trigger unnecessary “shuffling” of huge amounts of data around the cluster.
So what kind of applications are a good fit for parallel computing in general and Spark’s brand in particular? The answer is not cut and dried, but there are two common usage scenarios:
- Applications with enormous or even unbounded data sets: One example of this would be an application that processes sensor data: The sensor data may get read every four seconds (as is the standard for electric utilities), so even a simple voltage meter can generate huge data sets over time. Spark lets you process and analyze (or maybe more importantly, re-process and re-analyze) these data sets in a reasonable amount of time. You can scale up your cluster to handle a burst of processing needs and then drop it back down or turn it off altogether when there is no more data. This potentially saves money–lots of money.
- Applications requiring high speed processing: To continue with the sensor data example, consider an application that monitors the temperature in a million buildings worldwide on a one-minute interval. To minimize latency, those million readings could be partitioned across a Spark cluster and appropriate and timely notifications or alarms raised when readings exceeded configured thresholds. Spark streaming provides a mechanism where this can happen in near real-time–and all with Spark’s considerable scaling and partitioning capabilities. It also allows you to do considerable analytical work in near real-time, on-the-fly–a fact we have exploited with remarkable benefits at Blueprint Consulting Services.
The first class of application is the type most frequently associated with Spark: You feed the beast a ton of data and it chews through it and executes a variety of analytics. Again, the speed is more a function of configuration and cluster size than anything technical.
The second class of application–those requiring near real-time responses–are a growing subset of the Spark community’s usage of the platform, all thanks to Spark Streaming.
In the real world, things are often not so neatly divided or categorized: We now want real-time monitoring *and* rich analytics over the same data. Thankfully, Spark can handle both.