You’ve just learned about a new streaming data processing technology that would solve many of the technical challenges you are experiencing within your organization today. Unfortunately, it would require significant time and budget to integrate and operationalize within your current solution.
Enter Apache Beam.
According to the main website, “Apache Beam provides an advanced unified programming model, allowing you to implement batch and streaming data processing jobs that can run on any execution engine.” It’s analogous to a general contractor; they utilize specialized subcontractors to perform the work yet you only have to interact with the general contractor. If you need a new roof on your home because a previous subcontractor did a sub-par job, you only have to work with the general contractor. They don’t have to rebuild the entire house; they simply hire a new subcontractor to put on a new roof.
Dealing with “Out of Scope”
Today’s agile sprint teams are driven by their solution backlog. This backlog is filled with bugs, feature requests and spikes written to address needs that should be delivered by the current solution. Yet, how often does a feature get requested, only to have the technical team dismiss it as “out of scope”? They note the original specification document didn’t include any mention of the need for stateful computations, event-time windowing or some other fancy set of words used to describe the technical approach to address your request. “If only you had made it part of the original requirements,” they say, “then we could have accounted for it in our architecture and approach”.
So another project team is started. One tasked to create the “v-next” version of the original solution that will include all current functionality plus the new features requested. It will be leaner, meaner and created in the latest technology so as to avoid the mistakes of the past. “It will scale with all your needs” the super motivated project team touts. Product backlogs are created. Releases are made. The world rejoices until an “out of scope” feature is requested. Then the cycle repeats itself. As a decision maker, how do you break this cycle?
Enter Apache Beam.
Beam gives you a unified, portable and extensible solution from which to answer your top level streaming architecture decisions. I’ve had the pleasure of meeting and talking with Andrew Psaltis, author of “Streaming Data: Understanding the real-time pipeline” on several occasions. In his Apache Beam presentation at QCon in 2016, he noted:
“You can switch to whatever is more performant, more scalable, maybe something that requires a smaller footprint. Whatever your requirements are, it becomes easy to switch”.
You can view his presentation in its entirety at https://www.infoq.com/presentations/apache-beam.
Encouraging The “New Hotness”
Engineers and developers love working with new frameworks, libraries and api’s. Whether it’s for performance, ease of development, speed of deployment or just intellectual curiosity, the desire to utilize < insert new technology here /> will always be a topic of conversation within technical teams.
Consider stream processing computation engines. In the last six years, we’ve seen Storm, Spark, Flink and Apex grow in popularity (to name a few). Each is/was the “new hotness” and all promise scalable, performant and fault tolerant solutions to today’s streaming data problems. In practice, they all have their pros and cons when used within a solution for any given organization. How do you enable a technical team to stay relevant, curious and motivated to experiment with the next big thing without draining your budget?
Enter Apache Beam.
Admittedly, my interest in Apache Beam grew from a conversation I had with another engineer, Ryan Harris, at a local Apache Spark meetup. I’ve spent a lot of time with Spark and wanted to see what his excitement was all about.
I ran through the Python quick start at https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python with a local runner. Next I gave it a go with Google’s Cloud Dataflow runner. Finally, I ran it using the Spark runner. Aside from a few local development environment configuration adjustments (those were my own fault), Apache Beam let me experiment with capabilities from a few different technologies quickly.
You can check out the current Apache Beam capability matrix at https://beam.apache.org/documentation/runners/capability-matrix/. Don’t see the latest technology listed? Apache Beam is open source and has well-documented SDK’s so new runners can be created. Plus, Apache Beam is a core component of Google’s Cloud Dataflow service, so look for new additions to Apache Beam all the time.
Conclusion
As a decision maker, you want the peace of mind that a technical solution can scale with future business needs and enable innovation within your organization through technology experimentation. Apache Beam is a worthwhile addition to a streaming data architecture to give you that peace of mind.