Wednesday, January 20, 2016

Apache Data Flow Proposal


Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes.

Yeah, you can do it this way. Inevitably, people are moving away from map-reduce and one of the best manners of designing around more functionality is mapping stream processing functions onto cluster hardware.

That sounds simple enough for a language, but isn't. The two major hurdles are: For one, you're dealing with mobile code - and only languages like Java and Python seem to provide that (and, of course, Erlang.) Second, you're dealing with mobile data which means you're going to fall back to a form of explicit memory management since it's very costly to move around large objects.

Mobile code, mobile data. It's all solvable given current techniques but all-in-all it makes the most sense to target an impure imperative OO language with support for mobile code and combinator based programming. Preferable with primitives to turn an abstract data type into a service, or the reverse, when needed to prevent major refactoring and seamless scalability.

So, I am stuck designing a language for this since a Java library with a Scala interactive front-end will probably be so much better.

I should get back to my language, but it ain't easy. Mobile code isn't trivial, mobile data and seamless scalability is even further away.