Cloud Dataflow is a Google technology that provides a cloud service to process data. It allows developers to build pipelines, monitor their execution, and transform & analyse data, all in the cloud.
Cloud Dataflow is based on a highly efficient and popular model used internally at Google, which evolved from MapReduce and successor technologies like Flume and MillWheel. The underlying service is language-agnostic.
Cloud Dataflow represents all datasets, irrespective of size, uniformly via PCollections (“parallel collections”). A PCollection might be an in-memory collection, read from files on Cloud Storage, queried from a BigQuery table, read as a stream from a Pub/Sub topic, or calculated on demand by your custom code.
Dataflow is designed to complement the rest of Google’s existing cloud portfolio. If you’re already using Google BigQuery, Dataflow will allow you to clean, prep and filter your data before it gets written to BigQuery. Dataflow can also be used to read from BigQuery if you want to join your BigQuery data with other sources. This can also be written back to BigQuery.
Since this service is on Google’s infrastructure, it eliminates operational costs and the need to focus on scalability as Google handles this on its infrastructure. All we as developers need to focus is the Application layer and logic. “Eyeball this Space”, as this is quite interesting and it may be a game-changer in the BigData space.
For more information refer to Google Cloud Data Processing Service and Google Cloud Dataflow.