Data analytics

Taming the Data Streams — Microservices Style

A case / guide to build your own simple data streaming pipeline

Dinesh Bakiaraj

--

Photo by Dan LeFebvre on Unsplash

Data loses value quickly over time. While that statement is very broad stroke, the success of solutions and architectures today depend on this rabid race to covert raw data into actionable detail. If actionable insights are not inferred within an extremely tight window (usually seconds, I dare say milliseconds), then that data is exponentially less actionable or as Forrester’s Perishable Insights calls it: less harvestable.

The key cogwheel in this machinery is the pipeline that transports this data from disparate sources to consumers. This is a make or make infra in data analytics. Any rust here will doom the effectiveness of the solution.

Excitingly, this problem has been solved to some extent. AWS Kinesis being one such managed, on-demand, cloud native service. The traversal of data within the pipeline is managed by cloud providers on our behalf, along with implicit pub-sub, elastic pipeline, IAM, garbage collection etc. For all practical purposes one could move on, saying “problem solved”.

However, when architecting a micro-services based platform for data streaming and analysis, it may not be feasible for many solutions to punt the problem to an external service provider and throw a Hail Mary at it. Why ?

  • Services have a recurring cost. Not a viable option for many solutions.
  • Cloud services work best if you build entire data analytics pipeline on it, end to end. Relying on it only for a portion of the pipeline, say just data streaming, may be more detrimental for a number of reasons, latency being most important one.
  • You will have to relinquish control over a crucial cogwheel in you machinery. As a an engineer, to me, that is unsettling 😉

What I set out to do was to design a data streaming pipeline, that I would implement as an alternative to our “big” brother’s of the cloud.

So what’s new here. Looks the same like any other ?

Lets start with the goals of this pipeline:

  • The design must have most, if not all, benefits of existing cloud native data streaming solutions.
  • Should be modeled for and implementable as microservices.
  • Should guarantee time series sequencing. Time based sequencing will be the pipeline’s responsibility and not that of producer or consumer.
  • It’s cogs/segments should be elastically and independently scalable.
  • Should be SIMPLE (reasonably, that is 😃). Seriously, this is crucial. No garbage collection, no strict consistency, no multicast, no memory sharing.

What differentiates this model is not new functions/segments but rather the attributes and characteristics of each segment, that makes is possible to meet the above goals.

Producers

Anybody and anything in the system is a producer of data. Period. Data comes in all forms, shapes, sizes and so does its producers. The pipeline should not have to know the form, shape and size of a producer. Producers send data to the pipeline and not to the consumer.

What a producer should know however, is the data classification and the endpoint associated with each class (or as it is known in the wild): Topics. So the producer knows the topic(s) to which the data should be sent to.

Producer does not have to worry about complexities around:

  • time based sequencing of data
  • multiple instance of itself producing same class of data to the same topic

So essentially, a producer identifies the appropriate topic(s) for the data and sends them out. Nothing more, nothing less.

Topics

The implementation of a Topic (i.e pub-sub) infra, will have the following attributes:

  • A simple FIFO queueing.
  • Messages are deleted on read. Meaning there is no need for implicit garbage collection.
  • If the queue is full, push from the producer will fail.

The topic does not care about:

  • who produced the data
  • in which order was the data produced
  • how many instances of the producer application are producing
  • type of the data

So essentially, data is put on a queue by the producer and it waits there.

Dispatcher

I know its sounds scary 😉 but in all reality, we only need to implement the following functions in a Dispatcher:

  • Read from topic.
  • Demux based on message types.
  • Reorder the message based on time.
  • Dispatch to registered consumers.

Fine, there are some more details to these but then let’s only focus on its core attributes at this point and not how its implemented. That is:

  • Allows consumers to subscribe to a message type.
  • Consumption can be: real time, near real time or historical.

In “real time” consumption type, data will not be batched but streamed one message at a time.

In “near real time” consumption, data will be batched for a specific time window and dispatched when the time window is “determined” to be complete. This determination is best effort. Meaning, if we end up receiving data from producers outside this window, then it gets tricky. Simple action will be to drop, it as its a best effort scheme, however the engineer in me is horrified of such a proposition 😉 and this is the part of the system that should expose enough levers that can be used to influence the desired behavior.

In “historical” consumption, data should be batched to a bigger time window and/or size upper bounds.

  • Each consumer gets its own dispatch stack. i.e no memory sharing, locking with other consumers. The dispatch transport can be ReST or Streams (HTTP/2, gRPC etc) and is not really that important or interesting. What is interesting though is that the scheme should support real time, near real time and historical consumption.
  • Consumption is check-pointed, so consumers can start where they left off. This makes life of a consumer so much more simpler.

Consumers

Consumers come in all sizes, shapes and form as well. Consumers deal with the pipeline and not with producers. The consumers and the pipeline should share/negotiate the following:

  • message type of interest
  • consumption type: real time, near real time and historical
  • transport type: ReST, Streaming etc.

Gist

As you can see, the number of functions in each segment of the above data streaming pipeline is quite manageable and feasible. Keeping it simple and clean is all we need, to get our own data streaming pipeline running in our systems. All said, devil is still in the detail 😃 But again, details are details. They have never been good enough reason to not give anything we do, our best.

To Be Continued …

The details captured here on data streaming pipeline is just the start of the conversation. The intent was to point out that a data streaming pipeline can be as simple or as complex as you would like it to be. One doesn’t have to outsource this to external solutions, out of fear.

I am hoping to expand more on each functions of this pipeline in the coming days (or weeks 😉). So one bite at at time.

Stay Safe. ☮️

--

--

Dinesh Bakiaraj
0 Followers

I have many dreams... a big part of them about elegance in engineering and spreading its joy :)