Data Streaming Application: handle with care

Data Intensive Dreamer
5 min readNov 6, 2020
Photo by Eric Muhr on Unsplash

I’ve decided to write this article (and maybe a series) because I am just in love with this topic. My passion for data streaming and pretty much everything that I know on the subject is because of Riccardo; my ex-colleague, which, by the way, is one of my best friends now :).

Deal with streaming data is an art. Every data engineer is capable of handle data that is standing there. Batch processing, regardless of the size of data, is easy stuff when compared with streaming data.

If you are wondering what is streaming, I will give you the Wikipedia definition:

Streaming data is data that is continuously generated by different sources.

The keyword is continuous. A continuous flow of data that keeps coming, too big to store, too fast to do complex processing.

In this article, we will analyze some of the typical problems of a data streaming application.

Do not stare at data

Photo by Vasileia Eleftheriou on Unsplash

Yes, with stream processing, you cannot do complex operations on data. No technology enables you to do lots of heavy operations like joins. Every single function counts.

In one of my past projects, the client needs a streaming application to handle massive data aggregation. As this is not enough, those operations need to be done on a relational database!

As expected, the resulting pipeline was taking too much time to get things done. There are only two solutions in a case like this, remove the massive aggregations or try to optimise them. I’ve opted for the latter in this particular project, but not every project is the same.

For example, suppose you have this situation, a table with punctual data containing every user’s click event. Your objective is to populate for every event that arrives in your streaming pipeline a daily and a monthly aggregation table. An easy but cumbersome solution is to do those aggregation starting from the punctual table. A more lightweight solution can be to use the daily aggregation table as a starting point from the monthly aggregation.

Idempotency

Photo by boris misevic on Unsplash

Usually, when you are managing streaming data, you will deal with delivery semantics. This is a complex topic and deserves an entire article (maybe I will write one on this), so let’s just say that there are three types of delivery semantic:

  • At most one, this is the bad one :). It would be best if you always avoided this. It means that you are not sure that every message reaches the destination of the messages;
  • At least one, this is our case; we ensure that every message that arrives reaches the destination, but we are also saying that we can generate duplicated messages;
  • Exactly one, this is perfection — no need to explain it further.

So if at least one is our case, we must ensure the idempotency of our operation. Idempotency means that no matter how many times our function is executed, results will not be inconsistent.

Going back to our previous example, remember the user click operation? If we have to produce a daily aggregate on a relational database, we have two choices:

  • For every message that arrives, we can sum our message value to the daily table; but this is not idempotent. If a message is duplicated, we will add it twice;
  • An idempotent solution will be to handle all this process with a group by operation the punctual table. So no matter how many times we execute it, the result will always be correct.

Late data

Photo by Pierre Bamin on Unsplash

Last but not least, there are late data. Sometimes a message can be simply late. In case of late data, you have to choose how to handle it.
In the most straightforward case, you can ignore it. In more complex cases, you have to handle it.

For example, you can send late data to a different pipeline or choose to give those events an additional weight. It’s your choice! The most important thing, in this case, is that you should never forget about late data. Can you imagine the impact of a “forgotten” late data? What if your use case is about counting click events if a user during the session? If late data arrives, your statistic might not be consistent.

Remember the example? In that case, you can ignore it because with the group by technique, late data will be treated in the right way. Speaking about late event, you will come in contact with a complex topic, “watermarks,” but, again, a watermark deserves an article alone.
The lesson here is always to think about the possibility of late data.
One thing that makes the handling feasible is the availability of an event time in the message’s body. With that, you can handle late data easily; that’s precisely the case of our example.
Without an event time, well, it hard (if not impossible) to deal with late data. One possible solution can be using concepts like sessions and windowing in general, but like watermarks, those concepts deserve a more in-depth analysis.

In conclusion, the ones described in this article are some of the most famous problems I think about when developing a streaming application.

What about you? What are the things that you always keep in mind during streaming application development?

--

--

Data Intensive Dreamer

A dreamer in love with data engineering and streaming data pipeline development.