Categories
Data Engineering Data Science Programming Software Engineering

Data Processing Programming (1): Introduction

What is The Problem?

According to the “Data Never Sleeps 5.0” report in 2017, “90% of all data today was created in the past 2 years[1],” and this growth trend obviously continues at an exponential rate. As the volume is increasing, the importance of data for the functioning of organizations and business increase as well. Business decision-making and operation has become more and more data-driven. With this development, the number of people who work with data, one way or another, has grown dramatically. New disciplines such as Data Science, Machine Learning, Artificial Intelligence, applied statistics and mathematics and many more have emerged that bulk of their work consist of working with data. Not only programmers, analysts, engineers, data scientists and statisticians write data-processing code, but increasingly more jobs require some level of working with data. Data literacy has become the new standard skill and a requirement for many functions.

Working with data, in general, requires programming. To differentiate this type of programming from the traditional one – software application development – let’s call it: Data-processing programming (DPP) or Data-programming for short. For our purposes we can define DPP as: 

Data-processing programming is writing code that works on data (creates, reads, analyzes, transforms, operates, manages data …) irrespective of language (be it SQL, Python SAS, R, Scala, Java …) and complexity (from a simple SQL query to large data-pipelines to implementing complex data science models. 

Taken in this broad sense, one can safely say that the combined amount of code written for data processing is by far more than that strictly written by software developers and engineers for application development. Yet, despite this increase in data-processing programming and its importance for organizations, the practice suffers from a serious lack of discipline and method. Most of the code written for DPP is of poor quality and does not measure up to proper programming standards. In fact, there is a two-fold problem we are dealing with in this regard:

  1. Data-processing programs mostly lack discipline and method, and
  2. There is not much literature available to address this problem.

For traditional programming and software development there is a rich body of literature available addressing various aspects of the practice from concepts, principles and patterns to coding, design and best practices. But when it comes to DPP, there is not much out there available. When there are materials, they are mostly focused either on tools (tool-specific) or learning a particular language (syntax).  

As examples for lack of discipline and properly defined foundation, for instance, there are cases where something like “how to read a file in Python” is presented as “Data Science Design Patterns” which is not truly a design pattern (this topic will be explored in detail in later parts) but rather a language idiom. In other cases traditional object-oriented design patterns are sold as “Data Science design patterns”. While it is true that most existing design patterns are generic and they can be used in data science programming among others, calling them “Data Science design patterns” as such is misleading. More appropriate name, for instance, would be something like “some OOP design patterns that can be used in some data science programs”. 

Yet, as another example, in some cases the term “Data science design patterns” is used where in fact the content is more about modelling or commonly occurring models relevant to data science. But it is not clear in what sense they are “design” patterns. Not all patterns are design patterns. Design patterns as an engineering or programming concept is totally different from mathematical or statistical patterns. They belong to separate categories. To make it clear from the outset, in this writing, whenever the term design pattern is used, it is in the second sense – from a programming and engineering perspective. 

In any case, all the things mentioned above are important in their own rights: one needs tools, knowledge of relevant languages and knowledge of the domain. There is no denying of them. But the programming problem must be clearly separated from the rest and treated in its own right:  a purely programmatic and engineering treatment of the subject in its general sense. It is now time to pay attention to this long overdue and much ignored problem. 

Oftentimes in DPP programming is treated as something of secondary importance or at least its due attention is not given, which is simply wrong. By doing this both the individual programmer and the organizations incur huge costs. This series of articles intends to address this problem and help promote methodical and logical approach to the practice of data-processing programming.

Content organization

The materials presented are drawn from experience of data-processing programming – problems, patterns and solutions that occur again and again. On the other hand, in order to formalize and generalize the content, principles from Software Engineering are synthesized with them and when needed new concepts are invented to cover the topic. 

The content in this writing is divided into three parts. Part one introduces some fundamental concepts and principles which will lay a conceptual framework for the subject and later discussions. Part two puts the principles in action by applying them to specific examples and introducing more concepts and patterns along the way. Part three will be devoted to introducing one of the frequently occurring patterns in data-processing, and in my view one of the most important design patterns in DPP, which will be called Table-driven Design Pattern. This pattern will be investigated extensively from different angles through which various engineering and design considerations are explained.  

Note: In part one, there are not many actual coding examples. Some may get impatient with this method, but it is done on purpose. Understanding the context and conceptual framework of the problem, before attempting at particular cases, of crucial importance for any learning process. 

Desired outcome

This writing addresses the above-stated problem: lack of discipline in data-processing programming. Of course, it goes without saying that one article, one book, or even books cannot cover all aspects of a subject such as this. But, within the capacity of this writing, attempt is made to 1) raise an awareness about the existence of the problem and its nature and 2) provide a fairly good grounding on fundamentals of programming and design. The goal is to move away from random coding behaviour to a more conscious and analytical programming. The material should help one become a better programmer that writes maintainable, structured and well-designed programs. 

Programming, beside being important in itself, is also an extremely powerful analytical, creative and problem-solving device:
A disciplined approach to programming goes a long way beyond the immediate coding need. In fact it is a mode of thinking about things, a mode of looking at and solving problems which in general promotes analytical and logical thinking.