Category: Data Science

Data Processing Programming: a Software Engineering Approach (4)

Post author By nazar.merza
Post date September 17, 2020

PART II: Principles in Action

Now that we have some tools – principles and concepts – at hand to work with, it is time to begin by applying to concrete examples. It is worth noting that the purpose of these examples is only to illustrate the programming point and not their particular business content or meaning.

Note: Examples in this section are chosen from SQL because they better illustrate the idea of structural code complexity. But, it is not about showing SQL coding techniques etc. The purpose is to present general programming ideas, not tied to any one language.

In the examples, we start with a problematic code, explain what the problem is, and to reason and to show how they can be changed or improved. Problems or complexities are resolved through various methods, mostly based on the principle of Separation of Concerns (SoC). Also, in the process more and new engineering and programming concepts are introduced.

Example 1

Look at the following code and before reading further explanations, try to understand what it does and if there is any problem with the code; and if so, what it is.

SELECT

e.employee_id AS “Employee # “,

e.first_name || ‘ ‘ || e.last_name AS “Name”,

e.email AS “Email”,

e.phone_number AS “Phone”,

TO_CHAR(e.hire_date, ‘MM/DD/YYYY’) AS “Hire Date”,

TO_CHAR(e.salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) AS “Salary”,

e.commission_pct AS “Comission % “,

‘works as ‘ || j.job_title || ‘ in ‘ || d.department_name || ‘ department (manager: ‘ || dm.first_name || ‘ ‘ || dm.last_name || ‘) and immediate supervisor: ‘ || m.first_name || ‘ ‘ || m.last_name AS “Current Job”,

TO_CHAR(j.min_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) || ‘ – ‘ || TO_CHAR(j.max_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) AS “Current Salary”,

l.street_address || ‘, ‘ || l.postal_code || ‘, ‘ || l.city || ‘, ‘ || l.state_province || ‘, ‘ || c.country_name || ‘ (‘ || r.region_name || ‘)’ AS “Location”,

jh.job_id AS “History Job ID”,

‘worked from ‘ || TO_CHAR(jh.start_date, ‘MM/DD/YYYY’) || ‘ to ‘ || TO_CHAR(jh.end_date, ‘MM/DD/YYYY’) || ‘ as ‘ || jj.job_title || ‘ in ‘ || dd.department_name || ‘ department’ AS “History Job Title”

FROM

employees e JOIN jobs j

ON e.job_id = j.job_id

LEFT JOIN employees m

ON e.manager_id = m.employee_id

LEFT JOIN departments d

ON d.department_id = e.department_id

LEFT JOIN employees dm

ON d.manager_id = dm.employee_id

LEFT JOIN locations l

ON d.location_id = l.location_id

LEFT JOIN countries c

ON l.country_id = c.country_id

LEFT JOIN regions r

ON c.region_id = r.region_id

LEFT JOIN job_history jh

ON e.employee_id = jh.employee_id

LEFT JOIN jobs jj

ON jj.job_id = jh.job_id

LEFT JOIN departments dd

ON dd.department_id = jh.department_id

ORDER BY

E.employee_id;

Code from DEV

PAUSE!!!

Well, the above code is simply not easy to decipher, at least not without serious effort. In fact, once the query is understood, this query does not really carry any complex logic and for that reason it should be much easier to understand. Namely, it extracts some employee-related data from a number tables and does some transformation and formatting on the extracted data. But, what makes such a simple logic difficult to understand? Well, the problem with this code is that it does too many things at once which makes code hard to read. Readability was earlier given as a criterion for code quality.

Doing too many things at once – in the same function, step, program or unit- is one of the most frequently occurring coding problems in data-programming (actually in any progrogramming) which results in complex coding structures, bad code.

Code that is hard to understand, is hard to change, maintain or test as well. It is prone to errors and when error happens, which almost always does in such cases, is hard to find.

Anyways, going back to the example code, specifically it mixes two things together: Data extraction and data transformation (two concerns). Realizing this fact and then separating two concerns, completely changes the code and improves it. Then the first step for extracting data looks like this:

SELECT e.employee_id,

e.first_name

e.last_name,

e.email,

e.phone_number,

e.hire_date,

e.salary,

e.commission_pct,

d.department_name,

dm.first_name,

dm.last_name,

m.first_name,

m.last_name,

j.job_title,

j.min_salary,

j.max_salary,

jh.job_id,

jh.start_date,

jh.end_date,

jj.job_title,

dd.department_name,

l.street_address,

l.postal_code,

l.city,

l.state_province,

c.country_name,

r.region_name

FROM employees e

INNER JOIN jobs j

ON e.job_id = j.job_id

LEFT JOIN employees m

ON e.manager_id = m.employee_id

LEFT JOIN departments d

ON d.department_id = e.department_id

LEFT JOIN employees dm

ON d.manager_id = dm.employee_id

LEFT JOIN locations l

ON d.location_id = l.location_id

LEFT JOIN countries c

ON l.country_id = c.country_id

LEFT JOIN regions r

ON c.region_id = r.region_id

LEFT JOIN job_history jh

ON e.employee_id = jh.employee_id

LEFT JOIN jobs jj

ON jj.job_id = jh.job_id

LEFT JOIN departments dd

ON dd.department_id = jh.department_id

ORDER BY e.employee_id;

The new code is much better (though still not optimal):

its intent is clear and definite: extracting employee data.
The relation between fields and source tables are clearer.
It is easier to see what kind of information is extracted and their logical grouping: employee personal info, job info, job history info, location info

How to identify separate concerns in general?

But, before moving on further revising this code, let’s pause and ask, in general how to identify separate concerns with any code. In object-oriented programming (OOP) code is partitioned into units called class. That is, software is designed in terms of classes. One rule for class design is called the Single Responsibility Principle. According to this rule, each class should do one thing and only one thing. This idea can be generalized to any unit of programming beyond OOP. One method to go about this is to provide a descriptive statement about the unit under consideration, such as “what does this unit do”. If the statement points to only one thing or action, then the unit has one concern. If it points to more than one, then there are potentially more than one concern. Applying this rule to the original example code: it extracts data and transforms it – does two things.

Further revision

The revised code for extracting data, though clearer and cleaner than the original code, still not good enough. Namely, it has a giant Join clause with too many tables. This kinds of structures, although some mistakenly view them as advanced coding, in fact is a programming problem which I would like to call the Illusion of advanced coding. This illusion, quite often results in poor code quality or sometimes complete project failure. (complex queries is one of the major problems with data-warehousing projects and their failure).

The above can be made simpler by breaking the join into multiple steps. In some cases it is more clear how to divide joins when tables belong to clearly separate logical grouping. Let’s say, if in a query three tables contain customer-level data and two other containaining account-level info, then it can be easily broken along this line into two units. In this particular case, for instance, the code can be divided into current employee info, historical job info, job location info etc. Another thing which makes this code complex is that tables in the JOIN appear multiple times under different aliases with different semantic roles. For example, table EMPLOYEE is used once to extract employee info and another time to extract employee’s manager info (since manager is also an employee after all). Perhaps this is a bigger source of complexity and separates the query at least into two steps with the first occurrence of the tables in the first step and the next ones in the second step.

FROM employees e

INNER JOIN jobs j

ON e.job_id = j.job_id

LEFT JOIN employees m

ON e.manager_id = m.employee_id

LEFT JOIN departments d

ON d.department_id = e.department_id

LEFT JOIN employees dm

ON d.manager_id = dm.employee_id

LEFT JOIN locations l

ON d.location_id = l.location_id

LEFT JOIN countries c

ON l.country_id = c.country_id

LEFT JOIN regions r

ON c.region_id = r.region_id

LEFT JOIN job_history jh

ON e.employee_id = jh.employee_id

LEFT JOIN jobs jj

ON jj.job_id = jh.job_id

LEFT JOIN departments dd

ON dd.department_id = jh.department_id

This type of construct, especially when there are many tables, make the code complex and hard to understand. Hard to trace which information comes from which alias of the table and so on. As a result, one way of revised code divided into two steps may look like the following:

Big code constructs (including JOIN statements) are common examples of code smell. According to Martin Fowler,

A code smell is a surface indication that usually corresponds to a deeper problem in the system.

~~WITH~~ ~~employee_info_1 (~~

~~SELECT~~ e.~~employee_id~~,

e.~~first_name~~,

e.~~last_name~~,

e.~~email~~,

e.~~phone_number~~,

e.~~hire_date~~,

e.~~salary~~,

e.~~commission_pct~~,

d.~~department_name~~,

j.~~job_title~~,

j.~~min_salary~~,

j.~~max_salary~~,

jh.~~job_id~~,

jh.~~start_date~~,

jh.~~end_date~~,

l.~~street_address~~,

l.~~postal_code~~,

l.~~city~~,

l.~~state_province~~,

c.~~country_name~~,

r.~~region_name~~

~~FROM~~ ~~employees~~ e

~~INNER JOIN~~ ~~jobs~~ j

ON e.~~job_id~~ = j.~~job_id~~

~~LEFT JOIN~~ ~~departments~~ d

ON d.~~department_id~~ = e.~~department_id~~

~~LEFT JOIN~~ ~~locations~~ l

ON d.~~location_id~~ = l.~~location_id~~

~~LEFT JOIN~~ ~~countries~~ c

ON l.~~country_id~~ = c.~~country_id~~

~~LEFT JOIN~~ ~~regions~~ r

ON c.~~region_id~~ = r.~~region_id~~

~~LEFT JOIN~~ ~~job_history~~ jh

ON e.~~employee_id~~ = jh.~~employee_id~~

)

~~SELECT~~ ei~~.*,,~~

m.~~first_name~~,

m.~~last_name~~,

dm.~~first_name~~,

dm.~~last_name~~,

jj.~~job_title~~,

Dd.~~department_name~~

~~FROM~~ ~~employee_info_1~~ ei

~~LEFT~~ ~~JOIN~~ ~~employees~~ m

ON e.~~manager_id~~ = m.~~employee_id~~

~~LEFT~~ ~~JOIN~~ ~~employees~~ dm

ON d.~~manager_id~~ = dm.~~employee_id~~

~~LEFT~~ ~~JOIN~~ ~~jobs~~ jj

ON jj.~~job_id~~ = jh.~~job_id~~

~~LEFT~~ ~~JOIN~~ ~~departments~~ dd

ON dd.~~department_id~~ = jh.~~department_id~~;

Now with data extracted, operations on this data is encapsulated in its own step as follows (You don’t need to understand details of the code. That is not important for our purposes.)

SELECT

employee_id AS “Employee # “,

first_name || ‘ ‘ || last_name AS “Name”,

email AS “Email”,

phone_number AS “Phone”,

To_char(hire_date, ‘MM/DD/YYYY’) AS “Hire Date”,

To_char(salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’)

AS “Salary”,

commission_pct AS “Comission % “,

‘works as ‘ || job_title || ‘ in ‘ || department_name || ‘ department (manager: ‘ || first_name || ‘ ‘ || last_name || ‘) and immediate supervisor: ‘ || first_name || ‘ ‘ || last_name

AS “Current Job”,

To_char(min_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) || ‘ – ‘ || To_char(max_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’)

AS “Current Salary”,

street_address || ‘, ‘ || postal_code || ‘, ‘ || city || ‘, ‘ || state_province || ‘, ‘ || c.country_name || ‘ (‘ || r.region_name || ‘)’

AS “Location”,

job_id AS “History Job ID”,

‘worked from ‘ || To_char(start_date, ‘MM/DD/YYYY’) || ‘ to ‘ || To_char(end_date, ‘MM/DD/YYYY’) || ‘ as ‘ || job_title || ‘ in ‘ || department_name || ‘ department’

AS “History Job Title”

FROM

Table1;

The new code, though looks longer (by number of lines) is much better than the original code. It is more readable. It is more evident that information about employees is being extracted. One can tell which data/columns are extracted, from which table column come from and easier to identify fields that are related; logical grouping is more apparent. Also, the need for comments pretty much disappears. Those comments on joins were necessitated because the code was not clear. In general, one should write code is such a way that there shouldn’t be any need for comments. It doesn’t mean that comments should always be avoided. Rather, the code should be tthe code itself is readable enough for anyone to understand it. This is

~~a principle of good programming.~~

Perhaps one can further break the joins into smaller ones, or in different ways (as was stated above along logical grouping of tables) but that wouldn’t be necessary. The example is used only to the extent to illustrate the point. In fact, this whole discussion is not tied or limited to SQL coding either. It is about the general ideas that are applicable to any programming: complexity of the code, principles for dealing with complexity and most important of all, the thought process underlying all this. It shows how in general a reasoned approach to programming improves it.

Refactoring

In the above example, the method applied to simplify and improve the code is called Refactoring and it’s a very important concept in programming and software design. According to Martin Fowler, the originator of the concept,

Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

Its heart is a series of small behavior-preserving transformations. Each transformation (called a “refactoring”) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.

According to another definition,

In computer programming and software design, code refactoring is the process of restructuring existing computer code—changing the factoring—without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality. Potential advantages of refactoring may include improved code readability and reduced complexity; these can improve the source code’s maintainability and create a simpler, cleaner, or more expressive internal architecture or object model to improve extensibility.

There is one important point though that is not included in these definitions but In my view should have been:

with Refactoring, it is important to consider two aspects of it, or two meanings of it: the literal meaning and the logical meaning. In the literal sense, it is applied to existing, already-written code to improve it. But, in the logical sense it does not have to be that way. That is, the code should not be written badly first and then improved through refactoring, rather it has to be written with the refactoring already in mind, such that it should require the least amount of refactoring (properly designed) and ideally no refractory at all (in principle).

To recap and further clarify the topic:

Code Complexity was the problem.
Refactoring was a method to resolve this problem or improve the code.
Separation of Concern is one of the main principle by which the method of refactoring mainly, or any solution for complexity, works.

Example 2

The following example further illustrates and reinforces the idea of structural code complexity and why it is a problem. To begin with, like the first example, try to read and understand the code as is, without any extra information or context.

SELECT ens.company,

ens.state,

ens.zip_code,

ens.complaint_count

FROM (SELECT company,

state,

zip_code,

Count(complaint_id) AS complaint_count

FROM credit_card_complaints

WHERE state IS NOT NULL

GROUP BY company,

state,

zip_code) ens

INNER JOIN (SELECT ppx.company,

Max(ppx.complaint_count) AS complaint_count

FROM (SELECT ppt.company,

ppt.state,

Max(ppt.complaint_count) AS complaint_count

FROM (SELECT company,

state,

zip_code,

Count(complaint_id) AS complaint_count

FROM credit_card_complaints

WHERE state IS NOT NULL

GROUP BY company,

state,

zip_code

ORDER BY 4 DESC) ppt

GROUP BY ppt.company,

ppt.state

ORDER BY 3 DESC) ppx

GROUP BY ppx.company) apx

ON apx.company = ens.company

AND apx.complaint_count = ens.complaint_count

ORDER BY 4 DESC;

Code example from DataQuest

You may ask, how can one understand a code without explanatory information, without someone already familiar with the code explaining to me what it does. True, this is the kind of expectation that exists in practice in most places and cases, but it is simply wrong!

When someone inherits a code base or gets to work with it, in general, the original author is not there to explain the code. It should not be assumed as part of any programming. The right way is, really, to assume the exact opposite and write the code to be as self-explanatory as possible. ~~For this reason, no explanation is provided at first, in order to test the very expressiveness and readability of the code exactly as is.~~

Back to the example code, admittedly, it is not easy to understand what it does. One could possibly figure it out with some toiling, but why should that be the case? What happens if one gets code like this that goes over many pages, perhaps many tens of pages?

Anyways, this code uses credit_card_complaints table, which as its name implies keeps data about credit card complaints. The records contain: company, state, zip and complaints ID and some other info.

Company

State

Zip code t

Complaint ID

What the query is doing is to get:

For each company, find the state/zip code(s) with the highest number of complaints.

Well, based on this logic, this is a straight MIN/MAX problem which is very common in data analysis and should not be difficult or complex at all. Hence, the complexity of the given code does not come from the logic it implements but from the way the code is written.

Breaking the complex structure into simpler steps, one can construct the following algorithm:

Sum the number of complaints per each company/state/zip.
For each company, find the highest count from step 1.
For each company, get the state/zip ( 1 or more ) associated with the highest count.

Note: It is absolutely important in programming to design before any coding. If this principle is followed properly, it will fundamentally change the resulting program or software.)

Step 1: For each company/state/zip count the number of complaints: SELECT company,

state,

zip_code,

Count(complaint_id) AS complaint_count

FROM credit_card_complaints

WHERE state IS NOT NULL

AND zip_code IS NOT NULL

GROUP BY company,

state,

zip_code

;

Let’s call the above results, complaints_count table. Then,

Step 2: For each company/state, find the highest number of complaints */

company_max_count as

SELECT company,

Max(complaint_count) AS max_complaint_count

FROM complaints_count

GROUP BY company

/* For each company/state, retain the record with highest count */

SELECT cc.company,

cc.state,

cc.zip_code,

cc.complaint_count

FROM complaints_count cc

INNER JOIN company_max_count cmc

ON cc.company = cmc.company

AND cc.complaint_count = cmc.max_complaint_count

ORDER BY 4 DESC

;

Unlike the original complex query with many subqueries, this new algorithm breaks the problem into three sequential or linear steps (analysis) each of which is simple and independent. In general, as for dealing with structural code complexity: N linear 1-dimensional constructs are better than one N-dimensional construct.

To justify this statement, consider the two versions of the example and compare them in terms of testability. In the revised code, each step does only one thing which is easily testable.

Final Notes

In both examples the code does not really perform much computation. If complex computation is added to them (as it is the case in the real world) then it can become much more complex therefore avoiding complexity is so important in programming.

Tying the ideas of complexity and separation of concern, it can be

Degree of complexity of a code structure depends on the number of concerns implemented in the same structure. The more concerns in the same step, the more complex it is.

As it was discussed in part one and shown with examples in this part, complexity and quality of data-processing code does not depend on smaller code size, or fewer number of lines of code.

Data Engineering Data Science Programming Software Engineering

Data Processing Programming, Part 3: Separation of Concerns

Post author By nazar.merza
Post date September 8, 2020

Separation of Concerns: a Fundamental Principle

In the last part, the concept of complexity was introduced and explored to some extent. Now the question is how to deal with it. The well-known rule for this is: break it into simpler, smaller elements where each element is simple enough to be comprehended and understood easily. We all have heard this time and time again, and this is in fact the principle on which complexity is solved; this is called analysis and is the essence of problem-solving. But, how actually this is done, is not shown as many times. Anyways, for our purposes – programming and design – there is one principle which will help achieve this and it is called the Separation of Concerns Principle (SoC). When it comes to designing programs, software applications or systems, this is one of the most fundamental principles there so much that most other principles can be derived from it one way or another.

In computer science, separation of concerns (SoC) is a design principle for separating a computer program into distinct sections such that each section addresses a separate concern. A concern is a set of information that affects the code of a computer program.

As an example of separation of concerns at work, consider the REST architecture paradigm. Before REST (and web services), traditional old web applications mixed the back-end logic and front-end. For example, a web application would fetch the data from the server, for example a partially processed Java data-structure, and perform the rest of logic at the client and then show it. Each client request was tied to a specific function at the server. Then it was realized that these two aspects are separate and need not, should not, be mixed. The back-end service provides data, and does not know who uses it and how. It could be shown on a web page, a mobile app or read by a query from another application. The client does not know how the data is produced and receives it in a platform-independent format, e.g. in JSON.

https://api.zestard.com/wp-content/uploads/2015/12/What-is-Rest-API-02-1.jpg

Please be cautioned about one thing though, most design principles including SoC were developed or mostly used within the context of object-oriented programming (OOP) which dominated the field of software engineering and design since the 1990s. Hence, their explanations, including the source cited above, mostly refer to object-oriented terms such as information hiding, encapsulation etc. In practice, to many people they have been associated and tied to OOP, so much that not many consider using them outside of this realm. But, in essence they are general and the reader should focus on their general aspect. In fact, SoC for example, is a very general principle that goes not only beyond OOP but even beyond programming, and applicable to many diverse fields, e.g. to situations like division of labour between collaborating teams etc. The objective of this writing is to exactly do this: bring general ideas from the principles out; adapt and apply them to data-processing programming.

Modularity

In any program design, modularity is of crucial importance. Principle of modularity in design is based on SoC.

Modular programming is a software design technique that emphasizes separating the functionality of a program into independent, interchangeable modules, such that each contains everything necessary to execute only one aspect of the desired functionality.

So far, all these principles and concepts e.g. complexity, SoC, concerns, modularity etc are kind of abstract to someone not very familiar with them. Things like, what really is a concern, how to identify and separate them, and so on maybe not not clear. But, starting from the next part, principles are put to action by showing how each is applied to concrete programming problems.

Data Engineering Data Science Programming Software Engineering

Data Processing Programming 2: Code Complexity

Post author By nazar.merza
Post date September 7, 2020

Part I: Conceptual Framework

The concept of Complexity

In this part some of the most important concepts of programming and design will be introduced which will lay a foundation and a starting point for the subject. In subsequent parts, as the topic will require more concepts and principles will be introduced.

I think instead of opening a new subject by giving a bunch of definitions, it is more useful to develop it, going through the thought process together with the reader or audience (In general one has to be very careful with definitions, for they can be useful when used properly but in many other cases they can work as inhibitors for a subject exposition, since by setting the boundaries for a problem, at once limit their expansion. Anyways, this is a question in theory of definitions in philosophy and we will not dwell on it here.)

So, let’s begin it. When programming, the right way is to think about it, to look at it from another person’s perspective. This idea can be stated as the following principle:

Write the code such that for someone else, completely unfamiliar with it, it should require the least amount of effort to understand it.

This means, to put the same meaning in other words, programmer should intend for simplicity or program should be simple. Thus, simplicity is the first concept so far encountered in trying to formulate a conceptual framework for Data-processing programming (DPP). For now, there is no proof given for the above statement and let’s accept it as an axiom. It will be justified later, throughout this writing.

Okay, then what is simplicity? It turns out that defining simplicity directly is not easy (without stating the same thing in other words) nor is it that useful to do so. Instead, a better approach would be to understand it through its opposite – complexity. Because, at least in the context of coding as it will be shown, simplicity is nothing but avoiding complexity. Avoiding complexity though is possible only by understanding it. Now that we have the second important concept in our investigation of the subject of programming – complexity – let’s understand what it is.

What is Complexity?

Complexity, depending on the context may refer to different things and it needs to be clarified from the outset which meaning is intended in our exposition. But before, let’s begin by describing what meanings of the term are NOT intended here.

It is NOT about Computational Complexity

Perhaps the most widespread use of the term complexity in computer science and related fields is computational complexity which is often also called just complexity. Computational Complexity has a very specialized meaning: it is a measure of runtime efficiency of algorithms or the amount of resources an algorithm requires to run. In particular, the two most important and common resources being considered in complexity analysis are Time and Space. For example, there are different algorithms for sorting of arrays like Selection Sort, Bubble Sort, Insertion Sort, Merge Sort, Quick Sort, Heap Sort etc. For the same array, each of these algorithms require different amounts of space and time to execute. This consumption of resources for each algorithm determines their respective computational complexity. The more resources an algorithm uses the more complex it is. Mathematically, computational complexity is represented by what is called Big O Notation.

Computational complexity is not of interest to us and is not our subject of discussion. As was stated earlier, computational complexity is a runtime measure, runtime behaviour of algorithms and which is its fundamental property. Instead, we are interested in the complexity of the code of the program. From this point onward, we will completely forget about computational complexity.

Now let’s move on to the next topic, cyclomatic complexity.

Cyclomatic Complexity

According to Wikipedia,

Cyclomatic complexity is a software metric used to indicate the complexity of a program. It is a quantitative measure of the number of linearly independent paths through a program’s source code.

To put it simply, based on this metric for a piece of code, the more possible paths of execution, the more complex it is. For example, if the following is the whole code under consideration,

z = x * 2

its complexity 1. Since there is only one path for this code.

But for the following code

if (condition is true) then

z = x * 2

else

z = x * 4

the complexity is increased, it is now 2. Because, depending whether the condition evaluates to true or false, there are two paths along which the program can run. Thus, the second program is more complex than the first one.

Photo from Craftofcoding showing the code and it’s corresponding cyclomatic graph

The concept of cyclomatic complexity is useful as it provides an insight into the nature of code complexity. Any serious programmer has to know it. Yet, as important as it is, it is not broad enough and does not explain the types of complexity that commonly arises in data-processing programming. Why? Because it is focused on measuring the control flow of the program. Control flow analysis, while an important factor in programming paradigms, in DPP many of computations are declarative in nature (like in SQL) where there is no control flow and their complexity comes from other sources. Even in most cases of if-else, switch or any conditional statements, their semantic can be recast so it becomes declarative, eliminating control flow concern (this will be a major topic in part III). To understand the type of complexity associated with DPP, a new concert, which I call structural complexity, is introduced.

Structural complexity

Maybe a good point for explaining Structural complexity is to start from somewhere else, considering an analogy from mathematics. Although at first this material may seem too trivial, the idea is quite analytically powerful and helps understand code complexity in DPP and perhaps complexity in general.

In mathematics, linear equations are the simplest of all equation types. It is the simplest because it is the easiest to understand and to solve. That’s why it is the first type of equation one learns at school. (Note: Equations and examples are intentionally simple, only to make the point and not to take us away from focusing on our subject of enquiry.)

Anyways, the following is a linear equation:

24 = 2x (1)

Easy to solve, isn’t it? Just divide both sides by 2 and the solution is

x = 2. Now, consider the next equation.

24 = 2x + 5x – 3x + 6x + 4x (2)

This equation appears more complicated than the first one, it has more terms etc. But, it is not. It can be reduced to

24 = 20x

Therefore, in spite of its more complicated-looking view, in terms of complexity it is exactly equal to the first equation. In other words, they have the same degree of complexity.

Now consider the following quadratic equation:

24 = 2x²(3)

Although it looks simpler (shorter and more compact) than equation (2), in fact it is of a higher degree of complexity. It is not as intuitive and easy to understand as the linear equation. It can no longer be solved with the means of basic arithmetic operations (+, -, *, / ). It requires a separate method (Quadratic formula) and more concepts such as radicals, irrational numbers, multiple solutions etc (In fact, modern formulation had to wait until 1637 by Rene Descartes.) Speaking in terms of structural complexity, (2x²) is a more complex structure than (2x + 5x – 3x + 6x + 4x) as it is more difficult to understand and to solve.

Taking this analogy to the field of programming, the following nested loop

for i in array1:
for j in array2:
Do something involving i and j at once

is more complex than the next two loops.

for i in array1:
Do something with i

for j in array2:
Do something with j

Why? Because in the first case, one needs to trace the values of indices, i and j, together with any variables associated with them at once, as both loops progress (it gets more complicated as the loops and the operations within them get more complex). The two loops are intertwined and one cannot be separated from the other. While in the second case, each loop is independent of the other and it involves only tracing progression of one at a time. Hence, it is simpler.

Now let’s take this idea one step further, or, one step closer to our subject of study, an example from data-processing programming. A nested query, like the following,

SELECT job_id,

Avg(salary)

FROM employees

GROUP BY job_id

HAVING Avg(salary) < (SELECT Max(Avg(min_salary))

FROM jobs

WHERE job_id IN (SELECT job_id

FROM job_history

WHERE department_id BETWEEN 50 AND 100)

GROUP BY job_id);

involving three layers of SELECT, is more complex than three SELECT queries in sequence. The following is another example of complex code structure:

Code example from Quora

Structurally more complex code involves more dimensions to consider at once, hence it can also be called dimensional complexity.
Emphasis: here, it is not being suggested that things like nested loops or nested queries are always bad and to be avoided; it would be naive to say so. In the same way that in mathematics equations of different degrees and types are necessary, these constructs are part of programming and exist for a reason. The problem is not with the language constructs themselves but in how they are used. Here they are used only to illustrate the concept of structural code complexity. Their usage will be discussed later.

Data Engineering Data Science Programming Software Engineering

Data Processing Programming (1): Introduction

Post author By nazar.merza
Post date September 1, 2020

What is The Problem?

According to the “Data Never Sleeps 5.0” report in 2017, “90% of all data today was created in the past 2 years[1],” and this growth trend obviously continues at an exponential rate. As the volume is increasing, the importance of data for the functioning of organizations and business increase as well. Business decision-making and operation has become more and more data-driven. With this development, the number of people who work with data, one way or another, has grown dramatically. New disciplines such as Data Science, Machine Learning, Artificial Intelligence, applied statistics and mathematics and many more have emerged that bulk of their work consist of working with data. Not only programmers, analysts, engineers, data scientists and statisticians write data-processing code, but increasingly more jobs require some level of working with data. Data literacy has become the new standard skill and a requirement for many functions.

Working with data, in general, requires programming. To differentiate this type of programming from the traditional one – software application development – let’s call it: Data-processing programming (DPP) or Data-programming for short. For our purposes we can define DPP as:

Data-processing programming is writing code that works on data (creates, reads, analyzes, transforms, operates, manages data …) irrespective of language (be it SQL, Python SAS, R, Scala, Java …) and complexity (from a simple SQL query to large data-pipelines to implementing complex data science models.

Taken in this broad sense, one can safely say that the combined amount of code written for data processing is by far more than that strictly written by software developers and engineers for application development. Yet, despite this increase in data-processing programming and its importance for organizations, the practice suffers from a serious lack of discipline and method. Most of the code written for DPP is of poor quality and does not measure up to proper programming standards. In fact, there is a two-fold problem we are dealing with in this regard:

Data-processing programs mostly lack discipline and method, and
There is not much literature available to address this problem.

For traditional programming and software development there is a rich body of literature available addressing various aspects of the practice from concepts, principles and patterns to coding, design and best practices. But when it comes to DPP, there is not much out there available. When there are materials, they are mostly focused either on tools (tool-specific) or learning a particular language (syntax).

As examples for lack of discipline and properly defined foundation, for instance, there are cases where something like “how to read a file in Python” is presented as “Data Science Design Patterns” which is not truly a design pattern (this topic will be explored in detail in later parts) but rather a language idiom. In other cases traditional object-oriented design patterns are sold as “Data Science design patterns”. While it is true that most existing design patterns are generic and they can be used in data science programming among others, calling them “Data Science design patterns” as such is misleading. More appropriate name, for instance, would be something like “some OOP design patterns that can be used in some data science programs”.

Yet, as another example, in some cases the term “Data science design patterns” is used where in fact the content is more about modelling or commonly occurring models relevant to data science. But it is not clear in what sense they are “design” patterns. Not all patterns are design patterns. Design patterns as an engineering or programming concept is totally different from mathematical or statistical patterns. They belong to separate categories. To make it clear from the outset, in this writing, whenever the term design pattern is used, it is in the second sense – from a programming and engineering perspective.

In any case, all the things mentioned above are important in their own rights: one needs tools, knowledge of relevant languages and knowledge of the domain. There is no denying of them. But the programming problem must be clearly separated from the rest and treated in its own right: a purely programmatic and engineering treatment of the subject in its general sense. It is now time to pay attention to this long overdue and much ignored problem.

Oftentimes in DPP programming is treated as something of secondary importance or at least its due attention is not given, which is simply wrong. By doing this both the individual programmer and the organizations incur huge costs. This series of articles intends to address this problem and help promote methodical and logical approach to the practice of data-processing programming.

Content organization

The materials presented are drawn from experience of data-processing programming – problems, patterns and solutions that occur again and again. On the other hand, in order to formalize and generalize the content, principles from Software Engineering are synthesized with them and when needed new concepts are invented to cover the topic.

The content in this writing is divided into three parts. Part one introduces some fundamental concepts and principles which will lay a conceptual framework for the subject and later discussions. Part two puts the principles in action by applying them to specific examples and introducing more concepts and patterns along the way. Part three will be devoted to introducing one of the frequently occurring patterns in data-processing, and in my view one of the most important design patterns in DPP, which will be called Table-driven Design Pattern. This pattern will be investigated extensively from different angles through which various engineering and design considerations are explained.

Note: In part one, there are not many actual coding examples. Some may get impatient with this method, but it is done on purpose. Understanding the context and conceptual framework of the problem, before attempting at particular cases, of crucial importance for any learning process.

Desired outcome

This writing addresses the above-stated problem: lack of discipline in data-processing programming. Of course, it goes without saying that one article, one book, or even books cannot cover all aspects of a subject such as this. But, within the capacity of this writing, attempt is made to 1) raise an awareness about the existence of the problem and its nature and 2) provide a fairly good grounding on fundamentals of programming and design. The goal is to move away from random coding behaviour to a more conscious and analytical programming. The material should help one become a better programmer that writes maintainable, structured and well-designed programs.

Programming, beside being important in itself, is also an extremely powerful analytical, creative and problem-solving device:
A disciplined approach to programming goes a long way beyond the immediate coding need. In fact it is a mode of thinking about things, a mode of looking at and solving problems which in general promotes analytical and logical thinking.