Categories
Data Engineering Data Science Programming Software Engineering

Data Processing Programming: a Software Engineering Approach (4)

PART II: Principles in Action

Now that we have some tools – principles and concepts – at hand to work with, it is time to begin by applying to concrete examples. It is worth noting that the purpose of these examples is only to illustrate the programming point and not their particular business content or meaning. 

Note: Examples in this section are chosen from SQL because they better illustrate the idea of structural code complexity. But, it is not about showing SQL coding techniques etc. The purpose is to present general programming ideas, not tied to any one language.  

In the examples, we start with a problematic code, explain what the problem is, and to reason and to show how they can be changed or improved. Problems or complexities are resolved through various methods, mostly based on the principle of Separation of Concerns (SoC). Also, in the process more and new engineering and programming concepts are introduced. 

Example 1

Look at the following code and before reading further explanations, try to understand what it does and if there is any problem with the code; and if so, what it is. 

SELECT

   e.employee_id AS “Employee # “,

   e.first_name || ‘ ‘ || e.last_name AS “Name”,

   e.email AS “Email”,

   e.phone_number AS “Phone”,

   TO_CHAR(e.hire_date, ‘MM/DD/YYYY’) AS “Hire Date”,

   TO_CHAR(e.salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) AS “Salary”,

   e.commission_pct AS “Comission % “,

   ‘works as ‘ || j.job_title || ‘ in ‘ || d.department_name || ‘ department (manager: ‘ || dm.first_name || ‘ ‘ || dm.last_name || ‘) and immediate supervisor: ‘ || m.first_name || ‘ ‘ || m.last_name AS “Current Job”,

   TO_CHAR(j.min_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) || ‘ – ‘ || TO_CHAR(j.max_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) AS “Current Salary”,

   l.street_address || ‘, ‘ || l.postal_code || ‘, ‘ || l.city || ‘, ‘ || l.state_province || ‘, ‘ || c.country_name || ‘ (‘ || r.region_name || ‘)’ AS “Location”,

   jh.job_id AS “History Job ID”,

   ‘worked from ‘ || TO_CHAR(jh.start_date, ‘MM/DD/YYYY’) || ‘ to ‘ || TO_CHAR(jh.end_date, ‘MM/DD/YYYY’) || ‘ as ‘ || jj.job_title || ‘ in ‘ || dd.department_name || ‘ department’ AS “History Job Title” 

FROM

   employees e   JOIN jobs j 

    ON e.job_id = j.job_id

   LEFT JOIN employees m 

    ON e.manager_id = m.employee_id

   LEFT JOIN departments d 

    ON d.department_id = e.department_id

   LEFT JOIN employees dm 

    ON d.manager_id = dm.employee_id

   LEFT JOIN locations l 

    ON d.location_id = l.location_id 

   LEFT JOIN countries c 

   ON l.country_id = c.country_id 

   LEFT JOIN       regions r 

   ON c.region_id = r.region_id

   LEFT JOIN job_history jh 

   ON e.employee_id = jh.employee_id

   LEFT JOIN jobs jj 

   ON jj.job_id = jh.job_id

   LEFT JOIN departments dd 

   ON dd.department_id = jh.department_id 

ORDER BY

   E.employee_id;

Code from DEV

PAUSE!!!

Well, the above code is simply not easy to decipher, at least not without serious effort. In fact, once the query is understood, this query does not really carry any complex logic and for that reason it should be much easier to understand. Namely, it extracts some employee-related data from a number tables and does some transformation and formatting on the extracted data. But, what makes such a simple logic difficult to understand? Well, the problem with this code is that it does too many things at once which makes code hard to read. Readability was earlier given as a criterion for code quality. 

Doing too many things at once – in the same function, step, program or unit- is one of the most frequently occurring coding problems in data-programming (actually in any progrogramming) which results in complex coding structures, bad code. 

Code that is hard to understand, is hard to change, maintain or test as well. It is prone to errors and when error happens, which almost always does in such cases, is hard to find. 

Anyways, going back to the example code, specifically it mixes two things together: Data extraction and data transformation (two concerns). Realizing this fact and then separating two concerns, completely changes the code and improves it. Then the first step for extracting data looks like this:

SELECT   e.employee_id,

       e.first_name 

       e.last_name, 

       e.email, 

       e.phone_number, 

       e.hire_date, 

       e.salary, 

       e.commission_pct, 

       d.department_name, 

       dm.first_name, 

       dm.last_name, 

       m.first_name, 

       m.last_name, 

       j.job_title, 

       j.min_salary, 

       j.max_salary, 

       jh.job_id, 

       jh.start_date, 

       jh.end_date, 

       jj.job_title, 

       dd.department_name, 

       l.street_address, 

       l.postal_code, 

       l.city, 

       l.state_province, 

       c.country_name, 

       r.region_name 

FROM   employees e 

       INNER JOIN jobs j 

               ON e.job_id = j.job_id 

       LEFT JOIN employees m 

              ON e.manager_id = m.employee_id 

       LEFT JOIN departments d 

              ON d.department_id = e.department_id 

       LEFT JOIN employees dm 

              ON d.manager_id = dm.employee_id 

       LEFT JOIN locations l 

              ON d.location_id = l.location_id 

       LEFT JOIN countries c 

              ON l.country_id = c.country_id 

       LEFT JOIN regions r 

              ON c.region_id = r.region_id 

       LEFT JOIN job_history jh 

              ON e.employee_id = jh.employee_id 

       LEFT JOIN jobs jj 

              ON jj.job_id = jh.job_id 

       LEFT JOIN departments dd 

              ON dd.department_id = jh.department_id 

ORDER  BY e.employee_id; 

The new code is much better (though still not optimal):

  • its intent is clear and definite: extracting employee data. 
  • The relation between fields and source tables are clearer.
  •  It is easier to see what kind of information is extracted and their logical grouping: employee personal info, job info, job history info, location info

How to identify separate concerns in general?

But, before moving on further revising this code, let’s pause and ask, in general how to identify separate concerns with any code. In object-oriented programming (OOP) code is partitioned into units called class. That is, software is designed in terms of classes. One rule for class design is called the Single Responsibility Principle. According to this rule, each class should do one thing and only one thing. This idea can be generalized to any unit of programming beyond OOP. One method to go about this is to provide a descriptive statement about the unit under consideration, such as “what does this unit do”. If the statement points to only one thing or action, then the unit has one concern. If it points to more than one, then there are potentially more than one concern. Applying this rule to the original example code: it extracts data and transforms it – does two things. 

Further revision

The revised code for extracting data, though clearer and cleaner than the original code, still not good enough. Namely, it has a giant Join clause with too many tables. This kinds of structures, although some mistakenly view them as advanced coding, in fact is a programming problem which I would like to call the Illusion of advanced coding. This illusion, quite often results in poor code quality or sometimes complete project failure.  (complex queries is one of the major problems with data-warehousing projects and their failure). 

The above can be made simpler by breaking the join into multiple steps. In some cases it is more clear how to divide joins when tables belong to clearly separate logical grouping. Let’s say,  if in a query three tables contain customer-level data and two other containaining account-level info, then it can be easily broken along this line into two units. In this particular case, for instance, the code can be divided into current employee info, historical job info, job location info etc. Another thing which makes this code complex is that tables in the JOIN appear multiple times under different aliases with different semantic roles. For example, table EMPLOYEE is used once to extract employee info and another time to extract employee’s manager info (since manager is also an employee after all). Perhaps this is a bigger source of complexity and separates the query at least into two steps with the first occurrence of the tables in the first step and the next ones in the second step. 

FROM   employees e 

       INNER JOIN jobs j 

               ON e.job_id = j.job_id 

       LEFT JOIN employees m 

              ON e.manager_id = m.employee_id 

       LEFT JOIN departments d 

              ON d.department_id = e.department_id 

       LEFT JOIN employees dm 

              ON d.manager_id = dm.employee_id 

       LEFT JOIN locations l 

              ON d.location_id = l.location_id 

       LEFT JOIN countries c 

              ON l.country_id = c.country_id 

       LEFT JOIN regions r 

              ON c.region_id = r.region_id 

       LEFT JOIN job_history jh 

              ON e.employee_id = jh.employee_id 

       LEFT JOIN jobs jj 

              ON jj.job_id = jh.job_id 

       LEFT JOIN departments dd 

              ON dd.department_id = jh.department_id 

This type of construct, especially when there are many tables, make the code complex and hard to understand. Hard to trace which information comes from which alias of the table and so on. As a result, one way of revised code divided into two steps may look like the following:

Big code constructs (including JOIN statements) are common examples of code smell. According to Martin Fowler, 

A code smell is a surface indication that usually corresponds to a deeper problem in the system. 

WITH employee_info_1 (

SELECT e.employee_id, 

       e.first_name, 

       e.last_name, 

       e.email, 

       e.phone_number, 

       e.hire_date, 

       e.salary, 

       e.commission_pct, 

       d.department_name, 

       j.job_title, 

       j.min_salary, 

       j.max_salary, 

       jh.job_id, 

       jh.start_date, 

       jh.end_date, 

       l.street_address, 

       l.postal_code, 

       l.city, 

       l.state_province, 

       c.country_name, 

       r.region_name 

FROM   employees  e 

       INNER JOIN jobs  j 

               ON e.job_id = j.job_id 

       LEFT JOIN departments  d 

              ON d.department_id = e.department_id 

       LEFT JOIN locations  l 

              ON d.location_id = l.location_id 

       LEFT JOIN countries  c 

              ON l.country_id = c.country_id 

       LEFT JOIN regions  r 

              ON c.region_id = r.region_id 

       LEFT JOIN job_history  jh 

              ON e.employee_id = jh.employee_id 

)

SELECT ei.*,, 

m.first_name, 

m.last_name, 

dm.first_name, 

dm.last_name, 

jj.job_title, 

Dd.department_name 

FROM   employee_info_1 ei 

       LEFT JOIN employees m 

              ON e.manager_id = m.employee_id 

       LEFT JOIN employees dm 

              ON d.manager_id = dm.employee_id 

       LEFT JOIN jobs jj 

              ON jj.job_id = jh.job_id 

       LEFT JOIN departments dd 

              ON dd.department_id = jh.department_id; 

Now with data extracted, operations on this data is encapsulated in its own step as follows (You don’t need to understand details of the code. That is not important for our purposes.)

SELECT

   employee_id AS “Employee # “,

   first_name || ‘ ‘ || last_name AS “Name”,

   email AS “Email”,

   phone_number AS “Phone”,

   To_char(hire_date, ‘MM/DD/YYYY’) AS “Hire Date”,

   To_char(salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) 

AS “Salary”,

   commission_pct AS “Comission % “,

   ‘works as ‘ || job_title || ‘ in ‘ || department_name || ‘ department (manager: ‘ || first_name || ‘ ‘ || last_name || ‘) and immediate supervisor: ‘ || first_name || ‘ ‘ || last_name

AS “Current Job”,

   To_char(min_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’) || ‘ – ‘ || To_char(max_salary, ‘L99G999D99’, ‘NLS_NUMERIC_CHARACTERS = ”.,” NLS_CURRENCY = ”$”’)  

AS “Current Salary”,

   street_address || ‘, ‘ || postal_code || ‘, ‘ || city || ‘, ‘ || state_province || ‘, ‘ || c.country_name || ‘ (‘ || r.region_name || ‘)’ 

AS “Location”,

   job_id AS “History Job ID”,

   ‘worked from ‘ || To_char(start_date, ‘MM/DD/YYYY’) || ‘ to ‘ || To_char(end_date, ‘MM/DD/YYYY’) || ‘ as ‘ || job_title || ‘ in ‘ || department_name || ‘ department’ 

AS “History Job Title” 

FROM

   Table1;

The new code, though looks longer (by number of lines) is much better than the original code. It is more readable. It is more evident that information about employees is being extracted. One can tell which data/columns are extracted, from which table column come from and easier to identify fields that are related; logical grouping is more apparent.  Also, the need for comments pretty much disappears. Those comments on joins were necessitated because the code was not clear. In general, one should write code is such a way that there shouldn’t be any need for comments. It doesn’t mean that comments should always be avoided. Rather, the code should be tthe code itself is readable enough for anyone to understand it. This is 

   a principle of good programming.

Perhaps one can further break the joins into smaller ones, or in different ways (as was stated above along logical grouping of tables) but that wouldn’t be necessary. The example is used only to the extent to illustrate the point. In fact, this whole discussion is not tied or limited to SQL coding either. It is about the general ideas that are applicable to any programming: complexity of the code, principles for dealing with complexity and most important of all, the thought process underlying all this. It shows how in general a reasoned approach to programming improves it. 

Refactoring

In the above example, the method applied to simplify and improve the code is called Refactoring and it’s a very important concept in programming and software design. According to Martin Fowler, the originator of the concept, 

Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

Its heart is a series of small behavior-preserving transformations. Each transformation (called a “refactoring”) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.

According to another definition, 

In computer programming and software design, code refactoring is the process of restructuring existing computer code—changing the factoring—without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality. Potential advantages of refactoring may include improved code readability and reduced complexity; these can improve the source code’s maintainability and create a simpler, cleaner, or more expressive internal architecture or object model to improve extensibility.

There is one important point though that is not included in these definitions but In my view should have been: 

with Refactoring, it is important to consider two aspects of it, or two meanings of it: the literal meaning and the logical meaning. In the literal sense, it is applied to existing, already-written code to improve it. But, in the logical sense it does not have to be that way. That is, the code should not be written badly first and then improved through refactoring, rather it has to be written with the refactoring already in mind, such that it should require the least amount of refactoring (properly designed) and ideally no refractory at all (in principle).

To recap and further clarify the topic: 

  • Code Complexity was the problem.
  • Refactoring was a method to resolve this problem or improve the code. 
  • Separation of Concern is one of the main principle by which the method of refactoring mainly, or any solution for complexity, works. 

Example 2

The following example further illustrates and reinforces the idea of structural code complexity and why it is a problem. To begin with, like the first example, try to read and understand the code as is, without any extra information or context. 

SELECT ens.company, 

       ens.state, 

       ens.zip_code, 

       ens.complaint_count 

FROM   (SELECT company, 

               state, 

               zip_code, 

               Count(complaint_id) AS complaint_count 

        FROM   credit_card_complaints 

        WHERE  state IS NOT NULL 

        GROUP  BY company, 

                  state, 

                  zip_code) ens 

       INNER JOIN (SELECT ppx.company, 

                          Max(ppx.complaint_count) AS complaint_count 

                   FROM   (SELECT ppt.company, 

                                  ppt.state, 

                                  Max(ppt.complaint_count) AS complaint_count

                           FROM   (SELECT company, 

                                          state, 

                                          zip_code, 

                                          Count(complaint_id) AS complaint_count

                                   FROM   credit_card_complaints 

                                   WHERE  state IS NOT NULL 

                                   GROUP  BY company, 

                                             state, 

                                             zip_code 

                                   ORDER  BY 4 DESC) ppt 

                           GROUP  BY ppt.company, 

                                     ppt.state 

                           ORDER  BY 3 DESC) ppx 

                   GROUP  BY ppx.company) apx 

               ON apx.company = ens.company 

                  AND apx.complaint_count = ens.complaint_count 

ORDER  BY 4 DESC; 

Code example from DataQuest

You may ask, how can one understand a code without explanatory information, without someone already familiar with the code explaining to me what it does. True, this is the kind of expectation that exists in practice in most places and cases, but it is simply wrong!

When someone inherits a code base or gets to work with it, in general, the original author is not there to explain the code. It should not be assumed as part of any programming. The right way is, really, to assume the exact opposite and write the code to be as self-explanatory as possible. For this reason, no explanation is provided at first, in order to test the very expressiveness and readability of the code exactly as is.

Back to the example code, admittedly, it is not easy to understand what it does. One could possibly figure it out with some toiling, but why should that be the case? What happens if one gets code like this that goes over many pages, perhaps many tens of pages?

Anyways, this code uses credit_card_complaints table, which as its name implies keeps data about credit card complaints. The records contain: company, state, zip and complaints ID and some other info. 

CompanyState Zip code tComplaint ID

What the query is doing is to get:

For each company, find the state/zip code(s) with the highest number of complaints. 

Well, based on this logic, this is a straight MIN/MAX problem which is very common in data analysis and should not be difficult or complex at all. Hence, the complexity of the given code does not come from the logic it implements but from the way the code is written. 

Breaking the complex structure into simpler steps, one can construct the following algorithm:

  1. Sum the number of complaints per each company/state/zip.
  2. For each company, find the highest count from step 1.
  3. For each company, get the state/zip ( 1 or more ) associated with the highest count. 

Note: It is absolutely important in programming to design before any coding. If this principle is followed properly, it will fundamentally change the resulting program or software.)

Step 1: For each company/state/zip count the number of  complaints: SELECT company,

  state,

  zip_code,

  Count(complaint_id) AS complaint_count

FROM   credit_card_complaints

WHERE  state IS NOT NULL

  AND zip_code IS NOT NULL

GROUP  BY company,

  state,

  zip_code  

;

Let’s call the above results, complaints_count table. Then, 

Step 2: For each company/state, find the highest number of complaints */

company_max_count as

SELECT company,  

      Max(complaint_count) AS max_complaint_count 

FROM   complaints_count 

GROUP  BY company

/* For each company/state, retain the record with highest count */

SELECT cc.company, 

      cc.state, 

      cc.zip_code, 

      cc.complaint_count 

FROM   complaints_count cc 

      INNER JOIN company_max_count cmc 

      ON cc.company = cmc.company 

      AND cc.complaint_count = cmc.max_complaint_count 

ORDER  BY 4 DESC

Unlike the original complex query with many subqueries, this new algorithm breaks the problem into three sequential or linear steps (analysis) each of which is simple and independent. In general, as for dealing with structural code complexity: N linear 1-dimensional constructs are better than one N-dimensional construct. 

To justify this statement, consider the two versions of the example and compare them in terms of testability. In the revised code, each step does only one thing which is easily testable. 

Final Notes

In both examples the code does not really perform much computation. If complex computation is added to them (as it is the case in the real world) then it can become much more complex therefore avoiding complexity is so important in programming. 

Tying the ideas of complexity and separation of concern, it can be 

Degree of complexity of a code structure depends on the number of concerns implemented in the same structure. The more concerns in the same step, the more complex it is. 

As it was discussed in part one and shown with examples in this part, complexity and quality of data-processing code does not depend on smaller code size, or fewer number of lines of code.