MRRisley:=("Programming for Analytics")

Github Icon with Link
1. Intro

Conceptually, a computer program is an algorithm: a set of rules, logic, and/or procedures in order to accomplish a task. In order for the program to run, it requires a language that translates these rules, logic, and/or procedures in a way that the program understands. Generally speaking, programs accept inputs and translate them into outputs.

Considered from end-to-end, data analytics requires a broad range of programs in order to extract and deliver insights from data:

  1. Data Acquisition and Pre-processing
    • includes data queries, establishing data definitions, and preliminary cleaning
    • throughout the project, it is usually necessary to return to this stage to introduce additional layers of granularity and segmentation
  2. Data Exploration
    • this step includes data visualization, summary, and outlier analysis in order to understand the data
    • often the universe of possible analytical solutions is narrowed in this stage as pieces of information are assigned greater or lesser importance
    • in a work environment, the choice of program depends on the company you work for and the nature of the task
    • also depending on the nature of the task, data exploration may be the extent of the analysis
  3. Data Analysis
    • this step may include more focused analysis using statistics, modeling, algorithms, etc.
    • the degree to which this step is successful is completely dependent on the success of the two prior stages
    • analytics degrees give you the tools for this step, but real-world data is rarely ready for them as-is
    • "garbage in, garbage out" is the mantra of analytics
    • the key to success in analytics is a healthy skepticism of all data you see balances against a willingness to accept data as "good enough" after sufficient cleaning and review
  4. Insight Delivery
    • through several iterations of steps 1-3, a summary and conclusion from the analysis is usually given to management
    • nature of the delivery depends on the audience and their level of technical experience
    • it may be completely appropriate to make no mention of statistics or modeling choices but only to deliver the results (executive or senior management)
    • in other settings, it may be appropriate to dive immediately into technical details
    • for non-technical audiences, it may be preferable to develop graphs/figures in Excel or Tableau as they are typically presentation-ready with little work
    • for technical audiences, direct output from statistical software may be acceptable or even expected
  5. Insight Maintenance
    • depending on the nature of the problem, you may be required to replicate the analysis in the future
    • automation of the analysis is sometimes preferred if it is easier than the alternative
    • this stage is usually referred to as placing the process into "production"
    • processes in production need monitoring to ensure it is performing as intended
Top
     a. Types of Programming
Most analysts work in multiple programming languages. While an Oracle Data Base (ODB) can be accessed within R, SQL will still be required to query the ODB. However, as many can tell you, R and SQL are very different programming languages with different uses. There are generally classed into "paradigms" based on the features of the language, including whether the language specifies sequence of operations and where modifications can occur. In general, a programming language for data analysis is used for its procedural function. A language in the procedural paradigm gives the computer specific operations to be completed in sequence. For example, SAS's PROC statement and VBA's subroutine is at its most fundamental level a procedure.

However, most languages share features of other paradigms. For example, R and Python also use concepts from the object-oriented paradigm, which includes objects and classes. Everything in R is an object of a specific class, which itself is usually an attribute of the object. For example, a 2x2 matrix in R is of class "matrix," and there are specific methods that can be used with a matrix object. (FYI--some purists will say a strict interpretation of object-oriented programming means that most R objects are not true "objects".)

In general, an analyst does not need to know the programming paradigm of a language and will never engage in semantic debates about them. However, it is helpful to understand the concepts and whether you're simply programming a procedure or programming an object. A procedure, such as calculating the sum of two variables and storing it in memory, is generally self-evident when you see it within source code. On the other hand, programming an object, such as telling the Excel application in VBA to spit out a user-friendly error for invalid inputs, may be more difficult to understand. Working with programming objects, which are common in most languages, requires knowledge of the universe of possible objects as well as the objects themselves.

A list of common programs utilized for analytics and a description follows:

  1. R
  2. R utilizes procedural languages C and FORTRAN for computation as well as the R language. It is a widely-used and flexible statistical computing software with some generalized programming ability. It was developed in the 90s. As an open-source platform, users can submit packages that include functions, tools, and statistical procedures. There are currently thousands of packages. Most uses of R require installation and loading of several packages.

  3. Python
  4. Unfortunately, I know roughly nothing about the Python language, but it is gaining significantly more popularity. It is similar to R in that it is open source with individual/joint contributors. In contrast to R, it is a more generalized programming platform. As with most programming languages, there are a vast number of resources on the web to get you started if you're interested.

  5. SAS
  6. SAS has been the workhorse of statistics for decades. Most dates in programs have numeric representation of dates in order to facilitate adding and subtracting. January 1, 1960 represents the number 0 in SAS, which is around the time the software began to be developed. Excel users will likely recognize that Microsoft uses December 31, 1899 as the zero point such that January 1, 1900 = 1. SAS is purely a statistical computing software that consists of two primary steps: DATA and PROC. DATA steps can create and manipulate data whereas PROC steps perform an action on the data. The SAS language is primarily procedure-based, but there is some flexibility through user-defined functions and macros.

  7. SQL
  8. For analytics in the private sector, SQL is almost universally a requirement for data retrieval, manipulation, cleaning, and segmentation. Corporations tend to rely heavily on Relational Databases, for which the SQL language was designed. SQL is generally unlike the three programs listed above. By itself, SQL (Structured Query Language) follows the declarative paradigm: the query tells the program what it requires from a database and how to structure the results. However, the structure of the results are not customizable beyond the SQL language. Importantly, however, there are several procedural extensions to SQL, with PL/SQL and T-SQL being the most popular. These extensions bring procedural elements to SQL, such as logic and loops. SQL is almost exclusively used with a procedural extension, which allows the user to customize the output through data definitions, etc.

  9. VBA
  10. Visual Basic (VB), the underlying language of VBA, is becoming increasingly obsolete outside of a handful of cases. In fact, VB is no longer supported by Microsoft. However, its implementation in Visual Basic for Applications (VBA) makes it a nearly ubiquitous feature of every corporation. Because of the heavy reliance on Microsoft applications, particularly Excel, VBA will continue to be relevant for computer programmers and analysts. VBA uses a host application (such as Excel) and allows the user to control the host application and its objects. VBA utilizes concepts from several programming language paradigms, including procedural, object-oriented, and event-driven. Unlike procedural and object-oriented programming paradigms, execution of an event-driven program can be influenced by an end user of the program. For example, the user selecting "yes" to a prompt delivered by the program can alter the course of the program. As anyone who has used a macro knows, the end user frequently does not need to know anything about VBA in order to use the developed program (referred to as a macro).

  11. Others
  12. Plenty of other statistical computing and programming softwares are available to analysts, each with their merits and limitations. Some are better equipped for large-scale implementations of analytical solutions; others are better equipped for day-to-day, smaller-scale analytical tasks. Many are open source because the coding/analytics community has thankfully championed democratization. Some of the softwares I've encountered are:

    • SPSS, Stata, EViews, and Power BI are off-the-shelf solutions for statistics and data, but they do not offer the general programmability such as R and Python. They have a Graphical User Interface (GUI) that provides ease of use to those who do not require additional flexibility from programming.
    • Tableau is a very popular program for data visualization and has a very large user base and following. It facilitates dashboard creation and works well with visualizing big data. As an off-the-shelf solution, however, it is difficult to add additional flexibility outside of the updates Tableau pushes out.
    • Hadoop and Spark are solutions for large-scale data and computations. They are useful production environments for complex algorithms that work on large scale data. They are part of the Apache Software Foundation, which offers many other solutions (Cassandra and Hive, for example).
    • H2O is an artifical intelligence and machine learning platform. It is available as a package in R and Python.
Top
     b. Programming Concepts

Despite different types of programming, they involve similar concepts that are shared across languages. For example, object-oriented and non-object-oriented programming use variables and procedures such as functions. The syntax varies from language to language, but the concepts are shared. The concepts are:

  1. Variables
  2. Variables allow the program to be dynamic. For example, a program can assign a numeric variable an approximately infinite number of possible values. In many programs, objects can also be treated as variables. For example, in VBA, you can assign a specific worksheet to a variable (i.e., a worksheet object).

  3. Functions
  4. For users of Excel, functions are a familiar concept. They take arguments (sometimes called parameters), which are then passed through the function's algorithm(s). For example, in Excel, the SUM function computes the sum of numbers. =SUM(1, 2, 3) provides the sum of the numbers 1, 2, and 3, which is 6. In programming, it is common to create functions in order to automate a task that would otherwise require multiple lines of code.

  5. Logic
  6. Logic is necessary for almost every program. It tells the program what to do if a certain condition is met (or is not met). This usually takes the form of an If...Then... statement and is one of the fundamental components of the procedural paradigm.

  7. Loops
  8. Loops are utilized across programming languages in order to perform many iterations of the same task. For example, if I wanted to color every cell in an Excel worksheet based on its value, I'd need to loop through each cell within that Worksheet. Even with thousands of cells, a loop condenses the procedure into only a handful of lines of code.

  9. Code Formatting
  10. It is very good practice to format code consistently and with commonly used standards. Never underestimate the power of whitespace, indentation, and comments for making your code more readable. Remember, code readability is not only for posterity's sake, but for yours as well.

  11. Debugging
  12. Even the best programmer will encounter errors during development. Knowing how to work through the error is known as "debugging". The program usually has built-in error handling to notify the programmer of the type of error and the location. Errors can sometimes lead to an hour of head-scratching before the solution is found. Therefore, understanding basic errors and what to look for when they occur are important measures to maintain sanity while coding.

Top
2. Variables

Variables are a requirement of any program. They usually are of a specific type or "class," ranging from integers to variables of a specific type of object within the language.

Variables in programming offer flexibility. Imagine you are trying to program a calculator-type program that adds two numbers. The user inputs these two numbers, and then the program displays the answer. Without variables, you cannot program all possible inputs and outputs:


1+1 = 2
1+1.1 = 2.1
1+1.11 = 2.11
....	
	
However, with variables, the task becomes much simpler:

x+y = answer
	
In the first example, the program would have to search for the user inputs in a field of already calculated outputs. In the second example, the program calculates the output (answer) based on the user inputs (x and y).

Theoretically, a program cannot handle all real numbers because there is an infinite set: you can always add an extra decimal place to an already existing number. Under double point precision, for example, each number requires 8 bytes of memory. This generally results in precision up to 17 or so decimal places. Different variable types can frequently add additional decimal precision if the nature of the problem requires it. Also, in cases where decimal precision is less of a concern than efficiency and memory usage, a variable type with less decimal precision is preferred.

Top
     a. Numeric Variables

Numeric variables range from integers (no decimal precision) to float numeric variables, which replicate approximations of the real number line. Integers are counting numbers (1, 2, 3, ...) and can be negative. "Float" is shorthand for the fact that a decimal place can occur anywhere in a string of numbers (i.e., it can "float"). In other words, float numeric variables have decimal places. Floats can technically be integers (1 is an integer and a float) but integers cannot be a float (1.1 is not an integer, but it is a float). All programming languages will at least have these two numeric data types. For example, R uses decimal values by default. Even if variable x is assigned an integer, it is stored as a number with decimal precision. However, x can be coerced to an integer type.

Beyond the basic integer and float, some programming languages offer variations of them that have different memory requirements and levels of precision. In T-SQL for example, the variable type bigint (short-hand for "big integer") stores integers from -2^63 to 2^63-1 and requires 8 bytes of memory. The variable type int, on the other hand, requires 4 bytes and ranges from -2^31 to 2^31-1. tinyint ranges from 0 to 255, which is equivalent to 1 byte. For those familiar with binary, 8 bits equals 1 byte, and 8-bit encryption allows 256 possible values. For float variables, two types are common: single- and double-precision floating-point. Single requires 4 bytes of memory while double requires 8 bytes. Some programs allow precision up to 16 bytes (128 bits).

You do not need to know bits and bytes in order to program effectively. However, in general, you need to know which types require more memory and why you might need them (i.e., for more decimal precision). It is also important to understand the types and the ultimate precision of your results. Consider the examples below from R and T-SQL. We're simply trying to divide 1/100, which represents 0.01, or 1%.

In R, the results are as we expect:


> 1/100
[1] 0.01
		
In T-SQL, they are not. The query below returns the result 0.

select 1/100
		
Why? R's default numeric variable type is a float. Therefore, the result can also be a float. T-SQL, on the other hand, will "guess" your variable type. In the example provided, It encounters a 1 and a 100, both of which it reads as an integer. It then returns the result as an integer, which has no decimal precision. So, you get 0. In order to get the expected result, you need to coerce at least one of the numbers into a float. The code below returns the expected result by coercing 1 to a float through the built-in cast function.

select cast(1 as float)/100
		

Top
     b. String Variables

Strings are usually used for non-numeric values, but virtually every character and symbol (and number) can be represented in a string. Strings are sometimes referred to as character(s). The following are all examples of strings. I use the double quotation mark (") in order to denote the start and stop of a string. The use of quotation marks to designate strings is common to almost every programming language.


"string"

"this is a string"

"stRing 123"

"string_123"

"123string"

"123"
		
Note in the examples above that there are generally no rules on strings: Anything can be a string. For data purposes, strings are common to use for categorical data. For example, if a car dealership's inventory were stored as data, one could imagine fields for vehicle characteristics such as color, make, and model. In these instances, these fields would be programmed as string variables, as shown in the data set below:

     Model    Color  
[1,] "Honda"  "Blue" 
[2,] "Honda"  "Red"  
[3,] "Toyota" "Black"
		
In this example, both of these fields are strings, as denoted by the double quotation marks. However, not all programming languages will display quotation marks in output for string variables. This example was created in R, which always displays quotation marks in output to designate strings. T-SQL and SAS output, on the other hand, will not display the quotation marks.

The most counter-intuitive part about strings are the fact that numeric values can be strings. This can cause a host of issues that are sometimes dependent on the programming language. Imagine merging two data sets that contain numeric account numbers. In the first data set, the variable type for account numbers is "string" to allow for leading zeroes. If not classed in this way, the account number string "099988" would become 99988 when evaluated as a numeric value. If the data type of the account number in the second data set is numeric, merging the data will likely result in a "type mismatch" error. This is a result of the fact that, without any instruction to the program language, string variables cannot be evaluated as numeric values and vice versa. In the R language, explicitly summing a character value and a numeric value will return an error:

> class("15")
[1] "character"
> class(15)
[1] "numeric"
> "15"+15
Error in "15" + 15 : non-numeric argument to binary operator
		
The first lines of code tell us that "15" is a character value and that 15 is a numeric value. The difference between a string and a numeric in this instance is simply the use of the quotation marks. This is true of all programming languages I've worked with. When these two variables are summed in R, an error is returned letting us know that we're attempting to perform a mathematical operation on a non-numeric value, which the program cannot evaluate.

In other programming applications, however, the program will attempt to coerce a value to the appropriate variable type. Consider the following line of code in T-SQL:

select '1' + 1
		
This code returns the value 2 without errors or warnings. Also, note that a string in T-SQL is represented through a single quotation mark (').

To complicate matters further, in some languages, it is impossible to tell visually whether a seemingly numeric value is truly a numeric value. This is the case with SQL databases, which require you to examine the structure of the table in order to determine the variable type of a field. Therefore, you may assume two numeric-seeming variables can be summed, and in many cases they likely can be without error. But if the query encounters an observation where this is not the case, an error will be thrown.

Finally, the last consideration for strings is case sensitivity. In most cases, such as in R and Excel VBA, programming languages treat "hello" and "Hello" as two unique strings. In other words, the two are not equivalent, as shown below.

> "hello"=="Hello"
[1] FALSE
		
Therefore, in R, if you search the data for "hello" and it is coded as "Hello", the search will not return any results. In other cases, like T-SQL and Excel VBA, you can search the data for "hello" and it will return any results that match it regardless of case. This disregard for case sensitivity is useful because you can find records that meet a criteria regardless of case. If working with raw data in R, tidying the cases on string variables may be required where it is not necessarily required in a SQL query. Of course, almost all raw data needs tidying to some degree, but it is important to know whether the programming languages accounts for case with string variables. It is also important to know whether in the original data case did, in fact, matter. Otherwise, your analysis may incorrectly assume equivalence between "b" and "B", for example, which may not have been the intention of the person or system who created the data.

Top
     c. Booleans

Boolean variables are used for programming logic. They have two possible values: True and False. Most programs involve checking conditions and performing tasks (or not performing tasks) based on whether one or more conditions are met. For example, imagine a Bank charges two different interest rates depending on a borrower's credit score. For a high credit customer with a credit score of at least 700, the bank will offer a more competitive rate of 3.5%. For those who do not meet that criterion, the rate is 4.0%. Imagine a system is designed to automatically assign the rate when the application is processed. This logic would involve:

  1. Get the customer's credit score.
  2. Is the customer score >= 700? If yes, store True in a boolean variable. Otherwise, store False.
  3. When the boolean variable is True, return 3.5%. When False, return 4.0%

Evaluating conditions and utilizing boolean variables are a critical component to programming more complex tasks and to answering more complex analytical questions.

Top
     d. Other Variables

Given the flexibility required to accomplish tasks with a programming language, each language will likely have a wide variety of additional variable types.

One somewhat common variable type is a "catch all" variable type. For example, in the Excel VBA language, variant variable types allow the variables to work with different types of values. During the course of a program execution, a variant variable type may at one point be a numeric, at another a string, and at yet another a Worksheet object. In general, it is advisable that when the variable type is known in advance to use the explicit variable type rather than a catch-all variable type such as variant. type in the Excel VBA language

The most common other variable type is an object variable. Precisely what kinds of objects are available depends on the programming language. For example, in R, I could have a variable that represents a vector or one that represents the output of a linear regression (referred to as a lm object). In Excel VBA, I could have a variable that represents a cell, a range, a worksheet, or a workbook. As mentioned above, these object variables require the most specialized programming-specific knowledge as they are usually only relevant to one language.

Top
3. Functions

Functions are self-contained programs that translate inputs into outputs. Even with no programming experience, you've probably already encountered functions. If I want to know the square root of 1,500 on a calculator, I'll use the √x button to return the answer. Pressing this button tells the calculator to perform the mathematical operation x^(1/2) (which itself is then a type of function). This conversion process is a function.

The most common example of a function and how one might use it in a program language is through the Excel built-in functions. Within a cell, I type the following: =SUM(2, 3, 4). This uses the SUM() function in order to sum a series of arguments, which in this example are the numbers 2, 3, 4. Evaluating the function will return the result 9 in the cell. Excel has many, many more built-in functions, which you can see here.

As the universe of possible Excel functions makes clear, built-in functions typically fall into different classes. For example, there are string functions that parse or combine substrings (a string within a string). There are others that perform statistical operations, such as returning a random sample from a normal distribution. And there are others yet that work with dates, which is actually a very common programming requirement.

Outside of what a function is used for, there are two different types of functions: built-in and user-defined. The examples above are all built-in functions. These functions are "built in" to the programming language and generally do not require the programmer to do anything in order to use them. Therefore, when I pull up an instance of R, without loading any package, I can type the following and return a result.


> max(1, 2, 3)
[1] 3
	

In this example, we can see that the max function returns the maximum of the inputs, which is 3. Maximum functions are built-in functions for most programming languages.

User-defined functions are functions built by the programmer. Many analytical and programming tasks require manipulation of data under a precise set of rules of formulas. Consider the following example. Say that a rental application approval requires three criteria to be met: (1) credit score > 600, (2) no history of bankruptcies, and (3) no felony charges. A simple function can be created to automate the decisions. You could design a function to which you pass an applicant's name, and then the function evaluates these three criteria for the applicant. If they are all true, the program returns "Approved", but if one is not true, the program returns "Denied." Functions are frequently required for more complex analytical/programming tasks that require automation.

Top
4. Logic

In many cases, logic is a necessary component of any program. In fact, in more ways than one, logic is the essence of programming. Whether we realize it or not, our interactions with technology always involve a behind-the-scenes logic. Through sequences of actions, we interact with programs in order to fulfill wants and needs. We can ask our Google Assistant or Alexa what the weather will be like tomorrow, we can change the channel on the television, or we can order a book on Amazon. All of these actions involve programming logic.

The foundation of logic is the If...Then...Else. Logic involves checking a condition, such as "Is it raining outside?" Depending on whether evaluation of the conidition is True or False, then we execute a decision. For example, if it is raining outside, you should take your umbrella with you. If it is not, you should not take your umbrella. Logic is integrated into programming in the same way that it is hardwired into our decision-making processes. In many ways, the human brain is a metaphor for programming: Programs are artificial intelligence.

Extending the If...Then...Else example above, logic in programming can be summarized with the following psuedo-code:


If {condition} Then {decision} Else {decision} 
	
Importantly, a {decision} node may involve the decision to "do nothing," i.e., to take no action.

To put a real-life decision into the logical form above, imagine approaching a traffic light while driving. Whether we realize it or not, the line of code above is executed in our brains. Of course, our brains are not always predictable: we may accelerate through a yellow light in some cases where we brake in others. However, to simplify this process in pseudo-code, we actually need to incorporate two "Ifs" in order to handle the three possible lights: red, yellow, and green.

If {light="red"} Then {press the brake} Else If {light="green"} Then {proceed} Else {be careful}
	
The code above can be parsed as follows: If the light is red, then you will press the brake. The Else If then provides directions if the light is not red, which implies it is either yellow or green. Therefore, if the light is not red and if the light is green, then you will proceed. The final Else means that the light is not red or green, which implies it is yellow. In this case, a driver has to "be careful": the driver has to make a quick judgment of the relative risk of stopping or proceeding. This is difficult to program in a systematic way. However, given the advent of autonomous (driver-less) vehicles, this difficult decision-making process is being actively translated into programming statements. And, as you can imagine, this is why programming and analytics always involve ethical decisions.

As the examples above illustrate, logic is used in programming to create decision trees. For example, if a user selects "Yes", then the program will execute a sequence of events. Infinitely long and infinitely complex decision trees can be programmed simply from the If...Then...Else logic.

Decision Tree with Logic
Top
     a. Compound Conditions

Frequently more than one condition needs to be evaluated in order to make a decision. Before deciding whether you need to take an umbrella with you, you may check whether it is currently raining OR whether rain is forecast in the future. Only one of many conditions may need to be true in this instance in order to make the decision "take my umbrella". This is an example of a compound condition.

There are two common keywords used with compound conditions: AND and OR. Some languages use symbols for these keywords. For example, R uses & in lieu of "AND". Nontheless, the uses of AND and OR are equivalent across programming languages.

For those familiar with Venn Diagrams or probability, AND represents an intersection of two conditions. This means that both conditions joined by the AND must be satisfied. You can also join an infinite number of conditions with AND.


{condition 1} AND {condition 2} AND {condition 3} ...
	
OR, on the other hand, represents a union of two conditions. This means that at least one condition joined by the OR must be satisfied. As with AND, you can join an infinite number of conditions with OR.

{condition 1} OR {condition 2} OR {condition 3} ...
	
Finally, you can join AND and OR in as many ways as required, but in these cases parentheses (or some form of bracketing) is recommended.

{condition 1} AND ( {condition 2} OR  {condition 3} )
	
In this example, condition 1 must be satisfied and either condition 2 or 3 must be satisfied.

The truth table below shows use of AND and OR and the universe of possible results:

Condition X Condition Y Compound Condition Result
T T X AND Y T
T F X AND Y F
F T X AND Y F
F F X AND Y F
T T X OR Y T
T F X OR Y T
F T X OR Y T
F F X OR Y F
Top
5. Operators

Operators are symbols and keywords utilized to perform various operations in programming languages. There are a wide array of operators for any given program and they are equally as likely to vary across programming languages as they are to be the same. The tables below list the most common operators as well as their symbols/keywords across the R, T-SQL, VBA, and SAS programming languages. Much like mathematical expressions, which adhere to the order of operations, there is operational precedence for all types of operators. For example, in SAS, AND operators are processed before OR operators.

  1. Mathematical Operators: performs mathematical operations such as addition and multiplication.
  2. Operation R T-SQL VBA SAS
    Addition + + + +
    Subtraction - - - -
    Mulitplication * * * *
    Division / / / /
    Exponentiation ^ ^ ^ **
    Natural Log log log log log

    It is log in VBA but ln in Excel.


  3. Comparison Operators: performs comparisons that return a boolean (True or False).
  4. Operation R T-SQL VBA SAS
    Less Than < < < < or LT
    Less Than or Equal To <= <= <= <= or LE
    Greater Than > > > > or GT
    Greater Than or Equal To/td> >= >= >= >= or GE
    Equal To == = = = or EQ
    Not Equal To != <> <> NE or ~= or ^=

  5. Logical Operators: performs operations with logical conditions. The NOT operator below toggles a boolean state. It turns a True into a False and a False into a True.
  6. Operation R T-SQL VBA SAS
    AND & AND AND & or AND
    OR | OR OR ! or | or OR
    NOT ! NOT NOT ^ or ~ or NOT
Top
6. Concatenation

When working with strings in programming, it is common to require concatenation to join them together. Concatenation simply refers to combining strings into a larger string, single string. The examples below show two or three strings and the result of concatenation. Double quotation marks are used by convention to represent strings.

String A String B String C Concatenated String
"Hello" "123" "Hello123"
"Hello" " " "Matt" "Hello Matt"
"2018" "" "A" "2018A"
"2018" "02" "01" "20180201"

From the examples above, there are a couple key points:

Concatenation is accomplished in very different ways across programming languages. The table below lists the programming language, a description of how to concatenate, and an example of concatenation in the language to join the strings "Hello", " ", and "Matt". As with most things in programming, there are in some cases alternative methods to concatenate strings. The methods below, however, will always accomplish the task.

Language How to Concatenate Concatenation Example
R Concatenation is performed through the concatenation function, paste(). The sep="" argument tells the function to not include a delimiter. paste("Hello", " ", "Matt", sep="")
This example uses the default sep=" " argument. The result is the same. Note that in R, omission of an argument with a default returns the default. paste("Hello", "Matt")
T-SQL Concatenation is achieved through the "+" symbol. Note that this operator is the same as the addition operator. 'Hello' + ' ' + 'Matt'
VBA Concatenation is achieved through the "&" symbol. "Hello" & " " & "Matt"
SAS Concatenation is achieved through the "||" symbols (double bars). Finally, SAS also has a number of concatenation functions. 'Hello' || ' ' || 'Matt'
Top
7. Loops

Loops are not necessarily a feature of every program, but they are certainly a common feature of programming. As an analyst, certain tasks will invevitably require a loop. A loop is simply a way to perform an iterative task repeatedly (i.e., in a loop). In my experience, loops have the most concrete examples in Excel VBA. Imagine you have a 50 row x 50 column block of cells in a spreadsheet (i.e., 2,500 cells). For every cell, you want to check if the value exceeds a pre-defined threshold and perform conditional formatting on the cell for when it does. You also want to count how many cells in the 2,500 range satisfy this criteron.

As with nearly every programming task, there is more than one way to accomplish the task, some of which actually do not require a loop. However, once you become familiar with the logic of loops and how they are programmed in a particular language, you will default to a loop in order to solve tasks rather than what are usually more complicated alternatives.

In the example of a 50 x 50 Excel range, one can easily establish the following logic to solve the task:

  1. Check row 1, column 1. Add 1 to a running total.
  2. Check row 1, column 2. Add 1 to the running total.
  3. ...
  4. Check row 50, column 50. Add 1 to the running total.
It is not difficult to see how quickly these nearly-identical tasks, when repeated in this manner, would prove a programming nightmare. Even though the tasks themselves are not complex, programming 2,500 near-identical lines would be very onerous. However, a loop can quickly accomplish this task. The pseudo-code below demonstrates how a loop for this task would work.


{Check if cell > threshold}
{Add 1 to a running total if cell > threshold}
{Move to next cell}
	

In most programming languages, the {Move to next cell} portion wraps the code that is meant to be repeated, either through a specific syntax or symbols. In programming, wrapping usually refers to placing a block of code within a pre-defined structure (referred to as a wrapper). For example, a wrapper function would wrap another function within it (in this instance, it makes the original function more customizable. A loop usually takes some variation on the pseudo-code below.


For [every cell in the 50 x 50 range]
{Check if cell > threshold}
{Add 1 to a running total if cell > threshold}
{continue until all cells are looped}
	

In the example above, the 2,500-line code that would explicitly perform the task can be condensed into 4 lines with the use of a loop. (In Excel VBA, it would also be written in 4 lines). All programming languages have loops available, but they usually involve different symbols and syntax. Regardless, loops are a simple and efficient way to accomplish iterative tasks.

Top
8. Source Code Formatting

Formatting code does not impact how the code is processed, but it does have a considerable impact on the "governance" of the code, including its readability by other programmers/users as well as validation that the code is performing as intended. Most importantly, formatting your code helps you work within your own code. I consider myself a notoriously unorganized coder, and I always marvel at organized code because it is highly readable.

While there are not necessarily any "rules" for formatting code, there tend to be some standards:

  1. Comments in code are always advised. In fact, I have never heard of someone inheriting code where there were too many comments. It is always the opposite--there are too few comments. Comments are useful for (1) identifying what the code is doing and (2) identifying future enhancements.
  2. Indentation of self-contained portions of code, such as If statements and loops, allow these portions of code to stand out.
  3. Whitespace allows breaks in the code. 1,000 lines of code without whitespace to divide it is far less readable than adding 50 lines of whitespace to break up the code into logically-defined sections.
  4. Finally, there are some conventions to naming variables within programs. It is advisable to name variables in a way that identifies their type and what value they are storing. For example, mtx may identify a matrix variable in R, and dblTemp may refer to a temporary numeric variable that has double-point precision.
Of the list above, only comments vary from language to language. Comments within a programming language are denoted with a symbol or character. Everything after a comment symbol/character are displayed as a comment within the code and do not execute. The code below shows how commenting working in various languages.


# R
# A comment in R is the number/pound (#) symbol.
x = 1 #comments can be inserted in-line

-----

' VBA
' A comment in Excel VBA is the single apostrophe (') symbol.
x = 1 'comments can be inserted in-line

-----

-- T-SQL
-- A comment in Excel T-SQL is two dashes (--).
select 1+1 --comments can be inserted in-line

-----

* SAS;

* There are two types of comments in SAS
The first is an asterisk (*) followed by a semi-colon (;);

/*The second begins with a foward slash (/) and asterisk (*)
and ends with an asterisk and foward slash /*

data temp /* only the second type can be used in-line */;

data temp; * in-line after the statement is recommended;
run; /* this way, errors are less likely to occur */
	

Indentation of code allows "blocks" of text that span multiple lines to be visually set apart from the remainder of the code. These indented blocks of code are usually used for code that executes loops or other statements that require several statements. An example of a loop in the R language is shown below with indendation.


for (i in 1:10) {
	x[i] = i*2
	y[i] = i*3+2
}
	

An example of an If block in Excel VBA is shown below.


If rng > 0 Then
	x = rng*2
	y = rng*3+2
Else
	x = rng
	y=rng*2+2
End If
	
Top
9. Debugging

Coders always encounter programming errors. Without understanding that errors are a part of programming, a first-time coder can easily find them frustrating if not downright insurmountable. "Debugging" errors is one of the key components of becoming a coder.

First, it is important to understand why errors occur in the first place. There are three reasons for errors:

  1. Errors occur for first-time coders because of unfamiliarity with the language.
  2. Errors occur for veteran coders when they attempt to code something unfamiliar.
  3. Errors occur for everyone because of rushing.

Combinations of (1) and (3) or (2) and (3) will almost always result in extended debugging processes. The best method when dealing with something completely unfamiliar is to begin with error-free code (an example from the internet or class). Then, modify piece-by-piece while handling each error as it occurs. If trying to run a code chunk with many errors, you will encounter more than one type of error through the debugging process. For a beginning coder, this may be the point at which you throw your hands in the air and say there is no use. For the veteran coder, a change in the error after a code change is new, and usually useful, information. Even then, the veteran coder may curse under his or her breath.

The point of all this rambling is:

Errors are your friend!

I have seen beginning coders never produce an error-free script. This is partly due to unfamiliarity, but it is also due to their attitude toward errors. Write the entire program and then run the code to see what happens is never a good route. Break the task/program into bite-sized chunks that you tackle individually and sequentially. Code downstream from an error will almost always produce more errors:


> x <- "1"
> z <- x*2
Error in x * 2 : non-numeric argument to binary operator
> print(z)
Error in print(z) : object 'z' not found
	

In the R example above, x is stored as the value 1, but it is a string. Therefore, when multiplied by 2, you get an error based on incompatible types. The third error only results from the fact that the second statement failed (i.e., nothing was assigned to the variable z. Otherwise, there is no error:


> x <- 1
> z <- x*2
> print(z)
[1] 2
	

Without accounting for the first error, you cannot account for subsequent errors. Therefore, you must always debug errors in the order in which they occur.

The advice above is the most important part of understanding the debugging process, together with: (1) have patience and (2) Google often. Even though I am (mostly) patient and a proficient Googler, I have had to spend hours at times on the most elusive bugs. This is the nature of programming.

Finally, there are types of errors that can help you in the debugging process. I think of them in 3 groups, shown below. There are others out there, but these are the main culprits.

  1. Syntax Errors: These involve an error in the way in which the code is written. This can be a misplaced or missing comma, a misspelled keyword or variable, or, if your language is case-sensitive, improper case.
  2. Here's a syntax error in R because the as.character function has an improperly capitalized "c":

    
    > as.Character(1)
    Error in as.Character(1) : could not find function "as.Character"
    	

    Here's a syntax error in Excel VBA:

    VBA Syntax Error

    Note that the error message received in Excel VBA is somewhat more straightforward than the error message in R. Also, Excel VBA turns the text to the color red when there is a syntax error.

  3. Type Mismatch Errors: These involve an error from attempting to perform a function with an incompatiable variable type. See Variables above. This can be a string multiplied by a numeric as in the R example above, or it can result from passing a list in R to a function that only accepts vectors.
  4. Logic Errors: These are "invisible" errors, but they are the worst kind of error because they can result in incorrect conclusions or inferences. A logic error results from incorrect logic within the program itself. For example, say your manager wants to apply an attrition assumption for lost business of 1.5%. Within the program, a logic error would result if you misplace the decimal point: x*1.5 where x is your gross business. Without any other errors, the code would execute and spit out an attrition assumption. However, the attrition would be 1.5 times your gross business, which would imply more lost business than you have. In certain applications, this error may be easy to spot, but in other applications or with more subtle errors, it may go undetected without thorough code testing. Therefore, the best way to prevent logic errors is to test your code and test it often. I've also found a second set of eyes (such as my manager) helpful as well. Don't accept a counter-intuitive answer without making sure there aren't any obvious logic errors.
Top
Top
Github Icon with Link
MRRisley Home