MRRisley:=("SQL Data Types")

Note: This guide is meant to be a companion to Itzik Ben-Gan's T-SQL Fundamentals, 3rd edition.
Github Icon with Link
Back to SQL Main Page
1. The CAST function

The CAST function is extremely useful in T-SQL. Data type conversion functions are a regular feature in programming. This is because when we read in data to a program, the program assumes a data type for us unless we specify it. In most cases the program gets it right enough, but in others, such as with dates, the program may need to be told to treat it as a date versus a character string.

The CAST function takes a single argument--the field or value you are converting and the type to which you are converting it:


cast(value as data_type)
	

The following examples demonstrate the various uses of the CAST function.


select 
/*line 1*/  cast('9999' as float) 'string to float',
/*line 2*/  cast(9999.9 as int) 'decimal to integer',
/*line 3*/  cast(3.14 as dec(6, 5)) 'adding decimal places',
/*line 4*/  cast('12/18/2015' as datetime) 'datetime',
/*line 5*/  cast('5/6/1971' as smalldatetime) 'smalldatetime',
/*line 6*/  cast('January 31, 2018' as date) 'date',
/*line 7*/  cast('hello world' as varchar) 'varchar'
	
string to float        decimal to integer adding decimal places                   datetime                smalldatetime           date       varchar
---------------------- ------------------ --------------------------------------- ----------------------- ----------------------- ---------- ------------------------------
9999                   9999               3.14000                                 2015-12-18 00:00:00.000 1971-05-06 00:00:00     2018-01-31 hello world

(1 row(s) affected)
	

The query executes the following:

Top
2. Number Data

A language-neutral description of numeric data in analytics programming can be found here.

Similar to other programming languages, T-SQL relies on types of numeric values for the purposes of managing data storage, which is mostly a question of decimal precision when it comes to analytics. However, it is important to understand the memory requirements with each data type and also ways in which queries can be optimized if relying upon large quantities of data conversions.

Numeric data within SQL can be divided as follows:

  1. Integer
  2. Decimal/Numeric
  3. Float

There are also bit and money types, but these are rarely required for analytics purposes, and I've never seen the money type used in the real world. The bit type is useful in database design for dummy variables due to its low storage requirements. It takes the values of 0, 1, and NULL.

    a. Integers

Integers are counting numbers and have no decimal precision. There are four types, shown below. For most applications, int will suffice and uses little memory.

Type Range Storage
tinyint 0 to 255 1 byte
smallint -32,768 to 32,767 2 bytes
int -2,147,483,648 to 2,147,483,647 4 bytes
bigint -263 to 263-1 8 bytes

This information sourced from the relevant Microsoft Doc.

The integer data type is generally useful for data that does not require division or other operations that may produce a decimal point. Operations on integers will always return the type integer, which has no decimal precision.

When the value of an integer type exceeds the range shown in the chart above, an "overflow" error is returned:


select cast(-5 as tinyint)
	
Msg 220, Level 16, State 2, Line 2
Arithmetic overflow error for data type tinyint, value = -5.
	

In this example, we try to coerce -5 to the tinyint type. Because tinyint only has a range of 0 to 255, the error results.

    b. Decimal/Numeric

Decimal and numeric values are identical in T-SQL. They have fixed levels of precision and scale. Precision refers to the total number of digits stored both to the left and the right of the decimal point. It has a value of 1 through 38 (the maximum). Scale refers to the number of digits stored only to the right of the decimal place.

The default scale is zero. Scale is always greater than or equal to zero, and precision is always greater than or equal to scale.

Let's see how this works in practice. When referring to the decimal and numeric data types, scale and precision can be supplied in parentheses. Precision is provided first, then scale.


select cast(1.234 as dec(4, 3)) 'p=4, s=3',			--original
cast(1.234 as numeric(4, 3)) 'Numeric p=4, s=3',		--line 1
cast(1.234 as dec(5, 4)) 'p=5, s=4',				--line 2
cast(1.234 as dec(5, 3)) 'p=5, s=3',				--line 3
cast(1.234 as dec(3, 2)) 'p=3, s=2',				--line 4
cast(1.234 as dec(2, 1)) 'p=2, s=1',				--line 5
cast(1.234 as dec(1, 0)) 'p=1, s=0',				--line 6
cast(1.234 as dec(2)) 'p=2, s=0'				--line 7
	
p=4, s=3                                Numeric p=4, s=3                        p=5, s=4                                p=5, s=3                                p=3, s=2                                p=2, s=1                                p=1, s=0                                p=2, s=0
--------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- ---------------------------------------
1.234                                   1.234                                   1.2340                                  1.234                                   1.23                                    1.2                                     1                                       1

(1 row(s) affected)
	

The query above coerces 1.234 to decimal/numeric type with a specified precision and scale. The first cast function represents the precision and scale of 1.234 as-is. 1.234 has a precision of 4 because it has four total place values, that is, places to the left and right of the decimal. Scale is 3 because it has 3 place values to the right of the decimal. The query above demonstrates the following:

Decimal/numeric types require more storage than integer types. With a precision (i.e., total place values) of 1-9, 5 bytes are required. For the maximum precision of 29-38, 17 bytes are required. See the Microsoft Doc on decimal/numeric data types for more information regarding memory requirements.

    c. Float

The float data type is an approximate-number data type for floating point numeric data. It is slightly more flexible than the decimal/numeric type because only precision is required. The precision argument refers to the number of bits rather than the number of place values as was the case with the decimal/numeric type. The maximum number of place values is still 38. In practice, there are only two relevant values for this argument: 24 and 53. Data type real is equivalent to float(24), which is also single precision. Double precision is float(53). Operations involving single precision are accurate up to 6-9 digits and double precision are generally accurate up to 15-17 decimal places. This is why float is an approximate-number data type.

As with the decimal/numeric type, be sure to think about what level of precision is required for your analysis. The default argument for float is 53, which requires 8 bytes of memory. This isn't as large as a decimal(38) type, but it can become significant in certain applications, especially if you store large amounts of data with these types on the server.

Top
3. Character Data

A language-neutral description of character data, or strings, in analytics programming can be found here.

    a. char and varchar data types

Character data have a few different possible data types in T-SQL. First, there are regular characters, which support the English language, and Unicode, which supports other language characters. For Unicode, see the relevant Microsoft Doc. This section will only discuss regular characters.

For regular characters, there are two data character types: char and varchar. The char data type is fixed length. When a string exceeds the fixed length, it is truncated. On the other hand, varchar is variable length. Consider the two examples below:


select
cast('This is a string.' as char(10)) 'char 10',
cast('This is a string.' as varchar(10)) 'varchar 10',
cast('This is a string.' as char) 'char default',
cast('This is a string.' as varchar) 'varchar default'
	
char 10    varchar 10 char default                   varchar default
---------- ---------- ------------------------------ ------------------------------
This is a  This is a  This is a string.              This is a string.

(1 row(s) affected)
	

In the first two lines, the specified column width for both the char and varchar data type is 10, meaning that the returned result will accomodate ten characters (remember that spaces are characters as well). In the next two lines, no length is specified, so the default length of 30 is utilized. Because the string is not longer than 30 characters, the full result is returned for both data types. In addition, both char and varchar data types have a maximum length of 8,000 characters.

So when is char recommended over varchar and vice versa? In the practice of analytics through data housed in a SQL server, it doesn't matter much. For data storage and more complex querying, Microsoft recommends using char when the possible values are relatively consistent in length and varchar when they can vary considerably.

See the Microsoft Doc for more detail.

    b. String Manipulation

String manipulation is a regular feature of analytics work. String manipulation usually has two forms:

  1. returning only a portion of a string, known as a substring
  2. creating new strings from various other strings, known as concatenation

There are a number of character functions within T-SQL that are useful for working within strings. The following functions are those that are fairly common.

Character Functions
SUBSTRING(string, start_position, length)

The SUBSTRING function returns a portion of a string. The start_position is a numeric reference for where to begin the substring. For example, a start_position of 5 would begin to return the substring starting at the fifth character of the original string. The length argument tells SQL how many characters from the start position to return as the substring. For example, a start_position of 5 and length of 5 would return a substring that is 5 characters long starting from the fifth character of the original string.


select
substring('This is a string.', 5, 5) 'Result'
	
Result
------
 is a

(1 row(s) affected)	
	

The resulting substring is: ' is a', which has a leading space. Remember that spaces are characters as well.



LEFT(string, length) and RIGHT(string, length)

These functions return a portion of a string beginning at the first character (from left or right). The length argument tells SQL how many characters from the first character to return as the substring.


select
left('This is a string.', 6) 'left',
right('This is a string.', 6) 'right'
	
left   right
------ ------
This i tring.

(1 row(s) affected)
	

The resulting substring from the LEFT function is 'This i', which is 6 characters long from the first character when reading from the left. The resulting substring from the RIGHT function is 'tring.', which is 6 characters long from the first character when reading from the right.



LEN(string)

This functions returns the number of characters in the string.


select
len('This is a string.') 'len'
	
len
-----------
17

(1 row(s) affected)
	

The string 'This is a string.' has 17 characters.



LTRIM(string) and RTRIM(string)

These functions truncate leading (LTRIM) or trailing (RTRIM) spaces from a string.


select
ltrim('   leading spaces') 'ltrim leading',
rtrim('trailing spaces    ') 'rtrim trailing'
	
ltrim leading     rtrim trailing
----------------- -------------------
leading spaces    trailing spaces

(1 row(s) affected)
	

The resulting substrings have no leading or trailing spaces. Note that LTRIM will not remove trailing spaces and RTRIM will not remove leading spaces.

Concatenation

Concatenation is the process of joining one or more strings into a larger string. A general overview and description of concatenation in analytics programming can be found here.

Concatenation can be accomplished in T-SQL through the + operator or through the CONCAT function.


select 
'example' + '1' 'Result 1',
'example ' + '2' 'Result 2',
'You ' + 'can ' + 'concatenate ' + 'many ' + 'strings.' 'Result 3',
concat('You ', 'can ', 'also ', 'use ', 'CONCAT().') 'Result 4'
	
Result 1 Result 2  Result 3                          Result 4
-------- --------- --------------------------------- --------------------------
example1 example 2 You can concatenate many strings. You can also use CONCAT().

(1 row(s) affected)
	
  • Result 1 returns 'example1' by concatenating 'example' and '1'. Concatenation with number values is possible.
  • Result 2 returns 'example 2' by concatenating 'example ' and '2'. Note the space character after the word 'example' and the returned space in the result.
  • Result 3 returns a concatenation from many substrings.
  • Result 4 returns a concatenation from the CONCAT function.
Top
4. Date Data
    a. Date/time data types

Working with dates in most programming languages is a painful task. T-SQL has been one of the better programs I've worked with it when it comes to dates/times.

SQL allows data types that include dates, such as 01/01/2000, and times, such as 12:45AM. date data types only have the date component, time data types only have the time component, and datetime have both. I For other data types, see the relevant Microsoft Doc. It is not uncommon that you will encounter datetime data types, but frequently the time dimension is the default of 12:00 AM, which displays as 00:00:00.000.

The Microsoft Doc recommends date, time, datetime2, and datetimeoffset. You may encounter datetime or smalldatetime in older code. datetimeoffset can also store time zone, which is appropriate for day/time data with a time zone component.

The following code demonstrates the datetime2, date, and time data types.


select 
cast('01/01/2018 10:42:03.123' as datetime2) 'Result 1',
cast('01/01/2018 10:42:03.123' as datetime2(3)) 'Result 2',
cast('01-01-2018 10:42PM' as datetime2(0)) 'Result 3',
cast('Jan 1 2015' as date) 'Result 4',
cast('4am' as time) 'Result 5'
	
Result 1                    Result 2                    Result 3                    Result 4   Result 5
--------------------------- --------------------------- --------------------------- ---------- ----------------
2018-01-01 10:42:03.1230000 2018-01-01 10:42:03.123     2018-01-01 22:42:00         2015-01-01 04:00:00.0000000

(1 row(s) affected)
	
    b. Date functions

The following date functions are those that I've commonly encountered:

YEAR(date) and MONTH(date) and DAY(date)

These functions return the four-digit year, the one/two-digit month of a date, and the one/two-digit day of a date.


select 
year('1-12-2013') 'Year',
month('1-12-2013') 'Month',
day('1-12-2013') 'Day'
	
Year        Month       Day
----------- ----------- -----------
2013        1           12

(1 row(s) affected)
	
DATEDIFF(time_dimension, date1, date2)

This function returns the number of years, months, days, etc. between two supplied dates. Importantly, this number is an integer. Results are rounded down to the nearest integer.


select 
datediff(year, '20150101', 'Jan 31, 2019') 'Result 1a',
datediff(year, '20150101', 'Dec 31, 2019') 'Result 1b',
datediff(day, '20150101', 'Dec 31, 2019')/365.0 'Result 1c',
datediff(month, '20150101', 'Dec 1, 2018') 'Result 2a',
datediff(month, '20150101', 'Dec 31, 2018') 'Result 2b',
datediff(month, '20150101', 'Jan 1, 2019') 'Result 2c',
datediff(month, '20150101', 'Jan 31, 2019') 'Result 2d',
datediff(day, '20150101', 'Jan 31, 2019')/(365/12.0) 'Result 2e',
datediff(day, '20150101', 'Jan 31, 2019') 'Result 3',
datediff(second, '20150101', 'Jan 31, 2019') 'Result 4'
	
Result 1a   Result 1b   Result 1c                               Result 2a   Result 2b   Result 2c   Result 2d   Result 2e                               Result 3    Result 4
----------- ----------- --------------------------------------- ----------- ----------- ----------- ----------- --------------------------------------- ----------- -----------
4           4           5.000000                                47          47          48          48          49.01917915658                          1491        128822400

(1 row(s) affected)
	
  1. Result 1a returns the number of years between January 1, 2015, and January 31, 2019, as 4 years.
  2. Result 1b also returns the number of years between January 1, 2015, and December 31, 2019 as 4 years. At January 1, 2020, the result becomes 5 years.
  3. Result 1c demonstrates a work-around to return a fractional year value. For applications that require an even more precise measue than this, an average days per year that incorporates leap year may be more appropriate.
  4. Result 2a, 2b, 2c, 2d demonstrate the similarily in number of months returned between two dates (47 and 48) despite the fact that difference between Dec 1, 2018, and Jan 31, 2019, is two months for all intents and purposes.
  5. Result 2e demonstrates a work-around to return a fractional month value. For applications that require an even more precise measure than this, an average days per month may be more appropriate.
  6. Result 3 returns the number of days between January 1, 2015, and January 31, 2019.
  7. Result 4 returns the number of seconds between January 1, 2015, 12:00AM and January 31, 2019, 12:00AM.
EOMONTH(date)

This function returns the last day in a month for a given date. This is useful for grouping dates by month/year rather than using month and year functions separately.


select 
count(*) 'Count',
eomonth(modifieddate) 'Month Modified'
from [AdventureWorks].[Person].[Address]
group by eomonth(modifieddate)
order by eomonth(modifieddate)	
	
Count       Month Modified
----------- --------------
1           2006-06-30
1           2007-01-31
2           2007-11-30
5           2007-12-31
2           2008-01-31
2           2008-02-29
1           2008-03-31
17          2008-11-30
61          2008-12-31
...

(61 row(s) affected)
	
Top
5. NULLs
    a. NULL values in T-SQL

A lot of SQL users find NULL values somewhat baffling at first. This is primarily because the SQL language relies on a three-part logic rather than on a simple True/False binary. A logical expression can be either True, False, or Unknown. An expression that evaluates as Unknown will return a NULL value. In general, performing a logical expression on a NULL value will return a NULL value.

Let's begin with the [AdventureWorks].[Person].[Person] table. The table has 19,972 records. Several fields contain NULL values.


select count(*) 'Count'
from [AdventureWorks].[Person].[Person]
where (title = 'Mr.' or title <> 'Mr.')
	
Count
-----------
1009

(1 row(s) affected)
	

The WHERE clause would seem to suggest all records should be returned. However, only 1,009 records are counted? Why? According to SQL, a NULL value is UNKNOWN. Therefore, when the field title is NULL this means that the value may be 'Mr.' or it may not. Therefore, as a NULL value, it neither is equivalent to 'Mr.' nor is it inequivalent. This is why logic in SQL has three values, demonstrated in the query below. An expression can be True or False or Unknown.


select count(*) 'Count'
from [AdventureWorks].[Person].[Person]
where (title = 'Mr.' or title <> 'Mr.' or title is null)
	
	
Count
-----------
19972

(1 row(s) affected)
	

It is good practice to understand whether your table has NULL values for given fields. From the query below, you may assume that there are no NULL values for the title field because they are not returned in the results:


select title, count(*) 'count'
from [AdventureWorks].[Person].[Person]
where title not in ('Mr.', 'Mrs.', 'Ms.')
group by title
	
	
title    count
-------- -----------
Ms       1
Sr.      11
Sra.     3

(3 row(s) affected)
	

It makes sense to think that a NULL value would be returned because it is not equal to 'Mr.', 'Mrs.', or 'Ms.'. However, whether this is the case is actually unknown according to SQL--any given NULL value may be equal to one of these, or it may not. Thus, the NULL value must be handled as a unique logical result:


select title, count(*) 'count'
from [AdventureWorks].[Person].[Person]
where title not in ('Mr.', 'Mrs.', 'Ms.')
or title is null
group by title
	
title    count
-------- -----------
Sr.      11
NULL     18963
Sra.     3
Ms       1

(4 row(s) affected)
	
    b. NULL functions
ISNULL(value, value if null)

ISNULL() returns an alternate value when the value is otherwise a NULL.


select 
null 'Result 1',
isnull(null, 0) 'Result 2'
	
Result 1    Result 2
----------- -----------
NULL        0

(1 row(s) affected)
	
NULLIF(value1, value2)

NULLIF() takes two value arguments. When those values are equivalent, it returns a NULL value. When not equivalent, it returns the first value. This function is convenient if you want to return a NULL value in lieu of a particular value. I have found this very useful for avoiding 'divide by zero' errors.

In the code below, two fields, a and b, are populated with the values keyword. The query returns these fields and divides a and b. While the numbers are fictional, a divide by zero error is not uncommon for such operations on real-life data. The convenience of NULLIF is that the error is circumvented.


select 
a, b, a/nullif(b, 0) 'a/b' 
from
(values (100, 100), (400, 4), (4, 2), (3, 0)) as X(a, b)
	
a           b           a/b
----------- ----------- -----------
100         100         1
400         4           100
4           2           2
3           0           NULL

(4 row(s) affected)
	

Without the NULLIF, the code will return an error.


select 
a, b, a/b 'a/b' 
from
(values (100, 100), (400, 4), (4, 2), (3, 0)) as X(a, b)
	
Msg 8134, Level 16, State 1, Line 4
Divide by zero error encountered.
	
Top
Github Icon with Link
Back to SQL Main Page