CAST
functionThe CAST
function is extremely useful in T-SQL. Data type conversion functions are a regular feature in programming. This is because
when we read in data to a program, the program assumes a data type for us unless we specify it. In most cases the program gets it right enough, but in others,
such as with dates, the program may need to be told to treat it as a date versus a character string.
The CAST
function takes a single argument--the field or value you are converting and the type to which you are converting it:
cast(value as data_type)
The following examples demonstrate the various uses of the CAST
function.
select
/*line 1*/ cast('9999' as float) 'string to float',
/*line 2*/ cast(9999.9 as int) 'decimal to integer',
/*line 3*/ cast(3.14 as dec(6, 5)) 'adding decimal places',
/*line 4*/ cast('12/18/2015' as datetime) 'datetime',
/*line 5*/ cast('5/6/1971' as smalldatetime) 'smalldatetime',
/*line 6*/ cast('January 31, 2018' as date) 'date',
/*line 7*/ cast('hello world' as varchar) 'varchar'
string to float decimal to integer adding decimal places datetime smalldatetime date varchar ---------------------- ------------------ --------------------------------------- ----------------------- ----------------------- ---------- ------------------------------ 9999 9999 3.14000 2015-12-18 00:00:00.000 1971-05-06 00:00:00 2018-01-31 hello world (1 row(s) affected)
The query executes the following:
float
data type (i.e., a number)datetime
data type (which includes hours, minutes, seconds, and milliseconds)smalldatetime
data type (which includes hours, minutes, and seconds)date
data type (which does not include time)varchar
(character) data typeA language-neutral description of numeric data in analytics programming can be found here.
Similar to other programming languages, T-SQL relies on types of numeric values for the purposes of managing data storage, which is mostly a question of decimal precision when it comes to analytics. However, it is important to understand the memory requirements with each data type and also ways in which queries can be optimized if relying upon large quantities of data conversions.
Numeric data within SQL can be divided as follows:
There are also bit
and money
types, but these are rarely required for analytics purposes, and I've never seen the money type used in the real world. The bit
type is useful in database design for dummy variables due to its low storage requirements. It takes the values of 0, 1, and NULL.
Integers are counting numbers and have no decimal precision. There are four types, shown below. For most applications, int
will suffice and uses little memory.
Type | Range | Storage |
---|---|---|
tinyint |
0 to 255 | 1 byte |
smallint |
-32,768 to 32,767 | 2 bytes |
int |
-2,147,483,648 to 2,147,483,647 | 4 bytes |
bigint |
-263 to 263-1 | 8 bytes |
This information sourced from the relevant Microsoft Doc.
The integer data type is generally useful for data that does not require division or other operations that may produce a decimal point. Operations on integers will always return the type integer, which has no decimal precision.
When the value of an integer type exceeds the range shown in the chart above, an "overflow" error is returned:
select cast(-5 as tinyint)
Msg 220, Level 16, State 2, Line 2 Arithmetic overflow error for data type tinyint, value = -5.
In this example, we try to coerce -5 to the tinyint
type. Because tinyint
only has a range of 0 to 255, the error results.
Decimal and numeric values are identical in T-SQL. They have fixed levels of precision and scale. Precision refers to the total number of digits stored both to the left and the right of the decimal point. It has a value of 1 through 38 (the maximum). Scale refers to the number of digits stored only to the right of the decimal place.
The default scale is zero. Scale is always greater than or equal to zero, and precision is always greater than or equal to scale.
Let's see how this works in practice. When referring to the decimal and numeric data types, scale and precision can be supplied in parentheses. Precision is provided first, then scale.
select cast(1.234 as dec(4, 3)) 'p=4, s=3', --original
cast(1.234 as numeric(4, 3)) 'Numeric p=4, s=3', --line 1
cast(1.234 as dec(5, 4)) 'p=5, s=4', --line 2
cast(1.234 as dec(5, 3)) 'p=5, s=3', --line 3
cast(1.234 as dec(3, 2)) 'p=3, s=2', --line 4
cast(1.234 as dec(2, 1)) 'p=2, s=1', --line 5
cast(1.234 as dec(1, 0)) 'p=1, s=0', --line 6
cast(1.234 as dec(2)) 'p=2, s=0' --line 7
p=4, s=3 Numeric p=4, s=3 p=5, s=4 p=5, s=3 p=3, s=2 p=2, s=1 p=1, s=0 p=2, s=0 --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- --------------------------------------- 1.234 1.234 1.2340 1.234 1.23 1.2 1 1 (1 row(s) affected)
The query above coerces 1.234 to decimal/numeric type with a specified precision and scale. The first cast
function
represents the precision and scale of 1.234 as-is. 1.234 has a precision of 4 because it has four total place values, that is, places to the
left and right of the decimal. Scale is 3 because it has 3 place values to the right of the decimal.
The query above demonstrates the following:
Decimal/numeric types require more storage than integer types. With a precision (i.e., total place values) of 1-9, 5 bytes are required. For the maximum precision of 29-38, 17 bytes are required. See the Microsoft Doc on decimal/numeric data types for more information regarding memory requirements.
The float data type is an approximate-number data type for floating point numeric data. It is slightly more flexible than the decimal/numeric type
because only precision is required. The precision argument refers to the number of bits rather than the number of
place values as was the case with the decimal/numeric type. The maximum number of place values is still 38.
In practice, there are only two relevant values for this argument:
24 and 53. Data type real
is equivalent to float(24)
, which is also single precision.
Double precision is float(53)
. Operations involving single precision are accurate up to 6-9 digits and
double precision are generally accurate up to 15-17 decimal places. This is why float is an
approximate-number data type.
As with the decimal/numeric type, be sure to think about what level of precision is required for your analysis. The default argument for float is 53, which requires 8 bytes of memory. This isn't as large as a decimal(38) type, but it can become significant in certain applications, especially if you store large amounts of data with these types on the server.
A language-neutral description of character data, or strings, in analytics programming can be found here.
char
and varchar
data typesCharacter data have a few different possible data types in T-SQL. First, there are regular characters, which support the English language, and Unicode, which supports other language characters. For Unicode, see the relevant Microsoft Doc. This section will only discuss regular characters.
For regular characters, there are two data character types: char
and varchar
. The char
data type is fixed length. When a string
exceeds the fixed length, it is truncated. On the other hand, varchar
is variable length. Consider the two examples below:
select
cast('This is a string.' as char(10)) 'char 10',
cast('This is a string.' as varchar(10)) 'varchar 10',
cast('This is a string.' as char) 'char default',
cast('This is a string.' as varchar) 'varchar default'
char 10 varchar 10 char default varchar default ---------- ---------- ------------------------------ ------------------------------ This is a This is a This is a string. This is a string. (1 row(s) affected)
In the first two lines, the specified column width for both the char
and varchar
data type is 10, meaning that the returned result will accomodate
ten characters (remember that spaces are characters as well). In the next two lines, no length is specified, so the default length of 30 is utilized. Because the string
is not longer than 30 characters, the full result is returned for both data types. In addition, both char
and varchar
data types have
a maximum length of 8,000 characters.
So when is char
recommended over varchar
and vice versa? In the practice of analytics through data housed in a SQL server,
it doesn't matter much. For data storage and more complex querying, Microsoft recommends using char
when the possible values are relatively consistent in length and varchar
when they can vary
considerably.
See the Microsoft Doc for more detail.
String manipulation is a regular feature of analytics work. String manipulation usually has two forms:
There are a number of character functions within T-SQL that are useful for working within strings. The following functions are those that are fairly common.
SUBSTRING(string, start_position, length)
The
SUBSTRING
function returns a portion of a string. The start_position is a numeric reference for where to begin the substring. For example, a start_position of 5 would begin to return the substring starting at the fifth character of the original string. The length argument tells SQL how many characters from the start position to return as the substring. For example, a start_position of 5 and length of 5 would return a substring that is 5 characters long starting from the fifth character of the original string.select substring('This is a string.', 5, 5) 'Result'
Result ------ is a (1 row(s) affected)The resulting substring is: ' is a', which has a leading space. Remember that spaces are characters as well.
LEFT(string, length)
andRIGHT(string, length)
These functions return a portion of a string beginning at the first character (from left or right). The length argument tells SQL how many characters from the first character to return as the substring.
select left('This is a string.', 6) 'left', right('This is a string.', 6) 'right'
left right ------ ------ This i tring. (1 row(s) affected)The resulting substring from the
LEFT
function is 'This i', which is 6 characters long from the first character when reading from the left. The resulting substring from theRIGHT
function is 'tring.', which is 6 characters long from the first character when reading from the right.
LEN(string)
This functions returns the number of characters in the string.
select len('This is a string.') 'len'
len ----------- 17 (1 row(s) affected)The string 'This is a string.' has 17 characters.
LTRIM(string)
andRTRIM(string)
These functions truncate leading (
LTRIM
) or trailing (RTRIM
) spaces from a string.select ltrim(' leading spaces') 'ltrim leading', rtrim('trailing spaces ') 'rtrim trailing'
ltrim leading rtrim trailing ----------------- ------------------- leading spaces trailing spaces (1 row(s) affected)The resulting substrings have no leading or trailing spaces. Note that
LTRIM
will not remove trailing spaces andRTRIM
will not remove leading spaces.
Concatenation is the process of joining one or more strings into a larger string. A general overview and description of concatenation in analytics programming can be found here.
Concatenation can be accomplished in T-SQL through the
+
operator or through theCONCAT
function.select 'example' + '1' 'Result 1', 'example ' + '2' 'Result 2', 'You ' + 'can ' + 'concatenate ' + 'many ' + 'strings.' 'Result 3', concat('You ', 'can ', 'also ', 'use ', 'CONCAT().') 'Result 4'
Result 1 Result 2 Result 3 Result 4 -------- --------- --------------------------------- -------------------------- example1 example 2 You can concatenate many strings. You can also use CONCAT(). (1 row(s) affected)
- Result 1 returns 'example1' by concatenating 'example' and '1'. Concatenation with number values is possible.
- Result 2 returns 'example 2' by concatenating 'example ' and '2'. Note the space character after the word 'example' and the returned space in the result.
- Result 3 returns a concatenation from many substrings.
- Result 4 returns a concatenation from the
CONCAT
function.
Working with dates in most programming languages is a painful task. T-SQL has been one of the better programs I've worked with it when it comes to dates/times.
SQL allows data types that include dates, such as 01/01/2000, and times, such as 12:45AM. date
data types only
have the date component, time
data types only have the time component, and datetime
have both. I For other data types, see the relevant
Microsoft Doc. It is not
uncommon that you will encounter datetime
data types, but frequently the time dimension is the default of 12:00 AM, which displays
as 00:00:00.000.
The Microsoft Doc recommends date
, time
, datetime2
, and datetimeoffset
. You may encounter datetime
or smalldatetime
in older code. datetimeoffset
can also store time zone, which is appropriate for day/time data with a time zone component.
The following code demonstrates the datetime2
, date
, and time
data types.
select
cast('01/01/2018 10:42:03.123' as datetime2) 'Result 1',
cast('01/01/2018 10:42:03.123' as datetime2(3)) 'Result 2',
cast('01-01-2018 10:42PM' as datetime2(0)) 'Result 3',
cast('Jan 1 2015' as date) 'Result 4',
cast('4am' as time) 'Result 5'
Result 1 Result 2 Result 3 Result 4 Result 5 --------------------------- --------------------------- --------------------------- ---------- ---------------- 2018-01-01 10:42:03.1230000 2018-01-01 10:42:03.123 2018-01-01 22:42:00 2015-01-01 04:00:00.0000000 (1 row(s) affected)
datetime2
data type allows fractional second precision up to 100 nanoseconds.datetime2
data type. The argument of 3 returns a result with fractional
second precision up to the thousandths place value (i.e., 3 decimal places).The following date functions are those that I've commonly encountered:
YEAR(date)
and MONTH(date)
and DAY(date)
These functions return the four-digit year, the one/two-digit month of a date, and the one/two-digit day of a date.
select year('1-12-2013') 'Year', month('1-12-2013') 'Month', day('1-12-2013') 'Day'
Year Month Day ----------- ----------- ----------- 2013 1 12 (1 row(s) affected)
DATEDIFF(time_dimension, date1, date2)
This function returns the number of years, months, days, etc. between two supplied dates. Importantly, this number is an integer. Results are rounded down to the nearest integer.
select datediff(year, '20150101', 'Jan 31, 2019') 'Result 1a', datediff(year, '20150101', 'Dec 31, 2019') 'Result 1b', datediff(day, '20150101', 'Dec 31, 2019')/365.0 'Result 1c', datediff(month, '20150101', 'Dec 1, 2018') 'Result 2a', datediff(month, '20150101', 'Dec 31, 2018') 'Result 2b', datediff(month, '20150101', 'Jan 1, 2019') 'Result 2c', datediff(month, '20150101', 'Jan 31, 2019') 'Result 2d', datediff(day, '20150101', 'Jan 31, 2019')/(365/12.0) 'Result 2e', datediff(day, '20150101', 'Jan 31, 2019') 'Result 3', datediff(second, '20150101', 'Jan 31, 2019') 'Result 4'
Result 1a Result 1b Result 1c Result 2a Result 2b Result 2c Result 2d Result 2e Result 3 Result 4 ----------- ----------- --------------------------------------- ----------- ----------- ----------- ----------- --------------------------------------- ----------- ----------- 4 4 5.000000 47 47 48 48 49.01917915658 1491 128822400 (1 row(s) affected)
- Result 1a returns the number of years between January 1, 2015, and January 31, 2019, as 4 years.
- Result 1b also returns the number of years between January 1, 2015, and December 31, 2019 as 4 years. At January 1, 2020, the result becomes 5 years.
- Result 1c demonstrates a work-around to return a fractional year value. For applications that require an even more precise measue than this, an average days per year that incorporates leap year may be more appropriate.
- Result 2a, 2b, 2c, 2d demonstrate the similarily in number of months returned between two dates (47 and 48) despite the fact that difference between Dec 1, 2018, and Jan 31, 2019, is two months for all intents and purposes.
- Result 2e demonstrates a work-around to return a fractional month value. For applications that require an even more precise measure than this, an average days per month may be more appropriate.
- Result 3 returns the number of days between January 1, 2015, and January 31, 2019.
- Result 4 returns the number of seconds between January 1, 2015, 12:00AM and January 31, 2019, 12:00AM.
EOMONTH(date)
This function returns the last day in a month for a given date. This is useful for grouping dates by month/year rather than using month and year functions separately.
select count(*) 'Count', eomonth(modifieddate) 'Month Modified' from [AdventureWorks].[Person].[Address] group by eomonth(modifieddate) order by eomonth(modifieddate)
Count Month Modified ----------- -------------- 1 2006-06-30 1 2007-01-31 2 2007-11-30 5 2007-12-31 2 2008-01-31 2 2008-02-29 1 2008-03-31 17 2008-11-30 61 2008-12-31 ... (61 row(s) affected)
A lot of SQL users find NULL values somewhat baffling at first. This is primarily because the SQL language relies on a three-part logic rather than on a simple True/False binary. A logical expression can be either True, False, or Unknown. An expression that evaluates as Unknown will return a NULL value. In general, performing a logical expression on a NULL value will return a NULL value.
Let's begin with the [AdventureWorks].[Person].[Person]
table. The table has 19,972 records. Several fields contain NULL values.
select count(*) 'Count'
from [AdventureWorks].[Person].[Person]
where (title = 'Mr.' or title <> 'Mr.')
Count ----------- 1009 (1 row(s) affected)
The WHERE
clause would seem to suggest all records should be returned. However, only 1,009 records are counted? Why? According to SQL,
a NULL value is UNKNOWN. Therefore, when the field title
is NULL this means that the value may be 'Mr.' or it may not. Therefore, as a
NULL value, it neither is equivalent to 'Mr.' nor is it inequivalent. This is why logic in SQL has three values, demonstrated in the query below. An expression
can be True or False or Unknown.
select count(*) 'Count'
from [AdventureWorks].[Person].[Person]
where (title = 'Mr.' or title <> 'Mr.' or title is null)
Count ----------- 19972 (1 row(s) affected)
It is good practice to understand whether your table has NULL values for given fields. From the query below, you may assume that there are
no NULL values for the title
field because they are not returned in the results:
select title, count(*) 'count'
from [AdventureWorks].[Person].[Person]
where title not in ('Mr.', 'Mrs.', 'Ms.')
group by title
title count -------- ----------- Ms 1 Sr. 11 Sra. 3 (3 row(s) affected)
It makes sense to think that a NULL value would be returned because it is not equal to 'Mr.', 'Mrs.', or 'Ms.'. However, whether this is the case is actually unknown according to SQL--any given NULL value may be equal to one of these, or it may not. Thus, the NULL value must be handled as a unique logical result:
select title, count(*) 'count'
from [AdventureWorks].[Person].[Person]
where title not in ('Mr.', 'Mrs.', 'Ms.')
or title is null
group by title
title count -------- ----------- Sr. 11 NULL 18963 Sra. 3 Ms 1 (4 row(s) affected)
ISNULL(value, value if null)
ISNULL()
returns an alternate value when the value is otherwise a NULL.select null 'Result 1', isnull(null, 0) 'Result 2'
Result 1 Result 2 ----------- ----------- NULL 0 (1 row(s) affected)
NULLIF(value1, value2)
NULLIF()
takes two value arguments. When those values are equivalent, it returns a NULL value. When not equivalent, it returns the first value. This function is convenient if you want to return a NULL value in lieu of a particular value. I have found this very useful for avoiding 'divide by zero' errors.In the code below, two fields, a and b, are populated with the values keyword. The query returns these fields and divides a and b. While the numbers are fictional, a divide by zero error is not uncommon for such operations on real-life data. The convenience of
NULLIF
is that the error is circumvented.select a, b, a/nullif(b, 0) 'a/b' from (values (100, 100), (400, 4), (4, 2), (3, 0)) as X(a, b)
a b a/b ----------- ----------- ----------- 100 100 1 400 4 100 4 2 2 3 0 NULL (4 row(s) affected)Without the
NULLIF
, the code will return an error.select a, b, a/b 'a/b' from (values (100, 100), (400, 4), (4, 2), (3, 0)) as X(a, b)
Msg 8134, Level 16, State 1, Line 4 Divide by zero error encountered.