1 DATA statements

We have seen a number of PROC statements, which execute procedures on a supplied dataset. On the other hand, DATA statements create, manipulate, and import/export the data.

DATA and PROC steps are the foundation of working in SAS. Data is rarely ready for analysis “as is”. It is often said that data scrubbing and cleaning make up 80% of the work of an analytics task. I have found this to be my experience as well.

2 DATA statement basics

The SAS documentation for DATA steps are here.

In general, a DATA statement specifies:

  1. an output data set, which can be temporary (the WORK library) or permanent
  2. a beginning dataset
    • an already existing dataset
    • manually-input data
    • an imported dataset
  3. logic to create other variables

Let’s start with a DATA statement that uses the datalines statement:

data temp;
    input first_name $1-8 last_name $9-19 worth;
    datalines;
Michael Jordan     1.7
Jeff    Bezos      136.9
Elon    Musk       22.4
Bill    Gates      94.7
Mark    Zuckerberg 53.6
Kylie   Jenner     0.9
;
run;

proc print data=work.temp; run;

proc contents data=work.temp; run;

The code above accomplishes the following:

  1. data temp; creates a dataset named temp in the WORK library
  2. input first_name $1-8 last_name $9-19 worth 4.1; specifies three varibles will be put into the dataset: first_name, last_name, and worth. It also specifies the data types through informats.
    • first_name $1-8 specifies that the varible first_name is of type character ($) and spans columns 1-8. This means it will treat the first 8 characters as the variable first_name. $1-8 is the informat.
    • last_name $9-19 specifies that the character variable last_name occupies columns 9-19.
    • worth does not have an informat afterward. Without $, the data is read as a numeric type with a default length of 8.
  3. datalines; specifies that what follows it is to be treated as raw data. You can also use cards or lines.
  4. proc print data=work.temp; run; prints the data
  5. proc contents shows the details of the dataset

The printed data are shown below:

2.1 SAS Informats and Formats

I will not spend much time on SAS informats. An informat tells SAS the format of the data we are reading in. For example, if the data is a date, we need to tell SAS to read it as a date. SAS has made importing data and specifying data types relatively easy through point-and-click methods. Nonetheless, it is important to know what an informat is. Here’s the SAS documentation for informats.

SAS formats, on the other hand, allow you to format the data when displayed in output. This is an important distinction that you should understand. They have similar syntax. We will use SAS formats to prettify output.

3 Manipulating Data

We will now manipulate the diamonds data set. First, let’s take a deeper look at the structure of the dataset.

proc contents data=mrrlib.diamonds; run;

Through a DATA step, we’ll make the following changes:

  1. drop the variable VAR1, which was just the observation number from the R export
  2. transform variables carat and price using the natural logarithm
  3. drop variables x, y, and z and replace them with the names length, width and depth
  4. drop variables depth, which is the total depth percentage, as well as the variable table
data temp;
    set mrrlib.diamonds(drop=depth);
    
    length=x;
    width=y;
    depth=z;
    lcarat=log(carat);
    lprice=log(price);
    
    drop var1 x y z table;
run;

proc contents data=temp; run;

proc print data=temp(obs=10); run;

The code explained:

  1. data temp; creates a temporary dataset named temp in the WORK library. Remember that the WORK library is the default.
  2. set mrrlib.diamonds(drop=depth); uses the diamonds dataset as the starting data for temp and drops the original variable depth
  3. drop var1 x y z table; drops the listed variables from the final data set
  4. the next lines contain logic for variables that were not a part of the original dataset.
    • you can re-name a variable by assigning the original variable a new name and then dropping the original variable
    • notice that I drop depth in the set statement. If we drop in the drop statement, it will remove the variable we create, depth = z. depth represented a different variable in the original dataset.
  5. log() is the natural log in SAS
  6. proc contents data=temp; run; displays the new dataset
  7. proc print data=temp(obs=10); run; prints 10 observations starting at observation 1 (the default)

Homework #2 will involve exploring how these transformations affect the X-Y analysis we demonstrated here in class.