We have seen a number of PROC statements, which execute procedures on a supplied dataset. On the other hand, DATA statements create, manipulate, and import/export the data.
DATA and PROC steps are the foundation of working in SAS. Data is rarely ready for analysis “as is”. It is often said that data scrubbing and cleaning make up 80% of the work of an analytics task. I have found this to be my experience as well.
The SAS documentation for DATA steps are here.
In general, a DATA statement specifies:
Let’s start with a DATA statement that uses the datalines
statement:
data temp;
input first_name $1-8 last_name $9-19 worth;
datalines;
Michael Jordan 1.7
Jeff Bezos 136.9
Elon Musk 22.4
Bill Gates 94.7
Mark Zuckerberg 53.6
Kylie Jenner 0.9
;
run;
proc print data=work.temp; run;
proc contents data=work.temp; run;
The code above accomplishes the following:
data temp;
creates a dataset named temp in the WORK libraryinput first_name $1-8 last_name $9-19 worth 4.1;
specifies three varibles will be put into the dataset: first_name, last_name, and worth. It also specifies the data types through informats.
first_name $1-8
specifies that the varible first_name is of type character ($) and spans columns 1-8. This means it will treat the first 8 characters as the variable first_name. $1-8
is the informat.last_name $9-19
specifies that the character variable last_name occupies columns 9-19.worth
does not have an informat afterward. Without $
, the data is read as a numeric type with a default length of 8.datalines;
specifies that what follows it is to be treated as raw data. You can also use cards
or lines
.proc print data=work.temp; run;
prints the dataproc contents
shows the details of the datasetThe printed data are shown below:
I will not spend much time on SAS informats. An informat tells SAS the format of the data we are reading in. For example, if the data is a date, we need to tell SAS to read it as a date. SAS has made importing data and specifying data types relatively easy through point-and-click methods. Nonetheless, it is important to know what an informat is. Here’s the SAS documentation for informats.
SAS formats, on the other hand, allow you to format the data when displayed in output. This is an important distinction that you should understand. They have similar syntax. We will use SAS formats to prettify output.
We will now manipulate the diamonds data set. First, let’s take a deeper look at the structure of the dataset.
proc contents data=mrrlib.diamonds; run;
Through a DATA step, we’ll make the following changes:
data temp;
set mrrlib.diamonds(drop=depth);
length=x;
width=y;
depth=z;
lcarat=log(carat);
lprice=log(price);
drop var1 x y z table;
run;
proc contents data=temp; run;
proc print data=temp(obs=10); run;
The code explained:
data temp;
creates a temporary dataset named temp in the WORK library. Remember that the WORK library is the default.set mrrlib.diamonds(drop=depth);
uses the diamonds dataset as the starting data for temp and drops the original variable depthdrop var1 x y z table;
drops the listed variables from the final data setlog()
is the natural log in SASproc contents data=temp; run;
displays the new datasetproc print data=temp(obs=10); run;
prints 10 observations starting at observation 1 (the default)Homework #2 will involve exploring how these transformations affect the X-Y analysis we demonstrated here in class.