Details

  1. Due by the start of next class.
  2. Turn-in through the link on Blackboard.
  3. Use the manipulated diamonds dataset. See SAS Data Manipulation lecture.

Part 1: Test Normality

Using PROC UNIVARIATE and the HISTOGRAM statement, determine whether the log transformations of price and carat result in relatively more normal data.

The analysis variables are: price, lprice, carat, lcarat.

You can execute the analysis for the four variables in one PROC or four separate PROCs, one for each variable.

The goal is to determine the effectiveness of the log transformation.

  1. Use ODS graphics.
  2. Use the ods select statement to return only the normality tests (testsfornormality) and the histograms (histogram).
  3. Use the normaltest option for PROC UNIVARIATE to display the tests for normality.
  4. Use the normal option for the histogram to plot a normal density curve against the data. Remember this requires a /
  5. Use the noprint option within the normal option in order to supress tests of the data fit against the normal density curve. Remember that sub-options within an option require parentheses.

The PROC UNIVARIATE documentation is here.

Screenshot the output.

Then, based on the output, answer the following questions:

  1. Does the normal density plot fit price or lprice or neither more accurately?
  2. Does the normal density plot fit carat or lcarat or neither more accurately?
  3. Do the normality tests suggest the data are normal?

Part 2: Plot lprice vs. lcarat

  1. Generate a scatter plot using PROC SGPLOT.
  2. Plot lprice on the y-axis and lcarat on the x-axis
  3. specify the transparency option in order to avoid overplotting
  4. add a title

Screenshot the output.

Answer the following question:

  1. In comparison to the plot of price vs. carat in the lecture material, does the log transformation result in a more direct linear relationship between the two variables?

Part 3: Manipulate the data again

Create another temporary dataset that takes the square root of carat and price. SQRT() is the SAS function, or you can exponentiate with **(1/2). Be sure to include the log transformed in the dataset as well (see Parts 4 and 5).

Part 4: Correlation Matrix

Use PROC CORR to compute Pearson’s r correlation coefficient with four variables: lprice, lcarat, the square root of price, and the square root of carat. Suppress the simple statistics in the output.

Using the output, answer the following questions. Remember, we are only interested in how carat size influences the price of the diamond. This means we are not interested in whether there is a relationship between price and the square root of price, for example. There will be, but it doesn’t have any practical implications.

  1. What is Pearson’s r for each of the following:
    • lprice by lcarat
    • square root of price by square root of carat
    • lprice by the square root of carat
  2. According to (1), which transformation has the strongest linear relationship? The weakest?

Part 5: Correlation Matrix Plot

Use either PROC CORR or PROC SGSCATTER to create a correlation matrix with scatterplots for the same four variables. Include histograms of the data on the diagonal of the matrix. You do not need to “prettify” the output.

Screenshot the output.

Answer the following questions:

  1. Does the square root transformation of carat appear to be more normally distributed than the log transformation of carat?
  2. Do the plots of the data support your answer to (2) in Part 4? Why?

Submission:

Your submission should include:

  1. a Word document with the screenshot and answers to the questions above
  2. your SAS code

Upload to Blackboard!