Kaggle Titanic: Simple prediction using SAS

This is my first Kaggle dataset that I have worked on. If you haven’t heard of Kaggle before, it’s a wonderful platform where different users and companies upload data sets for statisticians and data miners to compete. Individuals use predictive modeling and analytics to produce different predictive models for these data sets, some having big prize money. Most use python, but SAS can also be used. For this dataset, I will be using SAS and Titanic datasets to predict the survival on the Titanic.

First, we will use the training dataset and the FREQ PROC to determine the survivorship by sex on the Titanic.

Screen Shot 2017-11-15 at 11.08.16 PM

We can see that 74.20% of women survived and 18.89% of men. With this information, we will do a simple prediction that 100% of women survived.

Now we will import the test set to SAS and run the code below:

DATA gender_submit(KEEP=survived passengerid );
  SET test;
  IF sex = "female" THEN survived = 1;
  ELSE survived = 0;

The new set called ‘gender_submit) codes all females as survivors (survived =1) and all ELSE non-survivors (survived=0), i.e. all the males. Although this is a super simple prediction, it follows the guidelines for acceptable submission to Kaggle.

More in-depth analysis:

Screen Shot 2017-11-19 at 1.45.45 PM

This two-way frequency table can be read the  following way:

  1. 81 females did not survive (0) while 233 did (1).
  2. 52.3% of the total passengers were males and did not survive. 26.15% of the total passengers were female and did survive.
  3. Of the male passengers, 81.11% died vs 74.40% of the female passengers survived.
  4. For those that died 85.25% were male. For those that survived 68.13% were female.

Screen Shot 2017-11-19 at 2.09.26 PM




Screen Shot 2017-11-19 at 2.09.33 PM




We can also include the variable “class” into the analysis proc freq analysis.  An interesting finding is the high proportion of females who survived in first and second class (both over 90%) and the only 50% survival rate of females in 3rd class.

DATA train_bin;
  LENGTH age_grp $20;
  SET train;
  IF .< age <= 10    THEN age_grp = "0-le10";
  ELSE IF 10<age<=20 THEN age_grp = "gt10-le20";
  ELSE IF 20<age<=30 THEN age_grp = "gt20-le30";
  ELSE IF 30<age<=40 THEN age_grp = "gt30-le40";
  ELSE IF 40<age<=50 THEN age_grp = "gt40-le50";
  ELSE IF 50<age     THEN age_grp = "gt50-le20";

PROC FREQ DATA = train_bin;
  TABLES age_grp *survived /nocol nopercent ;

The code above bins the age of the passengers into a more readable data set. Using the proc freq of train_bin, we produce the table below:

Screen Shot 2017-11-19 at 2.24.31 PM

For example, this table shows that for children under the age of 10, 59.38% survived.

Finally, we can see how the port where the passengers embarked made a difference in survivorship. We can also use the proc freq to graphically visualize this.

Screen Shot 2017-11-19 at 2.31.01 PMScreen Shot 2017-11-19 at 2.30.51 PM

For example, those that embarked from Cherbourne (C), 55.36% survived. A random finding with no basis for prediction use but interesting none the less.



Kaggle Titanic: Simple prediction using SAS

Logistic Regression Analysis

This week we will do a logistic regression on my little wine consumption study. Since my response variable is quantitative (number of wine drinks per year) I will have to bin it into 2 categories. Using the SAS code below, I created WINECAT=0, for those who drank less than 52 drinks of wine per year (i.e. 1 glass per week) and WINE CAT=1, for those who drank more than 52 drinks of wine per year (i.e. more than 1 glass per week). My hypothesis being: for individuals who drink wine (my sample size in the NESARC data set is N=14561), those who have had major depression in their lifetime are more likely to be in CAT 1, i.ie those with or who have had major depression are more likely to drink more than 1 glass of wine per week, because who doesn’t like wine when you’re depressed!


Screen Shot 2017-10-19 at 10.42.11 PM
BINNING the quantitative responsive variable 


Now, onto our table with the parameter estimates and the p-value (shown below).

Screen Shot 2017-10-19 at 11.05.34 PM

Hmmm interesting. MAJORDEPLIFE is significant with a p-value of 0.0054, however the parameter estimate is -0.1593. Therefore, indicating those with/or have had major depression are less likely to fall into the WINECAT=1 (more than 1 drink per week) Opposite of my hypothesis.  Equation of the line: WINECAT=-1.6661 + (-0.1593)MAJORDEPLIFE.

Wait, this equation does not work because our response variable WINECAT is 0 or 1. We must turn to probability and odds ratio. The odds ratio estimate for this logistic regression is shown below:

Screen Shot 2017-10-19 at 11.12.54 PM

Since both response and categorical variables are binary we can interpret this odds ratio as: for those individuals who drink wine, those with major depression history (MAJORDEPLIFE) are 0.853 times less likely to drink more than 1 glass of wine per week than wine drinkers without major depression.

If we select another population of wine drinkers we can be 95% confident that their odds ratio will be between 0.762 and 0.954 (the 95% confidence limits).

Let’s include more variables in our logistic regression:



    Screen Shot 2017-10-20 at 1.33.13 AMWhen we look at the table above we can see that confounding occurs, as major depression no longer is significant with wine drinking use. We can say that present unemployed looking for work situation AND/OR cannabis use confounds the relationship between major depression and wine use.

But which one is it?


Screen Shot 2017-10-20 at 1.37.05 AM
S3BQ1A5=”Ever used Cannabis?”



Screen Shot 2017-10-20 at 1.40.28 AM

Interesting, it looks like neither of the new variables by themselves confounds the relationship between major depression and wine drinking. Cannabis use is not significant when major depression is taken into account, but present employment situation is with a p value of 0.0219 and parameter estimate of 0.3391.

Screen Shot 2017-10-20 at 1.47.25 AM

Odds ratio of unemployed/laid off and looking for work variable is 1.404. Therefore, we can say that those individuals who are wine drinkers and those who are currently unemployed/laid off looking for work are 1.401 times more likely to drink more than 1 glass of wine per week (i.e. WINECAT=2). Confidence limits = 1.05 and 1.876. This variable does not confine the major depression variable.



Logistic Regression Analysis

Test a Multiple Regression Model

We’re also going to include the amount of cannabis used as another variable to see if it has any correlation with the amount of wine consumed per year.

Using the mean procedure, we get the mean number of cannabis usage per year at 101.77.

Screen Shot 2017-10-15 at 2.13.41 AM

Centering this variable gives a mean of 0.004943.

Screen Shot 2017-10-15 at 2.24.07 AM

When performing the regression analysis. Unfortunately, there were no significant p-values for both of the variables. The parameter estimates for both were negative (as shown in the table below):

Screen Shot 2017-10-15 at 2.31.07 AM

Since MAJORDEPLIFE is highly significant on its own (p-value of 0.0003), beta of -5.853. We can say that jntfreqyr_c confounds the relationship between MAJORDEPLIFE and number of wine drinks consumed per year.

Screen Shot 2017-10-15 at 1.24.08 PM


Screen Shot 2017-10-15 at 2.43.58 AM

We can see that this variable is almost significant (p-value of 0.07) with a parameter estimate of 2.15.

When all variables are added (table below), MAJORDEPLIFE gets a negative parameter estimate of -8.22 and a lower p-value of 0.08. JNTFREQYR_C is still not significant with p-value of 0.36 and surprisingly those who said yes to “Ever had trouble falling/staying asleep when effects of alcohol were wearing off” became significant (p-value of 0.023) and had a beta coefficiant of -12.06. This indicates that those who said no to trouble falling asleep were associated with higher wine drinking frequency.

Screen Shot 2017-10-15 at 1.13.37 PM

The 95% confidence intervals for each variable are listed below:

Screen Shot 2017-10-15 at 1.32.51 PM

Those with nonsignificant p values all include values of 0, no association.

Assessing model residuals:

Q-Q plot:

Screen Shot 2017-10-15 at 1.51.30 PM

Since our model residuals are not following the straight line. We can assume that the residuals for this model are not following a normal distribution. Therefore there may be other variables that should be included in our model to improve estimation.

Standardized residuals for all observations:

Screen Shot 2017-10-15 at 2.11.06 PM

As you can see, most of the residuals fall between 0 and -1. However, there are many outliers (observations greater than 3 std from the mean). Since there are more than 1% of observations greater than 3 std from the mean, we can safely say that our model is a poor fit for the data. More explanatory variables have to be added to have a better fitting model.

Leverage plot:

Screen Shot 2017-10-15 at 2.23.34 PM

Looking at our regression plot, we can see that there are 3 outliers with high leverage, therefore having a high influence on wine consumed per year.


Test a Multiple Regression Model

Test a basic linear regression model

Frequency table of explanatory variable:

Screen Shot 2017-09-27 at 9.34.19 PM

SAS code to produce the regression model is shown below:

Screen Shot 2017-09-27 at 9.20.56 PM

Since the explanatory variable (MAJORDEPLIFE) is categorical, a linear model is not created, however, an equation of the line is still determined:


Screen Shot 2017-09-27 at 9.23.19 PM

The intercept of the line is 37.01, and the slope is -5.85. This means that for those without depression (MAJORDEPLIFE = 0) consume an average of 37 drinks of wine per year, while those with depression (MAJORDEPLIFE = 1), consume an average of approximately 31 drinks a year. This is opposite to my original hypothesis (i.e. history of depression and more wine use).

The bar chart of these averages is shown below:

Screen Shot 2017-09-27 at 9.28.11 PM.png

The results of the linear regression model indicated that major depression (Beta=-5.85, p=.0003) was significantly and negatively associated with number wine drinks per year.

Test a basic linear regression model

Introduction to Regression Analysis – Wine and Depression


For this analysis on wine and depression, the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) is used. It is an American study that used an American-wide longitudinal survey of alcohol, drug use and related psychiatric and medical comorbidities. The sample size of the NESARC study was (N=43,093) which include both civilians and non-institutionalized adults. All dwellings were included such as independents homes, military bases, boarding houses and rooming houses.  The study oversampled Blacks and Hispanics, increasing both of their representation in the study. Also, young adults (18-24) were oversampled. For this analysis, the sample size included anyone who drank any wine in the last 12 months (N=14561).


U.S. Census Bureau Field Representatives in 2001– 2002 collected the data through computer-assisted personal interviews (CAPI). In each household, one adult was selected for the interviews conducted in the respondent’s home after consent had been received.


Using the alcohol use and associated disability interview schedule (AUDADIS-IV)  and the DSM IV (Grant et al., 2003; Grant, Harford, Dawson, & Chou, 1995), lifetime major depression was assessed (i.e. those who experienced major depression at least once in their lifetime). In terms of collecting data for the wine drinkers, the alcohol module of the AUDADIS-IV includes a variety of important questions, including the frequency and quantity of wine consumption. In this case for wine, specific questions such as the quantitative variable “How often drank wine in the last 12 months?” or the qualitative variable “Type of wine consumed in the last 12 months?”). This analysis will focus on the quantitative variable- i.e. the amount of wine consumed and its association with depression.

Introduction to Regression Analysis – Wine and Depression


Initial Chi Squared test showing cannabis use in the last 12 months (0=no, 1=yes) and medical contact in the last 12 months (shown as a mean percentage). The graph shows that with cannabis use the proportion of medical contact increases (from ~15% to ~24%). This result is statistically significant (see chi squared value and p value below).

Screen Shot 2017-09-01 at 11.17.27 AM

Screen Shot 2017-09-15 at 3.33.21 PM


We will now include lifetime nicotine dependence as a moderator to the analysis. Below shows the SAS code used:

 Screen Shot 2017-09-15 at 3.46.44 PM

For those with no nicotine dependence (TABLIFEDX=0), there is a small change in medical contact between cannabis users and nonusers (14% and 20.7%, respectively).

Screen Shot 2017-09-15 at 3.45.35 PM

For those with nicotine dependence (TABLIFEDX=1), there is a small increase in medical contact between cannabis users and nonusers (21% and 28.8%, respectively).

Screen Shot 2017-09-15 at 3.49.45 PM

Since the p-value for both chi-squared tests is <0.0001, we can safely say that nicotine dependence does not moderate the relationship between cannabis use and medical contact in the last 12 months. I.e. the relationship that those used who cannabis have higher rates of medical contact.



Since my original question looks at 2 categorical variables, I decided to take a look at how the age of first cannabis use is correlated with the number of beers drank per year.

I produced the SAS code below:

Screen Shot 2017-09-15 at 12.19.24 AM

The correlation procedure will look at “Age First Used Cannabis” vs “Number of Beers consumed per year”.

The correlation output is below:

Screen Shot 2017-09-15 at 12.19.49 AM

In this pearson correlation, r=-0.04468. Therefore a small negative correlation between age first used cannabis and number of beers consumed in a year. The p value is 0.0009, therefore highly likely that this relationship is not random. r(squared) value = 0.0019963024.