This is my first Kaggle dataset that I have worked on. If you haven’t heard of Kaggle before, it’s a wonderful platform where different users and companies upload data sets for statisticians and data miners to compete. Individuals use predictive modeling and analytics to produce different predictive models for these data sets, some having big prize money. Most use python, but SAS can also be used. For this dataset, I will be using SAS and Titanic datasets to predict the survival on the Titanic.
First, we will use the training dataset and the FREQ PROC to determine the survivorship by sex on the Titanic.
We can see that 74.20% of women survived and 18.89% of men. With this information, we will do a simple prediction that 100% of women survived.
Now we will import the test set to SAS and run the code below:
DATA gender_submit(KEEP=survived passengerid ); SET test; IF sex = "female" THEN survived = 1; ELSE survived = 0; RUN;
The new set called ‘gender_submit) codes all females as survivors (survived =1) and all ELSE non-survivors (survived=0), i.e. all the males. Although this is a super simple prediction, it follows the guidelines for acceptable submission to Kaggle.
More in-depth analysis:
This two-way frequency table can be read the following way:
- 81 females did not survive (0) while 233 did (1).
- 52.3% of the total passengers were males and did not survive. 26.15% of the total passengers were female and did survive.
- Of the male passengers, 81.11% died vs 74.40% of the female passengers survived.
- For those that died 85.25% were male. For those that survived 68.13% were female.
We can also include the variable “class” into the analysis proc freq analysis. An interesting finding is the high proportion of females who survived in first and second class (both over 90%) and the only 50% survival rate of females in 3rd class.
DATA train_bin; LENGTH age_grp $20; SET train; IF .< age <= 10 THEN age_grp = "0-le10"; ELSE IF 10<age<=20 THEN age_grp = "gt10-le20"; ELSE IF 20<age<=30 THEN age_grp = "gt20-le30"; ELSE IF 30<age<=40 THEN age_grp = "gt30-le40"; ELSE IF 40<age<=50 THEN age_grp = "gt40-le50"; ELSE IF 50<age THEN age_grp = "gt50-le20"; RUN; PROC FREQ DATA = train_bin; TABLES age_grp *survived /nocol nopercent ; RUN;
The code above bins the age of the passengers into a more readable data set. Using the proc freq of train_bin, we produce the table below:
For example, this table shows that for children under the age of 10, 59.38% survived.
Finally, we can see how the port where the passengers embarked made a difference in survivorship. We can also use the proc freq to graphically visualize this.
For example, those that embarked from Cherbourne (C), 55.36% survived. A random finding with no basis for prediction use but interesting none the less.