Lock5withR
includes a data set with data from a student survey. It includes the following variables.
Year
Year in schoolGender
Student’s gender: ‘F’ or ‘M’Smoke
Smoker? ‘No’ or ‘Yes’Award
Prefered award: ‘Academy’ ‘Nobel’ ‘Olympic’HigherSAT
Which SAT is higher? ‘Math’ or ‘Verbal’Exercise
Hours of exercsie per weekTV
Hours of TV viewing per weekHeight
Height (in inches)Weight
Weight (in pounds)Siblings
Number of siblingsBirthOrder
Birth order, 1=oldestVerbalSAT
Verbal SAT scoreMathSAT
Math SAT scoreSAT
Combined Verbal + Math SATGPA
College grade point averagePulse
Pulse rate (beats per minute)Piercings
Number of body piercingsFor now, let’s focus on just two variables: Sex
and Award
. (Each student was asked whether they would rather win a Academy Award, a Nobel Prize, or an Olympic Gold medal. Award
records their answers.) If the members of your group were added to the data set (just for these two variables), what would the new rows of data look like?
Write down some questions we might answer using the Sex
and/or Award
variables. Which of your questions/answers need both variables? Which only require one of the variables?
Our main tools for investigating question like this will be tally()
for numerical summaries and gf_bar()
for bar plots.
Run these commands to find out.
library(Lock5withR) # Load the package that contains the data
gf_bar( ~ Award, data = StudentSurvey)
tally( ~ Award, data = StudentSurvey)
Run the commands below to make numerical tables of different kinds.
tally( ~ Award | Sex, data = StudentSurvey, format = "percent")
tally( Award ~ Sex, data = StudentSurvey, format = "prop")
tally( Award ~ Sex, data = StudentSurvey, margins = TRUE)
tally( Award ~ Sex, data = StudentSurvey, margins = TRUE, format = "percent")
Which tables do you like best for this question?
When you use proportions or percents, be sure to check which things add up to 1 or 100%. (Possible answers: rows, columns, or the whole table.)
gf_bar()
can create a variety of bar charts.
Try these examples.
gf_bar( ~ Award, data = StudentSurvey, fill = ~Sex)
gf_bar( ~ Award, data = StudentSurvey, fill = ~Sex, position = "dodge")
gf_bar( ~ Award | Sex, data = StudentSurvey, fill = ~Sex)
gf_bar( ~ Sex, data = StudentSurvey, fill = ~ Award)
Which do you like best for answering this question?
We can also use gf_props()
or gf_percents()
to make bar charts on a proportion or percent scale.
Try these (our use gf_percents()
instead of gf_props()
if you want percents instead of proportions):
gf_props( ~ Award, data = StudentSurvey, fill = ~Sex)
gf_props( ~ Award, data = StudentSurvey, fill = ~Sex, position = "dodge")
gf_props( ~ Award, data = StudentSurvey, fill = ~Sex, position = "dodge",
denom = ~fill)
gf_props( ~ Award | Sex, data = StudentSurvey, fill = ~Sex)
gf_props( ~ Sex, data = StudentSurvey, fill = ~ Award)
gf_props( ~ Sex, data = StudentSurvey, fill = ~ Award, denom = ~x)
In each case, determine which segments add to 1 (or 100 percent).
What does denom
do?
A nationwide US telephone survey conucted by the Pew Foundation in October 2010 asked 2625 adults ages 18 and older “Some people say there is only one true love for each person. Do you agree or disagree?” The survey participants were selected randomly by landlines and cell phones. In addition to the answer to the question, surveyors recorded the sex of each person surveyed.
What is the population for this study?
What are some potential sources of bias in this study? Do you expect the bias to be relatively small or potentially large?
What are the cases in this study?
What are the variables? Are they categorical or quantitative?
Write down what the first few rows of the data set would look like if your group members were the first few cases.
Of those surveyed, 735 people agreed, 1812 disagreed, and 78 answered “don’t know”.
It is important to distinguish between the proportion of people in the population who would answer a certain way and the proportion of people in our sample who did answer a certain way. We have terminology and notation to distinguish between the two.
summary | parameter | statistic |
---|---|---|
proportion | \(p\) | \(\hat p\) (read: p hat) |
mean | \(\mu\) (Greek letter “mu”) | \(\overline x\) (read: x bar) or \(\hat \mu\) |
standard deviation | \(\sigma\) (Greek letter “sigma”) | \(s\) or \(\hat\sigma\) |
The notation for median and iqr is less standardized.
Here is the two way table for the Pew study.
answer | Male | Female |
---|---|---|
agree | 372 | 363 |
disagree | 807 | 1005 |
don’t know | 34 | 44 |
Use the table to answer the following questions