This week’s lab will show you how to apply statistical methods and resampling techniques to a dataset from the natural sciences, Arbuthnot. Through this, we will see how statistical methods can help us to put the scientific method into practice and provide you with hands-on experience with the kinds of data analysis a scientist will use after completing a series of experimental measurements.

Natural science, data science

Many of the datasets we’ve worked through in our labs this semester have come from fields outside of the natural sciences. That doesn’t mean that the skills we’re building don’t have a useful application in fields such as physics, chemistry, and biology. For that reason, this week we will apply statistical methods to a dataset from the natural sciences that can be used to build hypothesis tests on male and female birth ratio in London from 1629-1710.

About this week’s dataset

Arbuthnot is data on male and female birth ratios in London from 1629-1710. Dr. John Arbuthnot (1710) was a physician, writer, and mathematician who used the christening data to carry out the first known significance test, comparing observed data to a null hypothesis. He was interested in finding out the ratio of male and female births in London in 1629-1710. The data for these 81 years showed that in every year there were more male than female christenings.

Instruction on installing the HistData package

First, type install.packages("HistData") in the Console. Next, run the set up chunk which is located at the top of your .Rmd file.

Visualizing and quantifying the distribution

Let’s start by doing the usual practice of getting to know our dataset. There’s only one relevant variable in this dataset, Ratio, so it’s the distribution of the male and female birth ratios that matters. Let’s appraise the distribution of the ratio by creating some visualizations:

  1. Before we get started with visualization, run help(Arbuthnot) in the Console to look at the dataset documentation. Identify which two years are probably duplicates. Next, run View(Arbuthnot) in the Console to confirm your guess by reporting the male and female baptism counts in those two years. Finally, drop one of those two years from the dataset using the filter function, and save this new dataset as a new variable. The de-duped dataset will be used for the remainder of the assignment.

Now, let’s get crackin’.

  1. Visualize the dataset distribution as a boxplot — use

    ggplot(data = ...) +
      geom_boxplot(aes(x = "", y = Ratio)) + 
      coord_flip()

    and as a probability mass function (PMF) — use geom_histogram() with y = ..density.. inside aes() — with a binwidth that allows you can see the full dataset. Describe the center, shape, and spread of the distribution (don’t forget to mention the outliers).

    Make sure to use the de-duped dataset that you created in Exercise 1.

  2. Create a graph that shows both a density plot (using geom_density) and a PMF (using geom_histogram) *on the same plot.

    To do this, you will need to create a ggplot with both geom functions. (Make sure to use the y = ..density.. argument again in geom_histogram.)

  3. Finally, to wrap up this initial exploration, quantify these distributions by computing their summary statistics. The following functions in R are useful for computing the summary statistics of a dataset:

    • mean(): Computes the average

    • median(): Computes the median

    • min(): Finds the minimum value

    • max(): Finds the maximum value

    • sd(): Computes the standard deviation

    • IQR(): Computes the interquartile range

    Calculate the following summary statistics : the mean, median, maximum, minimum, standard deviation, and the inter-quartile range (IQR).

    ... %>%
      summarize(
        mean = mean(Ratio),
        median = median(Ratio),
        sd = sd(Ratio),
        iqr = IQR(Ratio),
        min = min(Ratio),
        max = max(Ratio)
      )

    (Replace the ... with the de-duped dataset from Exercise 1)

    Which summary statistics are sensitive to removing the outliers? Which ones are not?

infering a trend

Because there is a spread in the Ratio measurements in Arbuthnot’s dataset, the measured Ratio should be reported as a mean value with error bars. The error bars are typically found by calculating a confidence interval. A typical choice is a 95% confidence interval, which can be estimated using computational simulations that resample the dataset.

To perform our statistical resampling, we will use the tidyverse-inspired infer package, which will help us to compute confidence intervals and perform hypothesis tests.

If you did not already install the latest version of the infer package in CDS 101, you can easily install it by running the following in your Console window:

devtools::install_version("infer", version = "0.5.0", repos = "http://cran.us.r-project.org")

You will then need to re-run the setup chunk to reload the correct version of the infer package.

  1. We can use infer to perform a two-sided hypothesis test. The code for doing this is relatively similar, we just need to add an additional hypothesize(...) function.

    Of course, in order to run a hypothesis test we need some sort of hypothesis to test against, which will allow us to define the null distribution. We’re going to compare our experimental data (the mean Ratio we measured in the real world) to a theoretical null hypothesis in which the mean male and female birth ratio is 1.

    We can calculate the probability that our experimental data was generated in a world in which the null hypothesis is true. This probability is called the p-value. We also need to select a significance level \(\alpha\), which serves as a kind of evidence threshold that we use when determining whether or not we can reject the null hypothesis. A common choice for \(\alpha\) is 0.05, which is the value that we will use.

    Write down (in words) the null hypothesis and the alternative hypothesis for comparing this dataset against the tested mean value of 1.

    Note that this is a two-sided hypothesis test, which is different from the one-sided tests that you have covered in CDS 101.

    • In a one-sided test, our alternative hypothesis takes the form: Is our observed statistic significantly greater than the value we are comparing it to? (or significantly less than).

    • In a two-sided test, our alternative hypothesis takes the form: Is our observed statistic significantly different than the value we are comparing it to? In other words, we don’t care about the direction of the difference, only that there is one.

    (In this hypothesis test, the observed statistic is the mean of the ratio of male to female births.)

    We can use this code to generate the null distribution needed to perform the hypothesis test:

    `dataset name`_null <- `dataset name` %>%
      specify(formula = Ratio ~ NULL) %>%
      hypothesize(null = "point", mu = 1) %>%
      generate(reps = 10000, type = "bootstrap") %>%
      calculate(stat = "mean")

    Now that we have a null distribution, we can use it in combination with the experimental average for the male and female birth ratio to calculate the p-value. The p-value is simply the probability that, were we to repeat the experiment again, we would obtain a result that is the same or more extreme than the reported experimental measurement. Put another way, we need to count the number of data points in the simulated null distribution that are the same or more extreme than the experimental measurement. As of now, we have the simulated distribution of male and female birth ratio. Next, we need to know the true mean male and female birth ratio which is called observed statistics. The value will be used when finding p-value as a threshold so that we can decide which values in the null hypothesis are equal or extreme. To find the observed statistics, we would run the following:

    `dataset name`_obs_stat <- `dataset name` %>%
      specify(Ratio ~ NULL) %>%
      calculate(stat = "mean")

    If the computed p-value is less than 0.05, our significance level, then we reject the null hypothesis in favor of the alternative hypothesis. To find the p-value, we would run the following:

    `dataset name`_null %>%
      get_p_value(obs_stat = `dataset name`_obs_stat, direction = "two_sided")

    Finally, let’s visualize our observed statistic to find where it lies in our null hypothesis and see if it confirms with out p-value. Run the following code:

    visualize(`dataset name`_null) +
      shade_p_value(obs_stat = dataset_obs_stat, direction = "two_sided")

    Based on the codes that you’ve ran above to make statistical inferences, explain the result of your hypothesis testing.

  2. Use the infer package to run a two-sided hypothesis test (with \(\alpha = 0.05\)) against a null hypothesis of in which the mean ratio is not significantly different from 1.05. (In the previous exercise we used 1, i.e. equal numbers of male and female births; however, 1.05 is the current ratio of male to female births in 2020).

    Is there a difference between the result of this hypothesis test and the previous one. Can you explain why/why not?

How to submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by Felicia Natalie Wijaya, James Glasbrenner, Dominic White, and Ajay Kulkarni for CDS-102.

References

  1. S. M. Stigler, “Do Robust Estimators Work with REAL Data? (With Discussion),” Annals of Statistics 5, 1055 (1977).