I am a keen follower of the blog aggregator about the R programming language and Data Science named R Bloggers. It is a very resource to look for all about the possibilities of this statistical programming language that is a standard in the field, and specially for the more advanced researchers. It has got some steep learning curve, so it is advisable to learn a lot about statiscal concepts and reasoning before diving in this language. It is also important to know about programming. But the open source commitment by the developers of the language and the wide community of users provides for a nice open and plentiful trove of resources.
I receive e-mails form R Bloggers every single day of the week. And every day, be it a weekday, Saturdays, Sundays or Holidays, there is always good posts, with some about a new technique, another about a new perspective other on a relevant resource or innovative way of dealing with R and its main applications and so on… Not admissible to miss.
And today was one of those days when I found an article from R Bloggers that caught my attention again. This time the name of the blog and the article title were the triggers. After reading the content of the post I thought I should share and reproduce part of it, for it is about the importance of Data Literacy, knowledge about statistics and what this means for the XXI Century. We are living through an age where the knowledge of these subjects is increasing, but where the mismatches as to what society needs to know and what it really knows about the subjects in question are also increasing. And this is having wide economic impacts as companies and organizations started complaining about these mismatches. On a counterfactual tone, if there is a mismatch, there is also as never before all the resources needed to bridge all the gaps, and if this isn’t happening with the desired frequency, then the reasons may be of a different nature than a statistical regularity or anomaly; it is perhaps more of a human nature contradictory anomaly that we may be talking about…
Nevertheless let us dive a little deeper in the post and after we will be concluding with some remarks.
In an article called A Paradox in the Interpretation of Group Comparisons published in Psychological Bulletin, Lord (1967) made famous the following controversial story:
A university is interested in investigating the effects of the nutritional diet its students consume in the campus restaurant. Various types of data were collected including the weight of each student in the month of January and their weight in the month of June of the same year. The objective of the University is to know if the diet has greater effects on men than on women. This information is analyzed by two statisticians.
The first statistician observes that at the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of the semester (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is obvious from their background). On average, neither men nor women gained or lost weight during the course of the semester. The first statistician concludes that there is no evidence of any significant effect of diet (or any other factor) on student weight. In particular, there is no evidence of any differential effect on both sexes, since no group shows systematic differences.
The second statistician examines the data more carefully. Note that there is a group of men and women who started the semester with the same weight. This group consisted of thin men and overweight women. He notes that those men gained weight from the average and these women lost weight with respect to the average. The second statistician concludes that by controlling for the initial weight, the university diet has a positive differential effect on men relative to women. It is evident that for men and women with the same initial weight, on average they differ since men gained more weight, and women lost more weight.
The following chart shows the reasoning of both statisticians in dealing with the problem. Note that the black line describes a 45 degrees line, the green points are the data coming from the men and the red ones from the women:
The reasoning of the first statistician focuses on the expectations of both distributions. Specifically in the coordinates (x = 60, y = 60), for females, and (x = 70, y = 70) for males, where black, red and green lines appear to coincide. The reasoning of the second statistic is limited to the continuum induced by the overlap of red and green dots. Specifically to the space induced by x = (60, 70), y = (60, 70). Suppose we have access to this dataset as shown in the following illustration, where the first column denotes the initial weight of the students, the second column indicates the final weight, the third column describes the difference between pesos and the last one defines the Sex of the student.
The findings of the first statistician are obtained through a simple regression analysis that, taking as a response variable the difference between weights, induces a coefficient of regression equal to zero for the variable sex, which indicates that there are no significant differences in the weight difference between men and women.
The findings of the second statistic are obtained through a covariance analysis, taking as response variable the final weight and covariates are sex and the initial weight of the individual. This method induces a coefficient of regression equal to 5.98 which implies that there is significant difference between the final weight of the people, according to sex.
For Imbens and Rubin (2015), both are right when it comes to describing the data, although both lack a sound reasoning in establishing some kind of causality between the diet of the university and the loss or gain of weight in the students. Regardless of this I still find more interesting the analysis that arises from the comparison between men and women who started with the same weight (ie all data restricted to x = (60, 70) y = (60, 70).
Lord’s paradox summarizes the analysis of two statisticians who analyze the average weight of some students within a particular university. At the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of that six months (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is evident from their natural contexture). On average, neither men nor women gained or lost weight during the semester.
To perform the simulation, we assumed that both the final weight of the men and the women follow a linear relationship with the original weight. Thus, it is assumed that yM2i=βM0+β1yM1i+εi for the weight of women; and yH2i=βH0+β1yH1i+εi, for the weight of men. Where yM1i denotes the weight of the i-th female at the beginning of the semester, and yM2idenotes the weight of the i-th female at the end of the semester. The notation for men (H) maintains this logic.
Now, note that from their natural contexture, men must have greater weight than women. Suppose that on average the weight of men is equal to that of women plus a constant c. In addition, the mean weight in both groups is identical in both times. Then, we have y¯M=βM0+β1y¯M and that y¯H=βH0+β1y¯H=βH0+β1(y¯M+c). Hence, after some algebra, we have that βM0=(1−β1)y¯M and βH0=y¯H−β1(y¯M+c).
The following code replicates a set of data that follows the relationship proposed by Lord:
N <- 1000 b <- 10 l <- 50 u <- 70 Mujer1 <- runif(N, l, u) Hombre1 <- Mujer1 + b beta1 <- 0.4 Mujerb0 <- (1 - beta1) * mean(Mujer1) Hombreb0 <- mean(Hombre1) - beta1 * (mean(Mujer1) + b) sds <- 1 Mujer2 <- Mujerb0 + beta1 * Mujer1 + rnorm(N, sd=sds) Hombre2 <- Hombreb0 + beta1 * Hombre1 + rnorm(N, sd=sds) The graph can be done with the following piece of code: datos <- data.frame(inicio = c(Mujer1, Hombre1), final = c(Mujer2, Hombre2)) datos$dif <- datos$final - datos$inicio datos$sexo = c(rep(0, N), rep(1, N))
This was a post from the Blog that I found through R Bloggers aggregator. The Blog is called Data Literacy – the blog of Andrés Gutiérrez. The name of the blog caught my attention, as well as its epigraph by Herbert G. Wells. The important point to mention here:
- Data Literacy is an important topic of study where the resources available are cheap to access. Nevertheless the learning is hard and the computing framework of the R programming language is a steep one
- Leaning statistics may be a nice way to learn how to think about the complex reality surrounding us from every corner that we want to look upon. It is full of paradoxes such as the one presented in this post. But I think it is with this kind of hard reasoning and problem solving mindset that we better prepare ourselves to any challenge
- The open source community around Data Science and the R programming language may compensate for the shortcomings of our often enough contradictory human societies
Farewell to the data science community around aggregators such as R Bloggers.
Body text images: Data Literacy – The blog of Andrés Gutiérrez
Featured Image: R Stats + Digital Analytics: 8 Blogs you should Follow