data analysis

How Statistics Became a Model-blind Data-reduction Enterprise? Sewall Wright

Guinea Pigs

Wright studied genetics at Harvard. While he was working at the University of Chicago, he interested in the inheritance of coat colour in guinea pigs. He found that it is nearly impossible to breed an all-white or all-coloured guinea pig, even the most inbred families which contradict the prediction of that time in which a particular trait should become “fixed” by multiple generations of inbreeding.

To understand the effect of different factors contributing to the coat color of guinea pig, he drew the following diagram:

Path Diagram

The arrow points in one direction which implies a factor is a cause of another one. Omitting the arrows actually convey the significant assumptions that two factors have no causal relationship. By solving some algebraic equations (according to the book), Wright managed to solve the causal quantities by correlations measure in the data. Cautious reader may think Wright had assumed the causal diagram in the first place. That made estimating causal coefficients possible. Indeed, without some causal hypothesis, you cannot draw causal conclusions. The goal of causal analysis is not to prove that X is a cause of Y. That is the problem of causal discovery which is far more difficult. In contrast, causal analysis is to answer some causal queries from some causal hypothesis. Still remember the firing squat example? If we do not draw the casual diagram of firing squat, we cannot answer those what-if questions.

Let’s do some real calculations on a simpler problem: how much a guinea pig’s birth weight will be affected if it spends one more day in the womb. We may compare the birth weights of guinea pigs that spend 66 days in the womb with those that spend 67 days. We noted that the guinea pigs that spent a day longer in the womb weighted an average of 5.56g more at birth. Does that mean a guinea pig grows at 5.66g per day before it is born? Time is just one factor. A pup with more siblings will weigh less. How can we decompose the two causes?

We want to estimate the effect of P on X which is represented by the small letter p. However, P is affected by another path i.e. P <– L –> Q –> X. The total correlation of 5.66g per day is equal to p + (l * l’ * q). But since Q is unobserved, we cannot solve this equation directly. However, by looking at the path diagram, we can find l’ by observing the correlation of (L,P). We can also find I*q by observing the correlation of (L,X). Then, we can solve the equation to obtain p.

Wright received some criticism of his method, for example, “the user has to have a hypothesis and must devise an appropriate diagram of multiple causal sequences.” R.A. Fisher said “Statistic may regard as …the study of methods of the reduction of data.” Causal analysis is not just about data. We must incorporate some understanding of the process that produces the data. Karlin said, “Finally, and we think most fruitfully, one can adopt an essentially model-free approach, seeking to understand the data interactively by using a battery of displays, indices, and contrasts. This approach emphasizes the concept of robustness in interpreting results.” He is saying data already contains all wisdom. They only need to be massaged.

In the next post, we will talk about Bayesian Inference, Prior Belief + New Evidence –> Revised Belief, which gives us an way of combining the observed evidence with our prior knowledge. Does it solve the causation problem?

How Statistics Became a Model-blind Data-reduction Enterprise? Sewall Wright

Read next

My 2.5 and Counting Years in Consumer Loan Credit Risk Management

More Choices or Fewer Choices: The Paradox of Options

Reflecting on Business Models: A Worker’s Perspective

Comments ()

Read next

Comments ( )

Comments ()