DNA melting: Identifying the unknown sample

From Course Wiki
Jump to: navigation, search
20.309: Biological Instrumentation and Measurement

ImageBar 774.jpg


Overview

In the DNA lab, you had four samples. Each sample had a true melting temperature $ M_j $ (where $ j $ is an integer from 1 to 4). The instructors told you that the fourth sample was identical to one of the other three samples. Therefore, the unknown should have exactly the same melting temperature as sample 1, 2, or 3. Your job was to figure out which one matched the unknown.

Measurement procedure

Most groups measured each sample group in triplicate: $ N=3 $. (Some special students did something a little bit different.) This resulted in 12 observations, $ O_{j,i} $, where $ j $ is the sample group and $ i $ is the experimental trial number — an integer from 1 to 3. The majority of lab groups calculated the average melting temperature of each sample group, $ \bar{O_i} $ and guessed that sample 4 matched whichever of the known samples had the closest melting temperature.

Seems reasonable.

Measurement uncertainty

The observations included measurement error, $ O_{j,i}=M_j+E_{j,i} $. The presence of measurement error leads to the possibility that an unfortunate confluence of error terms could have caused you to misidentify the unknown sample. It’s not hard to imagine what factors tend to increase the likelihood of such an unfortunate fluke: the true means are close together, or the error terms are large. To get a handle on the possibility that your results were total crap due to bad luck alone (not incompetence), it is necessary to have some kind of model for the distribution of the error terms. How about this? The error terms are normally distributed with mean $ \mu=0 $ and standard deviation $ \sigma $. (Note that the error distribution among all of the sample groups is the same.)

Within the confines of this model, it is possible to estimate the chance that your result was a fluke. There are 6 possible pairwise hypotheses to test:

  1. $ M_4\stackrel{?}{=}M_1 $
  2. $ M_4\stackrel{?}{=}M_2 $
  3. $ M_4\stackrel{?}{=}M_3 $
  4. $ M_1\stackrel{?}{=}M_2 $
  5. $ M_1\stackrel{?}{=}M_3 $
  6. $ M_2\stackrel{?}{=}M_3 $

Evaluating the hypotheses

Student’s t-test offers a method for assigning a numerical degree of confidence to each null hypothesis. Essentially, the test considers the entire universe of possible outcomes of your experimental study. Imagine that you repeated the study an infinite number of times. (This may not be hard for you to imagine.) Repeating the study ad infinitum will elicit all possible outcomes of $ E_{j,i} $. The t-test categorizes each outcome into one of two realms: those that are more favorable to the null hypothesis (i.e., $ O $ is closer to $ M $ than the result you got); and those that are less favorable ($ O $ is farther from $ M $ than the result you got).

The t-test can be summarized by a p-value, which is equal to the percentage of possible outcomes that is less favorable to the null hypothesis than the result you got. A low p-value means that there are relatively few possible results less favorable to the null hypotheses than the one you got and many results that are more favorable. So it's probably reasonable to reject the null hypothesis. Rejecting the null hypothesis means that the means are likely not the same.

In most circumstances, the experimenter chooses a significance level such as 10% or 5% or 1% in advance of examining the data. Another way to think of this: if you chose a significance level of 5% and repeated the study 100 times, you would expect to incorrectly reject the null hypothesis because of bad luck on 5 occasions.

Which unknown?

If all went well, two of the first 3 null hypotheses were rejected and null hypotheses 4-6 were all rejected. For example, if the unknown was the same as sample 3, null hypotheses 1, 2, 4, 5, and 6 ought to have been rejected.

A problem comes up when you compare multiple means using the t-test. For example, if you chose a significance level for each t-test of 5%, there would be a 30% chance of incorrectly rejecting at least one null hypothesis somewhere among the 6 comparisons. The multcompare function in MATLAB implements a correction to the t-test procedure to assure that the family-wise error rate (FWER) is less than the significance level. In other words, the chance of any of the six hypotheses being incorrectly rejected due to bad luck is less than the FWER. There is an optional argument to multcompare that lets you set the FWER.

Some people argued that only the first three hypotheses are relevant to the question at hand. Maybe. But imagine what the defense counsel would say if it turned out that only hypotheses 1 and 2 were rejected and hypotheses 4-6 were not. You might be able to testify that the murderer must be suspect number 3. But the defense would argue, "how can you say it's number 3 if you can't even tell suspect 3 from 1 or 2?" It would be unconvincing to present a result that implicated suspect 3 unless 4 or 5 of the hypotheses were rejected (numbers 1, 2, 5, and 6 with number 4 rejected for extra awesomeness).

A few people also argued that the p-value of a single hypothesis test is good measure of confidence. In the example above where the unknown was sample 3, the confidence measure would be the p-value associated with hypothesis 3. It is important to keep in mind that the p-value only characterizes how likely it is that a particular result was obtained by accident. It does not compare the likelihood of one result to another. Consider the two situations summarized below. In both cases, the significance level was chosen to be 5%.

  1. hypothesis 2 has a p-value of 4.9% (so it is rejected) and hypothesis 3 has a p-value of 5.1% (not rejected)
  2. hypothesis 2 has a p-value of 4.9% (not rejected) and hypothesis 3 has a p-value of 90% (rejected)

Clearly, there is a much smaller chance that the sample was erroneously identified in the second case, even though the p-values for hypothesis 3 are identical. (Possibly the means of samples 2 and 3 are farther apart in the second case.)

If you used multcompare in your analysis, a good measure of confidence is the FWER that you chose. Since all the hypotheses may not be required to uniquely identify the unknown, the FWER is slightly conservative. So what. You required more things to be true than you strictly needed to. But it is likely that you would have gained very little by removing the unnecessary hypothesis from consideration. It is even more likely that the error terms did not perfectly satisfy the assumptions of test, so your calculation is at best an approximation of the possibility of this type of error.

You can probably think of ways the simple error model we relied on might be deficient. For example, there is no attempt to include any kind of systematic error. If there were significant systematic error sources in your experiment, your estimate of the chance that an unlucky accident happened may be very far from the truth. Because most real experiments do not perfectly satisfy the assumptions of the test, it is usually ridiculous to report an extremely small p-value. (That doesn't stop people from doing it, though.)