Difference between revisions of "DNA Melting: Processing DNA Melting Data"

Revision as of 17:16, 28 August 2011

20.309: Biological Instrumentation and Measurement

Overview

Fantastic. Now you have some simulated data or real data. It's time to estimate $ \scriptstyle \Delta S^{\circ} $, $ \scriptstyle \Delta H^{\circ} $, and the melting temperature. Nonlinear regression is one approach to extracting these values.

Most students use Matlab to analyze their raw data. Several scholars in the past elected to use Python, instead. If you choose this path, be aware that the wiki pages for Python data analysis are significantly less developed than the corresponding Matlab cribs. Nonlinear fitting in Python has been challenging. If you decide to use Python, consult and add your thoughts to this page: DNA melting data analysis in Python.

In outline, the data analysis steps are:

Write a function that models DNA melting
Load data file produced by LabVIEW VI or simulation
Provide initial guesses for model parameters and run the regression
Normalize results and convert to correct units
Estimate the melting temperature
Create plots

The model function is essentially the inverse of simulating DNA melting. The arguments to the function should include a vector of all the model parameters and explanatory variables. Make the signature of the function should match the requirements of the curve fitter you plan to use, for example : lsqcurvefit in Matlab.

Organization of data analysis script

Since you will be analyzing several experimental runs in a similar way, it makes sense to save the particulars of each experimental run (such as the name of the file that the data is stored in and the KCl concentration) in an array of data structures. Matlab implements a data type called a cellular array. Cellular arrays use pointy braces {} instead of parenthesis (). If foo is a cellular array, then foo{1} is the first element.

The following code demonstrates how to create the cellular array of data structures. The code fills in the first element of a cellular array (DnaMatchSampleInfo):

DnaMatchSampleInfo = {}  % null cellular array

DnaMatchSampleInfo{1}.filename = 'DNA Melting Data\20bp 100mM.txt';
DnaMatchSampleInfo{1}.SampleName = '20 bp 100mM';
DnaMatchSampleInfo{1}.KclConcentration = 100;
DnaMatchSampleInfo{1}.DnaConcentration = 30;
DnaMatchSampleInfo{1}.FilterNormalizedCutoffFrequency = 0.1;
DnaMatchSampleInfo{1}.FilterOrder = 25;
DnaMatchSampleInfo{1}.NormalizationMin = 0.01;
DnaMatchSampleInfo{1}.NormalizationMax = 0.99;
DnaMatchSampleInfo{1}.TrimStart = 0;
DnaMatchSampleInfo{1}.TrimEnd = 0;

... fill in more entries here ...

After initializing DnaMatchSampleInfo, each element will contain a data structure with the parameters of a particular experimental run. You can create additional cellular arrays for different sets of experiments. For example, DnaLengthSampleInfo might contain parameters for the 20, 30, and 40 base pair samples.

Using the information in the cellular arrays, it is easy to write a for loop that will process all of the data for a particular set of experiments. The function (called DnaMelt might look something like this:

% Function to process DNA melting data from LabVIEW VI
% sampleInfo is a cellular array of data structures
% output is a cellular array containing original data and computed values
function out = DnaMelt(sampleInfo)

% initialize variables
out = {};

for ii=1:length(sampleInfo)

    ... process the data ...

end

... process/plot specific results about whole data set...

Loading a data file

Use the load command to read in a data file.

    rawData = load(sampleInfo{ii}.filename);

The first column contains RTD voltage samples and the second contains the corresponding fluorescence voltages.

Converting voltages to temperature and relative fluorescence

These are straightforward mathematical manipluations on the dataset using the properties of the RTD and your normalized/corrected/adjusted fluorescence signal with the equation for dsDNA fraction.

Designing an FIR low pass filter

The matlab commands fir1 and freqz are useful for designing a low pass filter. Use fir1 to generate the filter kernel and freqz to plot its frequency response. Increasing the order of the filter will make the transition between pass and stop bands sharper.

Try plotting a few different filter designs:

freqz(fir1(30, 0.15))
freqz(fir1(300, 0.15))

Applying the FIR filter

Use the conv command to apply the filter. But be careful how you treat the samples near the beginning and end of the signal. conv returns a vector of length m + n - 1, where m and n are the lenghts of the two operands. conv will assume that all values outside of the defined signal are zero. This will distort your signal near the beginning and end. You can handle this by pre-padding your data with made-up data (usually the initial and final values) or just chopping off the extra samples.

The following functions may be of use:

% usage: ConvolveAndClip(kernel, data)
% convolves <kernel> with <data> and trims the ends of the result to length
% length(data) - length(kernel)
function out=ConvolveAndClip(kernel, data)

temp = conv(kernel, data);
trim = length(kernel);
out = temp(trim:end-trim);

% usage: PadAndConvolve(kernel, data)
% Pads <data> with initial and final values, convolves <data> with 
% <kernel>, and trims the result to the same length as <data>
function out=PadAndConvolve(kernel, data)

frontPaddingLength = floor(length(kernel)/2);
rearPaddingLength = ceil(length(kernel)/2);
frontPadding = data(1) * ones(frontPaddingLength, 1);
rearPadding = data(end) * ones(rearPaddingLength, 1);

s = size(data);

if(s(1) > s(2))
    paddedData = [frontPadding;data;rearPadding];
else
    paddedData = [frontPadding' data rearPadding'];
end

out = ConvolveAndClip(kernel, paddedData);

Ensuring that F(Temperature) is a function

Because there is noise in the temperature readout, your raw data is not guaranteed to be a function. You will run into all sorts of trouble taking the finite difference later on if your data is not a function. This can be solved by either sorting the data by temperature (using Matlab's sort command) or binning the samples by temperature (by iterating through the data with a for loop.)

Each approach has merits and disadvantages. Sorting by temperature can result in some very small ΔT values, which tend to be very noisy. Binning has the advantage of resulting in a uniformly sampled dataset — provided that there is at least one sample in each bin.

Computing the finite difference

The diff command will compute the finite difference of a discrete-time signal, which is an approximation of a continuous derivative. Remember that you need to compute ΔFluorescence/ΔTemperature, not ΔFluorescence/ΔTime. Therefore, you must divide by ΔTemperature.

The differencing operation is particularly sensitive to noise. (What is the frequency response of a differencer?) If your derivative plots are noisy, you may be able to improve them by applying additional filtering.

One estimate of T_m is the peak value of the numerical derivative.

Fitting the model

In the "Simulating DNA Melting" tutorial for Matlab or Python, there is a function for computing the theoretical value of dsDNA concentration as a function of DNA concentration, temperature, ΔH° and ΔS°.

To perform the fit, use the matlab function from the tutorial. You can use Matlab's curve fitter (lsqcurvefit) to estimate best-fit values for ΔH° and ΔS°.

lsqcurvefit requires the fitting function to be implemented a particular way. You may find the following code excerpt, which declares a suitable function called myFunction, useful. It also computes the R squared value for the fit.

   % Create user function for fitting
    DnaConcentration = sampleInfo{ii}.DnaConcentration;
    myFunction = @ (x, xdata)  DnaFraction(DnaConcentration, xdata, x(1), x(2));

    % Fit data to model
    fitOptions = optimset('TolFun',1E-30,'TolX',1E-10,'MaxFunEvals',1E4,'MaxIter',1E4);
    [fitValues, resnorm] = lsqcurvefit(myFunction, [-150 -71E3], temperature, ...
                    fluorescence, [-inf -inf], [inf inf], fitOptions);
    rSquared = 1 - resnorm / norm(fluorescence - mean(fluorescence))^2;

Plot legends in Matlab

Annoyingly, Matlab handles legends differently than other plot commands. The hold command does not apply to the legend command. (Direct complaints to http://www.mathworks.com/support.) The following code excerpt adds a plot legend. This code runs after the main loop. The cellular array out contains the results from processing. This code also places an "X" at the melting temperature.

% Compute text cell array for plot legends and plot an
% "X" at the estimated melting temperature

for ii=1:length(out)
    legendText{2 * ii - 1} = sampleInfo{ii}.SampleName;
    legendText{2 * ii} = [sampleInfo{ii}.SampleName ' Best Fit'];
    figure(1)
    plot(out{ii}.maxDerivativeTemperature, out{ii}.fluorescence(out{ii}.maxDerivativeIndex),...
        'linewidth',2,'marker','x','markersize',18,'color',out{ii}.plotColor);
    plot(out{ii}.maxModelDerivativeTemperature, out{ii}.fitFluorescence(...
        out{ii}.maxModelDerivativeIndex), 'linewidth',2,'marker','x','markersize',18,'color',...
        out{ii}.modelPlotColor);
    figure(2)
    plot(out{ii}.maxDerivativeTemperature, out{ii}.dFdT(out{ii}.maxDerivativeIndex),...
        'linewidth',2,'marker','x','markersize',18,'color',out{ii}.plotColor);
    plot(out{ii}.maxModelDerivativeTemperature, out{ii}.dModelFdT(...
        out{ii}.maxModelDerivativeIndex), 'linewidth',2,'marker','x','markersize',18,'color',...
        out{ii}.modelPlotColor);
end

% Add the legends
figure(1)
legend(legendText, 'location', 'southwest');
hold off;
figure(2)
legend(legendText, 'location', 'southwest');
hold off;

Report Requirements

Report requirements will be posted separately.

@@ Line 5: / Line 5: @@
 ==Overview==
-Fantastic. Now that you have some [[Matlab:Simulating DNA melting|simulated data]] or [[Lab Manual:Measuring DNA Melting Curves|real data]], it's time to estimate <math>\scriptstyle \Delta S^{\circ}</math>, <math>\scriptstyle \Delta H^{\circ}</math>, and the melting temperature. Nonlinear regression is one approach to extracting these values.
+Fantastic. Now you have some [[Matlab:Simulating DNA melting|simulated data]] or [[Lab Manual:Measuring DNA Melting Curves|real data]]. It's time to estimate <math>\scriptstyle \Delta S^{\circ}</math>, <math>\scriptstyle \Delta H^{\circ}</math>, and the melting temperature. Nonlinear regression is one approach to extracting these values.
 Most students use Matlab to analyze their raw data. Several scholars in the past elected to use Python, instead. If you choose this path, be aware that the wiki pages for Python data analysis are significantly less developed than the corresponding Matlab cribs. Nonlinear fitting in Python has been challenging. If you decide to use Python, consult and add your thoughts to this page: [[Python:DNA melting data analysis |DNA melting data analysis in Python]].

Difference between revisions of "DNA Melting: Processing DNA Melting Data"

Revision as of 17:16, 28 August 2011

Overview

Organization of data analysis script

Loading a data file

Converting voltages to temperature and relative fluorescence

Designing an FIR low pass filter

Applying the FIR filter

Ensuring that F(Temperature) is a function

Computing the finite difference

Fitting the model

Plot legends in Matlab

Report Requirements

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools