The study of intracellular processes using graph theory -

This workshop is a MAA Summer Research program for small groups of under-represented minority students. The workshop director is a recipient of a grant funded by NSF and NSA under the MAA National Research Experiences for Undergraduates Program (NREUP).

 

Students selected to participate in this program will receive the following:

·         Six weeks room and board on the ETSU campus. (July 5th- August 13th )

·         A $3400 stipend

·         Travel to Oak Ridge National Laboratory for an end-of-workshop field trip.

·         A fascinating hands-on exploration into the inner workings of the living cell - - mathematics will never be the same!!

 

To apply, return to Debra Knisley’s home web page and click on the application form link. You must qualify as an under-represented minority in the mathematical sciences (African-American or Hispanic) to apply. A description of the projects is given below.

 

Projects:

The Protein Folding Problem:  How does a linear polymer of amino acids assemble into a three-dimensional object capable of executing a precise chemical function? Within each cell, the DNA code is read and the instructions it provides tells the cell how to build a particular protein for a particular function. The proteins are constructed by the head-to-tail joining of amino acids chosen from a 20-letter alphabet. The 20 natural amino acids have a common backbone, but a variable side chain. As they grow in length, they begin to fold and it is this folding property that controls its function. These proteins can be modeled as simple graphs whose vertices can be labeled from the 20- letter alphabet. The graph can then be studied in a purely graph-theoretical way to determine such properties as diameter, colorability, crossing number and toughness. Is it possible that the proteins fold in such a way to preserve certain graphical characteristics? Online tutorials such as those found in MathMol 1 will be used to supplement instruction. MathMol is  being developed at the NYU/ACF Scientific Visualization Center in association with the NYU Collaborative for Excellence in Teacher Preparation and the YMCA Beacon Technology Center .

1. http://www.nyu.edu/pages/mathmol

2. http://folding.stanford.edu/education/protein.html

3. http://grail.lsd.ornl.gov/structure/resource/tutor.html

 

DNA Graphs:  DNA is responsible for building, at the right time and in the right places, the proteins, which enable life. The sequence of the bases in DNA determines the characteristics of the protein it builds. Even though the human genome project is all but complete, the rise of genomic medicine, which entails large-scale genetic sequencing of an individual or group of individuals, will ensure continued demand for efficient DNA sequencing techniques. The DNA sequencing problem stated mathematically is: Given the list of all subsequences of fixed length n (equal to the length of the substring given by a probe), which are present in R , and the number of occurrences of each subsequence, determine R. This reconstruction problem can now be formulated in terms of directed graphs since the ends of DNA are distinguished and therefore giving the sequence direction. From the manner in which the graph is constructed, the sequence of the target fragment corresponds to a Hamiltonian path in the graph.1,2 Generalizing the concept of DNA graphs, we consider the (nondirected) alphabet-overlap graph. In the alphabet-overlap graph, the kn vertices are labeled with the sequences of length n from an alphabet of size k. Two vertices are adjacent if their corresponding tags (specified subsequence) are the same. What size tag guarantees these graphs are Hamiltonian? What is their chromatic number and why do we care? 3,4,5. To supplement the biology component, we will use modules from the Bioinfomatics and the Human Genome Project such as below.

 

BSCS, with the support of the Department of Energy, has developed a new curriculum for high school biology that explores how scientists extract useful information from the Human Genome Project. The curriculum, BSCS's fifth module related to the Human Genome Project, includes background information for teachers and five classroom lessons. The lessons use both print and Web-based activities to help students learn how computers are used to assemble DNA sequences, locate genes, and obtain clues about gene functions. In this context, the ethical, social, and legal implications of genetic databases and informed consent are considered. 3

1.        Applications of Graph Theory in DNA Sequencing by Hybridization, P. Adams, D. Bryant, S. Barnes, Bulletin of the Institute of Combinatorics and its Applications, Vol.31 (2001).

2.        DNA Physical Mapping and Alternating Eulerian Cycles in Colored Graphs, P. Pevzner Algorithmica, Vol.13 (1995)

3.   http://biogeometry.cs.duke.edu/education/index.html

4.   http://archives.math.utk.edu/mathbio/molecularbiology.html

5.    http://www.ornl.gov/TechResources/Human_Genome/education

 

Graph Theoretic Techniques in Computational Molecular Biology:   The vast amounts of data stemming from both the Human Genome Project and the follow-up project of predicting protein folds (Proteomics) require faster and better algorithms. Algorithms whose designs depend on an underlying graphical structure are called graph algorithms. Graph algorithms are frequently used in computational biology.  An initial step in the analysis of gene expression is the identification of groups of genes that exhibit similar expression. This is known as the algorithmic problem of clustering. The main feature of the problem is how to determine the correct clustering or grouping of the genes so that genes in the same cluster are highly similar to each other. 1 One approach is to associate each genes expression pattern with a vertex and connect all vertices with an edge. Then label the edges according to a distance function based on similarity. We now find the minimum spanning tree of the labeled complete graph. The clustering problem is now a subtree problem. 2 Given the complete labeled graph, and the corresponding minimum spanning tree, is there a better approach to determine the clusters that does not use the subtree approach?  For instance, the line graph of the minimum spanning tree is a collection of cliques. Is there a correlation between the cliques and the clusters?  A similar question can be posed regarding other graph-theoretic based algorithms. The students will have an opportunity to discuss their ideas directly with the proteomics group at Oak Ridge.