Welcome to the IQB 2005 Summer program!

The IQB summer program is supported by the National Science Foundation (NSF DUE-0337406)

Math Quantitative Biology Biology

 

Protein folding group

    Led by: Debra Knisley and Celia McIntosh

 
 
 
     

 

Graph Theory Based Research of UDP-Glucosyltransferases

Holly Hicks

 Our objective was to create a program with the ability to differentiate between families of glucosyltransferases using mathematics of graph theory. The method was rsearch involving the creation of a modeling system for the folding of each protein and the analysis of the graphical model for characteristic invariants. The graphical invariants are used to train a neural network, a computer program essentially created to recognize characteristic invariants of each protein and classify the proteins fed to the system as either a glucosyltransferase acting on flavonoids or a glucosyltransferase acting on non-flavonoids. Directives taken to reach this goal include creating models with either weighted or non-weighted variations of the graph theory based models, and also with the models either containing the post secondary product glucosyltransferase box or excluding the post secondary product glucosyltransferase box. The four forms of the modeling system were used to train the neural network system, after using Maple to extract numerical invariant sets from the models, for increased ability to critique and analyze the findings of each run of the neural network, and also increased possibility for disctinction between flavonoid and non-flavonoid glucosyltransferases.

 

Protein Folding

Brad Wild

This summer, we had four full-time students and one part-time student in our group. Patricia Carey, Holly Hicks, Daniel Lamb, and Brad Wild were the full-timers. Shannon McConnell was the part-timer. We were supplied thirteen proteins by the biologists, Holly and Shannon. Eleven of these proteins are in group A, and the other two are in group B. Our goal was to use graph theory to distinguish between the two groups, as follows.

            First, we submitted the thirteen proteins to PHYRE, a protein fold recognition server. PHYRE gave us a prediction of what the proteins looks like after it has folded. We had to use PHYRE because we are not sure exactly what they look like. Once we had an idea of what the proteins look like, the two mathematicians, Daniel and Brad, set about creating graphical representations of the proteins using a branch of mathematics called graph theory. Graphs are mad of vertices (or nodes or dots) and edges (or relations or lines). Once each protein had a graph, the mathematicians obtained invariants from the graphs. Invariants are simply numerical quantifiers that describe some of the graphs properties. One easy example is the number of vertices and edges. This can give an idea of how big and how dense the graph is. After the mathematicians gathered invariants, they were given to Patricia to be analyzed. Patricia, with the help of Rhydon Jackson, used a neural network (an artificial brain) to search the invariants for patterns, in hopes of using these patterns to distinguish group A from group B. However, we were not able to separate the two groups as well as hoped. While we did make much progress, we will need more time to fully achieve our goal.

            Once our goal has been reached, there are a number of immediate applications. We can look at proteins that we are not sure belong to group A or group B and see where we think they should go. Also, if we know that a protein is in group A but our method claims it is not, we can begin to question the accuracy of PHYRE. As such our method may help in protein fold prediction.

 

Protein Folding

Daniel Lamb 

            This summer, we modeled two different types of proteins that are contained in plants by using Graph Theory, a discrete branch of mathematics.  The physical shapes of these proteins are not known, so modeling them proved to be theoretical and abstract.  After making the graphs, we investigated their properties in hopes of finding a way to distinguish between the two types of proteins.  We used an artificial neural network, a self-trainable piece of computer software, in an attempt to find hidden patterns in the graph properties.

            The protein folding group was split into two teams, with Holly Hicks, Rhydon Jackson, and Brad Wild in one and Patricia Carey, Shannon McConnell and myself in the other.  The groups had the same goal, but two different mathematical approaches were taken.  Drs. Debra Knisley and Celia McIntosh oversaw both groups.  In my group, Shannon helped explain the biology and worked with me to develop a method for creating the graphs based on the protein recognized as the closet match to the thirteen in our study.  Patricia Carey was instrumental in analyzing the data with the artificial neural network, which she designed.  My work included making the models for all thirteen proteins and pulling the graphical properties out of these models.      

            The results of this project are important because they put us one step closer towards being able to characterize proteins without needing to know their three dimensional shape.  With more work, our methods could be developed to determine functionality of proteins and help with the synthesis of enzymes for drug design.  The next stage of our work will be with the neural network, where the results of this project will be analyzed more thoroughly.  

 

Protein Folding

Shannon McConnell

As a member of the Protein Folding Group I have worked with students and professors from different fields of study.  It has been a very informative and interesting experience in which I hope to have played a beneficial role.  The combination of biology, math, and computer science students and professors gave an important edge to our research since we had wealth of information that brought an invaluable understanding, for everyone, of all aspects.  The biology student was responsible for supplying the study protein sequences (glucosyltransferases), and understanding of basic structures and functions of flavonoids, proteins and enzymes.  The math student was responsible for inputting the data into a program that would graphically analyze the data giving a database of information that could then be used by the computer science student.  The computer science student entered the data into a neural network that would assess any patterns that would give the final results on structural analysis.

Initially I was responsible for acquiring glucosyltransferase sequences that would be used to study structural patterns in proteins/enzymes of known specificity.  This would allow us to compile a database of glucosyltransferases with specific patterns in structure as produced through graphical analysis.  Once I was involved with the group more, I helped explain the structural properties and characteristics of the study substrates and enzyme sequences, specifically flavonoids and glucosyltransferases.  I also contributed a general explanation of protein and enzyme structure and function.  Later in our research we began to analyze each protein/enzyme in order to differentiate the different secondary structures such as the alpha helix, the beta sheet, and the loop.  After organizing the data and identifying the different secondary structures, the data was entered to a program that analyzed and graphically displayed similarities in structure between our thirteen known sequences of glucosyltransferases.  Overall our research project showed conclusive results that expressed the significance/potential for the identification of glucosyltransferases using graphical analyses, i.e. graph theory. 

 

 

Complexity group

  Led by: Steve Karsai and Jeff Knisley

   

Videos:

mike1.AVI

dmitry1.AVI

Bottom up approach of division of labor

Mike Phillips

Division of labor is one of the most commonly studied colony-level behaviors in social insects and is prime example of self-organization in biology.  Our project looks at one of the most central questions of self-organization:  How can social wasps, with only a few individual behaviors and no global intelligence, build complex nest structures?  We propose an agent-based model that demonstrates that this behavior can be explained by a simple division of labor that emerges as a result of local interactions between the wasps on the nest.  We also show that not only is water an important building material, but also the prime regulator of the system.

 This project is a continuation from Summer 2004 when we created a basic model and coded a simulation to test its predictions.  This summer was primarily focused on simplifying and fine-tuning the model, data collection, and data analysis.  For the first several weeks we tested the simulation to look for bugs and any parameters that could be simplified or eliminated.  After the group determined changes that could benefit the model I implemented the changes in our simulation code.  This was an iterative process of testing and re-coding that took up about the first third of the summer.

 Once satisfied that our simulation was error-free and that our model was as simple and effective as we could make it, we turned our attention to running simulation experiments.  To do this we used a separate helper program I wrote that analyzed the data files from parallel runs of the simulation.  We designed experiments that would test the dependency of the model on its parameters and initial conditions and also experiments to test perturbation experiments that had been conducted on both the real-world wasp colonies(Karsai and Wenzel, 2000) and in a previous ODE model (Karsai and Balazsi, 2002).

 As these simulations were being collected and analyzed we also began writing a paper of our results that we will soon publish.  The final result of our project is that this new model accurately mimics the behavior of the real world results (Karsai and Wenzel, 2000) and confirms the predictions of the ODE model (Karsai and Balazsi, 2002), contrary that it is based on a completely different modeling framework.

 

Computational approach of division of labor

Dmitry Yampolsky

            By surpassing many linear methods for interpreting variable correlations, neural networks present the rare opportunity of predicting not only the outcome of a natural process but also the detail and complexity of it's steps. Unfortunately a model is still only as precise as the theory behind it. Using a variety of approaches narrows the gap between the theory and the fact.

            To explore a variation on an emergent behavior model, and to demonstrate the danger of variable negligence during coding, a self organizing swarm model, whose units act according to only a local range of input, was constructed. Using platform independent C code, an easily customizable and modifiable model was constructed, inspired partly by a previous and concurrent IQB  project which portrays wasp colony behavior. Graphic output was used to allow for an easier way to interpret results.

            Results of the positioning behavior version of the model showed a swarming self-organization which can occur independently of an insect's initial position, experience or density of it's surroundings. Moreover, global order is attained thanks to a reaction to local conditions that's independent and customizable in each individual. The variable that unexpectedly turned up as significant is the position of an individual in the sequence of moves made. When positions can be filled by only one unit at a time, the one which gets to move first is better off. Such experience affects future behavior of the individual and the model as a whole, leading to an orderly arrangement not conceived at the time of the mode's initial planning.

 

 

Microarray group

  Led by: Karl Joplin, Lev Yampolsky Jeff Knisley and Edith Seier

 
   

Videos:

Lev1.AVI

Karl1.AVI

Jennifer.AVI

Patricia1.AVI

Neural networks and microarrays

Patricia Carey

The purpose of an artificial neural network is to learn data patterns and to recognize them in test data.  The artificial neural network is first taught different patterns and is then given a test pattern to determine which of the known patterns the test pattern most resembles.  Typically an artificial neural network is made up of three layers: input, hidden, and output.  Our focus is the hidden layer.  It is made up of sets of connecting weights that enables it to examine the test data and recognize previously learned patterns within the test data.  It is then able to display whether the test data contains any of these previously learned patterns.  The weights can then be adjusted by a combination of the sigmoid function and the delta function.  The data that was acquired by Drs. Joplin, Knisley, and Miller has a wide range and two separate channels.  The first channel is for genes present in up regulated and the second for down regulated genes.  This data set was too large to run through the artificial neural network, so the Monte Carlo method was used to help with its size.  This method determines how many random groups of a certain size can be processed instead of one large group, so smaller artificial neural networks were run instead of one large one.  After each of the artificial neural networks was taught their set of data, the second set of weights’ or alphas value corresponded to a genes importance in regulation.  The genes that had alphas above or below a certain critical value remained within the network, while those within the range were eliminated from the network.  This process was repeated for each network until only about 2.5% of the original genes were left.  The remaining genes were then run through another neural network and their alpha values were recorded.

The developmental bioinformatics group was composed of Dr. Joplin, Dr. Knisely, Dr. Miller, Jennifer Cooke, and me.  Jennifer Cooke worked on determining down regulated and up regulated genes, while I tried to find her candidate genes to examine through my artificial neural networks.

Although the research for this project was not able to be completed, once it is, it will output a set of statistically significant candidate genes for the biologist to examine.  These genes will be statistically significant because of the Monte Carlo method.  This research method is important because determining genes with this type of data set has never been successfully completed before. 

 

Diapause in Sarcophaga

Jennifer K. Cooke

             For the last seven weeks, I have been working under Dr. Joplin and Dr. Miller, alongside Robert Morgan and Chau Nguyen.  Dr. Joplin was our project director.  Our goal was to characterize the genes that are differentially regulated in diapause, in the fly Sarcophaga crassipalpis, using heterologous microarray analysis. I started by gathering information on several genes which are expressed in relatively equal amounts in diapausing and nondiapausing specimen, otherwise known as the 'mid' genes.  The information I have gathered on these genes includes protein function, protein sequence, mRNA sequence, and areas of high conservation.  From those areas of high conservation, I have designed primers for a few of these genes which can be used in RT-PCR analysis.  With primer B, RT can be performed using Oligo-dT and total RNA, and then PCR is run using both primers and the RT product.  Once PCR is complete, its products can be visualized using agarose gel electrophoresis.  If bands do not appear in the resulting gel, PCR reamplification can be done with the primers and the PCR product.  If there are still no bands in the gel, RT can be rerun using gene specific reverse primers, and PCR can be performed using the original primers.  Once bands are present, they can be cut from the gel and frozen in preparation for cloning and sequencing. Once data has been gathered from several of these mid genes, they can be used as control standards in experimental procedures with the differentially regulated genes.

            One of these primers, for the gene pgant6, has been ordered.  Once it arrived, it was amplified using PCR analysis and visualized in the agarose gel.  Results from the first gel were inconclusive, so it was reamplified.  The second gel was also inconclusive, and we determined that the RNA used for RT, which was several months old, must have been degraded. 

As soon as the insect pupae were ready, fresh RNA was extracted from both diapausing and nondiapausing specimen.  Once again, the first gel yielded no clear results.  However, after the PCR products had been reamplified, I was able to obtain a gel with two clear bands in the diapause column.  This result can mean one of two things.  The first possibility is that the new nondiapause RNA had already degraded prior to being used.  The second possibility is that the microarray gave an inaccurate result for this particular gene, and that it is in fact differentially regulated in diapause.  In order to determine which of these two possibilities is correct, the analysis will have to be repeated with another fresh batch of nondiapause RNA.  For now, the two bands from the diapause column have been cut out and DNA was extracted.  With this DNA, we can run another PCR analysis and attempt to clone the bands in another gel. 

 

Statistical Approach to Multifactorial Microarray Analysis

Erin Ashton

An algorithm capable of detecting interaction between factors was developed for a two-factor microarray data analysis.  In an exploratory analysis lacking replication, no direct test for interactions is possible. All we can measure is the heterogeneity of the slopes of the two factors, estimated as D=R11-R12-(R22-R21), where R11 is the response under lever 1 of factor 1 and level1 of factor 2, etc.  We simulated single-replicate data by calculating the D value in a set of randomly chosen replicates from a data set containing replicates which used data containing two subsets of genes: top 100 genes with the highest D (scaled by gene’s average) and the set of ribosomal proteins.  The rate of false positives can then be estimated from the cumulative distribution of D for ribosomal proteins (household controls) and the rate of false negatives from the cumulative distribution of the set of genes with the most significant interaction.  The cut-off of D=1 results in approximately 20% of false positives and 20% of false negatives, which is acceptable for an exploratory method. This results in a “poor man’s” approach to construct a list of candidate genes in a data set with a single replicate per treatment.  Q-PCR (quantitative polymerase chain reaction) analysis has allowed us to begin verification of the gene status indicated by the microarray.  The statistical approach was tested on simulated data and then applied to our data, which looked at the up and down regulation of genes in male and female Drosophila by the onset of mating.  Olfactory and gustatory receptors were looked at as the significant genes. 

 

 

Videos from the weekly sport events:

soccer002.AVI

soccer003.AVI