Mark D. Yandell, PhD

Dr. Mark Yandell is an internationally recognized expert in comparative and functional genomics. As a Post-doc at the Human Genome Project at Washington University, St. Louis, he was a co-developer of the widely commercially licensed PolyBayes package, the first probabilistic algorithm for sequence variant discovery. Thereafter he joined Celera Genomics, where he led the software group that wrote much of the software used to annotate and analyze the Drosophila, Human, Mouse, and mosquito genomes. From 2001-2005 he was a senior scientist for HHMI where he led a comparative genomics group at the Berkeley Drosophila Genome Project. Since 2005, he has been an Associate Professor in the Eccles Institute of Human Genetics, University of Utah. He has served on the Scientific Advisory Boards of the Saccharomyces Genome Database, the Rice and Amborella genome annotation projects, and VectorBase. He is currently Director of the Eccles Institute’s Bioinformatics program, and (co)-teaches Bioinformatics Programming for Molecular Biology and Evolutionary Genetics & Genomics. He is also co-author of the O’Reilly Book on BLAST. Current projects in his laboratory include an NHGRI funded project for the development of software for the creation and quality control of genome annotations; an NHGRI funded project to develop software tools for personal genomes analyses, including VAAST, a probabilistic disease-gene finder for personal genomes; an NIGMS funded project to develop software for high-throughput Image analysis; and a program grant from the NSF for development of a plant-specific genome annotation engine.

Title and Abstract:

Annotating genomes and their sequence-variants using interoperable, machine-readable data standards

Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah and School of Medicine, Salt Lake City, Utah, USA,

The ever-falling cost of sequencing is having dramatic impacts on the research community with regard to which, how and where genomes are sequenced. Indeed, costs have now fallen to the point where a sequenced genome is often only one component of a genomics-centered research plan, with many of today’s projects also involving significant transcriptome and re-sequencing efforts as well. The scale of these projects is truly staggering, and they present many challenges in quality control and curation. These gigantic datasets preclude ad-hoc manual curation efforts and require automated approaches for data management and quality control. This in turn makes the use of interoperable, machine-readable data standards essential. Fortunately there are several widely used data-standards available for the genomics domain. These include GFF for representation of genome annotations and their associated evidence; and VCF and GVF for representation of sequence variants. I will show how the use of these standardized formats is empowering individual investigators and small collaborative groups to annotate, manage, curate and analyze even truly huge genomes datasets. I will also discuss the challenges presented by genome re-sequencing, especially as regards annotation of these data in an interoperable machine-readable fashion. Finally, I will highlight a few examples from my own group illustrating how genome annotation and re-sequencing efforts can be combined for rapid identification of the genes and alleles underlying human disease and characteristic traits of plant cultivars and animal breeds.