Bioinformatics Frequently Asked Questions
This resource is maintained by and © Damian Counsell, UK
Medical Research Council Rosalind Franklin Centre for Genomic Research (the RFCGR)
1998-2004. It is made available under a modified version of
the Open Publication Licence.
to outline contents
to detailed contents
- added Georgia State University's courses---thanks to Eric VanWieren
- added FH Weihenstephan in Freising---thanks to Tobias Kailich
- added Johns Hopkins' courses---thanks to Tim Young
- added three new bioinformatics courses from Germany---thanks to Sebastian Kurscheid
- added courses at University of Illinois---thanks to Amit Sabnis
$Revision: 1.207 $ $Date: 2005/04/05 13:06:07 $
questions to me, Damian Counsell, and
I'll try to bring you answers. Alternatively, if you have your own answers,
mail them to me and I'll incorporate them. The practical section in particular is full of
gaps so your contributions to that are particularly welcome; I am slowly completing
and extending the entries when I have the time.
Although I am happy to tackle questions of general interest to all
visitors to the site, please note that:
I hope, however, that the information here helps with your studies, career and
- I cannot answer queries specific to you alone,
- I am not a careers adviser,
- I try not to offer opinions on the relative merits of bioinformatics
- I won't answer your essay questions, assignments, or homework,
- I won't provide you with a list of companies for you to market
your bioinformatics product to,
- I won't suggest a project for your Master's/PhD,
- I have not devised a bioinformatic cure for cancer---and neither
have you, and
- This FAQ is perpetually under construction.
I acknowledge the help of many other individuals
in creating this part of the Bioinformatics.Org site. If you have contributed
and I have forgotten to credit you, please email
me and I will correct my oversight immediately.
Bioinformatics is, I believe, a special kind of engineering discipline---it
certainly isn't a "pure" science. It has been enormously successful
in its short existence and I think its successes have been
the result of a practical and rigorous approach which I hope to encourage in
anyone interested in entering the field.
This document is not a scientific paper or textbook (yet). You will find blunt
opinions here. If you disagree with me about any of the following please tell me.
I hope to learn a lot from your inevitable and welcome criticisms.
There is certainly one sense in which I consider myself a pure scientist:
I'm open to rational persuasion.
I write this resource and hold the copyright for the purposes of protecting
its content from intellectual property pirates. By that I mean I want to keep
this out of the hands of people who steal the work of others for commercial
gain, and those who abuse and extend the powers of IP law at the expense of
the disadvantaged---rather than those who would like to copy or mirror this
resource for educational reasons. (This may sound overdramatic, but the FAQ
has already been pirated for doubtful purposes.)
Definitions: What is Bioinformatics?
Bioinformatics: What is bioinformatics?
Roughly, bioinformatics describes any use of computers to handle biological
In practice, the definition used by most people is narrower; bioinformatics
to them is a synonym for "computational molecular biology"---the
use of computers to characterize the molecular components of living things.
What is Bioinformatics?---The
Most biologists talk about "doing bioinformatics" when they use computers to store, retrieve, analyze or predict the composition or
the structure of biomolecules. As computers become more powerful you
could probably add simulate to this list of bioinformatics verbs. "Biomolecules" include
your genetic material---nucleic acids---and the products of your genes: proteins.
These are the concerns of "classical" bioinformatics, dealing primarily
with sequence analysis.
Fredj Tekaia at the Institut Pasteur offers this
definition of bioinformatics:
"The mathematical, statistical and computing methods that aim
to solve biological problems using DNA and amino acid sequences and related
It is a mathematically interesting property of most large biological molecules
that they are polymers; ordered chains of simpler molecular
modules called monomers. Think of the monomers as beads or
building blocks which, despite having different colours and shapes, all have
the same thickness and the same way of connecting to one another.
Monomers that can combine in a chain are of the same general class, but each
kind of monomer in that class has its own well-defined set of characteristics.
Many monomer molecules can be joined together to form a single, far larger, macromolecule.
Macromolecules can have exquisitely specific informational content and/or chemical
According to this scheme, the monomers in a given macromolecule of DNA or
protein can be treated computationally as letters of an alphabet,
put together in pre-programmed arrangements to carry messages or do work in
The greatest achievement of bioinformatics methods, the Human Genome Project, is currently
being completed. Because of this the nature and priorities of bioinformatics
research and applications are changing. People often talk portentously of our
living in the " post-genomic" era.
My personal view is that this will affect bioinformatics in several ways:
This FAQ concentrates on classical bioinformatics, but will, I hope, grow to
cover more of the "post-genomic" aspects of the field. It is worth
noting that all of the above non-classical areas of research depend upon established
sequence analysis techniques.
- Now we possess multiple whole genomes we can look for differences and similarities
between all the genes of multiple species. From such studies we can draw
particular conclusions about species and general ones about evolution. This
kind of science is often referred to as comparative genomics.
- There are now technologies designed to measure the relative number of copies
of a genetic message (levels of gene expression) at different stages in development
or disease or in different tissues. Such technologies, such as DNA
microarrays will grow in importance.
- Other, more direct, large-scale ways of identifying gene functions and
associations (for example yeast
two-hybrid methods) will grow in significance and with them the accompanying
bioinformatics of functional genomics.
- There will be a general shift in emphasis (of sequence analysis especially)
from genes themselves to gene products. This will lead to:
- attempts to catalogue the activities and characterize interactions
between all gene products (in humans): proteomics ).
- attempts to crystallize and or predict the structures of all proteins
(in humans): structural genomics.
- fewer DNA double-helices in bad sci-fi movies.
- What some people refer to as research or medical
informatics, the management of all biomedical experimental data
associated with particular molecules or patients---from mass spectroscopy,
to in vitro assays to clinical side-effects---will move from the
concern of those working in drug company and hospital I.T. (information
technology) into the mainstream of cell and molecular biology and migrate
from the commercial and clinical to academic sectors.
Definitions of Fields Related to
Molecular biology itself grew out
of biophysics.The British Biophysical Society defines
"an interdisciplinary field which applies techniques
from the physical sciences to understanding biological structure
about the various facets of the discipline can be found at the society's site hosted
at Birkbeck College, London.
Mike Goodrich wrote to ask what the status of biophysics was given
the definition of computational biology submitted by Paul Schulte
(below). A recent
article in The Scientist [free
registration required] dealt with this question---thanks to Jo Wixon
(Managing Editor of Comparative
and Functional Genomics) for the reference.
is Computational Biology?
Computational biologists might object (please
do), but, I find that people use "computational biology" when
discussing that subset of bioinformatics (in the broadest sense)
closest to the field of classical general biology.
Computational biologists interest themselves more with evolutionary,
population and theoretical biology rather than cell and molecular
biomedicine. It is inevitable that molecular biology is profoundly
important in computational biology, but it is certainly not what
computational biology is all about (see next paragraph). In these
areas of computational biology it seems that computational biologists
have tended to prefer statistical models for biological phenomena
over physico-chemical ones. This is often wise...
One computational biologist (Paul J Schulte) did object to the above
and makes the entirely valid point that this definition derives from
a popular use of the term, rather than a correct one. Paul works
on water flow in plant cells. He points out that biological fluid
dynamics is a field of computational biology in itself. He argues
that this, and any application of computing to biology, can be described
as "computational biology" (see also the "loose" definition
of bioinformatics below). Where we disagree, perhaps, is in the
conclusion he draws from this---which I reproduce in full:
"Computational biology is not a "field",
but an "approach" involving the use of computers to study
biological processes and hence it is an area as diverse as biology
Richard Durbin, Head of Informatics at the Wellcome Trust Sanger Institute,
expressed an interesting opinion on this distinction in an interview:
"I do not think all biological computing is bioinformatics, e.g. mathematical
modelling is not bioinformatics, even when connected with biology-related
problems. In my opinion, bioinformatics has to do with management
and the subsequent use of biological information, particular genetic
What is Medical Informatics?
The Medical Informatics
FAQ (no relation) provides the following definition:
"Biomedical Informatics is an emerging discipline
that has been defined as the study, invention, and implementation
of structures and algorithms to improve communication, understanding
and management of medical information."
That FAQ also points here
Aamir Zakaria, the author of the FAQ, emphasises that medical informatics
is more concerned with structures and algorithms for the manipulation
of medical data, rather than with the data itself.
This suggests that one difference between bioinformatics and medical
informatics as disciplines lies with their approaches to the data;
there are bioinformaticians interested in the theory behind the manipulation
of that data and there are bioinformatics scientists concerned
with the data itself and its biological implications. (I believe
that a good bioinformatics researcher should be interested in both
of these aspects of the field.)
Medical informatics, for practical reasons, is more likely to deal
with data obtained at "grosser" biological levels---that
is information from super-cellular systems, right up to the population
level---while most bioinformatics is concerned with information about
cellular and biomolecular structures and systems.
On both of these points I'd be happy for any medical informatics
specialists to correct
What is Cheminformatics?
The Web advertisement for Cambridge Healthtech Institute's Sixth
Annual Cheminformatics conference describes the field thus:
"the combination of chemical synthesis, biological
screening, and data-mining approaches used to guide drug discovery
but this, again, sounds more like a field being identified by some
of its most popular (and lucrative) activities, rather than by including
all the diverse studies that come under its general heading.
The story of
one of the most successful drugs of all time, penicillin,
seems bizarre, but the way we discover and develop drugs even now
has similarities, being the result of chance, observation and a lot
of slow, intensive chemistry. Until recently, drug design always
seemed doomed to continue to be a labour-intensive, trial-and-error
process. The possibility of using information technology, to plan
intelligently and to automate processes related to the chemical synthesis
of possible therapeutic compounds is very exciting for chemists and
biochemists. The rewards for bringing a drug to market more rapidly
are huge, so naturally this is what a lot of cheminformatics works
a page with a commercial slant which links to some interesting
discussions of the term "cheminformatics", what it means,
whether or not it exists as a distinct discipline, and even whether
it should be replaced by "chemoinformatics".
The span of academic cheminformatics is wide and is exemplified
by the interests of the cheminiformatics groups at the Centre
for Molecular and Biomolecular Informatics at the University of Nijmegen in the Netherlands.
These interests include:
- Synthesis Planning
- Reaction and Structure Retrieval
- 3-D Structure Retrieval
- Computational Chemistry
- Visualisation Tools and Utilities
Trinity University's Cheminformatics Web page,
for another example, concerns itself with cheminformatics as the
use of the Internet in chemistry.
What is Genomics?
Genomics is a field which existed before the completion of the sequences
of genomes, but in the crudest of forms, for example the oft-re-referenced
estimate of 100 000 genes in the human genome derived from a(n) (in)famous
piece of "back of an envelope" genomics, guessing the weight
of chromosomes and the density of the genes they bear. Genomics is
any attempt to analyze or compare the entire genetic complement of
a species or species (plural). It is, of course possible to compare
genomes by comparing more-or-less representative subsets of genes
is Mathematical Biology?
Mathematical biology is easier to distinguish from bioinformatics
than computational biology. Mathematical biology also tackles biological
problems, but the methods it uses to tackle them need not be numerical
and need not be implemented in software or hardware. Indeed, such
methods need not "solve" anything; in mathematical biology
it would be considered reasonable to publish a result which merely
establishes that a biological problem belongs to a particular general
The distinction between bioinformatics and mathematical biology
was illuminated by an email I received from Alex Kasman at the College
of Charleston. According to his working definition, he distinguished bioinformatics which
(under the tight definition at
"...seems to focus almost exclusively on specific
algorithms that can be applied to large molecular biological data
...from mathematical biology which...
"...includes things of theoretical interest which
are not necessarily algorithmic, not necessarily molecular in nature,
and are not necessarily useful in analyzing collected data."
What is Proteomics?
review on proteomics in the journal Nature defined the field
"The term proteome was first
coined to describe the set of proteins encoded by the genome1.
The study of the proteome, called proteomics, now evokes not only
all the proteins in any given cell, but also the set of all protein
isoforms and modifications, the interactions between them, the
structural description of proteins and their higher-order complexes,
and for that matter almost everything 'post-genomic'."
Michael J.Dunn, the Editor-in-Chief of Proteomics defines
the "proteome" as:
"the PROTEin complement of the genOME"
and proteomics to be concerned with:
"qualitative and quantitative studies of gene expression
at the level of the functional proteins themselves"
"an interface between protein biochemistry and molecular
Characterizing the many tens of thousands of proteins expressed
in a given cell type at a given time---whether measuring their molecular
weights or isoelectric points, identifying their ligands or determining
their structures---involves the storage and comparison of vast numbers
of data. Inevitably this requires bioinformatics. Here is a
constructively skeptical review by Lukas Huber.
What is Pharmacogenomics?
Pharmacogenomics is the application of genomic approaches and technologies
to the identification of drug targets. Examples include trawling
entire genomes for potential receptors by bioinformatics means, or
by investigating patterns of gene expression in both pathogens and
hosts during infection, or by examining the characteristic expression
patterns found in tumours or patients samples for diagnostic purposes
(possibly in the pursuit of potential cancer therapy targets).
The term "pharmacogenomics" is used for the more "trivial"---but
arguably more useful---application of bioinformatics approaches to
the cataloguing and processing of information relating to pharmacology
and genetics, for example the accumulation of information in databases
like this one.
(Thanks to Ivanovi.)
What is Pharmacogenetics?
All individuals respond differently to drug treatments; some positively,
others with little obvious change in their conditions and yet others
with side effects or allergic reactions. Much of this variation is
known to have a genetic basis. Pharmacogenetics is a subset of pharmacogenomics
which uses genomic/bioinformatic methods to identify genomic correlates,
for example SNPs (Single Nucleotide Polymorphisms),
characteristic of particular patient response profiles and use those
markers to inform the administration and development of therapies.
Strikingly, such approaches have been used to "resurrect" drugs
thought previously to be ineffective, but subsequently found to work
with in subset of patients. They can also be used for optimizing
the doses of chemotherapy for particular patients.
Overview of most common bioinformatics programs
Everyday bioinformatics is done with sequence search programs like BLAST, sequence
analysis programs, like the EMBOSS and Staden packages, structure
prediction programs like THREADER or PHD or molecular
imaging/modelling programs like RasMol and WHATIF.
Overview of most common bioinformatics technology
Currently, a lot of bioinformatics work is concerned with the technology
of databases (Thanks
again to Ivanovi.) These databases include both "public" repositories
of gene data like GenBank or
the Protein DataBank (the PDB), and
private databases, like those used by research groups involved in
gene mapping projects or those held by biotech companies. Making
such databases accessible via open standards is very important. Consumers
of bioinformatics data use a range of computer platforms: from the
more powerful and forbidding UNIX boxes favoured by the developers
and curators to the far friendlier Macs often found populating the
labs of computer-wary biologists.
Databases of existing sequencing data can be used to identify
homologues of new molecules that have been amplified and sequenced
in the lab. The property of sharing a common ancestor, homology,
can be a very powerful indicator in bioinformatics (see below).
Acquisition of sequence data
Bioinformatics tools can be used to obtain sequences of genes or
proteins of interest, either from material obtained, labelled, prepared
and examined in electric fields by individual researchers/groups
or from repositories of sequences from previously investigated material.
Analysis of data
Both types of sequence can then be analysed in many ways with bioinformatics
They can be assembled. Note that this is one of the occasions
when the meaning of a biological term differs markedly from a computational
one (see the amusing confusion over
the issue at Web-based geek forum Slashdot). Computer scientists, banish
from your mind any thought of assembly language. Sequencing can only
be performed for relatively short stretches of a biomolecule and
finished sequences are therefore prepared by arranging overlapping "reads" of monomers (single
beads on a molecular chain) into a single continuous passage of "code". This is
the bioinformatic sense of assembly.
They can be mapped---that is, their sequences can
be parsed to find sites where so-called "restriction enzymes" will
They can be compared, usually by aligning corresponding
segments and looking for matching and mismatching letters in their
sequences. Genes or proteins that are sufficiently similar are likely
to be related and are therefore said to be "homologous" to
each other---the whole truth is rather more complicated than this.
Such cousins are called "homologues".
If a homologue (a related molecule) exists, then a newly discovered
protein may be modelled---that is the three dimensional structure
of the gene product can be predicted without doing laboratory experiments.
Bioinformatics is used in primer design. Primers are short
sequences needed to make many copies of (amplify) a piece of DNA
as used in PCR (the Polymerase
Bioinformatics is used to attempt to predict the function of
actual gene products.
Information about the similarity, and, by implication, the relatedness
of proteins is used to trace the "family trees" of
different molecules through evolutionary time.
There are various other applications of computer analysis to sequence
data, but, with so much raw data being generated by the Human Genome
Project and other initiatives in biology, computers are presently
essential for many biologists just to manage their day-to-day results
Molecular modelling / structural biology is a growing field which
can be considered part of bioinformatics. There are, for example,
tools which allow you (often via the Net) to make pretty good predictions of
the secondary structure of proteins arising from a given amino acid
sequence, often based on known "solved" structures and
other sequenced molecules acquired by structural biologists.
Structural biologists use "bioinformatics" to handle the
vast and complex data from X-ray crystallography, nuclear magnetic
resonance (NMR) and electron microscopy investigations and create
the 3-D models of molecules that seem to be everywhere in the media.
Unfortunately the word "map" is used
in several different ways in biology/genetics/bioinformatics. The
definition given above is the one most frequently used in this context,
but a gene can be said to be "mapped" when its parent chromosome
has been identified, when its physical or genetic distance from other
genes is established and---less frequently---when the structure and
locations of its various coding components (its "exons")
What is Bioinformatics?---The
There are other fields---for example medical imaging / image analysis
which might be considered part of bioinformatics. There is also a
whole other discipline of biologically-inspired computation; genetic
algorithms, AI, neural networks. Often these areas interact in
strange ways. Neural networks, inspired by crude models of the functioning
of nerve cells in the brain, are used in a program called PHD to
predict, surprisingly accurately, the secondary structures of proteins
from their primary sequences.
What almost all bioinformatics has in common is the processing of
large amounts of biologically-derived information, whether DNA sequences
or breast X-rays.
How old is the discipline?
"How old is bioinformatics?" The answer to this one depends
on which source you choose to read.
From T K Attwood and D J Parry-Smith's "Introduction to
Bioinformatics", Prentice-Hall 1999 [Longman Higher Education;
"The term bioinformatics is used to encompass almost all computer applications
in biological sciences, but was originally coined in the mid-1980s for the
analysis of biological sequence data."
From Mark S. Boguski's article in the "Trends Guide to
Bioinformatics" Elsevier, Trends Supplement 1998 p1:
"The term "bioinformatics" is a relatively recent invention,
not appearing in the literature until 1991 and then only in the context of
the emergence of electronic publishing...
"...However, some of my role models when I was a graduate student (Margaret
O. Dayhoff, Russell F. Doolittle, Walter M. Fitch and Andrew D. McLachlan)
had been building databases, developing algorithms and making biological
discoveries by sequence analysis since the 1960s---long before anyone thought
to label this activity with a special term (if anything it was called `molecular
evolution'). Even a relatively new kid on the block, the National Center
for Biotechnology Information (NCBI), is celebrating its 10th anniversary
this year, having been written into existence by US Congressman Claude Pepper
and President Ronald Reagan in 1988. So bioinformatics has, in fact, been
in existence for more than 30 years and is now middle-aged."
Books: Can you recommend any bioinformatics
It's notoriously difficult to find any books on bioinformatics
itself that cater well for all of those coming from computing, from
mathematics and from biology backgrounds. The few textbooks available
in the field tend to be eyewateringly expensive as well. I've
divided suggested reading into books of general
interest, those best suited to people coming
from a computational/mathematical background and books
for biologists interested in bioinformatics. Where a book is
also listed in Bioinformatics.Org's books section I have linked
the title to the relevant entry there. Links to other lists of bioinformatics
books follow this section of suggested reading.
Many people are curious about the Human Genome (Project). The completion
of the first draft probably represents bioinformatics' coming
of age as a discipline. The first couple of books are aimed at the
A gossipy and insightful account of the race to sequence the genome
can be found in "The Sequence" by Kevin Davies
[Weidenfeld; ISBN 0297646982]. Matt Ridley's "Genome" [Fourth
Estate; ISBN 185702835X] is both an interesting layperson's introduction
to the issues raised by the bioinformatic revolution and an overview
of its biology and enormous scope. If I remember rightly, Ridley's
book received a slightly snooty review from Walter Bodmer. This is
understandable, since his and Robin McKie's excellent "pre-genomic" guide
to the Human Genome Mapping Project, "The Book of Life" [Oxford
Paperbacks; ISBN 0195114876] was undeservedly in a remainders bin
when I bought my copy a couple of years ago.
If you are a non-biological scientist (or a non-scientist) and are
hooked by these, why not go back to the "real beginning" of
the race and read James Watson's entertaining and indiscreet
memoir of his and Francis Crick's determination of the structure
of DNA, "The Double Helix" [Penguin; ISBN
0140268774]---now updated with an introduction by media don Steve
Nigel Barber at Peterborough Regional College in the UK recommends
Gary Zweiger's "Transducing the Genome" [McGraw-Hill
Professional Publishing: ISBN 0071369805]. The summary at
Amazon makes it sound a tad pretentious, but all the reviews seem
pretty positive so it might be worth a read.
If you are a quantitative scientist and would like a deeper knowledge
of contemporary (molecular) biology, but you want to acquire it as
painlessly as possible you could try the following:
- Donna Rae Siegfried's Biology for Dummies [Wiley; ISBN
0-7645-5326-7] is fun, well thought out and a lot more informative than the
title might suggest. If only all biology textbooks were this entertaining
- If you already have some biological knowledge and would like to get a grip
on modern biomolecular science then Richard J. Epstein's Human
Molecular Biology is an elegant, colourful and detailed guide.
There are two classic competing texts in cell and molecular biology
which Maximilian Haeussler reminds me to include: Alberts et al's Molecular
Biology of the Cell [Garland Science: ISBN 0815340729]
and Molecular Biology of the Gene [Benjamin Cummings:
If you are a hardcore maths/computing person Michael
Waterman's "Introduction to Computational Biology" [Chapman & Hall/CRC
Statistics and Mathematics; ISBN 0412993910] and Pavel
Pevzner's "Computational Molecular Biology - An
Algorithmic Approach" [The MIT Press (A Bradford
Book); ISBN 0262161974] will give you all the discrete maths you
can shake a stick at, but perfunctory introductions to the biology.
Bioinformatics.Org's very own Jeff Bizzaro recommends Dan Gusfield's "Algorithms
on Strings, Trees and Sequences" [Cambridge,
1997 ISBN 0-52158-519-8], Richard Durbin, S. Eddy, A. Krogh,
G. Mitchison "Biological
Sequence Analysis: Probabilistic Models of Proteins and Nucleic
Acids" [Cambridge, 1997 ISBN 0-52162-971-3] (which
I think is one of the clearest and most comprehensive guides
to alignment algorithms) and---for that full "computers-to-biology
conversion"--- Geoffrey M. Cooper "The
Cell: A Molecular Approach" [ASM Press, 1996
ISBN 0-87893-119-8]. Jeff Ames writes that a second edition of
this book is now available [Sinauer Associates, Incorporated,
2000 ISBN 0-87893-106-6] and that this version---if you can find
it in the shops---comes with a CD.
Applying bioinformatics to biological research
One outstanding general text for the biologist is David W.
Mount's "Bioinformatics" [Cold Spring
Harbor Press; ISBN 0879696087]. It's not cheap, but it's
the best I've seen if you are studying bioinformatics itself.
Bioinformatics has been dismissed by some as "the science of
BLAST searches". The best collection of advice so far on doing
BLAST searches is O'Reilly's BLAST book
by Ian Korf, Mark Yandell and Joseph Bedell [O'Reilly ISBN 0-596-00299-8].
I reviewed it enthusiastically, but not uncritically, for the UK
UNIX Users' Group magazine. I'd go as far as to say that
all biologists thinking of using BLAST in their research should read
the relevant sections before they even go near a computer.
If you wish to use general bioinformatics tools, especially
if you are a little wary of computers, my new "best" book
for Dummies" [John Wiley and Sons ISBN 0764516965].
It is (obviously) aimed at people who are beginners, who are happier
using the Web rather than typing commands, and who are more interested
in learning than in impressing people---the writing is friendly clear
and unpretentious. However, like several of my other tips (below)
it concentrates on Web-based resources so it will, inevitably, date.
(This is partially compensated for by there being a companion
Also, if you're coming to the subject as a computer user with
a biological background, looking to exploit the many tools available,
you might want to try Terry Attwood and David Parry-Smith's "Introduction
to Bioinformatics" [Longman Higher Education; ISBN
0582327881], or Des Higgins
and Willie Taylor's "Bioinformatics: Sequence Structure
and Databanks" [Oxford University Press; ISBN 0199637903].
Another excellent practical introduction is Andreas
Baxevanis and Francis Oulette's "Bioinformatics: A
Practical Guide to the Analysis of Genes and Proteins" [Wiley-Interscience;
ISBN 0471383910], now in its new and improved second edition. Bax
teaches bioinformatics all over Canada and the experience shows.
Arthur Lesk has also produced an excellent teaching book particularly
for protein bioinformatics in his Introduction
Bioinformatics.Org also recommends Cynthia Gibas and Per Jambeck's "Developing
Bioinformatics Skills" [O'Reilly, 2001 ISBN
Stuart Brown recommends his own book "Bioinformatics:
A Biologist's Guide to Biocomputing and the Internet" [Eaton
Pub Co; ISBN: 188129918X]. If he sends me a review copy I might
recommend it too ;-) .
"Darwin's Radio" by Greg Bear [Ballantine
Books, ISBN: 0345435249] is a wonderful hard SF thriller which stretches
ideas derived from genome discoveries to their breaking point. It's
gripping and humane.
Leonard Crane, the author of Ninth Day of Creation kindly
sent me a copy for review. So far it's an excellent read. I haven't
finished it yet, not because it isn't a rattling good story,
but because, like "Darwin's Radio", it
is very long and because I am very busy. If you'd like to read
a well-researched, but speculative, novel containing actual scenes
of practising bioinformatics then try it.
Ken Allen contributed the following reviews:
"Frameshift [Tor Books, ISBN: 0812571088] by Robert J.
Sawyer---based around the HGP---reasonable read, but poor / confused ending."
Calculating God [Tor Books, ISBN: 0812580354]by the same author---has
a subtler bio connection and is a much better read. Near the start an alien
spacecraft lands, the alien emerges and says 'take me to your paleontologist'
Further suggestions for this section are welcome.
Other lists of bioinformatics
See also compbiology.org's list, Steve Brenner's list,
Choon Tan's collection of books.
Centres of Bioinformatics Activity:
Where is bioinformatics done?
The biggest and best source of bioinformatics links I have encountered
is the Genome Web at the Rosalind
Franklin Centre for Genomics Research at the Genome
Campus near Cambridge, UK.
Most of the links below come from that resource. My list is necessarily
limited by comparison.
[XXXX INSERT DETAILS OF MORE SEQUENCING CENTRES HERE]
[XXXX INSERT DETAILS OF STANDARDS CENTRES HERE]
What virtual centres
(for example consortia and communities) for bioinformatics activity
[XXXX INSERT MORE DETAILS OF VIRTUAL BIOINFORMATICS CENTRES HERE]
Online Resources: What bioinformatics
Websites are there?
The front page of Bioinformatics.Org itself
is a bioinformatics 'Blog.
The Bio-Web links to resources
online for molecular and cell biologists and covers current news
in various biological/computational fields.
Genehack is the first bioinformatics 'Blog
I ever encountered.
The Australian National Genomic Information Service (ANGIS) is operated
by the Australian
Genomic Information Centre (currently at The University of Sydney) to offer
software, databases, documentation, training and support for biologists
"The University of Maryland AgNIC gateway is a guide to quality
agricultural biotechnology information on the Internet."
Christy Hightower, Engineering Librarian at the Science and Engineering
Library, University of California Santa Cruz has
already done this better than me. Visit her excellent
article about bioinformatics Net resources in Issues
in Science and Technology Librarianship.
Humberto Ortiz Zuazaga kindly introduced me to The International Society for Computational
Biology which he points out "has links to programs of
study and online courses in computational biology and to job postings".
Collections of Tools
You can start right here at
Bioinformatics.Org if you are looking for a bioinformatics toolbox.
I cannot recommend strongly enough the Rosalind Franklin Centre's "GenomeWeb".
Of historical interest only now, I guess, is the legendary " Pedro's
Molecular Biology Search and Analysis Tools".
Bioinformatics.Org is an international
organization which promotes freedom and openness in the field of
bioinformatics and is the root domain of a damned fine Website :-)
Computational Project 11) is another product of the UK's
Genome Campus. To quote their Web site, it was...
"...established to foster the broad bioinformatics
community and the UK research community in particular. Its purpose
is to facilitate the transfer of knowledge and expertise through
conferences, workshops, a newsletter and the use of the world wide
web. CCP11 is funded by the BBSRC and is hosted at the MRC
Rosalind Franklin Centre for Genomics Research RFCGR located on the Wellcome
Trust Genome Campus, Cambridge."
Jennifer Steinbachs runs compbiology.org which is a general
computational biology site as well as being a portal to her own work.
BioPlanet is well worth
visiting. It describes itself as "a not-for-profit site, funded
with our resources, for [its users'] benefit"
ColorBasePair is a densely
packed portal with lots of bioinformatics links
Nick Yates runs his own informative bioinformatics site, unsurprisingly
called nick-yates.com. He doesn't
aim to make money from it, but it may have paid-for ads. Check out
the glossaries---they are better than mine.
A great place to start, whether you come from a biological, physical
or computational background is at Martin Vingron's superb online
bioinformatics tutorial. (Begin by choosing a section from the
left-hand-side menu bar.)
Tom Smith and Don Emmeluth have produced a nice little exploration of
bioinformatics using NCBI resources and tools.
I recently stumbled upon a promising set of online
lecture notes currently under construction by B. Steipe at
the Genzentrum (Gene Center) at
München (University of Munich).
Chemistry for all
A defiantly frames-free chemistry tutorial
Mathematics for biologists
First of all, an almost completely painless
introduction to the horrors of the quadratic equation by Peter
Whalen, James Walker, and Drew Marticorena.
C. J. Schwarz of
the Department of Statistics and Acturial
Science, Simon Fraser University has
produced a course in statistics which is accompanied by set of sound, online PDF handouts.
Here is a great guide to
a whole array of statistical learning/teaching resources prepared
by Juha Puranen of the University
of Helsinki (English).
Computers for biologists
Programming for biologists
General introduction to biology for
Estrella Mountain Community
College in the States offers this excellent short
introduction to biology (actually "The Nature of Science
and Biology". It's a great place for keyboard jockeys
to start their journey to enlightenment. Thanks to Alex O'Neill
for pointing out the broken link.
The Dolan DNA Learning Center at Cold Spring Harbor has an outstanding
interactive tutorial introducing genetics. To take full advantage
of the multimedia elements you should download the Flash and Real players.
Molecular biology for computer scientists
The Institute of Arable Crop Research Beginner's
Guide to Molecular Biology
Protein chemistry for computer scientists
Unilever Education Advanced Series tutorial
Cell biology for computer scientists
The University of Arizona has
made available a high-quality
tutorial in cell biology. Not only does it cover the facts, but
it also attempts to introduce some of the philosophy of the field---recommended.
Even better, it's also available en Español and in
Once you've worked your way through that you might like to see
some scanning electron microscope images of
some of the structures you've read about taken by members of John Heuser's lab.
for computer scientists
Bob Patterson maintains his "Darwiniana" with
Other lists of bioinformatics tutorials
Education: Where can I study Bioinformatics...
straight to introduction to education section
This section is not complete,
but contributions to broaden its coverage are welcome. Please
do not direct questions about eligibility, course quality or admissions
policy to me, but to ask the individual institutions directly. Use
the links to obtain contact details. If an institution doesn't
provide telephone numbers/email addresses or snailmail details on
its Web site it doesn't deserve your patronage.
This resource focuses on complete, full-time degree programmes rather
than on individual study modules. Curating a list of the latter would
be a full-time job. You can go to other places, however, if you are
looking for short courses. Thanks to various contributors, including Wentian Li
who pointed me to this list at
Rockefeller which is mirrored at various other sites. And to Humberto
Ortiz Zuazaga for mailing me a link to the ICSB, where you can find this list.
If you are interested in U.S. programmes, here's a
list from Curtin and here's a
list from Stanford. Thanks to Amelie Stein who also supplied
some of the individual entries in this section.
Those wanting to find programmes in the Asia Pacific region could
have a look at this resource maintained
by the Asia Pacific Bioinformatics Network APBioNet. Thanks to Sentausa.
In the UK The Bioinformatics
Resource (part of the BBSRC's CCP11 project)
project maintains (among many other resources) lists of (mainly)
British Masters and PhDs in
bioinformatics. If you have any suggestions or updates please contact me
with them. You can publicize your course and offer a public service
at the same time.
University, Grahamstown, South Africa offers an MSc. in Bioinformatics
and Computational Molecular Biology. Thanks to Natalie Twine.
Cathal Seoighe wrote a while back about the South African National
Bioinformatics Institute (SANBI). Ruediger Braeuning has since
written to point out that bioinformatics training in South Africa
has been radically reorganized. He says:
"A new institute, the National Bioinformatics Nework (NBN), has been
created. We have nodes at Universities all over the country (UWC, UCT, SUN,
RU, UKZN, UP, WITS). Our main tasks are to:
- develop capacity in Bioinformatics
- perform world-class research
- support local Biotechnology initiatives
"We do offer courses on various topics in Bioinformatics ranging in
length from 3 days to several weeks. We also train Bioinformaticists on MSc,
PhD and post doc level. Undergraduate programs are currently being developed.
Bursaries are available. For more information visit our
South African National Bioinformatics Institute (SANBI) Honours Bioinformatics
Course at the University of the Western Cape. Next
year the same institute will be offering a Master's in bioinformatics---thanks
to Cathal Seoighe.
If you know of any other bioinformatics courses on the African continent
please feel free to mail me
Thanks to Jordan Patterson for the information that the University
of Alberta offers four-year Biology or Computer
Science degrees with a specialization in bioinformatics. The Faculty
of Computer Science there offers Master's and PhD training
Benjamin Horsman wrote to tell me that Simon Fraser University and the University
of British Columbia are collaborating on a new Bioinformatics
training program with the British Columbia Cancer Agency. The
program offers post-graduate diploma, Master's, and PhD
training in Bioinformatics. Now Simon Fraser University also offers a
joint major programme in Molecular Biology and Biochemistry (MBB)
and Computer Science in Bioinformatics. Thanks to Brittany
Nielsen for the info.
Thanks to Olga Likhodi for the information that Seneca College, Toronto offers
a post-graduate diploma
Peter Kublik informs me that from 2003/2004 the University of Calgary will offer
programme. He's part of the first intake.
The University of Waterloo, Department of Computer Science offers undergraduate and graduate courses in
bioinformatics. More information is here.
The Keck Graduate Institute claims that computational biology is
a core element of the curriculum in its Master of Bioscience degree.
Stanford University offers
academic and professional (distance-learning) MSs in Biomedical Bioinformatics as
well as its PhD programme. Thanks to Betty Cheng.
Thanks to Momchil Georgiev for the information that the University of California at San Diego offers
graduate programme and to Dana Brehm that there is now a new
bachelor's program, to quote her:
"[This is an] undergraduate, interdisciplinary program
for undergraduates leading to a B.S. degree. The new Bioinformatics
major is offered by the Division of Biology, and the departments
of Chemistry/Biochemistry, Computer Science and Engineering, and
Bioengineering. A student may choose to major in Bioinformatics in
any one of the four departments or division. The Division of Biology
currently offers two Bioinformatics courses, and with the advent
of the cross-disicplinary major, even more courses are going to be
taught 2002-03 and 2003-04."
University of California, Irvine Informatics in Biology and Medicine
David Delong wrote to me to point out that the College of Natural and Agricultural Sciences at
the University of California, Riverside is developing a "Center in
Genomics and Bioinformatics" which will offer a PhD
curriculum in genomics and bioinformatics from academic year
Catherine Velazquez says that The University of California, Santa Cruz offers a
new undergraduate BS course in bioinformatics.
They have a Frequently
Asked Questions. Now they also offer an MS/PhD in
Bioinformatics. Thanks to Kevin Karplus for the update.
Javier Rojas Balderrama emailed me to point out thatYale University offers a Bioinformatics and
Computational Biology track as
part of its combined Biological and Biomedical
Sciences graduate programme.
Georgia Institute of Technology Masters
of Science in Bioinformatics
According to Eric VanWieren Georgia State University offers a Master's
and PhD in Computer Science with a focus
on bioinformatics. The university's Bachelor of Science in Computer
Science also offers a "Fundamentals of Bioinformatics" course.
The University of Illinois
at Chicago offers graduate programmes covering Bioengineering
Bioinformatics through its Bioengineering department as well as an
undergraduate course track. Thanks to Amit Sabnis.
IUPUI offers an MS programme
Indiana University also offers
an MS programme
Iowa State University offers
an Interdisciplinary Ph.D. Program in Bioinformatics
and Computational Biology (BCB).
The Jackson Lab, a World centre
of mouse genome informatics offers a graduate training program.
Tim Young wrote to say that Johns Hopkins University in Maryland
offers an MS in Bioinformatics through the Zanvyl Krieger School
of Arts and Sciences Advanced Academic Programs and Whiting
School of Engineering Engineering and Applied Science Programs for
Professionals. They are also offering a Bioinfomatics concentration
MS in Biotechnology program.
Boston University offers a graduate programme and so does its
partner North Eastern University.
North Eastern also offers a Graduate Certificate in
Brandeis University offers
both a Master
of Science in Bioinformatics and a Graduate
Certificate in Bioinformatics. Thanks to Matt Foster.
The Department of Computer Science at
UMass Lowell offers various degrees from Bachelor's through to
PhD. level in Computer Science with Bioinformatics options.
At the National Autonomous University of Mexico a doctoral program
in biomedical sciences is available. Their Computational Molecular
Biology Group is here.
The University of Minnesota offers
a graduate programme in bioinformatics.
Thanks to Anu Haniharan for drawing my attention to mixing up the
Minnesota and New Jersey paragraphs.
The University of Nebraska Lincoln offers an Interdisciplinary
The Graduate Program of the Pathology-Microbiology Department at
the University of Nebraska Medical
Center (University of Nebraska at Omaha)
offers a specialty
track in bioinformatics.
Rama Penta wrote to say that Stevens Institute of Technology offers
a Master's programme in Bioinformatics.
The message also states that the University of Medicine
and Dentistry New Jersey (UMDNJ) offers a programme in biomedical informatics.
Thanks to Anu Haniharan for drawing my attention to mixing up the
Minnesota and New Jersey paragraphs.
Moustafa wrote to say that Ramapo College in New Jersey is the only
school in New Jersey offering a
Bachelor's degree in bioinformatics.
New York State
The University at Buffalo has
been involved in establishing a "Center of Excellence
in Bioinformatics". It used to a range of courses in bioinformatics
and related subjects, but all the course links seem to be dead now.
Thanks to Jeff Ligas for the original notification.
Canisius College---also in
Buffalo, NY---has had a state-approved B.S. in Bioinformatics since 2001.
Thanks to Deb Burhans.
Cornell and Rockefeller Universities, together
with the Sloan-Kettering Research Institute
offer a "Tri-institutional program
in Computational Biology and Medicine". Thanks to Brant
Rensselaer Polytechnic Institute offers
both undergraduate and graduate
programmes in bioinformatics
Rochester Institute of Technology offers BS MS and BS/MS programmes
in Bioinformatics. Thanks to Brandon H.
According to Maureen Downey, the College of Staten Island, part
of the City University of New York also offers a
challenging program in bioinformatics.
If you know of any other bioinformatics courses on the American
continent please feel free to mail me
Duke University's Center
for Bioinformatics and Computational Biology offers various bioinformatics programmes.
The North Carolina State University Statistical Genetics and Bioinformatics
Program offers Master's
Bioinformatics and PhDs in
The University of North Carolina at
Chapel Hill offers a programme in Bioinformatics and Computational Biology
Andrew Johnson writes: "There is a relatively new Biomedical
Informatics program in Ohio. (I'm entering the program in a few
months). Though the department stands
alone, it is in the College of Medicine at the Ohio
State Medical Center. Entrance is offered through a new Integrated Biomedical Sciences Graduate
The University of Pennsylvania offers
some of the best known and longest
established bioinformatics programmes at Batchelor's, Master's
and PhD levels. Thanks to Louis Licamele for pointing out my oversight
(I just assumed I'd already listed them!) He also points out
that Georgetown University is planning
bioinformatics courses too.
Tom Andrews, a student on the course, has written to me to tell
me that Texas A&M University at
Corpus Christi is currently offering a BS computer science degree
Jeremy Read told me that St. Edward's University in Austin offers a
B.S. in Bioinformatics.
The Keck Center for Computational
Biology---a joint venture of Baylor College of Medicine; University
of Houston; Rice University; University of Texas Health Science
Center, Houston; M.D. Anderson Cancer Center; and University of
Texas Medical Branch, Galveston---offers undergraduate (not
2003) and graduate level
training in Computational
The University of Texas, El Paso offers
George Mason University offers
both M.S. and PhD. programmes in Bioinformatics.
The Virginia Polytechnic Institute
and State University's Bioinformatics Institute offers graduate options in Bioinformatics. Thanks
to William S. Preissner for
correcting this entry.
Raymond Lau drew my attention to the Bachelor of Science degree in bioinformatics at the University of Hong Kong.
Niranjan Swaroop Sharma wrote to tell me about the Bioinformatics Institute
of India which is offering a
whole range of bioinformatics programmes and qualifications
in both regular and distance learning formats. I would have reported
on this earlier, but have not been able to view the site in Mozilla. I finally viewed the site
using Konqueror today (24Jul03).
Perhaps some tinkering with the ASP code is needed there...
Vaibhav Sinha wrote to tell me that the Institute of Bioinformatics and Applied
Biotechnology (IBAB) in Bangalore is offering bioinformatics courses.
Thanks to Surjeet Singh for drawing my attention to the Indian Institute
of Information Technology-Allahabd which runs
a Master of Technology (M. Tech Bioinformatics) degree.
According to Rahul Agrawal, the Indian Institute of Technology Delhi,
New Delhi provides courses in Biochemical
Engineering and Biotechnology. He adds that another branch of
the Institute, IIT Kharagpur also provides
various courses in
There is an Advanced (Graduate) Diploma in
Bioinformatics in the Bioinformatics Centre at the Jawaharlal
Madurai Kamaraj University in
Madurai, India claims to have been the first in the country to initiate
a bioinformatics programme and advanced diploma in bioinformatics
at its School of Biotechnology
Risabh Bhandari writes to say:
"The recently rechristened CBT (Center for Biochemical Technology) [link
dead 13Nov02] which is a CSIR Lab [in New] Delhi has started
a PG Diploma
in Bioinformatics in association with Informatics institute.
The course covers a large area in the field with [its] primary focus
on computational and programming concepts. The course is 6 months
in duration, [and] conducted at the national Head office of [the]
The University of Pune, Maharashtra offers its MSc. in Bioinformatics and Advanced
Diploma in Bioinformatics at the Bioinformatics Centre, India.
Uma Paresmeswaran wrote to say that SASTRA, which is based near Trichy,
Tamil Nadu, will be offering a B.Tech.Programme in
Bioinformatics from 2003/2004, the first institute in India offering
this course at the undergraduate level?
There is, according to Aditi Arur, an MSc distance education program in
Bioinformatics, offered by Sikkim Manipal University India.
Dr Amir Feisal Merican wrote to say that the Institute of Biological
Sciences, Faculty of Sciences, University of Malaya, Kuala Lumpur,
is offering a BSc (Bioinformatics) undergraduate degree
programme. Yam confirmed this that this degree has been taught
for 3 years.
Kebangsaan University, Malaysia
(UKM) will start to offer a Bachelor's Degree in Bioinformatics
to its next intake, in July, 2003.
Thanks to Abdul Hameed for pointing out that two universities in
Pakistan---COMSATS Institute of Technology and
the Mohammad Ali Jinnah University---will
be offer four-year Bachelor of Sciences degrees in bioinformatics
from September 2003.
The Bioinformatics Centre of
the National University of Singapore offers Undergraduate and
PhD programmes in conjunction with the life sciences departments
and research institutions at NUS.
Lam Ah Wah wrote to tell me that the Nanyang Technological University (NTU)
starts a BioInformatics undergraduate and part-time post-graduate
MSc course in Jul 2002. Be warned: their Web site has hideous frame/window
based "portal" which breaks half a dozen rules of good
interface design. Chua Hian Koon managed to find a better link, and I browsed from
there to the syllabus here.
If you know of any other bioinformatics courses is Asia please feel
free to mail me
The Research School
of Biological Sciences, at the Australian National University in
Canberra offers PhD., MSc. and Honours
programs in Bioinformatics.
You can obtain a Graduate
Certificate in Bioinformatics from Curtin University of Technology in
As of 2001 Flinders University in
Adelaide offers a Bachelor's
of Science in Bioinformatics.
The Biochemistry Department of La Trobe University in Victoria
also offers an undergraduate
course in Bioinformatics.
The University of Melbourne offers undergraduate
study in Bioinformatics. Thanks to Gad.
There are (according to H L View) PhD, MPhil and Honours programmes
in bioinformatics (plus a bioinformatics minor) available at Murdoch
University's Centre for Bioinformatics and Biological
Rachel Oh said that is possible to study a near-bioinformatics programme
at QUT (Queensland University of Technology): the B. Sci (biotech
maj.) & IT (in software engineering & data comms) IF29. A copy of
the course is available by searching their Website.
The University of New South Wales in
Sydney offers a Bachelor
of Engineering in Bioinformatics.
According to Jonathan Watts, "Queensland University of Technology
in Brisbane QLD offers a
Bachelor of Applied Science Innovation, with a major in Bioinformatics" from
Sydney University in New South
Wales offers a Bachelor's
of Science and a postgraduate, Master
of Applied Science degree in Bioinformatics. Thanks to Dominic
Lau and Sebastien Gerega or the update.
If you know of any other bioinformatics courses is Australasia please
feel free to mail me
Thanks to Danushka for the information that the University of Auckland, New Zealand
has a BSc (Hons)
option is offered as part of degree courses at the Graz University of Technology (Technische
Universität Graz) in Graz, Austria.
A consortium including nearly all the French-speaking universities
of Belgium (Bruxelles, Liège, Louvain, Mons, Namur and Gembloux) is offering the "Inter-University
DEA/DES (Master) in Bioinformatics".
The Department of Engineering at
the Katholieke Universitiet of Leuvan
offers a Master of Bioinformatics degree.
The Bioinformatics Centre at The University
of Copenhagen offers a two-year masters program in bioinformatics.
Thanks to Thomas Litman.
The Technical University of Denmark, Center
for Biological Sequence Analysis offers a two-year International
MSc. in bioinformatics.
Syddansk Universitet (The University
of Southern Denmark) offers both BSc- and MSc- level Bioinformatik / Experimental
Bioinformatics. Thanks to Fiona Nielsen for the updated link---"Center for Experimental Bioinformatics".
The Finnish Graduate School in Computational
Biology, Bioinformatics, and Biometry or "ComBi" is
a joint venture of the University of Helsinki (English), the University of Turku (English) and the University
of Tampere (English).
Fabio Pardi writes that the Université Paris VII offers a DEA
en Analyse de Génomes et Modélisation Moléculaire.
Thanks to Brant Inman again for this link to
Isabelle da Piedade kindly provided this list of Master's and PhD
programmes in France:
Thanks to Amelie Stein for several of these entries.
The Technische Fachhochschule
Berlin (University of Applied Science) offers an
MSc in Bioinformatics and the Freie Universität Berlin (Free
University) offers both an
MSc. and a BSc. in Bioinformatics. Thank you to Sebastian Kurscheid
for this information.
Alexandra Reitelmann wrote to say that Bonn-Aachen
International Center for Information Technology (B-IT) is offering
a new English-language Master's
programme in Life Science Informatics. The B-IT is a joint venture
between the University of Bonn, the RWTH Aachen University, the University of Applied
Sciences Bonn Rhein-Sieg, and Fraunhofer Institutszentrum Birlinghoven
The Institut für Informatik at Johann Wolfgang Goethe-Universität
Frankfurt am Main offers a programme in Bioinformatik.
The Fachhochschule Bingen also
offers a bioinformatics
degree. Thanks to Manuel Schmidt.
can be studied at the Fachhochschule (University of Applied Sciences)
Oldenburg/ Ostfriesland/Wilhelmshaven. Thanks to Gerd Klaassen.
Bioinformatics is taught at Friedrich-Schiller-Universität,
Jena. Thanks to Lisa Mullan for the updated link.
Zentrum für Bioinformatik at the Universität Leipzig teaches Bioinformatik.
You can do a PhD in bioinformatics in the Department of Computational Molecular
Biology at the Max Planck Institute for Molecular Genetics.
Thanks to Martin Okrslar---and to Pooja Jain for the correction
to my broken link.
The Technische Universität München and Ludwig-Maximilians-Universität München also
The Universität Tübingen (University
of Tübingen) also offers Bioinformatik.
Here are their own Frequently
Asked Questions (in German only) about studying bioinformatics
Tobias Kailich kindly pointed out that FH Weihenstephan in Freising
(near Munich) offers opportunities to study Bioinformatik / Bioinformatics.
Conor Meehan wrote to say that the National University of Ireland Maynooth
set up a four-year Batchelor's course in Computational
Biology and Bioinformatics two years ago.
Ben Gurion University,
Beer Sheva offers places on the Bioinformatics
Track to a select few of its admitted students to the School of Computer Science.
Tel Aviv University offers a
BSc. in Bioinformatics. Thanks to Racheli Zakarin for the link.
The famous Weizmann Institute in
Rehovot teaches an MSc. called "Multidisciplinary Program in
Computational Biology and Bioinformatics". This PDF document has
more information. Gad Abraham, who told me about this, points out
that "all studies there are conducted in English and that there
are no tuition fees"
The Netherlands (Holland)
The Centre for Molecular and Biomolecular Informatics (CMBI) at
the University of Nijmegen offers
degree in bioinformatics. This is a one or two year course leading
to a degree with the formal title of "Master in Life Sciences",
but the subtitle "Bioinformatics".
The Institutt for informatikk (Department of Informatics)
of the University of Bergen, Norway offers a
Master's degree in bioinformatics.
There is a post-graduate programme in bioinformatics organized
by the Instituto Gulbenkian de Ciência
(IGC) and the Faculty of Sciences of the University
of Lisboa. (Thanks to Pedro Fernandes.)
Francisco Rocha wrote to say that Escola Superior de Biotecnologia
(ESB) teaches a bioinformatics programme [follow the
link labelled "Bioinformática"] in both Lisbon and
Oporto. The teaching institution is the Universidade Católica do Porto.
Bjorn Olsson writes that, as well as a 4-year Master's Degree
in Bioinformatics, the University of Skövde offers a number
of short courses and allow computer science master's students
to include bioinformatics in their degree. There is more information here.
Daniel Nilsson drew my attention to the MSc in Bioinformatics Engineering
in Uppsala. Thanks to Erik Kanders for correcting the link.
There are also opportunities to study bioinformatics on the "normal" biotech
courses in Gothemburg Linköping and Umå.
The Stockholm Bioinformatics Centre, Stockholm University, offers PhD-level
shorter courses in bioinformatics subjects.
The School of Mathematical and Computing Sciences at Chalmers offers undergraduate and
Master's programmes in bioinformatics. Thanks to Samuel
Fabio Pardi wrote that the Swiss Institute of Bioinformatics offered
a Master's degree (DEA). It was
a collaboration between the Swiss Institute of Bioinformatics and
three faculties of the Universities of Geneva and Lausanne. According
to Javier Rojas Balderrama this programme is now closed.
In 2002 I prepared a review
of bioinformatics education in the UK for the journal Briefings
in Bioinformatics. The article ends with a detailed listing of
all current and some future undergraduate and graduate courses
in bioinformatics the UK as of September 2002, along with links.
You can read
a preprint here.
Bioinformatics is among the specialisms available on Aberdeen University's MSc/PgDip
The University of Abertay,
Dundee has an MSc./PG Dip in Bioinformatics.
Thanks to Dr Nagesh.
Birkbeck College is a British centre
with a proud tradition in educating working and/or mature students
to the highest academic standards.
The University of Birmingham and UMIST offers undergraduate courses
Cambridge University is planning
an MPhil in
Computational Molecular Biology to start in 2004-2005. Thanks
to Antony Quinn for
In October 2004, Cardiff University started
two different courses: Bioinformatics or
Genetic Epidemiology and Bioinformatics either full-time or part-time
and at MSc/PG Cert or Diploma level. Thanks to Ian Brewis, who pointed
out that Cardiff's programme is distinguished by offering students
a stronger thread of genetic epidemiology for those students interested
Cambridge University is planning
an MPhil in
Computational Molecular Biology to start in 2004-2005. Thanks
to Antony Quinn for
In April 2002 City University's Bioinformatics
group moved to the University of Glasgow Department of Computer Science.
. Thanks to Will Bachelor for alerting me to the existence of this
group. City still offers MScs in Pharmaceutical
Information Management and Health Informatics
Cranfield University at
Silsoe offers an MSc. in Bioinformatics.
Hussein Zedan pointed out that De Montfort University, Leicester
was going to start its MSc. in Bioinformatics in
September 2003 in both full- and part-time formats.
The University of East Anglia offers an MSc. in
Bioinformatics. Thanks to Dr Nagesh.
Edinburgh University, offers
in Quantitative Genetics and Genome Analysis and an MRes (MSc./Diploma
by Research) in Life Sciences in which you can specialize in Quantitative
trait analysis and genomics .
There are various graduate programmes offered by the University of Exeter MSc/MRes/PgCert/PgDip
in Bioinformatics. (Thanks to M Antro for an update.)
The University of Glasgow offers
an MRes in Bioinformatics.
In November 2004, Fiona Croll alerted me to Herriot-Watt University's Bioinformatics
(IT) MSc jointly taught by the university's School of Mathematical
and Computer Sciences and its School of Life Sciences.
Imperial College offers
a new MSc in Computational
Genetics and Bioinformatics and MRes Biomolecular
There are MRes studentships available
on the courses at Leeds University.
On 20Jan03 UKeU, the UK government-backed company set up to provide
online degrees from UK universities to students worldwide, announced
a new Master's level programme in Bioinformatics from the
Universities of Leeds and Manchester. (Thanks again to Jo Wixon for
University of Liverpool M.Sc.,
Postgraduate Diploma and Postgraduate Certificate in Biosystems & Informatics
Manchester University also teaches
bioinformatics to its undergraduates as well as offering a
taught MSc. course in
Newcastle University's MRes
in Bioinformatics began in September 2003.
The University of Nottingham's undergraduate biochemistry
degrees feature bioinformatics prominently.
Oxford University has a Master's
degree course with an interesting flexible structure. Thank
you to Helen Parkinson and
Clare Hayes for this information.
Thank you to David
Parkinson (no relation to Helen, above) for pointing out to
me that for the past two years Sheffield Hallam University has offered
in Bioinformatics at its Graduate School
in Science, Engineering and Technology.
The University of Sheffield Centre for
Bioinformatics and Computational Biology offers taught courses
related to bioinformatics.
Rafiu Fakunle emailed to tell me that Queen Mary, University of London offers
degree in bioinformatics.
Royal Holloway College in the
University of London offers an MSc. in Computer
Science by Research in which a bioinformatics specialism is available.
University College London (UCL) offers a final year undergraduate
Proteins and Computers".
Together with Harrow School of Computer Science, The University of Westminster, a new
university in London, offers an MSc. in Bioinformatics as both a full- and part-time
course. Again this is aimed primarily at graduates of the biological
York University's Department
of Biology offers Masters courses
and PhDs in both computational
biology and biomolecular science.
If you know of any other bioinformatics courses in Europe please
feel free to mail me
Many visitors to the FAQ ask about bioinformatics distance learning.
Eventually I will try to gather together all those courses on this
list that can be taken remotely---if I ever have the time. Unfortunately
I don't at the moment. All I can suggest is that you examine
the courses yourself through the links provided in the FAQ. Many
can be taken over the Net or offer components that can be studied
at a distance. (And, if you do compile such a list for yourself,
do please email it to me and I will post it here for the benefit
of our users with, as usual, a full credit for your efforts.)
If you are thinking of studying at a UK institution you might want
to search through the pre-print of
my review of UK bioinformatics education for the word "distance".
At the moment I think the courses at Birkbeck, Exeter and Oxford
offer either full or part distance learning options.
Careers: How can I become a
How can I get involved?
If you want to get involved in bioinformatics, now is an exciting
time, but (certainly for less senior practitioners) it looks as though
demand for bioinformaticians is currently falling, partly for general
economic reasons, partly, perhaps, because drugs companies in particular
have been disappointed with the pay-off from their investment in
This section is opinionated; there are people in the field, both
computer scientists and biologists, who I would love to provoke (or
convert). If you are a newcomer, and especially if you come from
one of bioinformatics component pure disciplines, I hope my ranted
warnings will help you to avoid the mistakes of your predecessors---and
I write as one of the mistaken. David S. Roos put
it well in his review in
the journal Science:
"Lack of familiarity with the intellectual questions
that motivate each side can also lead to misunderstandings. For example,
writing a computer program that assembles overlapping expressed sequence
tags (EST) sequences may be of great importance to the biologist
without breaking any new ground in computer science. Similarly, proving
that it is impossible to determine a globally optimal phylogenetic
tree under certain conditions may constitute a significant finding
in computer science, while being of little practical use to the biologist."
How can I get
involved?---I am a "newbie"
Please read the education section above for information about some
of the places you can currently study bioinformatics. Please
do not direct questions about eligibility, course quality or admissions
policy to me, but to ask the individual institutions directly.
If you are a high school student / sixth former, think about taking
an interdisciplinary computational biology or bioinformatics bachelor's
degree of the sort offered at, for example, Manchester University
in the UK or UPenn in the States. Don't worry if you can't
find a place on such a course or there isn't one nearby; perhaps
the best way to approach this subject is from two sides. Do a bachelor's
degree in one area while taking a healthy interest in the other---or
(if you can afford to) complement a first degree in one part of the
discipline with a second degree in the second.
If you already have a degree in a biological discipline there are
similar Master's courses---both interdisciplinary (e.g. Birkbeck's
in London) and conversion type courses---for biologists or others
to learn computer science, for example.
If you are currently doing a computer science or biology PhD, try
to take advantage of the opportunity to take courses in the "other" discipline.
How can I get involved?---I am a biologist
To a biologist I would say: take as many real computing courses
as you can. It's important not just to learn a programming language,
but also to learn the discipline of computing; to structure
and document your work in a rigorous way. What courses you take might
be directed by the kind of work you are interested in doing when
you graduate---whether you see yourself supporting bioinformatics
applications or building them. For the former you need all-round
familiarity with the programs themselves and the hardware and software
needed to run them---plus your existing understanding of biology.
For the latter you need to learn a structured programming language
and the principles of good program design---plus the ability to talk
to and understand biologists.
Courses biologists might consider taking:
Of all the computing courses available it is most important
that you have a proper introduction to the UNIX operating system(s).
Most current bioinformatics software (especially the free stuff)
runs on "open" platforms like Linux and
the Web. The UNIX philosophy is elegant, powerful, and frustrating.
Master it and you will save a lot of time.
Learn some maths. Basic statistics, logic/set theory and a little
calculus would be my recommendation. Many practising biologists
have little or no grasp of elementary concepts like statistical
significance, permutations and combinations and the principles
of good experimental design. Logic will come in handy at the
very least if you want to query databases in an intelligent way.
If you're interested in development, learn a real programming
language: Pascal, C(++), Java or Fortran.
Perl and HTML are the stuff that holds the Web together. A grasp
of these is essential for a lot of the Web/database work being
done by many bioinformaticians at the moment.
Good old BASIC can be very useful as an introduction to programming
or as a tool in its own right, but none of these latter languages
is built to crunch numbers and tackle real world biological problems---which
isn't to say people don't try...
How can I get involved?---I am a
One thing that I will emphasise repeatedly in this section is the
simple value of doing some "proper" biological laboratory
science. I have sat through many talks during which a bioinformatics "scientist" describes
in great detail how his---it's usually "his"---application
of a trendy mathematical tool offers a supposed insight into a (sometimes
supposed) biological problem. Nine times out of ten I know that this
method will never be so much as sneezed on by a practising biologist.
Quantitative scientists sometimes talk about their interest in studying
some aspect of "God's mind". Biologists, in contrast,
are interested in "Mother Nature". You might meditate on
God in the hope of some revelation, but to understand Nature you
have to meet her in the flesh. You are as likely to be useful to
biologists working in isolation at the keyboard as you are to conceive
with your clothes on. Desk-bound bioinformaticians have written
code that has turned out to be popular with biologists, but almost
always because they have collaborated with biologists.
Courses quantitative scientists might consider taking:
- Molecular biology
"MoBi" was the bioinformatics of its day; desperately
fashionable, the province of new, higher-paid practitioners and
considered with slight suspicion by more traditional biologists.
It was once a great achievement to sequence a modest stretch
of DNA, now it's a job for robots. Today the technology of
molecular biology is very well established. Scientists can buy
kits to perform the sort of genetic manipulations that would
make your parents' jaws drop. Some of the kits are so simple
your small children could use them (with a modest amount of training
Despite the profusion of commercial kits, there is still a requirement
for real skill in molecular biology and the general level of
scientific understanding required to be a good biological scientist---rather
than just completing a practical class---doesn't come easy.
Living matter, the stuff you have to work with is unpredictable
and responds slowly---except when it's dying. Even supposedly
fast-growing bacteria can take a long time to yield up their
Now, fashions in biomedical research are shifting from molecular
biology back to cell biology and protein biochemistry, but it's
well worth offering yourself up as a volunteer for some vacation
work in a molecular biology lab. The term is now more often used
to refer to the technological tools provided by MoBi to biology
in general, rather than to fundamental research in the field
itself. Those tools are common to a vast array of different kinds
of research, from archaeology to zoology.
- Protein (bio)chemistry
Protein (bio)chemistry is experiencing a revival. Proteins are
still more delicate and fussy than nucleic acids. The same advice
that applies to molecular biology applies to protein biochemistry.
That stuff bioinformatics people refer to as "wet lab science" is
much harder than it looks.
You might find it more difficult to get access to a good protein
lab than a good molecular biology lab and do protein science
with real wizards, but the very least you can do is read about
the theoretical aspects of the subject.
For insights into the principles of proteins structure, try,
for example, Carl Branden and John Tooze's "Introduction
to Protein Structure" [Garland ISBN 0-8153-2305-0]. Physicists
in particular might find the lack of general unifying principles
in this area overwhelming. Unfortunately there's no substitute
for acquiring a "feel" from the subject by examining
a lot of examples. Still the most critical stages in the successful
prediction of protein structure from sequence are those requiring
Thomas E. Creighton has been responsible for a range of standard
texts on protein chemistry. If you are working in a protein lab
you are likely to come across his "Protein Function : A
Practical Approach" [ISBN 019963615X] and the rather more
expensive and theoretical "Proteins : Structures and Molecular
Properties" [ISBN 071677030X]
- Evolutionary biology
It's a worn quote, but worth repeating:
"The mechanisms that bring evolution about certainly need study
and clarification. There are no alternatives to evolution as history
that can withstand critical examination. Yet we are constantly learning
new and important facts about evolutionary mechanisms. Nothing in
biology makes sense except in the light of evolution."
Theodosius Dobzhansky in "American Biology Teacher" vol.35
Darwin's theory is one of the simplest and most misunderstood
in science. Start with a good layperson's introduction, Richard
Dawkin's "The Selfish Gene" (and remember: it's
a metaphor, stupid) or Steve Jones' paraphrasing
of Darwin's original "The Origin of the Species" "Almost
Like a Whale". All biologists agree on the underlying principles,
but they are nearly ready to kill one another over the details.
After reading a decent book on evolutionary biology you should
have at least a handful of good questions. Now you are ready
to take a class in the subject. Take your questions with you.
You'll probably start an argument---or a fight.
You might also like to peruse Cynthia Gibas's answers to
similar questions from computational scientists on the O'Reilly Web site.
These damned biologists are making me use Word instead of LaTeX
to write up---what can I do?
More general advice
Use the software
Get access to an installation of EMBOSS and/or Staden and get someone
to lead you through the tools available. RasMol is a simple,
but powerful and elegant molecular imaging program which can teach
you a great deal about biological macromolecules; try a tutorial.
Get out on the Web and do some productive surfing for a change
:-) . The best starting point is the Human Genome Mapping Project
Resource Centre's "GenomeWeb". There's so much
stuff out there -- and most of it is free to academics.
Where can I find Bioinformatics jobs?
Start here at Bioinformatics.Org's Job
Then move on to the appointments / careers sections of the the major
scientific journals, or, better, search their Web jobs pages with "bioinformatics":
Appropriately for a Web-dependent discipline, there are a variety
of specialist commercial Web sites which carry bioinformatics jobs:
There are also a number of companies actively recruiting in the
area. Here are a few:
This section includes some simple rules-of-thumb to apply when performing
common bioinformatics tasks. I try to give a reference to a more
detailed source of guidance where I know of one.
How do I find a sequence?
The most common task in bioinformatics must be the acquisition of
some bioinformatics data on which to operate. Usually this in the
form of a nucleic acid or protein sequence, stored as characters
in the appropriate alphabet together with a header of related information:
for example some kind of unique identifying number the species from
which the original biological substrate was obtained, the names of
any authors who published the sequence and so on.
You may have already generated your own sequence data experimentally.
In this case you are likely to want to find sequences which are identical
or similar (and therefore possibly related) to yours. The task is
then one of similarity search.
...I have a description.
A paradoxical problem generated by the success of the bioinformatics
revolution is the increasing difficulty of navigating the huge amount
of data available. Once you could print out most of the existing
sequence databases onto paper and cram them into a single binder.
Now a search for "actin" alone will pull out hundreds and
hundreds of sequences. The key to find what you want is to develop
your own discriminatory skills rather than rely on computers to figure
out what it is you're really after.
Make sure you are clear about your aim first. If you are looking
for a sequence for a specific scientific purpose then you might be
best to start with a relevant human-generated publication. For example,
you have cloned a gene which is part of a well-characterised biochemical
pathway and you want to find other sequences of the same functional
gene product in other species (orthologues) Entrez PubMed is your
PubMed is a huge and very comprehensive database of the biomedical
scientific literature., created by the U.S. National Library of Medicine (NLM). Entrez PubMed
is another indispensable resource of the U.S. National Centre for Biotechnology Information (NCBI).
Both are part of the U.S. Department of Health and Human Services
National Institutes of Health
Swiss-Prot is curated by human beings.
Use SRS at the RFCGR
[XXXX INSERT DETAILED ADVICE HERE]
Use Boolean logic
[XXXX INSERT DETAILED ADVICE HERE]
[XXXX INSERT DETAILED ADVICE HERE]
an accession number.
[XXXX INSERT DETAILED SEQUENCE ADVICE HERE]
This section will be expanded---and there will be a more basic and
detailed explanation for novice searchers, but, in the meantime,
here are the top tips cribbed from the excellent paper by
Hugh B. Nicholas Jr., David W Deerfield II and Alexander J. Ropelewski
- Use a local favourite program on the Web server of your choice.
- Use at least two and preferably three similarity tables.
- If using Smith-Waterman or FASTA algorithms ensure that the gap opening
penalty is high enough.
- If the initial search finds no or insufficient matches repeat it with a
highly diverged matrix and/or with a Smith-Waterman-based server.
- If this doesn't work try switching from a PAM matrix to a BLOSUM matrix.
...I'm not sure whether or not to use
Hugh, David and Alexander again on when not to use the default search
parameters provided by a server.
- ...when the homologues you are looking for to match your query are highly
- ...when the query or matches are short.
- ...when you are only interested in a specific (in the sense of "species")
subset of database matches with a particular evolutionary relationship to
your sequence of interest---a relationship not implied by the default settings.
How can I align two sequences?
This section will also be expanded for newbies, until then, here
are Hugh, David and Alexander's tips for alignment:
- Use an appropriately divergent matrix (I'll be adding a table soon
to explain this).
- Reduce your gap penalty relative to that you used for your database search.
- Use the MaxSegs/Waterman-Eggert version of the dynamic programming algorithm
to provide the best local alignment and also to search for repeats.
How can I predict the function of a gene
[XXXX INSERT FUNCTION PREDICTION ADVICE HERE]
How can I predict the structure of a sequence?
You could start with anyone of these excellent guides (listed strictly
in alphabetical order):
How can I simulate a biomolecule?
Here's Peter J. Steinbach's "Introduction
to Macromolecular Simulation"
How can I write up?
Go here to download
some detailed advice. Go here for
Glossary of bioinformatics
Here I attempt to define some common terms in bioinformatics. I
have tried to balance clarity, brevity and rigour. Let me know if
I let one of these priorities over-ride the others.
What is an alignment?
When two symbolic representations of DNA or protein sequences are
arranged next to one another so that their most similar elements
are juxtaposed they are said to be aligned. Many
bioinformatics tasks depend upon successful alignments. Alignments
are conventionally shown as a traces.
In a symbolic sequence each base or residue monomer in each sequence
is represented by a letter. The convention is to print the single-letter
codes for the constituent monomers in order in a fixed font (from
the N-most to C-most end of the protein sequence in question or from
5' to 3' of a nucleic acid molecule). This is based on the
assumption that the combined monomers evenly spaced along the single
dimension of the molecule's primary structure. From now on I
shall refer to an alignment of two protein sequences.
Every element in a trace is either a match or a gap.
Where a residue in one of two aligned sequences is identical to its
counterpart in the other the corresponding amino-acid letter codes
in the two sequences are vertically aligned in the trace: a match.
When a residue in one sequence seems to have been deleted since the
assumed divergence of the sequence from its counterpart, its "absence" is
labelled by a dash in the derived sequence. When a residue appears
to have been inserted to produce a longer sequence a dash appears
opposite in the unaugmented sequence. Since these dashes represent "gaps" in
one or other sequence, the action of inserting such spacers is known
A deletion in one sequence is symmetric with an insertion in the
other. When one sequence is gapped relative to another a deletion
in sequence a can be seen as an insertion in sequence b.
Indeed, the two types of mutation are referred to together as indels.
If we imagine that at some point one of the sequences was identical
to its primitive homologue, then a trace can represent the three
ways divergence could occur (at that point).
Biological interpretation of an alignment
A trace can represent a substitution:
A trace can represent a deletion:
A trace can represent a insertion:
For obvious reasons I do not represent a silent mutation.
Traces may represent recent genetic changes which obscure older
changes. Here I have only represented point mutations for simplicity.
Actual mutations often insert or delete several residues.
What is a
Thanks to Bioinformatics.Org member Ravi Jain for the following
answer, which I present verbatim.
DNA microarrays consist of thousands of immobilized DNA sequences
present on a miniaturized surface the size of a business card or
less. Arrays are used to analyze a sample for the presence of gene
variations or mutations (genotyping), or for patterns of gene expression,
performing the equivalent of ca. 5 000 to 10 000 individual "test
tube" experiments in approximately two days of time.
Robotic technology is employed in the preparation of most arrays.
The DNA sequences are bound to a surface such as a nylon membrane
or glass slide at precisely defined locations on a grid. Using an
alternate method, some arrays are produced using laser lithographic
processes and are referred to as biochips or gene chips. The composition
of DNA on the arrays is of two general types:
- Oligonucleotides or DNA fragments (approximately 20-25 nucleotide bases).
These arrays are frequently used in genotyping experiments. The sequences
of alternate gene forms may be included for detection of mutations or normal
- Complete or partial cDNA (approximately 500-5 000 nucleotide bases). These
arrays are generally used for relative gene expression analysis of two or
more samples; however, oligonucleotide-based arrays may also be used for
DNA samples are prepared from the cells or tissues of interest.
For genotyping analysis, the sample is genomic DNA. For expression
analysis, the sample is cDNA, DNA copies of RNA. The DNA samples
are tagged with a radioactive or fluorescent label and applied to
the array. Single stranded DNA will bind to a complementary strand
of DNA. At positions on the array where the immobilized DNA recognizes
a complementary DNA in the sample, binding or hybridization occurs.
The labeled sample DNA marks the exact positions on the array where
binding occurs, allowing automatic detection. The output consists
of a list of hybridization events, indicating the presence or the
relative abundance of specific DNA sequences that are present in
What is a homologue?
a much-misused term and existed in biology long before the
notion of protein sequences. Strictly homology cannot be qualified;
it is not correct to state that two proteins are "30% homologous" with
each other, for example. If we could look back far enough in the
evolutionary histories of any two molecules under comparison, we
would be guaranteed to find a common ancestor eventually, but this
is not true homology. An example of this would be the relationship
between two variants of a single ancestral enzyme resulting from
a gene duplication event.
As a rule-of-thumb, true homology should be assigned only when the
feature which leads us to suspect a relationship between molecules
is one we consider likely to have derived from the molecules' common
ancestor. To quote Page and Holmes [Molecular Evolution:
A Phylogenetic Approac, Roderick D. M. Page and Edward C. Holmes;
Blackwell Scientific; ISBN 0865428891]:
"The classic molecular example is the parallel evolution
of amino acid sequences in the lysozyme enzyme in leaf-eating langur
monkeys and in cows. Both animals have independently evolved foregut
fermentation using bacteria, and in both cases lysozyme has been
recruited to degrade these bacteria. Therefore, langur and cow lysozymes
are homologous as genes; however, as digestive enzymes they are not
homologous because this functionality was not present in the ancestral
Although sequence determines structure, it is possible for two proteins
to have very different sequences and functions and share a common fold.
In fact, most gene products with similar three-dimensional structures
are insufficiently similar at the sequence level for true homology
or analogy (non-homologous similarity) to be distinguished.
What is an ontology?
Biology is changing from being a descriptive to an analytical science.
Accurate and consistent descriptions are, however, vital to analysis.
The idea of ontologies has been co-opted from philosophy and artificial
intelligence to partition bioinformatic knowledge in a way which
can be reliably navigated by computers.
This preprint of a review
by Ele Holloway of
the European Bioinformatics Institute gives
a more detailed insight into the varied approaches to ontologies
in bioinformatics by covering a recent meeting on the subject. The
final version appears in Comparative
and Functional Genomics.
What is a scoring matrix?
The following explanation was edited from a contribution by Amelie
The aim of a sequence
alignment, is to match "the most similar elements" of
two sequences. This similarity must be evaluated somehow. For example,
consider the following two alignments:
They seem quite similar: both contain one "indel" and
one substitution, just at different positions. However, if we think
of the letters as amino acid residues rather than elements of strings,
alignment (a) is the better one, because isoleucine (I) and leucine
(L) are similar sidechains, while tryptophan (W) has a very different
structure. This is a physico-chemical measure; we might prefer these
days to say that leucine simply substitutes for isoleucine more frequently---without
giving an underlying "reason" for this observation.
However we explain it, it is much more likely that a mutation changed
I into L and that W was lost, as in (a), than that W changed into
L and I was lost. We would expect that a change from I to L would
not affect the function as much as a mutation from W to L---but this
deserves its own topic.
To quantify the similarity achieved by an alignment, scoring
matrices are used: they contain a value for each possible
substitution, and the alignment score is the sum of the
matrix's entries for each aligned amino acid pair. For gaps
(indels), a special gap score is necessary---a very simple
one is just to add a constant penalty score for each indel. The optimal
alignment is the one which maximizes the alignment score.
PAM matrices are a common family of score matrices. PAM
stands for Percent Accepted Mutations,
where "accepted" means that the mutation has been adopted
by the sequence in question. Thus, using the PAM 250 scoring
matrix means that about 250 mutations per 100 amino acids may have
happened, while with PAM 10 only 10 mutations per 100 amino acids
are assumed, so that only very similar sequences will reach useful
PAM matrices contain positive and negative values: if the alignment
score is greater than zero, the sequences are considered to be related
(they are similar with respect to the used scoring matrix), if the
score is negative, it is assumed that they are not related. "Relationship" here
may refer to evolution as well as functionality of the proteins,
and of course the choice of the matrix affects the result, so one
has to make an assumption on the similarity of the sequences in order
to receive a useful result: rather distant sequences won't produce
a good alignment using PAM 10, and the optimal aligment of two very
similar sequences with PAM 500 may be less useful than that with
Finally, it should be noted that only some scoring matrices use similarity to
evaluate alignments, but others use distance, so the be
careful interpreting the results!
After this brief and necessarily superficial overview, you
might want to read some more about scoring matrices.
Thanks to the following people for questions:
- Jonathan Després
- Salma B. Rafi
- Amelie Stein
- Michael Wentzel
Thanks to the following people for corrections, links and sources:
- Anuradha Acharya
- Rahul Agrawal
- Ken Allen
- Tom Andrews
- M Antro
- Aditi Arur
- Paulo Almeida
- Jeff Ames
- Jim Auer
- Will Bachelor
- Justin Baker
- Javier Rojas Balderrama
- Nigel Barber
- Risabh Bhandari
- Ruediger Braeuning
- Ian A Brewis
- Pierre Bushel
- Debra Burhans
- Andrea Cabibbo
- Chua Hian Koon
- Betty Cheng
- Leonard Crane
- Fiona Croll
- Paul Curley
- David Delong
- Maureen Downey
- Steffen Durinck
- Rafiu Fakunle
- Pedro Fernandes
- Matthew Foster
- Momchil Georgiev
- Sebastien Gerega
- Jesmminder Gill
- Georges Grinstein
- Mike Goodrich
- Brandon H.
- Maximilian Haeussler
- Abdul Hameed
- Anu Haniharan
- Samuel Hargestam
- Clare Hayes
- H. L. Hiew
- Ele Holloway
- Matt Hope
- Benjamin Horsman
- Brant Inman
- Pooja Jain
- Andrew Johnson
- Tobias Kailich
- Erik Kanders
- Kevin Karplus
- Beatrice Kilel
- Gerd Klaassen
- David Klemitz
- Peter Kublik
- Sebastian Kurscheid
- Dominic Lau
- Raymond Lau
- Darren Lee
- Wentian Li
- Louis Licamele
- Jeff Ligas
- Olga Likhodi
- Thomas Litman
- Steve Masticola
- Matt at ColorBasePair.com
- James McInerney
- Conor Meehan
- Junaid A. Mehta
- Lisa Mullan
- David Murphy
- Feisal Merican
- Markus Montigel
- Dr. Nagesh
- Alex O'Neill
- Brittany Nielsen
- Fiona Nielsen
- Daniel Nilsson
- Rachel Oh
- Martin Okrslar
- Bjorn Olsson
- Uma Parameswaran
- Fabio Pardi
- David Parkinson
- Helen Parkinson
- Rama Penta
- Isabelle de Piedade
- Jean-Etienne Poirrier
- William S. Preissner
- Antony Quinn
- Jeremy Read
- G. Deepak Reddy
- Alexandra Reitelmann
- Francisco Rocha
- John Rowland
- Vishal Rupani
- Amit Sabnis
- Manuel Schmidt
- Cathal Seoighe
- Niranjan Swaroop Sharma
- Richard Sheehan
- Nihar Sheth
- Bolanle Shoge
- Vaibhav Sinha
- Amelie Stein
- Jennifer Steinbachs
- Mattias Thorslund
- James Thompson
- Natalie Twine
- Eric VanWieren
- Catherine Velazquez
- Lam Ah Wah
- Jonathan Watts
- Kathy Wiederin
- Linda Wilson
- Tim Young
- Zuthur Yew
- Tim Young
- Racheli Zakarin
- Hussein Zedan
- Humberto Ortiz Zuazaga
- Michael Zuker
Thanks to the following people for suggesting answers:
- Jeff Bizzaro
- Paul Boardman
- Ravi Jain
- Alex Kasman
- Sangeeta Sawant
- Fredj Tekaia
- Jo Wixon
Author and licensing
This resource is maintained by and © Damian Counsell, UK
Medical Research Council Rosalind Franklin Centre for Genomic Research
(the RFCGR) 1998-2004. It is made available under a modified version of
the Open Publication Licence.
It is currently mirrored at eBioinfogen
The FAQ has also been mirrored, without credit or any attempt to
link to the Open Content Licence, at the so-called "National
Bioinformatics Institute". If you are thinking of handing over
money for their "certification" you can draw your own conclusions
about their standing from this fact.
The first version of this Bioinformatics FAQ was prepared when I
was responsible for bioinformatics in the Section for Cell and Molecular Biology at
the Institute of Cancer Research (the
ICR) in London.
I am now a Bioinformatics Specialist at the Rosalind Franklin Centre
for Genomics Research, part of the Proteomics
Group and am supported by the Medical Research Council. This page
does not represent their views, but I will happily read
your criticisms. Although I may act on your advice I take no responsibility
for anything that might happen if you browse here.
Version control information
$Revision: 1.207 $ $Date: 2005/04/05 13:06:07 $ $Author: counsell