Bioinformatics is, effectively, an attempt to pull out
useful information from what looks, to the
untrained observer, like several
gigabytes of
random junk. The
human genome project and others like it have produced
sequence data in huge quantities. Sadly, though, a very long
string of 4
letters is not the easiest thing to
interpret. One of the most productive pieces of information obtained from them by bioinformatics has been regions that look like they might be
genes. Genes tend to have fairly predictable structures, being
preceeded by a higher than average number of adjacent Cs and Gs, followed within a few kilobases by a
methionine residue that functions as a start signal. Writing software that can predict these to any great degree of
accuracy has proven somewhat more
difficult than originally anticipated. One of the major problems is a growing
awareness that all sorts of other factors, such as the way in which the
DNA is
folded are also influencing things.
Effectively, it's all information theory. Bioinformaticians have been given a stack of data that is known to contain a large amount of information, and they're trying to get it out. For the next few years, at least, a lot of this is going to be guesswork and be based on a lot of assumptions. Even so, it's a field that has already produced lots of useful stuff and is likely to produce more. A full understanding of how the genome actually works is likely to have to wait until the entire biochemistry of a cell can be simulated.