I worry about abstraction in biology. When I refer to abstraction, I am generally referring to anytime we somehow transform, or abstract data in some fashion so that we are interpreting a transformed version of the data, and not the original data. What might be an example of this? I always think of molecular phylogenetics. The actual data used in molecular phylogenetics are individual sequences of DNA. There is then a layer of abstraction in aligning those DNA sequences to infer homology, then yet another layer of abstraction when those alignments are fed into an algorithm to infer a phylogeny.
Of course, abstraction is inevitable. In the example of molecular phylogenetics, abstraction has occurred before we actually get to the DNA sequences. If we imagine that those sequences are generated by Sanger sequencing, then what is actually read by a sequencer is length of a sequence and reflectance of a labelled nucleotide in a capillary, which is then abstracted by a program into a linear DNA sequence. But I would argue that in this case, we can assume a nearly one to one correspondence between the actual DNA sequence and our abstracted linear sequence in a text file. I think where we need to be concerned is when we have reasons to think that the relationship between the actual data and the abstracted data is diverging from that one to one relationship.
So why do we need to be concerned about abstraction? I think because it can generate problems that are non-identifiable. Using the same example of molecular phylogenetics, the step of alignment and assuming homology might not be appropriate because of pseudogenes or errors in alignment (particularly for very large datasets). There are many issues that can occur with the next step of reconstructing phylogenies, either problems like long-branch attraction that occur due to data problems, or because the method of reconstruction is inappropriate or incorrectly implemented. Of course, the phylogeny is rarely the last step of the process, as the phylogeny can then be used to map the evolution of traits, or trace the biogeography of a group, or to understand historical demography. For example, if we actually want to know how many times a trait has evolved convergently, we then need to use hierarchically abstracted data to make the actual biological inference that we are actually interested in. If our abstraction has created misleading patterns in the data at any stage, this could lead to making inaccurate inferences.
I have been using the example of molecular phylogenetics and phylogenetic comparative biology, but that is only because these are fields where I have a lot of practical experience with data and methods. I can come up with similar issues with abstractions with species distribution modelling, visual models, and biophysical modelling, and of course there are many others.
What should we do about abstraction? Well, I don’t think the solution is to avoid all abstraction altogether. I think that as scientists, we are already pretty comfortable with uncertainty, and in historical fields like phylogenetics and phylogenetic comparative biology, making inferences based upon heavily abstracted data is unavoidable. I do think that one thing that we can and should do is to both acknowledge the uncertainty that results from abstraction, and resist the urge to write papers more forcefully than is merited by our data and methods. I know that in order to get higher impact papers, making a forceful case for a single narrative is often the best bet for acceptance and impact. But we should also be precise in the confidence of inferences. From a pragmatic perspective, if we are making abstractions (e.g., inferring species distributions or physiology based upon remote sensing data) that can be groundtruthed with actual data, we should do so, either in the same studies, or encourage studies that do so. Additionally, even for historical data that cannot be groundtruthed, we should try to conduct robust and severe tests of hypotheses (sensu Platt and Deborah Mayo) with different types of data and analyses (e.g., integrating fossil data into biogeographic studies based on molecular phylogenetics). I think that the best work already employs these practices, but there is great scope for wider adoption of these methods.
Of course, abstraction is inevitable. In the example of molecular phylogenetics, abstraction has occurred before we actually get to the DNA sequences. If we imagine that those sequences are generated by Sanger sequencing, then what is actually read by a sequencer is length of a sequence and reflectance of a labelled nucleotide in a capillary, which is then abstracted by a program into a linear DNA sequence. But I would argue that in this case, we can assume a nearly one to one correspondence between the actual DNA sequence and our abstracted linear sequence in a text file. I think where we need to be concerned is when we have reasons to think that the relationship between the actual data and the abstracted data is diverging from that one to one relationship.
So why do we need to be concerned about abstraction? I think because it can generate problems that are non-identifiable. Using the same example of molecular phylogenetics, the step of alignment and assuming homology might not be appropriate because of pseudogenes or errors in alignment (particularly for very large datasets). There are many issues that can occur with the next step of reconstructing phylogenies, either problems like long-branch attraction that occur due to data problems, or because the method of reconstruction is inappropriate or incorrectly implemented. Of course, the phylogeny is rarely the last step of the process, as the phylogeny can then be used to map the evolution of traits, or trace the biogeography of a group, or to understand historical demography. For example, if we actually want to know how many times a trait has evolved convergently, we then need to use hierarchically abstracted data to make the actual biological inference that we are actually interested in. If our abstraction has created misleading patterns in the data at any stage, this could lead to making inaccurate inferences.
I have been using the example of molecular phylogenetics and phylogenetic comparative biology, but that is only because these are fields where I have a lot of practical experience with data and methods. I can come up with similar issues with abstractions with species distribution modelling, visual models, and biophysical modelling, and of course there are many others.
What should we do about abstraction? Well, I don’t think the solution is to avoid all abstraction altogether. I think that as scientists, we are already pretty comfortable with uncertainty, and in historical fields like phylogenetics and phylogenetic comparative biology, making inferences based upon heavily abstracted data is unavoidable. I do think that one thing that we can and should do is to both acknowledge the uncertainty that results from abstraction, and resist the urge to write papers more forcefully than is merited by our data and methods. I know that in order to get higher impact papers, making a forceful case for a single narrative is often the best bet for acceptance and impact. But we should also be precise in the confidence of inferences. From a pragmatic perspective, if we are making abstractions (e.g., inferring species distributions or physiology based upon remote sensing data) that can be groundtruthed with actual data, we should do so, either in the same studies, or encourage studies that do so. Additionally, even for historical data that cannot be groundtruthed, we should try to conduct robust and severe tests of hypotheses (sensu Platt and Deborah Mayo) with different types of data and analyses (e.g., integrating fossil data into biogeographic studies based on molecular phylogenetics). I think that the best work already employs these practices, but there is great scope for wider adoption of these methods.
RSS Feed