Tuesday, March 09, 2010
23andMe - I may be beginning to understand . . .
Whit Athey is one of those smart guys we all wish we were. He's a retired physicist with a doctorate in physics and biochemistry. He wrote the Y Haplogroup Predictor that many of us use. He also explains things.
I want to share his response to a recent question about 23andMe 'matching' on the ISOGG list.
The question was:
I am missing something in the definition of terms, apparently. If each parent contributes 50% of his genes to the child, why isn't the percentage more like 50% than 85%?
On the other hand, if as I think I have read, chimpanzees are 98% the same as humans, why isn't the comparison of unrelated individuals more in the range of 98%?
If you can, please define the terms being used and explain why a parent and child comparison isn't 50%.
Yes, all humans are more than 99% identical at each base location. This is confusing to a lot of people when they get involved with 23andMe data. The half-million locations or SNPs that were chosen for the Illumina chip (that 23andMe uses) were chosen precisely because they are much more variable than the average location. People, on average, are alike about 75% of the time at these particular 500,000 locations. Therefore, being "75% similar on a genome-wide comparison" is just an artifact of the Illumina set of SNPs. It would be very easy to select 500,000 locations where everyone would be 99% similar (500,000 random locations would probably do the trick), but it would not produce very interesting data. It would probably be very difficult to find 500,000 locations where people would be only 50% similar - maybe impossible - I don't know.
The "genome-wide comparison" is mostly meaningless as I said in my post the other day. For this particular set of SNPs, you get "genome-wide comparisons" for siblings and parent-child results of around 84% and, of course, about 75% for unrelated people. If we only had this measure to use, we would get nowhere fast. It's the long half-identical segments that are significant.
Maybe you would wonder how even the half-identical segments could be meaningful if there is a 75% probability of being identical anyway at a given location. While that much is true for a single location, the probability that two consecutive locations would by chance alone have the same base is (0.75)(0.75) = .56 and the probability of 1000 consecutive SNPs having the same base by chance would be (.75)^1000 (to the 1000th power), which is such a small number that I hesitate to try to write it. Therefore, when 23andMe finds a run of 1000 consecutive SNPs (adjacent on the same chromosome) that have the same state (on one of the chromosomes) they can reliably report that this is significant and could only occur as identical by descent. The calculation is actually a little more complicated than that because you are comparing two bases for each SNP with two others in another person at each SNP.
Note that the SNP locations occur about every 6000 base locations on average, and the assumption is that if you have, for example, 1000 consecutive matching bases at the locations of the SNPs, then all the bases in between each consecutive pair of SNPs (the other 5999 out of the 6000) are assumed to be the same too, resulting in 6,000,000 consecutive matching bases. These are the "half-identical segments."
A parent passes along exactly 50% of his/her 22 autosomal chromosomes to his/her child, so you are right about that. Depending on how the X and Y are counted in such calculations, you can see how the percentage would be moved slightly off 50%. The X is much larger than the Y, so your father passes along just as many chromosomes as your mother, but the amount of DNA can be different for sons as for daughters. I don't actually know exactly how 23andMe calculates these things, but you can see the potential for some small differences from 50%
I may be beginning to understand . . .