TG review: 12 kits at FTDNA
Table of Contents
D. IBD v IBS
Note: If you are already familiar with the concepts behind DNA matching, you may wish to skip the primer following this Introduction and proceed directly to Section III, below.
There has been much discussion in the genetic genealogy community lately about Triangulated Group's ("TG's") and what they can (or cannot) tell us about distant relationships. I reviewed 12 kits at Family Tree DNA to quantify the TG's by size, both in terms of the number of segments and segment size. I then reviewed TG's for common ancestors to assess whether relationships could be plausibly attributed to matching scenarios (this part of the review is ongoing. Links to these "case studies" are at the bottom of this page and on the DNA Home page).
II. Autosomal DNA: a brief primer
A. TG's: what are they?
TG's are the Holy Grail of DNA. They are groups of 3 or more people who are matching on a segment of DNA on the same chromosome in a chromosomal pair.1 Such individuals share a common ancestor somewhere in their trees from whom they "inherit" the segment.
There are 2 kinds of TG's: genetic TG's and genealogical TG's. Genetic TG's consist of matches who share DNA with you and others. Genealogical TG's consist of matches in genetic TG's for whom you have identified your common ancestor through a paper trail or other evidence. You will not always be able to identify the ancestor you share with your matches in genetic TG's, and you won't share DNA with all of your distant cousins.
Having a minimum of three people matching on a segment helps eliminate the possibility that an undetected alternative common ancestor passed the segment. For people descended from endogamous populations (where the same families marry into each other as a matter of custom, convenience or necessity), multiple common ancestors is not just a possibility, but a probability. Such individuals are often related in complex ways, sharing multiple sets of common ancestors, and/or having the same set of ancestors in their trees multiple times. This greatly complicates the "assignment" of a segment to a specific ancestor, especially when not all of your ancestors at a given level are accounted for.
TG's sometimes seem ill suited for the task we assign them – helping us identify common ancestors with distant cousins. Since DNA is passed in a purely random fashion, the chances diminish at each generational remove that we and our distant cousins received the same overlapping DNA.
Consider that you got approximately 6.25% of your DNA from your great-great-grandparents, and you can see how the odds are against three 3rd cousins receiving the exact same or overlapping segment of randomly passed DNA from them.
Nevertheless, most of us have observed TG's in our kits – TG's which are presumably formed with more distant than 2nd cousins. The fact that we are forming TG's at all with distant cousins seems to defy the simple mathematical odds, and suggests that other factors are at play.
B. Founder populations
Two factors, both discussed in an Ancestry “Learning Academy” video on DNA Circles, are founder populations and endogamy. Founder populations are small societies formed by population "bottlenecks" - a population reduction due to factors such as plague, which wiped out half of Europe in the 14th century, or the settlement of North America by Europeans in the 17th century onwards. The smaller gene pool results in less divergent DNA being passed around with more of the same segments passed to descendants.
Ancestry also gives the example of fast growing populations which result in more cousin marriages and multiple common ancestors for individuals. Ancestry does not do a good job in my opinion of explaining why this happens in fast growing populations – wouldn’t more people lead to more genetic diversity? But suffice to say that both distant and more recent endogamy result in more of the same DNA being shared, and that, as a result, matches are often sharing more DNA than we would expect for their relationship.
Cousin marriages continued in certain places well after the 18th century in the United States. My 2nd great grandfather had only 5 discrete sets of 2nd great-grandparents, instead of the usual eight. His parents were double second cousins, as a result.
D. IBD v IBS
The foregoing discussion is meant to contextualize the TG's reported here. The case studies drawn from this review include examples of 4th through 8th cousins in a TG, often with relatively large segments. These distant relationships are supplemented by closer cousin relationships (e.g. 3rd cousin) and the correct grandparent on the segment (I have mapped my Kits to their four grandparents).
The critical issue is the age of the segments, but only large DNA sharing reliably predicts the nearness of a relationship. A segment's age is really in part a function of the totality of DNA shared, but since distant cousins (say 3rd and beyond) usually share only one segment of DNA IBD, total amount of DNA IBD shared is equivalent to longest segment for many of those relationships.
IBD stands for "Identical By Descent," and refers to an amount of DNA which, if shared by individuals, is sufficiently large to guarantee that they share a recent common ancestor. Segments that are not IBD are typically characterized as reflecting a "chance" match between people that does not reflect shared ancestry, and are called IBS ("Identical By State"). But since all humans today share a most recent common ancestor within possibly as recently as 2,000 years, what does that mean? At some level, doesn't our DNA reflect a "match" with everyone else in the world?
Yes – and no. For all intents and purposes, the distinction between IBD and IBS/IBC is the difference between segments that are connecting us to recent common ancestors and those that only appear to do so or are too small to tell. "Recent" is a loosely defined term. Ancestry uses "recent" to describe matches in its "tables of confidence" – as in a 99% "[l]ikelihood of a single recent common ancestor" for people that share a total of 45-60 cM. Ancestry does not define "recent," but there seems to be some informal understanding among genetic genealogists that it equates to roughly 10 generations. A 5 cM segment shared with others may be from a recent common ancestor or one who lived 900 years ago. We cannot isolate it to a narrower date range and so its use as a genealogical tool is virtually nil. On the other hand, a segment that is 23 cM almost certainly connects us to an MRCA within the past 400 years or so.
According to Ancestry's "table of confidence" ratings, individuals who share over 60 cM are "virtually 100%" certain to share a recent common ancestor; those who share between 45-60 cM have a 99% chance; individuals who share 30-45 cM a 95% chance and those who share 16-30 cM an "above 50%" chance of sharing a most recent common ancestor (or ancestral couple).
It is readily apparent that there is a significant difference in the confidence levels between the 2 categories 30-45 cM and 16-30 cM. Without any guidance as to what "recent" means, if we are to believe that it covers ten generations, then individuals who share segments of up to 30 cM could be related through a most recent common ancestor who lived in the 18th century or earlier. Even individuals who share a segment of 30-45 cM could be sharing a most recent common ancestor sometime in the 18th century.
E. HIR's, Mbp's and SNP's
Segments that appear to connect us to our matches – but do not – are typically in the gray zone of IBD / IBS (say anywhere between 8-15 cM). In these cases, discrete HIR's of DNA are being "stitched" together by the software that analyzes our DNA to create the appearance of a single, longer HIR. HIR stands for "Half Identical Region" and is a section of DNA where 2 or more people have 1 of the 2 alleles (see below for definition of "allele") in an unbroken succession of base pairs ("Mbp's") in common. Base pairs are the "rungs" of the DNA ladder (the familiar double helix) but only a subset of them are compared for autosomal DNA analysis. These are "SNP's" (Single Nucleotide Polymorphisms). SNP's are Mbp's where there are differences in the reported values between people and/or populations. They are interspersed throughout the other Mbp's that make up our DNA. Since SNP's have a higher rate of differentiation than other Mbp's, having the same values on a succession of SNP's is a measure of how closely people are related (based on how long the HIR is (the unbroken sequence of half-identical SNP's)).
Each base pair consists of two markers – A, T, C or G. These letters stand for the biological compounds adenine, thymine, cytosine and guanine, the building blocks of DNA. Since every SNP has a particular position or "address" on our chromosomes, and since the address applies to both chromosomes in the pair, we have 2 base pairs, or 4 markers, at every SNP address. But, since A always pairs with T, and C always pairs with G, only one of the markers is reported for each base pair. Reporting the other would yield the same result expressed in a different way. Think of a photographic negative – same image, different version. The marker that is reported is called the "allele."
So, there are 2 alleles for every SNP – one from our mother's chromosome and one from our father's. The computer software that analyzes our DNA reports the alleles, but it does not know which one is from which chromosome. This is determined through a process called "phasing." Phasing groups the alleles into the maternal and paternal sides of our chromosomes. Phasing is achieved in 2 ways: by comparing a child's DNA with that of her parent(s), or through "population phasing," which Ancestry does, and which looks at how alleles are related to one another based on the predominant values for a given SNP for the majority of a population. In this way, Ancestry assigns the alleles to one side or the other, based on how the alleles are grouped in the population at large.
In sum, in a TG, a group of people have one of the 2 alleles in common on a run of SNP's - an unbroken chain of Mbp's falling on one of the 2 chromosomes in that chromosomal pair - and corresponding to one side of their tree (maternal or paternal). What defines the boundaries of a segment are "opposite homozygotes" (e.g., CC and GG), which do not match at all.
III. Approach and methodology
Based on observations of how TG's were formed in the kits I administer at FTDNA, I classified 6 groups of TG's based on the following criteria:
B. Recording Segments
I recorded all segments above 15 cM. Segments were grouped based on the criteria above and if they were a match in the FTDNA Matrix. Recording TG's where one or more longer segments overlapped with shorter segments that were not all overlapping with each other posed challenges. I created 3 additional columns in my spreadsheet where I broke out these "derivative" TG's ("DTG's") (an admittedly imperfect term). All TG's (including DTG's) had to meet the criteria above to be recorded.
C. Reporting Segments
All TG's but not all DTG's were reported. A DTG had to have at least 50% unique segments (not duplicated in another TG) to be reported and only those unique segments within the DTG were reported. This avoided duplication while facilitating inclusivity.
The following example from the spreadsheet illustrates how segments were recorded and reported:
All data for 03HA is reported (segments and cM). 2 segments are reported in 03HAa, and no segments are reported in 03HAb, as fewer than 50% are unique (not duplicated in another TG). A tally of all TG's is recorded in the Tally[#] column by entering the TG's Group code in the row of its first segment. Whether a DTG is reported is indicated by an "N" or "Y" in the Rpt[#]TG column and whether a segment is reported by an "N" or "Y" in the Rpt[#]Seg column.
"DS" in the final column stands for duplicate surname. Persons in a TG with the same surname as someone else in that TG were not reported as it was assumed the relationship was possibly that of a parent-child, sibling or first cousin. The person with the highest cM was reported. While it is obviously possible that this did not weed out all close relationships, the fact that many females list double or hyphenated surnames suggests that women often report their biological surnames. The DS designation was extended to segments that did not fall within TG's by applying it to all such persons (not just the duplicated match as in the case of TG's or DTG's). This clarifies the omission of many segments (since all segments above 15 cM were reported, not just those grouped into TG's).
Although I did not put a "cap" on segment size for reporting purposes, the longest segment with someone not known to be a 3rd cousin or closer to the kits was 55.85 cM. Most segments above 40 cM were variations (half cousin, once removed, etc.) on known 3rd cousin relationships or closer.
The kits I reviewed included my parents and their cousins, the closest being 2 1st cousins once removed (sometimes I refer to "my" Kits (where the K is capitalized) to refer to my mother and father). One kit, Kathryn, is not related to us at all. I selected cousins on widely spaced lines to minimize duplication of TG's. The list below gives the kit name; through which of my or my parents' grandparents the match is coming; and in the case of my parents through which of their grandparents ("MGF" = maternal grandfather, etc.).
- Sherry – my mother
- Jim – my father
- Sybil – my Maternal Grandfather's Paternal 1st cousin
- Julia – my Paternal Grandfather's Paternal 1st cousin
- Susan – Mom's 2nd cousin / MGM
- Jon – Mom's 2nd cousin / MGF
- Gary – Mom's 1/2 2nd cousin / PGM
- Connie – Mom's 4th cousin / PGF
- Mary – Dad's 2nd cousin / MGM
- Wayne – Dad's 3rd cousin 1R / PGF
- Gerald – Dad's 4th cousin / PGF
A link to the spreadsheet with the data is here.
Total segments reviewed: 3,906
Total segments grouped into TG's: 1,637
Total TG's: 397
A. TG totals by Group & Kit
B. Segment totals by Group & Kit
C. Average no. of segments & segment size per Group
D. Group distributions across kits as percentages
E. TG & segment distributions across kits
F. TG & segment distributions across Groups
The kits fell into 4 rough groupings, from the largest to the smallest amount of DNA sharing:
- Gerald, Jon & Sybil – TG's: 43, 44, 46; segments: 181, 207, 185
- Gary, Julia, Kathryn, Sherry & Wayne – TG's: 36, 33, 39, 36 & 37; segments: 147, 130, 159, 158 & 150
- Connie, Jim & Mary – TG's: 25, 24 & 21; segments: 90, 97 & 86
- Susan – TG's: 13; segments: 47
Of the first 3 kits, only Gerald's tree is presently researched across all lines and I was not able to identify factors - such as endogamy - which might explain his high numbers. However, Gerald's lines are all in the colonies by the early 1700s. This is true also of the lines which my Kits share with Jon and Sybil.
The 5 people in the next group seemed to fall in the middle range of DNA sharing, and what I assume would be the average in a larger sampling.
At the lower end were my father, Jim, his maternal 2nd cousin, Mary, and my mother's 4th cousin, Connie. My father's maternal grandfather was a 1st generation German-American, his parents having emigrated in the early 1860s. He has about a quarter fewer matches at FTDNA than my mother and her siblings, and I assume the more recent migration of this quarter of his tree is responsible for his lower numbers. Connie's lines are all in the U.S. for over 200 years, with most in the colonies from the early-mid 1700s. I am unfamiliar with Mary's lines that I do not share with her, but I believe her father's ancestors were Irish who emigrated in the mid 1800's.
Susan, my mother's 2nd cousin, had about a quarter of the DNA sharing compared to the highest kits. I am unfamiliar with her maternal ancestry and that of her paternal grandmother.
Group A had an uneven distribution, with between 15-20% of the TG's in 6 of the kits and 30-42% in 6 others.
Group B had a more uniform distribution, with between 12-23% of TG's for 8 of the kits and 4 others as outliers (5 & 8%; and 28 & 32%).
Group C had the most uniform distribution, with between 5.4-12.5% of the TG's in all kits except one (for 16.66%).
Group D had an uneven distribution, with 8-36% of all TG's.
Group E had a fairly uniform distribution, with 11-19% of the TG's in this Group for 9 of the kits and lower for the remainder (5, 7 & 8%).
Group F accounted for the fewest TG's, with 3-19% of all TG's.
There was a rough correlation between the size of a Group (its number of TG's) and how it was distributed across kits. The largest Groups, A and D, had the least uniform distribution; the smallest - C - the most uniform; and Groups B & E, mid-range in terms of size, a distribution in the middle.
Group A contained by far the most TG's: 99; while Group C, which differed from Group A only in that it was not restricted to 3 segments per TG, had the least: 41. These 2 Groups were also the most uneven in terms of their distribution across kits. My assumption is that Group A represents ancestral segments, and Group C, with no restriction on Group size, represents even deeper ancestry. Some residents of the British Isles have reported very large match groups overlapping on relatively small (8-15 cM range) segments. For the most part, I did not observe this phenomena in the kits I reviewed. I wonder if Americans across the board share less "deep ancestry" in common than their British counterparts. Since many Brits are descendants of inhabitants of the same island stretching back many hundreds of years, this seems plausible. The high number of TG's with small segment size but fewer segments in the TG's (for Americans) as compared with larger groups with these small segment sizes (the Brits) could reflect a version of "deep ancestry" for Americans which dates to colonial times, as opposed to that for Brits which dates to (say) 1500 and before.
Group B lifted the restrictions on segment size in Group A, and saw a decrease in the total number of TG's in that Group. Perhaps that was the result of incorporating segments that formed closer cousin relationships.
Group D contained the next highest number of TG's: 90. It represented the Mean of the Groups, with no restriction on the number of segments but a maximum of 2 segments over 20 cM. This Group would seem well positioned to incorporate segments dating from a period of extensive DNA sharing among members of colonial "founder populations." It would contain "ancestral segments" defined in terms of those populations as well as a smaller proportion of larger, more recent segments.
Group F, with the largest segments, had the next fewest TG's. This is rather what I expected, since I assume this Group represents closer cousin sharing. Just as we would expect very large segments or total amounts of DNA to be shared with a small, closely related group of people, so would I expect my nearer distant cousins to inhabit the TG's which I have defined in part as containing longer segments.
The above are my observations and may not reflect what is actually happening in the kits. The testing and analysis of atDNA is still young and there is much to be understood. It's difficult to say how we are matching others when there is a wide range of possibilities for the sharing amounts. The other limitation was the study's sample size, which I would like to have doubled at least, to see whether these sharing scenarios applied to kits in this demographic (Americans with varying degrees of colonial ancestry) on a bigger scale. Finally, it occurred to me only as I was writing this conclusion that by counting segments of smaller length within TG's containing longer ones, to some extent I was misrepresenting those segments. Therefore, I have listed the segments grouped by size irrespective of what Group I included them in here:
How do we determine the age of segments short of extracting and comparing distant ancestors' DNA from their remains and comparing it with our own? At least one approach (and perhaps the only one available to the lay genetic genealogist) is to observe sharing scenarios in conjunction with matches' pedigrees. While it seems probable that at a certain level it would be impossible to eliminate the possibility of alternative and hidden pathways for shared DNA – given the increased likelihood of multiple common ancestors as we move back in time – it seems possible to me to identify common ancestral lines as the source for shared segments in some situations. In the case studies drawn from my Kits, I have tried to show how TG's can solve problems and break down brick walls - by matching on a lineage, not individual. A link to the first of these case studies is below. They will be supplemented as existing TG's are reviewed, new matches are made and connections found.
1. The strict definition of the triangulation of a segment of atDNA is that 3 or more matches on a segment have confirmed they are all overlapping on the same segment (i.e., they are on the same side of the chromosome in the pair). This can be done by each of them individually confirming this is the case. Currently, the only automated tools able to perform triangulation are the Tier 1 Triangulation tool at gedmatch.com and the Autosomal DNA Segment Analyzer ("ADSA") at dnagedcom.com. For segments above 15 cM approximately (those under discussion here), an approximation of triangulation can be achieved using the Matrix tool at FTDNA in conjuction with the Chromosome Browser. When you see a group of people all in common with each other ("ICW") in the Matrix that are overlapping on at least 15 cM, then it is usually the case that they are matching on the same chromosome in the pair. In this study, the potential of their not matching on the same chromosome may be more of an issue for Groups A and B which are restricted to 3 segments. In general, the more people ICW each other and the larger the segment, the more certain that triangulation has been achieved. IMO, 4+ people who are all ICW each other on a segment of >17-18 cM will, in almost every case, constitute a triangulated segment.↩