Macro-level surname sampling is the analysis of surnames at a country-wide level

Most names are very rare

Perhaps the most surprising thing about surnames is how many very rare names there are. We are well aware of the Smiths and Joneses, but there are very few such common names. On the UK Info Disc 2000, which claims to include the UK electoral roll, about 42% of names occur once, 16% of names occur twice, 7% occur three times, and so on with ever decreasing numbers. Phone books are not such a good sample of the population, but they show a similar pattern. In the telephone directories for all England and Wales from about 1980 (when there were far fewer ex-directory subscribers than today), about 45% of names occur just once, which agrees well with the Info Disc figure. Similar results have been found in a study of names in the Swiss telephone directories. (Source: Barrai, I., Scapoli, C., Beretta, M., Nesti, C., Mamolini, E., and Rodriguez-Larralde, A., Annals of Human Biology (1996), vol 23, pp 431-455).

The reason why we are aware of the common names is that there are more people with them. There are more than half a million Smiths in Britain. However, as far as we can estimate, there are perhaps 200 000 people with names which only occur once in this country, around 140 000 people whose names occur twice, and about 110 000 people whose names occur three times.

Mathematics - and the total number of names

If we plot the logarithm of the number of surnames (n) occurring once, twice, three times, etc, against the logarithm of the number of times (x) a surname occurs (once, twice, etc,) we get close to a straight line. The graph shows the distribution of a thousand names selected at random from the UK Info Disc 2000.

Leading names like Smith and Jones are positioned close to where the graph dips to the horizontal x axis; conversely the hundreds of names which occur only once are clustered near to the vertical y axis.

Mathematically this means that n must be related to x by an expression of the form

n = b x-a This is expression 1.

where b and a are constants.

It can be seen from the graph that the 1000-name sample is fitted well if b = 324 and a = 1.49 approximately. However, if we try to match this up to the frequencies of the commoner surnames, we find that there are fewer common names than we would expect from this expression. Matching up the 1000-name sample shown in the graph with the common-names information suggests that overall the frequency distribution of names in Britain is given by something like

n = 200000 x-1.5 1.025-c This is expression 2.

where c = x0.4

Expression 2 fits the data well all the way from the 200 000 or so unique names to the half a million people called Smith. It is just a curve fitted to the data and has no fundamental significance, but it enables us to estimate the total number of surnames in Britain, something which is surprisingly hard to measure directly.

Expression 2 suggests that there are a little under half a million surnames in Britain.

All of the above assumes that we define every spelling variation as a different surname. If we try to group variants together, for example regarding Clark and Clarke as the same name, then the number of surnames depends solely on how we group the names.

The problem of getting a sample

The above results depend on taking a random sample of names in a population, so that each name, however common or rare, has an equal chance of being selected. Having selected the name, we count how many people have that name. Random sampling of names is difficult. It is easier to pick people at random from a population to form a sample, and then count how many times each surname occurs in the sample, but this gives a different result, and does not generally give a true picture of the frequency distribution of names in the population. The random-person method gives a form of distribution usually called the Yule distribution. It crops up in various other systems, such as the distribution of word frequency in texts, and was explained on the basis of very simple assumptions by Herbert Simon, a Nobel Prize-winner for economics. (Source: Simon, H.A., Biometrika, (1955) vol 42, pp 425-440)

I discussed this distribution in relation to surnames in an article in the Journal of One-Name Studies Vol 6 No.6, pp 119-124 (April 1998), although various aspects of that article have been overtaken by later work summarised above.

How does the distribution arise?

It seems incredible that, after 25 generations or so of hereditary surnames in England, there are so many unique or very rare surnames. One factor is that surname generation is still going on, through immigration, double-barrelling, or deliberate creation of new spelling variants. It ought to be possible to explain the form of the distribution shown in the graph in terms of generation and extinction processes, but this has not yet been done convincingly.

Probability theory has tackled one aspect - the chance that a name with just one holder will become extinct. The probability of extinction turns out to be an astonishing 89%, although the exact figure depends on the assumptions made about the probability of having no sons, one son, two sons, etc. This calculation is given in various texts on advanced probability, such as Probability by Peter Whittle (John Wiley, 1970) ISBN 0-471-01657-8, pp 124-125. This means that it is very probable that any one male 25 generations ago will have no male descendants today, and so a name with just one holder would have become extinct.

If there is indeed a 0.89 probability of extinction of a name with just one holder, the probability of a name with n holders instead of just one becoming extinct is 0.89n. For example, if there were five holders of a name, the chances of none of them eventually having any male descendants is 0.895, which equals about 0.56, so in this case there is still a better than even chance that the name will become extinct. However, if there are 50 holders of a name in the first generation, the probability that the name will become extinct is only 0.3% - a 99.7% chance that the name will survive.

The theoretical result has been roughly confirmed by Christopher Sturges and Brian Haggett, who did a computer simulation to calculate the number of descendants from each member of an original population. They found that after 23 generations, about 76% of the original population had no male descendants, so if a name had just one holder in the original population, there was a 76% probability that it would become extinct - the difference from 89% is probably mainly because they made different assumptions about the chances of having particular numbers of sons in each family. This work was published as Sturges and Haggett, Inheritance of English Surnames (Hawgood Computing, 1987) ISBN 0-948151-02-1.

This contribution is dedicated to a Miss Smith I know who at the time of writing is about to marry a Mr Fegyveres. Perhaps they will have many sons!

Copyright 2000 Trevor Ogden