Linguistic Search

Transcription

The first challenge, when matching international names, is to match names which are identical before transcription from their original script.

For example, محمد can be written in Latin characters in many different ways, including Muhamad, Mahomet or Mhmad, but should not be confused with محمود, despite its romanised version appearing similar (Mahmud). The algorithms necessary to match Mhmad with Mahomet without incurring the false positive Mahmud cannot be achieved with mathematical logic or generic phonetic algorithms alone. Language specific rules are key to identifying all valid variants of a name without over-matching.

Similar issues can be seen when matching names originating in a Cyrillic alphabet. Борис Ельцин is known in the English speaking world as Boris Yeltsin, but his surname would be written Eltsine in France or Jelzin in Germany. Transcription variants of Chinese names may also vary significantly. The common surname may be transcribed from Mandarin as Xiao, Hsiao, Shiau or Syau or as Siu if transcribed from Cantonese.

As well as the spelling of transliterative variants, there may be other features of the way a name from a non-Latin script can be presented in Latin characters which complicate accurate identity matching. One such feature can be seen in the transcription of Chinese names, where the way in which the syllables are split in the Romanised version can vary. For example, 亞 男 can be written as Yanan or Ya Nan, but Yan An is a totally different name – 沿 安.

Traphoty is able to match transliterative variants without returning excessive numbers of irrelevant hits, and is also able to prioritise such matches so that names identical in their original script are graded as stronger matches than those based on phonetic similarities or typographic errors.