Name               DSLIP Description                                  Info
------------------ ----- -------------------------------------------  --------
String
::KeyboardDistance Rdpfp Approximate String Comparison                KRBURTON


Approximate string comparison.






NAME
    String::KeyboardDistance - String Comparison Algorithm

SYNOPSIS
      use String::KeyboardDistance qw(:all);
      my $s1 = 'Apple';
      my $s2 = 'Wople';

      # compute a match probability
      my $pr = qwerty_keyboard_distance_match('Apple','Wople');

      # find the keyboard distance between two strings
      my $dst = qwerty_keyboard_distance('IBM','HAL');

      # find the keyboard distance between two characters
      $dst = qwerty_char_distance('a','v');

      print "maximum distance: $qwerty_max_distance\n";

DESCRIPTION
    This module implmements a version of keyboard distance for fuzzy string
    matching. Keyboard distance is a measure of the physical distance
    between two keys on a keyboard. For example, 'g' has a distance of 1
    from the keys 'r', 't', 'y', 'f', 'h', 'v', 'b', and 'n'. Immediate
    diagonals (like ''r, 'y', 'v', and 'n') are considered to have a
    distance of 1 instead of 1.414 to help to prevent horizontal/vertical
    bias.

    A match probability between two strings is computed from the total
    distances between corresponding characters divided by the length of the
    longer string multiplied by the maximum distance between the two
    furthest keys on the keyboard.

    The functions in this module use a simple grid of keys. For the qwerty
    mapping, the grid is similar to the following:

      | 0   1   2   3   4   5   6   7   8   9   10  11  12  13
    --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    0 | ` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | - | = |   |
    1 |   | q | w | e | r | t | y | u | i | o | p | [ | ] | \ |
    2 |   | a | s | d | f | g | h | j | k | l | ; | ' |   |   |
    3 |   | z | x | c | v | b | n | m | , | . | / |   |   |   |
    --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

    The grids for both qwerty and dvorak are based on PC style keyboards.
    Shifted characters have the same coordinates (a and A, 6 and ^).

  EXPORT

    This module exports no symbols by default. The following functions are
    available for export through EXPORT_OK:

      build_qwerty_map
      max_qwerty_distance
      qwerty_char_distance
      qwerty_keyboard_distance
      qwerty_keyboard_distance_match

      build_dvorak_map
      max_dvorak_distance
      dvorak_char_distance
      dvorak_keyboard_distance
      dvorak_keyboard_distance_match

      grid_distance

    The following varialbes are availalbe for export through EXPORT_OK:

      @qwerty_grid 
      $qwerty_map
      $qwerty_max_distance

      @dvorak_grid 
      $dvorak_map
      $dvorak_max_distance

    Additionaly, this module supports the following export tags:

      :Functions - import all functions
      :Variables - import all variables
      :all       - import both functions and variables

API REFERENCE
  build_qwerty_map

      param:  array reference to receive the map [optional]
      return: the map (array reference)

    This function builds a map of each character to its corresponding
    location on the keyboard. The location is derived by looking at the
    keyboard as a simple grid in which the location of keys on the keyboard
    are considereed to be the same as their shifted values. The following
    keys are considered to have the same location:

      1 and !
      r and r
      / and ?

    The map is an array, where the index is the value returned by chr() for
    the character at that location. The value is an array ref describing the
    location of the character on the keyboard. The first element is the y
    position, the second is the x, and the third represents the shift value
    (0 for non-shifted, 1 for shifted). All non-keyable characters,
    including tabs and spaces, will have undef values in the map, meaning
    they have no point on the grid. These non-key characterss should be
    considered to be the maximum distance away from any other character
    except themselves.

    The map is constructed at run-time from the @qwerty_grid package global,
    and cached in the $qwerty_map package global.

  build_dvorak_map

      param:  array reference to receive the map [optional]
      return: the map (array reference)

    This function is identical to build_qwerty_map, with the exception that
    it builds a map for dvorak keyboards, and the map is constructed from
    the @dvorak_grid package global, and cached in the $dvorak_map package
    global.

  max_qwerty_distance

      return: maximum distance

    This function dynamicly computes the maximum distance between keys in
    the qwerty map. The maximum key distance is stored in the
    $qwerty_max_distance package global.

  max_dvorak_distance

      return: maximum distance

    This function dynamicly computes the maximum distance between keys in
    the dvorak map. The maximum key distance is stored in the
    $dvorak_max_distance package global.

  qwerty_char_distance

      param:   char 1
      param:   char 2
      return:  distance

    This function computes the distance between the two characters passed on
    a qwerty keyboard.

  dvorak_char_distance

      param:   char 1
      param:   char 2
      return:  distance

    This function computes the distance between the two characters passed on
    a dvorak keyboard.

  grid_distance

      param:  x1 - point 1's x coordinate
      param:  y1 - point 1's y coordinate
      param:  x2 - point 2's x coordinate
      param:  y2 - point 2's y coordinate
      return: distance between points

    This function returns the distance between two points. If the two points
    have an x distance of 1, and a y distance of 1, then they are considered
    to be a distance of 1 apart. This is meant to help prevent
    horizontal/vertical bias in the distancing function. Otherwise we use
    the following formula:

      sqrt( (x1 - x2)**2 + (y1 - y2)**2 );

  qwerty_keyboard_distance

      param:  string1
      param:  string2
      return: distance

    Returns the sum of the distances between corresponding characters in the
    two strings. If one string is longer than the other the remaining
    characters are counted as having the same value as the maximum distance.

  qwerty_keyboard_distance_match

      param:  string1
      param:  string2
      return: probability of match

    The probability of a match is:

      Pr = 1 - ( D / (L * M) )

    Where D is the distance between the two strings, L is the length of the
    longer string, and M is the maximum character distance.

  dvorak_keyboard_distance

      param:  string1
      param:  string2
      return: distance

    Returns the sum of the distances between corresponding characters in the
    two strings. If one string is longer than the other the remaining
    characters are counted as having the same value as the maximum distance.

  dvorak_keyboard_distance_match

      param:  string1
      param:  string2
      return: probability of match

    The probability of a match is computed in the same way as for
    qwerty_keyboard_distance_match().

BUGS
    The grids are specific to US style PC keyboards. It is possible to
    substitue and use your own keyboard maps, but it's not well documented.
    Also, even US keyboards differ, for instance, in where they have the
    backslash and pipe characters. The grids in this module assume that it's
    in the last position of the second row.

    The space and tab keys are ignored entierly. Whitespace is irrelevant
    for the most common use of this module (for the author at least), which
    is matching words or names.

    Numbers are distanced from their locations at the top of the keyboard.
    It should be an option to perform a keyboard distance using only the
    numeric keypad.

    If one string is a substring of the other (abstained vs stained), the
    matching could be improved by either starting at the substring offset,
    or by prepending spaces onto the front of the shorter string so it
    matched the substring offset.

    Currently the maps, and the max lengths are computed dynamicly. The
    module would load faster if these were hard coded data.

    It should be possible to speed up the qwerty_char_distance and
    dvorak_char_distance functions by caching (for example, by using the
    Memoize module).

AUTHOR
    Kyle R. Burton <krburton@cpan.org>

    HMS 625 Ridge Pike Building E Suite 400 Conshohocken, PA 19428

SEE ALSO
    perl(1). String::Approx. Text::DoubleMetaphone. String::Similarity.