Need help designing an algorithm for similarities between words

Click For Summary
SUMMARY

The discussion focuses on designing a C++ algorithm to suggest similar words based on user input, utilizing a predefined list of the 5000 most common English words and a distance map representing pairwise distances between keyboard keys. The user aims to implement a method that accounts for typing errors by leveraging the Levenshtein distance, modified to include weights for character substitutions. The challenge lies in developing a comprehensive formula that effectively handles missing or additional letters during word comparison.

PREREQUISITES
  • C++ programming proficiency
  • Understanding of Levenshtein distance algorithm
  • Familiarity with data structures such as std::vector and std::map
  • Knowledge of keyboard layout and distance metrics
NEXT STEPS
  • Implement a weighted Levenshtein distance algorithm in C++
  • Research techniques for handling missing and additional letters in word comparisons
  • Explore optimization strategies for searching through large datasets in C++
  • Study user input error correction algorithms for text input systems
USEFUL FOR

Developers creating text input suggestion systems, C++ programmers interested in algorithm design, and anyone looking to improve user experience in typing applications.

Jamin2112
Messages
973
Reaction score
12
For fun, I'm trying to make a C++ program that takes a word from the user and comes up with an ordered list of suggested words, similar to the kind of thing you have on your cell phone when sending SMS messages.

So far I have:
  • An std::vector<std::string> of the 5000 most common English words
    [*]An std::map<std::pair<char,char>,int> of the pairwise distances of keys on a standard computer keyboard. For instance, (A,A) → 0, (A,Q) → 1, (W,C) → 3, (Z,M) → 6.


The idea is that when the user types in a word, it is checked against every one of the 5000 most common English words using some algorithm that uses my map. That map is supposed to help detect if a user hit a wrong key or two when typing in a word. Any ideas?
 
Technology news on Phys.org
What do you do with missing or additional letters?

The Levenshtein distance could be interesting, you just have to apply weights to the character substitutions in some way.
 
mfb said:
What do you do with missing or additional letters?

That's my problem -- I don't know!

An easy distance formula for comparing words of the same length is to sum up the distances between letters at corresponding indices. For instance, ("rhe","the")→1+0+0=1, so "the" is going to be one of the top suggestions. But yea, I'm having trouble thinking of a general formula.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 75 ·
3
Replies
75
Views
7K
Replies
5
Views
2K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 24 ·
Replies
24
Views
2K
Replies
10
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
Replies
8
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K