Finding a word in a dictionary

  • Thread starter Jamin2112
  • Start date
In summary, the class Dictionary uses a std::map to map words to their corresponding definitions. The class also has a private variable dict which is a std::map. The add_def and get_def functions are simple wrappers for std::map.
  • #1
Jamin2112
986
12
So I've been making a class called Dictionary in C++. I just added a class function that tries to return the definition of a word by searching through a list of words in alphabetical order. The thing is, the word you're looking for might not be in there. Hence this should be a slightly modified version of the Bisection Method in math, because in that method you know there's a zero of the function.

Does this sound right?

Pseudo-code:

w = the word we're looking for
a = first word in dictionary
b = last word in dictionary
---while a ≠ b -----------------------
c = the word directly between a and b (Use integer division by 2 to find it)
if c > w, then a = c; otherwise, b = c
---------------------------------------
if a = w, then return the definition of a; otherwise, return an error message

Code:

Code:
std::string Dictionary::getDef(std::string w) {
	std::vector<Entry>::iterator a(dict.begin()), b(dict.end()), c;
	while (a != b) { 
		c = a + (b - a)/2; 
		if (word_val(c->word) > word_val(w)) {
			a = c;
		} else { 
			b = c;
		}
	}
	if (a->word == w) { 
		return (a->def);
	} else { 
		return ("ERROR: Word not found.");
	}
	
};

word_val is a function that returns an unsigned integer. It's used to check whether one word is alphabetically greater or less than another, i.e. word_val(a) > word_val(b) iff a is alphabetically greater than b.

dict a private variable of my class, a vector of structures each of which contains a string for a word and another for the corresponding definition.
 
Technology news on Phys.org
  • #2
Do you know the stdlib call bsearch()?
 
  • #3
Here are some hints:
  1. Use more descriptive variable names in pseudo- and actual code: try lower, upper, current and target instead of a, b, c and w.
  2. Look at your loop: what happens when the target word has been found? Introduce a variable "found" (set initially to false) so that the loop stops when it needs to.
  3. Comparing strings by assigning integer values to them is probably not a good idea. If strcmp or string::compare is not good enough for you, write your own as a class member with similar return values.
  4. It is going to be difficult to add words to your dictionary. To overcome this we can use a binary tree structure; the Wikipedia article is a good starting point. With a binary tree, searching is built into the data structure. This can be further refined as a self balancing binary tree. A different approach often used for dictionary lookup is a hash table.
 
Last edited:
  • #4
Jamin2112 said:
So I've been making a class called Dictionary in C++.
It's good to learn how to roll your own "computer science data structures" (linked lists, arrays, binary search trees, associative arrays, ...). However, once you progress to a certain point, it's even better to know that others have done this before and that there are standard treatments of many of these concepts. As someone who wants employment in the field, you should be at this point. The C++ standard library has exactly what you want.

It's called a "map". A std::map<KeyType,MappedType> provides a means for mapping keys to values. In your case, a word to the definition of that word. A std::map is a self-balancing binary search tree, specifically, a red-black tree. C++11 offers an alternative to std::map, std::unordered_map. The access interfaces are very similar to std::map. The underlying implementations and hence the timing behaviors are very different. An unordered_map is implemented in terms of a rather different kind of computer science data structure, a hash table.

All of these concepts (trees, binary trees, binary search trees, hash tables): You should know them as a modern day computer programmer. These are extremely important computer science data structures.

Here's your class using a std::map:
Code:
class Dictionary {
public:
   typedef std::map< std::string, std::string > DictType;

   // Add or replace a definition for a word.
   void add_def (const std::string & word, const std::string & def) {
      dict[word] = def;
   }

   // Find the definition. Entry must exist.
   const std::string& get_def (const std::string & word) const {
      DictType::const_iterator entry = dict.find (word);
      if (entry == dict.end()) {
         throw (std::logic_error("Word \"" + word + "\" not found"));
      }
      return entry->second;
   }

private:
   DictType dict;
};
A couple of notes on the above:
  • Note that I'm passing and returning strings as references. Try not to pass or return strings, or big structures in general. There's a lot of hidden overhead in making copies of those passed/returned values for objects such as strings, vectors, etc. Try to use references or pointers instead.
  • Note that the functions add_def and get_def are very simple wrappers. There is very little reason for this class to exist. Just use std::map<std::string,std::string> .
Now I'll look at three key parts of your code.

Item 1: std::string Dictionary::getDef(std::string w)
Try not to pass strings by value. Here it would be much better to pass by reference, and specifically, const reference.

Another problem: w is a terrible name for a variable. Calling it "word" would have been much better. The cost of that much more meaningful name is typing three letters.Item 2: if (word_val(c->word) > word_val(w))
There are lots of things wrong here.
  • Why are you recomputing this word value? Your word_val function is most likely a very expensive function. Your keys aren't going to change, and you are going to use this a lot. It would make more sense to store the word value as part of your dictionary entry. Similarly, the word value of the input word isn't going to change. Compute it once. Sometimes these optimizations fall into the category of "premature optimization". Not in this case.
  • That function is fraught with problems. You are implicitly assuming that you can come up with a unique mapping from an arbitrary string to an integer. What do you do if two different words have the same word value?
  • Why are you doing this at all? You should have used if (c->word > w) . std::string overloads the comparison operators, and they work just the way you would expect.
Item 3: return ("ERROR: Word not found.")
This is a *bad* idea. What do you do when someone adds the word "word_not_found_error" to your dictionary with that string as the definition? Much better options:
  • Use a special return value such as a null pointer that can never be interpreted as a "good" return value.
  • Pass another argument whose value will be set to indicate success or failure.
  • Return a pair of values (e.g., std::pair), one to indicate success/failure, the other, the definition of the word.
  • Throw an exception. This is C++, after all.
 
Last edited:
  • #5
Thanks for your comprehensive feedback.

D H said:
It's good to learn how to roll your own "computer science data structures" (linked lists, arrays, binary search trees, associative arrays, ...). However, once you progress to a certain point, it's even better to know that others have done this before and that there are standard treatments of many of these concepts. As someone who wants employment in the field, you should be at this point. The C++ standard library has exactly what you want.

...

I guess that's what I'm going to focus on -- making sure I know, off the top of my head, how to use all the tools in the standard library. Up until now I had been focusing on making home-made data structures to solve the classic interview questions (e.g. find whether an array has any repeated elements).

Item 1: std::string Dictionary::getDef(std::string w)
Try not to pass strings by value. Here it would be much better to pass by reference, and specifically, const reference.

Another habit I need to form.

[*]That function is fraught with problems. You are implicitly assuming that you can come up with a unique mapping from an arbitrary string to an integer. What do you do if two different words have the same word value?

I assure you it's an injective function. The only issue is overload if the string is too long (I was ignoring that potential problem for now).
 
  • #6
Jamin2112 said:
I guess that's what I'm going to focus on -- making sure I know, off the top of my head, how to use all the tools in the standard library. Up until now I had been focusing on making home-made data structures to solve the classic interview questions (e.g. find whether an array has any repeated elements).
People have taken Imran Ghory's and Joel Spolsky's good advice on interviewing to absurd levels. Solving those arcane interview problems will not help you on the job, and it will not help interviewers determine who is good and who isn't. What will help you much more is practicing on something a bit larger. Make a program that reads a dictionary, takes an input word, and finds all of the words in the dictionary that are subsets of that word. For example, this->this, hit, sit, his, hi, it, I, plus one other four letter word. Write a *simple* game program, one where there's no intelligence in the program, just checking. Then try graduating to a more complex game. Seeing that this physics forums.com, make a gravity simulator. Then add a GUI. There are lots of possibilities.

Another thing that will help is to learn some design patterns. There are lots of books on design patterns for C++. Read one, but not a For Dummies (if there is one). Check the reviews because there are bad books and good ones.
I assure you it's an injective function. The only issue is overload if the string is too long (I was ignoring that potential problem for now).
I can pretty much assure you that it is not. It's called the birthday problem. How many people do you need to ask, on average, what their birthdate is and expect to find a duplicate?

Your word_val function is what's called a "hash function." Even the very best are subject to collisions. (BTW, you should have used std::hash() for this.) Better yet, you should not have used a hash function at all. Why did you do that? That you thought you needed one is a bit problematic.
 
Last edited:
  • #7
Jamin2112 said:
I assure you it's an injective function.
You just failed the interview.

The function might be injective for all the test cases you have used, you might even have proved it to be true according to some set of assumptions, but what happens when the code is adapted for Version X and the assumptions are no longer valid? Code that works today but breaks tomorrow costs an employer even more than code that doesn't work at all.

What DH has written is mainly excellent, but I'll differ on one point. The correct way to implement a string comparison function in this and similar algorithms is not to use comparison operators, but to use string::compare (why? Think about how many comparisons are done within the loop). And if you are using natural language words that have not been cleaned into A-Z ASCII (even in an English dictionary you need to make sure 'wrong-headed' comes between 'wrongful' and 'wrongly', string::compare puts it at the end) you can simply replace string::compare with a hand-rolled alternative that returns similar values.
 
  • #8
D H said:
I can pretty much assure you that it is not. It's called the birthday problem. How many people do you need to ask, on average, what their birthdate is and expect to find a duplicate?

The answer is 24 or something.

Anyways, I don't understand why you think it's impossible to construct an injective function f that maps strings to positive numbers, such that f(s1) < f(s2) if and only if s1 is alphabetically less than s2. For instance, if we limit ourselves to strings of length 1, f:A→N where A = {a, b, ..., z}, N = {1, 2, ..., 26}, with f(a) = 1, f(b) = 2, ..., f(z) = 26, does the trick.

Better yet, you should not have used a hash function at all. Why did you do that?

Obviously I was going for a "cool" approach to a problem.
 
  • #9
MrAnchovy said:
The correct way to implement a string comparison function in this and similar algorithms is not to use comparison operators, but to use string::compare (why? Think about how many comparisons are done within the loop).
There's essentially *no* difference in performance if all that is being done is one comparison:
Code:
while (a != b) {
   c = a + (b - a)/2;
   if (c->word > w) { // or if (w.compare(c->word) < 0), same thing
      a = c;
   }
   else { 
      b = c;
   }
}
Compile optimized and whatever minute difference there is in performance will disappear. With most implementations, that is.

However, it would have been a good idea to add an early escape from that loop when the desired object is found. Now it's a good idea to use compare:
Code:
while (a != b) { 
   c = a + (b - a)/2;
   int comp = w.compare (c->word);
   if (comp > 0) {
      b = c;
   }
   else {
      a = c;
      if (comp == 0) {
         break;
      }
   }
}
Now the code is avoiding a needless comparison by using the fact that string::compare returns -1, 0, or +1 to indicate <, ==, and >. Otherwise, why bother?


Jamin2112 said:
The answer is 24 or something.
Exactly. Much less than the naive 365/2. Now imagine that your hash function is near perfect, which means that in the case of a 32 bit hash (e.g., unsigned int), the probability that two random words have the same hash value is 2-32. How many words will you need to add to the dictionary before there is a 50% chance of a collision? Those are terrible odds, BTW. We want a very small probability of a collision. It takes surprisingly few words for the odds to rise above "very small".

Anyways, I don't understand why you think it's impossible to construct an injective function ...
Simple: 267>232. Seven letter words cause trouble. Look at that sentence "Anyways, I don't understand why you think it's impossible to construct an injective function ..." and count the seven letter words in it.


Obviously I was going for a "cool" approach to a problem.
That is a bad habit. Try to kick that habit.
 
  • #10
D H said:
There's essentially *no* difference in performance if all that is being done is one comparison:
Of course, but the code that was posted with only one comparison per iteration was broken - the value of the pointer to the matching element is never stored. A working version of the code inevitably requires more than one boolean-valued comparison per iteration.

Everything else you say is right on the mark.
 

1. How do I find a specific word in a dictionary?

To find a specific word in a dictionary, you can use the alphabetical index located at the beginning or end of the dictionary. Look for the first letter of the word you are searching for and then locate the page number that corresponds to that letter. Flip to that page and scan through the list of words until you find the one you are looking for.

2. What if I don't know how to spell the word I am looking for?

If you are unsure of the spelling of a word, you can try using the phonetic spelling guide located in the front or back of the dictionary. This guide uses symbols to represent the sounds of letters and can help you find the correct spelling of a word.

3. Can I use a dictionary to find the meaning of a word?

Yes, dictionaries are designed to provide definitions for words. Simply locate the word you are looking for and read the definition provided. Some dictionaries may also include example sentences or synonyms for the word.

4. Is there a limit to the number of words a dictionary can contain?

The number of words in a dictionary can vary depending on the edition and type of dictionary. Some dictionaries may contain hundreds of thousands of words, while others may only contain a few thousand. It is important to choose a dictionary that fits your specific needs.

5. Can I use a dictionary to find the origin of a word?

Yes, many dictionaries include information about the origin of words, also known as etymology. This can be useful in understanding the history and development of a word. Look for the word's root or language of origin in the definition or additional notes provided in the dictionary.

Similar threads

  • Programming and Computer Science
Replies
22
Views
2K
Replies
3
Views
762
  • Programming and Computer Science
Replies
11
Views
1K
  • Programming and Computer Science
Replies
2
Views
682
  • Programming and Computer Science
Replies
6
Views
8K
  • Programming and Computer Science
Replies
12
Views
1K
  • Programming and Computer Science
3
Replies
70
Views
3K
  • Programming and Computer Science
Replies
5
Views
2K
  • Programming and Computer Science
Replies
4
Views
781
Replies
9
Views
1K
Back
Top