Finding a word in a dictionary

Jamin2112 · Sep 19, 2013

So I've been making a class called Dictionary in C++. I just added a class function that tries to return the definition of a word by searching through a list of words in alphabetical order. The thing is, the word you're looking for might not be in there. Hence this should be a slightly modified version of the Bisection Method in math, because in that method you know there's a zero of the function.

Does this sound right?

Pseudo-code:

w = the word we're looking for
a = first word in dictionary
b = last word in dictionary
---while a ≠ b -----------------------
c = the word directly between a and b (Use integer division by 2 to find it)
if c > w, then a = c; otherwise, b = c
---------------------------------------
if a = w, then return the definition of a; otherwise, return an error message

Code:

Code:

std::string Dictionary::getDef(std::string w) {
	std::vector<Entry>::iterator a(dict.begin()), b(dict.end()), c;
	while (a != b) { 
		c = a + (b - a)/2; 
		if (word_val(c->word) > word_val(w)) {
			a = c;
		} else { 
			b = c;
		}
	}
	if (a->word == w) { 
		return (a->def);
	} else { 
		return ("ERROR: Word not found.");
	}
	
};

word_val is a function that returns an unsigned integer. It's used to check whether one word is alphabetically greater or less than another, i.e. word_val(a) > word_val(b) iff a is alphabetically greater than b.

dict a private variable of my class, a vector of structures each of which contains a string for a word and another for the corresponding definition.

jim mcnamara · Sep 20, 2013

Do you know the stdlib call bsearch()?

pbuk · Sep 20, 2013

Here are some hints:

Use more descriptive variable names in pseudo- and actual code: try lower, upper, current and target instead of a, b, c and w.
Look at your loop: what happens when the target word has been found? Introduce a variable "found" (set initially to false) so that the loop stops when it needs to.
Comparing strings by assigning integer values to them is probably not a good idea. If strcmp or string::compare is not good enough for you, write your own as a class member with similar return values.
It is going to be difficult to add words to your dictionary. To overcome this we can use a binary tree structure; the Wikipedia article is a good starting point. With a binary tree, searching is built into the data structure. This can be further refined as a self balancing binary tree. A different approach often used for dictionary lookup is a hash table.

D H · Sep 20, 2013

Jamin2112 said:

So I've been making a class called Dictionary in C++.

It's good to learn how to roll your own "computer science data structures" (linked lists, arrays, binary search trees, associative arrays, ...). However, once you progress to a certain point, it's even better to know that others have done this before and that there are standard treatments of many of these concepts. As someone who wants employment in the field, you should be at this point. The C++ standard library has exactly what you want.

It's called a "map". A std::map<KeyType,MappedType> provides a means for mapping keys to values. In your case, a word to the definition of that word. A std::map is a self-balancing binary search tree, specifically, a red-black tree. C++11 offers an alternative to std::map, std::unordered_map. The access interfaces are very similar to std::map. The underlying implementations and hence the timing behaviors are very different. An unordered_map is implemented in terms of a rather different kind of computer science data structure, a hash table.

All of these concepts (trees, binary trees, binary search trees, hash tables): You should know them as a modern day computer programmer. These are extremely important computer science data structures.

Here's your class using a std::map:

Code:

class Dictionary {
public:
   typedef std::map< std::string, std::string > DictType;

   // Add or replace a definition for a word.
   void add_def (const std::string & word, const std::string & def) {
      dict[word] = def;
   }

   // Find the definition. Entry must exist.
   const std::string& get_def (const std::string & word) const {
      DictType::const_iterator entry = dict.find (word);
      if (entry == dict.end()) {
         throw (std::logic_error("Word \"" + word + "\" not found"));
      }
      return entry->second;
   }

private:
   DictType dict;
};

A couple of notes on the above:

Note that I'm passing and returning strings as references. Try not to pass or return strings, or big structures in general. There's a lot of hidden overhead in making copies of those passed/returned values for objects such as strings, vectors, etc. Try to use references or pointers instead.
Note that the functions add_def and get_def are very simple wrappers. There is very little reason for this class to exist. Just use std::map<std::string,std::string> .

Now I'll look at three key parts of your code.

Item 1: std::string Dictionary::getDef(std::string w)
Try not to pass strings by value. Here it would be much better to pass by reference, and specifically, const reference.

Another problem: w is a terrible name for a variable. Calling it "word" would have been much better. The cost of that much more meaningful name is typing three letters.Item 2: if (word_val(c->word) > word_val(w))
There are lots of things wrong here.

Why are you recomputing this word value? Your word_val function is most likely a very expensive function. Your keys aren't going to change, and you are going to use this a lot. It would make more sense to store the word value as part of your dictionary entry. Similarly, the word value of the input word isn't going to change. Compute it once. Sometimes these optimizations fall into the category of "premature optimization". Not in this case.
That function is fraught with problems. You are implicitly assuming that you can come up with a unique mapping from an arbitrary string to an integer. What do you do if two different words have the same word value?
Why are you doing this at all? You should have used if (c->word > w) . std::string overloads the comparison operators, and they work just the way you would expect.

Item 3: return ("ERROR: Word not found.")
This is a *bad* idea. What do you do when someone adds the word "word_not_found_error" to your dictionary with that string as the definition? Much better options:

Use a special return value such as a null pointer that can never be interpreted as a "good" return value.
Pass another argument whose value will be set to indicate success or failure.
Return a pair of values (e.g., std::pair), one to indicate success/failure, the other, the definition of the word.
Throw an exception. This is C++, after all.

Jamin2112 · Sep 20, 2013

Thanks for your comprehensive feedback.

D H said:

It's good to learn how to roll your own "computer science data structures" (linked lists, arrays, binary search trees, associative arrays, ...). However, once you progress to a certain point, it's even better to know that others have done this before and that there are standard treatments of many of these concepts. As someone who wants employment in the field, you should be at this point. The C++ standard library has exactly what you want.

...

I guess that's what I'm going to focus on -- making sure I know, off the top of my head, how to use all the tools in the standard library. Up until now I had been focusing on making home-made data structures to solve the classic interview questions (e.g. find whether an array has any repeated elements).

Item 1: std::string Dictionary::getDef(std::string w)
Try not to pass strings by value. Here it would be much better to pass by reference, and specifically, const reference.

Another habit I need to form.

[*]That function is fraught with problems. You are implicitly assuming that you can come up with a unique mapping from an arbitrary string to an integer. What do you do if two different words have the same word value?

I assure you it's an injective function. The only issue is overload if the string is too long (I was ignoring that potential problem for now).

D H · Sep 20, 2013

Jamin2112 said:

I guess that's what I'm going to focus on -- making sure I know, off the top of my head, how to use all the tools in the standard library. Up until now I had been focusing on making home-made data structures to solve the classic interview questions (e.g. find whether an array has any repeated elements).

People have taken Imran Ghory's and Joel Spolsky's good advice on interviewing to absurd levels. Solving those arcane interview problems will not help you on the job, and it will not help interviewers determine who is good and who isn't. What will help you much more is practicing on something a bit larger. Make a program that reads a dictionary, takes an input word, and finds all of the words in the dictionary that are subsets of that word. For example, this->this, hit, sit, his, hi, it, I, plus one other four letter word. Write a *simple* game program, one where there's no intelligence in the program, just checking. Then try graduating to a more complex game. Seeing that this physics forums.com, make a gravity simulator. Then add a GUI. There are lots of possibilities.

Another thing that will help is to learn some design patterns. There are lots of books on design patterns for C++. Read one, but not a For Dummies (if there is one). Check the reviews because there are bad books and good ones.

I assure you it's an injective function. The only issue is overload if the string is too long (I was ignoring that potential problem for now).

I can pretty much assure you that it is not. It's called the birthday problem. How many people do you need to ask, on average, what their birthdate is and expect to find a duplicate?

Your word_val function is what's called a "hash function." Even the very best are subject to collisions. (BTW, you should have used std::hash() for this.) Better yet, you should not have used a hash function at all. Why did you do that? That you thought you needed one is a bit problematic.

pbuk · Sep 20, 2013

Jamin2112 said:

I assure you it's an injective function.

You just failed the interview.

The function might be injective for all the test cases you have used, you might even have proved it to be true according to some set of assumptions, but what happens when the code is adapted for Version X and the assumptions are no longer valid? Code that works today but breaks tomorrow costs an employer even more than code that doesn't work at all.

What DH has written is mainly excellent, but I'll differ on one point. The correct way to implement a string comparison function in this and similar algorithms is not to use comparison operators, but to use string::compare (why? Think about how many comparisons are done within the loop). And if you are using natural language words that have not been cleaned into A-Z ASCII (even in an English dictionary you need to make sure 'wrong-headed' comes between 'wrongful' and 'wrongly', string::compare puts it at the end) you can simply replace string::compare with a hand-rolled alternative that returns similar values.

Jamin2112 · Sep 20, 2013

D H said:

I can pretty much assure you that it is not. It's called the birthday problem. How many people do you need to ask, on average, what their birthdate is and expect to find a duplicate?

The answer is 24 or something.

Anyways, I don't understand why you think it's impossible to construct an injective function f that maps strings to positive numbers, such that f(s₁) < f(s₂) if and only if s₁ is alphabetically less than s₂. For instance, if we limit ourselves to strings of length 1, f:A→N where A = {a, b, ..., z}, N = {1, 2, ..., 26}, with f(a) = 1, f(b) = 2, ..., f(z) = 26, does the trick.

Better yet, you should not have used a hash function at all. Why did you do that?

Obviously I was going for a "cool" approach to a problem.

D H · Sep 20, 2013

MrAnchovy said:

The correct way to implement a string comparison function in this and similar algorithms is not to use comparison operators, but to use string::compare (why? Think about how many comparisons are done within the loop).

There's essentially *no* difference in performance if all that is being done is one comparison:

Code:

while (a != b) {
   c = a + (b - a)/2;
   if (c->word > w) { // or if (w.compare(c->word) < 0), same thing
      a = c;
   }
   else { 
      b = c;
   }
}

Compile optimized and whatever minute difference there is in performance will disappear. With most implementations, that is.

However, it would have been a good idea to add an early escape from that loop when the desired object is found. Now it's a good idea to use compare:

Code:

while (a != b) { 
   c = a + (b - a)/2;
   int comp = w.compare (c->word);
   if (comp > 0) {
      b = c;
   }
   else {
      a = c;
      if (comp == 0) {
         break;
      }
   }
}

Now the code is avoiding a needless comparison by using the fact that string::compare returns -1, 0, or +1 to indicate <, ==, and >. Otherwise, why bother?

Jamin2112 said:

The answer is 24 or something.

Exactly. Much less than the naive 365/2. Now imagine that your hash function is near perfect, which means that in the case of a 32 bit hash (e.g., unsigned int), the probability that two random words have the same hash value is 2^-32. How many words will you need to add to the dictionary before there is a 50% chance of a collision? Those are terrible odds, BTW. We want a very small probability of a collision. It takes surprisingly few words for the odds to rise above "very small".

Anyways, I don't understand why you think it's impossible to construct an injective function ...

Simple: 26⁷>2³². Seven letter words cause trouble. Look at that sentence "Anyways, I don't understand why you think it's impossible to construct an injective function ..." and count the seven letter words in it.

Obviously I was going for a "cool" approach to a problem.

That is a bad habit. Try to kick that habit.

pbuk · Sep 20, 2013

D H said:

There's essentially *no* difference in performance if all that is being done is one comparison:

Of course, but the code that was posted with only one comparison per iteration was broken - the value of the pointer to the matching element is never stored. A working version of the code inevitably requires more than one boolean-valued comparison per iteration.

Everything else you say is right on the mark.

Finding a word in a dictionary

1. How do I find a specific word in a dictionary?

2. What if I don't know how to spell the word I am looking for?

3. Can I use a dictionary to find the meaning of a word?

4. Is there a limit to the number of words a dictionary can contain?

5. Can I use a dictionary to find the origin of a word?

Similar threads

Hot Threads

Recent Insights