I need to write a function that reads words from a text file and

Click For Summary
SUMMARY

The discussion focuses on creating a function to read words from a text file and write them to an output text file while eliminating duplicates. The proposed solution involves using a vector or a Hashtable for efficient duplicate detection. A sample C# implementation demonstrates the use of a Dictionary to store unique words, enhancing performance by avoiding unnecessary comparisons. Key considerations include understanding the format of the input text file and ensuring proper word extraction.

PREREQUISITES
  • Understanding of C++ and C# programming languages
  • Familiarity with file I/O operations in C++
  • Knowledge of data structures such as vectors and dictionaries
  • Basic algorithm design for duplicate detection
NEXT STEPS
  • Implement a function using C++ STL containers for file reading and duplicate removal
  • Explore the use of HashSets in C# for efficient duplicate management
  • Learn about string manipulation techniques in C++ for word extraction
  • Investigate performance optimization strategies for large text file processing
USEFUL FOR

Software developers, particularly those working with file processing and text manipulation, as well as beginners looking to improve their understanding of data structures and algorithms for duplicate handling.

Nusc
Messages
752
Reaction score
2
I need to write a function that reads words from a text file and writes the words to an output tex file while removing all duplicate words.



void remove_dup(string filename)

{
vector <string> a;
string word;
ifstream infile(filename.c_str());
if(infile.fail())
exit(0);
while(!infile.eof(0))
{
infile >> word;

I'm stuck.

Should I use if statement?

if( a.at(i) == a.at(i) )
a.erase(i,i); ?
else

{
 
Technology news on Phys.org
Dealing with text is always a complicated task in programming especially for beginners. How are the words formatted in the text file? Are they separated by a delimiter of sorts (e.g. comma, semi-colon, etc)? Or is it one word per line?

So you need to make sure that you are indeed getting your words into the vector.

A straight-forward algorithm would be, using my own rough pseudo-code:

Code:
WhileNot (EndofFile)
  TempWord = GetNextWordFromTextFile()
  ForEach (Word in Vector)
    If TempWord == Word[x] in Vector
      DuplicateWordTest = true
      BreakForLoop
    EndIf
  EndFor

  If (DuplicateWordTest = false)
    Vector.AddWord(TempWord)
  EndIf
EndWhile

Now you will have an array that is free from any duplicates.
 
Instead of a vector you should use a Hashtable, or dictionary, this will give better performance because for each word you don't have to check every previous word. For example, in C#:
Code:
string[] words = text.Split(new char[]{ " "});
Dictionary<string, bool> index = new Dictionary<string, bool>(words.Length);

foreach(string word in words){
    if(index.ContainsKey(word)) Console.Write(" " + word);
    else index[word] = true;
}
 

Similar threads

Replies
35
Views
8K
  • · Replies 16 ·
Replies
16
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
52
Views
5K
  • · Replies 33 ·
2
Replies
33
Views
6K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 21 ·
Replies
21
Views
5K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K