What sort of Math do I need for a plagiarism detection algorithm?

Click For Summary
SUMMARY

The discussion focuses on the mathematical foundations required for developing a plagiarism detection algorithm, specifically a function f:S×S→[0, 1] that quantifies plagiarism between two strings. The algorithm compares strings of different lengths by checking substrings of the larger string against the smaller one. Key mathematical concepts include recursive functions and similarity measures, such as the Levenshtein distance, which is commonly used in text similarity analysis. The implementation of the function in C++ is also discussed, highlighting the need for statistical methods to enhance accuracy.

PREREQUISITES
  • Understanding of recursive functions in programming
  • Familiarity with string manipulation techniques in C++
  • Knowledge of similarity measures, particularly Levenshtein distance
  • Basic concepts of cluster analysis in data science
NEXT STEPS
  • Research advanced string matching algorithms, such as Knuth-Morris-Pratt
  • Explore statistical methods for improving plagiarism detection accuracy
  • Learn about the implementation of Levenshtein distance in C++
  • Investigate clustering techniques for text analysis and similarity measurement
USEFUL FOR

Software developers, data scientists, and researchers focused on text analysis, plagiarism detection, and algorithm development will benefit from this discussion.

Jamin2112
Messages
973
Reaction score
12
I'm trying to figure out the theory behind a simple plagiarism detection algorithm I'm making. The essence is that it's a function

f:S×S→[0, 1]

where S is the set of all strings and [0, 1] is the plagiarism quotient with 0 being no plagiarism and 1 being a completely copied string (that is, f(s1,s2)=1 if s1=s2).

This is analogous a some function that maps two real-valued vectors to the unit interval, something like norm(v1, v2). Only problem here is that the strings can be of different lengths. I guess my algorithm will have to, in that case, take the smaller string and check to see if the larger string contains any substrings suspicious of being plagiarized by the smaller one.

Any ideas?
 
Physics news on Phys.org
Ok, I just started to make the function like I described above. It's recursive and I still need the Math stuff figured out. Suggestions welcome.
Code:
double plgrsm_qtnt(std::string s1, std::string s2) {
	if (s1.length() == s2.length()) { 
		// math/stat stuff goes here ...
		
		
		// ...
	} else { 
		// Use s1 to be the smaller string if isn't already:
		if (s1.length() > s2.length())
			swap(s1, s2);
		// Check s1 against all substrings of s2 and return the 
		double max_qtnt(0); // since plgrsm_qtnt fallse in [0, 1]
		int s1_len(s1.length()), s2_len(s2.length());
		for (int i(0), j(s1_len - 1); j != s2_len; ++i, ++j) { 
			std::string s2_sbstr = s2.substring(i, j);
                        double this_qtnt = plgrsm_qtnt(s1, s2_sbstr);
			if (this_qtnt > max_qtnt)
				max_qtnt = this_qtnt;
		}
		return max_qtnt;
	} 	
}
 

Similar threads

  • · Replies 14 ·
Replies
14
Views
5K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
Replies
2
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 75 ·
3
Replies
75
Views
7K
  • · Replies 5 ·
Replies
5
Views
1K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 8 ·
Replies
8
Views
4K
  • · Replies 5 ·
Replies
5
Views
4K