What sort of Math do I need for a plagiarism detection algorithm?

Jamin2112 · Nov 12, 2013

I'm trying to figure out the theory behind a simple plagiarism detection algorithm I'm making. The essence is that it's a function

f:S×S→[0, 1]

where S is the set of all strings and [0, 1] is the plagiarism quotient with 0 being no plagiarism and 1 being a completely copied string (that is, f(s₁,s₂)=1 if s₁=s₂).

This is analogous a some function that maps two real-valued vectors to the unit interval, something like norm(v₁, v₂). Only problem here is that the strings can be of different lengths. I guess my algorithm will have to, in that case, take the smaller string and check to see if the larger string contains any substrings suspicious of being plagiarized by the smaller one.

Any ideas?

Jamin2112 · Nov 12, 2013

Ok, I just started to make the function like I described above. It's recursive and I still need the Math stuff figured out. Suggestions welcome.

Code:

double plgrsm_qtnt(std::string s1, std::string s2) {
	if (s1.length() == s2.length()) { 
		// math/stat stuff goes here ...
		
		
		// ...
	} else { 
		// Use s1 to be the smaller string if isn't already:
		if (s1.length() > s2.length())
			swap(s1, s2);
		// Check s1 against all substrings of s2 and return the 
		double max_qtnt(0); // since plgrsm_qtnt fallse in [0, 1]
		int s1_len(s1.length()), s2_len(s2.length());
		for (int i(0), j(s1_len - 1); j != s2_len; ++i, ++j) { 
			std::string s2_sbstr = s2.substring(i, j);
                        double this_qtnt = plgrsm_qtnt(s1, s2_sbstr);
			if (this_qtnt > max_qtnt)
				max_qtnt = this_qtnt;
		}
		return max_qtnt;
	} 	
}

jim mcnamara · Nov 12, 2013

Search engines, in an attempt not to display the very similar content from search results over and over again, use cluster analysis.

This is also used for plagiarism detection. Basically, texts are broken into fragments and a similarity distance is calculated using common substrings. One common similarity algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Cluster_analysis

What sort of Math do I need for a plagiarism detection algorithm?

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

What sort of Math do I need for a plagiarism detection algorithm?

Similar threads