What sort of Math do I need for a plagiarism detection algorithm?

Jamin2112 · Nov 12, 2013

I'm trying to figure out the theory behind a simple plagiarism detection algorithm I'm making. The essence is that it's a function

f:S×S→[0, 1]

where S is the set of all strings and [0, 1] is the plagiarism quotient with 0 being no plagiarism and 1 being a completely copied string (that is, f(s₁,s₂)=1 if s₁=s₂).

This is analogous a some function that maps two real-valued vectors to the unit interval, something like norm(v₁, v₂). Only problem here is that the strings can be of different lengths. I guess my algorithm will have to, in that case, take the smaller string and check to see if the larger string contains any substrings suspicious of being plagiarized by the smaller one.

Any ideas?

Jamin2112 · Nov 12, 2013

Ok, I just started to make the function like I described above. It's recursive and I still need the Math stuff figured out. Suggestions welcome.

Code:

double plgrsm_qtnt(std::string s1, std::string s2) {
	if (s1.length() == s2.length()) { 
		// math/stat stuff goes here ...
		
		
		// ...
	} else { 
		// Use s1 to be the smaller string if isn't already:
		if (s1.length() > s2.length())
			swap(s1, s2);
		// Check s1 against all substrings of s2 and return the 
		double max_qtnt(0); // since plgrsm_qtnt fallse in [0, 1]
		int s1_len(s1.length()), s2_len(s2.length());
		for (int i(0), j(s1_len - 1); j != s2_len; ++i, ++j) { 
			std::string s2_sbstr = s2.substring(i, j);
                        double this_qtnt = plgrsm_qtnt(s1, s2_sbstr);
			if (this_qtnt > max_qtnt)
				max_qtnt = this_qtnt;
		}
		return max_qtnt;
	} 	
}

jim mcnamara · Nov 12, 2013

Search engines, in an attempt not to display the very similar content from search results over and over again, use cluster analysis.

This is also used for plagiarism detection. Basically, texts are broken into fragments and a similarity distance is calculated using common substrings. One common similarity algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Cluster_analysis

What sort of Math do I need for a plagiarism detection algorithm?

Thread 'Video on imaginary numbers and some queries'

Thread 'Unit Circle Double Angle Derivations'

Thread 'Imaginary Pythagoras'

Similar threads

Hot Threads

Insights Fermat's Last Theorem

B What could prove this wrong? I'm having a dispute with friends

B About a definition: What is the number of terms of a polynomial P(x)?

B Geometry Puzzle with 20 points in a cross pattern

I Geometry problem of interest with a 3-4-5 triangle

Recent Insights

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers