CS110 Assignment 5 - Plagiarism Detector
Author = "Kaymin Martin-Burnett"
Contributors = "Ha Tran Nguyen Phuong", "Finn Macken", "Junran Shi"
Introduction
The problem at hand is that students submitting assignments may copy off each other and violate academic guidelines. The input received for this problem is two concatenated strings of all the words submitted. By finding all the common substrings of this concatenated string, an assessor can be hinted at the possibility of plagiarism. A computational solution, such as a program, can speed up and reduce errors in the assessment process. But what is the best data structure and algorithm for this computational problem?
Choosing a data structure and algorithm
The computational problem in context is that submitted assignments can be up to tens of thousands of words. Professors using this program do not want sit around for hours or days while similarities between the texts are found. Therefore, the objective is to minimize search time. One of the simplest data structures we could use is an array. We could store substrings in array elements and check to see the similarities of these substrings between arrays. However, numerous inefficiencies arise with this choice of data structure. Most importantly, the arrays must be sequentially checked element by element, even if the element we are searching for does not exist. This procedure's runtime is thus correlated with the size of the array creating a time complexity of O(n). Instead of checking every single element, how might we create a faster method of searching for a particular element in an array? Instead of memorizing indexes for our search, we can use associative arrays.
An associative array helps us avoid the inefficient run time of a simple array because utilize key-value map. A hash table is an implementation of an associative array using the key-value mapping technique. We can think of the hash table using a function that is a 'black box' between each key and buckets of data. We take a value, put it into the black box which processes it according to a set of rules and then outputs the right bucket to search for our data. This 'black box' is a hash function that saves us time. In terms of time complexity, the operations in our black box (hash function) are relatively simple and thus take O(1). A drawback here is that our operations can result in 2 different keys producing the same value. This is problem just like having a library have 20 different books under the same sorting code. So, while the hash function helps us save time by directing us to a specific shelf and rack in a library full of books, each book should have a unique code. We will explore further on how re-hashing, or collision strategies, can be used to mitigate this problem. For now, we can compare the array's search time complexity of O(n) to the hash table's of O(1) and justify the latter's choice for our context.
ALGORITHM - The first algorithm choice is provided by the assignment structions: Rolling hashing. Rolling hashing gets its name from the window of characters that it uses to search for matching substrings. The window of characters updates as it pushes out its first character and adds the next character outside of the window. At each iteration of a window of characters, the hash function computes a value that is cross checked against the value we are searching for. A visual representation of this process is found below. In Figure 1, we can once again see how the hash function is similar to that of a library. First we scan the library for the shelf (bucket of data) that contains the right combination of uniquely coded books, then we can just look at that shelf for the specific book (data) we need. (#analogies) (Diagram inspired by Stable Sort (2019))
The is a problem with the implementation in Figure 1. Namely, the hash function will produce the same value for ("cdef") and ("fedc") because the function's operation is just a summation. To avoid this problem, we can use modular arithmetic to create hash values that consider a character's order. Modulus is used because the constant number we use to signify a character's position can get very large when we want no collisions at all (ie. giving every character position a unique value).
I have selected 211 as my q for mod q with the following justification: Let's assume that this plagiarism detector will be used on class polls because we know that we will almost always get clean text (no images or diagrams). Class polls have a maximum of 600 characters. If our rolling hash window considers say substrings of 3 characters (k=3), we have have around 200 windows. The modulus has to be a prime number so I will go with a next larger prime number, 211.
Implementation 1 - Rolling Hashing
Implementation 2 - Cuckoo Method
What makes a good hash function? Firstly, it must be deterministic to avoid collisions. We can refer back the library analogy and having a our library program give the same reference code to 20 different books. Minimizing collisions means that distribution is likely uniform throughout the table (Tang, 2011). Of course, it should also be efficient to compute. With this in mind, I will be using the cuckoo method for my hash function.
I continue using the same hash table size so that I can fairly compare the two implementations.