Untitled Python Project

CS110 Assignment 5 - Plagiarism Detector

Author = "Kaymin Martin-Burnett"

Contributors = "Ha Tran Nguyen Phuong", "Finn Macken", "Junran Shi"

Introduction

The problem at hand is that students submitting assignments may copy off each other and violate academic guidelines. The input received for this problem is two concatenated strings of all the words submitted. By finding all the common substrings of this concatenated string, an assessor can be hinted at the possibility of plagiarism. A computational solution, such as a program, can speed up and reduce errors in the assessment process. But what is the best data structure and algorithm for this computational problem?

Choosing a data structure and algorithm

The computational problem in context is that submitted assignments can be up to tens of thousands of words. Professors using this program do not want sit around for hours or days while similarities between the texts are found. Therefore, the objective is to minimize search time. One of the simplest data structures we could use is an array. We could store substrings in array elements and check to see the similarities of these substrings between arrays. However, numerous inefficiencies arise with this choice of data structure. Most importantly, the arrays must be sequentially checked element by element, even if the element we are searching for does not exist. This procedure's runtime is thus correlated with the size of the array creating a time complexity of O(n). Instead of checking every single element, how might we create a faster method of searching for a particular element in an array? Instead of memorizing indexes for our search, we can use associative arrays.

An associative array helps us avoid the inefficient run time of a simple array because utilize key-value map. A hash table is an implementation of an associative array using the key-value mapping technique. We can think of the hash table using a function that is a 'black box' between each key and buckets of data. We take a value, put it into the black box which processes it according to a set of rules and then outputs the right bucket to search for our data. This 'black box' is a hash function that saves us time. In terms of time complexity, the operations in our black box (hash function) are relatively simple and thus take O(1). A drawback here is that our operations can result in 2 different keys producing the same value. This is problem just like having a library have 20 different books under the same sorting code. So, while the hash function helps us save time by directing us to a specific shelf and rack in a library full of books, each book should have a unique code. We will explore further on how re-hashing, or collision strategies, can be used to mitigate this problem. For now, we can compare the array's search time complexity of O(n) to the hash table's of O(1) and justify the latter's choice for our context.

ALGORITHM - The first algorithm choice is provided by the assignment structions: Rolling hashing. Rolling hashing gets its name from the window of characters that it uses to search for matching substrings. The window of characters updates as it pushes out its first character and adds the next character outside of the window. At each iteration of a window of characters, the hash function computes a value that is cross checked against the value we are searching for. A visual representation of this process is found below. In Figure 1, we can once again see how the hash function is similar to that of a library. First we scan the library for the shelf (bucket of data) that contains the right combination of uniquely coded books, then we can just look at that shelf for the specific book (data) we need. (#analogies) (Diagram inspired by Stable Sort (2019))

import urllib.request from PIL import Image urllib.request.urlretrieve( 'https://i.ibb.co/R4BDTNt/IMG-6-FA1-AFA5-DC03-1.jpg', "Figure 1") img = Image.open("Figure 1") img.show()

The is a problem with the implementation in Figure 1. Namely, the hash function will produce the same value for ("cdef") and ("fedc") because the function's operation is just a summation. To avoid this problem, we can use modular arithmetic to create hash values that consider a character's order. Modulus is used because the constant number we use to signify a character's position can get very large when we want no collisions at all (ie. giving every character position a unique value).

I have selected 211 as my q for mod q with the following justification: Let's assume that this plagiarism detector will be used on class polls because we know that we will almost always get clean text (no images or diagrams). Class polls have a maximum of 600 characters. If our rolling hash window considers say substrings of 3 characters (k=3), we have have around 200 windows. The modulus has to be a prime number so I will go with a next larger prime number, 211.

Implementation 1 - Rolling Hashing

def hashf(word): """ Returns a hash value for an inputted text Input ----- word: string Output ------ hash value for storing in table (justified in body above) """ # preprocess input by lowering word = word.lower() # intialize value value = 0 # defining a base from Nguyen & Tran (2021) base = 7 # based on ASCII exponent = len(word)-1 # Note: Rolling hashing function has been adapted from Nguyen & Tran (2021) # hash function processing for i in word: value += ord(i)*(base**exponent) exponent -= 1 return value%211 def rh_get_match(x, y, k): """ Finds all common length-k substrings of x and y using rolling hashing on both strings. Input ----- x, y: strings k: int, length of substring Output ------ A list of tuples (i, j) where x[i:i+k] = y[j:j+k] """ # create hash table that can fit largest value hashtable = [[] for i in range(211)] # initialize list for storing matching substrings commonsubstrings = [] # removing special chars x = x.lower().replace(" ","").replace(".'","").replace(",","").replace("+","").replace("—","").replace("-","").replace("/","").replace("'","") y = y.lower().replace(" ","").replace(".'","").replace(",","").replace("+","").replace("—","").replace("-","").replace("/","").replace("'","") # PART 1 (Storing in hash table): use rolling hash function to create hash values # then store value with k-length substrings in the hash table # create k length substrings for i in range(len(x) -k+1): # initialize r at beginning if(i == 0): r = hashf(x[i:i+k]) else: # update rule 1: adding next element Nguyen & Tran (2021) pg. 14 r = ((r*7 + ord(x[i+k-1])) % 211) # update rule 2: removing first element Nguyen & Tran (2021) pg. 14 r = (r - ord(x[i-1]) * ((7**k) % 211)) % 211 # add substring to hash table hashtable[r].append((x[i:i+k],i)) print(hashtable[r]) # PART 2 (Searching hash table): use hash values to look # up the substrings in the hash table # if lookup finds a match, record the substring in commonsubstrings # getting the hash values for lookup for j in range(len(y)-k+1): # initialize t at first pass if(j == 0): s = hashf(y[j:j+k]) else: #add/subctract next element Nguyen & Tran (2021) pg. 14 s = (s*7 + ord(y[j+k-1])) % 211 s = (s - ord(y[j-1]) * ((7**k) % 211)) % 211 # lookup hash table for value for i in hashtable[s]: # if match found, append indices to commonsubstrings if(y[j:j+k] == i[0]): commonsubstrings.append((i[1],j)) return commonsubstrings

# intial test that also confirms input cleaning (uppercase/punctuation is removed) assert rh_get_match("It is working.", "is", 2) == [(2, 0)] # checking a different k value assert rh_get_match("a b c d e f", "e", 1) == [(4, 0)] # checking for duplicate common substrings assert rh_get_match("We went to Wembley on Wednesday.", "we", 2) == [(0, 0), (2, 0), (8, 0), (17, 0)]

Implementation 2 - Cuckoo Method

What makes a good hash function? Firstly, it must be deterministic to avoid collisions. We can refer back the library analogy and having a our library program give the same reference code to 20 different books. Minimizing collisions means that distribution is likely uniform throughout the table (Tang, 2011). Of course, it should also be efficient to compute. With this in mind, I will be using the cuckoo method for my hash function.

I continue using the same hash table size so that I can fairly compare the two implementations.

# NOTE: I changed last minute from multiplication technique to cuckoo because of numerous # bottlenecks trying to get the multiplication hash value consistently created. def cuckoof(word): """ Returns a hash value for an inputted text Input ----- word: string Output ------ hash value for storing in table """ # initialize value = 0 # function 1 - cuckoo for i in word: p = p * 211 + ord(i) + 2^value return value%211 # function 2 - cuckoo value2 = 10 for i in word: p = p * 211 + ord(i) + 2^value2 return value%211 def regular_get_match(x, y, k): """ Finds all common length-k substrings of x and y using rolling hashing on both strings. Input ----- x, y: strings k: int, length of substring Output ------ A list of tuples (i, j) where x[i:i+k] = y[j:j+k] """ # create hash table that can fit largest value hashtable = [[] for i in range(211)] # initialize list for storing matching substrings commonsubstrings = [] # removing special chars x = x.lower().replace(" ","").replace(".'","").replace(",","").replace("+","").replace("—","").replace("-","").replace("/","").replace("'","") y = y.lower().replace(" ","").replace(".'","").replace(",","").replace("+","").replace("—","").replace("-","").replace("/","").replace("'","") # # PART 1 (Storing in hash table): use rolling hash function to create hash values # # then store value with k-length substrings in the hash table # # create k length substrings # for i in range(len(x) -k+1): # # initialize r at beginning # if(i == 0): # r = cuckoof(x[i:i+k]) # else: # # update rule 1: adding next element Nguyen & Tran (2021) pg. 14 # r = ((r*7 + ord(x[i+k-1])) % 211) # # update rule 2: removing first element Nguyen & Tran (2021) pg. 14 # r = (r - ord(x[i-1]) * ((7**k) % 211)) % 211 # # add substring to hash table # hashtable[r].append((x[i:i+k],i)) # print(hashtable[r]) # # PART 2 (Searching hash table): use hash values to xzlook up the substrings in the hash table # # if lookup finds a match, record the substring in commonsubstrings # # getting the hash values for lookup # for j in range(len(y)-k+1): # # initialize t at first pass # if(j == 0): # s = cuckoof(y[j:j+k]) # else: # #add/subctract next element Nguyen & Tran (2021) pg. 14 # s = (s*7 + ord(y[j+k-1])) % 211 # s = (s - ord(y[j-1]) * ((7**k) % 211)) % 211 # # lookup hash table for value # for i in hashtable[s]: # # if match found, append indices to commonsubstrings # if(y[j:j+k] == i[0]): # commonsubstrings.append((i[1],j)) # return commonsubstrings

print(regular_get_match("We went to Wembley on Wednesday.", "we", 2))

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}CS110 Assignment 5 - Plagiarism Detector

Introduction

Choosing a data structure and algorithm

Implementation 1 - Rolling Hashing

Implementation 2 - Cuckoo Method

CS110 Assignment 5 - Plagiarism Detector