## Foundations of Data Science: Programming and Linear Algebra

Group members:

Boon Zhong Boh (142870)

Tomas Latrille (121175)

Manuel Oom (141252)

Kai Rauenbühler (141358)

__Question 1: Class Snumpy__

Snumpy is a class with some of the functionalities the popular machine learning library Numpy has.

The class defines:

an array as $1,2,3,4,5,6$,

a vector as a column matrix with m rows and 1 column in the form of $[1,1,2]$,

a matrix with 1 row and n columns in the form of $[1,2,3,4,5]$ or with m rows and n columns as $[1,2,3,3,4,5,4,5,6]$.

Note: Refer to all the methods' codes for the relevant parameter checks and error handling.

## ones

The ones method takes the desired ** array_length** as a parameter and returns an array of 1's with the length specified in

**.**

*array_length*Parameters:

- array_length: integer

Returns:

- Array of 1's with the parameter's length.

## zeros

The zeros method takes the desired ** array_length** as a parameter and returns an array of 0's with the length specified in

**.**

*array_length*Parameters:

- array_length: integer

Returns:

- Array of 0's with the parameter's length.

## reshape

The reshape method takes an array (** lst**) and converts it into the dimensions specified by the tuple(

**).**

*row_column_size_tuple*Parameters:

lst: list

row_column_size_tuple: tuple

Returns:

- List transformed to a matrix (see above for description of matrix data type).

## shape

The shape method takes a matrix(** mat**) and returns its dimension in a tuple(

**) in the form of (number of rows, number of columns).**

*row_column_size_tuple*Parameters:

- mat: matrix (see above for description of matrix data type).

Returns:

- Lengths of the corresponding matrix dimensions as a tuple.

## append

The append method takes either two vectors or two matrices as parameter and appends them either by row or by column, depending on the ** append_by_row** parameter.

Parameters:

mat1: matrix (see above for description of matrix data type)

mat1: matrix (see above for description of matrix data type)

append_by_row: boolean (optional, default = true)

Returns:

- A new matrix that is the combination of the two input matrices.

## get

The get method takes a matrix (** mat**) and a tuple(

**) as parameters and returns the requested value based on 0-based indexing.**

*row_column_index_tuple*Parameters:

mat: matrix (see above for description of matrix data type)

row_column_index_tuple: tuple

Returns:

- The element in
specified by*mat*.*row_column_index_tuple*

## add

The add method takes two matrices (** mat1** and

**) and sums them up.**

*mat2*Parameters:

mat1: matrix (see above for description of matrix data type)

mat2: matrix (see above for description of matrix data type)

Returns:

- The sum of the two provided matrices.

## subtract

The subtract method takes two matrices (** mat1** and

**) and subtracts**

*mat2***from**

*mat2***.**

*mat1*Parameters:

mat1: matrix (see above for description of matrix data type)

mat2: matrix (see above for description of matrix data type)

Returns:

- The difference of the two provided matrices.

## dotproduct

The dotproduct method takes two matrices (** mat1** and

**) and computes their dot product.**

*mat2*Broad explanation of the algorithm:

a. determines if matrices supplied are column matrices or not.

b. if column matrices are supplied then only the lengths are assessed to see if dot product can be performed, otherwise, the usual column number of ** mat1** should equal to row number of

**should hold.**

*mat2*Checks that matrices supplied have dimensions such that dotproduct can be applied on them.

A helper function (getMatrixColbyIndex) is defined to retrieve column of

as dot product progresses.*mat2*A for-loop is defined to go through each row of

. During each*mat1*row, getMatrixColbyIndex retrieves the column of*mat1*one by one and a dot product is carried out each time.*mat2*The result from each dot product is stored in a list

Obtain the dimensions of the resulting matrix using the

__shape__method and row and column dimensions ofand*mat1*.*mat2*Return the dotproduct by reshaping the list from point 4, using the

__reshape__method and the dimensions from point 5.

Parameters:

mat1: matrix (see above for description of matrix data type)

mat2: matrix (see above for description of matrix data type)

Returns:

- The dotproduct of the two provided matrices.

## solver

The solver method takes two matrices, ** mat1** as coefficient matrix and

**as**

*mat2*

dependent variable" values and solves the system of linear equations.Checks that matrices supplied have dimensions that are reasonable for a system of linear equations.

Combine

and*mat1*into an augmented matrix.*mat2*Arrange the rows of the augmented matrix into an order that more closely resembles the row echelon form.

Conduct row reduction. If ZeroDivisionError is detected, there is no unique solution for the system and solver will exit, notifying the user.

Once row reduction is succesfully completed, back substitution occurs to return the unique solution for the system.

Parameters:

mat1: matrix (see above for description of matrix data type)

mat2: matrix (see above for description of matrix data type)

Returns:

- The unique solution of the system or a no solution found message.

__Question 2: Hamming's Code__

The purpose of Hamming(7,4) is to encode a 4-bit code ($w$) into 7-bit code ($c_w$) such that when it's sent over a noisy channel, any 1-bit error can be corrected, or any 1-bit or 2-bit error(s) can be detected.

__Codeword Generation__

$w$ contains the data bits, while 3 other bits (parity bits) will be added to the data bits during the encoding process. Each parity bit will "oversee" 3 data-bit, such that the XOR operation of a parity-bit and its data-bits will always be 0, and each parity bit will oversee a unique set of data-bits. In other words, we can also define each parity bit ($p_i$, where $i = 1,2,3$) as such:

$p_1 = d_1$ XOR $d_2$ XOR $d_4$,

$p_2 = d_1$ XOR $d_3$ XOR $d_4$,

$p_3 = d_2$ XOR $d_3$ XOR $d_4$, where $d_i, i = 1,2,3$, are the data-bits.

Positioning of the data-bits and the parity bits in $c_w$ is deliberate. Parity bits are placed in positions corresponding to index 001 ($p_1$), 010 ($p_2$) and 100 ($p_3$) in binary (only a single 1 is turned on). In fact, each parity bit is assigned data-bits that have 1 in their index in the same position:

$p_1$ with index 001 in binary (1 in base 10) is assigned data bits with index 011(3), 101(5) and 111(7) in $c_w$,

$p_2$ with index 010 in binary (2 in base 10) is assigned data bits with index 011(3), 110(6) and 111(7) in $c_w$,

$p_3$ with index 100 in binary (4 in base 10) is assigned data bits with index 101(5), 110(6) and 111(7) in $c_w$.

In the encode() function, where $G \cdot w$ is implemented such that $G$ is the generator matrix, and $G$ is defined such that each row corresponds to each bit from the resulting $c_w$, and the columns correspond to which data-bits that the bit from the corresponding row has influence over. For e.g. rows 1 to 7 of the $G$ matrix correspond to $p_1$, $p_2$, $d_1$, $p_3$, $d_2$, $d_3$, $d_4$ and all the 1's of the first row correspond to which data-bit $p_1$ has influence over, i.e. $d_1$, $d_2$ and $d_4$.

The XOR matrix multiplication $G \cdot w$ then gives $c_w$.

__Parity Check__

In the parity_check() function, $H$ is defined such that each row corresponds to the parity bit and the columns correspond to itself and the data bits that it has influence over. For e.g. rows 1 - 3 correspond to $p_1$, $p_2$ and $p_3$ all the 1's of the first row correspond to $p_1$, $d_1$, $d_2$ and $d_4$ in $c_w$. The rows of $H$ are also the coefficients of the parity check equations, where:

$p_1$ XOR $d_1$ XOR $d_2$ XOR $d_4$ = 0,

$p_2$ XOR $d_1$ XOR $d_3$ XOR $d_4$ = 0,

$p_3$ XOR $d_2$ XOR $d_3$ XOR $d_4$ = 0,

The XOR-multiplication of $H \cdot c_w$ generates the syndrome vector, which indicates which of the parity equations are not satisfied. The syndrome vector also identifies the location of a single error$^1$. The first row of the syndrome vector corresponds to the least significant bit, where $$\begin{bmatrix} 1 \ 0 \ 0 \end{bmatrix}$$ correspond to position $2^0 = 1$ of $c_w$.

__Decoding__

In the decode() function, we first run the parity_check() function to detect if there is any error. The function will return a message if either a 1 or 2-bit error is found. In the event of a 1-bit error, the decode() function will flip the bit at the position indicated by the parity_check() function. The $c_w$ with the flipped bit will then be multiplied with $R$, which is simply a mapping matrix, to retrieve the data-bits from $c_w$. The decode() function then returns the original $w$.

In the event of a 2-bit error however, the decode() function cannot distinguish this from a 1-bit error. We can tell from the parity_check() function within the decode() function that there is an error because it does not return a $0$ syndrome vector. It will attempt to correct the error based on the position returned by the parity_check() function but this will not correct the error. The decode() function will ultimately return an erroneous decoded 4-bit code that is different from the original $w$.

Similarly, for a 3-bit error, the decode function will ultimately return an erroneous decoded 4-bit code that is different from the original $w$. In fact, it is possible that a combination of 3-bit error could even pass the parity_check() function, in which it says that no error is detected despite the 3-bit error.

__Extra Clarifications__

Four-bit and seven-bit lists -- which would be used for dot-product operations in this question -- should be provided as lists (e.g. [1,0,1,0] and [1,0,1,0,1,1,0]) in this question. This is notably different from Question 1 where the matrices or vectors supplied to most functions in questions are lists of lists (e.g. [[1,2,3,4], [5,6,7,8]]).

While it is recommended in the guidelines for the format of vectors to be defined in the same manner throughout this project, we have decided that since this question only involves vectors, we would relax the requirements for defining vectors, making it easier for user input and function definitions.

$^1$https://web.stanford.edu/class/ee387/handouts/notes04.pdf

__Question 3: Text Document Similarity__

This algorithm evaluates similarity between input documents and a search document. All documents must be provided by the user and for this specific implementation Walt Whitman's poems were used.

## How does the algorithm work?

### Inputs

Set of Basis or Input Documents: $B$

Search Document: $$s$$

Each document $$ b*{i} $$ in $B$ will have $$ {w*{i,1}...w_{i,a} }$$ words, where $a$ can vary between input documents

The search document $s$ will also have $$ {s*{1}...s*{b} } $$ words

### Merging Input Documents

With the words of each of the input documents, it is possible to build a set of unique words

$$ basis = {x \in { w*{1,1}...w*{1,n}, w*{2,1}... w*{p,q}}} $$

We simplify $basis$ to:

$$ basis = {word*{1}...word*{r}} $$

Ergo $basis$ only contains unique words present in all input documents.

### Vectorizing Search Document and Input Document

With $basis$ it is possible to build vectors for our input and search documents by counting the occurrence of each word present in $basis$. This process is called Vectorization$^1$. If the search document contains a word that does not exist in $basis$, that word will not be counted. As a result, the input document and search document vectors will always have the same length.

Therefore we will have several term frequency vectors:

$$\vec{b_{i}}$$, representing vector for each input document, such that $i = 1,...,n,$ where $n$ = total number of input documents

$\vec{s}$, representing vector for the search document

### Vector Operations to evaluate Similarity

Similarity between vectors will be assessed according to three methods:

#### Dot Product

$$ \vec{s} \cdot \vec{b_{i}} $$

#### Euclidean Distance

Represents the distance in $n$ space between two vectors

$$ d\left( \vec{s},\vec{b*{i}}\right) = \sqrt {\sum *{d=1}^{n} \left( \vec{s*{d}}-\vec{b*{id}}\right)^2 } $$

#### Cosine Similarity$^2$

Represents the "similarity" in terms of the angle between two vectors. A cosine of 0 means that two vectors are ortogonal and do not share any word, while a cosine of 1 means that they do share all words and one vector is a linear combination of the other.

$$ sim(\vec{s}, \vec{b*{i}}) = \cos(\theta )={\mathbf {\vec{s}} \cdot \mathbf {\vec{b*{i}}} \over |\mathbf {\vec{s}} ||\mathbf {\vec{b_{i}}} |}$$

### Computational Complexity

As we can see the need to iterate both through input texts and the words on them makes this algorithm a $O(n^{2})$, all other operations are of same or less complexity.

__Extra Clarifications__

Similar to question 2, we have only require vectors to be defined as lists instead of lists of lists in this question.

### Implementation

### Input Documents and Search document

Input documents are merged into a set of words, this will be the *basis*. Word vectors are built according to that basis.

We choose three fragments of Walt Whitman's poems and two dummy inputs as input documents to check similarity against the search document.

### Merging all words sets into one (corpus)

### StopWords Removal

Stopwords, referred as extremely common words of a language will be removed from the basis. For this purpose it was used the list of 179 most common english words from NLTK stored in stopwrods.py.

Stopwords are removed to ensure that we evaluate similarity between documents according to the content itself and do not take into consideration articles, connectors or exctremely common words of the english language.

#### Vectorizing the search document according to the basis

#### Vectorizing input documents according to the basis

### Similarity

Similarity between a search document and one of the input documents (components of the basis) is evaluated through three measures:

- Dotproduct
- Euclidean Distance
- Cosine

### Results

Some visualizations might not appear properly in .pdf mode, it is therefore recommended to see the published notebook on Deepnote

### Correlation Matrix

The results tell us that all methods tend to move together, meaning higher similarity usually means higher dot product and lower euclidean distance.

It is important to notice that the file called "Song Of Myself" was excluded from the input documents. The poem was commented out in the definition on the input documents but could still be used. The main reason to exclude it, is due to its unproportional size, with more than 1000 lines (10x the others) which behaves in a different way and tends to bias the dot product and euclidean distance indicators.

This is a remarkable fact: dot product and euclidean distance appear to be document-size biased. Cosine Similarity on the other hand it is not biased towards size and provides a standard measure between 0 and 1. This enables a better comparison in various situations. Therefore we conclude that cosine similarity is indeed the better performing measure to evaluate text similarity.