import solution
import random
import numpy as np
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/jaroslavsafar/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
iterations = 50
num_of_topics = 20
alpha = 0.1
gamma = 0.1
random.seed(42)
np.random.seed(42)
train_docs, dictionary, train_newsgroups = solution.load_and_preprocess('train')
num_of_docs = len(train_docs)
num_of_words = len(dictionary)
11314 documents loaded.
Example document:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
Example document - lemmatized and stemmed:
['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']
Dictionary size: 6591
Example document - filtered:
[29, 20, 11, 30, 17, 5, 21, 31, 9, 6, 27, 15, 14, 7, 4, 6, 25, 0, 3, 24, 23, 1, 13, 18, 8, 26, 32, 22, 10, 12, 15, 16, 28, 2, 19]
Maximum document length: 4621
lda_result = solution.lda(docs=train_docs, num_of_topics=num_of_topics, num_of_words=num_of_words, iterations=iterations, alpha=alpha, gamma=gamma)
/Users/jaroslavsafar/Developer/Unsupervised_ML_in_NLP/hw1/solution.py:109: RuntimeWarning: overflow encountered in exp2
perplexity = np.exp2(entropy)
-----------------------------
Begin iteration 1:
-----------------------------
Begin iteration 2:
-----------------------------
Begin iteration 3:
-----------------------------
Begin iteration 4:
-----------------------------
Begin iteration 5:
-----------------------------
Begin iteration 6:
-----------------------------
Begin iteration 7:
-----------------------------
Begin iteration 8:
-----------------------------
Begin iteration 9:
-----------------------------
Begin iteration 10:
-----------------------------
Begin iteration 11:
-----------------------------
Begin iteration 12:
-----------------------------
Begin iteration 13:
-----------------------------
Begin iteration 14:
-----------------------------
Begin iteration 15:
-----------------------------
Begin iteration 16:
-----------------------------
Begin iteration 17:
-----------------------------
Begin iteration 18:
-----------------------------
Begin iteration 19:
-----------------------------
Begin iteration 20:
-----------------------------
Begin iteration 21:
-----------------------------
Begin iteration 22:
-----------------------------
Begin iteration 23:
-----------------------------
Begin iteration 24:
-----------------------------
Begin iteration 25:
-----------------------------
Begin iteration 26:
-----------------------------
Begin iteration 27:
-----------------------------
Begin iteration 28:
-----------------------------
Begin iteration 29:
-----------------------------
Begin iteration 30:
-----------------------------
Begin iteration 31:
-----------------------------
Begin iteration 32:
-----------------------------
Begin iteration 33:
-----------------------------
Begin iteration 34:
-----------------------------
Begin iteration 35:
-----------------------------
Begin iteration 36:
-----------------------------
Begin iteration 37:
-----------------------------
Begin iteration 38:
-----------------------------
Begin iteration 39:
-----------------------------
Begin iteration 40:
-----------------------------
Begin iteration 41:
-----------------------------
Begin iteration 42:
-----------------------------
Begin iteration 43:
-----------------------------
Begin iteration 44:
-----------------------------
Begin iteration 45:
-----------------------------
Begin iteration 46:
-----------------------------
Begin iteration 47:
-----------------------------
Begin iteration 48:
-----------------------------
Begin iteration 49:
-----------------------------
Begin iteration 50:
topics, saved_topics_distribution, entropies, doc_topics_count, word_topics_count, topics_count = lda_result
solution.plot_distribution_over_topics(saved_topics_distribution, num_of_topics)
solution.plot_entropies(entropies)
solution.plot_word_histogram(selected_topics=[1, 5, 15], word_topics_count=word_topics_count, n=20, dictionary=dictionary)
test_docs, _, test_newsgroups = solution.load_and_preprocess('test', dictionary=dictionary)
7532 documents loaded.
Example document:
From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu
I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.
Neil Gandler
Example document - lemmatized and stemmed:
['ubvmsd', 'buffalo', 'neil', 'gandler', 'subject', 'need', 'info', 'bonnevill', 'organ', 'univers', 'buffalo', 'line', 'news', 'softwar', 'vnew', 'nntp', 'post', 'host', 'ubvmsd', 'buffalo', 'littl', 'confus', 'model', 'bonnevill', 'hear', 'ssei', 'tell', 'differ', 'featur', 'perform', 'curious', 'know', 'book', 'valu', 'prefer', 'model', 'book', 'valu', 'usual', 'word', 'demand', 'time', 'year', 'hear', 'spring', 'earli', 'summer', 'best', 'time', 'neil', 'gandler']
Dictionary size: 6591
Example document - filtered:
[729, 2363, 280, 12, 30, 729, 120, 220, 715, 20, 11, 729, 536, 3475, 18, 107, 222, 499, 627, 124, 201, 13, 855, 226, 1018, 18, 855, 226, 445, 839, 3163, 152, 32, 107, 2099, 7, 147, 853, 152, 2363]
Maximum document length: 3954
lda_new_data_results = solution.lda_new_data(docs=test_docs, num_of_topics=num_of_topics, num_of_words=num_of_words,
word_topics_count=word_topics_count, topics_count=topics_count,
iterations=iterations, alpha=alpha, gamma=gamma)
-----------------------------
Begin iteration 1:
-----------------------------
Begin iteration 2:
-----------------------------
Begin iteration 3:
-----------------------------
Begin iteration 4:
-----------------------------
Begin iteration 5:
-----------------------------
Begin iteration 6:
-----------------------------
Begin iteration 7:
-----------------------------
Begin iteration 8:
-----------------------------
Begin iteration 9:
-----------------------------
Begin iteration 10:
-----------------------------
Begin iteration 11:
-----------------------------
Begin iteration 12:
-----------------------------
Begin iteration 13:
-----------------------------
Begin iteration 14:
-----------------------------
Begin iteration 15:
-----------------------------
Begin iteration 16:
-----------------------------
Begin iteration 17:
-----------------------------
Begin iteration 18:
-----------------------------
Begin iteration 19:
-----------------------------
Begin iteration 20:
-----------------------------
Begin iteration 21:
-----------------------------
Begin iteration 22:
-----------------------------
Begin iteration 23:
-----------------------------
Begin iteration 24:
-----------------------------
Begin iteration 25:
-----------------------------
Begin iteration 26:
-----------------------------
Begin iteration 27:
-----------------------------
Begin iteration 28:
-----------------------------
Begin iteration 29:
-----------------------------
Begin iteration 30:
-----------------------------
Begin iteration 31:
-----------------------------
Begin iteration 32:
-----------------------------
Begin iteration 33:
-----------------------------
Begin iteration 34:
-----------------------------
Begin iteration 35:
-----------------------------
Begin iteration 36:
-----------------------------
Begin iteration 37:
-----------------------------
Begin iteration 38:
-----------------------------
Begin iteration 39:
-----------------------------
Begin iteration 40:
-----------------------------
Begin iteration 41:
-----------------------------
Begin iteration 42:
-----------------------------
Begin iteration 43:
-----------------------------
Begin iteration 44:
-----------------------------
Begin iteration 45:
-----------------------------
Begin iteration 46:
-----------------------------
Begin iteration 47:
-----------------------------
Begin iteration 48:
-----------------------------
Begin iteration 49:
-----------------------------
Begin iteration 50:
_, lda_perplexity, simple_perplexity = lda_new_data_results
print(f'{lda_perplexity=}, {simple_perplexity=}')
lda_perplexity=1466.6730423878778, simple_perplexity=2627.0348276547684
iterations = 20
num_of_topics = 10
alpha = 0.01
gamma = 0.01
random.seed(42)
np.random.seed(42)
lda_result_2 = solution.lda(docs=train_docs, num_of_topics=num_of_topics, num_of_words=num_of_words, iterations=iterations, alpha=alpha, gamma=gamma)
topics_2, saved_topics_distribution_2, entropies_2, doc_topics_count_2, word_topics_count_2, topics_count_2 = lda_result_2
/Users/jaroslavsafar/Developer/Unsupervised_ML_in_NLP/hw1/solution.py:109: RuntimeWarning: overflow encountered in exp2
perplexity = np.exp2(entropy)
-----------------------------
Begin iteration 1:
-----------------------------
Begin iteration 2:
-----------------------------
Begin iteration 3:
-----------------------------
Begin iteration 4:
-----------------------------
Begin iteration 5:
-----------------------------
Begin iteration 6:
-----------------------------
Begin iteration 7:
-----------------------------
Begin iteration 8:
-----------------------------
Begin iteration 9:
-----------------------------
Begin iteration 10:
-----------------------------
Begin iteration 11:
-----------------------------
Begin iteration 12:
-----------------------------
Begin iteration 13:
-----------------------------
Begin iteration 14:
-----------------------------
Begin iteration 15:
-----------------------------
Begin iteration 16:
-----------------------------
Begin iteration 17:
-----------------------------
Begin iteration 18:
-----------------------------
Begin iteration 19:
-----------------------------
Begin iteration 20:
solution.plot_distribution_over_topics(saved_topics_distribution_2, num_of_topics, name='topics_distribution_2.png')
solution.plot_entropies(entropies_2, name='entropies_2.png')
solution.plot_word_histogram(selected_topics=[1, 5, 9], word_topics_count=word_topics_count_2, n=20, dictionary=dictionary, name='word_histogram_2.png')
lda_new_data_results_2 = solution.lda_new_data(docs=test_docs, num_of_topics=num_of_topics, num_of_words=num_of_words,
word_topics_count=word_topics_count_2, topics_count=topics_count_2,
iterations=iterations, alpha=alpha, gamma=gamma)
_, lda_perplexity_2, simple_perplexity_2 = lda_new_data_results_2
-----------------------------
Begin iteration 1:
-----------------------------
Begin iteration 2:
-----------------------------
Begin iteration 3:
-----------------------------
Begin iteration 4:
-----------------------------
Begin iteration 5:
-----------------------------
Begin iteration 6:
-----------------------------
Begin iteration 7:
-----------------------------
Begin iteration 8:
-----------------------------
Begin iteration 9:
-----------------------------
Begin iteration 10:
-----------------------------
Begin iteration 11:
-----------------------------
Begin iteration 12:
-----------------------------
Begin iteration 13:
-----------------------------
Begin iteration 14:
-----------------------------
Begin iteration 15:
-----------------------------
Begin iteration 16:
-----------------------------
Begin iteration 17:
-----------------------------
Begin iteration 18:
-----------------------------
Begin iteration 19:
-----------------------------
Begin iteration 20:
print(f'{lda_perplexity_2=}, {simple_perplexity_2=}')
lda_perplexity_2=1885.4735159650388, simple_perplexity_2=2627.0559503741624