What Makes a Word "Complex"?
Text-simplification systems that are now utilized widespread across various devices and programs collectively rely on the measurement of the complexity of words. The dataset in subject considers different variables including word length, I_ZScore and reaction time, leading to questions analyzing how these factors correlate with one another. This notebook will additionally cross reference the words of the study to words used in mainstream news headlines in an effort to notice the tendencies of media to create intellectual yet approachable issues. We will be using linear regression and prediction algorithms to solve our questions.
- Correlation between word length and difficulty
- Correlation between reaction time and difficulty
- Observing word difficulty in news headlines
Ten rows and the corresponding ten columns of our cleaned data table are shown below.
What is I_ZScore?
I_ZScore in this dataset is defined as the word's mean lexical decision latency. On a relative numerical scale, this measurement is derived by timed trials of tests distinguishing real words from nonlexical letter strings. Lexical decision latency is generally seen to be correlated with word frequency in that words that occur more often in text or speech are recognized faster and more accurately in comparison to words that do not. Scores less than or equal to zero indicate a faster mean decision time and simple word, while greater than zero scores are evidently more difficult to distinguish.
From the graph "Word Length vs Word Difficulty" above, we can see how there is a positive correlation between word length and word difficulty. This means that as word length increases, it is likely that word difficulty increases as well. The second graph above shows the line of best fit.
We found that the there's a correlation value of 0.5607 for the graph above, which is low positive.
What is Reaction Time?
Reaction time in this dataset is measured in milliseconds taken to indicate completed identification of the given stimulus word. Similarly to I_ZScore, it is possible to infer more difficult or complex words will take a subject longer indicate identification.
From the graph above, we can see how there is a positive correlation between word reaction time and word difficulty. This means that the longer it takes for someone to identiy the stimuli, the word difficulty is higher. The second graph above shows the line of best fit.
BELOW: function for error in regression line vs actual amount
The difficulty_scores array represents the respective headline's difficulty based on normalized score, as seen in the hd_difficulty_score method. (above) To compensate for the sizeable dataset, a randomized sample is used to effectively measure a headline's word difficulty, if a word used within the title is a part of the sample dataset.
The average normalized value of a sample of 1000 headlines' difficulty scores yield about 0.54787, a relatively expected but above average value for a normalized scale from 0 to 1. However, underlying factors could include:
- Lacking accurate prediction for the words unrepresented in the data
- Headlines tending to use choice words (may be more or less difficult) depending on the media audience and topics
The standard deviation of this data also yields a considerable value, indicating a noteable level of distribution between the headline titles' normalized difficulty scores. Some underlying factors affecting this calculating could include:
- Articles of specific/complex content significantly affecting difficulty
- Not enough articles include in the sample or unrepresented in the word difficulty sample set (or, synonymously, too many headlines having a normalized I_Zscore of 0)