Project 7: Power Law Distributions: Zipf's Law
Data satisfying the power law
Abstract
In this lab, we were tasked to get data for systems that can exhibit the Power law and curve fit them to get a relationship between the parameters of the data. I got data for the frequency of occurrence of common family names in the US as recorded by the US Census Bureau in 1990.
I graphed the log of rank of the names against the log the frequency and I did a curve fit on it. I noticed that there is a linear relationship between the logs of these data parameters.
This data shows similar attributes to an observation that linguist George Zipf made about the relationship between rank and frequency in languages spoken around the world, he called this the rank vs. frequency rule.
After graphing the rank vs frequency, I had a slope of -1.28 $\pm$ 0.01 and and intercept of 7.26 $\pm$ 0.01. When I fitted the data from the graph. The fact that the slope is very close to -1 means that Zipf's is truly a power law. I also got a correlation coefficient squared of about 0.9878.
Description
Zipf's law is an empirical law formulated using mathematical statistics that refers to the idea that plenty types of data studied in the physical sciences and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. Zipf distribution is related to the zeta distribution, but they are not identical.
Zipf's law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation.
Algorithm
To verify zipf's law as a power law
- I imported population and rank data from a csv file
- I graphed the log of rank of the names against the log the frequency
- I used the curve fitting function from the scientific computing library to try figure out a relationship between the logs in order to verify zipf's law as a power law, we needed a linear fit with a slope close to -1.
Implementation and Code
Findings
We get an r-squared value of 0.9878 which is really close to 1. Which means the data fits to the regression line to a high degree.
Results and Conclusion
In this lab, we were curve fit data that seemes to follow the power law, it was clear that there was a inverse relationship. I graphed the log of rank of the names against the log the frequency and I did a curve fit on it. I noticed that there is a linear relationship between the logs of these data parameters.
After graphing the rank vs frequency, I had a slope of -1.28 $\pm$ 0.01 and and intercept of 7.26 $\pm$ 0.01. I also got a correlation coefficient squared of about 0.9878. For this relationship to obey some power law it should be able to be fitted in some kind of linear relationship with a slope of -1. The slope of -1.28 $\pm$ 0.01 was pretty much closed to what was expected. I also found the correlation coefficient to be about 0.9878 which is close to 1 and it therefore siginifies a high level of correlation between data and the fit.