# read a dataset of movie reviewers into a DataFrame
user_cols =['user_id','age','gender','occupation','zip_code']
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')
users.head()
users.shape
We want to find duplicated zip codes
users.zip_code.duplicated().sum()# Numbers of duplicated zip code
We want to find duplicated full rows
users.duplicated().sum()
Want to see the rows that are duplicated
users.loc[users.duplicated(keep='first'),:]# keep only the first rows in the duplication
users.loc[users.duplicated(keep='last'),:]# keep only the last rows in the duplication
users.loc[users.duplicated(keep=False),:]#Shows all the duplicated rows
users.drop_duplicates(keep='first').shape # drop 7 rows but keep the first version of the duplicate
We only want to see as duplicated users that have the same age and the same zip code