Importing the data:
/datasets/movie-datasets/tmdb_5000_credits.csv
/datasets/movie-datasets/tmdb_5000_movies.csv
cast_id: 4
character: Captain Jack Sparrow
credit_id: 52fe4232c3a36847f800b50d
gender: 2
id: 85
name: Johnny Depp
order: 0
cast_id: 5
character: Will Turner
credit_id: 52fe4232c3a36847f800b511
gender: 2
id: 114
name: Orlando Bloom
order: 1
cast_id: 6
character: Elizabeth Swann
credit_id: 52fe4232c3a36847f800b515
gender: 1
id: 116
name: Keira Knightley
order: 2
cast_id: 12
character: William "Bootstrap Bill" Turner
credit_id: 52fe4232c3a36847f800b52d
gender: 2
id: 1640
name: Stellan Skarsgård
order: 3
cast_id: 10
character: Captain Sao Feng
credit_id: 52fe4232c3a36847f800b525
gender: 2
id: 1619
name: Chow Yun-fat
order: 4
cast_id: 9
character: Captain Davy Jones
credit_id: 52fe4232c3a36847f800b521
gender: 2
id: 2440
name: Bill Nighy
order: 5
cast_id: 7
character: Captain Hector Barbossa
credit_id: 52fe4232c3a36847f800b519
gender: 2
id: 118
name: Geoffrey Rush
order: 6
cast_id: 14
character: Admiral James Norrington
credit_id: 52fe4232c3a36847f800b535
gender: 2
id: 1709
name: Jack Davenport
order: 7
cast_id: 13
character: Joshamee Gibbs
credit_id: 52fe4232c3a36847f800b531
gender: 2
id: 2449
name: Kevin McNally
order: 8
cast_id: 11
character: Lord Cutler Beckett
credit_id: 52fe4232c3a36847f800b529
gender: 2
id: 2441
name: Tom Hollander
order: 9
cast_id: 19
character: Tia Dalma
credit_id: 52fe4232c3a36847f800b549
gender: 1
id: 2038
name: Naomie Harris
order: 10
cast_id: 8
character: Governor Weatherby Swann
credit_id: 52fe4232c3a36847f800b51d
gender: 2
id: 378
name: Jonathan Pryce
order: 11
cast_id: 37
character: Captain Teague Sparrow
credit_id: 52fe4232c3a36847f800b5b3
gender: 2
id: 1430
name: Keith Richards
order: 12
cast_id: 16
character: Pintel
credit_id: 52fe4232c3a36847f800b53d
gender: 2
id: 1710
name: Lee Arenberg
order: 13
cast_id: 15
character: Ragetti
credit_id: 52fe4232c3a36847f800b539
gender: 2
id: 1711
name: Mackenzie Crook
order: 14
cast_id: 18
character: Lieutenant Theodore Groves
credit_id: 52fe4232c3a36847f800b545
gender: 2
id: 4031
name: Greg Ellis
order: 15
cast_id: 55
character: Cotton
credit_id: 57e28d2ec3a3681a01005b5c
gender: 2
id: 1715
name: David Bailie
order: 16
cast_id: 17
character: Marty
credit_id: 52fe4232c3a36847f800b541
gender: 2
id: 4030
name: Martin Klebba
order: 17
cast_id: 57
character: Ian Mercer
credit_id: 57e28d78c3a36808b900bf4f
gender: 0
id: 939
name: David Schofield
order: 18
cast_id: 62
character: Scarlett
credit_id: 57e28ec5c3a3681a50005855
gender: 1
id: 2450
name: Lauren Maher
order: 19
cast_id: 63
character: Giselle
credit_id: 57e28ed692514123f5005635
gender: 1
id: 2452
name: Vanessa Branch
order: 20
cast_id: 60
character: Mullroy
credit_id: 57e28db2c3a3681a01005bc7
gender: 2
id: 1714
name: Angus Barnett
order: 21
cast_id: 59
character: Murtogg
credit_id: 57e28da192514118f7006008
gender: 0
id: 1713
name: Giles New
order: 22
cast_id: 58
character: Tai Huang
credit_id: 57e28d8ec3a3681a01005bab
gender: 2
id: 22075
name: Reggie Lee
order: 23
cast_id: 64
character: Henry Turner
credit_id: 57e29119925141151100a6cc
gender: 2
id: 61259
name: Dominic Scott Kay
order: 24
cast_id: 39
character: Mistress Ching
credit_id: 52fe4232c3a36847f800b5bd
gender: 1
id: 33500
name: Takayo Fischer
order: 25
cast_id: 40
character: Lieutenant Greitzer
credit_id: 52fe4232c3a36847f800b5c1
gender: 2
id: 1224149
name: David Meunier
order: 26
cast_id: 49
character: Hadras
credit_id: 56d1871c92514174680010cf
gender: 2
id: 429401
name: Ho-Kwan Tse
order: 27
cast_id: 56
character: Clacker
credit_id: 57e28d4b92514125710055cb
gender: 0
id: 1123
name: Andy Beckwith
order: 28
cast_id: 51
character: Penrod
credit_id: 56ec8c14c3a3682260003c53
gender: 2
id: 1056117
name: Peter Donald Badalamenti II
order: 29
cast_id: 61
character: Cotton's Parrot (voice)
credit_id: 57e28dcc9251412463005678
gender: 2
id: 21700
name: Christopher S. Capp
order: 30
cast_id: 65
character: Captain Teague
credit_id: 58bc2a37c3a368663003740b
gender: 2
id: 1430
name: Keith Richards
order: 31
cast_id: 66
character: Captain Jocard
credit_id: 58bc2a8e925141609e03a179
gender: 2
id: 2603
name: Hakeem Kae-Kazim
order: 32
cast_id: 67
character: Captain Ammand
credit_id: 58e2a21ac3a36872af00f9c2
gender: 0
id: 70577
name: Ghassan Massoud
order: 33
credit_id: 52fe4232c3a36847f800b579
department: Camera
gender: 2
id: 120
job: Director of Photography
name: Dariusz Wolski
credit_id: 52fe4232c3a36847f800b4fd
department: Directing
gender: 2
id: 1704
job: Director
name: Gore Verbinski
credit_id: 52fe4232c3a36847f800b54f
department: Production
gender: 2
id: 770
job: Producer
name: Jerry Bruckheimer
credit_id: 52fe4232c3a36847f800b503
department: Writing
gender: 2
id: 1705
job: Screenplay
name: Ted Elliott
credit_id: 52fe4232c3a36847f800b509
department: Writing
gender: 2
id: 1706
job: Screenplay
name: Terry Rossio
credit_id: 52fe4232c3a36847f800b57f
department: Editing
gender: 0
id: 1721
job: Editor
name: Stephen E. Rivkin
credit_id: 52fe4232c3a36847f800b585
department: Editing
gender: 2
id: 1722
job: Editor
name: Craig Wood
credit_id: 52fe4232c3a36847f800b573
department: Sound
gender: 2
id: 947
job: Original Music Composer
name: Hans Zimmer
credit_id: 52fe4232c3a36847f800b555
department: Production
gender: 2
id: 2444
job: Executive Producer
name: Mike Stenson
credit_id: 52fe4232c3a36847f800b561
department: Production
gender: 2
id: 2445
job: Producer
name: Eric McLeod
credit_id: 52fe4232c3a36847f800b55b
department: Production
gender: 2
id: 2446
job: Producer
name: Chad Oman
credit_id: 52fe4232c3a36847f800b567
department: Production
gender: 0
id: 2447
job: Producer
name: Peter Kohn
credit_id: 52fe4232c3a36847f800b56d
department: Production
gender: 0
id: 2448
job: Producer
name: Pat Sandston
credit_id: 52fe4232c3a36847f800b58b
department: Production
gender: 1
id: 2215
job: Casting
name: Denise Chamian
credit_id: 52fe4232c3a36847f800b597
department: Art
gender: 2
id: 1226
job: Production Design
name: Rick Heinrichs
credit_id: 52fe4232c3a36847f800b59d
department: Art
gender: 2
id: 553
job: Art Direction
name: John Dexter
credit_id: 52fe4232c3a36847f800b591
department: Production
gender: 1
id: 3311
job: Casting
name: Priscilla John
credit_id: 52fe4232c3a36847f800b5a3
department: Art
gender: 1
id: 4032
job: Set Decoration
name: Cheryl Carasik
credit_id: 52fe4232c3a36847f800b5a9
department: Costume & Make-Up
gender: 0
id: 4033
job: Costume Design
name: Liz Dann
credit_id: 52fe4232c3a36847f800b5af
department: Costume & Make-Up
gender: 1
id: 4034
job: Costume Design
name: Penny Rose
credit_id: 56427ce8c3a3686a53000d8b
department: Sound
gender: 2
id: 5132
job: Music Supervisor
name: Bob Badami
credit_id: 55993c15c3a36855db002f33
department: Art
gender: 2
id: 146439
job: Conceptual Design
name: James Ward Byrkit
credit_id: 52fe4232c3a36847f800b5b9
department: Costume & Make-Up
gender: 1
id: 406204
job: Makeup Department Head
name: Ve Neill
credit_id: 56e47f7892514132690017bd
department: Crew
gender: 2
id: 1259516
job: Stunts
name: John Dixon
credit_id: 5740be639251416597000849
department: Crew
gender: 0
id: 1336716
job: CGI Supervisor
name: Dottie Starling
credit_id: 56427c639251412fc8000dc1
department: Directing
gender: 1
id: 1344278
job: Script Supervisor
name: Pamela Alch
credit_id: 57083101c3a3681d320004e6
department: Crew
gender: 0
id: 1368867
job: Special Effects Coordinator
name: Allen Hall
credit_id: 56427d5ec3a3686a62000d4a
department: Sound
gender: 0
id: 1368884
job: Music Editor
name: Melissa Muik
credit_id: 56427c7b9251412fd4000e07
department: Directing
gender: 1
id: 1395290
job: Script Supervisor
name: Sharron Reynolds
credit_id: 56427d2bc3a3686a53000d9b
department: Sound
gender: 0
id: 1399327
job: Music Editor
name: Barbara McDermott
credit_id: 56427cb4c3a3686a53000d87
department: Directing
gender: 1
id: 1400738
job: Script Supervisor
name: Karen Golden
credit_id: 56427d169251412fd4000e23
department: Sound
gender: 0
id: 1534197
job: Music Editor
name: Katie Greathouse
id: 12
name: Adventure
id: 14
name: Fantasy
id: 28
name: Action
name: Walt Disney Pictures
id: 2
name: Jerry Bruckheimer Films
id: 130
name: Second Mate Productions
id: 19936
iso_3166_1: US
name: United States of America
id: 270
name: ocean
id: 726
name: drug abuse
id: 911
name: exotic island
id: 1319
name: east india trading company
id: 2038
name: love of one's life
id: 2052
name: traitor
id: 2580
name: shipwreck
id: 2660
name: strong woman
id: 3799
name: ship
id: 5740
name: alliance
id: 5941
name: calypso
id: 6155
name: afterlife
id: 6211
name: fighter
id: 12988
name: pirate
id: 157186
name: swashbuckler
id: 179430
name: aftercreditsstinger
iso_639_1: en
name: English
0
19995
Avatar
1
285
Pirates of the Caribbean: At World's End
Dropping Columns
We don't have to drop any columns except the ID column since it is repeated, apart from that we see that all columns are equally important for classifying and making keywords.
0
19995
Avatar
1
285
Pirates of the Caribbean: At World's End
2
206647
Spectre
3
49026
The Dark Knight Rises
4
49529
John Carter
0
19995
[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": …
1
285
[{"cast_id": 4, "character": "Captain Jack Sparrow", "credit_id": "52fe4232c3a36847f800b50d", "gender": 2, "id": 85, "name": "Johnny Depp", "order": 0}, {"cast_id": 5, "character": "Will Turner", "credit_id": "52fe4232c3a36847f800b511", "gender": 2, "id": 114, "name": "Orlando Bloom", "order": 1}, {"cast_id": 6, "character": "Elizabeth Swann", "credit_id": "52fe4232c3a36847f800b515", "gender": 1, "id": 116, "name": "Keira Knightley", "order": 2}, {"cast_id": 12, "character": "William \"Bootstrap Bill\" Turner", "credit_id": "52fe4232c3a36847f800b52d", "gender": 2, "id": 1640, "name": "Stellan Skarsg\u00e5rd", "order": 3}, {"cast_id": 10, "character": "Captain Sao Feng", "credit_id": "52fe4232c3a36847f800b525", "gender": 2, "id": 1619, "name": "Chow Yun-fat", "order": 4}, {"cast_id": 9, "character": "Captain Davy Jones", "credit_id": "52fe4232c3a36847f800b521", "gender": 2, "id": 2440, "name": "Bill Nighy", "order": 5}, {"cast_id": 7, "character": "Captain Hector Barbossa", "credit_id"…
2
206647
[{"cast_id": 1, "character": "James Bond", "credit_id": "52fe4d22c3a368484e1d8d6b", "gender": 2, "id": 8784, "name": "Daniel Craig", "order": 0}, {"cast_id": 14, "character": "Blofeld", "credit_id": "54805866c3a36829ab002592", "gender": 2, "id": 27319, "name": "Christoph Waltz", "order": 1}, {"cast_id": 13, "character": "Madeleine", "credit_id": "546f934fc3a3682f9a002ca5", "gender": 1, "id": 121529, "name": "L\u00e9a Seydoux", "order": 2}, {"cast_id": 10, "character": "M", "credit_id": "53e86503c3a368399c0031f0", "gender": 2, "id": 5469, "name": "Ralph Fiennes", "order": 3}, {"cast_id": 17, "character": "Lucia", "credit_id": "54805920c3a36829ae0022c5", "gender": 1, "id": 28782, "name": "Monica Bellucci", "order": 4}, {"cast_id": 8, "character": "Q", "credit_id": "52fe4d22c3a368484e1d8d87", "gender": 2, "id": 17064, "name": "Ben Whishaw", "order": 5}, {"cast_id": 11, "character": "Moneypenny", "credit_id": "53e8650cc3a368399c0031f4", "gender": 1, "id": 2038, "name": "Naomie Harris", "o…
3
49026
[{"cast_id": 2, "character": "Bruce Wayne / Batman", "credit_id": "52fe4781c3a36847f8139869", "gender": 2, "id": 3894, "name": "Christian Bale", "order": 0}, {"cast_id": 8, "character": "Alfred Pennyworth", "credit_id": "52fe4781c3a36847f8139881", "gender": 2, "id": 3895, "name": "Michael Caine", "order": 1}, {"cast_id": 5, "character": "James Gordon", "credit_id": "52fe4781c3a36847f8139875", "gender": 2, "id": 64, "name": "Gary Oldman", "order": 2}, {"cast_id": 3, "character": "Selina Kyle / Catwoman", "credit_id": "52fe4781c3a36847f813986d", "gender": 1, "id": 1813, "name": "Anne Hathaway", "order": 3}, {"cast_id": 4, "character": "Bane", "credit_id": "52fe4781c3a36847f8139871", "gender": 2, "id": 2524, "name": "Tom Hardy", "order": 4}, {"cast_id": 15, "character": "Miranda Tate", "credit_id": "52fe4781c3a36847f813988d", "gender": 1, "id": 8293, "name": "Marion Cotillard", "order": 5}, {"cast_id": 6, "character": "Blake", "credit_id": "52fe4781c3a36847f8139879", "gender": 2, "id": 2…
4
49529
[{"cast_id": 5, "character": "John Carter", "credit_id": "52fe479ac3a36847f813ea75", "gender": 2, "id": 60900, "name": "Taylor Kitsch", "order": 0}, {"cast_id": 20, "character": "Dejah Thoris", "credit_id": "52fe479ac3a36847f813eab3", "gender": 1, "id": 21044, "name": "Lynn Collins", "order": 1}, {"cast_id": 7, "character": "Sola", "credit_id": "52fe479ac3a36847f813ea79", "gender": 1, "id": 2206, "name": "Samantha Morton", "order": 2}, {"cast_id": 3, "character": "Tars Tarkas", "credit_id": "52fe479ac3a36847f813ea6d", "gender": 2, "id": 5293, "name": "Willem Dafoe", "order": 3}, {"cast_id": 8, "character": "Tal Hajus", "credit_id": "52fe479ac3a36847f813ea7d", "gender": 2, "id": 19159, "name": "Thomas Haden Church", "order": 4}, {"cast_id": 2, "character": "Matai Shang", "credit_id": "52fe479ac3a36847f813ea69", "gender": 2, "id": 2983, "name": "Mark Strong", "order": 5}, {"cast_id": 4, "character": "Tardos Mors", "credit_id": "52fe479ac3a36847f813ea71", "gender": 2, "id": 8785, "name":…
4559
380097
[]
2662
370980
[{"cast_id": 5, "character": "Jorge Mario Bergoglio da giovane", "credit_id": "566001d292514179040024b1", "gender": 0, "id": 18478, "name": "Rodrigo de la Serna", "order": 0}, {"cast_id": 6, "character": "Jorge Mario Bergoglio da anziano", "credit_id": "566001d992514179130025b6", "gender": 2, "id": 127252, "name": "Sergio Hern\u00e1ndez", "order": 1}, {"cast_id": 7, "character": "Franz Jalics", "credit_id": "566001df925141220400163f", "gender": 2, "id": 28514, "name": "\u00c0lex Brendem\u00fchl", "order": 2}, {"cast_id": 8, "character": "Giovane Prete", "credit_id": "566001e592514179130025b9", "gender": 2, "id": 1133330, "name": "Maximilian Dirr", "order": 3}, {"cast_id": 9, "character": "Esther Ballestrino", "credit_id": "566001ed925141790a00258d", "gender": 1, "id": 18499, "name": "Mercedes Mor\u00e1n", "order": 4}, {"cast_id": 11, "character": "Padre Pedro", "credit_id": "566001fc92514179060023ec", "gender": 0, "id": 1544273, "name": "Andres Gil", "order": 5}, {"cast_id": 12, "char…
4147
459488
[{"cast_id": 0, "character": "Narrator", "credit_id": "592b25b79251413b54061b9b", "gender": 0, "id": 1354401, "name": "Tony Oppedisano", "order": 1}]
2662
370980
[{"cast_id": 5, "character": "Jorge Mario Bergoglio da giovane", "credit_id": "566001d292514179040024b1", "gender": 0, "id": 18478, "name": "Rodrigo de la Serna", "order": 0}, {"cast_id": 6, "character": "Jorge Mario Bergoglio da anziano", "credit_id": "566001d992514179130025b6", "gender": 2, "id": 127252, "name": "Sergio Hern\u00e1ndez", "order": 1}, {"cast_id": 7, "character": "Franz Jalics", "credit_id": "566001df925141220400163f", "gender": 2, "id": 28514, "name": "\u00c0lex Brendem\u00fchl", "order": 2}, {"cast_id": 8, "character": "Giovane Prete", "credit_id": "566001e592514179130025b9", "gender": 2, "id": 1133330, "name": "Maximilian Dirr", "order": 3}, {"cast_id": 9, "character": "Esther Ballestrino", "credit_id": "566001ed925141790a00258d", "gender": 1, "id": 18499, "name": "Mercedes Mor\u00e1n", "order": 4}, {"cast_id": 11, "character": "Padre Pedro", "credit_id": "566001fc92514179060023ec", "gender": 0, "id": 1544273, "name": "Andres Gil", "order": 5}, {"cast_id": 12, "char…
4437
292539
[]
We add the overview into this movie
4437
292539
[]
Converting JSON to Lists
Lets make a function to parse the json and retrieve the top 4 cast and their names only
Great now lets convert the cast column into a list of names:
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
Lets convert the genres columns into a list of genres
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
Lets do the same for the crew column, but lets retrieve the director name only:
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
Now we convert the overview column into a list of words and store it as a list
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
Making a new dataframe
0
19995
['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']
1
285
['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
2
206647
['Daniel Craig', 'Christoph Waltz', 'Léa Seydoux', 'Ralph Fiennes']
3
49026
['Christian Bale', 'Michael Caine', 'Gary Oldman', 'Anne Hathaway']
4
49529
['Taylor Kitsch', 'Lynn Collins', 'Samantha Morton', 'Willem Dafoe']
0
19995
Avatar
1
285
Pirates of the Caribbean: At World's End
2
206647
Spectre
3
49026
The Dark Knight Rises
4
49529
John Carter
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
0
19995
Avatar
1
285
Pirates of the Caribbean: At World's End
2
206647
Spectre
3
49026
The Dark Knight Rises
4
49529
John Carter
Machine Learning Imports
Making function for stemming of words
Tokenization and Stemming of Words
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
0
19995
Avatar
1
285
Pirates of the Caribbean: At World's End
2
206647
Spectre
3
49026
The Dark Knight Rises
4
49529
John Carter
Now, lets vectorize the words and convert it into numbers.
We extract the top 5000 words from the tags
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/generic.py:5494: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
0
19995
Avatar
1
285
Pirates of the Caribbean: At World's End
2
206647
Spectre
3
49026
The Dark Knight Rises
4
49529
John Carter
Vectorization of Words
We vectorize the tags using the CountVectorizer library and then vectorize the given tagline adn convert into numpy array.
Most of the array will consist of 0s because not all the movie contains the 5000 words as its tags. It will only have selected words.
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
Calculating Similarity using cosine distance
Using cosine distances -- distance between two movies.
Cosine Distance: Distance between 2 vectors as an angle
Distance is inversely proportional to similarity -> high similarity, low distance
Here, we have a matrix this represents an array or arrays, where each array is the distance between a given movie and all other movies. So the shape of the array is 4808x4808.
Lets see the distance of the movies from first movie.
Top picks for you:
Top picks for you: