Author: Aaphsaarah Rahman

Final Project - Data Science 1
In [3]:
from IPython.display import HTML

HTML('''
<script src='//code.jquery.com/jquery-3.3.1.min.js'></script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
$('div .jp-CodeCell .jp-Cell-inputWrapper').hide();
} else {
$('div.input').show();
$('div .jp-CodeCell .jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Code on/off"></form>''')
Out[3]:
In [4]:
import pandas as pd
import numpy as np
import seaborn as sns 

import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

%matplotlib inline

Abstract

The main features which make the movie a blockbuster are hard to say. It is easy to guess what it might be but that might not be the real reason. Here the goal was to do Extrapolatory data analysis and come out with a conclusion using Python. This report was done using 2 Datasets from IMDB and TMDB. One had 28 features and the other 20 Each had around close to 5000 movies. The top Directors do have a crucial role in making the movie successful. Some features in the movie database are extremely correlated and some have no significance. Facenbook likes among actors and cast are more related and financial outcome like budget, gross income, popularity, critics are more related.

Introduction and Background

Movies have a huge impact on society. Nowadays people don’t just watch movies but they adapt a lot of information from the cinematic world. It has the capacity to influence community both locally and globally. Many kinds of movies are made each year. Each is review and criticised by many organisation.
A lot of effort goes into making a movie. It can turn out to be a success or failure depending on various factors. Factors like the choice of Director, Actor, theme, budget, storyline, etc. The theme, direction, storyline of movies all are also changing with time. The trend and direction of the movie industry are evolving over time. The aim of this report is to analyse the trend and how the features in the movie database are correlated.

Here two datasets from two movies database has been used to analyse the trend:

In [5]:
tmdb_movies = pd.read_csv("Dataset_all/tmdb_5000_movies.csv")
movie_metadata = pd.read_csv("Dataset_all/movie_metadata.csv")

print('1. The movie_metadata (IMDB) dataset contains:', len(movie_metadata),' movies and',len(movie_metadata.columns),'features.')
print('2. The tmdb_movies (TMBD) dataset contains:', len(tmdb_movies),' movies and',len(tmdb_movies.columns),'features.')
1. The movie_metadata (IMDB) dataset contains: 5043  movies and 28 features.
2. The tmdb_movies (TMBD) dataset contains: 4803  movies and 20 features.
In [6]:
print('Below are the 28 features of IMDB dataset: \n ')
s_m=sorted( movie_metadata.columns)
for col1 in s_m: 
    print(col1, end=' . ')
    
print('\n \nBelow are the 20 features of TMBD dataset: \n')

s_t=sorted( tmdb_movies.columns)
for col2 in s_t: 
    print( col2,  end=' . ')
Below are the 28 features of IMDB dataset: 
 
actor_1_facebook_likes . actor_1_name . actor_2_facebook_likes . actor_2_name . actor_3_facebook_likes . actor_3_name . aspect_ratio . budget . cast_total_facebook_likes . color . content_rating . country . director_facebook_likes . director_name . duration . facenumber_in_poster . genres . gross . imdb_score . language . movie_facebook_likes . movie_imdb_link . movie_title . num_critic_for_reviews . num_user_for_reviews . num_voted_users . plot_keywords . title_year . 
 
Below are the 20 features of TMBD dataset: 

budget . genres . homepage . id . keywords . original_language . original_title . overview . popularity . production_companies . production_countries . release_date . revenue . runtime . spoken_languages . status . tagline . title . vote_average . vote_count . 

IMDB Dataset time span is across 100 years in 66 countries, as well as gross earnings. There are 2399 unique director names and thousands of actors/actresses. TMDB also spans across 101 years (1916-2017.

The aim is to find a trend in the movie industry. By answering the following:

3. Perform PCA to find a trend ain the movie features.

Methodology

For running the analysis in python few libraries had to be loaded. For Principal Component Analysis – the sklearn library was imported. Standardise the data with sklearn's StandardScaler. Run the PCA with sklearn.decomposition .
For plots to showcase - seaborn, matplotlib, ggplot are imported from the library.

As every other dataset after loading the dataset, it had to be cleaned. All NaNs (empty values) were removed. To find the popular director two datasets were merged. They were merged by the movie title names but had an issue. As they were from different database there writing style was different as well. After troubleshoot it was found that in the IMDB - movie names column they had right white space, which was necessary to remove. After removing all the right white space from the column the two datasets were merged by inner joint and formed the new combined dataset of 4516 movies and 48 features.

The Data Frame of the combined data is right below.

In [7]:
#filtering NaN and removing it

movie_metadata =movie_metadata.dropna()
tmdb_movies = tmdb_movies.dropna()

# removing a right space frm movie-title column otherwise they won't merge properly
movie_metadata['movie_title']=movie_metadata['movie_title'].str.rstrip() 

combine_df =pd.merge(movie_metadata,tmdb_movies,  left_on="movie_title",right_on="original_title", how='inner') #merging the tables by movie title and only movies which were common in both dataset is used.
combine_df.head()
#combine_df.shape=(4516, 48)
Out[7]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres_x ... production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... [{"iso_3166_1": "US", "name": "United States o... 2012-07-16 1084939099 165.0 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.6 9106
4 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Action|Adventure|Sci-Fi ... [{"iso_3166_1": "US", "name": "United States o... 2012-03-07 284139100 132.0 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.1 2124

5 rows × 48 columns

From this combined data frame names of the popular director with there mean IMDB score, TMDB score, popularity and movie revenue made is found in the table. Few Exploratory Data Analyse will be found in the Appendix.

Cluster map and correlation heat map is used to interpret the data.For correlation the data had to be cleaned again by producing a data frame only with selected columns and with numerical values only. Afterwards formed the correlation heatmap with the correlation coefficient. The clustermap maps the matrix/data frame in hierarchically-clustered map. This allows seeing the bigger picture in hierarchical order. It shows how many clusters can be made.

For PCA firstly used 3 features from the merged data set. Then used the stand algorithm for normalizing the data and fitting the PCA model. Got the output of the PCA variance ratio. Secondly plotted the no. of components against cumulative explained variance. Thirdly formed the table of PCA score and the scatterplot of it. At the last plotted the biplot. The combined graph of loadings and PCA scores.

In [ ]:
 

Results

From the table generated below, it interprets that-
The top 10 Director who has given the biggest films are:-
James Cameron, Joss Whedon, Colin Trevorrow, James Wan, Joss Whedon, Chris Buck, Shane Black, Kyle Balda, Antony Russo, and Machael Bay.
Their movies have generated the highest revenue at there time. The table shows the IMDB and TMDB score, the popularity of the movie entirely and the revenue generated from it. This says a lot of information, that the success of the movie does depends on how the director is directing the movie and how he pulls it off till the end. These names are quite popular in the movie industry. And few of the names has repeted as well.

In [8]:
# calling the desied columns and doing group
c2=combine_df.groupby(['director_name','vote_average','imdb_score' ,'popularity','movie_title'])['revenue'].mean().sort_values(ascending=False).head(20)
c2.to_frame() # making it into dataframe
Out[8]:
revenue
director_name vote_average imdb_score popularity movie_title
James Cameron 7.2 7.9 150.437577 Avatar 2787965087
7.5 7.7 100.025899 Titanic 1845034188
Joss Whedon 7.4 8.1 144.448633 The Avengers 1519557910
Colin Trevorrow 6.5 7.0 418.708552 Jurassic World 1513528810
James Wan 7.3 7.2 102.322217 Furious 7 1506249360
Joss Whedon 7.3 7.5 134.279229 Avengers: Age of Ultron 1405403694
Chris Buck 7.3 7.6 165.125366 Frozen 1274219009
Shane Black 6.8 7.2 77.682080 Iron Man 3 1215439994
Kyle Balda 6.4 6.4 875.581305 Minions 1156730962
Anthony Russo 7.1 8.2 198.372395 Captain America: Civil War 1153304495
Michael Bay 6.1 6.3 28.529607 Transformers: Dark of the Moon 1123746996
Peter Jackson 8.1 8.9 123.630332 The Lord of the Rings: The Return of the King 1118888979
Sam Mendes 6.9 7.8 93.004993 Skyfall 1108561013
Michael Bay 5.8 5.7 116.840296 Transformers: Age of Extinction 1091405097
Christopher Nolan 7.6 8.5 112.312950 The Dark Knight Rises 1084939099
Lee Unkrich 7.6 8.3 59.995418 Toy Story 3 1066969703
Gore Verbinski 7.0 7.3 145.847379 Pirates of the Caribbean: Dead Man's Chest 1065659812
Rob Marshall 6.4 6.7 135.413856 Pirates of the Caribbean: On Stranger Tides 1045713802
Tim Burton 6.4 6.5 78.530105 Alice in Wonderland 1025491110
Christopher Nolan 8.2 9.0 187.322927 The Dark Knight 1004558444

Correlation Analysis

The correlation heatmap and cluster map show the relationship between each feature. The lighter the color the stronger the correlation.

The graph of correlation heat map has the correlation coefficient on it. It states that actors and total cast Facebook likes are highly correlated with each other. Although it does not have much effect on revenue or profit. Also directors Facebook like is irrelevant to any features.
Budget, gross income, movie facebook like, no. of critic for reviews, votes by the user, popularity of the movie, revenue and vote counts these features are highly correlated. They do have an effect on each other. It can be interpreted that the higher the popularity, critic review, user who give the reviews, the more popular it gets. And most likely the revenue earning might increase as well.

The clustermap maps the matrix in a hierarchically-clustered map. It shows there are 3 major clusters.

  • IMBD score, TMDB score(vote_average) can be clustered as one. But since it does not have much relation but other features it is clustered with Director Facebook likes.
  • Actor/Actress (first, second, third) along with other member in the cast; their facebook likes are more correlated.
  • And the rest of the features which includes financial count, critic reviews, user reviews, popularity and movie facebook likes are correlated and they can be clustered together.
In [9]:
#preparing data frame for correlation
# filtering num values only 
str_list=[] #empty list
for colname, colvalue in combine_df.iteritems():
    if type(colvalue[1])==str:
        str_list.append(colname)
num_list= combine_df.columns.difference(str_list) # will get only numeric values
combine_df_new = combine_df[num_list]
#combine_df_new.head()    #shape=4516 rows × 23 columns
In [10]:
col4=['actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','cast_total_facebook_likes','director_facebook_likes','budget_y','gross','imdb_score','movie_facebook_likes','num_critic_for_reviews','num_user_for_reviews','num_voted_users','popularity','revenue','vote_average','vote_count']
corr_df=combine_df_new[col4]
f,ax=plt.subplots(figsize=(12,10))
plt.title('Correlation of Movie Features')
sns.heatmap(corr_df.astype(float).corr(), linewidths=0.25, vmax=1.0,square=True, cmap="magma", linecolor="black", annot=True )
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
sns.clustermap(corr_df.corr(),cmap="magma",standard_scale=1)
plt.show()

Principal Component Analysis

Used Principal Component Analysis for unsupervised learning. To find the pattern in the data. Here we chose 3 features- polarity, revenue, imdb_score. And found three pca explained variance ratio.

In [11]:
a1=[ 'popularity','revenue','imdb_score' ]
x=combine_df[a1]  #formed the data for pca
x.head()
Out[11]:
popularity revenue imdb_score
0 150.437577 2787965087 7.9
1 139.082615 961000000 7.1
2 107.376788 880674609 6.8
3 112.312950 1084939099 8.5
4 43.926995 284139100 6.6
In [12]:
T=x.values #assing all values to T

scaler = StandardScaler() # normalizing the data
T_scaled=scaler.fit_transform(T) #fitting the transforming the values
In [13]:
pca = PCA()
T_7d=pca.fit_transform(T_scaled)
print('pca explained variance ratio =',pca.explained_variance_ratio_)
pca explained variance ratio = [0.59308083 0.26451655 0.14240262]

In the line plot of Cummulative Explained Variance vs no. of components. It showed that 95% of the variance falls under 2 principal component. Hence collapsed of the space from 3 dimensions to 2 dimensions.

In [14]:
components = np.arange(1,4) 
plt.plot(components, np.cumsum(pca.explained_variance_ratio_))

plt.xlabel('Number of Components') 
plt.ylabel('Cummulative Explained Variance')
plt.show()
In [15]:
pca = PCA(n_components=2)
pca.fit(T_scaled)
pca_loadings = pca.components_  #loaded the loading value
pca_scores = pca.fit_transform(T_scaled)  #formed the pca scores
#Note that we have collapsed the dimmensionality of our space from 3 dimensions to 2 dimensions

The plot looks like its clustered around the corner but it fans out later in the opposite direction.

In [19]:
plt.scatter(pca_scores[:,0], pca_scores[:,1] )
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Scatter plot of Principal Component Analysis ')
Out[19]:
Text(0.5, 1.0, 'Scatter plot of Principal Component Analysis ')

It looks like revenue and popularity is more correlated. The first PC is a combination of revenue, popularity and IMDB score. The second PC is a little bit more dominated by IMDB score.

In [20]:
def myplot(score,coeff,labels=None): 
    xs = score[:,0] 
    ys = score[:,1] 
    n = coeff.shape[0] 
    scalex = 1.0/(xs.max() - xs.min()) 
    scaley = 1.0/(ys.max() - ys.min()) 
    plt.scatter(xs * scalex,ys * scaley)
    for i in range(n): 
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5) 
        if labels is None: 
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, a1[i], color = 'g', ha = 'center', va = 'center')
        else: 
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center') 
            plt.xlim(-1,1)
            plt.ylim(-1,1) 
            plt.xlabel("PC1") 
            plt.ylabel("PC2") 
            plt.grid()

myplot(pca_scores[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.xlabel("PC1") 
plt.ylabel("PC2")
Out[20]:
Text(0, 0.5, 'PC2')

Conclusion

In the Movie industry, Director do play a crucial role in making the film success. In terms of Facebook likes of actors, actress and cast they are more correlated but does not influence the gross income, ratings or anything. Gross income, budget, revenue, critics review, users review, movies facebook likes they are all correlated. Thay influence one another. IMDB score and TMDB score do not correlate with any other features which might be the case that they are independent of any features. Acting of the actors, Director, storyline might be the factor but can not be concluded.

Appendix

The dataset from TMDB.

In [21]:
#Appendix 1
tmdb_movies = pd.read_csv("Dataset_all/tmdb_5000_movies.csv")
tmdb_movies.head(2)
#tmdb_movies.shape=(4803, 20)
Out[21]:
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500

The dataset from IMDB.

In [22]:
#Appendix 2

movie_metadata = pd.read_csv("Dataset_all/movie_metadata.csv")
movie_metadata.head(3)
#movie_metadata.shape=(5043, 28)
Out[22]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000

3 rows × 28 columns

Some Extrapolatory data analysis. This is the combined named of directors and their respective movie title sorted in highest IMDB score. The result is similar to the directors who produce blockbuster films that have the highest IMDB scores. And their work is recognised. The next table is similar incorporating revenus, gross, IMDB and TMDB scores.

In [23]:
#Appendix 3

c3=['director_name','imdb_score','movie_title']
c4=combine_df[c3]
c4 = c4[(c4[['director_name']] != 0).all(axis=1)]

director_1=c4.groupby(["director_name","movie_title"])["imdb_score"].max().sort_values(ascending=False).head(15) # finding the median value of each neighbourhood_group)
director_1
Out[23]:
director_name         movie_title                                      
Francis Ford Coppola  The Godfather                                        9.2
Christopher Nolan     The Dark Knight                                      9.0
Peter Jackson         The Lord of the Rings: The Return of the King        8.9
Steven Spielberg      Schindler's List                                     8.9
Christopher Nolan     Inception                                            8.8
David Fincher         Fight Club                                           8.8
Peter Jackson         The Lord of the Rings: The Fellowship of the Ring    8.8
                      The Lord of the Rings: The Two Towers                8.7
Lana Wachowski        The Matrix                                           8.7
Christopher Nolan     Interstellar                                         8.6
Tony Kaye             American History X                                   8.6
Bryan Singer          The Usual Suspects                                   8.6
David Fincher         Se7en                                                8.6
Robert Zemeckis       Back to the Future                                   8.5
Frank Darabont        The Green Mile                                       8.5
Name: imdb_score, dtype: float64
In [24]:
# calling the desied columns and doing group
c2=combine_df.groupby(['director_name','vote_average','imdb_score' ,'popularity','movie_title'])['budget_x'].mean().sort_values(ascending=False).head(20)
c2.to_frame() # making it into dataframe
Out[24]:
budget_x
director_name vote_average imdb_score popularity movie_title
Gore Verbinski 6.9 7.1 139.082615 Pirates of the Caribbean: At World's End 300000000.0
Andrew Stanton 6.1 6.6 43.926995 John Carter 263700000.0
Nathan Greno 7.4 7.8 48.681969 Tangled 260000000.0
Sam Raimi 5.9 6.2 115.699814 Spider-Man 3 258000000.0
Joss Whedon 7.3 7.5 134.279229 Avengers: Age of Ultron 250000000.0
David Yates 7.4 7.5 98.885637 Harry Potter and the Half-Blood Prince 250000000.0
Peter Jackson 7.1 7.5 120.965743 The Hobbit: The Battle of the Five Armies 250000000.0
Rob Marshall 6.4 6.7 135.413856 Pirates of the Caribbean: On Stranger Tides 250000000.0
Zack Snyder 5.7 6.9 155.790452 Batman v Superman: Dawn of Justice 250000000.0
Christopher Nolan 7.6 8.5 112.312950 The Dark Knight Rises 250000000.0
Anthony Russo 7.1 8.2 198.372395 Captain America: Civil War 250000000.0
Sam Mendes 6.3 6.8 107.376788 Spectre 245000000.0
James Cameron 7.2 7.9 150.437577 Avatar 237000000.0
Marc Webb 6.5 7.0 89.866276 The Amazing Spider-Man 230000000.0
Peter Jackson 7.6 7.9 94.370564 The Hobbit: The Desolation of Smaug 225000000.0
Barry Sonnenfeld 6.2 6.8 52.035179 Men in Black 3 225000000.0
Gore Verbinski 7.0 7.3 145.847379 Pirates of the Caribbean: Dead Man's Chest 225000000.0
Zack Snyder 6.5 7.2 99.398009 Man of Steel 225000000.0
Joss Whedon 7.4 8.1 144.448633 The Avengers 220000000.0
Gore Verbinski 5.9 6.5 49.046956 The Lone Ranger 215000000.0
In [25]:
#Appendix 4

a1=['director_name','imdb_score','movie_title', 'revenue', 'gross']
a2=combine_df[a1]
a2 = a2[(a2[['director_name', 'revenue']] != 0).all(axis=1)]

director_2=a2.groupby(["director_name","movie_title",'revenue','gross'])["imdb_score"].max().sort_values(ascending=False).head(15) # finding the median value of each neighbourhood_group)
director_2
Out[25]:
director_name         movie_title                                        revenue     gross      
Francis Ford Coppola  The Godfather                                      245066411   134821952.0    9.2
Christopher Nolan     The Dark Knight                                    1004558444  533316061.0    9.0
Steven Spielberg      Schindler's List                                   321365567   96067179.0     8.9
Peter Jackson         The Lord of the Rings: The Return of the King      1118888979  377019252.0    8.9
Christopher Nolan     Inception                                          825532764   292568851.0    8.8
David Fincher         Fight Club                                         100853753   37023395.0     8.8
Peter Jackson         The Lord of the Rings: The Fellowship of the Ring  871368364   313837577.0    8.8
Lana Wachowski        The Matrix                                         463517383   171383253.0    8.7
Peter Jackson         The Lord of the Rings: The Two Towers              926287400   340478898.0    8.7
Tony Kaye             American History X                                 23875127    6712241.0      8.6
David Fincher         Se7en                                              327311859   100125340.0    8.6
Bryan Singer          The Usual Suspects                                 23341568    23272306.0     8.6
Christopher Nolan     Interstellar                                       675120017   187991439.0    8.6
                      The Dark Knight Rises                              1084939099  448130642.0    8.5
Ridley Scott          Alien                                              104931801   78900000.0     8.5
Name: imdb_score, dtype: float64
In [26]:
#Appendix 5
c=combine_df.groupby(['director_name','imdb_score' ])['revenue'].mean().sort_values(ascending=False).head(15)


plt.figure(figsize=(11,11))   #fixing a default size of figure
plt.style.use('fivethirtyeight')    #chosing style, colour of plot
c.unstack().plot.barh()

plt.title("Barplot of the Director names and revenue earned by the movie with its IMDB score", fontsize=18) #lablelling title
plt.ylabel("Director names", fontsize=18)      #lablelling y-axis
plt.xlabel("Revenue in billion", fontsize=18)           #lablelling x-axis
plt.legend(fontsize=11,loc=0)       #fixing the postion of legend in asuitable place, with front size-11
plt.xticks(fontsize=15)            #fixing font size of the x axis elements
plt.yticks(fontsize=15)                  #fixing font size of the y axis elements


plt.show()
print(c.head(15))
<Figure size 792x792 with 0 Axes>
director_name    imdb_score
James Cameron    7.9           2.787965e+09
                 7.7           1.845034e+09
Joss Whedon      8.1           1.519558e+09
Colin Trevorrow  7.0           1.513529e+09
James Wan        7.2           1.506249e+09
Joss Whedon      7.5           1.405404e+09
Chris Buck       7.6           1.274219e+09
Shane Black      7.2           1.215440e+09
Kyle Balda       6.4           1.156731e+09
Anthony Russo    8.2           1.153304e+09
Michael Bay      6.3           1.123747e+09
Peter Jackson    8.9           1.118889e+09
Sam Mendes       7.8           1.108561e+09
Michael Bay      5.7           1.091405e+09
Lee Unkrich      8.3           1.066970e+09
Name: revenue, dtype: float64

Correlation map of combined data set. It shows similarity between facebook likes in one group. the financial variables and critics reviews, user reviews , popularity ,ets in another group.

In [27]:
##Appendix 7

no_id_c=combine_df_new.drop(columns=['id'])
f,ax=plt.subplots(figsize=(15,10))
plt.title('Correlation of Movie Features')
sns.heatmap(no_id_c.astype(float).corr(), linewidths=0.25, vmax=1.0, square=True, cmap="magma", linecolor="black", annot=True )
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

Producing a data frame only with numerical value of TMDB

In [25]:
# filtering data output is only numerical values

str_list2=[] #empty list
for colname, colvalue in tmdb_movies.iteritems():
    if type(colvalue[1])==str:
        str_list2.append(colname)
num_list= tmdb_movies.columns.difference(str_list2) # will get only numeric values
tmdb_m_num = tmdb_movies[num_list]
tmdb_m_num #shape=4803 x 7
Out[25]:
budget id popularity revenue runtime vote_average vote_count
0 237000000 19995 150.437577 2787965087 162.0 7.2 11800
1 300000000 285 139.082615 961000000 169.0 6.9 4500
2 245000000 206647 107.376788 880674609 148.0 6.3 4466
3 250000000 49026 112.312950 1084939099 165.0 7.6 9106
4 260000000 49529 43.926995 284139100 132.0 6.1 2124
... ... ... ... ... ... ... ...
4798 220000 9367 14.269792 2040920 81.0 6.6 238
4799 9000 72766 0.642552 0 85.0 5.9 5
4800 0 231617 1.444476 0 120.0 7.0 6
4801 0 126186 0.857008 0 98.0 5.7 7
4802 0 25975 1.929883 0 90.0 6.3 16

4803 rows × 7 columns

Correlation heat map and cluster map of TMDB numerical dataset. It looks like other then runtime and vote count all other features are correlated. But the clustermap illustrates there are 3 cluster-

  1. Runtime and Vote count
  2. Budget and Revenue
  3. Vote count and Popularity
In [26]:
#ap
no_id_tmdb_m_num=tmdb_m_num.drop(columns=['id'])

plt.figure(figsize=(12,10))
plt.title('Correlation of Movie Features')
ax=sns.heatmap(no_id_tmdb_m_num.astype(float).corr(), vmax=1,  cmap="magma", linecolor="black",annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
sns.clustermap(no_id_tmdb_m_num.corr(),cmap="magma",standard_scale=1)
plt.show()
#with sns.axes_style("white"):
#    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,  cmap="YlGnBu")
#    plt.show()
In [ ]:
 
In [ ]: