from IPython.display import HTML
HTML('''
<script src='//code.jquery.com/jquery-3.3.1.min.js'></script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
$('div .jp-CodeCell .jp-Cell-inputWrapper').hide();
} else {
$('div.input').show();
$('div .jp-CodeCell .jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Code on/off"></form>''')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
%matplotlib inline
The main features which make the movie a blockbuster are hard to say. It is easy to guess what it might be but that might not be the real reason. Here the goal was to do Extrapolatory data analysis and come out with a conclusion using Python. This report was done using 2 Datasets from IMDB and TMDB. One had 28 features and the other 20 Each had around close to 5000 movies. The top Directors do have a crucial role in making the movie successful. Some features in the movie database are extremely correlated and some have no significance. Facenbook likes among actors and cast are more related and financial outcome like budget, gross income, popularity, critics are more related.
Movies have a huge impact on society. Nowadays people don’t just watch movies but they adapt a lot of information from the cinematic world. It has the capacity to influence community both locally and globally. Many kinds of movies are made each year. Each is review and criticised by many organisation.
A lot of effort goes into making a movie. It can turn out to be a success or failure depending on various factors. Factors like the choice of Director, Actor, theme, budget, storyline, etc. The theme, direction, storyline of movies all are also changing with time. The trend and direction of the movie industry are evolving over time. The aim of this report is to analyse the trend and how the features in the movie database are correlated.
Here two datasets from two movies database has been used to analyse the trend:
tmdb_movies = pd.read_csv("Dataset_all/tmdb_5000_movies.csv")
movie_metadata = pd.read_csv("Dataset_all/movie_metadata.csv")
print('1. The movie_metadata (IMDB) dataset contains:', len(movie_metadata),' movies and',len(movie_metadata.columns),'features.')
print('2. The tmdb_movies (TMBD) dataset contains:', len(tmdb_movies),' movies and',len(tmdb_movies.columns),'features.')
1. The movie_metadata (IMDB) dataset contains: 5043 movies and 28 features. 2. The tmdb_movies (TMBD) dataset contains: 4803 movies and 20 features.
print('Below are the 28 features of IMDB dataset: \n ')
s_m=sorted( movie_metadata.columns)
for col1 in s_m:
print(col1, end=' . ')
print('\n \nBelow are the 20 features of TMBD dataset: \n')
s_t=sorted( tmdb_movies.columns)
for col2 in s_t:
print( col2, end=' . ')
Below are the 28 features of IMDB dataset: actor_1_facebook_likes . actor_1_name . actor_2_facebook_likes . actor_2_name . actor_3_facebook_likes . actor_3_name . aspect_ratio . budget . cast_total_facebook_likes . color . content_rating . country . director_facebook_likes . director_name . duration . facenumber_in_poster . genres . gross . imdb_score . language . movie_facebook_likes . movie_imdb_link . movie_title . num_critic_for_reviews . num_user_for_reviews . num_voted_users . plot_keywords . title_year . Below are the 20 features of TMBD dataset: budget . genres . homepage . id . keywords . original_language . original_title . overview . popularity . production_companies . production_countries . release_date . revenue . runtime . spoken_languages . status . tagline . title . vote_average . vote_count .
IMDB Dataset time span is across 100 years in 66 countries, as well as gross earnings. There are 2399 unique director names and thousands of actors/actresses. TMDB also spans across 101 years (1916-2017.
For running the analysis in python few libraries had to be loaded. For Principal Component Analysis – the sklearn library was imported. Standardise the data with sklearn's StandardScaler. Run the PCA with sklearn.decomposition .
For plots to showcase - seaborn, matplotlib, ggplot are imported from the library.
As every other dataset after loading the dataset, it had to be cleaned. All NaNs (empty values) were removed. To find the popular director two datasets were merged. They were merged by the movie title names but had an issue. As they were from different database there writing style was different as well. After troubleshoot it was found that in the IMDB - movie names column they had right white space, which was necessary to remove. After removing all the right white space from the column the two datasets were merged by inner joint and formed the new combined dataset of 4516 movies and 48 features.
The Data Frame of the combined data is right below.
#filtering NaN and removing it
movie_metadata =movie_metadata.dropna()
tmdb_movies = tmdb_movies.dropna()
# removing a right space frm movie-title column otherwise they won't merge properly
movie_metadata['movie_title']=movie_metadata['movie_title'].str.rstrip()
combine_df =pd.merge(movie_metadata,tmdb_movies, left_on="movie_title",right_on="original_title", how='inner') #merging the tables by movie title and only movies which were common in both dataset is used.
combine_df.head()
#combine_df.shape=(4516, 48)
color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres_x | ... | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | ... | [{"iso_3166_1": "US", "name": "United States o... | 2009-12-10 | 2787965087 | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 |
1 | Color | Gore Verbinski | 302.0 | 169.0 | 563.0 | 1000.0 | Orlando Bloom | 40000.0 | 309404152.0 | Action|Adventure|Fantasy | ... | [{"iso_3166_1": "US", "name": "United States o... | 2007-05-19 | 961000000 | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 |
2 | Color | Sam Mendes | 602.0 | 148.0 | 0.0 | 161.0 | Rory Kinnear | 11000.0 | 200074175.0 | Action|Adventure|Thriller | ... | [{"iso_3166_1": "GB", "name": "United Kingdom"... | 2015-10-26 | 880674609 | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 |
3 | Color | Christopher Nolan | 813.0 | 164.0 | 22000.0 | 23000.0 | Christian Bale | 27000.0 | 448130642.0 | Action|Thriller | ... | [{"iso_3166_1": "US", "name": "United States o... | 2012-07-16 | 1084939099 | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 |
4 | Color | Andrew Stanton | 462.0 | 132.0 | 475.0 | 530.0 | Samantha Morton | 640.0 | 73058679.0 | Action|Adventure|Sci-Fi | ... | [{"iso_3166_1": "US", "name": "United States o... | 2012-03-07 | 284139100 | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 |
5 rows × 48 columns
From this combined data frame names of the popular director with there mean IMDB score, TMDB score, popularity and movie revenue made is found in the table. Few Exploratory Data Analyse will be found in the Appendix.
Cluster map and correlation heat map is used to interpret the data.For correlation the data had to be cleaned again by producing a data frame only with selected columns and with numerical values only. Afterwards formed the correlation heatmap with the correlation coefficient. The clustermap maps the matrix/data frame in hierarchically-clustered map. This allows seeing the bigger picture in hierarchical order. It shows how many clusters can be made.
For PCA firstly used 3 features from the merged data set. Then used the stand algorithm for normalizing the data and fitting the PCA model. Got the output of the PCA variance ratio. Secondly plotted the no. of components against cumulative explained variance. Thirdly formed the table of PCA score and the scatterplot of it. At the last plotted the biplot. The combined graph of loadings and PCA scores.
From the table generated below, it interprets that-
The top 10 Director who has given the biggest films are:-
James Cameron, Joss Whedon, Colin Trevorrow, James Wan, Joss Whedon, Chris Buck, Shane Black, Kyle Balda, Antony Russo, and Machael Bay.
Their movies have generated the highest revenue at there time. The table shows the IMDB and TMDB score, the popularity of the movie entirely and the revenue generated from it. This says a lot of information, that the success of the movie does depends on how the director is directing the movie and how he pulls it off till the end. These names are quite popular in the movie industry. And few of the names has repeted as well.
# calling the desied columns and doing group
c2=combine_df.groupby(['director_name','vote_average','imdb_score' ,'popularity','movie_title'])['revenue'].mean().sort_values(ascending=False).head(20)
c2.to_frame() # making it into dataframe
revenue | |||||
---|---|---|---|---|---|
director_name | vote_average | imdb_score | popularity | movie_title | |
James Cameron | 7.2 | 7.9 | 150.437577 | Avatar | 2787965087 |
7.5 | 7.7 | 100.025899 | Titanic | 1845034188 | |
Joss Whedon | 7.4 | 8.1 | 144.448633 | The Avengers | 1519557910 |
Colin Trevorrow | 6.5 | 7.0 | 418.708552 | Jurassic World | 1513528810 |
James Wan | 7.3 | 7.2 | 102.322217 | Furious 7 | 1506249360 |
Joss Whedon | 7.3 | 7.5 | 134.279229 | Avengers: Age of Ultron | 1405403694 |
Chris Buck | 7.3 | 7.6 | 165.125366 | Frozen | 1274219009 |
Shane Black | 6.8 | 7.2 | 77.682080 | Iron Man 3 | 1215439994 |
Kyle Balda | 6.4 | 6.4 | 875.581305 | Minions | 1156730962 |
Anthony Russo | 7.1 | 8.2 | 198.372395 | Captain America: Civil War | 1153304495 |
Michael Bay | 6.1 | 6.3 | 28.529607 | Transformers: Dark of the Moon | 1123746996 |
Peter Jackson | 8.1 | 8.9 | 123.630332 | The Lord of the Rings: The Return of the King | 1118888979 |
Sam Mendes | 6.9 | 7.8 | 93.004993 | Skyfall | 1108561013 |
Michael Bay | 5.8 | 5.7 | 116.840296 | Transformers: Age of Extinction | 1091405097 |
Christopher Nolan | 7.6 | 8.5 | 112.312950 | The Dark Knight Rises | 1084939099 |
Lee Unkrich | 7.6 | 8.3 | 59.995418 | Toy Story 3 | 1066969703 |
Gore Verbinski | 7.0 | 7.3 | 145.847379 | Pirates of the Caribbean: Dead Man's Chest | 1065659812 |
Rob Marshall | 6.4 | 6.7 | 135.413856 | Pirates of the Caribbean: On Stranger Tides | 1045713802 |
Tim Burton | 6.4 | 6.5 | 78.530105 | Alice in Wonderland | 1025491110 |
Christopher Nolan | 8.2 | 9.0 | 187.322927 | The Dark Knight | 1004558444 |
The correlation heatmap and cluster map show the relationship between each feature. The lighter the color the stronger the correlation.
The graph of correlation heat map has the correlation coefficient on it. It states that actors and total cast Facebook likes are highly correlated with each other. Although it does not have much effect on revenue or profit. Also directors Facebook like is irrelevant to any features.
Budget, gross income, movie facebook like, no. of critic for reviews, votes by the user, popularity of the movie, revenue and vote counts these features are highly correlated. They do have an effect on each other. It can be interpreted that the higher the popularity, critic review, user who give the reviews, the more popular it gets. And most likely the revenue earning might increase as well.
The clustermap maps the matrix in a hierarchically-clustered map. It shows there are 3 major clusters.
#preparing data frame for correlation
# filtering num values only
str_list=[] #empty list
for colname, colvalue in combine_df.iteritems():
if type(colvalue[1])==str:
str_list.append(colname)
num_list= combine_df.columns.difference(str_list) # will get only numeric values
combine_df_new = combine_df[num_list]
#combine_df_new.head() #shape=4516 rows × 23 columns
col4=['actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','cast_total_facebook_likes','director_facebook_likes','budget_y','gross','imdb_score','movie_facebook_likes','num_critic_for_reviews','num_user_for_reviews','num_voted_users','popularity','revenue','vote_average','vote_count']
corr_df=combine_df_new[col4]
f,ax=plt.subplots(figsize=(12,10))
plt.title('Correlation of Movie Features')
sns.heatmap(corr_df.astype(float).corr(), linewidths=0.25, vmax=1.0,square=True, cmap="magma", linecolor="black", annot=True )
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
sns.clustermap(corr_df.corr(),cmap="magma",standard_scale=1)
plt.show()
Used Principal Component Analysis for unsupervised learning. To find the pattern in the data. Here we chose 3 features- polarity, revenue, imdb_score. And found three pca explained variance ratio.
a1=[ 'popularity','revenue','imdb_score' ]
x=combine_df[a1] #formed the data for pca
x.head()
popularity | revenue | imdb_score | |
---|---|---|---|
0 | 150.437577 | 2787965087 | 7.9 |
1 | 139.082615 | 961000000 | 7.1 |
2 | 107.376788 | 880674609 | 6.8 |
3 | 112.312950 | 1084939099 | 8.5 |
4 | 43.926995 | 284139100 | 6.6 |
T=x.values #assing all values to T
scaler = StandardScaler() # normalizing the data
T_scaled=scaler.fit_transform(T) #fitting the transforming the values
pca = PCA()
T_7d=pca.fit_transform(T_scaled)
print('pca explained variance ratio =',pca.explained_variance_ratio_)
pca explained variance ratio = [0.59308083 0.26451655 0.14240262]
In the line plot of Cummulative Explained Variance vs no. of components. It showed that 95% of the variance falls under 2 principal component. Hence collapsed of the space from 3 dimensions to 2 dimensions.
components = np.arange(1,4)
plt.plot(components, np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cummulative Explained Variance')
plt.show()
pca = PCA(n_components=2)
pca.fit(T_scaled)
pca_loadings = pca.components_ #loaded the loading value
pca_scores = pca.fit_transform(T_scaled) #formed the pca scores
#Note that we have collapsed the dimmensionality of our space from 3 dimensions to 2 dimensions
The plot looks like its clustered around the corner but it fans out later in the opposite direction.
plt.scatter(pca_scores[:,0], pca_scores[:,1] )
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Scatter plot of Principal Component Analysis ')
Text(0.5, 1.0, 'Scatter plot of Principal Component Analysis ')
It looks like revenue and popularity is more correlated. The first PC is a combination of revenue, popularity and IMDB score. The second PC is a little bit more dominated by IMDB score.
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, a1[i], color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid()
myplot(pca_scores[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.xlabel("PC1")
plt.ylabel("PC2")
Text(0, 0.5, 'PC2')
In the Movie industry, Director do play a crucial role in making the film success. In terms of Facebook likes of actors, actress and cast they are more correlated but does not influence the gross income, ratings or anything. Gross income, budget, revenue, critics review, users review, movies facebook likes they are all correlated. Thay influence one another. IMDB score and TMDB score do not correlate with any other features which might be the case that they are independent of any features. Acting of the actors, Director, storyline might be the factor but can not be concluded.
The dataset from TMDB.
#Appendix 1
tmdb_movies = pd.read_csv("Dataset_all/tmdb_5000_movies.csv")
tmdb_movies.head(2)
#tmdb_movies.shape=(4803, 20)
budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | [{"iso_3166_1": "US", "name": "United States o... | 2009-12-10 | 2787965087 | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 |
1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | 2007-05-19 | 961000000 | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 |
The dataset from IMDB.
#Appendix 2
movie_metadata = pd.read_csv("Dataset_all/movie_metadata.csv")
movie_metadata.head(3)
#movie_metadata.shape=(5043, 28)
color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | ... | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | ... | 3054.0 | English | USA | PG-13 | 237000000.0 | 2009.0 | 936.0 | 7.9 | 1.78 | 33000 |
1 | Color | Gore Verbinski | 302.0 | 169.0 | 563.0 | 1000.0 | Orlando Bloom | 40000.0 | 309404152.0 | Action|Adventure|Fantasy | ... | 1238.0 | English | USA | PG-13 | 300000000.0 | 2007.0 | 5000.0 | 7.1 | 2.35 | 0 |
2 | Color | Sam Mendes | 602.0 | 148.0 | 0.0 | 161.0 | Rory Kinnear | 11000.0 | 200074175.0 | Action|Adventure|Thriller | ... | 994.0 | English | UK | PG-13 | 245000000.0 | 2015.0 | 393.0 | 6.8 | 2.35 | 85000 |
3 rows × 28 columns
Some Extrapolatory data analysis. This is the combined named of directors and their respective movie title sorted in highest IMDB score. The result is similar to the directors who produce blockbuster films that have the highest IMDB scores. And their work is recognised. The next table is similar incorporating revenus, gross, IMDB and TMDB scores.
#Appendix 3
c3=['director_name','imdb_score','movie_title']
c4=combine_df[c3]
c4 = c4[(c4[['director_name']] != 0).all(axis=1)]
director_1=c4.groupby(["director_name","movie_title"])["imdb_score"].max().sort_values(ascending=False).head(15) # finding the median value of each neighbourhood_group)
director_1
director_name movie_title Francis Ford Coppola The Godfather 9.2 Christopher Nolan The Dark Knight 9.0 Peter Jackson The Lord of the Rings: The Return of the King 8.9 Steven Spielberg Schindler's List 8.9 Christopher Nolan Inception 8.8 David Fincher Fight Club 8.8 Peter Jackson The Lord of the Rings: The Fellowship of the Ring 8.8 The Lord of the Rings: The Two Towers 8.7 Lana Wachowski The Matrix 8.7 Christopher Nolan Interstellar 8.6 Tony Kaye American History X 8.6 Bryan Singer The Usual Suspects 8.6 David Fincher Se7en 8.6 Robert Zemeckis Back to the Future 8.5 Frank Darabont The Green Mile 8.5 Name: imdb_score, dtype: float64
# calling the desied columns and doing group
c2=combine_df.groupby(['director_name','vote_average','imdb_score' ,'popularity','movie_title'])['budget_x'].mean().sort_values(ascending=False).head(20)
c2.to_frame() # making it into dataframe
budget_x | |||||
---|---|---|---|---|---|
director_name | vote_average | imdb_score | popularity | movie_title | |
Gore Verbinski | 6.9 | 7.1 | 139.082615 | Pirates of the Caribbean: At World's End | 300000000.0 |
Andrew Stanton | 6.1 | 6.6 | 43.926995 | John Carter | 263700000.0 |
Nathan Greno | 7.4 | 7.8 | 48.681969 | Tangled | 260000000.0 |
Sam Raimi | 5.9 | 6.2 | 115.699814 | Spider-Man 3 | 258000000.0 |
Joss Whedon | 7.3 | 7.5 | 134.279229 | Avengers: Age of Ultron | 250000000.0 |
David Yates | 7.4 | 7.5 | 98.885637 | Harry Potter and the Half-Blood Prince | 250000000.0 |
Peter Jackson | 7.1 | 7.5 | 120.965743 | The Hobbit: The Battle of the Five Armies | 250000000.0 |
Rob Marshall | 6.4 | 6.7 | 135.413856 | Pirates of the Caribbean: On Stranger Tides | 250000000.0 |
Zack Snyder | 5.7 | 6.9 | 155.790452 | Batman v Superman: Dawn of Justice | 250000000.0 |
Christopher Nolan | 7.6 | 8.5 | 112.312950 | The Dark Knight Rises | 250000000.0 |
Anthony Russo | 7.1 | 8.2 | 198.372395 | Captain America: Civil War | 250000000.0 |
Sam Mendes | 6.3 | 6.8 | 107.376788 | Spectre | 245000000.0 |
James Cameron | 7.2 | 7.9 | 150.437577 | Avatar | 237000000.0 |
Marc Webb | 6.5 | 7.0 | 89.866276 | The Amazing Spider-Man | 230000000.0 |
Peter Jackson | 7.6 | 7.9 | 94.370564 | The Hobbit: The Desolation of Smaug | 225000000.0 |
Barry Sonnenfeld | 6.2 | 6.8 | 52.035179 | Men in Black 3 | 225000000.0 |
Gore Verbinski | 7.0 | 7.3 | 145.847379 | Pirates of the Caribbean: Dead Man's Chest | 225000000.0 |
Zack Snyder | 6.5 | 7.2 | 99.398009 | Man of Steel | 225000000.0 |
Joss Whedon | 7.4 | 8.1 | 144.448633 | The Avengers | 220000000.0 |
Gore Verbinski | 5.9 | 6.5 | 49.046956 | The Lone Ranger | 215000000.0 |
#Appendix 4
a1=['director_name','imdb_score','movie_title', 'revenue', 'gross']
a2=combine_df[a1]
a2 = a2[(a2[['director_name', 'revenue']] != 0).all(axis=1)]
director_2=a2.groupby(["director_name","movie_title",'revenue','gross'])["imdb_score"].max().sort_values(ascending=False).head(15) # finding the median value of each neighbourhood_group)
director_2
director_name movie_title revenue gross Francis Ford Coppola The Godfather 245066411 134821952.0 9.2 Christopher Nolan The Dark Knight 1004558444 533316061.0 9.0 Steven Spielberg Schindler's List 321365567 96067179.0 8.9 Peter Jackson The Lord of the Rings: The Return of the King 1118888979 377019252.0 8.9 Christopher Nolan Inception 825532764 292568851.0 8.8 David Fincher Fight Club 100853753 37023395.0 8.8 Peter Jackson The Lord of the Rings: The Fellowship of the Ring 871368364 313837577.0 8.8 Lana Wachowski The Matrix 463517383 171383253.0 8.7 Peter Jackson The Lord of the Rings: The Two Towers 926287400 340478898.0 8.7 Tony Kaye American History X 23875127 6712241.0 8.6 David Fincher Se7en 327311859 100125340.0 8.6 Bryan Singer The Usual Suspects 23341568 23272306.0 8.6 Christopher Nolan Interstellar 675120017 187991439.0 8.6 The Dark Knight Rises 1084939099 448130642.0 8.5 Ridley Scott Alien 104931801 78900000.0 8.5 Name: imdb_score, dtype: float64
#Appendix 5
c=combine_df.groupby(['director_name','imdb_score' ])['revenue'].mean().sort_values(ascending=False).head(15)
plt.figure(figsize=(11,11)) #fixing a default size of figure
plt.style.use('fivethirtyeight') #chosing style, colour of plot
c.unstack().plot.barh()
plt.title("Barplot of the Director names and revenue earned by the movie with its IMDB score", fontsize=18) #lablelling title
plt.ylabel("Director names", fontsize=18) #lablelling y-axis
plt.xlabel("Revenue in billion", fontsize=18) #lablelling x-axis
plt.legend(fontsize=11,loc=0) #fixing the postion of legend in asuitable place, with front size-11
plt.xticks(fontsize=15) #fixing font size of the x axis elements
plt.yticks(fontsize=15) #fixing font size of the y axis elements
plt.show()
print(c.head(15))
<Figure size 792x792 with 0 Axes>
director_name imdb_score James Cameron 7.9 2.787965e+09 7.7 1.845034e+09 Joss Whedon 8.1 1.519558e+09 Colin Trevorrow 7.0 1.513529e+09 James Wan 7.2 1.506249e+09 Joss Whedon 7.5 1.405404e+09 Chris Buck 7.6 1.274219e+09 Shane Black 7.2 1.215440e+09 Kyle Balda 6.4 1.156731e+09 Anthony Russo 8.2 1.153304e+09 Michael Bay 6.3 1.123747e+09 Peter Jackson 8.9 1.118889e+09 Sam Mendes 7.8 1.108561e+09 Michael Bay 5.7 1.091405e+09 Lee Unkrich 8.3 1.066970e+09 Name: revenue, dtype: float64
Correlation map of combined data set. It shows similarity between facebook likes in one group. the financial variables and critics reviews, user reviews , popularity ,ets in another group.
##Appendix 7
no_id_c=combine_df_new.drop(columns=['id'])
f,ax=plt.subplots(figsize=(15,10))
plt.title('Correlation of Movie Features')
sns.heatmap(no_id_c.astype(float).corr(), linewidths=0.25, vmax=1.0, square=True, cmap="magma", linecolor="black", annot=True )
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
Producing a data frame only with numerical value of TMDB
# filtering data output is only numerical values
str_list2=[] #empty list
for colname, colvalue in tmdb_movies.iteritems():
if type(colvalue[1])==str:
str_list2.append(colname)
num_list= tmdb_movies.columns.difference(str_list2) # will get only numeric values
tmdb_m_num = tmdb_movies[num_list]
tmdb_m_num #shape=4803 x 7
budget | id | popularity | revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|---|---|---|
0 | 237000000 | 19995 | 150.437577 | 2787965087 | 162.0 | 7.2 | 11800 |
1 | 300000000 | 285 | 139.082615 | 961000000 | 169.0 | 6.9 | 4500 |
2 | 245000000 | 206647 | 107.376788 | 880674609 | 148.0 | 6.3 | 4466 |
3 | 250000000 | 49026 | 112.312950 | 1084939099 | 165.0 | 7.6 | 9106 |
4 | 260000000 | 49529 | 43.926995 | 284139100 | 132.0 | 6.1 | 2124 |
... | ... | ... | ... | ... | ... | ... | ... |
4798 | 220000 | 9367 | 14.269792 | 2040920 | 81.0 | 6.6 | 238 |
4799 | 9000 | 72766 | 0.642552 | 0 | 85.0 | 5.9 | 5 |
4800 | 0 | 231617 | 1.444476 | 0 | 120.0 | 7.0 | 6 |
4801 | 0 | 126186 | 0.857008 | 0 | 98.0 | 5.7 | 7 |
4802 | 0 | 25975 | 1.929883 | 0 | 90.0 | 6.3 | 16 |
4803 rows × 7 columns
Correlation heat map and cluster map of TMDB numerical dataset. It looks like other then runtime and vote count all other features are correlated. But the clustermap illustrates there are 3 cluster-
#ap
no_id_tmdb_m_num=tmdb_m_num.drop(columns=['id'])
plt.figure(figsize=(12,10))
plt.title('Correlation of Movie Features')
ax=sns.heatmap(no_id_tmdb_m_num.astype(float).corr(), vmax=1, cmap="magma", linecolor="black",annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
sns.clustermap(no_id_tmdb_m_num.corr(),cmap="magma",standard_scale=1)
plt.show()
#with sns.axes_style("white"):
# ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, cmap="YlGnBu")
# plt.show()