from IPython.display import HTML

HTML('''
<script src='//code.jquery.com/jquery-3.3.1.min.js'></script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
$('div .jp-CodeCell .jp-Cell-inputWrapper').hide();
} else {
$('div.input').show();
$('div .jp-CodeCell .jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Code on/off"></form>''')

import pandas as pd
import numpy as np
import seaborn as sns 

import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

%matplotlib inline

Abstract¶

The main features which make the movie a blockbuster are hard to say. It is easy to guess what it might be but that might not be the real reason. Here the goal was to do Extrapolatory data analysis and come out with a conclusion using Python. This report was done using 2 Datasets from IMDB and TMDB. One had 28 features and the other 20 Each had around close to 5000 movies. The top Directors do have a crucial role in making the movie successful. Some features in the movie database are extremely correlated and some have no significance. Facenbook likes among actors and cast are more related and financial outcome like budget, gross income, popularity, critics are more related.

Introduction and Background¶

Movies have a huge impact on society. Nowadays people don’t just watch movies but they adapt a lot of information from the cinematic world. It has the capacity to influence community both locally and globally. Many kinds of movies are made each year. Each is review and criticised by many organisation.
A lot of effort goes into making a movie. It can turn out to be a success or failure depending on various factors. Factors like the choice of Director, Actor, theme, budget, storyline, etc. The theme, direction, storyline of movies all are also changing with time. The trend and direction of the movie industry are evolving over time. The aim of this report is to analyse the trend and how the features in the movie database are correlated.

Here two datasets from two movies database has been used to analyse the trend:

tmdb_movies = pd.read_csv("Dataset_all/tmdb_5000_movies.csv")
movie_metadata = pd.read_csv("Dataset_all/movie_metadata.csv")

print('1. The movie_metadata (IMDB) dataset contains:', len(movie_metadata),' movies and',len(movie_metadata.columns),'features.')
print('2. The tmdb_movies (TMBD) dataset contains:', len(tmdb_movies),' movies and',len(tmdb_movies.columns),'features.')

1. The movie_metadata (IMDB) dataset contains: 5043  movies and 28 features.
2. The tmdb_movies (TMBD) dataset contains: 4803  movies and 20 features.

print('Below are the 28 features of IMDB dataset: \n ')
s_m=sorted( movie_metadata.columns)
for col1 in s_m: 
    print(col1, end=' . ')
    
print('\n \nBelow are the 20 features of TMBD dataset: \n')

s_t=sorted( tmdb_movies.columns)
for col2 in s_t: 
    print( col2,  end=' . ')

Below are the 28 features of IMDB dataset: 
 
actor_1_facebook_likes . actor_1_name . actor_2_facebook_likes . actor_2_name . actor_3_facebook_likes . actor_3_name . aspect_ratio . budget . cast_total_facebook_likes . color . content_rating . country . director_facebook_likes . director_name . duration . facenumber_in_poster . genres . gross . imdb_score . language . movie_facebook_likes . movie_imdb_link . movie_title . num_critic_for_reviews . num_user_for_reviews . num_voted_users . plot_keywords . title_year . 
 
Below are the 20 features of TMBD dataset: 

budget . genres . homepage . id . keywords . original_language . original_title . overview . popularity . production_companies . production_countries . release_date . revenue . runtime . spoken_languages . status . tagline . title . vote_average . vote_count .

IMDB Dataset time span is across 100 years in 66 countries, as well as gross earnings. There are 2399 unique director names and thousands of actors/actresses. TMDB also spans across 101 years (1916-2017.

The aim is to find a trend in the movie industry. By answering the following:¶

1. Does having a popular Director increase the chance of having a movie? Merge the dataset and find the relation. Name the popular Director making the highest revenue in movies.¶

3. Perform PCA to find a trend ain the movie features.¶

Methodology¶

For running the analysis in python few libraries had to be loaded. For Principal Component Analysis – the sklearn library was imported. Standardise the data with sklearn's StandardScaler. Run the PCA with sklearn.decomposition .
For plots to showcase - seaborn, matplotlib, ggplot are imported from the library.

As every other dataset after loading the dataset, it had to be cleaned. All NaNs (empty values) were removed. To find the popular director two datasets were merged. They were merged by the movie title names but had an issue. As they were from different database there writing style was different as well. After troubleshoot it was found that in the IMDB - movie names column they had right white space, which was necessary to remove. After removing all the right white space from the column the two datasets were merged by inner joint and formed the new combined dataset of 4516 movies and 48 features.

The Data Frame of the combined data is right below.

#filtering NaN and removing it

movie_metadata =movie_metadata.dropna()
tmdb_movies = tmdb_movies.dropna()

# removing a right space frm movie-title column otherwise they won't merge properly
movie_metadata['movie_title']=movie_metadata['movie_title'].str.rstrip() 

combine_df =pd.merge(movie_metadata,tmdb_movies,  left_on="movie_title",right_on="original_title", how='inner') #merging the tables by movie title and only movies which were common in both dataset is used.
combine_df.head()
#combine_df.shape=(4516, 48)

From this combined data frame names of the popular director with there mean IMDB score, TMDB score, popularity and movie revenue made is found in the table. Few Exploratory Data Analyse will be found in the Appendix.

Cluster map and correlation heat map is used to interpret the data.For correlation the data had to be cleaned again by producing a data frame only with selected columns and with numerical values only. Afterwards formed the correlation heatmap with the correlation coefficient. The clustermap maps the matrix/data frame in hierarchically-clustered map. This allows seeing the bigger picture in hierarchical order. It shows how many clusters can be made.

For PCA firstly used 3 features from the merged data set. Then used the stand algorithm for normalizing the data and fitting the PCA model. Got the output of the PCA variance ratio. Secondly plotted the no. of components against cumulative explained variance. Thirdly formed the table of PCA score and the scatterplot of it. At the last plotted the biplot. The combined graph of loadings and PCA scores.

Results¶

Popular Directors¶

From the table generated below, it interprets that-
The top 10 Director who has given the biggest films are:-
James Cameron, Joss Whedon, Colin Trevorrow, James Wan, Joss Whedon, Chris Buck, Shane Black, Kyle Balda, Antony Russo, and Machael Bay.
Their movies have generated the highest revenue at there time. The table shows the IMDB and TMDB score, the popularity of the movie entirely and the revenue generated from it. This says a lot of information, that the success of the movie does depends on how the director is directing the movie and how he pulls it off till the end. These names are quite popular in the movie industry. And few of the names has repeted as well.

# calling the desied columns and doing group
c2=combine_df.groupby(['director_name','vote_average','imdb_score' ,'popularity','movie_title'])['revenue'].mean().sort_values(ascending=False).head(20)
c2.to_frame() # making it into dataframe

Correlation Analysis¶

The correlation heatmap and cluster map show the relationship between each feature. The lighter the color the stronger the correlation.

The graph of correlation heat map has the correlation coefficient on it. It states that actors and total cast Facebook likes are highly correlated with each other. Although it does not have much effect on revenue or profit. Also directors Facebook like is irrelevant to any features.
Budget, gross income, movie facebook like, no. of critic for reviews, votes by the user, popularity of the movie, revenue and vote counts these features are highly correlated. They do have an effect on each other. It can be interpreted that the higher the popularity, critic review, user who give the reviews, the more popular it gets. And most likely the revenue earning might increase as well.

The clustermap maps the matrix in a hierarchically-clustered map. It shows there are 3 major clusters.

IMBD score, TMDB score(vote_average) can be clustered as one. But since it does not have much relation but other features it is clustered with Director Facebook likes.
Actor/Actress (first, second, third) along with other member in the cast; their facebook likes are more correlated.
And the rest of the features which includes financial count, critic reviews, user reviews, popularity and movie facebook likes are correlated and they can be clustered together.

#preparing data frame for correlation
# filtering num values only 
str_list=[] #empty list
for colname, colvalue in combine_df.iteritems():
    if type(colvalue[1])==str:
        str_list.append(colname)
num_list= combine_df.columns.difference(str_list) # will get only numeric values
combine_df_new = combine_df[num_list]
#combine_df_new.head()    #shape=4516 rows × 23 columns

col4=['actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','cast_total_facebook_likes','director_facebook_likes','budget_y','gross','imdb_score','movie_facebook_likes','num_critic_for_reviews','num_user_for_reviews','num_voted_users','popularity','revenue','vote_average','vote_count']
corr_df=combine_df_new[col4]
f,ax=plt.subplots(figsize=(12,10))
plt.title('Correlation of Movie Features')
sns.heatmap(corr_df.astype(float).corr(), linewidths=0.25, vmax=1.0,square=True, cmap="magma", linecolor="black", annot=True )
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
sns.clustermap(corr_df.corr(),cmap="magma",standard_scale=1)
plt.show()

Principal Component Analysis¶

Used Principal Component Analysis for unsupervised learning. To find the pattern in the data. Here we chose 3 features- polarity, revenue, imdb_score. And found three pca explained variance ratio.

a1=[ 'popularity','revenue','imdb_score' ]
x=combine_df[a1]  #formed the data for pca
x.head()

T=x.values #assing all values to T

scaler = StandardScaler() # normalizing the data
T_scaled=scaler.fit_transform(T) #fitting the transforming the values

pca = PCA()
T_7d=pca.fit_transform(T_scaled)
print('pca explained variance ratio =',pca.explained_variance_ratio_)

pca explained variance ratio = [0.59308083 0.26451655 0.14240262]

In the line plot of Cummulative Explained Variance vs no. of components. It showed that 95% of the variance falls under 2 principal component. Hence collapsed of the space from 3 dimensions to 2 dimensions.

components = np.arange(1,4) 
plt.plot(components, np.cumsum(pca.explained_variance_ratio_))

plt.xlabel('Number of Components') 
plt.ylabel('Cummulative Explained Variance')
plt.show()

pca = PCA(n_components=2)
pca.fit(T_scaled)
pca_loadings = pca.components_  #loaded the loading value
pca_scores = pca.fit_transform(T_scaled)  #formed the pca scores
#Note that we have collapsed the dimmensionality of our space from 3 dimensions to 2 dimensions

The plot looks like its clustered around the corner but it fans out later in the opposite direction.

plt.scatter(pca_scores[:,0], pca_scores[:,1] )
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Scatter plot of Principal Component Analysis ')

Text(0.5, 1.0, 'Scatter plot of Principal Component Analysis ')

It looks like revenue and popularity is more correlated. The ﬁrst PC is a combination of revenue, popularity and IMDB score. The second PC is a little bit more dominated by IMDB score.

def myplot(score,coeff,labels=None): 
    xs = score[:,0] 
    ys = score[:,1] 
    n = coeff.shape[0] 
    scalex = 1.0/(xs.max() - xs.min()) 
    scaley = 1.0/(ys.max() - ys.min()) 
    plt.scatter(xs * scalex,ys * scaley)
    for i in range(n): 
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5) 
        if labels is None: 
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, a1[i], color = 'g', ha = 'center', va = 'center')
        else: 
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center') 
            plt.xlim(-1,1)
            plt.ylim(-1,1) 
            plt.xlabel("PC1") 
            plt.ylabel("PC2") 
            plt.grid()

myplot(pca_scores[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.xlabel("PC1") 
plt.ylabel("PC2")

Text(0, 0.5, 'PC2')

Conclusion¶

In the Movie industry, Director do play a crucial role in making the film success. In terms of Facebook likes of actors, actress and cast they are more correlated but does not influence the gross income, ratings or anything. Gross income, budget, revenue, critics review, users review, movies facebook likes they are all correlated. Thay influence one another. IMDB score and TMDB score do not correlate with any other features which might be the case that they are independent of any features. Acting of the actors, Director, storyline might be the factor but can not be concluded.

Appendix¶

The dataset from TMDB.

#Appendix 1
tmdb_movies = pd.read_csv("Dataset_all/tmdb_5000_movies.csv")
tmdb_movies.head(2)
#tmdb_movies.shape=(4803, 20)

The dataset from IMDB.

#Appendix 2

movie_metadata = pd.read_csv("Dataset_all/movie_metadata.csv")
movie_metadata.head(3)
#movie_metadata.shape=(5043, 28)

Some Extrapolatory data analysis. This is the combined named of directors and their respective movie title sorted in highest IMDB score. The result is similar to the directors who produce blockbuster films that have the highest IMDB scores. And their work is recognised. The next table is similar incorporating revenus, gross, IMDB and TMDB scores.

#Appendix 3

c3=['director_name','imdb_score','movie_title']
c4=combine_df[c3]
c4 = c4[(c4[['director_name']] != 0).all(axis=1)]

director_1=c4.groupby(["director_name","movie_title"])["imdb_score"].max().sort_values(ascending=False).head(15) # finding the median value of each neighbourhood_group)
director_1

director_name         movie_title                                      
Francis Ford Coppola  The Godfather                                        9.2
Christopher Nolan     The Dark Knight                                      9.0
Peter Jackson         The Lord of the Rings: The Return of the King        8.9
Steven Spielberg      Schindler's List                                     8.9
Christopher Nolan     Inception                                            8.8
David Fincher         Fight Club                                           8.8
Peter Jackson         The Lord of the Rings: The Fellowship of the Ring    8.8
                      The Lord of the Rings: The Two Towers                8.7
Lana Wachowski        The Matrix                                           8.7
Christopher Nolan     Interstellar                                         8.6
Tony Kaye             American History X                                   8.6
Bryan Singer          The Usual Suspects                                   8.6
David Fincher         Se7en                                                8.6
Robert Zemeckis       Back to the Future                                   8.5
Frank Darabont        The Green Mile                                       8.5
Name: imdb_score, dtype: float64

# calling the desied columns and doing group
c2=combine_df.groupby(['director_name','vote_average','imdb_score' ,'popularity','movie_title'])['budget_x'].mean().sort_values(ascending=False).head(20)
c2.to_frame() # making it into dataframe

#Appendix 4

a1=['director_name','imdb_score','movie_title', 'revenue', 'gross']
a2=combine_df[a1]
a2 = a2[(a2[['director_name', 'revenue']] != 0).all(axis=1)]

director_2=a2.groupby(["director_name","movie_title",'revenue','gross'])["imdb_score"].max().sort_values(ascending=False).head(15) # finding the median value of each neighbourhood_group)
director_2

director_name         movie_title                                        revenue     gross      
Francis Ford Coppola  The Godfather                                      245066411   134821952.0    9.2
Christopher Nolan     The Dark Knight                                    1004558444  533316061.0    9.0
Steven Spielberg      Schindler's List                                   321365567   96067179.0     8.9
Peter Jackson         The Lord of the Rings: The Return of the King      1118888979  377019252.0    8.9
Christopher Nolan     Inception                                          825532764   292568851.0    8.8
David Fincher         Fight Club                                         100853753   37023395.0     8.8
Peter Jackson         The Lord of the Rings: The Fellowship of the Ring  871368364   313837577.0    8.8
Lana Wachowski        The Matrix                                         463517383   171383253.0    8.7
Peter Jackson         The Lord of the Rings: The Two Towers              926287400   340478898.0    8.7
Tony Kaye             American History X                                 23875127    6712241.0      8.6
David Fincher         Se7en                                              327311859   100125340.0    8.6
Bryan Singer          The Usual Suspects                                 23341568    23272306.0     8.6
Christopher Nolan     Interstellar                                       675120017   187991439.0    8.6
                      The Dark Knight Rises                              1084939099  448130642.0    8.5
Ridley Scott          Alien                                              104931801   78900000.0     8.5
Name: imdb_score, dtype: float64

#Appendix 5
c=combine_df.groupby(['director_name','imdb_score' ])['revenue'].mean().sort_values(ascending=False).head(15)


plt.figure(figsize=(11,11))   #fixing a default size of figure
plt.style.use('fivethirtyeight')    #chosing style, colour of plot
c.unstack().plot.barh()

plt.title("Barplot of the Director names and revenue earned by the movie with its IMDB score", fontsize=18) #lablelling title
plt.ylabel("Director names", fontsize=18)      #lablelling y-axis
plt.xlabel("Revenue in billion", fontsize=18)           #lablelling x-axis
plt.legend(fontsize=11,loc=0)       #fixing the postion of legend in asuitable place, with front size-11
plt.xticks(fontsize=15)            #fixing font size of the x axis elements
plt.yticks(fontsize=15)                  #fixing font size of the y axis elements


plt.show()
print(c.head(15))

<Figure size 792x792 with 0 Axes>

director_name    imdb_score
James Cameron    7.9           2.787965e+09
                 7.7           1.845034e+09
Joss Whedon      8.1           1.519558e+09
Colin Trevorrow  7.0           1.513529e+09
James Wan        7.2           1.506249e+09
Joss Whedon      7.5           1.405404e+09
Chris Buck       7.6           1.274219e+09
Shane Black      7.2           1.215440e+09
Kyle Balda       6.4           1.156731e+09
Anthony Russo    8.2           1.153304e+09
Michael Bay      6.3           1.123747e+09
Peter Jackson    8.9           1.118889e+09
Sam Mendes       7.8           1.108561e+09
Michael Bay      5.7           1.091405e+09
Lee Unkrich      8.3           1.066970e+09
Name: revenue, dtype: float64

Correlation map of combined data set. It shows similarity between facebook likes in one group. the financial variables and critics reviews, user reviews , popularity ,ets in another group.

##Appendix 7

no_id_c=combine_df_new.drop(columns=['id'])
f,ax=plt.subplots(figsize=(15,10))
plt.title('Correlation of Movie Features')
sns.heatmap(no_id_c.astype(float).corr(), linewidths=0.25, vmax=1.0, square=True, cmap="magma", linecolor="black", annot=True )
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

Producing a data frame only with numerical value of TMDB

# filtering data output is only numerical values

str_list2=[] #empty list
for colname, colvalue in tmdb_movies.iteritems():
    if type(colvalue[1])==str:
        str_list2.append(colname)
num_list= tmdb_movies.columns.difference(str_list2) # will get only numeric values
tmdb_m_num = tmdb_movies[num_list]
tmdb_m_num #shape=4803 x 7

Correlation heat map and cluster map of TMDB numerical dataset. It looks like other then runtime and vote count all other features are correlated. But the clustermap illustrates there are 3 cluster-

Runtime and Vote count
Budget and Revenue
Vote count and Popularity

#ap
no_id_tmdb_m_num=tmdb_m_num.drop(columns=['id'])

plt.figure(figsize=(12,10))
plt.title('Correlation of Movie Features')
ax=sns.heatmap(no_id_tmdb_m_num.astype(float).corr(), vmax=1,  cmap="magma", linecolor="black",annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
sns.clustermap(no_id_tmdb_m_num.corr(),cmap="magma",standard_scale=1)
plt.show()
#with sns.axes_style("white"):
#    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,  cmap="YlGnBu")
#    plt.show()

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres_x	...	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	Color	Gore Verbinski	302.0	169.0	563.0	1000.0	Orlando Bloom	40000.0	309404152.0	Action\|Adventure\|Fantasy	...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	Color	Sam Mendes	602.0	148.0	0.0	161.0	Rory Kinnear	11000.0	200074175.0	Action\|Adventure\|Thriller	...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466
3	Color	Christopher Nolan	813.0	164.0	22000.0	23000.0	Christian Bale	27000.0	448130642.0	Action\|Thriller	...	[{"iso_3166_1": "US", "name": "United States o...	2012-07-16	1084939099	165.0	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.6	9106
4	Color	Andrew Stanton	462.0	132.0	475.0	530.0	Samantha Morton	640.0	73058679.0	Action\|Adventure\|Sci-Fi	...	[{"iso_3166_1": "US", "name": "United States o...	2012-03-07	284139100	132.0	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.1	2124

					revenue
director_name	vote_average	imdb_score	popularity	movie_title
James Cameron	7.2	7.9	150.437577	Avatar	2787965087
James Cameron	7.5	7.7	100.025899	Titanic	1845034188
Joss Whedon	7.4	8.1	144.448633	The Avengers	1519557910
Colin Trevorrow	6.5	7.0	418.708552	Jurassic World	1513528810
James Wan	7.3	7.2	102.322217	Furious 7	1506249360
Joss Whedon	7.3	7.5	134.279229	Avengers: Age of Ultron	1405403694
Chris Buck	7.3	7.6	165.125366	Frozen	1274219009
Shane Black	6.8	7.2	77.682080	Iron Man 3	1215439994
Kyle Balda	6.4	6.4	875.581305	Minions	1156730962
Anthony Russo	7.1	8.2	198.372395	Captain America: Civil War	1153304495
Michael Bay	6.1	6.3	28.529607	Transformers: Dark of the Moon	1123746996
Peter Jackson	8.1	8.9	123.630332	The Lord of the Rings: The Return of the King	1118888979
Sam Mendes	6.9	7.8	93.004993	Skyfall	1108561013
Michael Bay	5.8	5.7	116.840296	Transformers: Age of Extinction	1091405097
Christopher Nolan	7.6	8.5	112.312950	The Dark Knight Rises	1084939099
Lee Unkrich	7.6	8.3	59.995418	Toy Story 3	1066969703
Gore Verbinski	7.0	7.3	145.847379	Pirates of the Caribbean: Dead Man's Chest	1065659812
Rob Marshall	6.4	6.7	135.413856	Pirates of the Caribbean: On Stranger Tides	1045713802
Tim Burton	6.4	6.5	78.530105	Alice in Wonderland	1025491110
Christopher Nolan	8.2	9.0	187.322927	The Dark Knight	1004558444

	popularity	revenue	imdb_score
0	150.437577	2787965087	7.9
1	139.082615	961000000	7.1
2	107.376788	880674609	6.8
3	112.312950	1084939099	8.5
4	43.926995	284139100	6.6

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	...	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	...	3054.0	English	USA	PG-13	237000000.0	2009.0	936.0	7.9	1.78	33000
1	Color	Gore Verbinski	302.0	169.0	563.0	1000.0	Orlando Bloom	40000.0	309404152.0	Action\|Adventure\|Fantasy	...	1238.0	English	USA	PG-13	300000000.0	2007.0	5000.0	7.1	2.35	0
2	Color	Sam Mendes	602.0	148.0	0.0	161.0	Rory Kinnear	11000.0	200074175.0	Action\|Adventure\|Thriller	...	994.0	English	UK	PG-13	245000000.0	2015.0	393.0	6.8	2.35	85000

Analysis on Movie Databases - IMDB and TMDB

Author: Aaphsaarah Rahman

Abstract¶

Introduction and Background¶

The aim is to find a trend in the movie industry. By answering the following:¶

1. Does having a popular Director increase the chance of having a movie? Merge the dataset and find the relation. Name the popular Director making the highest revenue in movies.¶

3. Perform PCA to find a trend ain the movie features.¶

Methodology¶

Results¶

Popular Directors¶

Correlation Analysis¶

Principal Component Analysis¶

Conclusion¶

Appendix¶

Analysis on Movie Databases - IMDB and TMDB

Author: Aaphsaarah Rahman

Abstract¶

Introduction and Background¶

The aim is to find a trend in the movie industry. By answering the following:¶

1. Does having a popular Director increase the chance of having a movie? Merge the dataset and find the relation. Name the popular Director making the highest revenue in movies.¶

2. Through a Correlation Analysis find a relation between the features in the movie set. Which are the most related features?¶

3. Perform PCA to find a trend ain the movie features.¶

Methodology¶

Results¶

Popular Directors¶

Correlation Analysis¶

Principal Component Analysis¶

Conclusion¶

Appendix¶