1000字范文 > 对数据集“Netflix电影电视剧及用户观影数据“的分析处理和可视化

对数据集“Netflix电影电视剧及用户观影数据“的分析处理和可视化

时间：2018-10-24 02:10:12

一、寻找数据集

from kaggle：《Netflix Movies and TV Shows》 -------- Shivam Bansal

二、数据集分析

1、首先，通过pandas模块导入csv包

import pandas as pddata = pd.read_csv('movie_data.csv')In [3] data #数据内容num_critic_for_reviewsdurationgrossgenresnum_voted_usersnum_user_for_reviewslanguagecountrybudgettitle_yearimdb_score0723.0178.0760505847.0Action|Adventure|Fantasy|Sci-Fi8862043054.0EnglishUSA237000000.0.07.91302.0169.0309404152.0Action|Adventure|Fantasy471238.0EnglishUSA300000000.0.07.12602.0148.0200074175.0Action|Adventure|Thriller275868994.0EnglishUK245000000.0.06.83813.0164.0448130642.0Action|Thriller11443372701.0EnglishUSA250000000.0.08.54NaNNaNNaNDocumentary8NaNNaNNaNNaNNaN7.1....................................50381.087.0NaNComedy|Drama6296.0EnglishCanadaNaN.07.7503943.043.0NaNCrime|Drama|Mystery|Thriller73839359.0EnglishUSANaNNaN7.5504013.076.0NaNDrama|Horror|Thriller383.0EnglishUSA1400.0.06.3504114.0100.010443.0Comedy|Drama|Romance12559.0EnglishUSANaN.06.3504243.090.085222.0Documentary428584.0EnglishUSA1100.0.06.65043 rows × 11 columns

2、然后我们首先处理数据集中IMDB电影评分的数据：

In [4] score1,score2,score3,score4=0,0,0,0for i in range(5043):if(data.imdb_score[i]<5):score1=score1+1;elif(data.imdb_score[i]>5 and data.imdb_score[i]<7):score2=score2+1;elif (data.imdb_score[i]>7 and data.imdb_score[i] < 9):score3 = score3 + 1;else :score4 = score4 + 1;

导入绘图包，这里我使用的是matplotlib

import matplotlib.pyplot as plt labels = '0-5', '5-7', '7-9', '>9' #定义各板块名称sizes = score1,score2,score3,score4 #各板块数据colors = 'yellowgreen', 'gold', 'lightskyblue', 'lightcoral' # 设定颜色explode = 0, 0.1, 0, 0 #板块间的间隙plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=50)plt.axis('equal')plt.show()

3、再统计1987-的中国电影，并做成折线图

from pandas import Series,DataFrameimport numpy as npx = np.arange(1987,) #定义折线图X坐标y=[0]*33 #定义折线图y坐标k=0for i in data.country: #遍历数据集中的国家项if(i=='China'):m=int(data.title_year[k])-1987y[m]=y[m]+1k=k+1plt.figure(figsize=(10, 4), dpi=100)#定义画布plt.plot(x, y)plt.show() #显示图像

4、接下里统计各国电影的数量

mpl.rcParams["font.sans-serif"] = ["SimHei"]mpl.rcParams["axes.unicode_minus"] = False #解决中文显示问题plt.figure(figsize=(8,6)) labels=list(data.country.unique()) #使用pandas内置函数进行分类计数fracs=[]for i in labels:fracs.append(data.loc[data.country==i].shape[0]) #labels列表存的是国家，fracs列表存的是对应国家的个数

导入画世界地图所需的包

from pyecharts import options as opts from pyecharts.charts import Map,Geoimport os

作图

data = []for index in range(len(labels)):city_ionfo=[labels[index],fracs[index]]data.append(city_ionfo)c = (Map().add("Netflix全球电影分布",data, "world").set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(),visualmap_opts=opts.VisualMapOpts(max_=200) #因为美国和其他国家的数量差距过大，所以这里将最大值设为200（美国为3807），方便观察。))c.render_notebook()#将图片在jupyter中显示出来# os.system("render.html") # 用html打开

5、我们再来统计各类电影的占比

PS：这里我采用依次计数的方式，有更好的方法请告诉我。。。。

action,adventure,fantasy,sciencefiction,mystery,family,thriller,documentary,romance,comedy,animation,musical,western,history,drama,crime=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0for i in data.genres:if("Action" in i):action=action+1;if("Adventure" in i):adventure=adventure+1;if("Fantasy" in i):fantasy=fantasy+1;if("Sci-Fi" in i):sciencefiction=sciencefiction+1;if("Mystery" in i):mystery=mystery+1;if("Family" in i):family=family+1;if("Thriller" in i):thriller=thriller+1;if("Documentary" in i):documentary=documentary+1;if("Romance" in i):romance=romance+1;if("Comedy" in i):comedy=comedy+1;if("Animation" in i):animation=animation+1;if("Musical" in i):musical=musical+1;if("Western" in i):western=western+1;if("History" in i):history=history+1;if("Drama" in i):drama=drama+1;if("Crime" in i):crime=crime+1;print(action,adventure,fantasy,sciencefiction,mystery,family,thriller,documentary,romance,comedy,animation,musical,western,history,drama,crime)out：1153 923 610 616 500 546 1411 121 1107 1872 242 132 97 207 2594 889

print(" 电影类型饼状图") #输出饼状图，同上labels ='action','adventure','fantasy','sciencefiction','mystery','family','thriller','documentary','romance','comedy','animation','musical','western','history','drama','crime'sizes = action,adventure,fantasy,sciencefiction,mystery,family,thriller,documentary,romance,comedy,animation,musical,western,history,drama,crimecolors = 'yellowgreen', 'gold', 'lightskyblue', 'lightcoral','yellowgreen', 'gold', 'lightskyblue', 'lightcoral','yellowgreen', 'gold', 'lightskyblue', 'lightcoral','yellowgreen', 'gold', 'lightskyblue', 'lightcoral'explode = 0, 0, 0, 0,0, 0, 0, 0,0, 0, 0, 0,0, 0, 0, 0plt.pie(sizes, radius=2.5,explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=50)plt.show()

5、再我们统计netflix上电影的预算以及对应的总票房和观众的认可度

data=data.loc[data.budget.notnull()] #剔除掉数据集中budget属性为空的元组

y1,y2=[],[]for i in range(100,125):y1.append(data.budget[i])y2.append(data.gross[i])x = np.arange(25) #定义折线图X坐标plt.figure(figsize=(8,4)) #定义画布plt.plot(x, y1, '.-',label='预算/投入') #添加第一条折线到图中plt.plot(x, y2, '.-',label='票房/收入')plt.legend()plt.xlabel('个数')plt.ylabel('/十亿美元')plt.ylim((0,1000000000))plt.title('Netflix电影的预算以及对应的总票房')plt.show()

6、最后我们分析数据集中，观众点赞数的数据

①总数据集观众点赞数饼状图分布

data = pd.read_csv('movie_metadata.csv')score1,score2,score3,score4,score5=0,0,0,0,0for i in range(5043):if(data.num_voted_users[i]<2000):score1=score1+1;elif(data.num_voted_users[i]>2000 and data.num_voted_users[i]<10000):score2=score2+1;elif (data.num_voted_users[i]>10000 and data.num_voted_users[i] <20000):score3 = score3 + 1;elif (data.num_voted_users[i]>20000 and data.num_voted_users[i] <50000):score4 = score4 + 1; elif(data.num_voted_users[i]>50000):score5 = score5 + 1;labels1 ='2千以下', '2千-1万','1万-2万', '2万-5万', '5万以上' sizes = score1,score2,score3,score4,score5colors = 'yellowgreen', 'gold', 'lightskyblue', 'lightcoral','gold'explode = 0, 0, 0, 0,0plt.pie(sizes, explode=explode, labels=labels1, colors=colors, autopct='%1.1f%%', shadow=True, startangle=50)plt.axis('equal')plt.title('观众点赞饼状图')plt.show()

②从数据集中随机抽取100个数据作成散点图

import randomlist,d2=[],[]for i in range(100):list.append(random.randint(1, 4551))for i in list:d2.append(data.num_user_for_reviews[i])d1 = np.random.randn(100)plt.scatter(d1,d2)plt.title("观众点赞/投票数散点图")

数据集分析完毕

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。