1000字范文 > 数据分析（Numpy Pandas Matplotlib）常用API

数据分析（Numpy Pandas Matplotlib）常用API

时间：2019-01-18 07:10:19

Numpy

Pandas

Series

DataFrame

Matplotlib

Series和Dataframe的画图

seaborn

Scipy

Numpy：

np.array(任何可被解释为Numpy数组的逻辑结构)

np.arange(起始值(0),终止值,步长(1))

np.zeros(数组元素个数, dtype='类型')

np.ones(数组元素个数, dtype='类型')

数组的维度： np.ndarray.shape

元素的类型： np.ndarray.dtype

数组元素的个数： np.ndarray.size

转换ary元素的类型：b = ary.astype(float)

自定义复合类型：data=[

('zs', [90, 80, 85], 15),

('ls', [92, 81, 83], 16),

('ww', [95, 85, 95], 15)

]

b = np.array(data, dtype=[('name', 'str_', 2),

('scores', 'int32', 3),

('ages', 'int32', 1)])

视图变维(数据共享)：reshape() 与 ravel()#变为一维数组

复制变维(数据独立)：flatten()

就地变维：直接改变原数组对象的维度,不返回新数组： a.resize(2, 2, 2)

多维数组的组合与拆分：

#垂直方向完成组合操作,生成新数组

c = np.vstack((a, b))

# 垂直方向完成拆分操作,生成两个数组

d, e = np.vsplit(c, 2)

# 水平方向完成组合操作,生成新数组

c = np.hstack((a, b))

# 水平方向完成拆分操作,生成两个数组

d, e = np.hsplit(c, 2)

# 深度方向(3维)完成组合操作,生成新数组

i = np.dstack((a, b))

# 深度方向(3维)完成拆分操作,生成两个数组

k, l = np.dsplit(i, 2)

#把两个数组组合在一起成两行

c = np.row_stack((a, b))

#把两个数组组合在一起成两列

d = np.column_stack((a, b))

加载文件：np.loadtxt(

'../aapl.csv', # 文件路径

delimiter=',', # 分隔符

usecols=(1, 3), # 读取1、3两列 (下标从0开始)

unpack=False, # 是否按列拆包

dtype='U10, f8', # 制定返回每一列数组中元素的类型

converters={1：func} # 转换器函数字典

)

算数平均值：np.mean(array)

加权平均值：np.average(closing_prices, weights=volumes)

返回一个数组中最大值/最小值/极差：np.max() np.min() np.ptp()

产生9个介于[10, 100)区间的随机数：np.random.randint(10, 100, 9)

返回一个数组中最大/最小元素的下标：np.argmax() mp.argmin()

将两个同维数组中对应元素中最大/最小元素构成一个新的数组：np.maximum() np.minimum()

中位数：np.median(closing_prices)

总体标准差：np.std(closing_prices)

样本标准差np.std(closing_prices, ddof=1)

返回np.int16的表示范围（max、min）：np.iinfo(np.int16)

卷积：numpy.convolve(a(卷积数组), b(卷积核), 卷积类型(valid、full、same))

求Ax=B的解 x = np.linalg.lstsq(A, B)[0]

线性拟合：

A = np.column_stack((X轴数据, np.ones_like(days)))

x = np.linalg.lstsq(a, y轴数据)[0]

trend_line = X轴数据 * x[0] + x[1](x[0]是k,x[1]是b)

协方差：np.cov()、np.mean(dev_a*dev_b)(离差乘积的平均值)

方差：np.var()

相关系数：cov_ab/(np.std(a)*np.std(b))(协方差除标准差的乘积)

Pandas：

Series：

map()：可以接受一个函数（lambda）或含有映射关系的字典型对象

创建一个空的系列：s = pd.Series()

从ndarray创建一个系列：data = np.array(['a','b','c','d'])

s = pd.Series(data)

s = pd.Series(data,index=[100,101,102,103])

从字典创建一个系列：data = {'a' ： 0., 'b' ： 1., 'c' ： 2.}

s = pd.Series(data)

从标量创建一个系列：s = pd.Series(5, index=[0, 1, 2, 3])

访问Series：通过索引、标签均可

转换日期数据类型：pd.to_datetime(dates)

获取相差天数数值：delta.dt.days

日期相关操作：

Series.dt.year The year of the datetime.

Series.dt.month The month as January=1, December=12.

Series.dt.day The days of the datetime.

Series.dt.hour The hours of the datetime.

Series.dt.minute The minutes of the datetime.

Series.dt.second The seconds of the datetime.

Series.dt.microsecond The microseconds of the datetime.

Series.dt.week The week ordinal of the year.

Series.dt.weekofyear The week ordinal of the year.

Series.dt.dayofweek The day of the week with Monday=0, Sunday=6.

Series.dt.weekday The day of the week with Monday=0, Sunday=6.

Series.dt.dayofyear The ordinal day of the year.

Series.dt.quarter The quarter of the date.

Series.dt.is_month_start Indicates whether the date is the first day of the month.

Series.dt.is_month_end Indicates whether the date is the last day of the month.

Series.dt.is_quarter_start Indicator for whether the date is the first day of a quarter.

Series.dt.is_quarter_end Indicator for whether the date is the last day of a quarter.

Series.dt.is_year_start Indicate whether the date is the first day of a year.

Series.dt.is_year_end Indicate whether the date is the last day of the year.

Series.dt.is_leap_year Boolean indicator if the date belongs to a leap year.

Series.dt.days_in_month The number of days in the month.

创建日期序列：pd.date_range('/08/21', periods=5,freq='M')

构建某个区间的时间序列：pd.date_range(start, end)(bdate_range不包括星期六和星期天)

DataFrame：

创建一个空的DataFrame：df = pd.DataFrame()

从列表创建DataFrame：data = [1,2,3,4,5]

df = pd.DataFrame(data)

从字典来创建DataFrame：data = {'Name'：['Tom', 'Jack', 'Steve', 'Ricky'],'Age'：[28,34,29,42]}

df = pd.DataFrame(data, index=['s1','s2','s3','s4'])

df.memory_usage(index=True, deep=False)

index:指定是否在返回的序列中包含DataFrame的索引的内存使用情况。If index=True表示输出中第一项索引的内存使用情况。

deep:如果为True，通过查询对象dtypes来深入地检查数据，以了解系统级的内存消耗情况，并将其包含在返回值中。

返回:一个series，其索引是原始列名，其值是以字节为单位的每个列的内存使用情况

返回一个boolean的dataframe（缺失值）：df.isna()

列访问：通过列名标签访问

列添加：新建一个列索引,对该索引下的数据进行赋值操作

列删除：del(df['one'])、df.pop('two')

列数：df.shape[1]

查看列的信息：df.info()

行访问：df.loc[‘行标签名’],返回值为series，原字段对应到index、df.iloc[[2,3]](访问2,3这两行)

行添加：df.append(df2)

行删除： df.drop(0)

行数：df.shape[0]

访问某个元素：df.iloc[2,3](访问2行3列的元素)、df[‘列索引’][‘行索引‘]

修改数据：找到对应元素修改即可

dataframe常用属性：

|编号 | 属性或方法 | 描述 |

| ---- | ---------- | ----------------------------------- |

| 1 | `axes` |返回行/列标签(index)列表。 |

| 2 | `dtype` |返回对象的数据类型(`dtype`)。 |

| 3 | `empty` |如果系列为空,则返回`True`。 |

| 4 | `ndim` |返回底层数据的维数,默认定义：`1`。 |

| 5 | `size` |返回基础数据中的元素数。 |

| 6 | `values` |将系列作为`ndarray`返回。 |

| 7 | `head()` |返回前`n`行。 |

| 8 | `tail()` |返回最后`n`行。 |

统计相关函数：

| 1 | `count()` |非空观测数量 |

| 2 | `sum()` |所有值之和 |

| 3 | `mean()` |所有值的平均值 |

| 4 | `median()` |所有值的中位数 |

| 5 | `std()` |值的标准偏差 |

| 6 | `min()` |所有值中的最小值 |

| 7 | `max()` |所有值中的最大值 |

| 8 | `abs()` |绝对值 |

| 9 | `prod()` |数组元素的乘积 |

| 10 | `cumsum()` |累计总和 |

| 11 | `cumprod()` |累计乘积 |

| 12 | `describe(include=['object',’number’])` |一次性得出所有数值型特征|

按行/列标签排序(升序/降序)：sort_index(axis,ascending=True/False)

按某列(by)值排序(升序/降序)：sort_values(by=‘Age’,ascending=True/False)

根据字段分组：grouped =df.groupby('Year')

获取特定分组：res=grouped.get_group()

聚合： grouped['Points'].agg([np.sum, np.mean, np.std])

合并两个DataFrame：pd.merge(d1,d2, how='left'(合并方法))

|合并方法 | SQL等效 | 描述 |

| -------- | ------------------ | ---------------- |

| `left` | `LEFT OUTER JOIN` |使用左侧对象的键 |

| `right` | `RIGHT OUTER JOIN` |使用右侧对象的键 |

| `outer` | `FULL OUTER JOIN` |使用键的联合 |

| `inner` | `INNER JOIN` |使用键的交集 |

透视表：data.pivot_table(index=['class_id', 'gender'], values=['score'],

columns=['age'], margins=True, aggfunc='max')

(以class_id与gender为行, age为列,统计score数据,添加行、列小计，统计方法为max)

Matplotlib：

mp.rcParams['font.sans-serif'] = ['SimHei'] # 显示中文mp.rcParams['axes.unicode_minus'] = False # 正常显示负号

绘制折线图：mp.plot(xarray, yarray, linestyle='', linewidth=1, color='', alpha=0.5,label=’’)

绘制垂直线：mp.vlines(vval, ymin, ymax, ...)

绘制水平线：mp.hlines(xval, xmin, xmax, ...)

显示图表：mp.show()

消除边框：mp.box(False)

添加信息：mp.text(x,y,content,transform=ax.transAxes(若动画保持相对位置不变))

设置坐标轴范围：mp.xlim(x_limt_min, x_limit_max)、mp.ylim(y_limt_min, y_limit_max)

设置坐标刻度：mp.xticks(x_val_list , x_text_list )、mp.yticks(y_val_list , y_text_list )

获取坐标系：ax = mp.gca()

获取其中某个坐标轴：axis = ax.spines['坐标轴名']

设置坐标轴的位置：axis.set_position((type(一般为data), val))

设置坐标轴的颜色：axis.set_color(color)

显示图例：mp.legend()

特殊点：mp.scatter(xarray, yarray,

marker='', #点型 ~ matplotlib.markers

s='', #大小

edgecolor='', #边缘色

facecolor='', #填充色

zorder=3 #绘制图层编号 (编号越大,图层越靠上)

)

备注：mp.annotate(

r'$\frac{\pi}{2}$', #备注中显示的文本内容

xycoords='data', #备注目标点所使用的坐标系(data表示数据坐标系)

xy=(x, y), #备注目标点的坐标

textcoords='offset points', #备注文本所使用的坐标系(offset points表示参照点的偏移坐标系)

xytext=(x, y), #备注文本的坐标

fontsize=14, #备注文本的字体大小

arrowprops=dict() #使用字典定义文本指向目标点的箭头样式

)

手动构建 matplotlib 窗口：mp.figure(

'', #窗口标题栏文本

figsize=(4, 3), #窗口大小 <元组>

dpi=120, #像素密度

facecolor='' #图表背景色

)

设置图表标题显示在图表上方：mp.title(title, fontsize=12)

设置水平轴的文本：mp.xlabel(x_label_str, fontsize=12)

设置垂直轴的文本：mp.ylabel(y_label_str, fontsize=12)

设置刻度参数：mp.tick_params(..., labelsize=8, ...)

设置网格线：mp.grid(linestyle='')

设置紧凑布局,把图表相关参数都显示在窗口中：mp.tight_layout()

矩阵式布局：mp.subplot(rows, cols, num)

fig是图长a高b，ax是坐标系：fig，ax=mp.subplot(rows,cols,figsize=(a,b))

网格式布局：import matplotlib.gridspec as mg

拆分gs = mg.GridSpec(rows, cols)

合并mp.subplot(gs[0, ：2])

自由式布局：mp.axes([left_bottom_x, left_bottom_y, width, height])

刻度定位器：ax = mp.gca()

设置水平坐标轴的主刻度定位器

ax.xaxis.set_major_locator(mp.NullLocator())

设置水平坐标轴的次刻度定位器为多点定位器,间隔0.1

ax.xaxis.set_minor_locator(mp.MultipleLocator(0.1))

# 空定位器：不绘制刻度

mp.NullLocator()

# 最大值定位器：

# 最多绘制nbins+1个刻度

mp.MaxNLocator(nbins=3)

# 定点定位器：根据locs参数中的位置绘制刻度

mp.FixedLocator(locs=[0, 2.5, 5, 7.5, 10])

# 自动定位器：由系统自动选择刻度的绘制位置

mp.AutoLocator()

# 索引定位器：由offset确定起始刻度,由base确定相邻刻度的间隔

mp.IndexLocator(offset=0.5, base=1.5)

# 多点定位器：从0开始,按照参数指定的间隔(缺省1)绘制刻度

mp.MultipleLocator()

# 线性定位器：等分numticks-1份,绘制numticks个刻度

mp.LinearLocator(numticks=21)

# 对数定位器：以base为底,绘制刻度

mp.LogLocator(base=2)

绘制刻度网格线：ax.grid(

which='', # 'major'/'minor/both' <-> '主刻度'/'次刻度'

axis='', # 'x'/'y'/'both' <-> 绘制x或y轴

linewidth=1, # 线宽

linestyle='', # 线型

color='', # 颜色

alpha=0.5 # 透明度

)

半对数坐标,y轴将以指数方式递增：mp.semilogy(y)

以某种颜色自动填充两条曲线的闭合区域：mp.fill_between(

x, # x轴的水平坐标

sin_x, # 下边界曲线上点的垂直坐标

cos_x, # 上边界曲线上点的垂直坐标

sin_x<cos_x, # 填充条件,为True时填充

color='', # 填充颜色

alpha=0.2 # 透明度

)

绘制柱状图：mp.bar(

x, # 水平坐标数组

y, # 柱状图高度数组

width, # 柱子的宽度

color='', # 填充颜色

label='', #

alpha=0.2 #

)

绘制饼状图：mp.pie(

values, # 值列表

spaces, # 扇形之间的间距列表

labels, # 标签列表

colors, # 颜色列表

'%d%%', # 标签所占比例格式

shadow=True, # 是否显示阴影

startangle=90 # 逆时针绘制饼状图时的起始角度

radius=1 # 半径

)

绘制等高线：

mp.contourf(x, y, z, 8, cmap='jet')

cntr = mp.contour(

x, # 网格坐标矩阵的x坐标 (2维数组)

y, # 网格坐标矩阵的y坐标 (2维数组)

z, # 网格坐标矩阵的z坐标 (2维数组)

8, # 把等高线绘制成8部分

colors='black', # 等高线的颜色

linewidths=0.5 # 线宽

)

生成网格化坐标矩阵：x, y = np.meshgrid(np.linspace(-3, 3, n),

np.linspace(-3, 3, n))

为等高线图添加高度标签：mp.clabel(cntr, inline_spacing=1, fmt='%.1f',

fontsize=10)

绘制热成像图：mp.imshow(z, cmap='jet', origin='low')

3D图像绘制：

from mpl_toolkits.mplot3d import axes3d

ax3d = mp.gca(projection='3d')

3d散点图的绘制：ax3d.scatter(

x, # x轴坐标数组

y, # y轴坐标数组

marker='', # 点型

s=10, # 大小

zorder='', # 图层序号

color='', # 颜色

edgecolor='', # 边缘颜色

facecolor='', # 填充色

c=v, # 颜色值根据cmap映射应用相应颜色

cmap='' #

)

3d平面图的绘制：ax3d.plot_surface(

x, # 网格坐标矩阵的x坐标 (2维数组)

y, # 网格坐标矩阵的y坐标 (2维数组)

z, # 网格坐标矩阵的z坐标 (2维数组)

rstride=30, # 行跨距

cstride=30, # 列跨距

cmap='jet' # 颜色映射

)

3d线框图的绘制：ax3d.plot_wireframe(x,y,z,rstride=30,cstride=30,

linewidth=1, color='dodgerblue')

简单动画：

import matplotlib.animation as ma

定义更新函数行为：def update(number)：

pass

anim = ma.FuncAnimation(mp.gcf(), update, interval=10)

mp.show()

Series和Dataframe的画图

Series的plot的参数：

参数说明

label用于图例的标签

ax要在其上进行绘制的matplotlib subplot对象。如果没有设置，则使用当前matplotlib subplot

sytle将要传给matplotlib的风格字符串（如‘ko--’）

alpha 图表的填充不透明度（0到1之间）

kind可以是'line'、'bar'（柱状图）、'barh'（水平柱状图）、'kde'（核密度估计）、'pie'（饼状图）

train_data.label.value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0,0.1],ax=axe[0])

logy在Y轴上使用对数标尺

use_index 将对象的索引用作刻度标签

rot旋转刻度标签（0到360）

xticks 用作X轴刻度的值

yticks 用作Y轴刻度的值

xlimX轴的界限（例如[0,10]）

ylimY轴的界限

grid显示轴网格线（默认打开）

DataFrame的plot的参数

参数说明

subplot 将各个DataFrame列绘制到单独的subplot中

sharex 如果subplots=True，则共用同一个X轴，包括刻度和界限

sharey 如果subplots=True,则共用同一个Y轴

figsize表示图像大小的元组

title 表示图像标题的字符串

legend 添加一个subplot图例（默认为True）

sort_columns 以字母表顺序绘制各列，默认使用当前列顺序

seaborn：

sns.countplot(x=None, y=None, hue=None, data=None, order=None,

hue_order=None, orient=None, color=None, palette=None,

saturation=0.75, dodge=True, ax=None, **kwargs)：绘制条形图

参数说明：

x： x轴上的条形图，以x标签划分统计个数

y：y轴上的条形图，以y标签划分统计个数

hue：在x或y标签划分的同时，再以hue标签划分统计个数

data：df或array或array列表，用于绘图的数据集，x或y缺失时，data参数为数据集，同时x或y不可缺少，必须要有其中一个

order与 hue_order：分别是对x或y的字段排序，或是对hue的字段排序。排序的方式为列表

orient：强制定向，v:竖直方向；h:水平方向

palette：使用不同的调色板

sns.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)：条形图和核密度估计图

hist: 控制是否显示条形图,默认为Truekde: 控制是否显示核密度估计图,默认为Truerug: 控制是否显示观测的小细条（边际毛毯）默认为falsefit: 设定函数图像,与原图进行比较axlabel: 设置x轴的labellabel : 没有发现什么作yong.ax: 图片位置norm_hist：若为True, 则直方图高度显示密度而非计数(含有kde图像中默认为True)rag：控制是否生成观测数值的小细条fit：控制拟合的参数分布图形，能够直观地评估它与观察数据的对应关系(黑色线条为确定的分布)bins：int或list，控制直方图的划分

Scipy：

scipy.stats.probplot(x, sparams=(), dist='norm', fit=True, plot=None, rvalue=False)

默认比较数据属于正态分布

x：array_like从哪个样本/响应数据probplot创建情节。sparams：tuple, 可选参数Distribution-specific形状参数(形状参数加上位置和比例)。dist：str 或 stats.distributions instance, 可选参数分发或分发函数名称。对于正常概率图，默认值为‘norm’。看起来像stats.distributions实例的对象(即它们具有一个ppf方法)也被接受。fit：bool, 可选参数如果为True(默认值)，则将least-squares回归(best-fit)行拟合到样本数据。plot：object, 可选参数

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。