1000字范文 > 量化投资学习——ESG因子收益分析

量化投资学习——ESG因子收益分析

时间：2020-06-16 15:07:17

相关推荐

量化投资学习——ESG因子收益分析

企业责任因子（ESG）

ESG带来的贡献不光是带动股价的上涨，还可以降低这家公司投资上的风险，使得投资的夏普比率相对更高。ESG因子选股表现：沪深300内有一定选股能力和“排雷”能力，就已有的综合ESG得分结果来看，ESG在沪深300内具有比较稳定的选股能力和“排雷”能力。

残差风险：ESG得分高的公司具有较低的尾部风险。用CAPM模型的残差波动率来衡量个股的尾部风险。行业中性后按照ESG总分分五组，第五组的残差波动率（Res_Vol）最低，且显著低于其它组别。

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport scipy.stats as stimport timeimport statsmodels.api as smfrom datetime import datetime, timedeltafrom scipy.stats import ttest_indpd.set_option('display.max_columns', None)sns.set_style('white')plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['font.family'] = 'sans-serif'plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号

读取数据

esg = pd.read_excel('ESG指标.xlsx',sheet_name='指标',skiprows=3)closeprice_adj = pd.read_excel('ESG指标.xlsx',sheet_name='收盘价',skiprows=3)

构建ESG因子

由于我们获取的ESG因子是分级数据，因此为了方便处理，我们按照分级标准将其处理为0-1之间,此外由于很多公司缺失指标，对于缺失的公司采取不处理的方式

填补数据，对于缺失的数据使用上一个时间段的数据进行填补

esg.head()

df_test = pd.DataFrame(data=[[1,2,3],[4,5,np.nan]],columns=['A','B','C'])df_test.fillna(0)

esg = esg.fillna(method='ffill')esg.head()

查看数据的缺失占比

sorted(list(set(esg['Date'].dt.year)))

[,,,,,,,,,,,,,]

esg['Date'].dt.year

01234... 159 160 161 162 163 Name: Date, Length: 164, dtype: int64

df_cal = pd.DataFrame()

for i in sorted(list(set(esg['Date'].dt.year))):print(df_cal[str(i)]==((esg[esg['Date'].dt.year==i].isnull().sum())/esg[esg['Date'].dt.year==i].shape[0]).sort_values(ascending=False))

---------------------------------------------------------------------------KeyError Traceback (most recent call last)~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)2894 try:-> 2895 return self._engine.get_loc(casted_key)2896 except KeyError as err:pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: ''The above exception was the direct cause of the following exception:KeyError Traceback (most recent call last)<ipython-input-16-63fe351d39a2> in <module>1 for i in sorted(list(set(esg['Date'].dt.year))):----> 2print(df_cal[str(i)]==((esg[esg['Date'].dt.year==i].isnull().sum())/esg[esg['Date'].dt.year==i].shape[0]).sort_values(ascending=False))~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)2900 if self.columns.nlevels > 1:2901 return self._getitem_multilevel(key)-> 2902 indexer = self.columns.get_loc(key)2903 if is_integer(indexer):2904 indexer = [indexer]~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)2895 return self._engine.get_loc(casted_key)2896 except KeyError as err:-> 2897 raise KeyError(key) from err2898 2899 if tolerance is not None:KeyError: ''

import import_ipynb # 导入另外一个ipynb的工具包 pip install 一下import lib_fun as fun #工具函数

importing Jupyter notebook from lib_fun.ipynb

对于ESG评级转换为浮点数指标

esg_map={'AAA':1,'AA':0.8,'A':0.6,'BBB':0.4,'BB':0.2,'B':0,'CCC':-0.2,'CC':-0.4}

esg = esg.replace(esg_map)

考虑到前期数据为空的比较多，因此从开始使用数据

esg = esg.set_index('Date').stack()

#esg.values.astype(float) #数据结果变换

array([0.6, 0.6, 0.4, ..., 0.4, 0.8, 0.8])

esg.head()

Date -02-27 000001.SZ 0.6000002.SZ 0.6000063.SZ 0.4000066.SZ 0.4000069.SZ 0.6dtype: float64

pd.Series(esg.values).hist(figsize=(20,16), bins=100, label='factor_distribution')plt.legend()

<matplotlib.legend.Legend at 0x21c17170e50>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z8ch2Iyp-1632908049939)(output_23_1.png)]

ESG因子的行业分析

因子值分布

从下图可以看到，因子值的分布集中在 0 至 1 内，在-0.4 和-0.2上留下了大片空白，出现这种情况的原因是：因子存在少量的离群值，我们不能从图中看到它们，为了更好地利用因子对整体股票的预测信息，有必要消除离群值的干扰，后续我们需要对因子进行去极值处理

# 读取行业数据IND = pd.read_excel('ESG指标.xlsx',sheet_name='行业分类')IND.head()

IND = IND.set_index('Date').stack()

IND_df = pd.concat([IND, esg], axis=1)

IND_df.head()

IND_df.columns=['industry','ESG']

plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签

ESG因子的行业分布

从下图可以看到，不同行业的因子数据可比性较差，商业贸易的因子值明显高于其他的行业，除此之外，剩余行业大概分布在0.8，0.6和0.4的位置上，所以我们需要对因子进行行业中性化处理，消除行业间的差异对因子值的影响（行业中性化的具体做法下文介绍）

IND_median = IND_df.groupby('industry').median()plt.rc("font",family="SimHei",size="12")IND_median.plot(kind='bar', figsize=(13, 4), grid=True)

<AxesSubplot:xlabel='industry'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tF49h539-1632908049958)(output_33_1.png)]

ESG因子在行业时间序列上的变化

（X轴：时间轴，Y轴：行业）

从热力图可以看出各行业的ESG因子在时间上的变化（颜色越深代表ESG越高，颜色越浅代表ESG越低）：

钢铁和化工行业的ESG指标一直处于很低的位置，这可能与行业的特性，对于碳排放比较多有关房地产的ESG指标近期处于相对较高的位置，这可能与行业治理效果获得成效有关

IND_df.head()

industry_map={'银行':'bank',}

IND_df = IND_df.replace(industry_map)

pd.to_datetime(['19900304','19900204','19900304','19900306'])

DatetimeIndex(['1990-03-04', '1990-02-04', '1990-03-04', '1990-03-06'], dtype='datetime64[ns]', freq=None)

IND_df.index.get_level_values(level=0).month

Int64Index([2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...9, 9, 9, 9, 9, 9, 9, 9, 9, 9],dtype='int64', name='Date', length=49200)

IND_case.groupby(['Date', 'industry'])['ESG'].median()

Date industry-01-30 bank 0.8交通运输 0.4休闲服务 0.5传媒0.4公用事业 0.8... -09-24 通信0.8采掘0.8钢铁0.8非银金融 0.8食品饮料 0.8Name: ESG, Length: 2106, dtype: float64

IND_case.groupby(['Date', 'industry'])['ESG'].median().unstack('industry')

81 rows × 26 columns

IND_case = IND_df[IND_df.index.get_level_values(level=0).year>]plt.subplots(1, 1, figsize=(18,8))sns.heatmap(IND_case.groupby(['Date', 'industry'])['ESG'].median().unstack('Date'), cmap='YlGnBu')

<AxesSubplot:xlabel='Date', ylabel='industry'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-z11AauUt-1632908049960)(output_42_1.png)]

ESG因子与市值因子的相关性

市值是非常显著的风格因子，分析ESG因子与市值因子的相关性，有助于我们分析ESG因子的收益来源

从下图可以发现，ESG因子与市值因子的相关性在0.3以上，且十分连续，在某些时期可达到0.4到0.5，我们需要对ESG因子进行市值中性化处理

另一个比较关键的点是：随着时间延续，从到之后，到底到了顶端，发现相关性先上升后降低。

市值因子为公司股票总市值的自然对数

# 读取行业数据SIZE = pd.read_excel('ESG指标.xlsx',sheet_name='SIZE')SIZE = SIZE.set_index('Date').stack()SIZE = SIZE.apply(np.log) #市值因子的定义 SIZE = 市值(亿)的自然对数

SIZE_df = pd.concat([SIZE, esg], axis=1) SIZE_df = SIZE_df.reset_index() #防止index影响画图的效果

SIZE_df.columns=['Date','level_1','SIZE','esg']

SIZE_df['Date'] = pd.to_datetime(SIZE_df['Date'])rank_corr_with_mktcap = (SIZE_df.dropna().groupby(by='Date').apply(lambda df: st.spearmanr(df.esg, df.SIZE)[0]))rank_corr_with_mktcap.plot(figsize=(15, 4), grid=True)

<AxesSubplot:xlabel='Date'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Y26ebjKv-1632908049961)(output_48_1.png)]

因子数据预处理

通过上面的分析，我们需要对 ESG因子进行去极值，市值和行业的中性化处理在市值和行业中性化处理前，我们需要调整因子的异常值在市值和行业中性化处理后，我们需要对每期因子截面进行标准化处理

MAD去极值

MAD去极值把均值和标准差替换成稳健统计量，样本均值用样本中位数代替，样本标准差用样本MAD（Median Absolute Deviation）代替，把偏离中位数三倍以上的数据作为异常值：

md=median(x_i,i=1,2,...,n)md = median(x\_{i}, i=1,2,...,n)md=median(x_i,i=1,2,...,n)

MAD=median(∣x_i−md∣,i=1,2,...,n)MAD = median({|x\_{i}-md|, i=1,2,...,n})MAD=median(∣x_i−md∣,i=1,2,...,n)

MAD_e=1.483∗MADMAD\_{e} = 1.483*MADMAD_e=1.483∗MAD

行业和市值中性化

行业和市值中性处理是：将去极值后的因子对行业虚拟变量和对市值对数回归的方法,取回归的残差作为因子值，

行业划分采用申万一级行业

标准化

标准化把数据转换成以0为均值，1为标准差，使不同因子具有可加性:

X=X−mean(X)/std(X)X = X-mean(X)/std(X)X=X−mean(X)/std(X)

最后，对于行业中性化后的因子进行横截面正态标准化处理得到标准化 z-score

def factor_process(factor_name, data, mode):"""因子预处理函数，中位数去极值->对市值及行业中性化->标准化输入：factor_name: 需要进行预处理的因子名data: 某日的原始因子数据，columns最后的列必须为市值因子，行业哑变量mode: 'yes'代表中性化，'no'代表不做中性化返回：data: 对指定factor_name做完处理的因子数据""" # 中位数去极值D_m = data[factor_name].median()D_mad = abs(data[factor_name] - D_m)dm1 = 1.483 * D_mad.median()upper = D_m + 3 * dm1lower = D_m - 3 * dm1# 边界压缩temp = [max(lower, min(x, upper)) for x in list(data[factor_name])] data[factor_name] = temp#市值所在的列数n = list(data.columns).index('SIZE')# 中性化if mode == 'yes':y = np.array(data[factor_name])# 市值加行业（避免伪回归，因子共线性）x = np.array(data[data.columns[n: ]])x = sm.add_constant(x, has_constant='add')model = sm.OLS(y, x, missing='drop')results = model.fit()data[factor_name] = results.resid# 标准化data[factor_name] = (data[factor_name] - data[factor_name].mean()) / (data[factor_name].std())return data

原始因子数据

factor_org = pd.concat([esg, SIZE, IND], axis=1)factor_org.reset_index(inplace=True)factor_org.head(3)

factor_org.columns=['Date','secID','esg','SIZE','industry']

factor_org['industry']

0银行1 房地产2通信3 计算机4 房地产... 49195计算机49196电子49197 机械设备49198 医药生物49199电子Name: industry, Length: 49200, dtype: object

pd.concat([factor_org,pd.get_dummies(factor_org['industry'])],axis=1)

49200 rows × 31 columns

factor_stand = []# 待处理的因子factor_list = ['esg', 'SIZE']date_list = sorted(factor_org['Date'].unique())for date in date_list:tdata = factor_org[factor_org['Date'] == date]tdata.reset_index(drop=True ,inplace=True)tdata = tdata.dropna()if(len(tdata)==0):continue# 将行业转换成虚拟变量indu_dummies = pd.get_dummies(tdata['industry'])del tdata['industry']tdata = pd.concat([tdata, indu_dummies], axis=1)# 对市值标准化(不进行中性化)tdata = factor_process('SIZE', tdata, 'no')# 对其他因子进行处理for factor_name in factor_list[: -1]:tdata = factor_process(factor_name, tdata, 'yes')factor_stand.append(tdata)factor_stand = pd.concat(factor_stand)factor_stand.head()# 标准化因子存储

factor_stand.sort_values(by=['Date', 'secID'])factor_stand.reset_index(drop=True, inplace=True)# 不保留行业哑变量和市值factor_stand = factor_stand[['Date', 'secID']+factor_list]factor_stand.fillna(0, inplace=True)

预处理后的因子数据

对因子进行去极值，行业和市值中性化，标准化

factor_stand = factor_stand[['Date', 'secID', 'esg']]factor_stand.head(3)

去极值，行业和市值中性化后的因子值分布

从下图可以看到，处理后的因子相对原始因子，因子值的分布更加均衡（接近正态分布），我们可以更有效地利用因子的信息去区分好/不好的股票

factor_stand.hist(figsize=(8,4), bins=20, label='factor_distribution')plt.legend()

<matplotlib.legend.Legend at 0x13dd904f460>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9qnJ5g8z-1632908049962)(output_64_1.png)]

去极值，行业和市值中性化处理后的ESG因子的行业分布

从下图可以看到，不同行业的因子，在去极值，行业和市值中性化处理后，数据可比性提高

factor_stand_copy = factor_stand.copy()factor_stand_copy.set_index(['Date', 'secID'], inplace=True)IND_df = pd.concat([IND, factor_stand_copy], axis=1)

IND_df.columns=['industry','esg']

IND_median = IND_df.groupby('industry').median()plt.rc("font",family="SimHei",size="12")IND_median.plot(kind='bar', figsize=(13, 4), grid=True)

<AxesSubplot:xlabel='industry'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1pxDuzpz-1632908049964)(output_68_1.png)]

去极值，行业和市值中性化处理后的ESG因子的市值分布

从下图可以看到，ESG因子在去极值，行业和市值中性化处理后，与市值的相关度显著降低

处理前与市值有0.3的相关度，处理后与市值的相关度下降至0.02，最高也不超过0.06

SIZE_df = pd.concat([SIZE, factor_stand_copy], axis=1)SIZE_df = SIZE_df.reset_index()

SIZE_df.columns=['Date','secID','SIZE','esg']

<AxesSubplot:xlabel='Date'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3yO8VZv7-1632908049964)(output_72_1.png)]

因子检验

分析因子选股的有效性

准备数据：

原始因子和处理后的因子指数成分股：沪深300forward_return：股票未来一个月的收益

# 原始/预处理因子esg = factor_org.pivot(index='Date', columns='secID', values='esg')esg_neu = factor_stand.pivot(index='Date', columns='secID', values='esg')

# forward_returncloseprice_adj = pd.read_excel('ESG指标.xlsx',sheet_name='收盘价')closeprice_adj.head(100)

100 rows × 301 columns

closeprice_adj = closeprice_adj.set_index('Date')chgpct_a = closeprice_adj.pct_change()forward_return = chgpct_a.shift(-1)forward_return.head()

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。