Workshop Actuaria
Pricing Game
Objectif de cette étude : créer une société rentable et gagner le maximum de contrats en 2011.
Nous allons prédire une prime optimale pour chaque nouveau contrat. Pour ce faire, nous allons réaliser une modélisation en deux étapes :
- un modèle de régression logistique (un sinistre a t-il lieu ou pas ?),
- un modèle d’estimation des coûts des sinistres.
Ces deux modèles sont ensuite utilisés pour réaliser une estimation sous contrainte de la prime optimale par individu.
Exploration préliminaire
import pandas as pd
import numpy as np
%matplotlib inline
import xgboost as xgb
from sklearn import metrics
from xgboost import plot_importance
# Lecture des données sources du jeu
training_df = pd.read_csv("training.csv", sep=';')
pricing_df = pd.read_csv("pricing.csv", sep=',')
del pricing_df["Unnamed: 0"]
print(training_df.shape)
print(pricing_df.shape)
(100021, 20)
(36311, 15)
print(training_df.dtypes)
training_df.describe()
PolNum int64
CalYear int64
Gender object
Type object
Category object
Occupation object
Age int64
Group1 int64
Bonus int64
Poldur int64
Value int64
Adind int64
SubGroup2 object
Group2 object
Density float64
Exppdays int64
Numtppd int64
Numtpbi int64
Indtppd float64
Indtpbi float64
dtype: object
PolNum | CalYear | Age | Group1 | Bonus | Poldur | Value | Adind | Density | Exppdays | Numtppd | Numtpbi | Indtppd | Indtpbi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.000210e+05 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 | 100021.000000 |
mean | 2.002003e+08 | 2009.499895 | 41.122514 | 10.692625 | -6.921646 | 5.470781 | 16454.675268 | 0.512142 | 117.159270 | 327.588007 | 0.147449 | 0.046790 | 106.135007 | 222.762829 |
std | 6.217239e+04 | 0.500002 | 14.299349 | 4.687286 | 48.633165 | 4.591194 | 10506.742732 | 0.499855 | 79.500907 | 73.564636 | 0.436917 | 0.219546 | 444.949188 | 1859.422836 |
min | 2.001149e+08 | 2009.000000 | 18.000000 | 1.000000 | -50.000000 | 0.000000 | 1000.000000 | 0.000000 | 14.377142 | 91.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2.001399e+08 | 2009.000000 | 30.000000 | 7.000000 | -40.000000 | 1.000000 | 8380.000000 | 0.000000 | 50.625783 | 340.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 2.001649e+08 | 2009.000000 | 40.000000 | 11.000000 | -30.000000 | 4.000000 | 14610.000000 | 1.000000 | 94.364623 | 365.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 2.002608e+08 | 2010.000000 | 51.000000 | 14.000000 | 10.000000 | 9.000000 | 22575.000000 | 1.000000 | 174.644525 | 365.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 2.002858e+08 | 2010.000000 | 75.000000 | 20.000000 | 150.000000 | 15.000000 | 49995.000000 | 1.000000 | 297.385170 | 365.000000 | 7.000000 | 3.000000 | 12878.369910 | 69068.026292 |
print(pricing_df.dtypes)
pricing_df.describe()
PolNum int64
CalYear int64
Gender object
Type object
Category object
Occupation object
Age int64
Group1 int64
Bonus int64
Poldur int64
Value int64
Adind int64
SubGroup2 object
Group2 object
Density float64
dtype: object
PolNum | CalYear | Age | Group1 | Bonus | Poldur | Value | Adind | Density | |
---|---|---|---|---|---|---|---|---|---|
count | 3.631100e+04 | 36311.0 | 36311.000000 | 36311.000000 | 36311.000000 | 36311.000000 | 36311.000000 | 36311.000000 | 36311.000000 |
mean | 2.003507e+08 | 2011.0 | 41.186087 | 10.712153 | -6.594145 | 5.559335 | 16518.524689 | 0.518465 | 116.666024 |
std | 1.446972e+04 | 0.0 | 14.306933 | 4.690578 | 49.074019 | 4.626096 | 10528.519190 | 0.499666 | 79.787179 |
min | 2.003257e+08 | 2011.0 | 18.000000 | 1.000000 | -50.000000 | 0.000000 | 1005.000000 | 0.000000 | 14.377142 |
25% | 2.003381e+08 | 2011.0 | 30.000000 | 7.000000 | -40.000000 | 2.000000 | 8400.000000 | 0.000000 | 50.351845 |
50% | 2.003507e+08 | 2011.0 | 40.000000 | 11.000000 | -30.000000 | 4.000000 | 14665.000000 | 1.000000 | 93.382351 |
75% | 2.003633e+08 | 2011.0 | 51.000000 | 14.000000 | 10.000000 | 9.000000 | 22700.000000 | 1.000000 | 171.372936 |
max | 2.003757e+08 | 2011.0 | 75.000000 | 20.000000 | 150.000000 | 15.000000 | 49995.000000 | 1.000000 | 297.385170 |
Analyse des targets
Targets de classification : Sinistre Vs Non Sinistre
import seaborn as sns
sns.countplot(training_df.apply( lambda x : x["Numtppd"]+x["Numtpbi"],axis=1))
<matplotlib.axes._subplots.AxesSubplot at 0x1a22a86ac8>
Le nombre de sinistres corporels et matériels suit une loi de Poisson.
Le scoring d’un risque de sinistre va être complexe à résoudre du fait du déséquilibre important des classes.
Au vue de cette distribution, nous ne souhaitons pas tenir compte du nombre de sinistres dans notre target de régression.
Target de régression : Somme des coûts des sinistres (matériels et physiques)
# On commence par réduire l'analyse sur les données avec sinistresau pro rata de l'exposition du contrat
training_df["IndtTotal"] = training_df.apply( lambda x : (x["Exppdays"]/365) * (x["Indtppd"] + x["Indtpbi"]), axis=1)
train_couts_sinistre = training_df.loc[training_df["IndtTotal"] > 0, "IndtTotal" ]
import seaborn as sns
from matplotlib import pyplot as plt
def plot_dist(target):
# Configurer le paramètres d'affichage
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (18, 6)
# Affichage de la distribution des coûts des sinistres
sns.distplot(target.values, bins=50, kde=False)
plt.xlabel('Coûts des sinistres', fontsize=12)
plt.show()
target.skew()
plot_dist(train_couts_sinistre)
La distribution présente un longue queue à droite avec des valeurs a priori très extrêmes.
def show_outliers(target):
plt.figure(figsize=(12,6))
plt.scatter(range(target.shape[0]), np.sort(target.values),color='blue')
plt.xlabel('index', fontsize=12)
plt.ylabel('Coûts des sinistres', fontsize=12)
plt.show()
show_outliers(train_couts_sinistre)
Les deux graphes précédents nous permettent d’estimer qu’un montant total de sinistres au delà du 99ème quartile
présente une typologie spécifique que nous devons traiter à part.
# Estimation du seuil du 9ème quartile
percentile_98 = np.percentile(train_couts_sinistre,98)
percentile_98
15851.449026353968
# On supprime les lignes allant au delà du seuil défini comme limite des outliers sur la target
def remove_outliers(df, ulimit):
return df[df["IndtTotal"]<ulimit]
training_df = remove_outliers(training_df, percentile_98)
show_outliers(training_df.loc[training_df["IndtTotal"]>0, "IndtTotal"])
A présent nous retravaillons la distribution de notre target de régression.
# On modifie la forme de la distribution pour faciliter l'apprentissage
training_df["LogIndtTotal"] = np.log1p(training_df["IndtTotal"])
# revue de la nouvelle distribution de notre log-target
sns.distplot(training_df.loc[training_df["LogIndtTotal"]>0, "LogIndtTotal"], bins=50, kde=False)
plt.xlabel('Log Coûts Sinistres', fontsize=12)
plt.show()
Nous obtenons une gaussienne que nous espérons plus facile à apprendre, nous appliquerons une fonction exponentielle aux prédictions pour arriver à une prédiction finale.
Analyse des variables du jeu de données d’entrainement
Recherche de corrélation targets - variables
def display_categ(categ, target):
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.boxplot(x=categ, y=target, data=training_df[(training_df[target]>0) &(training_df[target]<20000) ])
display_categ("Gender", "Indtppd")
Nous remarquons une distinction des classes *Unemployed et Retired.
display_categ("Occupation", "Indtppd")
display_categ("Type", "Indtppd")
display_categ("Type", "Indtpbi")
display_categ("Group1", "Indtpbi")
Feature Engineering
import numpy as np
# Les targets : Indtpbi et Indtppd doivent être ajustée à leur exposition
training_df["Indtpbi_expo"] = training_df["Indtpbi"] * training_df["Exppdays"]
training_df["Indtppd_expo"] = training_df["Indtppd"] * training_df["Exppdays"]
# Correction de Poldur + création de Age premier contrat
# La variable polDur parait incohérente dans certains cas, on pose comme règle : Poldur - Age >=18
training_df["Age_premier_contrat"] = training_df["Age"] - training_df["Poldur"]
training_df["Age_premier_contrat"] = training_df.apply(lambda x : max(18, x["Age_premier_contrat"]), axis=1)
training_df["Poldur"] = training_df["Age"] - training_df["Age_premier_contrat"]
pricing_df["Age_premier_contrat"] = pricing_df["Age"] - pricing_df["Poldur"]
pricing_df["Age_premier_contrat"] = pricing_df.apply(lambda x : max(18, x["Age_premier_contrat"]), axis=1)
pricing_df["Poldur"] = pricing_df["Age"] - pricing_df["Age_premier_contrat"]
Nous créons des variables relatives aux bonus qui pourrons peut-être apporter du poids dans nos prédictions.
# Variable : n'a jamais eu d'accident
train_work_df = training_df.copy()
train_work_df["BonusNorm"] = train_work_df.apply(lambda x : 1+x["Bonus"]/100, axis=1)
train_work_df["BonusMax"] = train_work_df.apply(lambda x : 0.95**x["Poldur"], axis=1)
train_work_df["Bonustheorique"] = train_work_df.apply(lambda x : max(x["BonusMax"], 0.5), axis=1)
train_work_df["SansSinistre"] = train_work_df.apply(lambda x : int(x["Bonustheorique"] == x["BonusNorm"]), axis=1 )
training_df["SansSinistre"] = train_work_df["SansSinistre"].copy()
pricing_work_df = pricing_df.copy()
pricing_work_df["BonusNorm"] = pricing_work_df.apply(lambda x : 1+x["Bonus"]/100, axis=1)
pricing_work_df["BonusMax"] = pricing_work_df.apply(lambda x : 0.95**x["Poldur"], axis=1)
pricing_work_df["Bonustheorique"] = pricing_work_df.apply(lambda x : max(x["BonusMax"], 0.5), axis=1)
pricing_work_df["SansSinistre"] = pricing_work_df.apply(lambda x : int(x["Bonustheorique"] == x["BonusNorm"]), axis=1 )
pricing_df["SansSinistre"] = pricing_work_df["SansSinistre"].copy()
train_work_df.loc[train_work_df["SansSinistre"]==1,["Bonus","Poldur","BonusNorm","BonusMax","Bonustheorique", "SansSinistre" ]]
Bonus | Poldur | BonusNorm | BonusMax | Bonustheorique | SansSinistre | |
---|---|---|---|---|---|---|
4 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
19 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
125 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
142 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
151 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
162 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
188 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
207 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
229 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
253 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
271 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
295 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
300 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
316 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
335 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
348 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
363 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
365 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
416 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
439 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
462 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
515 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
562 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
594 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
595 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
651 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
660 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
678 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
699 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
739 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
... | ... | ... | ... | ... | ... | ... |
99352 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99419 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99422 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
99436 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99447 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
99510 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99511 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99542 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99584 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99589 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99608 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99627 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
99632 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99637 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99640 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99646 | -50 | 14 | 0.5 | 0.487675 | 0.5 | 1 |
99654 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99703 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99711 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99729 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99738 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99746 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99769 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99852 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99880 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99918 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99937 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99939 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
99958 | -50 | 15 | 0.5 | 0.463291 | 0.5 | 1 |
99997 | 0 | 0 | 1.0 | 1.000000 | 1.0 | 1 |
4211 rows × 6 columns
sns.lineplot(data =train_work_df[train_work_df["CalYear"]==2010], y="BonusNorm", x="Poldur")
<matplotlib.axes._subplots.AxesSubplot at 0x20c2f720240>
sns.countplot(x="Poldur", data=training_df)
<matplotlib.axes._subplots.AxesSubplot at 0x20c2defac88>
Encodage des variables catégorielles
import seaborn as sns
# On passe en catégorielle la variable Group1 du véhicule
training_df["Group1"] = training_df["Group1"].astype(object)
pricing_df["Group1"] = pricing_df["Group1"].astype(object)
# Tranforme en
categ_col_list = list(training_df.select_dtypes(include=[object]))
categ_col_list.remove("SubGroup2") # trop de valuers distinctes, il faudra étudier le regroupement de ces valeurs
# si des groupes différents de Group2 paraissent intéressants
# Affichage des distributions de données pour les variables catégorielles
def countplot_features(df, column_list, n_wide=1):
fig, ax = plt.subplots(int(len(column_list)/n_wide), n_wide, figsize = (18,24))
for i, col in enumerate(column_list) :
plt.sca(ax[int(i/n_wide), (i%n_wide)])
sns.countplot(x=col, data=df)
countplot_features(training_df, categ_col_list, 3)
Il y a suffisament de variables par catégorie pour laisser ces groupes tous distincts les uns des autres.
# Encodage des dataframe train et test
def dummies(train, test, columns = None):
if columns != None:
for column in columns:
train[column] = train[column].apply(lambda x: str(x))
test[column] = test[column].apply(lambda x: str(x))
good_cols = [column+'_'+i for i in train[column].unique() if i in test[column].unique()]
train = pd.concat((train, pd.get_dummies(train[column], prefix = column)[good_cols]), axis = 1)
test = pd.concat((test, pd.get_dummies(test[column], prefix = column)[good_cols]), axis = 1)
del train[column]
del test[column]
return train, test
train_dummies, pricing_dummies = dummies(training_df, pricing_df , columns = categ_col_list)
col_list = train_dummies.columns.to_list()
print(col_list)
print(str(len(col_list))+ " variables")
['PolNum', 'CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'SubGroup2', 'Density', 'Exppdays', 'Numtppd', 'Numtpbi', 'Indtppd', 'Indtpbi', 'IndtTotal', 'LogIndtTotal', 'Indtpbi_expo', 'Indtppd_expo', 'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female', 'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F', 'Category_Large', 'Category_Medium', 'Category_Small', 'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife', 'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18', 'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7', 'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6', 'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10', 'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L', 'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T', 'Group2_P', 'Group2_U', 'Group2_S']
66 variables
for col in ['SubGroup2', 'Numtppd', 'Numtpbi', 'Indtppd', 'Indtpbi', 'Indtpbi_expo', 'Indtppd_expo','IndtTotal', 'LogIndtTotal']:
col_list.remove(col)
Modélisation 1 : Classification binaire du risque de sinistre
Nous utilisons une modélisation de type xgboost afin de réaliser cette classification binaire.
def analyze_model(model_pred_proba, actual_target):
np.set_printoptions(precision=2)
# Transform predictions to input uniform data into the confusion matrix function
y_pred = [0 if x < 0.5 else 1 for x in model_pred_proba]
class_names = [0, 1]
# Plot non-normalized confusion matrix
plot_confusion_matrix(actual_target, y_pred, classes=class_names,
title='Confusion matrix')
plt.show()
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
# Premier classifieur XGBClassifier pour déterminer le risque de sinistre sur le contrat
import xgboost as xgb
from sklearn import metrics
from xgboost import plot_importance
# On stratifie nos datasets d'entrainement avec une split sur l'année afin d'estimer au mieux l'année 2011
X = train_dummies[col_list]
X_train = X[X["CalYear"] == 2009]
X_test = X[X["CalYear"] == 2010]
# On transforme en target binaire les variables Numtppd et Numtpbi
y_train_Numt = training_df[training_df["CalYear"] == 2009].apply(lambda row: 0 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
y_test_Numt = training_df[training_df["CalYear"] == 2010].apply(lambda row: 0 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
# On signe les targets pour une modélisation XGBoost
y_train_Numt_s = training_df[training_df["CalYear"] == 2009].apply(lambda row: -1 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
y_test_Numt_s = training_df[training_df["CalYear"] == 2010].apply(lambda row: -1 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
col_list.remove("Exppdays")
X_2011 = pricing_dummies[col_list]
Les résultats nous montrent un nombre important de faux positifs. Nous choisissons de redonner du poids à la classe minoritaire (les individus ayant fait l’objet d’un sinistre) afin de faire baisser le nombre de faux positifs.
###########################################################################################
# Premier classifieur de la target binaire des sinistres matériels : "Numtppd" ou "Numtpbi"
###########################################################################################
from sklearn.metrics import roc_auc_score
xgb_clf = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.7, gamma=0, learning_rate=0.01, max_delta_step=0,
max_depth=2, min_child_weight=1, missing=9999999999,
n_estimators=600, n_jobs=-1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0.01,
reg_lambda=1, scale_pos_weight=8, seed=None, silent=True,
subsample=0.4)
xgb_clf.fit(X_train, y_train_Numt_s)
xgb_clf_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]
# Affiche de la courbe ROC
fpr, tpr, _ = metrics.roc_curve(y_test_Numt, xgb_clf_pred_proba)
plt.plot(fpr,tpr)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.title("ROC Curve")
plt.show()
print("Score AUC : %.4f" % roc_auc_score( y_test_Numt, xgb_clf_pred_proba))
# Affichage de la matrice de confusion
analyze_model(xgb_clf_pred_proba, y_test_Numt)
plot_importance(xgb_clf, max_num_features=20)
plt.show()
Score AUC : 0.7393
Confusion matrix, without normalization
[[22601 19214]
[ 1740 6270]]
Optimisation des hyper paramètres
# Recherche d'hyper paramètres par validation croisée stratifiée
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from random import seed
params={
'max_depth': [2, 3, 4],
'subsample': [0.5],
'colsample_bytree': [ 0.6, 0.8 ],
'colsample_bylevel' : [1],
'n_estimators': [ 200, 400, 600 ],
'scale_pos_weight' : [8, 10, 12],
'learning_rate' : [0.005, 0.01, 0.1],
'reg_alpha' : [0.05, 0.1],
'reg_lambda': [0.1, 0.5],
'min_child_weight' : [1]
}
xgb_model = xgb.XGBClassifier()
clf = GridSearchCV(xgb_model, params, n_jobs=-1,
cv=StratifiedKFold( n_splits=3, shuffle=True),
scoring='roc_auc',
verbose=2, refit=True)
clf.fit(X_train, y_train_Numt_s)
Fitting 3 folds for each of 648 candidates, totalling 1944 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 38.1s
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 6.6min
[Parallel(n_jobs=-1)]: Done 349 tasks | elapsed: 19.2min
[Parallel(n_jobs=-1)]: Done 632 tasks | elapsed: 34.5min
[Parallel(n_jobs=-1)]: Done 997 tasks | elapsed: 53.4min
[Parallel(n_jobs=-1)]: Done 1442 tasks | elapsed: 79.8min
[Parallel(n_jobs=-1)]: Done 1944 out of 1944 | elapsed: 113.6min finished
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-57-3f19e2fbc935> in <module>
27 clf.fit(X_train, y_train_Numt_s)
28
---> 29 best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])
30 print('Score AUC : ', score)
AttributeError: 'GridSearchCV' object has no attribute 'grid_scores_'
best_est = clf.best_estimator_
print(best_est)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.8, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=2,
min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
silent=None, subsample=0.5, verbosity=1)
# On supprime Exppdays qui n'est pas présente dans le jeu de données 2011 à prédire
# cette variable est intégrée dans la target cout des sinistres du modèle 2
del X_train["Exppdays"]
Avec ces hyperparamètres optimisés, nous remarquons que le nombre de faux positifs à bien diminuer. Toutefois, le nombre de faux négatifs à augmenter. L’AUC est également amélioré.
clf_pred_proba = clf.predict_proba(X_test)[:, 1]
# Affiche de la courbe ROC
fpr, tpr, _ = metrics.roc_curve(y_test_Numt, clf_pred_proba)
plt.plot(fpr,tpr)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.title("ROC Curve")
plt.show()
print("Score AUC : %.4f" % roc_auc_score( y_test_Numt, clf_pred_proba))
# Affichage de la matrice de confusion
analyze_model(clf_pred_proba, y_test_Numt)
# Analyse des variables les plus influentes sur la différentiation
xgb_grid = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.8, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=2,
min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
silent=None, subsample=0.5, verbosity=1)
xgb_grid.fit(X_train, y_train_Numt_s)
plot_importance(xgb_grid, max_num_features=20)
plt.show()
Score AUC : 0.7477
Confusion matrix, without normalization
[[34449 7366]
[ 3888 4122]]
Analyse des probabilités associées aux classes de notre matrice de confusion
result_df = pd.DataFrame({'True_Value':y_test_Numt, 'Pred_Value':clf_pred_proba})
pred_1_true_0 = result_df[(result_df["Pred_Value"]>0.5) & (result_df["True_Value"]==0)]
print( " Proba des Faux positifs : " + str(pred_1_true_0.Pred_Value.mean()) )
pred_1_true_1 = result_df[(result_df["Pred_Value"]>0.5) & (result_df["True_Value"]==1)]
print( " Proba des Vrais positifs : " + str(pred_1_true_1.Pred_Value.mean()) )
pred_0_true_1 = result_df[(result_df["Pred_Value"]<0.5) & (result_df["True_Value"]==1)]
print( " Proba des Faux négatifs : " + str(pred_0_true_1.Pred_Value.mean()) )
pred_0_true_0 = result_df[(result_df["Pred_Value"]<0.5) & (result_df["True_Value"]==0)]
print( " Proba des Vrais négatifs : " + str(pred_0_true_0.Pred_Value.mean()) )
Proba des Faux positifs : 0.616603
Proba des Vrais positifs : 0.66039294
Proba des Faux négatifs : 0.3152266
Proba des Vrais négatifs : 0.2463552
Re entrainement sur 2009 et 2010 puis prédictions sur 2011
Nos hyperparamètres sont désormais fixés. Nous entrainons notre modèle sur l’ensemble du jeu d’entraînement.
del X["Exppdays"]
y_train_final_s = training_df.apply(lambda row: -1 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
xgb_grid = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.8, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=2,
min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
silent=None, subsample=0.5, verbosity=1)
xgb_grid.fit(X, y_train_final_s)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.8, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=2,
min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
silent=None, subsample=0.5, verbosity=1)
xgb_grid_pred_proba = xgb_grid.predict_proba(X_2011)[:, 1]
pd.DataFrame({"proba_pred":xgb_grid_pred_proba}).to_csv("./prediction_prob_sinistre.csv")
Modèle 2 : Régression par NN sur le coût d’un sinistre
Préparation des données
X = train_dummies[col_list]
Index(['CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'Density',
'IndtTotal', 'LogIndtTotal', 'Age_premier_contrat', 'SansSinistre',
'Gender_Male', 'Gender_Female', 'Type_C', 'Type_E', 'Type_D', 'Type_B',
'Type_A', 'Type_F', 'Category_Large', 'Category_Medium',
'Category_Small', 'Occupation_Employed', 'Occupation_Unemployed',
'Occupation_Housewife', 'Occupation_Self-employed',
'Occupation_Retired', 'Group1_18', 'Group1_11', 'Group1_5', 'Group1_12',
'Group1_13', 'Group1_7', 'Group1_3', 'Group1_9', 'Group1_14',
'Group1_8', 'Group1_6', 'Group1_20', 'Group1_16', 'Group1_4',
'Group1_1', 'Group1_10', 'Group1_2', 'Group1_15', 'Group1_19',
'Group1_17', 'Group2_L', 'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R',
'Group2_M', 'Group2_T', 'Group2_P', 'Group2_U', 'Group2_S'],
dtype='object')
train_dummies, pricing_dummies = dummies(training_df, pricing_df , columns = categ_col_list)
col_list = train_dummies.columns.to_list()
print(col_list)
print(str(len(col_list))+ " variables")
for col in ['SubGroup2', 'Numtppd', "Exppdays", 'Numtpbi','Indtpbi', 'Indtppd', 'Indtppd_expo', 'Indtpbi_expo', 'PolNum']:
col_list.remove(col)
# On transforme en target binaire les variables Indtppd et Indtpbi
y_train_LogIndt = train_dummies.loc[train_dummies["CalYear"] == 2009, 'LogIndtTotal' ]
y_test_LogIndt = train_dummies.loc[train_dummies["CalYear"] == 2010, 'LogIndtTotal']
# Filtre sur les données uniquement sinistrées
# On s'entraine sur 2009
X_train = train_dummies[train_dummies["CalYear"] == 2009]
X_test = train_dummies[train_dummies["CalYear"] == 2010]
# avec les usagers sinistrés
X_train = X_train.loc[ X_train['IndtTotal'] > 0, col_list ]
X_test = X_test.loc[ X_test['IndtTotal'] > 0, col_list ]
y_train_LogIndt = X_train.loc[X_train['IndtTotal'] > 0, "LogIndtTotal"]
y_test_LogIndt = X_test.loc[X_test['IndtTotal'] > 0, "LogIndtTotal"]
y_X_LogIndt = X.loc[X['IndtTotal'] > 0, "LogIndtTotal"]
for col in ['LogIndtTotal', 'IndtTotal']:
del X_train[col]
del X_test[col]
del X[col]
['PolNum', 'CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'SubGroup2', 'Density', 'Exppdays', 'Numtppd', 'Numtpbi', 'Indtppd', 'Indtpbi', 'IndtTotal', 'LogIndtTotal', 'Indtpbi_expo', 'Indtppd_expo', 'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female', 'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F', 'Category_Large', 'Category_Medium', 'Category_Small', 'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife', 'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18', 'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7', 'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6', 'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10', 'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L', 'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T', 'Group2_P', 'Group2_U', 'Group2_S']
66 variables
print(X_train.shape)
print(X_test.shape)
(7565, 55)
(8010, 55)
# On normalise nos variables
X_train.columns
Index(['CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'Density',
'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female',
'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F',
'Category_Large', 'Category_Medium', 'Category_Small',
'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife',
'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18',
'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7',
'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6',
'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10',
'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L',
'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T',
'Group2_P', 'Group2_U', 'Group2_S'],
dtype='object')
Régression avec un réseau de neurones
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD
from sklearn import preprocessing
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Scale data for NN
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
n_epochs = 100
def create_baseline():
# creation du modele
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(X_train_scaled.shape[1],)))
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='linear'))
model.add(Dense(1))
# Compilation du modele
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
return model
model_reg = create_baseline()
history_reg = model_reg.fit(X_train_scaled, y_train_LogIndt,
validation_data=(X_test_scaled, y_test_LogIndt),
epochs=n_epochs,
verbose=1,
batch_size=1024)
Train on 7565 samples, validate on 8010 samples
Epoch 1/100
7565/7565 [==============================] - 2s 315us/step - loss: 15.2680 - mean_absolute_error: 3.2318 - val_loss: 4.5242 - val_mean_absolute_error: 1.6506
Epoch 2/100
7565/7565 [==============================] - 0s 11us/step - loss: 4.2387 - mean_absolute_error: 1.6752 - val_loss: 4.4673 - val_mean_absolute_error: 1.7924
Epoch 3/100
7565/7565 [==============================] - 0s 12us/step - loss: 3.2320 - mean_absolute_error: 1.4122 - val_loss: 3.2058 - val_mean_absolute_error: 1.3541
Epoch 4/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.8502 - mean_absolute_error: 1.2922 - val_loss: 2.8909 - val_mean_absolute_error: 1.3710
Epoch 5/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.6552 - mean_absolute_error: 1.2799 - val_loss: 2.4886 - val_mean_absolute_error: 1.1919
Epoch 6/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.5082 - mean_absolute_error: 1.2042 - val_loss: 2.4921 - val_mean_absolute_error: 1.2326
Epoch 7/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.4443 - mean_absolute_error: 1.2131 - val_loss: 2.4083 - val_mean_absolute_error: 1.1840
Epoch 8/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.3914 - mean_absolute_error: 1.1822 - val_loss: 2.4314 - val_mean_absolute_error: 1.2057
Epoch 9/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.3547 - mean_absolute_error: 1.1822 - val_loss: 2.4065 - val_mean_absolute_error: 1.1876
Epoch 10/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.3294 - mean_absolute_error: 1.1667 - val_loss: 2.4289 - val_mean_absolute_error: 1.2053
Epoch 11/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.3068 - mean_absolute_error: 1.1689 - val_loss: 2.4122 - val_mean_absolute_error: 1.1892
Epoch 12/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.2835 - mean_absolute_error: 1.1594 - val_loss: 2.4242 - val_mean_absolute_error: 1.1991
Epoch 13/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.2634 - mean_absolute_error: 1.1495 - val_loss: 2.4360 - val_mean_absolute_error: 1.2035
Epoch 14/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.2372 - mean_absolute_error: 1.1493 - val_loss: 2.4321 - val_mean_absolute_error: 1.1968
Epoch 15/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.2213 - mean_absolute_error: 1.1409 - val_loss: 2.4448 - val_mean_absolute_error: 1.2054
Epoch 16/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1994 - mean_absolute_error: 1.1355 - val_loss: 2.4480 - val_mean_absolute_error: 1.2036
Epoch 17/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.1799 - mean_absolute_error: 1.1314 - val_loss: 2.4507 - val_mean_absolute_error: 1.2042
Epoch 18/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1629 - mean_absolute_error: 1.1308 - val_loss: 2.4495 - val_mean_absolute_error: 1.2024
Epoch 19/100
7565/7565 [==============================] - 0s 10us/step - loss: 2.1424 - mean_absolute_error: 1.1203 - val_loss: 2.4637 - val_mean_absolute_error: 1.2118
Epoch 20/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1266 - mean_absolute_error: 1.1189 - val_loss: 2.4653 - val_mean_absolute_error: 1.2055
Epoch 21/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1063 - mean_absolute_error: 1.1121 - val_loss: 2.4821 - val_mean_absolute_error: 1.2143
Epoch 22/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.0927 - mean_absolute_error: 1.1086 - val_loss: 2.4826 - val_mean_absolute_error: 1.2121
Epoch 23/100
7565/7565 [==============================] - 0s 17us/step - loss: 2.0731 - mean_absolute_error: 1.1038 - val_loss: 2.4879 - val_mean_absolute_error: 1.2138
Epoch 24/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.0585 - mean_absolute_error: 1.0976 - val_loss: 2.5072 - val_mean_absolute_error: 1.2259
Epoch 25/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.0410 - mean_absolute_error: 1.0940 - val_loss: 2.5042 - val_mean_absolute_error: 1.2199
Epoch 26/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.0238 - mean_absolute_error: 1.0931 - val_loss: 2.5082 - val_mean_absolute_error: 1.2177
Epoch 27/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.0075 - mean_absolute_error: 1.0878 - val_loss: 2.5228 - val_mean_absolute_error: 1.2237
Epoch 28/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9884 - mean_absolute_error: 1.0805 - val_loss: 2.5327 - val_mean_absolute_error: 1.2295
Epoch 29/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9673 - mean_absolute_error: 1.0759 - val_loss: 2.5327 - val_mean_absolute_error: 1.2233
Epoch 30/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9519 - mean_absolute_error: 1.0711 - val_loss: 2.5496 - val_mean_absolute_error: 1.2338
Epoch 31/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9346 - mean_absolute_error: 1.0670 - val_loss: 2.5660 - val_mean_absolute_error: 1.2401
Epoch 32/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9189 - mean_absolute_error: 1.0653 - val_loss: 2.5650 - val_mean_absolute_error: 1.2307
Epoch 33/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9020 - mean_absolute_error: 1.0596 - val_loss: 2.5715 - val_mean_absolute_error: 1.2329
Epoch 34/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.8758 - mean_absolute_error: 1.0530 - val_loss: 2.5794 - val_mean_absolute_error: 1.2394
Epoch 35/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.8505 - mean_absolute_error: 1.0442 - val_loss: 2.5980 - val_mean_absolute_error: 1.2463
Epoch 36/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.8325 - mean_absolute_error: 1.0408 - val_loss: 2.6015 - val_mean_absolute_error: 1.2468
Epoch 37/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.8091 - mean_absolute_error: 1.0330 - val_loss: 2.6161 - val_mean_absolute_error: 1.2510
Epoch 38/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.7862 - mean_absolute_error: 1.0279 - val_loss: 2.6120 - val_mean_absolute_error: 1.2470
Epoch 39/100
7565/7565 [==============================] - 0s 17us/step - loss: 1.7661 - mean_absolute_error: 1.0222 - val_loss: 2.6411 - val_mean_absolute_error: 1.2530
Epoch 40/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.7415 - mean_absolute_error: 1.0154 - val_loss: 2.6370 - val_mean_absolute_error: 1.2529
Epoch 41/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.7176 - mean_absolute_error: 1.0083 - val_loss: 2.6681 - val_mean_absolute_error: 1.2556
Epoch 42/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.6883 - mean_absolute_error: 0.9992 - val_loss: 2.6677 - val_mean_absolute_error: 1.2633
Epoch 43/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.6596 - mean_absolute_error: 0.9929 - val_loss: 2.7175 - val_mean_absolute_error: 1.2846
Epoch 44/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.6465 - mean_absolute_error: 0.9926 - val_loss: 2.7183 - val_mean_absolute_error: 1.2776
Epoch 45/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.6141 - mean_absolute_error: 0.9802 - val_loss: 2.7149 - val_mean_absolute_error: 1.2741
Epoch 46/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.5927 - mean_absolute_error: 0.9744 - val_loss: 2.7407 - val_mean_absolute_error: 1.2756
Epoch 47/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.5442 - mean_absolute_error: 0.9587 - val_loss: 2.7453 - val_mean_absolute_error: 1.2854
Epoch 48/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.5149 - mean_absolute_error: 0.9494 - val_loss: 2.7547 - val_mean_absolute_error: 1.2839
Epoch 49/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.4876 - mean_absolute_error: 0.9405 - val_loss: 2.7789 - val_mean_absolute_error: 1.2862
Epoch 50/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.4632 - mean_absolute_error: 0.9331 - val_loss: 2.7941 - val_mean_absolute_error: 1.2983
Epoch 51/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.4151 - mean_absolute_error: 0.9177 - val_loss: 2.8231 - val_mean_absolute_error: 1.3098
Epoch 52/100
7565/7565 [==============================] - 0s 15us/step - loss: 1.3863 - mean_absolute_error: 0.9110 - val_loss: 2.8383 - val_mean_absolute_error: 1.3096
Epoch 53/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.3394 - mean_absolute_error: 0.8941 - val_loss: 2.8434 - val_mean_absolute_error: 1.3080
Epoch 54/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.2951 - mean_absolute_error: 0.8799 - val_loss: 2.8632 - val_mean_absolute_error: 1.3135
Epoch 55/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.2600 - mean_absolute_error: 0.8660 - val_loss: 2.8956 - val_mean_absolute_error: 1.3195
Epoch 56/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.2268 - mean_absolute_error: 0.8554 - val_loss: 2.9044 - val_mean_absolute_error: 1.3248
Epoch 57/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.1820 - mean_absolute_error: 0.8378 - val_loss: 2.9582 - val_mean_absolute_error: 1.3282
Epoch 58/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.1562 - mean_absolute_error: 0.8294 - val_loss: 2.9762 - val_mean_absolute_error: 1.3423
Epoch 59/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.1225 - mean_absolute_error: 0.8185 - val_loss: 2.9871 - val_mean_absolute_error: 1.3355
Epoch 60/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.0903 - mean_absolute_error: 0.8026 - val_loss: 3.1149 - val_mean_absolute_error: 1.3858
Epoch 61/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.0852 - mean_absolute_error: 0.8101 - val_loss: 3.0336 - val_mean_absolute_error: 1.3488
Epoch 62/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.0202 - mean_absolute_error: 0.7798 - val_loss: 3.0746 - val_mean_absolute_error: 1.3602
Epoch 63/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.9805 - mean_absolute_error: 0.7623 - val_loss: 3.1015 - val_mean_absolute_error: 1.3780
Epoch 64/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.9215 - mean_absolute_error: 0.7389 - val_loss: 3.1360 - val_mean_absolute_error: 1.3692
Epoch 65/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.8876 - mean_absolute_error: 0.7246 - val_loss: 3.1642 - val_mean_absolute_error: 1.3814
Epoch 66/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.8425 - mean_absolute_error: 0.7078 - val_loss: 3.1564 - val_mean_absolute_error: 1.3802
Epoch 67/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.7857 - mean_absolute_error: 0.6808 - val_loss: 3.2486 - val_mean_absolute_error: 1.3968
Epoch 68/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.7722 - mean_absolute_error: 0.6793 - val_loss: 3.2855 - val_mean_absolute_error: 1.4128
Epoch 69/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.7417 - mean_absolute_error: 0.6607 - val_loss: 3.2923 - val_mean_absolute_error: 1.4177
Epoch 70/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.6889 - mean_absolute_error: 0.6382 - val_loss: 3.3449 - val_mean_absolute_error: 1.4299
Epoch 71/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.6491 - mean_absolute_error: 0.6197 - val_loss: 3.4674 - val_mean_absolute_error: 1.4666
Epoch 72/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.6672 - mean_absolute_error: 0.6320 - val_loss: 3.3820 - val_mean_absolute_error: 1.4346
Epoch 73/100
7565/7565 [==============================] - 0s 13us/step - loss: 0.6617 - mean_absolute_error: 0.6294 - val_loss: 3.4675 - val_mean_absolute_error: 1.4440
Epoch 74/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.6236 - mean_absolute_error: 0.6112 - val_loss: 3.4038 - val_mean_absolute_error: 1.4406
Epoch 75/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.5727 - mean_absolute_error: 0.5865 - val_loss: 3.4552 - val_mean_absolute_error: 1.4562
Epoch 76/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.5416 - mean_absolute_error: 0.5678 - val_loss: 3.4976 - val_mean_absolute_error: 1.4558
Epoch 77/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4930 - mean_absolute_error: 0.5394 - val_loss: 3.4640 - val_mean_absolute_error: 1.4518
Epoch 78/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4586 - mean_absolute_error: 0.5189 - val_loss: 3.5031 - val_mean_absolute_error: 1.4646
Epoch 79/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4342 - mean_absolute_error: 0.5043 - val_loss: 3.5909 - val_mean_absolute_error: 1.4729
Epoch 80/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4430 - mean_absolute_error: 0.5123 - val_loss: 3.5971 - val_mean_absolute_error: 1.4834
Epoch 81/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.3982 - mean_absolute_error: 0.4828 - val_loss: 3.6713 - val_mean_absolute_error: 1.4953
Epoch 82/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.4449 - mean_absolute_error: 0.5179 - val_loss: 3.6827 - val_mean_absolute_error: 1.4980
Epoch 83/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.4095 - mean_absolute_error: 0.4934 - val_loss: 3.6537 - val_mean_absolute_error: 1.4979
Epoch 84/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.3566 - mean_absolute_error: 0.4557 - val_loss: 3.6348 - val_mean_absolute_error: 1.4892
Epoch 85/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.3300 - mean_absolute_error: 0.4360 - val_loss: 3.7105 - val_mean_absolute_error: 1.5023
Epoch 86/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.3157 - mean_absolute_error: 0.4266 - val_loss: 3.6977 - val_mean_absolute_error: 1.5055
Epoch 87/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2924 - mean_absolute_error: 0.4094 - val_loss: 3.7523 - val_mean_absolute_error: 1.5125
Epoch 88/100
7565/7565 [==============================] - 0s 8us/step - loss: 0.3008 - mean_absolute_error: 0.4176 - val_loss: 3.7830 - val_mean_absolute_error: 1.5191
Epoch 89/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.2699 - mean_absolute_error: 0.3937 - val_loss: 3.8019 - val_mean_absolute_error: 1.5202
Epoch 90/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2618 - mean_absolute_error: 0.3894 - val_loss: 3.8120 - val_mean_absolute_error: 1.5202
Epoch 91/100
7565/7565 [==============================] - 0s 8us/step - loss: 0.2564 - mean_absolute_error: 0.3870 - val_loss: 3.8103 - val_mean_absolute_error: 1.5256
Epoch 92/100
7565/7565 [==============================] - 0s 8us/step - loss: 0.2346 - mean_absolute_error: 0.3667 - val_loss: 3.7884 - val_mean_absolute_error: 1.5226
Epoch 93/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2177 - mean_absolute_error: 0.3522 - val_loss: 3.8668 - val_mean_absolute_error: 1.5492
Epoch 94/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.2317 - mean_absolute_error: 0.3682 - val_loss: 3.9287 - val_mean_absolute_error: 1.5617
Epoch 95/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2162 - mean_absolute_error: 0.3551 - val_loss: 3.8762 - val_mean_absolute_error: 1.5386
Epoch 96/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2371 - mean_absolute_error: 0.3761 - val_loss: 3.9478 - val_mean_absolute_error: 1.5498
Epoch 97/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.2687 - mean_absolute_error: 0.4078 - val_loss: 3.8780 - val_mean_absolute_error: 1.5454
Epoch 98/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2417 - mean_absolute_error: 0.3842 - val_loss: 3.9039 - val_mean_absolute_error: 1.5536
Epoch 99/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.1896 - mean_absolute_error: 0.3318 - val_loss: 3.9239 - val_mean_absolute_error: 1.5588
Epoch 100/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.1752 - mean_absolute_error: 0.3160 - val_loss: 3.8977 - val_mean_absolute_error: 1.5471
Regression avec XGBoost
#XGBoost parameters list
xgb_params = {
"colsample_bytree" : 0.2,
"gamma" : 0,
"learning_rate" : 0.01,
"max_depth":3,
"reg_alpha":0.8,
"reg_lambda": 0.6,
"subsample": 0.5,
'objective': 'reg:linear',
'eval_metric': 'rmse',
'min_child_weight':1, #1.5
'silent': 1,
'seed':42
}
#Data Matrix is an internal data structure used by XGBoost and optimized for both memory efficiency and training speed
xgtrain = xgb.DMatrix(X_train, y_train_LogIndt, feature_names=X_train.columns)
xgtest = xgb.DMatrix(X_test, y_test_LogIndt, feature_names=X_test.columns)
watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
#Increase the number of rounds while running in local
num_rounds = 30000
#Loop on till you reach minimal value for the xgtest
xgb_model = xgb.train(xgb_params, xgtrain, num_rounds, watchlist, early_stopping_rounds=3000, verbose_eval=200)
[0] train-rmse:5.94873 test-rmse:6.01846
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.
Will train until test-rmse hasn't improved in 3000 rounds.
C:\Users\query\Anaconda3_New\envs\tf_gpu\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and \
[200] train-rmse:1.74871 test-rmse:1.76097
[400] train-rmse:1.56195 test-rmse:1.54713
[600] train-rmse:1.54913 test-rmse:1.53645
[800] train-rmse:1.54155 test-rmse:1.53476
[1000] train-rmse:1.53422 test-rmse:1.5344
[1200] train-rmse:1.52804 test-rmse:1.53452
[1400] train-rmse:1.52189 test-rmse:1.53434
[1600] train-rmse:1.51561 test-rmse:1.53463
[1800] train-rmse:1.50961 test-rmse:1.53534
[2000] train-rmse:1.50402 test-rmse:1.53558
[2200] train-rmse:1.49826 test-rmse:1.53675
[2400] train-rmse:1.49321 test-rmse:1.53734
[2600] train-rmse:1.48745 test-rmse:1.53793
[2800] train-rmse:1.4825 test-rmse:1.53851
[3000] train-rmse:1.47743 test-rmse:1.53932
[3200] train-rmse:1.47257 test-rmse:1.53953
[3400] train-rmse:1.46751 test-rmse:1.54063
[3600] train-rmse:1.46279 test-rmse:1.54136
[3800] train-rmse:1.458 test-rmse:1.54194
[4000] train-rmse:1.4533 test-rmse:1.54259
Stopping. Best iteration:
[1096] train-rmse:1.5313 test-rmse:1.5342
X_train.columns
Index(['CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'Density',
'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female',
'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F',
'Category_Large', 'Category_Medium', 'Category_Small',
'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife',
'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18',
'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7',
'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6',
'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10',
'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L',
'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T',
'Group2_P', 'Group2_U', 'Group2_S'],
dtype='object')
#Plot XGB feature importance
fig, ax = plt.subplots(figsize=(8, 12))
xgb.plot_importance(xgb_model, ax=ax)
import operator
importance = xgb_model.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1),reverse=True)
xgb_feature_list=[]
for i in range(0,len(importance)-1):
xgb_feature_list.append(importance[i][0])
xgb_feature_list_df = pd.DataFrame(xgb_feature_list)
features_list = ["Value", "Density", "Age", "Age_premier_contrat", "Bonus", 'Poldur']
#xgb_reg = xgboost.XGBRegressor(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75,
# colsample_bytree=1, max_depth=7)
xgb_reg = xgb.XGBRegressor( colsample_bytree = 0.1, gamma=0, learning_rate= 0.01, max_depth=4, reg_alpha=0.8,
reg_lambda= 0.6, subsample= 0.8,
objective ='reg:linear', eval_metric='rmse', min_child_weight=1, seed=42)
xgb_reg.fit(X_train[features_list],y_train_LogIndt)
predictions = xgb_reg.predict(X_test[features_list], ntree_limit=1496)
predictions
[16:22:01] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
array([4.16, 4.11, 4.17, ..., 4.13, 4.13, 4.25], dtype=float32)
# Total des coûts des sinistres
np.expm1(predictions).astype(int).sum()
506798
Il apparait que les prédictions de couts de sinistres sont très éloignées de la réalité.
Nous choisissons de scorer ces estimations en les normalisant par la valeur maximale.
Estimation du coût des sinistres sur 2011
X_train_cout= train_dummies.loc[train_dummies["IndtTotal"]>0 , features_list]
y_X_LogIndt = train_dummies.loc[train_dummies["IndtTotal"]>0 , "LogIndtTotal"]
xgb_reg = xgb.XGBRegressor( colsample_bytree = 0.1, gamma=0, learning_rate= 0.01, max_depth=4, reg_alpha=0.8,
reg_lambda= 0.6, subsample= 0.8,
objective ='reg:linear', eval_metric='rmse', min_child_weight=1, seed=42)
xgb_reg.fit(X_train_cout, y_X_LogIndt)
predictions = xgb_reg.predict(X_2011[features_list], ntree_limit=1496)
predictions = predictions/max(predictions)
pd.DataFrame({"score_cout_sinistre":predictions}).to_csv("./score_cout_sinistre.csv")
[16:34:40] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Définition d’une prime d’assurance à partir des deux modèles construits
t2009=training_df[training_df.CalYear==2009]
t2010=training_df[training_df.CalYear==2010]
nba2009=t2009[((training_df.Numtppd!=0) | (training_df.Numtpbi!=0))].shape[0]
nba2010=t2010[((training_df.Numtppd!=0) | (training_df.Numtpbi!=0))].shape[0]
/Users/admin/anaconda3/envs/DeepLearning/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
This is separate from the ipykernel package so we can avoid doing imports until
/Users/admin/anaconda3/envs/DeepLearning/lib/python3.6/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
after removing the cwd from sys.path.
total2009=training_df[training_df.CalYear==2009].Indtppd.sum()+training_df[training_df.CalYear==2009].Indtpbi.sum()
print("total 2009 = ",round(total2009)," euros")
total2010=training_df[training_df.CalYear==2010].Indtppd.sum()+training_df[training_df.CalYear==2010].Indtpbi.sum()
print("total 2010 = ",round(total2010)," euros")
total 2009 = 15357219.0 euros
total 2010 = 17539471.0 euros
Nous calculons le cout moyen par accident.
pma2009=total2009/nba2009
pma2010=total2010/nba2010
print("Prix moyen par accident (2009) ",round(pma2009))
print("Prix moyen par accident (2010) ",round(pma2010))
Prix moyen par accident (2009) 1987.0
Prix moyen par accident (2010) 2143.0
Nous calculons le cout moyen par accident (2009+2010).
total=training_df.Indtppd.sum()+training_df.Indtpbi.sum()
cout_moyen=total/(nba2009+nba2010)
print(cout_moyen)
2067.4139303798534
Lecture des résultats de la prédiction de sinistre (xgboost).
pred=pd.read_csv("prediction_prob_sinistre.csv")
Nous évaluons le nombre d’accidents prévus.
n=pred[pred.proba_pred>0.5].count()[0]
Nous estimons le coût total des sinistres. Celui-ci est surestimé, ce qui nous permettra de dégager une marge.
S=n*cout_moyen
print(round(S),"euros")
36808238.0 euros
Nous calculons des coefficients (rapport de 1 à 10, linéaire en fonction de la proba d’accident).
pred['coeff']=pred['proba_pred']*10 + 1
Somme des coefficients
C=pred['coeff'].sum()
print(C)
220470.62589862
Calcul de la prime par individu
pred['prime']=(pred['coeff']*S)/C
Vérification des pondérations
round(pred['prime'].sum())
36808238.0
Statistiques sur les primes
pred['prime'].describe()
count 36311.000000
mean 1013.693856
std 334.333860
min 250.738988
25% 762.103640
50% 991.660942
75% 1289.144943
max 1777.911799
Name: prime, dtype: float64
Nous sauvegardons les primes.
pred.to_csv("./prediction_prime_sinistre.csv")
Lecture des résultats de la prédiction du montant du sinistre (xgboost)
pred2=pd.read_csv("./score_cout_sinistre.csv")
pred2.keys()
Index(['Unnamed: 0', 'score_cout_sinistre'], dtype='object')
Intégration de la probabilité sur le montant du sinistre.
pred['proba_cout']=pred2['score_cout_sinistre']
pred['coeff2']=pred['proba_pred']*pred['proba_cout']*10 + 1
C2=pred['coeff2'].sum()
pred['prime2']=(pred['coeff2']*S)/C2
round(pred['prime2'].sum())
36808238.0
pred.to_csv("./prediction_prime_sinistre.csv")
pricing_df['tarif']=pred['prime2']
Nous sauvegardons les tarifs.
pricing_df[['PolNum','tarif']].to_csv("./polnum_tarifs.csv")
Conclusion
Malgré une estimation du coût total de sinistres assez élevée, notre prédiction de primes nous semblent cohérente.
Nous pourrions éventuellement optimiser nos prédictions en travaillant sur les points suivants :
- modélisation par une loi de Pareto généralisée des 2% d’individus que nous avons considérés comme des outliers,
- différenciation de modèles entre les dommages matériels et corporels,
- la réalisation d’un clustering (âge, densité, etc.),
- optimisation de nos modèles en créant des variables basées sur des calculs de moyennes par groupe,
- application d’une correction du montant des primes afin de gagner plus de contrats.