Workshop Actuaria

Pricing Game


Objectif de cette étude : créer une société rentable et gagner le maximum de contrats en 2011.

Nous allons prédire une prime optimale pour chaque nouveau contrat. Pour ce faire, nous allons réaliser une modélisation en deux étapes :

  • un modèle de régression logistique (un sinistre a t-il lieu ou pas ?),
  • un modèle d’estimation des coûts des sinistres.

Ces deux modèles sont ensuite utilisés pour réaliser une estimation sous contrainte de la prime optimale par individu.

Exploration préliminaire

import pandas as pd
import numpy as np
%matplotlib inline
import xgboost as xgb
from sklearn import metrics
from xgboost import plot_importance

# Lecture des données sources du jeu
training_df = pd.read_csv("training.csv", sep=';')
pricing_df = pd.read_csv("pricing.csv", sep=',')
del pricing_df["Unnamed: 0"]

print(training_df.shape)
print(pricing_df.shape)
(100021, 20)
(36311, 15)
print(training_df.dtypes)
training_df.describe()
PolNum          int64
CalYear         int64
Gender         object
Type           object
Category       object
Occupation     object
Age             int64
Group1          int64
Bonus           int64
Poldur          int64
Value           int64
Adind           int64
SubGroup2      object
Group2         object
Density       float64
Exppdays        int64
Numtppd         int64
Numtpbi         int64
Indtppd       float64
Indtpbi       float64
dtype: object

PolNumCalYearAgeGroup1BonusPoldurValueAdindDensityExppdaysNumtppdNumtpbiIndtppdIndtpbi
count1.000210e+05100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000100021.000000
mean2.002003e+082009.49989541.12251410.692625-6.9216465.47078116454.6752680.512142117.159270327.5880070.1474490.046790106.135007222.762829
std6.217239e+040.50000214.2993494.68728648.6331654.59119410506.7427320.49985579.50090773.5646360.4369170.219546444.9491881859.422836
min2.001149e+082009.00000018.0000001.000000-50.0000000.0000001000.0000000.00000014.37714291.0000000.0000000.0000000.0000000.000000
25%2.001399e+082009.00000030.0000007.000000-40.0000001.0000008380.0000000.00000050.625783340.0000000.0000000.0000000.0000000.000000
50%2.001649e+082009.00000040.00000011.000000-30.0000004.00000014610.0000001.00000094.364623365.0000000.0000000.0000000.0000000.000000
75%2.002608e+082010.00000051.00000014.00000010.0000009.00000022575.0000001.000000174.644525365.0000000.0000000.0000000.0000000.000000
max2.002858e+082010.00000075.00000020.000000150.00000015.00000049995.0000001.000000297.385170365.0000007.0000003.00000012878.36991069068.026292
print(pricing_df.dtypes)
pricing_df.describe()
PolNum          int64
CalYear         int64
Gender         object
Type           object
Category       object
Occupation     object
Age             int64
Group1          int64
Bonus           int64
Poldur          int64
Value           int64
Adind           int64
SubGroup2      object
Group2         object
Density       float64
dtype: object

PolNumCalYearAgeGroup1BonusPoldurValueAdindDensity
count3.631100e+0436311.036311.00000036311.00000036311.00000036311.00000036311.00000036311.00000036311.000000
mean2.003507e+082011.041.18608710.712153-6.5941455.55933516518.5246890.518465116.666024
std1.446972e+040.014.3069334.69057849.0740194.62609610528.5191900.49966679.787179
min2.003257e+082011.018.0000001.000000-50.0000000.0000001005.0000000.00000014.377142
25%2.003381e+082011.030.0000007.000000-40.0000002.0000008400.0000000.00000050.351845
50%2.003507e+082011.040.00000011.000000-30.0000004.00000014665.0000001.00000093.382351
75%2.003633e+082011.051.00000014.00000010.0000009.00000022700.0000001.000000171.372936
max2.003757e+082011.075.00000020.000000150.00000015.00000049995.0000001.000000297.385170

Analyse des targets

Targets de classification : Sinistre Vs Non Sinistre

import seaborn as sns  
sns.countplot(training_df.apply( lambda x : x["Numtppd"]+x["Numtpbi"],axis=1))
<matplotlib.axes._subplots.AxesSubplot at 0x1a22a86ac8>

png

Le nombre de sinistres corporels et matériels suit une loi de Poisson.

Le scoring d’un risque de sinistre va être complexe à résoudre du fait du déséquilibre important des classes.

Au vue de cette distribution, nous ne souhaitons pas tenir compte du nombre de sinistres dans notre target de régression.

Target de régression : Somme des coûts des sinistres (matériels et physiques)

# On commence par réduire l'analyse sur les données avec sinistresau pro rata de l'exposition du contrat
training_df["IndtTotal"] = training_df.apply( lambda x : (x["Exppdays"]/365) * (x["Indtppd"] + x["Indtpbi"]), axis=1)

train_couts_sinistre = training_df.loc[training_df["IndtTotal"] > 0, "IndtTotal" ]
import seaborn as sns
from matplotlib import pyplot as plt

def plot_dist(target):
    # Configurer le paramètres d'affichage
    plt.style.use(style='ggplot')
    plt.rcParams['figure.figsize'] = (18, 6)
    
    # Affichage de la distribution des coûts des sinistres  
    sns.distplot(target.values, bins=50, kde=False)
    plt.xlabel('Coûts des sinistres', fontsize=12)
    plt.show()
    
    target.skew()
    
plot_dist(train_couts_sinistre)    

png

La distribution présente un longue queue à droite avec des valeurs a priori très extrêmes.

def show_outliers(target):
    plt.figure(figsize=(12,6))
    plt.scatter(range(target.shape[0]), np.sort(target.values),color='blue')
    plt.xlabel('index', fontsize=12)
    plt.ylabel('Coûts des sinistres', fontsize=12)
    plt.show()

show_outliers(train_couts_sinistre)

png

Les deux graphes précédents nous permettent d’estimer qu’un montant total de sinistres au delà du 99ème quartile
présente une typologie spécifique que nous devons traiter à part.

# Estimation du seuil du 9ème quartile
percentile_98 = np.percentile(train_couts_sinistre,98)
percentile_98
15851.449026353968
# On supprime les lignes allant au delà du seuil défini comme limite des outliers sur la target
def remove_outliers(df, ulimit):
    
    return df[df["IndtTotal"]<ulimit] 
    

training_df = remove_outliers(training_df, percentile_98)
show_outliers(training_df.loc[training_df["IndtTotal"]>0, "IndtTotal"])

png

A présent nous retravaillons la distribution de notre target de régression.

# On modifie la forme de la distribution pour faciliter l'apprentissage
training_df["LogIndtTotal"] = np.log1p(training_df["IndtTotal"])

# revue de la nouvelle distribution de notre log-target
sns.distplot(training_df.loc[training_df["LogIndtTotal"]>0, "LogIndtTotal"], bins=50, kde=False)
plt.xlabel('Log Coûts Sinistres', fontsize=12)
plt.show()

png

Nous obtenons une gaussienne que nous espérons plus facile à apprendre, nous appliquerons une fonction exponentielle aux prédictions pour arriver à une prédiction finale.

Analyse des variables du jeu de données d’entrainement

Recherche de corrélation targets - variables

def display_categ(categ, target):
    fig, ax = plt.subplots(figsize=(12,8))
    ax = sns.boxplot(x=categ, y=target, data=training_df[(training_df[target]>0) &(training_df[target]<20000) ])  

display_categ("Gender", "Indtppd")

png

Nous remarquons une distinction des classes *Unemployed et Retired.

display_categ("Occupation", "Indtppd")

png

display_categ("Type", "Indtppd")

png

display_categ("Type", "Indtpbi")

png

display_categ("Group1", "Indtpbi")

png

Feature Engineering

import numpy as np

# Les targets : Indtpbi et Indtppd doivent être ajustée à leur exposition
training_df["Indtpbi_expo"] = training_df["Indtpbi"] * training_df["Exppdays"]
training_df["Indtppd_expo"] = training_df["Indtppd"] * training_df["Exppdays"]

# Correction de Poldur + création de Age premier contrat
# La variable polDur parait incohérente dans certains cas, on pose comme règle : Poldur - Age >=18
training_df["Age_premier_contrat"] = training_df["Age"] - training_df["Poldur"]
training_df["Age_premier_contrat"] = training_df.apply(lambda x :  max(18, x["Age_premier_contrat"]), axis=1)
training_df["Poldur"] = training_df["Age"] - training_df["Age_premier_contrat"]

pricing_df["Age_premier_contrat"] = pricing_df["Age"] - pricing_df["Poldur"]
pricing_df["Age_premier_contrat"] = pricing_df.apply(lambda x :  max(18, x["Age_premier_contrat"]), axis=1)
pricing_df["Poldur"] = pricing_df["Age"] - pricing_df["Age_premier_contrat"]



Nous créons des variables relatives aux bonus qui pourrons peut-être apporter du poids dans nos prédictions.

# Variable : n'a jamais eu d'accident
train_work_df = training_df.copy()
train_work_df["BonusNorm"] = train_work_df.apply(lambda x : 1+x["Bonus"]/100, axis=1)

train_work_df["BonusMax"] = train_work_df.apply(lambda x : 0.95**x["Poldur"], axis=1)
train_work_df["Bonustheorique"] = train_work_df.apply(lambda x : max(x["BonusMax"], 0.5), axis=1)

train_work_df["SansSinistre"] = train_work_df.apply(lambda x : int(x["Bonustheorique"] == x["BonusNorm"]), axis=1 )
training_df["SansSinistre"] = train_work_df["SansSinistre"].copy()



pricing_work_df = pricing_df.copy()
pricing_work_df["BonusNorm"] = pricing_work_df.apply(lambda x : 1+x["Bonus"]/100, axis=1)

pricing_work_df["BonusMax"] = pricing_work_df.apply(lambda x : 0.95**x["Poldur"], axis=1)
pricing_work_df["Bonustheorique"] = pricing_work_df.apply(lambda x : max(x["BonusMax"], 0.5), axis=1)

pricing_work_df["SansSinistre"] = pricing_work_df.apply(lambda x : int(x["Bonustheorique"] == x["BonusNorm"]), axis=1 )
pricing_df["SansSinistre"] = pricing_work_df["SansSinistre"].copy()

train_work_df.loc[train_work_df["SansSinistre"]==1,["Bonus","Poldur","BonusNorm","BonusMax","Bonustheorique", "SansSinistre" ]]

BonusPoldurBonusNormBonusMaxBonustheoriqueSansSinistre
4001.01.0000001.01
19001.01.0000001.01
125001.01.0000001.01
142001.01.0000001.01
151-50150.50.4632910.51
162001.01.0000001.01
188-50150.50.4632910.51
207001.01.0000001.01
229001.01.0000001.01
253-50140.50.4876750.51
271001.01.0000001.01
295-50140.50.4876750.51
300001.01.0000001.01
316-50140.50.4876750.51
335-50140.50.4876750.51
348001.01.0000001.01
363001.01.0000001.01
365001.01.0000001.01
416001.01.0000001.01
439001.01.0000001.01
462-50140.50.4876750.51
515-50150.50.4632910.51
562001.01.0000001.01
594-50150.50.4632910.51
595001.01.0000001.01
651001.01.0000001.01
660001.01.0000001.01
678-50150.50.4632910.51
699001.01.0000001.01
739001.01.0000001.01
.....................
99352001.01.0000001.01
99419001.01.0000001.01
99422-50140.50.4876750.51
99436001.01.0000001.01
99447-50140.50.4876750.51
99510-50150.50.4632910.51
99511001.01.0000001.01
99542001.01.0000001.01
99584001.01.0000001.01
99589001.01.0000001.01
99608-50150.50.4632910.51
99627-50140.50.4876750.51
99632-50150.50.4632910.51
99637-50150.50.4632910.51
99640001.01.0000001.01
99646-50140.50.4876750.51
99654001.01.0000001.01
99703001.01.0000001.01
99711-50150.50.4632910.51
99729-50150.50.4632910.51
99738001.01.0000001.01
99746-50150.50.4632910.51
99769001.01.0000001.01
99852-50150.50.4632910.51
99880001.01.0000001.01
99918-50150.50.4632910.51
99937-50150.50.4632910.51
99939001.01.0000001.01
99958-50150.50.4632910.51
99997001.01.0000001.01

4211 rows × 6 columns

sns.lineplot(data =train_work_df[train_work_df["CalYear"]==2010], y="BonusNorm", x="Poldur")
<matplotlib.axes._subplots.AxesSubplot at 0x20c2f720240>

png

sns.countplot(x="Poldur", data=training_df)
<matplotlib.axes._subplots.AxesSubplot at 0x20c2defac88>

png

Encodage des variables catégorielles

import seaborn as sns

# On passe en catégorielle la variable Group1 du véhicule
training_df["Group1"]  = training_df["Group1"].astype(object)
pricing_df["Group1"] = pricing_df["Group1"].astype(object)

# Tranforme en 
categ_col_list = list(training_df.select_dtypes(include=[object]))
categ_col_list.remove("SubGroup2") # trop de valuers distinctes, il faudra étudier le regroupement de ces valeurs
                                   # si des groupes différents de Group2 paraissent intéressants

# Affichage des distributions de données pour les variables catégorielles
def countplot_features(df, column_list, n_wide=1):
    fig, ax = plt.subplots(int(len(column_list)/n_wide), n_wide, figsize = (18,24))
    for i, col in enumerate(column_list) :
        plt.sca(ax[int(i/n_wide), (i%n_wide)])
        sns.countplot(x=col, data=df)

countplot_features(training_df, categ_col_list, 3)

png

Il y a suffisament de variables par catégorie pour laisser ces groupes tous distincts les uns des autres.

# Encodage des dataframe train et test
def dummies(train, test, columns = None):
    if columns != None:
        for column in columns:
            train[column] = train[column].apply(lambda x: str(x))
            test[column] = test[column].apply(lambda x: str(x))
            good_cols = [column+'_'+i for i in train[column].unique() if i in test[column].unique()]
            train = pd.concat((train, pd.get_dummies(train[column], prefix = column)[good_cols]), axis = 1)
            test = pd.concat((test, pd.get_dummies(test[column], prefix = column)[good_cols]), axis = 1)
            del train[column]
            del test[column]
    return train, test

train_dummies, pricing_dummies = dummies(training_df, pricing_df , columns = categ_col_list)
col_list = train_dummies.columns.to_list()
print(col_list)
print(str(len(col_list))+ " variables")
['PolNum', 'CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'SubGroup2', 'Density', 'Exppdays', 'Numtppd', 'Numtpbi', 'Indtppd', 'Indtpbi', 'IndtTotal', 'LogIndtTotal', 'Indtpbi_expo', 'Indtppd_expo', 'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female', 'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F', 'Category_Large', 'Category_Medium', 'Category_Small', 'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife', 'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18', 'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7', 'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6', 'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10', 'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L', 'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T', 'Group2_P', 'Group2_U', 'Group2_S']
66 variables
for col in ['SubGroup2', 'Numtppd', 'Numtpbi', 'Indtppd', 'Indtpbi', 'Indtpbi_expo', 'Indtppd_expo','IndtTotal', 'LogIndtTotal']:
    col_list.remove(col)

Modélisation 1 : Classification binaire du risque de sinistre

Nous utilisons une modélisation de type xgboost afin de réaliser cette classification binaire.

def analyze_model(model_pred_proba, actual_target):
    np.set_printoptions(precision=2)
    
    # Transform predictions to input uniform data into the confusion matrix function
    y_pred = [0 if x < 0.5 else 1 for x in model_pred_proba]
    class_names = [0, 1]
    
    # Plot non-normalized confusion matrix
    plot_confusion_matrix(actual_target, y_pred, classes=class_names,
                      title='Confusion matrix')
    
    plt.show()
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax
# Premier classifieur XGBClassifier pour déterminer le risque de sinistre sur le contrat 
import xgboost as xgb
from sklearn import metrics
from xgboost import plot_importance

# On stratifie nos datasets d'entrainement avec une split sur l'année afin d'estimer au mieux l'année 2011 
X = train_dummies[col_list]
X_train = X[X["CalYear"] == 2009]
X_test = X[X["CalYear"] == 2010]

# On transforme en target binaire les variables Numtppd et Numtpbi
y_train_Numt = training_df[training_df["CalYear"] == 2009].apply(lambda row: 0 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
y_test_Numt = training_df[training_df["CalYear"] == 2010].apply(lambda row: 0 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)

# On signe les targets pour une modélisation XGBoost
y_train_Numt_s = training_df[training_df["CalYear"] == 2009].apply(lambda row: -1 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)
y_test_Numt_s = training_df[training_df["CalYear"] == 2010].apply(lambda row: -1 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)



col_list.remove("Exppdays")
X_2011 = pricing_dummies[col_list]

Les résultats nous montrent un nombre important de faux positifs. Nous choisissons de redonner du poids à la classe minoritaire (les individus ayant fait l’objet d’un sinistre) afin de faire baisser le nombre de faux positifs.

###########################################################################################
# Premier classifieur de la target binaire des sinistres matériels : "Numtppd" ou "Numtpbi"
###########################################################################################
from sklearn.metrics import roc_auc_score

xgb_clf = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.7, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=9999999999,
       n_estimators=600, n_jobs=-1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0.01,
       reg_lambda=1, scale_pos_weight=8, seed=None, silent=True,
       subsample=0.4)

xgb_clf.fit(X_train, y_train_Numt_s)
xgb_clf_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]
    
# Affiche de la courbe ROC 
fpr, tpr, _ = metrics.roc_curve(y_test_Numt, xgb_clf_pred_proba)
plt.plot(fpr,tpr)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.title("ROC Curve")
plt.show()

print("Score AUC : %.4f" % roc_auc_score( y_test_Numt, xgb_clf_pred_proba))

# Affichage de la matrice de confusion
analyze_model(xgb_clf_pred_proba, y_test_Numt)
    
plot_importance(xgb_clf, max_num_features=20)
plt.show()

png

Score AUC : 0.7393
Confusion matrix, without normalization
[[22601 19214]
 [ 1740  6270]]

png

png

Optimisation des hyper paramètres

# Recherche d'hyper paramètres par validation croisée stratifiée
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from random import seed


params={
        'max_depth': [2, 3, 4],
        'subsample': [0.5],       
        'colsample_bytree': [ 0.6, 0.8 ], 
        'colsample_bylevel' : [1],
        'n_estimators': [ 200, 400, 600 ], 
        'scale_pos_weight' : [8, 10, 12], 
        'learning_rate' : [0.005, 0.01, 0.1], 
        'reg_alpha' : [0.05, 0.1],
        'reg_lambda': [0.1, 0.5], 
        'min_child_weight' : [1]
        }

xgb_model = xgb.XGBClassifier()

clf = GridSearchCV(xgb_model, params, n_jobs=-1, 
                   cv=StratifiedKFold( n_splits=3, shuffle=True), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clf.fit(X_train, y_train_Numt_s)


Fitting 3 folds for each of 648 candidates, totalling 1944 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   38.1s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 19.2min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 34.5min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed: 53.4min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed: 79.8min
[Parallel(n_jobs=-1)]: Done 1944 out of 1944 | elapsed: 113.6min finished



---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-57-3f19e2fbc935> in <module>
     27 clf.fit(X_train, y_train_Numt_s)
     28 
---> 29 best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])
     30 print('Score AUC : ', score)


AttributeError: 'GridSearchCV' object has no attribute 'grid_scores_'
best_est = clf.best_estimator_
print(best_est)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
              silent=None, subsample=0.5, verbosity=1)
# On supprime Exppdays qui n'est pas présente dans le jeu de données 2011 à prédire
# cette variable est intégrée dans la target cout des sinistres du modèle 2
del X_train["Exppdays"]

Avec ces hyperparamètres optimisés, nous remarquons que le nombre de faux positifs à bien diminuer. Toutefois, le nombre de faux négatifs à augmenter. L’AUC est également amélioré.

clf_pred_proba = clf.predict_proba(X_test)[:, 1]
    
# Affiche de la courbe ROC 
fpr, tpr, _ = metrics.roc_curve(y_test_Numt, clf_pred_proba)
plt.plot(fpr,tpr)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.title("ROC Curve")
plt.show()

print("Score AUC : %.4f" % roc_auc_score( y_test_Numt, clf_pred_proba))

# Affichage de la matrice de confusion
analyze_model(clf_pred_proba, y_test_Numt)
    
# Analyse des variables les plus influentes sur la différentiation 
xgb_grid = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
              silent=None, subsample=0.5, verbosity=1)
xgb_grid.fit(X_train, y_train_Numt_s) 
plot_importance(xgb_grid, max_num_features=20)
plt.show()    

png

Score AUC : 0.7477
Confusion matrix, without normalization
[[34449  7366]
 [ 3888  4122]]

png

png

Analyse des probabilités associées aux classes de notre matrice de confusion

result_df = pd.DataFrame({'True_Value':y_test_Numt, 'Pred_Value':clf_pred_proba})

pred_1_true_0 = result_df[(result_df["Pred_Value"]>0.5) & (result_df["True_Value"]==0)]
print( " Proba des Faux positifs : " + str(pred_1_true_0.Pred_Value.mean()) )

pred_1_true_1 = result_df[(result_df["Pred_Value"]>0.5) & (result_df["True_Value"]==1)]
print( " Proba des Vrais positifs : " + str(pred_1_true_1.Pred_Value.mean()) )

pred_0_true_1 = result_df[(result_df["Pred_Value"]<0.5) & (result_df["True_Value"]==1)]
print( " Proba des Faux négatifs : " + str(pred_0_true_1.Pred_Value.mean()) )

pred_0_true_0 = result_df[(result_df["Pred_Value"]<0.5) & (result_df["True_Value"]==0)]
print( " Proba des Vrais négatifs : " + str(pred_0_true_0.Pred_Value.mean()) )

 Proba des Faux positifs : 0.616603
 Proba des Vrais positifs : 0.66039294
 Proba des Faux négatifs : 0.3152266
 Proba des Vrais négatifs : 0.2463552

Re entrainement sur 2009 et 2010 puis prédictions sur 2011

Nos hyperparamètres sont désormais fixés. Nous entrainons notre modèle sur l’ensemble du jeu d’entraînement.

del X["Exppdays"]

y_train_final_s = training_df.apply(lambda row: -1 if (row['Numtppd']+row['Numtpbi'])==0 else 1, axis=1)

xgb_grid = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
              silent=None, subsample=0.5, verbosity=1)
xgb_grid.fit(X, y_train_final_s) 
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0.05, reg_lambda=0.5, scale_pos_weight=8, seed=None,
              silent=None, subsample=0.5, verbosity=1)
xgb_grid_pred_proba = xgb_grid.predict_proba(X_2011)[:, 1]
pd.DataFrame({"proba_pred":xgb_grid_pred_proba}).to_csv("./prediction_prob_sinistre.csv")

Modèle 2 : Régression par NN sur le coût d’un sinistre

Préparation des données

X = train_dummies[col_list]
Index(['CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'Density',
       'IndtTotal', 'LogIndtTotal', 'Age_premier_contrat', 'SansSinistre',
       'Gender_Male', 'Gender_Female', 'Type_C', 'Type_E', 'Type_D', 'Type_B',
       'Type_A', 'Type_F', 'Category_Large', 'Category_Medium',
       'Category_Small', 'Occupation_Employed', 'Occupation_Unemployed',
       'Occupation_Housewife', 'Occupation_Self-employed',
       'Occupation_Retired', 'Group1_18', 'Group1_11', 'Group1_5', 'Group1_12',
       'Group1_13', 'Group1_7', 'Group1_3', 'Group1_9', 'Group1_14',
       'Group1_8', 'Group1_6', 'Group1_20', 'Group1_16', 'Group1_4',
       'Group1_1', 'Group1_10', 'Group1_2', 'Group1_15', 'Group1_19',
       'Group1_17', 'Group2_L', 'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R',
       'Group2_M', 'Group2_T', 'Group2_P', 'Group2_U', 'Group2_S'],
      dtype='object')
train_dummies, pricing_dummies = dummies(training_df, pricing_df , columns = categ_col_list)
col_list = train_dummies.columns.to_list()
print(col_list)
print(str(len(col_list))+ " variables")

for col in ['SubGroup2', 'Numtppd', "Exppdays", 'Numtpbi','Indtpbi', 'Indtppd',  'Indtppd_expo', 'Indtpbi_expo', 'PolNum']:
    col_list.remove(col)

# On transforme en target binaire les variables Indtppd et Indtpbi
y_train_LogIndt = train_dummies.loc[train_dummies["CalYear"] == 2009, 'LogIndtTotal' ]
y_test_LogIndt = train_dummies.loc[train_dummies["CalYear"] == 2010, 'LogIndtTotal']



    
# Filtre sur les données uniquement sinistrées
#          On s'entraine sur 2009 
X_train = train_dummies[train_dummies["CalYear"] == 2009]
X_test = train_dummies[train_dummies["CalYear"] == 2010]

# avec les usagers sinistrés
X_train = X_train.loc[ X_train['IndtTotal'] > 0, col_list ]
X_test = X_test.loc[ X_test['IndtTotal'] > 0, col_list ]

y_train_LogIndt = X_train.loc[X_train['IndtTotal'] > 0, "LogIndtTotal"]
y_test_LogIndt = X_test.loc[X_test['IndtTotal'] > 0, "LogIndtTotal"]
y_X_LogIndt = X.loc[X['IndtTotal'] > 0, "LogIndtTotal"]

for col in ['LogIndtTotal', 'IndtTotal']:
    del X_train[col]
    del X_test[col]
    del X[col]


['PolNum', 'CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'SubGroup2', 'Density', 'Exppdays', 'Numtppd', 'Numtpbi', 'Indtppd', 'Indtpbi', 'IndtTotal', 'LogIndtTotal', 'Indtpbi_expo', 'Indtppd_expo', 'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female', 'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F', 'Category_Large', 'Category_Medium', 'Category_Small', 'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife', 'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18', 'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7', 'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6', 'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10', 'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L', 'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T', 'Group2_P', 'Group2_U', 'Group2_S']
66 variables
print(X_train.shape)
print(X_test.shape)
(7565, 55)
(8010, 55)
# On normalise nos variables
X_train.columns
Index(['CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'Density',
       'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female',
       'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F',
       'Category_Large', 'Category_Medium', 'Category_Small',
       'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife',
       'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18',
       'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7',
       'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6',
       'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10',
       'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L',
       'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T',
       'Group2_P', 'Group2_U', 'Group2_S'],
      dtype='object')

Régression avec un réseau de neurones

from keras.models import Sequential 
from keras.layers import Dense, Activation
from keras.optimizers import SGD
from sklearn import preprocessing

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Scale data for NN
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.fit_transform(X_test) 

n_epochs = 100

def create_baseline():
    # creation du modele
    model = Sequential()
    model.add(Dense(512, activation='relu', input_shape=(X_train_scaled.shape[1],)))    
    model.add(Dense(256, activation='relu'))
    model.add(Dense(128, activation='linear'))
    
    model.add(Dense(1)) 

    # Compilation du modele
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model



model_reg = create_baseline() 



history_reg = model_reg.fit(X_train_scaled, y_train_LogIndt, 
                            validation_data=(X_test_scaled, y_test_LogIndt),
                            epochs=n_epochs,
                            verbose=1,
                            batch_size=1024)
Train on 7565 samples, validate on 8010 samples
Epoch 1/100
7565/7565 [==============================] - 2s 315us/step - loss: 15.2680 - mean_absolute_error: 3.2318 - val_loss: 4.5242 - val_mean_absolute_error: 1.6506
Epoch 2/100
7565/7565 [==============================] - 0s 11us/step - loss: 4.2387 - mean_absolute_error: 1.6752 - val_loss: 4.4673 - val_mean_absolute_error: 1.7924
Epoch 3/100
7565/7565 [==============================] - 0s 12us/step - loss: 3.2320 - mean_absolute_error: 1.4122 - val_loss: 3.2058 - val_mean_absolute_error: 1.3541
Epoch 4/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.8502 - mean_absolute_error: 1.2922 - val_loss: 2.8909 - val_mean_absolute_error: 1.3710
Epoch 5/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.6552 - mean_absolute_error: 1.2799 - val_loss: 2.4886 - val_mean_absolute_error: 1.1919
Epoch 6/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.5082 - mean_absolute_error: 1.2042 - val_loss: 2.4921 - val_mean_absolute_error: 1.2326
Epoch 7/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.4443 - mean_absolute_error: 1.2131 - val_loss: 2.4083 - val_mean_absolute_error: 1.1840
Epoch 8/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.3914 - mean_absolute_error: 1.1822 - val_loss: 2.4314 - val_mean_absolute_error: 1.2057
Epoch 9/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.3547 - mean_absolute_error: 1.1822 - val_loss: 2.4065 - val_mean_absolute_error: 1.1876
Epoch 10/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.3294 - mean_absolute_error: 1.1667 - val_loss: 2.4289 - val_mean_absolute_error: 1.2053
Epoch 11/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.3068 - mean_absolute_error: 1.1689 - val_loss: 2.4122 - val_mean_absolute_error: 1.1892
Epoch 12/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.2835 - mean_absolute_error: 1.1594 - val_loss: 2.4242 - val_mean_absolute_error: 1.1991
Epoch 13/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.2634 - mean_absolute_error: 1.1495 - val_loss: 2.4360 - val_mean_absolute_error: 1.2035
Epoch 14/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.2372 - mean_absolute_error: 1.1493 - val_loss: 2.4321 - val_mean_absolute_error: 1.1968
Epoch 15/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.2213 - mean_absolute_error: 1.1409 - val_loss: 2.4448 - val_mean_absolute_error: 1.2054
Epoch 16/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1994 - mean_absolute_error: 1.1355 - val_loss: 2.4480 - val_mean_absolute_error: 1.2036
Epoch 17/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.1799 - mean_absolute_error: 1.1314 - val_loss: 2.4507 - val_mean_absolute_error: 1.2042
Epoch 18/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1629 - mean_absolute_error: 1.1308 - val_loss: 2.4495 - val_mean_absolute_error: 1.2024
Epoch 19/100
7565/7565 [==============================] - 0s 10us/step - loss: 2.1424 - mean_absolute_error: 1.1203 - val_loss: 2.4637 - val_mean_absolute_error: 1.2118
Epoch 20/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1266 - mean_absolute_error: 1.1189 - val_loss: 2.4653 - val_mean_absolute_error: 1.2055
Epoch 21/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.1063 - mean_absolute_error: 1.1121 - val_loss: 2.4821 - val_mean_absolute_error: 1.2143
Epoch 22/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.0927 - mean_absolute_error: 1.1086 - val_loss: 2.4826 - val_mean_absolute_error: 1.2121
Epoch 23/100
7565/7565 [==============================] - 0s 17us/step - loss: 2.0731 - mean_absolute_error: 1.1038 - val_loss: 2.4879 - val_mean_absolute_error: 1.2138
Epoch 24/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.0585 - mean_absolute_error: 1.0976 - val_loss: 2.5072 - val_mean_absolute_error: 1.2259
Epoch 25/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.0410 - mean_absolute_error: 1.0940 - val_loss: 2.5042 - val_mean_absolute_error: 1.2199
Epoch 26/100
7565/7565 [==============================] - 0s 14us/step - loss: 2.0238 - mean_absolute_error: 1.0931 - val_loss: 2.5082 - val_mean_absolute_error: 1.2177
Epoch 27/100
7565/7565 [==============================] - 0s 12us/step - loss: 2.0075 - mean_absolute_error: 1.0878 - val_loss: 2.5228 - val_mean_absolute_error: 1.2237
Epoch 28/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9884 - mean_absolute_error: 1.0805 - val_loss: 2.5327 - val_mean_absolute_error: 1.2295
Epoch 29/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9673 - mean_absolute_error: 1.0759 - val_loss: 2.5327 - val_mean_absolute_error: 1.2233
Epoch 30/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9519 - mean_absolute_error: 1.0711 - val_loss: 2.5496 - val_mean_absolute_error: 1.2338
Epoch 31/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9346 - mean_absolute_error: 1.0670 - val_loss: 2.5660 - val_mean_absolute_error: 1.2401
Epoch 32/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9189 - mean_absolute_error: 1.0653 - val_loss: 2.5650 - val_mean_absolute_error: 1.2307
Epoch 33/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.9020 - mean_absolute_error: 1.0596 - val_loss: 2.5715 - val_mean_absolute_error: 1.2329
Epoch 34/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.8758 - mean_absolute_error: 1.0530 - val_loss: 2.5794 - val_mean_absolute_error: 1.2394
Epoch 35/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.8505 - mean_absolute_error: 1.0442 - val_loss: 2.5980 - val_mean_absolute_error: 1.2463
Epoch 36/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.8325 - mean_absolute_error: 1.0408 - val_loss: 2.6015 - val_mean_absolute_error: 1.2468
Epoch 37/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.8091 - mean_absolute_error: 1.0330 - val_loss: 2.6161 - val_mean_absolute_error: 1.2510
Epoch 38/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.7862 - mean_absolute_error: 1.0279 - val_loss: 2.6120 - val_mean_absolute_error: 1.2470
Epoch 39/100
7565/7565 [==============================] - 0s 17us/step - loss: 1.7661 - mean_absolute_error: 1.0222 - val_loss: 2.6411 - val_mean_absolute_error: 1.2530
Epoch 40/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.7415 - mean_absolute_error: 1.0154 - val_loss: 2.6370 - val_mean_absolute_error: 1.2529
Epoch 41/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.7176 - mean_absolute_error: 1.0083 - val_loss: 2.6681 - val_mean_absolute_error: 1.2556
Epoch 42/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.6883 - mean_absolute_error: 0.9992 - val_loss: 2.6677 - val_mean_absolute_error: 1.2633
Epoch 43/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.6596 - mean_absolute_error: 0.9929 - val_loss: 2.7175 - val_mean_absolute_error: 1.2846
Epoch 44/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.6465 - mean_absolute_error: 0.9926 - val_loss: 2.7183 - val_mean_absolute_error: 1.2776
Epoch 45/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.6141 - mean_absolute_error: 0.9802 - val_loss: 2.7149 - val_mean_absolute_error: 1.2741
Epoch 46/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.5927 - mean_absolute_error: 0.9744 - val_loss: 2.7407 - val_mean_absolute_error: 1.2756
Epoch 47/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.5442 - mean_absolute_error: 0.9587 - val_loss: 2.7453 - val_mean_absolute_error: 1.2854
Epoch 48/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.5149 - mean_absolute_error: 0.9494 - val_loss: 2.7547 - val_mean_absolute_error: 1.2839
Epoch 49/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.4876 - mean_absolute_error: 0.9405 - val_loss: 2.7789 - val_mean_absolute_error: 1.2862
Epoch 50/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.4632 - mean_absolute_error: 0.9331 - val_loss: 2.7941 - val_mean_absolute_error: 1.2983
Epoch 51/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.4151 - mean_absolute_error: 0.9177 - val_loss: 2.8231 - val_mean_absolute_error: 1.3098
Epoch 52/100
7565/7565 [==============================] - 0s 15us/step - loss: 1.3863 - mean_absolute_error: 0.9110 - val_loss: 2.8383 - val_mean_absolute_error: 1.3096
Epoch 53/100
7565/7565 [==============================] - 0s 14us/step - loss: 1.3394 - mean_absolute_error: 0.8941 - val_loss: 2.8434 - val_mean_absolute_error: 1.3080
Epoch 54/100
7565/7565 [==============================] - 0s 12us/step - loss: 1.2951 - mean_absolute_error: 0.8799 - val_loss: 2.8632 - val_mean_absolute_error: 1.3135
Epoch 55/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.2600 - mean_absolute_error: 0.8660 - val_loss: 2.8956 - val_mean_absolute_error: 1.3195
Epoch 56/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.2268 - mean_absolute_error: 0.8554 - val_loss: 2.9044 - val_mean_absolute_error: 1.3248
Epoch 57/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.1820 - mean_absolute_error: 0.8378 - val_loss: 2.9582 - val_mean_absolute_error: 1.3282
Epoch 58/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.1562 - mean_absolute_error: 0.8294 - val_loss: 2.9762 - val_mean_absolute_error: 1.3423
Epoch 59/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.1225 - mean_absolute_error: 0.8185 - val_loss: 2.9871 - val_mean_absolute_error: 1.3355
Epoch 60/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.0903 - mean_absolute_error: 0.8026 - val_loss: 3.1149 - val_mean_absolute_error: 1.3858
Epoch 61/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.0852 - mean_absolute_error: 0.8101 - val_loss: 3.0336 - val_mean_absolute_error: 1.3488
Epoch 62/100
7565/7565 [==============================] - 0s 10us/step - loss: 1.0202 - mean_absolute_error: 0.7798 - val_loss: 3.0746 - val_mean_absolute_error: 1.3602
Epoch 63/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.9805 - mean_absolute_error: 0.7623 - val_loss: 3.1015 - val_mean_absolute_error: 1.3780
Epoch 64/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.9215 - mean_absolute_error: 0.7389 - val_loss: 3.1360 - val_mean_absolute_error: 1.3692
Epoch 65/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.8876 - mean_absolute_error: 0.7246 - val_loss: 3.1642 - val_mean_absolute_error: 1.3814
Epoch 66/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.8425 - mean_absolute_error: 0.7078 - val_loss: 3.1564 - val_mean_absolute_error: 1.3802
Epoch 67/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.7857 - mean_absolute_error: 0.6808 - val_loss: 3.2486 - val_mean_absolute_error: 1.3968
Epoch 68/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.7722 - mean_absolute_error: 0.6793 - val_loss: 3.2855 - val_mean_absolute_error: 1.4128
Epoch 69/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.7417 - mean_absolute_error: 0.6607 - val_loss: 3.2923 - val_mean_absolute_error: 1.4177
Epoch 70/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.6889 - mean_absolute_error: 0.6382 - val_loss: 3.3449 - val_mean_absolute_error: 1.4299
Epoch 71/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.6491 - mean_absolute_error: 0.6197 - val_loss: 3.4674 - val_mean_absolute_error: 1.4666
Epoch 72/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.6672 - mean_absolute_error: 0.6320 - val_loss: 3.3820 - val_mean_absolute_error: 1.4346
Epoch 73/100
7565/7565 [==============================] - 0s 13us/step - loss: 0.6617 - mean_absolute_error: 0.6294 - val_loss: 3.4675 - val_mean_absolute_error: 1.4440
Epoch 74/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.6236 - mean_absolute_error: 0.6112 - val_loss: 3.4038 - val_mean_absolute_error: 1.4406
Epoch 75/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.5727 - mean_absolute_error: 0.5865 - val_loss: 3.4552 - val_mean_absolute_error: 1.4562
Epoch 76/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.5416 - mean_absolute_error: 0.5678 - val_loss: 3.4976 - val_mean_absolute_error: 1.4558
Epoch 77/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4930 - mean_absolute_error: 0.5394 - val_loss: 3.4640 - val_mean_absolute_error: 1.4518
Epoch 78/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4586 - mean_absolute_error: 0.5189 - val_loss: 3.5031 - val_mean_absolute_error: 1.4646
Epoch 79/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4342 - mean_absolute_error: 0.5043 - val_loss: 3.5909 - val_mean_absolute_error: 1.4729
Epoch 80/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.4430 - mean_absolute_error: 0.5123 - val_loss: 3.5971 - val_mean_absolute_error: 1.4834
Epoch 81/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.3982 - mean_absolute_error: 0.4828 - val_loss: 3.6713 - val_mean_absolute_error: 1.4953
Epoch 82/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.4449 - mean_absolute_error: 0.5179 - val_loss: 3.6827 - val_mean_absolute_error: 1.4980
Epoch 83/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.4095 - mean_absolute_error: 0.4934 - val_loss: 3.6537 - val_mean_absolute_error: 1.4979
Epoch 84/100
7565/7565 [==============================] - 0s 14us/step - loss: 0.3566 - mean_absolute_error: 0.4557 - val_loss: 3.6348 - val_mean_absolute_error: 1.4892
Epoch 85/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.3300 - mean_absolute_error: 0.4360 - val_loss: 3.7105 - val_mean_absolute_error: 1.5023
Epoch 86/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.3157 - mean_absolute_error: 0.4266 - val_loss: 3.6977 - val_mean_absolute_error: 1.5055
Epoch 87/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2924 - mean_absolute_error: 0.4094 - val_loss: 3.7523 - val_mean_absolute_error: 1.5125
Epoch 88/100
7565/7565 [==============================] - 0s 8us/step - loss: 0.3008 - mean_absolute_error: 0.4176 - val_loss: 3.7830 - val_mean_absolute_error: 1.5191
Epoch 89/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.2699 - mean_absolute_error: 0.3937 - val_loss: 3.8019 - val_mean_absolute_error: 1.5202
Epoch 90/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2618 - mean_absolute_error: 0.3894 - val_loss: 3.8120 - val_mean_absolute_error: 1.5202
Epoch 91/100
7565/7565 [==============================] - 0s 8us/step - loss: 0.2564 - mean_absolute_error: 0.3870 - val_loss: 3.8103 - val_mean_absolute_error: 1.5256
Epoch 92/100
7565/7565 [==============================] - 0s 8us/step - loss: 0.2346 - mean_absolute_error: 0.3667 - val_loss: 3.7884 - val_mean_absolute_error: 1.5226
Epoch 93/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2177 - mean_absolute_error: 0.3522 - val_loss: 3.8668 - val_mean_absolute_error: 1.5492
Epoch 94/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.2317 - mean_absolute_error: 0.3682 - val_loss: 3.9287 - val_mean_absolute_error: 1.5617
Epoch 95/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2162 - mean_absolute_error: 0.3551 - val_loss: 3.8762 - val_mean_absolute_error: 1.5386
Epoch 96/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2371 - mean_absolute_error: 0.3761 - val_loss: 3.9478 - val_mean_absolute_error: 1.5498
Epoch 97/100
7565/7565 [==============================] - 0s 12us/step - loss: 0.2687 - mean_absolute_error: 0.4078 - val_loss: 3.8780 - val_mean_absolute_error: 1.5454
Epoch 98/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.2417 - mean_absolute_error: 0.3842 - val_loss: 3.9039 - val_mean_absolute_error: 1.5536
Epoch 99/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.1896 - mean_absolute_error: 0.3318 - val_loss: 3.9239 - val_mean_absolute_error: 1.5588
Epoch 100/100
7565/7565 [==============================] - 0s 10us/step - loss: 0.1752 - mean_absolute_error: 0.3160 - val_loss: 3.8977 - val_mean_absolute_error: 1.5471

Regression avec XGBoost

#XGBoost parameters list
xgb_params = {
    "colsample_bytree"   : 0.2, 
    "gamma" : 0,
    "learning_rate" : 0.01,
    "max_depth":3, 
    "reg_alpha":0.8,
    "reg_lambda": 0.6,
    "subsample": 0.5,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'min_child_weight':1, #1.5 
    'silent': 1,
    'seed':42
}

#Data Matrix is an internal data structure used by XGBoost and optimized for both memory efficiency and training speed
xgtrain = xgb.DMatrix(X_train, y_train_LogIndt, feature_names=X_train.columns)
xgtest = xgb.DMatrix(X_test, y_test_LogIndt, feature_names=X_test.columns)
watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]

#Increase the number of rounds while running in local
num_rounds = 30000 

#Loop on till you reach minimal value for the xgtest
xgb_model = xgb.train(xgb_params, xgtrain, num_rounds, watchlist, early_stopping_rounds=3000, verbose_eval=200)


[0]	train-rmse:5.94873	test-rmse:6.01846
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3000 rounds.


C:\Users\query\Anaconda3_New\envs\tf_gpu\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \


[200]	train-rmse:1.74871	test-rmse:1.76097
[400]	train-rmse:1.56195	test-rmse:1.54713
[600]	train-rmse:1.54913	test-rmse:1.53645
[800]	train-rmse:1.54155	test-rmse:1.53476
[1000]	train-rmse:1.53422	test-rmse:1.5344
[1200]	train-rmse:1.52804	test-rmse:1.53452
[1400]	train-rmse:1.52189	test-rmse:1.53434
[1600]	train-rmse:1.51561	test-rmse:1.53463
[1800]	train-rmse:1.50961	test-rmse:1.53534
[2000]	train-rmse:1.50402	test-rmse:1.53558
[2200]	train-rmse:1.49826	test-rmse:1.53675
[2400]	train-rmse:1.49321	test-rmse:1.53734
[2600]	train-rmse:1.48745	test-rmse:1.53793
[2800]	train-rmse:1.4825	test-rmse:1.53851
[3000]	train-rmse:1.47743	test-rmse:1.53932
[3200]	train-rmse:1.47257	test-rmse:1.53953
[3400]	train-rmse:1.46751	test-rmse:1.54063
[3600]	train-rmse:1.46279	test-rmse:1.54136
[3800]	train-rmse:1.458	test-rmse:1.54194
[4000]	train-rmse:1.4533	test-rmse:1.54259
Stopping. Best iteration:
[1096]	train-rmse:1.5313	test-rmse:1.5342
X_train.columns
Index(['CalYear', 'Age', 'Bonus', 'Poldur', 'Value', 'Adind', 'Density',
       'Age_premier_contrat', 'SansSinistre', 'Gender_Male', 'Gender_Female',
       'Type_C', 'Type_E', 'Type_D', 'Type_B', 'Type_A', 'Type_F',
       'Category_Large', 'Category_Medium', 'Category_Small',
       'Occupation_Employed', 'Occupation_Unemployed', 'Occupation_Housewife',
       'Occupation_Self-employed', 'Occupation_Retired', 'Group1_18',
       'Group1_11', 'Group1_5', 'Group1_12', 'Group1_13', 'Group1_7',
       'Group1_3', 'Group1_9', 'Group1_14', 'Group1_8', 'Group1_6',
       'Group1_20', 'Group1_16', 'Group1_4', 'Group1_1', 'Group1_10',
       'Group1_2', 'Group1_15', 'Group1_19', 'Group1_17', 'Group2_L',
       'Group2_O', 'Group2_Q', 'Group2_N', 'Group2_R', 'Group2_M', 'Group2_T',
       'Group2_P', 'Group2_U', 'Group2_S'],
      dtype='object')
#Plot XGB feature importance
fig, ax = plt.subplots(figsize=(8, 12))
xgb.plot_importance(xgb_model, ax=ax)

import operator
importance = xgb_model.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1),reverse=True)
xgb_feature_list=[]
for i in range(0,len(importance)-1):
    xgb_feature_list.append(importance[i][0])
xgb_feature_list_df = pd.DataFrame(xgb_feature_list)

png

features_list = ["Value", "Density", "Age", "Age_premier_contrat", "Bonus", 'Poldur']
#xgb_reg = xgboost.XGBRegressor(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75,
#                           colsample_bytree=1, max_depth=7)

xgb_reg = xgb.XGBRegressor( colsample_bytree = 0.1, gamma=0, learning_rate= 0.01, max_depth=4, reg_alpha=0.8,
                           reg_lambda= 0.6, subsample= 0.8,
                            objective ='reg:linear', eval_metric='rmse', min_child_weight=1, seed=42)

xgb_reg.fit(X_train[features_list],y_train_LogIndt)
predictions = xgb_reg.predict(X_test[features_list], ntree_limit=1496)
predictions
[16:22:01] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.





array([4.16, 4.11, 4.17, ..., 4.13, 4.13, 4.25], dtype=float32)
# Total des coûts des sinistres
np.expm1(predictions).astype(int).sum()

506798

Il apparait que les prédictions de couts de sinistres sont très éloignées de la réalité.
Nous choisissons de scorer ces estimations en les normalisant par la valeur maximale.

Estimation du coût des sinistres sur 2011


X_train_cout= train_dummies.loc[train_dummies["IndtTotal"]>0 , features_list]
y_X_LogIndt = train_dummies.loc[train_dummies["IndtTotal"]>0 , "LogIndtTotal"]

xgb_reg = xgb.XGBRegressor( colsample_bytree = 0.1, gamma=0, learning_rate= 0.01, max_depth=4, reg_alpha=0.8,
                           reg_lambda= 0.6, subsample= 0.8,
                            objective ='reg:linear', eval_metric='rmse', min_child_weight=1, seed=42)

xgb_reg.fit(X_train_cout, y_X_LogIndt)

predictions = xgb_reg.predict(X_2011[features_list], ntree_limit=1496)
predictions = predictions/max(predictions)

pd.DataFrame({"score_cout_sinistre":predictions}).to_csv("./score_cout_sinistre.csv")
[16:34:40] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Définition d’une prime d’assurance à partir des deux modèles construits

t2009=training_df[training_df.CalYear==2009]
t2010=training_df[training_df.CalYear==2010]
nba2009=t2009[((training_df.Numtppd!=0) | (training_df.Numtpbi!=0))].shape[0]
nba2010=t2010[((training_df.Numtppd!=0) | (training_df.Numtpbi!=0))].shape[0]

/Users/admin/anaconda3/envs/DeepLearning/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until
/Users/admin/anaconda3/envs/DeepLearning/lib/python3.6/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  after removing the cwd from sys.path.
total2009=training_df[training_df.CalYear==2009].Indtppd.sum()+training_df[training_df.CalYear==2009].Indtpbi.sum()
print("total 2009 = ",round(total2009)," euros")
total2010=training_df[training_df.CalYear==2010].Indtppd.sum()+training_df[training_df.CalYear==2010].Indtpbi.sum()
print("total 2010 = ",round(total2010)," euros")
total 2009 =  15357219.0  euros
total 2010 =  17539471.0  euros

Nous calculons le cout moyen par accident.

pma2009=total2009/nba2009
pma2010=total2010/nba2010
print("Prix moyen par accident (2009) ",round(pma2009))
print("Prix moyen par accident (2010) ",round(pma2010))
Prix moyen par accident (2009)  1987.0
Prix moyen par accident (2010)  2143.0

Nous calculons le cout moyen par accident (2009+2010).

total=training_df.Indtppd.sum()+training_df.Indtpbi.sum()
cout_moyen=total/(nba2009+nba2010)
print(cout_moyen)
2067.4139303798534

Lecture des résultats de la prédiction de sinistre (xgboost).

pred=pd.read_csv("prediction_prob_sinistre.csv")

Nous évaluons le nombre d’accidents prévus.

n=pred[pred.proba_pred>0.5].count()[0]

Nous estimons le coût total des sinistres. Celui-ci est surestimé, ce qui nous permettra de dégager une marge.

S=n*cout_moyen
print(round(S),"euros")
36808238.0 euros

Nous calculons des coefficients (rapport de 1 à 10, linéaire en fonction de la proba d’accident).

pred['coeff']=pred['proba_pred']*10 + 1

Somme des coefficients

C=pred['coeff'].sum()
print(C)
220470.62589862

Calcul de la prime par individu

pred['prime']=(pred['coeff']*S)/C

Vérification des pondérations

round(pred['prime'].sum())
36808238.0

Statistiques sur les primes

pred['prime'].describe()
count    36311.000000
mean      1013.693856
std        334.333860
min        250.738988
25%        762.103640
50%        991.660942
75%       1289.144943
max       1777.911799
Name: prime, dtype: float64

Nous sauvegardons les primes.

pred.to_csv("./prediction_prime_sinistre.csv")

Lecture des résultats de la prédiction du montant du sinistre (xgboost)

pred2=pd.read_csv("./score_cout_sinistre.csv")
pred2.keys()
Index(['Unnamed: 0', 'score_cout_sinistre'], dtype='object')

Intégration de la probabilité sur le montant du sinistre.

pred['proba_cout']=pred2['score_cout_sinistre']
pred['coeff2']=pred['proba_pred']*pred['proba_cout']*10 + 1
C2=pred['coeff2'].sum()
pred['prime2']=(pred['coeff2']*S)/C2
round(pred['prime2'].sum())
36808238.0
pred.to_csv("./prediction_prime_sinistre.csv")
pricing_df['tarif']=pred['prime2'] 

Nous sauvegardons les tarifs.

pricing_df[['PolNum','tarif']].to_csv("./polnum_tarifs.csv")

Conclusion

Malgré une estimation du coût total de sinistres assez élevée, notre prédiction de primes nous semblent cohérente.

Nous pourrions éventuellement optimiser nos prédictions en travaillant sur les points suivants :

  • modélisation par une loi de Pareto généralisée des 2% d’individus que nous avons considérés comme des outliers,
  • différenciation de modèles entre les dommages matériels et corporels,
  • la réalisation d’un clustering (âge, densité, etc.),
  • optimisation de nos modèles en créant des variables basées sur des calculs de moyennes par groupe,
  • application d’une correction du montant des primes afin de gagner plus de contrats.