Toptal challenge

11 minute read

Pandas and classification exercise

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from random import randint
from matplotlib import pyplot as plt
import seaborn as sns

For this exercise, it is required to use only three libraries: Pandas, Numpy, and Sklearn -this last one is used for the Decision tree and the confusion matrix-. Additionally, we imported the random and Matplotlib and Seaborn libraries to make the exercise completer and more visual.

To continue we upload both Training and Test databases. For this exercise, it is only required to use the Training Database.

data_test = pd.read_csv('test.csv')
data_train = pd.read_csv('train.csv')
data_train.head(6)
duration credit_amount installment_rate present_residence age existing_credits dependents checking_account_status_A12 checking_account_status_A13 checking_account_status_A14 ... other_installment_plans_A142 other_installment_plans_A143 housing_A152 housing_A153 job_A172 job_A173 job_A174 telephone_A192 foreign_worker_A202 target
0 36 8086.0 2.0 4.0 42.0 4.0 1 1 0 0 ... 0 1 1 0 0 0 1 1 0 1
1 15 3812.0 1.0 4.0 23.0 1.0 1 0 0 1 ... 0 1 1 0 0 1 0 1 0 0
2 36 2145.0 2.0 1.0 24.0 2.0 1 0 0 0 ... 0 1 1 0 0 1 0 1 0 1
3 12 2578.0 3.0 4.0 55.0 1.0 1 0 0 0 ... 0 1 0 1 0 0 1 0 0 0
4 21 5003.0 1.0 4.0 29.0 2.0 1 0 0 1 ... 0 0 1 0 0 1 0 1 0 1
5 15 2327.0 2.0 3.0 25.0 1.0 1 0 1 0 ... 0 1 1 0 1 0 0 0 0 1

6 rows × 49 columns

Modeling on the Training dataset

We constructed a function to split the Database into a Training and Test database, the former one with a size of 30%, and a random state equal to 42 for the reproduction of the results.

def uplo(db):
    y = db['target']
    X = db.drop('target', axis=1)
    xt, xte, yt, yte = train_test_split(X, y, test_size=0.3, random_state=42)
    print('The shape of dataset:', X.shape[0])
    print('The shape of the X dataset:', xt.shape[0])
    
    return xt,xte,yt,yte
X_train, X_test, y_train, y_test = uplo(data_train)
The shape of dataset: 900
The shape of the X dataset: 630

The dataset has a size of 900 rows and the X_train dataset of size 630.

Correlation matrix

To continue we construct a Correlation Matrix as a way to take the most significant variables with the target variable, this is not required in the exercise.

And as an example, here we consider the 10 most important ones.

corrMatrix = data_train.corr()
print(corrMatrix['target'][:10].sort_values(ascending=False))
duration                       0.223492
credit_amount                  0.168680
checking_account_status_A12    0.116657
installment_rate               0.065554
present_residence              0.007676
dependents                    -0.002630
existing_credits              -0.055938
checking_account_status_A13   -0.082670
age                           -0.097674
checking_account_status_A14   -0.315427
Name: target, dtype: float64

From our the correlation matrix -only considering the the most important 10 variables correlated with target-, we take as important variables: duration and age being the former with the highest positive correlation and the latest with the negative correlation.

col = [ 'target','duration' ,        
'credit_amount' ,'checking_account_status_A12' , 
'installment_rate','present_residence', 
'dependents','existing_credits','checking_account_status_A13' , 
'age', 'checking_account_status_A14']
corr = data_train[col].corr()

ax = sns.heatmap(  corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 200, n=200),
    square=True)

ax.set_xticklabels(ax.get_xticklabels(),
    rotation=45, horizontalalignment='right');

alt

After the correlation matrix, we decided to take two variables that have a considerable impact on the target variable (positive and negative correlation), which are: duration and age.

In the next step, we construct an empty list that saves the metrics of the Classifier tree, then we take five random variables from the Train database and add the other selected two correlated variables.

fp = []
nu = [randint(2, 49) for p in range(2, 7)]
nam = list(X_train.columns[nu])
nam.extend(['duration','age'])
nam

From the above results, we can see that were chosen five variables randomly, and duration and age were added too.

['present_residence',
 'credit_history_A31',
 'job_A172',
 'purpose_A41',
 'purpose_A48',
 'duration',
 'age']

Classification function: construction

Finally, we define a function that consider five arguments:

​ i) the empty list for metrics,

​ ii) the train dataset,

​ iii) the test dataset,

​ iv) the target variable, and

​ v) the target variable for the test dataset.

The most relevant part of the function makes the permutation for each of the random variables selected plus ‘duration’ and ‘age’. Each of these permutations is saved in a temporal data frame which is only considered for the Classification exercise.

Once we perform the Classification, we make the prediction on the test dataset, construct the Confusion Matrix and save the false-positive values of each iteration and save in the metric list. After the loop is finished, finally we perform the Classification exercise on the original Train dataset and take its metric.

def stree(fl,xt,xte,yt,yte):
    for i in xt.loc[:,nam]:
        df = xt.iloc[np.random.permutation(xt[i].values)]
        clf = DecisionTreeClassifier()
        clf = clf.fit(df,yt)
        
        #Predict the response for test dataset
        y_pred = clf.predict(xte)
        cm = confusion_matrix(yte, y_pred)
        print(cm[1,0])
        fl.append(cm[1,0])
   
    clf = clf.fit(xt,yt)
    y_base = clf.predict(xte)
    cm = confusion_matrix(yte, y_base)
    fl.append(cm[1,0])
    return fl
stree(fp,X_train,X_test,y_train,y_test)
71
71
71
71
26
61
54
re = nam.copy()
re.append('BASE')

Once we perform our Classification function, we plot the metrics (false positive values) for each variable plus the original estimation for the train dataset (called Base).

We see that the variable with least error is purpose_A48 followed by our Base model; being the other two variables with least error are: duration and age.

fig, ax = plt.subplots(figsize = (9, 6))
ax.bar(re, fp)
plt.setp(plt.gca().get_xticklabels(), rotation=25, horizontalalignment='right', fontsize=8)
plt.title('Train data: Selected variables')
plt.show()

alt

Modeling on the Test dataset

Finally, to test our previous model considering the same variables and specifications for the Classification tree, we reproduce the model specifications on the Test dataset.

We could see that the Test sample is considerably small with a size of 100 rows, and only 70 rows for the X dataset.

X_train, X_test, y_train, y_test = uplo(data_test)
The shape of dataset: 100
The shape of the X dataset: 70
fpt = []
stree(fpt,X_train,X_test,y_train,y_test)
10
10
10
10
10
5
8
ret = nam.copy()
ret.append('TEST')

After we perform the previous Classification model, we plot it metrics, this time we found that the variable with least error is duration followed by the Test model and age. The other variables have a higher error.

fig, ax = plt.subplots(figsize = (9, 6))
plt.bar(ret, fpt)
plt.setp(plt.gca().get_xticklabels(), rotation=25, horizontalalignment='right', fontsize=8)
plt.title('Test data: Selected variables')
plt.show()

alt

To conclude we construct a data frame zipping the results only for the Train sample.

lt = pd.DataFrame(list(zip(re, fp)),columns =['Name', 'CM val'])
lt.index = lt.Name
lt.drop(columns='Name', inplace=True)

lt
Name False positive value
present_residence 71
credit_history_A31 71
job_A172 71
purpose_A41 71
purpose_A48 26
duration 61
age 54
BASE 31

To present the main variables and its metrics, we construct a dictionary that contains three arguments:

i) The error of the Base model.

ii) The results from the most import variables, which must be different than the Base model

iii) The variable model with the least error, which in our case is different is not the Base model metric.

Dict = {'fp': [fp[-1]],
        'most_important' : lt.loc[["purpose_A48", "age"]].to_dict(),
        'fp_most_important' : lt['CM val'].min()}