VSCode (code server) in the Cloud; A Simple Classification Example on Kaggle

VSCode (code server) in the Cloud; A Simple Classification Example on Kaggle

Developers across the globe have diverse choices when it comes to their favourite coding editor. Visual Studio Code is one of the most used by developers among other IDEs and this is because of its flexibility.

However, is it possible to run vscode on code servers like Kaggle? Probably No right?

Kudos to the ML Community for the Python Library colabcode that activates this feature with just 2 lines of code. In this blog, I would be using vscode on kaggle kernel to build a simple classification model with the help of this awesome library.

  • About colabcode

  • Installling and getting started with colabcode

  • Building a simple classification on kaggle with vscode environment

colabcode

colabcode is a python library by Abhishek Thakur, built to run code server on Google Colab, Kaggle Notebooks or any other cloud platform.

Installing and getting started with colabcode

install colabcode

To install colabcode, navigate through kaggle kernel and lauch a notebook to work on and run the syntax below to install colabcode.

pip install colabcode

import colabcode

Import ColabCode class from colabcode library by running the snippet below

from colabcode import ColabCode

run colabcode

ColabCode(port=10000, password="XXXX")

Note: XXXX is your desired Password (change to your choice) and you can run without specifying any password. Also, you can run it with any password or port

Building a simple Classification model

The dataset used for modeling is one of the competitions on zindi. The challenge was to predict customers that will Churn for a Telecommunication. You can find the dataset here.

The steps taken up to modelling are:

  • Missing Values: Filled categorical features with "missing" and numeric features with 0

  • Dropped redundant features that are not useful for the model

  • Used Label Encoding to convert categorical data to numeric.

  • Scaled the data using RobustScaler: this is due to outliers in the data

  • Trained a simple LightgbmClassifier with Log-loss as the evaluation metrics

''' 
This Notebook is a simple tutorial that shows how to train a simple model on vscode (codeserver) on kaggle

Train: The training data used in the modelling

'''


#import library
import numpy as np  
import pandas as pd
import category_encoders as ce 
from sklearn.preprocessing import  RobustScaler
from sklearn.model_selection import  StratifiedKFold
from sklearn.utils import shuffle
from sklearn.metrics import log_loss
import lightgbm as lgb



train = pd.read_csv("../kaggle/input/expresso-train/Train.csv")


def fill_missing(train_data):
    '''
    This fuunction will be used to clean the data 

    Missing value: fill numerical features with 0 and fill categorical features with "missing"


     '''
    #numeric missing features filled with 0
    numeric_df = train.select_dtypes(exclude = ['object'])
    numeric_df.fillna(0, inplace = True)

    #categorical missing features filled with str "missing"
    cat_df = train.select_dtypes(include = ['object'])
    cat_df.fillna('missing', inplace = True)

    #join the numeric_df and cat_df together
    df_train = numeric_df.join(cat_df)

    return df_train


def drop_redundant(train_data):
    '''
    This function drops features that are not useful of perhaps redundant in modeling
    parameter:
    train_data: df_train from  function fill_missing() 
    '''

    df_train = fill_missing(train_data = train)
    df_train.drop(['MRG', 'TOP_PACK', 'ZONE1', 'user_id'], axis = 1, inplace = True)

    return df_train


def Label_encode(train_data):
    ''' 
    This function used category_encoder library to transform categorical features to numeric
    '''
    df_train = drop_redundant(train_data = train)
    cat_cols = ['REGION', 'TENURE'] 
    encoder = ce.OrdinalEncoder(cols=cat_cols)
    df_train = encoder.fit_transform(df_train)

    return df_train


def scale(train_data):
    '''This function scales the dataset to normality
    '''

    df_train = Label_encode(train_data = train)
    scaler = RobustScaler()
    df_train_scaled = scaler.fit_transform(df_train)

    df_train_n = pd.DataFrame(df_train_scaled, columns = df_train.columns)


    return df_train_n




def fit(data):
    '''This function contains everything about the modelling: cross validationa and fitting
    '''

    data = scale(train_data=train)


    #separate X and y
    X = data.drop('CHURN', axis = 1)
    y = data['CHURN']


    skf = StratifiedKFold(n_splits = 10, random_state = 42)
    skf.get_n_splits(X, y)


    err_lgb = []


    for train_index, test_index in skf.split(X, y):
        print('Train:', train_index, 'Test:', test_index)
        X_train, X_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]


        estimator = lgb.LGBMClassifier(learning_rate = 0.175, metric = 'l1',
                                   early_stopping_rounds = 200,
                                   eval_metric = 'binary_logloss',
                                   n_estimators = 28)

        estimator.fit(X_train,y_train, eval_set=[(X_train,y_train),(X_test, y_test)])
        pred = estimator.predict_proba(X_test)[:,1]
        err = log_loss(y_test, pred)
        err_lgb.append(err)


    cv_log_loss = np.mean(err_lgb)

    return print("Nice Job with  {} as log-loss".format(cv_log_loss))



fit(data = train)

References

Keep a watch on these addresses for more information about colabcode

  • colabcode github here

  • Video Tutorial on colabcode here