VSCode (code server) in the Cloud; A Simple Classification Example on Kaggle
Developers across the globe have diverse choices when it comes to their favourite coding editor. Visual Studio Code is one of the most used by developers among other IDEs and this is because of its flexibility.
However, is it possible to run vscode on code servers like Kaggle? Probably No right?
Kudos to the ML Community for the Python Library colabcode that activates this feature with just 2 lines of code. In this blog, I would be using vscode on kaggle kernel to build a simple classification model with the help of this awesome library.
About colabcode
Installling and getting started with colabcode
Building a simple classification on kaggle with vscode environment
colabcode
colabcode is a python library by Abhishek Thakur, built to run code server on Google Colab, Kaggle Notebooks or any other cloud platform.
Installing and getting started with colabcode
install colabcode
To install colabcode, navigate through kaggle kernel and lauch a notebook to work on and run the syntax below to install colabcode.
pip install colabcode
import colabcode
Import ColabCode class from colabcode library by running the snippet below
from colabcode import ColabCode
run colabcode
ColabCode(port=10000, password="XXXX")
Note: XXXX is your desired Password (change to your choice) and you can run without specifying any password. Also, you can run it with any password or port
Building a simple Classification model
The dataset used for modeling is one of the competitions on zindi. The challenge was to predict customers that will Churn for a Telecommunication. You can find the dataset here.
The steps taken up to modelling are:
Missing Values: Filled categorical features with "missing" and numeric features with 0
Dropped redundant features that are not useful for the model
Used Label Encoding to convert categorical data to numeric.
Scaled the data using RobustScaler: this is due to outliers in the data
Trained a simple LightgbmClassifier with Log-loss as the evaluation metrics
'''
This Notebook is a simple tutorial that shows how to train a simple model on vscode (codeserver) on kaggle
Train: The training data used in the modelling
'''
#import library
import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.utils import shuffle
from sklearn.metrics import log_loss
import lightgbm as lgb
train = pd.read_csv("../kaggle/input/expresso-train/Train.csv")
def fill_missing(train_data):
'''
This fuunction will be used to clean the data
Missing value: fill numerical features with 0 and fill categorical features with "missing"
'''
#numeric missing features filled with 0
numeric_df = train.select_dtypes(exclude = ['object'])
numeric_df.fillna(0, inplace = True)
#categorical missing features filled with str "missing"
cat_df = train.select_dtypes(include = ['object'])
cat_df.fillna('missing', inplace = True)
#join the numeric_df and cat_df together
df_train = numeric_df.join(cat_df)
return df_train
def drop_redundant(train_data):
'''
This function drops features that are not useful of perhaps redundant in modeling
parameter:
train_data: df_train from function fill_missing()
'''
df_train = fill_missing(train_data = train)
df_train.drop(['MRG', 'TOP_PACK', 'ZONE1', 'user_id'], axis = 1, inplace = True)
return df_train
def Label_encode(train_data):
'''
This function used category_encoder library to transform categorical features to numeric
'''
df_train = drop_redundant(train_data = train)
cat_cols = ['REGION', 'TENURE']
encoder = ce.OrdinalEncoder(cols=cat_cols)
df_train = encoder.fit_transform(df_train)
return df_train
def scale(train_data):
'''This function scales the dataset to normality
'''
df_train = Label_encode(train_data = train)
scaler = RobustScaler()
df_train_scaled = scaler.fit_transform(df_train)
df_train_n = pd.DataFrame(df_train_scaled, columns = df_train.columns)
return df_train_n
def fit(data):
'''This function contains everything about the modelling: cross validationa and fitting
'''
data = scale(train_data=train)
#separate X and y
X = data.drop('CHURN', axis = 1)
y = data['CHURN']
skf = StratifiedKFold(n_splits = 10, random_state = 42)
skf.get_n_splits(X, y)
err_lgb = []
for train_index, test_index in skf.split(X, y):
print('Train:', train_index, 'Test:', test_index)
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
estimator = lgb.LGBMClassifier(learning_rate = 0.175, metric = 'l1',
early_stopping_rounds = 200,
eval_metric = 'binary_logloss',
n_estimators = 28)
estimator.fit(X_train,y_train, eval_set=[(X_train,y_train),(X_test, y_test)])
pred = estimator.predict_proba(X_test)[:,1]
err = log_loss(y_test, pred)
err_lgb.append(err)
cv_log_loss = np.mean(err_lgb)
return print("Nice Job with {} as log-loss".format(cv_log_loss))
fit(data = train)
References
Keep a watch on these addresses for more information about colabcode