Auto-ViML — Minimum Guarantee Machine Learning Package

Thanga Sami
8 min readJun 9, 2021

INTRODUCTION:

If one Model is able to do all ML work such as Feature Engineering, Data Preprocessing, Cleaning, Handling Data Imbalancing, Model Selection, Hyper Parameter Tuning by itself and generates output for any hackathon competition, won't we call it as SuperStar-All in one package?

Yes, Auto-ViML is doing the same. It is doing all machine learning processes by itself and gives a better score in the Hackathon competitions as well.

In this Article, Let us see how Auto ViML working for one of the analytic Vidhya Hackathon problems in Detail.

Dataset, we used for our analysis is available in the below AV hackathon site.

https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics-ii

Using below PIP command, AutoViML package can be installed into Python.

pip install autoviml

from autoviml.Auto_ViML import Auto_ViML

The model can be build using the below comments. Apart from train, test data, we need to define the target variable also. No need for any preprocessing. The model itself will take care of it

Input Arguments:

train — Train Data Set with target Column

target — Target column name

test — Test Dataset without target Column

submission — Need to give file path. It can be left as an empty string. The working folder will be the default path.

scoring_parameter — ‘Balanced_accuracy’ for our case. If not, it will assume the appropriate scoring param for the problem and it will build the model.

hyper_param — RandomSearch (RS) or Grid Search (GS). Default is RS.

feature_reduction — Default = ‘True’ but it can be set to False if you don’t want automatic feature_reduction since, in Image data sets like digits and MNIST, you get better results when you don’t reduce features automatically. You can always try both and see.

Boosting Flag — 4 Possible choices.

None — This will build Linear Model, False — This will build a Random Forest or Extra Trees model (Bagging) , True — This will build an XGBoost Model, CatBoost — This will build a CatBoost Model

Add_Poly:

0 — do Nothing;

1-Add interaction variables only such as x1x2, x2x3,…x9*10 etc;

2-Add Interactions and Squared variables such as x12, x22, etc;

3-Adds both Interactions and Squared variables such as x1x2, x1**2,x2x3, x2**2, etc.

Stacking_Flag:

Default is False, If set to True, it will add an additional feature which is derived from predictions of another model.

Binning_Flag:

Default is False.It set to True, it will convert the top numeric variables into binned variables through a technique known as “Entropy” binning. This is very helpful for certain datasets (especially hard to build models).

Implanced_Flag:

Default is False. If set to True, it will use SMOTE from Imbalanced-Learn to oversample the “Rare Class” in an imbalanced dataset and make the classes balanced (50–50 for example in a binary classification). This also works for Regression problems where you have highly skewed distributions in the target variable. Auto_ViML creates additional samples using SMOTE for Highly Imbalanced data.

Verbose: 0 — Limited Output; 1- More Charts; 2- Lot of charts and output

Return Values:

Features1 — Used Feature details for the analysis by Auto-ViML

trainm — Preprocessed train Data set with prediction

testm — Preprocessed test Data set with prediction

model1 — will have optimum model details. our case, CalibratedClassifierCV is identified as best model by Auto-ViML and respective details given below.

CalibratedClassifierCV(base_estimator=OneVsRestClassifier(estimator=XGBClassifier(base_score=None,
booster='gbtree',
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
gamma=None,
gpu_id=None,
importance_type='gain',
interaction_constraints=None,
learning_rate=None,
max_delta_step=None,
max_depth=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
n_estimators=200,
n_jobs=-1,
nthread=-1,
num_parallel_tree=None,
random_state=99,
reg_alpha=None,
reg_lambda=None,
scale_pos_weight=None,
subsample=None,
tree_method=None,
validate_parameters=None,
verbosity=None)),
cv=5, method='isotonic')

INSIDE Auto-ViML:

Analysis taken care of by Auto-ViMl is listed below.##############  D A T A   S E T  A N A L Y S I S  #######################ALERT! Changing hyperparameter search to RS. Otherwise XGBoost will take too long for 10,000+ rows.
Training Set Shape = (318438, 19)
Training Set Memory Usage = 46.16 MB
Test Set Shape = (137057, 18)
Test Set Memory Usage = 18.82 MB
Single_Label Target: ['Stay']
################ Multi_Classification VISUALIZATION Started #####################Random shuffling the data set before training
Using RandomizedSearchCV for Hyper Parameter Tuning. This is 3X faster than GridSearchCV...
ALERT! Setting Imbalanced_Flag to True in Auto_ViML for Multi_Classification problems improves results!
Class -> Counts -> Percent
0-10: 23604 -> 7.4%
11-20: 78139 -> 24.5%
21-30: 87491 -> 27.5%
31-40: 55159 -> 17.3%
41-50: 11743 -> 3.7%
51-60: 35018 -> 11.0%
61-70: 2744 -> 0.9%
71-80: 10254 -> 3.2%
81-90: 4838 -> 1.5%
91-100: 2765 -> 0.9%
More than 100 Days: 6683 -> 2.1%
CAUTION: In Multi-Class Boosting (2+ classes), TRAINING WILL TAKE A LOT OF TIME!
String or Multi Class target: Stay transformed as follows: {'21-30': 0, '11-20': 1, '31-40': 2, '51-60': 3, '0-10': 4, '41-50': 5, '71-80': 6, 'More than 100 Days': 7, '81-90': 8, '91-100': 9, '61-70': 10}
Alert! Rare Class is not 1 but 10 in this data set
############## C L A S S I F Y I N G V A R I A B L E S ####################Classifying variables in data set...
Number of Numeric Columns = 3
Number of Integer-Categorical Columns = 5
Number of String-Categorical Columns = 8
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 0
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 0
Number of NLP String Columns = 0
Number of Date Time Columns = 0
Number of ID Columns = 1
Number of Columns to Delete = 1
18 Predictors classified...
This does not include the Target column(s)
2 variables removed since they were ID or low-information variables
['case_id', 'source']
############# D A T A P R E P A R A T I O N AND C L E A N I N G #############Filling missing values with "missing" placeholder and adding a column for missing_flags
Columns with most missing values: ['City_Code_Patient', 'Bed Grade']
and their missing value totals: [4532, 113]
Completed missing value Imputation. No more missing values in train.
2 new missing value columns added: ['Bed Grade_Missing_Flag', 'City_Code_Patient_Missing_Flag']
Test data has no missing values. Continuing...
Completed Label Encoding and Filling of Missing Values for Train and Test Data
Multi_Classification problem: hyperparameters are being optimized for balanced_accuracy
############# R E M O V I N G H I G H L Y C O R R E L A T E D V A R S #################Removing highly correlated variables using SULA method among (18) numeric variables
No numeric vars removed since none have high correlation with each other in this data...
Splitting features into float and categorical (integer) variables:
(3) float variables ...
(15) categorical vars...
############## F E A T U R E S E L E C T I O N BY X G B O O S T ####################Current number of predictors = 18
Finding Important Features using Boosted Trees algorithm...
using 18 variables...
using 14 variables...
using 10 variables...
using 6 variables...
using 2 variables...
Found 16 important features
Performing limited feature engineering for binning, add_poly and KMeans_Featurizer flags ...
Train CV Split completed with TRAIN rows = 254750 , CV rows = 63688
Binning_Flag set to False or there are no float vars in data set to be binned
KMeans_Featurizer set to False or there are no float variables in data
Performing MinMax scaling of train and validation data
############### XGBoost M O D E L B U I L D I N G B E G I N S ####################Rows in Train data set = 254750
Features in Train data set = 16
Rows in held-out data set = 63688
Finding Best Model and Hyper Parameters for XGBoost model...
Baseline Accuracy Needed for Model = 99.14%
CPU Count = 8 in this device
Using XGBoost Model, Estimated Training time = 140.11 mins
################## Imbalanced Model Training ############################Imbalanced Training using SMOTE Rare Class Oversampling method...
Using SMOTE's over-sampling techniques to make the 11 classes balanced...
class_weights = [0.03308772 0.03704803 0.05248281 0.0826697 0.12264519 0.24653067
0.28232465 0.43312308 0.59827153 1.04697518 1.05508387]
class_weighted_rows = {0: 69993, 1: 62511, 2: 44127, 3: 28014, 4: 18883, 5: 9394,
6: 8203, 7: 5347, 8: 3871, 9: 2315, 10: 2315}
Regression-resampler is erroring. Continuing...
########################################################
XGBoost Model Prediction Results on Held Out CV Data Set:
Multi Class Model Metrics Report
#####################################################

Accuracy = 47.8%
Balanced Accuracy (average recall) = 32.7%
Average Precision (macro) = 73.1%
Precisions by class:
44.7% 47.0% 51.6% 47.2% 70.8% 100.0% 90.2% 77.7% 78.1% 96.4% 100.0%
Recall Scores by class:
70.2% 57.8% 27.9% 56.7% 9.9% 1.0% 10.8% 57.6% 44.2% 19.2% 4.6%
F1 Scores by class:
54.7% 51.8% 36.2% 51.5% 17.4% 2.0% 19.3% 66.1% 56.4% 32.0% 8.7%
#####################################################
precision recall f1-score support
0 0.45 0.70 0.55 17498
1 0.47 0.58 0.52 15628
2 0.52 0.28 0.36 11032
3 0.47 0.57 0.51 7004
4 0.71 0.10 0.17 4721
5 1.00 0.01 0.02 2349
6 0.90 0.11 0.19 2051
7 0.78 0.58 0.66 1336
8 0.78 0.44 0.56 967
9 0.96 0.19 0.32 553
10 1.00 0.05 0.09 549
accuracy 0.48 63688
macro avg 0.73 0.33 0.36 63688
weighted avg 0.54 0.48 0.44 63688
[[12290 4627 171 324 53 0 0 13 20 0 0]
[ 5416 9027 591 512 78 0 0 0 4 0 0]
[ 4403 2044 3082 1417 39 0 7 28 12 0 0]
[ 1142 505 1297 3973 15 0 3 41 26 2 0]
[ 2045 2197 2 8 469 0 0 0 0 0 0]
[ 1470 517 135 175 6 24 2 15 5 0 0]
[ 236 104 413 994 0 0 222 67 14 1 0]
[ 74 44 72 336 2 0 6 769 32 1 0]
[ 64 38 87 319 0 0 2 30 427 0 0]
[ 59 24 87 252 0 0 2 19 4 106 0]
[ 274 80 41 116 0 0 2 8 3 0 25]]
################# E N S E M B L E M O D E L ##################
Time taken = 105 seconds
Based on trying multiple models, Best type of algorithm for this data set is Bagging_Classifier
#############################################################################
Displaying results of weighted average ensemble of 5 classifiers

Auto-ViML Performance:

Auto-ViML is giving a good performance score compare with other Hyper parameter tuned models. Analytic Vidya performance comparison is given below for reference.

Final Remarks:

Personally, I believe in Manual preprocessing and Model tuning. However, Auto-ViML gives good result which can be used as a readymade quick solution for our analysis.

If you interested in my article, you can follow my Medium for more articles like this . Also, you can touch with me in LinkedIn.

Thank you for reading my article!

--

--

Thanga Sami

I am a data science and machine learning enthusiast with hands on experience in python. I have graduated from MIT Chennai with 13+ years IT Experience