Auto-ViML — Minimum Guarantee Machine Learning Package

8 min readJun 9, 2021

INTRODUCTION:

If one Model is able to do all ML work such as Feature Engineering, Data Preprocessing, Cleaning, Handling Data Imbalancing, Model Selection, Hyper Parameter Tuning by itself and generates output for any hackathon competition, won't we call it as SuperStar-All in one package?

Yes, Auto-ViML is doing the same. It is doing all machine learning processes by itself and gives a better score in the Hackathon competitions as well.

In this Article, Let us see how Auto ViML working for one of the analytic Vidhya Hackathon problems in Detail.

Dataset, we used for our analysis is available in the below AV hackathon site.

https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics-ii

Using below PIP command, AutoViML package can be installed into Python.

pip install autoviml

from autoviml.Auto_ViML import Auto_ViML

The model can be build using the below comments. Apart from train, test data, we need to define the target variable also. No need for any preprocessing. The model itself will take care of it

Input Arguments:

train — Train Data Set with target Column

target — Target column name

test — Test Dataset without target Column

submission — Need to give file path. It can be left as an empty string. The working folder will be the default path.

scoring_parameter — ‘Balanced_accuracy’ for our case. If not, it will assume the appropriate scoring param for the problem and it will build the model.

hyper_param — RandomSearch (RS) or Grid Search (GS). Default is RS.

feature_reduction — Default = ‘True’ but it can be set to False if you don’t want automatic feature_reduction since, in Image data sets like digits and MNIST, you get better results when you don’t reduce features automatically. You can always try both and see.

Boosting Flag — 4 Possible choices.

None — This will build Linear Model, False — This will build a Random Forest or Extra Trees model (Bagging) , True — This will build an XGBoost Model, CatBoost — This will build a CatBoost Model

Add_Poly:

0 — do Nothing;

1-Add interaction variables only such as x1x2, x2x3,…x9*10 etc;

2-Add Interactions and Squared variables such as x12, x22, etc;

3-Adds both Interactions and Squared variables such as x1x2, x1**2,x2x3, x2**2, etc.

Stacking_Flag:

Default is False, If set to True, it will add an additional feature which is derived from predictions of another model.

Binning_Flag:

Default is False.It set to True, it will convert the top numeric variables into binned variables through a technique known as “Entropy” binning. This is very helpful for certain datasets (especially hard to build models).

Implanced_Flag:

Default is False. If set to True, it will use SMOTE from Imbalanced-Learn to oversample the “Rare Class” in an imbalanced dataset and make the classes balanced (50–50 for example in a binary classification). This also works for Regression problems where you have highly skewed distributions in the target variable. Auto_ViML creates additional samples using SMOTE for Highly Imbalanced data.

Verbose: 0 — Limited Output; 1- More Charts; 2- Lot of charts and output

Return Values:

Features1 — Used Feature details for the analysis by Auto-ViML

trainm — Preprocessed train Data set with prediction

testm — Preprocessed test Data set with prediction

model1 — will have optimum model details. our case, CalibratedClassifierCV is identified as best model by Auto-ViML and respective details given below.

CalibratedClassifierCV(base_estimator=OneVsRestClassifier(estimator=XGBClassifier(base_score=None,
                                                                                  booster='gbtree',
                                                                                  colsample_bylevel=None,
                                                                                  colsample_bynode=None,
                                                                                  colsample_bytree=None,
                                                                                  gamma=None,
                                                                                  gpu_id=None,
                                                                                  importance_type='gain',
                                                                                  interaction_constraints=None,
                                                                                  learning_rate=None,
                                                                                  max_delta_step=None,
                                                                                  max_depth=None,
                                                                                  min_child_weight=None,
                                                                                  missing=nan,
                                                                                  monotone_constraints=None,
                                                                                  n_estimators=200,
                                                                                  n_jobs=-1,
                                                                                  nthread=-1,
                                                                                  num_parallel_tree=None,
                                                                                  random_state=99,
                                                                                  reg_alpha=None,
                                                                                  reg_lambda=None,
                                                                                  scale_pos_weight=None,
                                                                                  subsample=None,
                                                                                  tree_method=None,
                                                                                  validate_parameters=None,
                                                                                  verbosity=None)),
                       cv=5, method='isotonic')

INSIDE Auto-ViML:

Analysis taken care of by Auto-ViMl is listed below.##############  D A T A   S E T  A N A L Y S I S  #######################ALERT! Changing hyperparameter search to RS. Otherwise XGBoost will take too long for 10,000+ rows.
Training Set Shape = (318438, 19)
    Training Set Memory Usage = 46.16 MB
Test Set Shape = (137057, 18)
    Test Set Memory Usage = 18.82 MB
Single_Label Target: ['Stay']################ Multi_Classification VISUALIZATION Started #####################Random shuffling the data set before training
    Using RandomizedSearchCV for Hyper Parameter Tuning. This is 3X faster than GridSearchCV...
ALERT! Setting Imbalanced_Flag to True in Auto_ViML for Multi_Classification problems improves results!
       Class  -> Counts -> Percent
        0-10:   23604  ->    7.4%
       11-20:   78139  ->   24.5%
       21-30:   87491  ->   27.5%
       31-40:   55159  ->   17.3%
       41-50:   11743  ->    3.7%
       51-60:   35018  ->   11.0%
       61-70:    2744  ->    0.9%
       71-80:   10254  ->    3.2%
       81-90:    4838  ->    1.5%
      91-100:    2765  ->    0.9%
More than 100 Days:    6683  ->    2.1%
CAUTION: In Multi-Class Boosting (2+ classes), TRAINING WILL TAKE A LOT OF TIME!
String or Multi Class target: Stay transformed as follows: {'21-30': 0, '11-20': 1, '31-40': 2, '51-60': 3, '0-10': 4, '41-50': 5, '71-80': 6, 'More than 100 Days': 7, '81-90': 8, '91-100': 9, '61-70': 10}
Alert! Rare Class is not 1 but 10 in this data set############## C L A S S I F Y I N G  V A R I A B L E S  ####################Classifying variables in data set...
    Number of Numeric Columns =  3
    Number of Integer-Categorical Columns =  5
    Number of String-Categorical Columns =  8
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  0
    Number of Numeric-Boolean Columns =  0
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  1
    Number of Columns to Delete =  1
    18 Predictors classified...
        This does not include the Target column(s)
    2 variables removed since they were ID or low-information variables
        ['case_id', 'source']#############     D A T A    P R E P A R A T I O N   AND C L E A N I N G     #############Filling missing values with "missing" placeholder and adding a column for missing_flags
    Columns with most missing values: ['City_Code_Patient', 'Bed Grade']
    and their missing value totals: [4532, 113]
Completed missing value Imputation. No more missing values in train.
    2 new missing value columns added: ['Bed Grade_Missing_Flag', 'City_Code_Patient_Missing_Flag']
    Test data has no missing values. Continuing...
    Completed Label Encoding and Filling of Missing Values for Train and Test Data
Multi_Classification problem: hyperparameters are being optimized for balanced_accuracy############# R E M O V I N G   H I G H L Y  C O R R E L A T E D    V A R S #################Removing highly correlated variables using SULA method among (18) numeric variables
    No numeric vars removed since none have high correlation with each other in this data...
Splitting features into float and categorical (integer) variables:
    (3) float variables ...
    (15) categorical vars...############## F E A T U R E   S E L E C T I O N    BY   X G B O O S T    ####################Current number of predictors = 18 
    Finding Important Features using Boosted Trees algorithm...
        using 18 variables...
        using 14 variables...
        using 10 variables...
        using 6 variables...
        using 2 variables...
Found 16 important features
    Performing limited feature engineering for binning, add_poly and KMeans_Featurizer flags  ...
    Train CV Split completed with TRAIN rows =  254750 , CV rows =  63688
    Binning_Flag set to False or there are no float vars in data set to be binned
    KMeans_Featurizer set to False or there are no float variables in data
Performing MinMax scaling of train and validation data############### XGBoost M O D E L   B U I L D I N G  B E G I N S  ####################Rows in Train data set = 254750
  Features in Train data set = 16
    Rows in held-out data set = 63688
Finding Best Model and Hyper Parameters for XGBoost model...
    Baseline Accuracy Needed for Model = 99.14%
    CPU Count = 8 in this device
Using XGBoost Model, Estimated Training time = 140.11 mins##################  Imbalanced Model Training  ############################Imbalanced Training using SMOTE Rare Class Oversampling method...
Using SMOTE's over-sampling techniques to make the 11 classes balanced...
    class_weights = [0.03308772 0.03704803 0.05248281 0.0826697  0.12264519 0.24653067
 0.28232465 0.43312308 0.59827153 1.04697518 1.05508387]
    class_weighted_rows = {0: 69993, 1: 62511, 2: 44127, 3: 28014, 4: 18883, 5: 9394,6: 8203, 7: 5347, 8: 3871, 9: 2315, 10: 2315}
Regression-resampler is erroring. Continuing...########################################################
XGBoost Model Prediction Results on Held Out CV Data Set:
Multi Class Model Metrics Report
#####################################################
    Accuracy          = 47.8%
    Balanced Accuracy (average recall) = 32.7%
    Average Precision (macro) = 73.1%
    Precisions by class:
    44.7%      47.0%      51.6%      47.2%      70.8%      100.0%      90.2%      77.7%      78.1%      96.4%      100.0%  
    Recall Scores by class:
    70.2%      57.8%      27.9%      56.7%      9.9%      1.0%      10.8%      57.6%      44.2%      19.2%      4.6%  
    F1 Scores by class:
    54.7%      51.8%      36.2%      51.5%      17.4%      2.0%      19.3%      66.1%      56.4%      32.0%      8.7%#####################################################
              precision    recall  f1-score   support0       0.45      0.70      0.55     17498
           1       0.47      0.58      0.52     15628
           2       0.52      0.28      0.36     11032
           3       0.47      0.57      0.51      7004
           4       0.71      0.10      0.17      4721
           5       1.00      0.01      0.02      2349
           6       0.90      0.11      0.19      2051
           7       0.78      0.58      0.66      1336
           8       0.78      0.44      0.56       967
           9       0.96      0.19      0.32       553
          10       1.00      0.05      0.09       549accuracy                           0.48     63688
   macro avg       0.73      0.33      0.36     63688
weighted avg       0.54      0.48      0.44     63688[[12290  4627   171   324    53     0     0    13    20     0     0]
 [ 5416  9027   591   512    78     0     0     0     4     0     0]
 [ 4403  2044  3082  1417    39     0     7    28    12     0     0]
 [ 1142   505  1297  3973    15     0     3    41    26     2     0]
 [ 2045  2197     2     8   469     0     0     0     0     0     0]
 [ 1470   517   135   175     6    24     2    15     5     0     0]
 [  236   104   413   994     0     0   222    67    14     1     0]
 [   74    44    72   336     2     0     6   769    32     1     0]
 [   64    38    87   319     0     0     2    30   427     0     0]
 [   59    24    87   252     0     0     2    19     4   106     0]
 [  274    80    41   116     0     0     2     8     3     0    25]]################# E N S E M B L E  M O D E L  ##################
Time taken = 105 seconds
Based on trying multiple models, Best type of algorithm for this data set is Bagging_Classifier
#############################################################################Displaying results of weighted average ensemble of 5 classifiers

Auto-ViML Performance:

Auto-ViML is giving a good performance score compare with other Hyper parameter tuned models. Analytic Vidya performance comparison is given below for reference.

Final Remarks:

Personally, I believe in Manual preprocessing and Model tuning. However, Auto-ViML gives good result which can be used as a readymade quick solution for our analysis.

If you interested in my article, you can follow my Medium for more articles like this . Also, you can touch with me in LinkedIn.

Thank you for reading my article!

Auto-ViML — Minimum Guarantee Machine Learning Package

Input Arguments:

Return Values:

Written by Thanga Sami