Home Hands-Free Machine Learning With Auto-Sklearn 2
Post
Cancel

Hands-Free Machine Learning With Auto-Sklearn 2

Auto-Sklearn is a library for automatic machine learning developed by AutoML group from Freibug-Hannover, Germany.

Generate tabular classification prediction using auto-sklearn 2

Main components of the library:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import libraries
import autosklearn.experimental.askl2
import autosklearn.metrics

# Configure model parameters
classifier = autosklearn.experimental.askl2.AutoSklearn2Classifier(
    time_left_for_this_task = 600, # in seconds
    metric = autosklearn.metrics.roc_auc)

# Fit model
classifier.fit(X_train, y_train)

# Predict on test data
y_pred = classifier.predict(X_test)

To find out more about other auto-sklearn applications, visit the examples webpage.

This notebook is available on GitHub or to be downloaded here.

Note: to install auto-sklearn on MacOS, use the commands below. Additional details can be found in this comment.

1
2
3
brew install swig
brew link swig
pip install -U auto-sklearn

Download the dataset from Kaggle

The dataset being used is from the Kaggle Titanic competition.

1
2
import fastkaggle
print("fastkaggle version: ", fastkaggle.__version__)
1
fastkaggle version:  0.0.7
1
2
3
comp = 'titanic' # competition name
path = fastkaggle.setup_comp(comp,
                  install = 'fastai "timm >= 0.6.2.dev0"')
1
2
3
4
# Import basic dependencies such as np, pd
import fastai
from fastai.imports import *
print("fastai version: ", fastai.__version__)
1
fastai version:  2.7.9
1
!ls {path}
1
gender_submission.csv  test.csv  train.csv

Process and clean the data

Additional transformation and normalization are handled by auto-sklearn 2.

1
2
df = pd.read_csv(path/'train.csv', index_col = 'PassengerId')
df.head()
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Define data features(X_train) and label(y_train).

1
2
y_train = df['Survived']
X_train = df.drop(['Survived', 'Name'], axis = 1)

Since auto-sklearn 2 does not accept string columns, it is necessary to convert them into categorical columns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Create a function that finds categorical columns and label them as such
## Import dependency
from fastai.tabular.all import *

def to_cat(df = df):
    '''
    Convert string-type columns of a dataframe into categorical columns
    '''
    # Identify string/categorical columns in the dataframe
    _, cat = cont_cat_split(df, 1)
    
    # Convert to categorical type using for loops
    for col in cat:
        df[col] = pd.Categorical(df[col])
1
to_cat(X_train)
1
X_train.head()
PclassSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
13male22.010A/5 211717.2500NaNS
21female38.010PC 1759971.2833C85C
33female26.000STON/O2. 31012827.9250NaNS
41female35.01011380353.1000C123S
53male35.0003734508.0500NaNS

Data exploration

The package pandas_profiling provide quick and valuable insights into the data.

1
2
import pandas_profiling
print("pandas_profiling version: ", pandas_profiling.__version__)
1
pandas_profiling version:  3.2.0
1
2
X_train.profile_report(progress_bar = False).to_notebook_iframe()
# Use .to_notebook_iframe() for HTML format or .to_widgets() for built in widget view

Configure and train the model

1
2
import autosklearn    
print("autosklearn version: ", autosklearn.__version__)
1
autosklearn version:  0.14.7
1
2
3
4
5
6
7
8
9
10
import autosklearn.experimental.askl2
import autosklearn.metrics

# Configure model parameters
cls = autosklearn.experimental.askl2.AutoSklearn2Classifier(
    seed = 42,
    time_left_for_this_task = 600, # in seconds
    metric = autosklearn.metrics.roc_auc,
    memory_limit = None,
    n_jobs = -1) # Use all CPUs available
1
2
3
%%capture
# Train the model
cls.fit(X_train, y_train)
1
2
[WARNING] [2022-08-25 04:49:26,365:Client-AutoML(42):870bbeda-2420-11ed-af95-acde48001122] Time limit for a single run is higher than total time limit. Capping the limit for a single run to the total time given to SMAC (599.729022)
[WARNING] [2022-08-25 04:49:26,365:Client-AutoML(42):870bbeda-2420-11ed-af95-acde48001122] Capping the per_run_time_limit to 299.0 to have time for a least 2 models in each process.
1
print(cls.sprint_statistics())
1
2
3
4
5
6
7
8
9
auto-sklearn results:
  Dataset name: 870bbeda-2420-11ed-af95-acde48001122
  Metric: roc_auc
  Best validation score: 0.884487
  Number of target algorithm runs: 232
  Number of successful target algorithm runs: 232
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

To save the progress thus far, we can use fastai’s function save_pickle to store the trained model.

1
save_pickle('cls.pkl', cls)

If the pickled model needs to be accessed later, run the following:

1
cls = load_pickle('cls.pkl')

Model insights

The contents of the model ensemble can be viewed below.

1
print(cls.leaderboard())
1
2
3
4
5
6
          rank  ensemble_weight                type      cost  duration
model_id                                                               
115          1             0.02  passive_aggressive  0.117159  3.054357
164          2             0.02  passive_aggressive  0.119444  2.595886
175          3             0.92                 sgd  0.121888  2.583323
159          4             0.04                 sgd  0.124069  2.279986
1
2
import PipelineProfiler
%pip show pipelineprofiler
1
2
3
4
5
6
7
8
9
10
11
Name: pipelineprofiler
Version: 0.1.18
Summary: Pipeline Profiler tool. Enables the exploration of D3M pipelines in Jupyter Notebooks
Home-page: https://github.com/VIDA-NYU/PipelineVis
Author: Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, Claudio Silva
Author-email: jorgehpo@nyu.edu
License: UNKNOWN
Location: /Users/tompham/Library/Python/3.10/lib/python/site-packages
Requires: networkx, notebook, numpy, python-dateutil, scikit-learn, scipy
Required-by: 
Note: you may need to restart the kernel to use updated packages.
1
2
profiler_data = PipelineProfiler.import_autosklearn(cls)
PipelineProfiler.plot_pipeline_matrix(profiler_data)

Use the trained model to make predictions

Process the test data similarly to the trained features(X_train).

1
2
3
4
df_test = pd.read_csv(path/'test.csv', index_col = 'PassengerId')
df_test = df_test.drop('Name', axis = 1)
to_cat(df_test)
df_test.head()
PclassSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
8923male34.5003309117.8292NaNQ
8933female47.0103632727.0000NaNS
8942male62.0002402769.6875NaNQ
8953male27.0003151548.6625NaNS
8963female22.011310129812.2875NaNS
1
2
# Make prediction
prediction = cls.predict(df_test)
1
2
3
4
# Convert the prediction to dataframe from ndarry
subm = pd.DataFrame(prediction,
                    index = df_test.index,
                    columns = ['Survived'])

Save the prediction as a .csv file.

1
2
3
subm.to_csv('subm.csv')
# View the first few rows
!head subm.csv
1
2
3
4
5
6
7
8
9
10
PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
897,0
898,1
899,0
900,1

Submit the prediction directly to the Kaggle competition. View the scores in this webpage.

1
2
3
4
5
# Submit to competition
from kaggle import api
api.competition_submit_cli('subm.csv', # file name
                           'auto-sklearn 2 - 10m', # version description
                           comp) # competition name
1
2
3
4
5
6
7
100%|██████████| 2.77k/2.77k [00:00<00:00, 5.35kB/s]





Successfully submitted to Titanic - Machine Learning from Disaster

This submission has an accuracy score of 79.186%, which is top 6% of all submissions. Note: there are numerous top predictions with 100% accuracy from cheating.

This post is licensed under CC BY 4.0 by the author.

Clothe Classifier - FastAI Model

-