Auto-Sklearn is a library for automatic machine learning developed by AutoML group from Freibug-Hannover, Germany.

Generate tabular classification prediction using auto-sklearn 2

Main components of the library:

  
# Import libraries
import autosklearn.experimental.askl2
import autosklearn.metrics

# Configure model parameters
classifier = autosklearn.experimental.askl2.AutoSklearn2Classifier(
    time_left_for_this_task = 600, # in seconds
    metric = autosklearn.metrics.roc_auc)

# Fit model
classifier.fit(X_train, y_train)

# Predict on test data
y_pred = classifier.predict(X_test)

To find out more about other auto-sklearn applications, visit the examples webpage.

This notebook is available on GitHub or to be downloaded here.

Note: to install auto-sklearn on MacOS, use the commands below. Additional details can be found in this comment.

  
brew install swig
brew link swig
pip install -U auto-sklearn

Download the dataset from Kaggle

The dataset being used is from the Kaggle Titanic competition.

  
import fastkaggle
print("fastkaggle version: ", fastkaggle.__version__)

fastkaggle version:  0.0.7

  
comp = 'titanic' # competition name
path = fastkaggle.setup_comp(comp,
                  install = 'fastai "timm >= 0.6.2.dev0"')

  
# Import basic dependencies such as np, pd
import fastai
from fastai.imports import *
print("fastai version: ", fastai.__version__)

fastai version:  2.7.9

  
!ls {path}

gender_submission.csv  test.csv  train.csv

Process and clean the data

Additional transformation and normalization are handled by auto-sklearn 2.

  
df = pd.read_csv(path/'train.csv', index_col = 'PassengerId')
df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

Define data features(X_train) and label(y_train).

  
y_train = df['Survived']
X_train = df.drop(['Survived', 'Name'], axis = 1)

Since auto-sklearn 2 does not accept string columns, it is necessary to convert them into categorical columns.

  
# Create a function that finds categorical columns and label them as such
## Import dependency
from fastai.tabular.all import *

def to_cat(df = df):
    '''
    Convert string-type columns of a dataframe into categorical columns
    '''
    # Identify string/categorical columns in the dataframe
    _, cat = cont_cat_split(df, 1)
    
    # Convert to categorical type using for loops
    for col in cat:
        df[col] = pd.Categorical(df[col])

  
to_cat(X_train)

  
X_train.head()

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	female	38.0	1	0	PC 17599	71.2833	C85	C
3	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	female	35.0	1	0	113803	53.1000	C123	S
5	3	male	35.0	0	0	373450	8.0500	NaN	S

Data exploration

The package pandas_profiling provide quick and valuable insights into the data.

  
import pandas_profiling
print("pandas_profiling version: ", pandas_profiling.__version__)

pandas_profiling version:  3.2.0

  
X_train.profile_report(progress_bar = False).to_notebook_iframe()
# Use .to_notebook_iframe() for HTML format or .to_widgets() for built in widget view

Configure and train the model

  
import autosklearn    
print("autosklearn version: ", autosklearn.__version__)

autosklearn version:  0.14.7

  
import autosklearn.experimental.askl2
import autosklearn.metrics

# Configure model parameters
cls = autosklearn.experimental.askl2.AutoSklearn2Classifier(
    seed = 42,
    time_left_for_this_task = 600, # in seconds
    metric = autosklearn.metrics.roc_auc,
    memory_limit = None,
    n_jobs = -1) # Use all CPUs available

  
%%capture
# Train the model
cls.fit(X_train, y_train)

[WARNING] [2022-08-25 04:49:26,365:Client-AutoML(42):870bbeda-2420-11ed-af95-acde48001122] Time limit for a single run is higher than total time limit. Capping the limit for a single run to the total time given to SMAC (599.729022)
[WARNING] [2022-08-25 04:49:26,365:Client-AutoML(42):870bbeda-2420-11ed-af95-acde48001122] Capping the per_run_time_limit to 299.0 to have time for a least 2 models in each process.

  
print(cls.sprint_statistics())

auto-sklearn results:
  Dataset name: 870bbeda-2420-11ed-af95-acde48001122
  Metric: roc_auc
  Best validation score: 0.884487
  Number of target algorithm runs: 232
  Number of successful target algorithm runs: 232
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

To save the progress thus far, we can use fastai’s function save_pickle to store the trained model.

  
save_pickle('cls.pkl', cls)

If the pickled model needs to be accessed later, run the following:

  
cls = load_pickle('cls.pkl')

Model insights

The contents of the model ensemble can be viewed below.

  
print(cls.leaderboard())

          rank  ensemble_weight                type      cost  duration
model_id                                                               
115          1             0.02  passive_aggressive  0.117159  3.054357
164          2             0.02  passive_aggressive  0.119444  2.595886
175          3             0.92                 sgd  0.121888  2.583323
159          4             0.04                 sgd  0.124069  2.279986

  
import PipelineProfiler
%pip show pipelineprofiler

Name: pipelineprofiler
Version: 0.1.18
Summary: Pipeline Profiler tool. Enables the exploration of D3M pipelines in Jupyter Notebooks
Home-page: https://github.com/VIDA-NYU/PipelineVis
Author: Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, Claudio Silva
Author-email: jorgehpo@nyu.edu
License: UNKNOWN
Location: /Users/tompham/Library/Python/3.10/lib/python/site-packages
Requires: networkx, notebook, numpy, python-dateutil, scikit-learn, scipy
Required-by: 
Note: you may need to restart the kernel to use updated packages.

  
profiler_data = PipelineProfiler.import_autosklearn(cls)
PipelineProfiler.plot_pipeline_matrix(profiler_data)

Use the trained model to make predictions

Process the test data similarly to the trained features(X_train).

  
df_test = pd.read_csv(path/'test.csv', index_col = 'PassengerId')
df_test = df_test.drop('Name', axis = 1)
to_cat(df_test)
df_test.head()

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
892	3	male	34.5	0	0	330911	7.8292	NaN	Q
893	3	female	47.0	1	0	363272	7.0000	NaN	S
894	2	male	62.0	0	0	240276	9.6875	NaN	Q
895	3	male	27.0	0	0	315154	8.6625	NaN	S
896	3	female	22.0	1	1	3101298	12.2875	NaN	S

  
# Make prediction
prediction = cls.predict(df_test)

  
# Convert the prediction to dataframe from ndarry
subm = pd.DataFrame(prediction,
                    index = df_test.index,
                    columns = ['Survived'])

Save the prediction as a .csv file.

  
subm.to_csv('subm.csv')
# View the first few rows
!head subm.csv

PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
897,0
898,1
899,0
900,1

Submit the prediction directly to the Kaggle competition. View the scores in this webpage.

  
# Submit to competition
from kaggle import api
api.competition_submit_cli('subm.csv', # file name
                           'auto-sklearn 2 - 10m', # version description
                           comp) # competition name

100%|██████████| 2.77k/2.77k [00:00<00:00, 5.35kB/s]

Successfully submitted to Titanic - Machine Learning from Disaster

This submission has an accuracy score of 79.186%, which is top 6% of all submissions. Note: there are numerous top predictions with 100% accuracy from cheating.

Hands-Free Machine Learning With Auto-Sklearn 2