normalisation, standardization and scaling VARIABLES

ljuillen

add in "Advanced Tab" :

normalisation and(or?) standardisation
scaling : given lower and upper bounds OR lower and upper bounds from the dataset ==> set variables in [0,1] or [-1,+1]

cf.
https://www.simplilearn.com/normalization-vs-standardization-article
https://www.shiksha.com/online-courses/articles/normalization-and-standardization/
https://medium.com/geekculture/scaling-vs-normalization-are-they-the-same-348035afe5ca

tim

Hello,

I'd like to discuss the crucial aspect of data preprocessing in machine learning (ML) projects, specifically focusing on normalization, standardization, and scaling. While these techniques are closely related, they have distinct differences (not discussed in this post).

The current TB GUI offers a "Normalize the dataset?" option in the Advanced tab. However, to deploy solutions to production data, it's essential to apply the same preprocessing parameters used for training. Unfortunately, these parameters are not accessible.

To overcome this limitation, I've started using the TB Python library, leveraging scikit-learn's extensive preprocessing methods:

• StandardScaler()
• MinMaxScaler()
• MaxAbsScaler()
• RobustScaler()
• PowerTransformer()
• QuantileTransformer()
• Normalizer()

Through experimentation, I've found that most datasets (but not all!) require preprocessing to achieve satisfactory results. To determine the optimal method, I apply each technique separately and evaluate the outcomes. Moreover, I save the scaling parameters for later use with production/validation data.

To summarize, I propose enhancing the TB GUI with the following features:

Integrate additional preprocessing methods similar to scikit-learn
Make these methods accessible on the Advanced tab for user selection.
Implement a mechanism to save scaling parameters during model training, similar to exporting solutions to various languages available from the menu.

This enhancement would enable users to:
• Preprocess training datasets using advanced techniques.
• Save and reuse scaling parameters for validation and production data.
• Improve overall model performance and deployment efficiency.

Thank you for considering this request.