Hello,
I'd like to discuss the crucial aspect of data preprocessing in machine learning (ML) projects, specifically focusing on normalization, standardization, and scaling. While these techniques are closely related, they have distinct differences (not discussed in this post).
The current TB GUI offers a "Normalize the dataset?" option in the Advanced tab. However, to deploy solutions to production data, it's essential to apply the same preprocessing parameters used for training. Unfortunately, these parameters are not accessible.
To overcome this limitation, I've started using the TB Python library, leveraging scikit-learn's extensive preprocessing methods:
• StandardScaler()
• MinMaxScaler()
• MaxAbsScaler()
• RobustScaler()
• PowerTransformer()
• QuantileTransformer()
• Normalizer()
Through experimentation, I've found that most datasets (but not all!) require preprocessing to achieve satisfactory results. To determine the optimal method, I apply each technique separately and evaluate the outcomes. Moreover, I save the scaling parameters for later use with production/validation data.
To summarize, I propose enhancing the TB GUI with the following features:
- Integrate additional preprocessing methods similar to scikit-learn
- Make these methods accessible on the Advanced tab for user selection.
- Implement a mechanism to save scaling parameters during model training, similar to exporting solutions to various languages available from the menu.
This enhancement would enable users to:
• Preprocess training datasets using advanced techniques.
• Save and reuse scaling parameters for validation and production data.
• Improve overall model performance and deployment efficiency.
Thank you for considering this request.