Sign in

Writes About Data Science | Data Scientist | Programmer | Connect:

Automated model training with lazypredict

Automated Machine Learning (Auto-ML) refers to automating the components of a data science model development pipeline. AutoML reduces the workload of a data scientist and speeds up the workflow. AutoML can be used to automate various pipeline components, including data understanding, EDA, data processing, model training, hyperparameter tuning, etc.

For an end-to-end machine learning project, the complexity of each of the pipeline components depends on the project. There are various AutoML open source libraries that speed up each of the pipeline components. Read this article to know 8 such AutoML libraries to automate the machine learning pipeline.

In this article…

k-Modes and k-Prototype algorithm intuition and usage

Clustering is an unsupervised machine learning technique that devices the population into several clusters or groups in such a way that data points in a cluster are similar to each other, and data points in different clusters are dissimilar. k-Means is a popular clustering algorithm that is limited to only numerical data.

Why k-Means can’t be used for Categorical features?

k-Means is a popular centroid-based clustering algorithm, that divides the data points of the entire population into k clusters each having an almost equal number of data points. …

Process your text data for NLP tasks using the CleanText library

Natural Language Processing (NLP) is a subfield of AI involving interactions between computer and natural language. It revolves around how to train a data science model that can understand and implement natural language task usage. A typical NLP project follows various aspects of the pipeline to train a model. Various steps in the pipeline include text cleaning, tokenization, stemming, encoding to numerical vector, etc followed by model training.

The dataset derived for NLP tasks is textual data, mainly derived from the internet. Most of the time, the textual data used for NLP modeling is dirty and needs to be cleaned…

Combination of Oversampling and Undersampling techniques

In classification tasks, one may encounter a situation where the target class label is not equally distributed. Such a dataset can be termed Imbalanced data. Imbalance in data can be a blocker to train a data science model. In case of imbalance class problems, the model is trained mainly on the majority class and the model becomes biased towards the majority class prediction.

Hence handling of imbalance class is essential before proceeding to the modeling pipeline. There are various class balancing techniques that solve the problem of class imbalance by either generating a new sampling of the minority class or…

Speed up your Pandas workflow using the PyPolars library

Pandas is one of the most important Python packages among data scientist’s to play around with the data. Pandas library is used mostly for data explorations and visualizations as it comes with tons of inbuilt functions. Pandas fail to handle large size datasets as it does not scale or distributes its process across all the cores of the CPU.

To speed up the computations, one can utilize all the cores of the CPU and speed up the workflow. There are various open-source libraries including Dask, Vaex, Modin, Pandarallel, PyPolars, etc that parallelize the computations across multiple cores of the CPU…

PyWedge — Interactive package to speed up data science modeling workflow

A data scientist spends most of the time performing exploratory data analysis (EDA). There are various components of the data science modeling pipeline, including EDA, data processing, hyperparameter tuning, baseline modeling, and model deployment.

There are various open-source Python libraries that can speed up some of the components on the pipeline. Read this article to 4 such libraries that can automate the EDA component. When it comes to automating the entire modeling pipeline starting from EDA, data processing, baseline modeling, and hyperparameter tuning the model, it takes a lot of data scientist’s time and energy.

Here PyWedge comes into the…

Select the best set of features using recursive feature selection

A real-world dataset contains a lot of relevant and redundant features. It is true that the presence of more data in terms of the number of instances or rows leads to training a better machine learning model.

Before proceeding, we must know why it’s not recommended to use all sets of features. To train a robust machine learning model, the data must be free of redundant features. There are various reasons why feature selection is important:

  • Garbage In, Garbage Out: The quality of data that goes towards training the model determines the quality of the output model. …

How to extract and convert tables from PDFs into Pandas Dataframe using Camelot

A standard principle in data science is that the presence of more data leads to training a better model. Data can be present in any format, data collection and data preparation is an important component of a model development pipeline. The required data for any case study can be present in any format, and it's the task of the data scientist to get the data into the desired format to proceed with the data preprocessing and other components of the pipeline.

A lot of structured/semi-structured or unstructured data can be present in tabular format in text-based PDF documents and in…

Develop ARIMA, SARIMAX, FB Prophet, VAR, and ML models using Auto-TS library

Automated Machine Learning (AutoML) refers to automating some of the components of the machine learning pipeline. AutoML speeds up the workflow of a data scientist by automating some of the model development processes. Automation allows the non-experts to train a basic machine learning model without being much knowledgeable in the field.

There are various Auto-ML open-sourced Python libraries including TPOT, MLBox, Auto-Sklearn, etc that can automate a classification or regression machine learning model. To automate an NLP problem, one can use the AutoNLP library.

In this article, we will discuss how to automate a time-series forecasting model implementation using an…

Deep dive understanding of Fuzzy C-Means Clustering Algorithm

Clustering is an unsupervised machine learning technique that divides the population into several groups or clusters such that data points in the same group are similar to each other, and data points in different groups are dissimilar.

In other words, clusters are formed in such a way that:

  • Data points in the same cluster are close to each other and hence they are very similar
  • Data points in different clusters are far apart and are different from each other.

Clustering is used to identify some segments or groups in your dataset. Clustering can be divided into two subgroups:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store