A real-world dataset contains a lot of relevant and redundant features. It is true that the presence of more data in terms of the number of instances or rows leads to training a better machine learning model.
Before proceeding, we must know why it’s not recommended to use all sets of features. To train a robust machine learning model, the data must be free of redundant features. There are various reasons why feature selection is important:
A standard principle in data science is that the presence of more data leads to training a better model. Data can be present in any format, data collection and data preparation is an important component of a model development pipeline. The required data for any case study can be present in any format, and it's the task of the data scientist to get the data into the desired format to proceed with the data preprocessing and other components of the pipeline.
A lot of structured/semi-structured or unstructured data can be present in tabular format in text-based PDF documents and in…
Automated Machine Learning (AutoML) refers to automating some of the components of the machine learning pipeline. AutoML speeds up the workflow of a data scientist by automating some of the model development processes. Automation allows the non-experts to train a basic machine learning model without being much knowledgeable in the field.
There are various Auto-ML open-sourced Python libraries including TPOT, MLBox, Auto-Sklearn, etc that can automate a classification or regression machine learning model. To automate an NLP problem, one can use the AutoNLP library.
In this article, we will discuss how to automate a time-series forecasting model implementation using an…
Clustering is an unsupervised machine learning technique that divides the population into several groups or clusters such that data points in the same group are similar to each other, and data points in different groups are dissimilar.
In other words, clusters are formed in such a way that:
Clustering is used to identify some segments or groups in your dataset. Clustering can be divided into two subgroups:
More training data in terms of the number of instances leads to a better data science model, but this does not hold true for the number of features. A real-world dataset has a lot of features, some of them are useful to train a robust data science model, others are redundant features that can impact the performance of the model.
Feature selection is an important element of a data science model development workflow. Selecting all the possible combinations of features is a polynomial solution. …
Pandas is a very popular Python library as it provides high-level usable and flexible API along with high-performance implementation. Pandas offer a vast list of APIs for data wrangling explorations but ignore the performance and scalability of the computations. Pandas is very slow or sometimes fails to perform explorations on a large-sized dataset, as it utilizes a single core of the CPU. It does not make 100% utilization of the CPU cores.
In this article, we will discuss 4 open-sourced libraries that can parallelize the existing Pandas ecosystem across multiple cores of the CPU:
Dask is an open-sourced Python library…
Pandas is one of the most popular Python libraries used for data explorations and visualization. Pandas come up with tons of inbuilt functions that make it easier to perform data explorations. But when it comes to handling large sized-dataset, it fails, as it performs all manipulation using a single-core CPU. Pandas do not take benefit of all the available CPU cores to scale up the computations.
All the available cores of the CPU can be used to scale up the performance and time complexity of large and complex computations. There are various open-sourced libraries such as Vaex, Dask, Modin, and…
Statistics is an integral part of Data Science and Machine Learning. Statistics is a subfield of mathematics that refers to the formalization of relationships between variables in the form of mathematical equations. It tries to find relationships between variables to predict the outcomes. Statistics is all about, involving the study of collection analysis, interpretation, presentation, and organization.
There is a lot of statistical tests, to measure the relationship within or between variables. During a data science project, often a question arises in Data Scientist’s mind, that which statistical techniques to use for what kind of data or variables and when…
Natural Language Processing (NLP) is a subfield of artificial intelligence concerned with interactions between computer and natural human languages. NLP involves text processing, text analysis, apply machine learning algorithms to text and speech, and many more.
Text processing is a key element in the pipeline of NLP or a text-based data science project. Regular expressions are used for a variety of purposes such as feature extraction, string replacement, and other string manipulations. Regular Expressions are also known as regex is a tool available with many programming languages and also too with many python libraries.
Regex is basically a set of…
The standard principle in data science is that more training data leads to a better machine learning model. It is true for the number of instances, but not for the number of features. A real-world dataset contains a lot of redundant features, that may impact the performance of the model.
A data scientist needs to be selective in terms of features he/she is choosing for the modeling. The dataset contains a lot of features, some of them useful and others not. To select all the possible combinations of features and then proceed to select the best set of features, is…