Sign in

Writes About Data Science | Top 1000 writer on Medium | Top Writer in AI | Data Scientist | Connect:

Essential guide to Icecream package for code debugging

Image by Dhruvil Patel from Pixabay

Debugging code is an important but tiresome task for every developer. It is essential to perform code debugging when your code output is not as expected or throws an error. Debugging is the process of finding and fixing the errors in the program.

Syntactic errors and Semantic errors are two types of errors a developer faces in their program. Syntactic errors are caused due to mistyping of commands or code, indentation error, and are easily fixed by following the Python traceback instructions. …

Essential guide to various dimensionality reduction techniques in Python

Image by Gerd Altmann from Pixabay

Exploratory Data Analysis is an important component of the data science model development pipeline. A data scientist spends most of the time in data cleaning, feature engineering, and performing other data wrangling techniques. Dimensionality Reduction is one of the techniques used by data scientists while performing feature engineering.

Dimensionality Reduction is the process of transforming a higher-dimensional dataset to a comparable lower-dimensional space. A real-world dataset often has a lot of redundant features. Dimensionality reduction techniques can be used to get rid of such redundant features or convert the n-dimensional datasets to 2 or 3 dimensions for visualization.

In this…

Achieve 280x times faster data frame iteration

(Image by Author)

Pandas is one of the popular Python libraries among the data science community, as it offers vast API with flexible data structures for data explorations and visualization. Pandas is the most preferred library for cleaning, transforming, manipulating, and analyzing data.

The presence of vast API makes Pandas easy to use, but when it comes to handling and process large-size datasets, it fails to scale the computations across all the CPU cores. Dask, Vaex are open-sourced libraries that scale the computations to speed up the workflow.

Feature engineering and feature explorations require iterating through the data frame. There are various methods…

Speed-up Pandas processing workflow with Swifter Package

(Image by Author)

Pandas is one of the popular Python packages among the data science community, as it offers a vast API and flexible data structures for data explorations and visualization. When it comes to handling and processing large-size datasets, it fails.

One can load and process a large-size dataset in chunks or use distributed parallel-computing libraries like Dask, Pandarallel, Vaex, etc. Modin library or multiprocessing package can be used to execute the Python functions in parallel and speed up the workflow. In my previous articles, I have discussed the hands-on implementation of Dask, Vaex, Modin, multiprocessing libraries.

Sometimes we are not willing…

Scikit-learn package lacks feature importance implementation for Voting Classifier unlike other models

Image by Arek Socha from Pixabay

Machine learning models are becoming increasingly employed in complex high settings such as financial technology, medical science, etc. Despite the increase in utilization, there’s a lack of techniques to explain the model. The higher the interpretability of the model, the easier it becomes for someone to comprehend the results. There are various advanced techniques and algorithms to interpret the models including LIME, SHAP, etc.

Feature Importance is the simplest and most efficient technique to interpret the importance of the feature for the estimator. Feature Importance can help to get a better interpretation of the estimator and lead to model improvements…

Make the data frame more intuitive in the Jupyter Notebook

Image by Mocho from Pixabay

Pandas is one of the most popular Python libraries in the data science community, as it offers flexible data structures and a vast API for data explorations and visualization. A data scientist spends most of the time exploring the data and performing exploratory data analysis. Jupyter Notebook provides an interactive platform to perform exploratory data analysis and is most preferred by Data Scientists and Data Analysts.

dataframe.head() is a function from the Pandas package to display the top 5 rows of the data frame. Pandas use predefined HTML+CSS commands to display the data frame in a formatted way on the…

Essential guide to multiprocessing in Python

(Image by Author)

Python is a popular programming language and the most preferred among the data science community. Python is primarily slow compared to other popular programming languages because of its dynamic nature and versatility. Python code is interpreted at runtime instead of being compiled to native code at compile time.

Execution time for C language is 10 to 100 times faster than Python code. However, if you compare the development time, Python is faster than the C language. For the data science case study the development time is far more critical than the run time performance. …

Benchmarking time numbers for various cross-validation based hyperparameter tuning techniques

(Image by Author), Edited using Pixlr

To train robust machine learning models, one must select the best suitable machine learning algorithm along with the best set of corresponding hyperparameters. To find what works best for the use case, a data scientist needs to manually train hundreds of models with different sets of hyperparameters comparing their performance. The manual search of selecting the model is a tedious task and slows down the modeling pipeline.

Hyperparameter tuning refers to the process of choosing the optimal set of parameters for a model. It is recommended to search the hyper-parameter space for an estimator for the best cross-validation score. Various…

Benchmark time comparison between Scikit-learn and Faiss k-Means implementation

Image by anncapictures from Pixabay

k-Means clustering is a centroid-based unsupervised method of clustering. This technique clusters the data points into k number of clusters or groups each having an almost equal distribution of data points. Each of the clusters is represented by its centroid.

For a set of n data points, k-means algorithms a.k.a Lloyd's algorithm optimizes to minimize the intracluster distance and maximize the intercluster distance.

Benchmark time comparison of using various data formats for reading and saving operations

Image by Pexels from Pixabay

Data Science is all about working with data. The entire data science model development pipeline involves data wrangling, data explorations, exploratory data analysis, feature engineering, and modeling. Reading and saving intermediate files is a common task in a model development pipeline. A data scientist often prefers reading and saving Pandas' data frame in CSV format. Working with a small size or moderate size data is very easy and does not require too much overhead, but when it comes to working with a large size dataset, the workflow slows down, due to the limitation of resources.

CSV, Excel, or other text…

Satyam Kumar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store