Awesome Python Data Science
</br>
Probably the best curated list of data science software in Python
Contents
Machine Learning
General Purpose Machine Learning
-
scikit-learn - Machine learning in Python.
-
PyCaret
- An open-source, low-code machine learning library in Python.
-
Shogun
- Machine learning toolbox.
-
xLearn
- High Performance, Easy-to-use, and Scalable Machine Learning Package.
-
cuML
- RAPIDS Machine Learning Library.
-
modAL
- Modular active learning framework for Python3.
-
Sparkit-learn
- PySpark + scikit-learn = Sparkit-learn.
-
mlpack
- A scalable C++ machine learning library (Python bindings).
-
dlib
- Toolkit for making real-world machine learning and data analysis applications in C++ (Python bindings).
-
MLxtend
- Extension and helper modules for Python’s data analysis and machine learning libraries.
-
hyperlearn
- 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels.
-
Reproducible Experiment Platform (REP)
- Machine Learning toolbox for Humans.
-
scikit-multilearn
- Multi-label classification for python.
-
seqlearn
- Sequence classification toolkit for Python.
-
pystruct
- Simple structured learning framework for Python.
-
sklearn-expertsys
- Highly interpretable classifiers for scikit learn.
-
RuleFit
- Implementation of the rulefit.
-
metric-learn
- Metric learning algorithms in Python.
-
pyGAM
- Generalized Additive Models in Python.
-
causalml
- Uplift modeling and causal inference with machine learning algorithms.
Gradient Boosting
-
XGBoost
- Scalable, Portable, and Distributed Gradient Boosting.
-
LightGBM
- A fast, distributed, high-performance gradient boosting.
-
CatBoost
- An open-source gradient boosting on decision trees library.
-
ThunderGBM
- Fast GBDTs and Random Forests on GPUs.
-
NGBoost
- Natural Gradient Boosting for Probabilistic Prediction.
-
TensorFlow Decision Forests
- A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Ensemble Methods
-
ML-Ensemble - High performance ensemble learning.
-
Stacking
- Simple and useful stacking library written in Python.
-
stacked_generalization
- Library for machine learning stacking generalization.
-
vecstack
- Python package for stacking (machine learning technique).
Imbalanced Datasets
-
imbalanced-learn
- Module to perform under-sampling and over-sampling with various techniques.
-
imbalanced-algorithms
- Python-based implementations of algorithms for learning on imbalanced data.
Random Forests
Kernel Methods
-
pyFM
- Factorization machines in python.
-
fastFM
- A library for Factorization Machines.
-
tffm
- TensorFlow implementation of an arbitrary order Factorization Machine.
-
liquidSVM
- An implementation of SVMs.
-
scikit-rvm
- Relevance Vector Machine implementation using the scikit-learn API.
-
ThunderSVM
- A fast SVM Library on GPUs and CPUs.
Deep Learning
PyTorch
-
PyTorch
- Tensors and Dynamic neural networks in Python with strong GPU acceleration.
-
pytorch-lightning
- PyTorch Lightning is just organized PyTorch.
-
ignite
- High-level library to help with training neural networks in PyTorch.
-
skorch
- A scikit-learn compatible neural network library that wraps PyTorch.
-
Catalyst
- High-level utils for PyTorch DL & RL research.
-
ChemicalX
- A PyTorch-based deep learning library for drug pair scoring.
TensorFlow
-
TensorFlow
- Computation using data flow graphs for scalable machine learning by Google.
-
TensorLayer
- Deep Learning and Reinforcement Learning Library for Researcher and Engineer.
-
TFLearn
- Deep learning library featuring a higher-level API for TensorFlow.
-
Sonnet
- TensorFlow-based neural network library.
-
tensorpack
- A Neural Net Training Interface on TensorFlow.
-
Polyaxon
- A platform that helps you build, manage and monitor deep learning models.
-
tfdeploy
- Deploy TensorFlow graphs for fast evaluation and export to TensorFlow-less environments running numpy.
-
tensorflow-upstream
- TensorFlow ROCm port.
-
TensorFlow Fold
- Deep learning with dynamic computation graphs in TensorFlow.
-
TensorLight
- A high-level framework for TensorFlow.
-
Mesh TensorFlow
- Model Parallelism Made Easier.
-
Ludwig
- A toolbox that allows one to train and test deep learning models without the need to write code.
-
Keras - A high-level neural networks API running on top of TensorFlow.
-
keras-contrib
- Keras community contributions.
-
Hyperas
- Keras + Hyperopt: A straightforward wrapper for a convenient hyperparameter.
-
Elephas
- Distributed Deep learning with Keras & Spark.
-
qkeras
- A quantization deep learning library.
MXNet
-
MXNet
- Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
-
Gluon
- A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet).
-
Xfer
- Transfer Learning library for Deep Neural Networks.
-
MXNet
- HIP Port of MXNet.
JAX
-
JAX
- Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
-
FLAX
- A neural network library for JAX that is designed for flexibility.
-
Optax
- A gradient processing and optimization library for JAX.
Others
-
transformers
- State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
-
Tangent
- Source-to-Source Debuggable Derivatives in Pure Python.
-
autograd
- Efficiently computes derivatives of numpy code.
-
Caffe
- A fast open framework for deep learning.
-
nnabla
- Neural Network Libraries by Sony.
Automated Machine Learning
-
auto-sklearn
- An AutoML toolkit and a drop-in replacement for a scikit-learn estimator.
-
Auto-PyTorch
- Automatic architecture search and hyperparameter optimization for PyTorch.
-
AutoKeras
- AutoML library for deep learning.
-
AutoGluon
- AutoML for Image, Text, Tabular, Time-Series, and MultiModal Data.
-
TPOT
- AutoML tool that optimizes machine learning pipelines using genetic programming.
-
MLBox
- A powerful Automated Machine Learning python library.
Natural Language Processing
-
torchtext
- Data loaders and abstractions for text and NLP.
-
gluon-nlp
- NLP made easy.
-
KerasNLP
- Modular Natural Language Processing workflows with Keras.
-
spaCy - Industrial-Strength Natural Language Processing.
-
NLTK
- Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
-
CLTK
- The Classical Language Toolkik.
-
gensim - Topic Modelling for Humans.
-
pyMorfologik
- Python binding for Morfologik.
-
skift
- Scikit-learn wrappers for Python fastText.
-
Phonemizer
- Simple text-to-phonemes converter for multiple languages.
-
flair
- Very simple framework for state-of-the-art NLP.
Computer Audition
-
torchaudio
- An audio library for PyTorch.
-
librosa
- Python library for audio and music analysis.
-
Yaafe
- Audio features extraction.
-
aubio
- A library for audio and music analysis.
-
Essentia
- Library for audio and music analysis, description, and synthesis.
-
LibXtract
- A simple, portable, lightweight library of audio feature extraction functions.
-
Marsyas
- Music Analysis, Retrieval, and Synthesis for Audio Signals.
-
muda
- A library for augmenting annotated audio data.
-
madmom
- Python audio and music signal processing library.
Computer Vision
-
torchvision
- Datasets, Transforms, and Models specific to Computer Vision.
-
PyTorch3D
- PyTorch3D is FAIR’s library of reusable components for deep learning with 3D data.
-
gluon-cv
- Provides implementations of the state-of-the-art deep learning models in computer vision.
-
KerasCV
- Industry-strength Computer Vision workflows with Keras.
-
OpenCV
- Open Source Computer Vision Library.
-
Decord
- An efficient video loader for deep learning with smart shuffling that’s super easy to digest.
-
MMEngine
- OpenMMLab Foundational Library for Training Deep Learning Models.
-
scikit-image
- Image Processing SciKit (Toolbox for SciPy).
-
imgaug
- Image augmentation for machine learning experiments.
-
imgaug_extension
- Additional augmentations for imgaug.
-
Augmentor
- Image augmentation library in Python for machine learning.
-
albumentations
- Fast image augmentation library and easy-to-use wrapper around other libraries.
-
LAVIS
- A One-stop Library for Language-Vision Intelligence.
Time Series
-
sktime
- A unified framework for machine learning with time series.
-
darts
- A python library for easy manipulation and forecasting of time series.
-
statsforecast
- Lightning fast forecasting with statistical and econometric models.
-
mlforecast
- Scalable machine learning-based time series forecasting.
-
neuralforecast
- Scalable machine learning-based time series forecasting.
-
tslearn
- Machine learning toolkit dedicated to time-series data.
-
tick
- Module for statistical learning, with a particular emphasis on time-dependent modeling.
-
greykite
- A flexible, intuitive, and fast forecasting library next.
-
Prophet
- Automatic Forecasting Procedure.
-
PyFlux
- Open source time series library for Python.
-
bayesloop
- Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
-
luminol
- Anomaly Detection and Correlation library.
-
dateutil - Powerful extensions to the standard datetime module
-
maya
- makes it very easy to parse a string and for changing timezones
-
Chaos Genius
- ML powered analytics engine for outlier/anomaly detection and root cause analysis
Reinforcement Learning
-
Gymnasium
- An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym
).
-
PettingZoo
- An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
-
MAgent2
- An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
-
Stable Baselines3
- A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
-
Shimmy
- An API conversion tool for popular external reinforcement learning environments.
-
EnvPool
- C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
-
RLlib - Scalable Reinforcement Learning.
-
Tianshou
- An elegant PyTorch deep reinforcement learning library.
-
Acme
- A library of reinforcement learning components and agents.
-
Catalyst-RL
- PyTorch framework for RL research.
-
d3rlpy
- An offline deep reinforcement learning library.
-
DI-engine
- OpenDILab Decision AI Engine.
-
TF-Agents
- A library for Reinforcement Learning in TensorFlow.
-
TensorForce
- A TensorFlow library for applied reinforcement learning.
-
TRFL
- TensorFlow Reinforcement Learning.
-
Dopamine
- A research framework for fast prototyping of reinforcement learning algorithms.
-
keras-rl
- Deep Reinforcement Learning for Keras.
-
garage
- A toolkit for reproducible reinforcement learning research.
-
Horizon
- A platform for Applied Reinforcement Learning.
-
rlpyt
- Reinforcement Learning in PyTorch.
-
cleanrl
- High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).
-
Machin
- A reinforcement library designed for pytorch.
-
SKRL
- Modular reinforcement learning library (on PyTorch and JAX) with support for NVIDIA Isaac Gym, Isaac Orbit and Omniverse Isaac Gym.
-
Imitation
- Clean PyTorch implementations of imitation and reward learning algorithms.
Graph Machine Learning
-
pytorch_geometric
- Geometric Deep Learning Extension Library for PyTorch.
-
pytorch_geometric_temporal
- Temporal Extension Library for PyTorch Geometric.
-
PyTorch Geometric Signed Directed
- A signed/directed graph neural network extension library for PyTorch Geometric.
-
dgl
- Python package built to ease deep learning on graph, on top of existing DL frameworks.
-
Spektral
- Deep learning on graphs.
-
StellarGraph
- Machine Learning on Graphs.
-
Graph Nets
- Build Graph Nets in Tensorflow.
-
TensorFlow GNN
- A library to build Graph Neural Networks on the TensorFlow platform.
-
Auto Graph Learning
-An autoML framework & toolkit for machine learning on graphs.
-
PyTorch-BigGraph
- Generate embeddings from large-scale graph-structured data.
-
Auto Graph Learning
- An autoML framework & toolkit for machine learning on graphs.
-
Karate Club
- An unsupervised machine learning library for graph-structured data.
-
Little Ball of Fur
- A library for sampling graph structured data.
-
GreatX
- A graph reliability toolbox based on PyTorch and PyTorch Geometric (PyG).
-
Jraph
- A Graph Neural Network Library in Jax.
Learning-to-Rank & Recommender Systems
-
LightFM
- A Python implementation of LightFM, a hybrid recommendation algorithm.
-
Spotlight - Deep recommender models using PyTorch.
-
Surprise
- A Python scikit for building and analyzing recommender systems.
-
RecBole
- A unified, comprehensive and efficient recommendation library.
-
allRank
- allRank is a framework for training learning-to-rank neural models based on PyTorch.
-
TensorFlow Recommenders
- A library for building recommender system models using TensorFlow.
-
TensorFlow Ranking
- Learning to Rank in TensorFlow.
Probabilistic Graphical Models
-
pomegranate
- Probabilistic and graphical models for Python.
-
pgmpy
- A python library for working with Probabilistic Graphical Models.
-
pyAgrum - A GRaphical Universal Modeler.
Probabilistic Methods
-
pyro
- A flexible, scalable deep probabilistic programming library built on PyTorch.
-
PyMC
- Bayesian Stochastic Modelling in Python.
-
ZhuSuan - Bayesian Deep Learning.
-
GPflow - Gaussian processes in TensorFlow.
-
InferPy
- Deep Probabilistic Modelling Made Easy.
-
PyStan
- Bayesian inference using the No-U-Turn sampler (Python interface).
-
sklearn-bayes
- Python package for Bayesian Machine Learning with scikit-learn API.
-
skpro
- Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute.
-
PyVarInf
- Bayesian Deep Learning methods with Variational Inference for PyTorch.
-
emcee
- The Python ensemble sampling toolkit for affine-invariant MCMC.
-
hsmmlearn
- A library for hidden semi-Markov models with explicit durations.
-
pyhsmm
- Bayesian inference in HSMMs and HMMs.
-
GPyTorch
- A highly efficient and modular implementation of Gaussian Processes in PyTorch.
-
sklearn-crfsuite
- A scikit-learn-inspired API for CRFsuite.
Model Explanation
-
dalex
- moDel Agnostic Language for Exploration and explanation. 
-
Shapley
- A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
-
Alibi
- Algorithms for monitoring and explaining machine learning models.
-
anchor
- Code for “High-Precision Model-Agnostic Explanations” paper.
-
aequitas
- Bias and Fairness Audit Toolkit.
-
Contrastive Explanation
- Contrastive Explanation (Foil Trees).
-
yellowbrick
- Visual analysis and diagnostic tools to facilitate machine learning model selection.
-
scikit-plot
- An intuitive library to add plotting functionality to scikit-learn objects.
-
shap
- A unified approach to explain the output of any machine learning model.
-
ELI5
- A library for debugging/inspecting machine learning classifiers and explaining their predictions.
-
Lime
- Explaining the predictions of any machine learning classifier.
-
FairML
- FairML is a python toolbox auditing the machine learning models for bias.
-
L2X
- Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
-
PDPbox
- Partial dependence plot toolbox.
-
PyCEbox
- Python Individual Conditional Expectation Plot Toolbox.
-
Skater
- Python Library for Model Interpretation.
-
model-analysis
- Model analysis tools for TensorFlow.
-
themis-ml
- A library that implements fairness-aware machine learning algorithms.
-
treeinterpreter
- Interpreting scikit-learn’s decision tree and random forest predictions.
-
AI Explainability 360
- Interpretability and explainability of data and machine learning models.
-
Auralisation
- Auralisation of learned features in CNN (for audio).
-
CapsNet-Visualization
- A visualization of the CapsNet layers to better understand how it works.
-
lucid
- A collection of infrastructure and tools for research in neural network interpretability.
-
Netron
- Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
-
FlashLight
- Visualization Tool for your NeuralNetwork.
-
tensorboard-pytorch
- Tensorboard for PyTorch (and chainer, mxnet, numpy, …).
Genetic Programming
-
gplearn
- Genetic Programming in Python.
-
PyGAD
- Genetic Algorithm in Python.
-
DEAP
- Distributed Evolutionary Algorithms in Python.
-
karoo_gp
- A Genetic Programming platform for Python with GPU support.
-
monkeys
- A strongly-typed genetic programming framework for Python.
-
sklearn-genetic
- Genetic feature selection module for scikit-learn.
Optimization
-
Optuna
- A hyperparameter optimization framework.
-
pymoo
- Multi-objective Optimization in Python.
-
pycma
- Python implementation of CMA-ES.
-
Spearmint
- Bayesian optimization.
-
BoTorch
- Bayesian optimization in PyTorch.
-
scikit-opt
- Heuristic Algorithms for optimization.
-
sklearn-genetic-opt
- Hyperparameters tuning and feature selection using evolutionary algorithms.
-
SMAC3
- Sequential Model-based Algorithm Configuration.
-
Optunity
- Is a library containing various optimizers for hyperparameter tuning.
-
hyperopt
- Distributed Asynchronous Hyperparameter Optimization in Python.
-
hyperopt-sklearn
- Hyper-parameter optimization for sklearn.
-
sklearn-deap
- Use evolutionary algorithms instead of gridsearch in scikit-learn.
-
sigopt_sklearn
- SigOpt wrappers for scikit-learn methods.
-
Bayesian Optimization
- A Python implementation of global optimization with gaussian processes.
-
SafeOpt
- Safe Bayesian Optimization.
-
scikit-optimize
- Sequential model-based optimization with a scipy.optimize
interface.
-
Solid
- A comprehensive gradient-free optimization framework written in Python.
-
PySwarms
- A research toolkit for particle swarm optimization in Python.
-
Platypus
- A Free and Open Source Python Library for Multiobjective Optimization.
-
GPflowOpt
- Bayesian Optimization using GPflow.
-
POT
- Python Optimal Transport library.
-
Talos
- Hyperparameter Optimization for Keras Models.
-
nlopt
- Library for nonlinear optimization (global and local, constrained or unconstrained).
-
OR-Tools - An open-source software suite for optimization by Google; provides a unified programming interface to a half dozen solvers: SCIP, GLPK, GLOP, CP-SAT, CPLEX, and Gurobi.
Feature Engineering
General
-
Featuretools
- Automated feature engineering.
-
Feature Engine
- Feature engineering package with sklearn-like functionality.
-
OpenFE
- Automated feature generation with expert-level performance.
-
skl-groups
- A scikit-learn addon to operate on set/”group”-based features.
-
Feature Forge
- A set of tools for creating and testing machine learning features.
-
few
- A feature engineering wrapper for sklearn.
-
scikit-mdr
- A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.
-
tsfresh
- Automatic extraction of relevant features from time series.
-
dirty_cat
- Machine learning on dirty tabular data (especially: string-based variables for classifcation and regression).
-
NitroFE
- Moving window features.
-
sk-transformer
- A collection of various pandas & scikit-learn compatible transformers for all kinds of preprocessing and feature engineering steps
Feature Selection
-
scikit-feature
- Feature selection repository in Python.
-
boruta_py
- Implementations of the Boruta all-relevant feature selection method.
-
BoostARoota
- A fast xgboost feature selection algorithm.
-
scikit-rebate
- A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
-
zoofs
- A feature selection library based on evolutionary algorithms.
Visualization
General Purposes
-
Matplotlib
- Plotting with Python.
-
seaborn
- Statistical data visualization using matplotlib.
-
prettyplotlib
- Painlessly create beautiful matplotlib plots.
-
python-ternary
- Ternary plotting library for Python with matplotlib.
-
missingno
- Missing data visualization module for Python.
-
chartify
- Python library that makes it easy for data scientists to create charts.
-
physt
- Improved histograms.
Interactive plots
-
animatplot
- A python package for animating plots built on matplotlib.
-
plotly - A Python library that makes interactive and publication-quality graphs.
-
Bokeh
- Interactive Web Plotting for Python.
-
Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
-
bqplot
- Plotting library for IPython/Jupyter notebooks
-
pyecharts
- Migrated from Echarts
, a charting and visualization library, to Python’s interactive visual drawing library.
Map
-
folium - Makes it easy to visualize data on an interactive open street map
-
geemap
- Python package for interactive mapping with Google Earth Engine (GEE)
Automatic Plotting
-
HoloViews
- Stop plotting your data - annotate your data and let it visualize itself.
-
AutoViz
: Visualize data automatically with 1 line of code (ideal for machine learning)
-
SweetViz
: Visualize and compare datasets, target values and associations, with one line of code.
NLP
-
pyLDAvis
: Visualize interactive topic model
Deployment
-
fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
-
streamlit - Make it easy to deploy the machine learning model
-
streamsync
- No-code in the front, Python in the back. An open-source framework for creating data apps.
-
gradio
- Create UIs for your machine learning model in Python in 3 minutes.
-
Vizro
- A toolkit for creating modular data visualization applications.
-
datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
-
binder - Enable sharing and execute Jupyter Notebooks
Statistics
-
pandas_summary
- Extension to pandas dataframes describe function.
-
Pandas Profiling
- Create HTML profiling reports from pandas DataFrame objects.
-
statsmodels
- Statistical modeling and econometrics in Python.
-
stockstats
- Supply a wrapper StockDataFrame
based on the pandas.DataFrame
with inline stock statistics/indicators support.
-
weightedcalcs
- A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
-
scikit-posthocs
- Pairwise Multiple Comparisons Post-hoc Tests.
-
Alphalens
- Performance analysis of predictive (alpha) stock factors.
Data Manipulation
Data Frames
-
pandas - Powerful Python data analysis toolkit.
-
polars
- A fast multi-threaded, hybrid-out-of-core DataFrame library.
-
Arctic
- High-performance datastore for time series and tick data.
-
datatable
- Data.table for Python.
-
pandas_profiling
- Create HTML profiling reports from pandas DataFrame objects
-
cuDF
- GPU DataFrame Library.
-
blaze
- NumPy and pandas interface to Big Data.
-
pandasql
- Allows you to query pandas DataFrames using SQL syntax.
-
pandas-gbq
- pandas Google Big Query.
-
xpandas
- Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
-
pysparkling
- A pure Python implementation of Apache Spark’s RDD and DStream interfaces.
-
modin
- Speed up your pandas workflows by changing a single line of code.
-
swifter
- A package that efficiently applies any function to a pandas dataframe or series in the fastest available manner.
-
pandas-log
- A package that allows providing feedback about basic pandas operations and finds both business logic and performance issues.
-
vaex
- Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
-
xarray
- Xarray combines the best features of NumPy and pandas for multidimensional data selection by supplementing numerical axis labels with named dimensions for more intuitive, concise, and less error-prone indexing routines.
Pipelines
-
pdpipe
- Sasy pipelines for pandas DataFrames.
-
SSPipe - Python pipe ( |
) operator with support for DataFrames and Numpy, and Pytorch. |
-
pandas-ply
- Functional data manipulation for pandas.
-
Dplython
- Dplyr for Python.
-
sklearn-pandas
- pandas integration with sklearn.
-
Dataset
- Helps you conveniently work with random or sequential batches of your data and define data processing.
-
pyjanitor
- Clean APIs for data cleaning.
-
meza
- A Python toolkit for processing tabular data.
-
Prodmodel
- Build system for data science pipelines.
-
dopanda
- Hints and tips for using pandas in an analysis environment.
-
Hamilton
- A microframework for dataframe generation that applies Directed Acyclic Graphs specified by a flow of lazily evaluated Python functions.
Data-centric AI
-
cleanlab
- The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
snorkel
- A system for quickly generating training data with weak supervision.
-
dataprep
- Collect, clean, and visualize your data in Python with a few lines of code.
Synthetic Data
-
ydata-synthetic
- A package to generate synthetic tabular and time-series data leveraging the state-of-the-art generative models.
Distributed Computing
-
Horovod
- Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
-
PySpark - Exposes the Spark programming model to Python.
-
Veles
- Distributed machine learning platform.
-
Jubatus
- Framework and Library for Distributed Online Machine Learning.
-
DMTK
- Microsoft Distributed Machine Learning Toolkit.
-
PaddlePaddle
- PArallel Distributed Deep LEarning.
-
dask-ml
- Distributed and parallel machine learning.
-
Distributed
- Distributed computation in Python.
Experimentation
-
mlflow
- Open source platform for the machine learning lifecycle.
-
Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
-
dvc - Data Version Control |
Git for Data & Models |
ML Experiments Management. |
-
envd
- 🏕️ machine learning development environment for data science and AI/ML engineering teams.
-
Sacred
- A tool to help you configure, organize, log, and reproduce experiments.
-
Ax
- Adaptive Experimentation Platform.
Data Validation
-
great_expectations
- Always know what to expect from your data.
-
pandera
- A lightweight, flexible, and expressive statistical data testing library.
-
deepchecks
- Validation & testing of ML models and data during model development, deployment, and production.
-
evidently
- Evaluate and monitor ML models from validation to production.
-
TensorFlow Data Validation
- Library for exploring and validating machine learning data.
Evaluation
-
recmetrics
- Library of useful metrics and plots for evaluating recommender systems.
-
Metrics
- Machine learning evaluation metric.
-
sklearn-evaluation
- Model evaluation made easy: plots, tables, and markdown reports.
-
AI Fairness 360
- Fairness metrics for datasets and ML models, explanations, and algorithms to mitigate bias in datasets and models.
Computations
-
numpy - The fundamental package needed for scientific computing with Python.
-
Dask
- Parallel computing with task scheduling.
-
bottleneck
- Fast NumPy array functions written in C.
-
CuPy
- NumPy-like API accelerated with CUDA.
-
scikit-tensor
- Python library for multilinear algebra and tensor factorizations.
-
numdifftools
- Solve automatic numerical differentiation problems in one or more variables.
-
quaternion
- Add built-in support for quaternions to numpy.
-
adaptive
- Tools for adaptive and parallel samping of mathematical functions.
-
NumExpr
- A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.
Web Scraping
-
BeautifulSoup: The easiest library to scrape static websites for beginners
-
Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the core
-
Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
-
Pattern
: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
-
twitterscraper
: Efficient library to scrape Twitter
Spatial Analysis
-
GeoPandas
- Python tools for geographic data.
-
PySal
- Python Spatial Analysis Library.
Quantum Computing
-
qiskit
- Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
-
cirq
- A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
-
PennyLane
- Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
-
QML
- A Python Toolkit for Quantum Machine Learning.
Conversion
-
sklearn-porter
- Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
-
ONNX
- Open Neural Network Exchange.
-
MMdnn
- A set of tools to help users inter-operate among different deep learning frameworks.
-
treelite
- Universal model exchange and serialization format for decision tree forests.
Contributing
Contributions are welcome! :sunglasses: </br>
Read the <a href=https://github.com/krzjoa/awesome-python-datascience/blob/master/CONTRIBUTING.md>contribution guideline</a>.
License
This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0