Skip to content

MaxwellLZH/python-packages-for-data-geeks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A curated list of nice pakages for we data people

Time Series

name owner stars description
AnomalyDetection twitter 3.1k Anomaly detection with r
stumpy TDAmeritrade 879 Stumpy is a powerful and scalable python library that can be used for a variety of time series data mining tasks
gluon-ts awslabs 765 Gluonts - probabilistic time series modeling in python
RobustSTL LeeDoYup 120 Unofficial implementation of robuststl: a robust seasonal-trend decomposition algorithm for long time series (aaai 2019)

Feature Engineering

name owner stars description
featuretools FeatureLabs 4.8k An open source python library for automated feature engineering
Augly facebookresearch 3.3k A data augmentations library for audio, image, text, and video.
great_expectations great-expectations 2.7k Always know what to expect from your data.
categorical-encoders scikit-learn-contrib 1.1k A library of sklearn compatible categorical variable encoders
fancy-impute iskandr 735 Multivariate imputation and matrix completion algorithms implemented in python
dirty-cat dirty-cat 158 Encoding methods for dirty categorical variables

Pandas Extensions

name owner stars description
pandas-profiliing pandas-profiling 5.9k Create html profiling reports from pandas dataframe objects
pdpipe pdpipe 557 Easy pipelines for pandas dataframes.
pydqc SauceCat 211 Python automatic data quality check toolkit
pandas_flavor Zsailer 186 The easy way to write your own flavor of pandas
pandas-log eyaltrabelsi 154 The goal of pandas-log is to provide feedback about basic pandas operations. it provides simple wrapper functions for the most common functions that add additional logs

Feature Selection

name owner stars description
scikit-features jundongl 845 Open-source feature selection repository in python
boruta scikit-learn-contrib 615 Python implementations of the boruta all-relevant feature selection method.
ppscore 8080labs 321 Predictive power score (pps) in python
minepy minepy 114 Minepy - maximal information-based nonparametric exploration
stability-selection scikit-learn-contrib 94 Scikit-learn compatible implementation of stability selection.

Model Tunning

name owner stars description
mlflow mlflow 5.3k Open source platform for the machine learning lifecycle
nnl microsoft 4.5k An open source automl toolkit for neural architecture search, model compression and hyper-parameter tuning.
metaflow Netflix 2k Build and manage real-life data science projects with ease.
skopt scikit-optimize 1.6k Sequential model-based optimization with a scipy.optimize interface
optuna optuna 1.5k A hyperparameter optimization framework

AutoML

name owner stars description
jina jina-ai 7.6k Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data
autokeras keras-team 7k An automl system based on keras
tpot EpistasisLab 6.5k A python automated machine learning tool that optimizes machine learning pipelines using genetic programming.
auto-scikitlearn automl 4.1k Automated machine learning with scikit-learn
darts quark0 2.8k Differentiable architecture search for convolutional and recurrent networks

Dimension Reduction

name owner stars description
umap lmcinnes 3.4k Uniform manifold approximation and projection
star-clustering josephius 83 A clustering algorithm that automatically determines the number of clusters and works without hyperparameter fine-tuning.

Machine Learning

name owner stars description
pattern clips 7.2k Web mining module for python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
VowpalWabbit VowpalWabbit 6.7k Vowpal wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
xlearn aksnzhy 2.6k High performance, easy-to-use, and scalable machine learning (ml) package, including linear model (lr), factorization machines (fm), and field-aware factorization machines (ffm) for python and cli interface.
lightning scikit-learn-contrib 1.3k Large-scale linear classification, regression and ranking in python
Metrics benhamner 1.2k Machine learning evaluation metrics, implemented in python, r, haskell, and matlab / octave
mlens flennerhag 553 Ml-ensemble – high performance ensemble learning
NGBoost stanfordmlgroup 335 Natural gradient boosting for probabilistic prediction
polylearn scikit-learn-contrib 191 A library for factorization machines and polynomial networks for classification and regression in python.

Bayesian Statistics

name owner stars description
pyro pyro-ppl 6.2k Deep universal probabilistic programming with python and pytorch
pymc pymc-devs 4.6k Probabilistic programming in python: bayesian modeling and probabilistic machine learning with theano
Edward blei-lab 4.6k A probabilistic programming language in tensorflow. deep generative models, variational inference.

Deep Learning

name owner stars description
Autograd HIPS 4.4k Efficiently computes derivatives of numpy code.
RAdam LiyuanLucasLiu 1.8k On the variance of the adaptive learning rate and beyond
einops arogozhnikov 1.6k Deep learning operations reinvented (for pytorch, tensorflow, chainer, gluon and others)
Pytorch Metric Learning KevinMusgrave 1.3k The easiest way to use deep metric learning in your application. modular, flexible, and extensible. written in pytorch.

Model Training

name owner stars description
horovod horovod 11.8k Distributed training framework for tensorflow, keras, pytorch, and apache mxnet.
tfx tensorflow 1.2k Tfx is an end-to-end platform for deploying production ml pipelines

Distributed

name owner stars description
ray ray-project 13.3k An open source framework that provides a simple, universal api for building distributed applications. ray is packaged with rllib, a scalable reinforcement learning library, and tune, a scalable hyperparameter tuning library.
dask dask 7.5k Parallel computing with task scheduling

Federated Learning

name owner stars description
FATE FederatedAI 1.1k An industrial level federated learning framework

Confident Learning

name owner stars description
cleanlab cgnorthcutt 1.2k Find label errors in datasets, weak supervision, and learning with noisy labels.

Causal Inference

name owner stars description
Edward blei-lab 4.6k A probabilistic programming language in tensorflow. deep generative models, variational inference.
dowhy microsoft 2.3k Dowhy is a python library for causal inference that supports explicit modeling and testing of causal assumptions. dowhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
CausalML uber 2.1k Uplift modeling and causal inference with machine learning algorithms
EconML microsoft 943 Alice (automated learning and intelligence for causation and economics) is a microsoft research project aimed at applying artificial intelligence concepts to economic decision making. one of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal …

NLP Preprocessing

name owner stars description
jieba fxsjy 23k 结巴中文分词
HanLP hankcs 19.4k Natural language processing for the next decade. tokenization, part-of-speech tagging, named entity recognition, syntactic & semantic dependency parsing, document classification
datasets huggingface 8.6k 🤗 the largest hub of ready-to-use nlp datasets for ml models with fast, easy-to-use and efficient data manipulation tools
Chinese Word Embeddings Embedding 6.6k 100+ chinese word vectors 上百种预训练中文词向量
sentencepiece google 3.3k Unsupervised text tokenizer for neural network-based text generation.
ckiptagger ckiplab 1.1k Ckip neural chinese word segmentation, pos tagging, and ner
jiagu ownthink 1k Jiagu深度学习自然语言处理工具 知识图谱关系抽取 中文分词 词性标注 命名实体识别 情感分析 新词发现 关键词 文本摘要 文本聚类
TextAttack QData 902 Textattack 🐙 is a python framework for adversarial attacks, data augmentation, and model training in nlp
MiNLP XiaoMi 494 Xiaomi natural language processing toolkits
fastHan fastnlp 163 Fasthan是基于fastnlp与pytorch实现的中文自然语言处理工具,像spacy一样调用方便。

NLP Models

name owner stars description
pytorch-transformers huggingface 17.8k 🤗 transformers: state-of-the-art natural language processing for tensorflow 2.0 and pytorch.
xlnet zihangdai 5.3k Xlnet: generalized autoregressive pretraining for language understanding
MatchZoo NTMC-Community 3.2k Facilitating the design, comparison and sharing of deep text matching models.
GPT2-Chinese Morizeyao 2.4k Chinese version of gpt2 training code, using bert tokenizer.
ALBERT brightmart 1.7k A lite bert for self-supervised learning of language representations, 海量中文预训练albert模型
bertforkeras bojone 1.2k Light reimplement of bert for keras
AliceMind alibaba 820 Alibaba's collection of encoder-decoders from mind (machine intelligence of damo) lab
FinBert valuesimplex 356
gensen Maluuba 284 Learning general purpose distributed sentence representations via large scale multi-task learning

Representation Learning

name owner stars description
sentence-transformers UKPLab 3k Sentence embeddings with bert & xlnet
top2vec ddangelov 489 Top2vec learns jointly embedded topic, document and word vectors.
glyce embedding ShannonAI 238 Code for neurips 2019 - glyce: glyph-vectors for chinese character representations

Image Processing

name owner stars description
imgaug aleju 9k Image augmentation for machine learning experiments.
albumentations albumentations-team 7.2k Fast image augmentation library and easy to use wrapper around other libraries. documentation: https://albumentations.ai/docs/ paper about library: https://www.mdpi.com/2078-2489/11/2/125
imagededupe idealo 2.7k 😎 finding duplicate images made easy!
imutils jrosebr1 2.6k A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying matplotlib images easier with opencv and python.

Object Detection

name owner stars description
mmdetection open-mmlab 7.4k Open mmlab detection toolbox and benchmark
keras-YOLO3 qqwweee 5.8k A keras implementation of yolov3 (tensorflow backend)
Light Facial Detection Linzaer 3.9k 💎1mb lightweight face detection model (1mb轻量级人脸检测模型)
SSD-Tensorflow balancap 3.8k Single shot multibox detector in tensorflow
detr facebookresearch 3.4k End-to-end object detection with transformers
FastMaskRCNN CharlesShang 3k Mask rcnn in tensorflow
u2net NathanUA 873 "The code for our newly accepted paper in pattern recognition 2020: ""u^2-net: going deeper with nested u-structure for salient object detection."""
TFace Tencent 454 A trusty face recognition research platform developed by tencent youtu lab

OCR

name owner stars description
easyOCR JaidedAI 8.3k Ready-to-use ocr with 40+ languages supported including chinese, japanese, korean and thai
chineseocr-lite ouyanghuiyu 5.7k 超轻量级中文ocr,支持竖排文字识别, 支持ncnn推理 ( dbnet(1.8m) + crnn(2.5m) + anglenet(378kb)) 总模型仅4.7m
InvoiceNet naiveHobo 1.5k Deep neural network to extract intelligent information from invoice documents.

Recommendation

name owner stars description
recommenders microsoft 6.6k Best practices on recommendation systems
DeepCTR shenweichen 3.3k Easy-to-use,modular and extendible package of deep-learning based ctr models.
DeepFM ChenglongChen 1.5k Tensorflow implementation of deepfm for ctr prediction.
neural-collaborative-filtering hexiangnan 988 Neural collaborative filtering
deepmatch shenweichen 781 A deep matching model library for recommendations & advertising. it's easy to train models and to export representation vectors which can be used for ann search.
xDeepFM Leavingseason 656

Outlier Detection

name owner stars description
alibi-detect SeldonIO 206 Algorithms for outlier and adversarial instance detection, concept drift and metrics.

Graph

name owner stars description
graph_nets deepmind 3.9k Build graph nets in tensorflow
dgl dmlc 3.4k Python package built to ease deep learning on graph, on top of existing dl frameworks.
graphSAGE williamleif 1.9k Representation learning on large graphs using stochastic graph convolutions.
SNAP snap-stanford 1.5k Stanford network analysis platform (snap) is a general purpose network analysis and graph mining library.
stellargraph stellargraph 1.2k Stellargraph - machine learning on graphs
plato Tencent 874 腾讯高性能分布式图计算框架plato
spektral danielegrattarola 810 Graph neural networks with keras and tensorflow 2.
simple-graph dpapathanasiou 499 "This is a simple graph database in sqlite, inspired by ""sqlite as a document database"""

Searching

name owner stars description
faiss facebookresearch 9.7k A library for efficient similarity search and clustering of dense vectors.
annoy spotify 6.7k Approximate nearest neighbors in c++/python optimized for memory usage and loading/saving to disk
haystack deepset-ai 2.1k 🔍 end-to-end python framework for building natural language search interfaces to data. leverages transformers and the state-of-the-art of nlp. supports dpr, elasticsearch, hugging face’s hub, and much more!

Adversarial Learning

name owner stars description
pytorch-CycleGAN-and-pix2pix junyanz 13.1k Image-to-image translation in pytorch
CycleGAN junyanz 9.7k Software that can generate photos from paintings, turn horses into zebras, perform style transfer, and more.
GANHacks soumith 8.4k "Starter from ""how to train a gan?"" at nips2016"
pix2pix phillipi 7.7k Image-to-image translation with conditional adversarial nets
DCGAN carpedm20 6.5k "A tensorflow implementation of ""deep convolutional generative adversarial networks"""
ALAE podgorskiy 1.8k [cvpr2020] adversarial latent autoencoders
DoppelGANger fjxmlzn 45 Using gans for sharing networked time series data: challenges, initial promise, and open questions, imc 2020

Model Interpretation

name owner stars description
SHAP slundberg 7.2k A unified approach to explain the output of any machine learning model.
LIME marcotcr 6.8k Lime: explaining the predictions of any machine learning classifier
Tensorwatch microsoft 2.5k Debugging, monitoring and visualization for python machine learning and data science
eli5 TeamHG-Memex 1.8k A library for debugging/inspecting machine learning classifiers and explaining their predictions
PDPBox SauceCat 382 Python partial dependence plot toolbox

Visualization

name owner stars description
Dash plotly 10.8k Analytical web apps for python & r. no javascript required.
prettymaps marceloprates 7.2k A small set of python functions to draw pretty maps from openstreetmap data. based on osmnx, matplotlib and shapely libraries.
Seaborn mwaskom 6.6k Statistical data visualization using matplotlib
Plotly plotly 5.8k An open-source, interactive graphing library for python (includes plotly express) ✨
streamlit streamlit 5.4k Streamlit — the fastest way to build custom ml tools
folium python-visualization 4.3k Python data. leaflet.js maps.
altair altair-viz 4.3k Declarative statistical visualization library for python
dash sample apps plotly 2k Open-source demos hosted on dash gallery
scikit-plot reiinakano 1.7k An intuitive library to add plotting functionality to scikit-learn objects.
CNN-Visualizer poloclub 1.7k Learning convolutional neural networks with interactive visualization. https://poloclub.github.io/cnn-explainer/

Development Toolkit

name owner stars description
free-apis public-apis 65.9k A collective list of free apis for use in software and web development.
bash-bible dylanaraps 23.3k 📖 a collection of pure bash alternatives to external processes.
python-fire google 15.8k Python fire is a library for automatically generating command line interfaces (clis) from absolutely any python object.
black psf 13.7k The uncompromising python code formatter
PySnooper cool-RR 12.9k Never use print for debugging again
poetry sdispater 7.1k Python dependency management and packaging made easy.
free api fangzesheng 6.5k 收集免费的接口服务,做一个api的搬运工
fastapi tiangolo 6.5k Fastapi framework, high performance, easy to learn, fast to code, ready for production
playwright-python microsoft 4.9k Python version of the playwright testing and automation library.
hypothesis HypothesisWorks 4k Hypothesis is a powerful, flexible, and easy to use library for property-based testing.
modin modin-project 3.6k Modin: speed up your pandas workflows by changing a single line of code
pyautogui asweigart 3.2k A cross-platform gui automation python module for human beings. used to programmatically control the mouse & keyboard.
jupytext mwouts 3k Jupyter notebooks as markdown documents, julia, python or r scripts
papermill nteract 2.7k 📚 parameterize, execute, and analyze notebooks
handclacs connorferster 2.3k Python library for converting python calculations into rendered latex.
lark lark-parser 2k Lark is a parsing toolkit for python, built with a focus on ergonomics, performance and modularity.
sqlfluff sqlfluff 1.8k A sql linter and auto-formatter for humans
handout danijar 1.8k Turn python scripts into handouts with markdown and figures
urwind urwid 1.7k Console user interface library for python (official repo)
more-itertools more-itertools 1.5k More routines for operating on iterables, beyond itertools
xarray pydata 1.5k N-d labeled arrays and datasets in python
icecream - debugging gruns 1.4k 🍦 sweet and creamy print debugging.
pygooglenews kotartemiy 816 If google news had a python library
bottleneck pydata 540 Fast numpy array functions written in c
wily tonybaloney 445 A python application for tracking, reporting on timing and complexity in python code

Tutorial

name owner stars description
Python 100 days jackfrued 70.8k Python - 100天从新手到大师
Command line tutorial in one page jlevy 66.5k Master the command line, in one page
Deep Learning 500 Questions scutan90 35.3k 深度学习500问,以问答形式对常用的概率知识、线性代数、机器学习、深度学习、计算机视觉等热点问题进行阐述,以帮助自己及有需要的读者。 全书分为18个章节,50余万字。由于水平有限,书中不妥之处恳请广大读者批评指正。 未完待续............ 如有意合作,联系scutjy2015@163.com 版权所有,违权必究 tan 2018.06
Learn Regex ziishaned 31.9k Learn regex the easy way
500 lines or less aosabook 23.9k 500 lines or less
Data Science Tutorial notebook donnemartin 17.7k Data science python notebooks: deep learning (tensorflow, theano, caffe, keras), scikit-learn, kaggle, big data (spark, hadoop mapreduce, hdfs), matplotlib, pandas, numpy, scipy, python essentials, aws, and various command lines.
Awesome tensorflow jtoy 15.3k Tensorflow - a curated list of dedicated resources http://tensorflow.org
NLP progress sebastianruder 13k Repository to track the progress in natural language processing (nlp), including the datasets and the current state-of-the-art for the most common nlp tasks.
《神经网络与深度学习》- 邱锡鹏 nndl 12k 《神经网络与深度学习》 邱锡鹏著 neural network and deep learning
wtfpython-cn leisurelicht 9.2k Wtfpython的中文翻译/施工结束/ 能力有限,欢迎帮我改进翻译
object-detection-papers hoya012 8.1k A paper list of object detection using deep learning.
MLAlgorithms rushter 7.8k Minimal and clean examples of machine learning algorithms implementations
numpy-ml ddbourgin 7.8k Machine learning, in numpy
Reinforcement-learning-introduction ShangtongZhang 7.7k Python implementation of reinforcement learning: an introduction
deep learning drizzle kmario23 7.1k Drench yourself in deep learning, reinforcement learning, machine learning, computer vision, and nlp by learning from these exciting lectures!!
Google Research google-research 6k Google ai research
GNN Papers thunlp 5.7k Must-read papers on graph neural networks (gnn)
minGPT karpathy 5.3k A minimal pytorch re-implementation of the openai gpt (generative pretrained transformer) training
UGATIT taki0112 4.4k Official tensorflow implementation of u-gat-it: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation
tensorflow2_tutorials_chinese czy36mengfei 4k Tensorflow2中文教程,持续更新(当前版本:tensorflow2.0),tag: tensorflow 2.0 tutorials
Tensorflow-2.x-Tutorials dragen1860 3.7k Tensorflow 2.x version's tutorials and examples, including cnn, rnn, gan, auto-encoders, fasterrcnn, gpt, bert examples, etc. tf 2.0版入门实例代码,实战教程。
Machine Learning Notes from Prof. Yida Xu roboticcam 3.5k My continuously updated machine learning, probabilistic models and deep learning notes and demos (1500+ slides) 我不间断更新的机器学习,概率模型和深度学习的讲义(1500+页)和视频链接
Awesome graph classification benedekrozemberczki 2.5k A collection of important graph embedding, classification and representation learning papers with implementations.
NLP-Beginner FudanNLP 2.3k Nlp上手教程
openNRE thunlp 2k An open-source package for neural relation extraction (nre)
Microsoft NLP examples microsoft 1.9k Natural language processing best practices & examples
anomaly detection yzhao062 1.9k Anomaly detection related books, papers, videos, and toolboxes
Stanford Natural Language Understanding Course cgpotts 727 Code for stanford cs224u
Generative Models in TF2 timsainb 690 Implementations of a number of generative models in tensorflow 2. gan, vae, seq2seq, vaegan, gaia, spectrogram inversion. everything is self contained in a jupyter notebook for easy export to colab.
Dimensional reduction algos heucoder 553 Pca、lda、mds、lle、tsne等降维算法的python实现
Generative Deep Learning davidADSP 491 The official code repository for examples in the o'reilly book 'generative deep learning'
reinforcement learning dalmia 417 Notes for the reinforcement learning course by david silver along with implementation of various algorithms.
Keras Text classification yongzhuo 277 中文长文本分类、短句子分类、多标签分类、两句子相似度(chinese text classification of keras nlp, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,fasttext,textcnn,charcnn,textrnn, rcnn, dcnn, dpcnn, vdcnn, crnn, bert, xlnet, albert, attention, deepmoji, han, 胶囊网络-capsulenet, transformer-encode, seq2seq, ent, dmn,
Graph neural network implementation by Microsoft microsoft 161 Tensorflow implementations of graph neural networks

Fun Stuff

name owner stars description
funNLP fighting41love 14.9k 中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、it词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、coconlp信息抽取工具、国内电话号码正则匹配、清华大学xlore:中英文跨语言百科知识图谱、清华大学人工智能技术…
tiler nuno-faria 3.8k 👷 build images with images
Hacking neural nets Kayzaks 1.9k A small course on exploiting and defending neural networks
KnockKnock huggingface 1.7k 🚪✊knock knock: get notified when your training ends with only two additional lines of code
break-capcha zhaipro 1.4k 使用机器学习算法完成对12306验证码的自动识别
GNE kingname 908 新闻网页正文通用抽取器 alpha 版.
pyforest 8080labs 689 Pyforest - feel the bliss of automated imports

Trading

name owner stars description
zipline quantopian 11.2k Zipline, a pythonic algorithmic trading library
tensortrade tensortrade-org 2k An open source reinforcement learning framework for training, evaluating, and deploying robust trading agents.
mlfinlab hudson-and-thames 1.3k Mlfinlab helps portfolio managers and traders who want to leverage the power of machine learning by providing reproducible, interpretable, and easy to use tools.
tf-quant google 773 High-performance tensorflow library for quantitative finance.

Contribution Guide

Add your favourite packages to package.json, and run package_info.py to update the page :)

About

A curated list of useful Python packages for data geeks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages