Data Management Challenges in Machine Learning (Fall 2018)

Motivation

Big data processing poses many challenges, which are often characterized by the three V's (volume, velocity, and varity). On the other hand, machine learning is increasingly used by all kinds of data-driven applications. This course explores the interactions between these two exciting fields. This blogpost provides one perspective of such interactions.

Topics

Because of the purpose above, the course will be divided into two parts.

  1. Utilizing machine learning technologies to solve hard data management challenges, such as data cleaning
  2. Utilizing data management technologies to solve hard machine learning challenges, such as data representation and training data curation

Objectives

The course covers a wide range of moder challenges and sub-topics in both data management and machine learning. The students will get familiar with these sub-topics, and gain a deep understanding of one sub-topic by doing presentations and course projects.

Furthermore, since this is a graduate seminar, another important objective is to train students to master basic skills for being a researcher. The course will create a number of opportunities for students to learn how to read a paper, how to write a paper review, how to give a good research talk, and how to ask questions during a talk?

Logistics

Pre-requisites

Grading (Subject to change)

Schedule (Subject to change)

TBD
Date Topic Content Presenter
08/21 Course Objective Course Introduction and Logistics Xu Chu [slides]
08/21 Data Cleaning and ML Introduction to Part 1 Xu Chu [slides]
08/23 Data Exploration and ML Introduction to Part 2 Xu Chu [slides]
08/23 Systems and ML Introduction to Part 3 Xu Chu [slides]
Part I: Data Cleaning and ML
TBD ML for Data Cleaning SLiMFast: Guaranteed Results for Data Fusion and Source Reliability
HoloClean: Holistic Data Repairs with Probabilistic Inference
Student [slides]
Student [slides]
TBD Statistical Distortion: Consequences of Data Cleaning
Student [slides]
Student [slides]
Optional reading Data Cleaning is a ML Problem that Needs Data Systems Help N.A.
TBD Data Cleaning for ML A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
ActiveClean: Interactive Data Cleaning For Statistical Modeling
Student [slides]
Student [slides]
TBD Cleaning Crowdsourced Labels Using Oracles For Supervised Learning
BoostClean: Automated Error Detection and Repair for Machine Learning
Student [slides]
Student [slides]
Optional reading Impacts of Dirty Data: an Experimental Evaluation N.A.
TBD ML for Data Deduplication Interactive Deduplication using Active Learning
On active learning of record matching packages.
Student [slides]
Student [slides]
TBD CrowdER: crowdsourced entity resolution
Distributed Data Deduplication
Student [slides]
Student [slides]
Optional reading Duplicate Record Detection: A Survey N.A.
TBD Data Wrangling/Transformation Potter’s Wheel: An Interactive Data Cleaning System
Transform-Data-by-Example (TDE): Extensible Data Transformation using Functions
Student [slides]
Student [slides]
TBD Training Data Enrichment Snorkel: Rapid Training Data Creation with Weak Supervision
Combining Labeled and Unlabeled Data with Co-Training
Student [slides]
Student [slides]
Part 2: Data Exploration and ML
TBD Relational Data Profiling TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
FastFDs􏰂 A Heuristic􏰁Driven􏰀 Depth􏰁First Algorithm for Mining Functional Dependencies from Relation Instances
Student [slides]
Student [slides]
TBD Discovering Denial Constraints
Efficient Denial Constraint Discovery with Hydra
Student [slides]
Student [slides]
Optional reading Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms
Profiling Relational Data – A Survey
N.A.
TBD Model Interpretation “Why Should I Trust You?” Explaining the Predictions of Any Classifier
Anchors: High-Precision Model-Agnostic Explanations
Student [slides]
Student [slides]
TBD A Unified Approach to Interpreting Model Predictions
Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.
Student [slides]
Student [slides]
Optional reading Interpretable ML Symposium
Interpretable ML by H2O
N.A.
TBD Visualization and Interpretation Visual Exploration of Machine Learning Results using Data Cube Analysis
ACTIVIS: Visual Exploration of Industry-Scale Deep Neural Network Models
Student [slides]
Student [slides]
Optional reading Recent progress and trends in predictive visual analytics N.A.
TBD Feature Engineering and Selection Deep Feature Synthesis: Towards Automating Data Science Endeavors
ExploreKit: Automatic Feature Generation and Selection
Student [slides]
Student [slides]
TBD One button machine for automating feature engineering in relational databases
XGBoost: A Scalable Tree Boosting System
Student [slides]
Student [slides]
Optional reading Discover Feature Engineering, How to Engineer Features and How to Get Good at It
An Introduction to Variable and Feature Selection
N.A.
Part 3: Systems and ML
TBD Managing ML Pipeline TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows
Student [slides]
Student [slides]
TBD MODELDB: A System for Machine Learning Model Management
Towards Unified Data and Lifecycle Management for Deep Learning
Student [slides]
Student [slides]
TBD Streaming ML Systems Online Machine Learning in Big Data Streams
MacroBase: Prioritizing Attention in Fast Data
Student [slides]
Student [slides]
TBD Overcoming catastrophic forgetting in neural networks
Measuring Catastrophic Forgetting in Neural Networks
Student [slides]
Student [slides]
Course Project Presentations
TBD Final Project TBD Student [slides]

References

 


  © Xu Chu 2018