Cloud Computing CS

Similar documents
About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

What Makes a Cuisine Unique?

Predicting Wine Varietals from Professional Reviews

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

DATA MINING CAPSTONE FINAL REPORT

Wine Rating Prediction

Predicting Wine Quality

Amazon Fine Food Reviews wait I don t know what they are reviewing

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Learning the Language of Wine CS 229 Term Project - Final Report

Parent Self Serve Mobile

DOI /j. cnki 欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟. R Rapid Miner Mahout

IT 403 Project Beer Advocate Analysis

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Click to edit Master title style Delivering World-Class Customer Service Through Lean Thinking

Abstract. Keywords: Gray Pine, Species Classification, Lidar, Hyperspectral, Elevation, Slope.

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

Learning Connectivity Networks from High-Dimensional Point Processes

Imputation of multivariate continuous data with non-ignorable missingness

Starbucks Geography Summary

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

ARM4 Advances: Genetic Algorithm Improvements. Ed Downs & Gianluca Paganoni

Analysis of Things (AoT)

Synchronous Systems. Asynchronous Circuit Design. Synchronous Disadvantages. Synchronous Advantages. Asynchronous Advantages. Asynchronous Systems

-- CS341 info session is on Thu 3/18 7pm in Gates Final exam logistics

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

GrillCam: A Real-time Eating Action Recognition System

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Weather Sensitive Adjustment Using the WSA Factor Method

Jure Leskovec Stanford University

Food Image Recognition by Deep Learning

Lesson 23: Newton s Law of Cooling

Multiple Imputation for Missing Data in KLoSA

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

6.2.2 Coffee machine example in Uppaal

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

AWRI Refrigeration Demand Calculator

Learning to Use the Checklist. HealthLinkBC February 2014

Unit of competency Content Activity. Element 1: Organise coffee workstation n/a n/a. Element 2: Select and grind coffee beans n/a n/a

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

DEVELOPING PROBLEM-SOLVING ABILITIES FOR MIDDLE SCHOOL STUDENTS

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

Multiplying Fractions

ZURICH SPM COURSE FEBRUARY 2015

IKAWA App V1 For USE WITH IKAWA COFFEE ROASTER. IKAWA Ltd. Unit 2 at 5 Durham Yard Bethnal Green London E2 6QF United Kingdom

Instruction (Manual) Document

Academic Year 2014/2015 Assessment Report. Bachelor of Science in Viticulture, Department of Viticulture and Enology

1) What proportion of the districts has written policies regarding vending or a la carte foods?

benefits of electronic menu boards: for your business and your customers

Flexible Working Arrangements, Collaboration, ICT and Innovation

BANKCOIN.global WHITE PAPER VER 1.6. BANKCOIN.global WHITE PAPER SEE 1.6

Asynchronous Circuit Design

Foodservice EUROPE. 10 countries analyzed: AUSTRIA BELGIUM FRANCE GERMANY ITALY NETHERLANDS PORTUGAL SPAIN SWITZERLAND UK

Flexible Imputation of Missing Data

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

THE APPLICATION OF NATIONAL SINGLE WINDOW SYSTEM (KENYA TRADENET) IN PROCESSING OF CERTIFICATES OF ORIGIN. A case study of AFA-Coffee Directorate

WINE RECOGNITION ANALYSIS BY USING DATA MINING

Climate change may alter human physical activity patterns

Algorithms. How data is processed. Popescu

#611-7 Workbook REVIEW OF PERCOLATION TESTING PROCEDURES. After completing this chapter, you will be able to...

Betty Crocker The Big Book Of Cakes (Betty Crocker Big Book) By Betty Crocker

JCAST. Department of Viticulture and Enology, B.S. in Viticulture

Candidate Agreement. The American Wine School (AWS) WSET Level 4 Diploma in Wines & Spirits Program PURPOSE

Rootstock Traits 2013

KEYWORDS:Classification, Discriminant Analysis, Wine Quality, PH, Residual Sugar

HW 5 SOLUTIONS Inference for Two Population Means

Olea Head and Neck DCE VPMC-14290A

Promote and support advanced computing to further Tier-One research and education at the University of Houston

VineAlert An Economic Impact Analysis

Sensory Characteristics and Consumer Acceptance of Mechanically Harvested California Black Ripe Olives

Step 1: Prepare To Use the System

How Many of Each Kind?

SAP Fiori - Take Order

4-H Food Preservation Proficiency Program A Member s Guide

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Appendix Table A1 Number of years since deregulation

2. What is percolation? ETH Zürich, Spring semester 2018

Protest Campaigns and Movement Success: Desegregating the U.S. South in the Early 1960s

Wine Consumption Production

4-H Food Preservation Proficiency

TOPIC 12. Motivation for Trade. Tuesday, March 27, 12

Mating Disruption an AreawideApproach to Controlling the Borer Complex in cherry

ATKINS PHYSICAL CHEMISTRY PDF PDF

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

AST Live November 2016 Roasting Module. Presenter: John Thompson Coffee Nexus Ltd, Scotland

PBL, Projects, and Activities downloaded from NextLesson are provided on an online platform.

A CASE STUDY: HOW CONSUMER INSIGHTS DROVE THE SUCCESSFUL LAUNCH OF A NEW RED WINE

Menus of Change General Session 3 Changing Consumer Behaviors and Attitudes

Esri Demographic Data Release Notes: Israel

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

ENGI E1006 Percolation Handout

Enquiring About Tolerance (EAT) Study. Randomised controlled trial of early introduction of allergenic foods to induce tolerance in infants

From VOC to IPA: This Beer s For You!

Cook Online Upgrading Pilot A Guide to Course Content

Transcription:

Cloud Computing CS 15-319 Apache Mahout Feb 13, 2012 Shannon Quinn

MapReduce Review Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop Map Phase Reduce Phase chunks C0 C1 C2 C3 mappers M0 M1 M2 M3 IO0 IO1 IO2 IO3 Shuffling Data Reducers R0 R1 FO0 FO1 Figure from lecture 6: MapReduce

MapReduce Review Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop This is our focus! Map Phase Reduce Phase chunks C0 C1 C2 C3 mappers M0 M1 M2 M3 IO0 IO1 IO2 IO3 Shuffling Data Reducers R0 R1 FO0 FO1 Figure from lecture 6: MapReduce

Apache Mahout A scalable machine learning library

Apache Mahout A scalable machine learning library Built on Hadoop

Apache Mahout A scalable machine learning library Built on Hadoop Philosophy of Mahout (and Hadoop by proxy)

What does Mahout do?

Recommendation

Classification

Clustering

Other Mahout Algorithms Dimensionality Reduction Regression Evolutionary Algorithms

Mahout 1. Recommendation 2. Classification 3. Clustering

Recommendation Overview Help users find items they might like based on historical preferences

Recommendation Overview Mathematically

Recommendation Overview Alice 5 1 4 Bob? 2 5 Peter 4 3 2 *based on example by Sebastian Schelter

Recommendation Overview 5 1 4-2 5 4 3 2 *based on example by Sebastian Schelter

Recommendation Overview 5 1 4 Bob? 2 5 4 3 2 *based on example by Sebastian Schelter

Recommendation Overview 5 1 4 Bob 1.5 2 5 4 3 2 *based on example by Sebastian Schelter

Recommendation in Mahout 1 st Map phase: process input *based on example by Sebastian Schelter

Recommendation in Mahout 1 st Map phase: process input *based on example by Sebastian Schelter 1 st Reduce phase: list by user

Recommendation in Mahout 2 nd Map phase: Emit co-occurred ratings *based on example by Sebastian Schelter

Recommendation in Mahout *based on example by Sebastian Schelter 2 nd Map phase: Emit co-occurred ratings 2 nd Reduce phase: Compute similarities

Mahout 1. Recommendation 2. Classification 3. Clustering

Classification Overview Assigning data to discrete categories

Classification Overview Assigning data to discrete categories Train a model on labeled data Spam Not spam

Classification Overview Spam? Not spam Assigning data to discrete categories Train a model on labeled data Run the model on new, unlabeled data

Naïve Bayes Example

Naïve Bayes Example Prob (token label) =

Naïve Bayes Example Not spam

Naïve Bayes Example Not spam President Obama s Nobel Prize Speech

Naïve Bayes Example Spam

Naïve Bayes Example Spam Spam email content

Naïve Bayes Example

Naïve Bayes Example Order a trial Adobe chicken daily EAB-List new summer savings, welcome!

Naïve Bayes in Mahout Complex!

Naïve Bayes in Mahout Complex! Training 1. Read the features

Naïve Bayes in Mahout Complex! Training 1. Read the features 2. Calculate per-document statistics

Naïve Bayes in Mahout Complex! Training 1. Read the features 2. Calculate per-document statistics 3. Normalize across categories

Naïve Bayes in Mahout Complex! Training 1. Read the features 2. Calculate per-document statistics 3. Normalize across categories 4. Calculate normalizing factor of each label

Naïve Bayes in Mahout Complex! Training 1. Read the features 2. Calculate per-document statistics 3. Normalize across categories 4. Calculate normalizing factor of each label Testing Classification

Other Classification Algorithms Stochastic Gradient Descent

Other Classification Algorithms Stochastic Gradient Descent Support Vector Machines

Other Classification Algorithms Stochastic Gradient Descent Support Vector Machines Random Forests

Mahout 1. Recommendation 2. Classification 3. Clustering

Clustering Overview Grouping unstructured data

Clustering Overview Grouping unstructured data Small intra-cluster distance

Clustering Overview Grouping unstructured data Small intra-cluster distance Large inter-cluster distance

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example Dogs Cats

K-Means Clustering in Mahout Map Phase chunks mappers C0 C1 C2 C3 M0 M1 M2 M3 + Reduce Phase IO0 IO1 IO2 IO3 Shuffling Data Reducers R0 R1 FO0 FO1 Figure from lecture 6: MapReduce

K-Means Clustering in Mahout Assume: # clusters <<< # points

K-Means Clustering in Mahout Assume: # clusters <<< # points M0 M1 M2 M3

K-Means Clustering in Mahout Assume: # clusters <<< # points M0 M1 M2 M3 <clusterid, observation> R0 R1

K-Means Clustering in Mahout Map phase: assign cluster IDs

K-Means Clustering in Mahout Map phase: assign cluster IDs Reduce phase: reset centroids

K-Means Clustering in Mahout Important notes --maxiter --convergencedelta method

Other Clustering Algorithms Latent Dirichlet Allocation Topic models

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters Canopy clustering Fast approximations of clusters

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters Canopy clustering Fast approximations of clusters Spectral clustering Treat points as a graph

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters Canopy clustering Fast approximations of clusters Spectral clustering Treat points as a graph K-Means & Eigencuts

Mahout in Summary

Mahout in Summary Scalable library

Mahout in Summary Scalable library Three primary areas of focus

Mahout in Summary Scalable library Three primary areas of focus Other algorithms

Mahout in Summary Scalable library Three primary areas of focus Other algorithms All in your friendly neighborhood MapReduce

Mahout in Summary http://mahout.apache.org/ Scalable library Three primary areas of focus Other algorithms All in your friendly neighborhood MapReduce

Thank you!