About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

Similar documents
Cloud Computing CS

Table of Contents. Toast Inc. 2

DOI /j. cnki 欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟. R Rapid Miner Mahout

Barista at a Glance BASIS International Ltd.

GrillCam: A Real-time Eating Action Recognition System

PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN PDF

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

MBA 503 Final Project Guidelines and Rubric

Predicting Wine Varietals from Professional Reviews

Predicting Wine Quality

IT 403 Project Beer Advocate Analysis

Step 1: Prepare To Use the System

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

North America Ethyl Acetate Industry Outlook to Market Size, Company Share, Price Trends, Capacity Forecasts of All Active and Planned Plants

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

Principles of Preparing, Cooking and Finishing Basic Pastry Products

Social Media: Content Drives Community Groups

+ + + =? Which Winery should you visit? ABOUT WHICHWINERY THE BACKGROUND FIND. TRACK. SHARE. LEARN.

Food Image Recognition by Deep Learning

Click to edit Master title style Delivering World-Class Customer Service Through Lean Thinking

First air coffee First coffee on the blockchain

Chef de Partie Apprenticeship Standard

The Future of the Still & Sparkling Wine Market in Poland to 2019

WOK OF FURY: HOW TO COOK CHINESE BY KHOAN VONG DOWNLOAD EBOOK : WOK OF FURY: HOW TO COOK CHINESE BY KHOAN VONG PDF

What Makes a Cuisine Unique?

Memorandum of understanding

Shaping the Future: Production and Market Challenges

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Principles of Producing Basic Pasta Dishes

Develop the skills and knowledge to use a range of cookery methods to prepare menu items for the kitchen of a hospitality or catering operation.

The Future of the Ice Cream Market in Finland to 2018

As Hatten Wines is at the forefront of building

Vegetarian Culinary Arts Courses 2018/2019

St. Agnes Catholic Primary School Highett Anaphylaxis Policy

HACCP Food Safety Employee Manual By Tara Paster

AGREEMENT n LLP-LDV-TOI-10-IT-538 UNITS FRAMEWORK ABOUT THE MAITRE QUALIFICATION

Unit title: Fermented Patisserie Products (SCQF level 7)

DEVELOPMENT OF A WDM STRATEGY USING BALANCED SCORECARD METHODOLOGY R S Mckenzie and J N Bhagwan*

ON BAKING 3RD EDITION

Training Guide For Servers In Restaurant Powerpoint

13 COLONIES TRIVIA AND ANSWERS 13 COLONIES TRIVIA AND PDF 13 COLONIES TRIVIA AND ANSWERS PDF THIRTEEN COLONIES QUIZ - BRAINPOP

Noun-Verb Decomposition

Pasta Market in Italy to Market Size, Development, and Forecasts

EMC Publishing s C est à toi! 3, 2E Correlated to the Colorado World Language Frameworks French 3

Senior Chef Production Cooking Apprenticeship Standard

Release Letter. Trufa

User Studies for 3-Sweep

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

REMARKABLE SERVICE BY THE CULINARY INSTITUTE OF AMERICA (CIA) DOWNLOAD EBOOK : REMARKABLE SERVICE BY THE CULINARY INSTITUTE OF AMERICA (CIA) PDF

QUICK SERVE RESTAURANT MANAGEMENT SERIES EVENT PARTICIPANT INSTRUCTIONS

Roaster/Production Operative. Coffee for The People by The Coffee People. Our Values: The Role:

Biocides IT training Vienna - 4 December 2017 IUCLID 6

CENTRAL OTAGO WINEGROWERS ASSOCIATION (INC.)

WiX Cookbook Free Ebooks PDF

Sample Guide and Delivery Schedule/Curriculum plan Culinary Operations

Setting Up the TEXAS WINE CALENDAR by Natalia Kolyesnikova, Ph.D.

Academic Year 2014/2015 Assessment Report. Bachelor of Science in Viticulture, Department of Viticulture and Enology

Jim Murray Whisky Bible 2018 The Whisky Shop

Jura Capresso F9 Repair Manual

Instruction (Manual) Document

IT tool training. Biocides Day. 25 th of October :30-11:15 IUCLID 11:30-13:00 SPC Editor 14:00-16:00 R4BP 3

Fibonacci Numbers An Application Of Linear Algebra

Unit of competency Content Activity. Element 1: Organise coffee workstation n/a n/a. Element 2: Select and grind coffee beans n/a n/a

The Future of the Confectionery Market in South Africa to 2019

Ideas for group discussion / exercises - Section 3 Applying food hygiene principles to the coffee chain

Innovations for a better world. Ingredient Handling For bakeries and other food processing facilities

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Semantic Web. Ontology Engineering. Gerd Gröner, Matthias Thimm. Institute for Web Science and Technologies (WeST) University of Koblenz-Landau

Using Standardized Recipes in Child Care

DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

Practice of Chinese Food II Hotel Restaurant and Culinary Science

Napa County Planning Commission Board Agenda Letter

Principles of Providing a Counter and Takeaway Service

Certificate III in Hospitality. Patisserie THH31602

Foodservice EUROPE. 10 countries analyzed: AUSTRIA BELGIUM FRANCE GERMANY ITALY NETHERLANDS PORTUGAL SPAIN SWITZERLAND UK

Opportunities. SEARCH INSIGHTS: Spotting Category Trends and. thinkinsights THE RUNDOWN

Step-by-Step Cake Decorating By Karen Sullivan

JCAST. Department of Viticulture and Enology, B.S. in Viticulture

Guidelines on the registration of national guides to good practice. In accordance with Article 8 of Regulation (EC) No 852/2004

Bishop Druitt College Food Technology Year 10 Semester 2, 2018

Sample Size Determination And Power By Thomas P. Ryan

Biocides IT training Helsinki - 27 September 2017 IUCLID 6

Presentation Notes Recipe for Success: Breaking Down Standardized Recipes

LEARNING AS A MACHINE CROSS-OVERS BETWEEN HUMANS AND MACHINES

Fibonacci Numbers: How To Use Fibonacci Numbers To Predict Price Movements [Kindle Edition] By Glenn Wilson

Hand Book Of Confectionery With Formulations With Directory Of Manufacturers Suppliers Of Plant Equ

The Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews

Wine Microbiology: Science And Technology (Food Science And Technology) By Claudio Delfini READ ONLINE

Virginia Western Community College HRI 225 Menu Planning & Dining Room Service

Hatten Classroom Programs. Published Rates

Jure Leskovec, Computer Science Dept., Stanford

The Food And Cooking Of Poland: Traditions, Ingredients, Tastes, Techniques: Over 60 Classic Recipes By Michlik Ewa

Table of content. The meaning of AMIT 3 INTRODUCTION 4 AMIT LEADERSHIP 5. VISION and MISSION 6 STRATEGIC PILLARS 7 COFFEE TRADING 8 TEA TRADING 9

Good Housekeeping The Cake Decorating Book The Ultimate Bakers Companion

Restaurant Management

INFLUENCER GENERATED CONTENT

BLACK COFFEE BY AGATHA CHRISTIE DOWNLOAD EBOOK : BLACK COFFEE BY AGATHA CHRISTIE PDF

Transcription:

About this Tutorial Apache Mahout is an open source project that is primarily used in producing scalable machine learning algorithms. This brief tutorial provides a quick introduction to Apache Mahout and explains how it can be applied to make recommendations and organize documents in more useable clusters. Audience This tutorial has been prepared for professionals aspiring to learn the basics of Mahout and develop applications involving machine learning techniques such as recommendation, classification, and clustering. Prerequisites Before you start proceeding with this tutorial, we assume that you have prior exposure to Core Java, Hadoop, and any of the Linux operating system flavors. Copyright & Disclaimer Copyright 2015 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at contact@tutorialspoint.com i

Table of Contents About this Tutorial... i Audience... i Prerequisites... i Copyright & Disclaimer... i Table of Contents... ii 1. MAHOUT INTRODUCTION... 1 What is Apache Mahout?... 1 Features of Mahout... 1 Applications of Mahout... 2 2. MAHOUT MACHINE LEARNING... 3 What is Machine Learning?... 3 Supervised Learning... 3 Unsupervised Learning... 4 Recommendation... 4 Classification... 5 Clustering... 5 3. MAHOUT ENVIRONMENT... 7 Pre-Installation Setup... 7 Installing Java... 8 Downloading Hadoop... 9 Installing Hadoop... 10 core-site.xml... 11 hdfs-site.xml... 12 yarn-site.xml... 13 mapred-site.xml... 13 Verifying Hadoop Installation... 13 ii

Downloading Mahout... 16 Maven Repository... 17 4. MAHOUT RECOMMENDATION... 18 Recommendation... 18 Mahout Recommender Engine... 19 Building a Recommender using Mahout... 21 5. MAHOUT CLUSTERING... 25 Applications of Clustering... 25 Procedure of Clustering... 25 Clustering Algorithms... 28 6. MAHOUT CLASSIFICATION... 31 What is Classification?... 31 How Classification Works... 31 Applications of Classification... 32 Naive Bayes Classifier... 32 Procedure of Classification... 32 iii

1. MAHOUT INTRODUCTION Mahout We are living in a day and age where information is available in abundance. The information overload has scaled to such heights that sometimes it becomes difficult to manage our little mailboxes! Imagine the volume of data and records some of the popular websites (the likes of Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon even for lesser known websites to receive huge amounts of information in bulk. Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw conclusions. However, no data mining algorithm can be efficient enough to process very large datasets and provide outcomes in quick time, unless the computational tasks are run on multiple machines distributed over the cloud. We now have new frameworks that allow us to break down a computation task into multiple segments and run each segment on a different machine. Mahout is such a data mining framework that normally runs coupled with the Hadoop infrastructure at its background to manage huge volumes of data. What is Apache Mahout? A mahout is one who drives an elephant as its master. The name comes from its close association with Apache Hadoop which uses an elephant as its logo. Hadoop is an open-source framework from Apache that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: Recommendation Classification Clustering Apache Mahout started as a sub-project of Apache s Lucene in 2008. In 2010, Mahout became a top level project of Apache. 4

Features of Mahout The primitive features of Apache Mahout are listed below. The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud. Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data. Mahout lets applications to analyze large sets of data effectively and in quick time. Includes several MapReduce enabled clustering implementations such as k- means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift. Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations. Comes with distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries. Applications of Mahout Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout internally. Foursquare helps you in finding out places, food, and entertainment available in a particular area. It uses the recommender engine of Mahout. Twitter uses Mahout for user interest modelling. Yahoo! uses Mahout for pattern mining. 5

2. MAHOUT MACHINE LEARNING Mahout Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms. Mahout implements popular machine learning techniques such as recommendation, classification, and clustering. Therefore, it is prudent to have a brief section on machine learning before we move further. What is Machine Learning? Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Here, learning means recognizing and understanding the input data and making wise decisions based on the supplied data. It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory. The developed algorithms form the basis of various applications such as: Vision processing Language processing Forecasting (e.g., stock market trends) Pattern recognition Games Data mining Expert systems Robotics Machine learning is a vast area and it is quite beyond the scope of this tutorial to cover all its features. There are several ways to implement machine learning techniques, however the most commonly used ones are supervised and unsupervised learning. 6

Supervised Learning Supervised learning deals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Common examples of supervised learning include: classifying e-mails as spam, labeling webpages based on their content, and voice recognition. There are many supervised learning algorithms such as neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier. Unsupervised Learning Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for patterns and trends. It is most commonly used for clustering similar input into logical groups. Common approaches to unsupervised learning include: k-means, self-organizing maps, and hierarchical clustering. Recommendation Recommendation is a popular technique that provides close recommendations based on user information such as previous purchases, clicks, and ratings. Amazon uses this technique to display a list of recommended items that you might be interested in, drawing information from your past actions. There are recommender engines that work behind Amazon to capture user behavior and recommend selected items based on your earlier actions. Facebook uses the recommender technique to identify and recommend the people you may know list. 7

Classification Classification, also known as categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. Classification is a form of supervised learning. Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder. itunes application uses classification to prepare playlists. 8

Clustering Clustering is used to form groups or clusters of similar data based on common characteristics. Clustering is a form of unsupervised learning. Search engines such as Google and Yahoo! use clustering techniques to group data with similar characteristics. Newsgroups use clustering techniques to group various articles based on related topics. The clustering engine goes through the input data completely and based on the characteristics of the data, it will decide under which cluster it should be grouped. Take a look at the following example. 9

Our library of tutorials contains topics on various subjects. When we receive a new tutorial at TutorialsPoint, it gets processed by a clustering engine that decides, based on its content, where it should be grouped. 10

3. MAHOUT ENVIRONMENT Mahout This chapter teaches you how to setup mahout. Java and Hadoop are the prerequisites of mahout. Below given are the steps to download and install Java, Hadoop, and Mahout. Pre-Installation Setup Before installing Hadoop into Linux environment, we need to set up Linux using ssh (Secure Shell). Follow the steps mentioned below for setting up the Linux environment. Creating a User It is recommended to create a separate user for Hadoop to isolate the Hadoop file system from the Unix file system. Follow the steps given below to create a user: Open root using the command su. Create a user from the root account using the command useradd username. Now you can open an existing user account using the command su username. Open the Linux terminal and type the following commands to create a user. $ su password: # useradd hadoop # passwd hadoop New passwd: Retype new passwd 11

SSH Setup and Key Generation SSH setup is required to perform different operations on a cluster such as starting, stopping, and distributed daemon shell operations. To authenticate different users of Hadoop, it is required to provide public/private key pair for a Hadoop user and share it with different users. 12

End of ebook preview If you liked what you saw Buy it from our store @ https://store.tutorialspoint.com 13