Three Critical Steps to Improving Product Data Quality WHITE PAPER

Similar documents
MBA 503 Final Project Guidelines and Rubric

How LWIN helped to transform operations at LCB Vinothèque

Kiosks: An Easy and Effective Nutrition Labeling Solution for Grocery Stores

EXPANDED CHOICES FOR EXTENDED FRESHNESS SOLUTIONS

ZESPRI International Limited Implementation Case Study

Appendix 2. Food Safety Plan Worksheets

Paper Reference IT Principal Learning Information Technology. Level 3 Unit 2: Understanding Organisations

Step 1: Prepare To Use the System

Barista at a Glance BASIS International Ltd.

US FOODS E-COMMERCE AND TECHNOLOGY OFFERINGS

Pasta Market in Italy to Market Size, Development, and Forecasts

Table of Contents. Toast Inc. 2

Memorandum of understanding

The Biocidal Products Regulation in the Automotive Supply Chain

UNIT TITLE: PROVIDE ADVICE TO PATRONS ON FOOD AND BEVERAGE SERVICES NOMINAL HOURS: 80

Global Hot Dogs Market Insights, Forecast to 2025

Trends. in retail. Issue 8 Winter The Evolution of on-demand Food and Beverage Delivery Options. Content

2017 FINANCIAL REVIEW

Using Standardized Recipes in Child Care

Semantic Web. Ontology Engineering. Gerd Gröner, Matthias Thimm. Institute for Web Science and Technologies (WeST) University of Koblenz-Landau

2016 China Dry Bean Historical production And Estimated planting intentions Analysis

Copyright 2008, Forel Publishing Company, LLC, Woodbridge, Virginia

TRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA

CASE STUDY: HOW STARBUCKS BREWS LOGISTICS SUCCESS

FOR PERSONAL USE. Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN ACTIVITY ASSESSMENT OPPORTUNITIES. Grade 3 Quarter 1 Activity 2

Roaster/Production Operative. Coffee for The People by The Coffee People. Our Values: The Role:

The Weights and Measures (Specified Quantities) (Unwrapped Bread and Intoxicating Liquor) Order 2011

Is Fair Trade Fair? ARKANSAS C3 TEACHERS HUB. 9-12th Grade Economics Inquiry. Supporting Questions

Answering the Question

Out of Home ROI and Optimization in the Media Mix Summary Report

Hops II Interfacing with the Hop Industry Role of a Hops Supplier. Tim Kostelecky John I. Haas, Inc ASBC Meeting June 6, 2017

Fairtrade Policy. Version 2.0

benefits of electronic menu boards: for your business and your customers

ENVIRONMENT INDUSTRY PEOPLE. Corporate Citizenship. do well, so we may do good

Work Sample (Minimum) for 10-K Integration Assignment MAN and for suppliers of raw materials and services that the Company relies on.

Board of Management Staff Students and Equalities Committee

F R E S H C U P. Single Serve Automatic Eject Pod System by:

North America Ethyl Acetate Industry Outlook to Market Size, Company Share, Price Trends, Capacity Forecasts of All Active and Planned Plants

Dining Room Theory

POSITION DESCRIPTION. DATE OF VERSION: August Position Summary:

TOTAL SOLUTIONS COFFEE EXPERTISE SUSTAINABILITY COMMITMENT

LEVEL 1 CERTIFICATE PROGRAM CURRICULUM. COMPETENCIES Knowledge, Skills and Explanations of the BGA Barista Level 1 (CB1) Designation

Is Your Restaurant Ready for the Growing Online Ordering Trend?

TITBIT WHITEPAPER TITBIT HELPS ORBIT CAFÉ INCREASE CHECK AVERAGES BY 20% AND IMPROVE EFFICIENCY AT REDUCED COST

The Impact of the BPR on the Automotive Supply Chain

Reaction to the coffee crisis at the beginning of last decade

WEL COME T O SER TINOS COFFEE

Uniform Retail Meat Identity Standards

California Wine Community Sustainability Report Chapter 12 SOLID WASTE REDUCTION AND MANAGEMENT

UNIVERSITY OF PLYMOUTH FAIRTRADE PLAN

UPC / SCC CODES MANITOBA LIQUOR & LOTTERIES ITEM NUMBER

BREWERS ASSOCIATION CRAFT BREWER DEFINITION UPDATE FREQUENTLY ASKED QUESTIONS. December 18, 2018

POSITION DESCRIPTION

Sample. TO: Prof. Hussain FROM: GROUP (Names of group members) DATE: October 09, 2003 RE: Final Project Proposal for Group Project

Comparative Advantage. Chapter 2. Learning Objectives

Submitting Beer To Homebrew Competitions. Joe Edidin

BLUEBERRY MUFFIN APPLICATION RESEARCH COMPARING THE FUNCTIONALITY OF EGGS TO EGG REPLACERS IN BLUEBERRY MUFFIN FORMULATIONS RESEARCH SUMMARY

By Type Still, Sparkling, Spring. By Volume- Liters Consumed. By Region - North America, Europe, Asia Pacific, Latin America and Middle East

What Is This Module About?

Dining Your Way into Reading

UK Dining. Sourcing Report. Fiscal Year Contributors: Lilian Brislen Scott Smith

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

west australian wine industry sustainable funding model

Colorized Mustang Wiring Diagrams

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Flavourings Legislation and Safety Assessment

Relevant Biocidal Product Types in Food Contact Applications

Get Schools Cooking Application

The Future of the Still & Sparkling Wine Market in Poland to 2019

Global Foodservice Equipment Market: Industry Analysis & Outlook ( )

(Definition modified from APSnet)

CHAPTER I BACKGROUND

Global Takeaway Food Delivery Market: Trends & Opportunities (2015 Edition) January 2016

POSITION DESCRIPTION. DATE OF VERSION: January Position Summary:

The Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews

Subject: Industry Standard for a HACCP Plan, HACCP Competency Requirements and HACCP Implementation

Application Note CL0311. Introduction

1) What proportion of the districts has written policies regarding vending or a la carte foods?

Running head: CASE STUDY 1

Barista/Café Assistant

Foodservice EUROPE. 10 countries analyzed: AUSTRIA BELGIUM FRANCE GERMANY ITALY NETHERLANDS PORTUGAL SPAIN SWITZERLAND UK

UNIT TITLE: TAKE FOOD ORDERS AND PROVIDE TABLE SERVICE NOMINAL HOURS: 80

PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN PDF

QUICK SERVE RESTAURANT MANAGEMENT SERIES EVENT PARTICIPANT INSTRUCTIONS

UNIVERSITY OF PLYMOUTH SUSTAINABLE FOOD PLAN

Shaping the Future: Production and Market Challenges

Math Fundamentals PoW Packet Cupcakes, Cupcakes! Problem

Western Uganda s Arabica Opportunity. Kampala 20 th March, 2018

EXECUTIVE SUMMARY OVERALL, WE FOUND THAT:

Special Price and Premium Terms

Customer Analysis Overview

Economic Contributions of the Florida Citrus Industry in and for Reduced Production

Coffee Machine Market Size, Share, Growth, Trend & Research Report 2015: Radiant Insights, Inc

Fairfield Public Schools Family Consumer Sciences Curriculum Food Service 30

An Annual Report by ShipCompliant and Wines & Vines. Direct to consumer. Wine Shipping Report

The restaurateur s guide to online ordering

Introduction to the Practical Exam Stage 1

COMMISSION IMPLEMENTING REGULATION (EU) No 543/2011 of 7 June 2011 EXCERPT: ANNEX I, PART B, PART 9 MARKETING STANDARD FOR TABLE GRAPES

Objective: Decompose a liter to reason about the size of 1 liter, 100 milliliters, 10 milliliters, and 1 milliliter.

Sara Jane Strecker, FACS Educator Learning Zone Express

Transcription:

Three Critical Steps to Improving Product Data Quality WHITE PAPER

SAS White Paper Table of Contents Introduction.... 1 Business Effects of Poor Product Data Quality.... 2 Unique Challenges of Product Data Quality.... 3 Improving Product Data Quality.... 4 Categorization.... 4 Standardization... 6 Brands.... 7 Units.... 9 Before and After Standardization.... 10 Matching... 11 After Matching.... 13 Summary... 14 Contributor: Jim Harris is a recognized data quality thought leader with 20 years of enterprise data management industry experience. Harris is an independent consultant, speaker and freelance writer for hire. Harris is the blogger-in-chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality. Harris is the host of the popular podcast OCDQ Radio, and is very active on Twitter, where you can follow him @ocdqblog.

Three Critical Steps to Improving Product Data Quality Introduction Convincing your organization to view data as a strategic corporate asset and, by extension, data quality as a strategic corporate discipline can be challenging. The relationship between business processes and the data used and created by those processes is not always obvious and tangible. In other words, how does the organization s data affect its business decisions and its ability to succeed? Because the strategic importance of one corporate asset the products your organization sells has never been in question, the data describing those products must be of sufficient quality to support optimal business performance. Let s imagine you work for a company called Acme Foods and are making a presentation to executive management about the need for improvements in product data quality. You tell the eight executives in the room that each has on the table in front of him or her a different product from a current list of Acme Foods top 100 best-selling products. The executives are confused because they all have the same kind of candy bar in front of them. Each has an attached card with a number and some text. You explain that the number is the sales rank and the text is the product description, which came directly from the Acme Foods master product catalog. They pass their candy bars around the room, pausing to read the attached cards. After a few minutes, you display the following chart: SALES RANK 1 E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. 5 Everybody Loves Milk Chocolate Squared NET WGT 99G Bar 8 Milk Chocolate Square that Everybody Loves in a 3.5 oz Bar 15 35oz box of e-heart milk chocolate squares chocolate candy bars 21 Square Bar (ELMC2) 99g of Milk Chocolatey Goodness 35 E-Heart Emoticon-Milk-Chocolate-Squared 3-and-1/2 ounce BAR 42 Square Chocolate E3MC2 Bar 99g (3.5z) Milk Chocolate 55 Milk Chocolate Squares Everyone Loves 10 1.5 OZ Squares (15 oz BAG) You point out a few of the obvious product data quality issues: Numerous variations in the official brand name (E<3MC2), which stands for Everybody Loves Milk Chocolate Squared. Six records are describing one product (excluding products 15 and 55), meaning six of the top 100 best-sellers are the same product. Product 15 is not a duplicate because of a different unit count based on packaging, and Product 55 is not a duplicate because of a different unit size (i.e., it is a bag of 10 smaller chocolate squares instead of one larger chocolate square candy bar). 1

SAS White Paper Business Effects of Poor Product Data Quality Confronted with a tangible demonstration of the product data quality issues plaguing Acme Foods, the executives discuss some of the business effects: Sales Forecasting Incorrect sales numbers negatively affect the ability to predict sales trends and plan future product marketing and promotions. Spending Analysis Incorrect sales negatively affect the procurement planning for purchasing the raw materials to make the products. Supply Chain Optimization Incorrect procurement levels trigger manufacturing disruptions and inefficiencies throughout the supply chain. Inventory Management Incorrect inventory levels cause order fulfillment delays in distribution channels, leading to delayed revenues or lost sales. In more general terms, the result is poor product data quality that: Increases costs. Decreases revenue. Increases risks. Disrupts daily operations. Causes bad tactical business decisions. Undermines strategic corporate planning. Even though Acme Foods prides itself on excellent business process management as well as hiring, investing in great people and implementing the latest technology, none of these best practices can save it from poor-quality data. Data must be viewed as a strategic corporate asset and data quality a strategic corporate discipline because high-quality data serves as a solid foundation for success, enabling better business decisions and optimal business performance. Using Acme Foods as a fictional case study, this white paper will describe a general approach for planning your organization s efforts to improve product data quality. It will provide a data-example-driven perspective of some of the unique challenges of product data quality, as well as discuss and demonstrate the three critical steps to improving product data quality. 2

Three Critical Steps to Improving Product Data Quality Unique Challenges of Product Data Quality Congratulations! The Acme Foods executives just approved a product data quality improvement project. Now what? How will you approach this daunting challenge? Product data presents some unique challenges. The first is that product is a generic term that can mean many different things. For example, a product could refer to: Raw materials used to manufacture products (e.g., the cocoa beans that Acme Foods purchases as a raw material for manufacturing chocolate). Semifinished goods from an intermediate stage of product development (e.g., the couverture chocolate that Acme Foods uses to make candy bars). Finished goods, which may be a single product, a package containing several products, or multiple products within the same brand based on packaging variations in the unit size and unit type. Other data domains, such as customer name and postal address, have a relatively small set of easily defined and recognized data attributes and data quality standards (but these are not always consistently enforced). The complex product supply chain includes manufacturers, distributors, suppliers, wholesalers, retailers and other vendors. All of these organizations typically maintain their own product catalogs, often with inconsistent data quality standards. There are some standards for product data quality, but they are not yet as widely adopted as standards for other data domains. Examples of these standards include: United Nations Standard Products and Services Code (UNSPSC) defines more than 20,000 categories of common commodities and services. Uniform Code Council (UCC) specializes in data standards for bar codes and electronic data interchange (EDI), primarily for North America. European Article Numbering (EAN) European standards similar to the UCC. EPCglobal collectively established by the UCC and EAN to develop standards for electronic product codes (EPC) and radio-frequency identification (RFID). Universal Product Code (UPC) worldwide bar code standard for the electronic identification of containers, pallets, cases, products and SKUs. 3

SAS White Paper These standards can assist with establishing consistent product descriptions and assigning unique product identifiers. However, these identifiers can suffer from the same data entry errors and data formatting variations as identifying attributes for other data domains. Also, these identifiers may not always be available and could be replaced with proprietary product identifiers, or even database surrogate keys. Effectively implementing these or other product data standards often requires matching based on product description, which is usually unstructured, meaning that most product data attributes are buried within a free-form text field. When you are creating your own product data standards or receiving third-party product data that follows a different standard (or none at all), recognizing and extracting product data attributes from a free-form text field will be your primary task. Categorizing, standardizing and matching product descriptions are three fundamental challenges to overcome when improving product data quality. Data quality tools provide considerable assistance with these challenges. However, compared to other data domains, a product data quality project will typically require more customization of what the data quality tool provides out of the box. Most of the customization effort is teaching the tool how to understand what are essentially the vocabulary, spelling and grammar of the product data language. Improving Product Data Quality The three critical steps to improving product data quality are: 1. Categorization. 2. Standardization. 3. Matching. The remainder of this paper will discuss and demonstrate these concepts from a data-example-driven perspective using the fictional products of Acme Foods. Categorization Determining the product category is an important first step because the category provides context for the product description, where the same words, abbreviations and symbols can mean something different within different product categories. 4

Three Critical Steps to Improving Product Data Quality For example, consider the following product descriptions: E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. Two Chocolate Energy Drinks (15oz Cans) Extreme Cacao Sugar Extreme Six Pack of Sugar Chewing Gum (Net Weight 7.41 oz.) Sugar Water 12 fl. oz. Carbonated Soft Drink Can Everybody Loves Milk Chocolate Squared NET WGT 99G Bar 6-PACK 24 Fl. Oz. Bottles Sugar Water Carbonated Soft Drink Milk Chocolate Square that Everybody Loves in a 3.5 oz Bar Extreme Cacao Chocolate Energy Drink 15 FL OZ Aluminum Can Sugary Carbonated Water FL OZ 24 Plastic Bottle Soft Drink 35oz box of e-heart milk chocolate squares chocolate candy bars Sugar Water Two Liter Plastic Bottle Carbonated Soft Drink Square Bar (ELMC2) 99g of Milk Chocolatey Goodness sugar water 12 floz fizzy drink CASE 24 aluminum cans (carbonated) E-Heart Emoticon-Milk-Chocolate-Squared 3-and-1/2 ounce BAR 12-12 FL OZ Cans Sugar Water Carbonated Soft Drink Square Chocolate E3MC 2 Bar 99g (3.5z) Milk Chocolate Carbonated Sugar Water Aluminum Can Six Pack Soda Pop 12 OZ (FL) Milk Chocolate Squares Everyone Loves 10 1.5 OZ Squares (15 OZ BAG) Eight Ounce Aluminum Cans Sugar Water (Six Pack) Carbonated Soda Non-Sugar-Free Chewing Gum Net Weight 35 grams Sugary X-treme Many large organizations have diverse product catalogs using a complex taxonomy or hierarchy of product categories, which are often managed by different groups of subjectmatter experts (SMEs). Categories are sometimes keywords that are found within the product description, but most often the category must be extrapolated from a semantic understanding of the product description. By determining the category for these product descriptions, we can begin conquer the challenge of improving product data quality by using category as a filter to route records to category-specific standardization processes. Data quality tools provide assistance by parsing the free-form product description to search for the key words, phrases and other logic needed for categorization. For simplicity, the data examples we are working with only represent two categories: candy and beverage. But simply categorizing all product descriptions containing the word chocolate as candy and sugar as beverage would incorrectly categorize both the Chocolate Energy Drink and the Sugar Chewing Gum. 5

SAS White Paper Therefore, the automated categorization process provided by the data quality tool has to use natural language processing and represent the knowledge of data SMEs. The Acme Foods SMEs have helped us properly categorize the product descriptions: CATEGORY E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. Everybody Loves Milk Chocolate Squared NET WGT 99G Bar Milk Chocolate Square that Everybody Loves in a 3.5 oz Bar 35oz box of e-heart milk chocolate squares chocolate candy bars Square Bar (ELMC2) 99g of Milk Chocolatey Goodness E-Heart Emoticon-Milk-Chocolate-Squared 3-and-1/2 ounce BAR Square Chocolate E3MC 2 Bar 99g (3.5z) Milk Chocolate Milk Chocolate Squares Everyone Loves 10 1.5 OZ Squares (15 OZ BAG) Non-Sugar-Free Chewing Gum Net Weight 35 grams Sugary X-treme Sugar Extreme Six Pack of Sugar Chewing Gum (Net Weight 7.41 oz.) Eight Ounce Aluminum Cans Sugar Water (Six Pack) Carbonated Soda Sugar Water 12 fl. oz. Carbonated Soft Drink Can Carbonated Sugar Water Aluminum Can Six Pack Soda Pop 12 OZ (FL) 12-12 FL OZ Cans Sugar Water Carbonated Soft Drink sugar water 12 floz fizzy drink CASE 24 aluminum cans (carbonated) Sugary Carbonated Water FL OZ 24 Plastic Bottle Soft Drink 6-PACK 24 Fl. Oz. Bottles Sugar Water Carbonated Soft Drink Sugar Water Two Liter Plastic Bottle Carbonated Soft Drink Extreme Cacao Chocolate Energy Drink 15 FL OZ Aluminum Can Two Chocolate Energy Drinks (15oz Cans) Extreme Cacao Please Note: It is a recommended best practice to design your categorization process as a separate function so that the technical processes are aligned naturally with the category-specific business rules provided by the product data SMEs. Standardization Free-form fields often contain numerous variations resulting from data-entry errors, different conventions for representing the same value and a general lack of data quality standards. Additional variations are introduced by multiple data sources, each with its own unique data characteristics and data quality challenges. Standardization parses free-form fields to break them into smaller fields to gain improved visibility of the available input data, create a more consistent representation, apply standard values and, when possible, populate missing values. It is important to note that sometimes what appear to be semantic inconsistencies in product data are intentional variations to accommodate such aspects as regional and linguistic differences, as well as special promotions. 6

Three Critical Steps to Improving Product Data Quality The standardization process should be designed as modular as possible to support a plug-and-play approach for various components, similar to the recommendation that categorization and standardization should be separate processes. Data s quality is determined by evaluating its fitness for its designated purpose. However, in the vast majority of cases, data has multiple business uses, and data of sufficient quality for one use may not be for other business uses. When the standardization process has a flexible architecture, it is easier to convert to various product data standards and support a wider range of business purposes. Most of the product attributes in our data examples are stored within the overloaded description field, such as unit count, unit size, unit measure and unit type. Even when the data source contains these attributes as separate fields, they can be sparsely populated or contain defaults or other values conflicting with the content of the product description field. Our product data standardization process uses the following fields: Brand the brand name of the Acme Food product. Unit Count the number of units in the packaged product. Unit Size the number associated with the unit of measurement. Unit Measure the unit of measurement for the product. Unit Type the packaging type of the product. Product Description the remaining description not covered by the above fields. Please Note: Many additional fields are commonly created when standardizing product data, especially to facilitate improved matching, but this paper focuses on the above fields for the purposes of demonstrating standardization concepts. Brands Let s begin by focusing on only the products in the candy category: CATEGORY E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. Everybody Loves Milk Chocolate Squared NET WGT 99G Bar Milk Chocolate Square that Everybody Loves in a 3.5 oz Bar 35oz box of e-heart milk chocolate squares chocolate candy bars Square Bar (ELMC2) 99g of Milk Chocolatey Goodness E-Heart Emoticon-Milk-Chocolate-Squared 3-and-1/2 ounce BAR Square Chocolate E3MC 2 Bar 99g (3.5z) Milk Chocolate Milk Chocolate Squares Everyone Loves 10 1.5 OZ Squares (15 OZ BAG) Non-Sugar-Free Chewing Gum Net Weight 35 grams Sugary X-treme Sugar Extreme Six Pack of Sugar Chewing Gum (Net Weight 7.41 oz.) 7

SAS White Paper Our candy SMEs have highlighted in bold the contents of the product description that is appropriate for the new brand field we are creating in this two-step process. The first step is to separate the brand name content from the product description: CATEGORY BRAND E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. Everybody Loves Milk Chocolate Squared NET WGT 99G Bar Milk Chocolate Square that Everybody Loves in a 3.5 oz Bar e-heart milk chocolate squares 35oz box of chocolate candy bars (ELMC2) Square Bar 99g of Milk Chocolatey Goodness E-Heart Emoticon-Milk-Chocolate-Squared 3-and-1/2 ounce BAR E3MC 2 Square Chocolate Bar 99g (3.5z) Milk Chocolate Milk Chocolate Squares Everyone Loves 10 1.5 OZ Squares (15 OZ BAG) Sugary X-treme Non-Sugar-Free Chewing Gum Net Weight 35 grams Sugar Extreme Six Pack of Sugar Chewing Gum (Net Weight 7.41 oz.) The second step is to standardize the representation of the brand names: CATEGORY BRAND E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. E<3MC 2 NET WGT 99G Bar E<3MC 2 in a 3.5 oz Bar E<3MC 2 35oz box of chocolate candy bars E<3MC 2 Square Bar 99g of Milk Chocolatey Goodness E<3MC 2 3-and-1/2 ounce BAR E<3MC 2 Square Chocolate Bar 99g (3.5z) Milk Chocolate E<3MC 2 10 1.5 OZ Squares (15 OZ BAG) Sugar Extreme Non-Sugar-Free Chewing Gum Net Weight 35 grams Sugar Extreme Six Pack of Sugar Chewing Gum (Net Weight 7.41 oz.) Please Note: Implement these steps separately to make it easier to apply different standards when appropriate (e.g., using regional brand names in a local language). 8

Three Critical Steps to Improving Product Data Quality Units Now let s focus on only the products in the beverage category, which has already been branded following the same process described in the previous section: CATEGORY BRAND Sugar Water Eight Ounce Aluminum Cans Sugar Water (Six Pack) Carbonated Soda Sugar Water 12 fl. oz. Carbonated Soft Drink Can Sugar Water Carbonated Aluminum Can Six Pack Soda Pop 12 OZ (FL) Sugar Water 12-12 FL OZ Cans Carbonated Soft Drink Sugar Water 12 floz fizzy drink CASE 24 aluminum cans (carbonated) Sugar Water Carbonated FL OZ 24 Plastic Bottle Soft Drink Sugar Water 6-PACK 24 Fl. Oz. Bottles Carbonated Soft Drink Sugar Water Two Liter Plastic Bottle Carbonated Soft Drink Extreme Cacao Cacao Chocolate Energy Drink 15 FL OZ Aluminum Can Extreme Cacao Two Chocolate Energy Drinks (15oz Cans) Our beverage SMEs have highlighted in bold the contents of the product description that is appropriate for the new unit fields we are creating in this two-step process. The first step is to separate the unit information from the product description: CATEGORY BRAND COUNT SIZE MEASURE TYPE Sugar Water Six Pack Eight Ounce Aluminum Cans Carbonated Soda Sugar Water 12 fl. oz. Can Carbonated Soft Drink Sugar Water Six Pack 12 OZ (FL) Aluminum Can Carbonated Soda Pop Sugar Water 12 12 FL OZ Cans Carbonated Soft Drink Sugar Water CASE 24 12 floz aluminum cans fizzy drink (carbonated) Sugar Water 24 FL OZ Plastic Bottle Carbonated Soft Drink Sugar Water 6-PACK 24 Fl. Oz. Bottles Carbonated Soft Drink Sugar Water Two Liter Plastic Bottle Carbonated Soft Drink Extreme Cacao 15 FL OZ Aluminum Can Chocolate Energy Drink Extreme Cacao Two 15 oz Cans Chocolate Energy Drinks 9

SAS White Paper The second step is to standardize the representation of the unit information: CATEGORY BRAND COUNT SIZE MEASURE TYPE Sugar Water 6 8 FL OZ CAN Carbonated Soft Drink Sugar Water 1 12 FL OZ CAN Carbonated Soft Drink Sugar Water 6 12 FL OZ CAN Carbonated Soft Drink Sugar Water 12 12 FL OZ CAN Carbonated Soft Drink Sugar Water 24 12 FL OZ CAN Carbonated Soft Drink Sugar Water 1 24 FL OZ BOTTLE Carbonated Soft Drink Sugar Water 6 24 FL OZ BOTTLE Carbonated Soft Drink Sugar Water 1 2 L BOTTLE Carbonated Soft Drink Extreme Cacao 1 15 FL OZ CAN Chocolate Energy Drink Extreme Cacao 2 15 FL OZ CAN Chocolate Energy Drink Please note: Missing unit counts were populated with 1 as their default value, and the remaining content of the original product description has also been standardized. Before and After Standardization CATEGORY E<3MC 2 Bar Milk Chocolate Square Net Weight 3.5 oz. Everybody Loves Milk Chocolate Squared NET WGT 99G Bar Milk Chocolate Square that Everybody Loves in a 3.5 oz Bar 35oz box of e-heart milk chocolate squares chocolate candy bars Square Bar (ELMC2) 99g of Milk Chocolatey Goodness E-Heart Emoticon-Milk-Chocolate-Squared 3-and-1/2 ounce BAR Square Chocolate E3MC 2 Bar 99g (3.5z) Milk Chocolate Milk Chocolate Squares Everyone Loves 10 1.5 OZ Squares (15 OZ BAG) Non-Sugar-Free Chewing Gum Net Weight 35 grams Sugary X-treme Sugar Extreme Six Pack of Sugar Chewing Gum (Net Weight 7.41 oz.) Eight Ounce Aluminum Cans Sugar Water (Six Pack) Carbonated Soda Sugar Water 12 fl. oz. Carbonated Soft Drink Can Carbonated Sugar Water Aluminum Can Six Pack Soda Pop 12 OZ (FL) 12-12 FL OZ Cans Sugar Water Carbonated Soft Drink sugar water 12 floz fizzy drink CASE 24 aluminum cans (carbonated) Sugary Carbonated Water FL OZ 24 Plastic Bottle Soft Drink 6-PACK 24 Fl. Oz. Bottles Sugar Water Carbonated Soft Drink Sugar Water Two Liter Plastic Bottle Carbonated Soft Drink Extreme Cacao Chocolate Energy Drink 15 FL OZ Aluminum Can Two Chocolate Energy Drinks (15oz Cans) Extreme Cacao 10

Three Critical Steps to Improving Product Data Quality After applying all of the standardization logic described above, we can easily see the dramatic improvement in the data quality of our product data examples: CATEGORY BRAND COUNT SIZE MEASURE TYPE E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 1.5 OZ BAR Milk Chocolate Square Sugar Extreme 1 35 G PACK Sugar Chewing Gum Sugar Extreme 6 35 G PACK Sugar Chewing Gum Sugar Water 6 8 FL OZ CAN Carbonated Soft Drink Sugar Water 1 12 FL OZ CAN Carbonated Soft Drink Sugar Water 6 12 FL OZ CAN Carbonated Soft Drink Sugar Water 12 12 FL OZ CAN Carbonated Soft Drink Sugar Water 24 12 FL OZ CAN Carbonated Soft Drink Sugar Water 1 24 FL OZ BOTTLE Carbonated Soft Drink Sugar Water 6 24 FL OZ BOTTLE Carbonated Soft Drink Sugar Water 1 2 L BOTTLE Carbonated Soft Drink Extreme Cacao 1 15 FL OZ CAN Chocolate Energy Drink Extreme Cacao 2 15 FL OZ CAN Chocolate Energy Drink Matching Matching for product data is usually performed for either comparing records within or across data sources in order to determine if they correspond to the same product (i.e., are duplicates) or for matching records against a standard product reference (e.g., UNSPSC in order to obtain the product commodity classification code). Matching often uses standardization to prepare its input. This facilitates a direct evaluation of comparable fields (e.g., brand name to brand name) and more reliable comparisons based on standardized values. It also decreases the failure to match records because of data variations and increases the probability of effective match results. 11

SAS White Paper The standardization of our data examples has normalized the product descriptions to the point that the six duplicate records in the candy category, which were highlighted in the introduction, can now be easily identified as exact matches: CATEGORY BRAND COUNT SIZE MEASURE TYPE E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 1.5 OZ BAG Milk Chocolate Square Sugar Extreme 1 35 G PACK Sugar Chewing Gum Sugar Extreme 6 35 G PACK Sugar Chewing Gum If the six duplicates were consolidated into a single record, then the E<3MC 2 brand could be properly represented as the following three unique Acme Foods products: CATEGORY BRAND COUNT SIZE MEASURE TYPE E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 1.5 OZ BAG Milk Chocolate Square Data quality tools support the advanced duplicate consolidation logic often necessary for selecting or constructing the consolidated record (also known as the survivor or golden copy ). Obviously, exact matching on rigorously standardized data is neither a recommended best practice nor a limitation imposed by data quality tools, which provide advanced matching techniques for overcoming data variations and other data quality issues. Although those techniques are beyond the scope of this paper, standardization will still play an important supporting role, especially for improving candidate selection for automated and interactive matching and for searching the product catalog. Data quality tools also provide some way to rank their match and search results (e.g., numeric probabilities, weighted percentages, odds ratios or confidence levels) as a primary method in differentiating automatic matches, automatic nonmatches and potential matches requiring manual review and verification by an SME. 12

Three Critical Steps to Improving Product Data Quality After Matching After matching has performed duplicate identification and consolidation, the updated Acme Foods product catalog now has dramatically improved product data quality: CATEGORY BRAND COUNT SIZE MEASURE TYPE E<3MC 2 1 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 3.5 OZ BAR Milk Chocolate Square E<3MC 2 10 1.5 OZ BAG Milk Chocolate Square Sugar Extreme 1 3.5 G PACK Sugar Chewing Gum Sugar Extreme 6 3.5 G PACK Sugar Chewing Gum Sugar Water 6 8 FL OZ CAN Carbonated Soft Drink Sugar Water 1 12 FL OZ CAN Carbonated Soft Drink Sugar Water 6 12 FL OZ CAN Carbonated Soft Drink Sugar Water 12 12 FL OZ CAN Carbonated Soft Drink Sugar Water 24 12 FL OZ CAN Carbonated Soft Drink Sugar Water 1 24 FL OZ BOTTLE Carbonated Soft Drink Sugar Water 6 24 FL OZ BOTTLE Carbonated Soft Drink Sugar Water 1 2 L BOTTLE Carbonated Soft Drink Extreme Cacao 1 15 FL OZ CAN Chocolate Energy Drink Extreme Cacao 2 15 FL OZ CAN Chocolate Energy Drink Searching and matching against this new internal standard product reference can prevent future duplicates from being added to the Acme Foods product catalog. 13

SAS White Paper Summary Product data presents some challenges that are different from other data domains. The root cause is often the product description, which is usually unstructured, meaning that most product data attributes are buried within a free-form text field. This paper provided a data-example-driven perspective of some of the unique challenges of product data quality, and discussed and demonstrated the three critical steps to improving product data quality: 1. Categorization Organizes product descriptions by category, aligning technical processes and business rules with SMEs, and routes product descriptions to category-specific standardization rules. Learn more To learn more about data quality, visit: sas.com/software/data-management/data-quality-category/index. html 2. Standardization A two-step process that separates the content of the product description into new fields then applies standard values. Implementing these steps separately makes it easier to apply different standards when appropriate (e.g., regional standards in a local language). 3. Matching Identifies and consolidates duplicate products within a source, facilitates improved search capability, and supports matching against an internal or external standard product reference. The fictional data examples from the Acme Foods product catalog demonstrated that I love Sugar Water and Everybody Loves Milk Chocolate Squared (E<3MC 2 ). But if there is only one fact that you take away from this white paper, let it be this one: Everybody loves high-quality product data. 14

About SAS SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 65,000 sites improve performance and deliver value by making better decisions faster. Since 1976 SAS has been giving customers around the world THE POWER TO KNOW. SAS Institute Inc. World Headquarters +1 919 677 8000 To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2013, SAS Institute Inc. All rights reserved. 106029_S118297_1213