DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

Similar documents
Predicting Wine Quality

Wine Rating Prediction

Food Image Recognition by Deep Learning

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Learning the Language of Wine CS 229 Term Project - Final Report

What Makes a Cuisine Unique?

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

PROJECT MODIFICATION DATE 07/30/2012 CREATOR Leif Hill FILE NAME tt_01 DESCRIPTION Wire frames - file 01 VERSION HISTORY Mon Aug

The 2016 KIT IWSLT Speech-to-Text Systems for English and German

Learning Connectivity Networks from High-Dimensional Point Processes

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Know Your Oil: Creating an Oil-Climate Index

Analysis of Things (AoT)

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

Using the Forest to see the Trees: A computational model relating features, objects and scenes

What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes

FOOD FOR THOUGHT Topical Insights from our Subject Matter Experts LEVERAGING AGITATING RETORT PROCESSING TO OPTIMIZE PRODUCT QUALITY

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Imputation of multivariate continuous data with non-ignorable missingness

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Grapevine: A Wine Prediction Algorithm Using Multi-dimensional Clustering Methods

for forecasting unit price of tea at Colombo auction

Question Bank MODIFIERS: GRADE LEVEL: EASY LEVEL: K-6 indoor/outdoor

DENSELY CONNECTED CONVOLUTIONAL NETWORKS

Predicting Wine Varietals from Professional Reviews

Problem How does solute concentration affect the movement of water across a biological membrane?

AWRI Refrigeration Demand Calculator

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Jure Leskovec, Computer Science Dept., Stanford

GATEWAY 3. stories about people. bee. lords

This appendix tabulates results summarized in Section IV of our paper, and also reports the results of additional tests.

News English.com Ready-to-use ESL / EFL Lessons

Running Head: MESSAGE ON A BOTTLE: THE WINE LABEL S INFLUENCE p. 1. Message on a bottle: the wine label s influence. Stephanie Marchant

Hops II Interfacing with the Hop Industry Role of a Hops Supplier. Tim Kostelecky John I. Haas, Inc ASBC Meeting June 6, 2017

DATA MINING CAPSTONE FINAL REPORT

Wine Consumption Production

Hybrid ARIMA-ANN Modelling for Forecasting the Price of Robusta Coffee in India

WORLDCHEFS GLOBAL CULINARY CERTIFICATION

WACS culinary certification scheme

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Investment Opportunity

NO TO ARTIFICIAL, YES TO FLAVOR: A LOOK AT CLEAN BALANCERS

Flexible Imputation of Missing Data

GLOBALIZATION UNIT 1 ACTIVATE YOUR KNOWLEDGE LEARNING OBJECTIVES

Multiple Imputation for Missing Data in KLoSA

Effect of Inocucor on strawberry plants growth and production

Division 1. Division 1

The new WMF espresso. The perfect espresso. Handmade automatically.

Handling Missing Data. Ashley Parker EDU 7312

CLEVER COFFEE SOLUTIONS. Networked coffee machines mean better business

ARM4 Advances: Genetic Algorithm Improvements. Ed Downs & Gianluca Paganoni

FOR PERSONAL USE. Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN ACTIVITY ASSESSMENT OPPORTUNITIES. Grade 3 Quarter 1 Activity 2

Better Punctuation Prediction with Hierarchical Phrase-Based Translation

A CASE STUDY: HOW CONSUMER INSIGHTS DROVE THE SUCCESSFUL LAUNCH OF A NEW RED WINE

Table of Contents. Toast Inc. 2

Yerevan Branding Competition. Tumo Center Group Participation. Hrant Grigoryan

Maximising Sensitivity with Percolator

EJEMPLO DE ARTÍCULO CON TEXTO ALEATORIO PARA LOS CONGRESOS DE LA ASOCIACIÓN ESPAÑOLA DE LINGÜÍSTICA APLICADA

Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology

Reliable Profiling for Chocolate and Cacao

12% Baking Mad. Page views increased by. Ridgeway. FOOD AND DRINK

Promoting Whole Foods

Missing Data Treatments

GrillCam: A Real-time Eating Action Recognition System

INFLUENCE OF THIN JUICE ph MANAGEMENT ON THICK JUICE COLOR IN A FACTORY UTILIZING WEAK CATION THIN JUICE SOFTENING

Unit 2, Lesson 4: Color Mixtures

Analyst coverage, information, and bubbles

MICROWAVE DIELECTRIC SPECTRA AND THE COMPOSITION OF FOODS: PRINCIPAL COMPONENT ANALYSIS VERSUS ARTIFICIAL NEURAL NETWORKS.

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

Click to edit Master title style Delivering World-Class Customer Service Through Lean Thinking

COFFEE BREAK KAZUKI YAMAMATO THINKING OF CHOOSING KEURIG? CUP YOUR COFFEE COFFEE QUIZ: LATTE ARTIST TAKING THE INTERNET BY STORM

Here you would want to place either your 2D flat ecover image graphic, the text title of your ebook, or whatever content you would like here.

Analog IC Design With Low-Dropout Regulators (LDOs) (Electronic Engineering) PDF

Given a realistic scenario depicting a new site install, the learner will be able to install and setup the brewer for retail turnover without error.

On-line NIR moisture measurement to control the degree of baking in biscuits and cookies

Hungry at half-time Describing food

Kid-Friendly Program.

* People come in with empty boxes and leave with meals * Able to cook there? Still a grocery store? * Provide customizable recipes * Some variables

FCS Lesson Plans: Teacher Guide Pork Stir-Fry

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Kid-Friendly Program. Created by Bayshore Home Health

Amazon Fine Food Reviews wait I don t know what they are reviewing

7-DAY HYDRATION CHALLENGE

Online Appendix for. To Buy or Not to Buy: Consumer Constraints in the Housing Market

Kid-Friendly Program

WHAT WE ARE LEARNING TODAY

Feudalism. Chapter 15, Section 2. Slaves. Serfs Both. (Pages )

learning goals ARe YoU ReAdY to order?

The LimoncelloQuest Troubleshooting Guide

Jure Leskovec Stanford University

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Národný pochod za život visual identity. design manual

Healthy Practices: Nutrition and Fitness

Internet Appendix for CEO Personal Risk-taking and Corporate Policies TABLE IA.1 Pilot CEOs and Firm Risk (Controlling for High Performance Pay)

Buying Filberts On a Sample Basis

Case Study A Year of Social-Local Success

Transcription:

Training Neural Rankers with Weak Supervision DIR2017 Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W. Bruce Croft Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Motivation Deep neural nets are data hungry For many tasks, the more data you have, the better your model will be! This amount of data is not always available for many IR tasks Unsupervised neural network based methods. Our idea: Using a well stablished unsupervised methods as training signal. Weak supervision: Connecting symbolic IR with data driven methods

General Idea To leverage a large amounts of unsupervised data to infer weak labels and use that signal for learning supervised models as if we had the ground truth labels.

Weak supervision for Ranking Pseudo-Labeling BM25 plays the role of pseudo-labeler in our learning scenario. A target collection and a large set of training queries (without relevance judgment), Using the pseudo-labeler to rank/score the documents for each query in the training query set. Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Ranking Architectures: Score model The goal in this architecture is to learn a scoring function Linear Regression Loss Point-wise Loss: (linear regression, with MSE) q d

Ranking Architectures: Rank model The goal in this architecture is to learn a ranking function Pairwise loss Pair-wise at training/ Point-wise at inference Loss: (Hinge Loss) d $ q d %

Ranking Architectures: RankProb model The goal in this architecture is to learn a ranking function Logistic Regression Loss Pair-wise Loss: (logistic regression) d $ q d %

Input Representations Dense Vector Representation: Fully Featurized Exactly the BM25 input: Sparse Vector Representation: Bag of words Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Input Representations Embedding Vector Representation Embedding size Joint Embedding Matrix for terms in Query and Document learning representation of terms Compositionally function (From words representation to query/document representation) Vocabulary size w ' w ( w ) Embedding size learning global weight of terms

Experimental Setup Target data collections: ClueWeb09 CatB dataset Robust dataset Training Query set: AOL (after some filtering, we got more than 6m queries for each set) Hyper-parameters: Width and depth of the network, learning rate, drop-out, embedding size Optimized using batched GP bandits with an expected improvement acquisition function

How do the neural models with different training objectives and input representations compare? Method Robust04 ClueWeb MAP P@20 ndcg@20 MAP P@20 ndcg@20 BM25 0.2503 0.3569 0.4102 0.1021 0.2418 0.2070 Score + Dense 0.1961 ô 0.2787 ô 0.3260 ô 0.0689 ô 0.1518 ô 0.1430 ô Score + Sparse 0.2141 ô 0.3180 ô 0.3604 ô 0.0701 ô 0.1889 ô 0.1495 ô Score + Embed 0.2423 ô 0.3501 0.3999 0.1002 0.2513 0.2130 Rank + Dense 0.1940 ô 0.2830 ô 0.3317 ô 0.0622 ô 0.1516 ô 0.1383 ô Rank + Sparse 0.2213 ô 0.3216 ô 0.3628 ô 0.0776 ô 0.1989 ô 0.1816 ô Rank + Embed 0.2811 0.3773 0.4302 0.1306 0.2839 0.2216 RankProb + Dense 0.2192 ô 0.2966 ô 0.3278 ô 0.0702 ô 0.1711 ô 0.1506 ô RankProb + Sparse 0.2246 ô 0.3250 ô 0.3763 ô 0.0894 ô 0.2109 ô 0.1916 RankProb + Embed 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330

How do the neural models with different training objectives and input representations compare? Take Home Message: Method Robust04 ClueWeb MAP P@20 ndcg@20 MAP P@20 ndcg@20 1. Define an objective which enables your model to go BM25 0.2503 0.3569 0.4102 0.1021 0.2418 0.2070 beyond the imperfection of the weakly annotated data Score + Dense 0.1961 ô 0.2787 ô 0.3260 ô 0.0689 ô 0.1518 ô 0.1430 ô Score (ranking + Sparse instead 0.2141of ô calibrated 0.3180 ô 0.3604 ô scoring). 0.0701 ô 0.1889 ô 0.1495 ô Score + Embed 0.2423 ô 0.3501 0.3999 0.1002 0.2513 0.2130 2. Rank Let + Dense the network 0.1940 ô 0.2830 decide ô 0.3317 about ô 0.0622the ô 0.1516 representation. ô 0.1383 ô Rank Feeding + Sparse the network 0.2213 ô 0.3216 with ô featurized 0.3628 ô 0.0776 ô input 0.1989kills ô 0.1816 the ô model Rank + Embed 0.2811 0.3773 0.4302 0.1306 0.2839 0.2216 creativity! RankProb + Dense 0.2192 ô 0.2966 ô 0.3278 ô 0.0702 ô 0.1711 ô 0.1506 ô RankProb + Sparse 0.2246 ô 0.3250 ô 0.3763 ô 0.0894 ô 0.2109 ô 0.1916 RankProb + Embed 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330

How meaningful are the compositionality weights learned in the embedding vector representation? (a) Robust04 (b) ClueWeb (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) Figure 4: Strong linear correlation between weight learned Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

How meaningful are the compositionality weights learned in the embedding vector representation? Take Home Message: By just seeing individual local instances from the data, the network learns such a global statistic. (a) Robust04 (b) ClueWeb (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) Figure 4: Strong linear correlation between weight learned Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

How well other alternatives for the embedding and weighting functions in embedding vector representation perform? Table 3: Performance of the rankprob model with variants of the embedding vector representation on di erent datasets. indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. SIGIR 2017, August 2017, Tokyo, Japan Anon. Table 3: Performance of the rankprob model with variants of the embedding vector representationrobust04 on di erent datasets. Embedding type indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed MAP P@20 ndcg@20 t-test, with Bonferroni correction. Pretrained (external) + Uniform 0.1656 0.2543 0.3017 Robust04weighting ClueWeb Pretrained (external) + IDF weighting 0.1711 0.2755 MAP P@20 ndcg@20 MAP P@20 ndcg@20 0.3104 Pretrained (external) + Weight learning 0.1880 0.2890 0.3413 Pretrained (external) + Uniform weighting 0.1656 0.2543 0.3017 0.0612 0.1300 0.1401 (target) +0.1711 Uniform weighting 0.12170.1346 0.2009 0.2791 Pretrained (external)pretrained + IDF weighting 0.2755 0.3104 0.0712 0.1469 Pretrained (external)pretrained + Weight learning 0.2890 0.3413 0.0756 0.1583 (target) +0.1880 IDF weighting 0.14020.1344 0.2230 0.2876 Pretrained (target) + Uniform weighting 0.1217 0.2009 0.2791 0.0679 0.1331 0.1587 Pretrained (target) + Weight learning 0.1477 0.2266 0.2804 Pretrained (target) + IDF weighting 0.1402 0.2230 0.2876 0.0779 0.1674 0.1540 + Uniform0.1477 weighting 0.26120.1729 0.3602 0.4180 Pretrained (target) + Learned Weight learning 0.2266 0.2804 0.0816 0.1608 Learned + Uniform weighting 0.2612 0.3602 0.4180 0.0912 0.1841 Learned + IDF weighting 0.26760.2216 0.3619 0.4200 Learned + IDF weighting 0.2676 0.3619 0.4200 0.1032 0.2419 0.1922 0.3802 Learned + Weight learning 0.2837 0.4389 Learned + Weight learning 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330 Embedding type function E: (1) employing pre-trained word embeddings learned from an external corpus (we used Google News), (2) employing pretrained word embeddings learned from the target corpus (using the skip-gram model [27]), and (3) learning embeddings during the network training as it is explained in Section 4.3. Furthermore, for the compositionality function, we tried di erent alternatives: (1) uniform weighting (simple averaging which is a common (a) Robust04approach (b) ClueWeb (a) Robust04 (b) ClueWeb in compositionality function), (2) using IDF as xed weights instead Figure 5: Performance of the rankprob model with learned (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) embedding, pre-trained embedding, and learned embedding of learning the weighting function W, and (3) learning weights Figure 4: Strong linear correlation between weight learned ClueWeb MAP P@20 ndcg@20 0.0612 0.0712 0.0756 0.0679 0.0779 0.0816 0.0912 0.1032 0.1387 0.1300 0.1346 0.1344 0.1331 0.1674 0.1729 0.2216 0.2419 0.2967 0.1401 0.1469 0.1583 0.1587 0.1540 0.1608 0.1841 0.1922 0.2330 (a) Robust04 with pre-trained embedding as initialization, with respect to (b) ClueWeb

How well other alternatives for the embedding and weighting functions in embedding vector representation perform? Take Home Message: Table 3: Performance of the rankprob model with variants of the embedding vector representation on di erent datasets. indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. SIGIR 2017, August 2017, Tokyo, Japan Anon. Table 3: Performance of the rankprob model with variants of the embedding vector representationrobust04 on di erent datasets. Embedding type indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed MAP P@20 ndcg@20 t-test, with Bonferroni correction. ClueWeb MAP P@20 ndcg@20 0.0912 0.1032 0.1387 0.2216 0.2419 0.2967 0.1841 0.1922 0.2330 If you Pretrained get enough data, can learn embedding which is (external) + Uniform weighting you 0.1656 0.2543 0.3017 0.0612 0.1300 0.1401 Pretrained (external) + IDF weighting 0.1711 0.2755 0.3104 0.0712 0.1346 0.1469 (external) + Weight learning 0.1880updating 0.2890 0.3413 them 0.0756 0.1344 0.1583 better Pretrained fitted to your task by just based on the Pretrained (target) + Uniform weighting 0.1217 0.2009 0.2791 0.0679 0.1331 0.1587 Pretrained (target) + IDF weighting 0.1402 0.2230 0.2876 0.0779 0.1674 0.1540 objective of(target) the downstream Pretrained + Weight learning 0.1477 task. 0.2266 0.2804 0.0816 0.1729 0.1608 Embedding type Robust04 MAP P@20 Pretrained (external) + Uniform weighting 0.1656 0.2543 Pretrained (external) + IDF weighting 0.1711 0.2755 Pretrained (external) + Weight learning 0.1880 0.2890 Pretrained (target) + Uniform weighting 0.1217 0.2009 Pretrained (target) + IDF weighting 0.1402 0.2230 + Uniform0.1477 weighting Pretrained (target) + Learned Weight learning 0.2266 Learned + Uniform weighting 0.2612 0.3602 Learned + IDF weighting Learned + IDF weighting 0.2676 0.3619 Learned + Weight learning Learned + Weight learning 0.2837 0.3802 ClueWeb ndcg@20 0.3017 0.3104 0.3413 0.2791 0.2876 0.2804 0.4180 0.4200 0.4389 MAP P@20 ndcg@20 0.0612 0.1300 0.1401 0.0712 0.1346 0.1469 0.0756 0.1344 0.1583 0.0679 0.1331 0.1587 0.0779 0.1674 0.1540 0.26120.1729 0.3602 0.0816 0.1608 0.0912 0.1841 0.26760.2216 0.3619 0.1032 0.2419 0.1922 0.3802 0.2837 0.1387 0.2967 0.2330 0.4180 0.4200 0.4389 But you need a lot of data: THANKS TO WEAK SUPERVISION! function E: (1) employing pre-trained word embeddings learned from an external corpus (we used Google News), (2) employing pretrained word embeddings learned from the target corpus (using the skip-gram model [27]), and (3) learning embeddings during the network training as it is explained in Section 4.3. Furthermore, for the compositionality function, we tried di erent alternatives: (1) uniform weighting (simple averaging which is a common (a) Robust04approach (b) ClueWeb (a) Robust04 (b) ClueWeb in compositionality function), (2) using IDF as xed weights instead Figure 5: Performance of the rankprob model with learned (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) embedding, pre-trained embedding, and learned embedding of learning the weighting function W, and (3) learning weights Figure 4: Strong linear correlation between weight learned (a) Robust04 with pre-trained embedding as initialization, with respect to (b) ClueWeb

How useful is learning with weak supervision as pretraining for supervised ranking? y signicant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. Method Robust04 ClueWeb MAP P@20 ndcg@20 MAP P@20 ndcg@20 Weakly supervised 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330 Fully supervised 0.1790 0.2863 0.3402 0.0680 0.1425 0.1652 Weakly supervised + Fully supervised 0.2912 0.4126 0.4509 0.1520 0.3077 0.2461

How useful is learning with weak supervision as pretraining Take Home for Message: supervised ranking? y signicant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. You want to train a neural network Robust04for your task ClueWeb but you ve Method got just a small amount of supervised data? Weakly supervised 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330 You can compensate it by pertaining your network on Fully supervised 0.1790 0.2863 0.3402 0.0680 0.1425 0.1652 weakly Weakly supervised annotated + Fully supervised data. 0.2912 0.4126 0.4509 0.1520 0.3077 0.2461 MAP P@20 ndcg@20 MAP P@20 ndcg@20

Avoiding your teacher s mistake! Training a neural ranker with controlled weak supervision MAIN GOAL: Controlling the effect of imperfect weak training instances by down-weighting them. Prediction loss wrt. the weak labels Prediction loss wrt. the weak labels Supervision Layer Supervision Layer Confidence Network Confidence Network Representation Learning Representation Learning Goodness of Representation Learning instances Goodness of instances Weak Annotator Weak Annotator True Labels Weak Annotator True Labels

Training Full Supervision mode Weak Supervision mode Prediction loss wrt. the weak labels Prediction loss wrt. the weak labels Supervision Layer Confidence Network Supervision Layer Confidence Network Representation Learning Goodness of instances Representation Learning Goodness of instances Weak Annotator True Labels Weak Annotator True Labels Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Thank you!