DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

Training Neural Rankers with Weak Supervision DIR2017 Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W. Bruce Croft Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Motivation Deep neural nets are data hungry For many tasks, the more data you have, the better your model will be! This amount of data is not always available for many IR tasks Unsupervised neural network based methods. Our idea: Using a well stablished unsupervised methods as training signal. Weak supervision: Connecting symbolic IR with data driven methods

General Idea To leverage a large amounts of unsupervised data to infer weak labels and use that signal for learning supervised models as if we had the ground truth labels.

Weak supervision for Ranking Pseudo-Labeling BM25 plays the role of pseudo-labeler in our learning scenario. A target collection and a large set of training queries (without relevance judgment), Using the pseudo-labeler to rank/score the documents for each query in the training query set. Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Ranking Architectures: Score model The goal in this architecture is to learn a scoring function Linear Regression Loss Point-wise Loss: (linear regression, with MSE) q d

Ranking Architectures: Rank model The goal in this architecture is to learn a ranking function Pairwise loss Pair-wise at training/ Point-wise at inference Loss: (Hinge Loss) d $ q d %

Ranking Architectures: RankProb model The goal in this architecture is to learn a ranking function Logistic Regression Loss Pair-wise Loss: (logistic regression) d $ q d %

Input Representations Dense Vector Representation: Fully Featurized Exactly the BM25 input: Sparse Vector Representation: Bag of words Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Input Representations Embedding Vector Representation Embedding size Joint Embedding Matrix for terms in Query and Document learning representation of terms Compositionally function (From words representation to query/document representation) Vocabulary size w ' w ( w ) Embedding size learning global weight of terms

Experimental Setup Target data collections: ClueWeb09 CatB dataset Robust dataset Training Query set: AOL (after some filtering, we got more than 6m queries for each set) Hyper-parameters: Width and depth of the network, learning rate, drop-out, embedding size Optimized using batched GP bandits with an expected improvement acquisition function

How do the neural models with different training objectives and input representations compare? Method Robust04 ClueWeb MAP P@20 ndcg@20 MAP P@20 ndcg@20 BM25 0.2503 0.3569 0.4102 0.1021 0.2418 0.2070 Score + Dense 0.1961 ô 0.2787 ô 0.3260 ô 0.0689 ô 0.1518 ô 0.1430 ô Score + Sparse 0.2141 ô 0.3180 ô 0.3604 ô 0.0701 ô 0.1889 ô 0.1495 ô Score + Embed 0.2423 ô 0.3501 0.3999 0.1002 0.2513 0.2130 Rank + Dense 0.1940 ô 0.2830 ô 0.3317 ô 0.0622 ô 0.1516 ô 0.1383 ô Rank + Sparse 0.2213 ô 0.3216 ô 0.3628 ô 0.0776 ô 0.1989 ô 0.1816 ô Rank + Embed 0.2811 0.3773 0.4302 0.1306 0.2839 0.2216 RankProb + Dense 0.2192 ô 0.2966 ô 0.3278 ô 0.0702 ô 0.1711 ô 0.1506 ô RankProb + Sparse 0.2246 ô 0.3250 ô 0.3763 ô 0.0894 ô 0.2109 ô 0.1916 RankProb + Embed 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330

How do the neural models with different training objectives and input representations compare? Take Home Message: Method Robust04 ClueWeb MAP P@20 ndcg@20 MAP P@20 ndcg@20 1. Define an objective which enables your model to go BM25 0.2503 0.3569 0.4102 0.1021 0.2418 0.2070 beyond the imperfection of the weakly annotated data Score + Dense 0.1961 ô 0.2787 ô 0.3260 ô 0.0689 ô 0.1518 ô 0.1430 ô Score (ranking + Sparse instead 0.2141of ô calibrated 0.3180 ô 0.3604 ô scoring). 0.0701 ô 0.1889 ô 0.1495 ô Score + Embed 0.2423 ô 0.3501 0.3999 0.1002 0.2513 0.2130 2. Rank Let + Dense the network 0.1940 ô 0.2830 decide ô 0.3317 about ô 0.0622the ô 0.1516 representation. ô 0.1383 ô Rank Feeding + Sparse the network 0.2213 ô 0.3216 with ô featurized 0.3628 ô 0.0776 ô input 0.1989kills ô 0.1816 the ô model Rank + Embed 0.2811 0.3773 0.4302 0.1306 0.2839 0.2216 creativity! RankProb + Dense 0.2192 ô 0.2966 ô 0.3278 ô 0.0702 ô 0.1711 ô 0.1506 ô RankProb + Sparse 0.2246 ô 0.3250 ô 0.3763 ô 0.0894 ô 0.2109 ô 0.1916 RankProb + Embed 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330

How meaningful are the compositionality weights learned in the embedding vector representation? (a) Robust04 (b) ClueWeb (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) Figure 4: Strong linear correlation between weight learned Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

How meaningful are the compositionality weights learned in the embedding vector representation? Take Home Message: By just seeing individual local instances from the data, the network learns such a global statistic. (a) Robust04 (b) ClueWeb (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) Figure 4: Strong linear correlation between weight learned Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

How well other alternatives for the embedding and weighting functions in embedding vector representation perform? Table 3: Performance of the rankprob model with variants of the embedding vector representation on di erent datasets. indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. SIGIR 2017, August 2017, Tokyo, Japan Anon. Table 3: Performance of the rankprob model with variants of the embedding vector representationrobust04 on di erent datasets. Embedding type indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed MAP P@20 ndcg@20 t-test, with Bonferroni correction. Pretrained (external) + Uniform 0.1656 0.2543 0.3017 Robust04weighting ClueWeb Pretrained (external) + IDF weighting 0.1711 0.2755 MAP P@20 ndcg@20 MAP P@20 ndcg@20 0.3104 Pretrained (external) + Weight learning 0.1880 0.2890 0.3413 Pretrained (external) + Uniform weighting 0.1656 0.2543 0.3017 0.0612 0.1300 0.1401 (target) +0.1711 Uniform weighting 0.12170.1346 0.2009 0.2791 Pretrained (external)pretrained + IDF weighting 0.2755 0.3104 0.0712 0.1469 Pretrained (external)pretrained + Weight learning 0.2890 0.3413 0.0756 0.1583 (target) +0.1880 IDF weighting 0.14020.1344 0.2230 0.2876 Pretrained (target) + Uniform weighting 0.1217 0.2009 0.2791 0.0679 0.1331 0.1587 Pretrained (target) + Weight learning 0.1477 0.2266 0.2804 Pretrained (target) + IDF weighting 0.1402 0.2230 0.2876 0.0779 0.1674 0.1540 + Uniform0.1477 weighting 0.26120.1729 0.3602 0.4180 Pretrained (target) + Learned Weight learning 0.2266 0.2804 0.0816 0.1608 Learned + Uniform weighting 0.2612 0.3602 0.4180 0.0912 0.1841 Learned + IDF weighting 0.26760.2216 0.3619 0.4200 Learned + IDF weighting 0.2676 0.3619 0.4200 0.1032 0.2419 0.1922 0.3802 Learned + Weight learning 0.2837 0.4389 Learned + Weight learning 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330 Embedding type function E: (1) employing pre-trained word embeddings learned from an external corpus (we used Google News), (2) employing pretrained word embeddings learned from the target corpus (using the skip-gram model [27]), and (3) learning embeddings during the network training as it is explained in Section 4.3. Furthermore, for the compositionality function, we tried di erent alternatives: (1) uniform weighting (simple averaging which is a common (a) Robust04approach (b) ClueWeb (a) Robust04 (b) ClueWeb in compositionality function), (2) using IDF as xed weights instead Figure 5: Performance of the rankprob model with learned (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) embedding, pre-trained embedding, and learned embedding of learning the weighting function W, and (3) learning weights Figure 4: Strong linear correlation between weight learned ClueWeb MAP P@20 ndcg@20 0.0612 0.0712 0.0756 0.0679 0.0779 0.0816 0.0912 0.1032 0.1387 0.1300 0.1346 0.1344 0.1331 0.1674 0.1729 0.2216 0.2419 0.2967 0.1401 0.1469 0.1583 0.1587 0.1540 0.1608 0.1841 0.1922 0.2330 (a) Robust04 with pre-trained embedding as initialization, with respect to (b) ClueWeb

How well other alternatives for the embedding and weighting functions in embedding vector representation perform? Take Home Message: Table 3: Performance of the rankprob model with variants of the embedding vector representation on di erent datasets. indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. SIGIR 2017, August 2017, Tokyo, Japan Anon. Table 3: Performance of the rankprob model with variants of the embedding vector representationrobust04 on di erent datasets. Embedding type indicates that the improvements over all other models are statistically signi cant, at the 0.05 level using the paired two-tailed MAP P@20 ndcg@20 t-test, with Bonferroni correction. ClueWeb MAP P@20 ndcg@20 0.0912 0.1032 0.1387 0.2216 0.2419 0.2967 0.1841 0.1922 0.2330 If you Pretrained get enough data, can learn embedding which is (external) + Uniform weighting you 0.1656 0.2543 0.3017 0.0612 0.1300 0.1401 Pretrained (external) + IDF weighting 0.1711 0.2755 0.3104 0.0712 0.1346 0.1469 (external) + Weight learning 0.1880updating 0.2890 0.3413 them 0.0756 0.1344 0.1583 better Pretrained fitted to your task by just based on the Pretrained (target) + Uniform weighting 0.1217 0.2009 0.2791 0.0679 0.1331 0.1587 Pretrained (target) + IDF weighting 0.1402 0.2230 0.2876 0.0779 0.1674 0.1540 objective of(target) the downstream Pretrained + Weight learning 0.1477 task. 0.2266 0.2804 0.0816 0.1729 0.1608 Embedding type Robust04 MAP P@20 Pretrained (external) + Uniform weighting 0.1656 0.2543 Pretrained (external) + IDF weighting 0.1711 0.2755 Pretrained (external) + Weight learning 0.1880 0.2890 Pretrained (target) + Uniform weighting 0.1217 0.2009 Pretrained (target) + IDF weighting 0.1402 0.2230 + Uniform0.1477 weighting Pretrained (target) + Learned Weight learning 0.2266 Learned + Uniform weighting 0.2612 0.3602 Learned + IDF weighting Learned + IDF weighting 0.2676 0.3619 Learned + Weight learning Learned + Weight learning 0.2837 0.3802 ClueWeb ndcg@20 0.3017 0.3104 0.3413 0.2791 0.2876 0.2804 0.4180 0.4200 0.4389 MAP P@20 ndcg@20 0.0612 0.1300 0.1401 0.0712 0.1346 0.1469 0.0756 0.1344 0.1583 0.0679 0.1331 0.1587 0.0779 0.1674 0.1540 0.26120.1729 0.3602 0.0816 0.1608 0.0912 0.1841 0.26760.2216 0.3619 0.1032 0.2419 0.1922 0.3802 0.2837 0.1387 0.2967 0.2330 0.4180 0.4200 0.4389 But you need a lot of data: THANKS TO WEAK SUPERVISION! function E: (1) employing pre-trained word embeddings learned from an external corpus (we used Google News), (2) employing pretrained word embeddings learned from the target corpus (using the skip-gram model [27]), and (3) learning embeddings during the network training as it is explained in Section 4.3. Furthermore, for the compositionality function, we tried di erent alternatives: (1) uniform weighting (simple averaging which is a common (a) Robust04approach (b) ClueWeb (a) Robust04 (b) ClueWeb in compositionality function), (2) using IDF as xed weights instead Figure 5: Performance of the rankprob model with learned (Pearson Correlation: 0.8243) (Pearson Correlation: 0.7014) embedding, pre-trained embedding, and learned embedding of learning the weighting function W, and (3) learning weights Figure 4: Strong linear correlation between weight learned (a) Robust04 with pre-trained embedding as initialization, with respect to (b) ClueWeb

How useful is learning with weak supervision as pretraining for supervised ranking? y signicant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. Method Robust04 ClueWeb MAP P@20 ndcg@20 MAP P@20 ndcg@20 Weakly supervised 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330 Fully supervised 0.1790 0.2863 0.3402 0.0680 0.1425 0.1652 Weakly supervised + Fully supervised 0.2912 0.4126 0.4509 0.1520 0.3077 0.2461

How useful is learning with weak supervision as pretraining Take Home for Message: supervised ranking? y signicant, at the 0.05 level using the paired two-tailed t-test, with Bonferroni correction. You want to train a neural network Robust04for your task ClueWeb but you ve Method got just a small amount of supervised data? Weakly supervised 0.2837 0.3802 0.4389 0.1387 0.2967 0.2330 You can compensate it by pertaining your network on Fully supervised 0.1790 0.2863 0.3402 0.0680 0.1425 0.1652 weakly Weakly supervised annotated + Fully supervised data. 0.2912 0.4126 0.4509 0.1520 0.3077 0.2461 MAP P@20 ndcg@20 MAP P@20 ndcg@20

Avoiding your teacher s mistake! Training a neural ranker with controlled weak supervision MAIN GOAL: Controlling the effect of imperfect weak training instances by down-weighting them. Prediction loss wrt. the weak labels Prediction loss wrt. the weak labels Supervision Layer Supervision Layer Confidence Network Confidence Network Representation Learning Representation Learning Goodness of Representation Learning instances Goodness of instances Weak Annotator Weak Annotator True Labels Weak Annotator True Labels

Training Full Supervision mode Weak Supervision mode Prediction loss wrt. the weak labels Prediction loss wrt. the weak labels Supervision Layer Confidence Network Supervision Layer Confidence Network Representation Learning Goodness of instances Representation Learning Goodness of instances Weak Annotator True Labels Weak Annotator True Labels Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Thank you!