Selection bias in innovation studies: A simple test

Selection bias in innovation studies: A simple test Work in progress Gaétan de Rassenfosse University of Melbourne (MIAESR and IPRIA), Australia. Annelies Wastyn KULeuven, Belgium. IPTS Workshop, June 2011

Innovation production functions (IPF) are often used in innovation studies Relate a firm s innovation output (I) to its research input (R): I = f(r) Study the impact of innovation policies, contribution of innovation to productivity growth, etc. Inventions are not observed, so patents are used as a proxy. Imperfect measure that has well-know limitations (Jefferson, 1929; Pavitt, 1985; Griliches, 1990) Not all inventions are patented The value of patents widely varies (majority is worthless) We look at a third shortcoming: patents are counted in a simple manner 2

A patent protects an invention in one market Priority filing (PF) Second filing (SF) = extension of the protection to a foreign market Picture downloaded from worldmapsphotos.com 3

Firms have a variety of patenting routes available to them... Belgium France US EPO WIPO ROW Invention 1: Invention 2: Invention 3: Invention 4: Invention 5: Invention 6: Invention 7: Invention 8: In theory: global count of priority filings. In practice, patents are counted at one reference office 4

Which is fine (but noisy)... unless the decision to select one office is not random Relationship between firm size and innovation Global count Level of innovation (number of patents per year) Low High Count at EPO firm i Small Large Firm size 5

The objective of this research is twofold Study whether the single office count leads to a selection bias or is it just noise? Propose a methodology to identify potential selection biases when the researcher is limited to information collected at one office 6

AGENDA 7

Agenda 0. Context 1. Motivations 2. The problem and proposed solution 3. Data 4. Empirical analysis 5. Conclusions 8

MOTIVATIONS The single office count is very popular Early evidence that it may lead to a selection bias Mo8va8ons Solu8on Data Analysis Conclusion

Single office count is a widespread practice Random sample of 20 recent articles in A* or A journals that estimate IPF for European firms 17 use single office count (mostly EPO) 1 uses two offices (EPO + national) 2 use a broader count All articles provide very few information on the patent indicator used 10

Risk of a selection bias No study has explicitly looked at this question Some results suggest that the filing route is not random: Seip (2010): large Dutch companies much more likely than SMEs to go at the EPO van Zeebroeck and van Pottelsberghe (2011) and Jensen, Thomson and Yong (2011): patent value affect filing route Raises the spectre of a selection bias 11

PROBLEM AND PROPOSED SOLUTION Exploit information on the mix between priority filings and second filings Mo8va8ons Solu8on Data Analysis Conclusion

The selection of patents may bias estimates of IPF The true unobserved output for firm i is (in logadditive form): ln (% & ) = *, &, +. & [IPF] Only a fraction of the output is observed at the reference office: ln($ % ) = (, % * +, % Hence the observed output can be written as: ln($ % ) = ln(( % $ % ) = ln(( % ) + ln ($ % ) =, % (. + /) + 0 % + 1 % Selection bias if alpha different from 0. 13

Objective: looking for randomness in the selection We would like to test whether π is random Not observed, so direct inference impossible Patenting process gives us one information on the structural form of π: we know that the patents observed at the reference office are of two types. Priority filings, which are directly filed at the reference office, and second filings, which are filed at the reference office in a later stage 14

Objective: looking for randomness in the selection The variable π can thus be expressed in a generic way as! " =! $ " + &1! $ * " )! " The variable π depends on x when at least one of the two components depends on x. In this case, the following ratio %!!" # = #! % # + '1! % + # *! # also depends on x. This ratio is known to the researcher. (Limited) risk of false positive and false negative 15

DATA Novel data on the whole population of patents by Belgian companies Mo8va8ons Solu8on Data Analysis Conclusion

Three databases are used Three waves of O&O statistieken by the Ministerie van de Vlaamse Gemeenschap Survey data on R&D (2002-2008) Bureau van Dijk s Belfirst Administrative data Patstat database by the OECD-EPO Data on patents (2000-2007) Full sample Subsample (N = 345) N Mean Min Mean Max Std. Dev. Diff. EMP (FTE) 861 536 4 608 5,685 933 * R&D (mio) 762 14 0 25 1,153 129 *** AGE 871 30 1 31 151 28 COMP (c) 902 2.12 1 2.12 3 0.57 - COMP_LOC (d) 946 0.07 0 0.04 1 - - COMP_REG (d) 946 0.28 0 0.26 1 - - COMP_WORLD (d) 946 0.65 0 0.70 1 - - 17

The identification of patents proceeds in three steps Among all the priority filings, identify these by Belgian inventors (de Rassenfosse et al.) Identify the companies Match these companies with R&D data - Cy 1 - Cy 2 - Cy 3 - Cy 4 - etc. - Cy 2 - Cy 3 - etc. Popula'on of priority patent applica'ons filed worldwide (i.e. regardless of the PO) 18

Even though 85% of all the patents are observed (through PF and SF), partial or no information for half the companies in the sample 27 % None 45 % Some 28 % All 42 % No PF at EPO 26 % Some 31 % All 30 % None 34 % Some 36 % All Correct information for 53% of companies Partial information for 34% of companies No information for 13% of companies 19

EMPIRICAL ANALYSIS The empirical analysis proceeds in two steps Context Solu8on Data Analysis Conclusion 20

Step 1: Innovation production function (IPF) IPF are estimated as Poisson (Hausman et al., 1984): 1,![# $% ' $%,) $ ] = exp/' $% 0 + ) $ 2 = 3 $% 4 $ for $ = 1,, ; and % = 1,,? where the fixed-effect is approximated with the pre-sample mean of the patent series (Blundell et al., JE, 2002) Three dependent variables:! "! "! " + $ " True count Count of PF at EPO Count of PF and SF at EPO 21

Step 2: Selection equation The test for a selection bias is estimated as a Bernoulli following Papke and Wooldridge (JAE, 1996): 2 "[$% &' ) &' ] = h() &' /) where h(.) is a link function such as the logistic function. 22

1 2 found. (1) found. (2) found. (3) found. (4) found. (5) Dep. Variable:! "! "! #! # + % # &' ln(emp) 0.470 *** 0.453 *** 0.386 *** 0.432 *** -0.042 (0.092) (0.099) (0.107) (0.109) (0.303) ln(rd/emp) 0.276 *** 0.267 *** 0.434 *** 0.228 *** 0.806 * (0.073) (0.074) (0.127) (0.083) (0.471) ln(age) -0.010 0.006-0.496 *** -0.001-0.736 ** (0.120) (0.142 ) (0.143) (0.165) (0.337) COMP 0.062 1.004 *** 0.120 1.618 ** (0.219) (0.204) (0.241) (0.682) PRE_PAT 0.347 *** 0.35 * -0.220 0.463 ** (0.171) (0.181) (0.305) (0.211) NO_PRE_PAT 0.284 0.323-0.787 *** 0.410 (0.312) (0.331) (0.402) (0.344) NO_PATENT -34.246 *** (0.415) Constant -4.723 *** -4.874 *** -5.354 *** -4.989 *** -1.791 (0.659) (0.624) (0.702) (0.695) (3.037) Industry dummies Y *** Y *** Y *** Y *** Y *** Year dummies Y *** Y *** Y ** Y *** Y Observations 388 345 345 345 345 Log -47-525 -477-284 -440 pseudolikelihood R 2 0.55 0.55 0.57 0.51 0.80 23

1 2 (1) (2) (3) (4) Dep. Variable:! "! #! # + % # &' ln(emp) 0.459 *** 0.414 *** 0.441 *** 0.080 (0.092) (0.103) (0.099) (0.090) ln(rd/emp) 0.264 *** 0.576 *** 0.215 ** 1.098 ** (0.079) (0.131) (0.086) (0.073) ln(age) 0.016-0.366 ** 0.023-0.513 * (0.130) (0.159) (0.149) (0.123) COMP_LOC -0.660-1.439-1.617 *** 18.317 *** (1.058) (1.327) (0.599) COMP_REG -0.028 0.461-0.221 0.734 (0.217) (0.351) (0.243) PRE_PAT 0.341 ** -0.243 0.441 ** (0.170) (0.295) (0.182) NO_PRE_PAT 0.334-0.507 0.437 (0.328) (0.390) (0.344) NO_PATENT -49.734 *** (0.467) Constant -4.799 *** -4.755 *** -4.794 *** -1.549 (0.660) (0.712) (0.719) (3.420) Industry dummies Y *** Y *** Y *** Y *** Year dummies Y *** Y ** Y *** Y Observations 345 345 345 345 Log -298-438 -48-477 pseudolikelihood Pseudo R2 0.55 0.52 0.52 0.80 24

CONCLUSION Two contributions Mo8va8ons Solu8on Data Analysis Conclusion

Wrap up Look at the widespread practice of using one single office of reference for counting patents 1. Single office count biases estimates of the IPF 2. Propose a simple way to test the existence of a selection bias. The methodology allows to detect biases arising from both PF and SF Silent about the direction of the bias 26

Implications Global count is warranted. If limited to one office, report estimates for (1) priority filings (2) total filings (i.e. priority filings and second filings) (3) determinants of the proxy variable. If coefficients not significant in (3), one can be reasonably confident that the selection bias does not affect the findings. Application to the competitive environment of the firm. Patent indicator used affects the findings: the effect of competition on innovation is observed only with international, high-value patents. Empirical studies have not generated clear conclusion about the relationship between innovation and competition (Gilbert, 2006): future studies should pay particular attention to the way patents are counted. 27

Thank you. (gaetand@unimelb.edu.au) 28

Counting both priority filings and second filings increases the number of observations Distribution of priority filings: ROW 25 % Share of priority filings identified when second filings are taken into account 85 % of Belgian patents end up at the EPO Belgium has one of the highest rate of patents transferred at the EPO US 5% Belgium 25 % EPO 85 % Belgian case is a very strong test of our claim: if a bias exist with Belgian data, likely to be worst for other countries EPO 45 % 29