Skip to main content

Advertisement

Log in

Eliciting permanent and transitory undeclared work from matched administrative and survey data

  • Original Paper
  • Published:
Empirica Aims and scope Submit manuscript

Abstract

We study the undeclared work patterns of Hungarian employees in relatively stable jobs, using a panel dataset that matches individual-level self-reported Labour Force Survey data with administrative records of the Pension Directorate for 2001–2006. We estimate the determinants of undeclared work using Heckman-type random-effects panel probit models, and develop a two-regime model to separate permanent and transitory undeclared work, where the latter follows a Markov chain. We find that about 6–7% of workers went permanently unreported for six consecutive years, and a further 4% were transitorily unreported in any given year. The models show lower reporting rates—especially in the permanent segment—among males, high-school graduates, those in agriculture and transport, small firms and various forms of atypical employment. Transitory non-reporting may be partly explained by administrative records missing for technical reasons. The results suggest that (1) the “aggregate labour input method” widely used in Europe can indeed be a simple yet reliable tool to estimate the size of informal employment, although it slightly overestimates the true magnitude of black work and (2) the long-term pension consequences of undeclared work may be substantial because of the high share of permanent non-reporting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Source: matched LFS–NPID sample. The columns show the size distribution of all employment spells and partly or fully unreported spells. In 93.1% of cases the ratio is zero or one

Similar content being viewed by others

Notes

  1. Examples are the consumption-income discrepancy method, the monetary method, the electricity consumption method and econometric methods such as MIMIC (Renooy et al. 2004; Schneider 2012).

  2. For instance, 11.5% of employment in full-time equivalent units for Italy in 2004 (Baldassarini 2007), 17–21% of official GDP for Slovenia between 1995 and 2004 (Nastav and Bojnec 2007), 8% of total employment for Spain in 2002 and 20–30% of GDP for Romania between 1996 and 2002 (for these and other EU country results, see GHK/FGB 2009, pp. 49 and 77).

  3. Hungarians typically have no precise information on their accumulated accrual points and expected pensions before they actually retire. The computer records (since 1988) and printed material that records registered workdays and contribution payments are available to individuals before their retirement only after a lengthy administrative procedure.

  4. Similar small differences occur if the data are reweighted according to the LFS weights that are supplied by the CSO to correct for non-response in the LFS. In the following we will use the unweighted data unless stated otherwise. For further details, see Bálint et al. (2010).

  5. For the study of such events, we selected employed workers from the 2001–2006 waves of the LFS, who had entered their jobs more than 12 months before the LFS interview and counted those who said that 12 months before the interview they had not been working.

  6. Homogeneity of the three distributions is not rejected by a Chi squared test (p value is 0.22).

  7. The likelihood function in the RE probit model is given as an integral with respect to the unobserved heterogeneity \( c_{i}^{P} \). In the MSL procedure this is approximated with adaptive Gauss–Hermite quadrature and then maximized. We use 12 integration points, but the results are not sensitive to this choice. Technically, the estimates are obtained with the xtprobit command of the Stata software (version 12).

  8. The average marginal effects are calculated using the gllapred command of the GLLAMM package of Stata, after bootstrapping the estimated model 1000 times.

  9. The raw data in Table 6 of “Appendix 1” show that 29% responded positively in the first wave and 25% in subsequent waves.

  10. Confidentiality is a major concern of the CSO. We do not know any case of misusing individual statistical data since 1946 (when census data were used to identify ethnic Germans).

  11. In addition, Table 7 of “Appendix 1” displays the composition of workers split according to whether they required NPID data at all. Most high-risk groups (such as people with secondary school attainment, farmers, service and construction workers, those working in atypical forms of employment, casual workers, the self-employed and those in micro-firms) are roughly equally represented in the two groups of LFS respondents. The proportion of men and residents of Budapest are under-represented among those requested NPID data, which we attribute to the low response probability of these groups in population surveys in general. Young people are obviously much less interested in their pension prospects than their older counterparts. However, workers in the first year of their tenure were less likely to make it to the merged sample.

  12. We do not use the linear specification [Eq. (1)] in models with endogenous selection, because the estimation of such models depends crucially on the distribution of the error term (e.g. normality). \( R_{it} \) is close to binary, and hence a binary model on \( Q_{it} \) is more appropriate. For the same reason we do not use a tobit-type model for \( R_{it} \), which is bounded on the unit interval, or a double hurdle model for the joint analysis of sample membership and reporting rates.

  13. Technically, the system is estimated using the gllamm package of the Stata software (Rabe-Hesketh et al. 2004).

  14. To keep the model structure relatively simple, we do not incorporate endogenous selection into the two-regime model. According to Sect. 4.2, the model with endogenous selection yields qualitatively similar results to the baseline ones.

  15. The notation here differs slightly from the previous section because \( \varvec{X}_{i} \) now does not contain the year 2008 values of the time-varying variables (work experience and tenure).

  16. This partly reflects the higher than average ratio of undeclared work among self-employed. 90% of those in one-member firms reported themselves as self-employed, and, on the other hand, more than 60% of self-employed people reported firm size of one. Hence, although the marginal effect of self-employment is negligible in our models, Table 3 shows that self-employed people are by 20% points more likely to be unreported than employees.

  17. The parameters of settlement type dummies (county capitals, other urban areas and villages) become economically and statistically insignificant once the micro-regional unemployment rates and residence in Budapest are controlled for, therefore these dummies are not included in the equations.

  18. According to the LFS, the median worker spends ten consecutive years on average in the same job, provided she has spent at least 2 years there.

  19. These workers make up less than one tenth of the employment stock at any time according to LFS.

References

Download references

Acknowledgements

The authors would like to thank Anikó Bíró, Márton Csillag and Gábor Kézdi for useful comments on an earlier version of the paper. Péter Elek was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and later by the ÚNKP-17-4 New National Excellence Program of the Ministry of Human Capacities in Hungary.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Péter Elek.

Appendices

Appendix 1

See Fig. 2 and Tables 6, 7 and 8.

Table 6 The percentage share of selected groups among respondents requesting NPID data after the first LFS visit versus later visits
Table 7 The composition of employed LFS respondents inside and outside the merged sample (per cent)
Table 8 Descriptive statistics of the simulated ratios of workers undeclared for exactly k years \( \left( {k = 1, 2, \ldots , 6} \right) \) in the 6-year period and the observed distribution in the sample (in per cent)

Appendix 2: Maximum likelihood estimation of the two-regime model

Let \( {\text{t}}_{\text{i}} \) denote worker \( {\text{i}} \)’s first year in the sample (which can take values between 2001 and 2006) and let us use the notations \( p_{it}^{\left( k \right)} = \Pr \left( {Q_{it} = k | J_{i} = 0, \varvec{ X}_{it} } \right) \) for the conditional probability of transitory non-reporting (k = 1) and reporting (k = 0), respectively, at time t.

If \( \prod\nolimits_{{t = t_{i} }}^{2006} {Q_{it} } = 0 \) (i.e. if the person is reported at least once), then for kt ∊ {0, 1} \( ( {t = t_{i} , \ldots , 2006} ) \):

$$ \Pr \left( {\mathop {\bigcap }\limits_{{t = t_{i} }}^{2006} \left\{ {Q_{it} = k_{t} } \right\} | \left\{ {\varvec{X}_{it} } \right\}} \right) = \left( {1 - \varPhi \left( {\varvec{X}_{i}\varvec{\beta}_{Z} } \right)} \right)*p_{{i,t_{i} }}^{{\left( {k_{{t_{i} }} } \right)}} *\mathop \prod \limits_{{t = t_{i} + 1}}^{2006} p_{it}^{{\left( {k_{t - 1} ,k_{t} } \right)}} $$
(9)

because in this case, by definition, the worker belongs to the transitory regime and then the joint probability of her undeclared pattern can be written as a product of the starting probability \( p_{{i,t_{i} }}^{{( {k_{{t_{i} }} } )}} \) and the transition probabilities \( p_{it}^{{\left( {k_{t - 1} ,k_{t} } \right)}} \).

If \( \prod\nolimits_{{t = t_{i} }}^{2006} {Q_{it} } = 1 \) (i.e. if the person is never reported), then

$$ \Pr \left( {\mathop {\bigcap }\limits_{{t = t_{i} }}^{2006} \left\{ {Q_{it} = 1} \right\} | \left\{ {\varvec{X}_{it} } \right\}} \right) = \varPhi \left( {\varvec{X}_{i}\varvec{\beta}_{Z} } \right) + \left( {1 - \varPhi \left( {\varvec{X}_{i}\varvec{\beta}_{Z} } \right)} \right)*p_{{i,t_{i} }}^{\left( 1 \right)} *\mathop \prod \limits_{{t = t_{i} + 1}}^{2006} p_{it}^{{\left( {11} \right)}} , $$
(10)

where the first term shows the contribution of the permanent regime and the second term gives the probability that a person of the transitory regime is undeclared for the whole period.

For workers who entered the sample at the start of their current job (i.e. for whom \( C_{{i,t_{i} }} = 1 \)), Eq. (7) implies that \( p_{{i,t_{i} }}^{\left( k \right)} = p_{{i,t_{i} }}^{{\left( {0,k} \right)}} \) and hence the above likelihood calculation is complete for them. However, the majority of our observations are left-censored because Ci,2001 > 1 for most workers who entered our sample in 2001. The missing observations from their work history could be tackled, for instance, using the expectation–maximization (EM) algorithm, or we can follow a computationally less intensive but equally satisfactory approach. Indeed, we first note that p (1) it can be calculated recursively by considering the probability of being undeclared in the previous year (p (1) i, t−1 ) and the transition probabilities (p (11) it and p (01) it ):

$$ p_{it}^{\left( 1 \right)} = p_{it}^{{\left( {11} \right)}} *p_{i,t - 1}^{\left( 1 \right)} + p_{it}^{{\left( {01} \right)}} *\left( {1 - p_{i,t - 1}^{\left( 1 \right)} } \right) . $$
(11)

Hence, to obtain p (1) i,2001 and p (0) i,2001 —only they are needed in Eqs. (9)–(10) for the likelihood—we can go back to (for example) year 1997, approximate p (1) i,1997 as the stationary distribution of the two-state Markov chain whose transition probabilities are fixed at their 1997 levels:

$$ {\text{p}}_{{{\text{i}},1997}}^{\left( 1 \right)} \approx {\text{p}}_{{{\text{i}},1997}}^{{\left( {01} \right)}} /\left( {{\text{p}}_{{{\text{i}},1997}}^{{\left( {10} \right)}} + {\text{p}}_{{{\text{i}},1997}}^{{\left( {01} \right)}} } \right) , $$
(12)

and then calculate recursively the starting values p (1) i,2001 and p (0) i,2001 . This final step is only an approximation, because (1) the transition probabilities are slightly time-varying and (2) the Markov chain may not have reached the stationary distribution for some workers by 1997. Nevertheless, since the probability of return to the reported state, \( {\text{p}}_{\text{it}}^{{\left( {10} \right)}} \), turns out to be relatively large in our case (see Sect. 4.1), the corresponding Markov chain has good mixing properties, and it seems to be enough to go back 4 years in time to get a satisfactory approximation for \( {\text{p}}_{{{\text{i}},2001}}^{\left( 1 \right)} \) in Eqs. (9)–(10). We note that the maximum likelihood estimates turn out to be almost identical, irrespective of whether 1996, 1997 or 1998 is used as the start year in the approximation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elek, P., Köllő, J. Eliciting permanent and transitory undeclared work from matched administrative and survey data. Empirica 46, 547–576 (2019). https://doi.org/10.1007/s10663-018-9403-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10663-018-9403-0

Keywords

JEL Classification

Navigation