부산대학교

작성일: 2018.03.12

수정일: 2018.03.30

작성자: 이혜영

조회수: 1337

2018.03.30.(금) (조문정박사 / 미국 노동통계청)

아래와 같이 초청특강을 개최하니 관심있는 여러분의 많은참석 바랍니다.

1. 일시 : 2018년 3월 30일(금), 오전 11시~ 11시 45분

2. 장소 : 통계학과 세미나실 (자연대연구실험동 222호)

3. 연사 : 조문정 박사 (미국 노동통계청 (US Bureau of Labor Statistics), Office of Survey Methods Research)

4. 연제 : Classification and Regression Trees and Forests for Imputing Data from Sample Surveys

Abstract

Analysis of sample survey data often requires adjustments to account for missing values in the outcome variables of principal interest. Standard adjustment methods based on item imputation or on propensity weighting factors rely heavily on the availability of auxiliary variables for both responding and non-responding units. Their application can be challenging in cases for which the auxiliary variables are numerous and are themselves subject to substantial incomplete-data problems. This paper shows how classification and regression trees and forests can overcome these difficulties and compares them with traditional likelihood methods in terms of estimation bias and mean squared error. The development is centered on a component of income data from the U.S. Consumer Expenditure Survey, which is subject to a relatively high rate of item missingness. Classification tree and forest methods are used to model the unit-level propensity for item missingness in the income component. Regression tree and forest methods are used to model the conditional mean structure of the income component.

Both sets of methods are then used to produce estimators of the mean of the income component, adjusted for item nonresponse. Thirteen methods for estimating a population mean are compared in a series of simulation experiments. The results show that if the number of auxiliary variables with missing values is not small, or if they have substantial missingness, likelihood methods can be rendered impracticable or even inapplicable.

Tree and forest methods are always applicable, are relatively fast, and have higher efficiency than likelihood methods under real-data situations with incomplete-data patterns similar to that in the abovementioned survey. Their efficiency loss under conditions ideal for likelihood methods is observed to be between 10-25%.