3. Calculation of undercount 3.2 Preliminary calculations of undercount 3.3 Final calculations of undercount 3.4 Comparison of preliminary and final calculations 3.6 Limitations of PES calculations and resulting estimates Table 3.1: Preliminary and final undercount by province and EA type Table 3.2: Undercount of household by province 3.1 Introduction The same basic methodology was used to calculate both the preliminary and final undercount rates. The difference arose in the data used to calculate the components. The basic steps involved in calculating undercount are as follows. Firstly, the data from the PES is used to calculate an adjustment factor: Adjustment factor = No. of people in PES in scope of census No. of people in PES counted in census Secondly, this adjustment factor is then applied to the census counts: Population estimate = Adjustment factor * census count Lastly, the undercount rate can be calculated: Undercount rate = Population estimate - census count Population estimate
3.2 Preliminary calculations of undercount The preliminary undercount rates were based solely on the answers to the question in the PES: Was the person counted in the census?. People who responded Yes, here or Yes, elsewhere were treated as enumerated in the census, while people who responded No were treated as missed. People who did not respond, or responded Dont know, were excluded from the estimation. Calculation of the preliminary undercount rate was performed within the strata of province and EA type. These adjustment factors were applied to preliminary census counts determined from a 25% sample of census boxes. No adjustment for undercount was made for persons in hostels and institutions (see Appendix E for definitions of EA types including hostels and institutions). For more information on the preliminary calculations of undercount, see the publication on Census 96 preliminary estimates. This simple method was used as the problems with discrepancies in boundaries between census and PES EAs, as described in Section 2.7 on matching, meant that alternatives such as comparing counts in the census and PES were not feasible. 3.3 Final calculations of undercount The final calculations of undercount were, in effect, a combination of responses to the PES question and the matching results. The matching results were, in themselves, insufficient for the calculations because:
EAs corresponding to hostels were excluded from the matching process, since the hostel population tended to be very mobile and the quality of the resulting PES data was often poor with many missing responses. As mentioned earlier, a different method had to be used to calculate weights for persons in hostels and institutions. This is described below in section 3.3.3 on weighting. Matching had also been impossible for a few PES EAs where, even after visiting provinces and using all other available information, a corresponding census EA could not be located. This was largely the result of administrative problems most of the people in these EAs reported that they had been enumerated. These 23 EAs were therefore not used in the calculation of the final estimates. In effect, it was assumed that the magnitude and distribution in the remaining EAs was similar. After the exclusion of the 18 EAs for hostels and the 23 EAs for which matching was impossible, the sample used for final calculation of the undercount was 765 EAs. Matching was limited to comparing the PES questionnaire with the census questionnaire at the same address. In some other countries, an attempt is made to confirm whether people who said they were counted somewhere other than the PES address were in fact enumerated. This was not attempted in South Africa as the circumstances, such as a lack of precise address information and practicalities accessing individual questionnaires elsewhere, made this too difficult. As a result, the questionnaire did not include questions on alternative addresses. The main impacts of this were the increased role of imputation and the incomplete identification of overcount, although attempts were made to address this, as explained below. After matching, all persons were classed as either matched, missed or unresolved. Those coded as matched or missed were considered resolved and they were assigned a probability of having being counted of 1 or 0, respectively, for use in subsequent calculations. For those coded as unresolved, in order to produce estimates of undercount, a probability of being enumerated was imputed. The probability was calculated from the resolved cases by making use of a technique known as CHAID (Chi-square Automatic Interaction Detection). The CHAID model was run on the resolved cases, excluding people who reported being enumerated elsewhere (see section 3.3.2 below), with a probability of 1 allocated to a matched case and a probability of 0 to a missed, as noted above. The following variables were used as predictors in the CHAID analysis: EA type, the response to the question whether the person was counted, age group, gender, population group and household size. The CHAID technique determines the predictors in order of the strength of the predictive power of the variables and so identifies the statistically significant predictors and interactions between them. This analysis was done separately for each province. In every instance, the response to the question on whether a person was counted was indicated as the most significant predictor. Further predictors varied between the different provinces.Thus, for each province, the CHAID model created a number of branches from combinations of categories of the predictive variables. The proportion of persons enumerated among the resolved cases was calculated within each CHAID branch. This was interpreted as the estimated probability of being enumerated for cases with the characteristics defined by each branch. For example, in Mpumalanga, for persons aged between 19 and 48 who were in households of 4 to 8 persons and who said they were counted, the probability of enumeration was 0,9113. Accordingly, each unresolved record in each province was allocated the appropriate probability of enumeration. A number of people with responses of Yes, elsewhere for the PES question on where counted in the census were, in fact, found on the census questionnaire at the PES address. This raises the possibility that these people were counted more than once (overcounted) if they were actually counted elsewhere, as reported. The fact that census enumeration took place over an extended period, during which time many people moved around, meant that there was some scope for people to be overcounted in Census 96. It was possible to use the information gathered in the PES to make allowance some aspects of overcount. Accordingly, the imputation process was extended to impute the probability that people who responded Yes, elsewhere were correct and were enumerated elsewhere, based on the accuracy of responses of Yes, here. However, it was not possible, given the various constraints relating to logistics, time and funding, to conduct an exercise that would give a comprehensive estimate of overcount. While the allowance for overcount is not complete (for example, there is no allowance for people being counted more than twice), it still goes some way towards addressing the issue. Once a probability of being enumerated had been allocated to every PES record, another model was used to apply the results to the final census data. This modeling technique, XAID, was used to determine the appropriate weighting classes and the associated weights to be applied to each person record in the census. XAID is version of the methodology used for CHAID but using a continuous dependent variable rather than a dichotomous (that is, 0 or 1) dependent variable. In the XAID analysis, the allocated probability of being enumerated was taken as dependent variable. The set of predictors was the same as in the CHAID analysis above, with the exception of the response to the question on whether a person was counted since this variable is not applicable to census records. The XAID analysis was run on all PES records in each province. The significant variables and their order of appearance in the XAID branches varied between the provinces. Age group and household size, however, figured prominently. The XAID model determined combinations of the predictive variables that were significant in modeling the probability of being enumerated. The characteristics defined by the XAID branches were then taken as the weighting classes, and the average values of the probability of being enumerated were interpreted as the estimated counted rates. The reciprocal values of these counted rates were taken as the weights associated with all census records falling in the identified weighting classes. For example, in Mpumalanga, for African persons aged 19 to 48 in a households of 1 to 5 persons and in informal urban areas, the counted rate was 0,8339, yielding a reciprocal of 1,1992 which was the respective weight. Once the weights had been calculated, they were applied to final census data and checked for anomalies. Where anomalies were identified, adjustments were made, although this occurred in very few cases and the adjustments were minor. For example, some situations were identified where the age distribution within a population group was unrealistically distorted by the weights which had adjusted adjoining age groups by very different amounts. Other anomalies were also found with respect to EA type. Accordingly, the XAID results were re-examined and the weighting classes used, or the weights within the classes were recalculated to smooth the distortions. In all, adjustments were made in five provinces, usually with negligible impact on the overall undercount rate. In some cases, the undercount decreased very slightly, in others it increased. For the weighting of persons in hostels and institutions (as determined by EA type as detailed in Appendix E), a different approach was taken. It was not possible to use the results from the XAID as information on household size and EA type was not applicable. Instead, categories were developed manually based on examination of the data and using combinations of population group, province, urban/non-urban, age and sex. Within these categories, the appropriate adjustment factor was calculated based on all the records in the PES and applied to each person enumerated in hostels and institutions in the census. The final weighting matrices used are presented in Appendix C, Table C.3 (persons not in hostels and institutions) and Table C.4 (persons in hostels and institutions). 3.4 Comparison of preliminary and final calculations The undercount rate in the preliminary estimates was 6,8% while the final adjustment rate was 10,7%. This change is not surprising given the differences in the methodology. The preliminary estimates relied solely on the accuracy of the responses of the household informant as to who in the household had or had not been enumerated. There are a number of reasons why these responses may not be accurate:
The difference between relying on peoples responses and actually looking at census questionnaire was the main reason for the difference between preliminary and final undercount rates. This can be seen in the detailed table comparing the responses and the matching results (before and after imputation), which is included in Appendix C, Table C.2. However, there were some other factors that also affected the comparison. The various corrections to records carried out during the matching process and the finalisation of the dataset would also have had an impact. Thus, if estimates were calculated using the same methodology as for the preliminary estimates but based on the final dataset, they would obviously differ slightly from those published as preliminary estimates. In addition, the method used for applying the calculated adjustment factors differed. The adjustment factors for the preliminary estimates were calculated in and applied to each EA type within each province. For the final estimates, adjustment factors took into account a wider range of factors, not only EA type but also age, gender, population group and household size. Adjustment factors by EA type for the preliminary estimates are thus not immediately comparable with those for the final estimates. Nevertheless, the table below compares preliminary and final estimates by EA type for each province. Table 3.1: Preliminary* and final undercount by province and EA type
* As published in Table 3, Census 96: Preliminary estimates, (CSS, 1997). The undercount rate increased between the preliminary and final estimates in all provinces and most EA types. Some provinces and EA types were affected more than others. In particular, tribal areas, and Northern Province and Eastern Cape which have a large proportion of the population in tribal areas showed the greatest differences between the preliminary and final estimates. This may reflect persons believing that they were counted at a tribal level, even though a census enumerator had not visited them. Thus, the preliminary undercount rate in tribal areas was surprisingly low (3,3%) while the final estimate, incorporating the matching results, was similar to other EA types (10,2%, compared to 10,7% for South Africa). The only areas where a decrease occurred were in farming areas in some provinces. Particularly in provinces with a small proportion of farmers, such as Gauteng, this may reflect the difference in the method of calculating and applying the adjustment factors for the final estimates. However, it is also possible, for example, that some people on farms were enumerated indirectly, i.e., without their knowledge, by farmers or other farmworkers, and this was picked up in the matching process. Undercount of households was not addressed for the preliminary estimates when the focus was on producing an estimate of the number of persons in South Africa. However, it was important for the final estimates to ensure that the person and household estimates from the census corresponded. It was also important to provide accurate data for planning purposes, as mentioned in Section 1, since the census obtained information on households access to services including electricity, water and telephones. The method used to calculate the undercount of households for the final release of census data was similar to that used for persons. During matching, whether or not a household had been enumerated in the census was recorded on the matching sheet and this was the basis for establishing the undercount, along with whether the household reported being enumerated in the census. In the PES, each householder was asked whether the household was visited by the census. However, this question was not used in the final undercount calculation. Instead, a variable was derived from the responses of the persons in the household to the question Was this person counted in the census?. This was used in preference to the question asked of households because the captured responses for households sometimes contradicted those for persons and the household responses appeared to be less accurate. As with individuals, during matching households were coded as resolved (either matched or missed) or unresolved. Where the match status for a household was unresolved, a probability of enumeration was imputed using a CHAID model. The most important factor in the model was the derived variable concerning whether a household had been visited in the census. Other factors included in the model were household size, EA type and population group of the first person in the household. Once the imputation was completed, weighting classes and weights were calculated for the census data using an XAID model based on a similar set of characteristics household size, EA type and population group. Again, as with individuals, the weights varied within different combinations of categories in each province. For example, in Mpumalanga, for a household of five or more persons with the first person African and in an informal urban area, the counted rate was 0,9199, yielding a reciprocal of 1,0871 which was the respective weight. The weighting matrices used are presented in Appendix C, Table C.5. The final undercount rates for households are shown in Table 3.2. Table 3.2: Undercount of households* by province
* excluding institutions and hostels People could be missed in the census either as a result of being missed within an enumerated household, or because the entire household which they were in was missed. Calculations based on the final undercount rates for persons and households indicate that just under two-thirds of the people missed in the census were in missed households. 3.6 Limitations of PES calculations and resulting estimates The 1996 PES was an advance on that conducted in the counted portion of South Africa in 1991, notably in undertaking matching and being conducted countrywide soon after enumeration. Even so, it was subject to a number of factors that affected the quality of the data. It is important to note these limitations in order to use the results effectively and to improve procedures for the next PES. However, given the limitations, analysis of the final calculations indicates that they appear to yield a fairly accurate representation of the undercount in the 1996 population census. Given the limited planning time of a year before the full-scale enumeration, it was difficult to give full attention and planning to the PES. This meant that the methodologies and procedures had to be revised through the stages of the PES. Stats SA intends to conduct a thorough study to evaluate the 1996 PES and to improve methodology and implementation of the PES for the 2001 census exercise, in order to ensure that all the required information is collected efficiently and the accuracy of the undercount calculations is increased. Some of the problems already mentioned involved questionnaire design, data entry and matching. It is this last problem that probably had the greatest impact on the final estimates of undercount. The matching process is inherently difficult, even in countries where most of the country has a formal address system and the expected undercount is low. For instance, there is always a tendency for the PES and census alike to miss the same people who are difficult to contact or do not want to be identified, which may lead to the estimate of undercount being slightly understated. In South Africa, additional problems were encountered in matching, as has been explained Section 2.7. A conservative approach was taken towards matching and, if there was any doubt about whether or not a household or person was enumerated, it was set to unresolved. This could happen if, for example, the addresses in a particular area were vague and there were a number of households in the PES and census which did not appear to match, but which could not conclusively be said to have been missed. Differences in names and household structure between the census and PES mean that there could be some matches of households and persons that are not at first apparent. It is impossible to resolve such situations without revisiting the EA so these persons and households were set to unresolved. As a result of this approach, the proportion of persons and dwellings coded as Unresolved tended to be fairly high, with 22% of all persons in the PES coded as unresolved. While it is usually easy to confirm a match the address and most of the occupants are the same it is often difficult to confirm a miss, particularly where addresses are vague as in the example above. Thus, a number of missed households and persons would have been coded as unresolved and, as a result, this would have lowered the estimated rate of missed households and persons. It is not possible to predict any overall bias of the undercount calculations as they were also subject to bias in other directions as well, resulting from the matching exercise and other factors. However, it is possible that the final indications of undercount are slight underestimates given the high unresolved rate. The issues concerning matching for the 1996 PES will be addressed in the development of the post-enumeration survey for the next census. The methodology behind matching and estimation will be further developed which will lead to improvements in many areas. For example, clarifications in procedures for matching coders should enable them to resolve a greater proportion of cases. In addition, improved mapping and administrative procedures should reduce the difficulty of locating EAs where households may have been enumerated. In particular, the GIS should remove the problems encountered with boundaries differing between the census and PES, simplifying the matching process and increasing the accuracy of estimates. Despite these problems, the ambitious and arduous matching exercise for the 1996 PES provided valuable information for the calculation of final undercount rates and the experience indicates that some form of matching can be performed successfully in South Africa. |