PubH 6052 FINAL EXAM
1. What is a codebook? Who might create one, what would she or he include in it, and what
purpose or purposes would it
... [Show More] serve?
2. Explain the distinction between variable labels and value labels in an electronic dataset.
3. True or False: The linear regression model cannot handle curvilinear relationships between
independent and dependent variables.
4. True or False: By convention, if we conduct a statistical hypothesis test and obtain a p-value
of .3, we would reject the null hypothesis.
5. Suppose you have two SPSS datasets. The first contains the variables ID, X1, X2, and X3 for
participants 1 through 100; the second contains the variables ID, X4, X5, and X6 for the same
100 participants. Suppose that the datasets are named EvalPre.sav and EvalPost.sav, and are
saved on your computer in following file location:
C:\Documents and Settings\Evaluation\EvalData\
And suppose, finally, that you want to combine these datasets to create a new dataset, to be
named EvalPrePost.sav, containing ID and X1 through X6 for all 100 participants. What SPSS
syntax would you use to accomplish this?
6. A research team is studying cognitive decline in old age. They collect data on 300 people
between the ages of 75 and 95 years. One of the key variables is a measure of one particular
aspect of cognitive functioning: Executive function (named EXFUNC in the dataset). For this
study it is measured using a test that produces values ranging from 0 to 100, with higher values
representing better executive function. The investigators fit a linear regression model to their
data and obtain the following estimated model:
EXFUNCi = 161.73 – 1.05AGEi + ei
According to this model, by how many points does the typical score on the executive function
scale decline between age 80 and 90?
7. Suppose your boss gives you a dataset and asks you to run frequencies on the variables X1
and X4, and descriptive statistics on the variables X2, X3, X5, and X6. What SPSS syntax
would you use to accomplish this task? (Please present only the command(s) that generate the
frequencies and descriptive statistics.)
8. Suppose your dataset has a variable, X1, that was derived from a questionnaire item with a
response options ranging from Strongly Disagree (coded 1) to Strongly Agree (coded 5).
Because the wording of this item runs in the opposite direction of the wording of several related
items, you want to create a reverse-coded version of this variable on which Strongly Disagree
will be coded 5 while Strongly Agree will be coded 1. What SPSS syntax would you use to
accomplish this task?
9. An investigator interested in regional differences in breastfeeding attitudes and practices
conducts a national survey. The survey includes a multi-item instrument measuring
breastfeeding attitudes. The resulting breastfeeding attitudes scale takes values ranging from 1 to
5, with higher numbers representing more favorable attitudes toward breastfeeding. This scale
score is named BRATT in the dataset. The dataset also includes a variable named region which
takes the following values: 1 = Northeast, 2 = Southeast, 3 = Midwest, 4 = Southwest, 5 = Rocky
Mountains, and 6 = West Coast. The investigator creates a set of dummy variables and runs a
linear regression model with the breastfeeding attitudes scale as the dependent variable. The
estimated model is
BRATTi = 3.68 + 0.57NORTHEASTi – 0.13SOUTHEASTi – 0.41SOUTWESTi – 0.04ROCKIESi +
0.77WESTCOASTi + ei
According to this model, what is the mean score on the breastfeeding attitudes scale for
respondents residing in the Midwest?
10. Suppose you have a dataset with items X7, X8, X9, and X10 and you want to sum these
items to create a scale score with variable name SumX. What SPSS syntax would you use to
11. Suppose you have a dataset that contains a variable, named BMI, that gives measured BMI
values for 100 adults. And suppose that you want to create a new variable, BMI3CAT, that
categorizes participants as normal, overweight, or obese according to the following scheme.
BMI | BMI3CAT
< 25 | 1
≥ 25 and < 30 | 2
≥ 30 | 3What SPSS syntax would you use to accomplish this?
12. Some researchers claim that exclusive breastfeeding of infants from birth to six months of
age can boost a child’s intelligence. Others are skeptical and believe that previous findings to
this effect may be attributable to confounding by maternal socioeconomic status. That is, higher
socioeconomic status mothers may be more likely to practice exclusive breastfeeding through six
months of age; and high material socioeconomic status may contribute to the development of
intelligence in the child through mechanisms other than breastfeeding. An investigator studying
these issues has a dataset on 1422 mother-child dyads. The dataset contains the following key
variables: CHILDIQ, the intelligence of the child as measured via the Stanford-Binet IQ test age
age 6 years, with higher scores indicating greater intelligence; BRSTFD, the mother’s self-report
of whether or not she breastfed the child exclusively through six months of age (0 = no, 1 = yes);
and MOMSES, an index of maternal socioeconomic status derived from information about
educational attainment, income, and her own parents’ occupations. The investigator finds that
the mean score on Stanford-Binet IQ test was 107.32 among the 454 children who were
exclusively breastfed for six months; and 102.55 for the 968 children who were not exclusively
breast fed through six months of age. Thus, the average breastfed child had an IQ 4.77 points
higher than the average non-breastfed child. To determine the extent to which this difference
could attributable to confounding by maternal socioeconomic status rather than to an actual
effect of breastfeeding on intelligence, the investigator next runs the following linear regression
model:
CHILDIQi = 0 + 1BRSTFDi + 2MOMSESi +ei.
If the difference is due in part to confounding by maternal SES, how would the value of the
coefficient 1 in this model likely compare to the raw difference of 4.77?
13. True or False: A boxplot is a useful way of examining the distribution of a dichotomous
variable.
14. Suppose you have a dataset that includes the variable BMI3CAT as described in question 8,
and you wish to create a bar chart that shows how many people fall into the three categories:
normal, overweight, and obese. What SPSS syntax would you use to obtain that bar chart?
15. Suppose you are trying to use linear regression analysis to determine whether the effect of
one variable, X1, on another variable, Y, depends upon the value taken by a third variable, X2.
What type of term should you include in your regression model?
A. A curvilinear term
B. A logistic term
C. An interaction term
D. An orthogonal term
E. None of the above
16. True or False: A cross-tabulation is a useful way of examining how two categorical variables
are related.
17. Suppose you have obtained from SPSS the correlation matrix in Appendix 1. According to
the information in this matrix, which two variables exhibit the strongest linear relationship?
18. In your own words, what is “confounding” and why is it sometimes a problem in
observational studies?
19. Suppose that your agency has been evaluating an intervention using a posttest-only control
group design. There were 155 people in the treatment group and 140 in the control group. The
outcome variable is continuous and, according to boxplots, appears to follow a bell-shaped
distribution with similar variances in the treatment and control groups. What statistical test
would be most appropriate for testing the null hypothesis of no intervention effect?
20. Suppose your agency has been evaluating an intervention using a one-group pretest-posttest
design with 20 participants. The focal variable is continuous but, in looking at the boxplots, you
see that it is skewed heavily to the left both before and after the intervention. What statistical test
would be most appropriate for testing the null hypothesis of no intervention effect (i.e., no
change from before to after the intervention)?
21. Suppose you wish to include a nominal variable that takes four different values as an
independent variable in a logistic regression model. How many dummy variables should you
include in your logistic regression model in order to accomplish this?
22. True or False: ANCOVA is often used to test the null hypothesis of no intervention effect in
the context of an impact evaluation using pretest-posttest control-group design with a continuous
dependent variable.
23. Suppose your agency has been evaluating an intervention using a one-group pretest-posttest
design with 200 participants. The focal variable is dichotomous. What statistical test would be
most appropriate for testing the null hypothesis of no intervention effect (i.e., no change from
before to after the intervention)?
24. A linear regression model with a single dummy variable predicting a continuous dependent
variable is equivalent to which of the following statistical tests?
A. Fisher’s Exact Test
B. Independent samples t-test (unequal variances version)
C. Paired t-test
D. Mann-Whitney U test
E. Independent samples t-test (equal variances version)
25. Suppose that your agency has been evaluating an intervention using a posttest-only control
group design. There were 95 people in the treatment group and 111 in the control group. The
outcome variable takes the following three ordinal values: normal, overweight, and obese. What
statistical test would be most appropriate for testing the null hypothesis of no intervention effect?
26. True or False: When running a two-sample t-test, if the Levene’s test gives a p-value of .023,
you should look at the equal variances rather than the unequal variances version of the t-test.
27. True or False: Before you run a logistic regression model, you must first use a COMPUTE
command to enact the log-odds or logit transformation on the dichotomous dependent variable.
28. Suppose you want to conduct a chi-square test of independence and Fisher’s exact test for
the relationship between two dichotomous variables. The first variable, named TREAT, is coded
0 for control group members and 1 for treatment group members. The second variable, named
POSTSMOKE, indicates smoking status assessed one month after the intervention being
evaluated; it is coded 0 for participants who were not smoking, and 1 for participants who were
smoking, at that time. What SPSS syntax would you use to obtain these hypothesis tests?
29. Suppose you want to conduct a paired t-test to see if post-intervention knowledge scores,
KNOWPOST, differ significantly on average from pre-intervention knowledge scores,
KNOWPRE, in a dataset containing information on people exposed to the intervention. What
SPSS syntax would you use to obtain the paired t-test?
30. What do you get when you exponentiate a logistic regression coefficient?
A. A relative risk
B. A hazard ratio
C. A quadratic term
D. An odds ratio
E. An intercept
31. Suppose you are want to conduct a Mann-Whitney U test to test for an effect of an
intervention on a continuous but highly skewed outcome in a small scale experiment using a
posttest-only control group design. The variable TREAT is coded 0 for control group members
and 1 for treatment group members. The continuous but skewed outcome variable is a measure
of blood glucose; in the dataset it is named GLUCOSE. What SPSS syntax would you use to
obtain the Mann-Whitney U test?
32. Suppose that the analysis of data from an impact evaluation using a posttest-only control
group design with a dichotomous outcome variable, SICKPOST, resulted in the SPSS output
appearing in Appendix 2. How would you quantify the estimated effect of the intervention in
terms of a relative risk, and do the statistical tests provide support for the effectiveness of this
intervention?
33. Suppose that you a part of a team that has been analyzing data from an impact evaluation
that a one-group pretest-posttest design. Measures of social support were obtained before and
after the intervention using the same instrument. They are continuous. The pre-intervention
social support variable appears in the dataset as SSPRE, and the post-intervention appears as
SSPOST. A paired t-test was conducted and you have been presented with the output appearing
in Appendix 3. How would you quantify the estimated effect of the intervention in terms of a
mean difference, and do the statistical tests provide support for the effectiveness of this
intervention?
34. Suppose you have a dataset containing survey data on 783 adolescents. For each adolescent,
you have a variable EVERSEX that indicates whether she or he has ever had sexual intercourse (0
for no, 1 for yes). You’re interested in how the likelihood of sexual activity varies in relation to
three variables: perceived parental disapproval of teen sexual activity, perceived peer norms
valuing sexual activity, and age. Perceived parental disapproval is a scale score derived from
multiple Likert-type questionnaire items, and is named PARDIS in your dataset. Perceived peer
norms is also a scale score, and is named PEERNORM. Age is a continuous variable computed
as the difference between the date of data collection and each respondent’s date of birth, and is
named AGE in your dataset. What SPSS syntax would you use to run a logistic regression model
with these three independent variables predicting EVERSEX?
35. Suppose that your team has been analyzing data from an impact evaluation that used a
posttest-only control group design. There were 227 participants in the treatment group, and 236
in the control group. The key outcome is a scale score measuring social support. In the dataset,
the experimental group variable is named TREAT, and takes the value 0 for control group
members and 1 for treatment group members; while the outcome variable is named SOCSUPP.
A member of your team used SPSS to conduct a two-sample t-test, and you have been presented
with the output appearing in Appendix 4. How would you quantify the estimated effect of the
intervention in terms of a mean difference, and does the statistical test provide support for the
effectiveness of the intervention?
36. One more question. Suppose your team has been analyzing data from an impact evaluation
that used a one-group pretest-posttest design. In the evaluation 50 participants were classified as
either sick or not sick both before and after the intervention. The pre-intervention measure
appears in the dataset as a variable named SICKPRE, taking values 0 for not sick and 1 for sick.
The post-intervention measure uses the same coding scheme and is named SICKPOST. A
member of your team has used McNemar’s test to see whether the proportion of participants who
were sick declined significantly over the course of the intervention, and you have been presented
with the output appearing in Appendix 5. Looking at this output, did the proportion classified as
sick increase or decrease over the course of the intervention, and was this change statistically
significant? [Show Less]