PSY721 - Advanced Tests and Measurements Study Guide A++A++ Completed
PSY721 - Advanced Tests and Measurements Study
Reliability, in a broad
... [Show More] statistical sense, is synonymous with:
consistency.
A source of error variance may take the form of:
item sampling, test takers' reactions to environment-related variables such as room temperature and
lighting, and test taker variables such as amount of sleep the night before a test, amount of anxiety, or
drug effects (all of these.)
Which type of reliability estimate is obtained by correlating pairs of scores from the same person (or
people) on two different administrations of the same test?
a test-retest estimate
A reliability coefficient is:
an index, a ratio of the total variance attributed to true variance, and unaffected by a systematic source
of error (All of these.)
What is the difference between alternate forms and parallel forms of a test?
Alternate forms do not necessarily yield test scores with equal means and variances.
Which of the following types of reliability estimates is the most expensive due to the costs involved in
test development?
alternate-form
An estimate of test-retest reliability is often referred to as a coefficient of stability when the time
interval between the test and retest is more than:
6 months.
As the reliability of a test increases, the standard error of measurement:
decreases.
Which type of reliability estimate would be appropriate only when evaluating the reliability of a test that
measures a trait that is relatively stable over time?
test-retest
Which of the following is true of systematic error?It has no effect on the reliability of a measure.
Computer-scorable items have tended to eliminate error variance due to:
scorer differences.
Which of the following might lead to a decrease in test-retest reliability?
the passage of time between the two administrations of the test, coaching designed to increase test
scores between the two administrations of the test, and practice with similar test materials between the
two administrations of the test (All of these.)
If items from a test are measuring the same trait, estimates of reliability yielded from KR-20 will typically
be ________ as compared to estimates from split-half methods.
higher
Which of the following is TRUE for estimates of alternate- and parallel-forms reliability?
Two test administrations with the same group are required, Test scores may be affected by factors such
as motivation, fatigue, or intervening events like practice, learning, or therapy, and Item sampling is a
source of error variance (All of these.)
If traditional measures of reliability are applied to criterion- referenced tests, the reliability estimates
will likely be:
spuriously high.
Test-retest estimates of reliability are referred to as measures of ________, and split-half reliability
estimates are referred to as measures of ________.
stability; internal consistency
For a heterogeneous test, measures of internal-consistency reliability will tend to be ________
compared with other methods of estimating reliability.
lower
Which of the following factors may influence a split-half reliability estimate?
fatigue, anxiety, and item difficulty (all of these.)
KR-20 is the statistic of choice for tests with which types of items?
multiple-choice and true-false (all of these.)
The Spearman-Brown formula is used for:correcting for one half of the test by estimating the reliability of the whole test, determining how many
additional items are needed to increase reliability up to a certain level, and determining how many
items can be eliminated without reducing reliability below a predetermined level (all of these.)
Typically, adding items to a test will have what effect on the test's reliability?
Reliability will increase.
Which of the following is NOT an acceptable way to divide a test when using the split-half reliability
method?
Assign easy items to one half of the test and difficult items to the other half.
Coefficient alpha is appropriate to use with all of the following test formats EXCEPT:
essay exam with no partial credit awarded.
Which of the following is TRUE about coefficient alpha?
It is a characteristic of a particular set of scores, not of the test itself.
A police officer mistakenly records the blood alcohol level of a suspected drunk driver after
administering a breathalyzer test. This mistake is most related to which type of reliability?
interscorer
A coefficient alpha over .9 may indicate that:
the items in the test are redundant.
Which best conveys the meaning of an inter-scorer reliability estimate of .90?
Ninety percent of the variance in the scores assigned by the scorers was attributed to true differences
and 10% to error.
If a time limit is long enough to allow test takers to attempt all items, and if some items are so difficult
that no test taker is able to obtain a perfect score, then the test is referred to as a ________ test.
power
If a test is homogeneous:
it is functionally uniform throughout, it will likely yield a high internal-consistency reliability estimate
compared with test-retest, and it would be reasonable to expect a high degree of internal consistency
(all of these.)
Which type(s) of reliability estimates would be most appropriate for a measure of heart rate?
test-retestTypically, speed tests:
contain items of a uniform difficulty level.
Which type(s) of reliability estimates would be appropriate for a speed test?
test-retest, alternate-form, and split-half from two independent testing sessions (all of these.)
Generalizability theory is most closely related to
test reliability.
In classical test theory, there exists only one true score. In Cronbach generalizability theory, how many
of these true scores exist?
many, depending on the number of different universes
Traditional measures of reliability are inappropriate for criterion-referenced tests because variability:
is minimized with criterion-referenced tests.
A test is considered valid when the test:
measures what it purports to measure.
Face validity refers to:
the appearance of relevancy of the test items.
Which is NOT a method of evaluating the validity of a test?
evaluating the percentage of passing and failing grades on the test
Predictive and concurrent validity can be subsumed under:
criterion-related validity
Face validity
may influence the way the test-taker approaches the situation, relates more to what the test appears to
measure than what the test may actually measure, and has received little attention and is given shortshrift as compared to other indices of validity (all of these.)
Which assessment technique has the MOST face validity?
administering a word processing test to a person applying to be a word processor
Relating scores obtained on a test to other test scores or data from other assessment procedures is
typically done in an effort to establish the __________ validity of a test.criterion-related
An instructor announces that an examination will cover the topics of reliability and validity. A student
boasts that he will read and study only the material on reliability. In fact, all the test questions are only
on reliability. The best conclusion a student of assessment could draw from this is that:
the examination lacked content validity.
Before constructing a comprehensive final examination, your instructor reviews the objectives of the
course, the textbook, and all lecture notes. Your instructor is making an effort to maximize the
__________ validity of the final examination.
content
Lawshe devised a method for determining agreement among raters or judges who rate items on how
essential they are. This method provides a way to quantify what type of validity?
content
In calculating the content validity ratio, panelists are asked to determine:
if the skill or knowledge measured by the item is essential.
A standard against which a test or test score is evaluated is known as:
a criterion.
The minimum value of a content validity ratio necessary to be statistically significant at the .05 level is
dependent on:
the number of panelists judging the items.
Which may best be viewed as varieties of criterion-related validity?
concurrent validity and predictive validity
The form of criterion-related validity that reflects the degree to which a test score is correlated with a
criterion measure obtained at the same time that the test score was obtained is known as:
concurrent validity.
The form of criterion-related validity that reflects the degree to which a test score correlates with a
criterion measure that was obtained some time subsequent to the test score is known as:
predictive validity.
A key difference between concurrent and predictive validity has to do with:
the time frame during which data on the criterion measure is collected.Which is an example of a criterion?
achievement test scores, success in being able to repair a defective toaster, and student ratings of
teaching effectiveness (all of these.)
An index of utility can be distinguished from an index of reliability and an index of validity in that an
index of utility can tell us something about:
the practical value of the information derived from what a test measures.
Test validity:
sets a ceiling on test utility.
One of the noneconomic benefits of a diagnostic test used to make decisions about involuntary
hospitalization of psychiatric patients is a benefit to:
society-at-large.
Costs associated with testing include all of the following EXCEPT:
return on investment.
The end-point of a utility analysis is typically an educated decision about:
which of many possible courses of action is optimal.
A utility analysis is conducted using:
expectancy tables, Naylor-Shine tables, and Taylor-Russell tables (All of these.)
If targeted test-takers for a particular test consistently fail to follow the directions for taking the test
then:
the test could still have great utility and the test could still be valid (b and c.)
Validity is to ____________ as utility is to ____________.
accuracy; usefulness
A potential noneconomic benefit of a well-run evaluation program is:
increase in quantity and quality of workers' on-the-job performance, decrease in time it takes to train
new workers, and reduction in the number of workplace accidents (All of these.)
The Angoff method of setting cutting scores relies heavily on:
the judgment of experts.The "Achilles heel" of the Angoff method is:
interrater reliability.
A hospital uses a compensatory model of selection in hiring surgeons. In their hiring evaluations, ratings
regarding past safety record is given more weight than ratings regarding the surgeon's "bedside
manner." From this, one could reasonably conclude that the people who are in charge of hiring surgeons
believe that:
bedside manner is less important compared to surgical safety.
The term item-mapping refers to an IRT-based method of:
setting cut scores that entails an ordering or histographic representation of test items.
Which of the following is a direct economic cost that could result as a consequence of NOT evaluating
personnel for employment positions within a large corporation?
the cost of lawsuits against the corporation
The idea for a new test may come from:
social need, review of the available literature, and common sense appeal (all of these.)
This term is used to refer to the preliminary research surrounding the creation of a prototype of a test:
pilot work, pilot study, and pilot research (all of these.)
Often used for the purpose of licensing persons in professions, these tests are called:
criterion-referenced tests
Likert scales measure attitudes using continuums. A continuum of items measuring ___________ could
be used for a Likert scale.
like it or not, agree/disagree, and approve to do not approve (All of these.)
Test items that contain alternatives with five points ranging from "strongly agree" to "strongly disagree"
are characterized as using this approach to scaling:
Likert scaling.
Guttman scales:
typically are constructed so that agreement with one statement may predict agreement with another
statement.
Which is an example of the selected-response item format?a multiple-choice item
Having a large item pool available during test revision is:
an advantage because poor items can be deleted in favor of the good items.
A well-written true-false item:
has a correct response that is veritably true or false, and not subject to debate.
Computer-adaptive testing has been found to:
reduce by as much half the number of test items administered.
Item branching refers to:
administering certain test items on a test depending on the test-takers' responses to previous test
items.
Which statement is TRUE of the test tryout phase of test construction?
Test conditions should be as similar to the actual administration as possible.
The item-validity index is key in determining:
criterion-related validity.
An item-difficulty index of 1 occurs when:
all examinees answer the item correctly. [Show Less]