Ann Del Vecchio, PhD
Michael Guerrero, PhD

Evaluation Assistance Center - Western Region
New Mexico Highlands University
Albuquerque, New Mexico

December, 1995


The following terms and definitions should prove useful in helping to understand the information to follow:

Basal and Ceiling Rules are guides for minimizing testing time. Test items are arranged in order of difficulty, with the easiest item first and the most difficult item last. The chronological age of the examinee is used to identify the basal or easiest items that examinee is expected to answer. The ceiling for a particular examinee is identified after a specific number of items are missed consecutively (anywhere from 4 items to an entire page of items may need to be answered incorrectly to identify the ceiling).

Correlation Coefficient is a statistical measure of the linear or curvilinear relationship between two variables, scores, or assessments. The correlation coefficient ranges from -1.0 to +1.0; when there is no relationship between two variables, it equals 0. A negative value indicates that as the value of one variable increases, the other variable tends to decrease. A positive value indicates that as one increases in value, so does the other and that as one decreases in value, so does the other. Correlation is used in both reliability and validity studies for tests.

Criterion Referenced Test (CRT) is a test developed and used to estimate how much of the content and skills covered in a specific content area have been acquired by the examinee. Performance is judged in relation to a set of criteria rather than in comparison to the performance of other individuals tested with a norm-referenced test (NRT).

Discrete Point refers to test items that measure a single unit of content knowledge in a particular domain. These items are usually multiple choice, true-false, or fill-in-the-blank and allow for only one correct answer.

Holistic Score is the assignment of a single score that reflects an overall impression of performance on a measure. Scores are defined by prescribed descriptions of the levels of performance, examples of benchmarks at each level, or by scoring rubrics.

Interpolation is a process of using available points based on actual data to estimate values between two points. Interpolation is usually done by calculation based on the distance between two known points but also can be done geometrically by connecting the points.

Language Dominance refers to the general observation that a bilingual or multilingual individual has greater facility in one language as opposed to the others. However this linguistic facility can vary based on the context of language use (e.g., school, church) and linguistic skill (speaking, writing).

Lexical refers to the lexicon of a language and is roughly equivalent to the vocabulary or dictionary of a language. Another term used by linguists at this level of a language is semantics. Semantics is the study of meanings at the word or sentence level.

Morphological refers to how words are constructed; a morpheme is essentially the smallest unit of language which conveys meaning. Some morphemes can stand alone (e.g., test) while others can only appear in conjunction with other morphemes (e.g., ed).

Norm Group is the sample of examinees drawn from a particular population and whose test scores are used as the foundation for the development of group norms. Group norms are the statistical data that represent the average (usually mean score) results for various groups rather than the scores of individuals within one of these groups or individuals across all groups. Only to the degree that persons in the norm group are like the persons to whom one wishes to administer a test can proper interpretation of test results be made. In other words, the test is not valid for persons who are not like the norm group.

Norm Referenced Test (NRT) is an instrument developed and used to estimate how the individuals being assessed compare to other individuals in terms of performance on the test. Individual performance is judged in comparison to other individuals tested, rather than against a set of criteria (criterion referenced test) or in a broad knowledge area (domain referenced test).

Normal Curve Equivalent (NCE) is a transformation of raw test scores to a scale with a mean of 50 and a standard deviation of 21.06. NCEs permit conversion of percentile ranks to a scale that has equal intervals of performance differences across the full range of scores and which can be arithmetically manipulated. Percentile ranks can not be used in arithmetic calculations.

Phonological (graphonic) refers to the smallest, distinguishable unit of sound in a language which help convey meaning (i.e., a phoneme). In isolation, however, the phoneme /p/ means nothing. Graphonic refers to the visual, graphic representation of the phonological system of a language which make reading and writing possible.

Pragmatic is the dimension of language which is concerned with the appropriate use of language in social contexts. In other words, pragmatics has to do with the variety of functions to which language is put (e.g., giving instructions, complaining, requesting)

and how these functions are governed depending on the social context (e.g., speaking to a teacher versus a student; at school versus at home).

Reliability is the degree to which a test or assessment consistently measures whatever it measures. It is expressed numerically, usually as a correlation coefficient. There are several different types of reliability including:

Alternate Forms Reliability is sometimes referred to as parallel forms or equivalent forms reliability. Alternate forms of a test are test forms designed to measure the same content area using items that are different yet equivalent. This type of reliability is conducted by correlating the scores on two different forms of the same test.

Intra Rater Reliability is the degree to which a test yields consistent results over different administrations with the same individual performing at the same level by the same assessor (intra-rater).

Inter Rater Reliability is the degree to which an instrument yields the same results for the same individual at the same time with more than one assessor (inter-rater).

Internal Consistency Reliability is sometimes called split half reliability and is the degree to which specific observations or items consistently measure the same attribute. It is measured in a variety of ways including Kuder Richardson 20 or 21, Coefficient Alpha, Cronbach's Alpha, and Spearman Brown Prophecy Formula. These methods yield a correlation coefficient that measures the degree of relationship between test items.

Rubric is sometimes referred to as a scoring rubric and is a set of rules, guidelines, or benchmarks at different levels of performance, or prescribed descriptors for use in quantifying measures of attributes and performance. Rubrics can be holistic, analytic or primary trait depending upon how discretely the defined behavior or performance is to be rated.

Stratified Sampling Design is the process of selecting a sample in such a way that identified subgroups in the population are represented in the sample in the same proportion that they exist in the population.

Syntactic level of a language is equivalent to the grammar of the language. This level (sometimes referred to as syntax) involves the way words are combined in rule-governed order.

Validity is the degree to which a test measures what it is supposed to measure. A test is not valid per se; it is valid for a particular purpose and for a particular group. Validity evidence can come from different sources such as theory, research or statistical analyses. There are different kinds of validity including:

Content Validity is the degree to which a test measures and intended content area. Item validity is concerned with whether the test items represent measurement in the intended content area. Sampling validity is concerned with how well the test samples the total content area. Content validity is usually determined by expert judgement of the appropriateness of the items to measure the specified content area.

Construct Validity is the degree to which a test measures an independent hypothetical construct. A construct is an intangible, unobserved trait such as intelligence which explains behavior. Validating a test of a construct involves testing hypotheses deduced from a theory concerning the construct.

Concurrent Validity is the degree to which the scores on a test are related to the scores on another, already established test administered at the same time, or to some other valid criterion available at the same time. The relationship method of determining concurrent validity involves determining the relationship between scores on the test and scores on some other established test or criterion. The discrimination method of establishing concurrent validity involves determining whether test scores can be used to discriminate between persons who possess a certain characteristic and those who do not, or those who possess it to a greater degree. This type of validity is sometimes referred to as criterion-related validity.

Predictive Validity is the degree to which a test can predict how well an individual will do in a future situation. It is determined by establishing the relationship between scores on the test and some measure of success in the situation of interest. The test that is used to predict success is referred to as the predictor and the behavior that is predicted is the criterion.

