Implicit in all of the argument which has been advanced in favor of measuring results is the contention that education means change. If changes are brought about in the children who are being educated, then there must be the possibility of measurement. These changes may take place in habit, in knowledge, in methods of work, in interests and ideals, and in power of appreciation. Probably no one would question the possibility of measuring the change which takes place in the formation of habits. We have quite commonly been willing to measure growth in knowledge by tests which demand that students not only remember facts, but that they show some ability to apply them. Whether or not a student commands a particular method of work can be determined by observation of his method of procedure as well as by the results that he secures. If interests or ideals are changed, there necessarily follows a change in activity. Any real power of appreciation will be accompanied by some change in expression.

The fact that we do not yet have scales or units of measurement which will enable us to evaluate accurately the results obtained in all of the different forms of school activity is not an argument against the possibility of measurement. In any field the development of units of measurement is dependent upon careful investigation, and upon a realization of the imperfection of the units already used. It is only as we insist upon measurement that we can hope to have our units refined. Take for example the problem of grades or marks which are commonly assigned to students as a measure of their efficiency in doing school work. Any investigation of these units will show that there is very great variation in their application by different members of the teaching corps.

The way to bring about a remedy is not to abolish all marks or grades, but rather to study the problem of the proper distribution of marks, and, if necessary, to weigh differently the marks of different instructors. The more imperfect the unit of measure which we now apply, the greater the need for insisting upon accurate measurement.

The first step in the development of scientific inquiry in any field is found in accurate description of the phenomena involved. The demand that we measure results in education is simply a demand that the basis for scientific investigation be made available by means of this accurate description of the situation as it exists. Some investigators in education have already been able to take the further steps in scientific inquiry which have enabled them to foretell with considerable accuracy the results which might be expected in education under given conditions. Further progress is, however, dependent upon that sort of measurement which will discover problems which are not now clearly defined or which have not yet been thought of. Of course, as inferences are made in the light of the problems suggested, there will be still further necessity for accurate measurement. When those who are charged with the responsibility of determining educational policy appeal to fact rather than to opinion, when we are able to evaluate accurately that which we achieve, educational progress will be assured and a profession of education will have been established.

We shall always have most excellent work in teaching done by those for whom scientific investigation, as such, means little. The investigations made will, however, modify the practice of these same teachers through changed demands and because of the demonstration of the validity of the new method of work by those who can appreciate the significance of results achieved by investigators. It is certainly to be expected that open-minded teachers will experiment for themselves and will aid in the work of the expert who must use the schools as his laboratory. Much depends upon the coöperation and good will of all who are engaged in teaching. It is not too much to expect that the spirit of investigation will be found in large measure to characterize those whose privilege it is to provide the situation in which intellectual development is stimulated.

An example of a study involving the coöperation of the pupils, teachers, and supervisory officers of twenty-six separate schools or school systems is Dr. C. W. Stone’s study on Arithmetical Abilities and Some of the Factors Determining them. The following quotations from Dr. Stone’s study indicate the type of result which we may hope to get from such investigations.

In Reasoning

The scores for the reasoning problems were determined from the results of two preliminary tests—one, giving one hundred 6th grade pupils all the time they needed to do the problems as well as they could in the order as printed (see page 11); and another, giving one hundred 6th grade pupils all the time they needed to do the problems as well as they could in the reverse order from that as printed. The results as tabulated below in table II show that scores for reasoning problems of Grade VI pupils can be very definitely arranged in a scale on the basis of relative difficulty. Just what the scale should be can only be determined by determining the form of distribution and the location of the zero point. From what is known of these the scale of weighting shown in the last column of table II is believed to be the best, and this is the one employed in the computations of this study. However, in order to enable the reader to satisfy himself as to which is the best method, the scores of the twenty-six systems were calculated on each of three other bases—(1) counting each problem reasoned correctly a score of 1; (2) counting each problem reasoned correctly a score based on the ratio of its difficulty as shown in the next to the last column of table II; and (3) counting the scores made on only the first six problems for which presumably all pupils of all systems had ample time. See Appendix, p. 98.

In both reasoning and fundamentals the scores used as a measure of the achievement of a system were computed by combining the scores of one hundred pupils. Where more than one hundred pupils were tested, the papers used were drawn at random, the number drawn from each class being determined by the ratio of its number to the total number tested in the system. Where less than one hundred pupils were tested, the combined scores made were raised to the basis of one hundred pupils.