The first and most significant result was that the students
did much more poorly than we expected. There are many
possible causes: Our expectations may have been too
high, the problems may have been too hard or a poor fit
to the students' background and interests, there may not
have been enough time given, and so on.
We did answer the question we asked in the Introduction
section: Do students in introductory computing courses
know how to program at the expected skill level? The
results from this trial assessment provide the answer
"No!" and suggest that the problem is fairly universal.
Many of the solutions would not compile due to syntax
errors. This suggests that many students have not even
acquired the technical skills needed for getting a program
ready to run. While all the results were poor, School V's
students did significantly better than the other universities.
Two important factors that may have contributed to this
difference are: (1) The School V instructor had given the
students an example to study, which was a complete
answer to a similar problem, and (2) All students were
required to take the exercise, which was given as an
examination. Thus, sources of difference among the
universities in this study could include type of
preparation, motivation on this exercise (e.g.,
examination vs. extra credit), student characteristics (e.g.
volunteers or compulsory participation), and issues such
as curriculum and teaching style.
The School V instructor, who gave the exercise as an
examination, applied local grading criteria in addition to
the criteria defined for this trial assessment. We found
that the correlation between the local grade and the General
Evaluation score was high, but not overwhelming. One
interpretation of this is that the two scores consider
somewhat different features. It would be interesting to
study these differences in order to gain a better
understanding of the way instructors normally grade
programming assignments and to contrast this with the
criteria we used in this study. Local grades may consider
more than performance on a single assignment. For
example, a teacher may wish to reward effort or dramatic
improvement, and there are certainly good reasons for
doing so. Assessment in a study such as this one,
however, considers performance at a particular instant.
Give this difference in contexts, it is not surprising that
the grade and the assessment score may differ.
We clearly misjudged the complexity of the exercises.
The higher General Evaluation score of the students who
worked on exercise P2 (infix notation without precedence)
implied that this exercise was in some sense easier than
exercise P1 (RPN notation). (Before conducting the
study, we had rated P2 as being of "moderate" difficulty
and P1 as being "simplest"). This points out more of
what we still do not know about student learning and
performance. P1 was undoubtedly difficult for students
who had never studied stacks or other basic data structures.
The result about bi-modality is troubling. There are two
distinct groups of performance in our datasets. This result
suggests that our current teaching approach is leading to
one kind of performance for one sizable group of students
and another kind of performance for another sizable group.
We need to keep in mind that different groups of students
have different needs and strengths; we must ensure that the
results from one group do not obscure our view of the
other.