Report by the ITiCSE 2001 Working Group on Assessment of Programming Skills of First-year CS Students

This is the html version of the file http://coweb.cc.gatech.edu/guzdial/uploads/18/assessmentwg_iticse_final.pdf.
G o o g l e automatically generates html versions of documents as we crawl the web.

Google is not affiliated with the authors of this page nor responsible for its content.

These search terms have been highlighted:

iticse

2001

working

group

report

assessment

Report by the ITiCSE 2001 Working Group on Assessment of Programming Skills of First-year CS Students

Page 1

A multi-national, multi-institutional study of assessment of

programming skills of first-year CS student

Report by the ITiCSE 2001 Working Group on

Assessment of Programming Skills of First-year CS

Students

Michael McCracken (chair)

Georgia Institute of Technology, USA

mike@cc.gatech.edu

Vicki Almstrum

University of Texas at Austin, USA

almstrum@cs.utexas.edu

Danny Diaz

Georgia Institute of Technology, USA

ddiaz@cc.gatech.edu

Mark Guzdial

Georgia Institute of Technology, USA

mike@cc.gatech.edu

Dianne Hagan

Monash University, Australia

Dianne.Hagan@infotech.monash.edu.au

Yifat Ben-David Kolikant

Weizmann Institute of Science, Israel

ntifat@wisemail.weizmann.ac.il

Cary Laxer

Rose-Hulman Institute of Technology, USA

Cary.Laxer@rose-hulman.edu

Lynda Thomas

University of Wales, Aberystwyth, UK

ltt@aber.ac.uk

Ian Utting

University of Kent, UK

I.A.Utting@ukc.ac.uk

Tadeusz Wilusz

Cracow University of Economics, Poland

eiwilusz@cyf-kr.edu.pl

ABSTRACT

In computer science, an expected outcome of a student's

education is programming skill. This working group

investigated the programming competency students have

as they complete their first one or two courses in

computer science. In order to explore options for

assessing students, the working group developed a trial

assessment of whether students can program. The

underlying goal of this work was to initiate dialog in the

Computer Science community on how to develop these

types of assessments. Several universities participated in

our trial assessment and the disappointing results suggest

that many students do not know how to program at the

conclusion of their introductory courses. For a combined

sample of 216 students from four universities, the average

score was 22.89 out of 110 points on the general

evaluation criteria developed for this study. From this

trial assessment we developed a framework of expectations

for first-year courses and suggestions for further work to

develop more comprehensive assessments.

KEYWORDS

INTRODUCTION

Programming is one of many skills that computer science

students are expected to master. In addition, most science,

mathematics, engineering, and technology (SMET)

programs expect that their students will acquire

programming skills as a part of their education. The

question is whether these requirements are being met. Are

the appropriate assessment measures in place to determine

if the students have acquired the necessary programming

skills? We think not, but wanted to gather evidence that

would confirm or refute our observations.

This working group arose from concerns expressed by

many computer science educators about their students'

lack of programming skills. Quite often these concerns

were focused on basic mastery of fundamental skills of

programming. A study by [8] identified similar

deficiencies in programming skill, although their study

focused on the teaching of programming. In several other

studies that have considered issues of learning to program,

LEAVE BLANK THE LAST 2.5 cm (1") OF THE LEFT

COLUMN ON THE FIRST PAGE FOR THE

COPYRIGHT NOTICE.

Page 2

assessment has been a part of their methodology. For

example, [6] studied students learning Basic; [7] looked at

conceptual "bugs" of novice programmers; and [9] studied

novice programmers' misconceptions. While the results

from these studies can help computer science educators

improve the teaching of programming, they do not answer

this question: Do students in introductory computing

courses know how to program at the expected skill level?

This working group collected data from several

universities and found that the students' level of skill was

not commensurate with their instructors' expectations.

Two issues are central to our effort:

Learning to program is a key objective in most

introductory computing courses, yet many computing

educators have voiced concern over whether their

students are learning the necessary programming

skills in those courses.

The development of CC2001 [1] represents the next

evolutionary cycle of the requirements for computing

education. These requirements are slated to become

the new standard for computer science education and

will form the basis for accreditation of computer

science programs in the USA. The requirements for

introductory computing courses in the ironman

version of the CC2001 prescribes the set of expected

programming skills students should acquire but

includes little information on assessment. The efforts

of this working group may contribute to developing

assessments for use by CC2001 implementers.

The remainder of this report is organized into eight major

sections. We begin by describing a framework for

learning objectives during the first year of computing

courses. The next section explores a variety of

assessment approaches and motivates the choice we made

for this study. Next we describe the methodology for the

trial assessment, including the work we did in the months

before the ITiCSE conference. In the analysis section, we

describe what we gleaned from the data during our working

group's meetings at the conference. The remaining

sections interpret the results, discuss implications and

possibilities for further analysis, raise issues to be

addressed in follow-on studies, and propose a model for

driving this work further.

A FRAMEWORK FOR FIRST-YEAR LEARNING

OBJECTIVES

When faced with understanding student performance, a

natural question is "What should be assessed?" The

working group discussed these issues and developed a

framework of first-year learning objectives, both to clarify

what we expected students to have learned during their first

year and to allow us to evaluate how well the instruments

for this study assessed the learning objectives.

For first-year computing students, a fairly universal

expectation is that they should learn the process of

solving problems in the domain of computer science, in

order to produce compilable, executable programs that are

correct and in the appropriate form. As the framework for

the learning objectives of the first year, we expect

computing students to learn to successfully follow these

steps:

Abstract the problem from its description

Generate sub-problems

Transform sub-problems into sub-solutions

Re-compose the sub-solutions into a working

program

Evaluate and iterate

In general, all Computer Science programmes aim to

produce students who can reliably follow these steps in

solving discipline-specific problems, independent of the

particular programming paradigm being used. This also

remains as a (possibly implicit) goal as students progress

through their programmes, although the domain of

application, as well as the scale and complexity of

problems addressed, changes. The following clarifies what

is involved in each of these problem-solving steps.

1. Abstract the problem from its description

First-year assessment exercises are generally framed in

terms of a concrete, usually informal, specification of a

problem for which students are required to implement a

solution. Starting from this specification, students must

first identify the relevant aspects of the problem

statement. Next, students must model those elements in

an appropriate abstraction framework, which is probably

predetermined based on the approach being used in the

solution space (e.g., procedural, OO, functional, logic)

and heavily influenced by the teaching approach.

2. Generate sub-problems

- The scope and

importance of this step in the problem-solving process

may be dependent on the design approach adopted. A

functional decomposition of a structured program often

requires further decomposition. In an object-oriented

solution, the previous step has probably designed the

classes needed, although at this stage, there may be

factorization of methods out of others already in the

design.

3. Transform sub-problems into sub-solutions

- Here, the student must decide on an implementation

strategy for individual classes, procedures, functions, or

modules, as well as on appropriate language constructs

(solution representations). This includes deciding on data

structures and programming techniques. A crucial aspect

of this step is the implementation (and testing) of the sub-

solutions. The solution should be correct and in the

appropriate form, that is, it not only produces the right

output but is also modularized, generalized, and conforms

to standards. Some language constructs may be

inappropriate in particular domains or particular

pedagogies; for example, it is not possible to use

Page 3

recursion in all languages. This step is typically the first

point in the process at which significant involvement

with tools (e.g. a compiler) is possible.

4. Re-compose

- In this step, the student must take

the sub-solutions and put them back together to generate

the solution to the problem. This step probably involves

creating an algorithm that controls the sequence of events.

5. Evaluate and iterate

- Finally, the student must

determine whether the earlier steps in the process have

resulted in a good solution to the problem and take

appropriate action if not. The solution must be tested

thoroughly, and some of the earlier steps may be revisited

if the solution fails any tests. The solution must be

debugged to correct runtime and logic errors.

While the above framework of learning objectives

represents an ideal and generalized situation, there are

some problems with this abstraction. Particular pedagogic

approaches and tool-chain support might change details of

the sequence. For instance, an approach based on extreme

programming (XP) [2] would make the testing activity

much more central, so work on that aspect would begin

much earlier in the process. The availability of tools such

as BlueJ [3] would enable testing to be performed more

easily at step 3, rather than waiting until step 5. Use of

design tools and notations can encourage students to check

submissions at an earlier stage in the process. Whatever

the variations, however, all of the steps in the process

should still take place.

ASSESSMENT INSTRUMENTS FOR

FIRST-

YEAR CS

This section reviews general requirements for assessment

and describes types of assessment frequently used in first-

year computing courses. In reviewing these strategies, we

discuss how well each meets the general requirements for

assessment. We emphasize that assessment must be tied

to the educational objectives discussed in the preceding

section on the learning objectives framework. We

conclude this section by evaluating how well the trial

assessment met these assessment requirements.

Two main categories of assessment are

objective testing

and

performance-based assessment

. Objective forms of

assessment, such as multiple-choice questions, can

provide a cost-effective means for determining student

knowledge about areas such as language syntax or

program behavior. Objective testing can provide instant

feedback and can be used for both formative and

summative assessment. On the other hand, multiple-

choice questions cannot directly test students' ability to

create working computer programs.

In performance-based assessment, students are assessed for

their ability to create programs. Criteria for performance-

based assessments include: fairness, generalizability,

cognitive complexity, content quality (depth) and coverage

(breadth), meaningfulness, and cost [4,5]. Below, we

present three common forms of performance-based

assessment instruments and discuss how well they meet

the learning objectives framework from the previous

section, as well as the seven criteria given earlier in this

paragraph.

Take-home programming assignments

Typically a number of these assignments are given

during a course. Such assignments tend to be fairly

large scale with a fairly generous maximum

timeframe set for completing them (up to several

weeks). Such assignments tend to cover all five

aspects of the learning objectives framework. They

generally contain a large amount of cognitive

complexity. They are fair, generalizable, and

meaningful in the sense that students are operating in

an environment that is close to reality; however,

students are penalized if they are unable to spend

enough time completing the assignment. This type of

assessment is more vulnerable to plagiarism than are

some of the other assessment approaches.

Examinations

(short answer)

These examinations (such as asking students to

generate code fragments) can be used to assess all five

learning objectives, although items on such

examinations often tend to concentrate on steps 3 and

4 of the learning objectives framework

(decomposition into sub-problems and transformation

into sub-solutions). It is difficult (but not

impossible) to make short-answer examinations

meaningful or generalizable because of the limited

time available for students to complete them, but

they can provide cognitive complexity at low cost.

Charettes

(the method used in this study)

Charettes are short assignments, typically carried out

during a fixed-length laboratory session that occurs on

a regular basis. The closed nature of these sessions

reduces the opportunity for plagiarism. Charettes

provide coverage of the learning objectives

framework, although in a manner that is more

superficial and less cognitively complex than is

possible with larger take-home assignments. The

experience of completing a charette may not be as

meaningful or generalizable as larger assignments.

Charettes may be unfair to students who have test

anxiety or troubles with time pressure.

Once an assessment instrument is chosen, the scoring

criteria must be determined. One approach to scoring

would be a raw assessment of whether the program works

(although this is not particularly useful for formative

assessment). It is common for first-year computing

instructors to examine the source code and other written

materials as part of their assessment strategy. Another

Page 4

approach to assessment is to combine one of the above

with interviews in which the students describe their

process and product and thus demonstrate that they

understood what they have presented.

In this study, the form of assessment used was the

charette, a short, lab-based assignment. We selected this

assessment type to foster a fairly uniform environment

across universities at a relatively low cost. Our charette

provided fairness in the sense that all students were

operating in a similar environment, although this

approach can be seen as discriminatory against students

with test-taking anxiety. The exercises did offer cognitive

complexity and covered all parts of the learning objectives

framework reasonably well. In the Methodology and

Analysis sections, we explain the criteria we used in

assessing the students' programs.

METHODOLOGY

To help determine the programming ability of first-year

computing students, the working group developed a set of

three related programming exercises that students at

several universities would be asked to solve. The

exercises, which varied in difficulty, were designed so that,

theoretically, students in any type of Computer Science

programme should be able to solve them. Students could

use any programming language to implement their

solutions; we assumed that they would use the language

that they were required to use for the course they were

taking at the time. Students would only have to complete

one exercise of their instructor's choosing. The opinion

of the working group's participating schools was that a

student at the end of the first year of study should be able

to solve the most difficult exercise of the three in about an

hour and a half.

The exercises focused on arithmetic expression evaluation.

The easiest of the three exercises (P1) required a computer

program to evaluate a postfix expression. The second

exercise (P2) required a computer program to evaluate an

infix expression with no operator precedence (the

operations were to be performed strictly left to right, with

no parentheses present). The last exercise (P3) required a

computer program to evaluate an infix expression with

parenthesis precedence (operations were to be performed

left to right, with parentheses forcing sub-expressions to

be evaluated first). Each exercise stated that input tokens

(numbers and operation symbols) would be separated by

white space to ease the process of entering data. Infix

expressions would contain only binary operations (

); postfix expressions could contain unary negation

(

) as well. The exercises are described in Appendix A.

To enable the work of students from different universities

under different instructors to be compared meaningfully,

the working group developed the General Evaluation (GE)

Criteria shown in Appendix B. The criteria considered

whether a student's program could run without error,

process several arithmetic expressions, produce correct

results, and determine when expressions contained errors.

These criteria were strictly execution-based. To assess the

style component of the GE Criteria, the source code was

inspected.

The Degree of Closeness (DoC) Criteria given in

Appendix C provided a subjective evaluation of how close

a student's source code was to a correct solution. Students

at some of the universities were also asked to complete a

questionnaire (see Appendix D) that gathered demographic

information, programming background, and reactions to

the task.

Instructors at four universities administered the trial

assessment as a laboratory-based exercise in their

respective courses. Two used the first exercise (P1, postfix

evaluation), one used the second exercise (P2, infix

evaluation with no parentheses), and one used all three

exercises, administering a different exercise in each of

three sections of the same course. Students had either 1

hour (at one university) or 1.5 hours (at three universities)

to write a computer program to solve the exercise they

were given using the language they were taught in their

classes (which happened to be either Java or C++). When

finished, students submitted their executable programs and

printed copies of their source code for assessment. At one

university, the exercise was set up as an examination

required of all students, while at the other three

universities, the participants were volunteers who received

extra credit points.

The computer programs were evaluated using the criteria

in Appendices B and C. The GE Criteria assess how

accurately the students implemented their solutions, and

thus concentrate on the last two learning objectives (re-

composition into a working program and evaluation). The

DoC Criteria assess the results of the abstraction process

and thus enabled us to see how well the students met the

first three learning objectives (abstraction, decomposition,

and transformation into sub-solutions). In addition, the

instructor who gave the exercise as an examination graded

the programs in the traditional manner in order to be

consistent with the grading criteria for the remainder of the

course. Outcomes of the assessments were reported to the

working group leader for tabulation and cross-institutional

analysis.

Page 5

ANALYSIS

Each instructor who administered the exercise applied the

General Evaluation (GE) Criteria (Appendix B). All

instructors produced an aggregate score for the General

Evaluation Criteria; most instructors also reported the four

component scores (execution, verification, validation, and

style). In contrast, the DoC Criteria (Appendix C) were

applied to the source code from all four universities by

evaluators at a single university. The evaluators also

generated comments to explain their reasons for giving

each DoC score. In an informal inter-rater reliability test

on scoring against the DoC Criteria, we found a high

degree of correlation between evaluators.

Two of the four universities administered a local version

of the Student Questionnaire (Appendix D). For all four

universities, the exercise number (P1, P2, or P3) was

recorded for each student as well as the programming

language used (Java or C++ in all cases). The four

participating universities were randomly assigned the

codes School S, School T, School U, and School V. The

instructor at School V reported a local grade on the

exercise (which was given as an examination). We

assigned each student an encoded student ID number in

order to ensure anonymity.

Once the raw data from each university were entered and

validated, the analysis followed two independent paths.

One path was a quantitative analysis based on the GE

score, the DoC score, and the other data available for each

student. The second path was a subjective analysis that

focused on several of the unsuccessful attempts to solve

the assigned exercise, looking at comments embedded in

the source code and information from the questionnaires.

We present the outcomes of these analyses in the next

three subsections.

Analysis of General Evaluation Score

The average General Evaluation (GE) score (combining

the execution, verification, validation, and style

components) for all students, all exercises, at all schools

(

n =

217) was 22.9 out of 110 (standard deviation 25.2).

The scoring for each of P1 (Schools S, T, and V), P2

(Schools U and V), and P3 (School V only) appears in

Table 1. Overall performance was generally fairly low.

Average (stdev)

P1 (

= 117)

21.0

(24.2)

P2 (

= 77)

24.1

(27.7)

P3 (

= 23)

31.0

(20.9)

Table 1:

GE average score by exercise

We assumed in this study that we would be able to safely

combine data from multiple universities in our analyses.

However, there are differences between the students at

different universities (e.g., in raw talent, in previous

experience, in courses completed), between how they are

taught, in how the exercises were applied (e.g.,

examination grade vs. extra credit points, time allowed,

hints given), and, especially, in how the GE Criteria were

applied. We used a statistical test (Student's t-test) to

compare the universities on each of the exercises.

Schools S and T did

not

differ significantly on P1, but

every other combination (Schools V and T on P1,

Schools V and S on P1, Schools U and V on P2) did

differ significantly (

< 0.00001).

Table 2 summarizes the scores for each school across all

the exercises. (Only School V used more than one

exercise, P1, P2, and P3.) School V had considerably

higher scores than the other universities. Note, however,

that we cannot simply conclude that School V's students

performed better; the differences may be due to factors

such as how the GE Criteria were applied, what types of

students participated, or how motivated students were to

do well.

S c o r e s

Figure 1:

Distribution of GE scores on the combined P1 dataset (histogram)

Page 6

Schools S and T are not statistically different on P1, so

we can combine those scores with more confidence that

we can gain the benefits of an increased sample size and of

describing students across multiple universities. On this

combined P1 dataset (combining Schools S and T,

= 94)

the average General Evaluation score is 14.0 (standard

deviation 18.0). Figure 1 shows that the distribution of

these scores is bi-modal. While the majority of the

students did very poorly, there is a second "hump" in the

distribution, indicating a set of students with somewhat

better performance.

Bi-modal distributions ("two humps") appear throughout

this data. Another example is the combined P2 dataset

(combining Schools U and V), which has a similar bi-

modal profile (Figure 2).

The majority of students working on P2 scored below 10

points and fewer than ten students earned between 10 and

35 points, while over thirty students scored between 36

and 54 points.

With such low scores, we were curious to know where the

students lost points. The GE Criteria had four

components:

execution

(did the program run?),

verification

(did it handle input correctly?),

validation

(is it the right

kind

of calculator?), and

style

(does it meet standards?).

Though the scores are uniformly low, as a percentage of

possible scores, students did best on the execution

component (implying that, overall, they wrote programs

that compiled and ran) and the style component (implying

that the source code looked good). The lowest component

scores were on the verification and validation components

(Table 3).

ANALYSIS OF DoC SCORES

The Degree of Closeness (DoC) score, a five-point scale

that rates how close a student's program is to being a

working solution (see Appendix C), is particularly

interesting to study because a single set of raters assigned

the DoC scores for all four universities. Therefore, any

differences in universities can be attributed to differences

among the universities themselves, rather than to

differences in applying the criteria.

We discovered that the GE and DoC Criteria

measure

similar phenomena. The correlation between the GE score

and the DoC score was significant (Pearson's

= 0.66).

The overall average DoC score (combining universities

and exercises,

= 217) was 2.3 out of a possible 5 points

Average (stdev)

School S (

= 73)

- P1

14.0

(18.6)

School T (

= 21)

- P1

12.0

(16.3)

School U (

= 47)

- P2

8.9

(11.4)

School V (n = 23)

- P1

School V (n = 30)

- P2

School V (n = 23)

- P3

Totals for School V on P1, P2, P3

48.7

(25.7)

47.8

(29.1)

30.9

(20.9)

43.0

(26.7)

Table 2:

GE average score by university

GE Component

(and

maximum score possible)

Average score

(stdev)

As percentage of

max score on

component

Execution

(maximum: 30)

7.2 (11.8)

23.9%

Verification

(maximum: 60)

1.6 (5.8)

2.8%

Validation

(maximum: 10)

0.3 (1.8)

3.2%

Style

(maximum: 10)

4.6 (3.4)

46.2%

Table 3:

Average GE component scores and percentage of each component achieved

Scores

Figure 2:

Distribution of GE scores on the P2 dataset (histogram)

Page 7

(standard deviation 1.2). In general, student performance

was low by measure of the DoC Criteria. The average

DoC score for each exercise appears in Table 4. Students

did best overall on the simple infix calculator exercise

(P2), and next best on the RPN calculator (P1). This may

be due to students' familiarity with infix calculators and

notation and their lack of familiarity with RPN

calculators, or perhaps due to mismatches between the

demands of the exercise (e.g., stacks for RPN calculators)

and the curriculum at a particular school.

Average (stdev)

P1 (

= 118)

2.2 (1.2)

P2 (

= 77)

2.4 (1.2)

P3 (

= 23)

2.0 (0.9)

Table 4:

DoC score by exercise

The distribution of DoC scores for the universities is

shown in the first five rows of Table 5, with the average

score for each university in the final row. School V had

the highest DoC score, with School S second. The

difference between universities is statistically significant

(on a Student's t-test,

< 0.01).

At School T, we had the unusual circumstance of two

different programming languages used in the exercises.

About half of School T's students solved P1 using C++

(

= 10) and the rest solved the exercise using Java (

11). We calculated the average DoC score for each of

these groups separately, then compared (using a Student's

t-test) each group to a comparison group (School S's

students) who solved P1 using Java. While School T's

C++ programmers did significantly better than School T's

Java programmers (

< 0.001), it is striking that the Java

programmers at School T differ significantly from School

S's Java programmers (

< 0.001), while School S's Java

programmers and School T's C++ programmers do not

differ significantly. Table 6 gives the average and standard

deviation for each of these groups.

Average (stdev)

School T's C++ Students (

= 10)

1.7 (0.8)

School T's Java Students (

= 11)

1.0 (0.0)

School S's Java Students (

= 73)

2.2 (1.1)

Table 6:

Average score on P1 by School T's Java and

C++ programmers and School S's students

QUALITATIVE

ANALYSIS OF SELECTED

SOLUTIONS

In our qualitative analysis of the data, our goal was to

better understand some of the outcomes reported in the

previous sections. We investigated the question "What

went wrong?" (from both an instructor and a student point

of view) for the students who produced an unsuccessful

solution. The analysis was based on the students' source

code as well as their responses to the Student

Questionnaire (Appendix D). The analysis focused on

students from Schools S and V whose DoC score was 1 or

2 and compared their performance with that of students at

the same schools whose DoC score was 4 or 5.

First we investigated the data from the

instructor's

point

of view to see how students were approaching the

exercise. For the students whose DoC score was 4 or 5,

we can say that little or nothing went wrong (i.e. they

produced working solutions that really solved the

exercise). These students can be characterized as

individuals who figured out a solution for the exercise and

either completed the exercise or were in the final phases of

implementing a solution. In analyzing what went wrong

for the students who earned a DoC score of 1, the results

can be classified into three types:

Type 1

(null result): the student handed in an empty file.

Type 2

(unplanned result): the student's work showed no

evidence of a plan to solve the problem. One

explanation for this performance is that the student

followed a heuristic in which they first did what they

knew how to do, deferring the tasks about which they

were uncertain, but were then unable to proceed

beyond that point.

Type 3

(unimplemented plan): there is evidence that the

student had a plan but did not carry it out. These

students apparently understood what they needed to do

and appeared to have a general structure for a solution.

We further subdivide this type into two subtypes.

For type 3a (unimplemented plan with promising

approach), there was evidence that the student had

identified a reasonable structure for solving the

exercise. For type 3b (unimplemented plan with poor

approach), the student apparently had a plan, but it

was a poor one for the solution.

School S

School T

School U

School V

Score of 5

Score of 4

Score of 3

Score of 2

Score of 1

Average (stdev)

2.2 (1.1)

1.3 (0.7)

1.9 (0.9)

2.7 (1.2)

Table 5:

DoC score distribution by university

Page 8

Next, we investigated the data from the

student's

point of

view to better understand why the process of completing

the exercise went so well for some students and so poorly

for others. We contrasted student attribution of difficulties

for students at School S whose DoC score was 1 with the

attributions of students at the same school whose DoC

score was 5. In the Student Questionnaire (Appendix D),

students were asked to rank the difficulty of the exercise

on the scale [easy, difficult, hard, impossible]. None of

the School S students who earned a DoC score of 1 (

25) rated the exercise as easy. Six of these students did

not respond to the questionnaire. Of the remaining

nineteen students, six ranked the exercise as

difficult

, nine

ranked the exercise as

hard

, and four ranked the exercise as

impossible

(and these were not necessarily the Type 1

students). For the three School S students whose DoC

score was 5, one thought the exercise was

easy

, one

thought it was

difficult

, and one thought it was

hard

To gain some insights into why, we read the reflections

reported by Type 1 students (null result) and students who

earned a DoC score of 5. We found that the Type 1

students attributed blame for their difficulties to factors

outside of their control. They blamed the amount of time

available to solve the problem, their unfamiliarity with

the computers in the lab, their lack of Java knowledge,

and other external factors. None of the Type 1 students

mentioned factors related to the process of solving the

exercise. In contrast, students whose DoC score was 5

competently described the difficulties they experienced in

the process of creating a solution. Many of these

explanations illuminated particular aspects of the design

phase or particularly challenging sub-problems.

Examples of comments made by such students were

"Simple errors got the best of me" (problem difficulty

rated as

difficult

), "Could not solve for error case"

(problem difficulty rated as

hard

), and "Implementation is

wrong but easy" (problem difficulty rated as

easy

). Most

of the students with DoC scores of 5 included comments

in their source code that documented the cases for which

the program did not work.

Due to the limited timeframe for the working group

collaboration, this qualitative analysis is preliminary and

incomplete. The Results section includes additional

observations from the qualitative analysis and ideas for

further qualitative analysis of this data, as suggested by

the results to this point.

RESULTS

The first and most significant result was that the students

did much more poorly than we expected. There are many

possible causes: Our expectations may have been too

high, the problems may have been too hard or a poor fit

to the students' background and interests, there may not

have been enough time given, and so on.

We did answer the question we asked in the Introduction

section: Do students in introductory computing courses

know how to program at the expected skill level? The

results from this trial assessment provide the answer

"No!" and suggest that the problem is fairly universal.

Many of the solutions would not compile due to syntax

errors. This suggests that many students have not even

acquired the technical skills needed for getting a program

ready to run. While all the results were poor, School V's

students did significantly better than the other universities.

Two important factors that may have contributed to this

difference are: (1) The School V instructor had given the

students an example to study, which was a complete

answer to a similar problem, and (2) All students were

required to take the exercise, which was given as an

examination. Thus, sources of difference among the

universities in this study could include type of

preparation, motivation on this exercise (e.g.,

examination vs. extra credit), student characteristics (e.g.

volunteers or compulsory participation), and issues such

as curriculum and teaching style.

The School V instructor, who gave the exercise as an

examination, applied local grading criteria in addition to

the criteria defined for this trial assessment. We found

that the correlation between the local grade and the General

Evaluation score was high, but not overwhelming. One

interpretation of this is that the two scores consider

somewhat different features. It would be interesting to

study these differences in order to gain a better

understanding of the way instructors normally grade

programming assignments and to contrast this with the

criteria we used in this study. Local grades may consider

more than performance on a single assignment. For

example, a teacher may wish to reward effort or dramatic

improvement, and there are certainly good reasons for

doing so. Assessment in a study such as this one,

however, considers performance at a particular instant.

Give this difference in contexts, it is not surprising that

the grade and the assessment score may differ.

We clearly misjudged the complexity of the exercises.

The higher General Evaluation score of the students who

worked on exercise P2 (infix notation without precedence)

implied that this exercise was in some sense easier than

exercise P1 (RPN notation). (Before conducting the

study, we had rated P2 as being of "moderate" difficulty

and P1 as being "simplest"). This points out more of

what we still do not know about student learning and

performance. P1 was undoubtedly difficult for students

who had never studied stacks or other basic data structures.

The result about bi-modality is troubling. There are two

distinct groups of performance in our datasets. This result

suggests that our current teaching approach is leading to

one kind of performance for one sizable group of students

and another kind of performance for another sizable group.

We need to keep in mind that different groups of students

have different needs and strengths; we must ensure that the

results from one group do not obscure our view of the

other.

Page 9

While the basis for comparison between programming

languages is small for this trial assessment, we did

unearth an interesting contrast. One school of thought

says: "Java is better than C++ for education" or

"Languages matter a lot-students learn better with X

than Y." In this study, Java programmers from School S

resembled C++ programmers from School T more than

they resembled the Java programmers at School T. This

suggests that the difference was not simply due to the

programming language. Issues of how the course is

taught and who the students are influence the outcome,

rather than being simply a matter of programming

language X vs. programming language Y. Future

investigations must dig into how learning differs with

different programming languages.

The fact that students did well on the style component of

the General Evaluation Criteria indicates that students are

responding to their instructors' admonishments about

commenting and formatting of code. The other

component scores (execution, verification, and validation)

indicate that the code that students write does not meet

specification; the only way to evaluate this is to run the

students' code. An implication of this is the importance of

actually

executing

student programs.

The significant number of solutions with a DoC score of

1 or 2 (i.e. students who were "clueless") raises the

suspicion that those students need additional work during

the first-year courses with developing skills in the first

learning objective in our framework (abstracting the

problem from a given description).

Many of the students who failed on this trial assessment

had no idea how to solve the exercise. On the Student

Questionnaire, the last question asked students:

What was

the most difficult part of this assigned task? Was it the

timed aspect of the problem, was the problem too

difficult, etc.?

The following quotes are responses from

students whose DoC score was 1 or 2:

"I didn't have enough time"

"I'm not good with stacks/queues."

"Too cold environment, problem was too hard." [We

believe the first phrase refers to the temperature in the

physical setting.]

The most frequent student complaint was a lack of

sufficient time to complete the exercise. This implies

that these students could not accurately identify the main

source of their difficulties in solving the exercise and

therefore tended to attribute blame for their lack of success

on factors other than themselves, such as a lack of time or

the "cold" environment. In a multi-factor analysis, [11]

found that attributing blame to external factors (such as

"luck") was not uncommon, but was particularly hard to

overcome. Once students attributed their failure to

unstable factors that were out of their control, they rarely

succeeded in future attempts.

One implication of this finding is that the implementation

of first-year courses should make better use of available

assessment methods and tools. Students should receive

accurate feedback that allows them to become aware of

their own limitations and difficulties-although such

feedback alone will not necessarily convince a student that

the reason he or she failed is at least partially internal

rather than purely external.

Students often have the perception that the focus of their

first-year courses is to learn the syntax of the target

programming language. This perception can lead students

to concentrate on implementation activities, rather than

activities such as planning, design, or testing. Generally,

this perception does not come directly from what their

instructors are telling them and, in fact, this belief seems

to be robust even in the face of instructors' statements to

the contrary. Students often skip the early stages in the

problem-solving process, perhaps because they see these

steps as either difficult or unimportant. It is also possible

that instruction has focused on the later stages, with an

implicit assumption that the earlier stages are well

understood or easy to understand.

The information from the students' reflections can provide

useful information for improving the assessment process.

The following two quotes are drawn from the responses to

the same Student Questionnaire item as above by students

whose DoC score was 2:

"I had a plan, I did not know how to carry it out in

Java."

"The problem was too difficult, I lost a lot of time

trying to understand how the computer work."

These quotes are from students who seemed to accurately

identify their own difficulties and who took responsibility

for their own performance. These students knew that they

should go through a process of understanding, planning,

and implementing. The earlier students' reflections give

us little information about whether they were following

these steps of problem-solving; in fact, the earlier students

appear to have been lost and unable to point out what they

do not know, blaming the environment or their poor

understanding of a class of concepts.

The students' reflections provided useful information

about the influence of the setting on student performance.

Five School V students who earned a DoC score of 1 or 2

complained that they had a plan but could not handle the

environment themselves and therefore could not translate

their solution into a working computer program. When

we interviewed the School V instructor, we learned that

while the setting was indeed lab-based as specified in the

instructions for how to administer the exercise, it was also

the first time these students had taken a laboratory-based

examination. This helps to explain why these students

found it difficult to work on their own and performed

rather poorly. Several students reported in the Student

Questionnaire that stress played a major role in their

Page 10

unsuccessful performance, while others reported that they

needed time just to figure out how a postfix calculator

works. Being aware of such factors can help us as

instructors to refine our assessment tools and give better

guidelines on how to administer the tools. These data also

give us insights into the students' performance that can be

used to refine our approach to evaluating their knowledge.

DISCUSSION

In analyzing the data from universities in different

countries, we have found that the problems we observed

with programming skills seem to be independent of

country and educational system. The most obvious

similarity we observed was that the most difficult part for

students seemed to be abstracting the problem to be solved

from the exercise description. At all universities, the main

student complaint was a lack of time to complete the

exercise.

In this trial assessment, as in the "real world", it may be

that black-box assessment of students' submissions

reinforces students' views of implementation and syntax

as the key focus of computer programming. Here we

explore some possible reasons for the observed situation.

Students may have inappropriate (bad) programming

habits

When beginning their university studies,

many students have prior experience in computer

programming. Often students with such experience

treat the source code as simple text rather than as an

executable computer program that is supposed to

accomplish a specific task. Their goal is simply to

obtain a program that compiles cleanly; often they are

then surprised by what the program really does when

presented with data.

Switching to modern (Java) object-oriented

programming tools.

Anecdotal evidence and some

research results (e.g. [10]) suggest that teaching an

object-oriented approach to computer programming

(for example, using a Java environment) requires

more time before students have sufficient knowledge

about the programming environment to solve

problems on their own (which suggests that less time

is required to achieve the needed level of familiarity

with the environment in a procedural or functional

approach). Therefore it is very likely that first-year

courses using an object-oriented approach do not have

room in the syllabus for fundamental data structures

such as stacks, queues, and trees.

Closed lab time constraint.

In terms of the way this

trial assessment was administered, time pressure may

have contributed to the poor results.

The qualitative analysis of selected solutions helped

explain student performance and therefore highlights where

future studies must improve over this trial assessment.

One direction for further analysis would be to give a more

in-depth characterization of the nature of student

knowledge and difficulties within each DoC score (i.e.

from 1 to 5). We could investigate this by considering

the quality of the source code, the internal documentation,

and the data from the Student Questionnaire. It would be

useful to consider these issues from both from the

instructor's point of view and the student's point of view.

A student's reflections can provide important clues to

whether the student understands his or her own limitations

in knowledge. For example, the terminology that the

student uses to describe his or her difficulties provides

glimpses into the student's processes and problem-solving

knowledge. These insights could help us better understand

whether students are becoming competent in correctly

identifying (and overcoming) their own difficulties.

In general, data analysis using qualitative approaches can

provide information to help improve educational processes

and refine assessment tools. For example, being aware of

the factors revealed by qualitative analysis can assist us in

developing better instructions for administering this trial

assessment. The information generated by the qualitative

analysis can also help make us aware of aspects of our

students' behavior that we otherwise would not notice.

Finally, the information from qualitative analysis can

provide better and more accurate insights into what

students know and how they use that knowledge.

To efficiently teach computer programming skills is

difficult. The kinds of assessment that instructors use

throughout their courses must provide appropriate

information for understanding students' processes of

developing programming skill. This trial assessment

showed that most of the participating students failed to

achieve one of the basic goals of a first-year computer

science course: to acquire at least a basic level of skill

with computer programming. This implies that it was the

students' knowledge, rather than their skills, that enabled

them to successfully complete their first-year courses. It is

possible that either performance-based assessment tends to

be improperly implemented or that it is often sacrificed in

order to make assessment more objective.

ISSUES TO BE ADDRESSED IN FOLLOW-ON

STUDIES

Several aspects of this study gave us cause for concern or

raised points that must be addressed in future studies of

this kind. These areas include the administration of the

study, the exercises, and the challenges of multi-

institutional collaboration.

Issues related to administration of the exercise

There are difficulties in comparing the performance of

students with different programming backgrounds. In

some universities, first-year students enter having already

taken a general introduction to programming course,

whereas in others most students are programming novices

at the start of their first year of studies. Although some

of the latter group may have prior programming

experience from school, other universities, or self-

Page 11

learning, the preponderance of novices in the sample

would affect the results from those universities. In future

studies, we might specify the level of prior programming

experience or the specific programming knowledge that

the students are assumed to have for each exercise. It

would then be fairer to allow instructors to choose the

appropriate exercise to give to their students. The

background questionnaire should also be modified to

solicit information on students' prior programming

knowledge.

Students were expected to solve the problem in whatever

language they were learning in their course. As it

happened, in our study all the students were learning either

C++ or Java. The language of implementation affects the

difficulty of the solution. For example, it is much easier

to read data from a keyboard in C++ or even C than in

Java. Many courses teach Java using classes supplied to

simplify input from keyboard, but it was specifically

stated in the instructions that students were not allowed to

use such classes. The exercises should be chosen so that

it is not necessary to use a technique that is clearly more

difficult in one language than another.

These exercises were designed to be done using computers

in a laboratory environment. The laboratory session must

be monitored to ensure that nobody uses external means

such as email or the Internet to obtain help with the

solution. It was unclear from the trial assessment

instructions whether the exercise could be done on an

open-book basis. It was also unclear whether instructors

were allowed to prepare the students for doing the exercise.

Such issues should be explicitly addressed in the

instructions in future collaborative assessment studies.

In some universities that participated in the study, the

students were volunteers. In others, the exercise was

compulsory. If students are asked to volunteer for a

programming exercise, anyone who is weak in

programming is likely to choose not to do it. This means

that, in order to gain a true picture of the programming

skills of students, the exercise must be compulsory for

students. The only way to ensure that all students will

attempt an exercise is to make its results count towards

their final mark in a course. It must therefore fit into the

assessment strategy of the course in which they are

enrolled, as an examination for which a number of marks

are allocated. In the future, it would help the analysis to

record information about the conditions for each

administration of the exercise, for example,

examination

vs.

extra credit

and

volunteers

vs.

compulsory

If the exercise is compulsory, a one-and-a-half hour

laboratory consisting of only one question may be unfair.

This is particularly true if this style of assessment is so

different from what students have already done in their

courses that they cannot determine where to start.

assessment of programming skill may need to take into

account the fact that, in the "real world", a programmer

usually does not have such a short time limit for

understanding a problem and writing the required computer

program. In addition, real-world programmers are

generally free to refer to books and other resources if

needed. Students whose primary language is not English

may need a considerable amount of time to read the

specification in order to understand what is required. In

future studies, it may be necessary to allow students much

more time than it is likely to take them to solve the

problem. For example, if a teaching assistant can solve

the problem in half an hour, it may be necessary to allow

students up to three or four hours to solve it. Some

students suffer from examination anxiety. To counter

this, it would be possible to give students a week, say, to

do the exercise, although this introduces more

opportunities for plagiarism, and the assessment strategy

would have to take this into account. Another approach

would be to treat the topic area for the exercise as a case

study that the instructor presents during one or more

lectures. Basic materials for presenting the case study

could be distributed to the participants. This would

introduce some consistency in how the case study was

introduced to students and could make it easier for students

to quickly understand the requirements of the exercise in

the closed-lab setting.

This study was not culturally neutral. For some

universities, the exercises and instructions had to be

translated into a language other than English. One way

to minimize the effect of this difference would be to

ensure a centralized translation to each language, which

would ensure that all universities using a particular natural

language use the same specification. Ideally, there should

also be a validation step to ensure that the translated

version of the exercise gives exactly the same

specification as the original English version.

In future studies, instructors must receive sufficient notice

of the study so that they have time to incorporate it into

their assessment strategies for a particular semester. This

point was a major factor in why additional universities did

not participate in this trial assessment.

Issues related to the exercises

The exercises used in this study were probably

discouraging for students with mathematical anxiety.

Such students exist even in Computer Science

programmes and are more likely to exist in other kinds of

computing programmes that do not include compulsory

mathematics courses or have strong mathematics

prerequisites, such as a programme focused on commercial

applications of computing. In future studies, a set of

exercises of equivalent programming difficulty could be

devised, and participating instructors could choose the

most appropriate exercise for students in their programme.

Alternatively, students could be allowed to choose the

exercise that they felt most comfortable attempting.

Page 12

The exercises in this assessment should have solutions

that are unlikely to appear in the textbooks typically used

by students in the first year. In this way, students who

had used such textbooks would not be at an advantage over

those who had not. To address this in future studies, a

review panel, consisting of a representative sample of

instructors, could be asked to provide feedback on the

appropriateness of the task, the level students would need

to be at to successfully solve the exercises, and whether

they knew of any resources that would give some students

an unfair advantage in solving any of the exercises. The

review panel could include instructors from different

countries, with different natural languages, teaching in

different kinds of degree programmes, and using different

programming languages.

In our study, the exercises were most easily solved using a

procedural approach, and it was not easy for a student to

decide which classes, attributes, and methods would be

required if an object-oriented approach were taken. This

may have confused many students. Given that most first-

year programmes currently seem to be using an object-

oriented language, the exercises should include options for

which a natural solution can be designed using an object-

oriented approach.

The specifications of the exercises in this study included

details that were not relevant to the solution, which made

it difficult for many students to achieve the first learning

objective in our framework (abstracting the problem from

the description). As stated earlier, many students (those

with DoC scores of 1 or 2) did not get seem to get past

that point in the problem-solving process. In the future,

extra effort should be expended to make each specification

as clear and simple as possible. One way to achieve this

would be to ask the review panel mentioned earlier to

suggest changes to the exercise descriptions, as well as to

the instructions for administering the exercises.

Issues related to multi-institutional collaboration

This trial assessment is an example of collaboration on a

single project across a variety of universities. Multi-

institutional collaboration offers advantages as well as

challenges. Among the advantages are an increased

experience pool, a larger cumulative pool of students, and

a wider variety of student profiles (increasing the potential

for generalizability of results). At the same time, multi-

institutional collaboration includes many challenges,

some of which are addressed earlier in this section. Being

separated physically makes it more difficult to coordinate

protocols for conducting the exercises. It is also more

difficult to make the data consistent (with respect to

formats, field names, etc.) and complete (one university

may collect data that is "lost" at another university,

simply because the second instructor did not know to

capture that information). Another important challenge is

making the exercises sufficiently general so that they are

neutral with respect to both culture and the university.

Experience in this trial assessment suggests that we did

not fully succeed in this. Our conclusion is that we must

be cautious in defining general exercises, since we cannot

assume that all first year programs cover the same

material in content or emphasis, even within the

boundaries of established curriculum standards and

accreditation criteria.

Based on the experiences with this trial assessment, we

offer the following advice for doing multi-institutional

collaborations:

Appoint one research coordinator, who will be the

main contact point for making decisions on the entire

project. In our case, the WG leader was the research

coordinator, who guided the entire process.

Do a trial run of the entire study, including analysis,

in order to work out details of data formats and

instruments.

Ensure that all source data can be traced to the

interpreted data. For example, ensure that the

printouts and files with the source code are marked in

a way that associates each with the coded ID of the

student who completed it.

CONTINUING THE QUEST

Because our preliminary work suggests that the problems

we have observed are universal, the working group feels it

is worthwhile to expand this trial assessment to include a

broader base of computer science educators and

universities.

We envision establishing a central web site

related to assessment of programming skills. Such a site

could provide a gathering spot for links and materials

related to this type of assessment, while at the same time

being easily usable from throughout the world. The web

site could include a registration process in order to allow

restricted access to various parts of the assessment site.

The programming assessment site must support three

main types of activities:

Assessment development

. The system should

enable instructors throughout the world to participate

in this collaborative project. For example, the web

site should have features to support individuals who

wish to submit new ideas or produce new assessments

(perhaps following pre-defined templates obtained

from the web site). The web site can also provide a

technical forum where individuals developing

assessment tools can discuss personal assessment

experiences with others involved in the project.

Support for carrying out assessment and

self-assessment

. This feature can serve two groups

of users: students and instructors. The assessment

web site can provide both groups of users with ready-

to-use assessments and background information. As

the instruments are filled out, the web site can collect

the results and allow users to submit comments and

feedback. Individual students would be able to use

Page 13

these tools for self-assessment and tracking personal

progress. The assessment web site could also

establish a worldwide database to accumulate

information about students' computing knowledge

and programming skills as measured by these

assessments. Such a database would provide a basis

for understanding student attributes within a single

university, a single country, or even globally.

Communication environment

. While much of

the information in the assessment web site will have

strictly controlled access based on an individual's

registered profile, the system could also allow the

general public to access certain information about

assessment. This would allow anyone interested in

any aspect of assessing programming skills to

exchange ideas and comments.

In order to realize the vision of an assessment web site,

several organizational aspects are needed, including:

a steering committee to guide the various efforts;

a series of meetings, perhaps on an annual basis,

where policy and structure can be defined;

a committee devoted to maintaining the system; and

one or more moderators who track day-to-day

submissions from the public.

In order to foster interaction while establishing and

building the assessment web site, a series of meetings

could be held at regular intervals to gather individuals

interested in contributing to this project. The meeting

agenda would include developing the philosophy and

strategy of assessment, accepting or rejecting proposed

changes to the whole system, and managerial

responsibilities such as designating the steering

committee. It would make sense for the

conference/workshop to take place in conjunction with a

major conference such as the SIGCSE Technical

Symposium or the ITiCSE Conference. The steering

committee would be responsible for guiding the

implementation strategy between the periodic meetings.

The system maintenance group would be the professionals

responsible for maintaining the system. Finally, the

moderators would monitor the content of the system on a

day-to-day basis.

The site with information from this working group is

located at the URL:

http://www.cc.gatech.edu/projects/iticsewg/csas.html.

ACKNOWLEDGEMENTS

The chair of this working group thanks each member for

her or his individual contributions. The members were

what made this working group a success. This project

required a great deal of dedication and effort by the

members before, during and after the conference.

The group would also like to thank the organizers of the

conference, Sally Fincher and Bruce Klein, and the

working group leader, Roger Boyle, for giving us the

opportunity to do this project. Finally, the group would

like to thank Georgia Tech students Blake Markham and

Prashanth Kolli, who helped with a lot of the logistics of

the project.

REFERENCES

ACM & IEEE-CS Joint Task Force on Computing

Curricula 2001 (2001).

Computing Curricula 2001,

Ironman Draft

. Association for Computing Machinery

and the Computer Society of the Institute of Electrical

and

Electronics

Engineers.

Available:

http://www.acm.org/sigcse/cc2001 [2001, 5/16/01].

Beck, K. (2000).

Xtreme Programming Explained:

Embrace the Change

. The XP Series, Addison-

Wesley, 2000, Boston.

BlueJ (2001).

BlueJ, the Interactive Java

Environment

. Available: http://www.bluej.org. [24

July 2001].

Hambleton, R.K. (1996). Advances in Assessment

Models, Methods, and Practices. In D.C. Berliner and

R.C. Calfee (Eds.)

Handbook of Educational

Psychology

. New York: Simon & Schuster

Macmillan.

Linn, R. L., Baker. E. L., and Dunbar, S. B. (1991).

Complex, performance-based assessment: Expectations

and validation criteria.

Educational Researcher

(8),

pp. 15-21.

Mayer, R. E. (1981). A psychology of how novices

learn computer programming.

Computing Surveys, 1

pp. 121-141.

Pea, R. (1986). Language independent conceptual bugs

in novice programming.

Educational Computing

Research, 2

(1), pp. 25-36.

Soloway, E., Ehrlich, K., Bonar, J., & Greenspan, J.

(1982). What do novices know about programming?

In A. Badre and B. Shneiderman (Eds)

Directions in

Human-Computer Interactions

, Norwood, NJ: Ablex,

pp. 27-54.

Spohrer, J., & Soloway, E. (1986). Novice mistakes:

Are the folk wisdoms correct?

Communications of the

ACM, 29

(7), pp. 624-632.

10.

Wiedenbeck, S., Ramalingam, V., Sarasamma, S. and

Corritore, C.L. (1999). A comparison of the

comprehension of object-oriented and procedural

programs by novice programmers.

Interacting With

Computers

(3), March, pp. 255-282.

11.

Wilson, B. C., & Shrock, S. (2001). Contributing to

success in an introductory computer science course: A

study of twelve factors. In I. Russell (Ed.),

The

Proceedings of the Thirty-second SIGCSE Technical

Symposium on Computer Science Education

. In

SIGCSE Bulletin inroads

(1). pp. 184-188

Page 14

APPENDICES

The information given in these appendices reflects updates

made after completing the trial assessment. Some

changes were introduced to clarify issues and to complete

points that were missed during the initial development.

The original and modified versions of the exercises and the

instruments are available via the working group's web site

at the URL

http://www.cc.gatech.edu/projects/iticsewg/csas.html.

Appendix A. Overview of the Exercises

The content of three exercises developed for use in this

study was distributed electronically to the participating

instructors so they could easily cut and paste the text in

creating their local versions of the assignment. As a

baseline for difficulty levels, we hypothesized that second-

semester computing students should be able to do the

most difficult exercise of the three, Exercise #3, in 1.5

hours. To improve consistency, participating instructors

received the following guidelines for how to administer

the task.

The students should work individually in a closed

lab setting (proctored, with all work completed in

the allotted time).

The student's goal is to produce a working and

tested program in the time allotted.

This is a programming exercise, so students

should produce a computer program. Any design

documentation, though important to solving the

problem, is not important to this assessment.

The three exercises, referred to in the body of the paper as

P1, P2, and P3, were as follows:

Exercise #1 (P1): Programming an RPN calculator;

difficulty level: 1

(simplest)

Exercise #2 (P2): Programming an "infix" calculator

without precedence; difficulty level: 2

(moderate

difficulty)

Exercise #3 (P3): Programming an "infix" calculator

with simple precedence (i.e. precedence determined by

parentheses only; no consideration given to operator

precedence);; difficulty level: 3

(most challenging)

The exercise description included a common introduction

for all three exercises. We suggested that students would

need ten minutes to read and understand this background

information. The main ideas in the introduction were:

An explanation of the two main notations for hand-

held calculators: Reverse Polish Notation (RPN) (also

known as "postfix", which is generally used by

Hewlett Packard calculators) and "infix" (which is

generally used by Texas Instruments calculators).

A description of how "post-fix" and "in-fix"

expressions should be processed.

A discussion of why RPN is simpler to implement

(i.e. no precedence issues) while at the same time it is

less intuitive for most users.

The individual descriptions of the three exercises provided

the following information:

User input is to come from the terminal's standard

input; output should be directed to standard output for

the terminal.

The solution can utilize standard library routines

provided by the language; no proprietary or other such

libraries may be used.

The operations that the particular calculator can

process include addition, subtraction, multiplication,

division, the power operator, and the inverse, or

negation, operator. The "infix" calculator with

precedence (Exercise #3) also included parenthesis

pairs, which are used to indicate simple precedence.

The description of each calculator shows the relative

format for a line of input. For all of the calculators,

some form of white space will delimit tokens

(numbers and operators).

User input will be entered non-interactively (so that

the program is not allowed to query the user for

additional information once the expression is entered),

with the exception of the prompt to solicit the next

line of input.

The program should terminate when the input

contains only the letter

`q'.

When an error is detected in the input, the program

should output an informative message and allow the

user to begin entering a new expression.

At the end of each calculation, the calculator should

be cleared so the data structure containing the

intermediate results is empty and ready for processing

a new expression.

Floating point arithmetic should be assumed and the

program should allow non-integer expressions as

valid input.

Through several lines of a sample session, the

description demonstrates a number of expressions and

the results from the associated calculations for the

specific calculator.

Appendix B. General Evaluation Criteria

Because this was a programming exercise intended to

evaluate the programming skills of the participants, the

evaluation focused on skills. The General Evaluation

Criteria were designed to give reasonably consistent

evaluations while allowing the participating instructors to

still follow their normal grading process.

The total number of marks that a particular program could

earn was 110. In the following, we have listed the

allocation of marks immediately after each item. The

Page 15

style section was optional, since some instructors do have

not style requirements in their introductory classes.

Execution

(30 marks)

� Does the program execute

without error in its initial form? Does it compile without

error? Does the program run successfully (no core dump

or equivalent failure)?

Verification

(total of 60 marks, as broken down in the

itemized list)

� Does the program correctly produce

answers to the benchmark data set? This includes the

following issues:

(10 marks)

The program should allow for multiple

inputs of different arithmetic expressions (i.e., it

should clear out the data structure properly between

different expressions).

(10 marks)

The program should terminate correctly

(i.e., entering the quit command should terminate the

program).

(30 marks)

The program should correctly process

data sets containing expressions typically

evaluated with a calculator. (Some sample

expressions were provided to the instructors. The

samples were not meant to be exhaustive, but to

provide a benchmark.)

(10 marks)

The program should react properly to

erroneous inputs.

Validation

(10 marks)

� Does the program represent the

calculator type asked for in the exercise specification?

Style

(10 marks)

� Does the style of the program

conform to local standards, including naming conventions

and indentation? (The style measure was optional.)

Appendix C. DoC Evaluation Criteria

As a more subjective measure of the quality of a solution,

the working group developed an indicator that we came to

call the DoC score, for "Degree of Closeness" (or, with

tongues firmly in cheeks, "Depth of Cluelessness"). The

DoC score applies to programs that did not work and

indicates how close the solution was to working.

To assign the DoC score for a student's program, the

evaluator inspected the source code. The scores ranged

from 5 to 1, with 5 being the best. Generally, the

evaluators added notes to explain the reasons for the

assigned score.

DoC

Score

Interpretation

Touchdown. The program should have compiled and worked. If it did not work, it could be that

the student simply ran out of time.

Close but something missing. While the basic structure and functionality is apparent in the

source code, the program is incomplete in some way. For example, it might have been missing

a method or a part of a method, but everything else seemed fine.

Close but far away. In reading the source code, the outline of a viable solution was apparent,

including meaningful comments, stub code, or a good start on the code.

Close but even farther away. The outline, comments, and stub code showed that the student had

some idea about what was needed, but completed very little of the program.

Not even close. The source code shows that the student had no idea about how to approach the

problem.

Page 16

Appendix D. Student Questionnaire

This version of the questionnaire was used at an American university. This questionnaire must be customized for each

participating university to solicit equivalent information.

Part 1: Personal Information

Name: ______________________________ IDNUM: ____________________

(please circle the correct choices below)

Sex:

Male

Female

Class Rank:

Freshman Sophomore Junior Senior

Overall GPA

: <2.0 2.0-2.5 2.5-3.0 3.0-3.5 >3.5

What grade do you expect to make in the course?

A B C D F

Major:

Part 2: Background

Where did you first learn to program in Java / C++? (please circle one)

BeforeHighSchool HighSchool College Other:

Do you have any experience programming outside a classroom environment? If so, please explain.

Part 3: Study Reaction

Did you feel that the assigned task was difficult please circle the level of difficulty)

What level of difficulty would you rank it? (

Easy Difficult

Hard

Impossible

Other:_______________________

What was the most difficult part of this assigned task? Was it the timed aspect of the problem, was the problem too difficult,

etc.? Please try to explain in a way that makes the difficulties clear for us.