|
American Association
for Higher Education
1998 AAHE Assessment Conference, Invited Address
June 14, 1998
The Straw Man l
The Means Matter l
What the Mean Misses l
What the Good News Misses l
Closing Remarks l
References
A
Summary, in Advance
A speaker benefits from having an easy
straw man to knock over. Here's mine. If you're going to
evaluate a program, common wisdom says that you should
-
assess the educational outcomes of that program (only)
-
look
at how well the average student achieves those goals,
and
-
develop your tests and inquiry so that, ideally, you
will be able to report achievement rather than being
forced to look at and talk about failure.
I'm going
to try and knock over all three of those contentions, to
argue that each one of them is radically incomplete as a way
of looking at our programs of instruction. The problems they
share have particular relevance to the uses of technology,
but the problems are also important to the study of almost
any educational program.
First,
I'll argue that evaluation is more than just a matter just
of outcomes assessment. Although the fourth principle of
good practice in assessment reminds us to look at students'
experiences, not just at what they learn, the commonplace
view seems to be that assessment can begin and end with the
question "Did they learn it?" I'll try to point out some of
the benefits of attending to means, not just ends.
Then in a
clever little pun I'm going to shift from means to the mean,
that is the average. I'll talk about the crucial information
that is missed when we look only at common goals and average
scores, especially in programs that use technology to expand
creative work and work on open-ended problems.
In the
third and final segment of this talk, I'll argue that good
news can be hidden in bad news, that patterns of persistent
failure can yield fresh insight into a program's most dearly
held values and that this kind of evaluative data can
provide a foundation for a fresh approach to faculty
development.
I. The
"Means" Matter.
Ends matter, but so do means. If we
don't study how a result was achieved (as opposed to
the way we planned to achieve it), data about whether
the result was achieved is not very useful.
The
simplest form of this argument is really easy to make.
Imagine that we evaluate a program only by comparing its
outcomes with something else, for example with the program's
performance last year or with the outcomes from a competitor
program. The data show that the program could be performing
10% better, let's say. Without some insight into what people
actually did in the program (as opposed to what they said
they would do behind those closed classroom doors or while
off doing homework), how can we decide what to do next to
improve those outcomes? Since learning is most directly the
result of what students do, studying what students actually
did in a course, as opposed to what we hope or fear that
they did, yields useful information.
How can
typical faculty members and administrators look at
process ó at the means ó in ways that complement
outcomes and that can guide changes in policy and practice?
Asserting Some Definitions
That's a big question, but before answering it, I'm overdue
to assert some definitions. I say "assert" because none of
the following terms has widely-agreed upon definitions, so
it's my responsibility to say what I mean by each of them.
Figure 1 sets up a relationship among
technology, user behavior, and learner and other outcomes.
On the left-hand side of figure 1 is a box representing the
technology of the program, which includes not just computers
but chalkboards, the campus, and the way that faculty are
organized, that is the hardware, software, and social
technology of the situation. The middle box represents what
people chose to do with the technology. The right hand box
is the outcomes of what they did. For example, our
technology, right here and right now, includes this lecture
hall and me: that's the left box. The "users" of the
technology are you; you're choosing paying some degree of
attention and some of you are taking notes: that's the
middle box. If someone were to test you later on what you
remember or what you've done as a result of this talk, those
are the outcomes, the right hand box.
In
addition to technology, user behavior, and outcomes, I need
to clarify some other ambiguous terms. When I say
assessment, I mean measuring the outcomes included in
the right hand box. When I say evaluation, I'm
talking about inquiry into how well the three boxes are
functioning together -- are users doing what was expected
with the technology (and, if not, why not) and, if so, are
the desired outcomes occurring (why or why not). So
assessment produces information that is crucial for
evaluation.
When I
say learning, I'll be talking about the middle box,
the user behavior. And when I talk about learning
outcomes, I'll be talking about the right hand box. So,
usually when I use a phrase like teaching and learning, I'll
mean what teachers and learners are doing right now (not
students' learning outcomes).
Notice
some other relationships among the boxes in this figure.
First, a dotted line from technology to user behavior"
reminds us that the user has choices about what to do with
the technology and that technology is not the only
determinant of user behavior. What users do with technology
is often not what the teacher or designer assumed and hoped
that they would do. That's one proof that the technology is
indeed empowering!
Second,
lots of arrows go into the outcomes box, not just the line
that goes from technology to user behavior. Whatever the
user does with the technology is only one influence on the
outcomes. For example, how much did the users already know
before the intervention started? Because so many other
factors can affect outcomes, it's risky to reason purely
from outcomes data about how to change technology or
behavior.
"You Idiot"
"You idiot," people have occasionally
said to me (using politer terms, I'll admit). "It's simple
to figure out the importance of technology using only
outcomes data. You just do a controlled experiment." They
claim that it's possible to learn all we need to know about
the outcome by studying only the right hand box, if we are
very careful about how we make the comparison. A controlled
experiment into the role of technology occurs when we set up
two versions of a process that are identical except for the
technology.
But how
often can faculty members do an experiment that's so
carefully designed that the design can rule out all
"extraneous" factors and enable valid inferences about the
technology's distinctive role? For example, how can typical
experiments "control" what the students do? Although
controlled experiments may be possible in big research
studies, we're talking about evaluation of what is being
done here and now, not about research that focuses on
averages in multiple sites. Tip O'Neill once said that all
politics is local. I hope we can agree that "all education
is local." What happens on the average (research) tells us
only a very little about what is happening to us
(evaluation). Most of the factors leading into the right
hand box are very context-specific, very much about what's
happening right here, right now, this year, with these
people. If we can't "control" for variations in student
motivation and talent, in precisely what the faculty member
does, and in what's going on in the rest of students'
studies, outcomes cannot tell for sure whether the
technology itself worked.
As if
that weren't enough, there's a second difficulty in relying
only on outcomes data to make sense of technology. We would
like to compare outcomes of two methods we have used, Method
1 with Method 2, in order to decide which is better or
whether we're making progress. But can we directly compare
outcomes? What if the faculty member took advantage of the
technology to change the goals of the course in Method 2?
After all, one common reason to use technology is to help
change what is being learned.
For
example, consider a course in statistics (or graphic arts,
or any of the other courses whose content is intimately tied
to the use of some technology). Method 1, let's say, is a
statistics course of study taught 30 years ago with paper
and pencil methods. Because students could use only paper
and pencil (and maybe a simple adding machine) to do
homework or tests, the course of study could teach only
certain statistical techniques and certain ways of thinking
about data. The assignments, quizzes and exams fit that
vision of the course.
Method 2
is a contemporary course in which students use graphing
calculators, powerful computers with graphical displays, and
huge statistical databases on the Internet. Because the
field itself and the available tools have changed
dramatically, faculty have made major changes in what they
want students to learn. The course of study is now organized
around different kinds of statistical techniques. Students
also learn different attitudes and approaches to dealing
with data, approaches that are more iterative, more visual.
And, of course, the tests of achievement are dramatically
different from 30 years ago.
So, if
the tests of achievement are different for the experimental
group, Method 1, and the comparison group, Method 2, we
cannot compare average test scores -- outcomes -- to decide
how valuable the computers are. Let's stick with our
statistics example. Let's assume that the average score of
78% on the final exam is the same in the experimental group
and the comparison group. Other outcomes measures such as
job placement rates and student satisfaction are also
unchanged. Because we know that computers are currently
important for learning marketable skills in statistics, we
have to conclude that a simple comparison of outcomes is
producing inadequate, even misleading, results.
If comparing outcomes is
inadequate, what do we do?
I suggest two solutions. We can do
better with the assessment comparison than my example
suggested above, so we'll begin there. Then I'll return to
the basic problem, which even the following suggestion
doesn't totally resolve.
For the
statistics course, we can produce a more productive result
by comparing tests as well as test scores. We can ask a
panel of judges whose judgment we trust -- employers,
graduate school representatives, faculty members who teach
the courses that have statistics as a prerequisite -- to
examine not just the scores but also the tests themselves.
We can ask them to choose Method 1 with its tests and test
results or Method 2 with its tests and test results,
considering, of course, the cost of teaching each method.
Judges can report which method they prefer and why. That
process is one way out of this quandary about outcomes.
But we
still have a problem: just knowing that respected judges
preferred the computer-supported course doesn't tell us
enough to enable further improvement of the course and
advances in cost efficiency. Although we know the results of
the course, we still know very little about how the results
were achieved, even in a course we taught ourselves, because
so much depends on what students did when we couldn't see
them and on what they were thinking at the time. To
improve a course of study, faculty members usually need
information on how the technology was actually used
to complement whatever outcomes data or inferences about
outcomes that they can gather.
Looking at the Mean
A second solution to our problem
involves looking at the mean. After identifying
educationally important practices (the middle box in figure
1) that depend on the technology, we can select practices we
suspect can make the difference between good outcomes and
bad. For example, we might consider the seven principles of
good practice, Gamson and Chickering's answer to the
question "What does research tell us are practices that
usually lead to good learning outcomes?" If we wanted to
explore the value of technology, we might find that some of
the seven principles (e.g., student-faculty interaction and
active learning) were implemented more thoroughly in Method
2 and that the technology was being used by students in
their active learning and their interaction with faculty.
Finally, if we were unable to measure directly
whether outcomes were "better" than for a comparison
(perhaps we're studying physics 101 and physics 103), it
would still be interesting to know whether one group's use
of technology were helping them implement the seven
principles of good practice better than the other group was
implementing them. These seven principles are so important
because there's so much research showing that implementing
these kinds of practices yields better learning outcomes.
For
example, imagine you're in an institution that has spent a
lot of money on E-mail and Internet connectivity. Your
institution wants to educate students who are better skilled
at working in teams than graduates were a decade ago.
Further, you may have data showing that graduates of a
program are getting better at working in teams, but you'd
still like to know whether the E-mail had anything to do
with that improvement. A necessary step is determining
whether the E-mail was used by students to work in teams.
How often did they use it? Are different types of students,
such as commuting students or students whose native language
is not English, using E-mail more than other types of
students? Are some kinds of students benefiting more or less
than the norm? When trying to work in teams, did students
find the E-mail a real help, or did they make their teams
work despite barriers posed by the E-mail media and the
E-mail system? Answers to those questions and others like
them would help to show what, if anything, your E-mail
investment had to do with the improvement in outcomes.
Suppose
that you found that E-mail was not being used
effectively to support improvements in the skills of
graduating students. Then other questions might occur to
you. How about the training for using E-mail for this
purpose? How reliable is the system? How often do students
use their computers for other purposes (that might affect
how often they log on)? How reliable is the E-mail service?
By
getting answers to these questions you begin to build up a
story of the role that the technology is playing or failing
to play in supporting the strategies in which you are
interested.
To sketch
technology's role in helping students learn, you can address
at least five types of questions, four of which are not
outcomes assessment. The first three correspond to the
three boxes (the triad) in Figure I.
-
Questions about the technology per se (e.g. could
students get access to it? how reliable was it? how good
was the general training? are some students more
familiar or skilled with the technology than others?)
-
Questions about the practice or behavior, per se
(e.g., how often are students asked to work in groups?
what training do they get in team skills? are some
students already good at this coming into the program?)
-
Questions about the outcome, per se (e.g., what
changes are there in team skills of graduating students?
how often are they called on to use those skills after
graduation? how well do they do in those settings?).
This is where outcomes assessment fits.
Then we
have two more sets of questions, about the arrows:
-
Questions about the technology's use for the practice
(e.g., how satisfactory was E-mail as a medium for team
work? how often was it used for that purpose?)
-
Questions about the practice's fostering of the outcomes
(e.g., did commuting students who rate high in group
skills also work extensively via e-mail? do graduates
interviewed about their work in groups talk about group
work they did in college that involved E-mail?)
It turns
out that many different disciplines and types of
institutions are using technology in similar ways, for
similar reasons, and with similar anxieties. That's what
makes the Flashlight Program possible and useful. This
project, which I direct, has been developing and
distributing survey and interview questions of these five
types. Many Flashlight items focus on the seven principles
of good practice, the ways that students and faculty use
technology to implement those principles, and some of the
most common problems that can block the functioning of such
triads. Information about Flashlight is available on our Web
site <http://www.tltgroup.org>. If you click on "FLASHLIGHT"
in the table of contents, you'll find material including a
summary of the issues and technologies we currently cover
("The Flashlight Program: Spotting an Elephant in the
Dark").
The site
also includes links to Flashlight-based research reports,
such as one by Gary Brown at Washington State University.
Brown's report provides an example of using Flashlight to
study how an outcome was achieved in an experimental seminar
program for at-risk students at WSU. Higher GPAs indicated
that the students coming out of this program were probably
benefiting, but had technology played a role? Armed with
Flashlight data about student learning practices, Brown and
his colleagues developed a convincing story about how the
freshman year gains were achieved: technology was being used
to implement principles of good practice. These findings
were used as part of a successful argument to
institutionalize the program.
II. What the
Mean Misses
The second part of my straw man focuses
on "what the mean misses." When I was at The Evergreen State
College in the late 1970s, I served as Director of
Educational Research. As part of my job, I would
periodically ask a faculty member how I could help in doing
evaluations. I'd say, "You pick the question. I will provide
all of the money and half the time needed to answer the
question. You will need to do the other half of the work. So
if you want to find out something, let's work together on
devising a really good question." Faculty often replied "OK,
what's a good question look like?" I would answer, "Imagine
your program as a black box. A mass of students is marching
in one end of the box and some time later they come out the
other side, changed. How do you want them to be different
after the program is over? Once you tell me that, we'll see
if we can come up with a 'difference detector' that is very
carefully geared to noticing whether this change has in fact
happened, and we'll go on from there."
I quickly
discovered that there were three kinds of faculty at
Evergreen faculty. One sort of faculty member
enthusiastically and decisively answered how students should
be different, and we went on from there. A second group
answered my question "How do you want the students leaving
to be different from the students entering?" rather more
hesitantly. They had an answer, but they and I weren't too
satisfied with it. Finally a third group couldn't answer my
question at all: they couldn't say how they wanted students
to become different as a result of their program. So I
concluded, being 27 at the time, that this was the
difference between very good faculty, mediocre faculty, and
faculty who really didn't know what they were doing.
I then
moved to the Fund for the Improvement of Postsecondary
Education (FIPSE) where one of my duties was to work with
applicants and project directors on their evaluation plans.
I would ask them the same question: "How do you want people
to be changed as a result of being encountered with your
project?" Amazingly, FIPSE project directors fell into the
same three categories of great FIPSE project directors,
mediocre directors, and directors who never should have been
funded in the first place. Except that categorization was
clearly ludicrous. Many of these projects were clearly
superlative, despite the fact that my categories slotted
them as directionless. But if they were so good, why
couldn't they answer this seemingly simple question: "What
do you want the average student to learn as a result of his
or her encounter with your project?"
It took
me some years to see the difficulty. My question had
presumed a particular goal that was uniform for every
student: some particular way in which students were all to
be changed by the program. Figure 2 helps illustrate my
presumption. In Figure 2 each student is represented by an
arrow. Students' knowledge before entering the program is
represented by the base of the arrow -- some know more than
others at the start. The tips of their arrows represent
their capabilities by the end of the program. We can see
that they learned different amounts, but (we assume) they
all learned the same kind of thing ó the only thing
we're concerned about ó learning in line with the program's
educational goal.
I now
call this the uniform impact perspective on education
because the educator's goals are what count: these goals are
the same for all students and a good program impacts even
students who initially don't want to learn. It's a very
legitimate, logical way to look at education. But, as you
know, it's not the only way to look at education.
Figure 3
offers a second perspective. It presumes that the
educational program is an opportunity. Different people come
in with different needs and different capabilities.
Accidents and coincidences happen. Students are creative in
different ways, leading to still more diversity of outcomes
from the "same" course or experience. After the program,
former students move into different life situations, further
changing the shape of the program's successes and failures.
In short, for many reasons, different people learn different
things as a result of their encounter with a learning
opportunity. These differences in learning are qualitative,
that is different in kind, and quantitative, that is
different in degree.
Figure 3
might represent all four people in a very tiny English
class. One masters grammar, one becomes a great poet, one
falls in love with Jane Austen's novels, and one picks up
skills that eventually lead to a job in advertising.
Imposing a uniform impact perspective labels the course a
failure. If its goal were to teach poetry, the average
learner became only slightly more interested. If its goal
were to teach grammar, ditto. Almost no one learned about
Jane Austen. And so on. But if the goal were that learners
took away something of life-changing importance related to
English, the course was 100% successful.
These
qualitative differences in learning can sometimes be quite
big from one learner to the next, especially if the
instruction is meant to be empowering, research-oriented,
exploratory, individualized. And, of course, learner
empowerment is often the intent of using computers and
telecommunications.
I call
this perspective unique uses because it begins with
the assumption that learners are unique and that we are
interested in how they've made use of the educational
opportunity that is facing them. The key to assessing
learning in unique uses terms is not whether students all
learned some particular thing (uniform impact) but rather
whether they learned something ñ anything ñ that was quite
valuable (by some broad, multi-faceted standard or process
we use for determining value.) In the English class of four
students described above, the unique uses criterion used was
whether the learning was of life changing importance and
whether it had something to do with English.
College
effectiveness ought to be viewed mainly from the unique uses
perspective, especially in the liberal (liberating) arts.
What on the average is a college supposed to achieve
for its liberal graduates? College-wide learning goals are
difficult to agree upon, if restricted to specifying what
all learners must learn: the lowest common denominator of
geography majors, literature majors and physics majors. On
the other hand, if the goal for graduates is (also) that
something terrific happens to them as a direct result of
their college education, no matter what that outcome is,
we will notice very different things about their learning
and their lives. We might notice that two members of one
graduating class won Nobel Prizes, for example, and credited
their undergraduate educations in their acceptance speeches,
even though we'd never put "winning a Nobel Prize" on a list
of uniform impact goals for undergraduates.
Each
perspective ñ uniform impact and unique uses - picks up
something different about what's going on in that single
reality. This is not, in other words, a case of the good new
perspective versus the bad old one. In almost any
educational program these are two quite legitimate ways to
assess learning and to evaluate program performance. Each
focuses on elements that the other tends to ignore.
When
designing any assessment or evaluation, the relative
importance of those two perspectives is going to depend on
the educational program itself and the client's needs. For a
training program, a uniform impact perspective might catch
virtually everything of interest to a policy maker: did
every doctor in the program master that particular
open-heart surgical operation? On the other side, evaluating
the educational performance of a university may warrant
relative modest attention to the uniform impact perspective:
most of the important outcomes differ in kind from one
department to the next and from one student to the next.
Usually, however, both perspectives are required to do a
fair and reasonably complete assessment or evaluation.
As
teachers, we apply both perspectives all the time. We want
students to master subject-verb agreement so with
subjective, expert judgement we design a test of that skill.
The students' scores signal (we hope) whether they have a
deep, lasting understanding of subject-verb agreement.
That's uniform impact assessment. We may evaluate the
course's performance each year in this area by the average
scores of students on the test. In the same course, we also
assign the theme "What characteristics of a college course
help us learn?" We give the resulting papers to an external
grader who grades each essay ó A, B, C, D, C, C, B, A. If we
ask the grader, "Why did you give those two papers a B?" or
"What did those three 'C' papers have in common?" the answer
might well be "They have nothing whatever in common, those
three C papers, except they were all C work." That's a
unique uses assessment. The grader had different reasons for
assigning each of the A's, each of the B's, and each of the
other grades. We might then ask the grader, "How good is
this year's version of the course?" And the grader (if he
had graded essays for this course before) probably would
have an opinion. That opinion might also include an expert
judgment on how good the course was in stimulating a variety
of types of good writing. That's a unique uses evaluation.
That's
just what happened at Brown University in a study of the use
of a precursor of the World Wide Web (Beeman et. al., 1988).
As was customary for this English course, Professor George
Landow used an external grader on the essays for his
experimental section. The grader had years of experience
grading final exam essays for this course and told Landow
when she was shown the essay questions in advance what he
might want to consider. "This will be a very difficult essay
test," she warned. He said, "No, no, that's all right. I
want to give this test." The external grader must have
agreed in the end that students performed well on the test
because she gave many of the students A's. There was
probably a great diversity of achievement among those
students, different kinds of excellence, because of the web
of resources and the manner in which Landow had taken
advantage of that web in organizing the section's work. So,
after assessing each student's excellence, the grader drew
an evaluative conclusion: excellent course.
The next
point of distinction between the uniform impact perspective
and the unique uses point of view is their contrasting
definitions of excellence.
Through
uniform impact lenses, we see excellence in the ability to
produce the desired goal. One approach is better than
another if it's better at adding value in that particular
direction and can do so consistently even in a somewhat
different setting and with different staff. The term
"teacher-proof" is one variation on this theme: the program
produces results even if teachers aren't especially good.
For example, a calculus program is wonderful because even
when students come in hating calculus, they love it by the
end of the program, and their scores on calculus achievement
tests are really high. In uniform impact terms this is a
wonderful, wonderful program.
To
determine whether a program is excellent in unique uses
terms, on the other hand, evaluates the magnitude and
variety of the best achievements of the students, after
assessing the students' work one at a time. Judging a
program design as excellent involves asking how many
different ways it has been adapted to different
settings and produced appropriate excellence.
Here's an
example of the recognition of the importance of variety. In
1987 I was involved with one of the first large-scale uses
of "chat rooms" in composition programs. The approach,
developed originally by Professor Trent Batson of Gallaudet
University, was called the ENFI Project, Educational
Networking For Interaction. Faculty in the project to some
extent did their own thing, embroidering on the basic ENFI
motif. But shouldn't they all be doing the same thing if the
evaluation was to mean anything? Batson had after all been
funded to test the practice of chat rooms in multiple
settings.
For
better or worse, nonetheless, faculty were using somewhat
different technology, and somewhat different teaching
methods, thereby exercising their academic freedom with a
vengeance. The uniform impact puzzle then was "Are they all
doing 'ENFI'?" From a unique uses perspective, however, we
could ask, "Has the concept of ENFI stimulated each faculty
member to do something wonderful and effective for his or
her students?" In fact, it would be a mark of the strength
of the ENFI concept if different adaptations of the
ENFI idea usually worked, even if in different ways (Bruce,
Peyton, and Batson, 1993).
For me,
by the way, Shakespeare's plays are a great example of this
sense of "excellence." I've grown over the years to prefer
Shakespeare to almost any other playwright, because no
matter how many times I see Macbeth or Hamlet,
the play is produced differently from the last time and the
differences are part of why the production is good. Even the
same producer and the same director and the same actors
create a different "Twelfth Night" each time. That's the
unique uses brand of excellence.
What kind
of evidence is sought in a uniform impact assessment? Very
sensitive instruments are specifically designed to pick up
progress in a particular direction: progress in achieving
the goal. Is this kind of evidence "objective?" Let's
consider the role of subjective judgment and expertise in
uniform impact assessment. A lot of judgment is used to
design instruments that are valid and reliable enough to
detect small differences in learning, the difference between
a B and a B+, let's say. The subjective judgment embedded in
these assessment instruments includes many somewhat
arbitrary decisions about what particular performances can
be trusted to stand in for the larger ability and about why
that larger ability is worth attention.
One
difference between the assessment of unique uses and uniform
impacts is that the act of judgment is much more on the
outside with unique uses. Although both types of assessment
require expertise and subjective judgment, what judges have
done has been buried underneath the fact of the tests in
uniform impact. The test does not foreground the decisions
that led to choices of features of this test or the
expenditures making sure that the test does indeed measure
what faculty expect it to measure.
In unique
uses, on the other hand, students are assessed one at a
time. The people who place a value on the learning of each
student must be "connoisseurs," to use Elliott Eisner's
phrase. The external grader at Brown University whom I
mentioned, for example, had been grading exams for years for
many different teachers at Brown, all of whom taught
different sections of the same literature course. When she
said a paper was a B paper, there was a lot of expertise to
give some credence to her judgment. She was a connoisseur.
To do a
unique uses evaluation, we usually need a particular sort of
connoisseur. We may be interested only in outcomes that
relate somehow to a literature course, for example. But even
within those bounds of novels and poetry and falling in love
with words and understanding grammar, the connoisseur has a
wide range of judgments to make, comparing apples and
oranges.
How are
the two perspectives on evaluation different when it comes
to communicating findings in a convincing way? Some people
assume that uniform impact is more credible because
decision-makers only want numbers. Well, yes and no. About
twenty years ago Empire State College had a vice president
for evaluation named Ernie Palola. I was visiting Empire
after their evaluation shop had been in operation for
several years. Ernie pointed out a format for reporting
evaluations of which they were very proud. On top of a
single heavy sheet of paper was a frequently asked
evaluative question about this new college. Underneath was
the answer to that question, usually in the form of a number
and a table and a couple paragraphs of explanation. Each
page was a self-mailer, so if somebody would mail or phone
in that particular question, this sheet of paper was folded,
stapled, and mailed to the inquirer. The report was brief,
quantitative, and to the point.
Although
Ernie was very proud of this way of communicating evaluative
data to the public, he said wryly, "The paradox for us is
that our most popular report, even now, is the first one
that this office issued. It's about 40 pages long, it has no
pictures, it has no numbers, it's solid text." As I recall,
this popular report was entitled something like "10 out of
30." Written after Empire State's first year of operation,
it consisted of long narratives about several of Empire
State's first students. Each chapter told a story of the
encounter by the student with the institution, what the
student did, and how well it seemed to work. Empire State
College: one student at a time.
The
stories added up to a story of a college, bigger than the
stories of the individual students. As has been often
observed, narrative is a very powerful way of teaching and a
very powerful way of learning. Those stories were a great
way to understand what this very strange institution was
about and how well it was doing. I can't imagine numbers
accomplishing this level of explanation and understanding
because numbers alone assume an unspoken context: how much
or how many of some quantity that evaluator and reader both
understand. With Empire State there was no shared, vivid
understanding. The stories helped supply that context.
Without such shared context, the number may be not nearly as
informative or decisive as the evaluator thinks it will be.
III. What the Good
News Misses
The third thing missed by my straw man
of evaluations that rely solely on outcomes assessment has
to do with the obsession with good news. The false analogy
between assessment and evaluation on the one hand and
grading on the other leads us too often to design
evaluations that focus on finding good news. That
perspective, obviously, misses important stuff.
The
obvious gap is that you need to detect problems before you
can fix them. This is more than a cognitive issue. Uri
Treisman once remarked, "Our problems are our most important
assets." What he meant was that energy and resources flow to
important problems. The more urgent and well documented the
problem, the more resources can flow to its solution. Not
everyone realizes that problems can be an asset. Some
faculty members, for example, avoid using items that focus
on worrisome issues because they don't want to look bad.
But if
you think about it in Treisman's way of resources flowing to
problems, imagine that you want to improve something about
your program. Don't you need to be able to document that
it's not working well in order to make the case that you're
going to need more money or help? Now that's not to say that
documenting a problem automatically leads to money, but it
does mean that you're going to have an easier time crafting
your request for more resources if you know more about
what's going wrong. As a long-time FIPSE program officer, I
can attest that we were much more responsive to proposals
that began by graphically documenting a real problem for
learners. Although there also had to be an opportunity to
solve the problem, identifying the problem was crucial.
But
there's a deeper sense of "looking for bad news" that I'd
like to explore. I'll begin with the project I mentioned
before about chat rooms, ENFI, Educational Networking For
Interaction. Visualize a scene: in a classroom you see a
circle of computers with big monitors. Students and a
faculty member are sitting behind computers, not talking to
each other, all typing. The dialogue of the class is
appearing and scrolling up the screen.
ENFI
provided a genre of dialogue that was midway between
informal oral discourse and the formal written academic
discourse that the students were trying to learn. This
mid-level written conversation provided a very different
ground and a different set of instructional possibilities
for the faculty member. It was an exciting new idea at a
time in the mid 1980s when the term chat room was not
yet widely known.
Trent
Batson, who had invented this approach, had asked the
Annenberg/CPB project where I worked for money for a
large-scale evaluation of this approach to teaching. He had
assembled a team of faculty members from seven colleges and
universities. When the Annenberg/CPB Project funded the ENFI
project, I, as the monitor of the grant, attended the first
meeting of the faculty after their courses had gotten under
way.
It was
about two months into the first semester, and the discussion
among these faculty had been going on, as I recall, for
about an hour and a half, maybe two hours. At that point
Laurie George, an English faculty member at the New York
Institute of Technology, turned to her colleague Marshall
Kremers and said rather quietly, "Marshall, you should tell
your story." He said equally quietly that he didn't want to.
She elbowed him a little bit and said, "No, you really
should talk about this, it's very important." So he
reluctantly began.
Kremers
said that on the second or third day of class the students
in their writing had suddenly just erupted in obscenities
and profanities that filled up everyone's screens. The
professor became just one line of text that kept getting
pushed off the screen by the flood of obscenities coming
onto the screen. Kremers kept typing "Let's get back on the
subject" or "Won't you quiet down?" but the flood of student
writing always pushed his words off the screen. Although he
thought about pulling the lectern out from the corner and
pounding on it, he decided, "No, this is an experiment; I've
got to stick with the paradigm."
So
Kremers walked out on his class. He came back later, either
in the same class hour or the next class meeting, but it
happened again: they blew him out of the classroom. It
happened a third time. The fourth time, he told us, he
managed to crush the rebellion. I don't think I've ever seen
a faculty member looking more ashamed or more guilty over
something that had happened in his classroom. He concluded
by saying, "I don't know what I did wrong." And there was a
long silence. And then somebody else in the room said,
"Well, you know, something like that happened to me."
Someone else added, "Yes, yes, something like that happened
to me, too." It turned out about a quarter of the people in
the room had had an experience something like that.
Diane
Thompson, an English faculty member at Northern Virginia
Community College, said, "Yes, something like that happened
to me, too. But this is the third semester I've been
teaching in this kind of environment. One of the things that
I've learned is that we rather glibly say that these are
'empowering' technologies, but we haven't really thought
about what 'empowering' means. Think about the French
Revolution! Think about what happened when those people got
a little bit of power. They started breaking windows and
doing some pretty nasty things testing their power.
"But this
is not all bad news. If you want to run a successful
composition course, the really important thing is to have
energy flowing into writing. And that's what you've got
there, Marshall," she said. "The challenge here is not to
crush the rebellion; it's to channel the energy!"
Well, all
of a sudden everybody was talking about how to channel the
energy. Meanwhile I was sitting there thinking that I'd seen
something like this before, at Evergreen. In fact it
happened pretty frequently because Evergreen was unlike
other teaching environments that most faculty had
experienced. Faculty coming to Evergreen often blamed
themselves for something that went wrong, something that
actually happened pretty frequently, although they didn't
know that because they were new to the institution.
But there
were some differences between Evergreen and the situation in
which Kremers found himself. First of all, Evergreen faculty
always taught in teams, new faculty members being teamed
with experienced faculty members. Experienced faculty would
counsel a newcomer, "This is the kind of thing that happens
at Evergreen. You may have done something particular to pull
the trigger, but this kind of thing goes wrong easily at
Evergreen. It's not a problem that can be easily eliminated
or avoided. You can, however, build on our past experience.
You might try this; you might try that." That sort of
conversation happened a lot at Evergreen. But Marshall
Kremers did not teach in a team. If he hadn't been part of
our evaluation team and able to learn with us, he might well
have simply stopped using ENFI.
A second
difference from Evergreen that also put Kremers at risk was
that he was dealing with new technology. Because technology
and its uses change every year, there isn't much chance to
accumulate a history about what has been going on, the way
that Evergreen's veteran faculty understood the dilemmas
posed for faculty.
I think
often about the hair's breath -- if Laurie George hadn't
been there to say, twice, to Marshall, "you really ought to
tell your story,"-- whether this experience would have come
out at all. But she did prompt him to share his story, and
I'm told that he has written a couple of valuable articles
about it since then.
If we
taught people to fly the way that we teach them to use most
educational innovations, we would say to the not-yet-pilots,
"Look. This is an airplane. It's really great for going all
sorts of places. You could go to Portales, New Mexico; you
could go to Paris; you could go almost anywhere you want.
Now why don't you step into the cabin with me, and we'll
take off. We'll fly around a little bit, and we'll land back
here again. And then I'm going to hand you the keys to the
airplane, and if you want to go to Paris, it's east of here.
This button on the control panel is the radio, and if you
need a help line just push it because we usually have
somebody on duty and hopefully they can help you if you run
into trouble between here and Paris!" That's how we teach
most faculty to use technology in teaching in their
disciplines. We sell them on the technology and teach the
rudiments, but we don't prepare them for problems they might
encounter as part of the teaching activity. I define that as
a career risk.
We ought
to give faculty practice in "simulators," for want of a
better word, that enable them get into and then out of
trouble in situations that are actually safe. One familiar
example of a simulator is a teaching case study that is
discussed by a seminar of faculty, but I don't know of any
teaching case studies that spring from a technology-related
problem like the one that hit Marshall Kremers. And I
suspect there aren't very many that have to do with really
innovative approaches to teaching generally; the ones I've
seen deal with classic problems, not emerging ones. The use
of simulators is awfully important because, number one,
faculty members need to have a reasonably safe experience,
safe to their careers, especially if they're junior faculty.
It's very traumatic in technology. Junior faculty members
are often advised not to have anything to do with technology
until after they've gotten tenure, which is not exactly the
way for a university or a college to make fast progress.
Now I can
make my real point, about the good news that can be hidden
in bad news. Remember that first observation that Diane
Thompson made about the French Revolution and about
empowerment. I've never thought about empowerment the same
way since that day. Diane's observation about the dark side
of empowerment gave me a richer, more useful way of
understanding a whole range of phenomena. We gain a fuller
and richer understanding of the strengths of what we are
doing by looking the problems that it causes squarely in the
eye.
Here,
too, my experience at Evergreen was helpful. I decided what
core practices and goals to evaluate at Evergreen by first
asking what problems the College couldn't definitively
solve. Those dilemmas were the flip side of its strengths.
It couldn't solve such problems completely without
abandoning the corresponding strengths, so the problems
remained unsolved. For example, a perennial problem at
Evergreen was the student complaint of an insufficient
choice of courses. That stubborn problem helped point my
attention as an evaluator to Evergreen's practice of faculty
teaching only one course at a time, sometimes for a full
academic year, as part of a team. By deploying its effort
that way Evergreen was able to do many valuable things ó it
made narrative evaluations much more feasible, for example,
and gave faculty and students the kind of flexibility I
mentioned earlier ó but one price that the College could
offer only a tiny fraction of the courses that a college its
size would ordinarily teach. That problem was insoluble
unless the College abandoned one of its core strengths.
That's why an important part of my evaluation was then
targeted on these full-time teaching and learning practices,
because the insoluble problem had attracted by attention.
So
dilemmas and core strengths are often the flip sides of the
same practices. The more stubborn the problem, the more
important is the underlying goal or strategy for the
institution over the long haul.
Any
program offers a wide range of practices and values. Which
ones should an evaluator study? You can do worse than first
looking for insoluble problems, and then using them to
identify the most important, long-term goals and values.
Let's
apply this kind of thinking to faculty development and new
technology. I have a proposal to make. It comes in four
parts.
A. Research to identify
dilemmas
The first part is that I would urge
faculty to do more research aimed at discovering the dark
side of the force. Pick a new instructional situation,
teaching courses on the Web for example. Get people together
who have had a little bit of experience with such teaching.
Reassure them," This is not going to get out; it's not going
to destroy your career; it's just within this room. Now
identify some of the most embarrassing things that have
happened to you as a result of the thing you've tried to do
with technology, or worrisome things, things that really
frayed your nerves or whatever. It's probably something that
never happened to anybody but you. That's OK. We want to
share the really bad stuff, though. And then we'll wait and
see whether other people say, ëYou know, something like that
happened to me.' Because we're going to be looking for the
patterns, not necessarily universal patterns." Remember that
what happened to Kremers only happened to a quarter of the
people in the room. But if you've got 10 or 15 people there,
things that happened to two or three people would be, I
think, quite enough to be significant.
This
important scholarship is something that many faculty members
and institutions ought to do because there are so many
variations in what we do and, thus, so many dilemmas to
discover. Because this research is time-consuming, no
institution is going to be able to do it across the board.
There is, therefore, plenty of room for lots of people to do
this kind of research.
B. Develop "simulators"
Second, based on discovered dilemmas,
we then need to develop simulators -- teaching case studies,
role-plays, video trigger tapes for discussions, computer
simulations. Although I don't know what they all might look
like, they would have in common their ability to enable
faculty, teaching assistants and adjunct faculty to
encounter these kinds of situations in a safe setting where
they can try out different sorts of responses. Many of these
simulators will involve group discussion.
If you've
never used a case study before, don't underestimate a case
study by just reading it. Case studies are often not
fascinating reading. After describing a problem, they stop.
The case study itself is like the grain of sand in the
oyster. The value is not in what you learn by reading the
case. It's the pearl that develops as people say, "Here is
why I think the problem occurred and what I would do about
it."
For
example, I've been in other discussions about the kind of
anarchy that Marshall Kremers discovered, and not everyone
takes off from where Diane Thompson did, about empowerment.
Other folks have different kinds of analyses about why
Kremers' problem happened and thus different ways of
responding to it. For example, some might say that this kind
of problem happens frequently in groups. Or other
participants might point out that chat rooms can be
fundamentally, subtly annoying because of the difficulty in
timing your comments, so some kind of explosion is likely.
Each different analysis suggests a different set of
indicators to anticipate, and different responses when
trouble begins to develop. Because of the variety of
possible analyses, I favor relatively unstructured
simulators that give participants more freedom to suggest a
variety of analyses of the problem.
C. Shedding light on the core
ideas
The third step is to brainstorm about the dilemmas and ask
what strengths they reveal by their intransigence. Each
dilemma can reflect the underside of a goal or strength,
just as the Kremers anarchy reflects a richer view of an
empowered student. After using such a simulator, the
participants all can reflect: "What light does this shed on
the larger situation? How does this change our ideas about
the nature of what we're trying to do? " These kinds of role
plays and simulations can provide a setting for developing
richer, more balanced and nuanced insights into values and
activities that are most important for the education of
students.
D. Using simulators for faculty
development on a national or international scale
Finally we ought to make these kinds of
simulators more widely available. A simulator developed for
geography at a community college in Alaska may well have
relevance to an elite selective private university. The
biggest surprise in my visits to many institutions in this
country and abroad is that while faculty members differ in
the specifics of that they teach and learn, the dilemmas
that they face are comparatively universal, across
disciplinary lines, types of institutions, even national
boundaries and language barriers. For example, Kremers'
experience with anarchy in a chat room can appear wherever
chat rooms are used, which is in lots of fields and lots of
settings. A teaching case study that had transcripts of how
students exploded in a chat room environment could even be
translated into other languages and be used appropriately in
many countries around the world. Case studies developed in
the UK could be employed in the US.
How to
get the simulators into wide use? There are many
possibilities. For example, the TLT Group, of which I am a
part, could be helpful in offering workshops around the
world based on your simulators, face-to-face or online. I'm
hoping we can collect simulators developed in many places
and make the whole collection available internationally.
Disciplinary associations could perform the same
dissemination function within their fields.
I think
faculty could write and get funded proposals to create and
disseminate simulators. Faculty could go in different
directions and approach different funders to get support for
doing simulators in their arena.
Closing
Remarks
My straw man, basing evaluation on the
assessment of the average outcomes while looking for good
news, is not a bad thing, but it's a radically incomplete
way of evaluating academic programs.
First,
studying strategy-in-use, not just outcomes, is really
important. We must examine what people are actually doing to
achieve the outcome. The Flashlight Project's tools, for
example, prompt faculty to use data about strategy-in-use as
a part of the story about why outcomes might or might not be
changing. Look at people's satisfaction with the tools that
are in hand when used for that strategy and that goal. Great
outcomes might be achieved despite the tools rather than
because of them; that's just one of many reasons why
evaluations need attend to means, not just ends.
Second,
attend to unique uses, not just uniform impacts. Today's
innovations, especially those using technology, tend to be
empowering. Like the library, they increase the role of
divergent learning: learning that is different for each
learner. If we fail to use unique uses assessments and
evaluations, we blind ourselves to a whole class of benefits
and problems.
Third,
look for bad news as well as good news, particularly because
the worst piece of news, the dilemmas, often are the flip
side of what's most important about a program and shed some
real light on the program's strengths. By developing
simulators that help people cope with problems that cannot
be definitely eliminated, you can protect the careers of the
people who are working in your institution. And if you can
help prepare them to deal with this bad stuff they're much
more likely to help their students learn.
References
-
Beeman, William O., Kenneth T.
Anderson, Gail Bader, James Larkin, Anne P. McClard,
Patrick McQuillan, Mark Shields (1988). Intermedia:
A Case Study of Innovation in Higher Education,
Providence, RI: Institute for Research in
Information and Scholarship, Brown University.
-
Bruce, Bertram, Joy Peyton and Trent Batson (Eds.)
(1993). Network-Based Classrooms: Promises and
Realities. New York: Cambridge University Press.
-
Chickering, Arthur and Zelda Gamson (1987) "Seven
Principles of Good Practice in Undergraduate
Education." AAHE Bulletin (March): 3-7.
|