What Outcomes Assessment Misses


Handbook and Other Materials l Asking the Right Questions (ARQ) l Training, Consulting, & External EvaluationFAQ

American Association for Higher Education
1998 AAHE Assessment Conference, Invited Address
June 14, 1998

The Straw Man l The Means Matter l What the Mean Misses l What the Good News Misses l Closing Remarks l References

A Summary, in Advance

A speaker benefits from having an easy straw man to knock over. Here's mine. If you're going to evaluate a program, common wisdom says that you should

  1. assess the educational outcomes of that program (only)
  2. look at how well the average student achieves those goals, and
  3. develop your tests and inquiry so that, ideally, you will be able to report achievement rather than being forced to look at and talk about failure.

I'm going to try and knock over all three of those contentions, to argue that each one of them is radically incomplete as a way of looking at our programs of instruction. The problems they share have particular relevance to the uses of technology, but the problems are also important to the study of almost any educational program.

First, I'll argue that evaluation is more than just a matter just of outcomes assessment. Although the fourth principle of good practice in assessment reminds us to look at students' experiences, not just at what they learn, the commonplace view seems to be that assessment can begin and end with the question "Did they learn it?" I'll try to point out some of the benefits of attending to means, not just ends.

Then in a clever little pun I'm going to shift from means to the mean, that is the average. I'll talk about the crucial information that is missed when we look only at common goals and average scores, especially in programs that use technology to expand creative work and work on open-ended problems.

In the third and final segment of this talk, I'll argue that good news can be hidden in bad news, that patterns of persistent failure can yield fresh insight into a program's most dearly held values and that this kind of evaluative data can provide a foundation for a fresh approach to faculty development.


I. The "Means" Matter.

Ends matter, but so do means. If we don't study how a result was achieved (as opposed to the way we planned to achieve it), data about whether the result was achieved is not very useful.

The simplest form of this argument is really easy to make. Imagine that we evaluate a program only by comparing its outcomes with something else, for example with the program's performance last year or with the outcomes from a competitor program. The data show that the program could be performing 10% better, let's say. Without some insight into what people actually did in the program (as opposed to what they said they would do behind those closed classroom doors or while off doing homework), how can we decide what to do next to improve those outcomes? Since learning is most directly the result of what students do, studying what students actually did in a course, as opposed to what we hope or fear that they did, yields useful information.

How can typical faculty members and administrators look at process at the means in ways that complement outcomes and that can guide changes in policy and practice?

Asserting Some Definitions

That's a big question, but before answering it, I'm overdue to assert some definitions. I say "assert" because none of the following terms has widely-agreed upon definitions, so it's my responsibility to say what I mean by each of them.

fig 1 Technology, User Behavior, Learner (and other) Outcomes

Figure 1 sets up a relationship among technology, user behavior, and learner and other outcomes. On the left-hand side of figure 1 is a box representing the technology of the program, which includes not just computers but chalkboards, the campus, and the way that faculty are organized, that is the hardware, software, and social technology of the situation. The middle box represents what people chose to do with the technology. The right hand box is the outcomes of what they did. For example, our technology, right here and right now, includes this lecture hall and me: that's the left box. The "users" of the technology are you; you're choosing paying some degree of attention and some of you are taking notes: that's the middle box. If someone were to test you later on what you remember or what you've done as a result of this talk, those are the outcomes, the right hand box.

In addition to technology, user behavior, and outcomes, I need to clarify some other ambiguous terms. When I say assessment, I mean measuring the outcomes included in the right hand box. When I say evaluation, I'm talking about inquiry into how well the three boxes are functioning together -- are users doing what was expected with the technology (and, if not, why not) and, if so, are the desired outcomes occurring (why or why not). So assessment produces information that is crucial for evaluation.

When I say learning, I'll be talking about the middle box, the user behavior. And when I talk about learning outcomes, I'll be talking about the right hand box. So, usually when I use a phrase like teaching and learning, I'll mean what teachers and learners are doing right now (not students' learning outcomes).

Notice some other relationships among the boxes in this figure. First, a dotted line from technology to user behavior" reminds us that the user has choices about what to do with the technology and that technology is not the only determinant of user behavior. What users do with technology is often not what the teacher or designer assumed and hoped that they would do. That's one proof that the technology is indeed empowering!

Second, lots of arrows go into the outcomes box, not just the line that goes from technology to user behavior. Whatever the user does with the technology is only one influence on the outcomes. For example, how much did the users already know before the intervention started? Because so many other factors can affect outcomes, it's risky to reason purely from outcomes data about how to change technology or behavior.


"You Idiot"

"You idiot," people have occasionally said to me (using politer terms, I'll admit). "It's simple to figure out the importance of technology using only outcomes data. You just do a controlled experiment." They claim that it's possible to learn all we need to know about the outcome by studying only the right hand box, if we are very careful about how we make the comparison. A controlled experiment into the role of technology occurs when we set up two versions of a process that are identical except for the technology.

But how often can faculty members do an experiment that's so carefully designed that the design can rule out all "extraneous" factors and enable valid inferences about the technology's distinctive role? For example, how can typical experiments "control" what the students do? Although controlled experiments may be possible in big research studies, we're talking about evaluation of what is being done here and now, not about research that focuses on averages in multiple sites. Tip O'Neill once said that all politics is local. I hope we can agree that "all education is local." What happens on the average (research) tells us only a very little about what is happening to us (evaluation). Most of the factors leading into the right hand box are very context-specific, very much about what's happening right here, right now, this year, with these people. If we can't "control" for variations in student motivation and talent, in precisely what the faculty member does, and in what's going on in the rest of students' studies, outcomes cannot tell for sure whether the technology itself worked.

As if that weren't enough, there's a second difficulty in relying only on outcomes data to make sense of technology. We would like to compare outcomes of two methods we have used, Method 1 with Method 2, in order to decide which is better or whether we're making progress. But can we directly compare outcomes? What if the faculty member took advantage of the technology to change the goals of the course in Method 2? After all, one common reason to use technology is to help change what is being learned.

For example, consider a course in statistics (or graphic arts, or any of the other courses whose content is intimately tied to the use of some technology). Method 1, let's say, is a statistics course of study taught 30 years ago with paper and pencil methods. Because students could use only paper and pencil (and maybe a simple adding machine) to do homework or tests, the course of study could teach only certain statistical techniques and certain ways of thinking about data. The assignments, quizzes and exams fit that vision of the course.

Method 2 is a contemporary course in which students use graphing calculators, powerful computers with graphical displays, and huge statistical databases on the Internet. Because the field itself and the available tools have changed dramatically, faculty have made major changes in what they want students to learn. The course of study is now organized around different kinds of statistical techniques. Students also learn different attitudes and approaches to dealing with data, approaches that are more iterative, more visual. And, of course, the tests of achievement are dramatically different from 30 years ago.

So, if the tests of achievement are different for the experimental group, Method 1, and the comparison group, Method 2, we cannot compare average test scores -- outcomes -- to decide how valuable the computers are. Let's stick with our statistics example. Let's assume that the average score of 78% on the final exam is the same in the experimental group and the comparison group. Other outcomes measures such as job placement rates and student satisfaction are also unchanged. Because we know that computers are currently important for learning marketable skills in statistics, we have to conclude that a simple comparison of outcomes is producing inadequate, even misleading, results.

If comparing outcomes is inadequate, what do we do?

I suggest two solutions. We can do better with the assessment comparison than my example suggested above, so we'll begin there. Then I'll return to the basic problem, which even the following suggestion doesn't totally resolve.

For the statistics course, we can produce a more productive result by comparing tests as well as test scores. We can ask a panel of judges whose judgment we trust -- employers, graduate school representatives, faculty members who teach the courses that have statistics as a prerequisite -- to examine not just the scores but also the tests themselves. We can ask them to choose Method 1 with its tests and test results or Method 2 with its tests and test results, considering, of course, the cost of teaching each method. Judges can report which method they prefer and why. That process is one way out of this quandary about outcomes.

But we still have a problem: just knowing that respected judges preferred the computer-supported course doesn't tell us enough to enable further improvement of the course and advances in cost efficiency. Although we know the results of the course, we still know very little about how the results were achieved, even in a course we taught ourselves, because so much depends on what students did when we couldn't see them and on what they were thinking at the time. To improve a course of study, faculty members usually need information on how the technology was actually used to complement whatever outcomes data or inferences about outcomes that they can gather.

Looking at the Mean

A second solution to our problem involves looking at the mean. After identifying educationally important practices (the middle box in figure 1) that depend on the technology, we can select practices we suspect can make the difference between good outcomes and bad. For example, we might consider the seven principles of good practice, Gamson and Chickering's answer to the question "What does research tell us are practices that usually lead to good learning outcomes?" If we wanted to explore the value of technology, we might find that some of the seven principles (e.g., student-faculty interaction and active learning) were implemented more thoroughly in Method 2 and that the technology was being used by students in their active learning and their interaction with faculty. Finally, if we were unable to measure directly whether outcomes were "better" than for a comparison (perhaps we're studying physics 101 and physics 103), it would still be interesting to know whether one group's use of technology were helping them implement the seven principles of good practice better than the other group was implementing them. These seven principles are so important because there's so much research showing that implementing these kinds of practices yields better learning outcomes.

For example, imagine you're in an institution that has spent a lot of money on E-mail and Internet connectivity. Your institution wants to educate students who are better skilled at working in teams than graduates were a decade ago. Further, you may have data showing that graduates of a program are getting better at working in teams, but you'd still like to know whether the E-mail had anything to do with that improvement. A necessary step is determining whether the E-mail was used by students to work in teams. How often did they use it? Are different types of students, such as commuting students or students whose native language is not English, using E-mail more than other types of students? Are some kinds of students benefiting more or less than the norm? When trying to work in teams, did students find the E-mail a real help, or did they make their teams work despite barriers posed by the E-mail media and the E-mail system? Answers to those questions and others like them would help to show what, if anything, your E-mail investment had to do with the improvement in outcomes.

Suppose that you found that E-mail was not being used effectively to support improvements in the skills of graduating students. Then other questions might occur to you. How about the training for using E-mail for this purpose? How reliable is the system? How often do students use their computers for other purposes (that might affect how often they log on)? How reliable is the E-mail service?

By getting answers to these questions you begin to build up a story of the role that the technology is playing or failing to play in supporting the strategies in which you are interested.

To sketch technology's role in helping students learn, you can address at least five types of questions, four of which are not outcomes assessment. The first three correspond to the three boxes (the triad) in Figure I.

  1. Questions about the technology per se (e.g. could students get access to it? how reliable was it? how good was the general training? are some students more familiar or skilled with the technology than others?)
  2. Questions about the practice or behavior, per se (e.g., how often are students asked to work in groups? what training do they get in team skills? are some students already good at this coming into the program?)
  3. Questions about the outcome, per se (e.g., what changes are there in team skills of graduating students? how often are they called on to use those skills after graduation? how well do they do in those settings?). This is where outcomes assessment fits.

Then we have two more sets of questions, about the arrows:

  1. Questions about the technology's use for the practice (e.g., how satisfactory was E-mail as a medium for team work? how often was it used for that purpose?)
  2. Questions about the practice's fostering of the outcomes (e.g., did commuting students who rate high in group skills also work extensively via e-mail? do graduates interviewed about their work in groups talk about group work they did in college that involved E-mail?)

It turns out that many different disciplines and types of institutions are using technology in similar ways, for similar reasons, and with similar anxieties. That's what makes the Flashlight Program possible and useful. This project, which I direct, has been developing and distributing survey and interview questions of these five types. Many Flashlight items focus on the seven principles of good practice, the ways that students and faculty use technology to implement those principles, and some of the most common problems that can block the functioning of such triads. Information about Flashlight is available on our Web site <http://www.tltgroup.org>. If you click on "FLASHLIGHT" in the table of contents, you'll find material including a summary of the issues and technologies we currently cover ("The Flashlight Program: Spotting an Elephant in the Dark").

The site also includes links to Flashlight-based research reports, such as one by Gary Brown at Washington State University. Brown's report provides an example of using Flashlight to study how an outcome was achieved in an experimental seminar program for at-risk students at WSU. Higher GPAs indicated that the students coming out of this program were probably benefiting, but had technology played a role? Armed with Flashlight data about student learning practices, Brown and his colleagues developed a convincing story about how the freshman year gains were achieved: technology was being used to implement principles of good practice. These findings were used as part of a successful argument to institutionalize the program.


II. What the Mean Misses

The second part of my straw man focuses on "what the mean misses." When I was at The Evergreen State College in the late 1970s, I served as Director of Educational Research. As part of my job, I would periodically ask a faculty member how I could help in doing evaluations. I'd say, "You pick the question. I will provide all of the money and half the time needed to answer the question. You will need to do the other half of the work. So if you want to find out something, let's work together on devising a really good question." Faculty often replied "OK, what's a good question look like?" I would answer, "Imagine your program as a black box. A mass of students is marching in one end of the box and some time later they come out the other side, changed. How do you want them to be different after the program is over? Once you tell me that, we'll see if we can come up with a 'difference detector' that is very carefully geared to noticing whether this change has in fact happened, and we'll go on from there."

I quickly discovered that there were three kinds of faculty at Evergreen faculty. One sort of faculty member enthusiastically and decisively answered how students should be different, and we went on from there. A second group answered my question "How do you want the students leaving to be different from the students entering?" rather more hesitantly. They had an answer, but they and I weren't too satisfied with it. Finally a third group couldn't answer my question at all: they couldn't say how they wanted students to become different as a result of their program. So I concluded, being 27 at the time, that this was the difference between very good faculty, mediocre faculty, and faculty who really didn't know what they were doing.

I then moved to the Fund for the Improvement of Postsecondary Education (FIPSE) where one of my duties was to work with applicants and project directors on their evaluation plans. I would ask them the same question: "How do you want people to be changed as a result of being encountered with your project?" Amazingly, FIPSE project directors fell into the same three categories of great FIPSE project directors, mediocre directors, and directors who never should have been funded in the first place. Except that categorization was clearly ludicrous. Many of these projects were clearly superlative, despite the fact that my categories slotted them as directionless. But if they were so good, why couldn't they answer this seemingly simple question: "What do you want the average student to learn as a result of his or her encounter with your project?"

It took me some years to see the difficulty. My question had presumed a particular goal that was uniform for every student: some particular way in which students were all to be changed by the program. Figure 2 helps illustrate my presumption. In Figure 2 each student is represented by an arrow. Students' knowledge before entering the program is represented by the base of the arrow -- some know more than others at the start. The tips of their arrows represent their capabilities by the end of the program. We can see that they learned different amounts, but (we assume) they all learned the same kind of thing the only thing we're concerned about learning in line with the program's educational goal.

I now call this the uniform impact perspective on education because the educator's goals are what count: these goals are the same for all students and a good program impacts even students who initially don't want to learn. It's a very legitimate, logical way to look at education. But, as you know, it's not the only way to look at education.

fig 2 Uniform Impact Perspective on Learning (4 upward pointing arrows)

Figure 3 offers a second perspective. It presumes that the educational program is an opportunity. Different people come in with different needs and different capabilities. Accidents and coincidences happen. Students are creative in different ways, leading to still more diversity of outcomes from the "same" course or experience. After the program, former students move into different life situations, further changing the shape of the program's successes and failures. In short, for many reasons, different people learn different things as a result of their encounter with a learning opportunity. These differences in learning are qualitative, that is different in kind, and quantitative, that is different in degree.

Figure 3 might represent all four people in a very tiny English class. One masters grammar, one becomes a great poet, one falls in love with Jane Austen's novels, and one picks up skills that eventually lead to a job in advertising. Imposing a uniform impact perspective labels the course a failure. If its goal were to teach poetry, the average learner became only slightly more interested. If its goal were to teach grammar, ditto. Almost no one learned about Jane Austen. And so on. But if the goal were that learners took away something of life-changing importance related to English, the course was 100% successful.

fig 3 Unique uses Perspective (arrows pointing in different directions)

These qualitative differences in learning can sometimes be quite big from one learner to the next, especially if the instruction is meant to be empowering, research-oriented, exploratory, individualized. And, of course, learner empowerment is often the intent of using computers and telecommunications.

I call this perspective unique uses because it begins with the assumption that learners are unique and that we are interested in how they've made use of the educational opportunity that is facing them. The key to assessing learning in unique uses terms is not whether students all learned some particular thing (uniform impact) but rather whether they learned something anything that was quite valuable (by some broad, multi-faceted standard or process we use for determining value.) In the English class of four students described above, the unique uses criterion used was whether the learning was of life changing importance and whether it had something to do with English.

College effectiveness ought to be viewed mainly from the unique uses perspective, especially in the liberal (liberating) arts. What on the average is a college supposed to achieve for its liberal graduates? College-wide learning goals are difficult to agree upon, if restricted to specifying what all learners must learn: the lowest common denominator of geography majors, literature majors and physics majors. On the other hand, if the goal for graduates is (also) that something terrific happens to them as a direct result of their college education, no matter what that outcome is, we will notice very different things about their learning and their lives. We might notice that two members of one graduating class won Nobel Prizes, for example, and credited their undergraduate educations in their acceptance speeches, even though we'd never put "winning a Nobel Prize" on a list of uniform impact goals for undergraduates.

Each perspective uniform impact and unique uses - picks up something different about what's going on in that single reality. This is not, in other words, a case of the good new perspective versus the bad old one. In almost any educational program these are two quite legitimate ways to assess learning and to evaluate program performance. Each focuses on elements that the other tends to ignore.

When designing any assessment or evaluation, the relative importance of those two perspectives is going to depend on the educational program itself and the client's needs. For a training program, a uniform impact perspective might catch virtually everything of interest to a policy maker: did every doctor in the program master that particular open-heart surgical operation? On the other side, evaluating the educational performance of a university may warrant relative modest attention to the uniform impact perspective: most of the important outcomes differ in kind from one department to the next and from one student to the next. Usually, however, both perspectives are required to do a fair and reasonably complete assessment or evaluation.

As teachers, we apply both perspectives all the time. We want students to master subject-verb agreement so with subjective, expert judgement we design a test of that skill. The students' scores signal (we hope) whether they have a deep, lasting understanding of subject-verb agreement. That's uniform impact assessment. We may evaluate the course's performance each year in this area by the average scores of students on the test. In the same course, we also assign the theme "What characteristics of a college course help us learn?" We give the resulting papers to an external grader who grades each essay A, B, C, D, C, C, B, A. If we ask the grader, "Why did you give those two papers a B?" or "What did those three 'C' papers have in common?" the answer might well be "They have nothing whatever in common, those three C papers, except they were all C work." That's a unique uses assessment. The grader had different reasons for assigning each of the A's, each of the B's, and each of the other grades. We might then ask the grader, "How good is this year's version of the course?" And the grader (if he had graded essays for this course before) probably would have an opinion. That opinion might also include an expert judgment on how good the course was in stimulating a variety of types of good writing. That's a unique uses evaluation.

That's just what happened at Brown University in a study of the use of a precursor of the World Wide Web (Beeman et. al., 1988). As was customary for this English course, Professor George Landow used an external grader on the essays for his experimental section. The grader had years of experience grading final exam essays for this course and told Landow when she was shown the essay questions in advance what he might want to consider. "This will be a very difficult essay test," she warned. He said, "No, no, that's all right. I want to give this test." The external grader must have agreed in the end that students performed well on the test because she gave many of the students A's. There was probably a great diversity of achievement among those students, different kinds of excellence, because of the web of resources and the manner in which Landow had taken advantage of that web in organizing the section's work. So, after assessing each student's excellence, the grader drew an evaluative conclusion: excellent course.

The next point of distinction between the uniform impact perspective and the unique uses point of view is their contrasting definitions of excellence.

Through uniform impact lenses, we see excellence in the ability to produce the desired goal. One approach is better than another if it's better at adding value in that particular direction and can do so consistently even in a somewhat different setting and with different staff. The term "teacher-proof" is one variation on this theme: the program produces results even if teachers aren't especially good. For example, a calculus program is wonderful because even when students come in hating calculus, they love it by the end of the program, and their scores on calculus achievement tests are really high. In uniform impact terms this is a wonderful, wonderful program.

To determine whether a program is excellent in unique uses terms, on the other hand, evaluates the magnitude and variety of the best achievements of the students, after assessing the students' work one at a time. Judging a program design as excellent involves asking how many different ways it has been adapted to different settings and produced appropriate excellence.

Here's an example of the recognition of the importance of variety. In 1987 I was involved with one of the first large-scale uses of "chat rooms" in composition programs. The approach, developed originally by Professor Trent Batson of Gallaudet University, was called the ENFI Project, Educational Networking For Interaction. Faculty in the project to some extent did their own thing, embroidering on the basic ENFI motif. But shouldn't they all be doing the same thing if the evaluation was to mean anything? Batson had after all been funded to test the practice of chat rooms in multiple settings.

For better or worse, nonetheless, faculty were using somewhat different technology, and somewhat different teaching methods, thereby exercising their academic freedom with a vengeance. The uniform impact puzzle then was "Are they all doing 'ENFI'?" From a unique uses perspective, however, we could ask, "Has the concept of ENFI stimulated each faculty member to do something wonderful and effective for his or her students?" In fact, it would be a mark of the strength of the ENFI concept if different adaptations of the ENFI idea usually worked, even if in different ways (Bruce, Peyton, and Batson, 1993).

For me, by the way, Shakespeare's plays are a great example of this sense of "excellence." I've grown over the years to prefer Shakespeare to almost any other playwright, because no matter how many times I see Macbeth or Hamlet, the play is produced differently from the last time and the differences are part of why the production is good. Even the same producer and the same director and the same actors create a different "Twelfth Night" each time. That's the unique uses brand of excellence.

What kind of evidence is sought in a uniform impact assessment? Very sensitive instruments are specifically designed to pick up progress in a particular direction: progress in achieving the goal. Is this kind of evidence "objective?" Let's consider the role of subjective judgment and expertise in uniform impact assessment. A lot of judgment is used to design instruments that are valid and reliable enough to detect small differences in learning, the difference between a B and a B+, let's say. The subjective judgment embedded in these assessment instruments includes many somewhat arbitrary decisions about what particular performances can be trusted to stand in for the larger ability and about why that larger ability is worth attention.

One difference between the assessment of unique uses and uniform impacts is that the act of judgment is much more on the outside with unique uses. Although both types of assessment require expertise and subjective judgment, what judges have done has been buried underneath the fact of the tests in uniform impact. The test does not foreground the decisions that led to choices of features of this test or the expenditures making sure that the test does indeed measure what faculty expect it to measure.

In unique uses, on the other hand, students are assessed one at a time. The people who place a value on the learning of each student must be "connoisseurs," to use Elliott Eisner's phrase. The external grader at Brown University whom I mentioned, for example, had been grading exams for years for many different teachers at Brown, all of whom taught different sections of the same literature course. When she said a paper was a B paper, there was a lot of expertise to give some credence to her judgment. She was a connoisseur.

To do a unique uses evaluation, we usually need a particular sort of connoisseur. We may be interested only in outcomes that relate somehow to a literature course, for example. But even within those bounds of novels and poetry and falling in love with words and understanding grammar, the connoisseur has a wide range of judgments to make, comparing apples and oranges.

How are the two perspectives on evaluation different when it comes to communicating findings in a convincing way? Some people assume that uniform impact is more credible because decision-makers only want numbers. Well, yes and no. About twenty years ago Empire State College had a vice president for evaluation named Ernie Palola. I was visiting Empire after their evaluation shop had been in operation for several years. Ernie pointed out a format for reporting evaluations of which they were very proud. On top of a single heavy sheet of paper was a frequently asked evaluative question about this new college. Underneath was the answer to that question, usually in the form of a number and a table and a couple paragraphs of explanation. Each page was a self-mailer, so if somebody would mail or phone in that particular question, this sheet of paper was folded, stapled, and mailed to the inquirer. The report was brief, quantitative, and to the point.

Although Ernie was very proud of this way of communicating evaluative data to the public, he said wryly, "The paradox for us is that our most popular report, even now, is the first one that this office issued. It's about 40 pages long, it has no pictures, it has no numbers, it's solid text." As I recall, this popular report was entitled something like "10 out of 30." Written after Empire State's first year of operation, it consisted of long narratives about several of Empire State's first students. Each chapter told a story of the encounter by the student with the institution, what the student did, and how well it seemed to work. Empire State College: one student at a time.

The stories added up to a story of a college, bigger than the stories of the individual students. As has been often observed, narrative is a very powerful way of teaching and a very powerful way of learning. Those stories were a great way to understand what this very strange institution was about and how well it was doing. I can't imagine numbers accomplishing this level of explanation and understanding because numbers alone assume an unspoken context: how much or how many of some quantity that evaluator and reader both understand. With Empire State there was no shared, vivid understanding. The stories helped supply that context. Without such shared context, the number may be not nearly as informative or decisive as the evaluator thinks it will be.


III. What the Good News Misses

The third thing missed by my straw man of evaluations that rely solely on outcomes assessment has to do with the obsession with good news. The false analogy between assessment and evaluation on the one hand and grading on the other leads us too often to design evaluations that focus on finding good news. That perspective, obviously, misses important stuff.

The obvious gap is that you need to detect problems before you can fix them. This is more than a cognitive issue. Uri Treisman once remarked, "Our problems are our most important assets." What he meant was that energy and resources flow to important problems. The more urgent and well documented the problem, the more resources can flow to its solution. Not everyone realizes that problems can be an asset. Some faculty members, for example, avoid using items that focus on worrisome issues because they don't want to look bad.

But if you think about it in Treisman's way of resources flowing to problems, imagine that you want to improve something about your program. Don't you need to be able to document that it's not working well in order to make the case that you're going to need more money or help? Now that's not to say that documenting a problem automatically leads to money, but it does mean that you're going to have an easier time crafting your request for more resources if you know more about what's going wrong. As a long-time FIPSE program officer, I can attest that we were much more responsive to proposals that began by graphically documenting a real problem for learners. Although there also had to be an opportunity to solve the problem, identifying the problem was crucial.

But there's a deeper sense of "looking for bad news" that I'd like to explore. I'll begin with the project I mentioned before about chat rooms, ENFI, Educational Networking For Interaction. Visualize a scene: in a classroom you see a circle of computers with big monitors. Students and a faculty member are sitting behind computers, not talking to each other, all typing. The dialogue of the class is appearing and scrolling up the screen.

ENFI provided a genre of dialogue that was midway between informal oral discourse and the formal written academic discourse that the students were trying to learn. This mid-level written conversation provided a very different ground and a different set of instructional possibilities for the faculty member. It was an exciting new idea at a time in the mid 1980s when the term chat room was not yet widely known.

Trent Batson, who had invented this approach, had asked the Annenberg/CPB project where I worked for money for a large-scale evaluation of this approach to teaching. He had assembled a team of faculty members from seven colleges and universities. When the Annenberg/CPB Project funded the ENFI project, I, as the monitor of the grant, attended the first meeting of the faculty after their courses had gotten under way.

It was about two months into the first semester, and the discussion among these faculty had been going on, as I recall, for about an hour and a half, maybe two hours. At that point Laurie George, an English faculty member at the New York Institute of Technology, turned to her colleague Marshall Kremers and said rather quietly, "Marshall, you should tell your story." He said equally quietly that he didn't want to. She elbowed him a little bit and said, "No, you really should talk about this, it's very important." So he reluctantly began.

Kremers said that on the second or third day of class the students in their writing had suddenly just erupted in obscenities and profanities that filled up everyone's screens. The professor became just one line of text that kept getting pushed off the screen by the flood of obscenities coming onto the screen. Kremers kept typing "Let's get back on the subject" or "Won't you quiet down?" but the flood of student writing always pushed his words off the screen. Although he thought about pulling the lectern out from the corner and pounding on it, he decided, "No, this is an experiment; I've got to stick with the paradigm."

So Kremers walked out on his class. He came back later, either in the same class hour or the next class meeting, but it happened again: they blew him out of the classroom. It happened a third time. The fourth time, he told us, he managed to crush the rebellion. I don't think I've ever seen a faculty member looking more ashamed or more guilty over something that had happened in his classroom. He concluded by saying, "I don't know what I did wrong." And there was a long silence. And then somebody else in the room said, "Well, you know, something like that happened to me." Someone else added, "Yes, yes, something like that happened to me, too." It turned out about a quarter of the people in the room had had an experience something like that.

Diane Thompson, an English faculty member at Northern Virginia Community College, said, "Yes, something like that happened to me, too. But this is the third semester I've been teaching in this kind of environment. One of the things that I've learned is that we rather glibly say that these are 'empowering' technologies, but we haven't really thought about what 'empowering' means. Think about the French Revolution! Think about what happened when those people got a little bit of power. They started breaking windows and doing some pretty nasty things testing their power.

"But this is not all bad news. If you want to run a successful composition course, the really important thing is to have energy flowing into writing. And that's what you've got there, Marshall," she said. "The challenge here is not to crush the rebellion; it's to channel the energy!"

Well, all of a sudden everybody was talking about how to channel the energy. Meanwhile I was sitting there thinking that I'd seen something like this before, at Evergreen. In fact it happened pretty frequently because Evergreen was unlike other teaching environments that most faculty had experienced. Faculty coming to Evergreen often blamed themselves for something that went wrong, something that actually happened pretty frequently, although they didn't know that because they were new to the institution.

But there were some differences between Evergreen and the situation in which Kremers found himself. First of all, Evergreen faculty always taught in teams, new faculty members being teamed with experienced faculty members. Experienced faculty would counsel a newcomer, "This is the kind of thing that happens at Evergreen. You may have done something particular to pull the trigger, but this kind of thing goes wrong easily at Evergreen. It's not a problem that can be easily eliminated or avoided. You can, however, build on our past experience. You might try this; you might try that." That sort of conversation happened a lot at Evergreen. But Marshall Kremers did not teach in a team. If he hadn't been part of our evaluation team and able to learn with us, he might well have simply stopped using ENFI.

A second difference from Evergreen that also put Kremers at risk was that he was dealing with new technology. Because technology and its uses change every year, there isn't much chance to accumulate a history about what has been going on, the way that Evergreen's veteran faculty understood the dilemmas posed for faculty.

I think often about the hair's breath -- if Laurie George hadn't been there to say, twice, to Marshall, "you really ought to tell your story,"-- whether this experience would have come out at all. But she did prompt him to share his story, and I'm told that he has written a couple of valuable articles about it since then.

If we taught people to fly the way that we teach them to use most educational innovations, we would say to the not-yet-pilots, "Look. This is an airplane. It's really great for going all sorts of places. You could go to Portales, New Mexico; you could go to Paris; you could go almost anywhere you want. Now why don't you step into the cabin with me, and we'll take off. We'll fly around a little bit, and we'll land back here again. And then I'm going to hand you the keys to the airplane, and if you want to go to Paris, it's east of here. This button on the control panel is the radio, and if you need a help line just push it because we usually have somebody on duty and hopefully they can help you if you run into trouble between here and Paris!" That's how we teach most faculty to use technology in teaching in their disciplines. We sell them on the technology and teach the rudiments, but we don't prepare them for problems they might encounter as part of the teaching activity. I define that as a career risk.

We ought to give faculty practice in "simulators," for want of a better word, that enable them get into and then out of trouble in situations that are actually safe. One familiar example of a simulator is a teaching case study that is discussed by a seminar of faculty, but I don't know of any teaching case studies that spring from a technology-related problem like the one that hit Marshall Kremers. And I suspect there aren't very many that have to do with really innovative approaches to teaching generally; the ones I've seen deal with classic problems, not emerging ones. The use of simulators is awfully important because, number one, faculty members need to have a reasonably safe experience, safe to their careers, especially if they're junior faculty. It's very traumatic in technology. Junior faculty members are often advised not to have anything to do with technology until after they've gotten tenure, which is not exactly the way for a university or a college to make fast progress.

Now I can make my real point, about the good news that can be hidden in bad news. Remember that first observation that Diane Thompson made about the French Revolution and about empowerment. I've never thought about empowerment the same way since that day. Diane's observation about the dark side of empowerment gave me a richer, more useful way of understanding a whole range of phenomena. We gain a fuller and richer understanding of the strengths of what we are doing by looking the problems that it causes squarely in the eye.

Here, too, my experience at Evergreen was helpful. I decided what core practices and goals to evaluate at Evergreen by first asking what problems the College couldn't definitively solve. Those dilemmas were the flip side of its strengths. It couldn't solve such problems completely without abandoning the corresponding strengths, so the problems remained unsolved. For example, a perennial problem at Evergreen was the student complaint of an insufficient choice of courses. That stubborn problem helped point my attention as an evaluator to Evergreen's practice of faculty teaching only one course at a time, sometimes for a full academic year, as part of a team. By deploying its effort that way Evergreen was able to do many valuable things it made narrative evaluations much more feasible, for example, and gave faculty and students the kind of flexibility I mentioned earlier but one price that the College could offer only a tiny fraction of the courses that a college its size would ordinarily teach. That problem was insoluble unless the College abandoned one of its core strengths. That's why an important part of my evaluation was then targeted on these full-time teaching and learning practices, because the insoluble problem had attracted by attention.

So dilemmas and core strengths are often the flip sides of the same practices. The more stubborn the problem, the more important is the underlying goal or strategy for the institution over the long haul.

Any program offers a wide range of practices and values. Which ones should an evaluator study? You can do worse than first looking for insoluble problems, and then using them to identify the most important, long-term goals and values.

Let's apply this kind of thinking to faculty development and new technology. I have a proposal to make. It comes in four parts.

A. Research to identify dilemmas

The first part is that I would urge faculty to do more research aimed at discovering the dark side of the force. Pick a new instructional situation, teaching courses on the Web for example. Get people together who have had a little bit of experience with such teaching. Reassure them," This is not going to get out; it's not going to destroy your career; it's just within this room. Now identify some of the most embarrassing things that have happened to you as a result of the thing you've tried to do with technology, or worrisome things, things that really frayed your nerves or whatever. It's probably something that never happened to anybody but you. That's OK. We want to share the really bad stuff, though. And then we'll wait and see whether other people say, You know, something like that happened to me.' Because we're going to be looking for the patterns, not necessarily universal patterns." Remember that what happened to Kremers only happened to a quarter of the people in the room. But if you've got 10 or 15 people there, things that happened to two or three people would be, I think, quite enough to be significant.

This important scholarship is something that many faculty members and institutions ought to do because there are so many variations in what we do and, thus, so many dilemmas to discover. Because this research is time-consuming, no institution is going to be able to do it across the board. There is, therefore, plenty of room for lots of people to do this kind of research.

B. Develop "simulators"

Second, based on discovered dilemmas, we then need to develop simulators -- teaching case studies, role-plays, video trigger tapes for discussions, computer simulations. Although I don't know what they all might look like, they would have in common their ability to enable faculty, teaching assistants and adjunct faculty to encounter these kinds of situations in a safe setting where they can try out different sorts of responses. Many of these simulators will involve group discussion.

If you've never used a case study before, don't underestimate a case study by just reading it. Case studies are often not fascinating reading. After describing a problem, they stop. The case study itself is like the grain of sand in the oyster. The value is not in what you learn by reading the case. It's the pearl that develops as people say, "Here is why I think the problem occurred and what I would do about it."

For example, I've been in other discussions about the kind of anarchy that Marshall Kremers discovered, and not everyone takes off from where Diane Thompson did, about empowerment. Other folks have different kinds of analyses about why Kremers' problem happened and thus different ways of responding to it. For example, some might say that this kind of problem happens frequently in groups. Or other participants might point out that chat rooms can be fundamentally, subtly annoying because of the difficulty in timing your comments, so some kind of explosion is likely. Each different analysis suggests a different set of indicators to anticipate, and different responses when trouble begins to develop. Because of the variety of possible analyses, I favor relatively unstructured simulators that give participants more freedom to suggest a variety of analyses of the problem.

C. Shedding light on the core ideas

The third step is to brainstorm about the dilemmas and ask what strengths they reveal by their intransigence. Each dilemma can reflect the underside of a goal or strength, just as the Kremers anarchy reflects a richer view of an empowered student. After using such a simulator, the participants all can reflect: "What light does this shed on the larger situation? How does this change our ideas about the nature of what we're trying to do? " These kinds of role plays and simulations can provide a setting for developing richer, more balanced and nuanced insights into values and activities that are most important for the education of students.

D. Using simulators for faculty development on a national or international scale

Finally we ought to make these kinds of simulators more widely available. A simulator developed for geography at a community college in Alaska may well have relevance to an elite selective private university. The biggest surprise in my visits to many institutions in this country and abroad is that while faculty members differ in the specifics of that they teach and learn, the dilemmas that they face are comparatively universal, across disciplinary lines, types of institutions, even national boundaries and language barriers. For example, Kremers' experience with anarchy in a chat room can appear wherever chat rooms are used, which is in lots of fields and lots of settings. A teaching case study that had transcripts of how students exploded in a chat room environment could even be translated into other languages and be used appropriately in many countries around the world. Case studies developed in the UK could be employed in the US.

How to get the simulators into wide use? There are many possibilities. For example, the TLT Group, of which I am a part, could be helpful in offering workshops around the world based on your simulators, face-to-face or online. I'm hoping we can collect simulators developed in many places and make the whole collection available internationally. Disciplinary associations could perform the same dissemination function within their fields.

I think faculty could write and get funded proposals to create and disseminate simulators. Faculty could go in different directions and approach different funders to get support for doing simulators in their arena.


Closing Remarks

My straw man, basing evaluation on the assessment of the average outcomes while looking for good news, is not a bad thing, but it's a radically incomplete way of evaluating academic programs.

First, studying strategy-in-use, not just outcomes, is really important. We must examine what people are actually doing to achieve the outcome. The Flashlight Project's tools, for example, prompt faculty to use data about strategy-in-use as a part of the story about why outcomes might or might not be changing. Look at people's satisfaction with the tools that are in hand when used for that strategy and that goal. Great outcomes might be achieved despite the tools rather than because of them; that's just one of many reasons why evaluations need attend to means, not just ends.

Second, attend to unique uses, not just uniform impacts. Today's innovations, especially those using technology, tend to be empowering. Like the library, they increase the role of divergent learning: learning that is different for each learner. If we fail to use unique uses assessments and evaluations, we blind ourselves to a whole class of benefits and problems.

Third, look for bad news as well as good news, particularly because the worst piece of news, the dilemmas, often are the flip side of what's most important about a program and shed some real light on the program's strengths. By developing simulators that help people cope with problems that cannot be definitely eliminated, you can protect the careers of the people who are working in your institution. And if you can help prepare them to deal with this bad stuff they're much more likely to help their students learn.


    • Beeman, William O., Kenneth T. Anderson, Gail Bader, James Larkin, Anne P. McClard, Patrick McQuillan, Mark Shields (1988). Intermedia: A Case Study of Innovation in Higher Education, Providence, RI: Institute for Research in Information and Scholarship, Brown University.
    • Bruce, Bertram, Joy Peyton and Trent Batson (Eds.) (1993). Network-Based Classrooms: Promises and Realities. New York: Cambridge University Press.
    • Chickering, Arthur and Zelda Gamson (1987) "Seven Principles of Good Practice in Undergraduate Education." AAHE Bulletin (March): 3-7.


PO Box 5643
Takoma Park, Maryland 20913
: 301.270.8312/Fax: 301.270.8110  

To talk about our work
or our organization
contact:  Sally Gilbert

Search TLT Group.org

Contact us | Partners | TLTRs | FridayLive! | Consulting | 7 Principles | LTAs | TLT-SWG | Archives | Site Map |