Thursday, August 14, 2014

Peer evaluations: one month retrospective

A few weeks ago we launched a new feature called Project Evaluations in the Computer Programming section on Khan Academy.

This feature allows students to request evaluations on the coding projects that they complete.
Another peer on the site would then see their evaluation request, evaluate them based on a rubric, and if they passed, they'd earn points for the project.

This blog post will analyze the data we’ve gathered since launching the feature, and talk about various changes we’ve made to address challenges that came up.

Uh oh, too many requests!

In the first day, we got a big surge of project evaluation requests. We thought this might have been because it was just released, but it turns out that it would be normal traffic.

Overall only 20% of evaluation requests were being satisfied per day in the first few days. That meant that every day, the average wait queue for getting an evaluation was getting longer.  The challenge was clear: How do we increase the amount of evaluations, without losing quality on the evaluations?


First 4 days since launch (Horizontal axis) vs. New evaluation requests (Purple) per day and evaluations (Pink) per day (Vertical axis)


We overcame this challenge relatively quickly, but only by making dozens of small changes.
See below for a list of all of those.

The same chart over a longer period of time shows the effect of these small changes.  You'll notice that the Evaluations (pink) bars start to dominate the requests (Purple)  bars per day.

Days since launch (Horizontal axis) vs. New evaluation requests (Purple) per day and evaluations (Pink) per day (Vertical axis)

Aggregating all of this data together gives us a pretty good picture of where we're at today.

Overall "Evaluated" (Green) vs "Pending evaluation" (Blue)

How did we increase evaluations?

We started off by thinking about what sort of people might evaluate projects, and there are a few key groups:
  • current students: those that are going through the curriculum themselves,
  • alumni: the students that have completed the curriculum (but may not be particularly active),
  • super users: those that already know how to program (via our curriculum or elsewhere) and are already active in the community
  • coaches of current students

When we launched, we only let the curriculum alumni perform evaluations, so all of the evaluations were coming from students that were both very active (enough to notice the new feature) and had completed the curriculum. They're still some of our most prolific reviewers, but we realized we needed to bring in more types of reviewers from across our student base, and that's where we made a lot of changes--

Lowering the bar overall:
  • We made it so that simply completing any coding challenge would qualify you to be a reviewer (which would enable us to bring in more types of reviewers).
  • To offset the lowered bar for performing evaluations, we added moderation tools and flagging to ensure quality is maintained.
Bring in current students as evaluators:
  • We added a next action after a student submits a project and passes a project, which is a link to evaluate another students' project. Those projects are picked based on the highest difficulty level that the student has passed themselves.
Bringing in more alumni reviewers:
  • We sent out notifications letting thousands of people know about the new feature, and that they were eligible to evaluate. 
  • We added a video that every student will watch at the end of the curriculum, showing them how they can be active in the community in various ways.
  • We added a notification that will get sent to every student when they earn the curriculum badge, encouraging them to be active.
Encouraging more super users:
  • We made the page listing evaluations requests easier to find and linked to it from more places.
  • We enabled the reviewers to be more efficient by presenting them immediately with a link to do another evaluation when they're done with the current one, or to skip the current evaluation if it isn't right for them.
  • We added features to make evaluations more useful (such as allowing for formatted code inside evaluations).
  • We added evaluations to profiles so that reviewers can show off the evaluations they've done.
Bringing in coach reviewers:

  • We improved coach integration so coaches can view which of their students' projects are pending.
  • We added a notification for coaches to find out about a students' project evaluation request.


Besides increasing the number of reviewers, we also reduced the number of requests, by adding messaging to prompt the students requesting evaluation to fix their own problems before asking for an evaluation.

How accurate are the evaluations?

We've seen very few cases of bad evaluations (like students making un-constructive negative comments, or failing projects that should pass), and we're generally very pleased with the quality of evaluations.

We've heard reviewers comment that they've learnt from the requestor's code, and seen reviewers share valuable tidbits of knowledge that wouldn't be gained otherwise.

We do have more work in progress for monitoring and understanding the quality of evaluations at a deeper level.

What's the distribution of evaluations across reviewers?


Initially, there was a very small amount of reviewers with a few outliers.

Since the changes we’ve made, we’ve managed to bring in a lot more people trying reviews. Over 4,000 people have performed at least one review.  2,000 have performed 1-2 evaluations, and around 100 have performed over 60.

The top evaluator has performed a staggering 3400 evaluations on their own.


Shows the number of people (vertical axis) who performed that many evaluates (x-axis)

How many projects pass?


Another interesting metric is how often students actually pass the evaluation.  During the first few days, only 54% of evaluations passed. By adding ways for users to catch their own problems before submitting for review, we’ve helped turn this into 66% passing rate. This lead to less failed evaluations, and in general it feels better when your evaluation passes!

It's worth noting that evaluations are an iterative process. When a project is marked as needing more work, the user will typically make changes and request evaluation. So even without our changes, we'd expect to see a growing number of passing evaluations over time.
 

"Passing" evaluations (Green) vs. "Needs more work" evaluations (Blue)



Closely monitoring changes in data


We’ve caught up with evaluations vs. requests but we’ll continue to watch this data closely.

From what we’ve gathered so far, it seems like peer evaluations works well and people are getting good evaluations on their projects.  We'll be monitoring and focusing on quality of evaluations even more in the coming weeks.


No comments:

Post a Comment