Evaluation helps you know where you are and what you need to do to get where you want to be. In this chapter we discuss why and how to plan your evaluation. This chapter will help you to understand motivations for evaluation, the different types of evaluation, and the interpretation of evaluation results. We encourage you to read a more extensive resource before proceeding with evaluation, such as the User’s Guide for Evaluating Learning Outcomes in Citizen Science,1 which gives a good general introduction to the field of evaluation and has many parallels to crowdsourcing in the cultural heritage domain.
In this context, evaluation is a systematic approach to gathering quantitative or qualitative data with which to assess a project’s progress or success. Evaluation design is driven by project objectives, which will differ between projects, and flow from such things as a high-level theory of change for your project, stakeholder needs, project values, and recognized opportunities. We discuss creating goals and objectives for your project further in the “Designing cultural heritage crowdsourcing projects” chapter.
As we have highlighted throughout the book, the values of your project show up in all areas of your work — evaluation is no different. The values of being human-centric, and focusing on accessibility, inclusion, and equity are still key. In this chapter, we outline a fairly traditional approach to capturing outputs and evaluation. However, in the long run, we invite you to consider that, for example, clean and simple metrics only speak to the output and not the context of the community you are measuring. Also, just as we promote values of co-production elsewhere in the book, you can also involve projects’ participants in designing the evaluation approach that you will use.
A few words of caution before you set out to evaluate: evaluation can encompass a range of actions, from capturing simple outputs to looking for evidence for longer-term impact in the wider society. Focusing on content-production metrics has a bias towards productivity, rather than engagement and human change. This can take away from the potential of transformative change coming from crowdsourcing projects. One stakeholder shared this perspective in our “what do you wish you’d known at the start?” survey:
“Don’t measure impact on the number of uploads or contributions. [Ours] is a legacy project with a longtail so its impact can be measured by the way it has changed people’s views of themselves and their ancestors.”2
Keep in mind that measurements can influence individual and group behaviors, consciously and unconsciously. Picking specific metrics may sway you to focus unduly on activities that will produce these metrics, at the expense of other elements of the project. Evaluation is ultimately a tool for reflection, not your only success metric, so do your best to not let your evaluation methods overwhelmingly influence your project design choices. Let the dog wag the tail, and not the tail wag the dog!
Embracing the evaluation mindset can be a transformative experience.3 For example, evaluation can help you track with appropriate metrics your progress toward an objective of expanding participation amongst marginalized or under-resourced groups, then scale activities that are succeeding in these aims or modify activities that are failing. This type of mid-point evaluation is called formative evaluation.
The evaluation that happens at the end of the project to document outcomes is called summative evaluation. This type of evaluation might frequently include requirements dictated by supervisory or funding bodies. Summative evaluation can inform reflection and guide designs of future work, as well as forming part of the report you may need to make to project stakeholders.
A further distinction that is often made when talking about evaluation is between quantitative and qualitative data, which are often both collected in an evaluation strategy. Quantitative data involves numbers that might be the raw input for statistical analyzes (such as the number of new participants in your project last month). Qualitative data often involves textual responses, which might involve a certain amount of downstream coding for statistical analyzes or could simply be a source of good ideas.
For example, useful qualitative data that feeds into bug fixing or usability work might look like these user comments:
The “submit” button doesn’t activate when the enter button is pressed, it has to be clicked, but it would be easier to press enter
The flow from one document to the next is difficult, is there any way it could be easier?
Enabling feedback as a form of evaluation means that participants can contribute their fantastic ideas.
Evaluation should be a means to determine whether you have successfully arrived at your desired outcomes, but it is also an important way to check on a project’s health at mid-points on the journey. Be prepared to change your project in response to what you discover. We only ask participants to give their time to meaningful tasks, and similarly, we only evaluate as a purposeful exercise. Always be ready to respond honestly and thoughtfully to the discoveries that are made during evaluation.
The process of identifying your evaluation strategy can also lead to further refinement of your project’s objectives, perhaps leading to clarifications. For this reason, project evaluation should be part of your original project design as discussed in the “Designing cultural heritage crowdsourcing projects” chapter.
If not carefully considered during project design, evaluation can feel like an afterthought, or worse, a painful bureaucratic hurdle to leap when applying for funding. From this point of view, evaluation can seem like a form of self-imposed policing. Rather than thinking of evaluation as overbearing auditing getting in the way of doing actual work, reframe evaluation as a benefit to your project. Thoughtfully designed and appropriately implemented evaluation can lead to discovery and adaptation, leading to continual improvement of our practice and deepening our relationships with partners and participants.
Before you start an evaluation, it is helpful to recognize the intended audience and what that audience will find most useful. Think about how you will frame an understanding of the project and its progress so that it is useful for your project audience. This audience could include funders, senior management, participants, other crowdsourcing practitioners, or the project organizing team itself. And some of those stakeholders might have data requirements that need to be met; such as content generated, the amount of time spent on a project, how data is used, number of publications, and other signals of impact that might be required by a funder.
Strategic decisions will need to be made early in your project regarding the type of data that you collect, the timing and methods of data collection, and who will collect and analyze the data. Data collection from your participants might occur before and after their experience with your project. It might involve questionnaires, social media posts, in-person conversations, or the collection of behavioral data during participation. You might seek out (or be required by a funding or supervisory body to hire) a professional evaluator for your work; professional organizations, such as the American Evaluation Association, sometimes provide “find an evaluator” services. All of these strategic decisions can have implications for the time and money allocation by your project.
However you decide to analyze your project, and whatever your analyzes reveal, reporting deserves discussion. Projects of all sizes will have onlookers, from funders to intellectually adjacent colleagues to participants. Those people will want to know how it is going, but you must be careful when reporting since data without context can be misleading.
Where feasible, discuss what results would be of interest to the audience of your evaluation. Sometimes the audience wants to see total numbers in metrics describing reach (such as how many pieces of data were categorized or words transcribed, how many people contributed to the work, or read a blog post). Sometimes the important metric is the depth of impact (for example, was the experience in some way transformative for participants, did the work result in a publication?). The process can lead to a deeper understanding of your project’s higher-level goals among your stakeholders.
Prior agreements might be productively made with relevant stakeholders about how and what evaluation data you will share and how you will share it. A weekly stats spreadsheet? A quarterly or annual report? Generally, keep things simple and high-level for funders, but be ready to provide more detailed data to participants or partners who may even prefer raw data.
When thinking about ways you can report on your project, it is useful to consider any built-in capabilities of the platform you are working on. For example, there is a range of community-maintained tools tracking contributions within Wikimedia projects, as well as tools tracking pageviews of all articles containing images from a specific category.4
It seems to be the fate of many evaluation forms to languish on shelves in folders, or as numbers on a spreadsheet. Your challenge is to turn the evaluation data that you collect for your project into actionable insights shared with the right people. Visualizations can make evaluation results more engaging, and a “What? So What? Now What?”5 format can both help you distill your evaluation results to their essence, and present them so that their implications are clear to all.
Internal reporting should not be ignored, even if that reporting is to and from the project manager or leader to the rest of the team. Keeping some amount of reporting private can sometimes enable you to be more open and critical. When you do not need to worry about how the data or your interpretation of it might be received, you can evaluate it more fairly.
The most important audiences for evaluations are your participants and the project team. Sharing progress with participants demonstrates your commitment to respecting and valuing their work. Within the project team, evaluation can shine a light on where more work is needed, such as offering more explanation or support, smoother tooling, or better communication.
The processes for evaluation can also be evaluated. This is another reason to start evaluating early in a project: this will give you time to iterate if you find the data being gathered does not enable progress to be measured against the aims.
If you expect to use the evaluation data to draw conclusions on human behavior in any way other than for formative evaluation (for example, if you would like to publish it), you might need to seek approval for your evaluation from your Institutional Review Board (in the United States) or a similar body elsewhere. Institutional Review Boards in the US help protect “the rights and welfare of human research subjects.”6 Even if you think that your Institutional Review Board is unlikely to deem your evaluation data usage as requiring their approval, it is better to hear that from them at the beginning than it is to discover, at the end, that it did require approval and cannot be used for your planned publication.
It is tempting to approach reporting as a return-on-investment analysis, but that may not be appropriate in cultural heritage and not-for-profit contexts. Throughout this book we have advocated for both tangible and intangible results, and this also applies to the process of choosing the right metrics for your project. Success for cultural heritage projects may not always be easily quantifiable.
The most straightforward way to recognize success is to evaluate progress towards project objectives. You will benefit from establishing project objectives that are specific, measurable, achievable, realistic, and time-bound (SMART) to advance project goals during project design. The more clearly you have defined your objectives, the more fairly you (or another stakeholder) can evaluate the project. For more on SMART goals, please see the “Designing cultural heritage crowdsourcing projects” chapter.
What does a metric look like? Examples of common metrics include 1) the total number of participants, 2) the number of new participants recruited per month, 3) the number of returning participants per month, 4) the average amount of activity completed by a participant per month, 5) the average number of comments per participant to an online forum supporting the project, 6) the amount of time spent managing the project and 7) up-time for the project website.7 Data quality metrics are also another common type of metric and are discussed in the “Working with crowdsourced data” chapter.
Your project’s recorded values for metrics should be assessed against informative and fair baselines. Baselines could be historical data that you collected for an earlier project that was similar in scope, or they could be earlier estimates that you made for the metric in your current project, such as increase recruitment rates month-over-month.
Project evaluation can evolve through time, as projects evolve in nature, scope, or intention, and project metrics should be regularly revisited to assess their continued value.
Most projects fail to meet one or more objectives, and planning for this can make them successful failures or sideways successes. One way to have a successful failure is to recognize it early. “Fail early, fail fast” is a maxim one often hears. If it is going to happen, you should get it over with quickly and learn from it, so that any resources invested have value in next-iteration improvements.
Another important element to pulling success from a failure is to evaluate the failure itself, take lessons from it and ideally document and share it so that others do not make the same mistake. Many failures contain a wealth of information that can be used to avoid repeating mistakes.8
Sometimes, the results you receive are not what you hoped to find. You never know what is in your data until you have it, but you will usually have some kind of expectation — and sometimes what you are expecting just is not there. However, these kinds of failures can be helpful. It may be that the unexpected content is useful or interesting in a different way. There might also be value in proving the absence of something. While your institution may not be eager to draw attention to projects that did not produce what was hoped for, it can still be important to release the results, so that the content might be useful to others.
What else can be gained by undertaking evaluation as part of your project plan beyond evidence for reporting and a sense of progress toward project goals? Some benefits of evaluating include gaining new insights, learning from participants, building relationships, and an understanding of longer-term social change.
Gaining new insights relies on data having been accurately collated and being assessed by people who know the collection. Often this is the participants themselves, who may have developed deep expertise through the process of contributing.
A common output for cultural heritage projects is data that is incorporated into a cataloging system, and/or published online as its own dataset. While all catalog enrichment creates new and structured ways of describing a collection, it may take dedicated time spent researching this data at scale to see a full picture of the new knowledge in qualitative terms. If resources are available, following up regularly during and after a project to see what people have been doing with the data can reveal significant impacts of the project and help guide decisions about the worth, content, and structure of future projects. New insights might also be shared on social media, forums, or in comments to organizers.
Many projects, such as those on the Zooniverse platform, have a Talk forum (or other message board) for participants and project teams to discuss topics of interest, as well as to ask for help. It is here that surprising aspects of a project can emerge when a contributor might ask whether anyone else has noticed or can help explain a feature. In rare circumstances, social media posts might also lead to insights into why people have chosen not to contribute.
Feedback from participants can provide critical insights into ways to improve information about a task and other factors in usability. Sometimes these critiques can sound abrupt or overly critical, especially on social media, but taking the time to thank commenters, and perhaps asking for further information about their perspectives, can lead to fruitful relationships.
Do not be afraid to ask participants to talk with you or your evaluator directly as part of your evaluation. Whether in-person, by phone, or in a video conference, interviews, discussions, and focus groups allow you to hear the thoughts, ideas, and emotions that might be hard for participants to express in other ways. Always be respectful of participant time in setting up these interactions. It is not uncommon to offer interviewees tokens of appreciation for making time in their schedule — even a project t-shirt or gift card to a coffee shop will mean a lot. However these direct interactions can be arranged, they help sharpen your project’s understanding of the human beings behind your crowd.
As part of the process of applying for recognition as an official Zooniverse project,9 a subset of registered Zooniverse volunteers (who have signed up as project testers) are asked to provide beta feedback. Project teams are encouraged to take feedback from these volunteers seriously. In addition to the beta survey sent to volunteers, the Living with Machines Zooniverse Talk board included a “General project discussion” topic with the description “Puzzled by an instruction? Got a question or an idea for improving things? Share here!”10 but Ridge also reviewed other comments and social media posts for ideas for usability tweaks. After making changes in response to feedback from Zooniverse and other volunteers, she summarized actions taken and thanked contributors in a forum post.11
Using a message board or forum space for evaluation might include noting how many responses each post gets, what type of new topics are raised, patterns in the questions asked or hashtags used about collection items, whether contributors and project organizers are excited — evident in the language and phrasing they use — and whether they are sharing discoveries or asking for help with interpretation.
In larger projects, it may be feasible to treat the contents of a message board as a text corpus and analyze it using Natural Language Processing (NLP) tools to identify topics, sentiments, etc. For example, the American Museum of Natural History used NLP and sentiment analysis service to review visitor comments at scale.12 In smaller projects, this may be both unnecessary and beyond the skills of a project team. As with all evaluation methods, you must ask yourself if the results are worth the effort.
If a project aims to work with a small, dedicated community (such as a group of people who have traditionally been excluded from cultural heritage spaces), having very large numbers of participants could be seen as disappointing; it takes away from the primary aim of building ties between this community and the project team. In this case, a more effective metric might be how many interactions there were per participant, or what the project inspired a participant to do. A crowdsourced archaeology project recently completed in the UK began with 80% of participants being new to archaeology sites and saw a substantial increase (138%) in visits to the site by the end of year one.13
Working with participants on/with other living people, or with topics that have direct connections to living people/histories and institutions, means participants can potentially be incorporated directly in evaluation. For example, people might be enthusiastic about validating data about themselves for quality control evaluation. This makes a refreshing change from many areas of cultural heritage where the subjects or creators of artifacts are no longer around to be consulted.
Where we are keen to reach new groups, in addition to evaluating who our current participants are, we can evaluate our means of communication. A popular way that is readily automatable is in social media analytics: is a particular hashtag discussed by members of the community we are seeking to work with?
It is important to value your participants and their right to privacy over collecting more evaluation data. It is imperative that the dignity of the human beings you may study — living and dead — remain central to your work. Though your evaluation might benefit from more data, always prioritize the privacy of the people over the work. Additionally, be sure to comply with privacy laws in your respective jurisdictions. We discuss balancing these choices further in the “Identifying, aligning, and enacting values in your project” chapter.
Relationships rarely cohere quickly when building trust is involved. There are no shortcuts to developing meaningful and respectful relationships. If this is an aim, your objectives might include events like community or school workshops. Measuring how far communications about such events travel on social media, views of online broadcasts or attendance at events themselves is straightforward, and in the short-term can be a proxy for evaluation more suited to this overarching aim.
In cases where an organization shares your goal of building relationships, your crowdsourcing project is likely to be just one part of a larger scheme of work. This often means that someone engaged in the larger work at the organization will remain in their role after the end of a crowdsourcing project, creating space — even in shorter projects — for long-term evaluation. Even if there are no immediate plans for long-term evaluation beyond the project, assume that there will be others coming behind you to build on the work that has been done. This effort will pay off both in the integrity of the work you produce, the respect with which you deal with the datasets, the people you work with, and will certainly make it easier for the future work we hope will be generated long after your project has concluded.
Depending on where the project team members are located within a cultural heritage organization, methods of evaluation might include measuring engagement in formal or informal education settings, such as online workshops or arranged school visits. If continued engagement with teachers is possible, additional measures could include whether colleagues co-develop classroom resources and whether they are frequently downloaded. If children are encouraged to work on this aspect of cultural heritage through school projects, and teachers can dedicate time to the relationship, can they report back annually, for example by sharing the best projects for inclusion in a blog post or similar? The Generic Learning Outcomes model14 used in the United Kingdom can serve as a framework for thinking about evaluations of this type.
One way to frame your work on crowdsourcing projects might be through the Theory of Change.15 This is a methodology that links project design decisions to larger, long-term social outcomes, pointing to impact within the wider community in which your project operates. It typically involves a logic model that links outcomes in If-Then statements. Mapping the various outcome steps you are planning means that you can compare and design your evaluation strategy to assess the logic of your pathways to achievement. To illustrate how a high-level strategy and a theory of change can translate into metrics to evaluate your projects, we highlight an example from Wikimedia UK in the following case study.
Top line organization-wide Theory of Change: Wikimedia UK believes that to achieve our vision of a more tolerant, informed, and democratic society we need to improve the representation of diverse people in the knowledge ecosystem, increase civic engagement by building digital literacy, and secure policy changes which increase access to open information for all. To effectively achieve these goals we must also work on strengthening our voice and sector recognition. Without access to knowledge, we cannot build understanding. Without diversity of content, this understanding is limited.
Theory of Change for Knowledge Equity strategic program: Wikimedia UK is helping to create more complete information online; by supporting marginalised people to become contributors and community leaders, and by uncovering and sharing knowledge created by and about underrepresented people and subjects.
Long-term outcome: Wikimedia reflects our diverse society and is free from systemic bias.
Metrics to check progress:
Content pages created or improved
Images/media added to Wikimedia Commons
Reach of content, e.g. image/article views
Newly registered editors
Total number of participants
In-depth diversity statistics for lead volunteers
Language diversity, e.g. how many languages have you worked across?
Content diversity, e.g. percentage of events where the focus is on underrepresented content
Geographical reach, e.g. percentage of events outside of London
Divergences, or missed outcomes, can require course correction during a project, though others can be happy surprises. You need to be prepared for change before starting an evaluation, and this does not only apply to instances where things have gone wrong, or not worked. Sometimes, and not always for discernible reasons, projects catch participants’ imaginations and take off. Some projects just have “magic sauce.” Flexible project planning allows for surprises, whether outcomes are greater or lesser than you hoped for. You can read more about this in the “Designing cultural heritage crowdsourcing projects” chapter.
Evaluation should be considered an essential part of every project. In this chapter, we shared framing for evaluation including what is gained by closely inspecting your practice, what can be learned from participants, and how evaluation can help you build relationships. Evaluation can even lead you toward long-term social goals, catalyzed by evidence and thorough planning.
The chapter also provides views into the types of metrics that can be helpful in crowdsourcing cultural heritage and the need for setting baselines. Reflecting on successful failures may lead your project in new directions, prompt a different set of questions, or reframe your understanding of your participants and data.
These practices may aid you as you explore where you can improve your practice, engagement, and even system design. As the saying goes, you don’t fatten a pig by weighing it.
It is good to evaluate — evaluation and our response to it build to success
Do not evaluate for the sake of it — as with pigs, the weighing is the check on progress, not the progress itself
Always be ready to make changes in response to evaluation — it improves our work and demonstrates our commitment to everyone involved
One final note: not only can undertaking evaluation improve your project and planning, but sharing outcomes of your evaluation can also improve practice in your wider networks. We encourage you to clearly articulate your goals and identify the affiliated metrics of success, evaluate regularly, and embrace an improvement mindset.