Skip to main content

10. Working with crowdsourced data

Published onApr 29, 2021
10. Working with crowdsourced data
·

Working with crowdsourced data

Crowdsourcing projects can produce a variety of outcomes, such as increased understanding of the value of cultural heritage collections and the development of new community support. However, the primary motivation for designing a new crowdsourcing project is often the creation of data. In addition to that focal data, your project might also generate a variety of other data types that are not immediately obvious but also merit consideration, such as Personally Identifiable Information (PII) and behavioral data. Some of the data modeling work is discussed in the “Choosing tasks and workflows” chapter.

In this chapter, we provide a framework for thinking about data management, including where it intersects with project values and legal requirements under an umbrella of data ethics.1 This framework progresses through three steps, beginning with a consideration of design, followed by an assessment of different approaches to collection. The final step anticipates how the data will be processed, including allowances for archiving and access considerations once the project has been completed.

Planning for data

In a small survey of stakeholders we ran for this book, we asked ‘what do you wish you'd known at the start?’.2 One heartfelt response said:

“In general I wish I knew how long these projects take to plan and execute so we could begin sooner. Specifically though I wish I had more experience working with data particularly in terms of thinking about what to do with the data once you have it (it would have been useful in terms of knowing what we wanted to ask volunteers to do)”

As we covered in the “Choosing tasks and workflows” and “Designing cultural heritage crowdsourcing projects” chapters, it is vitally important to keep your resulting data in mind even in the early stages of planning your project. While the data generated through your project is not the only factor you should consider, having a sense of what kind of data you want — and what form you want it to be in — will help you to define your project goals and choose appropriate tasks for your participants.

When preparing for your project, you will also want to reflect on your project values as they relate to the results you are seeking and who will help you create them (for more information on this, see the “Identifying, aligning, and enacting values in your project” chapter). Common ways that projects enact values around data in cultural heritage crowdsourcing projects include:

  • Defining the rights and reuse of data created by participants in a  project

  • Respecting the cultural contexts of the source materials and designing appropriate systems of access for the data related to those source materials

  • Preparing data for reuse, including consideration of lowest common denominator formats

  • Providing access to data from the project interface or organizational repository

  • Documenting data models and workflow considerations so those who seek to use the data will understand how it was created

  • Protecting the data affiliated with participants and their behavior

  • Designing the path for data to be identified with its source material

  • Preparing the skills, resources, and systems to use, transform, and/or roundtrip the data

Integrating crowdsourcing data into existing catalogs or collections management systems may not be straightforward. Data created in engaging workflows such as free-text tags or full-text transcription may not as easily fit into systems that have been designed to store and serve standard vocabularies and limited fields.3 Catalogs might not be able to reference items at the same level of granularity as crowdsourcing interfaces. For example, the catalog record for a volume or archival box may only include forms of information and metadata that describe it as an item, while a crowdsourcing project’s tasks could address regions of a specific page within that digitized volume or box. Crowdsourced data might be stored in extension systems that link digital assets and extended interpretive content to core catalogs, which will require coordination with metadata and technical teams. Roundtripping data, including data synchronization across collections and databases, and the desire to ingest the results of open access data corrected or enhanced by others, is still beyond the capabilities of many commercial information systems.4 These complexities can impact your project in many ways. We encourage you to plan for other eventual uses of the data produced through your crowdsourcing project in collaboration with relevant departments across the organization.

Consider what type of output you will be working with, and any standards that need to be met, early in your Project Design process (see the “Designing cultural heritage crowdsourcing projects” chapter for additional information). This type of backward design can help in the early stages as you make large-scale decisions about technology and project setup (for more information, see the “Aligning tasks, platforms, and goals” chapter and “Choosing tasks and workflows” chapter, respectively). Some of these questions may include:

  • With what technology will your results need to interact? For example, does your institution use a data management system that requires a certain file format for ingestion? Can it accommodate full-text transcription or expand with additional metadata  fields?

  • Who is the audience for your results? Who will be using this data? What will they want to do with it? How will the format of your project results shape your audience’s ability to work with your data?

  • What project-related information will you include in your results?

  • How will you acknowledge the contribution of your participants in generating these results? Will attribution be part of the project output?

  • How will you assess data quality, for example, the correctness of results?

Case study: making Zooniverse data more easily re-usable with ALICE

A frequent question about Zooniverse projects is whether the results can easily be used to train machine learning models in other systems such as Transkribus, a system designed for handwritten text recognition, which requires plain text files to train new models. The Zooniverse Project Builder currently makes results available as JSON-formatted data embedded in a CSV file. This data can be difficult to process into a human-readable format without some technical skill. It can also be difficult to link the results back to source material without a robust pipeline to link different versions of source images and metadata with transcribed text. In response to these questions, we built the Zooniverse Aggregate Line Inspector and Collaborative Editor (ALICE)5 as a way to expand access to transcription data for teams without the resources to hire data specialists, or from smaller institutions with fewer opportunities for collaboration across departments. ALICE features include in-app review of transcription data and additional export formats, including plain files, to make results more immediately shareable and reusable.

Data management plans

Data management plans articulate a project’s approach to a mix of topics relevant to data. These plans are often now required by funding agencies and foundations, and there is not a generally agreed-upon set of criteria for the contents or format of a data management plan. However, you might expect to be asked to articulate such things as 1) a high-level inventory of the data to be produced and why you have chosen different types of data, 2) a description of how you will provide context for your data (metadata or paradata), 3) the standards to which the data and metadata will conform, 4) the means by which the data will be stored, backed up, and protected, 5) data sharing policies, 6) data archiving plans and ongoing storage costs, 7) ethical and legal considerations, and 8) the roles and responsibilities for implementing the data management plan. The DMPTool6 is a data management plan builder from Stanford University Libraries and California Digital Library that helps you produce data management plans to the specifications of a target funding agency or foundation. This tool might be a great place to start, even if you are not submitting a funding request but simply seek a tool to guide your writing. The DMPonline7 is a similar tool created by the Digital Curation Centre at the University of Edinburgh. You might find examples shared by funding bodies to be useful for your planning.8 Data management plans should be regularly revisited and updated as your project evolves or your operating environment changes.

Case study: working with digital data in archaeological projects: DigVentures Born Digital project

The question of how to handle digital archive data has been a hot topic in archaeology for years — and the capacity to harness contributions from the crowd has now amplified the problem to another level. This issue has been particularly pertinent to DigVentures, a platform enabling civic participation in archaeology and heritage projects. At DigVentures, we have demonstrated how crowdsourcing can transform the way archaeologists work; providing innovative research tools, improving how sites are investigated, and giving new life to knowledge about the past. As with all parts of an archaeological archive, crowdsourced data contributes to the long-term preservation of sites by providing key information which can be accessed by researchers and the public alike. How that information can be used in the future is an important consideration and, as new technologies become the norm, we must be sure that our archival processes adapt to incorporate innovative methods, tools, and data. Historic England commissioned our team, working with the Chartered Institute for Archaeologists, to develop up-to-date guidance for everyone (not just crowdsourced teams) working with digital data in archaeological projects.9 The guidance forms part of the Archaeological Archives Forum’s ongoing series of practice guides for archives management aimed at practitioners. The project has been running since August 2018 and we have put together a toolkit entitled ‘Work Digital. Think Archive,’ a case in point of how the principles of managing digital data from crowds can also improve an organization’s digital data strategies in general.

Data lifecycles

Data lifecycles describe the series of activities relevant to the creation, use, and long-term persistence (or not) of data. Data management plans might cover all or only a portion of these activities. Different domains articulate data lifecycles in different ways, but the steps articulated by the Data Observation Network for Earth (DataONE10) might be a reasonable starting point for your project:

Moving through the Data Lifecycle, inspired by Data Observation Network for Earth (DataONE)

This example lifecycle is detailed further by DataONE’s Public Participation in Scientific Research Working Group (2013) in ways that might have parallels to your cultural heritage crowdsourcing project. However, a few features merit special mention in our context. “Assure” involves quality assessment and control mechanisms that could reasonably evolve throughout the project to meet your data requirements, and our Data Quality passage later in the chapter provides guidance in thinking about this. “Describe” documents the plans for metadata. If participant acknowledgment is a value of your project, you will need to adequately describe the “who?” of the data’s origin, but this metadata could also include answers to how, when, and other questions related to data creation. Other project values might further inform your thinking about your data life cycle. For example, data sharing might be an important outcome that is not recognized in this DataONE example, as could data destruction, such as of PII.

Aspects of your project’s data lifecycle will be constrained by data lifecycle decisions made by those who designed your chosen software solutions, and we encouraged you to keep that factor in mind in the “Aligning tasks, platforms, and goals” chapter. Well-designed platforms should make best practices in data management easier to follow.11

Data ethics

Consideration of the ethical creation, management, storage, access, and dissemination of data from a project is always important. Cultural heritage crowdsourcing projects have additional layers of ethical consideration for data because of the values embedded in cultural heritage projects. Values including acknowledgment of participants, respect for communities, and dedication to openness and access influence not only how the project runs but can also be seen in the care and management of project data.

While planning for the collection, use, and data sharing from your project, specific care must be taken to meet the legal requirements as appropriate to your location. Specific legal requirements will differ but consistently require care with any data that can be used to identify and locate an individual — in person or online — without their consent. These obligations apply to all organizations but cultural institutions, with their high public trust and missions, may be held to a more rigorous standard. Legal restrictions will help you protect the privacy of participants from your project by limiting the information you request, safely storing the information you do collect, and limiting the re-use of data that includes PII.

Directly consult the appropriate legal restrictions in your jurisdiction such as the European Union’s General Data Protection Regulation (GDPR).12 The United States relies on a complex set of legal requirements around data privacy rather than a single law — often specific to an individual state. If your organization has legal counsel, they are the place to start.

Some general guidelines that help avoid later legal consequences include: do not collect data you do not need; do not store data longer than you need it; and verify that the technical systems you use protect whatever data has already been collected (see the “Aligning tasks, platforms and goals” chapter). Do not wait until after a security breach to consider what type of participant information might have been exposed.

Ethical responsibility

The legal obligation does not cover all the ethical responsibilities for your project’s data. Many additional questions can be raised about the ethical collection, storage, and use of data. While the principle of open data informs many cultural heritage projects, the information in cultural institutions’ collections often contains details that can be harmful to communities, heritage sites, and individuals. Even when we can legally release a data set, we benefit from considering whether we should.

In most cases, participants should be able to access their own contributions. This allows them to repurpose that work for their own goals, such as family history research or membership in a heritage society. Participant reuse of crowdsourced data can also highlight needs of the project not envisioned during project design.13

Specific kinds of data and specific communities may have more stringent needs than those required by law. Archeological records may be used by looters if site location information is present. A parallel danger exists in citizen science, where, for example, the potential use of bird sighting data by exotic animal trappers led the eBird project to restrict visualization and data exports of some endangered species.14 Even historic mental health records are often restricted. Crowdsourcing projects working with these have used segmentation to present only a small portion of a record to each user, preserving record privacy, albeit at a cost of interrupting the flow of user experience.15 Letters and diaries written by living persons with a reasonable expectation of privacy should be treated cautiously. The Children’s Museum of Indianapolis redacted names and places from scanned letters written to teenaged AIDS patient Ryan White in the 1980s before posting the images on their crowdsourcing platform and included clear instructions on how to request that a letter be taken down.16

Indigenous communities may have several concerns about projects working with material about them. Since most materials in colonial archives were captured by outsiders to the community, they may contain items that were stolen or photographs that were staged or taken without the consent of the subject. Historic archival description practices may use derogatory or belittling terminology, reinforcing ethnic stereotypes, or using racial slurs. Re-presenting offensive descriptions to a new public as if they were neutral or authoritative is problematic; at minimum, contextual information should be added but ideally, the crowdsourcing task design will incorporate or replace colonial data with the perspectives of marginalized people.17 Indiscriminate public access to Indigenous materials may also be problematic, as sacred, private, or ceremonially restricted objects and practices were often collected without the permission of the community; any project putting them online should at minimum provide a way for the community to request that the data be taken down.

The archival holdings of many institutions record acts of violence and trauma. Documentation of genocide, enslavement, and hate crimes can be used as source data for crowd projects but require additional ethical considerations. Project responsibility includes both the participants working with the data as well as the victims of violence represented.

For victims, the project team might reflect on the ways that data collected increases or decreases respect for the individuals and restores or disrupts human dignity. Does the resulting data further understanding of their humanity and individuality or does it risk sensationalizing their experience? One way to address these issues is through the appropriate framing of the source material. Does the data represent the perspective of a victim or the language of a perpetrator of violence?

For participants, the project team might consider the extent to which data from a violent past potentially traumatizes someone encountering it today. Do participants have enough information to understand the content of a data source before they immerse themselves in it? Do participants have the opportunity to consciously decide to proceed or avoid data tasks involving records of violence? Addressing these issues often requires careful contextualization of data sets and tasks and providing participants the choice to decline work with some content.

Data ethics frameworks

Many existing guidelines and frameworks provide a basis for ethical considerations of data used by cultural crowdsourcing projects. While the cultural crowdsourcing community does not share a specific framework of ethics, practices of individual projects, as well as the data practices in the larger scientific community, serve as a model.

FAIR data principles (an acronym for Findable, Accessible, Interoperable, and Reusable), while not explicitly ethical, serve as a strong starting point for considering access, use, and reuse of data created by a project. Created by academic, publishing, and industry representatives, the FAIR principles create guidelines for open sharing and access to scholarly data.18

The CARE Principles build on and move beyond the open-access guidelines of FAIR to promote human-centered data management and practice. Created in 2018 at the Indigenous Data Sovereignty Principles for the Governance of Indigenous Data Workshop, the CARE Principles address the failure of data sharing practices to account for imbalances of power. “The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit.”19

The CARE framework introduces issues of Collective benefit, Authority, Responsibility, and Ethics to data practices. While these principles specifically address the needs and concerns of Indigenous Peoples, the CARE principles also raise important considerations for all communities and crowd projects in general. Who will benefit from the data being created? How much authority do the subjects of the data and the crowd participants creating data have over the final product’s content and use? Has the project planned for the long-term risks of access to the data?

Many cultural heritage crowdsourcing projects, as well as citizen science projects, embed ethical data practices in their projects. These practices make sure that data created and used by a project respond to the concerns and needs of the community.

  • Use the data — the project organizers hold responsibility for putting data created to use and not wasting the time of participants by burying or forgetting the results. Project results can be used in many ways including advancing research, improving discovery of collections, or even in exhibitions.

  • Make the data accessible — open data has been discussed above as an important aspect of cultural crowd projects contributing to the larger academic community. For some projects, the data might be released after a short waiting period to provide researchers the opportunity to publish their results. The Citizen Science Association Trustworthy Data Practices research project includes Open Data as one of the ethical practices for a project.20

  • Give credit — participants helped create the data and should receive credit for their work. Many organizations credit participants along with each individual data point they work on. These micro-attributions become part of the dataset. For example, the History Unfolded project, led by the United States Holocaust Memorial Museum, gives all contributors the option to list their name against all their contributing records.21

  • Share results — participants deserve to know the results of the research that they have made possible. How will your project inform the community of the research being conducted? The Zooniverse platform, for example, requires projects to “Communicate research findings to their communities, via open access publication, a blog or elsewhere.”22

Data quality

This section will specifically focus on data output as a project result, though the authors acknowledge that data is only part of a much broader understanding of the “results” of crowdsourcing projects. We will introduce methods for evaluating and validating resulting data, while also drawing on concepts around data lifecycles and data ethics.

Data quality management merits consideration early in your project’s design. Intended uses of the data might vary considerably from project to project, from statistical analysis for research to indexing for resource discovery, to legal decision-making to policy-making, among many others. Your intended use will differ from the next person, and you are a key stakeholder in your project. However, we encourage you to also consider how your crowdsourced product might reasonably be used by others as you make decisions about data quality. At the same time, we want to discourage attempts to hold your data to the unrealistic standard of being fit for all possible uses, and instead consider elements such as the adaptability of your data format and opportunities for post-processing if wide use and re-use is one of your goals.

Crowdsourced data is variable much in the same way that all human-generated data is variable. There will be a margin of human error, but that is not unique to crowdsourced data. A frequent misconception of crowdsourced data is that it is less trustworthy or of lesser quality than data produced by experts. In fact, crowdsourcing projects across disciplines have shown that participants can produce very high-quality results (even at the level of experts23) when given adequate instruction and support. In a 2018 publication, the team behind the Transcribe Bentham project showed how creating a new version of their Transcription Desk interface (the platform volunteers used to submit and review transcriptions) helped with their data quality by making the task more straightforward for volunteers.24

Similarly, we have found the project, platform, and task design to be much greater determinants of data quality than participants alone. Crowdsourced data output that is messy or difficult to work with is often a result of infrequent testing and iteration, rather than an inexperienced or unintelligent crowd. Overly complicated tasks may increase the likelihood of “incorrect” results; data output goals that include highly specific formatting requirements may increase the likelihood that the data produced will not fit within these boundaries. While the quality of crowdsourced data is ultimately dependent on many factors, it is clear that, when thoughtfully designed, crowdsourcing can and does produce high-quality, useful results that regularly pass peer review.

Dimensions of data quality

Data quality is a multi-faceted concept involving such things as accuracy, auditability, and orderliness, with potentially different facets valued in different settings. Below, we consider data quality facets that we have found to be relevant to our own projects.

The clear documentation of data requirements recommended above will be useful in quality management decisions that have implications for task design and platform selection, among other steps. You need not adopt special language in your data requirements documentation — use the language that you and your collaborators will understand. However, you might wish to use the list below as a framework, with a line or section specifying the relevance of each to your end use of the data.

Weaving together the three dimensions of data quality: fidelity, completeness, and accuracy

Fidelity

Fidelity might be the data quality facet that most immediately comes to mind, though you might be using “accuracy” to identify the idea. Fidelity is the digital representation of an object following project guidelines. A transcription project might ask the participant to type the letters as written in the image of a page or label. If the page reads “30 m east from the wharf” and the contributor types out “30 meters east from the wharf” or “20 m east from the wharf,” both would represent reduced fidelity. Fidelity rates can be assessed in a random subset of the data.

Accuracy

Accuracy is correspondence with generally agreed-upon reality. Reductions in accuracy can be introduced at the time that a cultural heritage item is produced, or when a translation of the information in that item occurs. Returning to the “30 m east from the wharf” example, a project might encourage participants to translate obvious abbreviations. If the participants translate the phrase as “30 miles east from the wharf” when the original intention was “30 minutes east from the wharf,” that would represent reduced accuracy. A further reduction in accuracy could arise if the original author of the sentence were misrepresenting the distance, and an average walker would have traveled just 20 minutes from the wharf to the location.

Assessments of accuracy are often framed by scale. In this example, that scale is spatial (miles) and temporal (minutes). A project geocoding historical locations might assign the wrong coordinates to the example location if working with a translation of the passage that is “30 miles east from the wharf.” However, another project determining whether an event occurred in one state or another might have all the information needed for an accurate assessment even with the reduced accuracy described in the example above, should the wharf be on a river that forms the western boundary of that state.

Assessment of accurate information content in cultural heritage objects might also reasonably be the focus of a crowdsourcing project, should the information content of an object be verifiable by either a second heritage object or a modern resource (such as airborne LiDAR data,25 in the case of relocating historical locations that can be identified through relatively slight modifications of the landscape).

Completeness

Completeness is the presence of data considered “in-scope” by the project at the point of use of the data. The absence of data can arise in various ways, including participant error (e.g., failure to type the “30 m” in our example above), but also loss or corruption of an image from a collection being queued up on a platform, loss of data contributions due to failures of internet connectivity, errors in data post-processing, and other mechanisms. Some reductions in completeness can be assessed and addressed by building checkpoints into your data management process.

Some projects might define data completeness based not on a total number of data points generated by a project, but instead on metrics describing the pattern of those data points. For example, measures of convergence — many contributors agreeing — across data might be a more satisfying measure of completeness for your project.

In our experience, successful project teams often choose to begin exploring the resulting data before the project ends, recognizing that even incomplete data can hold exciting discoveries, and allows teams to ensure that their project is producing useful, usable results.

Other dimensions

Bias leads to variation in the data produced by different contributors as a result of variation in interpretation, translation, or filtering information about the cultural heritage object. If historical documents are processed by speakers of a majority language, records written in minority languages might be left out of the resulting data because the participants do not feel competent in that language. While the process is innocuous, the result is the erasure of minority presence from the resulting data set. Without careful design, bias can also be proliferated as a result of the historical contexts, information systems, and professional practices related to source materials.

Auditability is the ability to identify the provenance of data and involves preserving the relationship of the cultural heritage object, the crowdsourcing task workflow, the contributor, and the data produced by the contributor. Auditability is best considered at the step of selecting a platform since a platform will need to record relevant information in its database for this to be available later.

An example from Zooniverse output26 demonstrates the data model and the ability to audit the origins of certain types of information:

  • classification_id

  • user_name

  • user_id

  • user_ip

  • workflow_id

  • workflow_name

  • workflow_version

  • created_at

  • metadata (generated through the classification task, e.g., browser information, user language, subject dimensions)

  • annotations (e.g., the classification data itself, which varies based on task type)

  • subject metadata (varies based on what was included in the original data upload)

Timeliness is a measure of the up-to-date quality of the data. Cultural heritage objects might be thought of as changing very little over time (one hopes — conservators have a never-ending battle with deterioration). However, the interpretation of those cultural heritage objects might change substantially. For example, a poorly understood language might become better understood with additional scholarship, or animal taxonomy might change with further fieldwork and an earlier name becomes out-of-date. In our experience, there are relatively few projects producing data that do not face reductions in timeliness with time.

Orderliness is the conformity of data to a standard. The standard could involve agreed-upon community standards established for interoperability, standards established for data ingestion at a data catalog, or other requirements that may emerge.

Case study: Wikidata contributor community agreeing on data standards27

In the context of Wikidata editing, data schemas can provide a standardised structure for data on a subject area. Previously there haven’t been a standard, community-agreed way to record specific types of items (e.g. a train, plant, human author, human astronaut, etc.). Inconsistent data modelling has meant that it has been tricky to query Wikidata, and tricky for others to contribute consistently.

Wikidata Schemas28 have been released on Wikidata in 2021 as a potential standard, for example, EntitySchema:E1029 for a human. They are encoded in ShEx ("Shape Expressions") language, which is extremely flexible but may exclude general non-technical Wikidata editors due to the complexity of learning the language. Once created, Wikidata Schemas can be used to automate checking and reporting on items, generating lists of what needs to be fixed or a completeness score, for example. At the time of writing the book, this is an ongoing discussion within the Wikidata community.

Quality control methods

How do you ensure that the quality of crowdsourced data meets the goals of the project? How can you optimize for participant productivity and user experience while maintaining rigor? What are the implications of particular strategies for processing data after the crowdsourcing project has finished?

Quality control methodologies fall into two broad categories, each with their own strengths and weaknesses. Multi-track, or multiple single-key entry approaches present the same digitized asset to two or more people, producing two or more independently-created versions of the “same” data. Single-track, or single key entry approaches present an asset to one user. The system may then request a different user to review or correct the first user’s contribution.

Below, we take an in-depth look at the strengths and weaknesses of two examples of quality control: Wikisource (single-track) and Zooniverse (multi-track).

Single-track quality control: Wikisource

Wikimedia projects such as Wikisource and Wikipedia employ a single-track approach, in that any editor may see and correct a previous editor’s work. Every edit is publicly visible, and a different view shows exactly what has been changed. In Wikisource30 — the Wikimedia text transcription project — pages within a document must be reviewed by a proofreader and validated by a second proofreader, according to this workflow:

Participants can locate pages needing transcription, proofreading, or validation within a document on the document’s index page. This type of interface makes the reliability of pages transparent and directs participants to the types of tasks needed for each page.

Index page for The Grateful Dead, showing pages validated, needing review, needing proofreading 1


  1. 31 https://en.wikisource.org/wiki/Index:The_Grateful_Dead.djvu↩︎
Strengths: usable data, collaborative learning

One of the strengths of single-track approaches is that serial volunteer efforts converge on data in a form that may be immediately usable in research or discovery systems. Participants are instructed to transform data into formats that match the needs of the project, such as full text for searching or form fields for indexing. This substantially reduces the need for data processing after the contributions are exported from the crowdsourcing platform.

Although participants learn from their own experience performing a task in all systems, single-track approaches allow them to observe other users’ contributions during the review process, and to learn from changes made by other users while reviewing their work. This kind of iterative process can mirror classroom settings, as corrections to work are an important instructional tool. Key to these interactions being fruitful is having a place for participants to discuss disagreements and justify corrections, to ensure that changes are ultimately understood by all parties involved. The potential benefits of this type of system can be considered alongside possible weaknesses, including those described below.

Weaknesses: bias, founder effect, conflict, motivation

A frequent aim of crowdsourcing tasks is to take hard-to-process digital artifacts, such as photographs or handwritten text, and convert them into accessible forms including captions or digital text. Since the accessible forms are by their nature easier for contributors to read, a reviewer in a single-track approach is apt to focus on the previous participant’s work, rather than the source material. As a result, mistakes which seem reasonable to the reviewer are less likely to be spotted than mistakes that do not seem to make sense. Because a frequent cause of participant error is hypercorrection — normalizing words that look misspelled in the original — a reviewer is more likely to miss a mistake that “looks” correct. In addition to allowing errors to slip through undetected, this can also cause systemic bias in the resulting data.

Hypercorrection passing review in Missouri State Archives 1970 Death Certificates1


  1. 32 https://www.sos.mo.gov/archives/about/volunteers↩︎

Since participants can see each other’s work, those who contributed early to the project or who are especially active can have an enormous influence on the community. If their practices do not align with the needs of the project, their influence can damage the quality of other participants’ data. This can happen if early, prolific participants adopt practices contrary to project instructions, such as modernizing spelling of historic texts in projects that request verbatim et literatim transcription, lowering the quality of the contributions. Well-meaning participants may also adopt sophisticated encoding conventions for unclear or underlined text, introducing mark-up that may need to be stripped in projects collecting data for full-text search. Since volunteers often move from one crowdsourcing project to another, they may follow the instructions they learned on prior projects rather than reading project-specific instructions.

Since single-track approaches allow participants to correct each other’s work, these can become a location of interpersonal conflict. These conflicts are most famous as “edit wars” on Wikipedia, when pages about disputed territories may become arguments between partisans of the different claimants, but also occur when editors rewrite articles to change British English to American English or vice versa.31 This problem can be addressed through clear policies and active moderation, but recurring disputes may drive away participants.

Projects which require each contribution to be reviewed may run into the challenge of user motivation. For some participants, transcribing text or classifying images is fun, but reviewing someone else’s contributions may feel like grading homework. If a single-track review is performed only by staff members, it can cost significant amounts of staff time. If participants carry out the review, your project team may need to specially recruit and motivate them to do reviewing tasks.

These potential outcomes should be brought into account when planning for data; strategies for shaping contributions through participant engagement include adequate access to context and instructions and consistent affirmation of project goals (e.g., what level of accuracy and completeness are you seeking and for what purposes).

Multi-track quality control: Zooniverse

The most popular crowdsourcing platform supporting multi-track quality control is the Zooniverse.32 Multi-track contributions are built into the platform structure, allowing project creators to set their requirements for what constitutes “multi” based on the needs of their project. Teams can also adjust these settings throughout their project, increasing or lowering the number of contributions needed based on early reviews of data quality. This section will use case studies from several Zooniverse projects to illustrate the strengths and weaknesses of multi-track methods, and the way that the platform team have tested their methods to ensure new tools are supporting those using the platform.

Strengths: automated evaluation, reduced bias

As multi-track crowd contributions are produced independently, it is possible to use algorithms to determine correctness. If ten people are presented with an image and asked the question, “Is this an apple?”, nine or ten responses of “yes” allow the project to confidently classify that image as an apple without any need for review by project staff. For many tasks, this automation can save project team or participant time that would be spent reviewing contributions.

At their best, multi-track methods also support even distribution of participant effort, helping to ensure that the number of contributions solicited is optimized for producing quality data, and helping to avoid overclassification of data and wasting participants’ time.

Case study: automation for optimizing volunteer effort

The Snapshot Serengeti project,33 which asked volunteers to classify species of animals in camera trap images from Serengeti National Park, used automated methods to determine whether participants were reaching consensus more quickly on certain images than others. The maximum number of contributions an image could receive was 25. However, if the first five contributions said there was no animal present in the image, it was considered “blank” and removed from the project early. Similarly, if 10 participants submitted the same contribution, this was considered complete based on participant consensus. This allowed participants to focus their efforts on images that truly needed all 25 contributions, rather than those that were more easily completed.34

Independence of contributions can help eliminate bias from early contributors; there is no need to worry that an early contributor who did not follow instructions will be copied by later contributors since their work will not be visible.

Weaknesses: data granularity, labor duplication, motivation, learning

Like single-track approaches to quality control, multi-track approaches introduce their own problems. Projects should decide based on weighing data quality needs and labor availability.

Because it is easiest to reconcile fine-grained data such as multiple-choice responses or number transcription, projects attempting to use multi-track methods may run into challenges when crowdsourcing tasks produce coarse-grained data. This is a particular challenge for text transcription, as full-line or full-page transcripts can vary based on the participant’s style (including elements as small as the number of spaces they place after a period), which may lead to false positives by algorithms detecting disagreement between contributions of the same data.

As multi-track approaches produce several versions of the same data, some form of consolidation will be necessary to convert the data into a usable form. If large tasks have been reduced to fine-grained microtasks, data may need to be aggregated to be usable. For example, a text transcription project might break page transcription into independent line transcription micro-tasks to simplify automated evaluation and reconciliation. The resulting validated lines would need to be reassembled into pages of text before they could be ingested into digital library systems.

Because multi-track approaches require two or more participants to perform the same work, they can be seen to waste volunteer labor, slowing down the overall productivity of the project. Some potential volunteers may choose to avoid multi-track projects, feeling that these projects either do not trust their work or do not value their labor. Successful projects using multi-track approaches monitor quality and adjust the number of independent participants who are presented with the same data to minimize wasted effort (see the Snapshot Serengeti example above). Additionally, successful projects using these approaches will communicate clearly to participants why the methods are necessary and how they work.

Some of the mechanisms used to enforce multi-track quality control can interfere with participant motivation. In particular, the data processing needed to present an unclassified image to three people may mean that the image is no longer accessible to participants once the third person has classified it. If that image is in the middle of a sequence being classified by a different user, that user may be forced to skip the image, interrupting the immersive experience of a narrative presented by the images. This is likely to be more common for complex task types such as transcription, which often require longer periods for participants to submit their contributions. For quicker cultural heritage task types such as tagging or drawing boxes, the risk may be lower.

While single-track methods run the risk of early contributors biasing later participants, they have the advantage that participants can learn from each other’s contributions. Reviewing another user’s work or seeing one’s own work corrected by another participant is a valuable way to build skills with the crowdsourcing task and expertise over the source material. By preventing users from seeing each other’s contributions, multi-track methods may slow down potential skill development among users.

Case study: testing the influence of Zooniverse tools on the quality of transcription results

A recent study by our team examined many of the weaknesses of multi-track methods described above. We used an A/B experiment to study our platform’s traditional multi-track transcription method against a hybrid collaborative approach that allowed participants to see and interact with the other transcriptions being submitted, while still technically soliciting multiple transcriptions of the same text.

The method included metrics for distributing effort evenly across a page (or pages) of text, as well as a consensus model that ensured that “completeness” was based on participants agreeing on a transcription, rather than solely being based on the number of transcriptions received. While participants did not edit one another’s work (as in a single-track method), this hybrid version is in a sense a combination of multi-track and review-based transcription methods — essentially asking participants to review and agree with one another in order to achieve consensus. In essence, the hybrid method allows participants to look behind the scenes and see the multi-track method at work.

Our A/B study compared this method with traditional independent multi-track methods, and we ultimately found that the collaborative multi-track method produced significantly higher-quality data.35 As a result, we are adding the collaborative “hybrid” tool to the Project Builder toolkit for wider use.

Balancing data quality and participant experience

The single and multi-track methods described above each have their own strengths and weaknesses. It is up to project teams to choose the approach that meets the needs of your values and goals, and which produces results that you can work with based on the skill sets possessed by your team. If you choose a more complex option such as multi-track methods based on concerns over data quality, you will need to be sure you have a data reconciliation plan in place before you invite participants to start contributing to your project. Quality control methods should not be used to police the contributions of your participants, but instead, be viewed as a way to optimize effort and make sure that the labor being donated to your project is being used effectively.

It is also important to consider the preferences of your community. As we note in many chapters of this book, participants in cultural heritage crowdsourcing are not a monolith; some will prefer task types that support single-track methods, others will prefer task types that align well with multi-track methods. Ultimately, participants will choose to take part in projects based on their own enjoyment of the task(s) as well as their interest in the subject matter. It is up to you and your team to consider your participants’ preferences alongside your own data-driven needs.

For example, in 2013, Smithsonian Institution and Quotient, Inc software engineers and data architects created the Smithsonian Transcription Center.36 Their original orientation to system design was through the lens of the types of collections data and the forms of data existing systems could accommodate. The adoption of a modular approach and open source technology (Drupal) allowed them to begin building a platform that was “extensible, versatile, able to be integrated, and adaptable to future needs.” Within a year, the dual mission to collect knowledge to promote discovery and support the journey of participants helped establish workflows that could be adapted for emerging requirements.

The approach to data-oriented but people-focused development created a platform with specific data generation functionality for enhancing collections records, project administration, and resulting data exports in JSON, CSV, and PDF formats. None of these development decisions relied on Smithsonian Institution infrastructure; instead, they were designed with that infrastructure in mind, including the ability to index the results directly with the collections management systems and alignment with the Smithsonian Web ecosystem at that time.

Balancing data quality and the value of specialist and unique tags

Gathering tags to label images can require different ways of thinking about data quality.37 While some projects look for agreement between taggers,38 you might also value unique tags that do not match other tags39 or help overcome the “semantic gap” between the language used by ordinary people and that used in cataloging.40 In some cases, unique tags might represent specialist knowledge. For example, a sailor will know more specific nautical terms than a non-sailor and be able to provide more precise labels for different kinds of sails, ropes, and knots than other people. Spelling mistakes may also be useful in broadening the discoverability of items, even if these tags are theoretically incorrect. Verification tasks or machine learning-based methods may be able to automatically exclude problematic unique tags added by spambots or would-be vandals.

Processing and accessing results

Processing data and quality control can have elements in common. Some projects that use a single-track approach review and reconcile data within the crowdsourcing project. In these cases, quality control happens before data is exported from the project platform. There are exceptions — for example, projects led by the Alabama Department of Archives and History take this approach and clean the data further after data export by using a spreadsheet where different errors are easy to pick up.

Your ideal version of data from your project may not directly match what is feasible to produce as a direct outcome of the project, depending on your available resources. The data that comes directly from your project will vary according to many factors including platform, source material, task type, etc. This section will present some considerations to keep in mind as you plan for processing your project results. You can read more about useful things to consider when designing your project to ensure your data is as close to your ideal outcome as possible in the “Choosing tasks and workflows” chapter.

Raw data refers to the results that come directly from your project. In some cases, your raw data will be ready for use or analysis, but sometimes it will require processing before it is ready to be used, for example, to be ingested into your institutional repository. This is known as post-processing, and it can include many different procedures, from quality control methods to data cleaning to data aggregation.

Data cleaning

Data cleaning is the process of imposing structure and order onto results that are often considered “messy” in order to process them with software that expects consistent and complete data. While an accepted part of working with crowdsourced data, data cleaning seems to be both unavoidable and unattainable, particularly when considering that the source material we are drawing from is itself often messy and non-standardized. While few (if any) fields would claim their data as perfectly tidy and consistent, source artifacts in cultural heritage are particularly prone to messiness, complexity, ambiguity, and contradiction, because of the nature of their production, use, survival, preservation, and curation over the years or even centuries. Layers of interpretation add to this complexity. Crowdsourced transcription can be powerful in its ability to demonstrate to participants the humanity inherent in historic documents and artifacts, a practice that is encouraged by practitioners, researchers, and educators.

Data cleaning encompasses processes such as identifying incorrect or missing information, fixing errors, removing duplicates, resolving contradictions. Many existing tools can help with this process. Those new to the process may want to start with OpenRefine, a free and open-source tool for working with data, and for which many useful tutorials exist online.41

In “Against Cleaning,” Katie Rawson and Trevor Muñoz suggest “indexing” as a replacement for “cleaning,” describing their own process working with data output from the New York Public Library’s What’s on the Menu? project. Ultimately, for their project, the authors chose to embrace diversity within their dataset, reminding us that, “However tidy values may look when grouped into rows or columns or neatly delimited records, this tidiness privileges the structure of a container, rather than the data inside it.”42 As with the quality control methods described above, the key to this process is to consider your project’s values alongside your needs, as well as the needs of the potential users and re-users of your resulting data.

Data aggregation

If your crowdsourcing project uses multi-track methods, you may need to aggregate your data. This will be the case if your goals include determining a single outcome for each piece of data included in your project. Methods for this process are largely dependent upon the tasks and tool types being used. For example, the Zooniverse platform documentation includes examples of aggregation methods using Python, broken down by the task types available in the Project Builder toolkit, as well as documentation for setting up Python on your machine and a code of conduct for potential contributors to the codebase.43 Even with robust documentation, however, this process can be difficult, especially for those who do not have experience using Python.

Case study: alternatives to aggregation in the Mutual Muses project

For some cultural heritage projects, particularly those which focus on transcribing text, true “aggregation” of data may not be an option for your team. In those cases, other approaches are possible, such as the method identified by a team in the Digital Art History program at the Getty Research Institute for their crowdsourcing project, Mutual Muses.44 The team was working with output from the free-text transcription of full pages of text and decided that aggregating large amounts of text was not feasible for their team. Instead, they opted to automate the process differently, asking: “given a set of transcriptions all based on the same document, which one has the most information in it that is also the most backed up by its fellow transcriptions? In other words, which transcription do all the other transcriptions agree with the most?” They used a combination of existing software and custom code to create a process that compared the (six) transcriptions received for each page of text, and finding which single transcription featured the most agreement with the other transcriptions submitted. The team described their process in detail in a follow-up blog post about their project, as well as sharing their code and results in a GitHub repository.45

Whether you choose to aggregate your data or take an alternate approach (such as the most-supported transcription method described above) is up to you and your team to determine, according to your project values. We recommend deciding on the appropriate methods before you launch your crowdsourcing project, otherwise you risk wasting participants’ time.

It may be helpful to design audiences or personas who will use the data from your project. For example, in the Engaging Crowds: citizen research and heritage data at scale project,46 three audiences were borne in mind during the design of its crowdsourcing projects: the general interested public including the people who participated in the projects, the collections-holding organizations and their cataloging systems, and Machine Learning and Artificial Intelligence applications that could use the resulting data to train future programs. It may be that you will design different data-cleaning processes for the audiences you identify, or perhaps that you will intervene further in one version of your dataset. As with all processes with data and code, it is good practice to document both the processes you take and the decisions that lay behind them. This will align with values around openness and transparency and allow future users of the data to understand the dataset, even where they might have preferred a different approach be taken.

Access to results

In addition to considering the ethical frameworks for data management and use, it is important to think about who will have access to the resulting data from your project, both in terms of viewing the data as well as reusability.

When thinking about access to results, bear in mind that your participants are donating their time and effort to your project. How would they feel if they were not able to access data that they had helped to create? How does this align with the values of your project? Additionally, if the re-use of this data would be facilitated by access to the source images that were used to generate it, failing to make the source images freely available may impede the ability to re-use the project results. Similar consideration should be given to additional products created using this data, for example, scholarly articles and books. Licensing terms or other organizational or ethical considerations do not always allow for free sharing of some data, especially images. While it may not be in your power to change your organization’s policy on sharing images, it is important to be open and clear with participants about this from the project start so everyone involved can make an informed decision on their participation.

Questions of access to the results of crowdsourcing projects are not new. In 2010, Rose Holley produced a list of “Tips for crowdsourcing,” generated through personal experience as well as through conversations with other crowdsourcing practitioners, which includes the dictum “Make the results/outcome of your work transparent and visible.”47 Though this advice has been present in the field of crowdsourcing for more than a decade, the process of generating data is still often prioritized over sharing results. This is evident in continued conversations across cultural heritage institutions about their practices and processes and formed part of the discussions in 2020’s “After the crowds disperse: crowdsourced data rediscovered and researched” workshop, which brought together many professionals and volunteers interested in cultural heritage crowdsourcing to discuss what happens to data within collections-holding organizations when participants have finished working on a project.48

While it is unlikely that colleagues within organizations would run crowdsourcing projects when the results would not feed into a catalog or finding system and enhance access to and knowledge of cultural heritage, there are many other potential audiences for your data. The steps needed to make your data not only available but also truly accessible to various audiences can be designed into the project from its inception.

Using results

In this chapter, as elsewhere in the book, we have stressed the ethical requirement to honor the work of participants in a crowdsourcing project. One aspect of this is in ensuring the work of the project is accessible both to the participants and the wider public. For more discussion of respecting participants, see the chapters on “Identifying, aligning, and enacting values in your project” and “Understanding and connecting to participant motivations”.

We are familiar with the idea that it is other people who often do the most exciting things with our data. To enable this, sharing your data is an important step in the closing stages of a project. After you have completed any data cleaning and processing, you may choose to deposit a copy of the data in a data repository. For preference, we recommend using a repository that can assign your data with a Digital Object Identifier (DOI) so it can be referenced easily. If subsequent work augments or enhances the data, remember to add and link the updated version to the repository entry. Both the original dataset and any augmented or annotated versions need to include clear and accessible documentation.

An important part of this documentation is the license and/or terms of use. We recommend using the most open license available to you, to enable the most permissive re-use of the data produced by participants. Understanding that this choice may not be in your power, depending on the organization you work for, even a restrictive license is better than no license at all. In many jurisdictions having no license explicitly attached to data will prevent anyone from re-using it, as the copyright rests with the perceived owners.

Case study: principles around access to the Colored Conventions Project corpus

The Colored Conventions Project asks people to agree to operate by a specific set of principles when using the results of its crowdsourcing transcription project. After the completion of the first phase of Transcribe Minutes (2014-2017), the CCP published the transcriptions on its website. Recognizing the harm caused by converting Black lives into data to forcibly remove them from their contexts, they set forth a handful of principles to shape the use of the CCP corpus. Visitors to the website must first click through an agreement to these principles before they can access the page with downloadable materials. These terms of use align with Principle 5 of the CCP Principles,49 presented on the project website: “We affirm the role of Black people as data creators and elevate the ways in which Black conventions generated data and statistics to advance, affirm and advocate for Black economic and organizational success and access. We also recognize that data has long served in the processes and recording of the destruction and devaluation of Black lives and communities. We seek to avoid exploiting Black subjects as data and to account for the contexts out of which Black subjects as data arise. We seek to name Black people and communities as an affirmation of the Black humanity inherent in Black data/curation. We remind ourselves that all data and datasets are shaped by decisions about whose histories are recorded, remembered, and valued.”

Here are the agreements to which potential users of the CCP Corpus must agree:50

  • I honor CCP’s commitment to a use of data that humanizes and acknowledges the Black people whose collective organizational histories are assembled here. Although the subjects of datasets are often reduced to abstract data points, I will contextualize and narrate the conditions of the people who appear as “data” and to name them when possible.

  • I will include the above language in my first citation of any data I pull/use from the CCP Corpus.

  • I will be sensitive to a standard use of language that again reduces 19th-century Black people to being objects. Words like “item” and “object,” standard in digital humanities and data collection, fall into this category.

  • I will acknowledge that Colored Conventions were produced through collectives rather than by the work of singular figures or events.

  • I will fully attribute the Colored Conventions Project for corpora content.

Examples of additional uses people have assigned to crowdsourced data include linking it to Wikipedia and using it to train Machine Learning algorithms. This latter can be particularly useful where you have a large collection in similar handwriting, and participant-transcribed text can be used to train a Handwritten Text Recognition (HRT) algorithm via tools such as Transkribus.51

Case study: using public domain crowdsourced data to create the Newspaper Navigator application

In 2020, Library of Congress Innovator in Residence and Computer Science Ph.D. student Benjamin Charles Germain Lee created Newspaper Navigator, a visual search application that can be trained through Machine Learning and participant input.52 Lee was inspired by the LC Labs experiment Beyond Words53 and designed a pipeline to extract images from the over 16 million pages of historical newspaper content in Chronicling America, the access point for the National Digital Newspaper Program.

The main goals of Newspaper Navigator were to extract those images and then to reimagine visual search using that corpus of images, however, the project was made possible by the contributions of volunteers to the Beyond Words data corpus.54 From its design, participants were made aware that classifications they created — in this case, by segmenting images, editing their OCR captions or transcribing the captions directly, and classifying the images against a controlled vocabulary — would be released into the public domain upon their creation.

This is the rights statement that was available on the data page for Beyond Words: “The data contributed by volunteers like you can be used in many different ways. We are giving back to our community by making this data public. All contributions to this application are released into the public domain. You are free to use this data set in any way you want. The data is provided as JSON data. Please note: this data set only contains crowd-verified records and may initially be very small. It may change or grow as volunteers like you contribute to Beyond Words.” The dataset was available for download from the project’s data page. When Lee downloaded the dataset, it consisted of only approximately 3,500 images but it seeded the successful data extraction pipeline for Newspaper Navigator. He paid it forward by making his modified dataset(s)55 available for reuse by releasing it into the public domain.56

You may have particular plans to re-use your data beyond a project’s primary goal (such as improving a finding aid). We recommend holding in mind from the start of the project questions of who else might want to do what with it:

  • Who else might use your data?

  • For what purposes?

  • Can you embed guidance on how to use the data that would be particularly useful to particular groups, such as family or local historians?

  • Can you facilitate computational access to the data?

  • Can other people find your data?

  • Will people understand from your documentation how the data was produced?

  • Can they refer to it uniquely?

  • Can they share additions or enhancements to your data?

Summary

This chapter went through the whole lifecycle of data in a crowdsourcing project, giving you dimensions to consider, and underpinning them with case studies to point to practical applications. We started with getting ready for using data (e.g., data management plans), then highlighting data ethics (presenting various frameworks that could be used). We highlighted data quality dimensions, such as accuracy and completeness, then moving onto quality control methods — again giving you various options based on practical examples. There are tensions in quality control, such as quality versus user experience, which we also tease out. The chapter ends with advice on processing and accessing data resulting from your project.

We hope that we have provided tools and frameworks to handle your project’s data. Planning the types, structure, and purpose of the data enables us to make the most of it and ensure that it suits the uses we have for it — and attempts to support future, unforeseen uses. It may be useful to go to the “Evaluating your crowdsourcing project” chapter next, where we also discuss data in the context of capturing and measuring outputs and impact.

Comments
1
BB
Ben Brumfield: This should read “multi-key”; the word “single” doesn’t make sense here.