Quality Assurance

Data Management

Quality control is an important aspect of data management. Some interesting discussions recently took place at the IODE/MarBEF data management workshop in Oostende March 2006. A short report, based on these discussions, can be found below. If you wish to contribute to any of these issues please contact Sarah Faulwetter and the MarBEF Data Management team at data@marbef.org.

Please download the summary report of the Data Quality Working Group.

Tools to assist with data management and quality control for (marine) biodiversity data

Database Management Tools

Name Description Further information / Links Costs Desktop / Web tool
BRAHMS A database software for botanical research and collection management. It provides support with the management of names, collection curation and taxonomic research. link none Desktop
Desktop GARP DesktopGarp is a software package for biodiversity and ecologic research that allows the user to predict and analyze wild species distributions. link none Desktop
Specify The Specify Software Project supports biological museums and herbaria with collection data management software, and data conversion, helpdesk and training services link none Desktop
GBIF Data Repository Tool An integrated data warehouse product for GBIF nodes that allows hosting of data in document format, data validation, and sharing via the DiGIR protocol. Also includes a simple image server and a BioCASE-compatible directory tool for keeping track of institutions and collections. link none Desktop
speciesBase speciesBase is a generic Microsoft® Access database developed in Visual Basic for Applications (VBA) to record generic taxonomic data. Its development was motivated by the increasing number of collections that wish to computerize their holdings and share data through the speciesLink network. speciesBase is a system to record taxonomic data associated to inventories, supported by foreign data tables link none Desktop
Unicorn The system is based on MS-Access and provides a dedicated application for storing and editing data resulting from benthic surveys ? Desktop
Taxis Desktop application to manage taxonomic data, specimen collection, character descriptions and identification keys, biogeograhpical records, literature references, images and much more. link none Desktop

Georeferencing Tools, GIS Tools

Name Description Further information / Links Costs Desktop / Web tool
GeoLocate GEOLocate is a comprehensive electronic georeferencing solution facilitating the task of assigning geographic coordinates to the locality data associated with natural history collections.  link none Desktop
SAGA GIS SAGA is a powerful open source GIS with a good support from the user community. It reads various formats and can create Shapefiles. link none Web
Quantum GIS Quantum GIS (QGIS) is an Open Source Geographic Information System (GIS) that runs on Linux, Unix, Mac OSX, and Windows. QGIS supports vector, raster, and database formats. Some of the major features include support for spatially enabled PostGIS tables, support for shapefiles, ArcInfo coverages, Mapinfo, and other formats supported by OGR, Raster support for a large number of formats, Identify features, Display attribute tables, Select features, GRASS Digitizing and Feature labeling. link none Web
ArcExplorer ArcExplorer is a lightweight GIS data viewer developed by ESRI. This freely available software offers an easy way to perform a variety of basic GIS functions, including display, query, and data retrieval applications. It can be used on its own with local data sets or as a client to Internet data and map servers. However, it cannot be used to create shapefiles. link none Web
Diva-GIS DIVA-GIS is a free mapping program, sometimes called geographic information system (GIS), that can be used for many different purposes. It is particularly useful for mapping and analyzing biodiversity data, such as the distribution of species, or other 'point-distributions'. Shapefiles can be created. link none Desktop
BioGeoMancer A georeferencing service for collectors, curators and users of natural history specimens. link none Web
geoLoc A tool to assist biological collections in georeferencing their data. link none Web
Georeferencing Calculator A java applet created to aid in the georeferencing of descriptive localities such as found in museum-based natural history collections link none Web
spOutlier An automated tool used to detect outliers in latitude, longitude and altitude, and to identify errant on-shore or off-shore records in natural history collections data. link none Web
GRASS GIS Commonly referred to as GRASS, this is a Geographic Information System (GIS) used for geospatial data management and analysis, image processing, graphics/maps production, spatial modeling, and visualization. GRASS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. link none Web
MapServer MapServer is an Open Source development environment for building spatially-enabled internet applications. MapServer is not a full-featured GIS system, nor does it aspire to be. Instead, MapServer excels at rendering spatial data (maps, images, and vector data) for the web. link none Web
openModeller openModeller is a fundamental niche modelling library, providing a uniform method for modelling distribution patterns using a variety of modelling algorithms. link none Web

Data Validation Tools

Name Description Further information / Links Costs Desktop / Web tool
GBIF Data Tester This extensible Java package is intended to support a number of functions within the GBIF Data Portal. For instance, when new data sets are registered with GBIF and are indexed, this new software will be used to identify possible errors in them. These issues can then be reported to the custodian of the data for correction in a standard format generated by DataTester. Tests that can be executed include the following: (1)Reporting unrecognized values for data elements (e.g. country names or basis of record values); (2) Checking that coordinates fall within the boundaries of named geographic areas; (3) Finding scientific names that are not known to external lists such as the Catalogue of Life or nomenclators; (4) Checking that scientific names have an appropriate format. link none Web
Data Cleaning (CRIA) An on-line data checking and error identification tool developed by CRIA to help curators of datasets made available via the speciesLink distributed information system to identify possible errors in their databases. Errors include both nomenclatural and geographic. link none Web
TREx TREx is a set of routines providing additional facilities for Excel. These routines are used to QC aspects of benthic survey data. ? Desktop

Taxonomic Names Tools

Name Description Further information / Links Costs Desktop / Web tool
TaxonGrab TaxonGrab is a tool written in PHP for the purpose of taxonomic name extraction. It accepts URLs, files and free text for checking and returns a list of taxonomic names that are contained in the document. Tool none Web
uBio Tools A suite of tools, described in detail below. Overview can be found at the general tool page of uBio. List of tools none Web
LinkIT LinkIT is a tool that allows you to build automated link outs from names in your own pages, databases, and texts to expert systems. Links are created dynamically and only build links if the referenced system has information. Tool
none Web
findIT FindIT is available via both a web application and as a SOAP method for embedding name-recognition into your own applications. One of findiTs rulesets is based on an analysis of scientific name suffixes. Read the WSDL file. Tool
none Web
Author Abbreviation Resolver The Author Abbreviation Resolver is a thesaurus for resolving abbreviations of author names in scientific nomenclature. It is available as a SOAP method. Tool
none Web
TNS Name Mapper The TNS Name Mapper superimposes lists of names against multiple checklists and classifications. Tool none Web
ParseIT This tool accepts a complex scientific name and breaks it into it's component parts. This tool is useful for identifying different forms of the same name, especially in combination with the author abbreviation service and findiT SOAP Tool
none Web
CanonizeIT CanonizeIT - This function complements the parser by deconstructing a scientific name into is canonical (or simplest) form. Documentation. Tool
none Web
CrawLIT CrawLIT can locate all the names within a collection of content and match the results against NameBank and various authority lists. The index of names can be linked to different name concepts to increase access to your content. Tool none Web
CrawLIT V.2 Another version of CrawLIT. The MRIB crawler was built for the USGS. Tool none Web
CrawLIT V.3 Another version reconciles the crawled site to a classification such as ITIS or Species 2000 Tool none Web
Deconstruct a checklist This application parses various standard checklist formats into a single normalized form. Tool
none Web
CompareIt CompareIt takes a URL or a list of names as input and compares the taxon names with a current taxonomy such as Species 2000 or ITIS and reports on the current status of the name and other metrics. URL input is first piped to FindIt so a web page with names can be checked against a current taxonomy. Tool none Web
GoldenGATE GoldenGATE is a tool to mark up elements in taxonomic descriptions through XML. The documentent editor is intended for the creation of new XML content from plain text data, and for inserting additional markup to document centric XML documents in order to make them more data centric and machine readable. The idea of the GoldenGATE document editor is to support a user in creating XML markup as far as possible. This comprises automation support for manual editing of XML as well as fully automated creation of markup. The latter includes, for instance, the application of natural language processing (NLP) algorithms. Tool none Desktop

Species Checklists

Name Taxonomic Coverage Remarks Link
MarineSpecies Biota MarineSpecies.org aims to provide an authoritative list of names of all marine species globally, ever published. It is a contribution to Species2000 and the Catalogue of Life, and will serve as the taxonomic backbone of OBIS. The aim is to provide an authoritative and comprehensive list of names of marine organisms, including information on synonymy. While highest priority goes to valid names, other names in use are included so that this register can serve as a guide to interpret taxonomic literature. link
ERMS Biota The European Register of Marine Species (ERMS) is an authoritative taxonomic list of species occurring in the European marine environment link
Species2000 Biota A comprehensive worldwide catalogue for checking the nomenclatural status, the classification and naming of species link
ITIS Biota Authoritative taxonomic information on plants, animals, fungi, and microbes of mainly North America link
uBio Biota Biological Indexer, lists all variations of taxonomic names, lexical variants, classifications, information about the nomenclatural status of names. link
ZooBank Biota The aim of ZooBank is to provide an online, open-access, register for new animal names and taxonomic acts in zoology link
Fishbase Fish Comprehensive information about marine and frehswater fish, including taxonomy, distribution, ecology and much more link
Hexacorallians of the World Sea anemones A compilation of publications concerning taxonomy, nomenclature, and geographic distribution of extant hexacorallians link
NeMys Nematodes; Mysida; Biota Generic online species information system, acting as a digital platform, storing all kinds of information for biological taxa, focussing on Mysids and Nematods. link
Clemam Mollusca CLEMAM is a taxonomically oriented database of the marine Mollusca of Europe and adjacent areas link
Cephbase Cephalopoda A database-driven web site on all living cephalopods (octopus, squid, cuttlefish and nautilus) link
AlgaeBase Algae; Seagrasses AlgaeBase is a database of information on algae that includes terrestrial, marine and freshwater organisms. At present, the data for the marine algae, particularly seaweeds, are the most complete link
NCBI database Biota The NCBI taxonomy database contains the names of all organisms that are represented in the NCBI genetic databases with at least one nucleotide or protein sequence link


Title Abstract Link
Chapman AD (2005). Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. This document examines methods for preventing as well as detecting and cleaning errors in primary biological collections databases. It discusses guidelines, methodologies and tools that can assist the natural history collections community and the observational communities to follow best practice in digitizing, documenting and validating information. But first, it also sets out a set of simple principles that should be followed in any data cleaning exercises. link
Chapman AD (2005). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. Data quality and errors in data are often neglected issues with environmental databases, modeling systems, GIS, decision support systems, etc. Too often, data are used uncritically without consideration of the errors contained within, and this can lead to erroneous results, misleading information, unwise environmental decisions and increased costs. The rapid increase in the exchange and availability of taxonomic and species-occurrence data has now made the consideration of the principles concerning data quality an important agenda item as users of the data begin to require more and more detail on the quality of this information. This paper expands on these issues and discusses a number of principles of data quality that should become core to the business of the natural history collections and observational communities as they release their data to the broader community. link
Chapman AD (2005). Uses of Primary Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. This paper examines uses for primary species occurrence data in research, education and in other areas of human endeavor, and provides examples from the literature of many of these uses. The paper examines not only data from labels, or from observational notes, but the data inherent in museum and herbarium collections themselves, which are long-term storage receptacles of information and data that are still largely untouched. Projects include the study of the species and their distributions through both time and space, their use for education, both formal and public, for conservation and scientific research, use in medicine and forensic studies, in natural resource management and climate change, in art, history and recreation, and for social and political use. link
Chapman, A.D. and J. Wieczorek (eds). 2006. Guide to Best Practices for Georeferencing. Copenhagen: Global Biodiversity Information Facility. This publication is one of the outputs from the BioGeomancer project and discusses Best Practices for Georeferencing biological species (specimen and observational) data. The publication presents examples of how to georeference a wide range of different location types, and provides information and examples on how to determine the extent and maximum uncertainty distance for locations based on the information provided. link
Dalcin EC. Data Quality Concepts and Techniques Applied to Taxonomic Databases. (2004) Ph.D. Thesis; Faculty of Medicine, Health and Life Sciences School of Biological Sciences. The thesis presents a discussion about improving data quality in taxonomic databases, considering conventional Data Cleansing techniques and applying generic data content error patterns to taxonomic data. Techniques of taxonomic error detection are explored, with special attention to scientific name spelling errors. Database quality assessment procedures and metrics are discussed in the context of taxonomic databases and the previously introduced concepts of Taxonomic Data Domains and Taxonomic Data Quality Dimensions. Four questions related to Taxonomic Database Quality are discussed, followed by conclusions and recommendations involving information system design and implementation and the processes involved in taxonomic data management and information flow. link
Maletic JI, Marcus A (2000). Data Cleansing: Beyond Integrity Analysis. In: Proceedings of the Conference on Information Quality (IQ2000), Boston. The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The applicable methods include: statistical outlier detection, pattern matching, clustering, and data mining techniques. Some brief results supporting the use of such methods are given. The future research directions necessary to address the data cleansing problem are discussed. link
Morris PJ (2005). Relational Database Design and Implementation for Biodiversity Informatics. PhyloInformatics; 7:1-66. The complexity of natural history collection information and similar information within the scope of biodiversity informatics poses significant challenges for effective long term stewardship of that information in electronic form. This paper discusses the principles of good relational database design, how to apply those principles in the practical implementation of databases, and examines how good database design is essential for long term stewardship of biodiversity information. Good design and implementation principles are illustrated with examples from the realm of biodiversity information, including an examination of the costs and benefits of different ways of storing hierarchical information in relational databases. This paper also discusses typical problems present in legacy data, how they are characteristic of efforts to handle complex information in simple databases, and methods for handling those data during data migration. link
Stribling JB, Moulton SR, et al. (2003). Determining the quality of taxonomic data. J. N. Am. Benthol. Soc.; 22(4):621–631 In this article, Stribling et al. discuss data quality issues to be considered when conducting taxonomic analyses for biological assessments. They differentiate between 2 broad areas of taxonomy—research and production taxonomic investigations—and consider how approaches to organism identification can vary between these 2 areas. The authors stress the importance of evaluating and communicating data quality, and that knowledge of quality assurance/quality control elements is essential before drawing conclusions from biological assessment results. link
Wieczorek J, Guo Q, et al. (2004). The point-radius method for georeferencing locality descriptions and calculating associated uncertainty. International Journal of Geographical Information Science; 18(8):745–767. Natural history museums store millions of specimens of geological, biological, and cultural entities. Data related to these objects are in increasing demand for investigations of biodiversity and its relationship to the environment and anthropogenic disturbance. A major barrier to the use of these data in GIS is that collecting localities have typically been recorded as textual descriptions, without geographic coordinates. We describe a method for georeferencing locality descriptions that accounts for the idiosyncrasies, sources of uncertainty, and practical maintenance requirements encountered when working with natural history collections. Each locality is described as a circle, with a point to mark the position most closely described by the locality description, and a radius to describe the maximum distance from that point within which the locality is expected to occur. The calculation of the radius takes into account aspects of the precision and specificity of the locality description, as well as the map scale, datum, precision and accuracy of the sources used to determine coordinates. This method minimizes the subjectivity involved in the georeferencing process. The resulting georeferences are consistent, reproducible, and allow for the use of uncertainty in analyses that use these data. link
Pyle RL (2004). Taxonomer: a relational data model for managing information relevant to taxonomic research. PhyloInformatics; 1:1-54. Taxonomic research, as a field of biological sciences, is fundamentally an exercise in information management. Modern computer technology offers the potential for both streamlining the taxonomic process, and increasing its accuracy. Effective use of computer technology to successfully manage taxonomic information is predicated upon the implementation of data models that accommodate the diverse forms of information important to taxonomic researchers. Although sophisticated data models have been developed to manage some information relevant to taxonomic research (e.g., natural history specimen information; descriptive data relating to morphological and molecular characters of specimens), similarly robust models for managing information about taxonomic names and how they are applied to taxonomic concepts, though they exist, have not attained widespread use and adoption. link
Title Description Link
Storing Hierarchical Data in a Database Article on different ways of storing hierarchical data in relational databases. link
ICES/OSPAR/HELCOM STGQAB report Section 7.3 - Review data validation guidelines of the HELCOM COMBINE manual; Annex 5: Proposal for revision of chapter B.4 of the HELCOM COMBINE Manual (B.4.3.1 Data checks applied to individual data points and variables) link
IOC Ocean Teacher Different tools, manuals and tutorials on Marine Data Management link
Digital Taxonomy Very extensive list of software, registers, resources, links etc. related to biodiversity data management. Some links are outdated. link

Compiled by Sarah Faulwetter, 22.01.2007, URLs accessed this date.
Updated 03.04.2007

Web site hosted and maintained by Flanders Marine Institute (VLIZ)
page last modified: 2008-07-03 17:25:10 GMT+1