CRESCYNT Toolbox – Data Cleaning

cleaning_ancient-1807518_1280_pixabay

Data cleaning. Data cleansing. Data preparation. Data wrangling. Data munging.

Garbage In, Garbage Out.

If you’re like most people, your data is self-cleaning, meaning: you clean it yourself! We often hear that 80% of our “data time” is spent in data cleaning to enable 20% in analysis. Wouldn’t it be great to work through data prep faster and keep more of our data time for analysis, exploration, visualization, and next steps?

Here we look over the landscape of tools to consider, then come back to where our feet may be right now to offer specific suggestions for workbook users – lessons learned the hard way over a long time.

The end goal is for our data to be accurate, human-readable, machine-readable, and calculation-ready.

Software for data cleaning:

RapidMiner may be the best free (for academia) non-coding tool available right now. It was built for data mining, which doesn’t have to be your purpose for it to work hard for you. It has a diagram interface that’s very helpful. It almost facilitates a “workflow discovery” process as you incrementally try, tweak, build, and re-use workflow paths that grow during the process of data cleaning. It makes quick work of plotting histograms for each data column to instantly SEE distributions, zeros, outliers, and number of valid entries. It also records and tracks commands (like a baby Jupyter notebook). When pulling in raw datasets, it automatically keeps the originals intact: RapidMiner makes changes only to a copy of the raw data, and then one can export the finished files to use with other software. It’s really helpful in joining data from multiple sources, and pulling subsets for output data files. Rapid Miner Studio: Data Prep.

R is popular in domain sciences and has a number of powerful packages that help with data cleaning. Make use of RStudio as you clean and manipulate data with dplyr and tidyr. New packages are frequently released, such as assertr, janitor, and datamaid. A great thing about R is its active community in supporting learning. Check out this swirl tutorial on Getting and Cleaning Data – or access through DataCamp. The most comprehensive list of courses on R for data cleaning is here via R-bloggers. There’s lovely guidance for data wrangling in R by Hadley Wickham – useful even outside of R.

Data cleaning tool recommendations by KD Nuggets, Quora, and Varonis are a little dated and business-oriented, but these survivors may be worth investigating:

  • Trifacta Wrangler was built for desktop use, and designed for many steps of data wrangling: cleaning and beyond. See intro video, datasheet, demo with Tableau.
  • DataCleaner – community or commercial versions; can use SQL databases. Mostly designed for business applications; videos show what it can do.
  • OpenRefine gets the legacy spotlight (was Google Refine… now community held). Free, open source, and still in use. Here’s a recent walkthrough. Helps fix messy text and categorical data; less useful for other science research data.

There are some great tools to potentially steal borrow that started in data journalism:

  • Tabula is “a tool for liberating data tables trapped inside PDF files” – extracts text-based pdfs (not scans) to data tables.
  • csvkit is “a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.” Helpful for converting Excel to csv cleanly, csv to json, json to csv, working with sql, and more.
  • agate is “a Python data analysis library that is optimized for humans instead of machines…. an alternative to numpy and pandas that solves real-world problems with readable code.” Here’s the cookbook.

Finally, Python itself is clearly a very powerful open source tool available for data cleaning. Look into it with this DataCamp course, pandas and other Python libraries, or this kaggle competition walkthrough.

Manual Data Munging. If you’re using Excel, Open Office, or Google Sheets to clean your data (e.g., small complex datasets common to many kinds of research), you may know all the tricks you need. For those newer to data editing, here are some tips.

  • To start: save a copy of your original file with a new name (e.g., tack on “initials-mod” plus the current date: YYYYMMDD). Then make your original file read-only to protect it. Pretend it’s in an untouchable vault. Use only your modifiable copy.
  • Create a Changes page where you record the edits you make in the order you make them. This also lets you scribble notes for changes you plan to make or items you need to track down but haven’t yet executed (Done and To-Do lists).
  • First edit: if you don’t have a unique ID for each row, add a first column with a simple numeric sequence before doing anything else.
  • Create a copy of that spreadsheet page, leave the original intact, and make modifications only to the newly copied page. If each new page is created on the left, the older pages are allowed to accumulate to the right (less vulnerable to accidental editing). Name each tab usefully.
  • Second edit: if your column headings take up more than one row, consolidate that information to one row. Do not Merge cells. Include units but no special characters or spaces: use only letters, numbers, dashes and underlines.
  • Add a Data Definitions page to record your old column headings, your new column headings, and explain what each column heading means. Include units here and also in column headings where possible.
  • In cells with text entries, do not use bare commas. Either use semicolons and dashes instead of commas in your text, or enclose text entries in quotation marks (otherwise creates havoc exporting to and importing from csv).
  • Add a Comments column, usually at the end of other columns, to record any notes that apply to individual rows or a subset of rows. Hit Save, now and often.
  • Now you’re free to sort each column to find data entry typos (e.g., misplaced decimals), inconsistent formats, or missing values. The danger here is failing to select the entire spreadsheet before sorting – always select the square northwest of cell A1 (or convert the spreadsheet to a table). This is where you’ll be glad you numbered each row at the start: to compare with the original.
  • If there’s a short note like data source credit that MUST accompany the page and must not get sorted, park it in the column header row to the right of the meaningful headers so it won’t get sorted, lost, or confused with actual data.
  • If you use formulas, document the formulas in your Data Definitions page (replace cells with column_names), and copy-paste as value-only as soon as practical.
  • Make sure there is only one kind of data in each column: do not mix numeric and text entries. Instead, create extra columns if needed.
  • Workbooks should be saved each day of editing with that day’s date (as YYYYMMDD) as part of the filename so you can get back to an older copy. At the end of your session clean up your Changes page, moving To-Do to Done and planning next steps.

Find more spreadsheet guidance here (a set of guidelines recently developed for participants in another project – good links to more resources at its end).

Beyond Workbooks. If you can execute and document your data cleaning workflows in a workbook like Excel, Open Office, or Google Sheets, then you can take your data cleaning to the next level. Knowing steps and sequences appropriate for your specific kinds of datasets will help enormously when you want to convert to using tools such as RapidMiner, R, or Python that can help with some automation and much bigger datasets.

Want more depth? Check out Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders  “If you are not good at data preparation, you are NOT a good data scientist…. The validity of any analysis is resting almost completely on the preparation.” – Claudia Perlich

Happy scrubbing! Email or comment with your own favorite tips. Cheers, Ouida Meier

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT Toolbox – Data Cleaning

CRESCYNT Toolbox – Discovery of Online Datasets

cinergi-coralscreendump
Data discovery at cinergi.sdsc.edu

Announcing recent progress for data discovery in support of coral reef research!

Take advantage of this valuable community resource: a data discovery search engine with a special nose for locating coral reef research data sources: cinergi.sdsc.edu.

A major way CRESCYNT has made progress is by serving as a collective coral reef use case for EarthCube groups that are building great new software tools. One of those is a project called CINERGI. It registers resources – especially online repositories and individual online datasets, plus documents and software tools – and then enriches the descriptors to make the resources more searchable. The datasets themselves stay in place: a record of the dataset’s location and description are registered and augmented for better find and filter. Registered datasets and other resources, of course, keep whatever access and use license their authors have given them.

CINERGI already has over a million data sources registered, and over 11,000 of these are specifically coral reef datasets and data repositories. The interface now also features a geoportal to support spatial search options.

The CINERGI search tool is now able to incorporate ANY online resources you wish, so if you don’t find your favorite resources or want to connect your own publications, data, data products, software, code, and other resources, please contribute. If it’s a coral-related resource, be sure to include the word “coral” somewhere in your title or description so it can be retrieved that way later as well. (Great retrieval starts with great metadata!)

To add new resources: Go to cinergi.sdsc.edu, and click on CONTRIBUTE. Fill in ESPECIALLY the first fields – title, description, and URL – then as much of the rest as you can.

Try it out!

Thanks to EarthCube, the CINERGI Data Discovery Hub, and the great crew at the San Diego Supercomputer Center and partners for making this valuable tool possible for coral reef research and other geoscience communities. Here are slides and a video to learn more.

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT Toolbox – Discovery of Online Datasets

CRESCYNT at EarthCube All Hands Meeting 2017

EarthCube domain scientists, computer scientists, data scientists, and new members gathered in Seattle June 7-9, 2017 to communicate progress, connect over projects and science challenges, plan for future collaborative work, and welcome new participants.

Most of the presentations and posters from the meeting are available here. CRESCYNT program manager Ouida Meier delivered an invited talk on sci-tech matchmaking (video|slides, helped facilitate breakout sessions focused on clarifying requirements and resources for virtual workbenches (summary), and presented CRESCYNT coral reef use cases and workflow collaboration during a poster session. Discussion and collective brainstorming throughout the meeting was very dynamic and fruitful.

CRESCYNTposter_EC-AHM_Seattle_2017June

Download a larger pdf of the CRESCYNT poster – Earth Cube AHM 2017.

Read more EarthCube in the News.

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT at EarthCube All Hands Meeting 2017

Resources for Coral Reef Education – by Judy Lemus

divers_lpittman_pixabay.jpgWe all recognize that communication and education about science concepts and the process of science is more important than ever.  Fortunately, coral reefs are charismatic ecosystems that inspire much curiosity, concern, and interest from many sectors of society.  While there is no shortage of stunning images and videos online, resources that combine these visuals with robust educational content can be more challenging to identify; they do exist and I’ve put together some of my favorites here. The list is not exhaustive, and we welcome your suggestions for great additions.

EDUCATIONAL WEBSITES. These resources provide educational information about coral reefs across multiple levels and concepts, often using multimedia.

Khaled bin Sultan Living Oceans Foundation Coral Reef Ecology Curriculum. The KSLOF has perhaps the most comprehensive website on coral reef ecology. The site is set up as a course with several units and resources with very nice graphics and high quality videos geared specifically for students and teachers. Lessons are aligned with the Next Generation Science Standards, Ocean Literacy Principles, and Common Core State Standards for K-12, but some of the material could easily be used in a college level course. A major downside to this site is that one must register to use it.

Smithsonian Ocean Portal. The Smithsonian’s website for coral and coral reefs is not as media-rich as the KSLOF, but does have a great deal of scientific information about corals.  Only a couple of lesson plans are offered, but the richness of the content lies in the embedded links to additional images and other stories. The science is backed up with oversight by Smithsonian coral reef biologist Nancy Knowlton.

MarineBio Coral Reefs. The MarineBio website is somewhat of a clearinghouse for other marine bio resources, but the educational content on coral reefs is good quality and quite extensive if you follow the links.  Like the Smithsonian site, there are links to both internal and external resources. The short videos featured throughout the site, generally from outside sources, are particularly engaging.

OTHER WEBSITES WITH EXTENSIVE INFORMATION ABOUT CORAL REEFS

National Ocean Service

NOAA Coral Reef Conservation Program

USGS Coral Reef Project

Coral Reef Alliance

Teach Ocean Science

ReefBase

Great Barrier Reef Foundation

Coral Triangle Initiative

Endangered Reefs, Threatened People

Coral Health Atlas

VIDEOS ABOUT CORALS AND CORAL REEFS. There are loads of videos of corals and coral reefs on the web; these excellent examples incorporate educational content.

Catlin Seaview

Chasing Coral (available through Netflix)

Climate Change: Coral Reefs on the Edge

Exploring the Coral Reef: Learn about Oceans for Kids

Corals Under Confocal

Coral bleaching caused by heating water (time-lapse)

Life Noggin – What Happens if All the Coral Dies? (animation)

Coral Bleaching Animation – HHMI BioInteractive Video (animation)

Coral Bleaching on the Great Barrier Reef (animation)

SCIENCE NEWS SITES. These science news websites regularly post stories on coral reefs.

ScienceDaily

LiveScience


Thanks to Dr. Judy Lemus for this cream-of-the-crop list. Judy is a Faculty Specialist in Science Education at the Hawaii Institute of Marine Biology; fortunately for us, she is also the Education Node Leader for CRESCYNT. You can download Judy’s list in pdf format.

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

Resources for Coral Reef Education – by Judy Lemus

CRESCYNT Toolbox – EarthCube Sci-Tech Matchup – Lightning Talks

Get a fast intro to new Ready-for-Science EarthCube Tools!

We’ve helped arrange a series of lightning talks that will feature tools developed by EarthCube “building block” projects for direct use by scientists. Many EarthCube-built tools are designed to serve as internal components of an EarthCube platform. Other tools were built for scientists as direct end users, and a collection of these are now ripe for adoption. Will some of these help you with your research work?
The current collection will be shown over a span of two online sessions. Find log-in details for Wed., Feb. 15 and Fri., Feb. 17 (no RSVP required – just show up).
Join us!

Wednesday, Feb. 15, 2017
4-5pm EST / 1-2pm PST / 11am-12pm HST – login link HERE
GeoDataspace/GeoTrust, Tanu Malik
ECOGEO Virtual Machine, Elisha Wood-Charlson
LinkedEarth, Julien Emile-Geay
OntoSoft, Yolanda Gil
Flyover Country, Amy Myrbo

Friday, Feb. 17, 2017
4-5:30pm EST / 1-2:30pm PST / 11am-12:30pm HST – login link HERE
CHORDS, Mike Daniels
SuAVE, Ilya Zaslavsky
CINERGI, Ilya Zaslavsky
X-DOMES Ontology Registry, Janet Fredericks
X-DOMES SensorML Registry, Janet Fredericks
iSamplesIGSN, Kerstin Lehnert
GeoDeepDive, Shanan Peters
Digital Crust, Shanan Peters
ECITE, Sara Graves
Earth System Bridge, Scott Peckham

UPDATE: All of the videos from the first rounds of talks are now on the EarthCube YouTube channel – here’s the playlist. Slides are now accessible at the EarthCube Tools Inventory, including additional presentations.

CRESCYNT Toolbox – EarthCube Sci-Tech Matchup – Lightning Talks

CRESCYNT Toolbox – R Resources for Graphing and Visualization

In a previous post we offered some solid supportive resources for learning R – a healthy  dinner with lots of great vegetables. Here we offer a dessert cart of rich resources for data visualization and graphing. It’s a powerful motivation for using R.
rgraphgallery

First up is The New R Graph Gallery – extensive, useful, and actually new. “It contains more than 200 data visualizations categorized by type, along with the R code that created them. You can browse the gallery by types of chart (boxplots, maps, histograms, interactive charts, 3-D charts, etc), or search the chart descriptions. Once you’ve found a chart you like, you can admire it in the gallery (and interact with it, if possible), and also find the R code which you can adapt for your own use. Some entries even include mini-tutorials describing how the chart was made.” (Description by Revolutions.)

Sometimes we want (or need) plain vanilla – something clean and elegant rather than extravagant. Check out A Compendium of Clean Graphs in R, including code. Many examples are especially well-suited for the spartan challenge of conveying information in grayscale. The R Graph Catalog is a similar resource.

If you’re just getting started with R, take a look at the Painless Data Visualization section (p. 17 onward) in this downloadable Beginner’s Guide.

In R, ggplot2,  based on the Grammar of Graphics, is perhaps the single most popular R package for data visualization. The R Cookbook‘s section on Graphs using ggplot2 is a helpful precursor to the R Graphics Cookbook. DataCamp’s DataVis with ggplot2 has a free segment of intro lessons.

For more on visualization and other capabilities, check out this recommended list of useful R packages in the R Studio support blog – succinct and terrific.

If you’re already skilled in R and want a new challenge, an indirect method of harnessing some of the power of D3.js for interactive web visualizations is available through plotly for R. Here’s getting started with plotly and ggplot2, plotly and Shiny, and a gallery. The resources offer code and in some cases the chance to open a visualization and modify its data.

Have a favorite resource? Please share as a comment, or email us!

CRESCYNT Toolbox – R Resources for Graphing and Visualization