Attendance is free and open to the public, online or in person.
Attendance is free and open to the public, online or in person.
Written while embedded in our CRESCYNT Data Science for Coral Reefs workshops. Amazingly, everyone who participated in workshop 1 – Data Science for Coral Reefs: Data Rescue – learned even more than they thought they would. We’ve had wonderful NCEAS trainers, spectacular participants with amazing datasets, and a lot of hard work over 4 days (March 7-10, 2018).
UPDATE: Here is the Data Rescue workshop agenda we used, with links to all of the training slides.
In the second intensive workshop – Data Science for Coral Reefs: Data Integration and Team Science – people will be introduced to R Studio and GitHub if they have not used them before, and then we will work on exploring techniques for integrating disparate datasets. We’ll start with a pair of datasets at a time, and efforts may involve extracting data from one dataset based on observations from another; upscaling, downscaling, resampling, or summarizing to make intervals and scales mesh – exactly the kind of process that coral reef researchers have said is a recurring challenge in asking bigger science questions.
UPDATE: Here is the Data Integration and Team Science workshop agenda we used, with links to all of those training slides and exercises.
Each workshop group is writing a paper to summarize and share lessons learned, so please stay tuned for those!
We experimented with an unusual process for these workshops: two days of training followed by two days of workathon. We’re liking it! Tell us what you think about these topics and training materials. What other workshop outputs would you like to see?
We’re extremely pleased to be able to offer two workshops in March 2018 at NCEAS. The first is CRESCYNT Data Science for Coral Reefs Workshop 1: Data Rescue. Apply here.
When: March 7-10, 2018
Where: NCEAS, Santa Barbara, California, USA
Recommended for senior scientists with rich “dark” data on coral reefs that needs to be harvested and made accessible in an open repository. Students or staff working with senior scientists are also encouraged to apply. Topics covered on days 1 and 2 of the workshop will cover the basic principles of data archiving and data repositories, including Darwin Core and EML metadata formats, how to write good metadata, how to archive data on the KNB data repository and elsewhere, data preservation workflow and best practices, and how to improve data discoverability and reusability. Additionally, participants will spend approximately 2 days working in pairs to archive their own data using these principles, so applying with a team member from your research group is highly recommended.
The workshop is limited to 20 participants. We encourage you to apply via this form. Workshop costs will be covered with support from NSF EarthCube – CRESCYNT RCN. Participants will publish data during the workshop process, and we anticipate widely sharing workshop outcomes, including workflows and recommendations. Because coral reef science embodies a wide range of data types (spreadsheets, images, videos, field notes, large ‘omics text files, etc.), anticipate some significant pre-workshop prep effort.
Related post: CRESCYNT Toolbox – Estate Planning for Your Data
Data cleaning. Data cleansing. Data preparation. Data wrangling. Data munging.
Garbage In, Garbage Out.
If you’re like most people, your data is self-cleaning, meaning: you clean it yourself! We often hear that 80% of our “data time” is spent in data cleaning to enable 20% in analysis. Wouldn’t it be great to work through data prep faster and keep more of our data time for analysis, exploration, visualization, and next steps?
Here we look over the landscape of tools to consider, then come back to where our feet may be right now to offer specific suggestions for workbook users – lessons learned the hard way over a long time.
The end goal is for our data to be accurate, human-readable, machine-readable, and calculation-ready.
Software for data cleaning:
RapidMiner may be the best free (for academia) non-coding tool available right now. It was built for data mining, which doesn’t have to be your purpose for it to work hard for you. It has a diagram interface that’s very helpful. It almost facilitates a “workflow discovery” process as you incrementally try, tweak, build, and re-use workflow paths that grow during the process of data cleaning. It makes quick work of plotting histograms for each data column to instantly SEE distributions, zeros, outliers, and number of valid entries. It also records and tracks commands (like a baby Jupyter notebook). When pulling in raw datasets, it automatically keeps the originals intact: RapidMiner makes changes only to a copy of the raw data, and then one can export the finished files to use with other software. It’s really helpful in joining data from multiple sources, and pulling subsets for output data files. Rapid Miner Studio: Data Prep.
R is popular in domain sciences and has a number of powerful packages that help with data cleaning. Make use of RStudio as you clean and manipulate data with dplyr and tidyr. New packages are frequently released, such as assertr, janitor, and datamaid. A great thing about R is its active community in supporting learning. Check out this swirl tutorial on Getting and Cleaning Data – or access through DataCamp. The most comprehensive list of courses on R for data cleaning is here via R-bloggers. There’s lovely guidance for data wrangling in R by Hadley Wickham – useful even outside of R.
There are some great tools to potentially
steal borrow that started in data journalism:
Finally, Python itself is clearly a very powerful open source tool available for data cleaning. Look into it with this DataCamp course, pandas and other Python libraries, or this kaggle competition walkthrough.
Manual Data Munging. If you’re using Excel, Open Office, or Google Sheets to clean your data (e.g., small complex datasets common to many kinds of research), you may know all the tricks you need. For those newer to data editing, here are some tips.
Find more spreadsheet guidance here (a set of guidelines recently developed for participants in another project – good links to more resources at its end).
Beyond Workbooks. If you can execute and document your data cleaning workflows in a workbook like Excel, Open Office, or Google Sheets, then you can take your data cleaning to the next level. Knowing steps and sequences appropriate for your specific kinds of datasets will help enormously when you want to convert to using tools such as RapidMiner, R, or Python that can help with some automation and much bigger datasets.
Want more depth? Check out Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders “If you are not good at data preparation, you are NOT a good data scientist…. The validity of any analysis is resting almost completely on the preparation.” – Claudia Perlich
Happy scrubbing! Email or comment with your own favorite tips. Cheers, Ouida Meier
Announcing recent progress for data discovery in support of coral reef research!
Take advantage of this valuable community resource: a data discovery search engine with a special nose for locating coral reef research data sources: cinergi.sdsc.edu.
A major way CRESCYNT has made progress is by serving as a collective coral reef use case for EarthCube groups that are building great new software tools. One of those is a project called CINERGI. It registers resources – especially online repositories and individual online datasets, plus documents and software tools – and then enriches the descriptors to make the resources more searchable. The datasets themselves stay in place: a record of the dataset’s location and description are registered and augmented for better find and filter. Registered datasets and other resources, of course, keep whatever access and use license their authors have given them.
CINERGI already has over a million data sources registered, and over 11,000 of these are specifically coral reef datasets and data repositories. The interface now also features a geoportal to support spatial search options.
The CINERGI search tool is now able to incorporate ANY online resources you wish, so if you don’t find your favorite resources or want to connect your own publications, data, data products, software, code, and other resources, please contribute. If it’s a coral-related resource, be sure to include the word “coral” somewhere in your title or description so it can be retrieved that way later as well. (Great retrieval starts with great metadata!)
Thanks to EarthCube, the CINERGI Data Discovery Hub, and the great crew at the San Diego Supercomputer Center and partners for making this valuable tool possible for coral reef research and other geoscience communities. Here are slides and a video to learn more.
EarthCube domain scientists, computer scientists, data scientists, and new members gathered in Seattle June 7-9, 2017 to communicate progress, connect over projects and science challenges, plan for future collaborative work, and welcome new participants.
Most of the presentations and posters from the meeting are available here. CRESCYNT program manager Ouida Meier delivered an invited talk on sci-tech matchmaking (video|slides, helped facilitate breakout sessions focused on clarifying requirements and resources for virtual workbenches (summary), and presented CRESCYNT coral reef use cases and workflow collaboration during a poster session. Discussion and collective brainstorming throughout the meeting was very dynamic and fruitful.
Download a larger pdf of the CRESCYNT poster – Earth Cube AHM 2017.
Read more EarthCube in the News.
We all recognize that communication and education about science concepts and the process of science is more important than ever. Fortunately, coral reefs are charismatic ecosystems that inspire much curiosity, concern, and interest from many sectors of society. While there is no shortage of stunning images and videos online, resources that combine these visuals with robust educational content can be more challenging to identify; they do exist and I’ve put together some of my favorites here. The list is not exhaustive, and we welcome your suggestions for great additions.
EDUCATIONAL WEBSITES. These resources provide educational information about coral reefs across multiple levels and concepts, often using multimedia.
Khaled bin Sultan Living Oceans Foundation Coral Reef Ecology Curriculum. The KSLOF has perhaps the most comprehensive website on coral reef ecology. The site is set up as a course with several units and resources with very nice graphics and high quality videos geared specifically for students and teachers. Lessons are aligned with the Next Generation Science Standards, Ocean Literacy Principles, and Common Core State Standards for K-12, but some of the material could easily be used in a college level course. A major downside to this site is that one must register to use it.
Smithsonian Ocean Portal. The Smithsonian’s website for coral and coral reefs is not as media-rich as the KSLOF, but does have a great deal of scientific information about corals. Only a couple of lesson plans are offered, but the richness of the content lies in the embedded links to additional images and other stories. The science is backed up with oversight by Smithsonian coral reef biologist Nancy Knowlton.
MarineBio Coral Reefs. The MarineBio website is somewhat of a clearinghouse for other marine bio resources, but the educational content on coral reefs is good quality and quite extensive if you follow the links. Like the Smithsonian site, there are links to both internal and external resources. The short videos featured throughout the site, generally from outside sources, are particularly engaging.
OTHER WEBSITES WITH EXTENSIVE INFORMATION ABOUT CORAL REEFS
VIDEOS ABOUT CORALS AND CORAL REEFS. There are loads of videos of corals and coral reefs on the web; these excellent examples incorporate educational content.
Chasing Coral (available through Netflix)
Coral bleaching caused by heating water (time-lapse)
Coral Bleaching on the Great Barrier Reef (animation)
SCIENCE NEWS SITES. These science news websites regularly post stories on coral reefs.
Thanks to Dr. Judy Lemus for this cream-of-the-crop list. Judy is a Faculty Specialist in Science Education at the Hawaii Institute of Marine Biology; fortunately for us, she is also the Education Node Leader for CRESCYNT. You can download Judy’s list in pdf format.