Scientists need better ways to analyze and integrate their data and collaborate with other scientists; new computing technologies and tools can help with this. However, it’s difficult to overcome the challenge of disparate perspectives and the absence of a common vocabulary: this is true of multidisciplinary science teams, and true when scientists try to talk with computer scientists. Workflows, as a way to help design and implement a workbench, are needed both as a collaboration space and a blueprint for implementation.
Take a look at a recent presentation to the EarthCube science committee (video) or an earlier presentation offered at ASLO 2017 (slides and voice) to see a flexible and low-tech way to simultaneously (1) facilitate necessary sci-tech interactions for your own lab and (2) begin to sketch out a blueprint for work that needs to be done. Subsequent technical implementation is possible with new tools including Common Workflow Language (CWL) as a set of specifications, Dockers as modular and sharable containers for either fully developed tools or small pieces of code, and Nextflow as an efficient and highly scalable definitive software language to make the computational work happen. Look for a post in the near future by Mahdi Belcaid to describe the technical implementation of these workflows.
OPPORTUNITY! We will be hosting one or two in-person skills training workshops in the coming months, with your expenses covered by our NSF EarthCube CRESCYNT grant, focused particularly on training early career professionals, and will work through some challenging coral reef use cases and their cyberinfrastructure needs. We collected some great use cases at ICRS, but would like additional cases to consider, so we invite you to describe your own research challenges through this google form. Please contact us for more on this, or other issues. Thanks!
If you’re attending ASLO (Association for Sciences in Limnology and Oceanography) in Hawaii or will be in Honolulu on Feb. 26th and care about better ways to collaborate, solve data and workflow challenges, or take the next steps in the relentless digital revolution, join us in person!
We’re excited to be able to offer this workshop as a real-time webinar – please participate remotely if you can! We have an amazing lineup of presenters and workshop guides. Don’t miss out!
Funding gratefully acknowledged from NSF EarthCube CRESCYNT Coral Reef Science and Cyberinfrastructure Network, Ruth D. Gates, PI (crescynt.org)
Get a fast intro to new Ready-for-Science EarthCube Tools!
We’ve helped arrange a series of lightning talks that will feature tools developed by EarthCube “building block” projects for direct use by scientists. Many EarthCube-built tools are designed to serve as internal components of an EarthCube platform. Other tools were built for scientists as direct end users, and a collection of these are now ripe for adoption. Will some of these help you with your research work?
The current collection will be shown over a span of two online sessions. Find log-in details for Wed., Feb. 15 and Fri., Feb. 17 (no RSVP required – just show up).
Having dedicated my PhD to automating the annotation of coral reef survey images, I have seen my fair share of surveys and talked to my fair share of coral ecologists. In these conversations, I always heard the same story: collecting survey images is quick, fun and exciting. Annotating them is, on the other hand, slow, boring, and excruciating.
When I started CoralNet (coralnet.ucsd.edu) back in 2012 the main goal was to make the manual annotation work less tedious by deploying automated annotators alongside human experts. These automated annotators were trained on previously annotated data using what was then the state-of-the-art in computer vision and machine learning. Experiments indicated that around 50% of the annotation work could be done automatically without sacrificing the quality of the ecological indicators (Beijbom et al. PLoS ONE 2015).
The Alpha version of CoralNet was thus created and started gaining popularity across the community. I think this was partly due to the promise of reduced annotation burden, but also because it offered a convenient online system for keeping track of and managing the annotation work. By the time we started working on the Beta release this summer, the Alpha site had over 300,000 images with over 5 million point annotations – all provided by the global coral community.
There was, however, a second purpose of creating CoralNet Alpha. Even back in 2012 the machine learning methods of the day were data-hungry. Basically, the more data you have, the better the algorithms will perform. Therefore, the second purpose of creating CoralNet was quite simply to let the data come to me rather than me chasing people down to get my hands on their data.
At the same time the CoralNet Alpha site was starting to buckle under increased usage. Long queues started to build up in the computer vision backend as power-users such as NOAA CREP and Catlin Seaview Survey uploaded tens of thousands of images to the site for analysis assistance. Time was ripe for an update.
As it turned out the timing was fortunate. A revolution has happened in the last few years, with the development of so-called deep convolutional neural networks. These immensely powerful, and large nets are capable of learning from vast databases to achieve vastly superior performance compared to methods from the previous generation.
During my postdoc at UC Berkeley last year, I researched ways to adapt this new technology to the coral reef image annotation task in the development of CoralNet Beta. Leaning on the vast database accumulated in CoralNet Alpha, I tuned a net with 14 hidden layers and 150 million parameters to recognize over 1,000 types of coral substrates. The results, which are in preparation for publication, indicate that the annotation work can be automated to between 80% and 100% depending on the survey. Remarkably: in some situations, the classifier is more consistent with the human annotators than those annotators are with themselves. Indeed, we show that the combination of confident machine predictions with human annotations beat both the human and the machine alone!
Using funding from NOAA CREP and CRCP, I worked together with UCSD alumnus Stephen Chan to develop CoralNet Beta: a major update which includes migration of all hardware to Amazon Web Services, and a brand new, highly parallelizable, computer vision backend. Using the new computer vision backend the 350,000 images on the site were re-annotated in one week! Software updates include improved search, import, export and visualization tools.
With the new release in place we are happy to welcome new users to the site; the more data the merrier!
– Many thanks to Oscar Beijbom for this guest posting as well as significant technological contributions to the analysis and understanding of coral reefs. You can find Dr. Beijbom on GitHub, or see more of his projects and publications here. You can also find a series of video tutorials on using CoralNet (featuring the original Alpha interface) on CoralNet’s vimeo channel, and technical details about the new Beta version in the release notes.
In a previous post we offered some solid supportive resources for learning R – a healthy dinner with lots of great vegetables. Here we offer a dessert cart of rich resources for data visualization and graphing. It’s a powerful motivation for using R.
First up is The New R Graph Gallery – extensive, useful, and actually new. “It contains more than 200 data visualizations categorized by type, along with the R code that created them. You can browse the gallery by types of chart (boxplots, maps, histograms, interactive charts, 3-D charts, etc), or search the chart descriptions. Once you’ve found a chart you like, you can admire it in the gallery (and interact with it, if possible), and also find the R code which you can adapt for your own use. Some entries even include mini-tutorials describing how the chart was made.” (Description by Revolutions.)
Sometimes we want (or need) plain vanilla – something clean and elegant rather than extravagant. Check out A Compendium of Clean Graphs in R, including code. Many examples are especially well-suited for the spartan challenge of conveying information in grayscale. The R Graph Catalog is a similar resource.
If you’re just getting started with R, take a look at the Painless Data Visualization section (p. 17 onward) in this downloadable Beginner’s Guide.
If you’re already skilled in R and want a new challenge, an indirect method of harnessing some of the power of D3.js for interactive web visualizations is available through plotly for R. Here’s getting started with plotly and ggplot2, plotly and Shiny, and a gallery. The resources offer code and in some cases the chance to open a visualization and modify its data.
Science is a team sport. Collaborations allow us to ask more ambitious science questions, but also intensify the need to connect disparate datasets across scales of time and space. Solving data interoperability challenges requires technological solutions not yet in place, so we’re taking the initiative to review potential solutions.
A platform from EarthCube is some time and distance away, but we have a chance to start assembling tools already at hand and in use for coral reef research workflows and do some testing. The process also helps us ground the ideal in the practical.
What are some criteria for a great infrastructure platform?
Ideally, solutions are: 1) modular, so when an improved tool is available it can be incorporated without restructuring the system; 2) free or low cost, so solutions are sustainable for most research labs; and 3) open source, allowing continued development from multiple disciplines and directions. However, we also want to start where people are, with the tools we’re already using – many of these are less than ideal but we make them work. That’s our starting place, and we want to hear about all of your tools.
It is tempting to set up a workbench for the challenge of analysis alone, but in a coral reef research lab we immediately crash into the realities of group data collection, field and lab work, physical specimens, and intersecting projects. All of these characteristics create additional layers of challenge. In the long run, infrastructure should help capture data and metadata generation at the source, and ease tracking, analysis, and replicability.
Good infrastructure solves more problems than it creates in compliance, skill demand, and management. An effective system helps graduate students and postdocs develop robust skills in managing data into the future, with guidelines that work for people, labs, and collaborators. Interoperability challenges must be solved for datasets that range from remote sensing to ecological surveys to bioinformatics work. Data cleaning, analysis, visualization, and mapping must be supported in flexible ways to clearly communicate research insights.
We have an opportunity to construct a preliminary array of tools and workflows on a cyberinfrastructure workbench for coral reef research, so if this is going to be a broadly useful start we need to know what you already like using and what more you wish you had. Dream big to start with, and along the way we’ll acknowledge the distinction between perfect and good-enough solutions.
We are driven to learn like sharks: constantly take in new flows, or die. In a recent workshop, when coral reef scientists were asked: “How many of you use R?” 60% raised a hand. To: “How many of you are comfortable with and love using R?” only about 15% kept a hand up.
Here’s where to go to learn to love R more.
You likely already know of the R Project, free and open source software for statistical computing and graphics. You may already know of the reliability of the Comprehensive R Archive Network or CRAN repository, favored by many over other potential sources of community-generated code because of their metadata and testing requirements; it now hosts over 9,300 packages (sorted by date and name).
You may not know of the new R course finder, an online directory you can search and filter to find the best online R course for your next step (note there are often free versions or segments of even the pay courses listed). There are YouTube videos for R learning, like twotorials (two-minute tutorials) and YaRrr! (because pirates) with book.
A very recent new book is getting rave reviews from both statistics and programming viewpoints: The Book of R by Tilman Davies (preview it here). The author writes:
“The Book of R …represents the introduction to the language that I wish I’d had when I began exploring R, combined with the first-year fundamentals of statistics as a discipline, implemented in R…. Try not to be afraid of R. It will do exactly what you tell it to – nothing more, nothing less. When something doesn’t work as expected or an error occurs, this literal behavior works in your favor…. Especially in your early stages of learning…try to use R for everything, even for very simple tasks or calculations you might usually do elsewhere. This will force your mind to switch to ‘R mode’ more often, and it’ll get you comfortable with the environment quickly.”
We’ll soon host a guest blogpost on some exploratory coral symbiont data analyses, visualizations, and comments generated in R Markdown, which is RStudio’s method for preserving code and output in one running web document. The work is beautiful and useful, and highlights the use of an electronic notebook as a way to capture and share data exploration, analysis and visualization, and to tell a data story. (A major advance to that software was announced this week in the form of R Notebook, which will ship within the next couple of months.)
Why is it worth learning to love R more?
R helps make sure your data work is reproducible (such an issue for science), repeatable (valuable for any processing you have to do periodically), and reusable (on other datasets or data versions, or by colleagues or your future self).
A couple of high-level languages, like R and Python, are becoming more popular each year, and are finding their way as general purpose tools into analytical platforms. These will serve as primary sources of flexibility in cyberinfrastructure platforms now available or under development. Our future selves thank us for the learning investment.
I would suggest that they get a copy of the R for Data Science book written by Hadley Wickham and Garrett Grolemund…. Also, when you have questions or run into problems don’t give up. There’s a lot of great activity around R on stackoverflow and other places and there’s an excellent chance you’re going to find the answers to your questions if you look carefully for them.
Further Update: In January 2018, Kaggle released resources for Hands-On Data Science learning, including lessons for R in data setup, data visualization, and machine learning.