The hardest thing about computational biology isn’t computational biology

Computational biology and data science/machine learning are the vanguard of digital transformation for early discovery biotech. They enable teams to analyze and interpret data whose volume and complexity would otherwise be inaccessible to most biologists.

A number of factors make these subjects fundamentally difficult: Low signal-to-noise ratio in biological data. The subtleties of a subject with more exceptions than rules. The rapidly evolving landscape of new instruments, assays, and techniques.

But the biggest headache, in practice, is something much simpler: The communications overhead of adding a second person to a previously one-person process.

Before the advent of large-scale genetic sequencing and digital microscopy, individual biologists would typically handle every step in their experiments, from generating samples to analyzing readouts. So when it came time to do the analysis, they not only had the full context needed to interpret the data, they also had it set up exactly how they wanted.

Today’s computational biologists and data scientists, on the other hand, have to get up to speed mid-project with enough context to interpret the data they’re handed. And because most biologists are trained for that one-person loop, they often don’t structure data in a way that makes collaboration easy.

Understanding what this context looks like starts with distinguishing two kinds of data that come out of a biology lab.

The first is the readout data produced by instruments that take readings from a sample, plate, etc. Digital microscope images. Sequence fragments from a sequencer. The output of a plate reader. This is what most people think of when they talk about experiment data.

But this is only half the story. Or maybe even less. Because readout data on its own is often useless.

Each value in a readout dataset is linked to a sample ID or a well index. But that ID alone doesn’t tell you what was in that sample or well. To interpret each value and understand how it compares to the other values in the dataset, an analyst needs to know what went into that sample or well, both in terms of materials and process.

Metadata is the record of everything that happened to create a sample before it went into the instrument: Compounds and concentrations. Incubation times and temperatures. Samples and reagents. The context required to interpret the data.

When a biologist manages an experiment end-to-end, they can record this metadata while they’re in the lab, in a form that’s just detailed enough to piece it back together later. As long as they do the analysis within a day or two, they’ll remember enough to fill in the blanks.

But this won’t work for a computational biologist or data scientist who wasn’t in the lab, and may not even have been involved in the planning.

To make things even more complicated, many computational biology and data science projects involve longitudinal analysis across data from multiple experiments, looking at factors that were not considered during the original experiment.

So it’s not enough for the analyst to track down the metadata from a single experiment. They may need it from dozens of experiments, in a consistent and computer-readable format. And they often interrogate the data based on factors that the bench scientist didn’t consider important.

In this context, it’s no longer enough for biologists to track their lab work the way they’ve done in the past. They need to capture more details in a more consistent form. But for the typical bench scientist, already swamped with tight deadlines and focused on the complexities immediately in front of them, this is easier said than done. Even if they’re willing to learn new tricks, they don’t have the bandwidth to add even more to their workflows. They need tools that make collecting detailed, consistent metadata fast and intuitive, naturally guiding them through the process and making the effort inherently rewarding.

Many Electronic Lab Notebooks (ELNs) and Lab Information Management Systems (LIMS) were built with the single, end-to-end biologist in mind. So they’re designed to let biologists quickly capture the sparse metadata that they need to reconstruct the complete picture later based on their memory. When it comes to capturing more detailed and consistent metadata, these tools just don’t cut it.

To take full advantage of larger and more complex data sources, and data science/computational biology capabilities, a modern biotech lab needs an ELN and/or LIMS that not only supports them in capturing more detailed and consistent metadata, but actively encourages them to do so.

That’s why Sapio built its lab informatics platform with AI and advanced analytics in mind from the start. By integrating an ELN and LIMS into its analytics suite, Sapio has struck a delicate balance between letting bench scientists work how they want while ensuring that the organization can leverage data science and computational biology to its fullest potential.

The hardest thing about computational biology isn’t computational biology

Receive the latest from Sapio, directly to your inbox.

You may also like

Q&A with Jim Sulzberger, Director of CMC

The biggest obstacle to digital transformation in pharma

On-Demand Webinar: Optimizing Clinical Diagnostics with LIMS