Being a Data Scientist: My Experience and Toolset

             

If I had to use a few words to give myself a title for my position at UNC, I might not have said I was a data scientist. When I was starting my career there was no such thing, but looking at my CV / Resume, I have:

I have done dozens of projects, and apparently I’ve amassed a fair bit of knowledge along the way that in some ways I have totally missed. Sometimes I answer a question and I think, “How did I know that, anyway?”

Well, yesterday I started mentoring at Thinkful in their Flexible Data Science Bootcamp, and I have to say that I love it already. I like their approach, because it blends 1-on-1 time with remote learning and goes out of its way to support its mentors in being good educators and not just experts.

But as I dig through my data science know-how I want to share it with more than just one student at a time, so this is the first in a series of posts about what it’s like to be a data scientist, or more accurately perhaps what I did as a data scientist and how that might relate to a new person doing data science in the field.

Some of it will include direct examples of doing data science projects in Python, and some of it will be more about the tools of the trade and how to work with open source tools to do data science. And some posts, like this one, will be more about “life as a data scientist,”

What I did as a Data Scientist at Renci

I would say that most of what I did at Renci fell into two camps.

  1. Take a set of requirements from a domain scientist and work, typically from an experimental methods section (published or in progress) and translate that into code.
  2. Develop systems for collaboration, dataset sharing, and access to HPC resources for research groups of domain scientists.

What’s a Domain Scientist?

When I think of a domain scientist, or a domain expert I typically think of someone outside general computer science, who may know something about programming, but need not, but has a question they’re trying to answer, a requirement they need to fulfill, or a body of work that they add to over time. They typically require scientific methods or knowledge of the science of information classification, accessibility, and archiving (think librarians). These include:

Any of these domain scientists can also be data scientists, but they are different hats to wear and data science distinctly requires a focus on computation and algorithms.

When does a Domain Scientist need a Data Scientist?

Outgrowing Excel.

After you reach 100,000 rows, or when you have sheets with “lookup keys”, pivot tables, and formulae that rely on hidden sheets, you have likely outgrown Excel. It’s not that Excel can’t handle it (although there is a practical row limit). It’s that Excel sheets have a tendency to turn into spaghetti code for a variety of reasons:

Performance on Large or Unusual Datasets.

“Big Data” means something different to everyone, but generally it is unmanageable or hard to run your models on because running the model takes too long. When this happens, someone with an expertise in coding and data science can rewrite a scientist’s model so that they can manage data separately from managing the model, and so that they can run their model efficiently on appropriate hardware.

Cleaning and Data Preparation.

There is a heirarchy we often talked about in information and knowledge management. Information < Data < Knowledge. Within that hierarchy are levels. Scanned paper documents. OCR-ed documents. Spreadsheets with commentary littered all over them and inconsistent representations of what’s “missing.”

Creating Algorithms from Methods.

Most often domain scientists simply aren’t computer people. In this case, a data scientist works directly with the domain scientist, walking through their research methods with them and gathering the specifics that can turn their ideas into code.

Visualization, especially Geographic and Interactive.

Visualization requires a specialized skillset that most scientists, even the coders, don’t have. Visualization tools available to the domain scientist often produce “rough drafts” of graphics, and are limited to “canned” representations that require customization to represent data effectively. Visualization is - to some extent - an art, but there are scientific principles as well that the data scientist will learn that help them:

Spatial (geographic) representations are even more specialized, as map projections, layering, and reproduction issues make producing effective maps its own subfield of visualization.

Machine Learning & Data Mining

Machine learning, text mining, and data mining are their own separate fields of study. They are part of “data science,” but require a significant amount of specialization. Machine learning refers to different methods of having a machine develop its own algorithm for categorizing or classifying elements of a dataset. Machine learning and data mining are not well distinguished, but machine learning techniques increasingly favor “unsupervised” learning algorithms.

Sharing, Collaboration, and Information and Knowledge Management

Like machine learning and data mining, I mentally separate this task because it requires a different skillset, and typically you’re working with a team or a professional organization. Skills the data scientist uses here tend to fall into the “library sciences.” Building or setting up an effective system for sharing and knowledge management means:

Data Science Tools

This is not meant to be exhaustive at all. It is just a sampling of the tools that are out there, and to some extent a list of tools I have used in the past to get things done. It is in the same general order as the section above. Where possible I have listed Python libraries and standalone tools, but some of the libraries here are in other languages, but they are widely used. I do not cover R at all, because it is its own ecosystem, and there are tens of thousands of packages for R that do everything you want.

Tools for working with larger and more complex excel-like data
Tools for working with spatial (geographic) data:
Tools for working with unusual datasets
Tools for creating performant code with large datasets

Note that any of the database systems from above also fall into this category.

Tools for cleaning data
Tools for Visualization
Some data mining and ML tools
Tools for Sharing, Collaboration, and Information and Knowledge Management

comments powered by Disqus