The Art and Science of Data Visualization
“The greatest value of a picture is when it forces us to notice what we never expected to see,” John Tukey
People tend to be very receptive to visual information, which is why it is often said that a picture is worth a thousand words. This adage is especially true in data science, where data visualization transforms complex datasets into more readily digestible insights.
What is data visualization? Very simply, data visualization is the thoughtful display of data to facilitate understanding. As scientists, we use data visualization to understand data and communicate results. The information we derive from data is used to answer questions for research or science-based advice and then carefully packaged visually to enhance understanding.
At AJM, we are often asked to provide solutions related to complex environmental challenges. Whether related to wildlife, aquatics, or wetlands, we specialize in providing innovative solutions tailored to answering specific questions. Data visualization begins with elements of design. A good designer understands the needs and goals of the target audience and how to best represent the data graphically. Suppose we are asked the following question:
"How does the relative abundance of different plant functional groups (e.g., graminoids, forbs, and woody plants) vary between grazed and non-grazed plots over time?"
The first step in this scenario involves asking a series of questions:
Who is the target audience?
What are the goals for interpretation?
What types of analyses need to be communicated?
Are there data constraints?
These considerations are crucial in the data design and visualization process. Your target audience, for instance, could be the general public, industry stakeholders, government regulators, or research scientists. Knowing the target audience informs decisions regarding how information is communicated. For instance, if the work is intended for publication in a peer-reviewed journal, then it’s safe to work on the assumption that those reviewing the work will be highly educated and experienced in implementing the scientific method and have backgrounds in similar fields.
A thorough deep dive into the literature and background information contributes to understanding the goals for interpretation, which is necessary to determine the objectives and frame the hypotheses. Referring to our research question above, if the objective is to determine the variation in the relative abundance of different plant functional groups over time between the grazed and non-grazed plant communities, then we could hypothesize that the relative abundance will differ between the two groups. This provides a basis for developing a robust study design to test hypotheses and determine whether predictions are accurate. We will use a simulated plant functional groups data set to demonstrate the data design and visualization process. For the plant functional groups scenario, this could include sampling quadrats in grazed and non-grazed meadows over multiple years to evaluate how the relative abundance of different functional groups varies over time (Figure 1).
Data visualization is part art and part science. It requires thought into the aesthetics of the data and understanding how people digest information. A notable pioneer in data visualization was Edward Tufte, who wrote “The Visual Display of Quantitative Information” in 1983 to describe the best and worst examples of data visualization. Tufte once said, “Graphical elegance is often found in simplicity of design and complexity of data.” If we adhere to this guiding statement with our plant functional group scenario above, we can develop comprehensive data designs to create meaningful visualizations to identify trends over time (Figure 2).
The multiple panels above show the trends in the relative abundance of different functional plant groups between grazed and non-grazed meadows from 1970 to 2015. The values are standardized with a mean of zero and a standard deviation of 1 to allow comparisons across functional groups and grazing conditions. For example, forbs, graminoids, and woody plants show a decreasing trend, indicating that grazing negatively impacts their relative abundance, whereas invasive plants increase, suggesting that grazing may facilitate their spread. In contrast, the trends are more stable for all plant groups under non-grazing conditions.
Another objective may revolve around evaluating the changes in plant community composition between grazing conditions over time. An interesting way to visualize the similarity (or dissimilarity) of complex data is to use multivariate statistical techniques such as non-metric multidimensional scaling (NMDS). This approach reduces high-dimensional data to two or three dimensions to better visualize and interpret patterns (Figure 3).
The plot above reduces the complexity of the plant community data to visualize similarities and differences between plant functional groups and grazing conditions. The axes show the two dimensions resulting from the NMDS analysis, and closer points are considered more similar. A notable distinction is the presence of invasive plant species that are more distinct in the grazed sites and separate from the other plant groups. This plot clearly shows the overall differences between the plant groups utilizing channels (e.g. distinct colours) and marks (e.g. distinct shapes) to represent data visually; however, it tells us nothing about changes over time.
We can adjust our plotting settings to view the changes in plant community data over time by using facets (Figure 4). By breaking down our data into bite-sized pieces, we can improve the visual representation of the data and facilitate comparisons across groups and years. Multiples allow us to evaluate a series of similar figures to make comparisons and draw conclusions.
How we represent data can influence how the data is perceived. Data visualization is not quite art or science – it’s a blend of both. A good visualization designer encodes data into visual form, and the user decodes it to improve understanding. By carefully considering the design elements and the needs of the audience, we can create visualizations that are both informative and engaging, ultimately leading to better data-driven decision-making.
Remember, whether you’re plotting simple bar charts or complex multi-dimensional data, always aim for clarity, simplicity, and purpose. As John Tukey said, “taking boring flat data and bringing it to life through visualization,” is the essence of data analytics.
By: Fonya Irvine, Senior Aquatic Science Biologist