fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

data artist

Data Science Dojo
Rachel Schlotfeldt
| July 10

Can data ever be significant without interpretation and visualization? Here’s why data artists and creative thinking matter in data science. 

Data artists and creative thinking have significance in data science

In a recent Atlantic article on the rise of the data artist, a stark distinction is drawn between data presentations and data art. Data visualization projects like R. David McCandless’ various “Information is Beautiful” projects fall on the side of the latter, whereas projects such as “Rich Blocks Poor Blocks” might be considered “closer to the data.” However, what is the implications of seeing certain approaches to data analytics as artistic and others as more scientifically objective?

The presence of the artist makes itself known in projects that explicitly use visuals to evoke sentiments in the individual, who is surrounded by a modern sea of information, and must grapple with the ways in which their identities and environments are being newly defined. But the process of cleaning data, selecting variables, and asking certain questions of a dataset could be defined as an artistic process as well, imbued with individual choices and altered by ideological conditioning.

The titanic dataset and artistic interpretation

The Titanic Dataset is the “Hello World” of data, used for the purpose of creating models that are predictive of an individual’s chance of survival based on their gender, age, price of ticket, accommodation class, and accompanying travelers among other variables. The dataset is used for Data Science Dojo’s Titanic Survival Predictor which outputs the statistical chance of survival based upon the above variables. When looking at the different factors that affect survival rates, how are the choices made a product of the data scientist’s own version of the truth? To answer this question, we can look at some visualizations of survival versus age.

When initially posing the question of whether age has a significant impact on survival, a multibox plot produced from Microsoft’s Azure Machine Learning Studio negates the assumption that age and survival rates are significantly intertwined. Looking at the first plot, 0 (deceased) and 1 (survived) have similar median values, represented by the horizontal line within the box.

There is some variance in the min and max, as well as the interquartile range, but the outliers average this out and little meaningful change is apparent. This pushes up against the assumption that children were evacuated from the ship first.

 

Box Plot Titanic Dataset

Asking the right questions

However, what is the impact of reflecting upon what other intuitive assumptions we make about our definitions of age and our categorical understandings of childhood and adulthood? Feature engineering involves building models that allow humans to acquire knowledge beyond the data representation. Through the process of understanding the gaps in ideology and understanding how we come to a dataset with a need to express prior enculturated forms of knowledge, the data scientist can develop richer answers to the questions posed.

 

Feature Engineering - data science
Dataset of feature engineering

In the Titanic dataset, the question of age versus chance of survival can be reformed by understanding what we define as a “child.” It is not surprising that the average life expectancy has increased throughout the years. Looking at the “Our World in Data” visualization, England in 1911 had an average life expectancy of 51.4, a year before the sinking of the Titanic in 1912.

Compare this to the average life expectancy in 2011 of 80. It becomes clear that it is easy to retroactively apply our definitions of adulthood to the dataset. For our model, eight years old was chosen as the boundary of childhood. With this inference, the corresponding pie chart looks like this:

 

Pie Chart Visualization

This is more aligned with assumptions of how the data should look. But doesn’t this process of redefining age force the data scientist to understand the ideological gaps at play? Making creative choices for the purposes of dissemination and palatability allows an audience to extract meaning? The process of choosing the age border itself involves this task.

The following figure illustrates the age distribution plotted against survival rate. The age of eight was not arbitrarily chosen, as it can be seen from the plot below that the first significant drop-in survival rate came between the ages of 8 – 11. Another significant decrease came after the age of 14. Depending on the data scientist’s definition of adulthood, the visualization and shock value of the data will change.

 

Distribution Plot

So what?

So what then comes next? The data in this case, is not working to hegemonically instantiate definitions of fact and fiction, 0’s and 1’s, but rather the answers themselves are constructed from the patterns in the data itself. It is this iterative process of asking questions of the data, using the data to re-frame the answers, and understanding the possibility of variance, that requires a dynamic understanding of the process of making meaning, challenging our relationship to the assumption that science is irrefutable.

Michel Foucault discusses the role of the author, tying this discussion back to what are considered the hard sciences. He writes that “scientific discourses began to be received for themselves, in the anonymity of an established or always demonstrable truth; their membership in a systematic ensemble, and not the reference to the individual who produced them, stood as their guarantee.

The author function faded away, and the inventor’s name served only to christen a theorem, proposition, particular effect, property, body, group of elements, or pathological syndrome” (Foucault, “What is an Author?”). However, what would happen if we broke down the idea of the author, taking the characteristics of creativity and self-expression and applying them to the figure of the data scientist. Restricting our idea of who can possess creativity limits the interplay between disciplines. This kind of restriction can only lead to dangerous inferences created without question, dampening the possibility for any degree of transparency.

Reapplying creative ingenuity

If we reapply our understanding of creative ingenuity in data science, beyond explicit “data art” and reuse the characteristics that constitute our understanding of the ways in which we treat data, then a new space opens. The creative figure here works to allow a discursive space that encourages play, subversion, and a recombination of outcomes.

As can be seen through the Titanic dataset, the visual is marked by the ways in which the data is created. If this version of the truth is broken down to become something more moldable and transient, then perhaps the untapped potential of big data to break apart established preconceived notions, can comfortably ease individuals into new kinds of truth.

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence