Can data ever be significant without interpretation? In a recent Atlantic article on the rise of the data artist, a stark distinction is drawn between data presentations and data art. Visualization projects like R. Luke Dubois’ “A More Perfect Union” and David McCandless’ various “Information is Beautiful” projects fall on the side of the latter, whereas projects such as “Rich Blocks Poor Blocks” might be considered “closer to the data”. However what are the implications of seeing certain approaches to data analytics as artistic and others as more scientifically objective? The presence of the artist makes itself known in projects that explicitly use visuals to evoke sentiments in the individual, who surrounded in a modern sea of information, must grapple with the ways in which their identities and environments are being newly defined. But the process of cleaning data, selecting variables, and asking certain questions of a dataset could be defined as an artistic process as well, imbued with individual choices and altered by ideological conditioning.
The Titanic Dataset is the “Hello World” of data, used for the purpose of creating models that are predictive of an individual’s chance of survival based on their gender, age, price of ticket, accommodation class, and accompanying travelers among other variables. The dataset is used for Data Science Dojo’s Titanic Survival Predictor which outputs the statistical chance of survival based upon the above variables. When looking at the different factors that affect survival rates, how are the choices made a product of the data scientist’s own version of the truth? To answer this question we can look at some visualizations of survival versus age.
When initially posing the question of whether age has a significant impact on survival, a multibox plot produced from Microsoft Azure Machine Learning Studio, negates the assumption that age and survival rates are significantly intertwined. From looking at the first plot, it can be seen that 0 (deceased) and 1 (survived) have similar median values, represented by the horizontal line within the box. There is some variance in the min and max, as well as the interquartile range, but the outliers average this out and little significant change is apparent. This pushes up against the assumption that children were evacuated from the ship first.
However, what is the impact of reflecting upon what other intuitive assumptions we make about our definitions of age and our categorical understandings of childhood and adulthood? Feature engineering involves building models that allow humans to acquire knowledge beyond the data representation. Through the process of understanding the gaps in ideology and understanding how we come to a dataset with a need to express prior encultured forms of knowledge, the data scientist is able to develop richer answers to the questions posed.
In the Titanic dataset, the question of age versus chance of survival can be reformed by understanding what we define as a “child”. It is not surprising that the average life expectancy has increased throughout the years. Looking at the “Our World in Data” visualization, it can be seen that England in 1911 had an average life expectancy of 51.4, a year before the sinking of the Titanic in 1912. Compare this to the average life expectancy in 2011 of 80. It becomes clear that it is easy to retroactively apply our definitions of adulthood onto the dataset. For our model, eight years old was chosen as the boundary of childhood. With this inference, the corresponding pie chart looks like:
This is perhaps more aligned with assumptions of how the data should look. But doesn’t this process of redefining age, force the data scientist to understand the ideological gaps at play? Making creative choices for the purposes of dissemination and palatability for an audience to extract meaning? The process of choosing the age border itself involves this task. The following figure illustrates age distribution plotted against survival rate. The age of eight was not arbitrarily chosen as it can be seen from the plot below that the first significant drop in survival rate came between the ages of 8 – 11. Another significant decrease came after the age of 14. It is clear that depending on the data scientist’s definition of adulthood, the visualization and shock value of the data will change.
So what then comes first? The data in this case, is not working to hegemonically instantiate definitions of fact and fiction, 0’s and 1’s, but rather the answers themselves are constructed from the patterns in the data itself. It is this iterative process of asking questions of the data, using the data to reframe the answers, and understanding the possibility of variance, that requires a dynamic understanding of the process of making meaning, challenging our relationship to the assumption that science is irrefutable.
Michel Foucault discusses the role of the author tying this discussion back to what are considered the hard sciences. He writes that “scientific discourses began to be received for themselves, in the anonymity of an established or always redemonstrable truth; their membership in a systematic ensemble, and not the reference to the individual who produced them, stood as their guarantee. The author function faded away, and the inventor’s name served only to christen a theorem, proposition, particular effect, property, body, group of elements, or pathological syndrome (Foucault, “What is an Author?”). However what would happen if we broke down the idea of the author, taking the characteristics of creativity and self-expression and reapplying them to the figure of the data scientist. Restricting our idea of who can possess creativity, limits the interplay between disciplines. This kind of restriction can only lead to dangerous inferences created without question, dampening the possibility for any degree of transparency.
If we reapply our understanding of creative ingenuity in data science, beyond explicit “data art” and reuse the characteristics that constitute our understanding for the ways in which we treat data, then a new space opens up. The creative figure here, works to allow a discursive space that encourages play, subversion, and a recombination of outcomes. As can be seen through the Titanic dataset, the visual is marked by the ways in which the data is created. If this version of the truth is broken down to become something more moldable and transient, then perhaps the untapped potential of big data to break apart established preconceived notions, can comfortably ease individuals into new kinds of truth.