fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

data wrangling

Data Science Dojo
Dave Langer

Feature engineering and data wrangling are key skills for a data scientist. Learn how to accelerate your R coding to deliver more, and better, features.

Earlier this month I had the privilege of traveling to Amsterdam to teach an excellent group of folk’s data science. As is so often the case, I learned as much from the students as they learned from me.

Understanding feature engineering and data wrangling

For example, one of the students asked for some R programming assistance around data wrangling and feature engineering. The scenario in question really intrigued me. I knew how I could solve the problem using traditional non-functional programming techniques (e.g., using loops), but I was looking for something more elegant.

In the hotel that evening I fired up RStudio and started noodling on the problem using my current go-to solution for data wrangling in R – the mighty dplyr package. I had so much fun working through the scenario, here’s some example code from the video showing dplyr in action.

[splus] #====================================================================== 
#Add the new feature for the Title of each passenger 
# 
train <- train %>% 
mutate(Title = str_extract(Name, "[a-zA-Z]+\\.")) table(train$Title)
table(train$Title)
 #====================================================================== 
 #Condense titles down to small subset 
# 
titles.lookup <- data.frame(Title = c("Mr.", "Capt.", "Col.", "Don.", "Dr.",
                                    "Jonkheer.", "Major.", "Rev.", "Sir.",
                                    "Mrs.", "Dona.", "Lady.", "Mme.", "Countess.", 
                                    "Miss.", "Mlle.", "Ms.",
                                    "Master."),
                          New.Title = c(rep("Mr.", 9),
                                        rep("Mrs.", 5),
                                        rep("Miss.", 3),
                                        "Master."),
                                        stringsAsFactors = FALSE)
View(titles.lookup)
#Replace Titles using lookup table 
train <- train %>% 
left_join(titles.lookup, by = "Title") 
View(train) 
train <- train %>% 
mutate(Title = New.Title) %>% 
select(-New.Title) 
View(train) 
[/splus]

Now compare the above elegant (if I do say so myself ;-)) code with the following code from my series:

[splus]
# Expand upon the relationship between `Survived` and `Pclass` by adding the new `Title` variable to the
# data set and then explore a potential 3-dimensional relationship.
# Create a utility function to help with title extraction
extractTitle <- function(name) {
  name <- as.character(name) if (length(grep("Miss.", name)) > 0) {
return ("Miss.")
 } else if (length(grep("Master.", name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", name)) > 0) {
return ("Mr.")
} else {
return ("Other")
}
}
titles <- NULL
for (i in 1:nrow(data.combined)) {
 titles <- c(titles, extractTitle(data.combined[i,"name"]))
}
data.combined$title <- as.factor(titles)
# Re-map titles to be more exact
titles[titles %in% c("Dona.", "the")] <- "Lady."
titles[titles %in% c("Ms.", "Mlle.")] <- "Miss."
titles[titles == "Mme."] <- "Mrs."
titles[titles %in% c("Jonkheer.", "Don.")] <- "Sir."
titles[titles %in% c("Col.", "Capt.", "Major.")] <- "Officer"
table(titles)

# Make title a factor
data.combined$new.title <- as.factor(titles)
# Collapse titles based on visual analysis
indexes <- which(data.combined$new.title == "Lady.")
data.combined$new.title[indexes] <- "Mrs."
indexes <- which(data.combined$new.title == "Dr." | 
             data.combined$new.title == "Rev." |
             data.combined$new.title == "Sir." |
             data.combined$new.title == "Officer")
data.combined$new.title[indexes] <- "Mr."

Beautiful!

In our Bootcamp we spend a lot of time emphasizing that in the bulk of scenarios a Data Scientist is best served by focusing their time on Data Wrangling and (most importantly) Feature Engineering. So often quality trumps everything else – algorithm selection, hyperparameter tuning, blending, etc. My work on this video series is aligned to our teachings on the importance of both in R. Hopefully folks get as much out of my new series as I am getting out of making it.

Enjoy and happy data sleuthing!

Data wrangling cheat sheet

Here is a cheat sheet:

Data wrangling-Cheat sheet
Data Science Dojo
Srishti Puri
| April 6

This article lists the top 54 most shared data science quotes: Data as an analogy, importance of data, data analytics adoption, data wrangling, data privacy and security, and future of data.

 

The growing reliance on data analytics has reset business practices, opening frontiers from innovation to productivity and competition. Moreover, these technologies are available at a much cheaper cost, making data a growing torrent flowing into every area of the global economy.

In this data-driven world of technological innovation, let’s take a look at some of the most popular data science quotes.

Learn with amazing data science quotes

 

Experts from every area of the economy have spoken of its capability and impact. We have a curated list for you of some of the famous and useful data science quotes:

data as an analogy

 

Data science quotes about “data as an analogy”

 

1. “Information is the oil of the 21st century, and analytics is the combustion engine.”- Peter Sondergaard, Chairman Of The Board at DecideAct.

2. “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”- Geoffrey Moore, management consultant and author of Crossing the Chasm.

3. “If you wanna do data science, learn how it is a technical, cultural, economic, and social discipline that has the ability to consolidate and rearrange societal power structures.” – Hugo Bowne-Anderson, Head of Developer Relations at OuterBounds.

4. Possessed is the right word. I often tell people; I don’t necessarily want to be a data scientist. You just kind of are a data scientist. You just can’t help but look at that data set and go, I feel like I need to look deeper. I feel like that’s not the right fit.” – Jennifer Shin, data science/machine learning/AI expert and founder of 8 Path Solutions.

5. “My least favorite description [of Deep Learning] is, “It works just like the brain.” I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain does.” – Yann LeCun, VP & Chief AI Scientist at Meta.

data science quotes
Data science quote – Yann LeCun

6. “AI is the new electricity. Just as electricity transformed industry after industry 100 years ago, I think AI will do the same.” – Andrew Ng, Founder & CEO of Landing AI, Founder of deeplearning.ai, Co-Chairman and Co-Founder of Coursera, and is currently an Adjunct Professor at Stanford University.

7. “Much of the power of artificial intelligence stems from its very mindlessness. Immune to the vagaries and biases that attend conscious thought, computers can perform their lightning-quick calculations without distraction or fatigue, doubt or emotion. The coldness of their thinking complements the heat of our own.” – Nicholas G. Carr, American writer on technology and business.

8. “We’ve defined our relationship with technology not as that of body and limb or even sibling and sibling, but as that of master and slave.” […] “With roles reversed, the metaphor also informs society’s nightmares about technology. As we become dependent on our technological slaves…we turn into slaves ourselves.” – Nicholas G. Carr, American writer on technology and business.

PRO TIP: Join our data science bootcamp program today to enhance your data analysis skillset!

importance of data

Data science quotes about “the importance of data”

 

9. “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days.” – Eric Schmidt, Founding Partner, Innovation Endeavors.

 

10. “We are moving slowly into an era where big data is the starting point, not the end.” – Pearl Zhu, Author.

 

11. Most of the world will make decisions by either guessing or using their guts. They will be either lucky or wrong.” – Suhail Doshi, chief executive officer, Mixpane.

 

12. “We’re entering a new world in which data may be more important than software.” – Tim O’Reilly, founder, O’Reilly Media.

 

13. “Without big data, you are blind and deaf in the middle of a freeway.” – Geoffrey Mooremanagement consultant, and theorist.

 

14. “Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.” – Aaron Levenstein, business professor at Baruch College.

 

15. “A data scientist is someone who can obtain, scrub, explore, model, and interpret data, blending hacking, statistics, and machine learning. Data scientists not only are adept at working with data but appreciate data itself as a first-class product.” – Hillary Mason, founder, Fast Forward Labs.

 

16. “Data Scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” – Mike Loukides, editor, O’Reilly Media.

 

17. “Too often we forget that genius, too, depends upon the data within its reach, that even Archimedes could not have devised Edison’s inventions.” – Ernest Dimnet, priest, writer, and lecturer.

 

18. “The core advantage of data is that it tells you something about the world that you didn’t know before.”- Hilary Mason, data scientist and founder of Fast Forward Labs.

 

data analytics adoption

Data science quotes about “data analytics adoption”

 

19. “The biggest challenge of making the evolution from a knowing culture to a learning culture—from a culture that largely depends on heuristics in decision making to a culture that is much more objective and data-driven and embraces the power of data and technology—is really not the cost. Initially, it ends up being imagination and inertia…

 

What I have learned in my last few years is that the power of fear is quite tremendous in evolving oneself to think and act differently today, and to ask questions today that we weren’t asking about our roles before. And it’s that mindset change—from an expert-based mindset to one that is much more dynamic and much more learning-oriented, as opposed to a fixed mindset—that I think is fundamental to the sustainable health of any company, large, small, or medium.” – Murli Buluswar, chief science officer, AIG.

 

20. “What we found challenging, and what I find in my discussions with a lot of my counterparts that is still a challenge, is finding the set of tools that enable organizations to efficiently generate value through the process.

 

I hear about individual wins in certain applications but having a more cohesive ecosystem in which this is fully integrated is something we are all struggling with, in part because it’s still very early days. Although we’ve been talking about it seemingly quite a bit over the past few years, the technology is still changing; the sources are still evolving.” – Ruben Sigala, former EVP and chief marketing officer, Caesars Entertainment.

 

21. “The human side of analytics is the biggest challenge to implementing big data.” – Paul Gibbons, author of “The Science of Successful Organizational Change.

 

22. “Every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection, right? But most of it is like cat videos on YouTube or 13-year-olds exchanging text messages about the next Twilight movie.” – Nate Silver, founder and editor in chief of FiveThirtyEight.

 

23. “One of the biggest challenges is around data privacy and what is shared versus what is not shared. And my perspective on that is consumers are willing to share if there’s value is returned. One-way sharing is not going to fly anymore. So how do we protect and how do we harness that information and become a partner with our consumers rather than kind of just a vendor for them?” – Zoher Karu, head of data and analytics, APAC and EMEA.

 

24. “The human side of analytics is the biggest challenge to implementing big data.” – Paul Gibbons, author of “The Science of Successful Organizational Change.”

 

25. “The first change we had to make was just to make our data of higher quality. We have a lot of data, and sometimes we just weren’t using that data, and we weren’t paying as much attention to its quality as we now need to… The second area is working with our people and making certain that we are centralizing some aspects of our business. We are centralizing our capabilities, and we are democratizing its use. I think the other aspect is that we recognize as a team and as a company that we ourselves do not have sufficient skills, and we require collaboration across all sorts of entities outside of American Express.

 

This collaboration comes from technology innovators, it comes from data providers, it comes from analytical companies. We need to put a full package together for our business colleagues and partners so that it’s a convincing argument that we are developing things together, that we are co-learning, and that we are building on top of each other.” – Ash Gupta, former American Express executive; president, Payments and E-Commerce Innovation, LLC.

 

26. “On average, people should be more skeptical when they see numbers. They should be more willing to play around with the data themselves.” – Nate Silver, founder, and editor in chief of FiveThirtyEight.

 

27. “Think analytically, rigorously, and systematically about a business problem and come up with a solution that leverages the available data.” – Michael O’Connell, chief analytics officer, TIBCO.

data wrangling

 

Data science quotes about “data wrangling”

 

28. “The data fabric is the next middleware.” – Todd Papaioannou, entrepreneur, investor, and mentor.

 

29. The goal is to turn data into information and information into insight.” – Carly Fiorina, former chief executive officer, Hewlett Packard.

 

30. “No data is clean, but most is useful.” – Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ

 

31. “Errors using inadequate data are much less than those using no data at all.” – Charles Babbage, mathematician, engineer, inventor, and philosopher.

 

32. “Data are just summaries of thousands of stories–tell a few of those stories to help make the data meaningful.” – Chip and Dan Heath, authors of “Made to Stick” and “Switch.”

 

33. “In the spirit of science, there really is no such thing as a ‘failed experiment.’ Any test that yields valid data is a valid test.” –  Adam Savage, creator of MythBusters.

 

34. “If somebody tortures the data enough (open or not), it will confess anything.” – Paolo Magrassi, former vice president, research director, Gartner.

 

35. “I think you can have a ridiculously enormous and complex data set, but if you have the right tools and methodology, then it’s not a problem.” – Aaron Koblin, entrepreneur in data and digital technologies.

 

36. “Data that is loved tends to survive.” – Kurt Bollacker, computer scientist.

 

37. Data is like garbage. You’d better know what you are going to do with it before you collect it.” – Mark Twain.

 

38. We are surrounded by data but starved for insights.” – Jay Baer, marketing and customer experience expert.

 

39. “With data collection, ‘the sooner the better’ is always the best answer.”- Marissa Mayer, IT executive and co-founder of Lumi Labs, former Yahoo! President and CEO.

 

40. “Errors using inadequate data are much less than those using no data at all.”- Charles Babbage, mathematician, philosopher, inventor, and mechanical engineer.

 

Learn more about data wrangling

 

data privacy, data security

Data science quotes about “data privacy and security”

 

41. “The price of freedom is eternal vigilance. Don’t store unnecessary data, keep an eye on what’s happening, and don’t take unnecessary risks.” – Chris Bell, former U.S. congressman.

 

42. “It’s so cheap to store all data. It’s cheaper to keep it than to delete it. And that means people will change their behavior because they know anything they say online can be used against them in the future.”- Mikko Hypponen, security and privacy expert.

 

43. “In (the) digital era, privacy must be a priority. Is it just me, or is secret blanket surveillance obscenely outrageous?” – Al Gore, former vice president of the United States.

 

44. You happily give Facebook terabytes of structured data about yourself, content with the implicit tradeoff that Facebook is going to give you a social service that makes your life better.” – John Battelle, founder, Wired magazine.

 

45. Better be despised for too anxious apprehensions than ruined by too confident security.” – Edmund Burke, British philosopher, and statesman.

 

46. Everything we do in the digital realm—from surfing the web to sending an email to conducting a credit card transaction to, yes, making a phone call—creates a data trail. And if that trail exists, chances are someone is using it—or will be soon enough.” – Douglas Rushkoff, author of “Throwing Rocks at the Google Bus.

 

future of data

 

Data science quotes about “the future of data”

 

47. “The world is one big data problem.” – Andrew McAfee, principal research scientist, at MIT.

 

48. “Big data will spell the death of customer segmentation and force the marketer to understand each customer as an individual within 18 months or risk being left in the dust.” – Virginia M. (Ginni) Rometty, chairman, president, and CEO of IBM.

 

49. “Every company has big data in its future, and every company will eventually be in the data business.” – Thomas H. Davenport, American academic and author specializing in analytics, business process innovation, and knowledge management.

 

50. We should teach the students, as well as executives, how to conduct experiments, how to examine data, and how to use these tools to make better decisions.”- Dan Ariely, professor of psychology and behavioral economics at Duke University and a founding member of the Center for Advanced Hindsight.

 

51. Autodidacts—the self-taught, un-credentialed, data-passionate people—will come to play a significant role in many organizations’ data science initiatives.” – Neil Raden, founder, and principal analyst, Hired Brains Research.

 

52. “There’s a digital revolution taking place both in and out of government in favor of open-sourced data, innovation, and collaboration.”- Kathleen Sebelius, former U.S. Secretary of Health and Human Services.

 

53. “Big data will replace the need for 80% of all doctors.” – Vinod Khosla, co-founder of Sun Microsystems and founder of Khosla Ventures.

 

54. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”- Hal Varian, chief economist, at Google.

 

The extensive list of data science quotes highlights the growing impact of the field on modern-day businesses and their running. Take inspiration from the opinions of leaders about data analytics, data wrangling, data privacy, and a lot more. These data science quotes provide unique insight into the world of data for you to start!

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence