fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

data scientists' skills

Data Science Dojo
Rahim Rasool
| February 17

Data Science Dojo has created an archive of 32 data sets for you to use to practice and improve your skills as a data scientist.

The repository carries a diverse range of themes, difficulty levels, sizes, and attributes. The data sets are categorized according to varying difficulty levels to be suitable for everyone.

They offer the ability to challenge one’s knowledge and get hands-on practice to boost their skills in areas, including but not limited to, exploratory data analysis, data visualization, data wrangling, machine learning, and everything essential to learning data science.

 

Large language model bootcamp

 

The data sets below have been sorted with increasing levels of difficulty for convenience (Beginner, Intermediate, Advanced). We recommend you test yourself with all the distinct data sets we’ve provided. We’ve presented a challenging question with each one, however, feel free to use them in any way you wish.

1. Find out the age of Abalone from physical measurements

Level: Beginner

Recommended Use: Regression Models

Domain: Environment

Link to Dataset

2. Predict student’s knowledge level

Level: Beginner

Recommended Use: Classification/Clustering

Domain: Education/Web

Link to Dataset

This data set has 403 rows and 6 columns. It is a real data set about the students’ knowledge status about the subject of Electrical DC Machines.

3. Can you predict the price of a house?

Level: Beginner

Recommended Use: Regression Models

Domain: Real Estate

Link to Dataset

With 414 rows and 7 columns related to various attributes of a house, this data set provides the market historical data of real estate valuations which are collected from Sindian Dist., New Taipei City, Taiwan.

 

Interested in learning about Large Language Models and building custom ChatGPT like applications for your business? Click below

Learn More                  

 

4. Can you estimate the location from WIFI Signal Strength

Level: Beginner

Recommended Use: Classification Models

Domain: Mobile/Location

Link to Dataset

This beginner-level data set has 2,000 rows and 8 columns. The data contains wifi signal strength observed from 7 wifi devices on a smartphone collected in an indoor space which could be used to estimate the location in one of the four rooms.

5. Predict the acceptability of a car

Car dataset
Predict the acceptability of a car

Level: Beginner

Recommended Use: Classification Models

Domain: Automobile

Link to Dataset

The data set has 1,728 rows and 7 columns in which car attributes, such as price and technology, are described across 6 variables such as “Buying Price”, “Maintenance”, “Safety” etc. There are multiple alternatives under each of the 6 variables. Car’s acceptability, the seventh attribute, is the outcome variable.

6. Predict the seminal quality of an individual

Level: Beginner

Recommended Use: Regression/Classification Models

Domain: Healthcare/Life

Link to Dataset

This data set has 10 attributes. It includes semen samples of 100 volunteers, analyzed according to the WHO 2010 criteria. It can be used to determine if it’s possible to reach a diagnosis without a laboratory approach, which includes expensive tests that are sometimes uncomfortable for the patients. Attributes presented in this data set can be taken easily using a questionnaire to estimate sperm concentration.

7. Estimate the chance of bankruptcy from qualitative parameters by experts

Level: Beginner

Recommended Use: Classification Models

Domain: Finance/Banking

Link to Dataset

This data set has 250 rows and 7 columns. It contains 6 qualitative parameters from experts which can be used to predict bankruptcy.

If you want to further develop your data modeling skillset, consider attending Data Science Dojo’s data science Bootcamp.

8. Can you predict the fuel efficiency of a car?

Level: Intermediate

Recommended Use: Regression Models

Domain: Automobiles

Link to Dataset

This data set has 398 rows, and 9 columns, and provides mileage, horsepower, model year, and other technical specifications for cars.

9. Was that chest pain an indicator of a heart disease

Level: Intermediate

Recommended Use: Classification Models

Domain: Health Sciences

Link to Dataset

This data set provides health examination data among 303 patients who presented with chest pain and might have been suffering from heart disease. The data set has 14 attributes to find whether the diagnosed patient was found to have heart disease or not.

10 Predict the total number of demand for orders

Level: Intermediate

Recommended Use: Regression Models

Domain: Business

Link to Dataset

This intermediate-level data set has 60 rows and 13 columns. The data was collected during 60 days and is from a real database in a Brazilian logistics company. It has twelve predictive attributes and a target which is the total orders for daily treatment.

11. Find out if a donor will give blood in March 2007

blood donation
Blood Donation

Level: Intermediate

Recommended Use: Classification Models

Domain: Business

Link to Dataset

This data set has 748 instances and 5 attributes. The data is from a donor database, Blood Transfusion Service Center in Hsin-Chu City, Taiwan. The center drives their blood transfusion service bus to a university in Hsin-Chu City to gather blood donated about every three months.

12. Forecast pollution level of a city

Level: Intermediate

Recommended Use: Regression Models

Domain: Environment

Link to Dataset

This data set has 43,824 rows and 13 columns. It contains the PM2.5 data from the US Embassy in Beijing. Meteorological data from Beijing Capital International Airport is also included. The data set can be used for pollution level forecasting using the Air Quality attributes provided. It will also offer experience in Multivariate Time Series Forecasting.

13. Will the patient survive for at least one year after a heart attack

Level: Intermediate

Recommended Use: Classification Models

Domain: Automobiles

Link to Dataset

This data set has 132 rows and 12 columns. It provides data that can be used for classifying if patients will survive for at least one year after a heart attack. All patients listed in the data set suffered heart attacks at some point in the past. Some are still alive and some are not.

14. Estimate compressive strength of concrete

Level: Intermediate

Recommended Use: Regression Models

Domain: Civil Engineering/Construction

Link to Dataset

This set has 1,030 rows and 9 columns. Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from a laboratory.

15. Discover patterns relating to the liver disorder and alcohol consumption

Level: Intermediate

Recommended Use: Classification/Regression/Clustering Models

Domain: Healthcare

Link to Dataset

This data set has 345 rows and 7 columns. The data set does not contain any variable representing the presence or absence of a liver disorder. The first five columns represent the result of various blood tests which may be of use in diagnosing alcohol-related liver disorders. The sixth represents the number of alcoholic drinks consumed per day by the subject (self-reported).

16. Predict which stock will provide the greatest rate of return

Selling and buying
Predict which stock will provide the greatest rate of return

Level: Intermediate

Recommended Use: Clustering/Regression/Classification Models

Domain: Business/Finance

Link to Dataset

This data set has 750 rows and 16 columns. It contains weekly data for the Dow Jones Industrial Index, used in computational investing research. Each record is data for a week and has the percentage of return that the stock has in the following week. Ideally, this could be used to determine which stock will produce the greatest rate of return in the following week.

17. Assess heating and cooling load requirements of the building

Level: Intermediate

Recommended Use: Regression/Classification Models

Domain: Energy

Link to Dataset

This data set has 768 rows and 10 columns. It can be used for assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters. The buildings differ concerning the glazing area, the glazing area distribution, and the orientation, amongst other parameters.

18. Determine the type of glass using oxide content

Level: Intermediate

Recommended Use: Classification Models

Domain: Physical

Link to Dataset

This data set has 214 rows and 10 columns. It provides details about 6 types of glass, defined in terms of their oxide content (i.e. Na, Fe, K, etc).

19. Predict the chance of survival

Level: Intermediate

Recommended Use: Classification Models

Domain: Healthcare

Link to Dataset

This data set has 155 rows, and 20 columns, and provides various attributes of a patient suffering from hepatitis. This can be used to predict the patient’s chance of survival or for other purposes.

20. Find patterns from spending data at wholesale

Level: Intermediate

Recommended Use: Classification/Clustering

Domain: Business/Retail

Link to Dataset

This data set has 440 rows and 8 columns. The data refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

21. Group similar travel reviews

Man woman going on vacation
Grouping similar travel reviews

Level: Intermediate

Recommended Use: Clustering/Classification Models

Domain: Web

Link to Dataset

This data set, populated by crawling TripAdvisor.com, has 980 rows and 11 columns. It includes reviews on destinations in 10 categories mentioned across East Asia. Each traveler rating is mapped as Excellent(4), Very Good(3), Average(2), Poor(1), and Terrible(0); and an average rating is used against each category per user.

22. Relate returns of Istanbul Stock Exchange with other international indices

Level: Intermediate

Recommended Use: Regression/Classification Models

Domain: Business/Finance

Link to Dataset

This data set has 536 rows and 9 columns. It includes returns of the Istanbul Stock Exchange (ISE) with seven other international indices; SP, DAX, FTSE, NIKKEI, BOVESPA, MSCE_EU, and MSCI_EM. It can be used to find a predictive relationship between the ISE100 and other international stock market indices.

23. Predict bike rental count (hourly/daily) based on the environmental & seasonal settings

Level: Intermediate

Recommended Use: Regression Models

Domain: Social

Link to Dataset

This data set, consisting of 17,379 rows and 17 columns, contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the Capital bike-share system with the corresponding weather and seasonal information. The bike-sharing rental process is highly correlated to the environmental and seasonal settings.

24. Detect Room Occupancy through Light, Temperature, Humidity, and CO2 sensors

Level: Intermediate

Recommended Use: Classification Models

Domain: Energy/Buildings

Link to Dataset

This data set has 20,560 rows and 7 attributes. It provides experimental data used for binary classification (room occupancy of an office room) from Temperature, Humidity, Light, and CO2. Ground-truth occupancy was obtained from time-stamped pictures that were taken every minute.

25. Estimate whether a person’s income exceeds $50K/year

Estimate whether a person’s income exceeds $50K/year

Level: Intermediate

Recommended Use: Classification Models

Domain: Social/Government

Link to Dataset

This data set was extracted from the census bureau database. There are 48,842 instances of data set. It has 15 attributes which include age, sex, education level, and other relevant details of a person.

26. Coronavirus (COVID-19) Dataset

Coronavirus under a microscope
Coronavirus (COVID-19) under a microscope

Level: Intermediate

Recommended Use: Classification Models

Domain: Health Sciences

Link to Dataset

The recent outbreak of the novel coronavirus has caused great concern all around the world. It has affected more around tens of thousands of people, mostly in China. The outbreak, originating in the Chinese city of Wuhan has been declared a global emergency by the World Health Organization (WHO).

This data set consists of 4 files and was collected through various sources. The first file 2019ncovdata.csv contains daily level information on the number of 2019-nCoV-affected cases across the globe. The files contain time series data of confirmed cases, deaths, and recovered cases, respectively.

This data set has been sourced from Kaggle and Johns Hopkins University. This dataset is provided to the public strictly for educational and academic research purposes.

27. Detect Autistic Spectrum Disorder (ASD) Cases

Level: Advanced

Recommended Use: Classification Models

Domain: Healthcare/Social Sciences

Link to Dataset

This advanced level data set has Autistic Spectrum Disorder (ASD) Screening Test Data for 704 adults and has 21 attributes including test takers’ demographics. It also has 10 questions that test takers answered in screening tests. The status of a test taker on ASD is determined and recorded under the Class/ASD variable.

28. Estimate the probability of Default

Level: Advanced

Recommended Use: Classification Models

Domain: Business/Finance

Link to Dataset

This data set has 30,000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card clients using the data provided.

29. Predict if a note is genuine

Level: Advanced

Recommended Use: Classification Models

Domain: Banking/Finance

Link to Dataset

This advanced level data set has 1,372 rows and 5 columns. Data were extracted from images of genuine and forged banknote-like specimens that were taken for the evaluation of an authentication procedure for banknotes, later digitized. Wavelet Transform tool was used to extract features from images.

30. Find a short-term forecast on the electricity consumption of a single home

Level: Advanced

Recommended Use: Regression/Clustering Models

Domain: Electricity

Link to Dataset

This data set has 2,075,259 rows and 9 columns. This data set provides measurements of electric power consumption in one household with a one-minute sampling rate over almost 4 years. Different electrical quantities and some sub-metering values are available.

31. Predict the number of shares on social networks

people
Predict the number of shares on social networks

Level: Advanced

Recommended Use: Regression/Classification Models

Domain: Business/Web

Link to Dataset

This data set has 39,644 rows and 61 columns. It summarizes a heterogeneous set of features about articles published by Mashable over 2 years and can be used to predict the number of shares of an article on social networks.

32. Amazon Product Reviews Data

Level: Advanced

Recommended Use: Text Analytics

Domain: E-commerce

Link to Dataset

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 – July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

This dataset is probably preferable for sentiment analysis-type tasks.

Need some help? Check out Data Science Dojo’s online data science boot camp!

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence