Name(s): Meghana Paruchuri and Rishitha Talluri
Meals in Minutes: Predicting Prep Time with Nutritional Insights
INTRODUCTION
For this project, we will be working with the recipe and ratings dataset that describes different foods to make and related information. The main question we are aiming to answer is: what types of recipes tend to take longer to cook? Specifically, we aim to use the “healthiness” of the recipe to predict cooking time.
Understanding the relationship between nutrition and cooking time is important in helping home cooks make more informed decisions about what to prepare, especially if they’re looking for healthier recipes that fit into tight schedules. It could also provide insight onto how the nutritional values of a dish may be correlated with its cooking time.
The initial dataset for recipes has 231637 rows × 12 columns, whereas the dataset for ratings has 1132367 rows × 5 columns. When we move forward with the next part of the project, we will clean and merge the two datasets. The columns that are relevant to our research include:
- “minutes”: how long it takes to prepare the recipe
- “nutrition”: nutrition information written as [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]
- “id”: the recipe id
DATA CLEANING AND EXPLORATORY DATA ANALYSIS
🧼 Data Cleaning
1. Merging Recipes and Ratings
- Combined the recipes and ratings datasets using a left join on recipe ID
- This ensures every recipe is preserved while incorporating all related user ratings and reviews
- Resulted in a wider dataset where each recipe may have multiple entries (one per rating)
2. Replacing Zero Ratings with NaN
- Ratings with a value of 0 likely means the recipe was “not rated” rather than one that truly received a rating of 0
- Replacing them with NaN ensures they’re excluded from average calculations, improving accuracy and general analysis of our data
3. Creating Average Ratings per Recipe
- Computed the mean rating for each recipe
- Mapped this back into the main dataset as a new column avg_rating
- This creates a useful metric for recipe quality based on the general consensus of all the reviews for each recipe
4. Splitting Nutrition Information into Columns
- The original nutrition column was a single string with all nutrient values
- Used regex to extract numeric values and split them into individual columns (calories, fat, protein, etc.)
- This enables us to perform quantitative analysis of the different nutrition values, allowing us to model and visualize these values
5. Removing Extreme Outliers
- Removed recipes with extremely long preparation times (>2 days) and high calorie counts (<4000 cals)
- These extreme values are most likely user entry errors or edge cases that will skew our analysis
name | minutes | calories | fat | sugar | sodium | protein | saturated_fat | carbs |
---|---|---|---|---|---|---|---|---|
1 brownies in the world best ever | 40 | 138.4 | 10.0 | 50.0 | 3.0 | 3.0 | 19.0 | 6.0 |
1 in canada chocolate chip cookies | 45 | 595.1 | 46.0 | 211.0 | 22.0 | 13.0 | 51.0 | 26.0 |
412 broccoli casserole | 40 | 194.8 | 20.0 | 6.0 | 32.0 | 22.0 | 36.0 | 3.0 |
412 broccoli casserole | 40 | 194.8 | 20.0 | 6.0 | 32.0 | 22.0 | 36.0 | 3.0 |
412 broccoli casserole | 40 | 194.8 | 20.0 | 6.0 | 32.0 | 22.0 | 36.0 | 3.0 |
1️⃣ Univariate Analysis
Distribution of Calories
This plot is a histogram demonstrating the distribution of calories. We can see that the curve is right skewed, with the majority of recipes falling under the 200 - 400 calorie range, then trailing off as the number of calories increases. It is important to note that there are a few outliers in the 3000 and above range as well.
Distribution of Preparation Times
This plot is also a histogram showing the univariate distribution of the ‘minutes’ column. This curve is heavily right skewed, where the vast majority of recipes take under 49 minutes to make, with minimal to none taking above 550 minutes
2️⃣ Bivariate Analysis
Distribution of Preperation Times by Protein Level
This box plot reveals that recipes with higher protein levels (PDV) may have slightly longer cooking times, as shown by the upward trend in median prep time across protein groups. This suggests that “healthier” high-protein recipes typically take more time to prepare than lower-protein options, helping answer our question about how recipe healthiness relates to cooking duration.
3️⃣ Interesting Aggregates
Distribution of Healthy Recipes and Average Preperation Time
The “healthy” tag plot actually shows that recipes marked healthy have shorter average prep times compared to non-healthy recipes. While this isn’t what we expected there are other factors that could explain this. Unhealthy recipes may be family-sized portions (casseroles, baked goods) while healthy ones may be single size portions.
Distribution of Average Prepartion Time based on Recipe by Sodium Level
The lower sodium level recipes have shorter average prep times compared to higher sodium level recipes. This could also be explained by other factors such as scale of recipes. Restaurant meals take awhile to make, are high in sodium, and produce high volumes of food.
4️⃣ Imputation
There are no NaN values in the nutrition columns (which are the features we are using), so we don’t need to conduct any imputation.
Framing a Prediction Problem
Problem Type
This is a regression problem, as the target variable we are predicting — the number of minutes it takes to make a recipe — is a continuous numerical value rather than a categorical variable.
Response Variable
We are predicting the minutes column, which represents how long it takes to prepare a recipe.
Features Used
Our features come from the nutrition data, which includes:
- Calories
- Sodium
- Protein
- Carbohydrates
These features are available before a user begins cooking, making them valid for training the prediction model
Why This Problem?
Understanding the relationship between a recipe’s nutritional information and how long it takes to prepare can help users make better choices based on health goals and time constraints. Especially with people who live busy, on-the-go lifestyles – like the average college student – it is important to understand if they can make healthy recipes in minimal time.
Evaluation Metric
We are going to use the R² metric to evaluate how well the features are correlated with the dependent variable, preparation time. This makes it an insightful metric for regression, as we want our model to fit our data well (as this means more accurate predictions). This is directly indicated by an R² value close to 1.
Baseline Model
This is a linear regression model that predicts the cooking time in minutes it takes to make a recipe based on calories and protein. Minutes, calories, and protein are all quantitative features in the model. We didn’t need to do any encodings because all the features were quantitative. However, the features were standardized using StandardScaler to ensure they were on the same scale, which helps improve the performance of linear regression by preventing features with larger magnitudes from dominating the model. The model was trained on 80% of the data, with the remaining 20% reserved for testing. Performance was evaluated using R². This model performs poorly because we get a R² close to 0 which means the model explains almost none of the variance in cooking times. This also means it performs almost as poorly as a constant prediction of the mean number of minutes.
Final Model
Sodium_level helps distinguish between processed and fresh recipes. Higher-sodium dishes often require less prep time as they may be prepackaged, while low-sodium meals may take longer since you may need to make them from scratch. Log_calories transformation accounts for the non-linear relationship between calorie content and cooking time, preventing high-calorie outliers from skewing predictions. The modeling algorithm chosen for this task was Random Forest Regressor. The preprocessing pipeline standardizes log_calories and one-hot encodes sodium_level (and dropping one to prevent multicollinearity). The model was improved using GridSearchCV, optimizing mean squared error, and the best hyperparameters were a max_depth of 45 and 100 n_estimators. The final model achieves an R² of approximately 0.437, a major improvement from the baseline model which had an R² of approximately 0.03 . The improvement means our final model predicts cooking times much more accurately than the baseline. It now explains 44% of what affects cooking time and makes smaller errors in predictions.