Vegetarian Protein Source Analysis

Status: active

I created a MySQL database of nutritional data from foods I like that are high in protein. I plan to use Python, specifically the modules Numpy and Pandas, to analyze that data. That analysis will be a great help in planning a healthier diet for myself.

Future plans

Do some research to determine more potentially useful data points.
Research foods I could have overlooked or have never heard of that would be worth analyzing.
Collect more data. From a larger subset of food I have on hand, and from the internet.
Finalize a good design for the database. Potentially a table for raw numerical data, a table for calculated data, and a table for non-numerical data.
Categorize the foods by type and subtype. e.g. processed plant protein and legume for types, wheat gluten and soy protein for subtypes.
Update existing data with categories.
Restructure the database to match my design. Create triggers, procedures, and functions where appropriate.
Insert the new data.
Research a more scientific method for ranking how healthy each food is.
Apply appropriate statistical techniques to analyze the data using Pandas, informed by my research.

Progress journal

2024-11-13

I collected nutritional data from the packaging of foods I like that are significant sources of protein in a note on my phone. I took care to maintain a consistent structure for the data in the note.

A few lines from my note

Lentils: 7g per 125ml, 0.4g fat, 14g carbs
Firm tofu: 14g per 85g, 7g fat, 3g carbs
Yves “the good” burger: 13g per 75g, 2g fat, 7g carbs

2024-12-01

I copied my note into a text file on my home server, and wrote a Python function to parse that file into a Python list of tuples.

parse_txt_data.py

import re

def split_at_unit(arg_str: str) -> tuple[float, str]:
    """
    Split str into a float and a unit string
    e.g. '125ml' becomes (125.0, 'ml')
    """
    split_re = re.compile(r"(\d+\.?\d*)\s*(\w+)")
    match = split_re.search(arg_str)
    if match:
        groups = match.groups()
        return (float(groups[0]), groups[1])
    print(f"string '{arg_str}' is not a number")
    return (-1, "")


def parse(txt_file) -> list:
    """
    Parse the specific text file in question into a list of its entries
    The line structure of the test file:
    <name>: <protein_val> per <ref_amount>, <fat_val> fat, <carb_val> carbs
    name is the name of a food item
    ref_amount is the amount of that food the nutrition values are based on
    protein_val, fat_val, and carb_val are the number values of those components of the food
    ref_amount may be a volume in millilitres or a mass in grams
    The nutrition values are masses in grams
    """
    result = []
    for line in txt_file.readlines():
        name = line[: (col_idx := line.find(":"))]
        food_values = []
        # parse the numbers
        for val in line[col_idx + 1 :].split(","):
            # parse the reference amount
            if (per_idx := val.find("per")) >= 0:
                ref_split = split_at_unit(val[per_idx + 4 :])
                food_values.append(ref_split[0])
                food_values.append(ref_split[1])
                protein_split = split_at_unit(val[:per_idx])
                food_values.append(protein_split[0])
                food_values.append(protein_split[1])
            else:
                # get the food values
                val_split = split_at_unit(val)
                food_values.append(val_split[0])
                food_values.append(val_split[1])
        result.append((name, *food_values))

    return result

With the data in a workable format, I set up my Python project virtual environment. After running venv, I created the other basic files for a Python module, including adding the mysql-connector-python module to the requirements file. I set up pre-commit with my usual generic hooks, some pre-commit hooks specifically for Python, and the Python code formatter black.

Then I installed and configured the MySQL server on my home server. Using the MySQL command line interface, I created the nutrition database, and a database user called "protein" with access permissions for only that database.

To avoid hard coding the database password, I created a .env file in my Python project with the credentials for the user I created. I used the python-dotenv module to load those credentials into my main Python program.

I finished the initial setup of the database using Python. I created an array of the column names for my data in my Python program. Then I connected to my local MySQL server as the user "protein" and created the vegprotein table.

The CREATE TABLE statement for the vegprotein table

CREATE TABLE vegprotein(
  vegprotein_id INT(11) PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  ref_num FLOAT NOT NULL,
  ref_unit VARCHAR(255) NOT NULL DEFAULT 'g',
  protein_num FLOAT NOT NULL,
  protein_unit VARCHAR(255) NOT NULL DEFAULT 'g',
  fat_num FLOAT NOT NULL,
  fat_unit VARCHAR(255) NOT NULL DEFAULT 'g',
  carb_num FLOAT NOT NULL,
  carb_unit VARCHAR(255) NOT NULL DEFAULT 'g',
  CONSTRAINT gt0 CHECK (ref_num > 0 AND protein_num > 0 AND fat_num > 0 AND carb_num > 0)
)

My choices of data types were mostly standard. I chose FLOAT for the numeric data, because my anticipated use cases wouldn't need more than single precision. I set the default unit to "g" for each column, because grams are the most common unit for nutritional data. I added a constraint to enforce positive values for the numerical data, as none of the foods in my sample had zero protein, carbohydrates, or fat. (I have since changed the constraint to allow zero values.)

I already knew exactly how I could insert all of my data with a for loop, so I decided to use the executemany function to challenge myself. List comprehensions are one of my favourite Python features, so I also had more fun with that less straightforward approach.

The Python code for inserting the data I collected

import mysql.connector
import os
from dotenv import load_dotenv, dotenv_values
from parse_txt_data import parse

PROTEIN_COLUMNS = [
    "vegprotein_id",
    "name",
    "ref_num",
    "ref_unit",
    "protein_num",
    "protein_unit",
    "fat_num",
    "fat_unit",
    "carb_num",
    "carb_unit",
]

if __name__ == "__main__":

  with open("./pyvegprotein/data.txt", "r", encoding="utf-8") as f:
      food_data = parse(f)

  load_dotenv()
  DB_HOST = os.getenv("db_host")
  DB_USER = os.getenv("db_user")
  DB_PASSWORD = os.getenv("db_password")
  DB_NAME = os.getenv("db_name")

  proteindb = mysql.connector.connect(
      host=DB_HOST, user=DB_USER, password=DB_PASSWORD, database=DB_NAME
  )

  cursor = proteindb.cursor()

  col_str = "".join([s + ", " for s in PROTEIN_COLUMNS[1:]])
  col_str = col_str[:-2]

  val_str = "%s, " * (len(PROTEIN_COLUMNS) - 2) + "%s"

  sql_str = f"INSERT INTO vegprotein ({col_str}) VALUES ({val_str})"

  cursor.executemany(sql_str, food_data)

  proteindb.commit()

2024-12-08

I researched where to find food density data so I could convert reference amount in millilitres to grams. Finding food densities wasn't as simple as I hoped. This PDF from a United Nations organization was the most reliable source for food density data I found. I had to look elsewhere for data on some foods, but I don't want to give those sites any additional traffic.

I wanted an opportunity to use relational database features, so I decided to put the density data into a new table. I created the fooddata table with three columns: food_id, the primary key; vegprotein_id, a foreign key to the vegprotein table; and density, the density of the food in grams per millilitre. I chose that table name because I expected to add more numerical data at a later date. Likely by moving some columns from vegprotein to fooddata. (The scope I originally had in mind for this project didn't warrant more than one table. I'm now working on designing the database properly.)

With two tables to work worth, I was ready to start writing useful SQL. I started with a query for viewing the data relevant to converting volumes to masses.

My query that selects foods with volumes

SELECT name, ref_num, ref_unit, density
FROM vegprotein INNER JOIN fooddata
ON vegprotein.vegprotein_id = fooddata.vegprotein_id
WHERE ref_unit = 'ml'

The result:

+-----------------------------+---------+----------+---------+
| name                        | ref_num | ref_unit | density |
+-----------------------------+---------+----------+---------+
| Lentils                     |     125 | ml       |    0.85 |
| “Dark red” kidney beans     |     125 | ml       |    0.79 |
| Organic chickpeas           |     125 | ml       |    0.71 |
+-----------------------------+---------+----------+---------+
3 rows in set (0.00 sec)

2024-12-9

I turned that query into a stored procedure.

My GetFoodsWithVolumes procedure

CREATE PROCEDURE GetFoodsWithVolumes ()
BEGIN
  SELECT name, ref_num, ref_unit, density
  FROM vegprotein INNER JOIN fooddata
  ON vegprotein.vegprotein_id = fooddata.vegprotein_id
  WHERE ref_unit = 'ml';
END

Then I created an SQL function for calculating the mass of a food with a given vegprotein_id using the data in the database.

My getMass function CREATE statement

CREATE FUNCTION getMass (vegid INT(11))
RETURNS FLOAT DETERMINISTIC
BEGIN
  SET @ml := 0;
  SET @gml := 0;
  SELECT ref_num, density
  INTO @ml, @gml
  FROM vegprotein INNER JOIN fooddata
  ON vegprotein.vegprotein_id = fooddata.vegprotein_id
  WHERE vegprotein.vegprotein_id = vegid;
  RETURN @ml * @gml;
END

That function turned getting the masses of foods with reference amounts in millilitres into a simple select statement.

A simple select statement using my getMass function

SELECT name, getMass(vegprotein_id) AS 'mass(grams)'
FROM vegprotein
WHERE ref_unit = 'ml'

The result:

+-----------------------------+------------------------+
| name                        | mass(grams)            |
+-----------------------------+------------------------+
| Lentils                     |                 106.25 |
| “Dark red” kidney beans     |                  98.75 |
| Organic chickpeas           |                  88.75 |
+-----------------------------+------------------------+
3 rows in set (0.00 sec)

2024-12-10

I had written individual INSERT statements for each row I added to the fooddata table. I didn't want to do that every time I needed to insert new data, so I wrote two Python functions to automate the process more.

Python functions for inserting data using mysql.connector

def insert_food(db, food: tuple):
  """
    insert a food row into db
  """
  cursor = db.cursor()

  cursor.execute(f"INSERT INTO vegprotein ({"".join([ s + ", " for s in PROTEIN_COLUMNS[1:] ])[:-2]}) VALUES ({"".join([ "'" + val + "', " if type(val) == str else str(val) + ", " for val in food])[:-2]})")

  db.commit()

def insert_density(db, density: float, like_str: str):
  """
    insert a food density value into db
    density: the value to insert in g/ml
    like_str: the string that should match the corresponding food item. Supplied to the SQL LIKE operator
  """
  cursor = db.cursor()

  cursor.execute(f"SELECT vegprotein_id FROM vegprotein WHERE name LIKE '{like_str}'")
  sql_str = f"INSERT INTO fooddata (density, vegprotein_id) VALUES ({str(density)}, {str(cursor.fetchone()[0])})"
  cursor.execute(sql_str)

  db.commit()

I can see modifying insert_density to do some of the work of producing a string for the LIKE operator in the future. It does the job well enough for now though.

2024-12-11

I started working on the data science part of the project. First I wrote a query to order my list of foods by their ratio of protein to total mass.

My protein ratio query

SELECT
  name,
  CAST(FORMAT(((protein_num / mass) * 100), 0) AS UNSIGNED) AS protein_percent,
  protein_num,
  mass
FROM vegprotein
ORDER BY protein_percent DESC

The result:

+------------------------------------+-----------------+-------------+--------+
| name                               | protein_percent | protein_num | mass   |
+------------------------------------+-----------------+-------------+--------+
| “Only peanuts” peanut butter       |              27 |           4 |     15 |
| Yves Canadian bacon                |              26 |          10 |     38 |
| Field roast classic smoked sausage |              26 |          22 |     85 |
| Tofurky Italian sausages           |              25 |          23 |     92 |
| PC Tempeh                          |              20 |          17 |     85 |
| Mixed nuts                         |              18 |           9 |     50 |
| Yves “the good” burger             |              17 |          13 |     75 |
| Firm tofu                          |              16 |          14 |     85 |
| Eggs                               |              12 |          13 |    105 |
| Organic chickpeas                  |              10 |           9 |  88.75 |
| Soft tofu                          |               8 |           7 |     85 |
| Lentils                            |               7 |           7 | 106.25 |
| “Dark red” kidney beans            |               7 |           7 |  98.75 |
+------------------------------------+-----------------+-------------+--------+
13 rows in set (0.00 sec)

I wanted to calculate more relationships between the numbers I had, so it made sense to store the results of those calculations in a new table: calculations.

The CREATE TABLE statement for the calculations table

CREATE TABLE calculations(
  calc_id INT(11) PRIMARY KEY AUTO_INCREMENT,
  vegprotein_id INT(11),
  protein_mass_ratio FLOAT,
  carb_protein_ratio FLOAT,
  fat_protein_ratio FLOAT,
  health_ratio FLOAT,
  FOREIGN KEY(vegprotein_id) REFERENCES vegprotein(vegprotein_id)
    ON DELETE CASCADE
    ON UPDATE CASCADE
)

Then I did all the calculations in an INSERT statement. To start simply, I subtracted the masses of carbohydrates and fat from the mass of protein for each food, then divided that number by the total mass to arrive at a number I labelled health_ratio.

The `health_ratio` formula
`g`: the total mass in grams
`pg`: the mass of protein in grams
`fg`: the mass of carbs in grams
`cg`: the mass of fat in grams
`health_ratio = (pg - fg - cg) / g`

That calculation was only a small part of the analysis I have planned. I know I want to make a similar calculation with a different weighting applied to each nutrient value. Further research will help me determine appropriate weights and additional calculations to perform.

The statement that inserted data into the calculations table

INSERT INTO calculations(
  vegprotein_id,
  protein_ratio,
  carb_ratio,
  fat_ratio,
  health_ratio
)
SELECT
  vegprotein_id,
  protein_num / mass,
  carb_num / mass,
  fat_num / mass,
  (protein_num - carb_num - fat_num) / mass
FROM vegprotein

Because I stored the calculated data in a format that was easier for a machine to work with, I wrote a function to format it into human-readable percentages.

My percentage function

DELIMITER //

CREATE FUNCTION percentage(float_num FLOAT, dec_num INT)
RETURNS VARCHAR(255) DETERMINISTIC
BEGIN
	DECLARE float_str VARCHAR(255);
	DECLARE dec_length INT;
	DECLARE _dec INT;
	SET float_str := CAST(float_num AS CHAR);
	SET dec_length := CHAR_LENGTH(MID(float_str, (LOCATE('.', float_str) + 1)));
	SET _dec := IF(dec_num > dec_length, dec_length, dec_num);
	SET float_str := CONCAT(FORMAT((float_num * 100), _dec), '%');
	RETURN float_str;
END//

DELIMITER ;

Finally, I wrote a query to rank all the foods by health_ratio.

A very informative query

SELECT
  name,
  percentage(protein_ratio, 2) AS protein,
  percentage(carb_ratio, 2) AS carbs,
  percentage(fat_ratio, 2) AS fat,
FORMAT(health_ratio * 100, 2) AS health
FROM calculations AS calcs
JOIN vegprotein AS vp
ON calcs.vegprotein_id = vp.vegprotein_id
ORDER BY health_ratio DESC

The resulting table:

+------------------------------------+---------+--------+--------+--------+
| name                               | protein | carbs  | fat    | health |
+------------------------------------+---------+--------+--------+--------+
| Yves Canadian bacon                | 26.32%  | 2.63%  | 1.05%  | 22.63  |
| Field roast classic smoked sausage | 25.88%  | 10.59% | 9.41%  | 5.88   |
| Yves “the good” burger             | 17.33%  | 9.33%  | 2.67%  | 5.33   |
| Firm tofu                          | 16.47%  | 3.53%  | 8.24%  | 4.71   |
| Soft tofu                          | 8.24%   | 1.18%  | 4.12%  | 2.94   |
| PC Tempeh                          | 20.0%   | 8.24%  | 9.41%  | 2.35   |
| Tofurky Italian sausages           | 25.00%  | 8.70%  | 14.13% | 2.17   |
| Eggs                               | 12.38%  | 0.95%  | 10.48% | 0.95   |
| Lentils                            | 6.59%   | 13.18% | 0.38%  | -6.96  |
| “Dark red” kidney beans            | 7.09%   | 18.23% | 0.51%  | -11.65 |
| Organic chickpeas                  | 10.14%  | 32.68% | 3.94%  | -26.48 |
| “Only peanuts” peanut butter       | 26.67%  | 20.0%  | 46.67% | -40.00 |
| Mixed nuts                         | 18.00%  | 24.00% | 52.00% | -58.00 |
+------------------------------------+---------+--------+--------+--------+

The rough order of the list was more or less what I was expecting. Nuts having too much fat, and legumes having too many carbohydrates had been clear to me prior to running that query. I also knew the two types of tofu were roughly equivalent, and I anticipated that Yves' bacon would be near the top.

There were some surprises in that list, however. From reading the packaging, tempeh looked like it would rank higher. I didn't think the Tofurkey sausages were the worst of the wheat gluten products. Likely because they're my favourite food on this list.

The bad news for me is that peanut butter and chickpeas are relatively bad choices for my diet, so I'll need to be careful with how much I each. The good news is that the top three foods–Yves' Canadian bacon, field roast sausages, and Yves' standard burgers–are all foods I like a lot, and I'm already eating the burgers and sausages often. Now that I know how healthy the bacon is for me, I'll start finding more meals to add it to.