Defects of an Alleged Statistical "Proof"

that Fruit and Mother's Milk are Similar

Outline of the "obligatecrank science proof" |

There exists in fruitarian circles an alleged statistical "proof" that one might encounter, which claims that the nutritional composition of mother's milk is "closest" to that of fruit. The proof uses correlation and covariance, and the claim is made that these are "robust" methods for comparing foods.

The alleged proof is probably the most detailed analysis ever presented by a fruitarian in an attempt to support the claim that fruit is similar to mother's milk. That is, if the proof were valid, it would be one of the few pieces of evidence anyone has been able to put forward in support of the "fruit is like mother's milk" theory. Let's review the major steps in the alleged proof so that if you happen to encounter it or those promoting it, you'll recognize the "proof" (and its numerous problems) as the make-

- The food composition (nutrient) profiles from USDA
Handbook 8 were averaged in broad categories: fruit, dairy, beef, grain, etc. The profile for human milk is a single food nutrient profile and is not averaged. - Correlations were computed between the human milk profile and the average category profiles calculated in
step 1. The correlations were highest between human milk and two categories: fruit (0.93) and poultry (0.96). - The covariances between the human milk profile and the two categories mentioned above were examined; the covariance between human milk and fruit was 3162; and between human milk and poultry, 6853. As the value of the milk-
fruit covariance is lower, the claim is then made that fruit is "closer" to milk than (any) other foods.

As some readers may find it difficult to understand detailed discussions of statistics, this brief summary section is to give you an overall assessment of the proof, and a relatively non-

**Synopsis:** As we'll see shortly, the above statistical "proof" of the similarity between milk and fruit is all of the following:

- Incorrect and/or invalid at every step.
- Fallacious, hence does not prove anything.
- In my opinion, a prime example of crank science--
indeed, it is one of the worst examples of crank science that I have ever seen.

- The nutrient profiles used are category averages from the USDA handbooks. Thus, the "fruit" profile is actually
*not*an average of only raw fruits, but includes processed fruits as well. In other words, the data are*not*what they need to be to test the hypothesis. From this point on (i.e., from the very beginning of the proof), the analysis is invalid--after all, an analysis done on the wrong data does not prove anything. - Treating missing data as zero may introduce bias and increase correlations.
- Correlation is not an appropriate way to compare lists (nutrient profiles) that lack internal consistency; i.e., the USDA nutrient profiles are dissimilar items measured in 7 different units. This assertion follows from the definition of correlation, and is discussed in detail further below.
- The use of correlation in the proof is based on a major structural assumption, which also happens to be an implicit assumption. The result is that the proof is not as objective as the fruitarian extremist might want you to believe
it to be. - In figurative terms, correlation can be interpreted as a kind of "standardized covariance"; i.e., correlation and covariance are related. Note that covariance changes when a variable is rescaled (multiplied by a coefficient); correlation does not. The fact that covariance is dependent on scale makes its use in the proof inappropriate (discussed below).
- The combination of correlation and covariance together do not "prove" that milk is "closer" to fruit than poultry. As previously mentioned, correlation as used in the proof is inappropriate.
- Humorous but true: If one pretends that the data, method, and results from the proof are actually valid, then use of the rescaling property of covariance "proves" that
100 g of mother's milk, is "closest" to46 g of poultry--not (100 g of) fruit!

Let's now examine in depth the wide assortment of errors to be found in the alleged proof in the subsections that follow. The section immediately following discusses the numerous flaws due to statistical errors in the fallacious crank science proof. As a comparison, the subsequent, and final, section directly contrasts the crank science proof with the approach used in this paper. The discussion assumes that you are comfortable with statistical concepts. (If that does not describe you, you might prefer to skip the material.) However, note that even if you cannot completely follow the statistical concepts, the analysis of the myriad errors in the crank science proof may still help, by way of comparison, to illustrate the underlying reasoning behind the statistical approach used in

Errors of the "obligatecrank science proof" |

Recall that the hypothesis of interest is that fruit--

However, that is not what the extremist's proof does. Instead, it reportedly averages *all* the nutrient profiles in the fruit category of USDA

As the USDA handbook fruit category average, then, is not necessarily the same as an average of raw fruits, the data analyzed by the extremist are *not* the data needed to test the hypothesis of interest. In other words, the "wrong" data are analyzed, and the entire proof is invalid and irrelevant from this

However, as there are many more interesting errors in the remaining structure of the proof, let us continue our discussion of the errors.

**Note:** Before leaving this topic, readers should be aware that the USDA handbook categories are very broad. For example, USDA Handbook 8-1 (dairy) includes eggs and egg products, non-

The "proof" reports that missing data are treated as zeroes for purposes of analysis. Of particular interest here is treating missing values as zeroes in computing averages. There is a subtle but statistically important difference between treating missing data as zeroes, and excluding it from the analysis. The difference occurs in determining the value of N, which is the number of data points in the analysis, as used in calculation of sample means,

If missing data is treated as zeroes, the value of N will reflect both missing and non-

Note that it is desirable to use data that is complete, or nearly so, in an analysis. In some cases that makes it appropriate to exclude an item from an averaged profile if it has "excessive"

Covariance is a gross measure of the joint variation of two variables. The theoretical definition is (notation explained after formula):

Cov(X,Y) = E[ (X-E(X)) * (Y-E(Y)) ]Readers unfamiliar with the E(*) notation can simply substitute Avg(*), or average, for the expectation operator. Thus, for example, (X-E(X)) simply means the value of X, with its average subtractedwhere:

Cov(X,Y) = covariance between variables X, Y; and,

E(*) is the expected value (expectation) operator, i.e., in this case, thetrue mean.

It is also the case that:

Cov(X,Y) = E[X*Y] - ( E(X)*E(Y) ),which when estimated yields the standard formula:

Cov(X,Y) = Avg(X*Y) - ( Avg(X)*Avg(Y) )Note the relationship between variance and covariance:where Avg(*) is the average, i.e., arithmetic mean.

Variance of X = Var(X) = Cov(X,X),and note the terminology: SD(X) = standard deviation of X = square root

Note that covariance changes when data are rescaled:

Cov(aX,cY) = a*c * Cov(X,Y)Correlation is a measure of the linear relationship between two variables, X and Y. If the relationship between two variables X and Y is nearly a straight line (of non-zero slope), then the correlation will be close to 1 or -1. Correlation and covariance are closely related:where a,c are arbitrary coefficients.

Corr(X,Y) = Cov(X,Y)/[SD(X)*SD(Y)]Correlation is always in the range of -1 to +1. Hence, in a figurative sense, correlation can be considered to be a sort of "standardized" covariance.where Corr(X,Y) = Correlation between variables X and Y, and SD(X) is the standard deviation of variable X.

For a nice introduction to the topics of correlation and covariance, see:

*Probability and Statistics,* Addison-

PART 1.

The crank science proof uses correlations to compare milk and the derived category averages. However, let us consider the following.

- The USDA nutrient profiles are a collection of different nutrient values measured in 7 different units: weight (gm, mg, mcg); energy levels (kcal, kJ); vitamin A
(RE, IU). - The definition/
formula for covariance, variance, and correlation all involve calculating a mean and subtracting it from the raw data. Consider the mean or (internal) average of the set of numbers in any one specific (USDA) nutrient profile (whether a solo food or category average does not matter). What units does that average/ mean have? Obviously, one cannot assign *any*units to such an average, as it is a mixture of numbers displaying 7 different types of units, per above. Hence, quantities like E(X) or Avg(X), in the calculation of covariance, variance, and correlation are*not meaningful.*It follows from this that covariance, variance, and correlation are not meaningful in this context, and that the use of covariance and correlation in the proof is inappropriate.In other words, in the alleged proof, the fruitarian extremist applied a statistical technique without bothering with the critical detail of whether it is appropriate for the data. (By the way, in preparing to do the analysis given in this paper, I considered, then rejected, the use of correlations across entire nutrient lists, for this very reason, and additional reasons discussed below.)

- Characteristics of the USDA nutrient profiles that impact data analysis. The USDA food profile data for food energy are linear functions (more precisely, rescalings) of each other (i.e., the same basic quantity in related units: kcal and kJ); similar remarks apply to the data for vitamin A, measured in RE and IU. The proximate composition data (less the food energy data) is linearly dependent, as it must add to
100 g in most cases. Further, the energy data--whether in kcal or kJ, is approximately a linear function of the proximate composition data (i.e., calories are a function of the fat, carbohydrate, and protein content of a food). The result of these redundancies and internal linear relationships in the USDA nutrient profiles may be to increase correlation and to introduce bias. - Correlations are given and references to (implicitly) significant differences in correlations are suggested in the (very poorly written) "proof," all without any formal tests of significance. Such an approach is necessarily incomplete and dubious.
- An important structural limit in applying correlations to nutrient lists (not an error): The definition of correlation is such that it does not acknowledge differences in the importance of the factors used in computation. In other words, in this case, the grams of fat and sugar (which are critical differences) are just as important as the levels of vitamins A and C are in calculating correlations. However, as milk is a fatty food, and fruit a sweet food, such differences are of great real-
world importance. That is why I analyzed the proximate composition (fat, carbohydrate, protein) in calories, separately, in my paper. Ignoring such differences by limiting analysis to correlations (and covariances), only, is irrational and produces an incomplete and misleading analysis.

PART 2.

The use of correlations in the alleged proof is based on a very important, implicit structural assumption. Given a suitable data set with two variables, then the calculation of correlations and other statistics is an objective exercise. However, note the term "suitable data set"--

The implicit *assumption*--*assumption*--

If one compares the USDA and German tables, one notices many structural differences. Similar remarks apply if one compares these two tables to other standard tables (e.g., British), or to published papers in the journals. The conclusion one quickly reaches is that there is no universally acknowledged, complete, "standard" nutrient list for food composition. Correlations based on different nutrient lists could yield different results.

Thus the allegedly objective analysis--

Remark: Inasmuch as regression and correlation are closely related, I deliberately limited regression analysis in this paper to narrowly defined data sets of closely related quantities, all of which were measured in the same units (amino acid profile, fatty acid profile).

PART 3.

Let's assume that you can somehow get a nutrient list that will be accepted as standard by most observers. Let's also assume you have data in the format of such a list, and you want to compare (despite the limitations therein) the nutrient lists (foods) via correlation. How, then, can you "fix" the problems found in the fallacious crank science proof? Let's consider some possible fixes and see why they won't work well,

**Fix 1:** Drop energy and vitamin A data. Convert all remaining data (all of which are weights) into one measure--

**Comments:**

This gives you data all in one measure--

**Fix 2:** Eliminate units of measure by converting the nutrient lists into lists of ratios, i.e., (nutrient list value)/(index list value). The candidates for serving as an "index list" are as follows.

- List of RDA/RDI values for nutrients.

- The nutrient list for human milk.

- Another constructed list.

- RDA/RDI lists: Both the nutrients and values therein are controversial and incomplete. There is no RDA/RDI for many important nutrients, and the different authorities often disagree as to the values for the RDA/RDI. You will not get anything approaching a consensus using this approach.
- If human milk is used for calculating ratios, then you end up trying to calculate a correlation between a nutrient ratio list--
milk-- that is constant, i.e. all 1's, versus other nutrient lists. A constant list has variance (and standard deviation) of zero, hence to compute correlation, you end up dividing by zero-- which is mathematically undefined. This approach definitely won't work. - It may be very difficult to produce an index list that is internally consistent and that can be defended statistically/
logically. The problem that milk has lactose, and other foods do not, presents some major challenges in constructing such a list. An overall average or "small" reference value for lactose would yield a high lactose content ratio for dairy, zero for other foods. This could influence correlations. Of course, the greatest challenge in producing any index list to use in calculating ratios for analysis is that any/all index lists constructed are subject to challenge, both in regard to index list values and the nutrients included in such an index list. Also, correlation is limited as an analysis tool, per the remarks above (it does not accurately reflect the important differences in proximate composition, sugar breakdown, etc.). An analysis that is limited to correlation (and covariance) alone, and that does not consider proximate composition and carbohydrate breakdown (e.g., the crank science proof) is an incomplete and invalid analysis.

The claim in the proof that lower covariances for milk and fruit (vs. milk and poultry) suggests the numbers for milk and fruit are lower in magnitude (hence may be "closer" in gross terms) is approximately correct. However, the covariances are irrelevant because the average used for fruit is wrong, and the use of correlations is invalid per the above discussion. Hence the conclusion of the proof is logically and statistically invalid and does

However, we can have some fun with the invalid logic of the proof. Recall that covariance changes with scale, but correlation does not. Now consider the covariances calculated in the proof: Cov(M,F) = 3162, and Cov(M,P) = 6853, where

Cov(M,cP) = 3162,That is, using the results from the crank science proof, the best match of foods is notand Corr(M,cP) = Corr(M,P) = 0.96,

versus Cov(M,F) = 3162, and Corr(M,F) = 0.93.

Remark: As correlation does not change with rescaling but covariance does, the analytical advantage of poultry (higher correlation) can be maintained. If fruit is rescaled to yield a different covariance, simply rescale poultry yet again to match the fruit covariance.

To summarize the above in plain language: the covariance results of the crank science proof, rescaled, suggest that poultry, in smaller amounts, is a better match to milk than

Differences: crank science"proof" versus this paper |

Here we'll look at some of the major differences between the analytical approach of this paper and the confused, sloppy approach of the crank science proof that uses correlation and covariance. Such an analysis reveals the following.

- The crank science proof used only the USDA handbook tables for analysis. As was discussed, these tables do
*not*provide a breakdown of types of sugar. In contrast, the analysis in this paper uses the German tables, which generally provide a sugar breakdown. Such information is important, as fruit is of central interest here, and sugar is the major source of calories in most (but not all) fruits; also milk contains an "uncommon" sugar, lactose. - The crank science proof used USDA handbook category averages--
quite literally, the wrong data for fruit. In contrast, the analysis in this paper uses realistic averages of small numbers of raw fruits. The averages used in this paper were chosen based on real- world experience as a fruitarian (no fruitarian would eat an average of the USDA fruit handbook, as it includes many cooked and canned fruits). Additionally, the German data was augmented (where necessary) with data from the USDA and/or British tables, to provide relatively "complete" data for the analysis of this paper. Note also that since fruitarians typically rely on a small number of (low-

priced, in-season) fruits as staples, the approach in this paper-- comparing milk to a realistic average of fruits-- provides a real- world test of how the claim "fruit is just like mother's milk" actually holds up in practice. - The crank science proof used only correlations and covariances, nothing else. The use of correlations in the proof is inappropriate for the reasons cited previously.

In sharp contrast to the above, the analysis in this paper emphasizes basic, common sense points, as follows.

- The percentage breakdown by calories--
proximate composition-- is of great importance, and limiting analysis to correlations and covariances (per the crank science proof) effectively ignores the importance of the proximate composition. All of the hand- waving and bogus crank science statistics cannot overcome the simple reality that milk at 52.53% calories from fat is a "fatty food," and sweet fruits at 84.87% calories from sugar are a "sweet food," and that sugar and fat are indeed different (hence milk and fruit are different). This point is common sense and obvious to all, with the notable exception of the extremist fruitarian promoters of the fallacious crank science "fruit is just like mother's milk" theories. - Types of sugars: The sugar breakdown is an explicit part of the analysis in this paper. This is important, as lactose (milk) and fructose (fruits) are slowly assimilated, while glucose and sucrose (fruits) are more rapidly assimilated.
Note also that an analysis based only on correlations and covariances does not give proper emphasis to differences in proximate composition or sugar breakdown, per the points above.

- Instead of general correlations, the analysis in this paper uses linear regression (a technique closely related to correlation), within limited, narrowly defined areas--
the comparison of the amino acid and fatty acid profiles. Regression is reasonable in those situations as the data are all of the same type, closely related, and measured in the same units. Such narrow use of regression can be more easily defended, logically, than the muddled and uninformed crank science "proof's" approach of attempting correlation on nutrient profiles that are internally inconsistent, and which are arguably incomplete.

Before writing to *Beyond Veg* contributors, please be aware of our

email policy about what types of email we can and cannot

*See Table of Contents for Section II - Making Sense of the Numbers*

*See Table of Contents for Section III - Challenging Fruitarian Defenses of the Theory*

*Back to Waking Up from the Fruitarian Dreamtime*

*Back to Research-Based Appraisals of Alternative Diet Lore*