Other Peer Institution Selection Article
**
If you are using Microsoft Internet Explorer or Google Chrom browser,
you would not be able to read the formulas in this article. These
formulas were written in MathML, a W3C standard, and can be viewed in
FireFox.
Summary
The
goal of this article is to provide mathematical insights into the
selection of nearest peer institutions. The discussion begins with a
general review of distance in mathematics and extended the idea to the
measurement of nearness. Examples were used to demonstrate the
importance of properly selected distance function.
The idea of peer
In many fields of study, it is a useful practice to tag objects that are similar to a particular object, the Anchor, as peers. The similarity can customarily be measured by the shortness of distance. The smaller the distance, the more similar an object is to the Anchor and the more likely an object to be selected as a peer of the Anchor.
Distances in Mathematics
In mathematics a metric or distance function is a function that defines the distance between two elements/objects (see Wikipedia article).
However, as the abstract nature of the mathematics, the 'set theory'
set criteria on behavior of the distance function but left the explicit
definition of the distance to the case of application since the explicit
definition is irrelevant in the content of the set theory.
For cases were the elements are Euclidean geometry points, the distances are commonly defined as:
However, this is not the only permissible definition for distance.
Distance in a case of study
The
lack of explicit definition of distance for a case of study call into
question of the distance between the mathematics and the real world.
However,
contrary to most believes, mathematics is not isolated in its own
abstract world, plenty of mathematical branches grow out of real world
problems. For example, the criteria for the distance function are
abstracted qualities of the common distance definition of the Euclidean
geometry.
The absent of the definition, in essence, offer the opportunity to select a sensible definition for the case of study.
In the land of peer selection,
properties are constantly associated with objects and distances are
customary defined through properties.
In
the case of higher-education-institution objects, possible properties
are fall enrollment headcounts, percent of male enrollment, total
revenue, ... etc, like those collected by IPEDS ( Integrated Postsecondary Education Data System) survey.
Distance of Interest for this article
Of
all the possible choice of distance definitions, the following are of
demonstrative interest. The subscript i denoted various properties while
the X denoted the Anchor object in the set and x a different object.
The D denote the distance
-
Sum of the square of the difference
-
Sum of the Ranking of difference
Ranking of
where the minimum value has a rank of 1 and the next smallest has a rank of 2 ... etc. -
Sum of the square of the percent difference
-
Sum of the absolute value of the percent difference
For the sack of demonstration, higher education institution IPEDS like objects were considered. The Anchor institution, My Inst, alone with three other institutions and their fabricated property values were listed in Table 1. Comparisons of these institution were presented in Figure 1. Data are fabricated to demonstrate that a good methodology would not depend on data to produce reasonable result. With Figure 1, it is clearly shown that Inst-1 is the institution that most similar to My Inst, the Anchor institution, followed by Inst-2 and Inst-3.
Table 1 - demonstrative data
Inst | Enrollment | % Men | Revenue |
Institution 1 | 1900 | 63% | 105,000 |
Institution 2 | 2050 | 90% | 101,000 |
Institution 3 | 1970 | 35% | 104,000 |
My Institution | 2000 | 60% | 100,000 |
Figure 1 - comparing institutions (click to see the picture)
Distance evaluated with each illustrative definition
Table 2 - Sum of the square of the difference
Inst | Enrollment | Percent Men | Revenue | Distance |
Inst 1 | 10000 | 0.0009 | 25,000,000 | 25,010,000.0 |
Inst 2 | 2500 | 0.09 | 1,000,000 | 1,002,500.1 |
Inst 3 | 900 | 0.0625 | 16,000,000 | 16,000,900.1 |
With the 'sum of the square of the difference' approach, the similarity ranking is in the order of Inst-2, Inst-3, and followed by Inst 1. Table 2 demonstrated that, in this model, the property having larger value would overshadow differences in other properties. It is, therefore, important to scale properties to a compatible matter.t
Table 3 - Sum of the Ranking of difference
Inst | Enrollment | Percent Men | Revenue | Distance |
Inst 1 | 3 | 1 | 3 | 7 |
Inst 2 | 2 | 3 | 1 | 6 |
Inst 3 | 1 | 2 | 2 | 5 |
The 'Sum of the Ranking' practice considered the Inst-3 as the most similar peer with Inst-2 and Inst-1 following. The problem with this approach may not be obvious. Couple of points can be made, if observe carefully. First of all is the misrepresentation of the true differences with category like integers, which by itself can't even avoid rounding errors. Using ordering also create problem in that distances between adjacent values are been replaced by 1.
Table 4 - Sum of the square of the percent difference
Inst | Enrollment | % Men | Revenue | Distance |
Inst 1 | 0.3% | 0.3% | 0.3% | 0.8% |
Inst 2 | 0.1% | 25.0% | 0.0% | 25.1% |
Inst 3 | 0.0% | 17.4% | 0.2% | 17.5% |
Inst-1 is ranked as the most likely followed by Inst-3 and Inst-2 in the 'Sum of the square of the percent difference' process.
Under this approach, differences are represented by the percent difference from the Anchor with no categorization attempted. Another benefit of this approach is the straightforward approach and the ease of explanation to audiences. Weighting to each property, as will be discussed later, is also plain to see and easy to identify.
Lesson learned
Properties' value varied in magnitudes, invoking values directly undermined the difference in properties with smaller magnitude. The employ of ranking could over-shadow the difference in value and, in effect, assigned a difference of 1 for all adjacent values.
The square vs. the absolute value (definition 3 vs. definition 4)
The fact that implied that the squared method would favor multiple smaller differences than a single larger difference while the absolute value method will weight small differences and single bigger difference equally. Visually, the squared method made sense.
Weighting properties
Once standardize to the 'sum of the square of the percent difference', weighting can easily be done by multiples to the 'square of the percent difference' before the sum.
* While customized distance can be used in Nearest Neighbor analysis, Nearest Neighbor represent a specific topic in the cluster analysis.