Title: | Fit Multi-Modal Mallows' Models to Ranking Data |
---|---|
Description: | An EM algorithm to fit Mallows' Models to full or partial rankings, with or without ties. Based on Adkins and Flinger (1998) <doi:10.1080/03610929808832223>. |
Authors: | Erik Gregory [aut, cre] |
Maintainer: | Erik Gregory <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1 |
Built: | 2025-02-24 03:01:03 UTC |
Source: | https://github.com/cran/RMallow |
Fits the Mallows' model to ranking data. Data can be partially or fully-ranked.
Package: | RMallow |
Type: | Package |
Version: | 1.0 |
Date: | 2012-02-18 |
License: | GPL (>= 2) |
Erik Gregory Maintainer: <[email protected]>
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
"Estimating a Population Distribution of Sequences of k Items from Cross- Sectional Data". Laurel A. Smith (Beckett) and Denis A. Evans. Journal of the Royal Statistical Society. Series C (Applied Statistics). Vol. 40, No. 1 (1991), pp.31-42. Blackwell Publishing for the Royal Statistical Society. Accessed 16/08/2010. http://www.jstor.org/stable/2347903 .
"A Non-iterative procedure for maximum likelihood estimation of the parameters of Mallows' Model Based on Partial Rankings". Laura Adkins and Michael Flinger. Communication in Statistics - Theory and Methods, 27:9, 2199-2220. 1998, Marchel Dekker, Inc. http://dx.doi.org/10.1080/03610929808832223 .
Calculates all of the Kendall's distances between two different sets of rankings.
AllKendall(r, seqs, data.info = NULL)
AllKendall(r, seqs, data.info = NULL)
r |
One set of sequences. |
seqs |
Another set of sequences. |
data.info |
Optional argument, a 0/1/NA matrix specifying all of the relevant information to calculate Kendall's difference for "r". Used for efficiency in "Solve". |
Matrix where output[i, j] represents the distance from sequence "i" in "r" to sequence "j" in "seqs".
Erik Gregory
data1 <- do.call("rbind", list(1:5, 5:1, c(3, 2, 1, 4, 5))) data2 <- do.call("rbind", list(1:5, 5:1)) # AllKendall(data1, data2)
data1 <- do.call("rbind", list(1:5, 5:1, c(3, 2, 1, 4, 5))) data2 <- do.call("rbind", list(1:5, 5:1)) # AllKendall(data1, data2)
Used to calculate the sequence Kendall distance distribution in N! space.
AllSeqDists(seqs)
AllSeqDists(seqs)
seqs |
Matrix or data frame of sequences. |
Vector of the distances from the sequences to 1:N.
Erik Gregory
Fit Mallows model N times and select most likely model. The EM algorithm to fit Multi-Modal Mallows' models is prone to getting stuck in local maxima, so we run it several times and selec the best one.
BestFit(datas, N, iter, G)
BestFit(datas, N, iter, G)
N |
number of times to run the model |
iter |
maximum number of iterations for each run |
G |
Number of cluster centers |
datas |
data set to fit |
best fitting model.
Calculate the normalizing coefficient, as a function of the lambda parameter, and the size of the sequence space.
C_lam(lambda, dists = NULL, dists.table = NULL)
C_lam(lambda, dists = NULL, dists.table = NULL)
lambda |
Spread parameter for Mallows' model. |
dists |
Vector of all distances from each sequence to 1:N |
dists.table |
Table version of "dists" above. |
Normalizing coefficient of Mallows' model in N! space with lambda = lambda.
Erik Gregory
Sequences in a fully-ordered sequence space have a unique Kendall Information vector associated with them. This function creates the sequence from the Kendall information vector.
ConstructSeqs(prefs, n.abils)
ConstructSeqs(prefs, n.abils)
prefs |
Ordering preference between columns in the data. 1 cooresponds to an increase, 0 to a decrease. |
n.abils |
Number of columns in the original data set. |
List of fully-ordered sequences, one for each row of prefs.
Erik Gregory
ConstructSeqs(matrix(c(1, 1, 1, 0, 0, 0), nrow = 1), 4) # Should output (4, 1, 2, 3)
ConstructSeqs(matrix(c(1, 1, 1, 0, 0, 0), nrow = 1), 4) # Should output (4, 1, 2, 3)
Simple synthetic data set containing 3 modal sequences in 15! space, with some noise added.
The format is: num [1:1700, 1:15] 1 15 1 15 15 12 10 4 1 15 ...
data(datas) head(datas)
data(datas) head(datas)
This function counts the number of fully-ordered vectors at each distance in N! space.
DistanceDistribution(N = 3)
DistanceDistribution(N = 3)
N |
Integer value, greater than or equal to 3. |
Table-like structure, where the names represent the distance from the modal sequence of each sequence in N! space, and the values represent the number of sequences at that distance in the sequence space.
Erik Gregory
This data is a pre-processed version of the 1980 American Psychological Association Presidential candidate ranking data. It has uninformative rankings removed, and values pre-simplified into partial rankings.
The format is: int [1:1378, 1:3] 1 1 1 1 2 2 1 1 2 2 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:1378] "1" "2" "3" "6" ... ..$ : chr [1:3] "Carter" "Reagan" "Anderson"
The American Psychological Association, http://www.electionstudies.org/studypages/1980prepost/1980prepost.htm
data(elect) head(elect)
data(elect) head(elect)
Assigns each ranking the probability that it belongs to each cluster, given current parameters.
EStep(R, r, p, lambda, G, N, C, all.dists = NULL)
EStep(R, r, p, lambda, G, N, C, all.dists = NULL)
R |
Current cluster modal sequences. |
r |
The data of partial or full rankings. |
p |
The proportion of the data currently assigned to each cluster. |
lambda |
The lambda parameters from Mallow's model for each cluster. |
G |
Number of clusters, length(R). |
N |
Number of rows in the data. |
C |
Vector of normalizing coefficients for the clusters. |
all.dists |
For efficiency, provide all of the Kendall distances between each sequence and each cluster mode. |
Matrix where output[i, j] represents the current probability that subject "i" belongs to cluster "j".
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Data formatting function.
FormatOut(R, p, lambda, z, datas, likelihood)
FormatOut(R, p, lambda, z, datas, likelihood)
R |
The modal sequences. |
p |
Proportion of data in each cluster. |
lambda |
Mallows' spread parameters for each cluster. |
z |
Probability of cluster membership for each individual. |
datas |
Matrix of partial sequences. |
likelihood |
Vector of the log-likelihood of the model at each iteration. |
R |
The modal sequences |
p |
Proportion in each cluster |
lambda |
Spread parameters for each cluster |
datas |
Rankings merged with their cluster membership, distance from each cluster center, and probability of each cluster membership |
min.like |
Likelihood at each iteration |
Erik Gregory
Performs each column-wise comparison on a matrix of sequences. A 0 value denotes that there is an increase between the two columns, 1 a decrease, and NA indicates that the column values are identical in the row.
KendallInfo(r, inds = NULL)
KendallInfo(r, inds = NULL)
r |
Matrix of sequences. |
inds |
Possibly efficiency increase when doing repeated calculations, currently not used. |
Matrix of 0s, 1s, and NAs representing pairwise comparisons of vector values.
Erik Gregory
http://en.wikipedia.org/wiki/Kendall_tau_distance
Objective function to find the root of in calculating the lambda parameters for each cluster.
Lambda(lambda, rhs, dists, dists.table = NULL)
Lambda(lambda, rhs, dists, dists.table = NULL)
lambda |
lambda value to calculate the function output at. |
rhs |
Right-hand side of the equation in the referenced paper. |
dists |
Not used. |
dists.table |
Table of distances between each sequence and the modal sequence in N! space. |
Output of the objective function to determine the root of. Goal is zero.
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Calculates the log-likelihood of the data with the current parameters and Kendall's distance.
Likelihood(z, p, C.lam, lambda, all.dists.data)
Likelihood(z, p, C.lam, lambda, all.dists.data)
z |
Probability of each cluster membership. |
p |
Proportion in each cluster. |
C.lam |
Vector of normalizing coefficients for Mallows' model. |
lambda |
Current spread parameters |
all.dists.data |
All distances from the data to the modal sequences. |
Current log-likelihood of the data with the current parameters.
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Fits the Multi-Modal Mallows' model to partial or full ranking data, using Kendall's metric and an EM algorithm. This is essentially metric sequence clustering.
Mallows(datas, G, iter = 10, hyp = NULL, plot.like = FALSE)
Mallows(datas, G, iter = 10, hyp = NULL, plot.like = FALSE)
datas |
Matrix of partial or fully-ranked data. |
G |
Number of modes, 2 or greater. |
iter |
Maximum number of iterations. |
hyp |
Hypothesis sequence vector, to initialize one of the cluster centers at. |
plot.like |
Should the likelihood be printed at each iteration? |
See output of FormatOut
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
This is identical to counting the number of fully-ordered vectors at each bubble sort distance in (N+1)! space.
NextTable(last.table, N.last)
NextTable(last.table, N.last)
last.table |
Table of distances in N! space. |
N.last |
N |
Table of distances in (N+1)! space.
Erik Gregory
Initialize sequence modes for the clustering process.
Rgen(G, hyp = NULL, abils)
Rgen(G, hyp = NULL, abils)
G |
number of cluster centers, including the hypothesis if provided |
hyp |
a single sequence of length |
abils |
number of items being ranked |
A list of G cluster centers, each of length abils
Erik Gregory
Rgen(3, 1:5, 5)
Rgen(3, 1:5, 5)
Calculates Kendall's distances of each sequence in N! space. This is VERY Inefficient for N >= 8. See DistanceDistribution for an astronomical improvement (possibly on the order of 10^10).
SeqDistribution(N)
SeqDistribution(N)
N |
Length of the ranking. Preferrably less than 9. |
Vector of Kendall distances from 1:N to each sequence in N! space.
Erik Gregory
Simplifies sequences so that each tie group is only of distance 1 to the next tie group. For example, we would simplify (1, 1, 2, 4, 4, 5) to (1, 1, 2, 3, 3, 4).
SimplifySequences(loss.time)
SimplifySequences(loss.time)
loss.time |
Matrix of sequences to be simplified. |
Simplified sequences, as described in Description.
Erik Gregory
The data has 3 modal sequences, and we can compare this to the two.mode data set.
The format is: List of 5 $ R :List of 3 ..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ... ..$ : int [1:15] 1 3 5 7 9 2 4 6 8 10 ... ..$ : int [1:15] 15 14 13 12 11 10 9 8 7 6 ... $ p : num [1:3] 0.447 0.118 0.435 $ lambda : num [1:3] 2.01 1000 2.04 $ datas :'data.frame': 1700 obs. of 23 variables: ..$ X1 : num [1:1700] 1 15 1 15 15 12 10 4 1 15 ... ..$ X2 : num [1:1700] 2 14 2 14 14 13 13 12 2 14 ... ..$ X3 : num [1:1700] 3 13 3 13 13 2 4 6 3 13 ... ..$ X4 : num [1:1700] 4 12 4 12 12 8 7 1 4 12 ... ..$ X5 : num [1:1700] 5 11 5 11 11 9 14 5 5 11 ... ..$ X6 : num [1:1700] 6 10 6 10 10 1 8 10 6 10 ... ..$ X7 : num [1:1700] 7 9 7 9 9 15 1 13 7 9 ... ..$ X8 : num [1:1700] 8 8 8 8 8 10 9 9 8 8 ... ..$ X9 : num [1:1700] 9 7 9 7 7 6 5 14 9 7 ... ..$ X10 : num [1:1700] 10 6 10 6 6 11 11 8 10 6 ... ..$ X11 : num [1:1700] 11 5 11 5 5 3 15 2 11 5 ... ..$ X12 : num [1:1700] 12 4 12 4 4 14 12 11 12 4 ... ..$ X13 : num [1:1700] 13 3 13 3 3 7 2 7 13 3 ... ..$ X14 : num [1:1700] 14 2 14 2 2 5 3 15 14 2 ... ..$ X15 : num [1:1700] 15 1 15 1 1 4 6 3 15 1 ... ..$ clust : int [1:1700] 1 3 1 3 3 3 3 1 1 3 ... ..$ pvals.1: num [1:1700] 1.00 1.03e-91 1.00 2.04e-93 1.03e-91 ... ..$ pvals.2: num [1:1700] 0 0 0 0 0 0 0 0 0 0 ... ..$ pvals.3: num [1:1700] 1.02e-92 1.00 1.34e-93 1.00 1.00 ... ..$ seq : Factor w/ 3 levels "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",..: 1 3 1 3 3 3 3 1 1 3 ... ..$ dists.1: num [1:1700] 0 105 0 105 105 61 58 46 0 105 ... ..$ dists.2: num [1:1700] 10 95 10 95 95 61 54 54 10 95 ... ..$ dists.3: num [1:1700] 105 0 105 0 0 44 47 59 105 0 ... $ min.like: num [1:100] -122710 -51439 -50310 -49976 -49718 ...
data(three.mode) head(three.mode[[4]])
data(three.mode) head(three.mode[[4]])
"datas" has 3 modes, but we observe here what happens when we try to fit it with 2 modal sequences. The most prominent modal sequences are 1:15, 15:1
The format is: List of 5 $ R :List of 2 ..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ... ..$ : int [1:15] 15 14 13 12 11 10 9 8 7 6 ... $ p : num [1:2] 0.557 0.443 $ lambda : num [1:2] 2.05 2.02 $ datas :'data.frame': 1700 obs. of 21 variables: ..$ X1 : num [1:1700] 1 15 1 15 15 12 10 4 1 15 ... ..$ X2 : num [1:1700] 2 14 2 14 14 13 13 12 2 14 ... ..$ X3 : num [1:1700] 3 13 3 13 13 2 4 6 3 13 ... ..$ X4 : num [1:1700] 4 12 4 12 12 8 7 1 4 12 ... ..$ X5 : num [1:1700] 5 11 5 11 11 9 14 5 5 11 ... ..$ X6 : num [1:1700] 6 10 6 10 10 1 8 10 6 10 ... ..$ X7 : num [1:1700] 7 9 7 9 9 15 1 13 7 9 ... ..$ X8 : num [1:1700] 8 8 8 8 8 10 9 9 8 8 ... ..$ X9 : num [1:1700] 9 7 9 7 7 6 5 14 9 7 ... ..$ X10 : num [1:1700] 10 6 10 6 6 11 11 8 10 6 ... ..$ X11 : num [1:1700] 11 5 11 5 5 3 15 2 11 5 ... ..$ X12 : num [1:1700] 12 4 12 4 4 14 12 11 12 4 ... ..$ X13 : num [1:1700] 13 3 13 3 3 7 2 7 13 3 ... ..$ X14 : num [1:1700] 14 2 14 2 2 5 3 15 14 2 ... ..$ X15 : num [1:1700] 15 1 15 1 1 4 6 3 15 1 ... ..$ clust : int [1:1700] 1 2 1 2 2 2 2 1 1 2 ... ..$ pvals.1: num [1:1700] 1.00 4.15e-94 1.00 4.15e-94 4.15e-94 ... ..$ pvals.2: num [1:1700] 5.4e-93 1.0 5.4e-93 1.0 1.0 ... ..$ seq : Factor w/ 2 levels "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",..: 1 2 1 2 2 2 2 1 1 2 ... ..$ dists.1: num [1:1700] 0 105 0 105 105 61 58 46 0 105 ... ..$ dists.2: num [1:1700] 105 0 105 0 0 44 47 59 105 0 ... $ min.like: num [1:100] -178063 -139298 -58290 -54074 -53902 ...
data(two.mode) head(two.mode[[4]])
data(two.mode) head(two.mode[[4]])
The two-modes seem to divide well between Democrats and Republicans...
The format is: List of 5 $ R :List of 2 ..$ : int [1:3] 1 3 2 ..$ : int [1:3] 3 1 2 $ p : num [1:2] 0.541 0.459 $ lambda : num [1:2] 2.19 2.32 $ datas :'data.frame': 1378 obs. of 9 variables: ..$ Carter : int [1:1378] 1 1 1 1 2 2 1 1 2 2 ... ..$ Reagan : int [1:1378] 1 2 2 2 1 1 2 3 1 1 ... ..$ Anderson: int [1:1378] 1 2 2 3 3 3 3 2 3 3 ... ..$ clust : int [1:1378] 1 1 1 1 2 2 1 1 2 2 ... ..$ pvals.1 : num [1:1378] 0.541 0.992 0.992 0.932 0.131 ... ..$ pvals.2 : num [1:1378] 0.45893 0.00809 0.00809 0.06802 0.86945 ... ..$ seq : Factor w/ 2 levels "1 3 2","3 1 2": 1 1 1 1 2 2 1 1 2 2 ... ..$ dists.1 : num [1:1378] 0 0 0 1 2 2 1 0 2 2 ... ..$ dists.2 : num [1:1378] 0 2 2 2 1 1 2 3 1 1 ... $ min.like: num [1:100] -6421 -3386 -2916 -2811 -2799 ...
American Psychological Association http://www.electionstudies.org/studypages/anes_mergedfile_1980/anes_mergedfile_1980.htm
data(two.seq) head(two.seq[[4]])
data(two.seq) head(two.seq[[4]])
Updates the Lambda parameters to maximize the likelihood of the data under Mallows' model.
UpdateLambda(r, R, z, G, dists.to.Rg, dists.table, top.bound = 1000)
UpdateLambda(r, R, z, G, dists.to.Rg, dists.table, top.bound = 1000)
r |
Matrix of partial rankings. |
R |
Current modal sequences. |
z |
Current probabilities of memberships in each cluster. |
G |
Number of modal sequences. |
dists.to.Rg |
Matrix of the distances between the data and the current modal sequences. |
dists.table |
Table of the distance distribution in N! space, under Kendall's metric. |
top.bound |
The maximum value for the lambda parameter. |
Vector of new lambda parameters for the clusters.
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Updates the proportion of data assigned to each cluster.
UpdateP(z)
UpdateP(z)
z |
Probabilities that each sequence is in each cluster. |
Proportion of data in each cluster.
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Maximizes the likelihood of the data by updating the cluster centers of the model.
UpdateR(r, z, infos = NULL)
UpdateR(r, z, infos = NULL)
r |
Matrix of sequences being clustered. |
z |
Probability of cluster membership for each sequence and each cluster. |
infos |
The KendallInfo matrix for "r". |
New cluster centers for each cluster.
Erik Gregory
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.