| Title: | Fit Multi-Modal Mallows' Models to Ranking Data |
|---|---|
| Description: | An EM algorithm to fit Mallows' mixture models to full or partial rankings, with or without ties. The Mallows model is a distance-based probability model over rankings, where probability decreases exponentially with distance from a central (modal) ranking. This package implements mixture models to handle multi-modal ranking data, as described in Murphy and Martin (2003) <doi:10.1016/S0167-9473(02)00165-2> and Adkins and Fligner (1998) <doi:10.1080/03610929808832223>. |
| Authors: | Erik Gregory [aut, cre] |
| Maintainer: | Erik Gregory <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.2.0 |
| Built: | 2026-06-06 07:53:57 UTC |
| Source: | https://github.com/e-gregory/rmallow |
Fits the Mallows' model to ranking data using an EM algorithm. Data can be partially or fully-ranked, with or without ties.
The main functions are:
Distance computation functions:
AllKendall: Compute Kendall distances between rankings
KendallInfo: Compute pairwise comparison information
DistanceDistribution: Distance distribution in N! space
Utility functions:
SimplifySequences: Normalize tied rankings
ConstructSeqs: Reconstruct sequences from preferences
Erik Gregory [email protected]
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655. doi:10.1016/S0167-9473(02)00165-2
Beckett, L.A. and Evans, D.A. (1991). Estimating a Population Distribution of Sequences of k Items from Cross-Sectional Data. Journal of the Royal Statistical Society, Series C, 40(1), 31-42.
Adkins, L. and Fligner, M. (1998). A Non-iterative procedure for maximum likelihood estimation of the parameters of Mallows' Model Based on Partial Rankings. Communications in Statistics - Theory and Methods, 27(9), 2199-2220. doi:10.1080/03610929808832223
Calculates all Kendall's distances between each ranking in r and
each sequence in seqs.
AllKendall(r, seqs, data_info = NULL, use_cpp = TRUE)AllKendall(r, seqs, data_info = NULL, use_cpp = TRUE)
r |
Matrix of rankings (each row is a ranking). |
seqs |
Matrix of reference sequences (each row is a sequence). |
data_info |
Optional precomputed Kendall information matrix for |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Matrix where entry [i, j] represents the Kendall distance
from ranking i in r to sequence j in seqs.
Erik Gregory
data1 <- rbind(1:5, 5:1, c(3, 2, 1, 4, 5)) data2 <- rbind(1:5, 5:1) AllKendall(data1, data2)data1 <- rbind(1:5, 5:1, c(3, 2, 1, 4, 5)) data2 <- rbind(1:5, 5:1) AllKendall(data1, data2)
Calculates the Kendall distance from each sequence to the canonical ordering (1, 2, ..., N).
AllSeqDists(seqs, use_cpp = TRUE)AllSeqDists(seqs, use_cpp = TRUE)
seqs |
Matrix or data frame of sequences, where each row is a ranking. |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Integer vector of distances from each sequence to 1:N.
Erik Gregory
seqs <- rbind(1:5, 5:1, c(1, 3, 2, 4, 5)) AllSeqDists(seqs) # Returns: 0, 10, 1seqs <- rbind(1:5, 5:1, c(1, 3, 2, 4, 5)) AllSeqDists(seqs) # Returns: 0, 10, 1
The EM algorithm for Mallows' model can converge to local maxima. This function runs the algorithm multiple times with different random initializations and returns the model with the highest likelihood.
BestFit(datas, N, iter, G, verbose = TRUE)BestFit(datas, N, iter, G, verbose = TRUE)
datas |
Matrix of rankings to fit (N x n_items). |
N |
Number of times to run the model with different initializations. |
iter |
Maximum number of EM iterations for each run. |
G |
Number of cluster centers (modal sequences). |
verbose |
Logical; if TRUE, print progress messages. |
The model fit with the highest log-likelihood. See Mallows
for details of the return value.
Erik Gregory
Calculates the normalizing constant C(lambda) for Mallows' model in a sequence space of size N!.
C_lam(lambda, dists = NULL, dists_table = NULL)C_lam(lambda, dists = NULL, dists_table = NULL)
lambda |
Spread parameter for Mallows' model (non-negative). |
dists |
Optional vector of all distances from each sequence to the
modal sequence 1:N. If NULL, |
dists_table |
Optional table of distance counts. If NULL, computed
from |
The normalizing constant C(lambda).
Erik Gregory
# For a 4-item ranking space dist_tab <- DistanceDistribution(4) C_lam(0.5, dists_table = dist_tab) C_lam(1.0, dists_table = dist_tab)# For a 4-item ranking space dist_tab <- DistanceDistribution(4) C_lam(0.5, dists_table = dist_tab) C_lam(1.0, dists_table = dist_tab)
Reconstructs ranking sequences from their Kendall information representation. Each fully-ordered sequence has a unique Kendall information vector.
ConstructSeqs(prefs, n_abils, use_cpp = TRUE)ConstructSeqs(prefs, n_abils, use_cpp = TRUE)
prefs |
Matrix of pairwise preferences, where each row represents one sequence's preferences. A value of 1 indicates a decrease (first element > second), 0 indicates an increase. |
n_abils |
Number of items in the original ranking. |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
List of integer vectors, each representing a reconstructed ranking.
Erik Gregory
# Reconstruct a sequence where item 4 is ranked first # For 4 items, there are 4*3/2 = 6 pairwise comparisons prefs <- matrix(c(1, 1, 1, 0, 0, 0), nrow = 1) ConstructSeqs(prefs, 4) # Returns list(c(4, 1, 2, 3)) # Multiple sequences at once prefs2 <- rbind(c(0, 0, 0, 0, 0, 0), c(1, 1, 1, 1, 1, 1)) ConstructSeqs(prefs2, 4) # Returns list(c(1,2,3,4), c(4,3,2,1))# Reconstruct a sequence where item 4 is ranked first # For 4 items, there are 4*3/2 = 6 pairwise comparisons prefs <- matrix(c(1, 1, 1, 0, 0, 0), nrow = 1) ConstructSeqs(prefs, 4) # Returns list(c(4, 1, 2, 3)) # Multiple sequences at once prefs2 <- rbind(c(0, 0, 0, 0, 0, 0), c(1, 1, 1, 1, 1, 1)) ConstructSeqs(prefs2, 4) # Returns list(c(1,2,3,4), c(4,3,2,1))
Synthetic data set containing 3 modal sequences in 15! space, with some noise added. Useful for testing and demonstrating the package.
A numeric matrix with 1700 rows (observations) and 15 columns (items).
data(datas) head(datas) dim(datas)data(datas) head(datas) dim(datas)
Efficiently computes the count of permutations at each Kendall distance from the canonical ordering (1, 2, ..., N). Uses a recurrence relation rather than enumeration, making it efficient even for large N.
DistanceDistribution(N = 3, use_cpp = TRUE)DistanceDistribution(N = 3, use_cpp = TRUE)
N |
Integer value, the length of rankings (must be >= 1). |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Named numeric vector where names are distances (0 to N*(N-1)/2) and values are counts of permutations at that distance.
Erik Gregory
SeqDistribution for the brute-force approach
# Distribution for 5-item rankings dist_tab <- DistanceDistribution(5) dist_tab # Verify total equals 5! sum(dist_tab) # Works efficiently even for large N dist_tab_20 <- DistanceDistribution(20) length(dist_tab_20) # Maximum distance is 20*19/2 = 190# Distribution for 5-item rankings dist_tab <- DistanceDistribution(5) dist_tab # Verify total equals 5! sum(dist_tab) # Works efficiently even for large N dist_tab_20 <- DistanceDistribution(20) length(dist_tab_20) # Maximum distance is 20*19/2 = 190
Pre-processed version of the 1980 American Psychological Association Presidential candidate ranking data. Uninformative rankings have been removed and values have been pre-simplified into partial rankings.
An integer matrix with 1378 rows (voters) and 3 columns (Carter, Reagan, Anderson).
American National Election Studies, https://electionstudies.org/
data(elect) head(elect) table(elect[, "Carter"]) # Distribution of Carter rankingsdata(elect) head(elect) table(elect[, "Carter"]) # Distribution of Carter rankings
Computes the probability that each ranking belongs to each cluster, given the current model parameters.
EStep(R, r, p, lambda, G, N, C, all_dists = NULL, use_cpp = TRUE)EStep(R, r, p, lambda, G, N, C, all_dists = NULL, use_cpp = TRUE)
R |
List of current cluster modal sequences. |
r |
Matrix of rankings (N x n_items). |
p |
Vector of cluster proportions (length G). |
lambda |
Vector of spread parameters (length G). |
G |
Number of clusters. |
N |
Number of observations. |
C |
Vector of normalizing constants (length G). |
all_dists |
Optional precomputed distance matrix (N x G). If NULL,
computed from |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Matrix (N x G) where entry [i, j] is the probability that
observation i belongs to cluster j.
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.
Formats the results of the Mallows model fitting procedure into a structured list with cluster assignments and diagnostics.
FormatOut(R, p, lambda, z, datas, likelihood)FormatOut(R, p, lambda, z, datas, likelihood)
R |
List of modal sequences (length G). |
p |
Vector of cluster proportions (length G). |
lambda |
Vector of spread parameters (length G). |
z |
Matrix of membership probabilities (N x G). |
datas |
Original data matrix of rankings (N x n_items). |
likelihood |
Vector of log-likelihood values at each iteration. |
List with components:
Modal sequences for each cluster
Proportion of data in each cluster
Spread parameters for each cluster
Data frame with original rankings plus cluster assignments, membership probabilities, assigned modal sequence, and distances to each cluster center
Log-likelihood at each iteration
Erik Gregory
Performs each column-wise comparison on a matrix of sequences. A 0 value denotes an increase between two columns, 1 denotes a decrease, and NA indicates tied values.
KendallInfo(r, inds = NULL, use_cpp = TRUE)KendallInfo(r, inds = NULL, use_cpp = TRUE)
r |
Matrix of sequences, where each row is a ranking. |
inds |
Optional matrix of column index pairs for comparisons. If NULL,
all pairs are computed using |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Matrix of 0s, 1s, and NAs representing pairwise comparisons. Each column corresponds to a pair of positions in the original ranking.
Erik Gregory
https://en.wikipedia.org/wiki/Kendall_tau_distance
# Full ranking r <- matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2, byrow = TRUE) KendallInfo(r) # Ranking with ties r_ties <- matrix(c(1, 1, 2, 3, 3), nrow = 1) KendallInfo(r_ties)# Full ranking r <- matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2, byrow = TRUE) KendallInfo(r) # Ranking with ties r_ties <- matrix(c(1, 1, 2, 3, 3), nrow = 1) KendallInfo(r_ties)
Computes the objective function whose root gives the MLE of lambda
in Mallows' model. Used internally by UpdateLambda.
Lambda(lambda, rhs, dists = NULL, dists_table = NULL)Lambda(lambda, rhs, dists = NULL, dists_table = NULL)
lambda |
Lambda value to evaluate the function at. |
rhs |
Right-hand side of the estimating equation (weighted mean distance). |
dists |
Not used (kept for backward compatibility). |
dists_table |
Table of distance distribution in N! space. |
Value of the objective function. The goal is to find lambda where this equals zero.
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.
Calculates the complete-data log-likelihood of the rankings given the current model parameters.
Likelihood(z, p, C_lam, lambda, all_dists_data)Likelihood(z, p, C_lam, lambda, all_dists_data)
z |
Matrix of cluster membership probabilities (N x G). |
p |
Vector of cluster proportions (length G). |
C_lam |
Vector of normalizing constants (length G). |
lambda |
Vector of spread parameters (length G). |
all_dists_data |
Matrix of distances from data to modal sequences (N x G). |
Numeric log-likelihood value.
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.
Fits a mixture of Mallows' models to partial or full ranking data using an EM algorithm with Kendall's distance metric.
Mallows(datas, G, iter = 100L, hyp = NULL, plot_like = FALSE, verbose = TRUE, tol = 1e-06, use_cpp = TRUE)Mallows(datas, G, iter = 100L, hyp = NULL, plot_like = FALSE, verbose = TRUE, tol = 1e-06, use_cpp = TRUE)
datas |
Matrix of partial or fully-ranked data (N x n_items). |
G |
Number of modes (clusters), must be >= 1. |
iter |
Maximum number of EM iterations (default 100). |
hyp |
Optional hypothesis sequence to initialize one cluster center. |
plot_like |
Logical; if TRUE, plot likelihood at each iteration. |
verbose |
Logical; if TRUE, print progress messages. |
tol |
Convergence tolerance. Algorithm stops when likelihood change is less than this value. |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementations. |
List with components:
List of modal sequences for each cluster
Vector of cluster proportions
Vector of spread parameters
Data frame with cluster assignments and diagnostics
Vector of log-likelihood at each iteration
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.
Adkins, L. and Fligner, M. (1998). A Non-iterative procedure for maximum likelihood estimation of the parameters of Mallows' Model Based on Partial Rankings. Communications in Statistics - Theory and Methods, 27:9, 2199-2220.
Given the Kendall distance distribution in N! space, computes the distribution in (N+1)! space. This uses the recurrence relation for Mahonian numbers (number of permutations with k inversions).
NextTable(last_table, n_last, use_cpp = TRUE)NextTable(last_table, n_last, use_cpp = TRUE)
last_table |
Table or numeric vector of distance counts in N! space. |
n_last |
The value of N (current space is N!). |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Numeric vector of distance counts in (N+1)! space.
Erik Gregory
# Get distribution for 5! from 4! tab4 <- DistanceDistribution(4) tab5 <- NextTable(tab4, 4) # Verify: should have 5! = 120 total permutations sum(tab5)# Get distribution for 5! from 4! tab4 <- DistanceDistribution(4) tab5 <- NextTable(tab4, 4) # Verify: should have 5! = 120 total permutations sum(tab5)
Generates initial modal sequences for the Mallows mixture model clustering. If a hypothesis sequence is provided, it becomes the first cluster center. Remaining cluster centers are initialized randomly.
Rgen(G, hyp = NULL, abils)Rgen(G, hyp = NULL, abils)
G |
Number of cluster centers (modes). |
hyp |
Optional hypothesis sequence to use as the first cluster center. Must be a permutation of 1:abils. |
abils |
Number of items being ranked. |
List of G integer vectors, each a permutation of 1:abils.
Erik Gregory
# Three random cluster centers for 5-item rankings set.seed(42) Rgen(3, abils = 5) # With a specific hypothesis sequence Rgen(3, hyp = c(5, 4, 3, 2, 1), abils = 5)# Three random cluster centers for 5-item rankings set.seed(42) Rgen(3, abils = 5) # With a specific hypothesis sequence Rgen(3, hyp = c(5, 4, 3, 2, 1), abils = 5)
Computes Kendall's distance from 1:N to each permutation in N! space by
enumerating all permutations. This is computationally expensive for N >= 8.
For larger N, use DistanceDistribution instead.
SeqDistribution(N, use_cpp = TRUE)SeqDistribution(N, use_cpp = TRUE)
N |
Length of the ranking. Should be less than 9 for practical use. |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Integer vector of Kendall distances from 1:N to each permutation.
Erik Gregory
DistanceDistribution for an efficient alternative
# All distances in 4! space dists <- SeqDistribution(4) table(dists)# All distances in 4! space dists <- SeqDistribution(4) table(dists)
Transforms rankings so that tie groups use consecutive integers. For example, (1, 1, 2, 4, 4, 5) becomes (1, 1, 2, 3, 3, 4).
SimplifySequences(rankings, use_cpp = TRUE)SimplifySequences(rankings, use_cpp = TRUE)
rankings |
Matrix of rankings to simplify, where each row is a ranking. |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
Matrix of simplified rankings with the same dimensions.
Erik Gregory
# Single ranking with gaps rankings <- matrix(c(1, 1, 2, 4, 4, 5), nrow = 1) SimplifySequences(rankings) # Returns: 1 1 2 3 3 4 # Multiple rankings rankings <- rbind( c(1, 3, 3, 5), c(2, 2, 4, 6) ) SimplifySequences(rankings)# Single ranking with gaps rankings <- matrix(c(1, 1, 2, 4, 4, 5), nrow = 1) SimplifySequences(rankings) # Returns: 1 1 2 3 3 4 # Multiple rankings rankings <- rbind( c(1, 3, 3, 5), c(2, 2, 4, 6) ) SimplifySequences(rankings)
The datas data set fitted with 3 modal sequences.
Compare to two.mode to see the effect of model selection.
A list with components:
List of 3 modal sequences
Vector of cluster proportions
Vector of spread parameters
Data frame with cluster assignments
Log-likelihood trajectory
data(three.mode) three.mode$R # Modal sequences three.mode$p # Cluster proportionsdata(three.mode) three.mode$R # Modal sequences three.mode$p # Cluster proportions
The datas data set fitted with only 2 modal sequences.
Since the true data has 3 modes, this shows what happens with
model misspecification.
A list with components:
List of 2 modal sequences
Vector of cluster proportions
Vector of spread parameters
Data frame with cluster assignments
Log-likelihood trajectory
data(two.mode) two.mode$R # Only 2 modal sequencesdata(two.mode) two.mode$R # Only 2 modal sequences
The elect data fitted with 2 modal sequences. The two modes
appear to correspond roughly to Democratic and Republican voting patterns.
A list with components:
List of 2 modal sequences
Vector of cluster proportions
Vector of spread parameters
Data frame with cluster assignments
Log-likelihood trajectory
American National Election Studies
data(two.seq) two.seq$R # Modal sequences (voting patterns) two.seq$p # Proportion in each groupdata(two.seq) two.seq$R # Modal sequences (voting patterns) two.seq$p # Proportion in each group
Updates the lambda (spread) parameters for each cluster to maximize the likelihood given current membership probabilities.
UpdateLambda(r, R, z, G, dists_to_Rg, dists_table, top_bound = 1000)UpdateLambda(r, R, z, G, dists_to_Rg, dists_table, top_bound = 1000)
r |
Matrix of rankings (N x n_items). |
R |
List of current modal sequences. |
z |
Matrix of membership probabilities (N x G). |
G |
Number of clusters. |
dists_to_Rg |
Matrix of distances from data to modal sequences (N x G). |
dists_table |
Table of distance distribution in N! space. |
top_bound |
Maximum allowed value for lambda (default 1000). |
Vector of updated lambda parameters (length G).
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.
Updates the proportion of data assigned to each cluster based on current membership probabilities.
UpdateP(z)UpdateP(z)
z |
Matrix of membership probabilities (N x G). |
Vector of updated cluster proportions (length G).
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.
# Example membership matrix for 10 observations, 3 clusters z <- matrix(c( 0.9, 0.05, 0.05, 0.8, 0.1, 0.1, 0.1, 0.8, 0.1, 0.1, 0.1, 0.8, 0.33, 0.33, 0.34, 0.9, 0.05, 0.05, 0.05, 0.9, 0.05, 0.05, 0.05, 0.9, 0.7, 0.2, 0.1, 0.2, 0.3, 0.5 ), nrow = 10, byrow = TRUE) UpdateP(z)# Example membership matrix for 10 observations, 3 clusters z <- matrix(c( 0.9, 0.05, 0.05, 0.8, 0.1, 0.1, 0.1, 0.8, 0.1, 0.1, 0.1, 0.8, 0.33, 0.33, 0.34, 0.9, 0.05, 0.05, 0.05, 0.9, 0.05, 0.05, 0.05, 0.9, 0.7, 0.2, 0.1, 0.2, 0.3, 0.5 ), nrow = 10, byrow = TRUE) UpdateP(z)
Updates the cluster centers (modal sequences) to maximize the likelihood given current membership probabilities.
UpdateR(r, z, infos = NULL, use_cpp = TRUE)UpdateR(r, z, infos = NULL, use_cpp = TRUE)
r |
Matrix of rankings (N x n_items). |
z |
Matrix of membership probabilities (N x G). |
infos |
Optional precomputed Kendall information matrix for |
use_cpp |
Logical; if TRUE (default), use the fast C++ implementation. |
List of G updated modal sequences.
Erik Gregory
Murphy, T.B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41, 645-655.