The illusory promise of the Aligned Rank Transform
Appendix. Additional experimental results
We present complementary results and new experiments that investigate additional scenarios. We also compare INT and RNK with other nonparametric methods. Unless explicitly mentioned in each section, we follow the experimental methodology presented in the main article. At the end of each section, we summarize our conclusions.
1 Results for
Although we only presented results for
For results from different experiments, we refer readers to our raw result files.
Conclusion
Results for
2 Main effects in the presence of interactions
In all experiments assessing Type I error rates reported in our article, we assumed no interaction effects. However, we also need to understand whether weak or strong interaction effects could affect the sensitivity of the methods in detecting main effects. This experiment evaluates the Type I error rate of the methods in the presence of an interaction effect alone, or alternatively, in the presence of a simultaneous main effect. We focus again on the three two-factor experimental designs we evaluated for our previous experiment, setting the sample size to
To simulate populations in which interactions emerge in the absence of main effects, we examine perfectly symmetric cross-interactions. To this end, we slightly change the method we use to encode the levels of each factor, such that levels are uniformly positioned around 0. For a factor with three levels, we numerically encode the levels as
Interaction effect only. We first test how the interaction effect alone influences the Type I error rate on
Interaction effect combined with main effect. We also evaluate the Type I error rate on
Conclusion
When distributions are skewed, the presence of an interaction effect can cause ART to detect a non-existent main effect. ART is more sensitive to such problems than parametric ANOVA. The performance of INT and RNK can also be affected by the presence of interaction effects, especially when the interaction effect is combined with a main effect, and distributions are either binomial or ordinal. More generally, main effects should be interpreted with caution when strong interactions exist.
3 Missing data
We evaluate how missing data can affect the performance of the four methods. Specifically, we study a scenario, where a random sample of
Main effects. Figure 7 presents Type I error rates for the main effect of
Interaction effects. Figure 8 and Figure 9 present Type I error rates for the interaction effect in the presence of a single main effect or two parallel main effects. The error levels for all methods, including ART, are now very similar to the ones observed with no missing data.
Conclusion
ART is sensitive to the presence of missing data when at least
4 Log-normal distributions
In a different experiment, we evaluate log-normal distributions with a wider range of
Main effects. Figure 11 presents our results on Type I error rates for main effects. As expected, ART’s inflation of error rates is less serious when distributions are closer to normal, while the problem becomes worse as distributions are more skewed.
Interaction effects. We observe similar patterns for the Type I error rate of the interaction effect in the presence of a single main effect (Figure 12) or two parallel main effects (Figure 13). As shown in Figure 13, any advantage of ART over RNK and INT for testing interaction disappears even when distributions exhibit light skew levels. We also observe again that the performance of RNK and INT remains identical across all skew levels.
Conclusion
ART’s robustness issues become less severe as log-normal distributions become less skewed and thus closer to normal. However, even under distributions with light skew, ART’s Type I error rates for interactions reach higher levels than those of both INT and RNK when parallel main effects are present.
5 Binomial distributions
We also evaluate a wider range of parameters for the binomial distribution. We focus on the lower range of probabilities
Main effects. We present our results for the main effect in Figure 14. We observe that ART’s Type I error rates increase as the number of repetitions decreases and the probability of success approaches zero, reaching very high levels when
Interaction effects. Figure 15 shows similar patterns for the Type I error rate of the interaction effect in the presence of a single main effect. When both main effects increase beyond a certain level (see Figure 16), all methods seem to fail to control the error rate. ART again demonstrates the worst behavior, systematically inflating error rates even when main effects are absent. In several cases, RNK performs better than INT.
Conclusion
ART is extremely problematic under binomial distributions, raising Type I error at very high levels even in the absence of other effects. Testing interactions in the presence of parallel main effects can be problematic for all other methods.
6 Ordinal data
Given the frequent use of ART with ordinal data, we evaluate our complete set of ordinal scales, based on both equidistant and flexible thresholds, with additional experimental designs.
Main effects. Figure 17 presents Type I error rates for the main effect. ART preserves error rates at nominal levels under the
Interaction effects. Figure 8 and Figure 9 present Type I error rates for the interaction effect in the presence of a single main effect or two parallel main effects. These results lead to similar conclusions. Even in cases where ART keeps error rates close to nominal levels (e.g., under the between-subjects design with equidistance thresholds), the performance of parametric ANOVA is constantly better.
Conclusion
ART’s inflation of Type I error rates with ordinal data is confirmed across a range of designs. For the between-subjects and mixed designs, the problem primarily concerns ordinal scales with flexible thresholds. However, for within-subjects designs, ART also inflates error rates for scales with equidistant thresholds, particularly when the number of levels is as low as five or seven. Again, all methods may fail to correctly infer interactions when parallel main effects exceed a certain threshold.
7 ART with median alignment
We evaluate a modified implementation of ART (ART-MED), where we use medians instead of means to align ranks. This approach draws inspiration from results by Salter and Fawcett (1993), showing that median alignment corrects ART’s instable behavior under the Cauchy distribution. We only test the
We emphasize that Salter and Fawcett (1993) only apply mean and median alignment to interactions. Our implementation for main effects is based on the alignment approach of Wobbrock et al. (2011), where we simply replace means by medians.
Main effects. Our results presented in Figure 20 demonstrate that median alignment (ART-MED), or at least our implementation of the method, is not appropriate for testing main effects. Although Type I error rates are now lower for the Cauchy distribution compared to the original method, they are still above nominal levels. In addition, they are significantly higher for all other distributions.
Interaction effects. In contrast, median alignment works surprisingly well for interactions, correcting deficiencies of ART, especially when main effects are absent or weak. Figure 21 and Figure 22 present our results. Despite this improved performance, we cannot recommend using the method because it still cannot compete with INT. Additionally, its advantages over parametric ANOVA are only apparent for the Cauchy distribution.
Conclusion
Using median instead of mean alignment with ART significantly improves the method’s performance in testing interactions across all the distributions we tested. However, we cannot recommend it, as the method is still less robust than INT. Furthermore, it is unclear how to apply median alignment for testing main effects — using medians with the alignment method of Wobbrock et al. (2011) results in extremely high error rates.
8 Nonparametric tests in single-factor designs
We compare PAR, RNK, and INT to nonparamatric tests for within- and between-subjects single-factor designs, where the factor has two, three, or four levels. Depending on the design, we use different nonparametric tests. For within-subjects designs, we use the Wilcoxon sign-rank test if the factor has two levels (2 within) and the Friedman test if the factor has three (3 within) or four (4 within) levels. For between-subjects designs (2 between, 3 between, and 4 between), we use the Kruskal–Wallis test.
Power. Figure 23 compares the power of the various methods as the magnitude of the main effect increases, where we use the abbreviation NON to designate a nonparametric test. We observe that primarily INT, but also RNK, generally exhibit better power than the nonparametric tests. Differences are more pronounced for within-subjects designs, corroborating Conover’s (2012) observation that the rank transformation results in a test that is superior to the Friedman test under certain conditions.
We expect that the accuracy of ANOVA on rank-transformed values will decrease with smaller samples. However, our tests with smaller samples of
Type I error rate under equal and unequal variances. Figure 24 presents the rate of positives under conditions of equal (
For between-subjects designs, we observe that the Kruskal–Wallis test and RNK yield very similar results. This is not surprising, as RNK is known to be a good approximation of the Kruskal–Wallis test (Conover 2012). INT’s positive rates are similar, although slightly higher under the binomial distribution. For within-subjects designs, differences among methods are more pronounced. The Wilcoxon sign-rank test (2 within) inflates rates well above
Figure 25 presents the same results but for
Conclusion
We do not see significant benefits in using dedicated nonparametric tests over RNK or INT. INT can replace nonparametric tests even for single-factor designs. If, after transforming the data, the assumptions of homoscedasticity or sphericity are still not met, applying common correction procedures (e.g., a Greenhouse–Geisser correction for sphericity violations) on the transformed data can reduce the risk of Type I errors.
9 ANOVA-type statistic (ATS)
We compare PAR, RNK, and INT to the ANOVA-type statistic (ATS) (Brunner and Puri 2001) for two-factor designs. We use its implementation in the R package nparLD (Noguchi et al. 2012), which does not support between-subjects designs. Thus, we only evaluate it for the
Type I error rates: Main effects. Figure 26 presents Type I error rates for the main effect of
Figure 27 presents results for the main effect of
Type I error rates: Interactions. Figure 28 presents Type I error rates for the interaction in the presence of a single main effect. Results are again very similar for all three nonparametric methods under the mixed design. In contrast, the error rates of ATS tend to be lower than nominal levels under the within-subjects design, often falling below
When two parallel main effects are present, ATS and RNK lead to very similar trends (see Figure 29). Overall, INT appears to be a more robust method with the exception of the binomial distribution, for which error rates are higher for this method.
Power: Main effects. As shown in Figure 30 and Figure 31, ATS appears as the most powerful method for detecting effects of
Power: Interactions. Figure 32 shows results on power for interactions. INT emerges again as the most powerful method. The power of ATS is particularly low under the within-subjects design.
Conclusion
Although ATS appears to be a valid alternative, it does not offer clear performance advantages over INT, which is also simpler and more versatile.
10 Generalizations of nonparametric tests
Finally, we evaluate the generalizations of nonparametric tests recommended by Lüpsen (2018; 2023) as implemented in his np.anova function (Lüpsen 2021). Specifically, we evaluate the generalization of the van der Waerden test (VDW), and the generalization of the Kruskal-Wallis and Friedman test (KWF). Their implementation uses R’s aov function and requires considering random slopes in the error term of the model, that is, using Error(s/(x1*x2))
for the two-factor within-participants design and Error(s/x2)
for the mixed design, where s
is the subject identifier variable. We also used the aov function and the same formulation of the error term for all other methods.
Type I error rates: Main effects. Figure 33 presents Type I error rates for the main effect of
Type I error rates: Interactions. Figure 34 presents the Type I error rates for the interaction, in the presence of a single main effect. Once again, the error rates of VDW and KWF decrease rapidly in both the between-subjects and mixed designs. Figure 35 shows the results when both main effects are present. Under the within-subjects design, KWF is the poorest-performing technique. While VDW outperforms RNK, it remains inferior to INT. For the between-subjects and mixed designs, KFW and VDW yield low or near-zero error rates when main effects are large, likely due to their extreme loss of power in these scenarios.
Power: Main effects. Figure 36 presents power results for detecting the main effect of
Figure 38 also presents results for the main effect of
Power: Interactions. We also present results on the power of methods to detect interactions in Figure 39, confirming the advantage of INT across all designs. Figure 40 provides a clearer picture of how power is affected by the presence of a main effect. We observe that the power of all methods drops as the main effect of
Conclusion
Our results do not support Lüpsen’s (2018; 2023) conclusions. The behavior of the generalized nonparametric tests presents issues in numerous scenarios. While these tests exhibit lower error rates under specific conditions, this is due to a significant loss of power when other effects are at play. Therefore, we advise against the use of these methods.