relevel vs relevel_by_frequency

stuartking · October 2022

I am using H2O's Python API to construct a GLM. To set the reference level for a categorical input I use relevel(). Based on the documentation it would seem as though relevel_by_frequency() would accomplish the same thing assuming I want the most frequency level as my base. What I find strange is that the GLM coefficient estimates are different depending on if I use relevel('most frequency category') or relevel_by_frequency(). Is this behavior expected/understandable?

JM05 · November 2022

It's not clear how you're using [relevel] but the following 2 should produce different results if there are many categories:

relevel('most_frequent_cat')
vs.
relevel_by_frequency()

The first will only move the most frequent category to the top, whereas the second will reorder all of them.

You should expect a strictly equivalent behavior if forcing to move only the most frequent category in both cases:

relevel('most_frequent_cat')
vs
relevel_by_frequency(top_n=1)

stuartking · November 2022

Thanks for the response, @JuanM. It's still unclear why the GLM would behave differently using the two methods since the coefficients for each non-base level indicate the effect of being a member of that level relative to the base level. If each method makes the most frequent category the reference level the coefficients for the other categories should be the same. Maybe I've misinterpreted what the function is actually doing? I posted a detailed example on SO that I hope will clarify the situation. Thanks again for the help.

stuartking · December 2022

Hi @JuanM. Following up to see if I can provide any additional info to help you answer my question. Should I post the same working example code from the Stackoverflow link above here? Thanks again for the help.

JM05 · December 2022

@stuartking Just wanting to let you know that we're talking about this internally and should have an answer for you soon.

stuartking · September 2023

@JM05 A new issue was opened on github related to this topic: https://github.com/h2oai/h2o-3/issues/15761

relevel vs relevel_by_frequency

Answers

Categories