relevel vs relevel_by_frequency
I am using H2O's Python API to construct a GLM. To set the reference level for a categorical input I use
relevel(). Based on the documentation it would seem as though
relevel_by_frequency() would accomplish the same thing assuming I want the most frequency level as my base. What I find strange is that the GLM coefficient estimates are different depending on if I use
relevel('most frequency category') or
relevel_by_frequency(). Is this behavior expected/understandable?
It's not clear how you're using [relevel] but the following 2 should produce different results if there are many categories:
relevel('most_frequent_cat') vs. relevel_by_frequency()
The first will only move the most frequent category to the top, whereas the second will reorder all of them.
You should expect a strictly equivalent behavior if forcing to move only the most frequent category in both cases:
relevel('most_frequent_cat') vs relevel_by_frequency(top_n=1)
Thanks for the response, @JuanM. It's still unclear why the GLM would behave differently using the two methods since the coefficients for each non-base level indicate the effect of being a member of that level relative to the base level. If each method makes the most frequent category the reference level the coefficients for the other categories should be the same. Maybe I've misinterpreted what the function is actually doing? I posted a detailed example on SO that I hope will clarify the situation. Thanks again for the help.
Hi @JuanM. Following up to see if I can provide any additional info to help you answer my question. Should I post the same working example code from the Stackoverflow link above here? Thanks again for the help.
@stuartking Just wanting to let you know that we're talking about this internally and should have an answer for you soon.