relevel vs relevel_by_frequency

I am using H2O's Python API to construct a GLM. To set the reference level for a categorical input I use relevel(). Based on the documentation it would seem as though relevel_by_frequency() would accomplish the same thing assuming I want the most frequency level as my base. What I find strange is that the GLM coefficient estimates are different depending on if I use relevel('most frequency category') or relevel_by_frequency(). Is this behavior expected/understandable?

Answers

  • JM05
    JM05 Administrator Posts: 26

    It's not clear how you're using [relevel] but the following 2 should produce different results if there are many categories:

    relevel('most_frequent_cat')
    vs.
    relevel_by_frequency()
    

    The first will only move the most frequent category to the top, whereas the second will reorder all of them.

    You should expect a strictly equivalent behavior if forcing to move only the most frequent category in both cases:

    relevel('most_frequent_cat')
    vs
    relevel_by_frequency(top_n=1)
    
  • stuartking
    stuartking Member Posts: 6

    Thanks for the response, @JuanM. It's still unclear why the GLM would behave differently using the two methods since the coefficients for each non-base level indicate the effect of being a member of that level relative to the base level. If each method makes the most frequent category the reference level the coefficients for the other categories should be the same. Maybe I've misinterpreted what the function is actually doing? I posted a detailed example on SO that I hope will clarify the situation. Thanks again for the help.

  • stuartking
    stuartking Member Posts: 6

    Hi @JuanM. Following up to see if I can provide any additional info to help you answer my question. Should I post the same working example code from the Stackoverflow link above here? Thanks again for the help.

  • JM05
    JM05 Administrator Posts: 26

    @stuartking Just wanting to let you know that we're talking about this internally and should have an answer for you soon.

  • stuartking
    stuartking Member Posts: 6

    @JM05 A new issue was opened on github related to this topic: https://github.com/h2oai/h2o-3/issues/15761