Does the column order in scoring dataframes matter to H2O models using mojo format?

4iesous · March 2023

My ML Ops team is suggesting that perhaps the reason why I'm getting different scores than they are (even though the scoring process flow in R and the mojo model is supposedly the same) is that perhaps the difference in the order of the columns in the dataframes we are each scoring having an impact. I can't actually verify for myself that they are using the exact same mojo model, etc., so I'm wondering if it's actually possible that that H2O mojo *.zip model file is ignoring the column names and simply uses the internal order of the columns for scoring purposes? It sounds far-fetched to me but wanted a second opinion. Thanks.

4iesous · March 2023

To add to my question above, one reason why this seemed strange is that when we each run the mojo model, we do get warnings like those below which seems to indicate that the model is matching the named column in the dataframe to be scored with historical value ranges used when the model was built, i.e. it doesn't seem to blindly ignore the column names and simply depend on the number and order of the variables the model was originally built with.

Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'first_individual' has levels not trained on: ["18F"]
2: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'k0083' has levels not trained on: ["1F", "2F", "4F", "8F", "9H"]

Does the column order in scoring dataframes matter to H2O models using mojo format?

Answers

Categories