Couple of people have mentioned clusters to scale out vs scale up.
This can work (we do it that way) but there are limitations. You are basically limited to the algorithms implemented in Spark and/or h2o which is a much smaller subset of the larger potential. It was enough for us, but it needs to be checked.
I don’t know anything about the scale out/clustering of GPU resources, we targeted CPU.
You can write your own distributed algorithms on top of Spark/h2o but that is serious development effort.
A backend Spark/h2o cluster running on Linux can be accessed via R, Python, other tools by users on other OSs.