This is an early introductory document covering the differences between two supercomputers which I happened to have access to.
More content might be added when I have time.
A comparison of the compute nodes in the following table. The characteristics are relatively similar, but ABCI is generally more powerful and uses slightly more recent hardware.
In particular, one would not train a mixed-precision model on TSUBAME.
|CPU||Intel(R) Xeon(R) CPU E5-2680 v4||Intel(R) Xeon(R) Gold 6148|
|CPU Core Count (Threads)||28 (56)||40 (80)|
|GPU||Tesla P100-SXM2||Tesla V100-SXM2|
|GPU VRAM||16GB (-10MB)||16GB (-0MB)|
|Interconnect||4x Intel Omni-Path||2x Mellanox Infiniband EDR|
|Shared||/gs/hs0, /gs/hs1, /gs/hs2||/fs1, /fs2, /fs3|
|Max Duration - Batch (Hours)||24||72|
|Max Duration - Interactive (Hours)||24||12|
|Hourly Rate - Standard (JPY)||80||200|
TSUBAME 3.0 allows one to submit up to two trial jobs using qsub. This is not supported in ABCI.
(Getting a shell through a trial job via qrsh seems to be rather tricky, possibly because trial jobs have extremely low priority. For the time being the cause of this is inconclusive.)
TSUBAME has the interactive node world accessible through the following domain - which is load balanced between two nodes. (
login1 - both can be accessed directly by changing the name in the domain.)
ABCI on the other hand does not allow direct access to the interactive node, and one needs to go through a proxy. The proxy is on the domain:
And one needs to use this as a jump node to get to the actual login servers. This can be done by ssh-ing into
as.abci.ai, then ssh-ing into
es. This is tedious and annoying, but recent versions of OpenSSH simplify this through a feature called "jump hosts". Ignore the official documentation - which gives you a convoluted method for connecting and use this:
ssh -J email@example.com username@es
es is load balanced between 4 nodes, which can be accessed directly.
Since both use the exact same job scheduler (Univa Grid Engine) - job submission generally works the same. The only thing you need to change when requesting a resource is the type, so the following command in TSUBAME:
qrsh -g my-tsubame-group -l q_node=1 -l h_rt=00:10:00
Would be this in ABCI:
qrsh -g my-abci-group -l rt_F=1 -l h_rt=00:10:00
Same rules apply in job scripts.
Both systems use GNU module to enable optional software packages. The naming convention is slightly different - ABCI uses two levels for module versions, while TSUBAME only uses a top level.
- ABCI: cuda/10.0/10.0.130
- TSUBAME: cuda/10.0.130
Choosing The Right Computer
ABCI is 2.5x more expensive than TSUBAME (for Tokyo Tech users, if your project billing falls under 第4条第4項(成果非公開), then it's a 25% difference.) - so the general rule of thumb seems to be this:
- I absolutely need FP16: ABCI, although it's worth noting that one can get 2x the nodes on TSUBAME so if you can setup distributed training, it might be more cost effective to do it there.
- I need more than 256GB RAM: ABCI
- I need lots of CPU cores: ABCI
- I need bleeding edge CPU instructions: ABCI
- TSUBAME is overbooked: ABCI, obviously
- Otherwise: TSUBAME
That said, both are way cheaper than EC2 or GCP.
I wish TSUBAME billing could be simplified, the points system seems to be designed to obfuscate how much money one is spending on compute.