Random ramblings of a anonymous software engineer. Contains occasional profanity. Personal opinions, not related to employer.


This is an early introductory document covering the differences between two supercomputers which I happened to have access to.

More content might be added when I have time.


A comparison of the compute nodes in the following table. The characteristics are relatively similar, but ABCI is generally more powerful and uses slightly more recent hardware.

In particular, one would not train a mixed-precision model on TSUBAME.

CPU Intel(R) Xeon(R) CPU E5-2680 v4 Intel(R) Xeon(R) Gold 6148
CPU Clock 2.40Ghz 2.40Ghz
CPU Core Count (Threads) 28 (56) 40 (80)
RAM 256GB 360GB
GPU Tesla P100-SXM2 Tesla V100-SXM2
GPU VRAM 16GB (-10MB) 16GB (-0MB)
Interconnect 4x Intel Omni-Path 2x Mellanox Infiniband EDR
Full Resource q_node rt_F
Scratch (NVMe) /scr /local
Apps /apps /bb/apps
Home FS Lustre Lustre
Shared /gs/hs0, /gs/hs1, /gs/hs2 /fs1, /fs2, /fs3
Shared FS Lustre GPFS
Max Duration - Batch (Hours) 24 72
Max Duration - Interactive (Hours) 24 12
Hourly Rate - Standard (JPY) 80 200

Trial Jobs

TSUBAME 3.0 allows one to submit up to two trial jobs using qsub. This is not supported in ABCI.

(Getting a shell through a trial job via qrsh seems to be rather tricky, possibly because trial jobs have extremely low priority. For the time being the cause of this is inconclusive.)

Interactive Node

TSUBAME has the interactive node world accessible through the following domain - which is load balanced between two nodes. (login0 and login1 - both can be accessed directly by changing the name in the domain.)

ABCI on the other hand does not allow direct access to the interactive node, and one needs to go through a proxy. The proxy is on the domain:

And one needs to use this as a jump node to get to the actual login servers. This can be done by ssh-ing into, then ssh-ing into es. This is tedious and annoying, but recent versions of OpenSSH simplify this through a feature called "jump hosts". Ignore the official documentation - which gives you a convoluted method for connecting and use this:

ssh -J username@es

es is load balanced between 4 nodes, which can be accessed directly.

Requesting Resources

Since both use the exact same job scheduler (Univa Grid Engine) - job submission generally works the same. The only thing you need to change when requesting a resource is the type, so the following command in TSUBAME:

qrsh -g my-tsubame-group -l q_node=1 -l h_rt=00:10:00

Would be this in ABCI:

qrsh -g my-abci-group -l rt_F=1 -l h_rt=00:10:00

Same rules apply in job scripts.

Module Naming

Both systems use GNU module to enable optional software packages. The naming convention is slightly different - ABCI uses two levels for module versions, while TSUBAME only uses a top level.

  • ABCI: cuda/10.0/10.0.130
  • TSUBAME: cuda/10.0.130

Choosing The Right Computer

ABCI is 2.5x more expensive than TSUBAME (for Tokyo Tech users, if your project billing falls under 第4条第4項(成果非公開), then it's a 25% difference.) - so the general rule of thumb seems to be this:

  • I absolutely need FP16: ABCI, although it's worth noting that one can get 2x the nodes on TSUBAME so if you can setup distributed training, it might be more cost effective to do it there.
  • I need more than 256GB RAM: ABCI
  • I need lots of CPU cores: ABCI
  • I need bleeding edge CPU instructions: ABCI
  • TSUBAME is overbooked: ABCI, obviously
  • Otherwise: TSUBAME

That said, both are way cheaper than EC2 or GCP.

Closing Remarks

I wish TSUBAME billing could be simplified, the points system seems to be designed to obfuscate how much money one is spending on compute.