17 Aug 2019 •

ABCI and TSUBAME 3.0

This is an early introductory document covering the differences between two supercomputers which I happened to have access to.

More content might be added when I have time.

Hardware

A comparison of the compute nodes in the following table. The characteristics are relatively similar, but ABCI is generally more powerful and uses slightly more recent hardware.

In particular, one would not train a mixed-precision model on TSUBAME.

	TSUBAME 3.0	ABCI
CPU	Intel(R) Xeon(R) CPU E5-2680 v4	Intel(R) Xeon(R) Gold 6148
CPU Clock	2.40Ghz	2.40Ghz
CPU Core Count (Threads)	28 (56)	40 (80)
RAM	256GB	360GB
GPU	Tesla P100-SXM2	Tesla V100-SXM2
GPU VRAM	16GB (-10MB)	16GB (-0MB)
Interconnect	4x Intel Omni-Path	2x Mellanox Infiniband EDR
Full Resource	q_node	rt_F
Scratch (NVMe)	/scr	/local
Apps	/apps	/bb/apps
Home FS	Lustre	Lustre
Shared	/gs/hs0, /gs/hs1, /gs/hs2	/fs1, /fs2, /fs3
Shared FS	Lustre	GPFS
Max Duration - Batch (Hours)	24	72
Max Duration - Interactive (Hours)	24	12
Hourly Rate - Standard (JPY)	80	200

Trial Jobs

TSUBAME 3.0 allows one to submit up to two trial jobs using qsub. This is not supported in ABCI.

(Getting a shell through a trial job via qrsh seems to be rather tricky, possibly because trial jobs have extremely low priority. For the time being the cause of this is inconclusive.)

Interactive Node

TSUBAME has the interactive node world accessible through the following domain - which is load balanced between two nodes. (login0 and login1 - both can be accessed directly by changing the name in the domain.)

login.t3.gsic.titech.ac.jp

ABCI on the other hand does not allow direct access to the interactive node, and one needs to go through a proxy. The proxy is on the domain:

as.abci.ai

And one needs to use this as a jump node to get to the actual login servers. This can be done by ssh-ing into as.abci.ai, then ssh-ing into es. This is tedious and annoying, but recent versions of OpenSSH simplify this through a feature called "jump hosts". Ignore the official documentation - which gives you a convoluted method for connecting and use this:

ssh -J [email protected] username@es

es is load balanced between 4 nodes, which can be accessed directly.

Requesting Resources

Since both use the exact same job scheduler (Univa Grid Engine) - job submission generally works the same. The only thing you need to change when requesting a resource is the type, so the following command in TSUBAME:

qrsh -g my-tsubame-group -l q_node=1 -l h_rt=00:10:00

Would be this in ABCI:

qrsh -g my-abci-group -l rt_F=1 -l h_rt=00:10:00

Same rules apply in job scripts.

Module Naming

Both systems use GNU module to enable optional software packages. The naming convention is slightly different - ABCI uses two levels for module versions, while TSUBAME only uses a top level.

ABCI: cuda/10.0/10.0.130
TSUBAME: cuda/10.0.130

Choosing The Right Computer

ABCI is 2.5x more expensive than TSUBAME (for Tokyo Tech users, if your project billing falls under ç¬¬4æ¡ç¬¬4é …(æˆæžœéžå…¬é–‹), then it's a 25% difference.) - so the general rule of thumb seems to be this:

I absolutely need FP16: ABCI, although it's worth noting that one can get 2x the nodes on TSUBAME so if you can setup distributed training, it might be more cost effective to do it there.
I need more than 256GB RAM: ABCI
I need lots of CPU cores: ABCI
I need bleeding edge CPU instructions: ABCI
TSUBAME is overbooked: ABCI, obviously
Otherwise: TSUBAME

That said, both are way cheaper than EC2 or GCP.

Closing Remarks

I wish TSUBAME billing could be simplified, the points system seems to be designed to obfuscate how much money one is spending on compute.