Will ML training drive massive growth in networking?

This originally came up in an earlier blog comment but it’s an interesting question and one not necessarily one restricted to the changes driven by deep learning training and other often GPU-hosted workloads. This trend has been underway for a long time and is more obvious when looking at networking which was your example as well. When configuring systems, it’s very important that the most expensive components are the limiting resource. Servers are, by far, the largest cost in a data center. Networking costs tend to run down around 15%. It would be nuts to allow an expensive server to be underutilized because the network was the bottleneck. You can’t allow a ~15% cost to block utilization on a ~60% cost.

This make perfect sense and actually the basic rule goes back to manufacturing assembly line design dating way before data centers. If your manufacturing process uses one particularly expensive machine during the process, then you want that machine to be fully utilized. If you can’t fully utilize that machine due to some less expensive resource being the bottleneck, it’s a bad design.

The same is true in data centers design. You want the most expensive resource to be fully utilized. Servers cost far more than networking so any design that allows the workloads to bottleneck on networking is a poor design. As easy as this is to understand, for the decades prior to cloud computing, networking almost always was the bottleneck. This is partly because Cisco, and to a lesser extent Juniper, were very expensive, vertically integrated suppliers. Consequently, networking margins were far higher than servers and so it was closer to reasonable to bottleneck on networking resources and just about every data center during the pre-cloud era did exactly that.

However, if you look closely at that historical example, even with crazy expensive networking equipment, bottlenecking on it was still didn’t make economic sense. It actually wasn’t the right decision at the time and three big changes have happened since:

1) Cloud providers operate at scale and understand the economics so quickly add networking resources to avoid bottlenecking on networking and not fully utilizing the most expensive component (servers),

2) Cloud providers have the scale and ability to do custom networking hardware designs that drop networking costs dramatically. As networking costs drop, the argument against being bottlenecked on these resources continues to be more obvious. The need to ensure that the network is not the bottleneck becomes increasingly clear as the relative cost of networking is reduce through internal development,

3) Modern workloads are, in many cases, more networking intensive. For the most part, this is just an fact unrelated to the economic argument above. It just means that the ratio of networking resource to servers need to increase to meet the needs of these workloads without bottlenecking on networking resources which we argued above isn’t a good economic decision.

Machine learning is the poster child workload for being network intensive and, since all the rules above continue to apply, the networking resource to server ratio will continue to escalate. This won’t change but I should make a quick note on ML training. The reason it is so network intensive is a single server is way too slow for many training jobs. If a single server can’t train fast enough, then multiple servers have to be used. Because training is a tightly connected workload, there is a lot of networking traffic in this model.

In perhaps an obvious prediction since the process is already well underway: custom hardware or GPUs will be added to servers in large numbers with specialized, inside-the-server interconnects. For these workloads, the general purpose CPUs will become just schedulers and coordinators for large numbers of specialized ASICS and the overall training workload that a single server can handle will go up. Clearly these training workloads are growing very fast but I suspect that specialized hardware with libraries and frameworks that exploit them will allow a greater percentage of these workloads to run single server.

This will slightly relieve the pressure to accelerate the growth of networking resources so I mention it here. However, the rule above that networking shouldn’t be the bottleneck, stays true and the network resources will continue to grow fast.

In summary, the need to recover from old designs from the Cisco-era means that networking resource need to “catch up” means more networking growth. The massive growth of machine learning workloads will partly be satisfied by custom ASICs investments but not fully and, consequently, networking requirements continue to grow fast. Networking resources cost less than server resources so networking should never be the bottleneck resource.

When all these factors are considered together, the need to increase the ratio of networking resources to server resources will continue for years to come. The move from 10G to 25G/40G happened in a fraction of the time needed for the industry to move from 1G to 10G. Network resources need to continue to accelerate partly because of new workloads like machine learning training but more conventional workloads are drivers of this change as well. The net? It’s a good time to be designing, building or selling machine learning ASICs or networking components.

Will ML training drive massive growth in networking?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112