• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Blizzard explain why they are having so many outages with Diablo 2 Resurrected (legacy service causing issues, some tech issues too)

cormack12

Gold Member
Source: https://us.forums.blizzard.com/en/d...king-on-it-and-how-we’re-moving-forward/28164

[..]we’ll briefly give you some context as to how our server databases work. First, there’s our global database, which exists as the single source of truth for all your character information and progress. As you can imagine, that’s a big task for one database, and wouldn’t cope on its own. So to alleviate load and latency on our global database, each region–NA, EU, and Asia–has individual databases that also store your character’s information and progress, and your region’s database will periodically write to the global one. Most of your in-game actions are performed against this regional database because it’s faster, and your character is “locked” there to maintain the individual character record integrity. The global database also has a back-up in case the main fails.

With that in mind, to explain what’s been going on, we’ll be focusing on the downtimes experienced between Saturday October 9 to now.

On Saturday morning Pacific time, we suffered a global outage due to a sudden, significant surge in traffic. This was a new threshold that our servers had not experienced at all, not even at launch. This was exacerbated by an update we had rolled out the previous day intended to enhance performance around game creation–these two factors combined overloaded our global database, causing it to time out. We decided to roll back that Friday update we’d previously deployed, hoping that would ease the load on the servers leading into Sunday while also giving us the space to investigate deeper into the root cause.

On Sunday, though, it became clear what we’d done on Saturday wasn’t enough–we saw an even higher increase in traffic, causing us to hit another outage. Our game servers were observing the disconnect from the database and immediately attempted to reconnect, repeatedly, which meant the database never had time to catch up on the work we had completed because it was too busy handling a continuous stream of connection attempts by game servers. During this time, we also saw we could make configuration improvements to our database event logging, which is necessary to restore a healthy state in case of database failure, so we completed those, and undertook further root cause analysis.

This leads us into Monday, October 11, when we made the switch between the global databases. This led to another outage, when our backup database was erroneously continuing to run its backup process, meaning that it spent most of its time trying to copy from the other database when it should’ve been servicing requests from servers. During this time, we discovered further issues, and we made further improvements–we found a since-deprecated-but-taxing query we could eliminate entirely from the database, we optimized eligibility checks for players when they join a game, further alleviating the load, and we have further performance improvements in testing as we speak. We also believe we fixed the database-reconnect storms we were seeing, because we didn’t see it occur on Tuesday.

Then Tuesday, we hit another concurrent player high, with a few hundreds of thousands of players in one region alone. This made us hit another incident of degraded database performance, the cause of which is currently being worked on by our database engineers. We also reached out to other engineers around Blizzard to work on smaller fixes as our own team focused on core server issues, and we reached out to our third-party partners for assistance as well.

Why this is happening:

In staying true to the original game, we kept a lot of legacy code. However, one legacy service in particular is struggling to keep up with modern player behavior.

This service, with some upgrades from the original, handles critical pieces of game functionality, namely game creation/joining, updating/reading/filtering game lists, verifying game server health, and reading characters from the database to ensure your character can participate in whatever it is you’re filtering for. Importantly, this service is a singleton, which means we can only run one instance of it in order to ensure all players are seeing the most up-to-date and correct game list at all times. We did optimize this service in many ways to conform to more modern technology, but as we previously mentioned, a lot of our issues stem from game creation.

We mention “modern player behavior” because it’s an interesting point to think about. In 2001, there wasn’t nearly as much content on the internet around how to play Diablo II “correctly” (Baal runs for XP, Pindleskin/Ancient Sewers/etc for magic find, etc). Today, however, a new player can look up any number of amazing content creators who can teach them how to play the game in different ways, many of them including lots of database load in the form of creating, loading, and destroying games in quick succession. Though we did foresee this–with players making fresh characters on fresh servers, working hard to get their magic-finding items–we vastly underestimated the scope we derived from beta testing.

Additionally, overall, we were saving too often to the global database: There is no need to do this as often as we were. We should really be saving you to the regional database, and only saving you to the global database when we need to unlock you–this is one of the mitigations we have put in place. Right now we are writing code to change how we do this entirely, so we will almost never be saving to the global database, which will significantly reduce the load on that server, but that is an architecture redesign which will take some time to build, test, then implement.
 

sn0man

Member
Sounds like a lot of great fixes. I’m looking forward to the fix that makes the game resilient to server outages by saving and loading of the game directly from the client device.
 

Lanrutcon

Member
Love the game, but we're talking about tech from 2 decades ago: there were bound to be legacy issues.

I wonder how their services are going to cope once botting takes off again.
 

killatopak

Gold Member
An unfortunate effect of still having the original engine running in the background. It’s worth it though and hopefully they fix it soon.
 

BadBurger

Is 'That Pure Potato'
Unfortunate. As someone who has to help support and maintain numerous legacy technologies, databases, code, whatever due to legal reasons in healthcare IT, I can sympathize. Shit gets hairy and sometimes the result is a Frankenstein monster that barely runs even under ideal conditions.
 
They are literally using the same Battle.net netcode and infrastructure from 2000. It preserves the game as it exactly was but it also ensures all the problems of decades obsolete code will also be there. Since the source code is lost, there's probably not a lot they can do about it.
 

treemk

Banned
I knew the netcode felt exactly like the old one, "we left the lag and descyn but dropped in some woke censorship cause we care about preserving the original"

Get the fuck out of here
 

MikeM

Member
I like the transparency. I wish there was more of it in the gaming business.

I also get the feeling that this game has sold far more than anticipated given their initially- planned server activity was really under-estimated.
 

Hari Seldon

Member
I know I was hyped about this but after new world dropped I have not touched it. I wonder how bad this would be without new world. Seems like they have overlapping player bases.
 

Kacho

Gold Member
I'm guessing the fix will take at least 3 months to build, test and deploy. Glad I switched over to offline toons. I'll jump back over to bnet when these issues are resolved.
 
Top Bottom