This project was inspired by watching Artosis encounter smurfs on the ladder. Smurfs are typically high-level players who play anonymously under alternate in-game names against lower-level players, often for nefarious reasons.
Sometimes two "players" can be easily associated with each other if they share the same Battlenet ID, but if there's no explicit Battlenet connection, it becomes much harder. Often, it's simply impossible to know that two different in-game names belong to the same player. In these cases, manual expert inspection of the replay itself is helpful; players are consistent from a mechanical point of view, in the pace they play, the hotkeys they use, etc., and an expert can identify a player based on these characteristics in the replay.
This project attempts to automate this process.
I use the term "player" to refer to the human playing the game. I use the term "in-game name" to refer to the name available in-game and on the replay. In this experiment, I'm trying to cluster in-game names based on the mechanical characteristics of the replay.
With this experiment, all I have to go on is the in-game name, and while these tend to be unique, it's not enforced. This means that all techniques and results below are inherently limited in how much certainty they can provide.
A Starcraft Broodwar 1v1 replay file consists of 2 in-game names and the actions/commands the respective player performed during the match. The intuition is that this sequence of actions contains information that can be used to identify a player. We can build this intuition into a model by translating actions into tokens and training a transformer on a predict-the-next-token task. During training, the model should eventually learn subtle intricacies distinct to certain players to predict the next token even better. These subtle intricacies will be available for us to use in the form of an embedding from a trained model. We can use these embeddings to cluster players.
For this project, I chose to clone Karpathy's nanoGPT. There's no particular reason I chose this one compared to many others other than it's simple, open-source, and works well enough.
I took a straightforward approach to tokenizing actions. Actions can be viewed as a simple sequence of (action, time delta) pairs. Using some quantization on the action's characteristics (e.g., ignoring coordinate information on click actions) and on the time deltas (all deltas were bucketed into one of seven tokens), I was able to settle on around 1000-2000 unique tokens.
Each player's actions in each replay were separated and tokenized. I processed around 100k replay files. Of these 200k strands of tokens, 20k were separated for a test set and the rest were used to train on.
I used a context length of 1024 tokens. This is about 1-5 minutes of a replay depending on a player's APM. I constructed each training batch from random segments in the strands that were 1024 tokens long. I let my 3090 run for about 24 hours. I used the test slice to ensure the losses from training weren't too far apart from the test loss. Other than this, I had no other immediate evaluation methods.
With this trained model, I generated tens of thousands of 1024-dimension embeddings by pulling out one of the last layers of the transformer. Each embedding was associated with the in-game name found in the replay file.
I selected a few of these embeddings from an account I was familiar with and used Euclidean distance to find close embeddings. I noticed a few of the closest embeddings were from the same in-game name as well as some known alternate in-game names. I knew I was on to something.
I constructed a little game to more rigorously prove this out. I started by randomly selecting 20 embeddings for each in-game name. For each of those embeddings, I compared the distance with all the other embeddings, keeping the closest 100. I tallied the number of occurrences from each in-game name in those 2000 close embeddings. I reasoned that the higher an in-game name's tally was, the more likely it was that those two in-game names referred to the same player.
Plotting those counts yielded a half-normal distribution, meaning basic statistical outlier analysis should work well here; counts with a high z-score (number of standard deviations from the mean) should correlate with an increased chance that the two respective in-game names belong to the same player.
I evaluated the strength of this technique by noting the z-score of the in-game name being evaluated. In other words, if we have Name_1 to Name_200, what's the z-score of Name_1 specifically? Let's call this "Self Z-Score." According to my evaluation of the above technique, the mean Self Z-Score was around 7, which is fairly high. The Self Z-Score ranked consistently in the top 3 among thousands of other z-scores. This strongly indicates that the system works.
I used this information with cwal.gg's existing cluster information (i.e., known in-game name clusters using Battlenet IDs) to produce an output that contains the top 1000 players on the Broodwar ladder, their existing cwal.gg cluster information, along with predictions of related clusters based on the techniques above. I ignored any z-score under 3 as it's likely just noise. Any z-score in the 3-5 range I would classify as an "interesting possible connection," anything in the 5-7 range is "likely connected," and anything 7+ is "very likely connected," but these categories are purely an intuition.
This information is available here.
I experimented with smaller context sizes and reducing the dimensionality of the embeddings so I could perform faster evaluations, but the performance Self Z-Score metric was much worse. I also experimented with a different type of tokenization that only uses time delta tokens, hoping to create a more general rhythm-based identification technique. This resulted in a high Self Z-Score for a few players, but the average was much lower.
I also thought about using the embeddings in a classification task. This has some downsides, however. You can only predict a small number of classes/players. And it's difficult to associate labels to data that are known to only belong to one class; the training set must be very carefully curated to ensure the label on each example rightfully belongs to one class.
All in all, I think the method above works reasonably well, although I'm sure there is plenty of room for improvement. I didn't experiment with tokenization very much. I'd like to try different tokenization techniques that scrub more racial information. I'd also like to think of different evaluation functions, particularly ones that can be ran during training as a companion to loss. It'd also be better to calculate precision and recall at each z-score threshold. However, this is problematic for the same reasons described in the potential classification task above; it's not possible to calculate precision unless you have carefully curated the data and its labels.
I'd also like to use this information to locate exact mechanical characteristics that make a player unique. Maybe it's a certain style of hotkey spam, or army movement technique, etc. I could imagine all sorts of ways to use variable-length context to programmatically find these insights.
Finally, I'd also like to try this for SC2. The replay files there have IDs which should be globally unique, making any results generated inherently more accurate.