Last month we announced the Netwar System Community Edition, the OVA for which is still not posted publicly. In our defense, what should have been a couple days with our core system has turned into a multifaceted month long bug hunt. A good portion could be credited to “unfamiliar with Search Guard”, but there is a hard kernel of “WTF, Twitter, WTF?!?” that we want to describe for other analysts.
Core System Configuration
First, some words about what we’ve done with the core of the system use day to day. After much experimentation we settled on the following configuration for our Elasticsearch dependent environment.
- HP Z620 workstations with dual eight core Xeons.
- 128 gig of ram.
- Dual Seagate IronWolf two terabyte drives in a mirror.
- Single Samsung SSD for system and ZFS cache.
- Trio of VirtualBox VMs with 500 gig of storage each.
- 32 gig for host, ZFS ARC (cache) limited to 24 gig.
- 24 gig per VM, JVM limited to 12 to 16 gig.
There are many balancing acts in this, too subtle and too niche to dig into here. It should be noted that FreeBSD Mastery:ZFS is a fine little book, even if you’re using Linux. The IronWolf drives are helium filled gear meant for NAS duty. In retrospect, paying the 50% premium for IronWolf Pro gear would have been a good move and we’ll go that way as we outgrow these.
We’ve started with a pair of machines, we’re defaulting to three shards per index, and a single replica for each. The Elasticsearch datacenter zones feature proved useful; pulling the network cable on one machine triggers some internal recovery processes, but there is no downtime from the user’s perspective. We’re due for a third system with similar specifications, it will receive the same configuration including a zone of its own, and we’ll move from one replica per index to two. This will be a painless shift to N+1 redundancy.
API Mysteries At Scale
Our first large scale project has been profiling the followers of 577 MPs in the U.K. Parliament. There are 20.6M follow relationships with 6.6M unique accounts. Extracting their profiles would require forty hours with our current configuration … but there are issues.
Users haven’t seen a Twitter #FailWhale in years, but as continuous users of the API we expect to see periods of misbehavior on about a monthly basis. February featured some grim adjustments, signs that Twitter is further clamping down on bots, which nipped our read only analytical activities. There are some features that seem to be permanently throttled now based on IP address.
When we arrived at what we thought was the end of the road, we had 6.26M profiles in Elasticsearch rather than the 6.6M we knew to exist, a discrepancy of about 350,000. We tested all 6.6M numeric IDs against the index and found just 325,000 negative responses. We treated that set as a new batch and the system captured 255,000, leaving only 70,000 missing. Repeating the process again with the 70,000 we arrived at a place where the problem was amenable to running a small batch in serial fashion.
Watching a batch of a thousand of these stragglers, roughly a quarter got an actual response, a quarter came back as suspended, and the remainder came back as page not found. The last response is expected when an account has renamed or self suspended, but we were using numeric ID rather than screen name.
And the API response to this set was NOT deterministic. Run the process again with the same data, the percentages were similar, but different accounts were affected.
A manual inspection of the accounts returned permits the formulation of a theory as to why this happens. We know the distribution of the creation dates of these accounts:
The bulk of the problematic accounts are dated between May and August of 2018. Recall that Twitter completed its acquisition of Smyte and shut down 70 million bots during that time frame. May in the histogram is the first month where account creation dates are level. A smaller set clustered around the same day in mid-December of 2012, another fertile time period for bot creation.
The affected accounts have many of the characteristics we associate with bots:
- Steeply inverted following to follow ratio.
- Complete lack of relationships to others.
- Relatively few tweets.
- Default username with eight trailing digits.
An account that was created and quickly abandoned will share these attributes. So our theory regarding the seeming problem with the API is as follows:
These accounts that can not be accessed in a deterministic fashion using the API are in some sort of Smyte induced purgatory. They are not accessible, protected, empty of content, suspended, or renamed, which are five conditions our code already recognizes. There is a new condition, likely “needs to validate phone number”, and accounts that have not done this are only likely of interest to their botnet operators, or researchers delving very deeply into the system’s behavior.
But What Does This MEAN?
Twitter has taken aggressive steps to limit the creation of bots. Accounts following MPs seem to have fairly evenly distributed creation dates, less the massive hump from early 2016 to mid 2018. We know botnet operators are liquidating collections of accounts that have been wiped of prior activity for as little as $0.25 each. There are reportedly offerings of batches of accounts considered to be ‘premium’, but what we know of this practice is anecdotal.
Our own experience is limited to maintaining a couple platoons of collection oriented accounts, and Twitter has erected new barriers, requiring longer lasting phone numbers, and sometimes voice calls rather than SMS.
This coming month we are going to delve into the social bot market, purchasing a small batch, which we will host on a remote VPS and attempt to use for collection work.
The bigger implication is this … Twitter’s implementation of Smyte is good, but it’s created a “hole in the ocean problem”, a reference to modern submarines with acoustic signatures that are less than the noise floor in their environment. If the affected accounts are all bots, and they’re just standing deadwood of no use to anyone, that’s good. But if they can be rehabilitated or repurposed, they are still an issue.
Seems like we have more digging to do here …
Mystery Partially Resolved …
So there was an issue with the API, but an issue on our side.
When a Twitter account gets suspended, it’s API tokens will still permit you to check its credentials. So a script like this reports all is well:
But if three of the sixty four accounts used in doing numeric ID to profile lookups have been suspended … 3/64 = 4.69% failure rate. That agrees pretty well with some of the trouble we observed. We have not had cause to process another large batch of numeric IDs yet, but when we do, we’ll check this theory against the results.