This site has been quiet the last five weeks, but many good things happened in the background. One of those good things has been progress on a small Netwar System demonstrator virtual machine, tentatively named the Community Edition.
What can you do with Netwar System CE? It supports using one or two Twitter accounts to record content on an ongoing basis, making the captured information available via the Kibana graphical front end to Elasticsearch. Once the accounts are authorized the system checks them every two minutes for any list that begins with “nsce-“, and accounts on those lists are recorded.
Each account used for recording produces a tw<name> index containing tweets and a tu<name> index containing the profiles of the accounts.
The tw* and tu* are index patterns that cover the respective content from all three accounts. The root account is the system manager and we assume users might place a set of API tokens on that account for command line testing.
This is a view from Kibana’s Discovery tab. The timeframe can be controller via the time picker at the upper right, the Search box permits filtering, the activity per date histogram appears at the top, and in the case we can see a handful of Brexit related tweets.
There are a variety of visualization tools within Kibana. He we see a cloud of hashtags used by the collected accounts. The time picker can be adjusted to a certain time frame, search terms may be added so that the cloud reflects only hashtags used in conjunction with the search term, and there are many further refinements that can be made.
What does it take to run Netwar System CE? The following is a minimal configuration of a desktop or laptop that could host it:
8 gig of ram
solid state disk
four core processor
There are entry level Dell laptops on Amazon with these specifications in the $500 range.
The VM itself is very light weight – two cores, four gig of ram, and the OVA file for the VM is just over four gig to download.
As shipped, the system has the following limits:
Tracking via two accounts
Disk space for about a million tweets
Collects thirty Twitter accounts per hour per account
If you are comfortable with the Linux command line it is fairly straightforward to add additional accounts. If you have some minimal Linux administration capabilities you could add a virtual disk, relocate the Elasticsearch data, and have room for more tweets.
If you are seeking to do a larger project, you should not just multiply these numbers to determine overall capacity. An eight gig VM running our adaptive code can cover about three hundred accounts per hour and a sixty four gig server can exceed four thousand.
After Implementing Search Guard ten days ago I was finally pushed into using Elasticsearch 6. Having noticed that 6.5.0 was out I decided to wait until Search Guard, which seems to lag about a week behind, managed to get their update done.
The 6.5.0 release proved terribly buggy, but now here we are with 6.5.1, running tests in A Small Development Environment, and the results are impressive. The combination of this code and an upgrade from Ubuntu 16.04 to 18.04 has made the little test machine, which we refer to as ‘hotpot‘, as speedy as our three node VPS based cluster.
This is a solid long term average of fully collecting over eleven accounts per minute, but the curious thing is that it’s not obvious what resource is limiting throughput. Ram utilization eventually ratcheted up to 80% but the CPU load average has been not more than 20% the whole time.
There is still a long learning curve ahead, but what I think I see here is that an elderly four core i7, if it has a properly tuned zpool disk subsystem, will be able to support a group of eight users in constant collection mode.
And that makes this page of Kimsufi Servers intriguing. The KS-9 looks to be the sweet spot, due to the presence of SSDs instead of spindles. If our monthly hardware is $21 that puts us in a place where maybe a $99/month small team setup makes sense to offer.
There is much to be done with Search Guard before this can happen, but hopefully we’ll be ready at the start of 2019.
One of the perennial problems in this field is the antiquated notion of jurisdiction, as well as increasing pressure on Westphalian Sovereignty. JP and I touched on this during our November 5th appearance on The View Up Here. The topic is complex and visual, so this post offers some images to back up the audio there.
Regional Internet Registries
The top level administrative domains for the network layer of the internet are the five Regional Internet Registries. These entities were originally responsible for blocks of 32 bit IPv4 addresses and 16 bit Autonomous System numbers. Later we added 128 bit IPv6 addresses and 32 bit Autonomous System numbers as the original numbers were being exhausted.
When you plug your home firewall into your cable modem it receives an IP address from your service provider and a default route. That outside IP is globally unique, like a phone number, and the default route is where any non-local traffic is sent.
Did you ever stop to wonder where your cable modem provider gets their internet service? The answer is that there is no ‘default route’ for the world, they connect at various exchange points, and they share traffic there. The ‘default route’ for the internet is a dynamic set of not quite 700,000 blocks of IP addresses, known as prefixes, which originate from 59,000 Autonomous Systems.
The Autonomous System can be though of as being similar to an telephone system country code. It indicates from a high level where a specific IP address prefix is located. The prefix can be thought of as an area code or city code, it’s a more specific location within the give Autonomous System.
There isn’t a neat global map for this stuff, but if you’re trying to make a picture, imagine a large bunch of grapes. The ones on the outside of the bunch are the hosting companies and smaller ISPs, who only touch a couple neighbors. The ones in the middle of the bunch touch many neighbors and are similar in position to the big global data carriers.
Domain Name Service
Once a new ISP has circuits from two or more upstream providers they can apply for an Autonomous System number and ask for IP prefixes. Those prefixes used to come straight from the RIRs, but any more you have to be a large provider to do that. Most are issued to smaller service providers by the large ones, but the net effect is the same.
Having addresses is just a start, the next step is finding interesting things to do. This requires the internet’s phone book – the Domain Name System. This is how we map names, like netwarsystem.com, to an IP address, like 18.104.22.168. There is also a reverse DNS domain that is meant to associate IP addresses with names. If you try to check that IP I just mentioned it’ll fail, which is a bit funny, as that’s not us, that’s kremlin[.]ru.
Domain Name Registrars & Root DNS Servers
How do you get a DNS name to use in the first place? Generally speaking, you have to pay a Registrar a fee for your domain name, there is some configuration done regarding your Start Of Authority, which is a fancy way of saying which name servers are responsible for your domain, then this is pushed to the DNS Root Servers.
There are nominally thirteen root servers. That doesn’t mean thirteen computers, it means there are twelve different organizations manage them (Verisign handles two), and their addresses are ‘anycast’, which means they originate from multiple locations, while the actual systems themselves are hidden from direct access. This is sort of a CDN for DNS data, and it exists due to endless attacks that are directed at these systems.
Verisign’s two systems are in datacenters on every continent and have over a hundred staff involved in their ongoing operation.
Layers Of Protection
And then things start to get fuzzy, because people who are in conflict will protect both their servers and their access.
Our web server is behind the Cloudflare Content Distribution Network. There are other CDNs out there and they exist to accelerate content as well as protect origin servers from attack. We like this service because it keeps our actual systems secret. This would be one component of that Adversary Resistant Hosting that we don’t otherwise discuss here.
When accessing the internet it is wise to conceal one’s point of origin if there may be someone looking back. This is Adversary Resistant Networking, which is done with Virtual Private Networks, the Tor anonymizing network, misattribution services like Ntrepid, and other methods that require some degree of skill to operate.
Peeling The Onion
Once you understand how all the pieces fit together there are still complexity and temporal issues.
Networked machines can generate enormous amounts of data. We previously used Splunk and recently shifted to Elasticsearch, both of which are capable of handling tens of millions of datapoints per day, even on the limited hardware we have available to us. Both systems permit time slicing of data as well as many other ways to abstract and summarize.
Data visualization can permit one to see relationships that are impenetrable to a manual examination. We use Paterva‘s Maltego for some of this sort of work and we reach for Gephi when there are larger volumes to handle.
Some of the most potent tools in our arsenal are RiskIQ and Farsight. These services collect passive DNS resolution data, showing bindings between names and IP addresses when they were active. RiskIQ collects time series domain name registration data. We can examine SSL certificates, trackers from various services, and many other aspects of hosting in order to accurately attribute activity.
The world benefits greatly from citizen journalists who dig into all sorts of things. This is less than helpful when it comes to complex infrastructure problems. Some specific issues that have arisen:
People who are not well versed in the technologies used can manage to sound credible to the layman. There have been numerous instances where conspiracy theorists have made comical attribution errors, in particular geolocation data for IPs being used to assert correlations where none exists.
There is a temporal component that arises when facing any opponent with even a bit of tradecraft and freely available tools don’t typically address that, so would-be investigators are left piecing things together, often without all of the necessary information.
Free access to quality tools like Maltego and RiskIQ are both intentionally limited. RiskIQ in particular cases problems in the hands of the uninitiated – a domain hosted on a Cloudflare IP will have thousands of fellows, but the free system will only show a handful. There have been many instances of people making inferences based on that limited data that have no connection to objective reality.
We do not have a y’all come policy in this area, we specifically seek out those who have the requisite skills to do proper analysis, who know when they are out on a limb. When we do find such an individual who has a legitimate question, we can bring a great deal of analytical power to bear.
That specific scenario happened today, which triggered the authoring of this article. We may never be able to make the details public, but an important thing happened earlier, and the world is hopefully a little safer for it.
The early version of the Netwar System ran with a handful of Twitter accounts and a flat file system. Today we use a 64 gig Xeon with 48 Twitter accounts for internal studies and a trio of 16 gig VPSes for botnetsu.press, our semi-public service. The requirements for an R&D system exceed a virtual machine, unless you’ve got a Xeon grade desktop.
We happen to have a Dell m4600 laptop and eight unallocated Twitter accounts, so this has been built out as an R&D environment. The system has a four core i7, 16 gig of ram, and in addition to the system volume there is a 60 gig msata SSD and a 500 gig spindle in the disk carrier that fits in the CD/DVD bay. This is essentially a miniature of our larger Xeon system.
Disk performance has always been our problem with Elasticsearch, so the msata drive was split into cache and log space for a 465G ZFS partition.
Once you’ve got them all installed you’ll see the following ports in use.
A few caveats, first be sure these are the final lines in /etc/security/limits.conf or you will quickly learn to hate Elasticsearch.
elasticsearch – nofile 300000
root – nofile 300000
Next, examine the configurations for Elasticsearch and Kibana in /etc. You’ll want to ensure there is more than the default 2 gig for the JVM and modify the Kibana config so you can reach port 5601 from elsewhere.
We have come to the point where we must release configuration advice and some Python code in order for others to learn to use the system. We’re going to trust that the requisite system integration capabilities, analytical tradecraft, and team management skills are going to limit the number of players who can actually do this. There isn’t a specific Github repository for this just yet, but there will be in the coming days.
The initial wave of NPCs were taken out by Twitter, about 1,500 all together according to reporting. A small number lingered, somehow slipping past the filter, and now they are regrouping. A tweet regarding the initial outbreak collected several new likes, among them this group of five:
And their 740 closest friends are all pretty homogenous:
A fast serial collection of these 740 accounts they follow was undertaken. Their mentions reveal some accounts that are early adopters, survivors of the first purge, or otherwise influential. 735 of them came through collection, the missing were empty, locked, or suspended.
These accounts made 469,889 mentions of others. First we’ll look at 285,102 mentions of normal accounts, then we’ll see 184,787 mentions of Celebrity, Media, and Political accounts. Given that there are 67,000 accounts involved in this mention map, we’ll employ some methods we don’t normally use. This layout was done with OpenOrd rather than Force Atlas 2 and the name size denotes volume of mentions produced.
The large names here are based on Eigenvector centrality – they are likely popular members of the group, or in the case of Yotsublast, a popular content creator aligned with NPC messaging.
Usually we filter CMP – Celebrities, Media, and Politicians. These accounts are actively seeking attention so it is interesting to see who they reach out to in order to achieve that in these 184,787 mentions to about 18,500 others.
Attempting different splines with Eigenvector centrality leads to, after several tries, this mess.
Beyond the core at the bottom, Kathy Griffin, Alexandra Ocasio-Cortez, and Hillary Clinton are singled out for attention.
Mentions are directed but the best way to handle them at this scale seems to be treating them as undirected and using the KBrace filter. This is a manageable set of accounts to examine and the groupings make intuitive sense.
The 742 accounts were placed into our “slow cooker” but only 397 were visible. It isn’t clear why 350 were missed, but Twitter’s quality filter may have something to do with that.
Unlike the group of accounts in yesterday’s A Deadpool Of Bots, this wake/sleep cycle over the last ten days looks like humans making their own accounts to join in the fun. Given a good sized sample of tweets, an average adult will only consistently be inactive from 0200 – 0500, so those empty three hour windows, except for the first day, are a pretty convincing sign.
Their hashtag usage is entirely what one would expect.
Given the tight timeframe it was interesting to look at an area graph of their daily hashtag use for the last ten days.
As a society we have barely begun to adapt to automated propaganda, and now we’re facing a human wave playing at being automation. This is an interesting, helpful thing, as it provides a perfect contrast to what we explored in A Deadpool Of Bots.
Yesterday in chat someone pointed out a small set of accounts that followed this Dirty Dozen of known harassment artists.
The accounts all had the format <first name><last name><two digits>. We extracted two dozen from the followers of these twelve accounts, then ran their followers and found a total of 112 accessible accounts that had this same format. Suspecting a botnet, we created a mention map to see how they have been spending their time.
This image is immediately telling for those used to examining mention maps. There are too many communities (denoted by different colors) present for such a small group and the isolated or nearly isolated islands just don’t look like a human interaction pattern.
Adjusting from outbound degree to Eigenvector centrality, it was immediately clear what the focus of this group of accounts was. The next level of zoom in the names revealed two other cryptocurency news sites and a leader in the field as being the targets of the these accounts.
Thinking that 112 was a small number, we extracted their 7,029 unique followers’ IDs and got their names. Nearly 600 matched the first/last/digits format, but there were other similarities as well. We placed all 7,029 in our “slow cooker”, set to capture all of their tweets.
We were expecting to find signs of a botnet, but it appears the entire set of accounts are part of the same effort. The 6,897 we managed to collect were all created in the same twelve week period. The gap between creation times and the steady production of about eighty accounts per day seems to indicate a small hand run operation in a country with cheap labor.
The network is transparently focused on cryptocurrency over the long haul. Adjusting the timeframe to the last thirty days moved the keywords around a bit but the word cloud is largely the same. The clue to how these accounts got into the mix is there in the lower right quadrant – #followme and #followback indicate a willingness to engage whomever from the world at large, in addition to their siblings.
Why we have pursued this so far when it looks like just a cryptocurrency botnet is due to this clue.
Here are five bad actors with two other accounts created right in between them time wise. This is the most striking example, but there are others like it. And understand the HUMINT that triggered this – a group of people who do nothing but take down racist hate talkers all day feel besieged by a group that manages to immediately regenerate after losing an account.
Our Working Theory
What we think we are seeing here is a pool of low end crypto pump and dump accounts that were either created for or later sold to a ringleader in this radicalized right wing group.
Now that we have roughly 7,000 of them on record, we have to decide what to do. This is just such a blatant example of automation that Twitter might immediately take it down if they notice. The 6.5 million tweets we collected are utterly dull – the prize here is the user profile dataset. We’d need some mods to our software, but maybe we need to collect all the followers for this group of 7,000 and figure out what the actual boundaries of this botnet truly are.
This has been a tiresome encounter for those who make it their business to drive hate speech from Twitter, but this may be the light at the end of the tunnel. If one group is using pools of purchased accounts to put their foot soldiers back in play the minute they get suspended, others are doing this, too. No effort was made to conceal this one from even moderate analysis efforts. If we demonstrate this is a pattern and Twitter is forced to act, we may well find that a lot of the heat will go out of political discourse on that platform.
Earlier today we captured seventy one Twitter accounts that we classified into three groups. These are Durant’s Dullards (21), Team Pillow Forts (23), and TheShed (27). The first are associated with RowdyPolitics[.]com, the second group are associated with CitJourno[.]org and Patribotics[.]blog, while the last group are unified by being stable, long term personas who are often forced to replace accounts due to suspension.
Visually, the Fortress of Pillowtude is on the left, the cluster of red accounts are the RowdyPolitics people, and The Shed’s frequent reincarnations leave them scattered around the perimeter on the right with fewer mentions.
This particular graphic has been filtered to remove 934 ‘CMP’ accounts – Celebrities, Media, and Politicians. The working theory behind this is that those accounts are ubiquitous, they cross group boundaries, and thus are not terribly useful for diagnostics. That thinly populated space in the middle are less notable CMP figures that haven’t been removed yet … but more importantly, some of those are ‘weak ties’, as covered in Mark Granovetter‘s 1973 classic social network analysis paper The Strength of Weak Ties.
Seeing The Whole Forest
While these groups lead in the creation, curation, and elevation of content, we want to be able to see them in the context of their operating environment. Graphs like this are useful for discerning structure, for identifying certain types of relationships, but those accounts generated over 262,000 mentions and over 12,000 others were mentioned twice or more. This is where we set aside Gephi and take up Elasticsearch.
Selecting the 2,739 accounts mentioned ten or more times is a good balance between getting what is important and not overrunning out available resources. Recent performance tuning means our collection system can now handle forty eight accounts in parallel. This run took 70 minutes to collect 6.48M tweets from 2,235 accounts that were actually available, an average of 32 accounts/minute. The 504 missing accounts are mostly those from The Shed that have been banned.
We want to see both overall features as well as group specifics, so JSON filters were created for each group. Applying them, we can see the top hashtags in use by each group over the last week. The fourth cloud is the overall set of hashtags employed by every account they mentioned. Here we begin to see what each group’s contribution to the overall conversation may have been.
6.5 million lines of text is a lot to digest. When we employ Kibana we have powerful ways to search, filter, and abstract content, coupled with fine grained control of time. If we want to know the top hashtags over the prior seven days, limited to those that occurred with #MAGA or #Anonyous, and see how they compare volume wise, that’s easily done.
What if we want to see who first noticed the news of Elena Khusyaynova’s indictment on Friday? A few mouse clicks and we have the data from when the story broke. Long term observations are just as smooth – if we set the system up to spool content, it’ll just continuously capture the accounts that we decide are interesting.
We are just getting started with the Kibana interface to Elasticsearch, using it as an advanced text search engine, and doing some simple infographics in the spirit of descriptive statistics. There are complex, powerful tools out there, such as Timesketch and Wazuh, that are built on the Elasticsearch foundation. If we find just the right person, we may start branching in that direction.
Articles here are written by a single author (thus far) but represent the collective views of a loose group of two dozen collaborators, hence the use of the first person plural ‘we’. We take on civil investigations, criminal defense, penetration testing, and geopolitical/cybersecurity threat assessments.
Group members have native fluency in English, French, German, Spanish, Romanian, and we do a fair job with Arabic when it is required. Several of us have corporate or IPS infrastructure backgrounds, and our tools, both chosen and created, reflect this internal integration capability.
This is an inventory of the major systems we currently employ.
The Gephi data visualization package is a piece of free software which permits the handling of networks with tens of thousands of nodes and hundreds of thousands of links. We use this for macro scale examinations of Twitter and some types of financial data, coding import procedures to express complex metrics, when required. When you see colorful network maps, this is likely the source.
The Maltego OSINT link analysis system began life as a penetration tester’s toolkit. It offers a rich set of entities, integration of many free and paid services, and local transform creation. There is a team collaboration feature for paid subscribers and the free Community Edition can read any graph we produce. This is used internally in the same way a financial audit firm would employ a spreadsheet – it is a de facto standard for recording and sharing investigation information.
Sentinel Visualizer is a law enforcement/intel grade link analysis package that supports both geospatial and temporal analysis. This only comes out in the face of paying engagements with large volumes of data, as it has a somewhat intimidating learning curve.
Hunch.ly is a Google Chrome extension that preserves the trail of web sites one visits, applying a standing list of selectors to each page and permitting the addition of investigator’s notes. This tool supports the notion of multiple named investigations, preserves content statically, and can export in a variety of formats. Users are free to follow their noses without the burden of bookmarking and making screen shots while investigating, then later attempting to share their findings in a coherent fashion. The system recently began supporting local Maltego transforms.
The RiskIQ service is an aggregator of a dozen passive threat data repositories in addition to it’s own native tracking of domain registrations, DNS, SSL certificates, and other threat assessment data. The service is delivered as a web based search engine and a companion set of Maltego transforms. This system is a panopticon for bad actor infrastructure which we use daily.
The Elasticsearch platform is used for many things, but for us it is a full text search engine with temporal analysis capabilities that will easily handle tens of thousands of Twitter accounts that have produced tens of millions of tweets. This is a construction kit for us, the right way to collate and correlate the work of teams of Actors, Collectors, and Directors. We currently curate 25 million tweets from ISIS accounts that were collected by TRAC, we support Liberty STRATCOM with collection and analysis, and the botnetsu.press system is in use by activists who track violent right wing groups in the west.
What not to do is just as important as the right stuff. Here are some things we avoided, that we tested but did not implement, or that we have used but later abandoned.
Analyst’s Notebook – nonstarter, 2x the cost of Sentinel Visualizer, and not nearly as open.
Windows – with the exception of Sentinel Visualizer, we don’t have anything that is Windows dependent. Generally speaking, things have to behave for Linux and OSX, with Windows support being nice, but not required.
Splunk – we tried to love it, truly we did. It just didn’t work out.
OSSIM – largely abandonware from what we hear. AlienVault’s Open Threat Exchange is doing fine though, and it all turns up in RiskIQ.
Aeon, Timeline, etc – we always jump at collaborative timeline tools, then later end up sitting back and being annoyed. SaaS solutions are out there, but we have confidentiality concerns that hold us back from using them.
TimeSketch – very cool, an Elastic based tool, but more incident response focused than intel oriented.
SpiderFoot – very cool, but we settled on RiskIQ/Maltego installed on a remotely accessible workstation. This is one we should put back up and use enough to advise others.
There have been many more digressions over the years, these are some of the more formative ones.