An Analysts Workstation

Six months ago we published An Analyst’s Environment, which describes some tools we use that are a bit beyond the typical lone gun grassroots analyst. Since then our VPS based Elasticsearch cluster has given way to some Xeon equipment in racks, which lead to Xeon equipment under desks.

Looking back over the past two months, we see a quickly maturing “build sheet” for analyst workstations. This is in no small part due to our discovery of Budgie, an Ubuntu Linux offshoot. Some of our best qualitative analysts are on Macs and they are extremely defensive of their work environment. Budgie permits at least some of that activity to move to Linux, and its thought that this will become increasingly common.

Do not assume that “I already use Ubuntu” is sufficient to evaluate Budgie. They are spending a lot of time taking off the rough edges. At the very least, put it in a VM and give it a look.

Once installed, we’re including the following packages by default:

  • Secure communications are best handled with Wire.
  • The Hunch.ly web capture package requires Google Chrome.
  • Chromium provides a separate unrecorded browser.
  • Maltego CE link analysis package is useful even if constrained.
  • Evernote is popular with some of our people, Tusk works on Linux.
  • XMind Zen provides mind mapping that works on all platforms.
  • Timeline has been a long term player and keeps adding features.
  • Gephi data visualization works, no matter what sized screen is used.

Both Talkwalker Alerts and Inoreader feeds are RSS based. People seem to be happy with the web interface, but what happens when you’re in a place without network access. There are a number of RSS related applications in Budgie’s slick software store. Someone is going to have to go through them and see which best fits that particular use case.

Budgie’s many packages for handling RSS feeds.

There have been so many iterations of this set of recommendations, most conditioned by the desire to support Windows, as well as Mac and Linux. The proliferation of older Xeon equipment, in particular the second generation HP Z420/Z620/Z820, which start in useable condition at around $150, mean we no longer have that constraint.

Sampling of inexpensive HP Z420s on Ebay in May of 2019.

Starting with that base, 64 gig of additional memory is about $150, and another $200 will will cover a 500 gig Crucial solid state disk and the fanless entry level Nvidia GT1030.

The specific combination of the Z420 and the Xeon E5-2650L v2 has a benchmark that matches the current MacBook Pro, it will be literally an order of magnitude faster on Gephi, the most demanding of those applications, and it will happily work for hours on end without making a sound. The Mac, on the other hand, will be making about as much noise as a Shopvac after just five minutes.

That chip and some Thermal Grizzly Kryonaut should not cost you more than $60 and will take a base Z420 from four cores to ten. So there you have it – mostly free software, a workstation you can build incrementally, and then you have the foundation required to sort out complex problems.

Analyzing Twitter Streams

Our prior work on Twitter content has involved bulk collection of the following types of data:

  • Tweets, including raw text suitable for stylometry.
  • Activity time for the sake of temporal signatures.
  • Mentions including temporal data for conversation maps.
  • User ID data for profile searches.
  • Follower/following relationships, often using Maltego.

Early on this involved simply running multiple accounts in parallel, each working on their own set of tasks. Seemingly quick results were a matter of knowing what to collect and letting things happen. Hardware upgrades around the start of 2019 permitted us to run sixteen accounts in parallel … then thirty two … and finally sixty four, which exceeded the bounds of 100mbit internet service.

We had never done much with the Twitter streaming API until just two weeks ago, but our expanded ability to handle large volumes of raw data has made this a very interesting proposition. There are now ten accounts engaged in collecting either a mix of terms or following lists of hundreds of high value accounts.

Indexing Many Streams

What we get from streams at this time includes:

  • Tweet content.
  • RT’d tweet content.
  • Quoted tweet content.
  • Twitter user data for the source.
  • Twitter user data for accounts mentioned.
  • Twitter user data for accounts that are RT’d.
  • User to mentioned account event including timestamp.
  • User to RT’d account event including timestamp.

This data is currently accumulating in a mix of Elasticsearch indices. We recognize that we have at least three document types:

  • Tweets.
  • User data.
  • Interaction data.

Our current setup is definitely beta at this point. We probably need more attention on the natural language processing aspect of the tweets themselves, particularly as we expand into handling multiple European languages. User data could standing having hashtags extracted from profiles, which we missed the first time around, otherwise this seems pretty simple.

The interaction data is where things become uncertain. It is good to have this content in Elasticsearch for the sake of filtering. It is unclear precisely how much we should permit to accumulate in these derivative documents; at this point they’re just the minimal data from each tweet that permits establishing the link between accounts involved. Do we also do this for hashtags?

Once we have this, the next question is what do we do with it? The search, sorting, and time slicing of Elasticsearch is nice, but this is really network data, and we want to visualize it.

Maltego is out of the running before we even start; 10k nodes maximum has been a barrier for a long time. Gephi is unusable on a 4k Linux display due to font sizing for readability, and it will do just enough on a half million node network to leave one hanging with an analysis half finished on a smaller display.

The right answer(s) seem to be to get moving on Graphistry and Neo4j. An EVGA GTX 1060 turned up here a few weeks ago, displaying a GT 1030 to an associate. Given the uptime requirements for Elasticsearch, not much has happened towards Graphistry use other than the physical install. It looks like Docker is a requirement, and that’s a synonym for “invasive nuisance”.

Neo4j has some visualization abilities but its real attraction is the native handling of storage and queries for graphs. Our associates who engage in analysis ask questions that are easily answered with Elasticsearch … and other questions that are utterly impossible to resolve with any tool we currently wield.

Conclusion

Expanding capacity has permitted us to answer some questions … but on the balance its uncovered more mysteries than it has resolved. This next month is going to involve getting some standard in place for assessing incoming streams, and pressing on both means of handling graph data, to see which one we can bring to bear first.

Suggested Reading: Complex Network Analysis in Python

Complex Network Analysis in Python
Complex Network Analysis in Python

There has been a gap in the social network analysis world since @ladamic stopped offering her excellent class via Coursera. I received my copy of Complex Network Analysis in Python earlier today, devoured the first five chapters, and I am pleased to report this book is a quality alternative to the class.

The book presumes you have enough Python to load stuff with pip and some pre-existing motivation to explore networks. Some key things to know:

  • The features and performance considerations of four Python network analysis modules are explained in detail; invaluable for those who are trying to scale up their efforts.
  • The visualization package Gephi is introduced in a very accessible fashion and advice regarding how to move data between it and your Python scripts is clear and simple.
  • There are a variety of real world examples included

The thing that is missing is Twitter – which is mentioned just once in passing on page 63. This seems like a good opening – the Complex Network Analysis Github repository is going to contain the Twitter related code we produce as we work through the examples in this text.

Making America Grey Again

The initial wave of NPCs were taken out by Twitter, about 1,500 all together according to reporting. A small number lingered, somehow slipping past the filter, and now they are regrouping. A tweet regarding the initial outbreak collected several new likes, among them this group of five:

Five NPCs
Five NPCs

And their 740 closest friends are all pretty homogenous:

NPCs: Diversity Through Conformity
NPCs: Diversity Through Conformity

A fast serial collection of these 740 accounts they follow was undertaken. Their mentions reveal some accounts that are early adopters, survivors of the first purge, or otherwise influential. 735 of them came through collection, the missing were empty, locked, or suspended.

These accounts made 469,889 mentions of others.  First we’ll look at 285,102 mentions of normal accounts, then we’ll see 184,787 mentions of Celebrity, Media, and Political accounts. Given that there are 67,000 accounts involved in this mention map, we’ll employ some methods we don’t normally use. This layout was done with OpenOrd rather than Force Atlas 2 and the name size denotes volume of mentions produced.

Many NPC Mentions
Many NPC Mentions

The large names here are based on Eigenvector centrality – they are likely popular members of the group, or in the case of Yotsublast, a popular content creator aligned with NPC messaging.

Popular NPC Accounts & Allies
Popular NPC Accounts & Allies

Usually we filter CMP – Celebrities, Media, and Politicians. These accounts are actively seeking attention so it is interesting to see who they reach out to in order to achieve that in these 184,787 mentions to about 18,500 others.

 

NPC Messaging Targets
NPC Messaging Targets

Attempting different splines with Eigenvector centrality leads to, after several tries, this mess.

Smaller Messaging Targets
Smaller Messaging Targets

Beyond the core at the bottom, Kathy Griffin, Alexandra Ocasio-Cortez, and Hillary Clinton are singled out for attention.

K-brace Filter Level 4
KBrace Filter Level 4

Mentions are directed but the best way to handle them at this scale seems to be treating them as undirected and using the KBrace filter. This is a manageable set of accounts to examine and the groupings make intuitive sense.

The 742 accounts were placed into our “slow cooker” but only 397 were visible. It isn’t clear why 350 were missed, but Twitter’s quality filter may have something to do with that.

NPC Creation Times
NPC Creation Times

Unlike the group of accounts in yesterday’s A Deadpool Of Bots, this wake/sleep cycle over the last ten days looks like humans making their own accounts to join in the fun. Given a good sized sample of tweets, an average adult will only consistently be inactive from 0200 – 0500, so those empty three hour windows, except for the first day, are a pretty convincing sign.

NPC Hashtags
NPC Hashtags

Their hashtag usage is entirely what one would expect.

NPC Daily Hashtag Use
NPC Daily Hashtag Use

Given the tight timeframe it was interesting to look at an area graph of their daily hashtag use for the last ten days.

 

As a society we have barely begun to adapt to automated propaganda, and now we’re facing a human wave playing at being automation. This is an interesting, helpful thing, as it provides a perfect contrast to what we explored in A Deadpool Of Bots.

A Deadpool Of Bots

Yesterday in chat someone pointed out a small set of accounts that followed this Dirty Dozen of known harassment artists.

Dirty Dozen De Jure
Dirty Dozen De Jure

The accounts all had the format <first name><last name><two digits>. We extracted two dozen from the followers of these twelve accounts, then ran their followers and found a total of 112 accessible accounts that had this same format. Suspecting a botnet, we created a mention map to see how they have been spending their time.

112 Bots & Mentions
112 Bots & Mentions

This image is immediately telling for those used to examining mention maps. There are too many communities (denoted by different colors) present for such a small group and the isolated or nearly isolated islands just don’t look like a human interaction pattern.

ICObench Nexus
ICObench Nexus

Adjusting from outbound degree to Eigenvector centrality, it was immediately clear what the focus of this group of accounts was. The next level of zoom in the names revealed two other cryptocurency news sites and a leader in the field as being the targets of the these accounts.

Thinking that 112 was a small number, we extracted their 7,029 unique followers’ IDs and got their names. Nearly 600 matched the first/last/digits format, but there were other similarities as well. We placed all 7,029 in our “slow cooker”, set to capture all of their tweets.

6,897 Bots
6,897 Bots

We were expecting to find signs of a botnet, but it appears the entire set of accounts are part of the same effort. The 6,897 we managed to collect were all created in the same twelve week period. The gap between creation times and the steady production of about eighty accounts per day seems to indicate a small hand run operation in a country with cheap labor.

Hashtags Used
Hashtags Used

The network is transparently focused on cryptocurrency over the long haul. Adjusting the timeframe to the last thirty days moved the keywords around a bit but the word cloud is largely the same. The clue to how these accounts got into the mix is there in the lower right quadrant – #followme and #followback indicate a willingness to engage whomever from the world at large, in addition to their siblings.

Why we have pursued this so far when it looks like just a cryptocurrency botnet is due to this clue.

The Big Clue
The Big Clue

Here are five bad actors with two other accounts created right in between them time wise. This is the most striking example, but there are others like it. And understand the HUMINT that triggered this – a group of people who do nothing but take down racist hate talkers all day feel besieged by a group that manages to immediately regenerate after losing an account.

Our Working Theory

What we think we are seeing here is a pool of low end crypto pump and dump accounts that were either created for or later sold to a ringleader in this radicalized right wing group.

Now that we have roughly 7,000 of them on record, we have to decide what to do. This is just such a blatant example of automation that Twitter might immediately take it down if they notice. The 6.5 million tweets we collected are utterly dull – the prize here is the user profile dataset. We’d need some mods to our software, but maybe we need to collect all the followers for this group of 7,000 and figure out what the actual boundaries of this botnet truly are.

This has been a tiresome encounter for those who make it their business to drive hate speech from Twitter, but this may be the light at the end of the tunnel. If one group is using pools of purchased accounts to put their foot soldiers back in play the minute they get suspended, others are doing this, too. No effort was made to conceal this one from even moderate analysis efforts. If we demonstrate this is a pattern and Twitter is forced to act, we may well find that a lot of the heat will go out of political discourse on that platform.

Tools Of The Trade

Articles here are written by a single author (thus far) but represent the collective views of a loose group of two dozen collaborators, hence the use of the first person plural ‘we’. We take on civil investigations, criminal defense, penetration testing, and geopolitical/cybersecurity threat assessments.

Group members have native fluency in English, French, German, Spanish, Romanian, and we do a fair job with Arabic when it is required. Several of us have corporate or IPS infrastructure backgrounds, and our tools, both chosen and created, reflect this internal integration capability.

This is an inventory of the major systems we currently employ.

Gephi

The Gephi data visualization package is a piece of free software which permits the handling of networks with tens of thousands of nodes and hundreds of thousands of links. We use this for macro scale examinations of Twitter and some types of financial data, coding import procedures to express complex metrics, when required. When you see colorful network maps, this is likely the source.

Maltego

The Maltego OSINT link analysis system began life as a penetration tester’s toolkit. It offers a rich set of entities, integration of many free and paid services, and local transform creation. There is a team collaboration feature for paid subscribers and the free Community Edition can read any graph we produce. This is used internally in the same way a financial audit firm would employ a spreadsheet – it is a de facto standard for recording and sharing investigation information.

Sentinel Visualizer

Sentinel Visualizer is a law enforcement/intel grade link analysis package that supports both geospatial and temporal analysis. This only comes out in the face of paying engagements with large volumes of data, as it has a somewhat intimidating learning curve.

Hunch.ly

Hunch.ly is a Google Chrome extension that preserves the trail of web sites one visits, applying a standing list of selectors to each page and permitting the addition of investigator’s notes. This tool supports the notion of multiple named investigations, preserves content statically, and can export in a variety of formats. Users are free to follow their noses without the burden of bookmarking and making screen shots while investigating, then later attempting to share their findings in a coherent fashion. The system recently began supporting local Maltego transforms.

RiskIQ

The RiskIQ service is an aggregator of a dozen passive threat data repositories in addition to it’s own native tracking of domain registrations, DNS, SSL certificates, and other threat assessment data. The service is delivered as a web based search engine and a companion set of Maltego transforms. This system is a panopticon for bad actor infrastructure which we use daily.

Elasticsearch

The Elasticsearch platform is used for many things, but for us it is a full text search engine with temporal analysis capabilities that will easily handle tens of thousands of Twitter accounts that have produced tens of millions of tweets. This is a construction kit for us, the right way to collate and correlate the work of teams of Actors, Collectors, and Directors. We currently curate 25 million tweets from ISIS accounts that were collected by TRAC, we support Liberty STRATCOM with collection and analysis, and the botnetsu.press system is in use by activists who track violent right wing groups in the west.

Negative Decisions

What not to do is just as important as the right stuff. Here are some things we avoided, that we tested but did not implement, or that we have used but later abandoned.

Analyst’s Notebook – nonstarter, 2x the cost of Sentinel Visualizer, and not nearly as open.

Windows – with the exception of Sentinel Visualizer, we don’t have anything that is Windows dependent. Generally speaking, things have to behave for Linux and OSX, with Windows support being nice, but not required.

Splunk – we tried to love it, truly we did. It just didn’t work out.

OSSIM – largely abandonware from what we hear. AlienVault’s Open Threat Exchange is doing fine though, and it all turns up in RiskIQ.

Aeon, Timeline, etc – we always jump at collaborative timeline tools, then later end up sitting back and being annoyed. SaaS solutions are out there, but we have confidentiality concerns that hold us back from using them.

TimeSketch – very cool, an Elastic based tool, but more incident response focused than intel oriented.

SpiderFoot – very cool, but we settled on RiskIQ/Maltego installed on a remotely accessible workstation. This is one we should put back up and use enough to advise others.

There have been many more digressions over the years, these are some of the more formative ones.