Mixed N.U.T.S.

The United Kingdom is divided into twelve N.U.T.S.Nomenclature des Unit├ęs Territoriales Statistiques. These are Northern Ireland, Scotland, Wales, and nine regional divisions within England.

United Kingdom N.U.T.S.

We had been running a Brexit related Twitter stream for some time, but as offerings from Social Media Intelligence Unit matured, we needed some finer grain detail. After a bit of grubbing around to match Twitter accounts to MP names, and then MP names to regions, we had twelve text files of account names. We collected the followers for each account, then consolidated those profiles into region specific indices. Those twelve indices were concatenated into a single national index. Then finally this national index was merged with our master index of user accounts.

This triple redundancy does use space, but not so much, a smaller index is a faster index, and there is some advantage to knowing the source of the content with great detail. This is also conditioned a bit by our upcoming Neo4j graph database implementation. The only thing we are sure of is that it will be an iterative process. Smaller pieces are easier to test than one massive corpus that requires much filtering to untangle.

Having done the organizing work, we made the MP accounts available in a Github repository. There are a number of people who have access to our Elasticsearch system and not all of them are able to whip up filters out of text files, so we prebuilt for most of the things they might want to do.

And with this we can do things that won’t be seen elsewhere …

For example, you can easily see the friends and followers for an MP’s account. Did you ever wonder how many friends they have all together? About 1.28m relationships to 506k unique accounts. We decided to drill down a bit further. How many of those 506k friends do not follow back any MPs at all?

Some of those accounts with MP audience who do not follow back will be family/friends, but if we squelch the ones with only one contact, those will vanish. Intuitively, we should see regional media, NGO, and government accounts next. Those that cross regions are likely influencers. Early returns on processing show a fairly steady 25% of those 506k do not follow any MPs. A set of 127k nodes with fewer than 600 hubs is going to be a dense graph, but not an impossible one to explore.

Time passes …

The final count proved to be 123,862 not followed by MPs, or 24.46%, which was a trifle surprising as the above estimate was done when the set was 10% processed.

And as a pleasant bonus, while those 123k accounts were processing we made good progress on being able to identify MPs by both party and region. U.K. politics watchers will note we got the colour right for four of the largest parties, and managed to visually distinguish one more. The accounts that were followed by MPs but did not reciprocate contributed to being able to arrange the graph so clearly.

Applying Eigenvector centrality for name sizing revealed some influencers.

The region data didn’t sort as neatly, but this was only 5% of the total network. Once it’s complete we should see it resolve into regional groupings, but they’ll be a bit more ragged than the cleaner ideological divides.

Being able to include and employ custom attributes is something we can do with Gephi, but not with Graphistry. If we wanted to accomplish that we’d need a graph database capable back end, like Neo4j, and the license required to connect it.

This is an interesting step, one that was easy in retrospect, but which we should have done some time ago. We’ll make up for lost time, now that we’ve got the capability.

Taking Inventory, Taking Action

We recently published Climate Strike Observation, which covered our setup work ahead of the global protests scheduled for the week of 20 – 29 September. This is a specific example of the first step in what has become a fairly well structured analytical process.

When presented with a mystery, either anticipated, currently happening, or in the past, the first thing we do is a recitation of relevant assets. Much of our content is related to Twitter, but this is by no means our sole source. We have sources of Telegram data, as well as Google/Talkwalker Alerts, web site scrapes, and document caches to which we have access. All of that is slowly getting massaged into a workable environment based on Open Semantic Search.

Since Twitter is the most developed aspect, as well as not providing us an opportunity to slip on sources and methods, we’ll stick to it as the example.

The system as it is employed today, can capture about sixty streams, which are based on text terms or numeric userids. Each stream has two indices, one for tweets, and one for the user accounts who made them or who were mentioned in them. Documents, Elastic’s term for items in an index, are uniquely identified with the “snowflake number” Twitter uses. Embedding Twitter’s numbers offers many benefits, both on and off the platform.

Let’s assume we want to know about an event in Canada. What resources do we have available?

  • Canadian MPs stream -2.5 million tweets from 270k accounts.
  • Canada Proud streams – 950k tweets from over 70k accounts.
  • Rebel Media stream – 430k tweets from 100k accounts.
  • National Observer – 362k tweets from 116k accounts.
  • Gab Canada – Gab users’ tweets & profiles tied to Canadian MPs.
  • TrudeauMustGo – streaming anti-Trudeau terms.
  • CAFife – stream related to Robert Fife’s recent Trudeau smear.
  • TrudeauBots – a stream based on some automated accounts.

All but the last of these streams are active today. There are other perspectives available as well.

  • campfollowers – every follower of every Canadian MP.
  • canada350 – 8k followers of 350’s Canadian role account.
  • collectusers – 55m profiles of mostly politically engaged accounts.

We have access to the staff of the National Observer, the host of The View Up Here podcast, some of his show’s guests, and other personal relationships. Each of the FVEYS countries nominally speaks English, but these are five mutually intelligible dialects. Local knowledge is a must for any serious analysis.

As far as tools, we have the Kibana interface for Elasticsearch, which means an analyst can create a visualization, and share it in a variety of ways. We have Maltego, Gephi, and Graphistry for data visualization. Elasticsearch can be made to produce both CSV and GML files, the latter being a native graph format.

Summarizing: Tasked with an outstanding question, likely guided by local knowledge, we apply a variety of tools to extract meaning from the noise.

One of the most important functions Elasticsearch provides is the ability to precisely control time. We keep the following as search parameters:

  • Twitter’s status_at timestamp for each tweet in a tweet index.
  • The created_at timestamp for the account that tweeted.
  • created_at for each account in a user index.
  • status_at is each user’s last action.
  • collected_at shows when we most recently saw each account.

Those are hard timestamps, but there are many ‘softer’ forms of time. We can bracket events using phrases such as “not prior to date X” or “between event Y and event Z”. An example where this is used is inferring follower arrival date ranges based on the fact that the Twitter API returns userids in reverse arrival order.

Our data visualization tools in descending order of use are:

  • Kibana’s many Visualization type options.
  • Gephi for handling large scale social network visualization.
  • Maltego for more detailed SNA and infrastructure analysis.
  • Graphistry is new to us and it mostly competes with Gephi.

Kibana directly handles time. Gephi will intake CSV or GML files are produced using time aware tools. Maltego generally doesn’t come out until we start tracking infrastructure or making detailed notes about a small set of entities. Graphistry can handle volumes of data that would smash Gephi flat, but at the cost of much less control over the graph’s appearance.

Summarizing: We approach the outstanding question with multiple tools that are applied to multiple streams of content.

Having completed this inventory the very first question we ask is “Do we have anything that matters?” We stream multiple perspectives ranging from the 336 MP accounts on a continuing basis down to a handful of hashtags related to a specific event. The hashtag based stream almost always starts because someone noticed an event getting attention. We can often find the first few updates by applying the same terms to the continuing collection indices.

Once we have picked the best index to use, we begin to narrow down the window of time as well as the participants, using Kibana. This likely leads to some Saved Searches, which may be shared as a Kibana URL with others, or used as the basis for a Visualization, which can also be shared. Kibana has the ability to export CSV files for use in other tools.

There is a command line tool that will accept sets of files containing numeric follower IDs, which can include a file of accounts that participated in a given stream. The output is a GML file containing not just names, but numeric data such as follower count, number of favorites, and so on.

Pulling the GML file in Gephi, we can arrange, filter, and label nodes on the basis of the numeric attributes. We can always highlight the accounts that contributed their followers, which is all but impossible using a CSV import. We run Louvain community detection to color label communities in the graph. We run Force Atlas 2 to lay out the graph. We filter on the various attributes and the return for more layout work.

The visualizations start as the means analysts use to separate signal from noise. They are complete when the analyst can use the graphics in a report on the question at hand. All through this process the analyst may be checking with the source of the question or the local expertise.

Summarizing: If the available information was deemed sufficient to answer a given question, a series of searches and visualizations are used to isolate the incident, culminating in some carefully honed views being used to illustrate a narrative.


Human observers equipped with only a small mobile device screen are at the mercy of algorithms. The only protection Twitter offers is against one on one harassment that crosses certain lines. While that is a welcome relief from what the platform was like, it does nothing to address the disinformation problem.

And it may be impossible to untangle U.S. free speech, profiling, and marketing methods in order to get at the disinformation problem at the level on which it is dispensed.

We have the experience and facilities to untangle such things. There are very few who do. This really needs to be solved at a policy level, rather than playing whack-a-mole on a weekly basis.

This was a very meta post, but our investigation into the #LeaveAlliance hashtag trending yesterday was what set it in motion. Usually such work for SMIU is used internally, but this one was made open, so you can see the principles described in this post applied to a real world problem.

Climate Strike Observations

There is a global #ClimateStrike planned for September 20th through the 27th. This is a series of peaceful protests regarding government inaction on climate change. Some are announced in advance, there will certainly be popup activities as things get rolling, and there are multiple groups involved. Several of us have done work for various environmental and clean energy groups, and we are well aware of corporate PR campaigns and their automated backing.

We began tracking Extinction Rebellion’s efforts in early July of 2019. We only found 112 accounts then, but an update this week has revealed at least 397 that are active. We have an index of 214,000 tweets and mentions for those original 112.

Our systems have spent the last many hours grinding on follower profiles for several different constituencies:

  • Extinction Rebellion’s 397 accounts
  • 350 & Bill McKibben
  • DeSmogBlog
  • 17 Deniers profiled on DeSmogBlog

The denier accounts are from a study done about five years ago, so it is in no way a complete representation. We will probably have a look at the players in Lewandowsky et. al.’s Recursive Fury study, too.

The creation date curve for followers is a clue to legitimacy; bots became a rising force in 2015, and older accounts that have been “bleached” also participate in such activities. Bill McKibben’s followers show no sign of such activity.

And the @350 account seems to have a similar audience, except for this intriguing spike of accounts created in 2011. We’ll dig deeper into that later.

The followers of the Deniers are only 50% collected at this point, but the difference is stark. Given that there is one very large account in the mix this curve may balance out a bit once things are done, but the ramp up in 2015 and the sharp drop in mid-2018 are signs of automated accounts that we see for almost all legislators in English speaking countries.

The Canadian 350 contingent is tiny, just 8,000 members, but the creation date curve parallels the global account.

Our observation plan is still coming together, but for the moment we are collecting:

  • Climate Strike hashtags & keywords
  • Deniers & their mentions
  • Extinction Rebellion’s 397 accounts & mentions
  • Extinction Rebellion specific hashtags & keywords

This study is a natural progression from recent findings by @RVAwonk and @JessBots which appeared in How Maxime Bernier hijacked Canada’s #ClimateChange discussion.

Several Upgrade Efforts

It was pointed out that stating we had Six Hours Of Downtime, then not saying much, was creating the perception of an uncertain future. Not at all the case, here is what’s been happening.

The outage was triggered by a power transient and it was the first of a couple of problems we had, which triggered some introspection regarding design and implementation choices. Here we are a month later and we have a few bits of wisdom to share:

  • Elasticsearch 6.5.4 was OK, but 6.8.2. is fine.
  • Headless Debian 9.9 is much smaller than Lubuntu.
  • Do not make fixed size disks for your virtual machines.
  • Definitely do not try to run ZFS in a VM on top of ZFS. Just don’t.
  • Three VMs on a single spindle pair is a PILOT configuration.
  • Give data VMs one more core than they seem to need.
  • A brief visit to swap seems to be OK.

Taking care of all these equipment and software upgrades has left us with a system that will respond to queries against our collection of 53 million Twitter profiles in one to three seconds, instead of multiple thirty second waits before it finally does the job. While capturing fifty streams. And merging two large user profile indices.

That last point requires some explanation – we know the Lucene component of Elasticsearch is known to make heavy use of “off-heap” memory. Data VMs have been set with 8 gig of Java heap space and anywhere from 16 to 32 gig of ram. No matter what, they all “stick their toes” into their swap pools, but never more than a few megabytes worth. Normally swap access is the last gasp before things spiral out of control, but they’ve been doing this for days under intentionally demanding conditions and we haven’t been able to bowl them over.

What is driving this is a dramatic increase in volume is our rapidly evolving stream handling capability and increase in analysts using the system. We currently have the following areas of operation:

  • Legislatures – stream members and their interactions for 12 countries.
  • European MPs, U.S. Governors, and U.S. Presidential candidates.
  • Campaigns – non-election related advocacy.
  • Disinformation – things that do not appear to be legitimate.
  • Threat Monitoring – likely trouble areas get proactively captured.

Threat Monitoring is, somewhat sadly, the busiest area thanks to America’s penchant for mass shootings. Seven analysts inhabit a task oriented channel and they always have more to examine than they have time to do.

Disinformation has a strong overlap with Threat Monitoring, and whatever is on fire at the moment has been taking precedence over delving into the campaigns that create the preconditions for trouble. More hands and eyes will improve things in this area.

Campaigns & Legislatures have been purely the domain of Social Media Intelligence Unit. There are reports the go out, but they don’t see the light of day. We should probably periodically pick an area and publish the same sort of work here. In our copious spare time.

As we hinted above, we have added storage to our existing systems, but that is an interim measure. We are currently collecting hardware specifications and reviewing data center locations and rates. The cheapest is Hurricane Electric in Fremont, but the best, given what we do, might be Raging Wire in Sacramento. That’s the oldest and largest of Twitter’s four datacenters, for those not familiar with the name.

Legislatures and much of Campaigns are on their way to that datacenter. Disinformation has no public facing facet and will remain in house where we can keep an eye on it. Threat Monitoring is a mixed bag; some will stay in the office, some will involve tight integration with client systems.

Those who pay attention to the Netwar System Github will have noticed many changes over the last few days. We are approaching the time where we will declare the Elasticsearch 6.8.2 upgrade complete, and then maybe have another go at offering a Netwar System Community Edition VM. When this is ready there will be an announcement here.