The United Kingdom is divided into twelve N.U.T.S. – Nomenclature des Unités Territoriales Statistiques. These are Northern Ireland, Scotland, Wales, and nine regional divisions within England.
We had been running a Brexit related Twitter stream for some time, but as offerings from Social Media Intelligence Unit matured, we needed some finer grain detail. After a bit of grubbing around to match Twitter accounts to MP names, and then MP names to regions, we had twelve text files of account names. We collected the followers for each account, then consolidated those profiles into region specific indices. Those twelve indices were concatenated into a single national index. Then finally this national index was merged with our master index of user accounts.
This triple redundancy does use space, but not so much, a smaller index is a faster index, and there is some advantage to knowing the source of the content with great detail. This is also conditioned a bit by our upcoming Neo4j graph database implementation. The only thing we are sure of is that it will be an iterative process. Smaller pieces are easier to test than one massive corpus that requires much filtering to untangle.
Having done the organizing work, we made the MP accounts available in a Github repository. There are a number of people who have access to our Elasticsearch system and not all of them are able to whip up filters out of text files, so we prebuilt for most of the things they might want to do.
And with this we can do things that won’t be seen elsewhere …
For example, you can easily see the friends and followers for an MP’s account. Did you ever wonder how many friends they have all together? About 1.28m relationships to 506k unique accounts. We decided to drill down a bit further. How many of those 506k friends do not follow back any MPs at all?
Some of those accounts with MP audience who do not follow back will be family/friends, but if we squelch the ones with only one contact, those will vanish. Intuitively, we should see regional media, NGO, and government accounts next. Those that cross regions are likely influencers. Early returns on processing show a fairly steady 25% of those 506k do not follow any MPs. A set of 127k nodes with fewer than 600 hubs is going to be a dense graph, but not an impossible one to explore.
Time passes …
The final count proved to be 123,862 not followed by MPs, or 24.46%, which was a trifle surprising as the above estimate was done when the set was 10% processed.
And as a pleasant bonus, while those 123k accounts were processing we made good progress on being able to identify MPs by both party and region. U.K. politics watchers will note we got the colour right for four of the largest parties, and managed to visually distinguish one more. The accounts that were followed by MPs but did not reciprocate contributed to being able to arrange the graph so clearly.
Applying Eigenvector centrality for name sizing revealed some influencers.
The region data didn’t sort as neatly, but this was only 5% of the total network. Once it’s complete we should see it resolve into regional groupings, but they’ll be a bit more ragged than the cleaner ideological divides.
Being able to include and employ custom attributes is something we can do with Gephi, but not with Graphistry. If we wanted to accomplish that we’d need a graph database capable back end, like Neo4j, and the license required to connect it.
This is an interesting step, one that was easy in retrospect, but which we should have done some time ago. We’ll make up for lost time, now that we’ve got the capability.
We recently published Climate Strike Observation, which covered our setup work ahead of the global protests scheduled for the week of 20 – 29 September. This is a specific example of the first step in what has become a fairly well structured analytical process.
When presented with a mystery, either anticipated, currently happening, or in the past, the first thing we do is a recitation of relevant assets. Much of our content is related to Twitter, but this is by no means our sole source. We have sources of Telegram data, as well as Google/Talkwalker Alerts, web site scrapes, and document caches to which we have access. All of that is slowly getting massaged into a workable environment based on Open Semantic Search.
Since Twitter is the most developed aspect, as well as not providing us an opportunity to slip on sources and methods, we’ll stick to it as the example.
The system as it is employed today, can capture about sixty streams, which are based on text terms or numeric userids. Each stream has two indices, one for tweets, and one for the user accounts who made them or who were mentioned in them. Documents, Elastic’s term for items in an index, are uniquely identified with the “snowflake number” Twitter uses. Embedding Twitter’s numbers offers many benefits, both on and off the platform.
Let’s assume we want to know about an event in Canada. What resources do we have available?
Canadian MPs stream -2.5 million tweets from 270k accounts.
Canada Proud streams – 950k tweets from over 70k accounts.
Rebel Media stream – 430k tweets from 100k accounts.
National Observer – 362k tweets from 116k accounts.
Gab Canada – Gab users’ tweets & profiles tied to Canadian MPs.
TrudeauMustGo – streaming anti-Trudeau terms.
CAFife – stream related to Robert Fife’s recent Trudeau smear.
TrudeauBots – a stream based on some automated accounts.
All but the last of these streams are active today. There are other perspectives available as well.
campfollowers – every follower of every Canadian MP.
canada350 – 8k followers of 350’s Canadian role account.
collectusers – 55m profiles of mostly politically engaged accounts.
We have access to the staff of the National Observer, the host of The View Up Here podcast, some of his show’s guests, and other personal relationships. Each of the FVEYS countries nominally speaks English, but these are five mutually intelligible dialects. Local knowledge is a must for any serious analysis.
As far as tools, we have the Kibana interface for Elasticsearch, which means an analyst can create a visualization, and share it in a variety of ways. We have Maltego, Gephi, and Graphistry for data visualization. Elasticsearch can be made to produce both CSV and GML files, the latter being a native graph format.
Summarizing: Tasked with an outstanding question, likely guided by local knowledge, we apply a variety of tools to extract meaning from the noise.
One of the most important functions Elasticsearch provides is the ability to precisely control time. We keep the following as search parameters:
Twitter’s status_at timestamp for each tweet in a tweet index.
The created_at timestamp for the account that tweeted.
created_at for each account in a user index.
status_at is each user’s last action.
collected_at shows when we most recently saw each account.
Those are hard timestamps, but there are many ‘softer’ forms of time. We can bracket events using phrases such as “not prior to date X” or “between event Y and event Z”. An example where this is used is inferring follower arrival date ranges based on the fact that the Twitter API returns userids in reverse arrival order.
Our data visualization tools in descending order of use are:
Kibana’s many Visualization type options.
Gephi for handling large scale social network visualization.
Maltego for more detailed SNA and infrastructure analysis.
Graphistry is new to us and it mostly competes with Gephi.
Kibana directly handles time. Gephi will intake CSV or GML files are produced using time aware tools. Maltego generally doesn’t come out until we start tracking infrastructure or making detailed notes about a small set of entities. Graphistry can handle volumes of data that would smash Gephi flat, but at the cost of much less control over the graph’s appearance.
Summarizing: We approach the outstanding question with multiple tools that are applied to multiple streams of content.
Having completed this inventory the very first question we ask is “Do we have anything that matters?” We stream multiple perspectives ranging from the 336 MP accounts on a continuing basis down to a handful of hashtags related to a specific event. The hashtag based stream almost always starts because someone noticed an event getting attention. We can often find the first few updates by applying the same terms to the continuing collection indices.
Once we have picked the best index to use, we begin to narrow down the window of time as well as the participants, using Kibana. This likely leads to some Saved Searches, which may be shared as a Kibana URL with others, or used as the basis for a Visualization, which can also be shared. Kibana has the ability to export CSV files for use in other tools.
There is a command line tool that will accept sets of files containing numeric follower IDs, which can include a file of accounts that participated in a given stream. The output is a GML file containing not just names, but numeric data such as follower count, number of favorites, and so on.
Pulling the GML file in Gephi, we can arrange, filter, and label nodes on the basis of the numeric attributes. We can always highlight the accounts that contributed their followers, which is all but impossible using a CSV import. We run Louvain community detection to color label communities in the graph. We run Force Atlas 2 to lay out the graph. We filter on the various attributes and the return for more layout work.
The visualizations start as the means analysts use to separate signal from noise. They are complete when the analyst can use the graphics in a report on the question at hand. All through this process the analyst may be checking with the source of the question or the local expertise.
Summarizing: If the available information was deemed sufficient to answer a given question, a series of searches and visualizations are used to isolate the incident, culminating in some carefully honed views being used to illustrate a narrative.
Human observers equipped with only a small mobile device screen are at the mercy of algorithms. The only protection Twitter offers is against one on one harassment that crosses certain lines. While that is a welcome relief from what the platform was like, it does nothing to address the disinformation problem.
And it may be impossible to untangle U.S. free speech, profiling, and marketing methods in order to get at the disinformation problem at the level on which it is dispensed.
We have the experience and facilities to untangle such things. There are very few who do. This really needs to be solved at a policy level, rather than playing whack-a-mole on a weekly basis.
This was a very meta post, but our investigation into the #LeaveAlliance hashtag trending yesterday was what set it in motion. Usually such work for SMIU is used internally, but this one was made open, so you can see the principles described in this post applied to a real world problem.
There is a global #ClimateStrike planned for September 20th through the 27th. This is a series of peaceful protests regarding government inaction on climate change. Some are announced in advance, there will certainly be popup activities as things get rolling, and there are multiple groups involved. Several of us have done work for various environmental and clean energy groups, and we are well aware of corporate PR campaigns and their automated backing.
We began tracking Extinction Rebellion’s efforts in early July of 2019. We only found 112 accounts then, but an update this week has revealed at least 397 that are active. We have an index of 214,000 tweets and mentions for those original 112.
Our systems have spent the last many hours grinding on follower profiles for several different constituencies:
Extinction Rebellion’s 397 accounts
350 & Bill McKibben
17 Deniers profiled on DeSmogBlog
The denier accounts are from a study done about five years ago, so it is in no way a complete representation. We will probably have a look at the players in Lewandowsky et. al.’s Recursive Fury study, too.
The creation date curve for followers is a clue to legitimacy; bots became a rising force in 2015, and older accounts that have been “bleached” also participate in such activities. Bill McKibben’s followers show no sign of such activity.
And the @350 account seems to have a similar audience, except for this intriguing spike of accounts created in 2011. We’ll dig deeper into that later.
The followers of the Deniers are only 50% collected at this point, but the difference is stark. Given that there is one very large account in the mix this curve may balance out a bit once things are done, but the ramp up in 2015 and the sharp drop in mid-2018 are signs of automated accounts that we see for almost all legislators in English speaking countries.
The Canadian 350 contingent is tiny, just 8,000 members, but the creation date curve parallels the global account.
Our observation plan is still coming together, but for the moment we are collecting:
It was pointed out that stating we had Six Hours Of Downtime, then not saying much, was creating the perception of an uncertain future. Not at all the case, here is what’s been happening.
The outage was triggered by a power transient and it was the first of a couple of problems we had, which triggered some introspection regarding design and implementation choices. Here we are a month later and we have a few bits of wisdom to share:
Elasticsearch 6.5.4 was OK, but 6.8.2. is fine.
Headless Debian 9.9 is much smaller than Lubuntu.
Do not make fixed size disks for your virtual machines.
Definitely do not try to run ZFS in a VM on top of ZFS. Just don’t.
Three VMs on a single spindle pair is a PILOT configuration.
Give data VMs one more core than they seem to need.
A brief visit to swap seems to be OK.
Taking care of all these equipment and software upgrades has left us with a system that will respond to queries against our collection of 53 million Twitter profiles in one to three seconds, instead of multiple thirty second waits before it finally does the job. While capturing fifty streams. And merging two large user profile indices.
That last point requires some explanation – we know the Lucene component of Elasticsearch is known to make heavy use of “off-heap” memory. Data VMs have been set with 8 gig of Java heap space and anywhere from 16 to 32 gig of ram. No matter what, they all “stick their toes” into their swap pools, but never more than a few megabytes worth. Normally swap access is the last gasp before things spiral out of control, but they’ve been doing this for days under intentionally demanding conditions and we haven’t been able to bowl them over.
What is driving this is a dramatic increase in volume is our rapidly evolving stream handling capability and increase in analysts using the system. We currently have the following areas of operation:
Legislatures – stream members and their interactions for 12 countries.
European MPs, U.S. Governors, and U.S. Presidential candidates.
Campaigns – non-election related advocacy.
Disinformation – things that do not appear to be legitimate.
Threat Monitoring – likely trouble areas get proactively captured.
Threat Monitoring is, somewhat sadly, the busiest area thanks to America’s penchant for mass shootings. Seven analysts inhabit a task oriented channel and they always have more to examine than they have time to do.
Disinformation has a strong overlap with Threat Monitoring, and whatever is on fire at the moment has been taking precedence over delving into the campaigns that create the preconditions for trouble. More hands and eyes will improve things in this area.
Campaigns & Legislatures have been purely the domain of Social Media Intelligence Unit. There are reports the go out, but they don’t see the light of day. We should probably periodically pick an area and publish the same sort of work here. In our copious spare time.
As we hinted above, we have added storage to our existing systems, but that is an interim measure. We are currently collecting hardware specifications and reviewing data center locations and rates. The cheapest is Hurricane Electric in Fremont, but the best, given what we do, might be Raging Wire in Sacramento. That’s the oldest and largest of Twitter’s four datacenters, for those not familiar with the name.
Legislatures and much of Campaigns are on their way to that datacenter. Disinformation has no public facing facet and will remain in house where we can keep an eye on it. Threat Monitoring is a mixed bag; some will stay in the office, some will involve tight integration with client systems.
Those who pay attention to the Netwar System Github will have noticed many changes over the last few days. We are approaching the time where we will declare the Elasticsearch 6.8.2 upgrade complete, and then maybe have another go at offering a Netwar System Community Edition VM. When this is ready there will be an announcement here.
Earlier today a power transient knocked both of our systems offline. I noticed it at the house and quickly figured out that it hit the office, too.
Recall the configuration of our systems:
HP Z620 workstations
Dual NAS grade disks
ZFS based mirroring
Multiple VMs hosting data
The multiple VM configuration we use, three per workstation, is in place for two reasons. Elasticsearch has requirements on the number of available systems before it will behave properly, such that two systems alone a not workable. The second reason is the 32 gig memory limit for a java virtual machine. These 192 gig systems easily support three large VMs and a good sized ZFS cache.
We tested this architecture prior to putting any data on it. That involved pulling the power to the shared switch, turning down or otherwise degrading a single system, and in general mistreating things to understand their failure modes.
But we never simply pulled the plug on both systems at once. As it was a sag rather than a full outage, one of the workstations restarted spontaneously, while the other required a power on restart. Both systems booted cleanly, but Elasticsearch as an operational service was nowhere to be found after the VMs were restarted, and the bug hunt began.
After ninety minutes of head scratching and digging it became apparent that the 100% disk usage was some built in maintenance procedure required before the cluster would even start to recover in an observable fashion. Shutting down two VMs on each system permitted one on each to finish, then turning the other two on let the cluster begin to recover.
The cluster would likely have recovered on its own eventually, but the disk contention for three VMs all trying to validate data at once seems to perform worse than three individual machines recovering serially. We’ll do something about this revealed architectural fault the next time we expand, although it isn’t clear at this time precisely what that will be.
We lost no data, but we do have a six hour gap in our streaming coverage, and we lost six hours in processing time on the current batch of user IDs. This could have been something much worse. There will be some power conditioners ordered in the near future and a weekly fifteen minute outage on Sunday for simultaneous ZFS snapshots of the data VMs would seem to be a wise precaution.
This tweet came into the collection thanks to a study of #Qanon we did earlier this year. The actual inception of our current cluster hardware appears to have been on January 29th of 2019. The very earliest it could have been created was December 19th of 2018 – the release date for Elasticsearch 6.5.4.
The system is resilient to the loss of any one system, which was given an unintended test last night, with an inadvertent shutdown of one of the servers in the cluster. Recovery takes a couple of minutes given the services and virtual machines, but there was not even an interruption in processing.
Today, for a variety of reasons, we began the process of upgrading to the June 20th, 2019 release of Elasticsearch 6.8.1. There are a number of reasons for doing this:
Index Life Cycle Management (6.6)
Cross Cluster Replication (6.6)
Elasticsearch 7 Upgrade Assistant (6.6)
Rolling Upgrade To Elasticsearch 7 (6.7)
Better Index Type Labeling (6.7)
Security Features Bundled for Community Edition (6.8)
Conversion From Ubuntu to Debian Linux
We are not jumping directly to Elasticsearch 7.x due to some fairly esoteric issues involving field formats and concerns regarding some of the Python libraries that we use. Ubunt1.u has been fine for both desktop and server use, but we recently began using the very fussy Open Semantic Search, and it behaves well with Debian. Best of all, the OVA of a working virtual machine with the Netwar System code installed and running is just 1.9 gig.
Alongside the production ready Elasticsearch based system we are including Neo4j with some example data and working code. The example data is a small network taken from Canadian Parliament members and the code produces flat files suitable for import as well as native GML file output for Gephi. We ought to be storing relationships to Neo4j as we see them in streams, but this is still new enough that we are not confident shipping it.
Some questions that have cropped up and our best answers as of today:
Is Open Semantic Search going to be part of Netwar System?
We are certainly going to be doing a lot of work with OSS and this seems like a likely outcome, given that it has both Elasticsearch and Neo4j connectors. The driver here is the desire to maintain visibility into Mastodon instances as communities shift off Twitter – we can use OSS to capture RSS feeds.
Will Netwar System still support Search Guard?
Yes, because their academic licensing permits things that the community edition of Elasticsearch does not. We are not going to do Search Guard integration into the OVA, however. There are a couple reasons for that:
Doesn’t make sense on a single virtual machine.
Duplicate configs means a bad actor would have certificate based access to the system.
Eager, unprepared system operators could expose much more than just their collection system if they try to use it online.
Netdata monitoring provides new users insight into Elasticsearch behavior, and we have not managed to make that work with SSL secured systems.
We are seeking a sensible free/paid break point for this system. It’s not clear where a community system would end and an enterprise system would begin.
Is there a proper FOSS license?
Not yet, but we are going to follow customs in this area. A university professor should expect to be able to run a secure system for a team oriented class project without incurring any expense. Commercial users who want phone support will incur an annual cost. There will be value add components that will only be available to paying customers. Right now 100% of revenue is based on software as a service and we expect that to continue to be the norm.
So the BSD license seems likely.
When will the OVA be available?
It’s online this morning for internal users. If it doesn’t explode during testing today, a version with our credentials removed should be available Tuesday or Wednesday. Most of the work required to support http/https transparently was finished during first quarter. One it’s up we’ll post a link to it here and there will be announcements on Twitter and LinkedIn.
An associate earlier this week mentioned having trouble getting Open Semantic Desktop Search to behave. This system offers an intriguing collection of capabilities, including an interface for Elasticsearch. Many hours later, we are picking our way through a minefield. This project is about to shift from Debian 9 to 10, and things are in terrible disarray.
First, some words about frustration free experimentation. If you store your virtual machines on a ZFS file system you can snapshot each time you complete and install step. If something goes wrong later, the snapshot/rollback procedure is essentially instantaneous. This is dramatically more useful than exporting VMs to OVA as a checkpoint. Keep in mind the file system will be dismounted during rollback; it’s best to have some VM specific space set aside.
The project wants Debian proper, so take the time to get Debian 9.9 installed. The desktop OVA wanted a single processor and five gig of ram. Four cores and eight gig seemed to be a sensible amount for a server. Do remember to add a host-only interface under VirtualBox so you can have direct ssh and web access.2
There are some precursors that you will need to put in place before trying to install the monolithic package.
apt install celeryd
apt install python3-pip
apt install python3-celery
apt install python-flower
Celery is a task queue manager and Flower provides a graphical interface to it at port 5555. These are missing from the monolithic package. You will also need to add the following to your /etc/security/limits.conf
Now Reboot The System So The Limits Are In Effect
Now you’re ready to install the monolithic package. This is going to produce an error indicating there are packages missing. You correct this problem with this command:
apt install -f
This is going to take a long time to run, maybe ten or fifteen minutes. It will reach 99% pretty quickly – and that’s about the 50% mark in terms of time and tasks. Once this is done, shut the system down, and take a snapshot. Be patient when you reboot it, the services are complex, hefty, and took a couple minutes to all become available on our i7 test system. This is what the system looks like when fully operational.
The Celery daemon config also needs attention. The config in /etc/default/celeryd must be edited so that it is ENABLED, and the chroot to /opt/Myproject will cause a failure to start due to missing directory. It seems safe to just turn this off.
Neo4j will be bound to just localhost and will not have a password. Since we’re building a server, rather than a specialty desktop, let’s fix this, too. The file is /etc/neo4j/neo4j.conf, these steps will permit remote access.
systemctl restart neo4j
visit http://yoursolrIP:7474 and set password
visit Config area in OSS web interface, add Neo4j credentials
Having completed these tasks, reboot the system to ensure it starts cleanly. You should find the Open Semantic Search interface here:
http://<IP of VM>/search
This seems like a good stopping point, but we are by no means finished. You can manually add content from the command line with the opensemanticsearch commands:
There are still many problems to be resolved. Periodic collection from data sources is not working, and web interface submissions are problematic as well. Attempts to parse RSS feeds generate numerous parse errors. Web pages do not import smoothly from our WordPress based site as well as one hosted on the WordPress commercial site.
We will keep coming back to this area, hopefully quickly moving past the administration details, and getting into some actual OSINT collection tradecraft.
There are a surplus of articles out there regarding the steps mobile device makers take to keep you focused on their system. Similar strategies are employed by social media players, on both mobile and desktop. Payouts are variable in size and appear irregularly. Notifications are red, likely a bit animated, and if nothing is happening particularly bad applications will offer you some hint that something is coming.
These design principles are effective and ethical … for a slot machine manufacturer. If one is trying to get some actual work done, this is the worst environment imaginable. If one is not neurotypical, the hazards are dramatically worse. I am in the shallow end of the autism spectrum pool, where I manage to pass much of the time, but I exercise tight control over my physical and digital workspaces in order to be productive.
I work at home, in a dim, quiet room, sitting at a desk facing a quiet, leafy side street. Dragonflies and hummingbirds are more common than passing vehicles. Two small desk lamps provide pools of light. There are a couple of playlists that contain low key, lyrics free music.
Earlier this year I switched from laptop to desktop, acquiring a 27″ 4K primary display. The smaller display to the right is a 24″ Samsung monitor that will rotate between landscape and portrait. My thinking was this would provide a space for reading. This was just a theory and it lasted all of two hours before I put the display back to landscape.
That experiment failed, but as I have grown more used to having such an enormous number of pixels in front of me, it has taken on an important role. I have a fairly small virtual machine running, its display set to 1920×1080, and the 24″ display is functionally a second machine for me, dedicated to things that are important but not urgent.
Important, Urgent: my billable hours, assisting others to get theirs.
Important, Not Urgent: system administration & software development.
Urgent, Unimportant: email, group chats that are off task.
Not Urgent & Unimportant: news of the day, Twitter drama, etc.
The smaller side display covers many things that matter, but which are best left alone unless specific things need doing. These include:
Chromium: Tweetdeck dedicated to a specific social media campaign.
Chrome: Tweetdeck dedicated to another campaign.
Firefox: Netdata observation of servers prone to overloading.
Bash shell: Four of them showing various performance metrics.
If I alt+tab into this system, it traps the keyboard. I can quickly cycle through the three major areas (two campaigns, system monitoring). I have to click to get free, and then I don’t look at it again, sometimes not till the next day. A grayscale screen lock image appears after five minutes of inactivity, reminding me its on, but offering no inducement to interact.
Email is encapsulated in a similar fashion. There are a couple VMs that do nothing but provide compartments for various accounts. I check them, do what is needed, and then close them. Updates come throughout the day, offering random payouts, typically of low value. I dispatch them all at once, usually AM & PM, and otherwise ignore.
App selection on the primary system is key. The Wire application works on desktops and mobiles, it supports up to three different accounts on the desktop, and it provides muting for busy group chats. I think I have one of every other chat system known to man, but I never use them unless I am summoned for some specific reason.
I have a virtual machine I’ve named Hunchly, after the browser activity recording tool by the same name, Hunch.ly. This hosts a couple of related tools and a broad, long term, low intensity social media presence, which lets me peer into various systems without getting enmeshed with personal contacts.
Software development and the related large scale social media analytics tasks are still on the host OS, but that will be the next thing to change. I have slowly begun to use PyCharm for development work due to a fairly new collaboration with another developer who favors it. I just learned there is OpenCYPHER support, which is going to facilitate our transition to using Neo4j for some social network analysis.
The smartphone is the equivalent of the dorm room you were compelled to inhabit your first year at college. There isn’t a lot of room and it’s a near constant ruckus. Your laptop or desktop is the cramped studio space overlooking the campus town bar strip that you moved into as a junior. A bit more room, but still endless distractions.
A machine that will support VirtualBox (free) in the style you need requires sixteen gig of ram and a solid state disk. This $485 Dell Precision M4600 is twin to the machine I use when I’m mobile. I can just copy virtual machine directories from desktop to laptop. As long as they’re two or four cores and four or eight gigabytes, this works well.
Application developers, content providers, and social network operators have no incentive to do any better. If you want to reclaim your time, sticking to the housing metaphor above, it’s like buying a sturdy farmhouse on a country road after it’s been empty for a while. You will have a bit more work to do on an ongoing basis, but the space, the quiet, and a roomy machine shed behind the two car garage? There is a reason we have an “escape to the country” societal meme. Apply that thinking to you online presence and you may be pleasantly surprised by the outcome.
Six months ago we published An Analyst’s Environment, which describes some tools we use that are a bit beyond the typical lone gun grassroots analyst. Since then our VPS based Elasticsearch cluster has given way to some Xeon equipment in racks, which lead to Xeon equipment under desks.
Looking back over the past two months, we see a quickly maturing “build sheet” for analyst workstations. This is in no small part due to our discovery of Budgie, an Ubuntu Linux offshoot. Some of our best qualitative analysts are on Macs and they are extremely defensive of their work environment. Budgie permits at least some of that activity to move to Linux, and its thought that this will become increasingly common.
Do not assume that “I already use Ubuntu” is sufficient to evaluate Budgie. They are spending a lot of time taking off the rough edges. At the very least, put it in a VM and give it a look.
Once installed, we’re including the following packages by default:
The Hunch.ly web capture package requires Google Chrome.
Chromium provides a separate unrecorded browser.
Maltego CE link analysis package is useful even if constrained.
Evernote is popular with some of our people, Tusk works on Linux.
XMind Zen provides mind mapping that works on all platforms.
Timeline has been a long term player and keeps adding features.
Gephi data visualization works, no matter what sized screen is used.
Both Talkwalker Alerts and Inoreader feeds are RSS based. People seem to be happy with the web interface, but what happens when you’re in a place without network access. There are a number of RSS related applications in Budgie’s slick software store. Someone is going to have to go through them and see which best fits that particular use case.
There have been so many iterations of this set of recommendations, most conditioned by the desire to support Windows, as well as Mac and Linux. The proliferation of older Xeon equipment, in particular the second generation HP Z420/Z620/Z820, which start in useable condition at around $150, mean we no longer have that constraint.
Sampling of inexpensive HP Z420s on Ebay in May of 2019.
Starting with that base, 64 gig of additional memory is about $150, and another $200 will will cover a 500 gig Crucial solid state disk and the fanless entry level Nvidia GT1030.
The specific combination of the Z420 and the Xeon E5-2650L v2 has a benchmark that matches the current MacBook Pro, it will be literally an order of magnitude faster on Gephi, the most demanding of those applications, and it will happily work for hours on end without making a sound. The Mac, on the other hand, will be making about as much noise as a Shopvac after just five minutes.
That chip and some Thermal Grizzly Kryonaut should not cost you more than $60 and will take a base Z420 from four cores to ten. So there you have it – mostly free software, a workstation you can build incrementally, and then you have the foundation required to sort out complex problems.
One of the barriers to making the Netwar System more broadly available has been taking the step of making a system publicly available. We used an Apache reverse proxy and passwords at first, but that was clumsy and limited. We also wanted to make select Visualizations available via iframe embedding, and that is a forbidding transit without a proper map.
Search Guard provides a demo install and it’s great for what it is, providing a working base configuration and most importantly furnishing an operational Certificate Authority. If you took a look around their site, you know they make their money with enterprise licensing, often with a compliance angle. The demo’s top level config file gives equal time to Active Directory, Kerberos, LDAP, and JSON Web Tokens. An enterprise solution architect is going to give a pleased nod to this – there’s something for everyone in there.
But our position is a bit different. Certificate Authority creation is a niche that anyone who has handled an ISP or hosting operation understands, but there seems to be an implicit assumption in the Search Guard documents – that the reader is already familiar with Elasticsearch in an enterprise context. None of us had ever used it in an enterprise setting, so it’s been a steep learning curve.
This post can be seen as a follow on to Installing Netwar System, which covers how to commission a system without securing it for a team. We currently use Elasticsearch 6.5.4 in production, this article is going to cover using 6.7.1, which we need to start exploring.
The instructions for Installing Netwar System cover not just Elasticsearch, they also address tuning that has to be done in sysctl.conf, limits.conf, which you need to do if you plan on putting any sort of load on the system.
The first step is installing the Searchguard Demo, starting from /usr/share/elasticsearch:
This will complete the install, but with the following warnings:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ java.io.FilePermission /proc/sys/net/core/somaxconn read java.lang.RuntimePermission accessClassInPackage.com.sun.jndi.ldap java.lang.RuntimePermission accessClassInPackage.sun.misc java.lang.RuntimePermission accessClassInPackage.sun.nio.ch java.lang.RuntimePermission accessClassInPackage.sun.security.x509 java.lang.RuntimePermission accessDeclaredMembers java.lang.RuntimePermission accessUserInformation java.lang.RuntimePermission createClassLoader java.lang.RuntimePermission getClassLoader java.lang.RuntimePermission setContextClassLoader java.lang.RuntimePermission shutdownHooks java.lang.reflect.ReflectPermission suppressAccessChecks java.net.NetPermission getNetworkInformation java.net.NetPermission getProxySelector java.net.SocketPermission * connect,accept,resolve java.security.SecurityPermission getProperty.ssl.KeyManagerFactory.algorithm java.security.SecurityPermission insertProvider.BC java.security.SecurityPermission org.apache.xml.security.register java.security.SecurityPermission putProviderProperty.BC java.security.SecurityPermission setProperty.ocsp.enable java.util.PropertyPermission * read,write java.util.PropertyPermission org.apache.xml.security.ignoreLineBreaks write javax.security.auth.AuthPermission doAs javax.security.auth.AuthPermission modifyPrivateCredentials javax.security.auth.kerberos.ServicePermission * accept See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated risks. -> Installed search-guard-6
This is fine, no action required. The next step is to execute the actual demo installation script.
cd /usr/share/elasticsearch/plugins/search-guard-6/tools/ chmod 755 install_demo_configuration.sh ./install_demo_configuration.sh
This will install the demo’s files in /etc/elasticsearch and it will provide a tip on how to update its configuration, using the following command. I grew weary of hunting for this, so I turned it into a shell script called /usr/local/bin/updatesg.
What this is doing, briefly, is picking up the files in the sgconfig directory and applying them to the running instance, using a client certificate called ‘kirk’, which was signed by the local root Certificate Authority.
There are five files that are part of the configuration and a sixth that was merged with your /etc/elasticsearch/elasticsearch.yml file.
The sg_config.yml file defines where the system learns how to authenticate users. I have created a minimal file that only works for the management cert and locally defined users.
The sg_internal_users.yml file is the equivalent of the /etc/passwd and /etc/group files of Unix. You add a username, you hash their password, and they need at least one item in their roles: subsection. The roles: are how authorization for various activities are conferred on the user.
Defining roles is more complex than it needs to be for a standalone system, a result of the broad support for different enterprise systems. There are three files involved in this. The first are the action groups. Here’s an example:
An action group contains one or more permissions, and optionally it can reference one or more other action groups.
The next thing to look at is the sg_roles.yml file. This where those action groups get assigned to a particular Search Guard role. You’ll see the action groups referenced in ALLCAPS, and there are additional very specific authorizations that some types of users need. These are applied to the cluster: and to groups of indices:. They also apply to tenants:, which is an enterprise feature that won’t appear in the demo install. A tenant is somewhat analogous to a Unix group, it’s a container of indices and related things (like Visualizations).
The first role is the internal stuff for the Kibana server itself. The second is a modification I made in order to permit read only access to one specific index (usertest) and a cluster of indices with similar names (usersby*). The blep tenant does do anything yet, we’re just starting to explore that feature.
Finally, all of this stuff has to be turned into effects within Elasticsearch itself. A user with a Search Guard role that has permissions, either directly, or via an action group bundle, has to be connected to an Elasticsearch role. The sg_roles_mapping.yml file does this.
One of the biggest challenges in this has been the lack of visual aides. Search Guard has no diagrams to go with their documentation and Elasticsearch is nearly as bare. I happened to find this one while working on this article. It would have been an immense help to have seen it six months ago.
You will also need to install Kibana, using these instructions. The following are the essential lines from /etc/kibana/kibana.yml.
Our first production Elasticsearch cluster was 5.6.12, then we upgraded to 6.5.1, and finally settled on 6.5.4. We’re on our third set of hardware for the cluster, and along the way there have been a number of problems. The following are things to always check with a Search Guard install:
Which user owns /usr/share/elasticsearch?
Which user owns /etc/elasticsearch?
Which user owns /usr/share/kibana?
Which user owns /etc/kibana?
Where is the Elasticsearch data stored?
Which user owns the data directory?
Are the perms in /etc directories all set to 600?
Just in case it wasn’t abundantly clear, DO NOT TRY THIS WITH PRODUCTION DATA. You will certainly hit some sort of snag your first time out and it’s quite possible to leave yourself with a system you can not access. Make a virtual machine, use zfs for the spaces Elasticsearch uses, and take a snapshot at each milestone, so you aren’t rebuilding from scratch. Not doing these things was the price we paid for any fluency we might display with the platform now.
Having come this far, it would appear the next natural step would be doing the work required to build a self-signed Certificate Authority suitable for use with a small cluster of machines. That will have to wait until next week, in the meantime this post provides plenty of guidance for your experiments.