Featured

Installing Netwar System

Online conflicts are evolving rapidly and escalating. Three years ago we judged that it was best to not release a conflict oriented tool, even one that is used purely for observation. Given the events since then, this notion of not proliferating seems … quaint.

So we released the Netwar System code, the companion ELK utilities, and this week we are going to revisit the Twitter Utils, a set of small scripts that are part of our first generation software, and which are still used for some day to day tasks.

When you live with a programming language and a couple of fairly complex distributed systems, there are troubles that arise which can be dispatched almost without thought. A new person attempting to use such a system might founder on one of these, so this post is going to memorialize what is required for a from scratch install on a fresh Ubuntu 18.04 installation.

Python

We converted to Python 3 a while ago. The default install includes Python 3.6.7, but you need pip, and git, too.

apt install python3-pip
apt install git
ln -s /usr/bin/python3 /usr/bin/python
ln -s /usr/bin/pip3 /usr/bin/pip

The next step is cloning the Netwar System repository into your local directory, make the commands executable, and place them on your path.

git clone [email protected]:NetwarSystem/NetwarSystem.git

chmod 755 tw-*

chmod 755 F-queue

cp tw-* /usr/local/bin/

cp F-queue /usr/local/bin/

Once that’s done, it’s time to install lots of packages. This is normally done like this:

pip install -r REQUIREMENTS.txt

But our REQUIREMENTS.txt for the Netwar System was pretty stale. We think it’s OK now, but here is how we updated it. A little bit of grep/sort/uniq provided this list of missing packages.

configparser
elasticsearch
elasticsearch_dsl
psutil
py2neo
redis
setproctitle
squish2
tweepy
walrus

You can manually install those and they’ll all work, except for squish2, the name for our internal package that contains the code to “squish” bulky, low value fields out of tweets and user profiles. This requires special handling like so.

cd NetwarSystem/squish2
pip install -e .

If you have any errors related to urllib3, SSL, or XML, those might be subtle dependency problems. Post them as issues on Github.

Elasticsearch Commands

There are a bunch of Elasticsearch related scripts in the ELKSG repository. You should clone them and then copy them into your path.

git clone [email protected]:NetwarSystem/ELKSG.git

cd ELKSG

chmod 755 elk*

cp elk* /usr/local/bin/

The ELK software can handle a simple install, or one with Search Guard. This is the simple setup, so add this final line to your ~/.profile so the scripts know where to find Elasticsearch.

export ELKHOST="http://localhost:9200"

Debian Packages

You need the following four pieces of software to get the system running in standalone mode.

  • Redis
  • Netdata
  • Elasticsearch
  • Neo4j

Redis and Netdata are simple.

apt update
apt install redis

There is an install procedure for Netdata that is really slick. Copy one command, paste it in a shell, it does the install, and makes the service active on port 19999.

Elasticsearch and Neo4j require a bit more work to get the correct version. The stricken lines used to install the Oracle JDK 8, but they changed licensing in late April of 2019. The OpenJDK install seems to make its dependents happy, but we are trialing this with a new system. Our production stuff still has the last working official Oracle setup.

add-apt-repository ppa:webupd8team/java

apt update

apt install oracle-java8-installer

add-apt-repository ppa:openjdk-r/ppa

apt update

apt install openjdk-8-jre

apt install curl apt-transport-https

curl -s https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -

echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee /etc/apt/sources.list.d/elastic-6.x.list

apt update

apt install elasticsearch=6.5.4

apt install kibana=6.5.4

mv /etc/apt/sources.list.d/elastic-6.x.list /etc/apt/sources.list.d/elastic-6.x.idle

systemctl enable elasticsearch

systemctl start elasticsearch

systemctl enable kibana

systemctl start kibana

The mv line leaves the Elasticsearch repository file in your sources directory, but it disables it. This is so you can update the rest of your system without stepping on the specific version needed.

Neo4j is similar, but it’s fine to track the latest version. Also note that Neo4j is a Java app – it needs the same Java installer we added for Elasticsearch.

wget -O - https://debian.neo4j.org/neotechnology.gpg.key | apt-key add -

echo 'deb https://debian.neo4j.org/repo stable/' | tee -a /etc/apt/sources.list.d/neo4j.list

apt update

apt install neo4j=1:3.5.4

Note that the version mentioned there is just what happens to be in the Neo4j install instructions on the day this article was written. This is not sensitive the way Elasticsearch is.

At this point you should have all four applications running. The one potential problem is Kibana, which may fail to start because it depends on Elasticsearch, which takes a couple minutes to come alive the first time it is run. Try these commands to verify:

systemctl status redis
systemctl status elasticsearch
systemctl status kibana
systemctl status neo4j

In terms of open TCP ports, try the following, which checks the access ports for Kibana, Redis, Neo4j, and Elasticsearch.

netstat -lan | awk '/:5601|:6379|:7474|:9200|:19999/'

And that’s that – you’ve got the software installed. Now we need to configure some things.

Linux & Packages Configuration

There are a number of things that need adjusting in order for the system to run smoothly. Elasticsearch will cause dropped packets under a load, so lets add these two lines to /etc/sysctl.conf

net.core.netdev_budget=3500
net.core.netdev_budget_usecs=35000

And then make them immediately active:

sysctl -w net.core.netdev_budget=3500
sysctl -w net.core.netdev_budget_usecs=35000

We also need to adjust the file handles and process limits upward for Elasticsearch’s Lucene component and Neo4j’s worker threads. Add these lines to /etc/security/limits.conf, and note that there are tab stops in the actual file, this looks terrible on the blog. Here it’s just best to reboot to make these settings active.

elasticsearch    -   nofile      300000
neo4j - nofile 300000
root - nofile 300000
neo4j hard nproc 10000
neo4j soft nproc 10000

If you’re running this software on your desktop, pointing a web browser at port 5601 will show Kibana and 7474 will show Neo4j. If you’re using a standalone or virtual machine, you’ll need to open some access. Here are three one liners with sed that will do that.

sed -i 's/#network.host: 192.168.0.1/network.host: 0.0.0.0/' /etc/elasticsearch/elasticsearch.yml

sed -i 's/#server.host: \"localhost\"/server.host: 0.0.0.0/' /etc/kibana/kibana.yml

sed -i 's/#dbms.connectors.default_listen/dbms.connectors.default_listen/' /etc/neo4j/neo4j.conf

systemctl restart elasticsearch

systemctl restart kibana

systemctl restart neo4j

Elasticsearch doesn’t require a password in this configuration, but Neo4j does, and it’ll make you change it from the default of ‘neo4j’ the first time you log in to the system.

OK, point your browser at port 19999, and you should see this:

Netdata status on a working system.

Notice the elasticsearch local and Redis local tabs at the lower right. You can get really detailed information on what Elasticsearch is doing, which is helpful when you are just starting to explore its capabilities.

Configuring Your First Twitter Account

You must have a set of Twitter application keys to take the next step. You’ll need to add the Consumer Key and Consumer Secret to the tw-auth command. Run it, paste the URL it offers into a browser, log in with your Twitter account, enter the seven digit PIN from the browser into the script, and it will create a ~/.twitter file that looks something like this.



You’ll need to enter the Neo4j password you set earlier. The elksg variable has to point to the correct host and port. The elksguser/elksgpass entries are just placeholders. If you got this right, this command will cough up your login shell name and Twitter screen name.

tw-myname

Next, you can check that your Elasticsearch commands are working:

elk-health

Now is the time to get Elasticsearch ready to accept Twitter data. Mostly this involves making sure it recognizes timestamps. Issue these commands:

elk-userids

elk-tuindices

elk-newidx

elk-mylog

elk-set2k

The first three ensure that timestamps work for the master user index, any tu* index related to a specific collection, and any tw* index containing tweets. The mylog command ensures the perflog indec is searchable. The last command bumps the field limit on indices. Experienced Elasticsearch users will be scratching their heads on this one – we still have much to learn here, feel free to educate us on how to permanently handle that problem.

If you want to see what these did, this command will show you a lot of JSON.

elk-showfmt

And now we’re dangerously close to actually getting some content in Elasticsearch. Try the following commands:

tw-friendquick NetwarSystem > test.txt

tw-load4usertest test.txt

tw-showusertest

This should produce a file with around 180 numeric Twitter IDs that are followed by @NetwarSystem, load them into Redis for processing, and the last command will give you a count of how many are loaded. This is the big moment, try this command next:

time tw-queue2usertest

That command should spew a bunch of JSON as it runs. The preceding time command will tell you how long it took, a useful thing when performance tuning long running processes.

Now try this one:

elk-list

You should get back two very long lines of text – one for the usertest index, show about 180 documents, and the other for perflog, which will just have a few.

There, you’ve done it! Now let’s examine the results.

Into Kibana

Your next steps require the Kibana graphical interface. Point your browser at port 5601 on your system. You’ll be presented with the Kibana welcome page. You can follow their tutorial if you’d like. Once you’ve done that, or skipped it, you will do the following:

  • Go the Management tab
  • Select Index Patterns
  • Create an Index Pattern for the usertest index

There should be a couple of choices for time fields – one for when the user account was created, the other is the date for their last tweet. Once you’ve done this, go to the Discover tab, which should default to your newly created Index Pattern. Play with the time picker at the upper right, find the Relative option, and set it to 13 years. You should see a creation date histogram something like this:

Conclusion

Writing this post involved grinding off every burr we found in the Github repositories, which was an all day job, but we’ve come to the point where you have cut & pasted all you can. The next steps will involve watching videos about how to use Kibana, laying hands on a copy of Elasticsearch: The Definitive Guide, and installing Graphileon so you can explore the Neo4j data.

If you need help, DM @NetwarSystem your email address, and we’ll send you an invite to the Netwar System Slack server.

Six Hours Of Downtime

Earlier today a power transient knocked both of our systems offline. I noticed it at the house and quickly figured out that it hit the office, too.

Recall the configuration of our systems:

  • HP Z620 workstations
  • Dual NAS grade disks
  • ZFS based mirroring
  • Multiple VMs hosting data

The multiple VM configuration we use, three per workstation, is in place for two reasons. Elasticsearch has requirements on the number of available systems before it will behave properly, such that two systems alone a not workable. The second reason is the 32 gig memory limit for a java virtual machine. These 192 gig systems easily support three large VMs and a good sized ZFS cache.

We tested this architecture prior to putting any data on it. That involved pulling the power to the shared switch, turning down or otherwise degrading a single system, and in general mistreating things to understand their failure modes.

But we never simply pulled the plug on both systems at once. As it was a sag rather than a full outage, one of the workstations restarted spontaneously, while the other required a power on restart. Both systems booted cleanly, but Elasticsearch as an operational service was nowhere to be found after the VMs were restarted, and the bug hunt began.

After ninety minutes of head scratching and digging it became apparent that the 100% disk usage was some built in maintenance procedure required before the cluster would even start to recover in an observable fashion. Shutting down two VMs on each system permitted one on each to finish, then turning the other two on let the cluster begin to recover.

The cluster would likely have recovered on its own eventually, but the disk contention for three VMs all trying to validate data at once seems to perform worse than three individual machines recovering serially. We’ll do something about this revealed architectural fault the next time we expand, although it isn’t clear at this time precisely what that will be.

We lost no data, but we do have a six hour gap in our streaming coverage, and we lost six hours in processing time on the current batch of user IDs. This could have been something much worse. There will be some power conditioners ordered in the near future and a weekly fifteen minute outage on Sunday for simultaneous ZFS snapshots of the data VMs would seem to be a wise precaution.

Migrating To Elasticsearch 6.8.1

Consulting our combined collection this morning, we found the oldest to be this 2008 anniversary tweet from @vipe248.

This tweet came into the collection thanks to a study of #Qanon we did earlier this year. The actual inception of our current cluster hardware appears to have been on January 29th of 2019. The very earliest it could have been created was December 19th of 2018 – the release date for Elasticsearch 6.5.4.

The system is resilient to the loss of any one system, which was given an unintended test last night, with an inadvertent shutdown of one of the servers in the cluster. Recovery takes a couple of minutes given the services and virtual machines, but there was not even an interruption in processing.

Today, for a variety of reasons, we began the process of upgrading to the June 20th, 2019 release of Elasticsearch 6.8.1. There are a number of reasons for doing this:

  • Index Life Cycle Management (6.6)
  • Cross Cluster Replication (6.6)
  • Elasticsearch 7 Upgrade Assistant (6.6)
  • Rolling Upgrade To Elasticsearch 7 (6.7)
  • Better Index Type Labeling (6.7)
  • Security Features Bundled for Community Edition (6.8)
  • Conversion From Ubuntu to Debian Linux

We are not jumping directly to Elasticsearch 7.x due to some fairly esoteric issues involving field formats and concerns regarding some of the Python libraries that we use. Ubunt1.u has been fine for both desktop and server use, but we recently began using the very fussy Open Semantic Search, and it behaves well with Debian. Best of all, the OVA of a working virtual machine with the Netwar System code installed and running is just 1.9 gig.

Alongside the production ready Elasticsearch based system we are including Neo4j with some example data and working code. The example data is a small network taken from Canadian Parliament members and the code produces flat files suitable for import as well as native GML file output for Gephi. We ought to be storing relationships to Neo4j as we see them in streams, but this is still new enough that we are not confident shipping it.

Some questions that have cropped up and our best answers as of today:

Is Open Semantic Search going to be part of Netwar System?

We are certainly going to be doing a lot of work with OSS and this seems like a likely outcome, given that it has both Elasticsearch and Neo4j connectors. The driver here is the desire to maintain visibility into Mastodon instances as communities shift off Twitter – we can use OSS to capture RSS feeds.

Will Netwar System still support Search Guard?

Yes, because their academic licensing permits things that the community edition of Elasticsearch does not. We are not going to do Search Guard integration into the OVA, however. There are a couple reasons for that:

  • Doesn’t make sense on a single virtual machine.
  • Duplicate configs means a bad actor would have certificate based access to the system.
  • Eager, unprepared system operators could expose much more than just their collection system if they try to use it online.
  • Netdata monitoring provides new users insight into Elasticsearch behavior, and we have not managed to make that work with SSL secured systems.
  • We are seeking a sensible free/paid break point for this system. It’s not clear where a community system would end and an enterprise system would begin.
Is there a proper FOSS license?

Not yet, but we are going to follow customs in this area. A university professor should expect to be able to run a secure system for a team oriented class project without incurring any expense. Commercial users who want phone support will incur an annual cost. There will be value add components that will only be available to paying customers. Right now 100% of revenue is based on software as a service and we expect that to continue to be the norm.

So the BSD license seems likely.

When will the OVA be available?

It’s online this morning for internal users. If it doesn’t explode during testing today, a version with our credentials removed should be available Tuesday or Wednesday. Most of the work required to support http/https transparently was finished during first quarter. One it’s up we’ll post a link to it here and there will be announcements on Twitter and LinkedIn.

Coaxing Open Semantic Server Into Operating Condition

An associate earlier this week mentioned having trouble getting Open Semantic Desktop Search to behave. This system offers an intriguing collection of capabilities, including an interface for Elasticsearch. Many hours later, we are picking our way through a minefield. This project is about to shift from Debian 9 to 10, and things are in terrible disarray.

First, some words about frustration free experimentation. If you store your virtual machines on a ZFS file system you can snapshot each time you complete and install step. If something goes wrong later, the snapshot/rollback procedure is essentially instantaneous. This is dramatically more useful than exporting VMs to OVA as a checkpoint. Keep in mind the file system will be dismounted during rollback; it’s best to have some VM specific space set aside.

The project wants Debian proper, so take the time to get Debian 9.9 installed. The desktop OVA wanted a single processor and five gig of ram. Four cores and eight gig seemed to be a sensible amount for a server. Do remember to add a host-only interface under VirtualBox so you can have direct ssh and web access.2

There are some precursors that you will need to put in place before trying to install the monolithic package.

  • apt install celeryd
  • apt install python3-pip
  • apt install python3-celery
  • apt install python-flower

Celery is a task queue manager and Flower provides a graphical interface to it at port 5555. These are missing from the monolithic package. You will also need to add the following to your /etc/security/limits.conf

Now Reboot The System So The Limits Are In Effect

Now you’re ready to install the monolithic package. This is going to produce an error indicating there are packages missing. You correct this problem with this command:

apt install -f

This is going to take a long time to run, maybe ten or fifteen minutes. It will reach 99% pretty quickly – and that’s about the 50% mark in terms of time and tasks. Once this is done, shut the system down, and take a snapshot. Be patient when you reboot it, the services are complex, hefty, and took a couple minutes to all become available on our i7 test system. This is what the system looks like when fully operational.

  • 25672 – RabbitMQ message broker
  • 8080 – spaCy natural language processing
  • 4369 – RabbitMQ beam protocol
  • 22 – ssh, installed for remote access
  • 25 – SMTP for local email, part of Debian
  • 7687 – Neo4j BOLT (server to server) protocol
  • 5672 – RabbitMQ
  • 9998 – Apache Tika document handling service
  • 7983 – Apache Solr
  • 80 – Apache web server
  • 7473 – Neo4j SSL web console
  • 7474 – Neo4j web console
  • 8983 – Apache Solr

Once this is done, you must address Github issue #29 flower doesn’t start automatically. You’ll need this /etc/rc.local file, which their process installs early on, then later removes.

The Celery daemon config also needs attention. The config in /etc/default/celeryd must be edited so that it is ENABLED, and the chroot to /opt/Myproject will cause a failure to start due to missing directory. It seems safe to just turn this off.

Neo4j will be bound to just localhost and will not have a password. Since we’re building a server, rather than a specialty desktop, let’s fix this, too. The file is /etc/neo4j/neo4j.conf, these steps will permit remote access.

  • dbms.security.auth_enabled=true
  • dbms.connectors.default_listen_address=0.0.0.0
  • systemctl restart neo4j
  • visit http://yoursolrIP:7474 and set password
  • visit Config area in OSS web interface, add Neo4j credentials

Having completed these tasks, reboot the system to ensure it starts cleanly. You should find the Open Semantic Search interface here:

http://<IP of VM>/search

This seems like a good stopping point, but we are by no means finished. You can manually add content from the command line with the opensemanticsearch commands:

  • opensemanticsearch-delete
  • opensemanticsearch-enrich
  • opensemanticsearch-filemonitoring
  • opensemanticsearch-index-dir
  • opensemanticsearch-index-file
  • opensemanticsearch-index-rss
  • opensemanticsearch-index-sitemap
  • opensemanticsearch-index-sparql
  • opensemanticsearch-index-web
  • opensemanticsearch-index-web-crawl

There are still many problems to be resolved. Periodic collection from data sources is not working, and web interface submissions are problematic as well. Attempts to parse RSS feeds generate numerous parse errors. Web pages do not import smoothly from our WordPress based site as well as one hosted on the WordPress commercial site.

We will keep coming back to this area, hopefully quickly moving past the administration details, and getting into some actual OSINT collection tradecraft.

Attention Conservation: Be In Charge

There are a surplus of articles out there regarding the steps mobile device makers take to keep you focused on their system. Similar strategies are employed by social media players, on both mobile and desktop. Payouts are variable in size and appear irregularly. Notifications are red, likely a bit animated, and if nothing is happening particularly bad applications will offer you some hint that something is coming.

These design principles are effective and ethical … for a slot machine manufacturer. If one is trying to get some actual work done, this is the worst environment imaginable. If one is not neurotypical, the hazards are dramatically worse. I am in the shallow end of the autism spectrum pool, where I manage to pass much of the time, but I exercise tight control over my physical and digital workspaces in order to be productive.

Physical Environment:

I work at home, in a dim, quiet room, sitting at a desk facing a quiet, leafy side street. Dragonflies and hummingbirds are more common than passing vehicles. Two small desk lamps provide pools of light. There are a couple of playlists that contain low key, lyrics free music.

Earlier this year I switched from laptop to desktop, acquiring a 27″ 4K primary display. The smaller display to the right is a 24″ Samsung monitor that will rotate between landscape and portrait. My thinking was this would provide a space for reading. This was just a theory and it lasted all of two hours before I put the display back to landscape.

That experiment failed, but as I have grown more used to having such an enormous number of pixels in front of me, it has taken on an important role. I have a fairly small virtual machine running, its display set to 1920×1080, and the 24″ display is functionally a second machine for me, dedicated to things that are important but not urgent.

Prioritization:

I have long used the Eisenhower Matrix for sorting tasks.

Here are examples for each quadrant:

  • Important, Urgent: my billable hours, assisting others to get theirs.
  • Important, Not Urgent: system administration & software development.
  • Urgent, Unimportant: email, group chats that are off task.
  • Not Urgent & Unimportant: news of the day, Twitter drama, etc.

Virtual Environment:

The smaller side display covers many things that matter, but which are best left alone unless specific things need doing. These include:

  • Chromium: Tweetdeck dedicated to a specific social media campaign.
  • Chrome: Tweetdeck dedicated to another campaign.
  • Firefox: Netdata observation of servers prone to overloading.
  • Bash shell: Four of them showing various performance metrics.

If I alt+tab into this system, it traps the keyboard. I can quickly cycle through the three major areas (two campaigns, system monitoring). I have to click to get free, and then I don’t look at it again, sometimes not till the next day. A grayscale screen lock image appears after five minutes of inactivity, reminding me its on, but offering no inducement to interact.

Email is encapsulated in a similar fashion. There are a couple VMs that do nothing but provide compartments for various accounts. I check them, do what is needed, and then close them. Updates come throughout the day, offering random payouts, typically of low value. I dispatch them all at once, usually AM & PM, and otherwise ignore.

App selection on the primary system is key. The Wire application works on desktops and mobiles, it supports up to three different accounts on the desktop, and it provides muting for busy group chats. I think I have one of every other chat system known to man, but I never use them unless I am summoned for some specific reason.

Work Focus:

I have a virtual machine I’ve named Hunchly, after the browser activity recording tool by the same name, Hunch.ly. This hosts a couple of related tools and a broad, long term, low intensity social media presence, which lets me peer into various systems without getting enmeshed with personal contacts.

Software development and the related large scale social media analytics tasks are still on the host OS, but that will be the next thing to change. I have slowly begun to use PyCharm for development work due to a fairly new collaboration with another developer who favors it. I just learned there is OpenCYPHER support, which is going to facilitate our transition to using Neo4j for some social network analysis.

Mobility:

The smartphone is the equivalent of the dorm room you were compelled to inhabit your first year at college. There isn’t a lot of room and it’s a near constant ruckus. Your laptop or desktop is the cramped studio space overlooking the campus town bar strip that you moved into as a junior. A bit more room, but still endless distractions.

A machine that will support VirtualBox (free) in the style you need requires sixteen gig of ram and a solid state disk. This $485 Dell Precision M4600 is twin to the machine I use when I’m mobile. I can just copy virtual machine directories from desktop to laptop. As long as they’re two or four cores and four or eight gigabytes, this works well.

Conclusions:

Application developers, content providers, and social network operators have no incentive to do any better. If you want to reclaim your time, sticking to the housing metaphor above, it’s like buying a sturdy farmhouse on a country road after it’s been empty for a while. You will have a bit more work to do on an ongoing basis, but the space, the quiet, and a roomy machine shed behind the two car garage? There is a reason we have an “escape to the country” societal meme. Apply that thinking to you online presence and you may be pleasantly surprised by the outcome.

An Analysts Workstation

Six months ago we published An Analyst’s Environment, which describes some tools we use that are a bit beyond the typical lone gun grassroots analyst. Since then our VPS based Elasticsearch cluster has given way to some Xeon equipment in racks, which lead to Xeon equipment under desks.

Looking back over the past two months, we see a quickly maturing “build sheet” for analyst workstations. This is in no small part due to our discovery of Budgie, an Ubuntu Linux offshoot. Some of our best qualitative analysts are on Macs and they are extremely defensive of their work environment. Budgie permits at least some of that activity to move to Linux, and its thought that this will become increasingly common.

Do not assume that “I already use Ubuntu” is sufficient to evaluate Budgie. They are spending a lot of time taking off the rough edges. At the very least, put it in a VM and give it a look.

Once installed, we’re including the following packages by default:

  • Secure communications are best handled with Wire.
  • The Hunch.ly web capture package requires Google Chrome.
  • Chromium provides a separate unrecorded browser.
  • Maltego CE link analysis package is useful even if constrained.
  • Evernote is popular with some of our people, Tusk works on Linux.
  • XMind Zen provides mind mapping that works on all platforms.
  • Timeline has been a long term player and keeps adding features.
  • Gephi data visualization works, no matter what sized screen is used.

Both Talkwalker Alerts and Inoreader feeds are RSS based. People seem to be happy with the web interface, but what happens when you’re in a place without network access. There are a number of RSS related applications in Budgie’s slick software store. Someone is going to have to go through them and see which best fits that particular use case.

Budgie’s many packages for handling RSS feeds.

There have been so many iterations of this set of recommendations, most conditioned by the desire to support Windows, as well as Mac and Linux. The proliferation of older Xeon equipment, in particular the second generation HP Z420/Z620/Z820, which start in useable condition at around $150, mean we no longer have that constraint.

Sampling of inexpensive HP Z420s on Ebay in May of 2019.

Starting with that base, 64 gig of additional memory is about $150, and another $200 will will cover a 500 gig Crucial solid state disk and the fanless entry level Nvidia GT1030.

The specific combination of the Z420 and the Xeon E5-2650L v2 has a benchmark that matches the current MacBook Pro, it will be literally an order of magnitude faster on Gephi, the most demanding of those applications, and it will happily work for hours on end without making a sound. The Mac, on the other hand, will be making about as much noise as a Shopvac after just five minutes.

That chip and some Thermal Grizzly Kryonaut should not cost you more than $60 and will take a base Z420 from four cores to ten. So there you have it – mostly free software, a workstation you can build incrementally, and then you have the foundation required to sort out complex problems.

Installing Search Guard

One of the barriers to making the Netwar System more broadly available has been taking the step of making a system publicly available. We used an Apache reverse proxy and passwords at first, but that was clumsy and limited. We also wanted to make select Visualizations available via iframe embedding, and that is a forbidding transit without a proper map.

Search Guard provides a demo install and it’s great for what it is, providing a working base configuration and most importantly furnishing an operational Certificate Authority. If you took a look around their site, you know they make their money with enterprise licensing, often with a compliance angle. The demo’s top level config file gives equal time to Active Directory, Kerberos, LDAP, and JSON Web Tokens. An enterprise solution architect is going to give a pleased nod to this – there’s something for everyone in there.

But our position is a bit different. Certificate Authority creation is a niche that anyone who has handled an ISP or hosting operation understands, but there seems to be an implicit assumption in the Search Guard documents – that the reader is already familiar with Elasticsearch in an enterprise context. None of us had ever used it in an enterprise setting, so it’s been a steep learning curve.

This post can be seen as a follow on to Installing Netwar System, which covers how to commission a system without securing it for a team. We currently use Elasticsearch 6.5.4 in production, this article is going to cover using 6.7.1, which we need to start exploring.

The instructions for Installing Netwar System cover not just Elasticsearch, they also address tuning that has to be done in sysctl.conf, limits.conf, which you need to do if you plan on putting any sort of load on the system.

The first step is installing the Searchguard Demo, starting from /usr/share/elasticsearch:

bin/elasticsearch-plugin install -b com.floragunn:search-guard-6:6.7.1-24.3

This will complete the install, but with the following warnings:

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
java.io.FilePermission /proc/sys/net/core/somaxconn read
java.lang.RuntimePermission accessClassInPackage.com.sun.jndi.ldap
java.lang.RuntimePermission accessClassInPackage.sun.misc
java.lang.RuntimePermission accessClassInPackage.sun.nio.ch
java.lang.RuntimePermission accessClassInPackage.sun.security.x509
java.lang.RuntimePermission accessDeclaredMembers
java.lang.RuntimePermission accessUserInformation
java.lang.RuntimePermission createClassLoader
java.lang.RuntimePermission getClassLoader
java.lang.RuntimePermission setContextClassLoader
java.lang.RuntimePermission shutdownHooks
java.lang.reflect.ReflectPermission suppressAccessChecks
java.net.NetPermission getNetworkInformation
java.net.NetPermission getProxySelector
java.net.SocketPermission * connect,accept,resolve
java.security.SecurityPermission getProperty.ssl.KeyManagerFactory.algorithm
java.security.SecurityPermission insertProvider.BC
java.security.SecurityPermission org.apache.xml.security.register
java.security.SecurityPermission putProviderProperty.BC
java.security.SecurityPermission setProperty.ocsp.enable
java.util.PropertyPermission * read,write
java.util.PropertyPermission org.apache.xml.security.ignoreLineBreaks write
javax.security.auth.AuthPermission doAs
javax.security.auth.AuthPermission modifyPrivateCredentials
javax.security.auth.kerberos.ServicePermission * accept
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
-> Installed search-guard-6

This is fine, no action required. The next step is to execute the actual demo installation script.

cd /usr/share/elasticsearch/plugins/search-guard-6/tools/
chmod 755 install_demo_configuration.sh
./install_demo_configuration.sh

This will install the demo’s files in /etc/elasticsearch and it will provide a tip on how to update its configuration, using the following command. I grew weary of hunting for this, so I turned it into a shell script called /usr/local/bin/updatesg.

/usr/share/elasticsearch/plugins/search-guard-6/tools/sgadmin.sh -cd "/usr/share/elasticsearch/plugins/search-guard-6/sgconfig" -icl -key "/etc/elasticsearch/kirk-key.pem" -cert "/etc/elasticsearch/kirk.pem" -cacert "/etc/elasticsearch/root-ca.pem" -nhnv

What this is doing, briefly, is picking up the files in the sgconfig directory and applying them to the running instance, using a client certificate called ‘kirk’, which was signed by the local root Certificate Authority.

There are five files that are part of the configuration and a sixth that was merged with your /etc/elasticsearch/elasticsearch.yml file.

  • sg_config.yml
  • sg_internal_users.yml
  • sg_action_groups.yml
  • sg_roles.yml
  • sg_roles_mapping.yml
  • elasticsearch.yml.example

The sg_config.yml file defines where the system learns how to authenticate users. I have created a minimal file that only works for the management cert and locally defined users.

The sg_internal_users.yml file is the equivalent of the /etc/passwd and /etc/group files of Unix. You add a username, you hash their password, and they need at least one item in their roles: subsection. The roles: are how authorization for various activities are conferred on the user.

Defining roles is more complex than it needs to be for a standalone system, a result of the broad support for different enterprise systems. There are three files involved in this. The first are the action groups. Here’s an example:

An action group contains one or more permissions, and optionally it can reference one or more other action groups.

The next thing to look at is the sg_roles.yml file. This where those action groups get assigned to a particular Search Guard role. You’ll see the action groups referenced in ALLCAPS, and there are additional very specific authorizations that some types of users need. These are applied to the cluster: and to groups of indices:. They also apply to tenants:, which is an enterprise feature that won’t appear in the demo install. A tenant is somewhat analogous to a Unix group, it’s a container of indices and related things (like Visualizations).

The first role is the internal stuff for the Kibana server itself. The second is a modification I made in order to permit read only access to one specific index (usertest) and a cluster of indices with similar names (usersby*). The blep tenant does do anything yet, we’re just starting to explore that feature.

Finally, all of this stuff has to be turned into effects within Elasticsearch itself. A user with a Search Guard role that has permissions, either directly, or via an action group bundle, has to be connected to an Elasticsearch role. The sg_roles_mapping.yml file does this.

One of the biggest challenges in this has been the lack of visual aides. Search Guard has no diagrams to go with their documentation and Elasticsearch is nearly as bare. I happened to find this one while working on this article. It would have been an immense help to have seen it six months ago.


You will also need to install Kibana, using these instructions. The following are the essential lines from /etc/kibana/kibana.yml.

xpack.security.enabled: false
xpack.spaces.enabled: false
elasticsearch.username: "kibanaserver"
elasticsearch.password: "kibanaserver"
elasticsearch.requestHeadersWhitelist: [ "Authorization", "sgtenant" ]
elasticsearch.ssl.verificationMode: none
elasticsearch.url: "https://localhost:9200"

Possible Pitfalls

Our first production Elasticsearch cluster was 5.6.12, then we upgraded to 6.5.1, and finally settled on 6.5.4. We’re on our third set of hardware for the cluster, and along the way there have been a number of problems. The following are things to always check with a Search Guard install:

  • Which user owns /usr/share/elasticsearch?
  • Which user owns /etc/elasticsearch?
  • Which user owns /usr/share/kibana?
  • Which user owns /etc/kibana?
  • Where is the Elasticsearch data stored?
  • Which user owns the data directory?
  • Are the perms in /etc directories all set to 600?

Just in case it wasn’t abundantly clear, DO NOT TRY THIS WITH PRODUCTION DATA. You will certainly hit some sort of snag your first time out and it’s quite possible to leave yourself with a system you can not access. Make a virtual machine, use zfs for the spaces Elasticsearch uses, and take a snapshot at each milestone, so you aren’t rebuilding from scratch. Not doing these things was the price we paid for any fluency we might display with the platform now.

Having come this far, it would appear the next natural step would be doing the work required to build a self-signed Certificate Authority suitable for use with a small cluster of machines. That will have to wait until next week, in the meantime this post provides plenty of guidance for your experiments.

Analyzing Twitter Streams

Our prior work on Twitter content has involved bulk collection of the following types of data:

  • Tweets, including raw text suitable for stylometry.
  • Activity time for the sake of temporal signatures.
  • Mentions including temporal data for conversation maps.
  • User ID data for profile searches.
  • Follower/following relationships, often using Maltego.

Early on this involved simply running multiple accounts in parallel, each working on their own set of tasks. Seemingly quick results were a matter of knowing what to collect and letting things happen. Hardware upgrades around the start of 2019 permitted us to run sixteen accounts in parallel … then thirty two … and finally sixty four, which exceeded the bounds of 100mbit internet service.

We had never done much with the Twitter streaming API until just two weeks ago, but our expanded ability to handle large volumes of raw data has made this a very interesting proposition. There are now ten accounts engaged in collecting either a mix of terms or following lists of hundreds of high value accounts.

Indexing Many Streams

What we get from streams at this time includes:

  • Tweet content.
  • RT’d tweet content.
  • Quoted tweet content.
  • Twitter user data for the source.
  • Twitter user data for accounts mentioned.
  • Twitter user data for accounts that are RT’d.
  • User to mentioned account event including timestamp.
  • User to RT’d account event including timestamp.

This data is currently accumulating in a mix of Elasticsearch indices. We recognize that we have at least three document types:

  • Tweets.
  • User data.
  • Interaction data.

Our current setup is definitely beta at this point. We probably need more attention on the natural language processing aspect of the tweets themselves, particularly as we expand into handling multiple European languages. User data could standing having hashtags extracted from profiles, which we missed the first time around, otherwise this seems pretty simple.

The interaction data is where things become uncertain. It is good to have this content in Elasticsearch for the sake of filtering. It is unclear precisely how much we should permit to accumulate in these derivative documents; at this point they’re just the minimal data from each tweet that permits establishing the link between accounts involved. Do we also do this for hashtags?

Once we have this, the next question is what do we do with it? The search, sorting, and time slicing of Elasticsearch is nice, but this is really network data, and we want to visualize it.

Maltego is out of the running before we even start; 10k nodes maximum has been a barrier for a long time. Gephi is unusable on a 4k Linux display due to font sizing for readability, and it will do just enough on a half million node network to leave one hanging with an analysis half finished on a smaller display.

The right answer(s) seem to be to get moving on Graphistry and Neo4j. An EVGA GTX 1060 turned up here a few weeks ago, displaying a GT 1030 to an associate. Given the uptime requirements for Elasticsearch, not much has happened towards Graphistry use other than the physical install. It looks like Docker is a requirement, and that’s a synonym for “invasive nuisance”.

Neo4j has some visualization abilities but its real attraction is the native handling of storage and queries for graphs. Our associates who engage in analysis ask questions that are easily answered with Elasticsearch … and other questions that are utterly impossible to resolve with any tool we currently wield.

Conclusion

Expanding capacity has permitted us to answer some questions … but on the balance its uncovered more mysteries than it has resolved. This next month is going to involve getting some standard in place for assessing incoming streams, and pressing on both means of handling graph data, to see which one we can bring to bear first.

Twitter Bots Concealed By API

Last month we announced the Netwar System Community Edition, the OVA for which is still not posted publicly. In our defense, what should have been a couple days with our core system has turned into a multifaceted month long bug hunt. A good portion could be credited to “unfamiliar with Search Guard”, but there is a hard kernel of “WTF, Twitter, WTF?!?” that we want to describe for other analysts.

Core System Configuration

First, some words about what we’ve done with the core of the system use day to day. After much experimentation we settled on the following configuration for our Elasticsearch dependent environment.

  • HP Z620 workstations with dual eight core Xeons.
  • 128 gig of ram.
  • Dual Seagate IronWolf two terabyte drives in a mirror.
  • Single Samsung SSD for system and ZFS cache.
  • Trio of VirtualBox VMs with 500 gig of storage each.
  • 32 gig for host, ZFS ARC (cache) limited to 24 gig.
  • 24 gig per VM, JVM limited to 12 to 16 gig.

There are many balancing acts in this, too subtle and too niche to dig into here. It should be noted that FreeBSD Mastery:ZFS is a fine little book, even if you’re using Linux. The IronWolf drives are helium filled gear meant for NAS duty. In retrospect, paying the 50% premium for IronWolf Pro gear would have been a good move and we’ll go that way as we outgrow these.

We’ve started with a pair of machines, we’re defaulting to three shards per index, and a single replica for each. The Elasticsearch datacenter zones feature proved useful; pulling the network cable on one machine triggers some internal recovery processes, but there is no downtime from the user’s perspective. We’re due for a third system with similar specifications, it will receive the same configuration including a zone of its own, and we’ll move from one replica per index to two. This will be a painless shift to N+1 redundancy.

API Mysteries At Scale

Our first large scale project has been profiling the followers of 577 MPs in the U.K. Parliament. There are 20.6M follow relationships with 6.6M unique accounts. Extracting their profiles would require forty hours with our current configuration … but there are issues.

Users haven’t seen a Twitter #FailWhale in years, but as continuous users of the API we expect to see periods of misbehavior on about a monthly basis. February featured some grim adjustments, signs that Twitter is further clamping down on bots, which nipped our read only analytical activities. There are some features that seem to be permanently throttled now based on IP address.

When we arrived at what we thought was the end of the road, we had 6.26M profiles in Elasticsearch rather than the 6.6M we knew to exist, a discrepancy of about 350,000. We tested all 6.6M numeric IDs against the index and found just 325,000 negative responses. We treated that set as a new batch and the system captured 255,000, leaving only 70,000 missing. Repeating the process again with the 70,000 we arrived at a place where the problem was amenable to running a small batch in serial fashion.

Watching a batch of a thousand of these stragglers, roughly a quarter got an actual response, a quarter came back as suspended, and the remainder came back as page not found. The last response is expected when an account has renamed or self suspended, but we were using numeric ID rather than screen name.

And the API response to this set was NOT deterministic. Run the process again with the same data, the percentages were similar, but different accounts were affected.

A manual inspection of the accounts returned permits the formulation of a theory as to why this happens. We know the distribution of the creation dates of these accounts:

MP Followers Account Creation Dates
MP Followers Account Creation Dates

The bulk of the problematic accounts are dated between May and August of 2018. Recall that Twitter completed its acquisition of Smyte and shut down 70 million bots during that time frame. May in the histogram is the first month where account creation dates are level. A smaller set clustered around the same day in mid-December of 2012, another fertile time period for bot creation.

The affected accounts have many of the characteristics we associate with bots:

  • Steeply inverted following to follow ratio.
  • Complete lack of relationships to others.
  • Relatively few tweets.
  • Default username with eight trailing digits.

An account that was created and quickly abandoned will share these attributes. So our theory regarding the seeming problem with the API is as follows:

These accounts that can not be accessed in a deterministic fashion using the API are in some sort of Smyte induced purgatory. They are not accessible, protected, empty of content, suspended, or renamed, which are five conditions our code already recognizes. There is a new condition, likely “needs to validate phone number”, and accounts that have not done this are only likely of interest to their botnet operators, or researchers delving very deeply into the system’s behavior.

But What Does This MEAN?

Twitter has taken aggressive steps to limit the creation of bots. Accounts following MPs seem to have fairly evenly distributed creation dates, less the massive hump from early 2016 to mid 2018. We know botnet operators are liquidating collections of accounts that have been wiped of prior activity for as little as $0.25 each. There are reportedly offerings of batches of accounts considered to be ‘premium’, but what we know of this practice is anecdotal.

Our own experience is limited to maintaining a couple platoons of collection oriented accounts, and Twitter has erected new barriers, requiring longer lasting phone numbers, and sometimes voice calls rather than SMS.

This coming month we are going to delve into the social bot market, purchasing a small batch, which we will host on a remote VPS and attempt to use for collection work.

The bigger implication is this … Twitter’s implementation of Smyte is good, but it’s created a “hole in the ocean problem”, a reference to modern submarines with acoustic signatures that are less than the noise floor in their environment. If the affected accounts are all bots, and they’re just standing deadwood of no use to anyone, that’s good. But if they can be rehabilitated or repurposed, they are still an issue.

Seems like we have more digging to do here …

Mystery Partially Resolved …

So there was an issue with the API, but an issue on our side.

When a Twitter account gets suspended, it’s API tokens will still permit you to check its credentials. So a script like this reports all is well:

But if three of the sixty four accounts used in doing numeric ID to profile lookups have been suspended … 3/64 = 4.69% failure rate. That agrees pretty well with some of the trouble we observed. We have not had cause to process another large batch of numeric IDs yet, but when we do, we’ll check this theory against the results.

 

Netwar System Community Edition

This site has been quiet the last five weeks, but many good things happened in the background. One of those good things has been progress on a small Netwar System demonstrator virtual machine, tentatively named the Community Edition.

What can you do with Netwar System CE? It supports using one or two Twitter accounts to record content on an ongoing basis, making the captured information available via the Kibana graphical front end to Elasticsearch. Once the accounts are authorized the system checks them every two minutes for any list that begins with “nsce-“, and accounts on those lists are recorded.

Each account used for recording produces a tw<name> index containing tweets and a tu<name> index containing the profiles of the accounts.

Netwar System CE Indices

The tw* and tu* are index patterns that cover the respective content from all three accounts. The root account is the system manager and we assume users might place a set of API tokens on that account for command line testing.

This is a view from Kibana’s Discovery tab. The timeframe can be controller via the time picker at the upper right, the Search box permits filtering, the activity per date histogram appears at the top, and in the case we can see a handful of Brexit related tweets.

Netwar System CE Tag Cloud

There are a variety of visualization tools within Kibana. He we see a cloud of hashtags used by the collected accounts. The time picker can be adjusted to a certain time frame, search terms may be added so that the cloud reflects only hashtags used in conjunction with the search term, and there are many further refinements that can be made.

What does it take to run Netwar System CE? The following is a minimal configuration of a desktop or laptop that could host it:

  • 8 gig of ram
  • solid state disk
  • four core processor

There are entry level Dell laptops on Amazon with these specifications in the $500 range.

The VM itself is very light weight – two cores, four gig of ram, and the OVA file for the VM is just over four gig to download.

As shipped, the system has the following limits:

  • Tracking via two accounts
  • Disk space for about a million tweets
  • Collects thirty Twitter accounts per hour per account

If you are comfortable with the Linux command line it is fairly straightforward to add additional accounts. If you have some minimal Linux administration capabilities you could add a virtual disk, relocate the Elasticsearch data, and have room for more tweets.

If you are seeking to do a larger project, you should not just multiply these numbers to determine overall capacity. An eight gig VM running our adaptive code can cover about three hundred accounts per hour and a sixty four gig server can exceed four thousand.

If you are willing to give a system like this a try, contact Neal Rauhauser on LinkedIn or DM the @NetwarSystem account.