Migrating To Elasticsearch 6.8.1

Consulting our combined collection this morning, we found the oldest to be this 2008 anniversary tweet from @vipe248.

This tweet came into the collection thanks to a study of #Qanon we did earlier this year. The actual inception of our current cluster hardware appears to have been on January 29th of 2019. The very earliest it could have been created was December 19th of 2018 – the release date for Elasticsearch 6.5.4.

The system is resilient to the loss of any one system, which was given an unintended test last night, with an inadvertent shutdown of one of the servers in the cluster. Recovery takes a couple of minutes given the services and virtual machines, but there was not even an interruption in processing.

Today, for a variety of reasons, we began the process of upgrading to the June 20th, 2019 release of Elasticsearch 6.8.1. There are a number of reasons for doing this:

  • Index Life Cycle Management (6.6)
  • Cross Cluster Replication (6.6)
  • Elasticsearch 7 Upgrade Assistant (6.6)
  • Rolling Upgrade To Elasticsearch 7 (6.7)
  • Better Index Type Labeling (6.7)
  • Security Features Bundled for Community Edition (6.8)
  • Conversion From Ubuntu to Debian Linux

We are not jumping directly to Elasticsearch 7.x due to some fairly esoteric issues involving field formats and concerns regarding some of the Python libraries that we use. Ubunt1.u has been fine for both desktop and server use, but we recently began using the very fussy Open Semantic Search, and it behaves well with Debian. Best of all, the OVA of a working virtual machine with the Netwar System code installed and running is just 1.9 gig.

Alongside the production ready Elasticsearch based system we are including Neo4j with some example data and working code. The example data is a small network taken from Canadian Parliament members and the code produces flat files suitable for import as well as native GML file output for Gephi. We ought to be storing relationships to Neo4j as we see them in streams, but this is still new enough that we are not confident shipping it.

Some questions that have cropped up and our best answers as of today:

Is Open Semantic Search going to be part of Netwar System?

We are certainly going to be doing a lot of work with OSS and this seems like a likely outcome, given that it has both Elasticsearch and Neo4j connectors. The driver here is the desire to maintain visibility into Mastodon instances as communities shift off Twitter – we can use OSS to capture RSS feeds.

Will Netwar System still support Search Guard?

Yes, because their academic licensing permits things that the community edition of Elasticsearch does not. We are not going to do Search Guard integration into the OVA, however. There are a couple reasons for that:

  • Doesn’t make sense on a single virtual machine.
  • Duplicate configs means a bad actor would have certificate based access to the system.
  • Eager, unprepared system operators could expose much more than just their collection system if they try to use it online.
  • Netdata monitoring provides new users insight into Elasticsearch behavior, and we have not managed to make that work with SSL secured systems.
  • We are seeking a sensible free/paid break point for this system. It’s not clear where a community system would end and an enterprise system would begin.
Is there a proper FOSS license?

Not yet, but we are going to follow customs in this area. A university professor should expect to be able to run a secure system for a team oriented class project without incurring any expense. Commercial users who want phone support will incur an annual cost. There will be value add components that will only be available to paying customers. Right now 100% of revenue is based on software as a service and we expect that to continue to be the norm.

So the BSD license seems likely.

When will the OVA be available?

It’s online this morning for internal users. If it doesn’t explode during testing today, a version with our credentials removed should be available Tuesday or Wednesday. Most of the work required to support http/https transparently was finished during first quarter. One it’s up we’ll post a link to it here and there will be announcements on Twitter and LinkedIn.

Netwar System Community Edition

This site has been quiet the last five weeks, but many good things happened in the background. One of those good things has been progress on a small Netwar System demonstrator virtual machine, tentatively named the Community Edition.

What can you do with Netwar System CE? It supports using one or two Twitter accounts to record content on an ongoing basis, making the captured information available via the Kibana graphical front end to Elasticsearch. Once the accounts are authorized the system checks them every two minutes for any list that begins with “nsce-“, and accounts on those lists are recorded.

Each account used for recording produces a tw<name> index containing tweets and a tu<name> index containing the profiles of the accounts.

Netwar System CE Indices

The tw* and tu* are index patterns that cover the respective content from all three accounts. The root account is the system manager and we assume users might place a set of API tokens on that account for command line testing.

This is a view from Kibana’s Discovery tab. The timeframe can be controller via the time picker at the upper right, the Search box permits filtering, the activity per date histogram appears at the top, and in the case we can see a handful of Brexit related tweets.

Netwar System CE Tag Cloud

There are a variety of visualization tools within Kibana. He we see a cloud of hashtags used by the collected accounts. The time picker can be adjusted to a certain time frame, search terms may be added so that the cloud reflects only hashtags used in conjunction with the search term, and there are many further refinements that can be made.

What does it take to run Netwar System CE? The following is a minimal configuration of a desktop or laptop that could host it:

  • 8 gig of ram
  • solid state disk
  • four core processor

There are entry level Dell laptops on Amazon with these specifications in the $500 range.

The VM itself is very light weight – two cores, four gig of ram, and the OVA file for the VM is just over four gig to download.

As shipped, the system has the following limits:

  • Tracking via two accounts
  • Disk space for about a million tweets
  • Collects thirty Twitter accounts per hour per account

If you are comfortable with the Linux command line it is fairly straightforward to add additional accounts. If you have some minimal Linux administration capabilities you could add a virtual disk, relocate the Elasticsearch data, and have room for more tweets.

If you are seeking to do a larger project, you should not just multiply these numbers to determine overall capacity. An eight gig VM running our adaptive code can cover about three hundred accounts per hour and a sixty four gig server can exceed four thousand.

If you are willing to give a system like this a try, contact Neal Rauhauser on LinkedIn or DM the @NetwarSystem account.