This tweet came into the collection thanks to a study of #Qanon we did earlier this year. The actual inception of our current cluster hardware appears to have been on January 29th of 2019. The very earliest it could have been created was December 19th of 2018 – the release date for Elasticsearch 6.5.4.
The system is resilient to the loss of any one system, which was given an unintended test last night, with an inadvertent shutdown of one of the servers in the cluster. Recovery takes a couple of minutes given the services and virtual machines, but there was not even an interruption in processing.
Today, for a variety of reasons, we began the process of upgrading to the June 20th, 2019 release of Elasticsearch 6.8.1. There are a number of reasons for doing this:
Index Life Cycle Management (6.6)
Cross Cluster Replication (6.6)
Elasticsearch 7 Upgrade Assistant (6.6)
Rolling Upgrade To Elasticsearch 7 (6.7)
Better Index Type Labeling (6.7)
Security Features Bundled for Community Edition (6.8)
Conversion From Ubuntu to Debian Linux
We are not jumping directly to Elasticsearch 7.x due to some fairly esoteric issues involving field formats and concerns regarding some of the Python libraries that we use. Ubunt1.u has been fine for both desktop and server use, but we recently began using the very fussy Open Semantic Search, and it behaves well with Debian. Best of all, the OVA of a working virtual machine with the Netwar System code installed and running is just 1.9 gig.
Alongside the production ready Elasticsearch based system we are including Neo4j with some example data and working code. The example data is a small network taken from Canadian Parliament members and the code produces flat files suitable for import as well as native GML file output for Gephi. We ought to be storing relationships to Neo4j as we see them in streams, but this is still new enough that we are not confident shipping it.
Some questions that have cropped up and our best answers as of today:
Is Open Semantic Search going to be part of Netwar System?
We are certainly going to be doing a lot of work with OSS and this seems like a likely outcome, given that it has both Elasticsearch and Neo4j connectors. The driver here is the desire to maintain visibility into Mastodon instances as communities shift off Twitter – we can use OSS to capture RSS feeds.
Will Netwar System still support Search Guard?
Yes, because their academic licensing permits things that the community edition of Elasticsearch does not. We are not going to do Search Guard integration into the OVA, however. There are a couple reasons for that:
Doesn’t make sense on a single virtual machine.
Duplicate configs means a bad actor would have certificate based access to the system.
Eager, unprepared system operators could expose much more than just their collection system if they try to use it online.
Netdata monitoring provides new users insight into Elasticsearch behavior, and we have not managed to make that work with SSL secured systems.
We are seeking a sensible free/paid break point for this system. It’s not clear where a community system would end and an enterprise system would begin.
Is there a proper FOSS license?
Not yet, but we are going to follow customs in this area. A university professor should expect to be able to run a secure system for a team oriented class project without incurring any expense. Commercial users who want phone support will incur an annual cost. There will be value add components that will only be available to paying customers. Right now 100% of revenue is based on software as a service and we expect that to continue to be the norm.
So the BSD license seems likely.
When will the OVA be available?
It’s online this morning for internal users. If it doesn’t explode during testing today, a version with our credentials removed should be available Tuesday or Wednesday. Most of the work required to support http/https transparently was finished during first quarter. One it’s up we’ll post a link to it here and there will be announcements on Twitter and LinkedIn.
Online conflicts are evolving rapidly and escalating. Three years ago we judged that it was best to not release a conflict oriented tool, even one that is used purely for observation. Given the events since then, this notion of not proliferating seems … quaint.
So we released the Netwar System code, the companion ELK utilities, and this week we are going to revisit the Twitter Utils, a set of small scripts that are part of our first generation software, and which are still used for some day to day tasks.
When you live with a programming language and a couple of fairly complex distributed systems, there are troubles that arise which can be dispatched almost without thought. A new person attempting to use such a system might founder on one of these, so this post is going to memorialize what is required for a from scratch install on a fresh Ubuntu 18.04 installation.
We converted to Python 3 a while ago. The default install includes Python 3.6.7, but you need pip, and git, too.
You can manually install those and they’ll all work, except for squish2, the name for our internal package that contains the code to “squish” bulky, low value fields out of tweets and user profiles. This requires special handling like so.
cd NetwarSystem/squish2 pip install -e .
If you have any errors related to urllib3, SSL, or XML, those might be subtle dependency problems. Post them as issues on Github.
There are a bunch of Elasticsearch related scripts in the ELKSG repository. You should clone them and then copy them into your path.
The ELK software can handle a simple install, or one with Search Guard. This is the simple setup, so add this final line to your ~/.profile so the scripts know where to find Elasticsearch.
You need the following four pieces of software to get the system running in standalone mode.
Redis and Netdata are simple.
apt update apt install redis
There is an install procedure for Netdata that is really slick. Copy one command, paste it in a shell, it does the install, and makes the service active on port 19999.
Elasticsearch and Neo4j require a bit more work to get the correct version. The stricken lines used to install the Oracle JDK 8, but they changed licensing in late April of 2019. The OpenJDK install seems to make its dependents happy, but we are trialing this with a new system. Our production stuff still has the last working official Oracle setup.
echo 'deb https://debian.neo4j.org/repo stable/' | tee -a /etc/apt/sources.list.d/neo4j.list
apt install neo4j=1:3.5.4
Note that the version mentioned there is just what happens to be in the Neo4j install instructions on the day this article was written. This is not sensitive the way Elasticsearch is.
At this point you should have all four applications running. The one potential problem is Kibana, which may fail to start because it depends on Elasticsearch, which takes a couple minutes to come alive the first time it is run. Try these commands to verify:
systemctl status redis systemctl status elasticsearch systemctl status kibana systemctl status neo4j
In terms of open TCP ports, try the following, which checks the access ports for Kibana, Redis, Neo4j, and Elasticsearch.
We also need to adjust the file handles and process limits upward for Elasticsearch’s Lucene component and Neo4j’s worker threads. Add these lines to /etc/security/limits.conf, and note that there are tab stops in the actual file, this looks terrible on the blog. Here it’s just best to reboot to make these settings active.
If you’re running this software on your desktop, pointing a web browser at port 5601 will show Kibana and 7474 will show Neo4j. If you’re using a standalone or virtual machine, you’ll need to open some access. Here are three one liners with sed that will do that.
sed -i 's/#network.host: 192.168.0.1/network.host: 0.0.0.0/' /etc/elasticsearch/elasticsearch.yml
sed -i 's/#server.host: \"localhost\"/server.host: 0.0.0.0/' /etc/kibana/kibana.yml
sed -i 's/#dbms.connectors.default_listen/dbms.connectors.default_listen/' /etc/neo4j/neo4j.conf
systemctl restart elasticsearch
systemctl restart kibana
systemctl restart neo4j
Elasticsearch doesn’t require a password in this configuration, but Neo4j does, and it’ll make you change it from the default of ‘neo4j’ the first time you log in to the system.
OK, point your browser at port 19999, and you should see this:
Notice the elasticsearch local and Redis local tabs at the lower right. You can get really detailed information on what Elasticsearch is doing, which is helpful when you are just starting to explore its capabilities.
Configuring Your First Twitter Account
You must have a set of Twitter application keys to take the next step. You’ll need to add the Consumer Key and Consumer Secret to the tw-auth command. Run it, paste the URL it offers into a browser, log in with your Twitter account, enter the seven digit PIN from the browser into the script, and it will create a ~/.twitter file that looks something like this.
You’ll need to enter the Neo4j password you set earlier. The elksg variable has to point to the correct host and port. The elksguser/elksgpass entries are just placeholders. If you got this right, this command will cough up your login shell name and Twitter screen name.
Next, you can check that your Elasticsearch commands are working:
Now is the time to get Elasticsearch ready to accept Twitter data. Mostly this involves making sure it recognizes timestamps. Issue these commands:
The first three ensure that timestamps work for the master user index, any tu* index related to a specific collection, and any tw* index containing tweets. The mylog command ensures the perflog indec is searchable. The last command bumps the field limit on indices. Experienced Elasticsearch users will be scratching their heads on this one – we still have much to learn here, feel free to educate us on how to permanently handle that problem.
If you want to see what these did, this command will show you a lot of JSON.
And now we’re dangerously close to actually getting some content in Elasticsearch. Try the following commands:
tw-friendquick NetwarSystem > test.txt
This should produce a file with around 180 numeric Twitter IDs that are followed by @NetwarSystem, load them into Redis for processing, and the last command will give you a count of how many are loaded. This is the big moment, try this command next:
That command should spew a bunch of JSON as it runs. The preceding time command will tell you how long it took, a useful thing when performance tuning long running processes.
Now try this one:
You should get back two very long lines of text – one for the usertest index, show about 180 documents, and the other for perflog, which will just have a few.
There, you’ve done it! Now let’s examine the results.
Your next steps require the Kibana graphical interface. Point your browser at port 5601 on your system. You’ll be presented with the Kibana welcome page. You can follow their tutorial if you’d like. Once you’ve done that, or skipped it, you will do the following:
Go the Management tab
Select Index Patterns
Create an Index Pattern for the usertest index
There should be a couple of choices for time fields – one for when the user account was created, the other is the date for their last tweet. Once you’ve done this, go to the Discover tab, which should default to your newly created Index Pattern. Play with the time picker at the upper right, find the Relative option, and set it to 13 years. You should see a creation date histogram something like this:
Writing this post involved grinding off every burr we found in the Github repositories, which was an all day job, but we’ve come to the point where you have cut & pasted all you can. The next steps will involve watching videos about how to use Kibana, laying hands on a copy of Elasticsearch: The Definitive Guide, and installing Graphileon so you can explore the Neo4j data.