Coaxing Open Semantic Server Into Operating Condition

An associate earlier this week mentioned having trouble getting Open Semantic Desktop Search to behave. This system offers an intriguing collection of capabilities, including an interface for Elasticsearch. Many hours later, we are picking our way through a minefield. This project is about to shift from Debian 9 to 10, and things are in terrible disarray.

First, some words about frustration free experimentation. If you store your virtual machines on a ZFS file system you can snapshot each time you complete and install step. If something goes wrong later, the snapshot/rollback procedure is essentially instantaneous. This is dramatically more useful than exporting VMs to OVA as a checkpoint. Keep in mind the file system will be dismounted during rollback; it’s best to have some VM specific space set aside.

The project wants Debian proper, so take the time to get Debian 9.9 installed. The desktop OVA wanted a single processor and five gig of ram. Four cores and eight gig seemed to be a sensible amount for a server. Do remember to add a host-only interface under VirtualBox so you can have direct ssh and web access.2

There are some precursors that you will need to put in place before trying to install the monolithic package.

  • apt install celeryd
  • apt install python3-pip
  • apt install python3-celery
  • apt install python-flower

Celery is a task queue manager and Flower provides a graphical interface to it at port 5555. These are missing from the monolithic package. You will also need to add the following to your /etc/security/limits.conf

Now Reboot The System So The Limits Are In Effect

Now you’re ready to install the monolithic package. This is going to produce an error indicating there are packages missing. You correct this problem with this command:

apt install -f

This is going to take a long time to run, maybe ten or fifteen minutes. It will reach 99% pretty quickly – and that’s about the 50% mark in terms of time and tasks. Once this is done, shut the system down, and take a snapshot. Be patient when you reboot it, the services are complex, hefty, and took a couple minutes to all become available on our i7 test system. This is what the system looks like when fully operational.

  • 25672 – RabbitMQ message broker
  • 8080 – spaCy natural language processing
  • 4369 – RabbitMQ beam protocol
  • 22 – ssh, installed for remote access
  • 25 – SMTP for local email, part of Debian
  • 7687 – Neo4j BOLT (server to server) protocol
  • 5672 – RabbitMQ
  • 9998 – Apache Tika document handling service
  • 7983 – Apache Solr
  • 80 – Apache web server
  • 7473 – Neo4j SSL web console
  • 7474 – Neo4j web console
  • 8983 – Apache Solr

Once this is done, you must address Github issue #29 flower doesn’t start automatically. You’ll need this /etc/rc.local file, which their process installs early on, then later removes.

The Celery daemon config also needs attention. The config in /etc/default/celeryd must be edited so that it is ENABLED, and the chroot to /opt/Myproject will cause a failure to start due to missing directory. It seems safe to just turn this off.

Neo4j will be bound to just localhost and will not have a password. Since we’re building a server, rather than a specialty desktop, let’s fix this, too. The file is /etc/neo4j/neo4j.conf, these steps will permit remote access.

  • dbms.security.auth_enabled=true
  • dbms.connectors.default_listen_address=0.0.0.0
  • systemctl restart neo4j
  • visit http://yoursolrIP:7474 and set password
  • visit Config area in OSS web interface, add Neo4j credentials

Having completed these tasks, reboot the system to ensure it starts cleanly. You should find the Open Semantic Search interface here:

http://<IP of VM>/search

This seems like a good stopping point, but we are by no means finished. You can manually add content from the command line with the opensemanticsearch commands:

  • opensemanticsearch-delete
  • opensemanticsearch-enrich
  • opensemanticsearch-filemonitoring
  • opensemanticsearch-index-dir
  • opensemanticsearch-index-file
  • opensemanticsearch-index-rss
  • opensemanticsearch-index-sitemap
  • opensemanticsearch-index-sparql
  • opensemanticsearch-index-web
  • opensemanticsearch-index-web-crawl

There are still many problems to be resolved. Periodic collection from data sources is not working, and web interface submissions are problematic as well. Attempts to parse RSS feeds generate numerous parse errors. Web pages do not import smoothly from our WordPress based site as well as one hosted on the WordPress commercial site.

We will keep coming back to this area, hopefully quickly moving past the administration details, and getting into some actual OSINT collection tradecraft.

Attention Conservation: Be In Charge

There are a surplus of articles out there regarding the steps mobile device makers take to keep you focused on their system. Similar strategies are employed by social media players, on both mobile and desktop. Payouts are variable in size and appear irregularly. Notifications are red, likely a bit animated, and if nothing is happening particularly bad applications will offer you some hint that something is coming.

These design principles are effective and ethical … for a slot machine manufacturer. If one is trying to get some actual work done, this is the worst environment imaginable. If one is not neurotypical, the hazards are dramatically worse. I am in the shallow end of the autism spectrum pool, where I manage to pass much of the time, but I exercise tight control over my physical and digital workspaces in order to be productive.

Physical Environment:

I work at home, in a dim, quiet room, sitting at a desk facing a quiet, leafy side street. Dragonflies and hummingbirds are more common than passing vehicles. Two small desk lamps provide pools of light. There are a couple of playlists that contain low key, lyrics free music.

Earlier this year I switched from laptop to desktop, acquiring a 27″ 4K primary display. The smaller display to the right is a 24″ Samsung monitor that will rotate between landscape and portrait. My thinking was this would provide a space for reading. This was just a theory and it lasted all of two hours before I put the display back to landscape.

That experiment failed, but as I have grown more used to having such an enormous number of pixels in front of me, it has taken on an important role. I have a fairly small virtual machine running, its display set to 1920×1080, and the 24″ display is functionally a second machine for me, dedicated to things that are important but not urgent.

Prioritization:

I have long used the Eisenhower Matrix for sorting tasks.

Here are examples for each quadrant:

  • Important, Urgent: my billable hours, assisting others to get theirs.
  • Important, Not Urgent: system administration & software development.
  • Urgent, Unimportant: email, group chats that are off task.
  • Not Urgent & Unimportant: news of the day, Twitter drama, etc.

Virtual Environment:

The smaller side display covers many things that matter, but which are best left alone unless specific things need doing. These include:

  • Chromium: Tweetdeck dedicated to a specific social media campaign.
  • Chrome: Tweetdeck dedicated to another campaign.
  • Firefox: Netdata observation of servers prone to overloading.
  • Bash shell: Four of them showing various performance metrics.

If I alt+tab into this system, it traps the keyboard. I can quickly cycle through the three major areas (two campaigns, system monitoring). I have to click to get free, and then I don’t look at it again, sometimes not till the next day. A grayscale screen lock image appears after five minutes of inactivity, reminding me its on, but offering no inducement to interact.

Email is encapsulated in a similar fashion. There are a couple VMs that do nothing but provide compartments for various accounts. I check them, do what is needed, and then close them. Updates come throughout the day, offering random payouts, typically of low value. I dispatch them all at once, usually AM & PM, and otherwise ignore.

App selection on the primary system is key. The Wire application works on desktops and mobiles, it supports up to three different accounts on the desktop, and it provides muting for busy group chats. I think I have one of every other chat system known to man, but I never use them unless I am summoned for some specific reason.

Work Focus:

I have a virtual machine I’ve named Hunchly, after the browser activity recording tool by the same name, Hunch.ly. This hosts a couple of related tools and a broad, long term, low intensity social media presence, which lets me peer into various systems without getting enmeshed with personal contacts.

Software development and the related large scale social media analytics tasks are still on the host OS, but that will be the next thing to change. I have slowly begun to use PyCharm for development work due to a fairly new collaboration with another developer who favors it. I just learned there is OpenCYPHER support, which is going to facilitate our transition to using Neo4j for some social network analysis.

Mobility:

The smartphone is the equivalent of the dorm room you were compelled to inhabit your first year at college. There isn’t a lot of room and it’s a near constant ruckus. Your laptop or desktop is the cramped studio space overlooking the campus town bar strip that you moved into as a junior. A bit more room, but still endless distractions.

A machine that will support VirtualBox (free) in the style you need requires sixteen gig of ram and a solid state disk. This $485 Dell Precision M4600 is twin to the machine I use when I’m mobile. I can just copy virtual machine directories from desktop to laptop. As long as they’re two or four cores and four or eight gigabytes, this works well.

Conclusions:

Application developers, content providers, and social network operators have no incentive to do any better. If you want to reclaim your time, sticking to the housing metaphor above, it’s like buying a sturdy farmhouse on a country road after it’s been empty for a while. You will have a bit more work to do on an ongoing basis, but the space, the quiet, and a roomy machine shed behind the two car garage? There is a reason we have an “escape to the country” societal meme. Apply that thinking to you online presence and you may be pleasantly surprised by the outcome.

An Analysts Workstation

Six months ago we published An Analyst’s Environment, which describes some tools we use that are a bit beyond the typical lone gun grassroots analyst. Since then our VPS based Elasticsearch cluster has given way to some Xeon equipment in racks, which lead to Xeon equipment under desks.

Looking back over the past two months, we see a quickly maturing “build sheet” for analyst workstations. This is in no small part due to our discovery of Budgie, an Ubuntu Linux offshoot. Some of our best qualitative analysts are on Macs and they are extremely defensive of their work environment. Budgie permits at least some of that activity to move to Linux, and its thought that this will become increasingly common.

Do not assume that “I already use Ubuntu” is sufficient to evaluate Budgie. They are spending a lot of time taking off the rough edges. At the very least, put it in a VM and give it a look.

Once installed, we’re including the following packages by default:

  • Secure communications are best handled with Wire.
  • The Hunch.ly web capture package requires Google Chrome.
  • Chromium provides a separate unrecorded browser.
  • Maltego CE link analysis package is useful even if constrained.
  • Evernote is popular with some of our people, Tusk works on Linux.
  • XMind Zen provides mind mapping that works on all platforms.
  • Timeline has been a long term player and keeps adding features.
  • Gephi data visualization works, no matter what sized screen is used.

Both Talkwalker Alerts and Inoreader feeds are RSS based. People seem to be happy with the web interface, but what happens when you’re in a place without network access. There are a number of RSS related applications in Budgie’s slick software store. Someone is going to have to go through them and see which best fits that particular use case.

Budgie’s many packages for handling RSS feeds.

There have been so many iterations of this set of recommendations, most conditioned by the desire to support Windows, as well as Mac and Linux. The proliferation of older Xeon equipment, in particular the second generation HP Z420/Z620/Z820, which start in useable condition at around $150, mean we no longer have that constraint.

Sampling of inexpensive HP Z420s on Ebay in May of 2019.

Starting with that base, 64 gig of additional memory is about $150, and another $200 will will cover a 500 gig Crucial solid state disk and the fanless entry level Nvidia GT1030.

The specific combination of the Z420 and the Xeon E5-2650L v2 has a benchmark that matches the current MacBook Pro, it will be literally an order of magnitude faster on Gephi, the most demanding of those applications, and it will happily work for hours on end without making a sound. The Mac, on the other hand, will be making about as much noise as a Shopvac after just five minutes.

That chip and some Thermal Grizzly Kryonaut should not cost you more than $60 and will take a base Z420 from four cores to ten. So there you have it – mostly free software, a workstation you can build incrementally, and then you have the foundation required to sort out complex problems.

Analyzing Twitter Streams

Our prior work on Twitter content has involved bulk collection of the following types of data:

  • Tweets, including raw text suitable for stylometry.
  • Activity time for the sake of temporal signatures.
  • Mentions including temporal data for conversation maps.
  • User ID data for profile searches.
  • Follower/following relationships, often using Maltego.

Early on this involved simply running multiple accounts in parallel, each working on their own set of tasks. Seemingly quick results were a matter of knowing what to collect and letting things happen. Hardware upgrades around the start of 2019 permitted us to run sixteen accounts in parallel … then thirty two … and finally sixty four, which exceeded the bounds of 100mbit internet service.

We had never done much with the Twitter streaming API until just two weeks ago, but our expanded ability to handle large volumes of raw data has made this a very interesting proposition. There are now ten accounts engaged in collecting either a mix of terms or following lists of hundreds of high value accounts.

Indexing Many Streams

What we get from streams at this time includes:

  • Tweet content.
  • RT’d tweet content.
  • Quoted tweet content.
  • Twitter user data for the source.
  • Twitter user data for accounts mentioned.
  • Twitter user data for accounts that are RT’d.
  • User to mentioned account event including timestamp.
  • User to RT’d account event including timestamp.

This data is currently accumulating in a mix of Elasticsearch indices. We recognize that we have at least three document types:

  • Tweets.
  • User data.
  • Interaction data.

Our current setup is definitely beta at this point. We probably need more attention on the natural language processing aspect of the tweets themselves, particularly as we expand into handling multiple European languages. User data could standing having hashtags extracted from profiles, which we missed the first time around, otherwise this seems pretty simple.

The interaction data is where things become uncertain. It is good to have this content in Elasticsearch for the sake of filtering. It is unclear precisely how much we should permit to accumulate in these derivative documents; at this point they’re just the minimal data from each tweet that permits establishing the link between accounts involved. Do we also do this for hashtags?

Once we have this, the next question is what do we do with it? The search, sorting, and time slicing of Elasticsearch is nice, but this is really network data, and we want to visualize it.

Maltego is out of the running before we even start; 10k nodes maximum has been a barrier for a long time. Gephi is unusable on a 4k Linux display due to font sizing for readability, and it will do just enough on a half million node network to leave one hanging with an analysis half finished on a smaller display.

The right answer(s) seem to be to get moving on Graphistry and Neo4j. An EVGA GTX 1060 turned up here a few weeks ago, displaying a GT 1030 to an associate. Given the uptime requirements for Elasticsearch, not much has happened towards Graphistry use other than the physical install. It looks like Docker is a requirement, and that’s a synonym for “invasive nuisance”.

Neo4j has some visualization abilities but its real attraction is the native handling of storage and queries for graphs. Our associates who engage in analysis ask questions that are easily answered with Elasticsearch … and other questions that are utterly impossible to resolve with any tool we currently wield.

Conclusion

Expanding capacity has permitted us to answer some questions … but on the balance its uncovered more mysteries than it has resolved. This next month is going to involve getting some standard in place for assessing incoming streams, and pressing on both means of handling graph data, to see which one we can bring to bear first.

Twitter Bots Concealed By API

Last month we announced the Netwar System Community Edition, the OVA for which is still not posted publicly. In our defense, what should have been a couple days with our core system has turned into a multifaceted month long bug hunt. A good portion could be credited to “unfamiliar with Search Guard”, but there is a hard kernel of “WTF, Twitter, WTF?!?” that we want to describe for other analysts.

Core System Configuration

First, some words about what we’ve done with the core of the system use day to day. After much experimentation we settled on the following configuration for our Elasticsearch dependent environment.

  • HP Z620 workstations with dual eight core Xeons.
  • 128 gig of ram.
  • Dual Seagate IronWolf two terabyte drives in a mirror.
  • Single Samsung SSD for system and ZFS cache.
  • Trio of VirtualBox VMs with 500 gig of storage each.
  • 32 gig for host, ZFS ARC (cache) limited to 24 gig.
  • 24 gig per VM, JVM limited to 12 to 16 gig.

There are many balancing acts in this, too subtle and too niche to dig into here. It should be noted that FreeBSD Mastery:ZFS is a fine little book, even if you’re using Linux. The IronWolf drives are helium filled gear meant for NAS duty. In retrospect, paying the 50% premium for IronWolf Pro gear would have been a good move and we’ll go that way as we outgrow these.

We’ve started with a pair of machines, we’re defaulting to three shards per index, and a single replica for each. The Elasticsearch datacenter zones feature proved useful; pulling the network cable on one machine triggers some internal recovery processes, but there is no downtime from the user’s perspective. We’re due for a third system with similar specifications, it will receive the same configuration including a zone of its own, and we’ll move from one replica per index to two. This will be a painless shift to N+1 redundancy.

API Mysteries At Scale

Our first large scale project has been profiling the followers of 577 MPs in the U.K. Parliament. There are 20.6M follow relationships with 6.6M unique accounts. Extracting their profiles would require forty hours with our current configuration … but there are issues.

Users haven’t seen a Twitter #FailWhale in years, but as continuous users of the API we expect to see periods of misbehavior on about a monthly basis. February featured some grim adjustments, signs that Twitter is further clamping down on bots, which nipped our read only analytical activities. There are some features that seem to be permanently throttled now based on IP address.

When we arrived at what we thought was the end of the road, we had 6.26M profiles in Elasticsearch rather than the 6.6M we knew to exist, a discrepancy of about 350,000. We tested all 6.6M numeric IDs against the index and found just 325,000 negative responses. We treated that set as a new batch and the system captured 255,000, leaving only 70,000 missing. Repeating the process again with the 70,000 we arrived at a place where the problem was amenable to running a small batch in serial fashion.

Watching a batch of a thousand of these stragglers, roughly a quarter got an actual response, a quarter came back as suspended, and the remainder came back as page not found. The last response is expected when an account has renamed or self suspended, but we were using numeric ID rather than screen name.

And the API response to this set was NOT deterministic. Run the process again with the same data, the percentages were similar, but different accounts were affected.

A manual inspection of the accounts returned permits the formulation of a theory as to why this happens. We know the distribution of the creation dates of these accounts:

MP Followers Account Creation Dates
MP Followers Account Creation Dates

The bulk of the problematic accounts are dated between May and August of 2018. Recall that Twitter completed its acquisition of Smyte and shut down 70 million bots during that time frame. May in the histogram is the first month where account creation dates are level. A smaller set clustered around the same day in mid-December of 2012, another fertile time period for bot creation.

The affected accounts have many of the characteristics we associate with bots:

  • Steeply inverted following to follow ratio.
  • Complete lack of relationships to others.
  • Relatively few tweets.
  • Default username with eight trailing digits.

An account that was created and quickly abandoned will share these attributes. So our theory regarding the seeming problem with the API is as follows:

These accounts that can not be accessed in a deterministic fashion using the API are in some sort of Smyte induced purgatory. They are not accessible, protected, empty of content, suspended, or renamed, which are five conditions our code already recognizes. There is a new condition, likely “needs to validate phone number”, and accounts that have not done this are only likely of interest to their botnet operators, or researchers delving very deeply into the system’s behavior.

But What Does This MEAN?

Twitter has taken aggressive steps to limit the creation of bots. Accounts following MPs seem to have fairly evenly distributed creation dates, less the massive hump from early 2016 to mid 2018. We know botnet operators are liquidating collections of accounts that have been wiped of prior activity for as little as $0.25 each. There are reportedly offerings of batches of accounts considered to be ‘premium’, but what we know of this practice is anecdotal.

Our own experience is limited to maintaining a couple platoons of collection oriented accounts, and Twitter has erected new barriers, requiring longer lasting phone numbers, and sometimes voice calls rather than SMS.

This coming month we are going to delve into the social bot market, purchasing a small batch, which we will host on a remote VPS and attempt to use for collection work.

The bigger implication is this … Twitter’s implementation of Smyte is good, but it’s created a “hole in the ocean problem”, a reference to modern submarines with acoustic signatures that are less than the noise floor in their environment. If the affected accounts are all bots, and they’re just standing deadwood of no use to anyone, that’s good. But if they can be rehabilitated or repurposed, they are still an issue.

Seems like we have more digging to do here …

Mystery Partially Resolved …

So there was an issue with the API, but an issue on our side.

When a Twitter account gets suspended, it’s API tokens will still permit you to check its credentials. So a script like this reports all is well:

But if three of the sixty four accounts used in doing numeric ID to profile lookups have been suspended … 3/64 = 4.69% failure rate. That agrees pretty well with some of the trouble we observed. We have not had cause to process another large batch of numeric IDs yet, but when we do, we’ll check this theory against the results.

 

An Analyst’s Environment

This week we had a chance to work with an analyst who is new to our environment. The conversation revealed some things we find pedestrian that are exciting to a new person, so we’re going to detail them.

Alerts

Many people use Google’s Alerts, but far fewer are familiar with the service Talkwalker offers. This company offers social media observation tools and their free alerts service seems to be a way to gather cognitive excess, to learn what things might matter to actual humans. These alerts arrive as email, or as an RSS feed, which is a very valuable format.

Feed Reading

Google Reader used to be a good feed reader, but it was canceled some years ago. Alternatives today include Feedly and Inoreader. The first is considered the best for day to day reading activity, while Inoreader gets high marks for archival and automation. The paid version, just $49 per year, will comfortably handle hundreds of feeds, including the RSS output from the above mentioned Talkwalker.

Content Preservation

Talkwalker Alerts never sleep, Inoreader provides all sorts of automation, but how does one preserve some specific aspect of the overall take? We like Hunch.ly for faithful capture. This $129 tool is a Chrome extension that faithfully saves every page visited, it offers ‘selectors’, text strings that are standing queries in an ‘investigation’, which can be exported as a single zip file, which another user can then import. That is an amazingly powerful capability for small groups, who are otherwise typically trying to synchronize with an incomplete, error filled manual process.

Link Analysis

Alerting, feed tracking, and content preservation are important, but the Hunch.ly investigation is the right quantum of information for an individual or a small group. Larger bodies of information where linkages matter are best handled with Maltego Community Edition, which is free. There are transforms (queries) that will pull information from a Hunch.ly case, but the volume of information returned exceeds the CE version’s twelve item limit.

Maltego Classic is $1,000 with a $499 annual maintenance fee. This is well worth the cost for serious investigation work, particularly when there is a need to live share data among multiple analysts.

Costs Of Doing Business

We are extremely fond of FOSS tools, but there are some specialized tasks where it simply makes no sense to try to “roll your own”. This $1,200 kit of tools is a force multiplier for any investigator, dramatically enhancing accuracy and productivity.

The Shape Of The Internet

One of the perennial problems in this field is the antiquated notion of jurisdiction, as well as increasing pressure on Westphalian Sovereignty. JP and I touched on this during our November 5th appearance on The View Up Here.  The topic is complex and visual, so this post offers some images to back up the audio there.

Regional Internet Registries

Regional Internet Registries
Regional Internet Registries

The top level administrative domains for the network layer of the internet are the five Regional Internet Registries. These entities were originally responsible for blocks of 32 bit IPv4 addresses and 16 bit Autonomous System numbers. Later we added 128 bit IPv6 addresses and 32 bit Autonomous System numbers as the original numbers were being exhausted.

When you plug your home firewall into your cable modem it receives an IP address from your service provider and a default route. That outside IP is globally unique, like a phone number, and the default route is where any non-local traffic is sent.

Did you ever stop to wonder where your cable modem provider gets their internet service? The answer is that there is no ‘default route’ for the world, they connect at various exchange points, and they share traffic there. The ‘default route’ for the internet is a dynamic set of not quite 700,000 blocks of IP addresses, known as prefixes, which originate from 59,000 Autonomous Systems.

The Autonomous System can be though of as being similar to an telephone system country code. It indicates from a high level where a specific IP address prefix is located. The prefix can be thought of as an area code or city code, it’s a more specific location within the give Autonomous System.

There isn’t a neat global map for this stuff, but if you’re trying to make a picture, imagine a large bunch of grapes. The ones on the outside of the bunch are the hosting companies and smaller ISPs, who only touch a couple neighbors. The ones in the middle of the bunch touch many neighbors and are similar in position to the big global data carriers.

Domain Name Service

Once a new ISP has circuits from two or more upstream providers they can apply for an Autonomous System number and ask for IP prefixes. Those prefixes used to come straight from the RIRs, but any more you have to be a large provider to do that. Most are issued to smaller service providers by the large ones, but the net effect is the same.

Having addresses is just a start, the next step is finding interesting things to do. This requires the internet’s phone book – the Domain Name System. This is how we map names, like netwarsystem.com, to an IP address, like 95.173.136.70. There is also a reverse DNS domain that is meant to associate IP addresses with names. If you try to check that IP I just mentioned it’ll fail, which is a bit funny, as that’s not us, that’s kremlin[.]ru.

Domain Name Registrars & Root DNS Servers

How do you get a DNS name to use in the first place? Generally speaking, you have to pay a Registrar a fee for your domain name, there is some configuration done regarding your Start Of Authority, which is a fancy way of saying which name servers are responsible for your domain, then this is pushed to the DNS Root Servers.

There are nominally thirteen root servers. That doesn’t mean thirteen computers, it means there are twelve different organizations manage them (Verisign handles two), and their addresses are ‘anycast’, which means they originate from multiple locations, while the actual systems themselves are hidden from direct access. This is sort of a CDN for DNS data, and it exists due to endless attacks that are directed at these systems.

Verisign’s two systems are in datacenters on every continent and have over a hundred staff involved in their ongoing operation.

Layers Of Protection

And then things start to get fuzzy, because people who are in conflict will protect both their servers and their access.

Our web server is behind the Cloudflare Content Distribution Network. There are other CDNs out there and they exist to accelerate content as well as protect origin servers from attack. We like this service because it keeps our actual systems secret. This would be one component of that Adversary Resistant Hosting that we don’t otherwise discuss here.

When accessing the internet it is wise to conceal one’s point of origin if there may be someone looking back. This is Adversary Resistant Networking, which is done with Virtual Private Networks, the Tor anonymizing network, misattribution services like Ntrepid, and other methods that require some degree of skill to operate.

Peeling The Onion

Once you understand how all the pieces fit together there are still complexity and temporal issues.

Networked machines can generate enormous amounts of data. We previously used Splunk and recently shifted to Elasticsearch, both of which are capable of handling tens of millions of datapoints per day, even on the limited hardware we have available to us. Both systems permit time slicing of data as well as many other ways to abstract and summarize.

Data visualization can permit one to see relationships that are impenetrable to a manual examination. We use Paterva‘s Maltego for some of this sort of work and we reach for Gephi when there are larger volumes to handle.

Some of the most potent tools in our arsenal are RiskIQ and Farsight. These services collect passive DNS resolution data, showing bindings between names and IP addresses when they were active. RiskIQ collects time series domain name registration data. We can examine SSL certificates, trackers from various services, and many other aspects of hosting in order to accurately attribute activity.

Conclusion

The world benefits greatly from citizen journalists who dig into all sorts of things. This is less than helpful when it comes to complex infrastructure problems. Some specific issues that have arisen:

  • People who are not well versed in the technologies used can manage to sound credible to the layman. There have been numerous instances where conspiracy theorists have made comical attribution errors, in particular geolocation data for IPs being used to assert correlations where none exists.
  • There is a temporal component that arises when facing any opponent with even a bit of tradecraft and freely available tools don’t typically address that, so would-be investigators are left piecing things together, often without all of the necessary information.
  • Free access to quality tools like Maltego and RiskIQ are both intentionally limited. RiskIQ in particular cases problems in the hands of the uninitiated – a domain hosted on a Cloudflare IP will have thousands of fellows, but the free system will only show a handful. There have been many instances of people making inferences based on that limited data that have no connection to objective reality.

We do not have a y’all come policy in this area, we specifically seek out those who have the requisite skills to do proper analysis, who know when they are out on a limb. When we do find such an individual who has a legitimate question, we can bring a great deal of analytical power to bear.

That specific scenario happened today, which triggered the authoring of this article. We may never be able to make the details public, but an important thing happened earlier, and the world is hopefully a little safer for it.

 

Domestic Extremist? Or Something Else?

What does this site say to you at first glance?

patrioticfreedomfighter[.]com
patrioticfreedomfighter[.]com
This is one of nearly two dozen sites pushing fringe right wing views that are all associated with Mark Edward Baker, as detailed in this story by McClatchy. When I first heard of this the initial thought was that something so slippery could be a foreign influence operation. I came to a much different conclusion, but it took many hours of digging.

Here are the full list of domains involved:

1776christian[.]com

americangunnews[.]com

americanlibertyreport[.]com

americanviralheadlines[.]com

christianpatriotdaily[.]com

conservativezone[.]com

factsnotmemes[.]com

financialmorningdigest[.]com

firearmdaily[.]com

freedomnewsreport[.]com

frontpagepatriot[.]com

healthiervideos[.]com

liberalliedetector[.]com

libertyplanet[.]com

libertyvideonews[.]com

memesorfacts[.]com

nationalgunnetwork[.]com

patrioticfreedomfighter[.]com

patrioticviralnews[.]com

readytofirenews[.]com

uspoliticsandnews[.]com

wealthauthority[.]com

The physical plant for this is a circus – 434 unique IP addresses and they all seem to be tied to the operation.

Mark Edward Baker Internet Footprint
Mark Edward Baker Internet Footprint

A simpler exam of the SOA for each domain yielded a deeper clue in the form of the [email protected] address used for registration. It’s connected to another cluster of domains.

gomarkb@gmail.com domains
[email protected] domains

We are not going to revisit the merry chase this guy provides – fire up hunch.ly and go at it. He uses the alias Mark Bentley, be on the lookout for LOP, which is short of League of Power, and his wife Jennifer is a signatory on some of the paperwork. He has at least half a dozen PO boxes in Florida and a similar setup in Reno, Nevada, which appears to he his origin. Once I was sure I had a real name, I was more interested in what his business model is and if there were any foreign ties.

If you poke around for League of Power you’ll find complaints about his $27 scam work from home DVD. This guy’s ideology is getting other people’s money and providing little to nothing in return. This is pretty common to see on both sides of the aisle – grifters working the earnest, but naive masses. This guy clearly focuses on the right – different skills are needed to run a similar game against the left.

Here is the one image that more or less sums up what he is doing:

Mailgun Usage
Mailgun Usage

You would have to be in the business of examining attribution resistant hosting to notice this, but it was like a flashing neon sign for me. Domains that don’t want to be traced typically have no email handling at all. This guy’s business model is list building, which he’ll use for maybe some political stuff, but it will be an ongoing bulk mail target after the election.

 

What about foreign influence being behind this? The article mentions that conservativezone[.]com[.]com had been used. That’s a spearphishing move. It resolves to these geniuses:

conservativezone[.]com[.]com
conservativezone[.]com[.]com
What is AS206349? A dinky autonomous system in Bulgaria with a history of IP address hijacking. Baker got some service from these guys, but it wasn’t anything he wanted to receive. The may have noticed he was gathering lists of the easily duped and decided this would be a good phishing hole for them.

 

People who are way into the bipolar politics in the U.S. tend to judge things as either on their side, or the opposition. There are nuances on that spectrum that get ignored, foreign influence, weird cyberstalker types, and just outright fraud, like we’re seeing here. Don’t jump to sticking a red or blue label on something until you’ve had a good chance to inspect it and conjure up some alternate theories.

Russian Infrastructure, Domestic Threats

Let’s take a look at a curious thing in RiskIQ – the October 22nd registration for the 0hour1[.]com domain.

0HOUR1 Registration
0HOUR1 Registration

The nameservers at westcall[.]ru are part of a large ISP in St. Petersburg. The registrant’s trail is an intentional mess, but lets see what we can find on Brian Durant. It’s helpful to know his birth name –  Fiore DiPietrantonio.

Durant came to my attention because his crew are bothering a CVE researcher I know, and threatening a man in Brooklyn that they mistakenly identified as them. There is a decent Threadreader on Durant from @trebillion that provided me a starting point.

Achtung! If you choose to pursue this, you must turn your OSINT tradecraft up to eleven. The name change is an attempt to leave behind a shady past, there is intentional deception at work prior to the politics, and don’t chase after a pretty face on a largely empty persona.

As a sign of how much of a hassle this backtrail was, take a look at my hunch.ly case for it. And this is the second one – the first draft had so much crud in it I found it easier to just start over and revisit the stuff I confirmed.

Brian Durant Investigations
Brian Durant Investigations

About that empty female persona … here is where it starts.

rowdypolitics[.]com
rowdypolitics[.]com
Which of these do you think are legitimate?

Meghan Thompson
Meghan Thompson

If you picked only #1 and #2, which a faint urge to check #3, give yourself a gold star.

Three Domains & Hosting
Three Domains & Hosting

I got distracted writing this and spent half an hour playing with the RiskIQ response to the Maltego Domain Analysis transform. GoDaddy is a terrible swamp that typically reveals nothing, so I collapsed it to a point to better see the other things. The westcall[.]ru nameserver mistake only showed in the registration, they caught it before it started turning up in  passive DNS. So what are these three other things we see?

192.169.82.86 is part of The Swamp – the tiny allocations in 192.0.0.0/7 that were handed out for free back in the eighties. RiskIQ shows over 9,000 names for it, Maltego finds 552, ARIN says it’s a point to point link from Limestone Networks to a customer. It’s been passed around for thirty years or more and I don’t think it tells us much. The reverse lookup is the last one someone carded to enter and those don’t seem to matter much on shared servers. Make a mental note to come back, only if all else fails.

86.106.93.230 is a European address, you can tell just by the leading octet, and with a little poking we find it’s in AS44901 – BelCloud Hosting, but it’s listed with over 10,000 other names.

What about 149.56.202.49? Looking at the times it was active we find that it had the system to itself, running what looks to be WordPress under cPanel.

magaforamerica[.]com
magaforamerica[.]com
This is turning into a common theme – people trying to do their own hosting and then giving up after a couple days.  The DNS tab provided another interesting clue that I just noticed as I was drafting this article.

videowhispers@gmail.com
[email protected]

And what’s going on here? Every time I look at this thing I find another eastern European/Russian link.

videowhispers[.]com
videowhispers[.]com

 

So … I thought this was going to be a declaration and instead it’s a problem statement – there is more digging to do here. But there is one piece of digging that is done – the ID of Brian Durant’s associate who threatened me when I first started probing is within easy reach. Check the DM that @NetwarSystem received on October 17th and the menacing voicemail from October 27th.

This “very reasonable dude” is @MistaBRONCO, which has been a stable alias for him for at least six years. It was on his Flickr account, where a close inspection of cat photos turned up this gem. Handsome boy, isn’t he?

BRONCO, International Cat Of Mystery
BRONCO, International Cat Of Mystery

So our internet tough guy who cleverly pressed *69 before he left me a message on a Google Voice number that hasn’t a phone attached to it in five years is laid low by the ol’ surname & multiple phone numbers on a pet’s nametag. Amateur hour here – stable name, personal details all over the place. This is all recorded in hunch.ly and the good bits of the YouTube channel that put the voice with this cat are safe, too.

So I’ve got a voice threat, another guy that this genius misidentified as Ca1m has received threats, the source lives on Long Island, the target is in Brooklyn. Since these are both covered by the NYC FBI field office and I was pointedly told to “come correct”, I had the following exchange with the counter-terrorism SA in that office, whom I’ve known since Occupy days.

Alerting Agent Smith
Alerting Agent Smith

That’s as correct as I can play it in the wee hours of a Friday right before a midterm in which Russian influence is certainly still a problem.

How’d I do?

 

SocialLinks: An Actual Investigation

We have a special treat today, an actual investigation and preservation exercise on a Russian company we have had our eye on for the last year. First, to understand why today is their lucky day, peruse this New York Times article:

Here’s the lede from the article for those who refuse to click through:

On the same day Facebook announced that it had carried out its biggest purge yet of American accounts peddling disinformation, the company quietly made another revelation: It had removed 66 accounts, pages and apps linked to Russian firms that build facial recognition software for the Russian government.

This is a good step, but it isn’t the only housecleaning needed.

Available OSINT

If you open the current Maltego client you will see the Transform Hub, a listing of all service providers you can integrate. This is what you see this morning in the left hand column if you open the application on a 4k monitor. Note what is first and what is last.

SocialLinks Offerings In Maltego
SocialLinks Offerings In Maltego Transform Hub

I am the admin for the Maltego Group on LinkedIn. This is a screen shot I collected this morning. Implicit in this is a violation of Facebook terms of service, the source is SocialLinks CEO Alexandr Aleexev.

SocialLinks CEO Alexandr Alexeev Violates Facebook TOS
SocialLinks CEO Alexandr Alexeev Violates Facebook TOS

The company’s YouTube page has a variety of other demonstrations that indicate similar practices on other social media platforms. I snagged a screen shot from the video, which shows what they were doing a year ago.

SocialLinks Social Media Platforms
SocialLinks Social Media Platforms

This alone is enough to set off alarm bells given the current climate. We know a bit more about this company, because I assumed their appearance on the Transform Hub meant they were a quality provider. I spent $130 for a month of service in November of 2017, setting off six weeks of curious encounters, both with them, and disinterested counter-intelligence investigators on three continents.

Our Encounter

Needing to capture some Instagram content for a civil case, I grew weary of chasing threads after I got acquainted with the players, so I went looking for an automation solution that would capture their social network. I had an Instagram account to use from one of the parties to the case, but I assumed it would be burned the minute any evidence from it showed in a filing.

What I received just plain didn’t work. Not with Instagram, not with my personal LinkedIn network, not with Github, which should be fairly open.

SocialLinks Troubles
SocialLinks Troubles

This went on for a while, at first with me sharing details of DNS resolution issues for their servers and other stuff that seemed like normal troubleshooting. When I went to LinkedIn and made contact with CEO Alexandr Alexeev, I was immediately suspicious given his Moscow location.

We had an initial conversation on LinkedIn, below is an example of how things started. We agreed to meet on Wire, and there I was asked about VPNs, provisioning proxies in the U.S., and other means to circumvent access restrictions. I played along, thinking that some counter-intel attention might be forthcoming.

Alexander Alexeev 2018-01-20
Alexander Alexeev 2018-01-20

Counter-intelligence Failures

I felt there was plenty of reason for attention on this situation, but after a month of asking for attention, no fewer than three governments who should have been interested all struck out without so much as a single swing.

  • United States – target of Russian election interference.
  • United Kingdom – target of Russia referendum interference.
  • South Africa – home of Maltego maker Paterva.

I’m not saying that I typed this all into IC3, pressed ‘send’, and crossed my fingers. I had conversations with two people in the U.S. IC community, I talked to James Patrick at Liberty STRATCOM, and I have an associate who has a personal relationship with a brigadier in Hawks, South Africa’s Directorate for Priority Crime Investigation.

There are two ways to interpret this – either the system has already noticed and is working the problem, or the system is utterly overloaded and we are on our own. The fact that I got no response from any of the three countries made me think it was the latter. Having covered my own position, I sat back to see what happened next.

Further Cause To Act

Two months ago I met up with someone who had experiences with Social Links that were very similar to mine. I had already come to believe  I was dealing with someone trying to “handle” me, albeit clumsily, and probably seeking advice from someone else in the process. Talking to this other party hardened that impression. That’s all I am going to say about this.

Removal

As a last ditch effort, I reached out to an FBI Special Agent I know last week. They had the weekend to think it over, Monday to act, and the NYT article is the last piece of stimulus I need.  I just removed Alexeev from the LinkedIn Maltego Group and later today I am going to remove his posts, after announcing why he was removed.

Alexandr Alexeev Removal
Alexandr Alexeev Removal

Preservation & Publication

The particulars of what happened are important so I set out to preserve the details. Here are the steps I took:

  • Started Hunch.ly, preserved forty pages, both public and private. This application preserves content in an admissible fashion.
  • Launched MediaHuman’s YouTube Downloader  just in case they or YouTube decide to wipe their channel.
  • Preserved the @_SocialLinks_ Twitter account for further analysis using both Maltego and our internal tools.
  • Wrapped up the exported Hunch.ly casefile, the Maltego graphs, and the videos, transferred it to a person who will hand carry it to a former U.S. Attorney we employ for certain sorts of touchy situations.

That last bit is just for my protection, given that our President and Attorney General both appear to be compromised by the Russian government. I will not be subject to a raid that deprives me of exculpatory evidence, followed by a politically motivated prosecution.

Conclusion

In the absence of any response from the agencies that ought to handle problems like this, I guess we are in charge for the moment. If you’ve had similar troubles with this company, I advise you to back up everything, transfer it to legal counsel so that it is protected from seizure by attorney client privilege, then reach out to your local FBI field office.

I have shared the particulars with a couple reporters, too. If you have similar experiences and wouldn’t mind being interviewed on this, feel free to contact me.