- Mon 31 May 2021
- server admin
- Gaige B. Paulsen
- #server admin, #elasticsearch
As I've possibly mentioned here before, ClueTrust is using Elasticsearch to perform analysis of log information. Recently, I finally decided to take some our telemetry inforamtion and pull it in to Elasticsearch as a data exploration and statistical tool.
Importing structured XML data into Elasticsearch
Although there are some filters and logstash methods that have this capability, the XML that we use is extremely regular (strict schemas, etc.), and I felt that it would be better to directly and intentionally import based on the DOM that I'd created in 2009 when preparing for Cartographica to ship.
For purposes of illustration, the basic form of the Cartographica telemetry files is:
- Preamble
- Crash logs (yep, they're embedded)
- Event stream
- Errors
- Events (launch, quit, and other)
- Exceptions
- Statistics (at quit and other times)
Due to the way that Elasticsearch works, it turns out this is a really workable input, generating a (possibly too verbose) set of items from each telemetry report, including:
- Report
- Launch
- Event
- Crash
- Error
- Exception
- Statistic
For most of these items, the format is regular and arguments are inserted directly into the record (so a Crash has a crash log along with some interpretive data as well as the preamble from the report). This holds true for Events and Errors as basically individual data points inside of a preamble+launch context. The only oddity is the Statistic report which contains many "columns" of data for each statistic event. It's not lost on me that this idempotent data set is very SNMP-like.
Searching for meaning among the data
Due to the choice to separate these out as separate objects in Elasticsearch,
most statistical information is straightforward to ascertain. Want to know
what formats are most popular? Look for importVector
or importRaster
events and tabulate the number of times each format and/or driver are used.
Interested in how frequently a particular analysis tool is used? Look for
its corresponding event.
When (and how) to pivot your data
The one piece that had me stumped for a few days was: how do I determine how many active users are on which version of macOS? I've got launch data and a unique (but pseudonymous) host identifier. For example, you can create buckets based on the OS and count unique host IDs... unfortunately, that creates a data problem with users who have upgraded during the time period being examined. Using this technique, a user who was running macOS 11 and upgrading on each release day would account for 10 separate macOS counters.
What I really needed was to look at the data for just the most recent report
for each host ID. Basically, I needed to pivot around the host ID. After much
too much time trying to find a complex way through this, I finally searched on
"pivot elasticsearch" and found Pivot Transformations,
which turns out to be just what I needed. By creating a transformed index with
the latest method, I was able to get an index that only pointed to the most
recent documents for each host ID. Once I had this, I could aggregate using
terms
to find the operating system, resulting in a bucket of OS versions used
most recently by each host ID.
Pivot Transformations create an alternate index to documents in another index.
In my case, I used a latest
transformation, which maintains only the most recent
item for each unique key, based on the specified timestamp field and possibly
limited by a filter. In my case:
{
"source": {
"index": [
"ct-app-logs-*"
]
},
"latest": {
"unique_key": [
"report.host.id"
],
"sort": "report.timestamp"
},
"description": "Application Hosts",
"frequency": "1m",
"dest": {
"index": "ct-app-hosts"
}
}
creates the new index based on the existing ct-app-logs-*
pattern and
pulling out items by the unique report.host.id
key, using report.timestamp
to determine which item is the most recent. This boils down 12 indexes containing
18.5M documents into a single index containing 16K documents.
The destination index, ct-app-hosts
was set up ahead of time using a basic
clone of the original index.
If desired, I could add a query
key which would have limited the scope of the
documents.
Upping the ante for aggregation
Once I got this going, I was having some issues pulling information out of the data due to some variances in how versions were managed. In particular, I was interested in seeing major OS versions (macOS 10.15, macOS 11, iOS 14, etc.) and maybe the same for the application version.
To facilitate this, I used runtime fields, setting up 2 additional mappings in
the destination index (ct-app-hosts
) pointed at above. To do this, I PUT
a
new index definition for the index, citing the following:
{
"mappings": {
"runtime": {
"major_app": {
"type":"keyword",
"script": {
"source": """
def myField = doc['report.application.version'];
if (myField.empty)
emit("");
else {
def dom = myField.value;
for( String suffix : ['a','d','b']) {
if (dom.indexOf(suffix)>0) {
dom = dom.substring(0,dom.indexOf(suffix));
}
}
int last = dom.lastIndexOf('.');
if (last == dom.indexOf('.'))
emit(dom);
else
emit(dom.substring(0,last));
}""",
"lang" : "painless"
}
},
"major_os" : {
"type" : "keyword",
"script" : {
"source" : """
def myField = doc['report.host.os.version'];
if (myField.empty)
emit("");
else {
def dom = myField.value;
int last = dom.lastIndexOf('.');
def major = dom.substring(0,last);
if (major.startsWith('11') || major=='10.16') {
emit('11');
} else {
emit(major);
}
}""",
"lang" : "painless"
}
}
},
This creates a new column, major_app
which:
- Checks that
report.application.version
exists - Removes any suffix starting with
a
,b
ord
- Removes the last
.
(unless it is also the first.
, such as in1.4
)
Similarly, it creates a column major_os
which:
- Checks that
report.host.os.version
exists - Removes the last
.
- Makes sure to emit
11
for10.16
With these two powerful tools, I was able to create a clear, concise, and constantly-up-to-date resource for OS and application usage.