The purpose of this blog post is to share some ideas on how to identify trends and potential issues in performance in your data, or machines, or infrastructure. I am not here to tell you how to solve potential issues, I just want to share the approach we are using to check how things are working, and maybe find improvements.
In one of the Utility we follow, there are four Server machines, accessed through Terminal Server, where Users log-in, start ArcMap and do their job. What we want is to check if all the machines performs well, and to spot trends that may influence the work of the users, also finding problematic datasets could be a plus.
Collect some data
The main tool to collect useful information is ESRI PerfQAnalyzer ( https://blogs.esri.com/esri/arcgis/2017/01/05/perfqanalyzer-new-10-5-version-build172-available-for-download/ )
We configured this tool on every machine, scheduled each hour, for fetching a list of specified FeatureClasses on a specified Extent (I can further detail this part if needed)
The output is collection of log files with many informations on them, the important ones we are interested into are:
Open workspace Milliseconds ET: "529,2688"
Opened (esriGeometryPolyline) feature class: "RETI.ScFlCostruzione", Storage: "esriGeometryStorageST", Milliseconds ET: "584,1443"
Fetch found "0" "RETI.ScFlCostruzione" feature(s) within extent (665133,542399998. 5170859,32710001. 665142,551599999. 5170862,6987. 103532), Milliseconds ET: "24,6214"
As you can see we have an Open Workspace entry which can be useful, along with the open and fetch time for every FeatureClass scheduled in the script
One information we wanted to add was something about users, to see if active users and opened ArcMaps could influence the performances, for this particular purpose we created a C# console application that appends to the previous log files (so we can parse a single file), the count of active users, connected users and open ArcMap.
Total Users: 1
Active Users: 1
Open ArcMap: 0
Parsing the log files
Now we have four folders full of wonders, and we need to gather all the information into a common place, the best place where to analyze data is Excel, so we created a C# console application with the purpose of collecting and aggregating everything. We decided to create two worksheets for every machine, one for the data and one for graphs, plus one worksheet used to compare the data from all the different machines.
The C# library used for the Excel operations is EpPlus (EPPlus-Create advanced Excel spreadsheets on the server - Home ) a very powerful tool as you will see.
The C# app access the remote folders, for each log file inside them it will parse the relevant information and transfer them in the relative worksheet
Here is how the sheets with the data looks like:
A lot of columns and a lot of rows (we converted the times in seconds), at the bottom of the fetch times we added SUM, AVERAGE and MAX functions on the columns, under this three rows we added the users and arcmap counts:
Creating the graphs and analyzing the results
A big wall of text is not really useful, except maybe for the statistics calculated at the bottom of the columns, but with some visual aids we'll try to bring out some value.
The first graph we want, is a line visualization of the single fetch times of every class, compared to the active users and arcmaps, to achieve this we used EpPlus to create a line graph, add all the fetch times, then add an Area graph to the secondary Y axis and add the counts to it, this is a portion of the result on the first machine:
As we can see, things are starting to pop out, green areas are the arcmap counts, we see that there is a big spike at midnight which is not related to the user activities (noone was using arcmap, we checked and it was a scheduled compress operation), but during daytime there wasn't really anything strange. Comparing the graphs between machines we don't see any particular difference, but the spike is really high, so is the time frame, and something may be right there hidden somewhere. So using the filter options of the excel graph we can look at just a particular frame of time during daily work:
The beauty of Excel shines here, recreating the graph with new scales, and this is now giving us a zoom on a portion of data once again giving us a new information, the pink line tends to be above the others, which happens also on the other machines, giving us one first thing to look at, with a precise name.
The second graph we wanted to create is one with the statistics and open workspace times, again compared to the users and arcmaps count (NOTE: the left axis have times and each line count as 10 seconds, I am showing only the right one with users, where each line count as one):
Now we can see similar results, but with a more evident spike at morning around 4:30, at that time every FeatureClass performed sligthly worse which added to the total fetch time, while having average and max still low, this is something else to be investigated, that happen on every machine, on multiple day. Once again we see that there is no evident correlation between the active users and fetch times.
Another interesting thing that popped out is on the fourth machine, which have Open Workspace times almost non existing, we investigated this and it appears that this machine is on the same network as the database, while the other three are on a different switch, this kind of information is precious to highlight latencies in the network that can be improved.
As last, we wanted a recap graph, putting all things together to compare the different machines, so we created a line graph showing the sum of the fetch times from the different machines and added a stacked area graph with the count of open arcmaps, to have a total and also an idea if the load balancer is doing his job, this is the graph filtered on the same time span as above:
What we can discover is that we have two machines that are faster, and two slower, looking at the complete graph we see the usual compress spikes and those mysterious 4:30 spikes that we are looking into, but the important thing that we kno is that there is no correlation between the TOTAL number of open arcmap (which should be working users) and the total fetch times, I am not really a DBA, I am just a programmer, but to me this means that the Database does not have problems serving the data to multiple users, and problems may be somewhere else (indexes, performance collection?)
Having a lot of data may be scary, but showing them in the right way and in an automated way will save you time and give you warnings, we now have different things to look at, like the mysterious spike at 4:30 or why two machines are slower, we also have some FeatureClasses to look at. This tool is very fast and can be scheduled, saving the time that an user should have spent into creating the right graphs or put together the data from the raw log files.
We added some params to the tool, to let it gather only daytime work logs, to use the open time instead of the fetch time, and to gather just one month of data.
Do you have similar ways of looking at your performances? Do you have ideas to share on how to improve this tool or want more information about?
A good Idea could be to add oracle information to the graphs, like active connections and stuff, but we don't have access to this kind of information, and you don't need administrative privileges to the db to collect the data reported here.
Let me know what you think