Trace Investigation with HDInsight Hive query


This article is to parallelize  trace investigations using Hadoop Map Reduce and thereby reduce time and effort in investigations. 

Current Problem

1.       During on call events there is a time and effort investment in recognizing the exact issues. There is no mechanism which can give a prediction of probable issues.
2.       Traces are big data. Huge GB files are scanned to filter out exact traces. Currently we scan through traces which we download as buffer locally. Here Parallel execution of multiple filters is putting more load to the system.
3.       We are having 10-15 filter strings but executing all of them in one go is not possible in current scenario.

Proposed Solution
We propose a solution of using HDInsight to use Hive Query and do parallel execution using map reduce . Map Reduce is a technology to divide the large data in multiple chunks and send it to mappers. Mappers are executors which work on small data size and provide output. Outputs from different mappers are combined and reduced to a cumulative output using Reduceer.

We can execute Hive query language implementation on the winfab traces to accomplish multiple known filters consequently on the big data and filter out the traces in text format.

Before going forward here are few terminology that will help us the technologies used for this parsing.

What is Hive
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

What is HDInsight

HDInsight is a Hadoop distribution powered by the cloud. This means HDInsight was architected to handle any amount of data, scaling from terabytes to petabytes on demand. You can spin up any number of nodes at anytime.

Powershell Script

In powershell execute Get-AzurePublishSettingsFile.
Currently we have a hdinsight created in a different Azure account.

1.       Save the publishsetting file in some location
2.       Execute Import-AzurePublishSettingsFile -PublishSettingsFile <<filepath where you saved the publishsettingfile>>
3.       Verify connection using Use-AzureHDInsightCluster -Name wftraceanalyzer
Note: There is an issue in connecting to the cluster, it might be because of incompatible powershell and Hadoop SDK. Please read this link
4.       After verification, execute the following powershell script

$clusterName = "<<hdinsight cluster name>>"
$subscriptionName ="<<subscription name>>"
$storageAccountName="<<storage account name>>"
$storageAccountKey = "<<storage key>>"
$destfolder ="hive/warehouse/traces/"
$logoutputfolder = "F:\Hadoop\Trace"

$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

$files = Get-ChildItem $inputfilepath\*.* -Include *.trace | Where-Object {$_.Name -match "fabric_trace"}
foreach($file in $files)
    $fileName = $file.Name   
    $blobName = "$destfolder$fileName"
    Set-AzureStorageBlobContent -File $file -Container $containerName -Blob $blobName -Context $destContext -Force

#Hive query to create the table on the file structure foundation and load the trace data in hive table
$querystring="DROP TABLE trace; CREATE TABLE trace (date STRING, level STRING, process INT, thread INT, type STRING, text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/hive/warehouse/traces/' INTO TABLE trace;"
$jobDef = New-AzureHDInsightHiveJobDefinition -Query $querystring
$hivejob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $jobDef

#Troubleshooting 1:type~RE.PrimaryReplicate@<PartitionId>
$querystring= "SELECT * from trace where type like '%RE.PrimaryReplicate%'"
$response = Invoke-Hive $querystring | Out-File $logoutputfolder\REPrimaryReplicate.txt

#Troubleshooting 2:type~FM.*Upgrade
$querystring= "SELECT * from trace where type like '%FM%Upgrade%'"
Invoke-Hive $querystring | Out-File $logoutputfolder\FMUpgrade.txt

#Troubleshooting 3:type~FM.AppUpgrade
$querystring="SELECT * from trace where type like '%FM%AppUpgrade%'"
Invoke-Hive $querystring | Out-File $logoutputfolder\FMAppUpgrade.txt

#Troubleshooting 4:type~FabricNode.NodeOpening
$querystring="SELECT * from trace where type like '%FabricNode%NodeOpening%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\FabricNodeOpening.txt

#Troubleshooting 5:type~CM.*Upgrade && level~Warning
$querystring="SELECT * from trace where type like '%CM%Upgrade%' and level like '%arning%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\CMUpgrade.txt

#Troubleshooting 6:type~FM.*Upgrade && ~pending
$querystring="SELECT * from trace where type like '%FM%Upgrade%' and text like '%pending%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\FMUpgradePending.txt

#Troubleshooting 7:type~FM.FabricUpgrade && ~pending
$querystring="SELECT * from trace where type like '%FM%FabricUpgrade%' and text like '%pending%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\FMFabricUpgradePending.txt

#Troubleshooting 8:type~RA.Upgrade
$querystring="SELECT * from trace where type like '%RA%Upgrade%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\RAUpgrade.txt

#Troubleshooting 9:type~RA.FT|RAP.FT|RAP.Api && text ~ "on node <node_id>"
$querystring="SELECT * from trace where type like '%RA%FT' or type like '%RAP%FT%' or type like '%RAP%Api' and text like '%on node%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\RAFT.txt

#Troubleshooting 9:type~RE.PrimaryCatchUpQuorum
$querystring="SELECT * from trace where type like '%RE%PrimaryCatchUpQuorum'"
Invoke-Hive $querystring| Out-File $logoutputfolder\REPrimaryCatchUpQuorum.txt

#Troubleshooting 10:type~Lease.SlowSendCompletion
$querystring="SELECT * from trace where type like '%Lease%SlowSendCompletion%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\LeaseSlowSend.txt

#Troubleshooting 11:type~Config && text~value
$querystring="SELECT * from trace where type like '%Config%' and text like '%value%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\Config.txt

5.       There are bunch of filters which are taken and sent using Hive query and corresponding logs are stored in output folder

Input folder: where the traces are kept

Output folder:

The execution takes 5-15 min based on the data size. Eventually it gives all the traces.



  1. very informative blog and useful article thank you for sharing with us , keep posting learn more Big Data Hadoop Online Training Hyderabad


Post a Comment

Popular posts from this blog

Firebase authentication with Ionic creator

Big Data - SWOT Analysis

LINKEDIN api call using NODE.JS OAUTH module