Analyzing Search Clicks Data Using Flume, Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data, Spark streaming, Hbase.

Repository contains unit/integration test cases to generate analytics based on clicks events related to the product search on any e-commerce website.

Getting Started

The project is maven project and can be build with Eclipse. Check pom dependencies for relevant version of earch application. It uses cloudera hadoop distribution version 2.3.0-cdh5.0.0.

Functionality

The scenario covered in the application for the search analytics using big data is as follow,

Events based:

Job Based:

Hadoop

The application uses mini hdfs and mini mr cluster for test cases.
If you want to use the same for external hdfs location, please change relevant configurations and use accordingly.

Flume

FlumeAgentService to control map search events to both hdfs and ES bases on multiplexing selector approach.
The application uses inbuilt rolling file sink for the EmbeddedAgent. You can also setup and start external flume agent and point the embedded agent to the same.

JSONSerDe:
To map the json data to hive queries, custom SerDe is used. Create jar and add to your own hive environment to query data if you use external flume source as configured above.
To create json SerDe jar,
$ jar cf jaihivejsonserde-1.0.jar org/jai/hive/serde/JSONSerDe.class

ElasticSearch

ElasticSearchJsonBodyEventSerializer:
Customer ES serializer is used to put data from hadoop to ElasticSearch using hive.
To create ES jsons erializer jar,
$ cd target/classes
$ jar cf jaiflumeesjsonserializer-1.0.jar org/jai/flume/sinks/elasticsearch/serializer/ElasticSearchJsonBodyEventSerializer.class

Product Search Functionality

ElasicSearch is used to index products data and to be able to filter on the products.
SearchCriteria store different user selection information which can be specific query string, sorting information, pagination information, different facet/filter selection etc.
SearchQueryInstruction to generate json data for customer clicks,

Hadoop File storage based on Year/Month/Day/Hour

ElasticSearch Recently Viewed items by customers

Hive Parition information

External table search_clicks pointing to above hdfs data location.

ElasticSearch Customer Top queries information

Oozie

Coordinator jobs runs hourly to create hive partitions based on hadoop data.
Bundle job to query top query strings and index to elasticsearch on daily basis.
LocalOozie is used to start oozier server for testing purpose.

Spring Data Hadoop

Spring data is used for hive server management. The bean and context loading support to manage dependent start/shutdown of different servers/services.

Spark Streaming

Spark streaming integrated with Flume events to deliver top search queries in last an hour or top viewed products in last an hour.

Hbase

MiniHbaseCluster setup to store data. Spring data to use hbase client. Integration wih Flume agent to directly store data in hbase using HbaseSink. HbaseJsonSerializer to serialize the JSON data.
Schema Design,

Hbase functionality,

Blog Posts

Check below blog posts for details how each functionality is used,

Jaibeer Malik

Name With Owner	jaibeermalik/searchanalytics-bigdata
Primary Language	Java
Program language	Java (Language Count: 3)
Platform
License:	MIT License

Created At	2014-05-07 23:23:02
Pushed At	2022-10-05 10:19:30
Last Commit At	2020-10-30 14:59:50
Release Count	0

Stargazers Count	73
Watchers Count	17
Fork Count	69
Commits Count	50
Has Issues Enabled
Issues Count	0
Issue Open Count	0
Pull Requests Count	1
Pull Requests Open Count	9
Pull Requests Close Count	3

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

searchanalytics-bigdata

Github stars Tracking Chart