Online vs Offline Bigdata solution

Big Data can take both online and offline forms. Online Big Data refers to data that is created, ingested, trans- formed, managed and/or analyzed in real-time to support operational applications and their users. Big Data is born online. Latency for these applications must be very low and availability must be high in order to meet SLAs and user expectations for modern application performance. This includes a vast array of applications, from social networking news feeds, to analytics to real-time ad servers to complex CRM applications. Examples of online Big Data databases include MongoDB and other NoSQL databases.

Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big Data in a batch context. They typically do not create new data. For these applications, response time can be slow (up to hours or days), which is often acceptable for this type of use case. Since they usually produce a static (vs. operational) output, such as a report or dashboard, they can even go offline temporarily without impacting the overall goal or end product. Examples of offline Big Data applications include Hadoop-based workloads; modern data warehouses; extract, transform, load (ETL) applications; and business intelligence tools.

Organizations evaluating which Big Data technologies to adopt should consider how they intend to use their data. For those looking to build applications that support real-time, operational use cases, they will need an operational data store like MongoDB. For those that need a place to conduct long-running analysis offline, perhaps to inform decision-making processes, offline solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so in tandem, and they will sometimes find integrations between online and offline Big Data technologies. For instance, MongoDB provides integration with Hadoop.

Cache static object for better performance AEM 5.6.1

Several contents i.e. CSS, image and icons don’t get changes frequently, therefore the system should be configured so that requested objects does not get expired in client browser and reduce unnecessary HTTP traffic.

This unloads the traffic request from server also improves user experience response time for page and any HTTP request as browser stores the objects in browser cache based on expiration date for the path.

Any HTTP request has header information and HTTP protocol allow browser to cache based on that information.

Apache configuration changes

Uncomment this in httpd.conf.

LoadModule expires_module modules/mod_expires.so

and load this changes where for the content path or any directory/folder that you want caching.

<Location /libs>
ExpiresActive On
ExpiresByType text/css “access plus 1 month”
ExpiresByType application/x-javascript “access plus 1 hour”
ExpiresByType image/png “access plus 1 month”
ExpiresByType image/gif “access plus 1 month”
ExpiresByType application/x-javascript “access plus 1 hour”
</Location>

 

<Location /etc/designs>
ExpiresActive On
ExpiresByType image/gif “access plus 1 hour”
ExpiresByType image/png “access plus 1 hour”
ExpiresByType image/svg “access plus 1 hour”
ExpiresByType image/jpeg “access plus 1 hour”
ExpiresByType image/jpg “access plus 1 hour”
ExpiresByType application/x-javascript “access plus 1 hour”
ExpiresByType text/css “access plus 1 hour”
</Location>

 

How to ensure it is effective?

If expiry is not active

We will see such log in access.log of apache for any image objects for each request.
127.0.0.1 – – [11/Apr/2014:16:02:35 -0500] “GET /content/dam/IMAGES/Brand%20Assets/hero-home-1.jpg HTTP/1.1” 200 496832That means image file of size 496832 bytes are being downloaded on each request. 200 is http successful status.

If expiry is active, ideally we should see logs like this
127.0.0.1 – – [11/Apr/2014:16:53:32 -0500] “GET /content/dam/IMAGES/Brand%20Assets/hero-home-1.jpg HTTP/1.1” 304 –

 

That means 304 indicates that browser has cached that image and server has not got any new version so apache is not going to send back to browser a new copy and reducing http traffic.

Run Load test and check logs

 

Custom Rewriter Transformer to rewrite any HTML Output generated by Sling Rendering Process

Problem Context.  I came across a unique business problem and found such solution is not available on the web so writing this blog might help others. Sling Link Rewriting based on sling: pipeline is mainly focused on how to rewrite any anchor link inside or … Continue reading Custom Rewriter Transformer to rewrite any HTML Output generated by Sling Rendering Process

MySQL Setup, login first time as root and troublshooting

Presuming you are using Redhat or CentOS and yam command is available .

  1.  yum install mysql-server
  2. /etc/init.d/mysqld start
  3. mysqladmin -u root password ‘{password}’, if it gives issue try to reset default password in this manner
  4. /etc/init.d/mysqld stop
    mysqld_safe –skip-grant-tables &
    mysql -u root
    mysql> use mysql;
    mysql> update user set password=PASSWORD(“newrootpassword”) where User=’root’;
    mysql> flush privileges;
    mysql> quit
    /etc/init.d/mysqld stop
    /etc/init.d/mysqld start

 

Strategic thoughts when choosing new big data storage technology

Before 2000, primary challenges for companies were to enable the systems so that transactional data could be captured faster for organizational productivity, now gear is shifted towards delivery of information to the business users through reporting, analytical system and actionable drill down dashboard etc that organization have stored in files, data, audio and video stream etc on propriety clustered and open source file system based on their business need and suitability.

Organizations are using storage technology for decades to store the information on clustered file system which are mounted on multiple servers and few are not but complexities of the underlying storage environment increases as new servers/system are added for scalability.

Now, organizations, which want to monetize, better analyze and capitalize the information channels and integrate with the business, depending on the big data storage, opted in the past they are facing challenges of large scale data indexing, availability on demand with low latency. So some of them are choosing/changing or integrating with better enterprise class large scale data storage through some connecter technology to trap and monetize the value of information they have in reduced time manner.

We should know the technology, protocols, network challenges when thinking of  adopt new big data storage and their features.  There are few architectural approaches how clustering works in such scenario.

Shared Disk: It uses SAN storage area network at a block level. It has again few approaches, some distribute information across all over the server in cluster and some employ centralized metadata server.  SGI CXFS, IBM GPFS,  DataPlow, Microsoft CSV, Oracle CFS, Redhat GFS, SUN QFS, VMware VMFS, Ceph etc are most widely used cluster file systems.

Distributed file system: It uses a network protocol and Lustre’s data storage technology is very popular on this, Ceph has also come up with this and Microsoft has DFS too.

NAS: It uses file based protocol i.e. NFS, SMB (Server Message Block)/CIFS (Common Internet File System).

Shared Nothing Architecture: Each storage nodes communicates changes to other or to master for replication. Ceph, Lustre and Hadoop are few implementers for this.

The most suitable technology selection reduces time to solution and control the budgets as well, so based on the above architecture, let us list down the general and most critical SLAs from the solutions.

Common selection parameters for big data storage technology?

  • High availability
  • Scalability of data storage with fixed IT budgets
  • Fault tolerance
  • Cost of ownership, commodity hardware
  • Global workload sharing
  • Map Reduce algorithm support on high bandwidth
  • Reduced time to solution
  • Centralize storage management
  • Support wide range of hardware and software
  • High application IO support for analytic system
  • Event stream processor/storage
  • Caching of data for better performance
  • Holistic network design (Unified Ethernet Fabric)

List of Big data Storage technology available in the market

Most of the features are claimed to be supported by these listed except Unified Ethernet Fabric that is separate case and CISCO has network related offering to scale out the big data storage.

HDFS

It is the de facto solution in big data technology for large scale data processing over clusters of commodity hardware and is very much suitable However if you are trying to process dynamic datasets (data in motion) , ad-hoc analysis or graph data structure, please stop and read about Goggle’s better alternative to map reduce paradigm (Percolator, Dremel and Pregel). Cassandra and other Enterprise version of HDFS are trying to provide improvement and solutions in this area.

GPFS

It has been available in the market since 1993 and thousands of organizations are using it (Pharmaceutical, Financial Institutes, Life Science, USA National Whether Forecast, Energy Sector etc). It runs on commodity hardware as well and support many OS and platform. It claims to work with low latency ad-hoc analysis and streaming data at very high volumes. It is a propriety offering from IBM and of course one of the very suitable big data storage options if licensing is not a concern. Cluster Manager Failure, File System manager level failure, Secondary cluster, configuration Server failure and Rack Failure are claimed to be addressed with GPFS SNC.

Lustre Distributed File System

This is a very recognized scalable cluster computing file storage system that is widely used by super computers and it has open licensing. There are many commercial suppliers of this bundled with hardware like netApp and Dell. It also claims to fulfill all of requirement listed above including low latency for analytical system.

Isilon’sOneFS by EMC

This is a major induction in the big data storage arena and companies like Oracle and IBM are taming this big data beast. EMC was has re-engineered HDFS and created its own version of data storage later in 2011. It uses the MapR File system. MAPR File System claim to be the alternate file system to HDFS which has full random access read /write right. Snapshots and mirrors advanced feature that addressed centralized metadata of name node in HDFS Single point failure issue.

NetApp’s RAID array on HDFS.

Netapps claims it’s improvement to HDFS to make it faster and reliable but still rely on HDFS.

Clever safe Dispersed Computation file system highly scalable object based storage.

Appistry and KosmosFS are few more computational big data storage options.

Conclusion

In order to monetize and present actionable insights of information to business for provocative organizational decision making, analytic system heavily rely on data storage technologies that drives how data is  made available to frontend middle ware application at a faster rate and reduced interval. As we know GFS based HDFS is cheaper and rock solid but for scalability over Peta bytes perhaps the enterprise class solution may be EMC, IBM or GPFS or many available in the market?

But remember many commercial offering don’t run on commodity hardware and cost advantage of HDFS and related bundles are fundamental for current success and growing popularity. Low latency issue with HDFS can be addressed with right design/implementation by skilled big data technical experts that are organization choice weather they desire for per-bundled commercial offering or open source HDFS that allows customizing organizational needs in a flexible manner but it all depends on business requirement and investment budgets for the solutions.

IBM Big Data GPFS Protocol

GPFS is advanced Protocol similar to HDFS and exist in market since 1993 and thousand of compnaies are using like Energy sector, Social Media, Analytics  industries and finance.

It primarily focus on following items.

  1. High performance
  1. Availability
  1. Advance replication
  1. Multi site caching
  1. Reliability
  1. Fault tolerance.

VESTAS Wind System is one of the very use for GPFS they using for more than decades.

* Wind pattern information system they have created for Clients that gives good analytic system of Wind information all around the world.

* GPFS is very suitable for situation where NFS and many legacy protocol can not work and terabytes of data need to be available for analytic system on demand.

References

http://www.ibm.com/platformcomputing

http://www-03.ibm.com/systems/software/gpfs/

Leading real time Big Data Processor in the market

Solution Vendor Type Description
Storm Twitter Streaming Twitter’s new streaming big-data analytics solution
S4 Yahoo! Streaming Distributed stream computing platform from Yahoo!
Hadoop Apache Batch First open source implementation of the MapReduce paradigm
Spark UC Berkeley AMPLab Batch Recent analytics platform that supports in-memory data sets and resiliency
Disco Nokia Batch Nokia’s distributed MapReduce framework
HPCC LexisNexis Batch HPC cluster for big data
DRUID METAMARKET Stream Analytics and streaming

Undertanding Big Data, Analytics for Enterprise Class Hadoop and Streaming Data

Brief Background (We are seeing all the time)

Big data applies to the Information that can’t be processed with traditional processors and  tools but same traditional tools should not be mixed Big Data solutions. Increasingly   Enterprises are having challenges how to access the wealth of information but they don’t know how to get the values of it because it’s sitting in most raw form unstructured or semistructured format.

Thinking of BIG Data Solution why?

Big Data can be interpreted in many different way and that is why it’s conforming as V3. Velocity, Volume and varieties that characterizes it.

  1. Analyzing the raw unstructured, semiunstcrtured data from wide variety of sources. Big Data Solution can work on structured content too.
  2. Big Data are ideal for iterative and  exploratory analysis where business has not any predefined formula just like Traditional BI solutions where BI Solution providers has proprietary fixed formula for specific industries like Retail etc
  3. Big Data solution are ideal when analysis has to be done on whole set of data not on sample of data else it wouldn’t be as effective.

Traditional BI Tools has been always working on data that is per-processed while BIG data does analysis when data is in motion not in the rest most of the time. That’s where Stream Processing with low latency with high volume of data is key factor in Big Data technologies.

Broader use case for Big Data Solutions

  1. IT logs analytic
  2. Fraud Detection patterns
  3. The social media pattern
  4. Energy Sector
  5. Health care
  6. Retails sector
  7. Patterns for modeling and Management

Big Data Platform available in the market

Broader Technical Spectrum/Stack in Big Data provided by different vendors

Leading Research Investment in Big data

XML Based Workflow Engine

In this article, I would like to explore the power of XQuery, XML and XSLT in terms of workflow based application, workflow here described in very general way , one can think of
documents moving around different-2 people, where each can perform different-2 task, processing for insurance claim, or claim from goverment spent money on minimal job guarnatee program or bug report systems are examples that comes in mind.

XML fits very well for workflow application becuse it gives very good adaptibility to integrate with any external application.

We will talk about very highly configurable enterprise application re-usability , loosely coupled services are basic requirement. Till recently XQuery has been termed as Query language but it is pure server side language like jsp and php with query language capability.
This blog is for how to create your own custom very generic work flow engine, the conceptualization is based on JBoss JBPM workflow engine but this is very light weight and minimal taken implementation. XQuery and XML Database like Exist is enough, no JAVA code is required . This blog will be the first one coming with sample examples running…Please drop a mail on chandra.dream@gmail.com for code.