Category: Big Data

Middle level technology solution company challenge: Embracing Enterprise Application Framework

I worked in start up open source technology company for 7 years and was amazing journey in building platform using Apache ServiceMix,  Lifreay Portal, JBoss middle ware suites, Alfresco and many others as  integrated solution to achieve business need for large banking, social care  in Africa, Europe and Indian market.

I recently had three days in-house TIBCO Training includes TIBCO Business works, EMS, Designer, Active Space , spot fire and other tools related with deployment , Continuous integration and logging framework. Team did great jobs in terms of delivering basic building blocks and architectural nuts and bolts which are required to develop service Web Service, Rest Full Service, querying Database using TIBCO Designer no custom code, it’s all ready made TIBCO provided pallets and Hops and I simply said wow because I am coming from ETL background where ready-made hops and tools saves developers time and reduces risk of long and buggy code for file reading/transformation/querying Database and many more.

Here is one of the sample service designs in TIBCO designer to read some data from database (No Java/.net class no custom code)

TIBCO Designer

As i software solution architect I am convinced that developing a service is much easier with TIBCO than  writing Custom Java or .Net code and writing JSON and other transformation code, it helps to reduce buggy code and faster turnaround time for solution. It can’t be denied Initial development time would be higher for those who are new to this.

NoSQL/Data Grid (Active Space) in TIBCO

I always built robust /scalable low-cost solution based on liferay, alfresco on JBoss and many more .. no software license but supporting thousand concurrent public users. Map/reduce, Lucene, Indexing , SOLR are basic ingredient the solution we developed. I am keen follower of this scalable platform such as Amazon, Netflix , E-Bay, price-line etc .

Active space is new addition to TIBCO suite to support high number of concurrent users in competition of NoSQL/Data-grid/Cache technology which TIBCO claims runs on commodity server, but in Demo I found they are running two nodes on 512 GB memory which are beyond my thoughts why they solution in such a way, later I realized its in memory non-persistent solution where data resides in RAM not on Disk.

Challenge in Adoption Non Relational Approach?

I got a chance to speak few architects who have developed financial platform Microsoft .net for more than decades and very often I hear in leadership meeting reducing downtime , upgrade etc. I always ask question to myself why can’t we move to new generation database  and here is reply

“ This new technology and framework are bad adoption and does not fit in our solution, we are very happy with our existing solution and we can scale in this only. New solution are very slow , no documentation so not acceptable blah blah blah”.

Fitting Everything together?

One most important philosophies underpin that one size does not fit all, for many year traditional Database server are used for storing all types, it doesn’t matter all data types fits for relational Data model. General perception is that media, Analytic and social media application are suitable for NoSQL Database and structured relations data but that’s not correct.

Following can be implemented in exciting software to scale and reduce downtime.

  1. Memcached is used by Both Amazon and Netflix for frequently  requested content/data same can be achieved by Active Space to reduce load on database and faster response to user requests.
  2. Let us stick using highly secure data in RDBMS but active/active set up can be done to reduce downtime and scale it.
  3. Move other kind of information for example user profile and other data in distributed  NoSQL Database and redesign the service.

 

Online vs Offline Bigdata solution

Big Data can take both online and offline forms. Online Big Data refers to data that is created, ingested, trans- formed, managed and/or analyzed in real-time to support operational applications and their users. Big Data is born online. Latency for these applications must be very low and availability must be high in order to meet SLAs and user expectations for modern application performance. This includes a vast array of applications, from social networking news feeds, to analytics to real-time ad servers to complex CRM applications. Examples of online Big Data databases include MongoDB and other NoSQL databases.

Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big Data in a batch context. They typically do not create new data. For these applications, response time can be slow (up to hours or days), which is often acceptable for this type of use case. Since they usually produce a static (vs. operational) output, such as a report or dashboard, they can even go offline temporarily without impacting the overall goal or end product. Examples of offline Big Data applications include Hadoop-based workloads; modern data warehouses; extract, transform, load (ETL) applications; and business intelligence tools.

Organizations evaluating which Big Data technologies to adopt should consider how they intend to use their data. For those looking to build applications that support real-time, operational use cases, they will need an operational data store like MongoDB. For those that need a place to conduct long-running analysis offline, perhaps to inform decision-making processes, offline solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so in tandem, and they will sometimes find integrations between online and offline Big Data technologies. For instance, MongoDB provides integration with Hadoop.

Strategic thoughts when choosing new big data storage technology

Before 2000, primary challenges for companies were to enable the systems so that transactional data could be captured faster for organizational productivity, now gear is shifted towards delivery of information to the business users through reporting, analytical system and actionable drill down dashboard etc that organization have stored in files, data, audio and video stream etc on propriety clustered and open source file system based on their business need and suitability.

Organizations are using storage technology for decades to store the information on clustered file system which are mounted on multiple servers and few are not but complexities of the underlying storage environment increases as new servers/system are added for scalability.

Now, organizations, which want to monetize, better analyze and capitalize the information channels and integrate with the business, depending on the big data storage, opted in the past they are facing challenges of large scale data indexing, availability on demand with low latency. So some of them are choosing/changing or integrating with better enterprise class large scale data storage through some connecter technology to trap and monetize the value of information they have in reduced time manner.

We should know the technology, protocols, network challenges when thinking of  adopt new big data storage and their features.  There are few architectural approaches how clustering works in such scenario.

Shared Disk: It uses SAN storage area network at a block level. It has again few approaches, some distribute information across all over the server in cluster and some employ centralized metadata server.  SGI CXFS, IBM GPFS,  DataPlow, Microsoft CSV, Oracle CFS, Redhat GFS, SUN QFS, VMware VMFS, Ceph etc are most widely used cluster file systems.

Distributed file system: It uses a network protocol and Lustre’s data storage technology is very popular on this, Ceph has also come up with this and Microsoft has DFS too.

NAS: It uses file based protocol i.e. NFS, SMB (Server Message Block)/CIFS (Common Internet File System).

Shared Nothing Architecture: Each storage nodes communicates changes to other or to master for replication. Ceph, Lustre and Hadoop are few implementers for this.

The most suitable technology selection reduces time to solution and control the budgets as well, so based on the above architecture, let us list down the general and most critical SLAs from the solutions.

Common selection parameters for big data storage technology?

  • High availability
  • Scalability of data storage with fixed IT budgets
  • Fault tolerance
  • Cost of ownership, commodity hardware
  • Global workload sharing
  • Map Reduce algorithm support on high bandwidth
  • Reduced time to solution
  • Centralize storage management
  • Support wide range of hardware and software
  • High application IO support for analytic system
  • Event stream processor/storage
  • Caching of data for better performance
  • Holistic network design (Unified Ethernet Fabric)

List of Big data Storage technology available in the market

Most of the features are claimed to be supported by these listed except Unified Ethernet Fabric that is separate case and CISCO has network related offering to scale out the big data storage.

HDFS

It is the de facto solution in big data technology for large scale data processing over clusters of commodity hardware and is very much suitable However if you are trying to process dynamic datasets (data in motion) , ad-hoc analysis or graph data structure, please stop and read about Goggle’s better alternative to map reduce paradigm (Percolator, Dremel and Pregel). Cassandra and other Enterprise version of HDFS are trying to provide improvement and solutions in this area.

GPFS

It has been available in the market since 1993 and thousands of organizations are using it (Pharmaceutical, Financial Institutes, Life Science, USA National Whether Forecast, Energy Sector etc). It runs on commodity hardware as well and support many OS and platform. It claims to work with low latency ad-hoc analysis and streaming data at very high volumes. It is a propriety offering from IBM and of course one of the very suitable big data storage options if licensing is not a concern. Cluster Manager Failure, File System manager level failure, Secondary cluster, configuration Server failure and Rack Failure are claimed to be addressed with GPFS SNC.

Lustre Distributed File System

This is a very recognized scalable cluster computing file storage system that is widely used by super computers and it has open licensing. There are many commercial suppliers of this bundled with hardware like netApp and Dell. It also claims to fulfill all of requirement listed above including low latency for analytical system.

Isilon’sOneFS by EMC

This is a major induction in the big data storage arena and companies like Oracle and IBM are taming this big data beast. EMC was has re-engineered HDFS and created its own version of data storage later in 2011. It uses the MapR File system. MAPR File System claim to be the alternate file system to HDFS which has full random access read /write right. Snapshots and mirrors advanced feature that addressed centralized metadata of name node in HDFS Single point failure issue.

NetApp’s RAID array on HDFS.

Netapps claims it’s improvement to HDFS to make it faster and reliable but still rely on HDFS.

Clever safe Dispersed Computation file system highly scalable object based storage.

Appistry and KosmosFS are few more computational big data storage options.

Conclusion

In order to monetize and present actionable insights of information to business for provocative organizational decision making, analytic system heavily rely on data storage technologies that drives how data is  made available to frontend middle ware application at a faster rate and reduced interval. As we know GFS based HDFS is cheaper and rock solid but for scalability over Peta bytes perhaps the enterprise class solution may be EMC, IBM or GPFS or many available in the market?

But remember many commercial offering don’t run on commodity hardware and cost advantage of HDFS and related bundles are fundamental for current success and growing popularity. Low latency issue with HDFS can be addressed with right design/implementation by skilled big data technical experts that are organization choice weather they desire for per-bundled commercial offering or open source HDFS that allows customizing organizational needs in a flexible manner but it all depends on business requirement and investment budgets for the solutions.

IBM Big Data GPFS Protocol

GPFS is advanced Protocol similar to HDFS and exist in market since 1993 and thousand of compnaies are using like Energy sector, Social Media, Analytics  industries and finance.

It primarily focus on following items.

  1. High performance
  1. Availability
  1. Advance replication
  1. Multi site caching
  1. Reliability
  1. Fault tolerance.

VESTAS Wind System is one of the very use for GPFS they using for more than decades.

* Wind pattern information system they have created for Clients that gives good analytic system of Wind information all around the world.

* GPFS is very suitable for situation where NFS and many legacy protocol can not work and terabytes of data need to be available for analytic system on demand.

References

http://www.ibm.com/platformcomputing

http://www-03.ibm.com/systems/software/gpfs/

Undertanding Big Data, Analytics for Enterprise Class Hadoop and Streaming Data

Brief Background (We are seeing all the time)

Big data applies to the Information that can’t be processed with traditional processors and  tools but same traditional tools should not be mixed Big Data solutions. Increasingly   Enterprises are having challenges how to access the wealth of information but they don’t know how to get the values of it because it’s sitting in most raw form unstructured or semistructured format.

Thinking of BIG Data Solution why?

Big Data can be interpreted in many different way and that is why it’s conforming as V3. Velocity, Volume and varieties that characterizes it.

  1. Analyzing the raw unstructured, semiunstcrtured data from wide variety of sources. Big Data Solution can work on structured content too.
  2. Big Data are ideal for iterative and  exploratory analysis where business has not any predefined formula just like Traditional BI solutions where BI Solution providers has proprietary fixed formula for specific industries like Retail etc
  3. Big Data solution are ideal when analysis has to be done on whole set of data not on sample of data else it wouldn’t be as effective.

Traditional BI Tools has been always working on data that is per-processed while BIG data does analysis when data is in motion not in the rest most of the time. That’s where Stream Processing with low latency with high volume of data is key factor in Big Data technologies.

Broader use case for Big Data Solutions

  1. IT logs analytic
  2. Fraud Detection patterns
  3. The social media pattern
  4. Energy Sector
  5. Health care
  6. Retails sector
  7. Patterns for modeling and Management

Big Data Platform available in the market

Broader Technical Spectrum/Stack in Big Data provided by different vendors

Leading Research Investment in Big data