Big Data Hadoop Training (1) - PowerPoint PPT Presentation

About This Presentation

Big Data Hadoop Training (1)


Big Data and Hadoop instructional class is intended to give information and aptitudes to turn into a fruitful Hadoop Developer. A great deal of ability, top to bottom learning of center ideas is needed in a course alongside execution on differed industry use-cases. SoftwareSkool provides various online training courses which are highly in demand in the present trend. We designed our e-learning platform on proven teaching methods in such a way that every individual will be mastered at the end of their course. Contact Us: Ph No: 4097912424 – PowerPoint PPT presentation

Number of Views:580


Transcript and Presenter's Notes

Title: Big Data Hadoop Training (1)

(No Transcript)
  • What Is Big Data
  • What Is Hadoop
  • Characteristics of Big Data
  • Characteristics of Hadoop
  • Big Data Storage Considerations
  • Understanding Hadoop Technology and storage
  • BigData Technologies
  • Hadoop HDFS Architecture
  • Why Big Data
  • Why Hadoop
  • Future of Big Data
  • Future of Hadoop

What is Big Data
  • Big data means really a big data, it is a
    collection of large datasets that cannot be
    processed using traditional computing techniques.
    Big data is not merely a data, rather it has
    become a complete subject, which involves various
    tools, techniques and frameworks.

What is Hadoop
  • Hadoop is a free, Java-based programming
    framework that supports the processing of large
    data sets in a distributed computing environment.
    It is part of the Apache project sponsored by the
    Apache Software Foundation.

Characteristics of Big Data
  • We have all known about the 3Vs of huge
    information which are Volume, Variety and
    Velocity. Yet, Inderpal Bhandar, Chief Data
    Officer at Express Scripts noted in his
    presentation at the Big Data Innovation Summit in
    Boston that extra Vs IT, business and information
    researchers should be worried with, most
    eminently enormous information Veracity. There
    are 3 Types
  • volume
  • velocity  
  • Variety

  • Volume Refers to the incomprehensible
    measures of data made reliably. We are not
    talking Terabytes yet rather Zetta bytes or
    Bronto bytes. In case we take all the data made
    on the planet between the absolute starting point
    and 2008, the same measure of data will soon be
    created reliably. This makes most data sets
    excessively immeasurable, making it impossible to
    store and dismember using standard database
    development. New tremendous data instruments
    usage coursed systems so we can store and
    dismember data transversely over databases that
    are specked around wherever on the planet.

  • Velocity Refers to the speed at which new
    data is made and the pace at which data moves
    around. Essentially consider internet systems
    administration messages turning into a web
    sensation in seconds. Advancement grants us now
    to analyze the data while it is being delivered
    (as a less than dependable rule insinuated as
    in-memory examination), while never putting it
    into data bases. The Velocity is the pace at
    which the data is made, secured, dismembered and
    imagined. Some time recently, when group get
    ready was fundamental practice, it was common to
    get a redesign from the database reliably or even
    reliably. PCs and servers obliged liberal time to
    change the data and overhaul the databases. In
    the immense data time, data is made persistently
    or close steady. With the openness of Internet
    joined devices, remote or wired, machines and
    contraptions can go on their data the moment it
    is made.

  • Variety Refers to the distinctive sorts of
    information we can now utilize. In the past we
    just centered around organized information that
    perfectly fitted into tables or social databases,
    for example, monetary information. Truth be told,
    80 of the world's information is unstructured
    (content, pictures, feature, voice, and so on.)
    With enormous information innovation we can now
    examine and unite information of distinctive
    sorts, for example, messages, online networking
    discussions, photographs, sensor information,
    feature or voice recordings. Previously, all
    information that was made was organized
    information, it conveniently fitted in sections
    and lines yet those days are over. These days,
    90 of the information that is created by an
    association is unstructured information.
    Information today comes in a wide range of
    organizations organized information,
    semi-organized information, unstructured
    information and even complex organized
    information. The wide mixture of information
    obliges an alternate methodology and diverse
    strategies to store all crude information.

(No Transcript)
Characteristics of Hadoop
  • Hadoop provides a reliable shared storage (HDFS)
    and analysis system (MapReduce).
  • Hadoop is highly scalable and unlike the
    relational databases, Hadoop scales linearly. Due
    to linear scale, a Hadoop Cluster can contain
    tens, hundreds, or even thousands of servers.
  • Hadoop is very cost effective as it can work with
    commodity hardware and does not require expensive
    high-end hardware.
  • Hadoop is highly flexible and can process both
    structured as well as unstructured data.
  • Hadoop has built-in fault tolerance. Data is
    replicated across multiple nodes (replication
    factor is configurable) and if a node goes down,
    the required data can be read from another node
    which has the copy of that data. And it also
    ensures that the replication factor is
    maintained, even if a node goes down, by
    replicating the data to other available nodes.
  • Hadoop works on the principle of write once and
    read multiple times.
  • Hadoop is optimized for large and very large data
    sets. For instance, a small amount of data like
    10 MB when fed to Hadoop, generally takes more
    time to process than traditional systems.

(No Transcript)
Big Data Storage Considerations
  • Our experience building an industry leading
    Big Data storage platform has taught us a few
    things about the storage challenges faced by
    organizations. Customers have shared with us some
    of the general pros and cons of the storage
    options they have considered when choosing a
    storage platform.

Open Source
  • Pros
  • Free with community support
  • Scalable
  • Runs on inexpensive commercial-off-the-shelf
    (COTS) hardware
  • Cons
  • Community support is not sufficient and there is
    a reliance on outside consultancy
  • Investment to build and maintain in-house
  • In-house support, testing and tuning
  • No guaranteed SLA
  • Long lead time to get into production

(No Transcript)
Conventional Storage Systems
  • Pros
  • Enterprise-class support and quality
  • Long term lifecycle/release management
  • Appliance based model
  • Cons
  • Expensive license and support
  • Locked-in/proprietary hardware
  • Scalability and manageability issues such as file
    system, namespace, data protection, disaster
    prevention, etc

(No Transcript)
Software-defined Storage
  • Pros
  • Enterprise-class support and quality
  • Long term lifecycle/release management
  • Massively scalable built for todays and
    emerging workloads
  • Easy to manage self healing, non disruptive
  • Runs on inexpensive COTS hardware
  • Cons
  • Some solutions require additional software with a
    separate license
  • Scalability varies with solutions
  • Data migration is required with some solutions

(No Transcript)
Understanding Hadoop technology and storage
  • Because Hadoop stores three copies of each piece
    of data, storage in a Hadoop cluster must be able
    to accommodate a large number of files. To
    support the Hadoop architecture, traditional
    storage systems may not always work. The links
    below explain how Hadoop clusters and HDFS work
    with various storage systems, including
    network-attached storage (NAS), SANs and object

  • software vendors have gotten the message that
    Hadoop is hot -- and many are responding by
    releasing Hadoop connectors that are designed to
    make it easier for users to transfer information
    between traditional relational databases and the
    open source distributed processing system.
  • Oracle, Microsoft and IBM are among the vendors
    that have begun offering Hadoop connector
    software as part of their overall big data
    management strategies. But it isnt just the
    relational database management system (RDBMS)
    market leaders that are getting in on the act.

(No Transcript)
Big Data Technologies
  • Big Data information is a wide term for
    information sets so vast or complex that
    customary information preparing applications are
  • Big Data Technologies are 9 Technologies
  • Crowd sourcing
  • Data fusion
  • Data integration
  • Genetic algorithm
  • Machine learning
  • Natural language processing
  • Signal processing
  • Time series
  • Simulation

Crowd sourcing
  • Crowd sourcing, a present day business term
    authored in 2005, is characterized by
    Merriam-Webster as the procedure of soliciting so
    as to acquire required administrations, thoughts,
    or substance commitments from a substantial
    gathering of individuals, and particularly from
    an online group, as opposed to from customary
    workers or suppliers a portmanteau of "group" and
    "outsourcing, its more particular definitions are
    yet vigorously faced off regarding.

(No Transcript)
Data fusion
  • Information combination is the procedure of
    coordination of various information and learning
    speaking to the same certifiable item into a
    steady, exact, and valuable representation.
    combination of the information from 2 sources
    (measurement 1 2) can yield a classifier
    better than any classifiers taking into account
    measurement 1 or measurement 2 alone
    Information combination procedures are regularly
    arranged as low, middle of the road or high,
    contingent upon the handling stage at which
    combination takes place. Low level information
    combination consolidates a few wellsprings of
    crude information to create new crude
    information. The desire is that melded
    information is more educational and engineered
    than the first inputs.

(No Transcript)
Data integration
  • Information joining includes consolidating
    information living in distinctive sources and
    furnishing clients with a brought together
    perspective of these data.1 This procedure gets
    to be noteworthy in an assortment of
    circumstances, which incorporate both business
    (when two comparative organizations need to blend
    their databases) and investigative (joining
    examination results from diverse bioinformatics
    stores, for instance) areas. Information mix
    shows up with expanding recurrence as the volume
    and the need to share existing information
    explodes. It has turned into the center of broad
    hypothetical work, and various open issues stay

(No Transcript)
Genetic Algorithm
  • In the field of counterfeit consciousness, a
    hereditary calculation (GA) is a pursuit
    heuristic that emulates the procedure of
    characteristic choice. This heuristic (likewise
    some of the time called a metaheuristic) is
    routinely used to produce valuable answers for
    advancement and pursuit problems.1 Genetic
    calculations have a place with the bigger class
    of developmental calculations (EA), which create
    answers for streamlining issues utilizing systems
    roused by characteristic advancement, for
    example, legacy, change, determination, and

(No Transcript)
Machine learning
  • Machine learning is a subfield of PC science1
    that developed from the investigation of example
    acknowledgment and computational learning
    hypothesis in fake intelligence. Machine learning
    investigates the study and development of
    calculations that can gain from and make
    forecasts on data. Such calculations work by
    building a model from sample inputs keeping in
    mind the end goal to make information driven
    expectations or decisions instead of taking after
    entirely static project guidelines. Machine
    learning is firmly identified with and regularly
    covers with computational measurements a teach
    that likewise works in expectation making. It has
    solid binds to scientific enhancement, which
    conveys systems, hypothesis and application
    spaces to the field. Machine learning is utilized
    in a scope of figuring assignments where
    outlining and programming express calculations is

(No Transcript)
Natural language processing
  • This article speaks the truth dialect handling
    by PCs. For the preparing of dialect by the human
    cerebrum, see Language handling in the mind.
    Normal dialect handling (NLP) is a field of
    software engineering, computerized reasoning, and
    computational etymology worried with the
    collaborations in the middle of PCs and human
    (characteristic) dialects. As being what is
    indicated, NLP is identified with the territory
    of humancomputer association. Numerous
    difficulties in NLP include normal dialect
    understanding, that is, empowering PCs to get
    importance from human or common dialect
    information, and others include characteristic

(No Transcript)
Signal processing
  • Sign preparing is an empowering innovation that
    incorporates the key hypothesis, applications,
    calculations, and executions of handling or
    moving data contained in a wide range of
    physical, typical, or unique configurations
    extensively assigned as signals. It utilizes
    numerical, measurable, computational, heuristic,
    and semantic representations, formalisms, and
    strategies for representation, demonstrating,
    investigation, union, revelation, recuperation,
    detecting, procurement, extraction, learning,
    security, or legal sciences

(No Transcript)
Time series
  • A period arrangement is a grouping of information
    focuses, commonly comprising of progressive
    estimations made over a period interim. Cases of
    time arrangement are sea tides, numbers of
    sunspots, and the day by day shutting estimation
    of the Dow Jones Industrial Average. Time
    arrangement are every now and again plotted by
    means of line outlines. Time arrangement are
    utilized as a part of insights, sign preparing,
    example acknowledgment, econometrics, numerical
    money, climate anticipating, canny transport and
    direction forecasting, seismic tremor
    expectation, electroencephalography, control
    building, stargazing, correspondences designing,
    and to a great extent in any area of connected
    science and designing which includes worldly

(No Transcript)
  • Simulation is the operation's impersonation of a
    genuine procedure or framework over time. The
    demonstration of reenacting something first
    obliges that a model be created this model
    speaks to the key qualities or practices/elements
    of the chose physical or theoretical framework or
    procedure. The model speaks to the framework
    itself, while the reenactment speaks to the
    framework's operation after some time.

(No Transcript)
 Hadoop HDFS Architecture
  • Hadoop1 gives a disseminated filesystem and a
    structure for the investigation and change of
    expansive information sets utilizing the
    MapReduce DG04 worldview. While the interface
    to HDFS is designed after the Unix filesystem,
    steadfastness to principles was relinquished for
    enhanced execution for the applications at
    hand.An imperative normal for Hadoop is the
    apportioning of information and calculation
    crosswise over numerous (thousands) of hosts, and
    the execution of utilization calculations in
    parallel near their information. A Hadoop group
    scales calculation limit, stockpiling limit and
    I/O transfer speed by essentially including thing
    servers. Hadoop groups at Yahoo! compass 40,000
    servers, and store 40 petabytes of utilization
    information, with the biggest group being 4000
    servers. One hundred different associations
    overall report utilizing Hadoop

(No Transcript)
Why Big Data
  • Data are now woven into every sector and function
    in the global economy, and, like other essential
    factors of production such as hard assets and
    human capital, much of modern economic activity
    simply could not take place without them. The use
    of Big Data large pools of data that can be
    brought together and analyzed to discern patterns
    and make better decisions will become the basis
    of competition and growth for individual firms,
    enhancing productivity and creating significant
    value for the world economy by reducing waste and
    increasing the quality of products and services

 Why Hadoop
  • Apache Hadoop enables big data applications for
    both operations and analytics and is one of the
    fastest-growing technologies providing
    competitive advantage for businesses across
    industries. Hadoop is a key component of the
    next-generation data architecture, providing a
    massively scalable distributed storage and
    processing platform. Hadoop enables organizations
    to build new data-driven applications while
    freeing up resources from existing systems. MapR
    is a production-ready distribution for Apache

Future of Big Data
  • Plainly Big Data is in its beginnings, and is
    substantially more to be found. Presently is for
    the most organizations only a cool keyword,
    because it has an incredible potential and not
    many genuinely recognize what all is about. A
    clear sign that there is a whole other world to
    enormous data then is at present appeared
    available, is that the enormous programming
    organizations not have, or don't display their
    Big Data solutions, and those that have like
    Google, does not utilize it in ca business way.
    The organizations need to choose what kind of
    technique utilization to execute Big Data. They
    could utilize a more progressive approach and
    move all the information to the new Big Data
    environment, and all there porting, demonstrating
    and cross examination will be executed utilizing
    the new business intelligence in light of Big
    Data. This methodology is now utilized by many
    analytics driven associations that puts all the
    information on the Hadoop environment and build
    business knowledge arrangements on top of it.

Future of Hadoop
  • Dynamic caching
  • Multiple network interface support
  • Support NVRAM
  • Hardware Security Modules

Dynamic caching
  • Access pattern based caching of hot data
  • LRU, LRU2 Cache partial blocks Dynamic
    migration of data between storage tiers

Multiple network interface support
  • Better aggregated bandwidth utilization
    Isolation of traffic

Support NVRAM
  • Better durability without write performance cost
    File system metadata to NVRAM for better

Hardware Security Modules
  • Better key management Processing that
    requires higher security only on these nodes
    Important requirement for Financials and

(No Transcript)
Write a Comment
User Comments (0)