Data Persistence


In this blog i'm going to talk about Data Persistence.

Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended. In other words, for a data store to be considered persistent, it must write to non-volatile storage.

 Information system role

At the most basic level, an information system (IS) is a set of components that work together to manage data processing and storage. Its role is to support the key aspects of running an organization, such as communication, record-keeping, decision making, data analysis and more. Companies use this information to improve their business operations, make strategic decisions and gain a competitive edge.
Information systems typically include a combination of software, hardware and telecommunication networks. For example, an organization may use customer relationship management systems to gain a better understanding of its target audience, acquire new customers and retain existing clients. This technology allows companies to gather and analyze sales activity data, define the exact target group of a marketing campaign and measure customer satisfaction.

Uses

There are different types of information systems and each has a different role. Business intelligence (BI) systems, for instance, can turn data into valuable insights.
This kind of technology allows for faster, more accurate reporting, better business decisions and more efficient resource allocation. Another major benefit is data visualization, which enables analysts to interpret large amounts of information, predict future events and find patterns in historical data.
Organizations can also use enterprise resource planning (ERP) software to collect, manage and analyze data across different areas, from manufacturing to finance and accounting. This type of information system consists of multiple applications that provide a 360-degree view of business operations. NetSuite ERP, PeopleSoft, Odoo and Intacct are just a few examples of ERP software.
Like other information systems, ERP provides actionable insights and helps you decide on the next steps. It also makes it easier to achieve regulatory compliance, increase data security and share information between departments. Additionally, it helps to ensure that all of your financial records are accurate and up-to-date.
In the long run, ERP software can reduce operational costs, improve collaboration and boost your revenue. Nearly half of the companies that implement this system report major benefits within six months.
At the end of the day, information systems can give you a competitive advantage and provide the data you need to make faster, smarter business decisions. Depending on your needs, you can opt for transaction processing systems, knowledge management systems, decision support systems and more. When choosing one, consider your budget, industry and business size. Look for an information system that aligns with your goals and can streamline your day-to-day operations.

Data, Database, Database server and Database management system


 Data are raw facts can be processed and converted to meaningful information.
Data can exist in a variety of forms — as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind. Since the mid-1900s, people have used the word data to mean computer information that is transmitted or stored.
Strictly speaking, data is the plural of datum, a single piece of information. In practice, however, people use data as both the singular and plural form of the word, and as a mass noun (like "sand").

 There are two types of data.
Quantitative data cane be measurable. But qualitative can't.

 Data can be stored in,
  • Files
  • Databases
Databases are created and managed in database servers.
A database (DB), in the most general sense, is an organized collection of data. More specifically, a database is an electronic system that allows data to be easily accessed, manipulated and updated.
In other words, a database is used by an organization as a method of storing, managing and retrieving information. Modern databases are managed using a database management system (DBMS).

 SQL is used to process db.

Database types

•Hierarchical databases 
•Network databases 
•Relational databases 
•Non-relational databases (NoSQL) 
•Object-oriented databases 
•Graph databases 
•Document databases

DBMSs are used to connect to the DB servers and manage the DBs and data in them.


E.g:
  • PHPMyAdmin
  • MySQL Workbench

Files vs Databases


A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems.

A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. 
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database. For simple operations, read, write, file operations are faster and simple.

Pros of the File System

  • Performance can be better than when you do it in a database. To justify this, if you store large files in DB, then it may slow down the performance because a simple query to retrieve the list of files or filename will also load the file data if you used Select * in your query. In a files system, accessing a file is quite simple and light weight.
  • Saving the files and downloading them in the file system is much simpler than it is in a database since a simple "Save As" function will help you out. Downloading can be done by addressing a URL with the location of the saved file.
  • Migrating the data is an easy process. You can just copy and paste the folder to your desired destination while ensuring that write permissions are provided to your destination.
  • It's cost effective in most cases to expand your web server rather than pay for certain databases.
  • It's easy to migrate it to cloud storage i.e. Amazon S3, CDNs, etc. in the future.

Cons of the File System

  • Loosely packed. There are no ACID (Atomicity, Consistency, Isolation, Durability) operations in relational mapping, which means there is no guarantee. Consider a scenario in which your files are deleted from the location manually or by some hacking dudes. You might not know whether the file exists or not. Painful, right?

  • Low security. Since your files can be saved in a folder where you should have provided write permissions, it is prone to safety issues and invites trouble, like hacking. It's best to avoid saving in the file system if you cannot afford to compromise in terms of security.

Pros of Database

  • ACID consistency, which includes a rollback of an update that is complicated when files are stored outside the database.
  • Files will be in sync with the database and cannot be orphaned, which gives you the upper hand in tracking transactions.
  • Backups automatically include file binaries.
  • It's more secure than saving in a file system.

Cons of Database

  • You may have to convert the files to blob in order to store them in the database.
  • Database backups will be more hefty and heavy.
  • Memory is ineffective. Often, RDBMSs are RAM-driven, so all data has to go to RAM first. Yeah, that’s right. Have you ever thought about what happens when an RDBMS has to find and sort data? RDBMS tracks each data page — even the lowest amount of data read and written — and it has to track if it’s in-memory or if it’s on-disk, if it’s indexed or if it's sorted physically etc.

Data warehouse Vs BigData  


The Difference Between Big Data vs Data Warehouse, are explained in the points presented below:

  1. Data Warehouse is an architecture of data storing or data repository. Whereas Big Data is a technology to handle huge data and prepare the repository.
  2. Any kind of DBMS data accepted by Data warehouse, whereas Big Data accept all kind of data including transnational data, social media data, machinery data or any DBMS data.
  3. Data warehouse only handles structure data (relational or not relational), but big data can handle structure, non-structure, semi-structured data.
  4. Big data normally used a distributed file system to load huge data in a distributed way, but data warehouse doesn’t have that kind of concept.
  5. From a business point of view, as big data has a lot of data, analytics on that will be very fruitful, and the result will be more meaningful which help to take proper decision for that organization. Whereas Data warehouse mainly helps to analytic on informed information.
  6. Data warehouse means the relational database, so storing, fetching data will be similar with normal SQL query. And big data is not following proper database structure, we need to use hive or spark SQL to see the data by using hive specific query.
  7. 100% data loaded into data warehousing are using for analytics reports. But whatever data loaded by Hadoop, maximum 0.5% used on analytics reports till now. Others data are loaded into the system, but in not use status.
  8. Data Warehousing never able to handle humongous data (totally unstructured data). Big data (Apache Hadoop) is the only option to handle humongous data.
  9. The timing of fetching increasing simultaneously in data warehouse based on data volume. Means, it will take small time for low volume data and big time for a huge volume of data just like DBMS. But in case of big data, it will take a small period of time to fetching huge data (as it specially designed for handling huge data), but taken huge time if we somehow try to load or fetch small data in HDFS by using map reduce.

  SQL statements, Prepared statements, and Callable statements

PreparedStatement- Use this for general-purpose access to your database. Useful when you are using static SQL statements at runtime. The Statement interface cannot accept parameters.

The PreparedStatement interface extends the Statement interface, which gives you added functionality with a couple of advantages over a generic Statement object.
This statement gives you the flexibility of supplying arguments dynamically.

Creating PreparedStatement Object

PreparedStatement pstmt = null;
try {
   String SQL = "Update Employees SET age = ? WHERE id = ?";
   pstmt = conn.prepareStatement(SQL);
   . . .
}
catch (SQLException e) {
   . . .
}
finally {
   . . .
}
All parameters in JDBC are represented by the ? symbol, which is known as the parameter marker. You must supply values for every parameter before executing the SQL statement.
The setXXX() methods bind values to the parameters, where XXX represents the Java data type of the value you wish to bind to the input parameter. If you forget to supply the values, you will receive an SQLException.
Each parameter marker is referred by its ordinal position. The first marker represents position 1, the next position 2, and so forth. This method differs from that of Java array indices, which starts at 0.
All of the Statement object's methods for interacting with the database (a) execute(), (b) executeQuery(), and (c) executeUpdate() also work with the PreparedStatement object. However, the methods are modified to use SQL statements that can input the parameters.
 
 

 CallableStatement -
Use this when you want to access the database stored procedures. The CallableStatement interface can also accept runtime input parameters.


Just as a Connection object creates the Statement and PreparedStatement objects, it also creates the CallableStatement object, which would be used to execute a call to a database stored procedure.

Creating CallableStatement Object

Suppose, you need to execute the following Oracle stored procedure −
CREATE OR REPLACE PROCEDURE getEmpName 
   (EMP_ID IN NUMBER, EMP_FIRST OUT VARCHAR) AS
BEGIN
   SELECT first INTO EMP_FIRST
   FROM Employees
   WHERE ID = EMP_ID;
END;
NOTE: Above stored procedure has been written for Oracle, but we are working with MySQL database so, let us write same stored procedure for MySQL as follows to create it in EMP database −
DELIMITER $$

DROP PROCEDURE IF EXISTS `EMP`.`getEmpName` $$
CREATE PROCEDURE `EMP`.`getEmpName` 
   (IN EMP_ID INT, OUT EMP_FIRST VARCHAR(255))
BEGIN
   SELECT first INTO EMP_FIRST
   FROM Employees
   WHERE ID = EMP_ID;
END $$

DELIMITER ;
Three types of parameters exist: IN, OUT, and INOUT. The PreparedStatement object only uses the IN parameter. The CallableStatement object can use all the three.

ORM

 ORM is for Object Relational Mapping.
There are different structures for holding data at runtime 
•Application holds data in objects 
•Database uses tables (entities)
Mismatches between relational and object models
•Granularity: Object model has more granularity than relational model.
•Subtypes: Subtypes (means inheritance) are not supported by all types of relational databases. •Identity: Like object model, relational model does not expose identity while writing equality. •Associations: Relational models cannot determine multiple relationships while looking into an object domain model.
 •Data navigation: Data navigation between objects in an object network is different in both models
ORM implementations in JAVA
 •Java Beans
 •JPA

Beans use POJO POJO 
stands for Plain Old Java Object. It is an ordinary Java object, not bound by any special restriction other than those forced by the Java Language Specification and not requiring any class path. POJOs are used for increasing the readability and re-usability of a program. POJOs have gained most acceptance because they are easy to write and understand. They were introduced in EJB 3.0 by Sun microsystems

A POJO should not: 
 •Extend pre-specified classes. 
•Implement pre-specified interfaces.
 •Contain pre-specified annotations.
Beans 
•Beans are special type of Pojos. There are some restrictions on POJO to be a bean.
•All JavaBeans are POJOs but not all POJOs are JavaBeans. •Serializable i.e. they should implement Serializable interface. Still some POJOs who don’t implement Serializable interface are called POJOs because Serializable is a marker interface and therefore not of much burden.
•Fields should be private. This is to provide the complete control on fields
•Fields should have getters or setters or both. 
•A no-argconstructor should be there in a bean. 
•Fields are accessed only by constructor or getter setters.


Java Persistence API (JPA) 
•An API/specification for ORM 
•Uses 
       •POJO classes 
       •XML based mapping file (represent the DB) 
       •A provider (implementation of JPA)


JPA implementations 
•Hybernate 
•JDO 
•EclipseLink
•ObjectDB

REFERENCE 

https://searchstorage.techtarget.com/definition/Persistent-storage 

https://www.techopedia.com/definition/1185/database-db 

https://stackoverflow.com/questions/38120895/database-vs-file-system-storage 

https://www.tutorialspoint.com/jdbc/jdbc-statements.htm 

Comments

Popular posts from this blog

How to run multiple Transformations from one Job in Pentaho

Data Analytics

Distributed Systems