RSS

Stages – Data Quality Problems Arise 

Stages at which Data Quality Problems Arise

  1. Data Acquisition Processes:

    Data is entered into the company’s systems by many ways including manual and automatic. Data migration in large volumes introduces serious errors. There are many factors that contribute to poor quality of incoming data

  1. Initial Data Conversion: 

    When data is migrated from an existing database to another, a host of quality issues can arise. The source data may itself be incorrect owing to its own limitations; or the mapping of the old database to the new database may show inconsistencies or the conversion routines may map them incorrectly. It is often seen that the ‘legacy’ systems have metadata which is out-of-sync with the actual schema implemented. The data dictionary’s accuracy serves as the base for conversion algorithms, mapping and efforts. If the dictionary and actual data are out of sync, it can lead to major problems in data quality.

  1. System Consolidation:

    Business mergers and acquisitions lead to data consolidations. Since the focus is mostly on business process streamlining, the data conjoining is usually given lesser importance. This can be catastrophic especially if data experts of the previous company are not involved in the consolidation process, and there are challenges with out-of-sync meta data. Merging two databases which do not have compatible fields may lead to incorrect data ‘fitment’ and hit data accuracy adversely.

  1. Manual Data Entry:

    Data is fed manually into the system many times, and is hence prone to human error. Since user data is often entered through various user-friendly interfaces, they may not be directly compatible with the internal data representation. In addition, end-users tend to fill ‘shortcut’ information in fields that they perceive to be unimportant, but which may be crucial to internal data management. The data operator may not have the expertise to understand this data and might incorrectly fill values in the wrong fields or may mistype the information.

  1. Batch Feeds:

    Often automated processes are used to fill in large volumes of similar data in batches, as this saves effort and time. The systems pushing this bulk amount of data may pump in equally huge amounts of wrong data as well. This can be quite disastrous, especially when data travels down a number of databases in series. Wrong data can trigger incorrect processes that may result in incorrect decisions which might have huge adverse impacts for the firm. Usually, the data flow across the integrating systems is not fully tested, and any upgrade of software processes in the data chain with inadequate regression testing may have a detrimental effect of great magnitude on live data.

  1. Real-Time Interfaces:

    This is in complete opposition to batch feeds. With real-time interfaces and applications becoming the flavor of interactive and enhanced user experience, data enters the database in real-time and often propagates quickly to the chain of interconnected databases. This triggers actions and responses that might be visible to the user almost immediately, leaving little room for validation and verification. This causes a huge hole in the data quality assurance where a wrong entry may cause havoc at the back end.

  1. Internal Data Changes:

    The company may be running processes that modify data residing within the system. This may lead to the involuntary introduction of errors in the data. The following processes are responsible for internal changes in enterprise data

  1. Data Processing:

    Data in an enterprise needs to be processed regularly for summarizations, calculations and clearing up. There may have been a well-tested and proven cycle of such data processing conducted in the past. The code of the collation programs, the processes themselves, as well as the actual data evolve with time; hence a repeated cycle of collation may not yield similar results. The processed data may be completely off-the-mark and if it forms the basis of further successive processing, the error may travel down in a multifold manner.

  1. Data Cleansing:

    Every company needs to rectify its incorrect data periodically. Manual cleansing has been taken over by time and effort saving automations. Although this is very helpful, it has the potential risk of wrongly affecting thousands of records. The software used to automate may have bugs, or the data specifications which form the basis of cleansing algorithms may be incorrect. This can result in making absolutely valid data, invalid, and virtually reverse the very advantage of the cleaning exercise.

  1. Data Purging: 

    Old data constantly needs to be removed from the system, to save valuable storage space and reduce the maintenance efforts needed for retaining mammoth and obsolete volumes of information. Any purging results in destruction of data. Hence, a wrong or accidental deletion can affect data quality hazardously. Just as in the case of cleansing, bugs and incorrect data specifications in the purging software may unleash unwarranted destruction of valuable data. At times, valid data may incorrectly fit the purging criteria, and get erased.

  1. Data Manipulation:

    Data inside the company’s databases is subjected to manipulations due to system upgrades, database redesign and other such exercises. This results in data quality issues, since the personnel involved may not be data experts and the data specifications may be unreliable. Such deterioration of data is termed as data decay. Some reasons this problem occurs

  1. Changes in Data Not Captured: 

    Data represents real-world objects which may change on their own with time, and the data representation might not be catching up with this change. Thus, this data gets automatically aged and transformed into a meaningless form. Also in the case of interconnected systems, changes made to one branch are not migrated to the interfacing systems. This can lead to huge inconsistencies in data which may show up adversely at a later stage, often after the damage has occurred.

  1. System Upgrades:

    System upgrades are inevitable, and such exercises rely heavily on the data specifications for the expected data representation. In reality, the data is far from the documented behavior and the result is often chaotic. A system upgrade that is poorly tested can hence lead to irreversible data quality damages.

  1. New Uses of Data:

    Businesses need to find more revenue generating uses of existing data and this may open up a new set of issues. Data meant for one purpose may practically not suit another objective and using it for new purposes may lead to incorrect interpretations and assumptions in the new area.

  1. Loss Of Expertise:

    Data and data experts seal an impenetrable pact; the expert usually has an ‘eye’ for wrong data, is well-versed with the exceptions, knows how to extract relevant data usefully and discard the rest. This is due to long years of association with the ‘legacy’. When such experts retire, move on, or are dropped due to a new merger, the new data handling member may be unaware of these data anomalies which were earlier rectified by the experts. Hence, wrong data may travel unchecked into a process.

  2. Automation of Internal Processes::

    As more applications with higher automation levels share huge volumes of data, users get more exposure to erroneous internal data that was previously ignored. Companies stand to lose credibility in case of such exposure. Automation cannot replace the need to validate information; intentional and unintentional tweaking of data by the users may also lead to data decay, which may be out of the company’s control.In conclusion, data quality can be lost due to processes that bring data into the system, those that tend to cleanup and modify the data and through a process of data ageing and decay where the data may itself not change in time.

Advertisements
 

[Amazon](500150) Error setting/closing connection: Connection timed out.

While connecting to Redshift with SQLWorkbench

Error: [Amazon](500150) Error setting/closing connection: Connection timed out

Solution:

  1. Go into EC2 Management Console
  2. On the left navigation pane, look for Network & Security header and click on Security Groups. (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId)
  3. Notice the row with name “launch-wizard-1” in the Group Name’s column. Click that. Then select the Inbound tab, and click Edit.
  4. Add Rule: Type=Redshift, Source=MyIP. (auto-fill)
  5. Save it.
  6. Try to connect again from within SQLWorkbench/J.
 

R Plot PCH Symbols Chart

R Plot PCH Symbols Chart

Following is a chart of PCH symbols used in R plot. When the PCH is 21-25, the parameter “col=” and “bg=” should be specified. PCH can also be in characters, such as “#”, “%”, “A”, “a”, and the character will be ploted.

 

Google cloud compute images list

centos-7-v20170620                                centos-cloud       centos-7                              READY
coreos-alpha-1465-0-0-v20170706                   coreos-cloud       coreos-alpha                          READY
coreos-beta-1437-3-0-v20170630                    coreos-cloud       coreos-beta                           READY
coreos-stable-1409-6-0-v20170706                  coreos-cloud       coreos-stable                         READY
debian-8-jessie-v20170619                         debian-cloud       debian-8                              READY
debian-9-stretch-v20170619                        debian-cloud       debian-9                              READY
cos-beta-60-9592-31-0                             cos-cloud          cos-beta                              READY
cos-dev-61-9715-0-0                               cos-cloud          cos-dev                               READY
cos-stable-58-9334-74-0                           cos-cloud          cos-stable                            READY
cos-stable-59-9460-64-0                           cos-cloud          cos-stable                            READY
rhel-6-v20170620                                  rhel-cloud         rhel-6                                READY
rhel-7-v20170620                                  rhel-cloud         rhel-7                                READY
sles-11-sp4-v20170621                             suse-cloud         sles-11                               READY
sles-12-sp2-v20170620                             suse-cloud         sles-12                               READY
sles-12-sp1-sap-v20170620                         suse-sap-cloud     sles-12-sp1-sap                       READY
sles-12-sp2-sap-v20170620                         suse-sap-cloud     sles-12-sp2-sap                       READY
ubuntu-1404-trusty-v20170703                      ubuntu-os-cloud    ubuntu-1404-lts                       READY
ubuntu-1604-xenial-v20170619a                     ubuntu-os-cloud    ubuntu-1604-lts                       READY
ubuntu-1610-yakkety-v20170619a                    ubuntu-os-cloud    ubuntu-1610                           READY
ubuntu-1704-zesty-v20170619a                      ubuntu-os-cloud    ubuntu-1704                           READY
windows-server-2008-r2-dc-v20170615               windows-cloud      windows-2008-r2                       READY
windows-server-2012-r2-dc-core-v20170615          windows-cloud      windows-2012-r2-core                  READY
windows-server-2012-r2-dc-v20170615               windows-cloud      windows-2012-r2                       READY
windows-server-2016-dc-core-v20170615             windows-cloud      windows-2016-core                     READY
windows-server-2016-dc-v20170615                  windows-cloud      windows-2016                          READY
sql-2012-enterprise-windows-2012-r2-dc-v20170615  windows-sql-cloud  sql-ent-2012-win-2012-r2              READY
sql-2012-standard-windows-2012-r2-dc-v20170615    windows-sql-cloud  sql-std-2012-win-2012-r2              READY
sql-2012-web-windows-2012-r2-dc-v20170615         windows-sql-cloud  sql-web-2012-win-2012-r2              READY
sql-2014-enterprise-windows-2012-r2-dc-v20170615  windows-sql-cloud  sql-ent-2014-win-2012-r2              READY
sql-2014-standard-windows-2012-r2-dc-v20170615    windows-sql-cloud  sql-std-2014-win-2012-r2              READY
sql-2014-web-windows-2012-r2-dc-v20170615         windows-sql-cloud  sql-web-2014-win-2012-r2              READY
sql-2016-enterprise-windows-2012-r2-dc-v20170615  windows-sql-cloud  sql-ent-2016-win-2012-r2              READY
sql-2016-enterprise-windows-2016-dc-v20170615     windows-sql-cloud  sql-ent-2016-win-2016                 READY
sql-2016-express-windows-2012-r2-dc-v20170615     windows-sql-cloud  sql-exp-2016-win-2012-r2              READY
sql-2016-express-windows-2016-dc-v20170615        windows-sql-cloud  sql-exp-2016-win-2016                 READY
sql-2016-standard-windows-2012-r2-dc-v20170615    windows-sql-cloud  sql-std-2016-win-2012-r2              READY
sql-2016-standard-windows-2016-dc-v20170615       windows-sql-cloud  sql-std-2016-win-2016                 READY
sql-2016-web-windows-2012-r2-dc-v20170615         windows-sql-cloud  sql-web-2016-win-2012-r2              READY
sql-2016-web-windows-2016-dc-v20170615            windows-sql-cloud  sql-web-2016-win-2016                 READY
 

Essentials of Machine Learning Algorithms

3 types of Machine Learning Algorithms..

  1. Supervised Learning

How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random forest,, KNN, Logistic Regression etc.

  1. Unsupervised Learning

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.

  1. Reinforcement Learning:

How it works:  Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process

List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. KNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boost & Adaboost

 

 

 

Error: tar: Failed to set default locale in R while installing the packages

Error: tar: Failed to set default locale

Solution: set the language

system(“defaults write org.R-project.R force.LANG en_US.UTF-8”)

check it:

system(“locale”)
LANG=”en_IN.UTF-8″
LC_COLLATE=”C”
LC_CTYPE=”C”
LC_MESSAGES=”C”
LC_MONETARY=”C”
LC_NUMERIC=”C”
LC_TIME=”C”
LC_ALL=

 

Difference between Supervised Learning an Unsupervised Learning

Difference between Supervised Learning an Unsupervised Learning

If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.