The basis of statistics consists of numbers or observations usually obtained by some process of counting or measurement .These are collectively as data. Data may be of two broad types : Primary data and Secondary data.
Primary data: Data collected primarily for the purpose of the given enquiry are called primary data .
These are collected by the enquirer, either on his own or through some agency set up for the purpose, directly from the field of enquiry.
This type of data may be used with great confidence, because the enquirer will himself decide upon the coverage of the data.
Secondary data
The ordinary user of economic and social statistics will find that the data have been already collected by some other agency, government or private; these may exist either in a published or in an unpublished form. His job will then be simply to have access to the source and get hold of the data. Such data will be called Secondary data. Government departments collect data on diverse topics that touch the life of the people as a matter of routine and as an essential basis of administration
Private agencies like banks and industrial concerns regularly compile figures on their assets and liabilities, number of employees, income of employees, etc. The enquire may get his material readymade from such agencies, or he may get the data in a rough form and adapt them to his needs.
In making use of secondary data, the enquirer has to be particularly careful about the nature of the data their coverage, the definitions on which they are based and their degree of reliability.
Sometimes the information, collected from different agencies is inadequate for the purpose of his enquiry. He will then have to decide whether to collect his own data, either to base his enquiry solely on them or to collect from other agencies as secondary data.
Collection of data
A fundamental question to be considered at the outset is whether the collection of data should be done by complete enumeration or by sampling.
In complete enumeration, each and every individual of the group to which data are to relate is covered, and information gathered for each individual separately
In Sampling, only some individuals forming a representative part of the group are covered, either because the group is too large or because the items on which information is sought are too numerous.
Complete enumeration may lead to greater accuracy and greater refinement in analysis, but it may be a very expensive and time-consuming operation.
A sample designed and taken with care can produce results that may be sufficiently accurate for the purpose of the enquiry, and it can save much time and money.
The information sought may be gathered, from the individuals of the whole group (called the population ) or from those of the sample by one of the three methods:
i) The questionnaire method
ii) The interviewer method
iii) The method of direct observation.
Questionnaire method: Each informant (or respondent) is provided with a questionnaire, usually sent by mail with return postage prepaid, and is asked to supply the information in the form of answer to the questions.
This method can be effective only when the informants have attained a certain level of education .
The drawback of the method is that the informants may not get sufficient interest in the enquiry even if they are sufficiently enlightened. Consequently, the data may involve a high percentage of non-response and thus fail to reflect the true state of the field of enquiry.
Interviewer method: Enumerators go from one informant to other and elicit the required information . This method is used in population censuses. Also, this method has to be employed in case the informants are not all literate or, even if literate, have not attained the requisite educational level .
For instance, if one is interested in family income and expenditure on different items, one may arrange to interview the head of each family and collect the information sought from him. The data collected by this method are likely to be more accurate, since a tactful investigator may persuade the informant to supply the required information and the meaning of each question may be properly explained to him so that the answer may be correct and to the point.
Method of Direct Observation: The enquirer or his assistants get the data directly from the field of enquiry without having to depend on the co-operation of informants.
When data are needed on the height and weights of , say, 200 college students, they will be approached individually and height (say in cm) of each measured with a tape and the weight (say in kg) measured with a weighing balance.
If data are needed on the sentence of a novel by , say, Bankimchandra, the enquirer himself will go through the book and note for each sentence the length i.e. the number of words contained therein.
On the other hand, if data are required on the incidence of blindness among a group of people, one will just observe each member of the group and note whether he or she is or is not blind.
The direct method of data collection may, therefore, involve either measurement or counting or both observation.
Population or universe
It is the aggregate of all possible values of a variable or possible objects whose characteristics are of interest in any particular investigation or enquiry.
Example: If the incomes of the citizens of a country are of interest to us, the aggregate of relevant incomes will constitute the population. This population is finite, whereas the population consisting of all possible outcomes (heads, tails) in successive tosses of a coin is infinite. It should also be apparent from the above discussion that a statistical population as defined here need not have anything to with a human population.
A sample is a part of a population. Although we are primarily interested in the properties of a population or universe, it is often impracticable or even impossible to study the entire universe. Hence, inferences about a population are usually drawn an the basis of a sample. It is, therefore, essential that the sample studied should be representative of the population.
Data two types-
1. Quantitative data
2. Quantitative data
Qualitative data: In certain statistical investigations, we are concerned only with the presence or absence of some characteristic in a set of objects or individuals . In this situation, we only count how money individuals do or do not possess the characteristics.
For example, if we have record of births we may be concerned only as to whether the body is male or not and count the number of male babies.
Similarly, if a coin is tossed a number of times, we may only note the number of heads in the given set of tosses.
This type of data is called Qualitative or enumeration data and the characteristic used to classify an individual into different categories is called an attribute.
The term qualitative arises from the fact that differences between a set of individuals with respect to the given characteristic can only be stated in qualitative units.
It should be pointed out that an attribute may specify any number of classes or categories and to one and only one of which every individual under consideration must belong.
In some situations, the categories of an attribute are either natural or have clearly defined boundaries so that individuals can be classified into a category without any ambiguity
Variable
A variable is a measurable quantity which can assume any of the prescribed set of values, called the domain of the variables.
Thus, the height of a person, the yield of a crop, the price of a commodity, the number of children in a family are some examples of variables.
Quantitative data: When we are interested in a variable, we either note or measure the actual magnitude of some character for each of the individuals or units under quantitative data because it is possible to express the differences between individuals on quantitative scale.
Discrete and continuous variables
Discrete variable : When a variable can assume only isolated values, it is called a discrete variable.
For example, if the number of children in a family is the variable of interest, it is obvious that it cannot assume fractional values and hence it is a discrete variable. Most discrete variables can only assume the values 0,1,2……….. . But this need not be so. The important point is that the possible values of the variable are separated from one another.
Continuous variable: A variable is said to be continuous if it can theoretically assume any value within given range or ranges. Such variables, for instance, are height of a person, price of a commodity and time. In practice, it is not possible to measure, say, the height of a person beyond a certain degree of fineness because of the limitation of our measuring devices. This implies that observational data will always be discrete in nature, but the concept of a continuous variable is extremely useful.
From both the theoretical and practical point of view, quantitative data are more important than qualitative data. Nevertheless, we may note here that the statistical procedures for dealing with one type of data are often applicable to the other type.
Attribute and variable: Although statistics always deals with numerical data, such data may arise in one of two ways. In some cases, the data are numerical to start with, e.g. when we record the height for each of a group of men or number of rooms in each house of a town.
In other cases numbers arise only secondarily when we record the sex of each newborn baby during a month or the language of each book in a library, the data are not numbers initially .
We get numbers if, subsequently, we note the number of male babies and that of female babies, or the number of books written in English, the number written in French, the number written in Bengali and so firth.
We may, therefore, say that the first type of data arise if we are observing, for each expressed in number. Such a character will be referred to as quantitative character or a variable or a variate..
For the second type of data, the character observed (viz. the sex of a baby or the language in which a book is written ) is not expressible in numerical terms. Such a character is, therefore, called a qualitative characters or an attribute.
Thus, the outcome of coin toss is an attribute with two categories, head or tail, the sex of a person has two categories, male or female
But often the categories do not have clearly defined boundaries and the assignment of an individual to a category may raise difficult problems. This may be due to the fact that the boundaries are arbituary or rather vague and uncertain.
For example, the boundaries between the categories, rich and poor, good and bad, employed and unemployed, are not sharply defined.
This sort of uncertainly should be recognized and one ought to allow this in the statistical analysis.
Frequency distribution of an attribute During investigation by a Research Bureau in 1973, 2000 inhabitants of three districts of a country were interviewed.
Each was asked, among others, about ‘X’ corporation employees’ agitation of that time. On getting the data, the sponsors of the investigation put them into a systematic form. They just counted the number of those who knew about the agitation among the people interviewed and got the following table:
Result of Survey of ‘X’ corporation employees’ Agitation
Table:1
State of knowledge | Number of people (frequency) |
Aware | 500 |
unaware | 1500 |
Total | 2000 |
The number 500 shows how many of the people interviewed were aware of the agitation. In statistical language, this is the frequency of the form ‘ aware’ of he attribute, say ‘state of knowledge’, because it tells us how frequent this form was among the people interviewed. Similarly , the number 1500 is the frequency of the form ‘unaware’.
Perhaps, a better picture is obtained if one uses, instead of the frequencies, the proportions (or the relative frequencies , as they are called). These are shown in the table 2.
Proportion of People aware of ‘X’ corporation Employees’ Agitation.
Table: 2
State of knowledge | Relative frequency |
Aware | 0.25 |
unaware | 0.75 |
Total | 1.00 |
Table 1 shows how the total frequency , 2000 is distributed over the two classes, ‘aware’ and ‘unaware’ Such a table is, therefore, said to give a frequency distribution in this case, the frequency distribution of an attribute that may be called’ state of knowledge’
Table 2 presents the same frequency distribution in a different form.
Table 1 and 2 present a dichotomy a classification of individuals into two classes.
We may as well have frequency distributions of attributes with more than two classes. For instance, in the same survey, again, the people who knew of the agitation were asked whether they were sympathetic to the agitation or not. Their answers led to the following frequency distribution with three classes.
Table: 3
Attitude | Number of people (frequency) |
Sympathetic | 150 |
unsympathetic | 190 |
Indifferent | 165 |
Total | 500 |
N.B. The data given in Table 2 in the form of relative frequencies may be shown in a pie diagram or a divided bar diagram.
Discrete and Continuous Variables: In studying data regarding quantitative characters, it is found that these may be of two principal types. In the first place, the character may take only some isolated values, like the number of letters in a word (word-length), number of (family-size) and so forth. Alternatively, it may conceivably take any value within its range of variation. The height, weight or age of a man, the diameter of a bobbin, the temperature, rainfall or humidity in a region etc. are variables of this type.
Even in the 2nd case, the actual measurements will present a discreteness, e.g. when heights are given correct to the nearest cm. But this discreteness is completely artificial, being due to the limitations of the measuring instrument.
Variables of the first type are called discontinuous or discrete, while those of the 2nd type are called continuous.