Statistical determination of the sample. Population and sampling method

Sample

Sample or sample population- a set of cases (subjects, objects, events, samples), using a certain procedure, selected from the general population to participate in the study.

Sample characteristics:

  • Qualitative characteristics of the sample - who exactly we choose and what sampling methods we use for this.
  • Quantitative characteristics of the sample - how many cases we select, in other words, sample size.

Necessity of sampling

  • The object of study is very extensive. For example, consumers of a global company’s products are represented by a huge number of geographically dispersed markets.
  • There is a need to collect primary information.

Sample size

Sample size- the number of cases included in the sample population. For statistical reasons, it is recommended that the number of cases be at least 30-35.

Dependent and independent samples

When comparing two (or more) samples, an important parameter is their dependence. If a homomorphic pair can be established (that is, when one case from sample X corresponds to one and only one case from sample Y and vice versa) for each case in two samples (and this basis of relationship is important for the trait being measured in the samples), such samples are called dependent. Examples of dependent samples:

  • pairs of twins,
  • two measurements of any trait before and after experimental exposure,
  • husbands and wives
  • and so on.

If there is no such relationship between samples, then these samples are considered independent, For example:

Accordingly, dependent samples always have the same size, while the size of independent samples may differ.

Comparison of samples is made using various statistical criteria:

  • and etc.

Representativeness

The sample may be considered representative or non-representative.

Example of a non-representative sample

  1. A study with experimental and control groups, which are placed in different conditions.
    • Study with experimental and control groups using a pairwise selection strategy
  2. A study using only one group - an experimental one.
  3. A study using a mixed (factorial) design - all groups are placed in different conditions.

Sample Types

Samples are divided into two types:

  • probabilistic
  • non-probabilistic

Probability samples

  1. Simple probability sampling:
    • Simple resampling. The use of such a sample is based on the assumption that each respondent is equally likely to be included in the sample. Based on the list of the general population, cards with respondent numbers are compiled. They are placed in a deck, shuffled and a card is taken out at random, the number is written down, and then returned back. Next, the procedure is repeated as many times as the sample size we need. Disadvantage: repetition of selection units.

The procedure for constructing a simple random sample includes the following steps:

1. must be received full list members of the population and number this list. Such a list, recall, is called a sampling frame;

2. determine the expected sample size, that is, the expected number of respondents;

3. extract from table random numbers as many numbers as we need sample units. If there should be 100 people in the sample, 100 random numbers are taken from the table. These random numbers can be generated by a computer program.

4. select from the base list those observations whose numbers correspond to the written random numbers

  • Simple random sampling has obvious advantages. This method is extremely easy to understand. The results of the study can be generalized to the population being studied. Most approaches to statistical inference involve collecting information using a simple random sample. However, the simple random sampling method has at least four significant limitations:

1. It is often difficult to create a sampling frame that would allow simple random sampling.

2. Simple random sampling may result in a large population, or a population distributed over a large geographic area, which significantly increases the time and cost of data collection.

3. The results of simple random sampling are often characterized by low precision and a larger standard error than the results of other probability methods.

4. As a result of using SRS, a non-representative sample may be formed. Although samples obtained by simple random sampling, on average, adequately represent the population, some of them are extremely misrepresentative of the population being studied. This is especially likely when the sample size is small.

  • Simple non-repetitive sampling. The sampling procedure is the same, only the cards with respondent numbers are not returned to the deck.
  1. Systematic probability sampling. It is a simplified version of simple probability sampling. Based on the list of the general population, respondents are selected at a certain interval (K). The value of K is determined randomly. The most reliable result is achieved with a homogeneous population, otherwise the step size and some internal cyclic patterns of the sample may coincide (sampling mixing). Disadvantages: the same as in a simple probability sample.
  2. Serial (cluster) sampling. Selection units are statistical series (family, school, team, etc.). The selected elements are subject to a complete examination. The selection of statistical units can be organized as random or systematic sampling. Disadvantage: Possibility of greater homogeneity than in the general population.
  3. Regional sampling. In the case of a heterogeneous population, before using probability sampling with any selection technique, it is recommended to divide the population into homogeneous parts, such a sample is called district sampling. Zoning groups can include both natural formations (for example, city districts) and any feature that forms the basis of the study. The characteristic on the basis of which the division is carried out is called the characteristic of stratification and zoning.
  4. "Convenience" sample. The “convenient” sampling procedure consists of establishing contacts with “convenient” sampling units - a group of students, sports team, with friends and neighbors. If you want to get information about people's reactions to a new concept, this type of sampling is quite reasonable. Convenience sampling is often used to pretest questionnaires.

Non-probability samples

Selection in such a sample is carried out not according to the principles of randomness, but according to subjective criteria - availability, typicality, equal representation, etc.

  1. Quota sampling - the sample is constructed as a model that reproduces the structure of the general population in the form of quotas (proportions) of the characteristics being studied. Number of sample elements with various combinations of the studied characteristics is determined in such a way that it corresponds to their share (proportion) in the general population. So, for example, if our general population consists of 5,000 people, of which 2,000 are women and 3,000 are men, then in the quota sample we will have 20 women and 30 men, or 200 women and 300 men. Quota samples are most often based on demographic criteria: gender, age, region, income, education, and others. Disadvantages: usually such samples are not representative, because it is impossible to take into account several social parameters at once. Pros: readily available material.
  2. Method snowball. The sample is constructed as follows. Each respondent, starting with the first, is asked for contact information of his friends, colleagues, acquaintances who would fit the selection conditions and could take part in the study. Thus, with the exception of the first step, the sample is formed with the participation of the research objects themselves. The method is often used when it is necessary to find and interview hard-to-reach groups of respondents (for example, respondents with high incomes, respondents belonging to the same professional group, respondents who have any similar hobbies/interests, etc.)
  3. Spontaneous sampling – sampling of the so-called “first person you come across”. Often used in television and radio polls. The size and composition of spontaneous samples is not known in advance, and is determined only by one parameter - the activity of respondents. Disadvantages: it is impossible to establish which population the respondents represent, and as a result, it is impossible to determine representativeness.
  4. Route survey – often used when the unit of study is the family. On the map settlement, in which the survey will be carried out, all streets are numbered. Using a table (generator) of random numbers, big numbers. Each big number is considered as consisting of 3 components: street number (2-3 first numbers), house number, apartment number. For example, the number 14832: 14 is the street number on the map, 8 is the house number, 32 is the apartment number.
  5. Regional sampling with selection of typical objects. If, after zoning, a typical object is selected from each group, i.e. an object that is close to the average in terms of most of the characteristics studied in the study, such a sample is called regionalized with the selection of typical objects.

6.Modal sampling. 7.expert sampling. 8. Heterogeneous sample.

Group Building Strategies

The selection of groups for participation in a psychological experiment is carried out using various strategies to ensure that internal and external validity are maintained to the greatest possible extent.

Randomization

Randomization, or random selection, is used to create simple random samples. The use of such a sample is based on the assumption that each member of the population is equally likely to be included in the sample. For example, to make a random sample of 100 university students, you can put pieces of paper with the names of all university students in a hat, and then take 100 pieces of paper out of it - this will be a random selection (Goodwin J., p. 147).

Pairwise selection

Pairwise selection- a strategy for constructing sampling groups, in which groups of subjects are made up of subjects who are equivalent in terms of secondary parameters that are significant for the experiment. This strategy is effective for experiments using experimental and control groups with the best option- attracting twin pairs (mono- and dizygotic), as it allows you to create...

Stratometric selection

Stratometric selection- randomization with the allocation of strata (or clusters). With this method of sampling, the general population is divided into groups (strata) with certain characteristics (gender, age, political preferences, education, income level, etc.), and subjects with the corresponding characteristics are selected.

Approximate Modeling

Approximate Modeling- drawing limited samples and generalizing conclusions about this sample to the wider population. For example, with the participation of 2nd year university students in the study, the data of this study applies to “people aged 17 to 21 years”. The admissibility of such generalizations is extremely limited.

Approximate modeling is the formation of a model that, for a clearly defined class of systems (processes), describes its behavior (or desired phenomena) with acceptable accuracy.

Notes

Literature

Nasledov A. D. Mathematical methods psychological research. - St. Petersburg: Rech, 2004.

  • Ilyasov F. N. Representativeness of survey results in marketing research // Sociological Research. 2011. No. 3. P. 112-116.

see also

  • In some types of studies, the sample is divided into groups:
    • experimental
    • control
  • Cohort

Links

  • The concept of sampling. Main characteristics of the sample. Sample Types

Wikimedia Foundation.

2010.:

Synonyms

Synonym dictionary

Interval estimation of event probability. Formulas for calculating the sample size using a purely random sampling method. To determine the probabilities of events that interest us, we use a sampling method: we conduct n independent experiments, in each of which event A may occur (or not occur) (probability R occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events A To determine the probabilities of events that interest us, we use a sampling method: we conduct in a series of tests is taken as a point estimate for the probability p occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events occurrence of an event in a separate trial. In this case, the value p* is called sample share occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events occurrences of the event , and p - .

Due to the corollary of the central limit theorem (Moivre-Laplace theorem), the relative frequency of an event with a large sample size can be considered normally distributed with parameters M(p*)=p and

Therefore, for n>30, a confidence interval for the general share can be constructed using the formulas:


where u cr is found from the tables of the Laplace function, taking into account the given confidence probability γ: 2Ф(u cr)=γ.

With a small sample size n≤30, the maximum error ε is determined from the Student distribution table:
where tcr =t(k; α) and the number of degrees of freedom k=n-1 probability α=1-γ (two-sided area).

The formulas are valid if the selection was carried out in a random, repeated manner (the general population is infinite), otherwise it is necessary to make an adjustment for the non-repetition of selection (table).

Average sampling error for the general share

PopulationInfiniteFinal volume N
Type of selectionRepeatedRepeatless
Average sampling error

Formulas for calculating the sample size using a purely random sampling method

Selection methodFormulas for determining sample size
for averagefor share
Repeated
Repeatless
Fraction of units w = . Accuracy ε = . Probability γ =

General share problems

To the question “Does the confidence interval cover the given p0 value?” - can be answered by checking the statistical hypothesis H 0:p=p 0 . It is assumed that the experiments are carried out according to the Bernoulli test scheme (independent, probability tests is taken as a point estimate for the probability p occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events is constant). By volume sample To determine the probabilities of events that interest us, we use a sampling method: we conduct determine the relative frequency p * of occurrence of event A: where m- number of occurrences of the event occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events A To determine the probabilities of events that interest us, we use a sampling method: we conduct tests. To test the hypothesis H 0, statistics are used that, with a sufficiently large sample size, have a standard normal distribution (Table 1).
Table 1 - Hypotheses about the general share

Hypothesis

H 0:p=p 0H 0:p 1 =p 2
AssumptionsBernoulli test circuitBernoulli test circuit
Sample estimates
Statistics K
Statistics distribution K Standard normal N(0,1)

Example No. 1. Using random repeat sampling, the firm's management conducted a sample survey of 900 of its employees. Among the respondents there were 270 women. Construct a confidence interval with a probability of 0.95 covering the true proportion of women in the entire team of the company.
Solution. According to the condition, the sample proportion of women is (relative frequency of women among all respondents). Since the selection is repeated and the sample size is large (n=900), the maximum sampling error is determined by the formula

The value of u cr is found from the table of the Laplace function from the relation 2Ф(u cr) = γ, i.e. The Laplace function (Appendix 1) takes the value 0.475 at u cr =1.96. Therefore, the marginal error and the desired confidence interval
(p – ε, p + ε) = (0.3 – 0.18; 0.3 + 0.18) = (0.12; 0.48)
So, with a probability of 0.95, we can guarantee that the proportion of women in the entire team of the company is in the range from 0.12 to 0.48.

Example No. 2. The owner of the parking lot considers the day “lucky” if the parking lot is more than 80% full. During the year, 40 inspections of the car park were carried out, of which 24 were “successful”. With a probability of 0.98, find a confidence interval for estimating the true proportion of “lucky” days during the year.
Solution. The sample proportion of “lucky” days is
Using the table of the Laplace function, we find the value of u cr for a given
confidence probability
Ф(2.23) = 0.49, ucr = 2.33.
Considering the selection to be non-repetitive (i.e., two checks were not carried out on the same day), we will find the limiting error:
where n=40, N = 365 (days). From here
and confidence interval for the general share: (p – ε, p + ε) = (0.6 – 0.17; 0.6 + 0.17) = (0.43; 0.77)
With a probability of 0.98, we can expect that the proportion of “lucky” days during the year will be in the range from 0.43 to 0.77.

Example No. 3. Having checked 2500 products in the batch, they found that 400 products were of the highest grade, but n–m were not. How many products need to be checked in order to determine with 95% confidence the proportion of the highest grade with an accuracy of 0.01?
We look for a solution using the formula for determining the sample size for re-selection.

Ф(t) = γ/2 = 0.95/2 = 0.475 and this value according to the Laplace table corresponds to t=1.96
Sample proportion w = 0.16; sampling error ε = 0.01

Example No. 4. A batch of products is accepted if the probability that the product will comply with the standard is at least 0.97. Among the randomly selected 200 products of the tested batch, 193 were found to meet the standard. Is it possible to accept the batch at the significance level α=0.02?
Solution. Let us formulate the main and alternative hypotheses.
H 0:p=p 0 =0.97 - unknown general share tests is taken as a point estimate for the probability equal to the specified value p 0 =0.97. In relation to the condition - the probability that a part from the inspected batch will comply with the standard is equal to 0.97; those. The batch of products can be accepted.
H 1:p<0,97 - вероятность того, что деталь из проверяемой партии окажется соответствующей стандарту, меньше 0.97; т.е. партию изделий нельзя принять. При такой альтернативной гипотезе критическая область будет левосторонней.
Observed Statistic Value K(table) calculate for given values ​​p 0 =0.97, n=200, m=193


We find the critical value from the table of the Laplace function from the equality


According to the condition, α = 0.02, hence F(Kcr) = 0.48 and Kcr = 2.05. The critical region is left-sided, i.e. is the interval (-∞;-K kp)= (-∞;-2.05). The observed value K obs = -0.415 does not belong to the critical region, therefore, at this level of significance there is no reason to reject the main hypothesis. You can accept a batch of products.

Example No. 5. Two factories produce the same type of parts. To assess their quality, samples were taken from the products of these factories and the following results were obtained. Among the 200 selected products from the first plant, 20 were defective, and among the 300 products from the second plant, 15 were defective.
At a significance level of 0.025, find out whether there is a significant difference in the quality of parts manufactured by these factories.

According to the condition, α = 0.025, hence F(Kcr) = 0.4875 and Kcr = 2.24. With a two-sided alternative, the range of acceptable values ​​has the form (-2.24;2.24). The observed value K obs =2.15 falls within this interval, i.e. at this level of significance there is no reason to reject the main hypothesis. Factories produce products of the same quality.

Topic: Sampling method in statistics

1. The concept of sample observation, its tasks

Statistical observation can be organized continuous or non-continuous. Continuous observation involves examining all units of the population being studied and is associated with large labor and material costs. The study of not all units of the population, but only a certain part, by which one should judge the properties of the entire population as a whole, can be carried out not continuous observation. In statistical practice, the most common is selective observation.

Selective observation - This is a type of incomplete observation in which the selection of units to be examined is carried out in a random order, the selected part is studied, and the results are distributed to the entire original population. The observation is organized in such a way that this part of the selected units is on a reduced scale represents(represents) the entirety.

The population from which selection is made is called general general

The set of selected units is called sample population, and all its general indicators - selective.

There are a number of reasons why, in many cases, selective observation is preferred over continuous observation. The most significant of them are the following:

Saving time and money as a result of reducing the amount of work;

Minimizing damage or destruction of the objects under study (determining the tensile strength of yarn, testing light bulbs for burning time, checking canned food for good quality);

The need for a detailed study of each observation unit when it is impossible to cover all units (when studying the family budget);

Achieving greater accuracy of survey results by reducing errors that occur during registration.

The advantage of selective observation over continuous observation can be realized if it is organized and carried out in strict accordance with scientific principles sampling theory. These principles are: ensuring accidents(equal opportunity to be included in the sample) selection of units and a sufficient number of them. Compliance with these principles allows us to obtain an objective guarantee of the representativeness of the resulting sample population. Concept representativeness the selected population should not be understood as its representation for all characteristics of the population being studied, but only in relation to those characteristics that are studied or have a significant impact on the formation of summary general characteristics.

The main task of sample observation in economics is to obtain reliable judgments about the indicators of the average and share in the population based on the characteristics of the sample population (average and share). It should be borne in mind that in any statistical research (continuous and selective) errors of two types arise: registration and representativeness.

Registration errors can have random(unintentional) and systematic(tendentious) character. Random errors usually balance each other, since they do not have a predominant direction towards exaggerating or understating the value of the indicator being studied. Systematic errors directed in one direction due to a deliberate violation of selection rules (biased goals). They can be avoided with proper organization and monitoring.

Representativeness errors are inherent only in selective observation and arise due to the fact that the sample population does not completely reproduce the general population. They represent the discrepancy between the values ​​of the indicators obtained from the sample and the values ​​of the indicators of the same values ​​that would be obtained with continuous observation carried out with the same degree of accuracy, i.e. between the values ​​of the elective and the corresponding general indicators.

For each specific sample observation, the value of the representativeness error can be determined using the appropriate formulas, which depend on type, method And way formation of the sample population.

By appearance distinguish between individual, group and combined selection. At individual selection individual units of the general population are selected into the sample population; at group selection- qualitatively homogeneous groups or series of units being studied; combined selection involves a combination of the first and second types.

By selection method differentiate repeat And non-repetitive sampling.

At resampling the total number of units in the population remains unchanged during the sampling process. A particular unit included in the sample is returned to the population after registration, and it retains an equal opportunity with all other units to be included in the sample again when resampling units (“selection according to the returned ball scheme”). Resampling is rare in socioeconomic life. Usually the sample is organized according to a non-repetitive sampling scheme.

At non-repetitive sampling a population unit included in the sample is not returned to the general population and does not participate in the sample in the future; that is, the subsequent sample is made from the general population without the previously selected units (“selection according to the unreturned ball scheme”). Thus, with non-repetitive sampling, the number of units in the general population is reduced during the research process.

Selection method specifies a specific mechanism or procedure for sampling units from a population.

According to the degree of coverage of population units, they distinguish big And small (To determine the probabilities of events that interest us, we use a sampling method: we conduct <30) выборки.

In the practice of sampling research, the following types of sampling are most widespread: actually random, mechanical, typical, serial, combined.

The main characteristics of the parameters of the general and sample populations are indicated by the symbols:

N-volume of the general population (number of units included in it);

P - sample size (number of units surveyed);

- general average (average value of a characteristic in the general population);

Sample mean;

P- general share (the share of units possessing a given value of the attribute in the total number of units in the general population);

w - sample share;

- general dispersion (variance of a characteristic in the general population);

S 2 - sample variance of the same characteristic;

- standard deviation in the population;

S- standard deviation in the sample.

2. Sampling errors

During selective observation, it must be ensured accident selection of units. Each unit must have an equal chance of being selected. This is what a random sample is based on.

TO actual random sample refers to the selection of units from the entire population (without first dividing it into any groups) by drawing lots (mainly) or some other similar method, for example, using a table of random numbers. Random selection - This selection is not random. The randomness principle suggests that the inclusion or exclusion of an item from a sample cannot be influenced by any factor other than chance. Example actually random winning draws can serve as selection: from the total number of tickets issued, a certain part of the numbers that account for the winnings is selected at random. Moreover, all numbers are provided with an equal opportunity to be included in the sample. In this case, the number of units selected in the sample population is usually determined based on the accepted sample proportion.

Share, samples is the ratio of the number of units in the sample population to the number of units in the general population:

So, with a 5% sample from a batch of parts of 1000 units. sample size P is 50 units, and with a 10% sample -100 units. etc. With the correct scientific organization of sampling, errors in representativeness can be reduced to minimal values, as a result - sample observation becomes quite accurate.

Proper random selection “in its pure form” is rarely used in the practice of selective observation, but it is the initial one among all other types of selection; it contains and implements the basic principles of selective observation.

Let's consider some questions of the theory of the sampling method and the error formula for a simple random sample.

Applying sampling method In statistics, two main types of general indicators are usually used: average value of a quantitative characteristic And relative value of the alternative characteristic(the share or specific weight of units in a statistical population that differ from all other units of this population only by the presence of the characteristic being studied).

Selective share ( w ), or frequency, is determined by the ratio of the number of units possessing the characteristic being studied T, to the total number of units in the sample population P:

w = t/p.

For example, if out of 100 sample parts (u = 100), 95 parts turned out to be standard (T=95), then the sample fraction

w = 95 / 100 = 0,95 .

To characterize the reliability of sample indicators, there are average And maximum sampling error.

Sampling error or, in other words, the representativeness error is the difference between the corresponding sample and general characteristics:

(1)

(2)

Sampling error occurs only in sample observations. The greater the value of this error, the more the sample indicators differ from the corresponding general indicators.

Sample mean and sample proportion are inherently random variables which can take on different values ​​depending on which population units are included in the sample. Therefore, sampling errors are also random variables and can take on different values. Therefore, the average of possible errors is determined - the average sampling error.

What does it depend on average sampling error! If the principle of random selection is observed, the average sampling error is determined, first of all, sample size: The larger the number, other things being equal, the smaller the average sampling error. By covering an increasing number of units of the general population with a sample survey, we characterize the entire general population more and more accurately.

The average sampling error also depends on degree of variation the trait being studied. The degree of variation, as is known, is characterized by dispersion or w (1 - w ) - for an alternative sign. The smaller the variation of the characteristic, and therefore the dispersion, the smaller the average sampling error, and vice versa. With zero dispersion (the characteristic does not vary), the average sampling error is zero, i.e., any unit of the general population will accurately characterize the entire population according to this characteristic.

The dependence of the average sampling error on its volume and the degree of variation of the attribute is reflected in formulas that can be used to calculate the average sampling error under conditions of selective observation, when the general characteristics ( x,p) are unknown, and therefore, it is not possible to find the real sampling error directly using formulas (1), (2).

With random re-sampling average errors are theoretically calculated using the following formulas:

for the average quantitative characteristic

(3)

for a share (alternative attribute)

(4)

Since practically the variance of a trait in the population not known exactly, in practice they use

dispersion value S 2 , calculated for a sample population on the basis of the law of large numbers, according to which a sample population, with a sufficiently large sample size, fairly accurately reproduces the characteristics of the general population.

Thus, the calculation formulas average sampling error with random re-selection, the following will be:

for the average quantitative characteristic

for a share (alternative attribute)

(6)

However, the variance of the sample population is not equal to the variance of the general population, and therefore, the average sampling errors calculated using formulas (5) and (6) will be approximate. But in probability theory it has been proven that the general variance is expressed through the sample variance by the following relation:

(7)

Because P / (To determine the probabilities of events that interest us, we use a sampling method: we conduct-1) for sufficiently large P - value close to unity, then we can assume that = S 2 , A therefore, in practical calculations of average sampling errors, formulas (5) and (6) can be used. And only in cases of a small sample (when the sample size does not exceed 30) it is necessary to take into account the coefficient n/(n-1) and count small sample mean error according to the formula:

(8)

In the above formulas for calculating average sampling errors, it is necessary to multiply the radical expression by 1-(p/ N ), since in the process of non-repetitive sampling the number of units in the general population is reduced. Therefore, for non-repetitive sampling, the calculation formulas average sampling error will take the following form:

for the average quantitative characteristic

(9)

for a share (alternative attribute)

(10)

Because P always less N , then the additional factor 1 - (n / N ) will always be less than one. It follows that the average error during non-repetitive selection will always be less than during repeated selection. At the same time, with a relatively small percentage of the sample, this multiplier is close to one (for example, with a 5% sample it is 0.95; with a 2% sample it is 0.98, etc.). Therefore, sometimes in practice formulas (5) and (6) are used to determine the average sampling error without the specified multiplier, although the sample is organized as non-repetitive. This occurs when the number of units in the population N unknown or limitless, or when P very little compared to N, and essentially, the introduction of an additional factor close to unity will have virtually no effect on the average sampling error.

Mechanical sampling consists in the fact that the selection of units into the sample population from the general population, divided according to a neutral criterion into equal intervals (groups), is carried out in such a way that only one unit is selected from each such group for the sample. To avoid bias, the unit that is in the middle of each group should be selected.

When organizing mechanical selection, population units are first arranged (usually in a list) in a certain order (for example, by alphabet, location, in ascending or descending order of values ​​of some indicator not related to the property being studied, etc.), after which a given number of units is selected mechanically, at a certain interval. In this case, the size of the interval in the population is equal to the inverse value of the sample proportion. So, with a 2% sample, every 50th unit is selected and checked (1: 0.02), with a 5% sample - every 20th unit (1: 0.05), for example, a part coming off a machine .

With a sufficiently large population, mechanical selection is close to pure random selection in terms of the accuracy of the results. Therefore, to determine the average error of mechanical sampling, the formulas for proper random non-repetitive sampling (9), (10) are used.

To select units from a heterogeneous population, the so-called typical sample, which is used in cases where all units of the general population can be divided into several qualitatively homogeneous, similar groups according to characteristics that influence the indicators being studied.

When surveying enterprises, such groups can be, for example, industry and sub-industry, forms of ownership. Then, from each typical group, a purely random or mechanical sample is used to individually select units into the sample population.

Sample sampling is usually used when studying complex statistical populations. For example, during a sample survey of family budgets of workers and employees in certain sectors of the economy, the labor productivity of enterprise workers, represented by separate groups by qualification.

Typical sampling gives more accurate results compared to other methods of selecting units in the sample population. Typing the general population ensures the representativeness of such a sample, the representation of each typological group in it, which makes it possible to eliminate the influence of intergroup dispersion on the average sampling error,

When determining average error of a typical sample acts as an indicator of variation the average of the within-group variances.

Average sampling error found using the formulas:

for the average quantitative characteristic

(re-selection); (11)

(non-repetitive selection); ( 12)

for a share (alternative attribute)

(re-selection); (13)

(non-repetitive selection), (14)

Where - the average of the within-group variances for the sample population;

The average of the within-group variances of the share (alternative

characteristic) for the sample population.

Serial sampling involves random selection from the general population not of individual units, but of their equal groups (nests, series) in order to subject all units in such groups to observation without exception.

The use of serial sampling is due to the fact that many goods for their transportation, storage and sale are packaged in bundles, boxes, etc. Therefore, when monitoring the quality of packaged goods, it is more rational to check several packages (series) than to select the required amount of product from all packages.

Since within groups (series) all units without exception are examined, the average sampling error (when selecting equal series) depends only on the intergroup (interseries) dispersion.

Average sampling error for the average quantitative trait during serial selection they are found using the formulas:

(re-selection); ( 15 )

(non-repetitive selection), ( 16 )

Where r- number of selected series; R - total number of episodes.

The between-group variance of a serial sample is calculated as follows:

where is the average of the i-th series; - overall average for the entire sample population.

Average sampling error for proportion (alternative attribute) in serial selection:

(re-selection); ( 17 )

(non-repetitive selection). ( 18 )

Intergroup(inter-series) variance of the serial sample share determined by the formula:

(19)

Where w i - share of the characteristic in the i-th series; - the total proportion of the characteristic in the entire sample population.

In the practice of statistical surveys, in addition to the previously discussed selection methods, a combination of them is used (combined selection).

3. Extension of sample results to the general population

The ultimate goal of sample observation is to characterize the population based on sample results.

Sample averages and relative values ​​are distributed to the general population, taking into account the limit of their possible error.

In each specific sample, the discrepancy between the sample mean and the general mean, i.e. may be less than the average sampling error , equal to it or greater than it.

Moreover, each of these discrepancies has a different probability(objective possibility of an event occurring). Therefore, the actual discrepancies between the sample mean and the general can be considered as a certain marginal error associated with the average error and guaranteed with a certain probability R.

Maximum sampling error for the average () at re-selection can be calculated using the formula:

(20)

Where t- normalized deviation - “confidence coefficient”, depending on the probability with which the maximum sampling error is guaranteed;

Average sampling error.

The formula can be written in a similar way marginal sampling error for the proportion upon re-selection:

(21)

With random non-repetitive selection in the formulas for calculating the maximum sampling errors (20) and (21), it is necessary to multiply the radical expression by 1 - ( To determine the probabilities of events that interest us, we use a sampling method: we conduct / N ) .

The formula for the maximum sampling error follows from the basic principles of the theory of the sampling method, formulated in a number of theorems of probability theory reflecting the law of large numbers.

Based on the theorem of P.L. Chebyshev (with clarifications by A.M. Lyapunov) with a probability as close to unity as possible, it can be argued that with a sufficiently large sample size and limited general dispersion, sample generalizing indicators (average, share) will differ as little as possible from the corresponding general indicators.

In relation to finding average value of the attribute, this theorem can be written as follows:

(22)

and for shares sign:

(23 )

Where (24)

Thus, the magnitude of the maximum sampling error can be established with a certain probability.

Function values F( t ) at different values t as a multiple of the average sampling error, are determined on the basis of specially compiled tables. Here are some values ​​that are used most often for samples of a sufficiently large size ( To determine the probabilities of events that interest us, we use a sampling method: we conduct 30):

t 1,000 1,960 2,000 2,580 3,000

F( t ) 0,683 0,950 0,954 0,990 0,997

The marginal sampling error answers the question about the sampling accuracy with a certain probability, the value of which is determined by the coefficient t(in practical calculations, as a rule, the specified probability should not be less than 0.95). Yes, when t= 1 maximum error will be = . Therefore, with a probability of 0.683 it can be stated that the difference between the sample and general indicators will not exceed one average sampling error. In other words, in 68.3% of cases the representativeness error will not exceed ±1.

At t = 2 with probability 0.954 it will not go beyond ±2,

at t = 3 with a probability of 0.997 - beyond ±3, etc.

As can be seen from the above function values F (t) (see last value), the probability of occurrence of an error equal to or greater than triple the average sampling error, i.e. 3 is extremely small and equal to 0.003, i.e. 1-0.997. Such unlikely events are considered practically impossible, and therefore the magnitude = 3 can be taken as the limit of possible sampling error.

Sample observation is carried out in order to extend the conclusions obtained from the sample data to the general population. One of the main tasks is to estimate the studied characteristics (parameters) of the general population using sample data.

The maximum sampling error allows us to determine limiting values ​​of population characteristics and their confidence intervals:

for average (25)

for share (26)

This means that with a given probability it can be stated that the value of the general average should be expected in the range from - before +

The confidence interval of the general share can be written in a similar way:

Along with the absolute value of the maximum sampling error, the maximum relative sampling error, which is defined as the percentage ratio of the marginal sampling error to the corresponding characteristic of the sample population:

for average, %: (27)

for a share, %: (28)

Let's consider finding the average and maximum sampling errors, determining the confidence limits of the average and proportion using specific examples.

Task 1. To determine the speed of settlements with creditors of corporation enterprises, a commercial bank conducted a random sample of 100 payment documents, for which the average time for transferring and receiving money turned out to be 22 days ( = 22) with a standard deviation of 6 days (S= 6).

Necessary with probability P = 0.954 determine the maximum error of the sample average and confidence limits of the average duration of settlements of the enterprises of this corporation.

Solution. Marginal error = t determined by the repeated selection formula (6.20), since the size of the general population N unknown. From the values ​​presented F (t) (see p. 98) for probability R= 0.954 we find t = 2.

Therefore, the maximum sampling error, days:

The general average will be equal to = ± , and the confidence intervals (limits) of the general average are calculated based on the double inequality:

Thus, with a probability of 0.954, it can be stated that the average duration of settlements for enterprises of this corporation ranges from 20.8 to 23.2 days.

Task 2. Among 1,000 families sampled in the region based on per capita income (2% mechanical sampling), 300 families were found to be low-income.

It is required to determine the share of low-income families in the entire region with a probability of 0.997.

Solution. The sample share (the share of low-income families among the surveyed families) is equal to:

According to the previously presented data F( t) for probability 0.997 we find t= 3 (see p. 99). The maximum error of the share is determined by the formula of non-repetition sampling (mechanical sampling is always non-repetition):

Maximum relative sampling error, %:

The general share and the confidence limits of the general share are calculated based on the double inequality:

In our example:

Thus, it is almost certain, with a probability of 0.997, that it can be stated that the share of low-income families among all families in the region ranges from 28.6 to 31.4%.

Task 3. To determine the yield of grain crops, a sample survey of 100 farms in the region of various forms of ownership was carried out, as a result of which summary data was obtained (Table 6.1). It is necessary to determine with a probability of 0.954 the maximum error of the sample average and the confidence limits of the average yield of grain crops for all farms in the region.

Table 6.1

Distribution of yields among regional farms with different forms of ownership

Solution. Since the surveyed farms in the region are grouped by type of ownership, the maximum error of the average yield is determined by the formula for a typical sample carried out by the repeated selection method (the size of the general population N is unknown):

In this formula, the average of the within-group variances is unknown.

It is calculated by the formula:

According to the data presented earlier (see p. 98) F (t) for probability R=0.954 we find t = 2.

Then the maximum sampling error, c/ha:

General average: = ± . To find its boundaries, you first need to calculate the average yield for the sample population , c/ha:

Maximum relative sampling error, %:

We calculate the confidence limits of the general average based on the double inequality:

Thus, with a probability of 0.954 it can be guaranteed that the average yield of grain crops in the region will be no less than 20 c/ha, but no more than 22 c/ha.

Determining the required sample size. When designing a sample observation with a predetermined value of the permissible sampling error, it is very important to correctly determine the size (volume) of the sample population, which with a certain probability will ensure the specified accuracy of the observation results. Formulas for determining the required sample size P easy to obtain directly from sampling error formulas.

Thus, from the formulas for the maximum sampling error for re-selection it is not difficult (after squaring both sides of the equality) to express required sample size:

for the average quantitative characteristic

for a share (alternative attribute)

(30 )

Similarly, from the formulas for the maximum sampling error for non-repetitive selection we find that

(for average); (31 )

(for share). (32 )

These formulas show that as the estimated sampling error increases, the required sample size decreases significantly.

To calculate the sample size, you need to know the variance. It can be borrowed from previously conducted surveys of the same or similar population, and if there are none, then a special small sample survey must be conducted to determine the variance.

Task 4. To determine the average age of 1200 faculty students, it is necessary to conduct a sample survey using a random, non-repetitive selection method. It is preliminarily established that the standard deviation of the age of students is 10 years.

How many students need to be surveyed so that with probability 0.954 the average sampling error does not exceed 3 years?

Solution. Let us calculate the required sample size, people, using the non-repetition sampling formula (6.31), taking into account that t = 2 at R = 0,954:

Thus, the sample size is 47 people. ensures the specified accuracy during non-repetitive sampling.

The sampling method is widely used in statistical practice to obtain economic information.

The sampling method is becoming increasingly relevant in modern conditions of the transition to a market economy. Changes in the nature of economic relations, rent, ownership of individual groups and individuals determine changes in the functions of accounting and statistics, reduction and simplification of reporting. At the same time, increasing requirements for management increase the need to provide reliable information and further increase its efficiency. All this determines the wider use of the sampling method in economics.

Domestic statistics have already accumulated some experience in sample surveys.

In the theory of the sampling method, various selection methods and types of sampling have been developed to ensure representativeness. Under selection method understand the procedure for selecting units from the population. There are two selection methods: repeated and non-repetitive. At repeated In sampling, each randomly selected unit, after being surveyed, is returned to the general population and, with subsequent selection, can again be included in the sample. This selection method is based on the “returned ball” scheme: the probability of being included in the sample for each unit of the population does not change regardless of the number of units selected. At repeatable In sampling, each unit selected at random is not returned to the general population after its examination. This selection method is based on the “non-returned ball” scheme: the probability of being included in the sample for each unit of the general population increases as selection proceeds.

Depending on the methodology for forming the sample population, the following main ones are distinguished: types of sampling:

actually random;

mechanical;

typical (stratified, zoned);

serial (nested);

combined;

multi-stage;

multiphase;

interpenetrating.

Actually random sampling is formed in strict accordance with scientific principles and random selection rules. To obtain a random sample itself, the general population is strictly divided into sampling units, and then a sufficient number of units are selected in a random repeated or non-repetitive order.

Random order is like drawing lots. In practice, it is most often used when using special tables of random numbers. If, for example, 40 units are to be selected from a population containing 1587 units, then 40 four-digit numbers that are less than 1587 are selected from the table.

In the case when the random sample itself is organized as a repeated sample, the standard error is calculated in accordance with formula (6.1). With the non-repetitive sampling method, the formula for calculating the standard error will be:


where 1 – To determine the probabilities of events that interest us, we use a sampling method: we conduct/ N– the proportion of units in the general population that were not included in the sample. Since this fraction is always less than one, the error during non-repetitive selection, other things being equal, is always less than during repeated selection. Non-repetitive selection is easier to organize than repeated selection, and it is used much more often. However, the value of the standard error during non-repetitive sampling can be determined using a simpler formula (5.1). Such a replacement is possible if the proportion of units in the general population that were not included in the sample is large and, therefore, the value is close to unity.

Forming a sample in strict accordance with the rules of random selection is practically very difficult, and sometimes impossible, since when using tables of random numbers it is necessary to number all units of the general population. Quite often, the population is so large that it is extremely difficult and impractical to carry out such preliminary work, so in practice other types of samples are used, each of which is not strictly random. However, they are organized in such a way as to ensure maximum approximation to the conditions of random selection.

When clean mechanical sampling the entire general population of units must first of all be presented in the form of a list of selection units, compiled in some order neutral with respect to the trait being studied, for example, alphabetically. Then the list of selection units is divided into as many equal parts as there are units to be selected. Next, according to a pre-established rule not related to the variation of the characteristic under study, one unit is selected from each part of the list. This type of sampling may not always provide random sampling, and the resulting sample may be biased. This is explained by the fact that, firstly, the ordering of units in the general population may have an element of a non-random nature. Secondly, sampling from each part of the population if the reference point is incorrectly established can also lead to bias error. However, in practice it is easier to organize a mechanical sample than a random one, and when conducting sample surveys this type of sampling is most often used. The standard error in mechanical sampling is determined by the formula of the actual random non-repetitive sampling (6.2).

Typical (zoned, stratified) sample has two goals:

ensure representation in the sample of the corresponding typical groups of the general population according to the characteristics of interest to the researcher;

increase the accuracy of sample survey results.

With a typical sample, before its formation begins, the general population of units is divided into typical groups. In this case, a very important point is the correct choice of grouping characteristic. The selected typical groups may contain the same or different numbers of selection units. In the first case, the sample population is formed with an equal share of selection from each group, in the second - with a share proportional to its share in the general population. If a sample is formed with an equal share of selection, it is essentially equivalent to a number of strictly random samples from smaller populations, each of which is a typical group. Selection from each group is carried out in a random (repeated or non-repeated) or mechanical manner. With a typical sample, both with an equal and unequal share of selection, it is possible to eliminate the influence of intergroup variation of the characteristic being studied on the accuracy of its results, since mandatory representation of each of the typical groups in the sample population is ensured. Will the standard error of the sample depend on the amount of total variance? 2, and on the value of the average of the group variances?i 2 . Since the average of the group variances is always less than the total variance, all other things being equal, the standard error of a typical sample will be less than the standard error of a random sample itself.

When determining standard errors of a typical sample, the following formulas are used:

When repeating the selection method

With a non-repetitive selection method:

– the average of the group variances in the sample population.

Serial (cluster) sampling- this is a type of formation of a sample population when not units to be surveyed, but groups of units (series, nests) are selected at random. Within the selected series (nests), all units are examined. Serial sampling is practically easier to organize and conduct than sampling individual units. However, with this type of sampling, firstly, the representation of each of the series is not ensured and, secondly, the influence of inter-series variation of the studied characteristic on the survey results is not eliminated. In the case where this variation is significant, it will lead to an increase in the random error of representativeness. When choosing the type of sample, the researcher must take this circumstance into account. The standard error of serial sampling is determined by the formulas:

With the repeated selection method -


where? is the interseries variance of the sample population; r– number of selected series;

With a non-repetitive selection method -


Where R– number of series in the population.

In practice, certain methods and types of samples are used depending on the purpose and objectives of sample surveys, as well as the possibilities of their organization and conduct. Most often, a combination of selection methods and types of sampling is used. Such samples are called combined. Combination is possible in different combinations: mechanical and serial sampling, typical and mechanical, serial and actually random, etc. Combined sampling is used to ensure the greatest representativeness with the least labor and monetary costs for organizing and conducting the survey.

With a combined sample, the standard error of the sample consists of errors at each stage and can be determined as the square root of the sum of squared errors of the corresponding samples. So, if during a combined sample mechanical and typical samples were used in combination, then the standard error can be determined by the formula


where?1 and? 2 are the standard errors of the mechanical and typical samples, respectively.

Peculiarity multi-stage extraction consists in the fact that the sample population is formed gradually, according to the stages of selection. At the first stage, first stage units are selected using a predetermined method and type of selection. At the second stage, from each unit of the first stage included in the sample, units of the second stage are selected, etc. The number of stages can be more than two. At the last stage, a sample population is formed, units of which are subject to survey. So, for example, for a sample survey of household budgets, at the first stage, territorial subjects of the country are selected, at the second - districts in selected regions, at the third - enterprises or organizations are selected in each municipality, and, finally, at the fourth stage - families are selected in selected enterprises .

Thus, the sample population is formed at the last stage. Multistage sampling is more flexible than other types, although it generally produces less accurate results than a single-stage sample of the same size. However, it has one important advantage, which is that the sampling frame for multi-stage selection needs to be built at each stage only for those units that were included in the sample, and this is very important, since often there is no ready-made sampling frame.

The standard sampling error in multi-stage sampling for groups of different sizes is determined by the formula


where?1, ?2, ?3 , ... – standard errors at different stages;

n1, n2, n3 , .. . – the number of samples at the corresponding selection stages.

If the groups are unequal in volume, then theoretically this formula cannot be used. But if the total proportion of selection at all stages is constant, then in practice the calculation using this formula will not lead to a distortion of the error value.

Essence multiphase sampling consists in the fact that on the basis of the initially formed sample population a subsample is formed, from this subsample the next subsample is formed, etc. The initial sample population represents the first phase, a subsample from it represents the second, etc. It is advisable to use multiphase sampling in cases where If:

different sample sizes are required to study different traits;

the variability of the studied characteristics is not the same and the required accuracy is different;

less detailed information must be collected for all units in the initial sample frame (first phase), and more detailed information must be collected for units in each subsequent phase.

One of the undoubted advantages of multiphase sampling is the fact that information obtained in the first phase can be used as additional information in subsequent phases, information in the second phase can be used as additional information in subsequent phases, etc. This use of information increases the accuracy of the results of the sample survey .

When organizing multiphase sampling, you can use a combination of different methods and types of selection (typical sampling with mechanical sampling, etc.). Multiphase selection can be combined with multistage selection. At each stage, sampling can be multiphase.

The standard error in multiphase sampling is calculated for each phase separately in accordance with the formulas of the selection method and type of sampling with which its sample population was formed.

Interpenetrating excavations- two or more independent samples from the same population, collected in the same way and type. It is advisable to resort to interpenetrating samples if it is necessary to obtain preliminary results of sample surveys in a short period of time. Cross-sampling is effective for assessing survey results. If the results are the same in independent samples, this indicates the reliability of the sample survey data. Cross-sampling can sometimes be used to test the work of different researchers by having each of them survey different samples.

The standard error for interpenetrating samples is determined by the same formula as the typical proportional sample (5.3). Interpenetrating samples, compared to other types, require more labor and money, so the researcher must take this into account when designing a sample survey.

The maximum errors for various selection methods and types of sampling are determined by the formula? = t?, where? is the corresponding standard error.

Did you like the article? Share with your friends!