Home Perennial flowers Decoding characters into Russian. What is ANSI encoding and what is it eaten with? Incorrect display of symbols

Decoding characters into Russian. What is ANSI encoding and what is it eaten with? Incorrect display of symbols

Site encoding (Encoding) is the correspondence of a number to characters (numbers, letters, signs and other special characters). The most common encodings are considered ASCII together with Unicode UTF-8 and Windows-1251. In the content, a special meta tag is responsible for the encoding: which sets up a specific type of code for the pages. In this case, it is UTF-8 Unicode.

In simple words, these are standard symbols and numbers that correspond to a specific type of set of written letters, numbers, signs and other elements. Most often, the site uses one type of encoding, but there are exceptions when several encodings can be installed at once. However, this can lead to incorrect display of the entire web resource. Many sites use the encoding standard - UTF-8, since this type of code is supported by many well-known browsers, search engines, servers and other platforms. Very often there are situations when the encoding specified on the website does not match the one installed on the server. The main reason for this phenomenon is that the provider does not support the provided type of encoding, as a result of which it sends "its own", which actually leads to incorrect display of information. An encoding is a table that describes any correspondence between a specific character and a number. Each symbol, which is visible on the site, for the computer represents just a set of bits (a set of zeros and other ones).

Types of site encodings

There are several types of encodings in the Internet world:

  • ASCII is the very first encoding adopted by the American National World Standards Institute. For encoding, only 7 bits were used, where for the first time 128 values ​​are placed the English alphabet, as well as all numbers, signs and symbols. This encoding is not universal and was most often used on English-language sites.
  • Cyrillic is a truly domestic version. The encoding used the second part of the main code table, or rather characters from 129 to 256. It is used on Russian-language sites and blogs.
  • Encodings 1250-1258 (MS Windows and Windows systems) are standard 8-bit encodings that appeared immediately after the release of the well-known Microsoft Windows operating system. The numbers 1250 through 1258 are directed to the language used by the encoding. 1250 are the languages ​​of central Europe, and 1251 are for the Cyrillic alphabet.
  • KOI8 - stands for 8-bit information exchange code. Usually, the standards of the Russian Cyrillic alphabet are used in Unix systems and the like, where the KOI-7, KOI8-R and KOI8-U standards apply.
  • Unicode (original name Unicode) is a well-known standard for character encoding that allows characters to be described in literally all world languages. Often referred to as "U + xxxx", where "xxxx" are hex values. The most common family of this encoding is considered to be UTF (Unicode-Transformation Format), that is, UTF-8, 16 and 32.

Each individual view can be used directly on any site.

Universal and popular encodings

Today, the most popular and well-known encoding is UTF-8, and it is thanks to it that it is possible to provide maximum compatibility with all old systems that used the usual 8-bit character types. UTF-8 encoding includes most sites on the Internet, and it is this standard that is considered universal. UTF-8 supports both Cyrillic and Latin characters.

Hello dear readers of the blog site. Today we will talk with you about where the krakozyabrs come from on the site and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern encodings of the Unicode consortium UTF 16 and 8.

To some, this information may seem superfluous, but you would know how many questions I receive with regards to the crawled out krakozyabrs (not readable set of characters). Now I will have the opportunity to refer everyone to the text of this article and independently find my jambs. Well, get ready to absorb the information and try to follow the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings took place simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather not euphonious in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings is the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as "aski"). It describes the first 128 characters most commonly used by English-speaking users - Arabic numerals and punctuation marks.

Even these 128 characters described in ASCII included some service symbols like brackets, hash lines, asterisks, etc. Actually, you yourself can see them:

It is these 128 characters from the original ASCII version that became the standard, and in any other encoding you will surely meet them and they will stand in that order.

But the fact is that with the help of one byte of information, it is possible to encode not 128, but as many as 256 different values ​​(two to the power of eight equals 256), therefore, after the basic version of Asuka, a whole series appeared extended ASCII encodings, in which, in addition to 128 basic characters, it was possible to encode symbols of the national encoding (for example, Russian).

Here, perhaps, it is worth saying a little more about the number systems that are used in the description. Firstly, as you all know, the computer works only with numbers in the binary system, namely with zeros and ones ("Boolean algebra", if someone went to college or school). , each of which is a 2 in power, starting from zero, and up to two in the seventh:

It is not difficult to understand that there can be only 256 possible combinations of zeros and ones in such a construction. It is quite simple to convert a number from a binary system to a decimal one. You just need to add up all the powers of two above which there are ones.

In our example, this is 1 (2 to the zero power) plus 8 (two to the 3 power), plus 32 (two to the fifth power), plus 64 (to the sixth), plus 128 (to the seventh). The total gets 233 in decimal notation. As you can see, everything is very simple.

But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, an asterisk corresponds to the hexadecimal number 2A in Asuka. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters are also used from A (means ten) to F (means fifteen).

Well, for convert binary number to hexadecimal resort to the following simple and intuitive method. Each byte of information is split into two pieces of four bits, as shown in the above screenshot. That. in each half byte, only sixteen values ​​(two to the fourth power) can be encoded in binary, which can be easily represented as a hexadecimal number.

Moreover, in the left half of the byte, it will be necessary to count the degrees again starting from zero, and not as shown in the screenshot. As a result, by some simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle turned out to be clear to you. Well, now let's continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, a starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else there, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. it became possible to add symbols of letters of your language to Aski.

Here it will be necessary to digress once again to clarify - why do we need encodings at all texts and why it is so important. Symbols on your computer screen are formed on the basis of two things - sets of vector forms (representations) of all kinds of characters (they are in co files) and a code that allows you to pull out of this set of vector forms (font file) exactly the symbol that will need to be inserted into Right place.

It is clear that the fonts themselves are responsible for the vector forms, but the operating system and the programs used in it are responsible for coding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and trite.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. There are several varieties of extended Asuka just for encoding the characters of the Russian language.

For example, originally appeared CP866, in which it was possible to use the characters of the Russian alphabet and it was an extended version of ASCII.

Those. its upper part completely coincided with the basic version of Asuka (128 Latin characters, numbers and any other crap), which is shown in the screenshot just above, but the lower part of the table with CP866 encoding had the view shown in the screenshot just below and allowed to encode another 128 signs (Russian letters and all kinds of pseudo-graphics):

You see, in the right column, the numbers start with 8, because the numbers 0 through 7 refer to the basic ASCII part (see first screenshot). That. the Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding line with 9 and the column with the number C in hexadecimal notation), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will be displayed in the text.

Where did this amount come from? pseudographics in CP866? The point is that this encoding for Russian text was developed back in those furry years, when there was no such spread of graphic operating systems as it is now. And in Dos, and similar text operating systems, pseudo-graphics made it possible to somehow diversify the design of texts and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian characters, for example, this type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of the text is encoded with one single byte. The screenshot shows the second half of the KOI8-R table, since the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did in CP866.

If you look at the very first screenshot (of the base part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the letters of the Latin alphabet consonant with them from the first part of the table. This was done for the convenience of switching from Russian characters to Latin characters by discarding only one bit (two to the seventh power or 128).

Windows 1251 - modern version of ASCII and why krakozyabry come out

Further development of text encodings was associated with the fact that graphical operating systems were gaining popularity and the need to use pseudo-graphics in them disappeared over time. As a result, a whole group arose, which, in essence, were still extended versions of Asuka (one character of the text is encoded with only one byte of information), but already without the use of pseudo-graphic characters.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name of the Cyrillic alphabet was still used for the version with support for the Russian language. An example of this can serve.

It favorably differed from the previously used CP866 and KOI8-R in that the place of pseudo-graphic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.) ):

Due to such an abundance of Russian language encodings, font manufacturers and software manufacturers constantly had headaches, and we, dear readers, often got out those notorious krakozyabry when there was confusion with the version used in the text.

Very often they got out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not fundamentally solve this problem, and often users for correspondence used to avoid the notorious krakozyabs when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, the krakozyabry, who climbed out instead of the Russian text, were the result of an incorrect use of the encoding of this language, which did not correspond to the one in which the text message was originally encoded.

For example, if we try to display the characters encoded with CP866 using the Windows 1251 code table, then these same krakozyabry (meaningless set of characters) will come out, completely replacing the message text.

A similar situation very often occurs in forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor that adds gagging to the code that is not visible to the naked eye.

In the end, many were tired of this situation with a lot of encodings and constantly emerging krakozyabers, there were prerequisites for the creation of a new universal variation that would replace all existing ones and would finally solve the root problem of the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where the characters of the language were much more than 256.

Unicode - Universal UTF encodings 8, 16 and 32

These thousands of characters from the Southeast Asian language group could not be described in one byte of information, which was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who code hardware, who create fonts) who were interested in the emergence of a universal text encoding.

The first variation released under the auspices of the Unicode consortium was UTF 32... The number in the name of the encoding means the number of bits that are used to encode one character. 32 bits are 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have the size (weight) four times more. This is bad, but now we have the opportunity to encode the number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a colossal margin).

But many countries with languages ​​of the European group did not need to use such a huge number of characters in the encoding, but when UTF-32 was used, they received a fourfold increase in the weight of text documents for nothing, and as a result, an increase in the volume of Internet traffic and volume stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was accepted by default as the base space for all the symbols that we use. It uses two bytes to encode one character. Let's see how this case looks.

In the Windows operating system, you can follow the path "Start" - "Programs" - "Accessories" - "System Tools" - "Symbol Map". As a result, a table with vector forms of all fonts installed in your system will open. If you select the Unicode character set in the "Additional parameters", you will be able to see for each font separately the entire assortment of characters included in it.

By the way, by clicking on any of them, you can see its two-byte UTF-16 code consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 with 16 bits? 65536 (two to the power of sixteen), and it was this number that was taken as the base space in Unicode. In addition, there are ways to encode with it and about two million characters, but were limited to the extended space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, for example, programs only in English, because after the transition from the extended version of ASCII to UTF-16, the weight of the documents doubled (one byte per one character in Aski and two bytes for the same character in UTP-16).

It was precisely for the satisfaction of everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding... They called it UTF-8. Despite the number eight in the name, it really has a variable length, i.e. each character in the text can be encoded into a sequence of one to six bytes.

In practice, in UTF-8, only the range from one to four bytes is used, because beyond four bytes of code, nothing is even theoretically possible to imagine. All Latin characters in it are encoded into one byte, just like in good old ASCII.

What is noteworthy, in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in UTF-8. Those. the basic part of Asuka just passed into this brainchild of the Unicode consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian ones - in three bytes. Unicode Consortium after creating UTF 16 and 8 solved the main problem - now we have there is a single code space in fonts... And now their producers can only fill it with vector forms of text symbols based on their strengths and capabilities. Now even in sets.

In the above "Character Table" you can see that different fonts support a different number of characters. Some Unicode-rich fonts can be very heavy. But now they differ not in that they are created for different encodings, but in that the font manufacturer has filled or has not filled a single code space with certain vector forms to the end.

Krakozyabry instead of Russian letters - how to fix it

Let's now see how krakozyabras appear instead of the text, or, in other words, how the correct encoding for the Russian text is chosen. Actually, it is set in the program in which you create or edit this very text, or code using text fragments.

For editing and creating text files, I personally use a very good one, in my opinion. However, it can highlight the syntax of a good hundred more programming and markup languages, and also has the ability to expand using plugins. Read a detailed review of this great program at the link provided.

In the top menu of Notepad ++ there is an item "Encodings", where you will be able to convert the existing version to the one that is used by default on your site:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, in order to avoid the appearance of cracks, choose the option UTF 8 without BOM... What is the BOM prefix?

The fact is that when the YUTF-16 coding was developed, for some reason they decided to attach to it such a thing as the ability to write a character code, both in direct sequence (for example, 0A15) and in reverse (150A). And in order for the programs to understand in which sequence to read the codes, and was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in the addition of three additional bytes at the very beginning of the documents.

In the UTF-8 encoding, no BOM is provided for in the Unicode consortium, and therefore the addition of a signature (these most notorious additional three bytes to the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files to UTP, we must always choose the option without BOM (no signature). So you advance protect yourself from crawling out krakozyabrov.

What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTP-8 without BOM), for example, the notorious Windows Notepad. It saves the document in UTF-8, but still adds a signature (three extra bytes) to the beginning. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this trifle, a problem may arise - krakozyabry will come out.

So by no means do not use regular Windows notepad for editing documents of your site, if you do not want the appearance of krakozyabrs. The best and simplest option, I think, is the already mentioned Notepad ++ editor, which has practically no drawbacks and consists only of advantages.

In Notepad ++, when choosing an encoding, you will be able to convert the text to UCS-2 encoding, which is inherently very close to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language it will be already described by us just above Windows 1251. Where does this information come from?

It is registered in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you install a different default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in the encoding you need in Notepad ++ or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid krakozyabrov Other than the above steps, it is useful to register in his cap source code of all pages of the site information about this very coded to the server or local host avoid confusion.

Generally, in all languages ​​except Html Hypertext Markup use the xml special announcement, which indicates the text encoding.

Before starting to disassemble the code, the browser knows which version is used, and exactly how to interpret the character codes of the language. But interestingly, if you save the document in its default Unicode, it is classified xml can be omitted (encoding will be regarded as a UTF-8 if no BOM or SPF-16, if the BOM is).

In the case of Html language document to specify the encoding used Meta element Which registers between the opening and closing Head:

... ...

This record is quite different from that adopted in, but complying with the new standard being injected slowly Html 5 and it will be one hundred percent correctly understood by any currently used on browsers.

The idea is that an element with an indication of Meta coding Html document will be better to put as high as possible in the header of the document That at the time of the meeting in the text of the first character not from the base ANSI (which always correctly read in any variation) browser should already have the information on how to interpret the codes of these characters.

Good luck to you! See you soon on the pages of the blog site

You may be interested

What is a URL, the differences between absolute and relative links for the site
OpenServer - modern local server and an example of its use for WordPress installation on your computer
What is the Chmod, what permissions to assign file and folder (777, 755, 666) and how to do it with PHP
Yandex search on the site and online store

Later ASCII was expanded (initially it did not use all 8 bits), so the ability to use longer 128 and 256 (2 to 8 degrees) of the different symbols that can be encoded in a single byte of information.
This improvement allowed to be added to the coding ASCII accented characters in different countries, in addition to the already existing Roman alphabet.
Enhanced encoding options ASCII there are so many on the grounds that too many languages ​​in the world. I think that many of you have heard about this encoding as KOI8 (code sharing information, 8 bits) - is also enhanced encoding ASCII... KOI8 include numbers, letters of the Latin alphabet and Russian, as well as punctuation marks, special characters and pseudographics.

encoding ISO

International Standards Organization (International Standards Organization) has created a range of character sets for different alphabets / languages.

Codes ISO 8859 series

Encoding Description
ISO 8859-1 (Latin-1) Latin Extended, which includes characters of most Western European languages ​​(English, Danish, Irish, Icelandic, Spanish, German, Italian, Norwegian, Portuguese, Romansh, Faroese, Swedish, Scottish (Gaelic) and partly Dutch, Finnish, French), as well as some Eastern European (Albanian) and African languages ​​(Afrikaans, Swahili). In the Latin-1 no sign of the euro and the capital letter Ÿ. This code page is considered to be the default encoding for HTML-documents and e-mails. Also, this code page corresponds to the first 256 Unicode characters.
ISO 8859-2 (Latin-2) Latin Extended, which includes characters Central and Eastern European languages ​​(Bosnian, Hungarian, Polish, Slovak, Slovenian, Croatian, Czech). In Latin-2, as well as in Latin-1, there is no sign of the euro.
ISO 8859-3 (Latin-3) Latin Extended, which includes symbols of southern European languages ​​(Maltese, Turkish, and Esperanto).
ISO 8859-4 (Latin-4) Latin Extended, which includes symbols Nordic languages ​​(Greenlandic, Estonian, Latvian, Lithuanian and Sami languages).
ISO 8859-5 (Latin / Cyrillic) Cyrillic, Slavic languages ​​including symbols (Belarusian, Bulgarian, Macedonian, Russian, Serbian and Ukrainian part).
ISO 8859-6 (Latin / Arabic) Symbols used in Arabic. Symbols other languages ​​letter based on the Arabic is not supported. For the correct display of text in the ISO 8859-6 encoding requires bidirectional support and context-dependent character shapes.
ISO 8859-7 (Latin / Greek) Symbols of modern Greek. It can also be used for writing in ancient Greek texts monotonicheskoy spelling.
ISO 8859-8 (Latin / Hebrew) Symbols of modern Hebrew. It is used in two ways: the logical order of characters repetition (requires support bidirectional) and a visual symbol order.
ISO 8859-9 (Latin-5) Variant of the Latin-1, which is rarely used characters are replaced by the Icelandic language to Turkish. Used for the Turkish and Kurdish languages.
ISO 8859-10 (Latin-6) Option Latin-4, more convenient for the Scandinavian languages.
ISO 8859-11 (Latin / Thai) Symbols of the Thai language.
ISO 8859-13 (Latin-7) Option Latin-4, more convenient for the Baltic languages.
ISO 8859-14 (Latin-8) Latin Extended comprising symbols Celtic languages ​​such as Scotch (Gaelic) and Breton.
ISO 8859-15 (Latin-9) Variant of the Latin-1, which is rarely used characters are replaced with the full support necessary for the Finnish, French and Estonian languages. In addition, the Latin-9 was added to the euro sign.
ISO 8859-16 (Latin-10) Latin Extended, which includes symbols of southern European and eastern European (Albanian, Hungarian, Italian, Polish, Romanian, Slovenian, Croatian), as well as some Western European languages ​​(Irish in the new spelling, German, Finnish, French). As in the Latin-9, in the Latin-10 has been added to the euro sign.

For documents in English and most other Western European languages, it is widely supported encoding ISO-8859-1.

In HTML ISO-8859-1 is the default encoding (in XHTML and HTML5-coded default is UTF-8).
When using a page encoding other than ISO-8859-1, you need to specify it in the tag .

For HTML4:

For HTML5:

An example of ANSI-encoding is known to all Windows-1251.

Windows-1251 favorably with other 8-bit Cyrillic character sets (such as CP866 and ISO 8859-5) by the presence of almost all the characters used in typography to Russian plain text (only the accent mark). It also contains all the characters for the other Slavic languages: Ukrainian, Belarusian, Serbian, Macedonian and Bulgarian.
Below are the decimal values ​​of the encoding symbols Windows-1251.

To display a table of characters in HTML-document, use the following syntax:

K + + code;

Windows-1251 (CP1251)

.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .A .B .C .D .E .F

8.
Ђ
402
Ѓ
403

201A
ѓ
453

201E

2026

2020

2021

20AC

2030
Љ
409

2039
Њ
40A
Ќ
40C
Ћ
40B
Џ
40F

9.
ђ
452

2018

2019

201C

201D

2022

2013
-
2014

2122
љ
459

203A
њ
45A
ќ
45C
ћ
45B
џ
45F

A.

A0
Ў
40E
ў
45E
Ј
408
¤
A4
Ґ
490
¦
A6
§
A7
Yo
401
©
A9
Є
404
«
AB
¬
AC
­
AD
®
AE
Ї
407

B.
°
B0
±
B1
І
406
і
456
ґ
491
µ
B5

B6
·
B7
e
451

2116
є
454
»
BB
ј
458
Ѕ
405
ѕ
455
ї
457

C.
A
410
B
411
V
412
G
413
D
414
E
415
F
416
Z
417
AND
418
Th
419
TO
41A
L
41B
M
41C
N
41D
O
41E
NS
41F

D.
R
420
WITH
421
T
422
Have
423
F
424
NS
425
C
426
H
427
NS
428
SCH
429
B
42A
NS
42B
B
42C
NS
42D
NS
42E
I AM
42F

E.
a
430
b
431
v
432
G
433
d
434
e
435
f
436
s
437
and
438
th
439
To
43A
l
43B
m
43C
n
43D
O
43E
NS
43F

F.
R
440
with
441
T
442
at
443
f
444
NS
445
c
446
h
447
NS
448
SCH
449
b
44A
NS
44B
b
44C
NS
44D
NS
44E
I am
44F

UNICODE encoding standard

Unicode (born Unicode.) - character encoding standard that allows to present signs of almost all languages ​​of the world, and special characters. Submitted to the Unicode characters are encoded using unsigned integers. Unicode has several forms of representation symbols in the computer: UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). (English Unicode transformation format -. UTF).
UTF-8- it is now common coding, which has been widely used in operating systems and web space. Text composed of Unicode characters numbered less than 128 (area codes U + 0000 through U + 007F) contains the dialing characters ASCII with the corresponding codes. Next are the areas of signs of various scripts, punctuation marks and technical symbols. Under the Cyrillic characters, areas of characters with codes from U + 0400 to U + 052F, from U + 2DE0 to U + 2DFF, from U + A640 to U + A69F are allocated.

Encoding UTF-8 is universal and has an impressive reserve for the future. This makes it the most convenient encoding for use on the Internet.

Before answering the question about what the ANSI Windows encoding is, let's first answer another question: "What is an encoding in general?"

Each computer, each system uses a certain set of symbols, depending on the language used by the user, on his professional competence and personal preferences.

General definition of encoding

So, in Russian, 33 characters are used to denote letters, in English - 26. 10 digits are also used for counting (0; 1; 2; 3; 4; 5; 6; 7; 8; 9) and some special characters, minus , space, period, percentage, and so on.

Each of these characters is assigned a sequential number using a code table. For example, the letter "A" can be assigned the number 1; "Z" is 26 and so on.

Actually, a number representing a character as an integer is considered a character code, and an encoding is, accordingly, a set of characters in such a table.

Rich variety of code tables

At the moment, there is a fairly large number of encodings and code tables used by different specialists: these are ASCII, developed in 1963 in America, and Windows-1251, which was recently popular thanks to Microsoft, KOI8-R and Guobiao - and many, many others, and the process of their appearance and withering away continues to this day.

Among this huge list is the so-called ANSI encoding.

The fact is that at one time Microsoft created a whole set of code pages:

All of them are collectively called ANSI encoding table, or ANSI code page.

Interesting fact: one of the first tables became the ASCII code, in 1963 founded American National Standards Institute (American National Standards Institute), abbreviated as it is ANSI.

Among other things, this includes the coding and non-printable characters, the so-called "escape sequences", or the ESC, unique for all character sets, are often incompatible with each other. With skillful use, however, they can hide and restore the cursor to translate it from one position in the text to another, set tabs, wash the terminal portion of the window in which the work was done, change the formatting of the text on the screen and change the color (or even draw and submit sound signals!). In 1976, by the way, it was pretty good tool for programmers. Incidentally, the terminal - a device that is required for inputting and outputting information. In those days he was a monitor and keyboard connected to a computer (electronic computer).

Incorrect display of symbols

Unfortunately, in the future, such a system led to numerous failures in systems, instead of bringing the desired poems, news feeds or favorite descriptions of computer games called krakozyabry - meaningless, unreadable character sets. The emergence of ubiquitous error was caused by only an attempt to display characters that are encoded in one code table, using the other.

Most often, the consequences of an incorrect reading of the codes we encounter on the Internet until now when our browser, for whatever reason, can not accurately determine which of the Windows - **** encodings currently in use, due to the guidance Web -MASTER ANSI common encoding either initially incorrect encoding, for example, 1252 instead of 1521. The following table shows the exact codings.

Cyrillic encodings ANSI-table, Windows-1251

Moreover, in 1986, ANSI has been significantly expanded, thanks to Jan E. Davis, who wrote the package The Draw, allowing not just use the base, from our point of view, the function, but also to fully (or almost fully) draw!

Summing up

Thus, it can be seen that the encoding ANSI, in fact, although it was quite a controversial decision, maintains its position.

Over time, with a light hand enthusiasts ancient ANSI terminal migrated even on phones!

New on the site

>

Most popular