[/ Copyright (c) 2019-2024 Ruben Perez Hidalgo (rubenperez038 at gmail dot com) Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ] [section:charsets Character sets] [nochunk] [heading Character set refresher] MySQL defines a character set as "a set of symbols and their respective encodings". `utf8mb4`, `utf16` and `ascii` are character sets supported by MySQL. A collation is a set of rules for comparing characters in a character set. For example, a case-insensitive collation will make strings that only differ in case compare equal. All collations are associated to a single character set. For example, `utf8mb4_spanish_ci` is a case-insensitive collation associated to the `utf8mb4` character set. Every character set has a default collation, which will be used if a character set without a collation is specified. For example, `latin1_swedish_ci` is the default collation for the `latin1` character set. You can find more information about these concepts in [mysqllink charset.html the official MySQL docs on character sets]. [heading The connection character set and collation] Every connection has an associated character set and collation. The connection's character set determines the encoding for character strings sent to and retrieved from the server. This includes SQL query strings, string fields and column names in metadata. The connection's collation is used for string literal comparison. Every session you establish can have its own different character set and collation. You can specify this in two ways: * When calling [refmem connection connect] or [refmem connection handshake], using [refmem handshake_params connection_collation]. You specify a numeric ID that identifies the collation to use, and your connection will use the character set associated to this collation. You can find collation IDs in the [include_file boost/mysql/mysql_collations.hpp] and [include_file boost/mysql/mariadb_collations.hpp] headers. The problem with this approach is that if you specify a collation ID that is unknown to the server (e.g. `utf8mb4_0900_ai_ci` for an old MySQL 5.7 server), the handshake operation will succeed but the connection [*will silently fall back to the server's default character set], (usually `latin1`, which is not Unicode). * At any time, issuing a __SET_NAMES__ SQL statement. For example, `"SET NAMES utf8mb4"` will set the current connection's character set to `utf8mb4` and the connection's collation to utf8mb4's default collation. If the character set is unknown, the `SET NAMES` statement will fail. You can use [refmem connection execute] to issue the statement: [charsets_set_names] [heading character_set_results and character_set_client] Both of the above methods are shortcuts to set several session-level variables. The ones that impact this library's behavior are: * [mysqllink server-system-variables.html#sysvar_character_set_client character_set_client] determines the encoding that SQL statements sent to the server should have. This includes the SQL strings passed to [refmem connection execute] and [refmem connection prepare_statement], and string parameters passed to [refmem statement bind]. Not all character sets are permissible in `character_set_client`. The server will accept setting this variable to any UTF-8 character set, but won't accept UTF-16. * [mysqllink server-system-variables.html#sysvar_character_set_results character_set_results] determines the encoding that the server will use to send any kind of result, including string fields retrieved by [refmem connection execute], metadata like [refmem metadata column_name] and error messages. Note that, when you define a string column with a character set (e.g. `"CREATE TABLE t1 (col1 VARCHAR(5) CHARACTER SET latin1)"`), the column's character set will be used for storage and comparisons, but not for client communication. If you set `character_set_results` to `utf16`, any field obtained by `SELECT`ing `col1` will be UTF16-encoded, and not latin1-encoded. Note also that [refmem metadata column_collation] reflects the charset and collation the server has converted the column to before sending it to the client. In the above example, `metadata::column_collation` will be the default collation for UTF16, rather than `latin1_swedish_ci`. The table below summarizes the encoding used by each piece of functionality in this library: [table:string_encoding [ [Functionality] [Encoding given by...] ] [ [SQL query strings passed to [refmem connection execute] and [refmem connection prepare_statement]] [`character_set_client`] ] [ [String values passed as parameters to [refmem statement bind]] [`character_set_client`] ] [ [ String fields retrieved by [refmem connection execute] or [refmem connection read_some_rows]:[br][br] [refmem field_view as_string][br] [refmem field_view get_string] ] [`character_set_results`] ] [ [ Metadata strings:[br][br] [refmem metadata database][br] [refmem metadata table][br] [refmem metadata original_table][br] [refmem metadata column_name][br] [refmem metadata original_column_name] ] [`character_set_results`] ] [ [Server-generated error messages: [refmem diagnostics server_message]] [`character_set_results`] ] [ [ Informational messages:[br][br] [refmem results info][br] [refmem execution_state info] ] [ ASCII. These can only contain ASCII characters and are always ASCII encoded. More info in [@https://dev.mysql.com/doc/c-api/8.0/en/mysql-info.html this section]. ] ] ] [heading:tracking (Experimental) Character set tracking] [reflink any_connection] attempts to track the connection's current character set. You can access this information using [refmem any_connection current_character_set] and [refmem any_connection format_opts]. [note This functionality is only relevant when using SQL formatting and escaping functions, like [reflink format_sql], [reflink format_context] or [reflink escape_string]. ] The MySQL protocol has limited support for character set tracking, so this task requires some help from the user. Some situations can make the current character set to be unknown. If this happens, [refmem any_connection current_character_set] and [refmem any_connection format_opts] return an `unknown_character_set` error. This is how tracking works: * Before connection establishment, the current character set is always unknown. * After [refmem any_connection connect] or [refmemunq any_connection async_connect] succeed, heuristics are used to determine the current character set. This is required because the server may reject the collation requested by [refmem connect_params connection_collation] and silently fall back to an unknown character set. If Boost.MySQL is not sure that the collation will be accepted, the current character set will be left unknown. Note that leaving [refmem connect_params connection_collation] to its default value always sets the current character set to [reflink utf8mb4_charset]. * After connection, you can call [refmem any_connection set_character_set] or [refmemunq any_connection async_set_character_set] to set the current character set to a known value. This will issue a `SET NAMES` statement and also update the value stored in the client. * Calling [refmem any_connection reset_connection] or [refmemunq any_connection async_reset_connection] resets the character set to the server's default, which is unknown (usually `latin1`). The current character set will be unknown until you call [refmemunq any_connection set_character_set] or [refmemunq any_connection async_set_character_set]. [warning [*Do not execute `SET NAMES`], `SET CHARACTER SET` or any other SQL statement that modifies `character_set_client` using `execute`. This will make character set information stored in the client invalid. ] [heading:custom (Experimental) Adding support for a character set] Built-in support is provided for `utf8mb4` ([reflink utf8mb4_charset]) and `ascii` ([reflink ascii_charset]). We strongly encourage you to always use `utf8mb4`. If you really need to use a different character set, you can implement them by creating [reflink character_set] objects. You can then pass them to functions like [refmemunq any_connection set_character_set] like the built-in ones. The structure has the following members: * [refmem character_set name] must match the name you would use in `SET NAMES`. * [refmem character_set next_char] is used to iterate the string. It must return the length in bytes of the first code point in the string, or 0 if the code point is invalid. For example, this is how you could implement the `utf8mb4` character set. For brevity, only a small part of the implementation is shown - have a look at the definition of [reflink utf8mb4_charset] for a full implementation. [charsets_next_char] [endsect]