Serialization in C ++

This article will focus on automating the serialization process in C ++. In the beginning, we will consider the basic mechanisms that make it easier to read / write data to input / output streams, after which a description of a primitive libclang-based code generation system will be given. A link to the repository with a demo version of the library is located at the end of the article.

At ruSO, questions periodically arise regarding serialization of data in C ++, sometimes these questions are general in nature, when TC basically does not know where to start, sometimes these are questions describing a specific problem. The purpose of this article is to summarize one of the possible ways to implement serialization in C ++, which will allow you to follow the steps of building a system from the initial steps to some logical conclusion, when this system can already be used in practice.

1. Initial Information


This article will use a binary data format, the structure of which is determined based on the types of serializable objects. This approach saves us from using third-party libraries, limiting ourselves only to the tools provided by the standard C ++ library.

Since the serialization process consists in converting the state of an object into a stream of bytes, which obviously should be accompanied by write operations, the latter will be used instead of the term “serialization” when describing low-level details. Similarly for read / deserialize.

To reduce the volume of the article, only examples of serialization of objects will be given (except in cases where deserialization contains some details that are worth mentioning). Full code can be found in the above repository.

2. Supported types


First of all, it is worth deciding on the types that we plan to support - it directly depends on how the library will be implemented.

For example, if the choice is limited to the fundamental types of C ++, then a function template (which is a family of functions to work with the values ​​of integer types) and its explicit specializations will suffice. Primary template (used for types std :: int32_t, std :: uint16_t, etc.):

template<typename T> auto write(std::ostream& os, T value) -> std::size_t { const auto pos = os.tellp(); os.write(reinterpret_cast<const char*>(&value), sizeof(value)); return static_cast<std::size_t>(os.tellp() - pos); } 

Note : if the data obtained during serialization is planned to be transferred between machines with different byte orders, it is necessary, for example, to convert a value from the local byte order to a network byte, and then perform the reverse operation on the remote machine, so making changes will be necessary as for the write function data to the output stream, and for the function of reading from the input stream.

Specialization for bool:

 constexpr auto t_value = static_cast<std::uint8_t>('T'); constexpr auto f_value = static_cast<std::uint8_t>('F'); template<> auto write(std::ostream& os, bool value) -> std::size_t { const auto pos = os.tellp(); const auto tmp = (value) ? t_value : f_value; os.write(reinterpret_cast<const char*>(&tmp), sizeof(tmp)); return static_cast<std::size_t>(os.tellp() - pos); } 

This approach defines the following rule: if a value of type T can be represented as a sequence of bytes of length sizeof (T), the definition of the primary template can be used for it, otherwise, it is necessary to determine the specialization. This requirement can be dictated by the features of the representation of an object of type T in memory.

Consider the container std :: string: it is obvious that we cannot take the address of an object of the specified type, cast it to a pointer to char and write it to the output stream - that means we need specialization:

 template<> auto write(std::ostream& os, const std::string& value) -> std::size_t { const auto pos = os.tellp(); const auto len = static_cast<std::uint32_t>(value.size()); os.write(reinterpret_cast<const char*>(&len), sizeof(len)); if (len > 0) os.write(value.data(), len); return static_cast<std::size_t>(os.tellp() - pos); } 

Two important points to make here:

  1. Not only the contents of the string are written to the output stream, but also its size.
  2. Cast std :: string :: size_type to type std :: uint32_t. In this case, it is worth paying attention not to the size of the target type, but to the fact that it is of a fixed length. Such a reduction will allow avoiding problems in the case, for example, if data is transmitted over a network between machines with different machine word sizes.

So, we found out that values ​​of fundamental types (and even objects of type std :: string) can be written to the output stream using the write function template. Now let's analyze what changes we need to make if we want to add containers to the list of supported types. We have only one option for overloading - use the T parameter as the type of container elements. And if in the case of std :: vector this will work:

 template<typename T> auto write(std::ostream& os, const std::vector<T>& value) -> std::size_t { const auto pos = os.tellp(); const auto len = static_cast<std::uint16_t>(value.size()); os.write(reinterpret_cast<const char*>(&len), sizeof(len)); auto size = static_cast<std::size_t>(os.tellp() - pos); if (len > 0) { std::for_each(value.cbegin(), value.cend(), [&](const auto& e) { size += ::write(os, e); }); } return size; } 

, then with std: map - no, because the std :: map template requires at least two parameters - the key type and the value type. Thus, at this stage, we can no longer use the function template - we need a more universal solution. Before we figure out how to add container support, let's recall that we still have custom classes. Obviously, even using the current solution, it would not be wise to overload the write function for each class that requires serialization. In the best case, we would like to have one specialization of the write pattern that works with custom data types. But for this it is necessary that the classes have the ability to independently control serialization, respectively, they should have an interface that would allow the user to serialize and deserialize objects of this class. As it turns out a little later, this interface will serve as a “common denominator” for the write template when working with custom classes. Let's define it.

 class ISerializable { protected: ~ISerializable() = default; public: virtual auto serialize(std::ostream& os) const -> std::size_t = 0; virtual auto deserialize(std::istream& is) -> std::size_t = 0; virtual auto serialized_size() const noexcept -> std::size_t = 0; }; 

Any class that inherits from ISerializable agrees to:

  1. Override serialize - write state (data members) to the output stream.
  2. Override deserialize — Read the state (initialization of data members) from the input stream.
  3. Override serialized_size - calculates the size of serialized data for the current state of the object.

So, back to the write function template: in general, we can implement specialization for the ISerializable class, but we cannot use it, take a look:

 template<> auto write(std::ostream& os, const ISerializable& value) -> std::size_t { return value.serialize(os); } 

Each time, we would have to cast the heir type to ISerializable to take advantage of this specialization. Let me remind you that at the very beginning we set as our goal to simplify the writing of code related to serialization, and not vice versa, to complicate it. So, if the types supported by our library are not limited to fundamental types, then we should look for another solution.

3. stream_writer


Using function templates to implement a universal interface for writing data to a stream was not a completely suitable solution. The next option that we should check is the class template. We will follow the same methodology as used with the function template - the primary template will be used by default, and explicit specializations will be added to support the necessary types.

In addition, we should take into account all of the above about ISerializable - obviously, we will not be able to solve the problem with many successor classes without resorting to type_traits: starting with C ++ 11, the std :: enable_if template has appeared in the standard library, which allows ignoring template classes when certain conditions during compilation - and this is exactly what we are going to take advantage of.

Stream_writer class template :

 template<typename T, typename U = void> class stream_writer { public: static auto write(std::ostream& os, const T& value) -> std::size_t; }; 

The definition of the write method:

 template<typename T, typename U> auto stream_writer<T, U>::write(std::ostream& os, const T& value) -> std::size_t { const auto pos = os.tellp(); os.write(reinterpret_cast<const char*>(&value), sizeof(value)); return static_cast<std::size_t>(os.tellp() - pos); } 

The specialization for ISerializable will be as follows:

 template<typename T> class stream_writer<T, only_if_serializable<T>> : public stream_io<T> { public: static auto write(std::ostream& os, const T& value) -> std::size_t; }; 

where only_if_serializable is a helper type:

 template<typename T> using only_if_serializable = std::enable_if_t<std::is_base_of_v<ISerializable, T>>; 

Thus, if type T is a class derived from ISerializable , then this specialization will be considered as a candidate for instantiation, respectively, if type T is not in the same class hierarchy as ISerializable , it will be excluded from possible candidates.

It would be fair to ask the following question here: how will this work? After all, the primary template will have the same values ​​of typical parameters as its specialization - <T, void>. Why will specialization be given preference, and will it be? Answer: will be, since such behavior is prescribed by the standard ( source ):

(1.1) If exactly one matching specialization is found, the instantiation is generated from that specialization

The specialization for std :: string will now look like this:

 template<typename T> class stream_writer<T, only_if_string<T>> { public: static auto write(std::ostream& os, const T& value) -> std::size_t; }; template<typename T> auto stream_writer<T, only_if_string<T>>::write(std::ostream& os, const T& value) -> std::size_t { const auto pos = os.tellp(); const auto len = static_cast<std::uint32_t>(value.size()); os.write(reinterpret_cast<const char*>(&len), sizeof(len)); if (len > 0) os.write(value.data(), len); return static_cast<std::size_t>(os.tellp() - pos); } 

where only_if_string is declared as:

 template<typename T> using only_if_string = std::enable_if_t<std::is_same_v<T, std::string>>; 

It is time to return to the containers. In this case, we can use the container type parameterized with some type of U, or <U, V>, as in the case of std :: map, directly as the value of the parameter T of the template of the stream_writer class. Thus, nothing changes in the interface in our interface - this is what we aimed for. However, the question arises, what should be the second parameter of the template for the stream_writer class so that everything works correctly? This is in the next chapter.

4. Concepts


First, I will give a brief description of the concepts used, and only then I will show updated examples.

 template<typename T> concept String = std::is_same_v<T, std::string>; 

Honestly, this concept was defined for fraud, which we will see on the next line:

 template<typename T> concept Container = !String<T> && requires (T a) { typename T::value_type; typename T::reference; typename T::const_reference; typename T::iterator; typename T::const_iterator; typename T::size_type; { a.begin() } -> typename T::iterator; { a.end() } -> typename T::iterator; { a.cbegin() } -> typename T::const_iterator; { a.cend() } -> typename T::const_iterator; { a.clear() } -> void; }; 

The Container contains the requirements that we “make” to the type to really make sure that it is one of the container types. This is exactly the set of requirements that we will need when implementing stream_writer , the standard has much more requirements, of course.

 template<typename T> concept SequenceContainer = Container<T> && requires (T a, typename T::size_type count) { { a.resize(count) } -> void; }; 

Concept for sequential containers: std :: vector, std :: list, etc.

 template<typename T> concept AssociativeContainer = Container<T> && requires (T a) { typename T::key_type; }; 

Concept for associative containers: std :: map, std :: set, std :: unordered_map, etc.

Now, to determine the specialization for consecutive containers, all that remains for us to do is impose restrictions on the type T:

 template<typename T> requires SequenceContainer<T> class stream_writer<T, void> { public: static auto write(std::ostream& os, const T& value) -> std::size_t; }; template<typename T> requires SequenceContainer<T> auto stream_writer<T, void>::write(std::ostream& os, const T& value) -> std::size_t { const auto pos = os.tellp(); // to support std::forward_list we have to use std::distance() const auto len = static_cast<std::uint16_t>( std::distance(value.cbegin(), value.cend())); os.write(reinterpret_cast<const char*>(&len), sizeof(len)); auto size = static_cast<std::size_t>(os.tellp() - pos); if (len > 0) { using value_t = typename stream_writer::value_type; std::for_each(value.cbegin(), value.cend(), [&](const auto& item) { size += stream_writer<value_t>::write(os, item); }); } return size; } 

Supported containers:

  • std :: vector
  • std :: deque
  • std :: list
  • std :: forward_list

Similarly for associative containers:

 template<typename T> requires AssociativeContainer<T> class stream_writer<T, void> : public stream_io<T> { public: static auto write(std::ostream& os, const T& value) -> std::size_t; }; template<typename T> requires AssociativeContainer<T> auto stream_writer<T, void>::write(std::ostream& os, const T& value) -> std::size_t { const auto pos = os.tellp(); const auto len = static_cast<typename stream_writer::size_type>(value.size()); os.write(reinterpret_cast<const char*>(&len), sizeof(len)); auto size = static_cast<std::size_t>(os.tellp() - pos); if (len > 0) { using value_t = typename stream_writer::value_type; std::for_each(value.cbegin(), value.cend(), [&](const auto& item) { size += stream_writer<value_t>::write(os, item); }); } return size; } 

Supported containers:

  • std :: map
  • std :: unordered_map
  • std :: set
  • std :: unordered_set

In the case of map, there is a small nuance, it concerns the implementation of stream_reader . The value_type for std :: map <K, T> is std :: pair <const K, T>, respectively, when we try to cast a pointer to const K to a pointer to char when reading from the input stream, we get a compilation error. We can solve this problem as follows: we know that for associative containers value_type is either a single type K or std :: pair <const K, V>, then we can write small template helper classes that will be parameterized by value_type and inside determine the type we need.

For std :: set, everything remains unchanged:

 template<typename U, typename V = void> struct converter { using type = U; }; 

For std :: map - remove const:

 template<typename U> struct converter<U, only_if_pair<U>> { using type = std::pair<std::remove_const_t<typename U::first_type>, typename U::second_type>; }; 

The definition of read for associative containers:

 template<typename T> requires AssociativeContainer<T> auto stream_reader<T, void>::read(std::istream& is, T& value) -> std::size_t { const auto pos = is.tellg(); typename stream_reader::size_type len = 0; is.read(reinterpret_cast<char*>(&len), sizeof(len)); auto size = static_cast<std::size_t>(is.tellg() - pos); if (len > 0) { for (auto i = 0U; i < len; ++i) { using value_t = typename converter<typename stream_reader::value_type>::type; value_t v {}; size += stream_reader<value_t>::read(is, v); value.insert(std::move(v)); } } return size; } 


5. Auxiliary functions


Consider an example:

 class User : public ISerializable { public: User(std::string_view username, std::string_view password) : m_username(username) , m_password(password) {} SERIALIZABLE_INTERFACE protected: std::string m_username {}; std::string m_password {}; }; 

The definition of the serialize method (std :: ostream &) for this class should look like this:

 auto User::serialize(std::ostream& os) const -> std::size_t { auto size = 0U; size += stream_writer<std::string>::write(os, m_username); size += stream_writer<std::string>::write(os, m_password); return size; } 

However, you must admit that it is inconvenient to indicate each time the type of object that is written to the output stream. We write an auxiliary function that would automatically deduce the type T:

 template<typename T> auto write(std::ostream& os, const T& value) -> std::size_t { return stream_writer<T>::write(os, value); } 

Now the definition is as follows:

 auto User::serialize(std::ostream& os) const -> std::size_t { auto size = 0U; size += ::write(os, m_username); size += ::write(os, m_password); return size; } 

The final chapter will require a few more helper functions:

 template<typename T> auto write_recursive(std::ostream& os, const T& value) -> std::size_t { return ::write(os, value); } template<typename T, typename... Ts> auto write_recursive(std::ostream& os, const T& value, const Ts&... values) { auto size = write_recursive(os, value); return size + write_recursive(os, values...); } template<typename... Ts> auto write_all(std::ostream& os, const Ts&... values) -> std::size_t { return write_recursive(os, values...); } 

The write_all function allows you to list all objects to be serialized at once, while write_recursive ensures the correct order of writing to the output stream. If the order of calculations were defined for fold-expressions (provided that we use the binary + operator), we could use them. In particular, in the function size_of_all (it was not mentioned earlier, it is used to calculate the size of serialized data), it is fold-expressions that are used due to the absence of input-output operations.

6. Code Generation


The libclang - C API for clang is used to generate the code. High-level this task can be described as follows: we need to recursively go around the directory with the source code, check all header files for classes marked with a special attribute, and if there is one, check the data members for the same attribute and compile the string from the names of the data members listed with a comma. All that remains for us to do is write the definition templates for the functions of the ISerializable class (in which we can only put the enumeration of the necessary data members).

An example of a class for which the code will be generated:

 class __attribute__((annotate("serializable"))) User : public ISerializable { public: User(std::string_view username, std::string_view password) : m_username(username) , m_password(password) {} User() = default; virtual ~User() = default; SERIALIZABLE_INTERFACE protected: __attribute__((annotate("serializable"))) std::string m_username {}; __attribute__((annotate("serializable"))) std::string m_password {}; }; 

Attributes are written in the GNU style because libclang refuses to recognize the attribute format from C ++ 20, and it does not support non-annotated attributes either. Source Directory Traversal:

 for (const auto& file : fs::recursive_directory_iterator(argv[1])) { if (file.is_regular_file() && file.path().extension() == ".hpp") { processTranslationUnit(file, dst); } } 

The definition of the processTranslationUnit function:

 auto processTranslationUnit(const fs::path& path, const fs::path& targetDir) -> void { const auto pathname = path.string(); arg::Context context { false, false }; auto translationUnit = arg::TranslationUnit::parse(context, pathname.c_str(), CXTranslationUnit_None); arg::ClassExtractor extractor; extractor.extract(translationUnit.cursor()); const auto& classes = extractor.classes(); for (const auto& [name, c] : classes) { SerializableDefGenerator::processClass(c, path, targetDir.string()); } } 

In this function, only ClassExtractor is of interest to us - everything else is necessary for the formation of AST. The definition of the extract function is as follows:

   void ClassExtractor::extract(const CXCursor& cursor) { clang_visitChildren(cursor, [](CXCursor c, CXCursor, CXClientData data) { if (clang_getCursorKind(c) == CXCursorKind::CXCursor_ClassDecl) { /*   */ /* -    - */ /* -     */ } return CXChildVisit_Continue; } , this); } 

Here we already see directly the C API functions for clang. We intentionally left only the code needed to understand how libclang is used. Everything that remains behind the scenes does not contain important information - it is just a registration of class names, data members, etc. More detailed code can be found in the repository.

And finally, in the processClass function, the presence of serialization attributes of each found class is checked, and if there is one, a file is generated with the definition of the necessary functions. The repository provides specific examples: where to get the name / s of namespace (this information is stored directly in the Class class) and the path to the header file.

For the aforementioned task, the Argentum library is used, which, unfortunately, I do not recommend you to use - I started developing it for other purposes, but due to the fact that for this task I just needed the functionality that was implemented there, and I was lazy, I did not rewrite the code, but simply posted it on Bintray and connected it to the CMake file through the Conan package manager. All this library provides is simple wrappers over the clang C API for classes and data members.

And one more small remark - I do not provide a ready-made library, I only tell how to write it.

UPD0 : cppast can be used instead of libclang . Thanks to masterspline for the link provided.

1. github.com/isnullxbh/dsl
2. github.com/isnullxbh/Argentum

Source: https://habr.com/ru/post/479462/


All Articles