Quality Matters, #4: Applying (Removable) Diagnostic Measures To The recls Library

by Matthew Wilson

This column instalment first published in ACCU's Overload, #95, February 2010. All content copyright Matthew Wilson 2009-2010.

Abstract

This instalment, like the last, involves getting my hands dirty examining another (open-source) library; this time it's recls [RECLS], which provides recursive file-system searching via a (largely) platform-independent API. recls, which stands for recursive ls, was my first venture into open-source libraries that involved compilation of source (as opposed to pure header-only libraries), and it still bears the scars of the early mistakes I made, so there're rich pickings to be had. (I should also mention that recls was the exemplar project for a series of instalments of my former CUJ column, 'Positive Integration', between 2003 and 2005; all these instalments are available online from Dr Dobb's Journal; a list of them all is given in http://www.synesis.com.au/publications.html#columns. I'll attempt as little duplication with them as possible.)

I'll begin with an introduction to recursive search, illustrating why it is such an onerous task using operating system APIs (OS APIs), and give some examples of how it's made easier with recls. This will be followed by an introduction to the recls architecture: core API, core implementation, and language mappings. The various design decisions will be covered, to give you an understanding of some of the pros and cons to be discussed later.

Then we'll get all 'Software Quality' on it, examining the API, the implementation and the C++ mapping(s). Each examination will cover the extant version (1.8), the new version (1.9) that should be released by the time you read this, and further improvements required in future versions. Naturally, the discussion will be framed by our aspects of software quality [QM#1]: as well as the usual discussions of intrinsic characteristics, the problem area - interaction with the file-system and the complexity of the library - dictates the use of (removable) diagnostic measures and applied assurance measures. It is in the application of the latter two that the meat of this month's learning resides (for me in particular).

Introduction

recls had a proprietary precursor in the dim and distant past, which I originally wrote to obviate the two main issues with recursive file-system search:

Let's look at a couple of examples to illustrate. Listings 1 and 2 print all files under a given search directory, in UNIX and Windows respectively. Both examples suffer the first issue, since the search APIs yield only the name of the entry (file/directory) retrieved, requiring you to remember the full directory where you have just searched, in order to append each directory name and recurse again.

Listing 1

unsigned list_all_files_r(char const* path)
{
  STLSOFT_ASSERT(NULL != path);
  STLSOFT_ASSERT('\0' != 0[path]);
  std::string directory(path);
  if(directory[directory.size() - 1u] != '/')
  {
    directory += '/';
  }
  DIR* dir = ::opendir(path);
  if(NULL == dir)
  {
    ff::fmtln(std::cerr, "failed to search '{0}': {1} ({2})", path, stlsoft::error_desc(errno), errno);
    return ~0u;
  }
  else
  {
    stlsoft::scoped_handle<DIR*>  scoper(dir, ::closedir);
    unsigned n = 0u;
    { for(struct dirent* de; NULL != (de = ::readdir(dir));)
    {
      if( de->d_name[0] == '.' &&
          de->d_name[1] == '\0')
      {
        // '.'
      }
      else if(de->d_name[0] == '.' &&
              de->d_name[1] == '.' &&
              de->d_name[2] == '\0')
      {
        // '..'
      }
      else
      {
        std::string entryPath = directory + de->d_name;
        struct stat st;
        int r = ::stat(entryPath.c_str(), &st);
        if(0 != r)
        {
          ff::fmtln(std::cerr, "failed to stat '{0}': {1} ({2})", entryPath, stlsoft::error_desc(errno), errno);
        }
        else
        {
          if(st.st_mode & S_IFREG)
          {
            ff::fmtln(std::cout, "    {0}", entryPath);
            ++n;
          }
          else
          {
            n += list_all_files_r(entryPath.c_str());
          }
        }
      }
    }}
    return n;
  }
}

void list_all_files(char const* path)
{
  ff::fmtln(std::cout, "Searching '{0}'", path);
  unsigned n = list_all_files_r(path);
  if(~0u != n)
  {
    ff::fmtln(std::cout, "  {0} file(s) found", n);
  }
}
 
 

Listing 2

unsigned list_all_files_r(char const* path)
{
  STLSOFT_ASSERT(NULL != path);
  STLSOFT_ASSERT('\0' != 0[path]);
  std::string directory(path);
  if(directory[directory.size() - 1u] != '\\')
  {
    directory += '\\';
  }
  std::string searchSpec = directory + "*.*";
  WIN32_FIND_DATA data;
  HANDLE          h = ::FindFirstFile(searchSpec.c_str(), &data);
  if(INVALID_HANDLE_VALUE == h)
  {
    DWORD err = ::GetLastError();
    ff::fmtln(std::cerr, "failed to search '{0}': {1} ({2})", path, winstl::error_desc(err), err);
    return ~0u;
  }
  else
  {
    stlsoft::scoped_handle<HANDLE>  scoper(h, ::FindClose, INVALID_HANDLE_VALUE);
    unsigned n = 0u;
    do
    {
      if(data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
      {
        if( data.cFileName[0] == '.' &&
            data.cFileName[1] == '\0')
        {
          // '.'
        }
        else if(data.cFileName[0] == '.' &&
                data.cFileName[1] == '.' &&
                data.cFileName[2] == '\0')
        {
          // '..'
        }
        else
        {
          n += list_all_files_r((directory + data.cFileName).c_str());
        }
      }
      else
      {
        ff::fmtln(std::cout, "    {0}{1}", directory, data.cFileName);
        ++n;
      }
    } while(::FindNextFile(h, &data));
    return n;
  }
}

void list_all_files(char const* path)
{
  ff::fmtln(std::cout, "Searching '{0}'", path);
  unsigned n = list_all_files_r(path);
  if(~0u != n)
  {
    ff::fmtln(std::cout, "  {0} file(s) found", n);
  }
}
 

The second problem can be seen in the extra processing on UNIX. The UNIX search API - opendir()/readdir() - provides only the file name. To find out whether the entry you've just retrieved is a file or a directory you must issue another system call, stat(); you also have to call this to find out file size, timestamps, and so forth. Conversely, the Windows search API - FindFirstFile()/FindNextFile() - includes all such information in the WIN32_FIND_DATA structure that the search functions fill out each time an entry is found.

As I hope both examples clearly illustrate, with either operating system you've got to put in a lot of work just to do a basic search. The mundane preparation of the search directory (appended with the search-all pattern *.* in Windows) and the elision of the dots directories - . and .. - dominate the code. And neither of these are terribly good exemplars: I've assumed everything not a regular file is a directory on UNIX, which does not always hold, and I've horribly overloaded the return value of the worker function list_all_files_r() to indicate an error condition. More robust versions would do it better, but would include even more code. The intrinsic software evaluations are not all that impressive:

So let's look at the alternative. Listings 3 and 4 show the same functionality obtained via recls' core API, in a step-wise manner (via Recls_Search()) and a callback manner (via Recls_SearchProcess()) respectively. Listing 5 shows the same functionality obtained via recls' C++ mapping (the new unified form available in version 1.9).

Listing 3

// Assumes introduction of recls namespace symbols

void list_all_files(char const* path)
{
  ff::fmtln(std::cout, "Searching '{0}'", path);
  hrecls_t    hSrch;
  recls_rc_t  rc  = Recls_Search(path, NULL, recls::FILES | recls::RECURSIVE, &hSrch);
  if(RECLS_FAILED(rc))
  {
    ff::fmtln(std::cerr, "failed to search '{0}': {1} ({2})", path, rc, int(rc));
  }
  else
  {
    stlsoft::scoped_handle<hrecls_t> scoper(hSrch, Recls_SearchClose);
    unsigned  n = 0u;
    entry_t   entry;
    { for(Recls_GetDetails(hSrch, &entry); RECLS_SUCCEEDED(rc); rc = Recls_GetNextDetails(hSrch, &entry), ++n)
    {
      stlsoft::scoped_handle<entry_t> scoper2(entry, Recls_CloseDetails);
      ff::fmtln(std::cout, "    {0}", entry->path);
    }}
    ff::fmtln(std::cout, "  {0} file(s) found", n);
  }
}
 
 

Listing 4

// Assumes introduction of recls namespace symbols

int RECLS_CALLCONV_DEFAULT onFile(
    recls_entry_t               entry
,   recls_process_fn_param_t    param
)
{
  ff::fmtln(std::cout, "    {0}", entry->path);
  ++*static_cast<unsigned*>(param);
  return +1; // continue
}

void list_all_files(char const* path)
{
  ff::fmtln(std::cout, "Searching '{0}'", path);
  unsigned    n   = 0u;
  recls_rc_t  rc  = Recls_SearchProcess(path, NULL, recls::FILES | recls::RECURSIVE, onFile, &n);
  if(RECLS_SUCCEEDED(rc))
  {
    ff::fmtln(std::cout, "  {0} file(s) found", n);
  }
  else
  {
    ff::fmtln(std::cerr, "failed to search '{0}': {1} ({2})", path, rc, int(rc));
  }
}
 
 

Listing 5

void list_all_files(char const* path)
{
  ff::fmtln(std::cout, "Searching '{0}'", path);
  try
  {
    recls::search_sequence files(path, recls::wildcardsAll(), recls::FILES | recls::RECURSIVE);
    unsigned n = 0;
    { for(recls::search_sequence::const_iterator i = files.begin(); i != files.end(); ++i, ++n)
    {
      ff::fmtln(std::cout, "    {0}", *i);
    }}
    ff::fmtln(std::cout, "  {0} file(s) found", n);
  }
  catch(recls::recls_exception& x)
  {
    ff::fmtln(std::cerr, "failed to search '{0}': {1} ({2})", path, x, int(x.get_rc()));
  }
}
 

Clearly, each example has benefited from the use of a dedicated library, compared to the first two. Each is more expressive, for three reasons. First, the abstraction level of recursive file-system search has been raised. Second, the evident increased level of portability: indeed none of the examples exhibit any platform-dependencies. Finally, the flexibility of the recls' types: note that we can pass entry instances, or their path fields, directly to FastFormat [FF-1, FF-2, FF-3]. These factors also contribute to a likely increase in robustness, most particularly in the removal of the fiddly code for handling search directory, dots directories and file information. I'd also argue strongly that the transparency of the code is improved.

On the negative side, modularity has been reduced, since we now depend on recls and (albeit indirectly for Listings 3 and 4) on STLSoft [STLSOFT].

So, pretty good so far. However, the picture is not perfect. recls has some unpleasant characteristics, and they're not all addressed yet, even with the latest release. The purpose of this instalment is to use the flaws in recls to illustrate software quality issues involved in writing non-trivial software libraries with unpredictable operating-system interactions. Let's dig in.

The recls library

The recls architecture is comprised of three major parts:

As I've mentioned numerous times previously [QM#3, !(C ^ C++)], I prefer a C-API wherever possible, because it:

In the case of recls, the interoperability was the clincher, although I'm starting to withdraw from this position somewhat, as I'll discuss later.

The recls core API

The two main entities in recls are the search and the entry. A search comprises a root directory, a search specification, and a set of flags that moderate the search behaviour and the type of information retrieved. An entry is a file-system entry that is found as a result of executing the search at a given time. It provides read-only access to the full path, the drive (on Windows), the directory, the file (name and/or extension), the size (for files), the file-system-specific attributes, the timestamps, as well as other useful pseudo-properties such as search-relative path.

The "Search" Type

The search type is not visible to client code, and is manipulated as an opaque handle, hrecls_t, via API functions. The search type has a state, which is a non-reversible/non-resettable position referring to an item within the directory tree under the given search directory. (Note that the state reflects a localised snapshot: it remembers which file it's on, but what is the next file can change depending on external operating-system action. On a long enumeration it is possible to omit an item that was removed after it commenced and include an item that was not present at the time of commencement, just as is the case with manual enumeration.)

The API functions of concern include:

The "Entry" Type

In contrast, the entry type is only semi-opaque. The API functions that retrieve the entry details from a search handle are defined in terms of the handle type recls_entry_t (aka recls::entry_t in C++ compilation units), as in:

  RECLS_API Recls_GetDetails(
    hrecls_t        hSrch
  , recls_entry_t*  phEntry
  );

In the same vein, the API functions that elicit individual characteristics about an entry do so in terms of the handle type, as in:

  RECLS_FNDECL(size_t) Recls_GetPathProperty(
    recls_entry_t hEntry
  , recls_char_t* buffer
  , size_t        cchBuffer
  );

Thus, it is possible to write application code in an operating system-independent manner. However, because different operating systems provide different file-system entry information, and application programmers may want access to that information, the underlying type for recls_entry_t, struct recls_entryinfo_t, is defined in the API (see Listing 6).

Listing 6

typedef struct recls_entryinfo_t const* recls_entry_t;

struct recls_strptrs_t
{
  recls_char_t const* begin;
  recls_char_t const* end;
};
struct recls_strptrsptrs_t
{
  struct recls_strptrs_t const* begin;
  struct recls_strptrs_t const* end;
};

#if !defined(RECLS_PURE_API)
struct recls_entryinfo_t
{
  recls_uint32_t              attributes;
  struct recls_strptrs_t      path;
# if defined(RECLS_PLATFORM_IS_WINDOWS)
  struct recls_strptrs_t      shortFile;
  recls_char_t                drive;
# endif /* RECLS_PLATFORM_IS_WINDOWS */
  struct recls_strptrs_t      directory;
  struct recls_strptrs_t      fileName;
  struct recls_strptrs_t      fileExt;
  struct recls_strptrsptrs_t  directoryParts;
# if defined(RECLS_PLATFORM_IS_WINDOWS)
  recls_time_t                creationTime;
# endif /* RECLS_PLATFORM_IS_WINDOWS */
  recls_time_t                modificationTime;
  recls_time_t                lastAccessTime;
# if defined(RECLS_PLATFORM_IS_UNIX)
  recls_time_t                lastStatusChangeTime;
# endif /* RECLS_PLATFORM_IS_UNIX */
  recls_filesize_t            size;
  struct recls_strptrs_t      searchDirectory;
  struct recls_strptrs_t      searchRelativePath;
  /* Remaining member are undocumented and subject to change */  
  recls_uint64_t              checkSum;
  recls_uint32_t              extendedFlags[2];
  recls_byte_t                data[1];
};
#endif /* !RECLS_PURE_API */
 

You may have noted, from Listing 3, another reason to use the recls_entryinfo_t struct: it leads to more succinct code. That's because string access shims [XSTL, FF-2, IC++] are defined for the recls_strptrs_t type, as in:

  # if defined(RECLS_CHAR_TYPE_IS_WCHAR)
  inline wchar_t const* c_str_data_w(
  # else /* ? RECLS_CHAR_TYPE_IS_WCHAR */
  inline char const* c_str_data_a(
  # endif /* RECLS_CHAR_TYPE_IS_WCHAR */
    recls_strptrs_t const& ptrs
  )
  {
  return ptrs.begin;
  }
  # if defined(RECLS_CHAR_TYPE_IS_WCHAR)
  inline size_t c_str_len_w(
  # else /* ? RECLS_CHAR_TYPE_IS_WCHAR */
  inline size_t c_str_len_a(
  # endif /* RECLS_CHAR_TYPE_IS_WCHAR */
    recls_strptrs_t const& ptrs
  )
  {
  return static_cast<size_t>(ptrs.end - ptrs.begin);
  }

So when we write

  ff::fmtln(std::cout, "    {0}", entry->path);
 

the FastFormat application layer [FF-1, FF-2, FF-3] knows to invoke stlsoft::c_str_data_a() and stlsoft::c_str_len_a() (or the widestring equivalents, in a widestring build) to elicit the string slice representing the path.

Time and Size

You may have looked at Listing 6 and wondered about the definitions of recls_time_t and recls_filesize_t. Here's where the platform-independence falls down. With 1.8 (and earlier), the time and size types were defined as follows:

  #if defined(RECLS_PLATFORM_IS_UNIX)
  typedef time_t         recls_time_t;
  typedef off_t          recls_filesize_t;
  #elif defined(RECLS_PLATFORM_IS_WINDOWS)
  typedef FILETIME       recls_time_t;
  typedef ULARGE_INTEGER recls_filesize_t;
  . . .

The decision to do this was pretty much a fallback, as I didn't think of better alternatives at the time. (If memory serves, the size type results from a time when I was still interested in maintaining compatibility with C++ compilers that did not have 64-bit integer types.) No-one's actually ever complained about this, so either no-one's using time/size information for multi-platform programming or they've learned to live with it. I've learned to live with the size thing by using conversion shims [IC++, XSTL] to abstract away the difference between the UNIX and Windows types, as in:

  ff::fmtln("size of {0} is {1}", entry->path, stlsoft::to_uint64(entry->size));

But it's still a pain, and a reduction in the transparency of client code. Time is more of a pain, and is considerably less easy to work around.

Both of these detract significantly from the discoverability of the library, and require change. With 1.9 I've redefined recls_filesize_t to be a 64-bit unsigned integer, and invoke the conversion shim internally. Alas, I've run out of time with the time attribute, and the inconsistent, platform-dependent time types abide. This will be addressed with 1.10, hopefully sometime later this year.

Intrinsic Quality

Let's do a quick check-list of the intrinsic software quality of the core API, and client code that uses it.

So, from a purely API perspective, clear wins for using recls are expressiveness and portability, with some flexibility thrown in the mix.

The recls core implementation

Unfortunately, the cheery picture I've painted thus far starts to peel and crack when we look at the implementation, which is hideously opaque (!transparent).

Implementation Language: C or C++?

The first thing to note is that the implementation language is C++. There are two reasons. First, and most significantly, this was so I could use a large number of components from STLSoft to assist in the implementation. The main ones are:

The other reason was that there is some runtime polymorphism going on inside, allowing for file search and FTP search (Windows-only) to share much of the same surrounding code. Thus, a search begun with Recls_SearchFtp() can be manipulated in exactly the same way as one begun with Recls_Search() by client code (and mapping layers). I've long outgrown the perverse pleasure one gets from writing polymorphic code in C, so it had to be C++.

While the first reason did prove itself, in that I was able to implement a large amount of functionality in a relatively short amount of time, I'm not sure that I would do the same again. Some of the code in there is insanely baroque. For example, the constructor of the internal class ReclsFileSearchDirectoryNode (Listing 7).

Listing 7

ReclsFileSearchDirectoryNode::ReclsFileSearchDirectoryNode(
  recls_uint32_t            flags
, recls_char_t const*       searchDir
, size_t                    rootDirLen
, recls_char_t const*       pattern
, size_t                    patternLen
, hrecls_progress_fn_t      pfn
, recls_process_fn_param_t  param
)
  : m_current(NULL)
  , m_dnode(NULL)
  , m_flags(flags)
  , m_rootDirLen(rootDirLen)
  , m_searchDir()
  , m_searchDirLen(prepare_searchDir_(m_searchDir, searchDir))
  , m_pattern(pattern)
  , m_patternLen(patternLen)
  , m_directories(
      searchDir
#if defined(RECLS_PLATFORM_IS_WINDOWS)
    , types::traits_type::pattern_all()
#endif /* platform */
    , dssFlags_from_reclsFlags_(flags))
  , m_directoriesBegin(
      select_iter_if_(
        flags & RECLS_F_RECURSIVE
      , m_directories.begin()
      , m_directories.end()))
  , m_entries(
      searchDir
    , pattern
#ifdef RECLS_SUPPORTS_MULTIPATTERN_
    , types::traits_type::path_separator()
#endif /* RECLS_SUPPORTS_MULTIPATTERN_ */
    , essFlags_from_reclsFlags_(flags))
  , m_entriesBegin(m_entries.begin())
  , m_pfn(pfn)
  , m_param(param)
{
  . . .
 

This is really, really horrible. As Aussies like to say, 'How embarrassment?'

The class clearly has a large number of member variables; there are member initialiser-list ordering dependencies; even conditionally-compiled different constructors of the member variables! The constructor body contains static assertions to ensure that the member ordering issues do not bite, but that hardly makes up for all the rest of it. Like many codebases, there were good reasons for each of these individual steps, but the end result is a big mess. I can tell you that adding new features to this codebase is a problem.

There are also some per-class memory allocation routines. In particular, the file entry instance type recls_entryinfo_t (see Listing 6) is of variable length, so that the path, search directory and (for Windows) the short file strings, along with the array of string slices that constitute the directory parts, are all allocated in a single block. This adds further complexity. Unlike the monstrous constructor shown above, however, I would defend this tactic for the entry info. Because it is immutable, and reference-counted (via a hidden prefixed field), it means that all of the complexity involved in dealing with the instances is encapsulated in one place, after which it can be copied easily (via adding a reference) and eventually released via a single call to free(). I've used this technique many times in the past, and I think it fine. (I may be deluding myself through habit, of course.)

Intrinsic Quality

Let's do a quick check-list of the intrinsic software quality of the core implementation.

For anyone who can be bothered to download 1.8 and 1.9, you'll see a lot more files in the src/ directory for 1.9, as a consequence of my having started to pare away the components from each other. In 1.8, there were sixteen .cpp files, and I think I can say that six were good, eight were moderate, and two were bad. The refactoring has helped a lot, such that out of the 21 .cpp files in the source directory, eleven are good, eight are moderate, and only two are bad. The numbers back up what I'm trying to do, which is to separate out all parts that are clear and good, or semi-clear and semi-good, in order to reduce the overall cost if/when a full refactoring happens. Of course, as shown above, the bad is still really bad. But now the badness is not impinging on the good.

As well as the refactoring reason - letting me see the wood for the trees - there's another reason for splitting up the files, which we'll get to in a minute or two.

The recls C++ mapping(s)

In versions prior to 1.9 recls has shipped with two separate mappings for C++ client code:

Enumerating with the original C++ mapping would look something like that shown in Listing 8.

Listing 8

void list_all_files(char const* path)
{
  ff::fmtln(std::cout, "Searching '{0}'", path);
  try
  {
    recls::cpp::FileSearch search(path, recls::Recls_GetWildcardsAll(), recls::FILES | recls::RECURSIVE);
    unsigned n = 0;
    for(; search.HasMoreElements(); search.GetNext(), ++n)
    {
      recls::cpp::FileEntry entry = search.GetCurrentEntry();
      ff::fmtln(std::cout, "    {0}", entry);
    }
    ff::fmtln(std::cout, "  {0} file(s) found", n);
  }
  catch(recls::cpp::ReclsException& x)
  {
    ff::fmtln(std::cerr, "failed to search '{0}': {1} ({2})", path, x, int(x.rc()));
  }
}
 

The provision of both reflected recls' secondary role as a research and writing vehicle for my CUJ column, and also because, at the time (2003), STL was still somewhat novel and unfamiliar to some C++ programmers. In the 6+ years since, I've found myself using the C++ mapping for enumeration in commercial projects precisely zero times, and I've not had much feedback from users making much use of it, either.

So, given that I was already making significant breaking changes, and (temporarily) dropping other mappings, I decided to take the opportunity and merge the best features from the two mappings. Simplistically, the utility functions come from the former "C++" mapping, and the collections come from the former STL mapping.

Consequently, version 1.9 supports only a single C++ mapping, which is comprised of six types:

and (a growing list; 1.9 is still being polished as I write this) of utility functions:

Headers

The other change is that now you just include <recls/recls.hpp>, which serves two purposes:

The result is just a whole lot less to type, or to think about. More discoverable, if you will.

Properties

One other thing to note. In the last chapter (35) of Imperfect C++ [IC++], I described a set of (largely) portable techniques I'd devised for defining highly efficient properties (as we know them from C# and Ruby) for C++. So, for all compilers that support them (which is pretty much everything better than VC++ 6, which is pretty much everything of import these days), you have the option to elicit entry information via getters, as in

  std::string srp = entry.get_search_relative_path();
  uint64_t    sz  = entry.get_size();
  ????        ct  = entry.get_creation_time();
  // Still platform-dependent ;-/
  bool        ro  = entry.is_readonly();

or via properties, as in:

  std::string srp = entry.SearchPelativePath;
  uint64_t    sz  = entry.Size;
  ????        ct  = entry.CreationTime;
  // Still platform-dependent ;-/
  bool        ro  = entry.IsReadOnly;

if you like that kind of thing. (Which I do.)

Quality?

Let's do a quick check-list of the intrinsic software quality of the new C++ mapping.

Listing 9

search_sequence::const_iterator
search_sequence::begin() const
{
  hrecls_t    hSrch;
  recls_rc_t  rc = Recls_Search(m_directory, m_pattern, m_flags, &hSrch);
  if( RECLS_FAILED(rc) &&
      RECLS_RC_NO_MORE_DATA != rc)
  {
    throw recls_exception(rc);
  }
  return const_iterator(hSrch);
}

ftp_search_sequence::const_iterator
ftp_search_sequence::begin() const
{
  hrecls_t    hSrch;
  recls_rc_t  rc = Recls_SearchFtp(m_host.c_str(), m_username.c_str(), m_password.c_str(), m_directory, m_pattern, m_flags, &hSrch);
  if( RECLS_FAILED(rc) &&
      RECLS_RC_NO_MORE_DATA != rc)
  {
    throw recls_exception(rc);
  }
  return const_iterator(hSrch);
}


template <typename C, typename T, typename V>
basic_search_sequence_const_iterator<C, T, V>&
basic_search_sequence_const_iterator<C, T, V>::operator ++()
{
  RECLS_MESSAGE_ASSERT("Attempting to increment invalid iterator", NULL != m_handle);
  if(RECLS_FAILED(Recls_GetNext(m_handle->hSrch)))
  {
    m_handle->Release();
    m_handle = NULL;
  }
  return *this;
}

class entry
{
  . . .
public: /// Attribute Methods
  char_type const* c_str() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return m_entry->path.begin;
  }
  . . .
  recls_time_t get_creation_time() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return Recls_GetCreationTime(m_entry);
  }
  . . .
  string_type get_path() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return string_type(m_entry->path.begin, m_entry->path.end);
  }
  string_type get_drive() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return string_type(m_entry->path.begin, m_entry->directory.begin);
  }
  string_type get_directory_path() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return string_type(m_entry->path.begin, m_entry->directory.end);
  }
  string_type get_directory() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return string_type(m_entry->directory.begin, m_entry->directory.end);
  }
  string_type get_file() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return string_type(m_entry->fileName.begin, m_entry->fileExt.end);
  }
  string_type get_file_name() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    return string_type(m_entry->fileName.begin, m_entry->fileName.end);
  }
  string_type get_file_extension() const
  {
    STLSOFT_ASSERT(NULL != m_entry);
    if(m_entry->fileExt.begin == m_entry->fileExt.end)
    {
      return string_type();
    }
    else
    {
      return string_type(m_entry->fileExt.begin - 1, m_entry->fileExt.end);
    }
  }
  . . .
private: /// Member Variables
  recls_entry_t m_entry;
};
 

Other mappings

I mentioned earlier that interoperability was a major motivator in choosing to provide a C API. In many cases, that's worked really well. For example, I've been able to rewrite the C++ interface for 1.9 with very little concern for changes in the core API between 1.8 and 1.9. The COM mapping was similarly implemented with very little difficulty against the core API; the fact that, in hindsight, I think the COM mapping implementation stinks is immaterial. I'm also pretty happy with the Python and Ruby mappings, although both will definitely benefit from a brush up when I update them to 1.9.

There have been problems with the model however. First, the rather mundane issue that being all in one distribution, every time I update, say the Ruby mapping, I have to release the entire suite of core library and all mappings. This is just painful, and also muddies the waters for users of a subset of the facilities.

Second, and more significantly, with some languages the advantage of not having to reproduce the non-trivial search logic is outweighed by the hassles attendant in writing and maintaining the mapping code, and in distributing the resulting software. The clearest example of this is the .NET mapping. As well as the tiresome P/Invoke issues, a C# mapping requires an underlying C library to be packaged in a separate DLL. On top of the obvious issues to .NET security, the underlying DLL has to managed manually, and one finds oneself still in 'DLL Hell'. (That's the classical version of DLL Hell, not the newer and often more vexing .NET-specific DLL Hell; but that's another story.) As a consequence of these factors, I spent some time last year rewriting recls for .NET from scratch, entirely in C#, in part necessitated by some commercial activities. The result, called recls 100% .NET [RECLS-100%-NET] was documented in an article I wrote for DDJ late last year [DDJ-RECLS-BLOG]. I may do other rewrites in the future, depending on how well version 1.9 plays with the other language mappings.

Quality assurance

If you remember back to [QM#2], when we cannot prove correctness we must rely on gathering evidence for robustness. A library like recls, with admittedly questionable robustness in the core implementation, positively cries out for us to do so.

To hand, we have (removable) diagnostic measures and/or applied assurance measures ([QM#1]). To save you scrabbling through back issues, I'll reiterate the lists now. (Removable) diagnostic measures can include:

Applied assurance measures can include:

Most/all of these can help us with a library like recls to reach a point of confidence at which we can 'adjudge [it] to behave according to the expectations of its stakeholders' [QM#2].

First, I'll discuss the items to which the library has been subjected in the past:

On reflection, this is not a bad list, and I guess it helps to explain why recls has become the pretty reliable thing it's been for the last 6+ years. As Steve McConnell says 'Successful quality assurance programs use several different techniques to detect different kinds of errors' [CC].

Nonetheless, the coverage is incomplete, occasional defects still occur, and I remain unsure about the behaviour of significant parts of the software under a range of conditions. More needs to be done.

Several measures either have not been used before, or have been used in a limited fashion. The two I believe are now most important are:

Diagnostic Logging

I hope you've noticed that many of my libraries work together without actually being coupled to each other. b64, FastFormat, Pantheios [PAN], recls, and others work together without having any knowledge of each other. A major reason for this is that they all represent strings as an abstract concept, namely string access shims [XSTL, FF-2, IC++]. But that's only a part of it. I think modularity is a huge part of the negative-decision making process of programmers - coupling brings hassle - so much so that I'll be devoting a whole instalment to the subject later this year.

The problem with working with any orthogonal layer of software service such as diagnostic logging, or indeed with any other software component, is that it is a design-time decision that imposes code time, build time and, in many cases, deployment time consequences. Adding diagnostic logging to recls would be extremely easy to do by implementing in terms of Pantheios, which is a robust, efficient and flexible logging API library, as in:

  RECLS_API Recls_Stat(
    recls_char_t const* path
  , recls_uint32_t      flags
  , recls_entry_t*      phEntry
  )
  {
    pan::log_DEBUG("Recls_Stat(path=", path
      , ", flags="
      , pan::integer(flag, pan::fmt::fullHex)
      , ", ...)");

The costs of converting flags to a (hexadecimal) string, combining all the string fragments into a single statement, and emitting to the output stream would be paid only if the DEBUG level is enabled; otherwise there's effectively zero cost, on the order of a handful of processor cycles.

Sounds great. The only problem with that is that building and using recls would involve one of two things:

There's the further issue that users may already have their own logging libraries, and prefer to use them to Pantheios. (<vainglory>Ok, I'm playing devil's advocate here, since who could imagine such a situation!</vainglory> But the general point stands.)

I think the answer is rather to allow a user to opt-in to a diagnostic logging library if they chose. In C, the only ways to do this are:

I've opted for the second approach. Version 1.9 introduces the new API function Recls_SetApiLogFunction():

  typedef void (RECLS_CALLCONV_DEFAULT* recls_log_pfn_t)(
    int                 severity
  , recls_char_t const* fmt
  , va_list             args
  );

  struct recls_log_severities_t
  {
    /** An array of severities, ranked as follows:
      * - [0] - Fatal condition
      * - [1] - Error condition
      * - [2] - Warning condition
      * - [3] - Informational condition
      * - [4] - Debug0 condition
      * - [5] - Debug1 condition
      * - [6] - Debug2 condition
      * Specifying an element with a value <0
        disables logging for that severity.
      */
    int severities[7];
  #ifdef __cplusplus
    . . . // ctors
  #endif
  };

  RECLS_FNDECL(void) Recls_SetApiLogFunction(
    recls_log_pfn_t               pfn
  , int                           flags
  , recls_log_severities_t const* severities
  );

With this, the user can specify a log function, and a optional list of severity translations. By default, the severity translations are those compatible with Pantheios. And recls_log_pfn_t just so happens to have the same signature as pantheios_logvprintf(), the Pantheios (C) API function . But nothing within recls depends on, or knows anything about, Pantheios, so there's no coupling. You can just as easily define your own API logging function.

Code Coverage

Well, I hope you've made it this far, because this is the meat of this instalment. We're going to see some code coverage in action. I'll be using the xCover library [XCOVER], which I discussed in a CVu article in March 2009 [XCOVER-CVu]. As CVu online is available only to members, non-ACCU members should seriously think about joining this great organisation.

xCover works, for those compilers that support it (VC++ 7+, GCC 4.3+), by borrowing the non-standard __COUNTER__ pre-processor symbol in marking execution points, and using it to record the passage of the thread of execution through the different branches of the code. At a given sequence point, usually before program exit, the xCover library can be asked to report on which execution points have not been executed. In combination with an automated functional test, this can be used to indicate code which may be unused.

Consider the test program in Listing 10, which exercises the functional aspects of the Recls_CombinePaths() API function. It's written in C, but the same principle applies to a C++ test program. (If you're interested, the functional testing is done with the xTests library [XTESTS], a simple C/C++ unit/component test library that I bundle with all my other open-source libraries).

Listing 10

/* test.unit.api.combine_paths.c */
static void test_1(void);
static void test_2(void);
static void test_3(void);
static void test_4(void);
int main(int argc, char **argv)
{
  int retCode = EXIT_SUCCESS;
  int verbosity = 2;
  XTESTS_COMMANDLINE_PARSEVERBOSITY(argc, argv, &verbosity);
  if(XTESTS_START_RUNNER("test.unit.api.combine_paths", verbosity))
  {
    XTESTS_RUN_CASE(test_1);
    XTESTS_RUN_CASE(test_2);
    XTESTS_RUN_CASE(test_3);
    XTESTS_RUN_CASE(test_4);
#ifdef XCOVER_VER
    XCOVER_REPORT_GROUP_COVERAGE("recls.core.extended.combine_paths", NULL);
#endif /* XCOVER_VER */
    XTESTS_PRINT_RESULTS();
    XTESTS_END_RUNNER_UPDATE_EXITCODE(&retCode);
  }
  return retCode;
}
. . .
static void test_4(void)
{
  char    result[101];
  size_t  cch = Recls_CombinePaths("abc", "def", &result[0], STLSOFT_NUM_ELEMENTS(result));
  result[cch] = '\0';
  XTESTS_TEST_INTEGER_EQUAL(7u, cch);
#if defined(RECLS_PLATFORM_IS_UNIX)
  XTESTS_TEST_MULTIBYTE_STRING_EQUAL("abc/def", result);
#elif defined(RECLS_PLATFORM_IS_WINDOWS)
  XTESTS_TEST_MULTIBYTE_STRING_EQUAL("abc\\def", result);
#endif
}
 

XCOVER_REPORT_GROUP_COVERAGE() is the salient statement. This requests that xCover report on all the uncovered marked execution points pertaining to the group "recls.core.extended.combine_paths". This grouping is applied to those parts of the codebase associated with combining paths by using xCover constructs. In this way, you divide your codebase logically, in order to support code coverage testing in association with automated functional testing. (You can also request for an overall coverage report, or reports by source file, from within smoke tests, or your running application, as you see fit. It's just that I prefer to associate it with automated functional testing.)

At the moment - and this is why 1.9 is not yet released - I haven't yet got the implementation file refactoring done in such a fashion that the various functionality is properly separated. So, running the test program from Listing 10 with Visual C++ 9 as I write this, I get output along the lines of Figure 1.

Figure 1

        ..\..\bin\recls.1.test.unit.api.combine_paths.vc9.mt.exe --verbosity=2
[Start of group recls.core.extended.combine_paths]:
Uncovered code at index 6 in file ../../src/api.extended.cpp, between lines 88 and 483
Uncovered code at index 7 in file ../../src/api.extended.cpp, between lines 88 and 483
. . .
Uncovered code at index 35 in file ../../src/api.extended.cpp, between lines 88 and 483
Uncovered code at index 38 in file ../../src/api.extended.cpp, between lines 502 and 783
. . .
Uncovered code at index 67 in file ../../src/api.extended.cpp, between lines 502 and 783
[End of group recls.core.extended.combine_paths]:
 

All of these are false positives from other core functions defined in the same implementation file: the Recls_CombinePaths() function is fully covered by test.unit.api.combine_paths.c.

Obviously I've some work to go, and that'll probably also entail adding further refinements to the xCover library, to make this work easier. When it's all nicely boxed off, I'll do a proper tutorial instalment about combining code coverage and automated functional testing. Despite the in-progress nature of the technology, I hope you get the clear point that the two techniques - code coverage analysis and automated functional testing - are a great partnership in applied quality assurance. The functional analysis makes sure that whatever you test behaves correctly, and the code coverage analysis makes sure that everything of relevance is tested.

Such things are, as we all know, trivially simple to achieve in other languages (e.g. C#, Java). But despite being harder in C++, they are possible, and we should work towards using them whenever it's worth the effort, as it (almost always) is with a general-purpose open-source library.

Summary

I've examined a well-established open-source library, recls, and criticised it in terms of intrinsic quality characteristics, for the core API, core implementation, and the C++ mapping. Where it has come up short I have made adjustments in the forthcoming version 1.9 release, or have identified improvements to be made in subsequent versions.

I have examined the suite of (removable) diagnostic measures and applied assurance measures and have reported on the ongoing work to refine code coverage analysis, in combination with automated functional testing, in the recls library, this work to be revisited at a future time in this forum when it is mature.

Acknowledgements

As always, my friend Garth Lancaster, has kindly given of his time to read this at the end of a long working week just before my deadline, without complaint (to my manners) and with salient criticisms (of my writing). He does want me to point out that 'How embarrassment?' is a playful part of the Australian vernacular, originating from a comedy show, and is not representative of an endemic poor standard of grammar.

I must also thank, and apologise to, not only Ric Parkin, as editor, but also all his band of reviewers, as I've really pushed them to the wire with my shocking lateness this time. Perhaps Ric will henceforth borrow some wisdom from my wife, and start artificially bringing due dates and times forward in order to effect a magical eleventh hour delivery with time to spare.

References

[!C^C++] !(C ^ C++), Matthew Wilson, CVu, November 2008

[CC] Code Complete, 2nd Edition, Steve McConnell, Microsoft Press, 2004

[DDJ-RECLS-BLOG] Recursive Search Examples, pt 2: C

[FF-1] An Introduction to FastFormat, part 1: The State of the Art, Matthew Wilson, Overload 89, February 2009

[FF-2] An Introduction to FastFormat, part 2: Custom Argument and Sink Types, Matthew Wilson, Overload 90, April 2009

[FF-3] An Introduction to FastFormat, part 3: Solving Real Problems, Quickly, Matthew Wilson, Overload 91, June 2009

[GOF] Design Patterns, Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides , Addison-Wesley, 1994

[IC++] Imperfect C++, Matthew Wilson, Addison-Wesley, 2004

[PAN] http://pantheios.org/

[QM-1] Quality Matters, Part 1: Introductions, and Nomenclature, Matthew Wilson, Overload 92, August 2009

[QM-2] Quality Matters, Part 2: Correctness, Robustness and Reliability, Matthew Wilson, Overload 93, October 2009

[QM-3] Quality Matters, Part 3: A Case Study in Quality, Matthew Wilson, Overload 94, December 2009

[RECLS] http://recls.org/

[RECLS100.NET] recls 100% .NET, Matthew Wilson, Dr Dobb's Journal, November 2009

[STLSOFT] The STLSoft libraries are a collection of (mostly well written, mostly badly documented) C and C++, 100% header-only, thin faades and STL extensions that are used in much of my commercial and open-source programming; available from http://stlsoft.org/

[UNIXem] A simple UNIX emulation library for Windows; available from http://www.synesis.com.au/software/unixem.html

[WINE] http://www.winehq.org/

[XCOVER] http://xcover.org/

[XCOVER-CVu] xCover: Code Coverage for C/C++, Matthew Wilson, CVu, March 2009;

[XSTL] Extended STL, volume 1: Collections and Iterators, Matthew Wilson, Addison-Wesley, 2007

[XTESTS] http://xtests.org/