Wednesday, February 28, 2018

Struggles of a Rust Novice - Annoyances and Stumbling Blocks

Over the past few weekends, I've been spending time investigating more of the Rust language. Along with the web trifecta (HTML, CSS, JS) and OpenFrameworks, this is one of the things I've been meaning to spend some time learning/playing with this year.

It's a funny kind of language, as I've previous noted during my very brief foray into some simple things from last year) - On one hand, a lot of the concepts seem nice/vaguely familiar to stuff I've played with in other languages, making it sometimes feel deceptively easy to use. But then, you go to compile the code, and proceed to spend the next 30-60 minutes trying to find a way to convert various things into the right types that the compiler insists you need. Admittedly, it's quite frustrating work at times (see notes on String handling below), but, it pales in comparison to the blackhole and deranged hell that is web-based dev (i.e. CSS-based Layouts in particular - aaaaaaargh!!! Despite now having "CSS Flow" and "CSS Grid", it seems that CSS Layouts and I still don't get along very much at all). Faced with a choice between the two (and having just done experienced both back-to-back recently), I'd much rather face the Rust compiler anyday.

Anyway, to actually get a proper feel for the language this time, I decided to use this opportunity to bash together a little processing tool that I was going to need for another one of the many projects I'm working on atm:
That is, a tool to take XSPF playlists generated by VLC (and containing a carefully sequenced list of music I've composed/recorded over the past year), extract the filenames (and process them to unpack/extract-out the important identifying bits of metadata I was embedding in the filenames), and then spit all this info out into a more easily consumable format (e.g. JSON).  
Sure, I could've done this all in Python (though it lacks a XML library that just gives me the DOM element tree with no "namespace" tag mangling, etc.), or a mix of VLC + Python (i.e. saving out a m3u list - 1 line per track - then processing that in Python), but this way was going to be more fun (and a good way to throw something a bit meatier/realistic at Rust than the toy-examples I'd been playing with) :)


Hopefully the following post will be enlightening to some people out there - both other newbies stuck and struggling to get their heads around certain things they keep running into, but also for the Rust dev team in their self-professed "ergonomics drive" to lessen the humps that new devs face when learning Rust.


1) String Handling
I mentioned this last time, and I'm still ranting about it a year later, but IMHO, I still find the string handling situation to be absolutely nuts. There are something like 4-5 different types of strings you'll commonly need to encounter (and in particular, the first 2-3):
 * String
 * &str
 * &'static string
 * str   <--- Well, maybe not this one... It hardly ever comes up in any meaningful way, but I've incldued for completeness
  * OsString

Coming from languages where there is only a single string type that you to really care about (i.e. "str" in Python, "String" in Java, and "char *" in C - if you can call it a string-type :P), and where "string literals" do the expected thing and just get automatically converted/coerced into the appropriate types, the string-handling situation here is a bit bonkers. Perhaps it's not as bad for C++ people - after all, they have to deal with "std::string()" & ".c_str()" on top of good ol' "char *", let alone the likely propect that each library they include ends up bringing along its own string wrapper class (with all the associated trouble that brings).


Perhaps, part of the problem is that I'm not that familiar with this stuff yet. For instance, it often feels like it is highly likely that I'm doing a lot more costly and unnecessary conversions to String to make things conceptually easier to handle (e.g. for storing in a struct, or for something to edit it down the line without worrying about who owns the actual character buffer being used). Then there are the conversions to &str's needing to be made everytime I pass something to another function or try to match against some string literals (aka nearly all the time!)...

For example, consider the following piece of code:
1. fn parse_arguments() {
2.     let args : Vec<String> = env::args().collect();
3.     if let Some(mode) = args.get(1) {
4.         match mode.as_ref() {
5.             "foo" => println!("Foo"),
6.             "boo" => println!("Baaa"),
7.             _     => println!("Aaargh"),
8.        }
9.    }
10.}
Here we're trying to handle a command-line argument that may or may not be present - if it is present, we want to check its value to determine what we should do. It is complicated however by the fact that we're dealing with 3 different types of strings here:
  * String (line 2 - each item in that list is a String instance on the heap AFAIK)
  * &str   (line 4 - you need to convert the values to this form so that the matching will work)
  * &'static str (line 5, 6 - the LHS string literals are "static" strings with a persistent/long-lived lifetime - or something like that. They can be matched against &str but not String, because the latter is a Struct/wrapper, while the former is more of a comparable buffer type)

IMO, the most annoying part of this whole affair is that we have to do the ".as_ref()" thing. (On the other hand, it is slightly better than having to append ".to_string()" after ease case label instead (which is even more stuipid-looking, and which probably ends up being actually inefficient).  Why is this such a pain?
    i)  AFAIK, std::string::String is a built-in part of the language, while static strings are an important part of life. We also have a really fancy auto-type-guessing engine, that manages to do a pretty swell job a lot of the time... except in this case, where you need to go through the trouble of manually doing all the type conversion stuff.  (This is all the more baffling since another case works perfect line going the other way in some cases)

   ii) If you're extracting the value-to-be-matched out of an Option/Result type (and let's face it, this is Rust, so you're likely to be doing that more often than not). then you cannot actually do the conversion at the same time as you're declaring that variable. That is, you can't do something like:
...                                                       
if let Some(mode).as_ref() = args.get(1) {                
  match mode {                                            
    ...                                                   
  }                                                       
}                                                         
 where you're declaring an unpacked value that also gets converted to the types you'd like to use it as. Instead, you'd have to separate out those steps.  This feels a bit like a let-down, when the destructuring stuff is so awesome/powerful in general.

----------------

Now, if you then run into the OsString (to work with std::path::Path) , you'll suddenly be faced with another pair of these conversions. Except, now, when you do these conversions, you have a few more in-between layers to unpack. This leads to the following ugly piece of code:
let extn_str = path.extension().unwrap()  /* get OsString */
                   .to_str().unwrap();    /* get &str - Need to unwrap the converted version */
Blegh! Just look at those two .unwraps() needed just to read a value that in any other language would just be a single step!  This is currently the best I can come up with for this - if you have any suggestions (e.g. using one of those .unwrap_and_then() or similar constructions, I'd love to hear about it)

2) The Module System
Just like with #1, I ranted about this a year ago, was going to skip this (since I hadn't run into it much when I started writing this post), but then, I eventually ended up getting a compiler error that could only be solved restructuring my code a bit to cope with these issues.

Basically, what I'm trying to do is to write a simple, single-file binary/executable program, where the code is split up into different files/modules - just like I'd do in any other language. However, at some point, you're going to suddenly run into troubles when some of your code in one of the "modules" tries to call something from another module you've written, or worse, when it tries to use something from another Crate that someone else wrote (and which you've just imported into your project as a dependency via Cargo), yet all of these things would work perfectly fine if done in the src/main.rs file instead!

For example, here are some of the problems I encountered, and the "fixes" I applied to get them working. (Note: Knowing what I now know - read on below - you shouldn't really need to be doing these things)
  * Calling code from one module in another module - (e.g. "demo_runner.rs" needs to call get_input() from "terminal_utils.rs").  This was my first run-in with the weirdness of the module system.  The problem here was that up till that point, I'd been able to get away with doing "mod demo_runner" and "mod terminal_utils" at the top of "src/main.rs" to get the publicly exported stuff from those modules to be available for use in my code in "src/main.rs". So, I naturally assumed that the same approach would allow me to "import/include" those defines in working in one of the module files. Wrong. I ended up making it work by changing the code in "demo_runner.rs" to use the "use terminal_utils" instead of "mod terminal_utils".  Surprised and stumped, I tried changing the "src/main.rs" imports to "use" instead, only to be faced with a compile error... "Oh well", I thought, "Rust is just weird like that... :/"

 * Using functions/types defined in other crates (i.e. essential if you need to run Regex's, Random Numbers, etc.). Again, this initially seems to work fine, until it doesn't. For example, consider the following snippet that the Regex library provides:
extern crate regex;
use regex::Regex; 
To get this compiling in a module (bad way), you'd end up needing to fix the "use" lines by appending "self::" to each line. That is, line 2 becomes:
use self::regex::Regex;
 * Using macros defined in other crates, within a module - (e.g. "#[macro_use] extern crate lazy_static;").  TBH, this is the one that finally stumped me completely, leading to the following discoveries...  AFAIK, there isn't actually any way to hack around these kinds of problems!



Thanks to this email (and looking at the approach used there), I finally figured out WTF was going on here. Here's my new understanding of the issue - something that the official Rust docs somehow don't make clear enough IMO ;)

First, the somewhat sane and easy to understand stuff (if you've used anything else). Hopefully I'm right about all this - it mostly seems to work something along these lines:
 * AFAIK, each <somename>.rs file corresponds with an implicit module called "somename", as you'd expect from Python or Java.
 * If you have a module called "my_module", you can have a sub-folder inside your Crate's folder with the very name, and containing a "mod.rs" file (i.e. equivalent to "__init__.py"), along with a bunch of .rs files that define "sub-modules for my_module".
* Within an arbitrary file, you can apparently use a "mod block" to define some stuff that goes into a particular module within your Crate.  I haven't gotten to the point of playing with or testing this out further, but from the sounds of things, this is a bit like your ability to add things to different namespaces from whatever file you're working in in C# (and also, C++ to a lesser degree). Basically, it's probably more useful for single-file tests of

Now, here's the real sticking point that's been leading to a lot of weird "import hacks" and/or other random problems trying to get code to compile:
 All code in Rust must/will reside within a Crate!

That's right, even if you're writing up a project with just a single file that gets translated to an executable (i.e. the src/main.rs default template), that file (and all the code it contains), is implicitly part of a Crate - albeit, an unnamed one. That's why you had to append "self::" to all the "use" statements when trying to import types from other Creates (as mentioned above).

The implications of this are both seemingly obvious, but also critically non-obvious to the point of causing massive headaches for days:
   - For libraries, this is easy to understand: you make a folder, chuck a bunch of source files inside that folder, and stick a "lib.rs" in there telling the world what the library exports from all the source files beside it.

    - For single-crate binaries (i.e. what I've been doing, and what practically every beginner will be doing initially), once you move beyond having just a single file (i.e. the "src/main.rs" main file and program entrypoint), you actually still need to define an "entrypoint" file (aka a modules-in-crate manifest in Rust-code) that declares what modules are available in the crate. (Apparently, it's also a good idea do the external crate imports here too... it seems).

     It's not at all clear from the docs that this needs to happen, or what the file should be called (i.e. should you name it after your project/crate, or should you just use "lib.rs" on the basis that one day, you might well want to just turn the majority of the code into a library?).

EDIT:
Apparently, it seems that for "executable" crates, "src/main.rs" is the root file of the crate (while lib.rs is used for libraries).  See https://stackoverflow.com/a/39175997/6531515


3) Indexing Items in Strings/Arrays
There are perhaps 2 main annoyances here that I've encountered:
  i) You cannot use negative indexing to access stuff from the end of the array
 Unfortunately, one of my favourite parts of Python's lists/slicing syntax is the ability to use negative indices to refer to stuff from the end of the list instead. Unfortunately, that doesn't work here, as list indices are usize's (or u8's) - that is, they are unsigned integers, which cannot have negative values!

 (Note: If you do try to use a negative value, it will spit out something about std::ops::Neg not being defined for that type - I wonder if it's possible to hack this in using that method?)

The downside of the current situation is that you're left doing some rather verbose things where you have to subtract values from the total length of the collection. Javascript also

  ii) Inconsistency over panic() vs Option/Result types when accessing invalid indices
This one is a bit of a mixed bag. On one hand, I can see why they may have opted to make this panic instead of wrapping it up with Option/Result to indicate that it may fail (i.e. it's annoying enough (that we have to handle these things in all the normal cases sometimes). On the other hand though, it feels a bit inconsistent that it behaves this way. Granted, there's the .get() function that does spit out Options.


4) "Public" Structs/Members
It's annoying that there isn't any easy way to automatically just tag a struct and all its members as all being fully available for everyone outside a module (but perhaps in the same crate, or not) to see. Sometimes, that's the easiest way to do what you want to do, but now you need to go ahead and tag everything manually!

I know it's generally considered "bad" in OO circles, but sometimes, all you want is to be able to have plain "data" structs that you can share around all over and easily access the innards of without any fuss. Sure, it's "brittle" (cue standard argument about you wanting to change the representation of a certain datatype in the struct) - but TBH, if I'm going to be modifying that representation, the API will likely be changing too if I had a whole bunch of getters/setters for controlling access to that info anyway; thus, it's just more straightforward to allow full access, consequences be damned!

(Off topic rant: Encapsulation + Data-Hiding are seriously overrated IMHO.  People do it because the textbooks say it's a "god-lier" way to do it - yada yada about coupling, cohesion, separation of concerns, yada yada - BLEGH!  Personally, I've come to realise that in the end, it's all just busywork that people invented for themselves, that only serves to cause bloat and get in the way of doing stuff).

I wonder if it's possible to do this using some kind of struct decorators (much like the "#[derive(XYZ)]" decorator/tag-stuff I don't know the name for), macros, or Rust compiler extensions, where simply tagging the struct in a particular way would cause the struct and all its child members to get tagged with "pub". I guess you could do a macro like:
transparent_struct!(
      struct my_struct {
          member1 : String,
          member2 : i64
     }
);
That would inject the necessary stuff into the code definitions, as follows:
pub struct my_struct {
       pub member1: String,
       pub member2: i64,
}
Surely this is possible, right? I mean, people have already done similar things for templating HTML/CSS code from inside a Rust file (e.g. for having a React-like framework, just built using 100% Rust code), so something like this should in theory be doable (especially we're just extending Rust syntax, and Rust's macros seem to work best when you're parsing and recombining standard constructs in novel ways).

Tips for libraries/ways to do this already welcome!


5) Partial Struct Initialisation
So, there are probably some "very good" reasons why initialising structs works the way it does now, but it's a bit of a pain that we can't just only fill out the struct values we know about, leaving everything else to get initialised to some sane-ish default values (e.g. 0 for all numbers, and empty/null values for all structs/options/refs/etc.). This is especially an issue with some larger structs...

For now, I guess the only solution is to do something like this:
pub struct MyStruct {
        pub member1: i64,
        pub member2: Option<bool>,
        pub member3: String,
}

impl MyStruct {
      // Internal-use only constructor to construct a "default" instance
     fn init_empty() -> Self {
           MyStruct {
                  member1: 0,
                  member2: None,
                  member3: String::new(),
           }
     }

     pub fn new(name: &str, value: i64) -> Self {
          let mut ret = Self::init_empty();
          ret.member1 = value;
          ret.member3 = name;
          
          ret
     }
}

6) Extracting values from Option<>'s and Result<>'s
It's good that Rust places a premium on error handling, but sometimes, it's still a jolly good annoyance to deal with extracting the values out and getting them into the right types (#1) to do what you need. Case in point: See the OsString example above...

No comments:

Post a Comment