rebol : COMBINE: an alternative to REJOIN for Rebol/Red

Date: 13-May-2014/4:55:37-4:00

Characters: (none)

The REJOIN primitive in Rebol and Red has never pleased me, and I've been trying to either change its functionality or adjust the name for a while. Taking my ideas and incorporating some feedback from others, I've focused on something called COMBINE instead, and leaving REJOIN alone.

The effort has turned out so far to be fruitful, in my opinion. Yet to understand why COMBINE is "good", and REJOIN is "bad" (at least for many casual applications), you first have to understand what REJOIN is. It was meant to encapsulate two operations: joining and reducing.

Joining

At first joining will seem like the easier of the two concepts to get through. JOIN takes two parameters and produces a new result by joining them together.

>> join "foo" 10
== "foo10"

But JOIN hides a rather confusing matrix of behavior, because as a generic operation it can take any data type as a first parameter...and any as a second parameter. Some of the results can seem pretty sensible:

>> join 10 20
== "1020"

One might think that should be able to create a number by joining the digits, and not be surprised if it gave back the number 1020 instead of a string representation. But it's reasonable to imagine that it doesn't. Yet what would happen if you joined a value from the <tag> subclass of string with another tag?

>> join <span> </span>
== <span</span>>

Er... "WAT?" (as the saying goes). Or how about if you just want to join a couple of blocks together?

>> join [a b c] [d]      
** Script error: d has no value
** Where: reduce apply repend join
** Near: reduce :value part length only dup count

Sure...it's true that d is being manipulated as a symbol and has no value. But neither did a, b, or c. APPEND seems quite capable of joining blocks containing arbitrary symbols:

>> append [a b c] [d]     
== [a b c d]

That probably makes you wonder what the semantic difference between "joining" things and "appending" them is, anyway. The words seem awfully similar. For comparison:

>> append "foo" 10
== "foo10"

>> append 10 20
** Script error: append does not allow integer! for its series argument

>> append <span> </span>
== <span</span>>
]

It actually turns out that the behavior of JOIN is defined fairly plainly in terms of APPEND. Yet despite their close relationship as functions that take two arguments and moosh things together, JOIN is designed for certain conveniences.

First convenience: JOIN does not modify the series you are joining with.

Here's APPEND:
```
>> foo: [10 20 30]

>> append foo 40
== [10 20 30 40]

>> probe foo
== [10 20 30 40]
```
Now, here's JOIN:
```
>> bar: [10 20 30]

>> join bar 40
== [10 20 30 40]

>> probe bar
== [10 20 30]
```
The modification distinction is a shade of meaning that the words don't actually carry in English. But now I've told you it's there--so you know. While some definitions that come in the box are up for grabs, this one apparently isn't...you can call them whatever you like on your own system.
Second convenience: if you hand JOIN a value type as its first parameter that isn't a series type APPEND could understand, it will convert it to a human-readable string.

It uses a function called FORM to do this. If you use this behavior you become somewhat tied in to the hardcoded behavior of FORM, and you will not (yet) find a detailed ISO spec of what it does. As with many languages whose "stringification" is implementation-defined, if you have specific needs of (for instance) how a date is going to be rendered as a string that FORM can't (or won't) define, then JOINing with a date value won't serve your purposes either.
Third convenience: if the second parameter to JOIN is a block, it will be reduced before the append is performed.

Understanding the third convenience will require knowing what reducing is, so let's go ahead and tackle that.

Reducing

To "reduce" a block of data in the Rebol vernacular is to take it through an evaluation step, calling functions and substitions on any value that isn't of an "irreducible type" to the interpreter. So:

>> reduce [10 + 20]
== [30]

If you didn't have the addition operator in there, there would have been nothing (from REDUCE's point of view) to do.

>> reduce [10 20]
== [10 20]

It's at the foundation of understanding the logic of Rebol languages to know what value types are "reducible" and which are "irreducible". The design decisions behind this are what drive the entire trick. It's how a homoiconic data structure has managed to dress up so convincingly (at times) as a normal-looking language, while easily bending to do more creative things.

The "dead" value types are things like integers, strings (including string subclasses like <tags>, binary data blobs, coordinate pairs, and (interestingly) nested ordinary blocks:

>> reduce ["Nothing happening!" </nada> 10x20 #{DECAFBAD} [10 + 20]]
== ["Nothing happening!" </nada> 10x20 #{DECAFBAD} [10 + 20]]

"Live" value types are things like symbolic words, as well as more unusual block types like (parentheses style blocks) and path/style/blocks.

>> foo: "Signs of Life!"

>> bar: function [] [
       return quote foo
   ]

>> baz: object [x: 10 y: 20]

>> reduce [foo bar baz/x baz/y (30 + 40)]
== ["Signs of Life!" foo 10 20 70]

Note It's always important to reiterate that Rebol's foundational tools like reduce represent only one way of evaluating the underlying data structure. Being able to build your own mini-languages that do neat things, leveraging these as your building blocks, is the real point.

We see that the reduce operator performs only a single step. Because invoking the bar function returned a quoted symbolic word as a result, that symbolic word is in the result. It would take another reduce step to get it to a state that isn't further reducible.

>> reduce ["Signs of Life!" foo 10 20 70]
== ["Signs of Life!" "Signs of Life!" 10 20 70]

This ability to intermingle "live" and "dead" elements in a single block for processing is actually leveraged by many Rebol primitives, such as print

>> a: 10 b: 20

>> print ["a is" a "and b is" b "sum is" a + b]
a is 10 and b is 20 sum is 30

When you start thinking of reducing as a "service" and know what it will touch (and what it won't) you can start building dialects with confidence. JOIN uses the service, but only on the second parameter. That's why you can do nice things like this:

>> domain: http://blog.hostilefork.com

>> slug: "combine-alternative-rebol-red-rejoin"

>> join domain ["/" slug "/"]
== http://blog.hostilefork.com/combine-alternative-rebol-red-rejoin/

If you were using APPEND and didn't want to corrupt your domain to produce the derived URL from it, you'd have to copy. And to get the substitution for the post slug you'd need to explicitly call reduce. Even a simple example like this becomes less pleasing quickly:

>> append (copy domain) reduce ["/" slug "/"]
== http://blog.hostilefork.com/combine-alternative-rebol-red-rejoin/

A reasonable question would be that if implicit reduction is so convenient for the second argument, then why isn't it done to the first argument as well? The reason is that since JOIN is essentially a convenience function layered over APPEND, it winds up being used in a similar situation. The block as a first parameter represents "finalized" results, and you are only concerned about doing calculations on the new material being added to the tail.

Hearing all this, you may be convinced that JOIN is a plausible reaction to how to make the lower-level APPEND more convenient. Now you are ready to cast a critical eye on something that's definitely more "WAT?"-worthy...

"Rejoining" (?)

When I discovered rejoin, I did not know about the existence of join. REJOIN seemed to be the go-to answer for how people would concatenate strings. It seemed like a funny name (was there an unjoin somewhere?) and I didn't like it much.

I understood that it took one block parameter, that it would reduce and join the components together to make a string. It was a lot like what you could get out of print, except that print would throw in spaces between elements I often did not want. I wound up writing a lot of print rejoin [...], taking care of spaces manually so I could have control. You would need that control if you wanted to throw commas in after evaluations with no space:

>> print rejoin [
    {a is} space a 
    {,} space
    {and b is} space b
    {,} space
    {sum is} space a + b
    newline
   ]
a is 10, and b is 20, sum is 30

I'd actually go to the level of using the full word space to refer to the character constant, instead of embedding it awkwardly into the surrounding strings. At first it wound up being more verbose, but it drew attention to patterns. Those patterns could be used to build clear abstractions, which usually wound up being not only briefer but more generally correct.

Thus print throwing in line feeds and spaces was good for debug output, but a lot of formatted output needs to be specific. To the extent that I knew a join existed it seemed useless. How often would you want to join a list of irreducible values without reducing them first? Not very often--you probably would have entered them joined to begin with. And that's why I used print rejoin a lot and didn't understand why string concatenation had such a foreign name.

Note @DocKimbel has pointed out, "rejoindre" is the word for join in French. That makes it less surprising that he isn't bothered by it. :-/

But rejoin is not a string concatenation function. Like join it's kind of an "anything concatenation function", and inherits all of the "WAT?" properties that come with that:

>> rejoin ["Hello" <span> "World" </span>]
== "Hello<span>World</span>"

>> rejoin [<span> "Hello" </span> "World"]
== <spanHello</span>World>

>> rejoin [[1 2] 3 4]
== [1 2 3 4]

>> rejoin [1 [2 3] 4]  
== "12 34"

To make a long story short, what you get is a result that is strongly influenced by the first element in the block. You are effectively getting a JOIN that reduces both of its parameters instead of just the second, but the first parameter is merely the result of the first evaluation in the block.

This means that if you want to force the result type to get a more predictable result, you can do that by putting in an empty first element of your target type. So if you want a string output regardless of the elements being joined, you could put a string literal at the head. That could correct the wild output from the rejoin starting with a <span> tag above:

>> rejoin ["" <span> "Hello" </span> "World"]
== "<span>Hello</span>World"

Perhaps you think this operation is highly useful, or perhaps an aesthetic monstrosity. Whether it has a good or bad name for what it does as related to join, there has been strong resistance to eliminating it or tinkering with the names.

For my own part, I will say that though I still don't feel great about the names, I came to soften my anti-JOIN stance once I learned what it did. But in the case of rejoin, there is a newly named primitive that can replace it in essentially every case I'd care about...

A New Hope: COMBINE

Based on a pattern of needs that rejoin had not been meeting, I kicked off a proposal currently going under the name "combine". It's been evolving with some input from others.

The strongest impetus behind combine came from wanting to remedy how hard it was to put expressions that may result in none into rejoin. This pattern was constantly coming up:

>> rejoin ["abc" if 1 > 2 ["def"] "ghi"]
== "abcnoneghi"

Hence if and unless could basically never be used. Suppressing the string conversion of none required either:

>> rejoin ["abc" either 1 > 2 ["def"] [{}] "ghi"]
== "abcghi"

Another frustration for me was how the outermost block of a rejoin behaved differently from an inner block:

>> rejoin ["abc" "def" ["ghi" "jkl"] "mno" "pqr"]
== "abcdefghi jklmnopqr"

Remember that my first reason for using rejoin was to get more control over the spacing. This behavior creates the awkward need to request using no-space semantics, when that's already the mindset I'm in:

>> rejoin ["abc" "def" rejoin ["ghi" "jkl"] "mno" "pqr"]
== "abcdefghijklmnopqr"

Correcting those two behaviors seemed like enough at first. But thinking more about it, I felt that determining the type from the first non-none element was a mistake...and combine should be used to make strings only.

To try and mitigate performance concerns of doing type conversions between other target string types, it's possible to add refinements or allow COMBINE/INTO to target any string subclass with the result. But with a fresh start and opportunity to avoid the WAT--and REJOIN still around for anyone who believes they need it--its fickleness shouldn't be carried forward.

Going even further in realizing what I needed, I didn't just want blocks that were combined without spaces. I wanted a recursive evaluation...so that if an evaluation produced a block, that block would be combined as well. I'd liken this desire to wanting to have the compositional ability in parse rules to pull out frequent patterns; something I frequently wish to do.

Note As designed, this has the comfortable property that combine [x y] gets the same results as combine [combine x combine y].

I also wanted protections against converting things like word results into strings, which rejoin happily does. My own experiences had shown that getting into situations like this were usually mistakes:

>> rejoin ["abc" quote def "ghi"]
== "abcdefghi"

So COMBINE will error in this case. If an evaluation bottomed out to something like a WORD!, SET-WORD! or LIT-WORD!...and you really meant that to happen and form it in the middle of your string, it could be controlled with an /ANY refinement or similar. It's also be theoretically possible to keep running the evaluator to see if the def word resolved (as with the blocks), but it seems better to just have it be an error.

A suggestion from @rgchris was to provide a /WITH refinement to put delimiters in-between the elements being combined. In my sample I made it accept a string, a character, or a block that it would run combine on. My reasoning for accepting the block is that I like to have ways to call more attention to spaces in strings:

>> combine/with ["abc" ["def" "ghi"] "jkl"] ["," space]
== "abc, def, ghi, jkl"

Initially I thought that it was bad to apply the delimiter to nested blocks. Because in cases where you didn't want that, it would be too hard to remove them. It seemed easier to call a nested COMBINE than to take them out! But subsequent design discussions suggested being able to take a function that was aware of its depth:

>> delim: function [depth] [either/only depth = 1 ["," space] none]

>> combine/with ["abc" ["def" "ghi"] "jkl"] delim
== "abc, defghi, jkl"

It occurred to me that the wish to override the delimiter was so common in practice that using an out-of-band type, such as a refinement or issue, could signal that you didn't want the delimiter to be applied. As something brief and useful, I thought /+ might be a good pick. So that's what I am currently using to see how it works out.

>> combine/with ["abc" ["def" /+ "ghi"] "jkl"] ["," space]
== "abc, defghi, jkl"

Note I first tried using /-, but decided it looked too much like the tab escape sequence of ^-. So intead of thinking of it as "subtract the /WITH delimiter" I recast it as "do a concatenation before the COMBINE.".

Empty Block Handling

One edge case is what to do with an empty block or all none. At the moment this returns an empty string:

>> combine [none (if (1 > 2) "uh-oh") none]
== ""

Initially I'd hoped to distinguish this case by returning none. However, my prototype implementation has an /INTO refinement...and since that returns the insertion position in another series there would be no way to distinguish between an empty result and a none result. Technically speaking, I'm not really sure how often someone performing a combine would really need to distinguish the case of gluing together strings which turned out to be empty vs. nones.

Using as a basis for PRINT

One application that I decided would be important would be to use COMBINE/WITH a SPACE as the behavior of PRINT when it was passed a block. That lets you create small bits of "print DNA" in blocks, just as you can break down parse rules into sub-rules.

>> whose: ["whose name is" name]

>> know: ["I know a" type whose]

>> type: "dog" name: "Wesly"

>> print ["Statement:" know /+ "."]
Statement: I know a dog whose name is Wesly.

>> type: "language" name: "Red"

>> print [know "and it's really cool!"]
I know a language whose name is Red and it's really cool!

The recursive nature of the evaluation offers a lot of possibilities, and the ability to suppress spaces with /+ gives PRINT a much-needed ability. Not only is the common case covered cleanly, but the need to break out into REJOIN (or even COMBINE) is almost entirely eliminated.

COMBINE in Draem as a case study

I now have a non-trivial case study for its usage: the fledgling static website builder behind this site, called Draem. It continues to be an experiment...although little by little it is becoming less ad-hoc. Feel welcome to take a look through the sources, especially for htmlify.reb and make-templates.reb.

Note Draem targets Django templated HTML and not raw HTML at this point in time. Though the template substitutions are trivial; it could be replaced. I just happened to have other Django projects running so it was the shortest-path way to manage URL routing, basically.

The conversion to use combine solved several bugs in Draem, and the code looks better. I found myself experimenting with things like IF/ONLY in order to avoid needing to nest blocks. A sample combine:

combine [
    if/only main-tag [
        <li>
            link-to-tag main-tag
        </li>
    ]
    <li>
        <span>
            entry/header/title
        </span>
    </li>
]

Probably a matter of preference as to whether one thinks that reads better as:

combine [
    if main-tag [
        [
            <li>
                link-to-tag main-tag
            </li>
        ]
    ]
    <li>
        <span>
            entry/header/title
        </span>
    </li>
]

It's a new pattern to think about. May not be for everyone, but in my programming style I like how it tightens up the indentation.

We might observe that the rejoin equivalent--not even particularly bad in this case--is still a headache:

rejoin [
    ""
    either main-tag [               
        rejoin [
            ""
            <li>
                link-to-tag main-tag
            </li>
        ]
    ] [
        {}
    ]
    <li>
        <span>
            entry/header/title
        </span>
    </li>
]

I got a bit of a surprise when I tried to convert some append-based tricks to use map-each and build blocks to combine. So for example:

>> combine [map-each x [1 2] [[x]]]
== "22"

It's not confusing that it doesn't work. What's confusing is why when it got [[x] [x]] back it didn't error during the combine by saying x isn't defined. :-/

Suggestions?...

Visit the CureCode issue #2142 to offer your thoughts. Also: remember that this blog entry is in GitHub as dialected Rebol. So if you have suggested edits or ways of making this more clear, that's another avenue...send a pull request!

Setting up an Ubuntu Cloud Server

Coding on the Osborne One with BASIC FUN