Back to library index.
Package std-string (in std.i) - string manipulation
Index of documented functions or symbols:
DOCUMENT strcase(upper, string_array) or strcase, upper, string_array returns STRING_ARRAY with all strings converted to upper case if UPPER is non-zero. If UPPER is zero, result is lower case. (For characters >=0x80, the case conversion assumes the ISO8859-1 character set.) Called as a subroutine, strcase converts STRING_ARRAY in place.
SEE ALSO: string, strlen, strpart, strglob, strfind, strgrep, strword
DOCUMENT strchar(string_array) or strchar(char_array) converts STRING_ARRAY to an array of characters, or CHAR_ARRAY to an array of strings. The return value is always a 1D array, except in the second form if CHAR_ARRAY contains only a single string, the result will be a scalar string. Each string is stored in sequence including its trailing '\0' character, with any string(0) elements treated as if they were "". Going in the opposite direction, a '\0' before any non-'\0' characters produces string(0), so that "" can never be an element of the result, and if the final char (of the leading dimension) is not '\0', an implicit '\0' is assumed beyond the end of the input char array. For example, strchar(["a","b","c"]) --> ['a','\0','b','\0','c','\0'] strchar([['a','\0','b'],['c','\0','\0']]) --> ["a","b","c",string(0)] The string and pointer data types themselves also convert between string and char data, avoiding the quirks of strchar.
DOCUMENT streplace(string_array, start_end, to_string) replaces the part(s) START_END of STRING_ARRAY by TO_STRING. The leading dimension of START_END must be a multiple of 2, while any trailing dimensions must be conformable with the dimensions of STRING_ARRAY. The TO_STRING must be conformable with STRING_ARRAY if the leading dimension of START_END is 2. An element of START_END may represent "no match" (for example, when end2, then TO_STRING must have a leading dimension conformable with n (that is, of length either 1 or n). In this case, streplace performs multiple replacements within each string. In order for multiple replacements to be meaningful, the START_END must be disjoint and sorted, as returned by strfind or strgrep with a repeat count, or by strword. In other words, the first dimension of START_END should be non-decreasing, except where end "Goodbye, world!" streplace(s,[0,5,7,7], ["Goodbye","cruel "]) --> "Goodbye, cruel world!" streplace(s,[0,5,7,7,12,13], ["Goodbye","cruel ","?"]) --> "Goodbye, cruel world?" streplace(s,[0,5,0,-1,12,13], ["Goodbye","cruel ","?"]) --> "Goodbye, world?" streplace([s,s],[0,5], ["Goodbye", "Good bye"]) --> ["Goodbye, world!", "Good bye, world!"] streplace([s,s],[0,5,7,7], [["Goodbye","cruel "], ["Good bye",""]]) --> ["Goodbye, cruel world!", "Good bye, world!"]
DOCUMENT strfind(pat, string_array) or strfind(pat, string_array, off) finds pattern PAT in STRING_ARRAY. Optional OFF is an integer array conformable with STRING_ARRAY or 0-origin offset(s) within the string(s) at which to begin the search(es). The return value is a [start,end] offset pair specifying the beginning and end of the first match, or [len,-1] if none, with trailing dimensions the same as the dimensions of STRING_ARRAY. This return value is suitable as an input to the strpart or streplace functions. The strfind function is the simpler string pattern matcher: strfind - just finds a literal pattern (possibly case insensitive) strgrep - matches a pattern containing complex regular expressions Additionally, the strglob function does filename wildcard matching. Keywords: n= (default 1) returns list of first n matches, so leading dimension of result will be 2*n case= (default 1) zero for case-insensitive search back= (default 0) non-zero for backwards search If back!=0 and n>1, the last match is listed as the last start-end pair, so the output pairs still appear in increasing order, and the first few may be 0,-1 to indicate no match. Examples: s = ["one two three", "four five six"] strfind("o",s) --> [[0,1], [1,2]] strfind(" t",s) --> [[3,5], [13,-1]] strfind(" t",s,n=2) --> [[3,5,7,9], [13,-1,13,-1]] strfind("e",s,n=2,back=1) --> [[11,12,12,13], [0,-1,8,9]]
SEE ALSO: string, strglob, strgrep, strword, strpart, streplace
DOCUMENT strglob(pat, string_array) or strglob(pat, string_array, off) test if pattern PAT matches STRING_ARRAY. Optional OFF is an integer array conformable with STRING_ARRAY or 0-origin offset(s) within the string(s) at which to begin the search(es). The return value is an int with the same dimensions as STRING_ARRAY, 1 for a match, and 0 for no match. PAT can contain UNIX shell wildcard or "globbing" characters: matches any number of characters ? matches any single character [abcd] matches any single character in the list, which may contain ranges such as [a-z0-9A-Z] \c matches the character c (useful for c= a special character) (note that this is "\\c" in a yorick string) The strglob function is mostly intended for matching lists of file names. Note, in particular, that unlike strfind or strgrep, the entire string must match PAT. Keywords: case= (default 1) zero for case-insensitive search path= (default 0) 1 bit set means / must be matched by / 2 bit set means leading . must be matched by . esc= (default 1) zero means \ is not treated as an escape The underlying compiled routine is based on the BSD fnmatch function, contributed by Guido van Rossum. Examples: return all files in current directory with .pdb extension: d=lsdir("."); d(where(strglob("*.pdb", d))); return all subdirectories of the form "hackNN", case insensitive: d=lsdir(".",1); d(where(strglob("hack[0-9][0-9]", d, case=0)));
SEE ALSO: string, strfind, strgrep, strword, strpart, streplace
DOCUMENT strgrep(pat, string_array) or strgrep(pat, string_array, off) finds pattern PAT in STRING_ARRAY. Optional OFF is an integer array conformable with STRING_ARRAY or 0-origin offset(s) within the string(s) at which to begin the search(es). The return value is a [start,end] offset pair specifying the beginning and end of the first match, or [len,-1] if none, with trailing dimensions the same as the dimensions of STRING_ARRAY. This return value is suitable as an input to the strpart or streplace functions. The underlying compiled routine is based on the regexp package written by Henry Spencer (copyright University of Toronto 1986), slightly modified for yorick. PAT is a regular expression, simliar to the UNIX grep utility. Every "regular expression" syntax is slightly different; here is the syntax supported by strgrep: The following characters in PAT have special meanings: '[' followed by any sequence of characters followed by ']' is a "range", which matches any single one of those characters '^' first means to match any character NOT one in the sequence '-' in such a sequence indicates a range of characters (e.g.- "[A-Za-z0-9_]" matches any alphanumeric character or underscore, while "[^A-Za-z0-9_]" matches anything else) to include ']' in the sequence, place it first, to include '-' in the sequence, place it first or last (or first after a leading '^' in either case) Note that the following special characters lose their special meanings inside a range. '.' matches any single character '^' matches the beginning of the string (but no characters) '$' matches the end of the string (but no characters) (that is, ^ and $ serve to anchor a search so that it will only find a match at the beginning or end of the string) '\' (that is, a single backslash, which can only be entered into a yorick string by a double backslash "\\") followed by any single character eliminates any special meaning for that character, for example "\\." matches period, rather than any single character (its special meaning) '(' followed by a regular expression followed by ')' matches the regular expression, creating a sub-pattern, which is a type of atom (see below) '|' means "or"; it separates branches in a regular expression '*' after an atom matches 0 or more matches of the atom '+' after an atom matches 1 or more matches of the atom '?' after an atom matches 0 or 1 matches of the atom The definitions of "atom", "branch", and "regular expression" are: A "regular expression" (which is what PAT is) consists of zero or more "branches" separated by '|'; it matches anything that matches one of the branches. A "branch" consists of zero or more "pieces", concatenated; it matches a match for the first followed by a match for the second, etc. A "piece" is an "atom", optionally followed by '*', '+', or '?'; it matches the atom, or zero or more repetitions of the atom, as specified by the optional suffix. Finally, an "atom" is an ordinary single character, or a '\'-escaped single character (matching that character), or one of the special characters '.', '^', or '$', or a []-delimited range (matching any single character in the range), or a sub-pattern enclosed in () (matching the sub-pattern). A maximum of nine sub-patterns is allowed in PAT; these are numbered 1 through 9, in order of their opening '(' in PAT. This recursive definition of regular expressions often leads to ambiguities, both subtle and glaring. Here is Henry Spencer's synopsis of how his routines behave: ------------------------------------------------------------------- If a regular expression could match two different parts of the input string, it will match the one which begins earliest. If both begin in the same place but match different lengths, or match the same length in different ways, life gets messier, as follows. In general, the possibilities in a list of branches are considered in left-to-right order, the possibilities for `*', `+', and `?' are considered longest-first, nested constructs are considered from the outermost in, and concatenated constructs are considered leftmost- first. The match that will be chosen is the one that uses the earliest possibility in the first choice that has to be made. If there is more than one choice, the next will be made in the same manner (earliest possibility) subject to the decision on the first choice. And so forth. For example, `(ab|a)b*c' could match `abc' in one of two ways. The first choice is between `ab' and `a'; since `ab' is earlier, and does lead to a successful overall match, it is chosen. Since the `b' is already spoken for, the `b*' must match its last possibility -the empty string- since it must respect the earlier choice. In the particular case where no `|'s are present and there is only one `*', `+', or `?', the net effect is that the longest possible match will be chosen. So `ab*', presented with `xabbbby', will match `abbbb'. Note that if `ab*' is tried against `xabyabbbz', it will match `ab' just after `x', due to the begins-earliest rule. (In effect, the decision on where to start the match is the first choice to be made, hence subsequent choices must respect it even if this leads them to less-preferred alternatives.) ------------------------------------------------------------------- When PAT contains parenthesized sub-patterns, strgrep returns the [start,end] of the entire match by default, but you can also get the [start,end] of any or all of the sub-patterns using the sub= keyword (see below). If PAT does not contain any regular expression constructs, you should use the strfind function instead of strgrep. The strglob function, if appropriate, will also be faster than strgrep. Keywords: n= (default 1) returns list of first n matches, so leading dimension of result will be 2*n sub=[n1,n2,...] is a list of the sub-pattern [start,end] pairs to be returned. Thus 0 is the whole PAT, 1 is the first parenthesized sub-pattern, and so on. The leading dimension of the result will be 2*numberof(sub)*n. The sequence n1,n2,... must strictly increase: n1[0,13] strgrep("(Hello|Goodbye), *([a-z]*|[A-Z]*)!", s, sub=[1,2]) --> [0,5,7,12] strgrep("(Hello|Goodbye), *([a-z]*|[A-Z]*)!", s, sub=[0,2]) --> [0,13,7,12] strgrep("(Hello|Goodbye), *(([A-Z]*)|([a-z]*))!", s, sub=[0,2,3,4]) --> [0,13,7,12,13,-1,7,12]
SEE ALSO: string, strglob, strfind, strword, strpart, streplace, strgrepm
DOCUMENT strgrepm(pat, string_array) or strgrepm(pat, string_array, off) call strgrep, but simply return mask of same dimensions as STRING_ARRAY set to 1 where it matches the PAT, and 0 where it does not match. The strgrepm function does not accept any of the strgrep keywords, but it does accept the strglob case= keyword to indicate a case insensitive search.
DOCUMENT string The yorick string datatype is a character string, e.g.- "Hello, world!". Internally, strings are stored as 0-terminated sequences of characters, which are 8-bit bytes, the same as the char datatype.. Like numeric datatypes, string behaves as a function to convert objects to the string datatype. There are only two interesting conversions: string(0) is the nil string, like a 0 pointer This is the only string which is "false" in an if test. string(pc) where pc is an array of type pointer where each pointer is either 0 or points to an array of type char, copies the chars into an array of strings, adding a trailing '\0' if necessary pointer(sa) where sa is an array of stringa is the inverse conversion, copying each string to an array of char (including the terminal '\0') and returning an array of pointers to them The strchar() function may be a more convenient way to convert from string to char and back. Yorick provides the following means of manipulating string variables: s+t when s and t are strings, + means concatentation (this is not perfect nomenclature, since t+s != s+t) s(,sum,..) the sum index range concatentates along a dimension of an array of strings sum(s) concatenates all the strings in an array (in storage order) strlen(s) returns length(s) of string(s) s strcase(upper, s) converts s to upper or lower case strchar(s_or_c) converts between string and char arrays (quick and dirty alternative to string<->pointer) strpart(s, m:n) strpart(s, sel) extracts substrings (sel is a [start,end] list) string search functions: strglob(pat, s) shell-like wildcard pattern match, returns 0 or 1 strword(s, delim) parses s into word(s), returns a sel strfind(pat, s) simple pattern match, returns a sel strgrep(pat, s) regular expression pattern match, returns a sel streplace(s, sel, t) replaces sel in s by t strtrim trims leading and/or trailing blanks (based on strword) strmatch is a wrapper for strfind that simply returns whether there was a match or not rather than its exact offset strtok is a variant of strword that calls strpart in order to return the substrings rather than an sel index list The strword, strfind, and strgrep functions produce a sel, that is, a list of [start,end] offsets into an array of strings. These sel indicate portions of a string to be operated on for the strpart and streplace functions. The sread, swrite, and print functions operate on or produce strings. The rdline, rdfile, read, and write functions perform I/O on strings to text files.
DOCUMENT strlen(string_array) returns an long array with dimsof(STRING_ARRAY) containing the lengths of the strings. Both string(0) and "" have length 0.
SEE ALSO: string, strchar, strcase, strpart, strfind, strword
DOCUMENT strmatch(string_array, pattern) or strmatch(string_array, pattern, case_fold) or strmatch(string_array, pattern, case_fold) returns an int array with dimsof(STRING_ARRAY) with 0 where PATTERN was not found in STRING_ARRAY and 1 where it was found. If CASE_FOLD is specified and non-0, the pattern match is insensitive to case, that is, an upper case letter will match the same lower case letter and vice-versa. (Consider using strfind directly.)
DOCUMENT strpart(string_array, m:n) or strpart(string_array, start_end) or strpart, string_array, start_end returns another string array with the same dimensions as STRING_ARRAY which consists of characters M through N of the original strings. M and N are 1-origin indices; if M is omitted, the default is 1; if N is omitted, the default is the end of the string. If M or N is non-positive, it is interpreted as an index relative to the end of the string, with 0 being the last character, -1 next to last, etc. Finally, the returned string will be shorter than N-M+1 characters if the original doesn't have an Mth or Nth character, with "" (note that this is otherwise impossible) if neither an Mth nor an Nth character exists. A 0 is returned for any string which was 0 on input. In the second form, START_END is an array of [start,end] indices. A single pair [start,end] is equivalent to the range start+1:end, that is, start is the index of the character immediately before the substring (which is to say start is the number of characters skipped at the beginning of the string). If endlength, or if the original string is string(0), strpart returns string(0); otherwise, if end==start, strpart returns "". However, the START_END array may have any additional dimensions (beyond the leading dimension of length 2) which are conformable with the dimensions of the STRING_ARRAY. The result will be a string array with dimensions dimsof(STRING_ARRAY,START_END(1,..)). Furthermore, the leading dimension of START_END may have any even length, say 2*n, in which case the leading dimension of the result will be n. For example, strpart(a, [s1,e1,s2,e2,s3,e3,s4,e4]) is equivalent to (or shorthand for) strpart(a(-,..), [[s1,e1],[s2,e2],[s3,e3],[s4,e4]])(1,..) In the third form, called a subroutine, strpart operates on STRING_ARRAY in place. In this case START_END must have leading dimension of length 2, although it may have trailing dimensions as usual. Examples: strpart("Hello, world!", 4:6) --> "lo," strpart("Hello, world!", [3,6]) --> "lo," -it may help to think of [start,end] as the 0-origin offset of a "cursor" between the characters of the string strpart("Hello, world!", [3,3]) --> "" strpart("Hello, world!", [3,2]) --> string(0) strpart("Hello, world!", [3,20]) --> string(0) strpart("Hello, world!", [3,6,7,9]) --> ["lo,","wo"] strpart(["one","two"], [[1,2],[0,1]]) --> ["n","t"] strpart(["one","two"], [1,2,0,1]) --> [["n","o"],["w","t"]]
DOCUMENT strtok(string_array, delim) or strtok(string_array) or strtok(string_array, delim, n) strips the first token off of each string in STRING_ARRAY. A token is delimited by any of the characters in the string DELIM. If DELIM is blank, nil, or not given, the default DELIM is " \t\n" (blanks, tabs, or newlines). The result is a string array ts with dimensions 2-by-dimsof(STRING_ARRAY); ts(1,) is the first token, and ts(2,) is the remainder of the string (the character which terminated the first token will be in neither of these parts). The ts(2,) part will be 0 (i.e.- the null string) if no more characters remain after ts(1,); the ts(1,) part will be 0 if no token was present. A STRING_ARRAY element may be 0, in which case (0, 0) is returned for that element. With yorick-1.6, strtok has been extended to accept multiple delimiter sets DELIM for successive words, and a repeat count N for the final DELIM set. The operation is the same as for strword, except that the N<=0 special cases are illegal, and if DELIM consists of only a single set, N=2 is the default rather than N=1. The dimensions of the return value are thus min(2,numberof(DELIM)+N-1)-by-dimsof(STRING_ARRAY).
DOCUMENT strtrim(string_array) or strtrim(string_array, which) or strtrim, string_array, which returns STRING without leading and/or trailing blanks. WHICH=1 means to trim leading blanks only, WHICH=2 trims trailing blanks only, while WHICH=3 (the default) trims both leading and trailing blanks. Called as a subroutine, strtrim performs this operation in place. The blank= keyword, if present, is a list of characters to be considered "blanks". Use blank=[lead_delim,trail_delim] to get different leading and trailing "blanks" definitions. By default, blank=" \t\n". (See strword for more about delim syntax.)
DOCUMENT strword(string_array) or strword(string_array, delim) or strword(string_array, delim, n) or strword(string_array, off, delim, n) scans to the first character in STRING_ARRAY which is not in the DELIM list. DELIM defaults to " \t\n", that is, whitespace. The return value is a [start,end] offset pair, with trailing dimensions matching the dimensions of the given STRING_ARRAY. Note that this return value is suitable for use in the strpart or streplace functions. If the first character of DELIM is "^", the sense is reversed; strword scans to the first character in DELIM. (Except that if DELIM is the single character "^", it has its usual meaning.) Also, a "-" which is not the first (or second after "^") or last character of DELIM indicates a range of characters. Finally, if DELIM is "" or string(0), the scan stops immediately, since the first character (no matter what it is) is not in DELIM. Furthermore, DELIM can be a list of delimiter sets, where each element of the list delimits a new word, so the return value will be [start1,end1, ..., startN,endN], where N=numberof(DELIM), and start1 is the offset of the first character not in DELIM(1), characters with offset between end1 and start2 are in DELIM(2), characters with offset between end2 and start3 are in DELIM(3), and so on. If endM is the length of the string for some M[2,15] strword("Hello, world!") --> [0,13] strword("Hello, world!", , 2) --> [0,6,7,13] strword("Hello, world!", , -2) --> [0,6] strword("Hello, world!", ".!, \t\n", -2) --> [0,5] strword("Hello, world!", [string(0), ".!, \t\n"], 0) --> [0,12] strword("Hello, world!", "A-Za-z", 2) --> [5,7,12,13] strword("Hello, world!", "^A-Za-z", 2) --> [0,5,7,13] strword("Hello, world!", "^A-Za-z", 3) --> [0,5,7,12,13,-1] strword(" Hello, world!", [" \t\n",".!, \t\n"]) --> [2,7,9,15] strword(" Hello, world!", [" \t\n",".!, \t\n"], 2) --> [2,7,9,14,15,-1]