We haven't done a lot with text data in our programs to this point. The programs you have seen and written have done little more than read in simple text data and spit it back out, with virtually no manipulation or transformation of that information. Part of the reason I have delayed discussing text processing is because, aside from the concatenation operator (+), all string manipulation is done through calling methods. Now that you know how to call methods, we can discuss string processing techniques. You'll be able to break open a string value, extract individual characters or substrings, and do interesting things with them.
Recall that a string contains a sequence of 0 or more characters. A character is any individual UNICODE symbol, including punctuation, white space, letters, punctuation, and other symbols.
Consider the following program:
Example 4.1. Word.cs
// Word.cs // Displays a word in all capitals using System; class Word { static void Main() { string word; Console.Write("Enter a word in lower case or mixed case: "); word = Console.ReadLine(); string wordCapitalized = word.ToUpper(); Console.WriteLine("The word in all caps is: " + wordCapitalized); } // main } // Word
This program asks the user to enter a word, and displays the word in all uppercase letters. Notice the method call that converts the word to uppercase:
string wordCapitalized = word.ToUpper();
The ToUpper() method, a member of the String class, returns the string in all capitals. Take another look at this method call -- there's something odd about it, at least when you compare it to the examples discussed in the last section. Can you see what it is?
I'll give you a hint. The difference has to do with what comes before the dot. All the methods we've used up to now have been called by writing the name of the method's class before the dot. What comes before the dot in this method call?
String methods are different than the methods we've seen. Instead of writing the class name (String) before the dot, we write the name of a string variable before the dot.
Now, take a look at the method interface for the ToUpper() method, defined in the String class:
public String ToUpper()
What word is missing from this interface that appeared in all the method interfaces we have looked at so far?
If you said 'static,' you're right. The absence of the word 'static' is very significant. It means we're dealing with a very different kind of method from the methods we have been studying. The methods I introduced in the last section -- the ones marked static -- are called class methods, because when you call them, you always write the name of their class before the dot. The ToUpper method is the first example you have seen of what are called "instance methods". When you use an instance method, instead of writing the class name before the dot, you write the name of a variable whose data type is the class. The next section will have more to say on the subject of instance methods, but for now, just remember that most String methods expect a string variable before the dot in the method call statement.
If you think about it a bit, it wouldn't make sense to write the name of the class before the dot with the ToUpper() method. Here's what the code would look like if we did that:
string wordCapitalized = String.ToUpper(); // wrong
Aside from the fact that this statement would trigger a compile error, it doesn't even make sense. A program could have several string variables in it, each holding a different value. Which one of those values would the ToUpper() method use? The method interface for ToUpper() doesn't allow you to specify the string you want to operate on as a parameter. Instead, you put the string variable before the dot, and the method operates on the data in that variable.
Take a look at this fragment:
string firstName = "John"; string lastName = "Doe"; string capName = firstName.ToUpper() + lastName.ToUpper();
The call firstName.ToUpper() yields "JOHN", because firstName is "John". The call lastName.ToUpper() yields "DOE". We're calling the same method -- ToUpper() -- twice, but each time we call it, we supply a different variable before the dot, and the result is determined by the data in that variable.
Here are some common methods used with strings:
Figure 4.2. Common String Methods
public string Substring(int start, int length)
Returns the section of this string starting at index start, up to but not including index end
public int IndexOf(char ch)
Searches this String for the character ch, and returns its position
public string ToUpper()
Returns this String with letters converted to ALL CAPITALS
public string ToLower()
Returns this String with letters converted to all lowercase
public string Trim()
Returns this String with any leading or trailing blanks removed
The Substring method lets you peek inside a string value to extract individual characters, or a subset of the characters. You provide the position of the desired character, and the number of characters to extract, and Substring returns it.
The start parameter specifies a starting position in the string. The first character in a string is at index 0, the second is at index 1, and so on. For example, the string value "Greet" has the following structure:
G | r | e | e | t | |
index | 0 | 1 | 2 | 3 | 4 |
Notice that "Greet" has a length of 5 (there are 5 characters in it), but the position of the last character is index 4. There is no index 5 in this string.
Here's a brief code fragment that shows how you might use the Substring method to extract the first character from any string entered by the user:
Console.Write("Enter a word: "); string msg = Console.ReadLine(); string firstLetter = msg.Substring(0,1); // extract 1 character starting at position 0 Console.WriteLine("The first letter in the word is:" + firstLetter);
By the way, you have to be careful with the Substring method. If you give it a starting position that is too big, or if the length you provide extends past the end of the string, your program will crash. In this example, if the user pressed Enter without typing a word, the Substring would fail, because there is no character at position 0 in an empty string.
The IndexOf method does the opposite of the Substring method -- you give it the character you want to find, and it tells you what position the character is located in the string.
Note that ToUpper() and ToLower() do not change the string on which they are invoked. Instead, they return a new string value, which you must assign to a variable. For example, to convert a string to uppercase, you can't write this:
msg.ToUpper(); // no effect
The toUppercase() method does not affect the string msg itself. Instead, it makes a copy of msg with any letters converted to uppercase, and returns it. You must store the result in a variable. If you want, you can store the result in the same variable, like this:
msg = msg.ToUpper(); // the right way
Often it's necessary to determine the length of a string. If you look through the list of methods in the C# String class, you won't find a method that tells you how many characters are in a string. Instead, the String class contains a property called Length that you can use to determine that information. A property is a piece of data inside an object (as opposed to a method, which is an action an object can perform).
Here's a program that demonstrates how you access the Length property to determine the number of characters in a string:
Example 4.2. WordLength.cs
// WordLength.cs // Displays the length of a word, and some additional information using System; class Word { static void Main() { string word; Console.Write("Enter a word in lower case or mixed case: "); word = Console.ReadLine(); int len = word.Length; // the number of characters if (len > 0) { Console.WriteLine("The word has " + len + " characters."); char lastChar = word[len-1]; Console.WriteLine("The last character is: " + lastChar); } else { Console.WriteLine("You didn't enter a word."); } } // Main }
Here's the line to focus on:
int len = word.Length;
Note that when you use a property, you don't put parentheses after it.
The Length property tells you the number of characters in the string. For example, if the user entered "food", word.Length would be 4.
Now, look at this line, which extracts a single character from the string:
char lastChar = word[len-1];
Can you determine which character in the string is extracted? Tip: Remember that character positions in strings begin with 0.
If you said "the last one," you're right! For example, if the user entered a 5-letter word, the characters would be in positions numbered 0 - 4, but the Length would be 5. So, to get the last character, you always subtract 1 from the value of the Length property.
You use square brackets after a string variable to extract a character at a given position. You can think of it as a shortcut for using Substring, if all you want is one character. The difference between using square brackets to extract a single character and using the Substring method is that the Substring method returns a string value, but the square brackets return a char value.
The program in Example 4.2, “WordLength.cs” asked the user to enter a word, and reported the number of characters in the word. The ReadLine() method allows the user to enter any number of words, in which case the output would be misleading. Let's modify the program to detect when the user enters more than one word.
Here's some pseudocode that outlines the approach we will use:
Get text from user Let isValid = true While there are more characters in the user's entry to check: Get the next character in the word If it's a space: Let isValid = false If isValid is true: Compute and display the number of characters else Display an error message
The basic idea involves searching the string entered by the user to see if it contains a space character. We'll examine each character in the word to see if it is a space or not. If it is a space, we'll set a flag (isValid) to false (the flag is initially set to true). After we've checked all the letters in the word, if the flag (isValid) is still true, we know we haven't found a space, and we can proceed to display the character count. Otherwise, if the flag is true, we display an error message.
We'll use a loop and a counter variable to work through all the indexes in the user's string, extract the character at each position, and check it for a space. Here is the code:
Example 4.3. WordChecker.cs
// WordChecker.cs // Displays the length of a word using System; class WordChecker { static void Main() { Console.Write("Enter a word: "); string word = Console.ReadLine(); int len = word.Length; bool isValid = true; int curIndex = 0; while (curIndex < len) { char nextChar = word[curIndex]; if (nextChar == ' ') { isValid = false; } curIndex = curIndex + 1; } if (isValid == true) { Console.WriteLine("The word has " + len + " characters in it."); } else { Console.WriteLine("You entered an invalid word."); } } // main } // class WordChecker
This works well, but it it doesn't completely solve our problem. If we truly want the user to enter a word, then we shouldn't allow numbers and symbols. In other words, we want the user's entry to consist solely of letters.
The solution involves changing the validity test. Instead of designing the program to reject strings containing spaces, let's design it to reject strings containing anything other than letters. To do that, we need a way to test a char value to see whether it is a letter or not. We'll use a method in the Char class to help.
static bool isLetter(char ch)
Notice that this method is marked static, so when we use it, we must put the class name before the dot. If you look at the documentation for the isLetter method, you find that it tests the char value you provide as a parameter and returns a true if the value is a letter (either upper or lowercase); otherwise, it returns false. We only have to change one statement in the program to switch from rejecting spaces to rejecting non-letters. We'll change the if statement in the while loop to the following:
if (Char.IsLetter(nextChar) == false)
With this change in place, the computer now checks to see if each character is a letter. If it encounters anything other than a letter, such as a digit, punctuation, or space, the computer sets isValid to false.
By the way, if you check the C# API, you'll see that Char is not really a class, but something called a struct. There are some differences between classes and structs, but none that make a difference in the present discussion.