Quantcast
Channel: Scott Hanselman's Blog
Viewing all articles
Browse latest Browse all 1148

Back to Basics: Big O notation issues with older .NET code and improving for loops with LINQ deferred execution

$
0
0

Earlier today Brad Wilson and I were discussing a G+ post by Jostein Kjønigsen where he says, "see if you can spot the O(n^2) bug in this code."

public IEnumerable<Process> GetProcessesForSession(string processName, int sessionId)
{
var processes = Process.GetProcessByName(processName);
var filtered = from p in processes
where p.SessionId == sessionId
select p;
return filtered;
}

This is a pretty straightforward method that calls a .NET BCL (Base Class Library) method and filters the result with LINQ. Of course, when any function calls another one that you can't see inside (which is basically always) you've lost control. We have no idea what's going on in GetProcessesByName.

Let's look at the source to the .NET Framework method in Reflector. Our method calls Process.GetProcessesByName(string).

public static Process[] GetProcessesByName(string processName)
{
return GetProcessesByName(processName, ".");
}

Looks like this one is an overload that passes "." into the next method Process.GetProcessesByName(string, string) where the second parameter is the machineName.

This next one gets all the processes for a machine (in our case, the local machine) then spins through them doing a compare on each one in order to build a result array to return up the chain.

public static Process[] GetProcessesByName(string processName, string machineName)
{
if (processName == null)
{
processName = string.Empty;
}
Process[] processes = GetProcesses(machineName);
ArrayList list = new ArrayList();
for (int i = 0; i < processes.Length; i++)
{
if (string.Equals(processName, processes[i].ProcessName, StringComparison.OrdinalIgnoreCase))
{
list.Add(processes[i]);
}
}
Process[] array = new Process[list.Count];
list.CopyTo(array, 0);
return array;
}

if we look inside GetProcesses(string), it's another loop. This is getting close to where .NET calls Win32 and as these classes are internal there's not much I can do to fix this function other than totally rewrite the internal implementation. However, I think I've illustrated that we've got at least two loops here, and more likely three or four.

public static Process[] GetProcesses(string machineName)
{
bool isRemoteMachine = ProcessManager.IsRemoteMachine(machineName);
ProcessInfo[] processInfos = ProcessManager.GetProcessInfos(machineName);
Process[] processArray = new Process[processInfos.Length];
for (int i = 0; i < processInfos.Length; i++)
{
ProcessInfo processInfo = processInfos[i];
processArray[i] = new Process(machineName, isRemoteMachine, processInfo.processId, processInfo);
}
return processArray;
}

This code is really typical of .NET circa 2002-2003 (not to mention Java, C++ and Pascal). Functions return arrays of stuff and other functions higher up filter and sort.

When using this .NET API and for looping over the results several times, I'm going for(), for(), for() in a chain, like O(4n) here.

Note: To be clear, it can be argued that O(4n) is just O(n), cause it is. Adding a number like I am isn't part of the O notation. I'm just saying we want to avoid O(cn) situations where c is a large enough number to affect perf.

image

Sometimes you'll see nested for()s like this, so O(n^3) here where things get messy fast.

Squares inside squares inside squares representing nested fors

LINQ is more significant than people really realize, I think. When it first came out some folks said "is that all?" I think that's unfortunate. LINQ and the concept of "deferred execution" is just so powerful but I think a number of .NET programmers just haven't taken the time to get their heads around the concept.

Here's a simple example juxtaposing spinning through a list vs. using yield. The array version is doing all the work up front, while the yield version can calculate. Imagine a GetFibonacci() method. A yield version could calculate values "just in time" and yield them, while an array version would have to pre-calculate and pre-allocate.

public void Consumer()
{
foreach (int i in IntegersList()) {
Console.WriteLine(i.ToString());
}

foreach (int i in IntegersYield()) {
Console.WriteLine(i.ToString());
}
}

public IEnumerable<int> IntegersYield()
{
yield return 1;
yield return 2;
yield return 4;
yield return 8;
yield return 16;
yield return 16777216;
}

public IEnumerable<int> IntegersList()
{
return new int[] { 1, 2, 4, 8, 16, 16777216 };
}

Back to our GetProcess example. There's two issues at play here.

First, the underlying implementation where GetProcessesInfos eventually gets called is a bummer but it's that way because of how P/Invoke works and how the underlying Win32 API returns the data we need. It would certainly be nice if the underlying API was more granular. But that's less interesting to me than the larger meta-issue of a having (or in this case, not having) a LINQ-friendly API.

The second and more interesting issue (in my option) is the idea that the 2002-era .NET Base Class Library isn't really setup for LINQ-friendliness. None of the APIs return LINQ-friendly stuff or IEnumerable<anything> so that when you change together filters and filters of filters of arrays you end up with O(cn) issues as opposed to nice deferred LINQ chains.

When you find yourself returning arrays of arrays of arrays of other stuff while looping and filtering and sorting, you'll want to be aware of what's going on and consider that you might be looping inefficiently and it might be time for LINQ and deferred execution.

image

Here's a simple conversion attempt to change the first implementation from this classic "Array/List" style:

ArrayList list = new ArrayList();
for (int i = 0; i < processes.Length; i++)
{
if (string.Equals(processName, processes[i].ProcessName, StringComparison.OrdinalIgnoreCase))
{
list.Add(processes[i]);
}
}
Process[] array = new Process[list.Count];
list.CopyTo(array, 0);
return array;

To this more LINQy way. Note that returning from a LINQ query defers execution as LINQ is chainable. We want to assemble a chain of sorting and filtering operations and execute them ONCE rather than for()ing over many lists many times.

if (processName == null) { processName = string.Empty; }

Process[] processes = Process.GetProcesses(machineName); //stop here...can't go farther?

return from p in processes
where String.Equals(p.ProcessName, processName, StringComparison.OrdinalIgnoreCase)
select p; //the value of the LINQ expression being returned is an IEnumerable<Process> object that uses "yield return" under the hood
Here's the whole thing in a sample program.
static void Main(string[] args)
{
var myList = GetProcessesForSession("chrome.exe", 1);
}

public static IEnumerable<Process> GetProcessesForSession(string processName, int sessionID)
{
//var processes = Process.GetProcessesByName(processName);
var processes = HanselGetProcessesByName(processName); //my LINQy implementation
var filtered = from p in processes
where p.SessionId == sessionID
select p;
return filtered;
}

private static IEnumerable<Process> HanselGetProcessesByName(string processName)
{
return HanselGetProcessesByName(processName, ".");
}

private static IEnumerable<Process> HanselGetProcessesByName(string processName, string machineName)
{
if (processName == null)
{
processName = string.Empty;
}
Process[] processes = Process.GetProcesses(machineName); //can't refactor farther because of internals.

//"the value of the LINQ expression being returned is an IEnumerable<Process> object that uses "yield return" under the hood" (thanks Mehrdad!)

return from p in processes where String.Equals(p.ProcessName == processName, StringComparison.OrdinalIgnoreCase) select p;

/* the stuff above replaces the stuff below */
//ArrayList list = new ArrayList();
//for (int i = 0; i < processes.Length; i++)
//{
// if (string.Equals(processName, processes[i].ProcessName, StringComparison.OrdinalIgnoreCase))
// {
// list.Add(processes[i]);
// }
//}
//Process[] array = new Process[list.Count];
//list.CopyTo(array, 0);
//return array;
}

This is a really interesting topic to me and I'm interested in your opinion as well, Dear Reader. As parts of the .NET Framework are being extended to include support for asynchronous operations, I'm wondering if there are other places in the BCL that should be updated to be more LINQ friendly. Or, perhaps it's not an issue at all.

Your thoughts?



© 2011 Scott Hanselman. All rights reserved.



Viewing all articles
Browse latest Browse all 1148

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>