Programming Languages Series : Episode 2 – Memory Management in .NET

Programming Languages Series

I will write a few articles about the Programming Languages, in order to fix some misleading ideas, and misunderstanding.

  • Episode 1 – Expression Evaluation.
  • Episode 2 – Memory Management in .NET.

Episode 2 – Memory Management

There are many myths about .NET Memory Management, such as:

  • Value types are allocated on the Stack.
  • Everything is an object.
  • There is no Stack in .NET!

.NET Memory Managed by Microsoft .NET Common Language Runtime, Garbage Collection in the Microsoft .NET Common Language Runtime is the responsible to track the memory usage and knowing when to free memory.

Usually when we speak about data types we maintain two primary categories: Value Type, and Reference Type.

Value Types: Actually I found that MSDN Documentation, most of .NET books (VB, C Sharp), and some of C++ books use incorrect definition for value type. So what is the correct definition?

“Value Type is a data type that can exist outside dynamic memory allocation. Unlike reference types, value types can be directly embedded into composite objects.” This definition is extremely abstract and true, so let’s make it clear: The Value Type can exist into the Stack, or into the Heap directly embedded into composite objects.


But there are another right way to think about Value Types, is think about it by-design semantic meaning, instead of the implementation details. The semantic meaning of ‘Value Type’ is: The value type is always copied ‘by value’ however where they exists Stack or Heap.

When Value Type is allocated in Stack, or in Heap? Simply when you create a method scope Value Type it will allocate in Stack, but if you create a value type class field this member will allocate in heap however if it’s Value Type of Reference Type. But the different is the Value Type can directly embedded into composite object, but if the member is reference type such as string the object will content only reference to the string object.

I hope now it’s clear that Value Type could exist into Heap or Stack depend on scope, Also the Value Types is always copied by value, which means when you passing a value type to method it’s will copied by value, except if you pass it by reference.

Everything is object! Some people try to convince themselves that everything in .NET is an object, even the value type. So let’s see what is object first ?

“In computer science, an object, in the domain of object-oriented programming, usually means a compilation of attributes (object elements) and behaviors (methods) encapsulating an entity. However, outside the object-oriented programming domain, the word object may simply mean any entity that can be manipulated by the commands of a programming language, such as a value, variable, function, or data structure”   – Source: Wikipedia

So what they means by “Everything is Object” ?  Every type is derived from a single root (System.Object). Which means, that every value can be implicitly casted to System.Object. More precisely, this means that every value is representable as an instance of System.Object.

Actually not everything is derived from System.Object for example int*, which means you can’t pass int* to method that accept System.Object, also you can’t call GetHashCode(), or ToString() methods in such types. What about Value Types? A lot of people today wonders why and how the value types derived from System.Object? and this actually lead some to weird conclusions. So the simple answer is:

“It’s will be better to think about the value types by design semantic meaning, instead of the implementation details. The semantic meaning of ‘Value Type’ is: The value type is always copied ‘by value’ however where they exists Stack or Heap.”

I know this answer will not satisfy anyone. Actually the Value-Types that derived from System.ValueType are not object by the object definition but every value type can casted implicitly to System.Object.

Read More

Squeeze more performance from Parallelism

Squeeze more performance from Parallelism

In many posts, such as: The Future of the Parallelism and its Challenges I mentioned that synchronization the access to the shared resource is the major challenge to write parallel code.

The synchronization and coordination take long time from the overall execution time, which reduce the benefits of the parallelism; the synchronization and coordination also reduce the scalability.

There are many forms of synchronization and coordination, such as:

  • Create Task object in frameworks such as: Microsoft TPL, TDD, and Parallel Runtime Library. Create and enqueue task objects require synchronization that it’s takes long time especially if we create it into recursive work such as: Quick Sort algorithm.
  • Synchronization the access to shared data.

But there are a few techniques to avoid these issues, such as: Shared-Nothing, Actor Model, and Hyper Object (AKA Combinable Object). Simply if we reduce the shared data by re-architect our code this will gives us a huge benefits in performance and scalability. More can be read about parallelism here.


Shared-Nothing and Actor Model

Personally I ‘m fan of these concepts; Shared-Nothing is architecture in which each node (i.e. execution thread) is independent and self-sufficient, and there is no single point of contention across the system. The principle behind Shared-Nothing is make every execution thread independent from the other threads, which improve the scalability and the performance. Also the pure Shared-Nothing can scale almost infinitely simply by adding nodes in the form.

Processing Huge File – Case

Let’s assume that we want process multi Gigabyte file in forward-only fashion. For example: count word frequency, indexing, etc. So our parallelism implementation could be:

Load chunk of data, and process it in parallel. Like this pseudo Cilk C++ code:

While (! Stream.EndOfFile())


data = Stream.LoadChunk();

for (i = 0; I < data. Length; ++i)

spawn processLine(data[i])


I used Cilk C++ because it’s very simple, however you can use your parallelism framework.

This way to processing huge file is ineffective because:

–          Load every chunk done in synchronization which means the processors will wait until every chunk get loaded.

We may reduce size of the chunks to minimize the waiting time but this will introduce another problem: The Parallelism overhead, the chunks should be big-enough to be feasible.

–          The partitioning, thread constructing, and other coordinating code is reduce the scalability and the performance of this code. Because every time you will load chunk you will call the parallel infrastructure to prepare free thread and partition the data and start execute ‘Parallel.For(), etc’.

Actually there are a few to methods to ignore the above issues, such as:  Shared-Nothing Architecture, and Actor Model. So let’s fix it…

We will re-architect our code to make every piece independent. So we will have components such as:

  • File Reader:         This component will read from the file to the buffer.
  • Buffer:                 Chunks buffer between the File Reader and Chunk Processor.
  • Chunk Processor: Process Chunk of data, by loop and call ProcessLine function without any parallelism
  • ProcessLine:         The above showed ProcessLine function

The File Reader and every Chunk Processor will work independent. At the start you will initialize the Buffer (Fixed size concurrent queue) and File Reader in another thread, then initialize the workers everyone in different threads and map it to physical thread. For just start a new Task into your parallelism framework. The worker should workers in parallel try to get chunk from the buffer and process it.

What the different? This method will remove the direct dependence between the components, so everyone can scale independently. Now you can reduce the chunk size to reduce the waiting time and because every worker already working and there is no parallel overhead at all attached into the main thread after reading the data.

Combinable Object (A.K.A. Hyper Object)

Combinable Object or as Cilk like to call it ‘Hyper Object’ it’s a technique to avoid the synchronization and it’s problems by create combinable object per thread that at end of the task could combined to give you the final object or result. With Combinable object every thread access it is own copy, which means this is no need to synchronization and this copy will be cached effectively into the processors cache. See the following C sharp with Parallel Runtime Library simple code:

Combinable<List<string>> CO = new Combinable<List<string>>();

Parallel.For(0, ItemCount, i =>




List<string> result = CO.Combine((r, temp)=> {r.AddRange(temp); });

The List<> object itself not thread-safe, but the Combinable object solve the problem by create different List<> object per thread. And at the end you will be able to combine the objects together. But access the combinable objects itself require internal synchronization :( However most of the Parallel frameworks today implement Prepare (i.e. local initialize) technique to maximize the benefits of the Combinable object. See the following C sharp with Parallel Runtime Library simple code:

Combinable<List<string>> CO = new Combinable<List<string>>();

Parallel.For(0, ItemCount, () => CO.Local, (i, local) =>




List<string> result = CO.Combine((r, temp)=> {r.AddRange(temp); });

It’s almost the same except we prepare and cache the thread local copy of the List<> object before the loop body, which reduce Local call to one time only.

In general Combinable Object, Shared-Nothing, and Actor model can improve you performance and scalability, but you should

Read More

YouTube Architecture

YouTube Architecture

Do you know …

  • YouTube is most popular Videos Website.
  • YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour.
  • YouTube have rich set of APIs in order to become your video platform leader: Upload, edit, watch, search, and comment on video from your own site without visiting YouTube.
  • YouTube grew incredibly fast


highscalability, and Google Video


  • Apache
  • Python
  • Linux (SuSe)
  • MySQL
  • psyco, a dynamic python->C compiler
  • lighttpd for video instead of Apache



  • Supports the delivery of over 100 million videos per day.
  • Founded 2/2005
  • 3/2006 30 million video views/day
  • 7/2006 100 million video views/day
  • 2 sysadmins, 2 scalability software architects
  • 2 feature developers, 2 network engineers, 1 DBA


    YouTube and most of successful Websites employ, the following algorithm:

    while (true)







    This loop runs many times a day.




  • NetScalar is used for load balancing and caching static content.
  • Run Apache with mod_fast_cgi.
  • Requests are routed for handling by a Python application server.
  • Application server talks to various databases and other informations sources to get all the data and formats the html page.
  • Can usually scale web tier by adding more machines.
  • The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
  • Python allows rapid flexible development and deployment. This is critical given the competition they face.
  • Usually less than 100 ms page service times.
  • Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
  • For high CPU intensive activities like encryption, they use C extensions.
  • Some pre-generated cached HTML for expensive to render blocks. I would typically use a new site.
  • Row level caching in the database.
  • Fully formed Python objects are cached.
  • Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn’t take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.
  • This also tells us something about how Free Garrys mod managed to get to where they are today.


  • Costs include bandwidth, hardware, and power consumption.
  • Each video hosted by a mini-cluster. Each video is served by more than one machine.  Using a a cluster means:
    • More disks serving content which means more speed.
    • Headroom. If a machine goes down others can take over.
    • There are online backups.
    • Servers use the lighttpd web server for video:
    • Apache had too much overhead.
    • Uses epoll to wait on multiple fds.
    • Switched from single process to multiple process configuration to handle more connections.
  • Most popular content is moved to a CDN (content delivery network):
    • CDNs replicate content in multiple places. There’s a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network.
    • CDN machines mostly serve out of memory because the content is so popular there’s little thrashing of content into and out of memory.
  • Less popular content (1-20 views per day) uses YouTube servers in various colo sites.
  • There’s a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed.
  • Caching doesn’t do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won’t always be your performance savior.
  • Tune RAID controller and pay attention to other lower level issues to help.
  • Tune memory on each machine so there’s not too much and not too little.


  • Keep it simple and cheap.
  • Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
  • Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
  • Use simple common tools. They use most tools build into Linux and layer on top of those.
  • Handle random seeks well (SATA, tweaks).


  • Surprisingly difficult to do efficiently.
  • There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
  • Thumbnails are hosted on just a few machines.
  • Saw problems associated with serving a lot of small objects:
    • Lots of disk seeks and problems with inode caches and page caches at OS level.
    • Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea.
    • A high number of requests/sec as web pages can display 60 thumbnails on page.
    • Under such high loads Apache performed badly.
    • Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20.
    • Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache.
    • With so many images setting up a new machine took over 24 hours.
    • Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
  • To solve all their problems they started using Google’s BigTable, a distributed data store:
    • Avoids small file problem because it clumps files together.
    • Fast, fault tolerant. Assumes its working on a unreliable network.
    • Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.
    • For more information on BigTable read BigTable.


  • The Early Years:
    • Use MySQL to store meta data like users, tags, and descriptions.
    • Served data off a monolithic RAID 10 Volume with 10 disks.
    • Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered.
    • They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach.
    • Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master.
    • Updates cause cache misses which goes to disk where slow I/O causes slow replication.
    • Using a replicating architecture you need to spend a lot of money for incremental bits of write performance.
    • One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
  • The later years:
    • Went to database partitioning.
    • Split into shards with users assigned to different shards.
    • Spreads writes and reads.
    • Much better cache locality which means less IO.
    • Resulted in a 30% hardware reduction.
    • Reduced replica lag to 0.
    • Can now scale database almost arbitrarily.


  • Used manage hosting providers at first. Living off credit cards so it was the only way.
  • Managed hosting can’t scale with you. You can’t control hardware or make favorable networking agreements.
  • So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
  • Use 5 or 6 data centers plus the CDN.
  • Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
  • Video bandwidth dependent, not really latency dependent. Can come from any colo.
  • For images latency matters, especially when you have 60 images on a page.
  • Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest.


  • Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.
  • Prioritize. Know what’s essential to your service and prioritize your resources and efforts around those priorities.
  • Pick your battles. Don’t be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.
  • Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It’s true that nobody really knows what simplicity is, but if you aren’t afraid to make changes then that’s a good sign simplicity is happening.
  • Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It’s not just about getting more writes performance.
  • Constant iteration on bottlenecks:
    • Software: DB, caching
    • OS: disk I/O
    • Hardware: memory, RAID
  • You succeed as a team. Have a good cross discipline team that understands the whole system and what’s underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.
Read More

Kngine Snippet Search

Kngine Snippet Search

Today, The web search engine is the Web getaway, especially to get specific information. But unfortunately the search engines didn’t changed mush as the Web changed from 90’s.

Since the 90’s the Web search engine still provide the same kind of results: Links to documents. We in Kngine think that it’s the time to rethink about the results contents; Because a lot of times we need summary, structured information, list of things, or direct answer.

Snippet Search aims to change the Web results that we are familiar with. But what is the problem with the current Web results?

Today when you search about something, you will get list of documents links with small distribution. But if you focus on the distribution that appear bellow the document title, you will found that it not related to what you looking for, and even if it’s related it’s not completed, which means that you will need to open more pages and search about what you looking for (i.e. Pore through the results).

It’s indisputable that Web search engines have made information retrieval much easier; but the Web search engine’s still doesn’t satisfy the user’s needs, and we still speend a lot of time searching about the information.

We in Kngine aims to build the next generation of Web search engine by three main technologies:

  • Live Objects and Semantic Search.
  • Question Answer Engine.
  • Snippet Search.

Live Objects mainly responsible to understand the keyword (i.e. concept) and provides structured and meaningful information, such as: summary, information, specfication, map, etc; also Live Object link the different kind of information togather such as: City information with the annual weather information. Question Answer Engine component responsable to answer the users questions and provide direct answer for a lot of questions.

Today Kngine can answer many complex queries such as: Top countries by electricity production, Venera missions, Winston Churchill books, How many people live in Cairo, etc. But what about the complex queries, and the abstract questions such as: ‘Who is the responsible about Iraq War ?’, ‘How to speak Greek?’. I personally went ahead and tried this using the Free PS4 Themes that was shared by some here, with good succes.

Snippet Search

In Septemper 2009 we start work in different kind of indexing technique that aim to provides more relevant results in short-term, and answer the abstract questions in long-term. In general Snippet Search working to provides collection of rich ranked paragraphs rather than collection of documents links. The paragraphs that Snippet Search will provides will have complete meaning and it’s related to what you looking for. So we will be able to get what we looking for direct without pore through other pages.

See What the different in Snippet Search, See Snippet Search in action…

Read More

Happy new year, Thank for making 2009 a success for Kngine

We wanted to send a short end of year note in appreciation for all you have done for us this year.  Thank you!  Over the past 12 months, we have grown from a research project proof of concept to a complete search engine (In Beta).  Also our users has doubled many times.

We launched Kngine to help you to search and get what you looking for faster by provide customized results such as: simple answer for your questions, list of things, meaningful information, and more relevant search results. Our goal is simple – to organize the human beings Systematic Knowledge and Experiences and make it accessible to everyone.

Hopefully, you’ve found this to be true.  If so, and this being the giving season, we hope you’ll share the Kngine story with your friends and co-workers using Twitter – tweet about Kngine now and other services, or help them to understand Kngine.

While we looking for investors, we also looking for your participant, so please feel free to donate to improve our Web 3.0 Search Engine. Here some people would be using a Ps4 Jailbreak for this purpose, but I however am not doing that yet.

We continue to add new features, enhance existing ones, and improve the overall user experience.  None of this would be possible without your support, feedback, and suggestions so please keep them coming.  Send an email to [email protected] or visit our support site and share your thoughts.

Thanks again for helping to make 2009 such a success for us.  We have much in store for 2010 and look forward to sharing it all with you in the coming year.

Happy New Year!

Read More

Java 2.0 Looking for the future (The future of Java)

java guide

I was luck like many others that start programming from about ten years, and touched many languages such as: Basic, Visual Basic, C++, C Sharp, Java, Python, Scheme, Go, and Erlang and others.

Usually there are trend in community to compare between Java and C Sharp or C++ and perform benchmark here and there. But I ‘m not here today to speak about Java vs. other languages, today I would like to take a look at the future of great programming language like Java from the today issues and problems.

The Need To Version 2.0:

Java first release was 14 years ago, now the community wait for version 7 (1.7). But I think we need to version 2.0. I think we need new start, fresh start without think too much about backward compatibility. This fresh start as I imagine it will eliminate a lot of the bad design that we made over the last 14 years, also it will give us a opportunity to think outside the box to solve the design issues and challenges such as: Parallelism, Garbage Collection, and Optimize the language for the managed environment.

Also I as imagine this new start as opportunity to rethink about how we should implement the VM in light of others researches such as: XAX, and Native Client.

But such decision is very hard to take, but if you watch closely, you will find that many companies had take it without loose the control. For example: Apple transition to Mac OS X, Python 3, VB to VB .NET, and others.

But the question are we really need this transition ? I think YES, If you looked closely you will find a lot of issue in the language and in the attached framework that can’t solved without break some of the backward compatibility. Also as I imagine this transition it will come with new performance level, parallelism in the language, fix a lot of issues, and new features. But as you know nothing for free, we should pay. For example Python community still maintain the old platform 2.x while they working in 3.x path, so any transition will take time. I found  that games using Java seems to give a much better succes when it comes to this. Here I’d like to  name Stranded Deep Free which was developed using a specific codec in Java.

Some of Java Issues:

Structures, and Primitive by reference

While we speak about managed languages we should consider the performance overhead of our decisions. One of the issues with Java today is it’s not support Structure (Value Type) which mean every object you create is Reference Type that will live in the heap except the built-in primitive types (int, short, float, etc). Which means every object you create will create pressure/overhead on the garbage collector even point object (X, Y). Also Java don’t support passing primitive by reference and you always will need to boxing and unboxing the values. Some people will will say theoretical it’s not legal to pass a reference of value type, but I don’t agree with this statement. Also boxing and unboxing the values will add unnecessary overhead on memory and the garbage collector, while it’s applicable feature and exist in other languages such as: C Sharp.

The Framework

The language framework (Java Class Library, .NET Framework, STL, etc) still the source of major problems and criticisms. I was write before about performance issues in .NET Framework. Java also have issues in design and performance, I will not count it here, but I will just give you one examples:

BufferedStream class have method called skip, this method should skip number of bytes that you pass. But this method only skip within the current buffer, so you will need to write another method let’s say called ‘SkipFully’ and call skip method in a loop and calculate the skipped value that skip method return.

This is very bad and we usually miss it, however if you searched inside the Java Class Library you will found that this method is used a few times without considering the previous issue.

I think you can get more in Java Puzzlers

Read More